[PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
@ 2004-06-06 15:39 Con Kolivas
  2004-06-06 16:59 ` Jan Killius
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 15:39 UTC (permalink / raw)
  To: Linux Kernel Mailinglist; +Cc: Zwane Mwaikambo, William Lee Irwin III

[-- Attachment #1: Type: Text/Plain, Size: 3146 bytes --]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is an update of the scheduler policy mechanism rewrite using the 
infrastructure of the current O(1) scheduler. Slight changes from the 
original design require a detailed description. The change to the original 
design has enabled all known corner cases to be abolished and cpu 
distribution to be much better maintained. It has proven to be stable in my 
testing and is ready for more widespread public testing now.


Aims:
 - Interactive by design rather than have interactivity bolted on.
 - Good scalability.
 - Simple predictable design.
 - Maintain appropriate cpu distribution and fairness.
 - Low scheduling latency for normal policy tasks.
 - Low overhead.
 - Making renicing processes actually matter for CPU distribution (nice 0 gets 
20 times what nice +20 gets)
 - Resistant to priority inversion
 - More forgiving of poor locking
 - Tunable for a server workload or computational tasks


Description:
 - All tasks start at a dynamic priority based on their nice value. They run 
for one RR_INTERVAL (nominally set to 10ms) and then drop one stair 
(priority). If they sleep before running again they get to run for 2 
intervals before being demoted a priority and so on until they get all their 
intervals at their best priority: 20 intervals for nice 0; 1 interval for 
nice +19.


- - The staircase elevation mechanism can be disabled and all tasks can simply 
descend stairs using the sysctl:

echo 0 > /proc/sys/kernel/interactive

this has the effect of maintaining cpu distribution much more strictly 
according to nice whereas the default mechanism allows bursts of cpu by 
interactive tasks before settling to appropriate cpu distribution.


- - The time tasks are cpu bound can be increased by using the sysctl:

echo 1 > /proc/sys/kernel/compute

which extends the RR_INTERVAL to 100ms and disables the staircase elevation 
improving conditions for pure computational tasks by optimising cache 
benefits and decreasing context switching (gains another 1.5% on kernbench).


Performance:
- - All cpu throughput benchmarks show equivalent or better performance than 
mainline. Note that disabling the interactive setting actually _worsens_ some 
benchmarks because of their dependence on yield() so I don't recommend 
disabling it unless you do a comparison first. 
- - Interactivity is approximately equivalent to mainline 2.6 but with faster 
application startup and no known behavioural quirks.


Comments and testing welcome.

 fs/proc/array.c        |    2
 include/linux/sched.h  |    8
 include/linux/sysctl.h |    2
 kernel/sched.c         |  676 
+++++++++++++------------------------------------
 kernel/sysctl.c        |   16 +
 5 files changed, 212 insertions(+), 492 deletions(-)

Can be downloaded here:
http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3

and below
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou
wXhrRsBsfY2BIqbLT1tUWcs=
=g1no
-----END PGP SIGNATURE-----

[-- Attachment #2: patch-2.6.7-rc2-s6.3 --]
[-- Type: text/x-diff, Size: 34187 bytes --]

diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/fs/proc/array.c linux-2.6.7-rc2-s6.3/fs/proc/array.c
--- linux-2.6.7-rc2-base/fs/proc/array.c	2004-05-31 21:29:15.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/fs/proc/array.c	2004-06-07 00:03:56.959133536 +1000
@@ -155,7 +155,6 @@ static inline char * task_state(struct t
 	read_lock(&tasklist_lock);
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
-		"SleepAVG:\t%lu%%\n"
 		"Tgid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
@@ -163,7 +162,6 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1020000000/1024),
 	       	p->tgid,
 		p->pid, p->pid ? p->real_parent->pid : 0,
 		p->pid && p->ptrace ? p->parent->pid : 0,
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/include/linux/sched.h linux-2.6.7-rc2-s6.3/include/linux/sched.h
--- linux-2.6.7-rc2-base/include/linux/sched.h	2004-05-31 21:29:21.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/include/linux/sched.h	2004-06-07 00:10:37.450576503 +1000
@@ -164,6 +164,7 @@ extern void show_stack(struct task_struc
 
 void io_schedule(void);
 long io_schedule_timeout(long timeout);
+extern int interactive, compute;
 
 extern void cpu_init (void);
 extern void trap_init(void);
@@ -394,14 +395,13 @@ struct task_struct {
 	struct list_head run_list;
 	prio_array_t *array;
 
-	unsigned long sleep_avg;
-	long interactive_credit;
 	unsigned long long timestamp;
-	int activated;
+	unsigned long runtime, totalrun;
+	unsigned int deadline;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
-	unsigned int time_slice, first_time_slice;
+	unsigned int slice, time_slice, first_time_slice;
 
 	struct list_head tasks;
 	struct list_head ptrace_children;
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/include/linux/sysctl.h linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
--- linux-2.6.7-rc2-base/include/linux/sysctl.h	2004-05-31 21:29:21.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/include/linux/sysctl.h	2004-06-07 00:06:13.007895851 +1000
@@ -133,6 +133,8 @@ enum
 	KERN_NGROUPS_MAX=63,	/* int: NGROUPS_MAX */
 	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
 	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
+	KERN_INTERACTIVE=66,	/* interactive tasks to have cpu bursts */
+	KERN_COMPUTE=67,	/* adjust timeslices for a compute server */
 };
 
 
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/kernel/sched.c linux-2.6.7-rc2-s6.3/kernel/sched.c
--- linux-2.6.7-rc2-base/kernel/sched.c	2004-05-31 21:29:24.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/kernel/sched.c	2004-06-07 00:36:43.382396246 +1000
@@ -16,7 +16,10 @@
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
  *  2004-04-02	Scheduler domains code by Nick Piggin
- */
+ *  2004-03-19.	New staircase scheduling policy by Con Kolivas with help
+ *		from Zwane Mwaikambo and useful suggestions by 
+ *		William Lee Irwin III. 
+*/
 
 #include <linux/mm.h>
 #include <linux/module.h>
@@ -66,8 +69,6 @@
 #define USER_PRIO(p)		((p)-MAX_RT_PRIO)
 #define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
 #define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE	(MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
-			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
 
 /*
  * Some helpers for converting nanosecond timing to jiffy resolution
@@ -75,110 +76,18 @@
 #define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
 #define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
 
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	 30
-#define CHILD_PENALTY		 95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		  3
-#define PRIO_BONUS_RATIO	 25
-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	  2
-#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define CREDIT_LIMIT		100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- *  priority range a task can explore, a value of '1' means the
- *  task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
-			num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
-	(v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
-	(SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
-	((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
-	((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
-	((p)->interactive_credit < -CREDIT_LIMIT)
+int compute = 0;
+/* 
+ *This is the time all tasks within the same priority round robin.
+ *compute setting is reserved for dedicated computational scheduling
+ *and has ten times larger intervals.
+ */
+#define _RR_INTERVAL		((10 * HZ / 1000) ? : 1)
+#define RR_INTERVAL		(_RR_INTERVAL * (1 + 9 * compute))
 
 #define TASK_PREEMPTS_CURR(p, rq) \
 	((p)->prio < (rq)->curr->prio)
 
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
-		((MAX_TIMESLICE - MIN_TIMESLICE) * \
-			(MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))
-
-static unsigned int task_timeslice(task_t *p)
-{
-	return BASE_TIMESLICE(p);
-}
-
 #define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
 
 /*
@@ -192,7 +101,7 @@ typedef struct runqueue runqueue_t;
 struct prio_array {
 	unsigned int nr_active;
 	unsigned long bitmap[BITMAP_SIZE];
-	struct list_head queue[MAX_PRIO];
+	struct list_head queue[MAX_PRIO + 1];
 };
 
 /*
@@ -214,12 +123,11 @@ struct runqueue {
 	unsigned long cpu_load;
 #endif
 	unsigned long long nr_switches;
-	unsigned long expired_timestamp, nr_uninterruptible;
+	unsigned long nr_uninterruptible;
 	unsigned long long timestamp_last_tick;
 	task_t *curr, *idle;
 	struct mm_struct *prev_mm;
-	prio_array_t *active, *expired, arrays[2];
-	int best_expired_prio;
+	prio_array_t array;
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -300,16 +208,18 @@ static inline void rq_unlock(runqueue_t 
 /*
  * Adding/removing a task to/from a priority array:
  */
-static void dequeue_task(struct task_struct *p, prio_array_t *array)
+static void dequeue_task(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	array->nr_active--;
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
 }
 
-static void enqueue_task(struct task_struct *p, prio_array_t *array)
+static void enqueue_task(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
 	array->nr_active++;
@@ -321,8 +231,9 @@ static void enqueue_task(struct task_str
  * remote queue so we want these tasks to show up at the head of the
  * local queue:
  */
-static inline void enqueue_task_head(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	list_add(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
 	array->nr_active++;
@@ -330,42 +241,11 @@ static inline void enqueue_task_head(str
 }
 
 /*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
-	int bonus, prio;
-
-	if (rt_task(p))
-		return p->prio;
-
-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
-	prio = p->static_prio - bonus;
-	if (prio < MAX_RT_PRIO)
-		prio = MAX_RT_PRIO;
-	if (prio > MAX_PRIO-1)
-		prio = MAX_PRIO-1;
-	return prio;
-}
-
-/*
  * __activate_task - move a task to the runqueue.
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task(p, rq->active);
+	enqueue_task(p, rq);
 	rq->nr_running++;
 }
 
@@ -374,95 +254,112 @@ static inline void __activate_task(task_
  */
 static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task_head(p, rq->active);
+	enqueue_task_head(p, rq);
 	rq->nr_running++;
 }
 
-static void recalc_task_prio(task_t *p, unsigned long long now)
+// deadline - the best deadline rank a task can have.
+static unsigned int deadline(task_t *p)
 {
-	unsigned long long __sleep_time = now - p->timestamp;
-	unsigned long sleep_time;
-
-	if (__sleep_time > NS_MAX_SLEEP_AVG)
-		sleep_time = NS_MAX_SLEEP_AVG;
+	unsigned int task_user_prio;
+	if (rt_task(p))
+		return p->deadline;
+	task_user_prio = TASK_USER_PRIO(p);
+	if (likely(task_user_prio < 40))
+		return 39 - task_user_prio;
 	else
-		sleep_time = (unsigned long)__sleep_time;
+		return 0;
+}
 
-	if (likely(sleep_time > 0)) {
-		/*
-		 * User tasks that sleep a long time are categorised as
-		 * idle and will get just interactive status to stay active &
-		 * prevent them suddenly becoming cpu hogs and starving
-		 * other processes.
-		 */
-		if (p->mm && p->activated != -1 &&
-			sleep_time > INTERACTIVE_SLEEP(p)) {
-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-						AVG_TIMESLICE);
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-		} else {
-			/*
-			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
-			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+static void inc_deadline(task_t *p)
+{
+	unsigned int best_deadline;
+	best_deadline = deadline(p);
+	if (!p->mm || rt_task(p) || !best_deadline)
+		return;
+	if (p->deadline < best_deadline)
+		p->deadline++;
+}
 
-			/*
-			 * Tasks with low interactive_credit are limited to
-			 * one timeslice worth of sleep avg bonus.
-			 */
-			if (LOW_CREDIT(p) &&
-			    sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
-				sleep_time = JIFFIES_TO_NS(task_timeslice(p));
+static void dec_deadline(task_t *p)
+{
+	if (!p->mm || rt_task(p))
+		return;
+	if (p->deadline)
+		p->deadline--;
+}
 
-			/*
-			 * Non high_credit tasks waking from uninterruptible
-			 * sleep are limited in their sleep_avg rise as they
-			 * are likely to be cpu hogs waiting on I/O
-			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
-				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-						INTERACTIVE_SLEEP(p)) {
-					p->sleep_avg = INTERACTIVE_SLEEP(p);
-					sleep_time = 0;
-				}
-			}
+// slice - the duration a task runs before losing a deadline rank.
+static unsigned int slice(task_t *p)
+{
+	unsigned int slice = RR_INTERVAL;
+	if (!rt_task(p))
+		slice += deadline(p) * RR_INTERVAL;
+	return slice;
+}
 
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a
-			 * task spends sleeping, the higher the average gets -
-			 * and the higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
+// interactive - interactive tasks get longer intervals at best priority
+int interactive = 1;
 
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
-				p->sleep_avg = NS_MAX_SLEEP_AVG;
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-			}
+/*
+ * effective_prio - dynamic priority dependent on deadline rank.
+ * With 0 deadline ranks the priority decreases by one each RR_INTERVAL.
+ * As the deadline rank increases the priority stays at the top "stair" or
+ * value for longer.
+ */
+static int effective_prio(task_t *p)
+{
+	int prio;
+	unsigned int full_slice, used_slice, first_slice;
+	unsigned int best_deadline;
+	if (rt_task(p))
+		return p->prio;
+
+	best_deadline = deadline(p);
+	full_slice = slice(p);
+	used_slice = full_slice - p->slice;
+	if (p->deadline > best_deadline || !p->mm)
+		p->deadline = best_deadline;
+	first_slice = RR_INTERVAL;
+	if (interactive && !compute)
+		first_slice *= (p->deadline + 1);
+	prio = MAX_PRIO - 1 - best_deadline;
+
+	if (used_slice < first_slice)
+		return prio;
+	if (p->mm)
+		prio += 1 + (used_slice - first_slice) / RR_INTERVAL;
+	if (prio > MAX_PRIO - 1)
+		prio = MAX_PRIO - 1;
+	return prio;
+}
+
+static void recalc_task_prio(task_t *p, unsigned long long now)
+{
+	unsigned long sleep_time = now - p->timestamp;
+	unsigned long run_time = NS_TO_JIFFIES(p->runtime);
+	unsigned long total_run = NS_TO_JIFFIES(p->totalrun) + run_time;
+	if (!run_time && NS_TO_JIFFIES(p->runtime + sleep_time) < RR_INTERVAL) {
+		if (p->slice - total_run < 1) {
+			p->totalrun = 0;
+			dec_deadline(p);
+		} else {
+			p->totalrun += p->runtime;
+			p->slice -= NS_TO_JIFFIES(p->totalrun);
 		}
+	} else {
+		inc_deadline(p);
+		p->runtime = 0;
+		p->totalrun = 0;
 	}
-
-	p->prio = effective_prio(p);
 }
 
 /*
  * activate_task - move a task to the runqueue and do priority recalculation
- *
- * Update all the scheduling statistics stuff. (sleep average
- * calculation, priority modifiers, etc.)
  */
 static void activate_task(task_t *p, runqueue_t *rq, int local)
 {
-	unsigned long long now;
-
-	now = sched_clock();
+	unsigned long long now = sched_clock();
 #ifdef CONFIG_SMP
 	if (!local) {
 		/* Compensate for drifting sched_clock */
@@ -471,33 +368,14 @@ static void activate_task(task_t *p, run
 			+ rq->timestamp_last_tick;
 	}
 #endif
-
-	recalc_task_prio(p, now);
-
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated) {
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->activated = 2;
-		else {
-			/*
-			 * Normal first-time wakeups get a credit too for
-			 * on-runqueue time, but it will be weighted down:
-			 */
-			p->activated = 1;
-		}
-	}
+	p->slice = slice(p);
+	if (!rt_task(p) && p->mm)
+		recalc_task_prio(p, now);
+	else
+		inc_deadline(p);
+	p->prio = effective_prio(p);
+	p->time_slice = RR_INTERVAL;
 	p->timestamp = now;
-
 	__activate_task(p, rq);
 }
 
@@ -507,9 +385,12 @@ static void activate_task(task_t *p, run
 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
 	rq->nr_running--;
-	if (p->state == TASK_UNINTERRUPTIBLE)
+	if (p->state == TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible++;
-	dequeue_task(p, p->array);
+		if (p->deadline > (deadline(p) - 2) && p->deadline && p->mm)
+			p->deadline--;
+	}
+	dequeue_task(p, rq);
 	p->array = NULL;
 }
 
@@ -813,14 +694,8 @@ out_set_cpu:
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible--;
-		/*
-		 * Tasks on involuntary sleep don't earn
-		 * sleep_avg beyond just interactive state.
-		 */
-		p->activated = -1;
-	}
 
 	/*
 	 * Sync wakeups (i.e. those types of wakeups where the waker
@@ -890,12 +765,14 @@ void fastcall sched_fork(task_t *p)
 	 */
 	local_irq_disable();
 	p->time_slice = (current->time_slice + 1) >> 1;
+	p->slice = (current->slice + 1) >> 1;
 	/*
 	 * The remainder of the first timeslice might be recovered by
 	 * the parent if the child exits early enough.
 	 */
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
+	current->slice >>= 1;
 	p->timestamp = sched_clock();
 	if (!current->time_slice) {
 		/*
@@ -923,33 +800,14 @@ void fastcall wake_up_forked_process(tas
 	unsigned long flags;
 	runqueue_t *rq = task_rq_lock(current, &flags);
 
+	// Forked process gets a lower deadline rank to prevent fork bombs.
+	p->deadline = 0;
 	BUG_ON(p->state != TASK_RUNNING);
 
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
-
-	p->prio = effective_prio(p);
 	set_task_cpu(p, smp_processor_id());
 
-	if (unlikely(!current->array))
-		__activate_task(p, rq);
-	else {
-		p->prio = current->prio;
-		list_add_tail(&p->run_list, &current->run_list);
-		p->array = current->array;
-		p->array->nr_active++;
-		rq->nr_running++;
-	}
+	p->prio = effective_prio(p);
+	__activate_task(p, rq);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -970,19 +828,11 @@ void fastcall sched_exit(task_t * p)
 	local_irq_save(flags);
 	if (p->first_time_slice) {
 		p->parent->time_slice += p->time_slice;
-		if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
-			p->parent->time_slice = MAX_TIMESLICE;
+		if (unlikely(p->parent->time_slice > slice(p->parent)))
+			p->parent->time_slice = slice(p->parent);
 	}
 	local_irq_restore(flags);
-	/*
-	 * If the child was a (relative-) CPU hog then decrease
-	 * the sleep_avg of the parent as well.
-	 */
 	rq = task_rq_lock(p->parent, &flags);
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -1246,18 +1096,6 @@ lock_again:
 		double_rq_unlock(this_rq, rq);
 		goto lock_again;
 	}
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
 
 	p->prio = effective_prio(p);
 	set_task_cpu(p, cpu);
@@ -1369,15 +1207,15 @@ static void double_lock_balance(runqueue
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static inline
-void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
-	       runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
+static inline 
+void pull_task(runqueue_t *src_rq, task_t *p, 
+		runqueue_t *this_rq, int this_cpu)
 {
-	dequeue_task(p, src_array);
+	dequeue_task(p, src_rq);
 	src_rq->nr_running--;
 	set_task_cpu(p, this_cpu);
 	this_rq->nr_running++;
-	enqueue_task(p, this_array);
+	enqueue_task(p, this_rq);
 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
 				+ this_rq->timestamp_last_tick;
 	/*
@@ -1427,29 +1265,16 @@ static int move_tasks(runqueue_t *this_r
 		      unsigned long max_nr_move, struct sched_domain *sd,
 		      enum idle_type idle)
 {
-	prio_array_t *array, *dst_array;
+	prio_array_t* array, *dst_array;
 	struct list_head *head, *curr;
 	int idx, pulled = 0;
 	task_t *tmp;
 
 	if (max_nr_move <= 0 || busiest->nr_running <= 1)
 		goto out;
+	array = &busiest->array;
+	dst_array = &this_rq->array;
 
-	/*
-	 * We first consider expired tasks. Those will likely not be
-	 * executed in the near future, and they are most likely to
-	 * be cache-cold, thus switching CPUs has the least effect
-	 * on them.
-	 */
-	if (busiest->expired->nr_active) {
-		array = busiest->expired;
-		dst_array = this_rq->expired;
-	} else {
-		array = busiest->active;
-		dst_array = this_rq->active;
-	}
-
-new_array:
 	/* Start searching at priority 0: */
 	idx = 0;
 skip_bitmap:
@@ -1457,14 +1282,8 @@ skip_bitmap:
 		idx = sched_find_first_bit(array->bitmap);
 	else
 		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired && busiest->active->nr_active) {
-			array = busiest->active;
-			dst_array = this_rq->active;
-			goto new_array;
-		}
+	if (idx >= MAX_PRIO) 
 		goto out;
-	}
 
 	head = array->queue + idx;
 	curr = head->prev;
@@ -1479,7 +1298,7 @@ skip_queue:
 		idx++;
 		goto skip_bitmap;
 	}
-	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+	pull_task(busiest, tmp, this_rq, this_cpu);
 	pulled++;
 
 	/* We only want to steal up to the prescribed number of tasks. */
@@ -1938,22 +1757,6 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
 EXPORT_PER_CPU_SYMBOL(kstat);
 
 /*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks. We also ignore the interactivity
- * if a better static_prio task has expired:
- */
-#define EXPIRED_STARVING(rq) \
-	((STARVATION_LIMIT && ((rq)->expired_timestamp && \
-		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
-			((rq)->curr->static_prio > (rq)->best_expired_prio))
-
-/*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
  *
@@ -1998,76 +1801,38 @@ void scheduler_tick(int user_ticks, int 
 	cpustat->system += sys_ticks;
 
 	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
+	if (p->array != &rq->array) {
 		set_tsk_need_resched(p);
 		goto out;
 	}
 	spin_lock(&rq->lock);
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter. Note: we do not update a thread's
-	 * priority until it either goes to sleep or uses up its
-	 * timeslice. This makes it possible for interactive tasks
-	 * to use up their timeslices at their highest priority levels.
-	 */
-	if (unlikely(rt_task(p))) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			dequeue_task(p, rq->active);
-			enqueue_task(p, rq->active);
-		}
+	
+	// SCHED_FIFO tasks never run out of timeslice.
+	if (unlikely(p->policy == SCHED_FIFO))
+		goto out_unlock;
+	// Tasks lose a deadline rank each time they use up a full slice().
+	if (!--p->slice) {
+		set_tsk_need_resched(p);
+		dequeue_task(p, rq);
+		dec_deadline(p);
+		p->slice = slice(p);
+		p->prio = effective_prio(p);
+		p->time_slice = RR_INTERVAL;
+		enqueue_task(p, rq);
+		p->first_time_slice = 0;
 		goto out_unlock;
 	}
+	/*
+	 * Tasks that run out of time_slice but still have slice left get
+	 * requeued with a lower priority && RR_INTERVAL time_slice.
+	 */
 	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
+		dequeue_task(p, rq);
 		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!rq->expired_timestamp)
-			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
-				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
-	} else {
-		/*
-		 * Prevent a too long timeslice allowing a task to monopolize
-		 * the CPU. We do this by splitting up the timeslice into
-		 * smaller pieces.
-		 *
-		 * Note: this does not mean the task's timeslices expire or
-		 * get lost in any way, they just might be preempted by
-		 * another task of equal priority. (one with higher
-		 * priority would have preempted this task already.) We
-		 * requeue this task to the end of the list on this priority
-		 * level, which is in essence a round-robin of tasks with
-		 * equal priority.
-		 *
-		 * This only applies to tasks in the interactive
-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
-		 */
-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
-			(p->array == rq->active)) {
-
-			dequeue_task(p, rq->active);
-			set_tsk_need_resched(p);
-			p->prio = effective_prio(p);
-			enqueue_task(p, rq->active);
-		}
+		p->time_slice = RR_INTERVAL;
+		enqueue_task(p, rq);
+		goto out_unlock;
 	}
 out_unlock:
 	spin_unlock(&rq->lock);
@@ -2131,8 +1896,8 @@ static inline int dependent_sleeper(int 
 		 * task from using an unfair proportion of the
 		 * physical cpu's resources. -ck
 		 */
-		if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(p) || rt_task(smt_curr)) &&
+		if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) >
+			slice(p) || rt_task(smt_curr)) &&
 			p->mm && smt_curr->mm && !rt_task(p))
 				ret = 1;
 
@@ -2141,8 +1906,8 @@ static inline int dependent_sleeper(int 
 		 * or wake it up if it has been put to sleep for priority
 		 * reasons.
 		 */
-		if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(smt_curr) || rt_task(p)) &&
+		if ((((p->slice * (100 - sd->per_cpu_gain) / 100) > 
+			slice(smt_curr) || rt_task(p)) && 
 			smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
 			(smt_curr == smt_rq->idle && smt_rq->nr_running))
 				resched_task(smt_curr);
@@ -2168,10 +1933,9 @@ asmlinkage void __sched schedule(void)
 	long *switch_count;
 	task_t *prev, *next;
 	runqueue_t *rq;
-	prio_array_t *array;
+	prio_array_t* array;
 	struct list_head *queue;
 	unsigned long long now;
-	unsigned long run_time;
 	int cpu, idx;
 
 	/*
@@ -2193,19 +1957,8 @@ need_resched:
 
 	release_kernel_lock(prev);
 	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
-		run_time = now - prev->timestamp;
-	else
-		run_time = NS_MAX_SLEEP_AVG;
-
-	/*
-	 * Tasks with interactive credits get charged less run_time
-	 * at high sleep_avg to delay them losing their interactive
-	 * status
-	 */
-	if (HIGH_CREDIT(prev))
-		run_time /= (CURRENT_BONUS(prev) ? : 1);
 
+	prev->runtime = now - prev->timestamp;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -2227,56 +1980,22 @@ need_resched:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
-			rq->expired_timestamp = 0;
 			wake_sleeping_dependent(cpu, rq);
 			goto switch_tasks;
 		}
 	}
 
-	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-		rq->best_expired_prio = MAX_PRIO;
-	}
-
+	array = &rq->array;
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
-	if (dependent_sleeper(cpu, rq, next)) {
+	if (dependent_sleeper(cpu, rq, next))
 		next = rq->idle;
-		goto switch_tasks;
-	}
-
-	if (!rt_task(next) && next->activated > 0) {
-		unsigned long long delta = now - next->timestamp;
-
-		if (next->activated == 1)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
-		array = next->array;
-		dequeue_task(next, array);
-		recalc_task_prio(next, next->timestamp + delta);
-		enqueue_task(next, array);
-	}
-	next->activated = 0;
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
 	RCU_qsctr(task_cpu(prev))++;
-
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0) {
-		prev->sleep_avg = 0;
-		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
-			prev->interactive_credit--;
-	}
 	prev->timestamp = now;
 
 	if (likely(prev != next)) {
@@ -2542,7 +2261,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
-	prio_array_t *array;
+	prio_array_t* array;
 	runqueue_t *rq;
 	int old_prio, new_prio, delta;
 
@@ -2565,7 +2284,7 @@ void set_user_nice(task_t *p, long nice)
 	}
 	array = p->array;
 	if (array)
-		dequeue_task(p, array);
+		dequeue_task(p, rq);
 
 	old_prio = p->prio;
 	new_prio = NICE_TO_PRIO(nice);
@@ -2574,7 +2293,7 @@ void set_user_nice(task_t *p, long nice)
 	p->prio += delta;
 
 	if (array) {
-		enqueue_task(p, array);
+		enqueue_task(p, rq);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -2696,7 +2415,7 @@ static int setscheduler(pid_t pid, int p
 	struct sched_param lp;
 	int retval = -EINVAL;
 	int oldprio;
-	prio_array_t *array;
+	prio_array_t* array;
 	unsigned long flags;
 	runqueue_t *rq;
 	task_t *p;
@@ -2965,28 +2684,17 @@ out_unlock:
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
- * this function yields the current CPU by moving the calling thread
- * to the expired array. If there are no other threads running on this
  * CPU then this function will return.
  */
 asmlinkage long sys_sched_yield(void)
 {
 	runqueue_t *rq = this_rq_lock();
-	prio_array_t *array = current->array;
-	prio_array_t *target = rq->expired;
 
-	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
-	 */
-	if (unlikely(rt_task(current)))
-		target = rq->active;
-
-	dequeue_task(current, array);
-	enqueue_task(current, target);
+	dequeue_task(current, rq);
+	current->slice = slice(current);
+	current->time_slice = current->slice;
+	current->prio = effective_prio(current);
+	enqueue_task(current, rq);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
@@ -3125,7 +2833,7 @@ long sys_sched_rr_get_interval(pid_t pid
 		goto out_unlock;
 
 	jiffies_to_timespec(p->policy & SCHED_FIFO ?
-				0 : task_timeslice(p), &t);
+				0 : slice(p), &t);
 	read_unlock(&tasklist_lock);
 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
@@ -3246,6 +2954,7 @@ void __devinit init_idle(task_t *idle, i
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
 	idle->state = TASK_RUNNING;
+	idle->deadline = 0;
 	set_task_cpu(idle, cpu);
 	double_rq_unlock(idle_rq, rq);
 	set_tsk_need_resched(idle);
@@ -3878,7 +3587,7 @@ int in_sched_functions(unsigned long add
 void __init sched_init(void)
 {
 	runqueue_t *rq;
-	int i, j, k;
+	int i, j;
 
 #ifdef CONFIG_SMP
 	/* Set up an initial dummy domain for early boot */
@@ -3899,13 +3608,10 @@ void __init sched_init(void)
 #endif
 
 	for (i = 0; i < NR_CPUS; i++) {
-		prio_array_t *array;
+		prio_array_t* array;
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
-		rq->active = rq->arrays;
-		rq->expired = rq->arrays + 1;
-		rq->best_expired_prio = MAX_PRIO;
 
 #ifdef CONFIG_SMP
 		rq->sd = &sched_domain_init;
@@ -3916,16 +3622,14 @@ void __init sched_init(void)
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif
 		atomic_set(&rq->nr_iowait, 0);
+		array = &rq->array;
 
-		for (j = 0; j < 2; j++) {
-			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
-				INIT_LIST_HEAD(array->queue + k);
-				__clear_bit(k, array->bitmap);
-			}
-			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
+		for (j = 0; j <= MAX_PRIO; j++) {
+			INIT_LIST_HEAD(array->queue + j);
+			__clear_bit(j, array->bitmap);
 		}
+		// delimiter for bitsearch
+		__set_bit(MAX_PRIO, array->bitmap);
 	}
 	/*
 	 * We have to do a little magic to get the first
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/kernel/sysctl.c linux-2.6.7-rc2-s6.3/kernel/sysctl.c
--- linux-2.6.7-rc2-base/kernel/sysctl.c	2004-05-31 21:29:24.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/kernel/sysctl.c	2004-06-07 00:09:21.513441965 +1000
@@ -636,6 +636,22 @@ static ctl_table kern_table[] = {
 		.mode		= 0444,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= KERN_INTERACTIVE,
+		.procname	= "interactive",
+		.data		= &interactive,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= KERN_COMPUTE,
+		.procname	= "compute",
+		.data		= &compute,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
@ 2004-06-06 16:59 ` Jan Killius
  2004-06-06 17:40 ` Felipe Alfaro Solana
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Jan Killius @ 2004-06-06 16:59 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3041 bytes --]

On Sunday 06 June 2004 17:39, you wrote:
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. The change to the original
> design has enabled all known corner cases to be abolished and cpu
> distribution to be much better maintained. It has proven to be stable in my
> testing and is ready for more widespread public testing now.
>
>
> Aims:
>  - Interactive by design rather than have interactivity bolted on.
>  - Good scalability.
>  - Simple predictable design.
>  - Maintain appropriate cpu distribution and fairness.
>  - Low scheduling latency for normal policy tasks.
>  - Low overhead.
>  - Making renicing processes actually matter for CPU distribution (nice 0
> gets 20 times what nice +20 gets)
>  - Resistant to priority inversion
>  - More forgiving of poor locking
>  - Tunable for a server workload or computational tasks
>
>
> Description:
>  - All tasks start at a dynamic priority based on their nice value. They
> run for one RR_INTERVAL (nominally set to 10ms) and then drop one stair
> (priority). If they sleep before running again they get to run for 2
> intervals before being demoted a priority and so on until they get all
> their intervals at their best priority: 20 intervals for nice 0; 1 interval
> for nice +19.
>
>
> - The staircase elevation mechanism can be disabled and all tasks can
> simply descend stairs using the sysctl:
>
> echo 0 > /proc/sys/kernel/interactive
>
> this has the effect of maintaining cpu distribution much more strictly
> according to nice whereas the default mechanism allows bursts of cpu by
> interactive tasks before settling to appropriate cpu distribution.
>
>
> - The time tasks are cpu bound can be increased by using the sysctl:
>
> echo 1 > /proc/sys/kernel/compute
>
> which extends the RR_INTERVAL to 100ms and disables the staircase elevation
> improving conditions for pure computational tasks by optimising cache
> benefits and decreasing context switching (gains another 1.5% on
> kernbench).
>
>
> Performance:
> - All cpu throughput benchmarks show equivalent or better performance than
> mainline. Note that disabling the interactive setting actually _worsens_
> some benchmarks because of their dependence on yield() so I don't recommend
> disabling it unless you do a comparison first.
> - Interactivity is approximately equivalent to mainline 2.6 but with faster
> application startup and no known behavioural quirks.
>
>
> Comments and testing welcome.
>
>  fs/proc/array.c        |    2
>  include/linux/sched.h  |    8
>  include/linux/sysctl.h |    2
>  kernel/sched.c         |  676
> +++++++++++++------------------------------------
>  kernel/sysctl.c        |   16 +
>  5 files changed, 212 insertions(+), 492 deletions(-)
>
> Can be downloaded here:
> http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3
>
> and below
> Con
The same patch modified to apply clean on 2.6-7-rc2-mm2.
-- 
        Jan

[-- Attachment #2: patch-2.6.7-rc2-mm2-s6.3 --]
[-- Type: text/x-diff, Size: 34235 bytes --]

diff -Naurp linux-2.6.7-rc2-mm2/fs/proc/array.c linux-2.6.7-rc2-mm2-s6.3/fs/proc/array.c
--- linux-2.6.7-rc2-mm2/fs/proc/array.c	2004-06-06 18:51:52.000000000 +0200
+++ linux-2.6.7-rc2-mm2-s6.3/fs/proc/array.c	2004-06-06 18:27:58.343412208 +0200
@@ -155,7 +155,6 @@ static inline char * task_state(struct t
 	read_lock(&tasklist_lock);
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
-		"SleepAVG:\t%lu%%\n"
 		"Tgid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
@@ -163,7 +162,6 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1020000000/1024),
 	       	p->tgid,
 		p->pid, p->pid ? p->real_parent->pid : 0,
 		p->pid && p->ptrace ? p->parent->pid : 0,
diff -Naurp linux-2.6.7-rc2-mm2/include/linux/sched.h linux-2.6.7-rc2-mm2-s6.3/include/linux/sched.h
--- linux-2.6.7-rc2-mm2/include/linux/sched.h	2004-06-06 18:51:52.000000000 +0200
+++ linux-2.6.7-rc2-mm2-s6.3/include/linux/sched.h	2004-06-06 18:27:58.343412208 +0200
@@ -172,6 +172,7 @@ extern void show_stack(struct task_struc
 
 void io_schedule(void);
 long io_schedule_timeout(long timeout);
+extern int interactive, compute;
 
 extern void cpu_init (void);
 extern void trap_init(void);
@@ -410,14 +411,13 @@ struct task_struct {
 	struct list_head run_list;
 	prio_array_t *array;
 
-	unsigned long sleep_avg;
-	long interactive_credit;
 	unsigned long long timestamp;
-	int activated;
+	unsigned long runtime, totalrun;
+	unsigned int deadline;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
-	unsigned int time_slice, first_time_slice;
+	unsigned int slice, time_slice, first_time_slice;
 
 	struct list_head tasks;
 	struct list_head ptrace_children;
diff -Naurp linux-2.6.7-rc2-mm2/include/linux/sysctl.h linux-2.6.7-rc2-mm2-s6.3/include/linux/sysctl.h
--- linux-2.6.7-rc2-mm2/include/linux/sysctl.h	2004-06-06 18:51:52.000000000 +0200
+++ linux-2.6.7-rc2-mm2-s6.3/include/linux/sysctl.h	2004-06-06 18:27:58.345411904 +0200
@@ -133,6 +133,8 @@ enum
 	KERN_NGROUPS_MAX=63,	/* int: NGROUPS_MAX */
 	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
 	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
+	KERN_INTERACTIVE=66,	/* interactive tasks to have cpu bursts */
+	KERN_COMPUTE=67,	/* adjust timeslices for a compute server */
 };
 
 
diff -Naurp linux-2.6.7-rc2-mm2/kernel/sched.c linux-2.6.7-rc2-mm2-s6.3/kernel/sched.c
--- linux-2.6.7-rc2-mm2/kernel/sched.c	2004-06-06 18:51:52.000000000 +0200
+++ linux-2.6.7-rc2-mm2-s6.3/kernel/sched.c	2004-06-06 18:33:49.541022056 +0200
@@ -16,7 +16,10 @@
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
  *  2004-04-02	Scheduler domains code by Nick Piggin
- */
+ *  2004-03-19.	New staircase scheduling policy by Con Kolivas with help
+ *		from Zwane Mwaikambo and useful suggestions by 
+ *		William Lee Irwin III. 
+*/
 
 #include <linux/mm.h>
 #include <linux/module.h>
@@ -70,8 +73,6 @@
 #define USER_PRIO(p)		((p)-MAX_RT_PRIO)
 #define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
 #define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE	(MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
-			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
 
 /*
  * Some helpers for converting nanosecond timing to jiffy resolution
@@ -79,110 +80,18 @@
 #define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
 #define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
 
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	 30
-#define CHILD_PENALTY		 95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		  3
-#define PRIO_BONUS_RATIO	 25
-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	  2
-#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define CREDIT_LIMIT		100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- *  priority range a task can explore, a value of '1' means the
- *  task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
-			num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
-	(v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
-	(SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
-	((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
-	((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
-	((p)->interactive_credit < -CREDIT_LIMIT)
+int compute = 0;
+/* 
+ *This is the time all tasks within the same priority round robin.
+ *compute setting is reserved for dedicated computational scheduling
+ *and has ten times larger intervals.
+ */
+#define _RR_INTERVAL		((10 * HZ / 1000) ? : 1)
+#define RR_INTERVAL		(_RR_INTERVAL * (1 + 9 * compute))
 
 #define TASK_PREEMPTS_CURR(p, rq) \
 	((p)->prio < (rq)->curr->prio)
 
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
-		((MAX_TIMESLICE - MIN_TIMESLICE) * \
-			(MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))
-
-static unsigned int task_timeslice(task_t *p)
-{
-	return BASE_TIMESLICE(p);
-}
-
 #define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
 
 /*
@@ -196,7 +105,7 @@ typedef struct runqueue runqueue_t;
 struct prio_array {
 	unsigned int nr_active;
 	unsigned long bitmap[BITMAP_SIZE];
-	struct list_head queue[MAX_PRIO];
+	struct list_head queue[MAX_PRIO + 1];
 };
 
 /*
@@ -218,13 +127,12 @@ struct runqueue {
 	unsigned long cpu_load;
 #endif
 	unsigned long long nr_switches;
-	unsigned long expired_timestamp, nr_uninterruptible;
+	unsigned long nr_uninterruptible;
 	unsigned long long timestamp_last_tick;
 	task_t *curr, *idle;
 
 	struct mm_struct *prev_mm;
-	prio_array_t *active, *expired, arrays[2];
-	int best_expired_prio;
+	prio_array_t array;
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -406,16 +314,18 @@ static inline void rq_unlock(runqueue_t 
 /*
  * Adding/removing a task to/from a priority array:
  */
-static void dequeue_task(struct task_struct *p, prio_array_t *array)
+static void dequeue_task(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	array->nr_active--;
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
 }
 
-static void enqueue_task(struct task_struct *p, prio_array_t *array)
+static void enqueue_task(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
 	array->nr_active++;
@@ -427,8 +337,9 @@ static void enqueue_task(struct task_str
  * remote queue so we want these tasks to show up at the head of the
  * local queue:
  */
-static inline void enqueue_task_head(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
 {
+	prio_array_t* array = &rq->array;
 	list_add(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
 	array->nr_active++;
@@ -436,42 +347,11 @@ static inline void enqueue_task_head(str
 }
 
 /*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
-	int bonus, prio;
-
-	if (rt_task(p))
-		return p->prio;
-
-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
-	prio = p->static_prio - bonus;
-	if (prio < MAX_RT_PRIO)
-		prio = MAX_RT_PRIO;
-	if (prio > MAX_PRIO-1)
-		prio = MAX_PRIO-1;
-	return prio;
-}
-
-/*
  * __activate_task - move a task to the runqueue.
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task(p, rq->active);
+	enqueue_task(p, rq);
 	rq->nr_running++;
 }
 
@@ -480,95 +360,112 @@ static inline void __activate_task(task_
  */
 static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task_head(p, rq->active);
+	enqueue_task_head(p, rq);
 	rq->nr_running++;
 }
 
-static void recalc_task_prio(task_t *p, unsigned long long now)
+// deadline - the best deadline rank a task can have.
+static unsigned int deadline(task_t *p)
 {
-	unsigned long long __sleep_time = now - p->timestamp;
-	unsigned long sleep_time;
-
-	if (__sleep_time > NS_MAX_SLEEP_AVG)
-		sleep_time = NS_MAX_SLEEP_AVG;
+	unsigned int task_user_prio;
+	if (rt_task(p))
+		return p->deadline;
+	task_user_prio = TASK_USER_PRIO(p);
+	if (likely(task_user_prio < 40))
+		return 39 - task_user_prio;
 	else
-		sleep_time = (unsigned long)__sleep_time;
+		return 0;
+}
 
-	if (likely(sleep_time > 0)) {
-		/*
-		 * User tasks that sleep a long time are categorised as
-		 * idle and will get just interactive status to stay active &
-		 * prevent them suddenly becoming cpu hogs and starving
-		 * other processes.
-		 */
-		if (p->mm && p->activated != -1 &&
-			sleep_time > INTERACTIVE_SLEEP(p)) {
-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-						AVG_TIMESLICE);
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-		} else {
-			/*
-			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
-			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+static void inc_deadline(task_t *p)
+{
+	unsigned int best_deadline;
+	best_deadline = deadline(p);
+	if (!p->mm || rt_task(p) || !best_deadline)
+		return;
+	if (p->deadline < best_deadline)
+		p->deadline++;
+}
 
-			/*
-			 * Tasks with low interactive_credit are limited to
-			 * one timeslice worth of sleep avg bonus.
-			 */
-			if (LOW_CREDIT(p) &&
-			    sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
-				sleep_time = JIFFIES_TO_NS(task_timeslice(p));
+static void dec_deadline(task_t *p)
+{
+	if (!p->mm || rt_task(p))
+		return;
+	if (p->deadline)
+		p->deadline--;
+}
 
-			/*
-			 * Non high_credit tasks waking from uninterruptible
-			 * sleep are limited in their sleep_avg rise as they
-			 * are likely to be cpu hogs waiting on I/O
-			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
-				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-						INTERACTIVE_SLEEP(p)) {
-					p->sleep_avg = INTERACTIVE_SLEEP(p);
-					sleep_time = 0;
-				}
-			}
+// slice - the duration a task runs before losing a deadline rank.
+static unsigned int slice(task_t *p)
+{
+	unsigned int slice = RR_INTERVAL;
+	if (!rt_task(p))
+		slice += deadline(p) * RR_INTERVAL;
+	return slice;
+}
 
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a
-			 * task spends sleeping, the higher the average gets -
-			 * and the higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
+// interactive - interactive tasks get longer intervals at best priority
+int interactive = 1;
 
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
-				p->sleep_avg = NS_MAX_SLEEP_AVG;
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-			}
+/*
+ * effective_prio - dynamic priority dependent on deadline rank.
+ * With 0 deadline ranks the priority decreases by one each RR_INTERVAL.
+ * As the deadline rank increases the priority stays at the top "stair" or
+ * value for longer.
+ */
+static int effective_prio(task_t *p)
+{
+	int prio;
+	unsigned int full_slice, used_slice, first_slice;
+	unsigned int best_deadline;
+	if (rt_task(p))
+		return p->prio;
+
+	best_deadline = deadline(p);
+	full_slice = slice(p);
+	used_slice = full_slice - p->slice;
+	if (p->deadline > best_deadline || !p->mm)
+		p->deadline = best_deadline;
+	first_slice = RR_INTERVAL;
+	if (interactive && !compute)
+		first_slice *= (p->deadline + 1);
+	prio = MAX_PRIO - 1 - best_deadline;
+
+	if (used_slice < first_slice)
+		return prio;
+	if (p->mm)
+		prio += 1 + (used_slice - first_slice) / RR_INTERVAL;
+	if (prio > MAX_PRIO - 1)
+		prio = MAX_PRIO - 1;
+	return prio;
+}
+
+static void recalc_task_prio(task_t *p, unsigned long long now)
+{
+	unsigned long sleep_time = now - p->timestamp;
+	unsigned long run_time = NS_TO_JIFFIES(p->runtime);
+	unsigned long total_run = NS_TO_JIFFIES(p->totalrun) + run_time;
+	if (!run_time && NS_TO_JIFFIES(p->runtime + sleep_time) < RR_INTERVAL) {
+		if (p->slice - total_run < 1) {
+			p->totalrun = 0;
+			dec_deadline(p);
+		} else {
+			p->totalrun += p->runtime;
+			p->slice -= NS_TO_JIFFIES(p->totalrun);
 		}
+	} else {
+		inc_deadline(p);
+		p->runtime = 0;
+		p->totalrun = 0;
 	}
-
-	p->prio = effective_prio(p);
 }
 
 /*
  * activate_task - move a task to the runqueue and do priority recalculation
- *
- * Update all the scheduling statistics stuff. (sleep average
- * calculation, priority modifiers, etc.)
  */
 static void activate_task(task_t *p, runqueue_t *rq, int local)
 {
-	unsigned long long now;
-
-	now = sched_clock();
+	unsigned long long now = sched_clock();
 #ifdef CONFIG_SMP
 	if (!local) {
 		/* Compensate for drifting sched_clock */
@@ -577,33 +474,14 @@ static void activate_task(task_t *p, run
 			+ rq->timestamp_last_tick;
 	}
 #endif
-
-	recalc_task_prio(p, now);
-
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated) {
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->activated = 2;
-		else {
-			/*
-			 * Normal first-time wakeups get a credit too for
-			 * on-runqueue time, but it will be weighted down:
-			 */
-			p->activated = 1;
-		}
-	}
+	p->slice = slice(p);
+	if (!rt_task(p) && p->mm)
+		recalc_task_prio(p, now);
+	else
+		inc_deadline(p);
+	p->prio = effective_prio(p);
+	p->time_slice = RR_INTERVAL;
 	p->timestamp = now;
-
 	__activate_task(p, rq);
 }
 
@@ -613,9 +491,12 @@ static void activate_task(task_t *p, run
 static void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
 	rq->nr_running--;
-	if (p->state == TASK_UNINTERRUPTIBLE)
+	if (p->state == TASK_UNINTERRUPTIBLE) {
 		rq->nr_uninterruptible++;
-	dequeue_task(p, p->array);
+		if (p->deadline > (deadline(p) - 2) && p->deadline && p->mm)
+			p->deadline--;
+	}
+	dequeue_task(p, rq);
 	p->array = NULL;
 }
 
@@ -946,14 +827,8 @@ out_set_cpu:
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible--;
-		/*
-		 * Tasks on involuntary sleep don't earn
-		 * sleep_avg beyond just interactive state.
-		 */
-		p->activated = -1;
-	}
 
 	/*
 	 * Sync wakeups (i.e. those types of wakeups where the waker
@@ -1023,12 +898,14 @@ void fastcall sched_fork(task_t *p)
 	 */
 	local_irq_disable();
 	p->time_slice = (current->time_slice + 1) >> 1;
+	p->slice = (current->slice + 1) >> 1;
 	/*
 	 * The remainder of the first timeslice might be recovered by
 	 * the parent if the child exits early enough.
 	 */
 	p->first_time_slice = 1;
 	current->time_slice >>= 1;
+	current->slice >>= 1;
 	p->timestamp = sched_clock();
 	if (!current->time_slice) {
 		/*
@@ -1056,33 +933,14 @@ void fastcall wake_up_forked_process(tas
 	unsigned long flags;
 	runqueue_t *rq = task_rq_lock(current, &flags);
 
+	// Forked process gets a lower deadline rank to prevent fork bombs.
+	p->deadline = 0;
 	BUG_ON(p->state != TASK_RUNNING);
 
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
-
-	p->prio = effective_prio(p);
 	set_task_cpu(p, smp_processor_id());
 
-	if (unlikely(!current->array))
-		__activate_task(p, rq);
-	else {
-		p->prio = current->prio;
-		list_add_tail(&p->run_list, &current->run_list);
-		p->array = current->array;
-		p->array->nr_active++;
-		rq->nr_running++;
-	}
+	p->prio = effective_prio(p);
+	__activate_task(p, rq);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -1103,19 +961,11 @@ void fastcall sched_exit(task_t * p)
 	local_irq_save(flags);
 	if (p->first_time_slice) {
 		p->parent->time_slice += p->time_slice;
-		if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
-			p->parent->time_slice = MAX_TIMESLICE;
+		if (unlikely(p->parent->time_slice > slice(p->parent)))
+			p->parent->time_slice = slice(p->parent);
 	}
 	local_irq_restore(flags);
-	/*
-	 * If the child was a (relative-) CPU hog then decrease
-	 * the sleep_avg of the parent as well.
-	 */
 	rq = task_rq_lock(p->parent, &flags);
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -1379,18 +1229,6 @@ lock_again:
 		double_rq_unlock(this_rq, rq);
 		goto lock_again;
 	}
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
 
 	p->prio = effective_prio(p);
 	set_task_cpu(p, cpu);
@@ -1505,15 +1343,15 @@ static void double_lock_balance(runqueue
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static inline
-void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
-	       runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
+static inline 
+void pull_task(runqueue_t *src_rq, task_t *p, 
+		runqueue_t *this_rq, int this_cpu)
 {
-	dequeue_task(p, src_array);
+	dequeue_task(p, src_rq);
 	src_rq->nr_running--;
 	set_task_cpu(p, this_cpu);
 	this_rq->nr_running++;
-	enqueue_task(p, this_array);
+	enqueue_task(p, this_rq);
 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
 				+ this_rq->timestamp_last_tick;
 	/*
@@ -1570,29 +1408,16 @@ static int move_tasks(runqueue_t *this_r
 		      unsigned long max_nr_move, struct sched_domain *sd,
 		      enum idle_type idle)
 {
-	prio_array_t *array, *dst_array;
+	prio_array_t* array, *dst_array;
 	struct list_head *head, *curr;
 	int idx, pulled = 0;
 	task_t *tmp;
 
 	if (max_nr_move <= 0 || busiest->nr_running <= 1)
 		goto out;
+	array = &busiest->array;
+	dst_array = &this_rq->array;
 
-	/*
-	 * We first consider expired tasks. Those will likely not be
-	 * executed in the near future, and they are most likely to
-	 * be cache-cold, thus switching CPUs has the least effect
-	 * on them.
-	 */
-	if (busiest->expired->nr_active) {
-		array = busiest->expired;
-		dst_array = this_rq->expired;
-	} else {
-		array = busiest->active;
-		dst_array = this_rq->active;
-	}
-
-new_array:
 	/* Start searching at priority 0: */
 	idx = 0;
 skip_bitmap:
@@ -1600,14 +1425,8 @@ skip_bitmap:
 		idx = sched_find_first_bit(array->bitmap);
 	else
 		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired && busiest->active->nr_active) {
-			array = busiest->active;
-			dst_array = this_rq->active;
-			goto new_array;
-		}
+	if (idx >= MAX_PRIO) 
 		goto out;
-	}
 
 	head = array->queue + idx;
 	curr = head->prev;
@@ -1622,7 +1441,7 @@ skip_queue:
 		idx++;
 		goto skip_bitmap;
 	}
-	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+	pull_task(busiest, tmp, this_rq, this_cpu);
 	pulled++;
 
 	/* We only want to steal up to the prescribed number of tasks. */
@@ -2114,22 +1933,6 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
 EXPORT_PER_CPU_SYMBOL(kstat);
 
 /*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks. We also ignore the interactivity
- * if a better static_prio task has expired:
- */
-#define EXPIRED_STARVING(rq) \
-	((STARVATION_LIMIT && ((rq)->expired_timestamp && \
-		(jiffies - (rq)->expired_timestamp >= \
-			STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
-			((rq)->curr->static_prio > (rq)->best_expired_prio))
-
-/*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
  *
@@ -2174,76 +1977,38 @@ void scheduler_tick(int user_ticks, int 
 	cpustat->system += sys_ticks;
 
 	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
+	if (p->array != &rq->array) {
 		set_tsk_need_resched(p);
 		goto out;
 	}
 	spin_lock(&rq->lock);
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter. Note: we do not update a thread's
-	 * priority until it either goes to sleep or uses up its
-	 * timeslice. This makes it possible for interactive tasks
-	 * to use up their timeslices at their highest priority levels.
-	 */
-	if (unlikely(rt_task(p))) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			dequeue_task(p, rq->active);
-			enqueue_task(p, rq->active);
-		}
+	
+	// SCHED_FIFO tasks never run out of timeslice.
+	if (unlikely(p->policy == SCHED_FIFO))
+		goto out_unlock;
+	// Tasks lose a deadline rank each time they use up a full slice().
+	if (!--p->slice) {
+		set_tsk_need_resched(p);
+		dequeue_task(p, rq);
+		dec_deadline(p);
+		p->slice = slice(p);
+		p->prio = effective_prio(p);
+		p->time_slice = RR_INTERVAL;
+		enqueue_task(p, rq);
+		p->first_time_slice = 0;
 		goto out_unlock;
 	}
+	/*
+	 * Tasks that run out of time_slice but still have slice left get
+	 * requeued with a lower priority && RR_INTERVAL time_slice.
+	 */
 	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
 		set_tsk_need_resched(p);
+		dequeue_task(p, rq);
 		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!rq->expired_timestamp)
-			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
-				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
-	} else {
-		/*
-		 * Prevent a too long timeslice allowing a task to monopolize
-		 * the CPU. We do this by splitting up the timeslice into
-		 * smaller pieces.
-		 *
-		 * Note: this does not mean the task's timeslices expire or
-		 * get lost in any way, they just might be preempted by
-		 * another task of equal priority. (one with higher
-		 * priority would have preempted this task already.) We
-		 * requeue this task to the end of the list on this priority
-		 * level, which is in essence a round-robin of tasks with
-		 * equal priority.
-		 *
-		 * This only applies to tasks in the interactive
-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
-		 */
-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
-			(p->array == rq->active)) {
-
-			dequeue_task(p, rq->active);
-			set_tsk_need_resched(p);
-			p->prio = effective_prio(p);
-			enqueue_task(p, rq->active);
-		}
+		p->time_slice = RR_INTERVAL;
+		enqueue_task(p, rq);
+		goto out_unlock;
 	}
 out_unlock:
 	spin_unlock(&rq->lock);
@@ -2307,8 +2072,8 @@ static inline int dependent_sleeper(int 
 		 * task from using an unfair proportion of the
 		 * physical cpu's resources. -ck
 		 */
-		if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(p) || rt_task(smt_curr)) &&
+		if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) >
+			slice(p) || rt_task(smt_curr)) &&
 			p->mm && smt_curr->mm && !rt_task(p))
 				ret = 1;
 
@@ -2317,8 +2082,8 @@ static inline int dependent_sleeper(int 
 		 * or wake it up if it has been put to sleep for priority
 		 * reasons.
 		 */
-		if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(smt_curr) || rt_task(p)) &&
+		if ((((p->slice * (100 - sd->per_cpu_gain) / 100) > 
+			slice(smt_curr) || rt_task(p)) && 
 			smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
 			(smt_curr == smt_rq->idle && smt_rq->nr_running))
 				resched_task(smt_curr);
@@ -2344,10 +2109,9 @@ asmlinkage void __sched schedule(void)
 	long *switch_count;
 	task_t *prev, *next;
 	runqueue_t *rq;
-	prio_array_t *array;
+	prio_array_t* array;
 	struct list_head *queue;
 	unsigned long long now;
-	unsigned long run_time;
 	int cpu, idx;
 
 	/*
@@ -2370,19 +2134,8 @@ need_resched:
 	release_kernel_lock(prev);
  	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
-		run_time = now - prev->timestamp;
-	else
-		run_time = NS_MAX_SLEEP_AVG;
-
-	/*
-	 * Tasks with interactive credits get charged less run_time
-	 * at high sleep_avg to delay them losing their interactive
-	 * status
-	 */
-	if (HIGH_CREDIT(prev))
-		run_time /= (CURRENT_BONUS(prev) ? : 1);
 
+	prev->runtime = now - prev->timestamp;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -2404,58 +2157,24 @@ need_resched:
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
 			next = rq->idle;
-			rq->expired_timestamp = 0;
 			wake_sleeping_dependent(cpu, rq);
  			schedstat_inc(rq, sched_idle);
 			goto switch_tasks;
 		}
 	}
 
-	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		schedstat_inc(rq, sched_switch);
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-		rq->best_expired_prio = MAX_PRIO;
-	}
-
+	array = &rq->array;
 	idx = sched_find_first_bit(array->bitmap);
 	queue = array->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
-
-	if (dependent_sleeper(cpu, rq, next)) {
+	if (dependent_sleeper(cpu, rq, next))
 		next = rq->idle;
-		goto switch_tasks;
-	}
-
-	if (!rt_task(next) && next->activated > 0) {
-		unsigned long long delta = now - next->timestamp;
-
-		if (next->activated == 1)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
 
-		array = next->array;
-		dequeue_task(next, array);
-		recalc_task_prio(next, next->timestamp + delta);
-		enqueue_task(next, array);
-	}
-	next->activated = 0;
 switch_tasks:
 	prefetch(next);
 	clear_tsk_need_resched(prev);
 	RCU_qsctr(task_cpu(prev))++;
 
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0) {
-		prev->sleep_avg = 0;
-		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
-			prev->interactive_credit--;
-	}
 	prev->timestamp = now;
 
 	if (likely(prev != next)) {
@@ -2721,7 +2440,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
-	prio_array_t *array;
+	prio_array_t* array;
 	runqueue_t *rq;
 	int old_prio, new_prio, delta;
 
@@ -2744,7 +2463,7 @@ void set_user_nice(task_t *p, long nice)
 	}
 	array = p->array;
 	if (array)
-		dequeue_task(p, array);
+		dequeue_task(p, rq);
 
 	old_prio = p->prio;
 	new_prio = NICE_TO_PRIO(nice);
@@ -2753,7 +2472,7 @@ void set_user_nice(task_t *p, long nice)
 	p->prio += delta;
 
 	if (array) {
-		enqueue_task(p, array);
+		enqueue_task(p, rq);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -2882,7 +2601,7 @@ static int setscheduler(pid_t pid, int p
 	struct sched_param lp;
 	int retval = -EINVAL;
 	int oldprio;
-	prio_array_t *array;
+	prio_array_t* array;
 	unsigned long flags;
 	runqueue_t *rq;
 	task_t *p;
@@ -3151,29 +2870,17 @@ out_unlock:
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
- * this function yields the current CPU by moving the calling thread
- * to the expired array. If there are no other threads running on this
  * CPU then this function will return.
  */
 asmlinkage long sys_sched_yield(void)
 {
 	runqueue_t *rq = this_rq_lock();
-	prio_array_t *array = current->array;
-	prio_array_t *target = rq->expired;
 
-	schedstat_inc(rq, yld_cnt);
-	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
-	 */
-	if (unlikely(rt_task(current)))
-		target = rq->active;
-
-	dequeue_task(current, array);
-	enqueue_task(current, target);
+	dequeue_task(current, rq);
+	current->slice = slice(current);
+	current->time_slice = current->slice;
+	current->prio = effective_prio(current);
+	enqueue_task(current, rq);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
@@ -3312,7 +3019,7 @@ long sys_sched_rr_get_interval(pid_t pid
 		goto out_unlock;
 
 	jiffies_to_timespec(p->policy & SCHED_FIFO ?
-				0 : task_timeslice(p), &t);
+				0 : slice(p), &t);
 	read_unlock(&tasklist_lock);
 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
@@ -3433,6 +3140,7 @@ void __devinit init_idle(task_t *idle, i
 	idle->array = NULL;
 	idle->prio = MAX_PRIO;
 	idle->state = TASK_RUNNING;
+	idle->deadline = 0;
 	set_task_cpu(idle, cpu);
 	double_rq_unlock(idle_rq, rq);
 	set_tsk_need_resched(idle);
@@ -4068,7 +3776,7 @@ int in_sched_functions(unsigned long add
 void __init sched_init(void)
 {
 	runqueue_t *rq;
-	int i, j, k;
+	int i, j;
 
 #ifdef CONFIG_SMP
 	/* Set up an initial dummy domain for early boot */
@@ -4089,13 +3797,10 @@ void __init sched_init(void)
 #endif
 
 	for (i = 0; i < NR_CPUS; i++) {
-		prio_array_t *array;
+		prio_array_t* array;
 
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
-		rq->active = rq->arrays;
-		rq->expired = rq->arrays + 1;
-		rq->best_expired_prio = MAX_PRIO;
 
 #ifdef CONFIG_SMP
 		rq->sd = &sched_domain_init;
@@ -4106,16 +3811,14 @@ void __init sched_init(void)
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif
 		atomic_set(&rq->nr_iowait, 0);
+		array = &rq->array;
 
-		for (j = 0; j < 2; j++) {
-			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
-				INIT_LIST_HEAD(array->queue + k);
-				__clear_bit(k, array->bitmap);
-			}
-			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
+		for (j = 0; j <= MAX_PRIO; j++) {
+			INIT_LIST_HEAD(array->queue + j);
+			__clear_bit(j, array->bitmap);
 		}
+		// delimiter for bitsearch
+		__set_bit(MAX_PRIO, array->bitmap);
 	}
 	/*
 	 * We have to do a little magic to get the first
diff -Naurp linux-2.6.7-rc2-mm2/kernel/sysctl.c linux-2.6.7-rc2-mm2-s6.3/kernel/sysctl.c
--- linux-2.6.7-rc2-mm2/kernel/sysctl.c	2004-06-06 18:51:52.000000000 +0200
+++ linux-2.6.7-rc2-mm2-s6.3/kernel/sysctl.c	2004-06-06 18:27:58.359409776 +0200
@@ -618,6 +618,22 @@ static ctl_table kern_table[] = {
 		.mode		= 0444,
 		.proc_handler	= &proc_dointvec,
 	},
+	{
+		.ctl_name	= KERN_INTERACTIVE,
+		.procname	= "interactive",
+		.data		= &interactive,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= KERN_COMPUTE,
+		.procname	= "compute",
+		.data		= &compute,
+		.maxlen		= sizeof (int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 	{ .ctl_name = 0 }
 };
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
  2004-06-06 16:59 ` Jan Killius
@ 2004-06-06 17:40 ` Felipe Alfaro Solana
  2004-06-06 22:58   ` Con Kolivas
  2004-06-06 20:43 ` Prakash K. Cheemplavam
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Felipe Alfaro Solana @ 2004-06-06 17:40 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This is an update of the scheduler policy mechanism rewrite using the 
> infrastructure of the current O(1) scheduler. Slight changes from the 
> original design require a detailed description. The change to the original 
> design has enabled all known corner cases to be abolished and cpu 
> distribution to be much better maintained. It has proven to be stable in my 
> testing and is ready for more widespread public testing now.

I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus
staircase plus autoswappiness and my system behaves exceptionally. It
seems pretty responsive even when under heavy load (while true; do a=2;
done). Nice work.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
  2004-06-06 16:59 ` Jan Killius
  2004-06-06 17:40 ` Felipe Alfaro Solana
@ 2004-06-06 20:43 ` Prakash K. Cheemplavam
  2004-06-06 22:55   ` Con Kolivas
  2004-06-06 20:59 ` Felipe Alfaro Solana
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Prakash K. Cheemplavam @ 2004-06-06 20:43 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

Hi,

it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o 
criteria for me: Playing ut2004 it generated a lot of statics (sound 
wise). I consider this a regression in contrast to O(1). Nick's 
scheduler plays nice as well. For Nick's I have X reniced to -10. Your 
scheduler doesn't like this, as well: When plaing some tune via xmms and 
then switching to another (virtual) desktop, I get pops in the sound for 
fractions of a second. Putting X back to 0 fixes this. But I don't know 
how to "fix" ut2004 with staircase. :-(

Hth,

Prakash

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
                   ` (2 preceding siblings ...)
  2004-06-06 20:43 ` Prakash K. Cheemplavam
@ 2004-06-06 20:59 ` Felipe Alfaro Solana
  2004-06-06 22:57   ` Con Kolivas
  2004-06-06 23:47 ` Con Kolivas
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Felipe Alfaro Solana @ 2004-06-06 20:59 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This is an update of the scheduler policy mechanism rewrite using the 
> infrastructure of the current O(1) scheduler. Slight changes from the 
> original design require a detailed description. The change to the original 
> design has enabled all known corner cases to be abolished and cpu 
> distribution to be much better maintained. It has proven to be stable in my 
> testing and is ready for more widespread public testing now.

It seems this staircase scheduler has some strange interactions with
networking and PM suspend. I can't suspend my laptop when any program is
using the network and, what's more, after the suspend mechanism fails,
the program that was using the network stays stuck at D state and can't
be killed.

I can trigger this with evolution 1.4.8 using an IMAP mail account, or
simply by launching Konqueror, pointing it to a URL and then try
suspending while Konqueror is retrieving the page from Internet.

I'm open to suggestions.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 20:43 ` Prakash K. Cheemplavam
@ 2004-06-06 22:55   ` Con Kolivas
  0 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 22:55 UTC (permalink / raw)
  To: Prakash K. Cheemplavam
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Mon, 7 Jun 2004 06:43, Prakash K. Cheemplavam wrote:
> Hi,

Hi. Thanks for testing.

> it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o
> criteria for me: Playing ut2004 it generated a lot of statics (sound
> wise). I consider this a regression in contrast to O(1). Nick's
> scheduler plays nice as well. For Nick's I have X reniced to -10. Your
> scheduler doesn't like this, as well: When plaing some tune via xmms and
> then switching to another (virtual) desktop, I get pops in the sound for
> fractions of a second. Putting X back to 0 fixes this. But I don't know
> how to "fix" ut2004 with staircase. :-(

Yes this is designed for X nice==0.

Try interactive = 0 setting for gaming.

Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 20:59 ` Felipe Alfaro Solana
@ 2004-06-06 22:57   ` Con Kolivas
  0 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 22:57 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Mon, 7 Jun 2004 06:59, Felipe Alfaro Solana wrote:
> On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This is an update of the scheduler policy mechanism rewrite using the
> > infrastructure of the current O(1) scheduler. Slight changes from the
> > original design require a detailed description. The change to the
> > original design has enabled all known corner cases to be abolished and
> > cpu distribution to be much better maintained. It has proven to be stable
> > in my testing and is ready for more widespread public testing now.
>
> It seems this staircase scheduler has some strange interactions with
> networking and PM suspend. I can't suspend my laptop when any program is
> using the network and, what's more, after the suspend mechanism fails,
> the program that was using the network stays stuck at D state and can't
> be killed.

Could well be the autoregulated vm swappiness which triggers the unable to 
suspend when swappiness=0 bug. If not, this staircase scheduler is quite a 
large change to the scheduler priority design and there could be some weird 
interaction.

Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 17:40 ` Felipe Alfaro Solana
@ 2004-06-06 22:58   ` Con Kolivas
  0 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 22:58 UTC (permalink / raw)
  To: Felipe Alfaro Solana
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Mon, 7 Jun 2004 03:40, Felipe Alfaro Solana wrote:
> On Sun, 2004-06-06 at 17:39, Con Kolivas wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > This is an update of the scheduler policy mechanism rewrite using the
> > infrastructure of the current O(1) scheduler. Slight changes from the
> > original design require a detailed description. The change to the
> > original design has enabled all known corner cases to be abolished and
> > cpu distribution to be much better maintained. It has proven to be stable
> > in my testing and is ready for more widespread public testing now.
>
> I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus
> staircase plus autoswappiness and my system behaves exceptionally. It
> seems pretty responsive even when under heavy load (while true; do a=2;
> done). Nice work.

Sounds good.

Thanks for testing.

Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
                   ` (3 preceding siblings ...)
  2004-06-06 20:59 ` Felipe Alfaro Solana
@ 2004-06-06 23:47 ` Con Kolivas
  2004-06-07 13:56 ` William Lee Irwin III
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 23:47 UTC (permalink / raw)
  To: Linux Kernel Mailinglist; +Cc: Zwane Mwaikambo, William Lee Irwin III

On Mon, 7 Jun 2004 01:39, Con Kolivas wrote:
> This is an update of the scheduler policy mechanism rewrite using the
> infrastructure of the current O(1) scheduler. Slight changes from the
> original design require a detailed description. 

Seems it wasn't clear unless you look at the code; this has only a single 
priority queue with no expired array.

Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
                   ` (4 preceding siblings ...)
  2004-06-06 23:47 ` Con Kolivas
@ 2004-06-07 13:56 ` William Lee Irwin III
  2004-06-07 13:57   ` William Lee Irwin III
  2004-06-07 19:57 ` Phy Prabab
  2004-06-29 11:10 ` Pavel Machek
  7 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 13:56 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 01:39:26AM +1000, Con Kolivas wrote:
> This is an update of the scheduler policy mechanism rewrite using the 
> infrastructure of the current O(1) scheduler. Slight changes from the 
> original design require a detailed description. The change to the original 
> design has enabled all known corner cases to be abolished and cpu 
> distribution to be much better maintained. It has proven to be stable in my 
> testing and is ready for more widespread public testing now.

There is only one "array" per runqueue; the removed code should be at
most a BUG_ON() or WARN_ON(); I opted to delete it altogether.

Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 06:19:45.938605000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 06:53:21.555185000 -0700
@@ -1789,13 +1789,7 @@
 		cpustat->user += user_ticks;
 	cpustat->system += sys_ticks;
 
-	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != &rq->array) {
-		set_tsk_need_resched(p);
-		goto out;
-	}
 	spin_lock(&rq->lock);
-	
 	// SCHED_FIFO tasks never run out of timeslice.
 	if (unlikely(p->policy == SCHED_FIFO))
 		goto out_unlock;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 13:56 ` William Lee Irwin III
@ 2004-06-07 13:57   ` William Lee Irwin III
  2004-06-07 14:04     ` William Lee Irwin III
  0 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 13:57 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 06:56:31AM -0700, William Lee Irwin III wrote:
> There is only one "array" per runqueue; the removed code should be at
> most a BUG_ON() or WARN_ON(); I opted to delete it altogether.

JIFFIES_TO_NS() is unused.

Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 06:49:10.104411000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 06:50:39.987747000 -0700
@@ -74,7 +74,6 @@
  * Some helpers for converting nanosecond timing to jiffy resolution
  */
 #define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
 
 //This is the time all tasks within the same priority round robin.
 #define RR_INTERVAL		((10 * HZ / 1000) ? : 1)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 13:57   ` William Lee Irwin III
@ 2004-06-07 14:04     ` William Lee Irwin III
  2004-06-07 14:07       ` William Lee Irwin III
  0 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 14:04 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 06:57:38AM -0700, William Lee Irwin III wrote:
> JIFFIES_TO_NS() is unused.

array->nr_active only ever modified, never examined.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 06:53:26.779390000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 07:03:00.910109000 -0700
@@ -92,7 +92,6 @@
 typedef struct runqueue runqueue_t;
 
 struct prio_array {
-	unsigned int nr_active;
 	unsigned long bitmap[BITMAP_SIZE];
 	struct list_head queue[MAX_PRIO + 1];
 };
@@ -204,7 +203,6 @@
 static void dequeue_task(struct task_struct *p, runqueue_t *rq)
 {
 	prio_array_t* array = &rq->array;
-	array->nr_active--;
 	list_del(&p->run_list);
 	if (list_empty(array->queue + p->prio))
 		__clear_bit(p->prio, array->bitmap);
@@ -215,7 +213,6 @@
 	prio_array_t* array = &rq->array;
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
 	p->array = array;
 }
 
@@ -229,7 +226,6 @@
 	prio_array_t* array = &rq->array;
 	list_add(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
 	p->array = array;
 }
 
@@ -1095,7 +1091,6 @@
 			p->prio = current->prio;
 			list_add_tail(&p->run_list, &current->run_list);
 			p->array = current->array;
-			p->array->nr_active++;
 			rq->nr_running++;
 		}
 	} else {

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 14:04     ` William Lee Irwin III
@ 2004-06-07 14:07       ` William Lee Irwin III
  2004-06-07 15:00         ` William Lee Irwin III
  0 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 14:07 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote:
> array->nr_active only ever modified, never examined.

cpu_to_node_mask() is dead.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 07:03:00.910109000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 07:06:28.072616000 -0700
@@ -46,12 +46,6 @@
 
 #include <asm/unistd.h>
 
-#ifdef CONFIG_NUMA
-#define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
-#else
-#define cpu_to_node_mask(cpu) (cpu_online_map)
-#endif
-
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 14:07       ` William Lee Irwin III
@ 2004-06-07 15:00         ` William Lee Irwin III
  2004-06-07 15:11           ` William Lee Irwin III
  0 siblings, 1 reply; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 15:00 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote:
>> array->nr_active only ever modified, never examined.

On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote:
> cpu_to_node_mask() is dead.

task->array is nothing more than a boolean flag. Shove it into
task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t).
Compiletested only on sparc64 only.


Index: kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-alpha/thread_info.h	2004-05-29 23:26:43.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h	2004-06-07 07:13:08.775700000 -0700
@@ -77,6 +77,7 @@
 #define TIF_UAC_NOPRINT		6	/* see sysinfo.h */
 #define TIF_UAC_NOFIX		7
 #define TIF_UAC_SIGBUS		8
+#define TIF_QUEUED		9	/* queued for scheduling */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-arm/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-arm/thread_info.h	2004-05-29 23:26:10.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-arm/thread_info.h	2004-06-07 07:14:01.315712000 -0700
@@ -123,6 +123,7 @@
  *  TIF_NEED_RESCHED	- rescheduling necessary
  *  TIF_USEDFPU		- FPU was used by this task this quantum (SMP)
  *  TIF_POLLING_NRFLAG	- true if poll_idle() is polling TIF_NEED_RESCHED
+ *  TIF_QUEUED		- true if on scheduler runqueue
  */
 #define TIF_NOTIFY_RESUME	0
 #define TIF_SIGPENDING		1
@@ -130,6 +131,7 @@
 #define TIF_SYSCALL_TRACE	8
 #define TIF_USED_FPU		16
 #define TIF_POLLING_NRFLAG	17
+#define TIF_QUEUED		18
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-arm26/thread_info.h	2004-05-29 23:25:53.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h	2004-06-07 07:14:30.291307000 -0700
@@ -118,6 +118,7 @@
  *  TIF_NEED_RESCHED	- rescheduling necessary
  *  TIF_USEDFPU		- FPU was used by this task this quantum (SMP)
  *  TIF_POLLING_NRFLAG	- true if poll_idle() is polling TIF_NEED_RESCHED
+ *  TIF_QUEUED		- true if on scheduler runqueue
  */
 #define TIF_NOTIFY_RESUME	0
 #define TIF_SIGPENDING		1
@@ -125,6 +126,7 @@
 #define TIF_SYSCALL_TRACE	8
 #define TIF_USED_FPU		16
 #define TIF_POLLING_NRFLAG	17
+#define TIF_QUEUED		18
 
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-cris/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-cris/thread_info.h	2004-05-29 23:26:48.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-cris/thread_info.h	2004-06-07 07:14:57.512169000 -0700
@@ -85,6 +85,7 @@
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		17	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-h8300/thread_info.h	2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h	2004-06-07 07:16:00.653570000 -0700
@@ -93,6 +93,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	4	/* true if poll_idle() is polling
 					   TIF_NEED_RESCHED */
+#define TIF_QUEUED		17	/* true if on scheduler runqueue */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-i386/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-i386/thread_info.h	2004-05-29 23:25:45.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-i386/thread_info.h	2004-06-07 07:16:15.755274000 -0700
@@ -145,6 +145,7 @@
 #define TIF_IRET		5	/* return with iret */
 #define TIF_SYSCALL_AUDIT	7	/* syscall auditing active */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		17	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ia64/thread_info.h	2004-05-29 23:26:49.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h	2004-06-07 07:16:53.195583000 -0700
@@ -74,6 +74,7 @@
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
 #define TIF_SYSCALL_TRACE	3	/* syscall trace active */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		17	/* true if on scheduler runqueue */
 
 #define TIF_WORK_MASK		0x7	/* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE */
 #define TIF_ALLWORK_MASK	0xf	/* bits 0..3 are "work to do on user-return" bits */
Index: kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-m68k/thread_info.h	2004-05-29 23:26:19.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h	2004-06-07 07:23:29.772294000 -0700
@@ -11,7 +11,7 @@
 	__s32			preempt_count; /* 0 => preemptable, <0 => BUG */
 	__u32 cpu; /* should always be 0 on m68k */
 	struct restart_block    restart_block;
-
+	unsigned char		queued;
 	__u8			supervisor_stack[0];
 };
 
@@ -48,6 +48,7 @@
 #define TIF_NOTIFY_RESUME	2	/* resumption notification requested */
 #define TIF_SIGPENDING		3	/* signal pending */
 #define TIF_NEED_RESCHED	4	/* rescheduling necessary */
+#define TIF_QUEUED		5	/* true if on scheduler runqueue */
 
 extern int thread_flag_fixme(void);
 
@@ -67,6 +68,9 @@
 	case TIF_SYSCALL_TRACE:				\
 		tsk->thread.work.syscall_trace = val;	\
 		break;					\
+	case TIF_QUEUED:				\
+		(tsk)->thread_info.queued = 1;		\
+		break;					\
 	default:					\
 		thread_flag_fixme();			\
 	}						\
@@ -84,6 +88,9 @@
 	case TIF_SYSCALL_TRACE:				\
 		___res = tsk->thread.work.syscall_trace;\
 		break;					\
+	case TIF_QUEUED:				\
+		___res = tsk->thread_info.queued;	\
+		break;					\
 	default:					\
 		___res = thread_flag_fixme();		\
 	}						\
Index: kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-m68knommu/thread_info.h	2004-05-29 23:26:09.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h	2004-06-07 07:24:30.203107000 -0700
@@ -91,6 +91,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	4	/* true if poll_idle() is polling
 					   TIF_NEED_RESCHED */
+#define TIF_QUEUED		16	/* true if on scheduler runqueue */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-mips/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-mips/thread_info.h	2004-05-29 23:25:40.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-mips/thread_info.h	2004-06-07 07:25:03.044114000 -0700
@@ -113,6 +113,7 @@
 #define TIF_SYSCALL_AUDIT	4	/* syscall auditing active */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		18	/* true if on scheduler runqueue */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-parisc/thread_info.h	2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h	2004-06-07 07:25:26.559539000 -0700
@@ -60,6 +60,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	4	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_32BIT               5       /* 32 bit binary */
+#define TIF_QUEUED		6	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ppc/thread_info.h	2004-05-29 23:25:52.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h	2004-06-07 07:26:16.802901000 -0700
@@ -86,6 +86,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	4	/* true if poll_idle() is polling
 					   TIF_NEED_RESCHED */
+#define TIF_QUEUED		5	/* true if on scheduler runqueue */
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-ppc64/thread_info.h	2004-05-29 23:25:45.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h	2004-06-07 07:26:34.180260000 -0700
@@ -94,6 +94,7 @@
 #define TIF_RUN_LIGHT		6	/* iSeries run light */
 #define TIF_ABI_PENDING		7	/* 32/64 bit switch needed */
 #define TIF_SYSCALL_AUDIT	8	/* syscall auditing active */
+#define TIF_QUEUED		9	/* true if on scheduler runqueue */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-s390/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-s390/thread_info.h	2004-05-29 23:26:19.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-s390/thread_info.h	2004-06-07 07:27:50.350680000 -0700
@@ -89,6 +89,7 @@
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling 
 					   TIF_NEED_RESCHED */
 #define TIF_31BIT		18	/* 32bit process */ 
+#define TIF_QUEUED		19	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-sh/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sh/thread_info.h	2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sh/thread_info.h	2004-06-07 07:28:08.307950000 -0700
@@ -93,6 +93,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_USEDFPU		16	/* FPU was used by this task this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	17	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		18	/* true if on scheduler runqueue */
 #define TIF_USERSPACE		31	/* true if FS sets userspace */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sparc/thread_info.h	2004-05-29 23:26:43.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h	2004-06-07 07:29:22.011745000 -0700
@@ -137,6 +137,7 @@
 					 * this quantum (SMP) */
 #define TIF_POLLING_NRFLAG	9	/* true if poll_idle() is polling
 					 * TIF_NEED_RESCHED */
+#define TIF_QUEUED		10	/* true if on scheduler runqueue */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-sparc64/thread_info.h	2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h	2004-06-07 07:30:54.506684000 -0700
@@ -228,6 +228,7 @@
  *       an immediate value in instructions such as andcc.
  */
 #define TIF_ABI_PENDING		12
+#define TIF_QUEUED		13	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/asm-um/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-um/thread_info.h	2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-um/thread_info.h	2004-06-07 07:35:20.439256000 -0700
@@ -65,6 +65,7 @@
 #define TIF_POLLING_NRFLAG      3       /* true if poll_idle() is polling 
 					 * TIF_NEED_RESCHED 
 					 */
+#define TIF_QUEUED		4	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
Index: kolivas-2.6.7-rc2/include/asm-v850/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-v850/thread_info.h	2004-05-29 23:26:03.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-v850/thread_info.h	2004-06-07 07:36:38.430400000 -0700
@@ -83,6 +83,7 @@
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
 #define TIF_POLLING_NRFLAG	4	/* true if poll_idle() is polling
 					   TIF_NEED_RESCHED */
+#define TIF_QUEUED		5	/* true if on scheduler runqueue */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
Index: kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/asm-x86_64/thread_info.h	2004-05-29 23:26:26.000000000 -0700
+++ kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h	2004-06-07 07:37:09.234717000 -0700
@@ -106,6 +106,7 @@
 #define TIF_IA32		17	/* 32bit process */ 
 #define TIF_FORK		18	/* ret_from_fork */
 #define TIF_ABI_PENDING		19
+#define TIF_QUEUED		20	/* true if on scheduler runqueue */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
Index: kolivas-2.6.7-rc2/include/linux/sched.h
===================================================================
--- kolivas-2.6.7-rc2.orig/include/linux/sched.h	2004-06-07 06:19:45.891612000 -0700
+++ kolivas-2.6.7-rc2/include/linux/sched.h	2004-06-07 07:37:49.068661000 -0700
@@ -325,7 +325,6 @@
 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
 
-typedef struct prio_array prio_array_t;
 struct backing_dev_info;
 struct reclaim_state;
 
@@ -392,8 +391,6 @@
 
 	int prio, static_prio;
 	struct list_head run_list;
-	prio_array_t *array;
-
 	unsigned long long timestamp;
 	unsigned long runtime, totalrun;
 	unsigned int deadline;
Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 07:06:28.072616000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 07:59:22.711997000 -0700
@@ -85,10 +85,10 @@
 
 typedef struct runqueue runqueue_t;
 
-struct prio_array {
+typedef struct prio_array {
 	unsigned long bitmap[BITMAP_SIZE];
 	struct list_head queue[MAX_PRIO + 1];
-};
+} prio_array_t;
 
 /*
  * This is the main, per-CPU runqueue data structure.
@@ -191,6 +191,21 @@
 	spin_unlock_irq(&rq->lock);
 }
 
+static inline int task_queued(task_t *task)
+{
+	return test_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
+static inline void set_task_queued(task_t *task)
+{
+	set_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
+static inline void clear_task_queued(task_t *task)
+{
+	clear_ti_thread_flag(task->thread_info, TIF_QUEUED);
+}
+
 /*
  * Adding/removing a task to/from a priority array:
  */
@@ -207,7 +222,7 @@
 	prio_array_t* array = &rq->array;
 	list_add_tail(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
-	p->array = array;
+	set_task_queued(p);
 }
 
 /*
@@ -220,7 +235,7 @@
 	prio_array_t* array = &rq->array;
 	list_add(&p->run_list, array->queue + p->prio);
 	__set_bit(p->prio, array->bitmap);
-	p->array = array;
+	set_task_queued(p);
 }
 
 /*
@@ -369,7 +384,7 @@
 			p->deadline--;
 	}
 	dequeue_task(p, rq);
-	p->array = NULL;
+	clear_task_queued(p);
 }
 
 /*
@@ -442,7 +457,7 @@
 	 * If the task is not on a runqueue (and not running), then
 	 * it is sufficient to simply update the task's cpu field.
 	 */
-	if (!p->array && !task_running(rq, p)) {
+	if (!task_queued(p) && !task_running(rq, p)) {
 		set_task_cpu(p, dest_cpu);
 		return 0;
 	}
@@ -473,7 +488,7 @@
 repeat:
 	rq = task_rq_lock(p, &flags);
 	/* Must be off runqueue entirely, not preempted. */
-	if (unlikely(p->array)) {
+	if (unlikely(task_queued(p))) {
 		/* If it's preempted, we yield.  It could be a while. */
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &flags);
@@ -603,7 +618,7 @@
 	if (!(old_state & state))
 		goto out;
 
-	if (p->array)
+	if (task_queued(p))
 		goto out_running;
 
 	cpu = task_cpu(p);
@@ -663,7 +678,7 @@
 		old_state = p->state;
 		if (!(old_state & state))
 			goto out;
-		if (p->array)
+		if (task_queued(p))
 			goto out_running;
 
 		this_cpu = smp_processor_id();
@@ -725,7 +740,7 @@
 	 */
 	p->state = TASK_RUNNING;
 	INIT_LIST_HEAD(&p->run_list);
-	p->array = NULL;
+	clear_task_queued(p);
 	spin_lock_init(&p->switch_lock);
 #ifdef CONFIG_PREEMPT
 	/*
@@ -1079,12 +1094,12 @@
 	set_task_cpu(p, cpu);
 
 	if (cpu == this_cpu) {
-		if (unlikely(!current->array))
+		if (unlikely(!task_queued(current)))
 			__activate_task(p, rq);
 		else {
 			p->prio = current->prio;
 			list_add_tail(&p->run_list, &current->run_list);
-			p->array = current->array;
+			set_task_queued(p);
 			rq->nr_running++;
 		}
 	} else {
@@ -2232,9 +2247,8 @@
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
-	prio_array_t* array;
 	runqueue_t *rq;
-	int old_prio, new_prio, delta;
+	int queued, old_prio, new_prio, delta;
 
 	if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
 		return;
@@ -2253,8 +2267,7 @@
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
-	array = p->array;
-	if (array)
+	if ((queued = task_queued(p)))
 		dequeue_task(p, rq);
 
 	old_prio = p->prio;
@@ -2263,7 +2276,7 @@
 	p->static_prio = NICE_TO_PRIO(nice);
 	p->prio += delta;
 
-	if (array) {
+	if (queued) {
 		enqueue_task(p, rq);
 		/*
 		 * If the task increased its priority or is running and
@@ -2369,7 +2382,7 @@
 /* Actually do priority change: must hold rq lock. */
 static void __setscheduler(struct task_struct *p, int policy, int prio)
 {
-	BUG_ON(p->array);
+	BUG_ON(task_queued(p));
 	p->policy = policy;
 	p->rt_priority = prio;
 	if (policy != SCHED_NORMAL)
@@ -2385,8 +2398,7 @@
 {
 	struct sched_param lp;
 	int retval = -EINVAL;
-	int oldprio;
-	prio_array_t* array;
+	int queued, oldprio;
 	unsigned long flags;
 	runqueue_t *rq;
 	task_t *p;
@@ -2446,13 +2458,12 @@
 	if (retval)
 		goto out_unlock;
 
-	array = p->array;
-	if (array)
+	if ((queued = task_queued(p)))
 		deactivate_task(p, task_rq(p));
 	retval = 0;
 	oldprio = p->prio;
 	__setscheduler(p, policy, lp.sched_priority);
-	if (array) {
+	if (queued) {
 		__activate_task(p, task_rq(p));
 		/*
 		 * Reschedule if we are currently running on this runqueue and
@@ -2922,7 +2933,7 @@
 
 	idle_rq->curr = idle_rq->idle = idle;
 	deactivate_task(idle, rq);
-	idle->array = NULL;
+	clear_task_queued(idle);
 	idle->prio = MAX_PRIO;
 	idle->state = TASK_RUNNING;
 	idle->deadline = 0;
@@ -3034,7 +3045,7 @@
 		goto out;
 
 	set_task_cpu(p, dest_cpu);
-	if (p->array) {
+	if (task_queued(p)) {
 		/*
 		 * Sync timestamp with rq_dest's before activating.
 		 * The same thing could be achieved by doing this step

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 15:00         ` William Lee Irwin III
@ 2004-06-07 15:11           ` William Lee Irwin III
  0 siblings, 0 replies; 31+ messages in thread
From: William Lee Irwin III @ 2004-06-07 15:11 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo

On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote:
>> cpu_to_node_mask() is dead.

On Mon, Jun 07, 2004 at 08:00:37AM -0700, William Lee Irwin III wrote:
> task->array is nothing more than a boolean flag. Shove it into
> task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t).
> Compiletested only on sparc64 only.

prio_array_t is no longer a useful abstraction. Compiletested only on
sparc64 only.


Index: kolivas-2.6.7-rc2/kernel/sched.c
===================================================================
--- kolivas-2.6.7-rc2.orig/kernel/sched.c	2004-06-07 07:59:22.711997000 -0700
+++ kolivas-2.6.7-rc2/kernel/sched.c	2004-06-07 08:07:09.705003000 -0700
@@ -81,15 +81,8 @@
  * These are the runqueue data structures:
  */
 
-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
-
 typedef struct runqueue runqueue_t;
 
-typedef struct prio_array {
-	unsigned long bitmap[BITMAP_SIZE];
-	struct list_head queue[MAX_PRIO + 1];
-} prio_array_t;
-
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -113,7 +106,8 @@
 	unsigned long long timestamp_last_tick;
 	task_t *curr, *idle;
 	struct mm_struct *prev_mm;
-	prio_array_t array;
+	unsigned long bitmap[BITS_TO_LONGS(MAX_PRIO+1)];
+	struct list_head queue[MAX_PRIO + 1];
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -207,21 +201,19 @@
 }
 
 /*
- * Adding/removing a task to/from a priority array:
+ * Adding/removing a task to/from a runqueue:
  */
 static void dequeue_task(struct task_struct *p, runqueue_t *rq)
 {
-	prio_array_t* array = &rq->array;
 	list_del(&p->run_list);
-	if (list_empty(array->queue + p->prio))
-		__clear_bit(p->prio, array->bitmap);
+	if (list_empty(rq->queue + p->prio))
+		__clear_bit(p->prio, rq->bitmap);
 }
 
 static void enqueue_task(struct task_struct *p, runqueue_t *rq)
 {
-	prio_array_t* array = &rq->array;
-	list_add_tail(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
+	list_add_tail(&p->run_list, rq->queue + p->prio);
+	__set_bit(p->prio, rq->bitmap);
 	set_task_queued(p);
 }
 
@@ -232,9 +224,8 @@
  */
 static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
 {
-	prio_array_t* array = &rq->array;
-	list_add(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
+	list_add(&p->run_list, rq->queue + p->prio);
+	__set_bit(p->prio, rq->bitmap);
 	set_task_queued(p);
 }
 
@@ -1257,27 +1248,24 @@
 		      unsigned long max_nr_move, struct sched_domain *sd,
 		      enum idle_type idle)
 {
-	prio_array_t* array, *dst_array;
 	struct list_head *head, *curr;
 	int idx, pulled = 0;
 	task_t *tmp;
 
 	if (max_nr_move <= 0 || busiest->nr_running <= 1)
 		goto out;
-	array = &busiest->array;
-	dst_array = &this_rq->array;
 
 	/* Start searching at priority 0: */
 	idx = 0;
 skip_bitmap:
 	if (!idx)
-		idx = sched_find_first_bit(array->bitmap);
+		idx = sched_find_first_bit(busiest->bitmap);
 	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
+		idx = find_next_bit(busiest->bitmap, MAX_PRIO, idx);
 	if (idx >= MAX_PRIO) 
 		goto out;
 
-	head = array->queue + idx;
+	head = busiest->queue + idx;
 	curr = head->prev;
 skip_queue:
 	tmp = list_entry(curr, task_t, run_list);
@@ -1919,7 +1907,6 @@
 	long *switch_count;
 	task_t *prev, *next;
 	runqueue_t *rq;
-	prio_array_t* array;
 	struct list_head *queue;
 	unsigned long long now;
 	int cpu, idx;
@@ -1971,9 +1958,8 @@
 		}
 	}
 
-	array = &rq->array;
-	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
+	idx = sched_find_first_bit(rq->bitmap);
+	queue = rq->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
 	if (dependent_sleeper(cpu, rq, next))
@@ -3590,8 +3576,6 @@
 #endif
 
 	for (i = 0; i < NR_CPUS; i++) {
-		prio_array_t* array;
-
 		rq = cpu_rq(i);
 		spin_lock_init(&rq->lock);
 
@@ -3604,14 +3588,11 @@
 		INIT_LIST_HEAD(&rq->migration_queue);
 #endif
 		atomic_set(&rq->nr_iowait, 0);
-		array = &rq->array;
-
-		for (j = 0; j <= MAX_PRIO; j++) {
-			INIT_LIST_HEAD(array->queue + j);
-			__clear_bit(j, array->bitmap);
-		}
+		for (j = 0; j <= MAX_PRIO; j++)
+			INIT_LIST_HEAD(&rq->queue[j]);
+		memset(rq->bitmap, 0, BITS_TO_LONGS(MAX_PRIO+1)*sizeof(long));
 		// delimiter for bitsearch
-		__set_bit(MAX_PRIO, array->bitmap);
+		__set_bit(MAX_PRIO, rq->bitmap);
 	}
 	/*
 	 * We have to do a little magic to get the first

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
                   ` (5 preceding siblings ...)
  2004-06-07 13:56 ` William Lee Irwin III
@ 2004-06-07 19:57 ` Phy Prabab
  2004-06-07 21:12   ` Con Kolivas
  2004-06-29 11:10 ` Pavel Machek
  7 siblings, 1 reply; 31+ messages in thread
From: Phy Prabab @ 2004-06-07 19:57 UTC (permalink / raw)
  To: Con Kolivas, Linux Kernel Mailinglist
  Cc: Zwane Mwaikambo, William Lee Irwin III

FYI:

I have had a chance to test this patch.  I have a make
system that has presented 2.6 kernels with problems,
so  I am using this as the test.  Observations show
that turning off interactive is much more
deterministic:

2.6.7-rc2-bk8-s63:
echo 0 > /proc/sys/kernel/interactive

A:  35.57user 38.18system 1:20.28elapsed 91%CPU 
B:  35.54user 38.40system 1:19.48elapsed 93%CPU
C:  35.48user 38.28system 1:20.94elapsed 91%CPU

2.6.7-rc2-bk8-s63:
A:  35.32user 38.51system 1:26.47elapsed 85%CPU
B:  35.43user 38.35system 1:20.79elapsed 91%CPU
C:  35.61user 38.23system 1:25.00elapsed 86%CPU

However, something is still slower than the 2.4.x
kernels:

2.4.23:
A:  28.32user 29.51system 1:01.17elapsed 93%CPU
B:  28.54user 29.40system 1:01.48elapsed 92%CPU
B:  28.23user 28.80system 1:00.21elapsed 94%CPU

Nice work, as I can now turn off some functionality
within the kernel that is causing me some slow down.

Thanks!
Phy


--- Con Kolivas <kernel@kolivas.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> This is an update of the scheduler policy mechanism
> rewrite using the 
> infrastructure of the current O(1) scheduler. Slight
> changes from the 
> original design require a detailed description. The
> change to the original 
> design has enabled all known corner cases to be
> abolished and cpu 
> distribution to be much better maintained. It has
> proven to be stable in my 
> testing and is ready for more widespread public
> testing now.
> 
> 
> Aims:
>  - Interactive by design rather than have
> interactivity bolted on.
>  - Good scalability.
>  - Simple predictable design.
>  - Maintain appropriate cpu distribution and
> fairness.
>  - Low scheduling latency for normal policy tasks.
>  - Low overhead.
>  - Making renicing processes actually matter for CPU
> distribution (nice 0 gets 
> 20 times what nice +20 gets)
>  - Resistant to priority inversion
>  - More forgiving of poor locking
>  - Tunable for a server workload or computational
> tasks
> 
> 
> Description:
>  - All tasks start at a dynamic priority based on
> their nice value. They run 
> for one RR_INTERVAL (nominally set to 10ms) and then
> drop one stair 
> (priority). If they sleep before running again they
> get to run for 2 
> intervals before being demoted a priority and so on
> until they get all their 
> intervals at their best priority: 20 intervals for
> nice 0; 1 interval for 
> nice +19.
> 
> 
> - - The staircase elevation mechanism can be
> disabled and all tasks can simply 
> descend stairs using the sysctl:
> 
> echo 0 > /proc/sys/kernel/interactive
> 
> this has the effect of maintaining cpu distribution
> much more strictly 
> according to nice whereas the default mechanism
> allows bursts of cpu by 
> interactive tasks before settling to appropriate cpu
> distribution.
> 
> 
> - - The time tasks are cpu bound can be increased by
> using the sysctl:
> 
> echo 1 > /proc/sys/kernel/compute
> 
> which extends the RR_INTERVAL to 100ms and disables
> the staircase elevation 
> improving conditions for pure computational tasks by
> optimising cache 
> benefits and decreasing context switching (gains
> another 1.5% on kernbench).
> 
> 
> Performance:
> - - All cpu throughput benchmarks show equivalent or
> better performance than 
> mainline. Note that disabling the interactive
> setting actually _worsens_ some 
> benchmarks because of their dependence on yield() so
> I don't recommend 
> disabling it unless you do a comparison first. 
> - - Interactivity is approximately equivalent to
> mainline 2.6 but with faster 
> application startup and no known behavioural quirks.
> 
> 
> Comments and testing welcome.
> 
>  fs/proc/array.c        |    2
>  include/linux/sched.h  |    8
>  include/linux/sysctl.h |    2
>  kernel/sched.c         |  676 
> +++++++++++++------------------------------------
>  kernel/sysctl.c        |   16 +
>  5 files changed, 212 insertions(+), 492
> deletions(-)
> 
> Can be downloaded here:
>
http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3
> 
> and below
> Con
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (GNU/Linux)
> 
>
iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou
> wXhrRsBsfY2BIqbLT1tUWcs=
> =g1no
> -----END PGP SIGNATURE-----
> > diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/fs/proc/array.c
> linux-2.6.7-rc2-s6.3/fs/proc/array.c
> --- linux-2.6.7-rc2-base/fs/proc/array.c	2004-05-31
> 21:29:15.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/fs/proc/array.c	2004-06-07
> 00:03:56.959133536 +1000
> @@ -155,7 +155,6 @@ static inline char *
> task_state(struct t
>  	read_lock(&tasklist_lock);
>  	buffer += sprintf(buffer,
>  		"State:\t%s\n"
> -		"SleepAVG:\t%lu%%\n"
>  		"Tgid:\t%d\n"
>  		"Pid:\t%d\n"
>  		"PPid:\t%d\n"
> @@ -163,7 +162,6 @@ static inline char *
> task_state(struct t
>  		"Uid:\t%d\t%d\t%d\t%d\n"
>  		"Gid:\t%d\t%d\t%d\t%d\n",
>  		get_task_state(p),
> -		(p->sleep_avg/1024)*100/(1020000000/1024),
>  	       	p->tgid,
>  		p->pid, p->pid ? p->real_parent->pid : 0,
>  		p->pid && p->ptrace ? p->parent->pid : 0,
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/include/linux/sched.h
> linux-2.6.7-rc2-s6.3/include/linux/sched.h
> --- linux-2.6.7-rc2-base/include/linux/sched.h
> 2004-05-31 21:29:21.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/include/linux/sched.h
> 2004-06-07 00:10:37.450576503 +1000
> @@ -164,6 +164,7 @@ extern void show_stack(struct
> task_struc
>  
>  void io_schedule(void);
>  long io_schedule_timeout(long timeout);
> +extern int interactive, compute;
>  
>  extern void cpu_init (void);
>  extern void trap_init(void);
> @@ -394,14 +395,13 @@ struct task_struct {
>  	struct list_head run_list;
>  	prio_array_t *array;
>  
> -	unsigned long sleep_avg;
> -	long interactive_credit;
>  	unsigned long long timestamp;
> -	int activated;
> +	unsigned long runtime, totalrun;
> +	unsigned int deadline;
>  
>  	unsigned long policy;
>  	cpumask_t cpus_allowed;
> -	unsigned int time_slice, first_time_slice;
> +	unsigned int slice, time_slice, first_time_slice;
>  
>  	struct list_head tasks;
>  	struct list_head ptrace_children;
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/include/linux/sysctl.h
> linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
> --- linux-2.6.7-rc2-base/include/linux/sysctl.h
> 2004-05-31 21:29:21.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
> 2004-06-07 00:06:13.007895851 +1000
> @@ -133,6 +133,8 @@ enum
>  	KERN_NGROUPS_MAX=63,	/* int: NGROUPS_MAX */
>  	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console
> power-off halt */
>  	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
> +	KERN_INTERACTIVE=66,	/* interactive tasks to have
> cpu bursts */
> +	KERN_COMPUTE=67,	/* adjust timeslices for a
> compute server */
>  };
>  
>  
> diff -Naurp --exclude-from=dontdiff
> linux-2.6.7-rc2-base/kernel/sched.c
> linux-2.6.7-rc2-s6.3/kernel/sched.c
> --- linux-2.6.7-rc2-base/kernel/sched.c	2004-05-31
> 21:29:24.000000000 +1000
> +++ linux-2.6.7-rc2-s6.3/kernel/sched.c	2004-06-07
> 00:36:43.382396246 +1000
> @@ -16,7 +16,10 @@
>   *		by Davide Libenzi, preemptible kernel bits by
> Robert Love.
>   *  2003-09-03	Interactivity tuning by Con Kolivas.
>   *  2004-04-02	Scheduler domains code by Nick
> Piggin
> - */
> + *  2004-03-19.	New staircase scheduling policy by
> Con Kolivas with help
> + *		from Zwane Mwaikambo and useful suggestions by 
> + *		William Lee Irwin III. 
> +*/
>  
>  #include <linux/mm.h>
>  #include <linux/module.h>
> @@ -66,8 +69,6 @@
>  #define USER_PRIO(p)		((p)-MAX_RT_PRIO)
>  #define TASK_USER_PRIO(p)
> USER_PRIO((p)->static_prio)
>  #define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
> -#define AVG_TIMESLICE	(MIN_TIMESLICE +
> ((MAX_TIMESLICE - MIN_TIMESLICE) *\
> -			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO -
> 1)))
>  
>  /*
>   * Some helpers for converting nanosecond timing to
> jiffy resolution
> @@ -75,110 +76,18 @@
>  #define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 /
> HZ))
>  #define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 /
> HZ))
>  
> -/*
> - * These are the 'tuning knobs' of the scheduler:
> - *
> - * Minimum timeslice is 10 msecs, default timeslice
> is 100 msecs,
> - * maximum timeslice is 200 msecs. Timeslices get
> refilled after
> - * they expire.
> - */
> -#define MIN_TIMESLICE		( 10 * HZ / 1000)
> -#define MAX_TIMESLICE		(200 * HZ / 1000)
> -#define ON_RUNQUEUE_WEIGHT	 30
> -#define CHILD_PENALTY		 95
> -#define PARENT_PENALTY		100
> -#define EXIT_WEIGHT		  3
> -#define PRIO_BONUS_RATIO	 25
> -#define MAX_BONUS		(MAX_USER_PRIO *
> PRIO_BONUS_RATIO / 100)
> -#define INTERACTIVE_DELTA	  2
> -#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
> -#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
> -#define NS_MAX_SLEEP_AVG
> (JIFFIES_TO_NS(MAX_SLEEP_AVG))
> -#define CREDIT_LIMIT		100
> -
> -/*
> - * If a task is 'interactive' then we reinsert it
> in the active
> - * array after it has expired its current
> timeslice. (it will not
> - * continue to run immediately, it will still
> roundrobin with
> - * other interactive tasks.)
> - *
> - * This part scales the interactivity limit
> depending on niceness.
> - *
> - * We scale it linearly, offset by the
> INTERACTIVE_DELTA delta.
> - * Here are a few examples of different nice
> levels:
> - *
> - *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
> - *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
> - *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
> - *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
> - *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
> - *
> - * (the X axis represents the possible -5 ... 0 ...
> +5 dynamic
> - *  priority range a task can explore, a value of
> '1' means the
> - *  task is rated interactive.)
> - *
> - * Ie. nice +19 tasks can never get 'interactive'
> enough to be
> - * reinserted into the active array. And only
> heavily CPU-hog nice -20
> - * tasks will be expired. Default nice 0 tasks are
> somewhere between,
> - * it takes some effort for them to get
> interactive, but it's not
> - * too hard.
> - */
> -
> -#define CURRENT_BONUS(p) \
> -	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
> -		MAX_SLEEP_AVG)
> -
> -#ifdef CONFIG_SMP
> -#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
> -		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) -
> 1)) * \
> -			num_online_cpus())
> -#else
> -#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
> -		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) -
> 1)))
> -#endif
> -
> -#define SCALE(v1,v1_max,v2_max) \
> -	(v1) * (v2_max) / (v1_max)
> 
=== message truncated ===



	
		
__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 19:57 ` Phy Prabab
@ 2004-06-07 21:12   ` Con Kolivas
  2004-06-07 21:40     ` Phy Prabab
  0 siblings, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2004-06-07 21:12 UTC (permalink / raw)
  To: Phy Prabab
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

On Tue, 8 Jun 2004 05:57, Phy Prabab wrote:
> I have had a chance to test this patch.  

Thanks

> I have a make 
> system that has presented 2.6 kernels with problems,
> so  I am using this as the test.  Observations show
> that turning off interactive is much more
> deterministic:
>
> 2.6.7-rc2-bk8-s63:
> echo 0 > /proc/sys/kernel/interactive
>
> A:  35.57user 38.18system 1:20.28elapsed 91%CPU
> B:  35.54user 38.40system 1:19.48elapsed 93%CPU
> C:  35.48user 38.28system 1:20.94elapsed 91%CPU
>
> 2.6.7-rc2-bk8-s63:
> A:  35.32user 38.51system 1:26.47elapsed 85%CPU
> B:  35.43user 38.35system 1:20.79elapsed 91%CPU
> C:  35.61user 38.23system 1:25.00elapsed 86%CPU
>
> However, something is still slower than the 2.4.x
> kernels:
>
> 2.4.23:
> A:  28.32user 29.51system 1:01.17elapsed 93%CPU
> B:  28.54user 29.40system 1:01.48elapsed 92%CPU
> B:  28.23user 28.80system 1:00.21elapsed 94%CPU
>
> Nice work, as I can now turn off some functionality
> within the kernel that is causing me some slow down.

Glad to see it does what you require. Turning off "interactive" should still 
provide moderately good interactive performance at low to moderate loads, but 
is much stricter about cpu usage distribution as you can see.

Cheers,
Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 21:40     ` Phy Prabab
@ 2004-06-07 21:34       ` Con Kolivas
  2004-06-08  0:06         ` Phy Prabab
  2004-06-08  2:50       ` Andrew Morton
  1 sibling, 1 reply; 31+ messages in thread
From: Con Kolivas @ 2004-06-07 21:34 UTC (permalink / raw)
  To: Phy Prabab
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

Quoting Phy Prabab <phyprabab@yahoo.com>:

> OOOPPPSSSS....
> 
> I need to make a correction on my previous data.  I
> had inadvertantly turned off interactivity and also
> increased the compute time to 100.  I confirmed that
> just setting interactivity off, does not solve my
> problem:
> 
> 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive):
> A:  37.30user 40.56system 1:42.01elapsed 76%CPU
> B:  37.29user 40.35system 1:23.87elapsed 92%CPU
> C:  37.30user 40.56system 1:36.01elapsed 81%CPU
> 
> 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1
> /proc/sys/kernel/compute):
> A:  37.28user 40.36system 1:25.60elapsed 90%CPU
> B:  37.22user 40.35system 1:22.17elapsed 94%CPU
> C:  37.27user 40.35system 1:24.71elapsed 91%CPU
> 
> The question here, noticing that user and kernel time
> are the same, where is the dead time coming from and
> why is it sooooo much more deterministic with compute
> time at 100 vs 10?  Maybe I am misinterpreting the
> data, but this suggests to me that something is going
> awry (ping-pong, no settle, ???) within the kernl?
> 
> 
> Also please note the degredation between
> 2.6.7-rc2-bk8-s63:
> 
> A:  35.57user 38.18system 1:20.28elapsed 91%CPU
> B:  35.54user 38.40system 1:19.48elapsed 93%CPU
> C:  35.48user 38.28system 1:20.94elapsed 91%CPU
> 
> Interesting how much more time is spent in both user
> and kernel space between the two kernels.  Also note
> that 2.4.x exhibits even greater delta:
> 
> A:  28.32user 29.51system 1:01.17elapsed 93%CPU
> B:  28.54user 29.40system 1:01.48elapsed 92%CPU
> B:  28.23user 28.80system 1:00.21elapsed 94%CPU
> 
> Could anyone suggest a way to understand why the
> difference between the 2.6 kernels and the 2.4
> kernels?
> 
> Thank you for your time.
> Phy
> 
Hi.

How repeatable are the numbers normally? Some idea of what it is you're
benchmarking may also help in understanding the problem; locking may be an
issue with what you're benchmarking and out-of-order scheduling is not as
forgiving of poor locking. Extending the RR_INTERVAL and turning off
interactivity makes it more in-order and more forgiving of poor locking or
yield().

Compute==1 setting inactivates interactivity anyway, but that's not really
relevant to your figures since you had set interactive 0 when you set compute
1.

Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 21:12   ` Con Kolivas
@ 2004-06-07 21:40     ` Phy Prabab
  2004-06-07 21:34       ` Con Kolivas
  2004-06-08  2:50       ` Andrew Morton
  0 siblings, 2 replies; 31+ messages in thread
From: Phy Prabab @ 2004-06-07 21:40 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

OOOPPPSSSS....

I need to make a correction on my previous data.  I
had inadvertantly turned off interactivity and also
increased the compute time to 100.  I confirmed that
just setting interactivity off, does not solve my
problem:

2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive):
A:  37.30user 40.56system 1:42.01elapsed 76%CPU
B:  37.29user 40.35system 1:23.87elapsed 92%CPU
C:  37.30user 40.56system 1:36.01elapsed 81%CPU

2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1
/proc/sys/kernel/compute):
A:  37.28user 40.36system 1:25.60elapsed 90%CPU
B:  37.22user 40.35system 1:22.17elapsed 94%CPU
C:  37.27user 40.35system 1:24.71elapsed 91%CPU

The question here, noticing that user and kernel time
are the same, where is the dead time coming from and
why is it sooooo much more deterministic with compute
time at 100 vs 10?  Maybe I am misinterpreting the
data, but this suggests to me that something is going
awry (ping-pong, no settle, ???) within the kernl?


Also please note the degredation between
2.6.7-rc2-bk8-s63:

A:  35.57user 38.18system 1:20.28elapsed 91%CPU
B:  35.54user 38.40system 1:19.48elapsed 93%CPU
C:  35.48user 38.28system 1:20.94elapsed 91%CPU

Interesting how much more time is spent in both user
and kernel space between the two kernels.  Also note
that 2.4.x exhibits even greater delta:

A:  28.32user 29.51system 1:01.17elapsed 93%CPU
B:  28.54user 29.40system 1:01.48elapsed 92%CPU
B:  28.23user 28.80system 1:00.21elapsed 94%CPU

Could anyone suggest a way to understand why the
difference between the 2.6 kernels and the 2.4
kernels?

Thank you for your time.
Phy





--- Con Kolivas <kernel@kolivas.org> wrote:
> On Tue, 8 Jun 2004 05:57, Phy Prabab wrote:
> > I have had a chance to test this patch.  
> 
> Thanks
> 
> > I have a make 
> > system that has presented 2.6 kernels with
> problems,
> > so  I am using this as the test.  Observations
> show
> > that turning off interactive is much more
> > deterministic:
> >
> > 2.6.7-rc2-bk8-s63:
> > echo 0 > /proc/sys/kernel/interactive
> >
> > A:  35.57user 38.18system 1:20.28elapsed 91%CPU
> > B:  35.54user 38.40system 1:19.48elapsed 93%CPU
> > C:  35.48user 38.28system 1:20.94elapsed 91%CPU
> >
> > 2.6.7-rc2-bk8-s63:
> > A:  35.32user 38.51system 1:26.47elapsed 85%CPU
> > B:  35.43user 38.35system 1:20.79elapsed 91%CPU
> > C:  35.61user 38.23system 1:25.00elapsed 86%CPU
> >
> > However, something is still slower than the 2.4.x
> > kernels:
> >
> > 2.4.23:
> > A:  28.32user 29.51system 1:01.17elapsed 93%CPU
> > B:  28.54user 29.40system 1:01.48elapsed 92%CPU
> > B:  28.23user 28.80system 1:00.21elapsed 94%CPU
> >
> > Nice work, as I can now turn off some
> functionality
> > within the kernel that is causing me some slow
> down.
> 
> Glad to see it does what you require. Turning off
> "interactive" should still 
> provide moderately good interactive performance at
> low to moderate loads, but 
> is much stricter about cpu usage distribution as you
> can see.
> 
> Cheers,
> Con



	
		
__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  0:06         ` Phy Prabab
@ 2004-06-07 22:08           ` Con Kolivas
  0 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-07 22:08 UTC (permalink / raw)
  To: Phy Prabab
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

Quoting Phy Prabab <phyprabab@yahoo.com>:

> 
> Just to clarify, setting compute 1 implys interactive
> 0?

Yes.

> 
> These numbers are very reproducable nad have done them
> (in a continuous loop) for two hours.

Ok.

> 
> The test is a make of headers for a propritary exec. 
> Making headers is rather simple is all it does it link
> a bunch of h files (traversing dirs) and some
> dependance generation (3 files, yacc and lex).  I have
> moved the source code base to local disk to dicount
> nfs issues (though the difference is neglibible and
> nfs performance on 2.6 is generally faster than 2.4).

Out of curiosity, what happens from nfs? i/o effects can be significant. I
assume you've excluded memory effects?

> I have tried to get a good test case that can be
> submitted. Still trying. 

Linking kernel headers with full debugging?

> Any suggestions to try to diagnose this?

Profiling.

> Thanks!
> Phy

Cheers,
Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 21:34       ` Con Kolivas
@ 2004-06-08  0:06         ` Phy Prabab
  2004-06-07 22:08           ` Con Kolivas
  0 siblings, 1 reply; 31+ messages in thread
From: Phy Prabab @ 2004-06-08  0:06 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

Just to clarify, setting compute 1 implys interactive
0?

These numbers are very reproducable nad have done them
(in a continuous loop) for two hours.

The test is a make of headers for a propritary exec. 
Making headers is rather simple is all it does it link
a bunch of h files (traversing dirs) and some
dependance generation (3 files, yacc and lex).  I have
moved the source code base to local disk to dicount
nfs issues (though the difference is neglibible and
nfs performance on 2.6 is generally faster than 2.4).

I have tried to get a good test case that can be
submitted. Still trying. 

Any suggestions to try to diagnose this?

Thanks!
Phy
> > 
> Hi.
> 
> How repeatable are the numbers normally? Some idea
> of what it is you're
> benchmarking may also help in understanding the
> problem; locking may be an
> issue with what you're benchmarking and out-of-order
> scheduling is not as
> forgiving of poor locking. Extending the RR_INTERVAL
> and turning off
> interactivity makes it more in-order and more
> forgiving of poor locking or
> yield().
> 
> Compute==1 setting inactivates interactivity anyway,
> but that's not really
> relevant to your figures since you had set
> interactive 0 when you set compute
> 1.
> 
> Con
> 

__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  3:05         ` Nick Piggin
@ 2004-06-08  1:07           ` Con Kolivas
  2004-06-08  5:51           ` Phy Prabab
  1 sibling, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-08  1:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Phy Prabab, linux-kernel, zwane, wli

Quoting Nick Piggin <nickpiggin@yahoo.com.au>:

> Andrew Morton wrote:
> 
> > At a guess I'd say either a) you're hitting some path in the kernel which
> > is going for a giant and bogus romp through memory, trashing CPU caches or
> > b) your workload really dislikes run-child-first-after-fork or c) the page
> > allocator is serving up pages which your access pattern dislikes or d)
> > something else.
> > 
> 
> e) it's the staircase scheduler patch?
> 

Phy Prabab wrote:
>Could anyone suggest a way to understand why the
>difference between the 2.6 kernels and the 2.4
>kernels

I guess Phy better tell us if it's unique to staircase; and if so there's
clearly a bug unique to it.

Con

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-07 21:40     ` Phy Prabab
  2004-06-07 21:34       ` Con Kolivas
@ 2004-06-08  2:50       ` Andrew Morton
  2004-06-08  3:02         ` Con Kolivas
                           ` (2 more replies)
  1 sibling, 3 replies; 31+ messages in thread
From: Andrew Morton @ 2004-06-08  2:50 UTC (permalink / raw)
  To: Phy Prabab; +Cc: kernel, linux-kernel, zwane, wli

Phy Prabab <phyprabab@yahoo.com> wrote:
>
> Also please note the degredation between
>  2.6.7-rc2-bk8-s63:
> 
>  A:  35.57user 38.18system 1:20.28elapsed 91%CPU
>  B:  35.54user 38.40system 1:19.48elapsed 93%CPU
>  C:  35.48user 38.28system 1:20.94elapsed 91%CPU
> 
>  Interesting how much more time is spent in both user
>  and kernel space between the two kernels.  Also note
>  that 2.4.x exhibits even greater delta:
> 
>  A:  28.32user 29.51system 1:01.17elapsed 93%CPU
>  B:  28.54user 29.40system 1:01.48elapsed 92%CPU
>  B:  28.23user 28.80system 1:00.21elapsed 94%CPU
> 
>  Could anyone suggest a way to understand why the
>  difference between the 2.6 kernels and the 2.4
>  kernels?

This is very very bad.

It's a uniprocessor machine, yes?

Could you describe the workload a bit more?  Is it something which others
can get their hands on?

It spends a lot of time in the kernel for a build system.  I wonder why.

At a guess I'd say either a) you're hitting some path in the kernel which
is going for a giant and bogus romp through memory, trashing CPU caches or
b) your workload really dislikes run-child-first-after-fork or c) the page
allocator is serving up pages which your access pattern dislikes or d)
something else.

It's certainly interesting.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  2:50       ` Andrew Morton
@ 2004-06-08  3:02         ` Con Kolivas
  2004-06-08  3:05         ` Nick Piggin
  2004-06-08  5:42         ` Phy Prabab
  2 siblings, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-08  3:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Phy Prabab, linux-kernel, zwane, wli

Quoting Andrew Morton <akpm@osdl.org>:

> Phy Prabab <phyprabab@yahoo.com> wrote:
> >
> > Also please note the degredation between
> >  2.6.7-rc2-bk8-s63:
> > 
> >  A:  35.57user 38.18system 1:20.28elapsed 91%CPU
> >  B:  35.54user 38.40system 1:19.48elapsed 93%CPU
> >  C:  35.48user 38.28system 1:20.94elapsed 91%CPU
> > 
> >  Interesting how much more time is spent in both user
> >  and kernel space between the two kernels.  Also note
> >  that 2.4.x exhibits even greater delta:
> > 
> >  A:  28.32user 29.51system 1:01.17elapsed 93%CPU
> >  B:  28.54user 29.40system 1:01.48elapsed 92%CPU
> >  B:  28.23user 28.80system 1:00.21elapsed 94%CPU
> > 
> >  Could anyone suggest a way to understand why the
> >  difference between the 2.6 kernels and the 2.4
> >  kernels?
> 
> This is very very bad.
> 
> It's a uniprocessor machine, yes?
> 
> Could you describe the workload a bit more?  Is it something which others
> can get their hands on?
> 
> It spends a lot of time in the kernel for a build system.  I wonder why.
> 
> At a guess I'd say either a) you're hitting some path in the kernel which
> is going for a giant and bogus romp through memory, trashing CPU caches or
> b) your workload really dislikes run-child-first-after-fork or c) the page
> allocator is serving up pages which your access pattern dislikes or d)
> something else.
> 
> It's certainly interesting.

Ordinary kernbench on 8x OSDL hardware fails to show anything like this:
Summary:
time in seconds for optimal run:
2.6.7-rc2: 129.948
2.6.7-rc2-s6.3: 129.528
2.6.7-rc2-s6.3compute: 127.63

Detailed:
kernel: patch-2.6.7-rc2
plmid: 2987
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 129.948
User Time 873.302
System Time 86.318
Percent CPU 738
Context Switches 39853.4
Sleeps 29045.8
Average Maximal -j Load Run:
Elapsed Time 128.362
User Time 866.274
System Time 84.352
Percent CPU 740
Context Switches 25742.6
Sleeps 20615.8
Average Half Load -j 4 Run:
Elapsed Time 231.04
User Time 833.69
System Time 72.94
Percent CPU 392
Context Switches 12672.2
Sleeps 22267.2


kernel: staircase6.2
plmid: 3002
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 129.528
User Time 873.866
System Time 85.372
Percent CPU 740.2
Context Switches 38286.6
Sleeps 28331
Average Maximal -j Load Run:
Elapsed Time 128.074
User Time 862.728
System Time 85.842
Percent CPU 740.2
Context Switches 20894
Sleeps 22189.2
Average Half Load -j 4 Run:
Elapsed Time 230.942
User Time 833.03
System Time 71.126
Percent CPU 391
Context Switches 3467
Sleeps 22389.2


kernel: crunchah6.3
plmid: 3008
Host: stp8-002
Average Optimal -j 32 Load Run:
Elapsed Time 127.63
User Time 861.668
System Time 83.078
Percent CPU 739.6
Context Switches 19818.4
Sleeps 28442.6
Average Maximal -j Load Run:
Elapsed Time 127.354
User Time 856.632
System Time 82.952
Percent CPU 737
Context Switches 12389.2
Sleeps 21788.6
Average Half Load -j 4 Run:
Elapsed Time 230.8
User Time 833.378
System Time 71.304
Percent CPU 391.6
Context Switches 3738.4
Sleeps 22288.6


Con


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  2:50       ` Andrew Morton
  2004-06-08  3:02         ` Con Kolivas
@ 2004-06-08  3:05         ` Nick Piggin
  2004-06-08  1:07           ` Con Kolivas
  2004-06-08  5:51           ` Phy Prabab
  2004-06-08  5:42         ` Phy Prabab
  2 siblings, 2 replies; 31+ messages in thread
From: Nick Piggin @ 2004-06-08  3:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Phy Prabab, kernel, linux-kernel, zwane, wli

Andrew Morton wrote:

> At a guess I'd say either a) you're hitting some path in the kernel which
> is going for a giant and bogus romp through memory, trashing CPU caches or
> b) your workload really dislikes run-child-first-after-fork or c) the page
> allocator is serving up pages which your access pattern dislikes or d)
> something else.
> 

e) it's the staircase scheduler patch?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  2:50       ` Andrew Morton
  2004-06-08  3:02         ` Con Kolivas
  2004-06-08  3:05         ` Nick Piggin
@ 2004-06-08  5:42         ` Phy Prabab
  2004-06-08  5:50           ` Zwane Mwaikambo
  2 siblings, 1 reply; 31+ messages in thread
From: Phy Prabab @ 2004-06-08  5:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: kernel, linux-kernel, zwane, wli

The test case is a build system that links headers (ln
-s) and runs bison and flex on a couple of files. 
strace shows no difference between running on 2.4.23
compared to running on 2.6.7-rc2bk8s63.

Unfortuneately, I can not send this out, but I am
trying to get a tet case that will demostrate this.

In the mean time, what can I do to try understand this
slowdown?

Thank you for your time.
Phy

--- Andrew Morton <akpm@osdl.org> wrote:
> Phy Prabab <phyprabab@yahoo.com> wrote:
> >
> > Also please note the degredation between
> >  2.6.7-rc2-bk8-s63:
> > 
> >  A:  35.57user 38.18system 1:20.28elapsed 91%CPU
> >  B:  35.54user 38.40system 1:19.48elapsed 93%CPU
> >  C:  35.48user 38.28system 1:20.94elapsed 91%CPU
> > 
> >  Interesting how much more time is spent in both
> user
> >  and kernel space between the two kernels.  Also
> note
> >  that 2.4.x exhibits even greater delta:
> > 
> >  A:  28.32user 29.51system 1:01.17elapsed 93%CPU
> >  B:  28.54user 29.40system 1:01.48elapsed 92%CPU
> >  B:  28.23user 28.80system 1:00.21elapsed 94%CPU
> > 
> >  Could anyone suggest a way to understand why the
> >  difference between the 2.6 kernels and the 2.4
> >  kernels?
> 
> This is very very bad.
> 
> It's a uniprocessor machine, yes?
> 
> Could you describe the workload a bit more?  Is it
> something which others
> can get their hands on?
> 
> It spends a lot of time in the kernel for a build
> system.  I wonder why.
> 
> At a guess I'd say either a) you're hitting some
> path in the kernel which
> is going for a giant and bogus romp through memory,
> trashing CPU caches or
> b) your workload really dislikes
> run-child-first-after-fork or c) the page
> allocator is serving up pages which your access
> pattern dislikes or d)
> something else.
> 
> It's certainly interesting.


	
		
__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  5:42         ` Phy Prabab
@ 2004-06-08  5:50           ` Zwane Mwaikambo
  0 siblings, 0 replies; 31+ messages in thread
From: Zwane Mwaikambo @ 2004-06-08  5:50 UTC (permalink / raw)
  To: Phy Prabab; +Cc: Andrew Morton, kernel, linux-kernel, wli

yOn Mon, 7 Jun 2004, Phy Prabab wrote:

> The test case is a build system that links headers (ln
> -s) and runs bison and flex on a couple of files.
> strace shows no difference between running on 2.4.23
> compared to running on 2.6.7-rc2bk8s63.
>
> Unfortuneately, I can not send this out, but I am
> trying to get a tet case that will demostrate this.
>
> In the mean time, what can I do to try understand this
> slowdown?

Whether the regression is also present without the staircase scheduler.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-08  3:05         ` Nick Piggin
  2004-06-08  1:07           ` Con Kolivas
@ 2004-06-08  5:51           ` Phy Prabab
  1 sibling, 0 replies; 31+ messages in thread
From: Phy Prabab @ 2004-06-08  5:51 UTC (permalink / raw)
  To: Nick Piggin, Andrew Morton; +Cc: Phy Prabab, kernel, linux-kernel, zwane, wli

the results I published also included 2.4.23 runs
which do not have the stair step scheduler.

Also, I have run this on 2.6-1->2.6.7-rc<x>, all the
2.6.x series are slower than the 2.4 series.

Thank you for your time.
Phy

--- Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Andrew Morton wrote:
> 
> > At a guess I'd say either a) you're hitting some
> path in the kernel which
> > is going for a giant and bogus romp through
> memory, trashing CPU caches or
> > b) your workload really dislikes
> run-child-first-after-fork or c) the page
> > allocator is serving up pages which your access
> pattern dislikes or d)
> > something else.
> > 
> 
> e) it's the staircase scheduler patch?


	
		
__________________________________
Do you Yahoo!?
Friends.  Fun.  Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/ 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
                   ` (6 preceding siblings ...)
  2004-06-07 19:57 ` Phy Prabab
@ 2004-06-29 11:10 ` Pavel Machek
  2004-06-29 11:14   ` Con Kolivas
  2004-06-29 12:05   ` Felipe Alfaro Solana
  7 siblings, 2 replies; 31+ messages in thread
From: Pavel Machek @ 2004-06-29 11:10 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

Hi!

> This is an update of the scheduler policy mechanism rewrite using the 
> infrastructure of the current O(1) scheduler. Slight changes from the 
> original design require a detailed description. The change to the original 
> design has enabled all known corner cases to be abolished and cpu 
> distribution to be much better maintained. It has proven to be stable in my 
> testing and is ready for more widespread public testing now.
> 
> 
> Aims:
>  - Interactive by design rather than have interactivity bolted on.
>  - Good scalability.
>  - Simple predictable design.
>  - Maintain appropriate cpu distribution and fairness.
>  - Low scheduling latency for normal policy tasks.
>  - Low overhead.
>  - Making renicing processes actually matter for CPU distribution (nice 0 gets 
> 20 times what nice +20 gets)
>  - Resistant to priority inversion

How do you solve priority inversion?

Can you do "true idle threads" now? (I.e. nice +infinity?)
									Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-29 11:10 ` Pavel Machek
@ 2004-06-29 11:14   ` Con Kolivas
  2004-06-29 12:05   ` Felipe Alfaro Solana
  1 sibling, 0 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-29 11:14 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pavel Machek wrote:
| Hi!
|
|
|>This is an update of the scheduler policy mechanism rewrite using the
|>infrastructure of the current O(1) scheduler. Slight changes from the
|>original design require a detailed description. The change to the
original
|>design has enabled all known corner cases to be abolished and cpu
|>distribution to be much better maintained. It has proven to be stable
in my
|>testing and is ready for more widespread public testing now.
|>
|>
|>Aims:
|> - Interactive by design rather than have interactivity bolted on.
|> - Good scalability.
|> - Simple predictable design.
|> - Maintain appropriate cpu distribution and fairness.
|> - Low scheduling latency for normal policy tasks.
|> - Low overhead.
|> - Making renicing processes actually matter for CPU distribution
(nice 0 gets
|>20 times what nice +20 gets)
|> - Resistant to priority inversion
|
|
| How do you solve priority inversion?

I don't solve it. It is relatively resistant to priority inversion by
tasks traversing all the priorities lower than itself rather than
sitting at one priority. It does this more strictly when interactive is
disabled which I recommend for server and multiuser setups.

| Can you do "true idle threads" now? (I.e. nice +infinity?)

Not safely, no. I have a batch(idle) implementation that is relatively
safe but can be abused (it's in my -ck patchset only). At some stage if
the codebase settles down enough I'll try and introduce semaphore
tracking for batch processes that will elevate them to normal scheduling
~ to prevent DoSing.

Cheers,
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFA4U8oZUg7+tp6mRURAircAKCSvuZ4D6rM7AsfAbHKhQoCw1C7NACaA44C
UyUJAByRIE894CchzlN2OiU=
=9xVI
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
  2004-06-29 11:10 ` Pavel Machek
  2004-06-29 11:14   ` Con Kolivas
@ 2004-06-29 12:05   ` Felipe Alfaro Solana
  1 sibling, 0 replies; 31+ messages in thread
From: Felipe Alfaro Solana @ 2004-06-29 12:05 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo,
	William Lee Irwin III

On Tue, 2004-06-29 at 13:10 +0200, Pavel Machek wrote:
> Hi!
> 
> > This is an update of the scheduler policy mechanism rewrite using the 
> > infrastructure of the current O(1) scheduler. Slight changes from the 
> > original design require a detailed description. The change to the original 
> > design has enabled all known corner cases to be abolished and cpu 
> > distribution to be much better maintained. It has proven to be stable in my 
> > testing and is ready for more widespread public testing now.
> > 
> > 
> > Aims:
> >  - Interactive by design rather than have interactivity bolted on.
> >  - Good scalability.
> >  - Simple predictable design.
> >  - Maintain appropriate cpu distribution and fairness.
> >  - Low scheduling latency for normal policy tasks.
> >  - Low overhead.
> >  - Making renicing processes actually matter for CPU distribution (nice 0 gets 
> > 20 times what nice +20 gets)
> >  - Resistant to priority inversion
> 
> How do you solve priority inversion?
> 
> Can you do "true idle threads" now? (I.e. nice +infinity?)

I think idle threads can be achieved with the batch scheduler policy
present in -ck patches. It works nicely, and I use it a lot.

PS: At least, if you mean with "idle threads" those that do only run
where there are idle cycles of CPU (i.e. all processes are sleeping).


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2004-06-29 12:05 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas
2004-06-06 16:59 ` Jan Killius
2004-06-06 17:40 ` Felipe Alfaro Solana
2004-06-06 22:58   ` Con Kolivas
2004-06-06 20:43 ` Prakash K. Cheemplavam
2004-06-06 22:55   ` Con Kolivas
2004-06-06 20:59 ` Felipe Alfaro Solana
2004-06-06 22:57   ` Con Kolivas
2004-06-06 23:47 ` Con Kolivas
2004-06-07 13:56 ` William Lee Irwin III
2004-06-07 13:57   ` William Lee Irwin III
2004-06-07 14:04     ` William Lee Irwin III
2004-06-07 14:07       ` William Lee Irwin III
2004-06-07 15:00         ` William Lee Irwin III
2004-06-07 15:11           ` William Lee Irwin III
2004-06-07 19:57 ` Phy Prabab
2004-06-07 21:12   ` Con Kolivas
2004-06-07 21:40     ` Phy Prabab
2004-06-07 21:34       ` Con Kolivas
2004-06-08  0:06         ` Phy Prabab
2004-06-07 22:08           ` Con Kolivas
2004-06-08  2:50       ` Andrew Morton
2004-06-08  3:02         ` Con Kolivas
2004-06-08  3:05         ` Nick Piggin
2004-06-08  1:07           ` Con Kolivas
2004-06-08  5:51           ` Phy Prabab
2004-06-08  5:42         ` Phy Prabab
2004-06-08  5:50           ` Zwane Mwaikambo
2004-06-29 11:10 ` Pavel Machek
2004-06-29 11:14   ` Con Kolivas
2004-06-29 12:05   ` Felipe Alfaro Solana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox