Re: [PATCH] Staircase scheduler v7.4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Wes Janzen <superchkn@sbcglobal.net>
To: Con Kolivas <kernel@kolivas.org>
Cc: linux kernel mailing list <linux-kernel@vger.kernel.org>,
	William Lee Irwin III <wli@holomorphy.com>,
	Zwane Mwaikambo <zwane@linuxpower.ca>,
	Pauli Virtanen <pauli.virtanen@hut.fi>
Subject: Re: [PATCH] Staircase scheduler v7.4
Date: Sat, 26 Jun 2004 15:04:28 -0500	[thread overview]
Message-ID: <40DDD6CC.7000201@sbcglobal.net> (raw)
In-Reply-To: <40DC38D0.9070905@kolivas.org>

Hi Con,

I don't know what's going on but 2.6.7-mm2 with the staircase v7.4 (with 
or without staircase7.4-1) takes about 3 hours to get from loading the 
kernel from grub to the login prompt.  Now I realize my K6-2 400 isn't 
state of the art...  I don't have this problem running 2.6.7-mm2.

It just pauses after starting nearly every service for an extended 
period of time.  It responds to sys-rq keys but just seems to be doing 
nothing while waiting.

Any suggestions?

Thanks,

Wes Janzen

Con Kolivas wrote:

> This is a scheduler policy rewrite designed to be interactive by 
> design without tweaks or tuning and be lean and extensible for all 
> sorts of settings. (see previous announcements for more detail).
>
> Patches (including incrementals from previous versions) against 2.6.7 
> and 2.6.7-mm2 can be downloaded from:
> http://ck.kolivas.org/patches/2.6/
>
> At this point all known bugs that have come up from testers have been 
> addressed. For testers of this patch please note that renicing tasks 
> to -ve values will not improve the behaviour of games and the like 
> (usually it worsens it). Users will be rewarded by judicious use of 
> +nice values as nice has profound effects with this scheduler.
>
> Note, for multiuser machines and servers I recommend disabling 
> interactive mode:
> echo 0 > /proc/sys/kernel/interactive
>
>
> Changes since v7.3
>
> A very difficult to track (and reproduce) bug was finally sorted out 
> where a task that forked repeated cpu bound tasks for just the right 
> duration before they stopped would be able to DoS other tasks. 
> Basically since every time the parent would go to sleep it would wake 
> up back at it's best priority it meant the children would run only for 
> as long as needed to stay at that same priority thus hogging all cpu 
> resources. This was addressed using the existing infrastructure 
> already in place in staircase. It deals with ultra short running tasks 
> by making each next wakeup actually continue as though they are still 
> running their first timeslice. Thus their priority drops over time in 
> spite of repeated wakeups. Great thanks to Pauli Virtanen for his 
> testing and help in nailing this.
>
> The other change involves a design issue with the behaviour of the 
> O(1) scheduler. If task a is preempted by a higher priority task b, 
> task a gets requeued to run after all tasks of the same priority as 
> itself. Preempted tasks are now flagged as having that happen and will 
> go ahead of other tasks getting to continue where it left off. This 
> tends to smooth out the jitter of things like X and audio somewhat, 
> and because they may run again if task b runs only briefly it helps 
> preserve cache. Thus on preliminary benchmarks I've found a slight 
> improvement in throughput under heavy load.
>
> Attached is the patch against 2.6.7.
>
> Regards,
> Con
>
>------------------------------------------------------------------------
>
>Index: linux-2.6.7-staircase/fs/proc/array.c
>===================================================================
>--- linux-2.6.7-staircase.orig/fs/proc/array.c	2004-06-25 02:24:28.106117207 +1000
>+++ linux-2.6.7-staircase/fs/proc/array.c	2004-06-25 02:24:33.881209658 +1000
>@@ -155,7 +155,7 @@
> 	read_lock(&tasklist_lock);
> 	buffer += sprintf(buffer,
> 		"State:\t%s\n"
>-		"SleepAVG:\t%lu%%\n"
>+		"Burst:\t%d\n"
> 		"Tgid:\t%d\n"
> 		"Pid:\t%d\n"
> 		"PPid:\t%d\n"
>@@ -163,7 +163,7 @@
> 		"Uid:\t%d\t%d\t%d\t%d\n"
> 		"Gid:\t%d\t%d\t%d\t%d\n",
> 		get_task_state(p),
>-		(p->sleep_avg/1024)*100/(1020000000/1024),
>+		p->burst,
> 	       	p->tgid,
> 		p->pid, p->pid ? p->real_parent->pid : 0,
> 		p->pid && p->ptrace ? p->parent->pid : 0,
>Index: linux-2.6.7-staircase/include/linux/sched.h
>===================================================================
>--- linux-2.6.7-staircase.orig/include/linux/sched.h	2004-06-25 02:24:28.113116107 +1000
>+++ linux-2.6.7-staircase/include/linux/sched.h	2004-06-25 02:24:34.720077833 +1000
>@@ -164,6 +164,7 @@
> 
> void io_schedule(void);
> long io_schedule_timeout(long timeout);
>+extern int interactive, compute;
> 
> extern void cpu_init (void);
> extern void trap_init(void);
>@@ -325,7 +326,6 @@
> extern struct user_struct root_user;
> #define INIT_USER (&root_user)
> 
>-typedef struct prio_array prio_array_t;
> struct backing_dev_info;
> struct reclaim_state;
> 
>@@ -392,16 +392,13 @@
> 
> 	int prio, static_prio;
> 	struct list_head run_list;
>-	prio_array_t *array;
>-
>-	unsigned long sleep_avg;
>-	long interactive_credit;
> 	unsigned long long timestamp;
>-	int activated;
>+	unsigned long runtime, totalrun;
>+	unsigned int burst;
> 
> 	unsigned long policy;
> 	cpumask_t cpus_allowed;
>-	unsigned int time_slice, first_time_slice;
>+	unsigned int slice, time_slice;
> 
> 	struct list_head tasks;
> 	struct list_head ptrace_children;
>@@ -549,6 +546,8 @@
> #define PF_SWAPOFF	0x00080000	/* I am in swapoff */
> #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
> #define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */
>+#define PF_FORKED	0x00400000	/* I have just forked */
>+#define PF_PREEMPTED	0x00800000	/* I have just been preempted */
> 
> #ifdef CONFIG_SMP
> #define SCHED_LOAD_SCALE	128UL	/* increase resolution of load */
>@@ -742,7 +741,6 @@
>  }
> #endif
> extern void FASTCALL(sched_fork(task_t * p));
>-extern void FASTCALL(sched_exit(task_t * p));
> 
> extern int in_group_p(gid_t);
> extern int in_egroup_p(gid_t);
>Index: linux-2.6.7-staircase/include/linux/sysctl.h
>===================================================================
>--- linux-2.6.7-staircase.orig/include/linux/sysctl.h	2004-06-25 02:24:28.113116107 +1000
>+++ linux-2.6.7-staircase/include/linux/sysctl.h	2004-06-25 02:24:33.884209187 +1000
>@@ -133,6 +133,8 @@
> 	KERN_NGROUPS_MAX=63,	/* int: NGROUPS_MAX */
> 	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
> 	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
>+	KERN_INTERACTIVE=66,	/* interactive tasks can have cpu bursts */
>+	KERN_COMPUTE=67,	/* adjust timeslices for a compute server */
> };
> 
> 
>Index: linux-2.6.7-staircase/init/main.c
>===================================================================
>--- linux-2.6.7-staircase.orig/init/main.c	2004-06-25 02:24:28.108116892 +1000
>+++ linux-2.6.7-staircase/init/main.c	2004-06-25 02:24:33.885209030 +1000
>@@ -314,8 +314,15 @@
> #define smp_init()	do { } while (0)
> #endif
> 
>+unsigned long cache_decay_ticks;
> static inline void setup_per_cpu_areas(void) { }
>-static inline void smp_prepare_cpus(unsigned int maxcpus) { }
>+static void smp_prepare_cpus(unsigned int maxcpus)
>+{
>+	// Generic 2 tick cache_decay for uniprocessor
>+	cache_decay_ticks = 2;
>+	printk("Generic cache decay timeout: %ld msecs.\n",
>+		(cache_decay_ticks * 1000 / HZ));
>+}
> 
> #else
> 
>Index: linux-2.6.7-staircase/kernel/exit.c
>===================================================================
>--- linux-2.6.7-staircase.orig/kernel/exit.c	2004-06-25 02:24:28.111116421 +1000
>+++ linux-2.6.7-staircase/kernel/exit.c	2004-06-25 02:24:33.887208715 +1000
>@@ -96,7 +96,6 @@
> 	p->parent->cmaj_flt += p->maj_flt + p->cmaj_flt;
> 	p->parent->cnvcsw += p->nvcsw + p->cnvcsw;
> 	p->parent->cnivcsw += p->nivcsw + p->cnivcsw;
>-	sched_exit(p);
> 	write_unlock_irq(&tasklist_lock);
> 	spin_unlock(&p->proc_lock);
> 	proc_pid_flush(proc_dentry);
>Index: linux-2.6.7-staircase/kernel/sched.c
>===================================================================
>--- linux-2.6.7-staircase.orig/kernel/sched.c	2004-06-25 02:24:28.110116578 +1000
>+++ linux-2.6.7-staircase/kernel/sched.c	2004-06-25 02:24:34.722077519 +1000
>@@ -16,6 +16,8 @@
>  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
>  *  2003-09-03	Interactivity tuning by Con Kolivas.
>  *  2004-04-02	Scheduler domains code by Nick Piggin
>+ *  2004-06-11	New staircase scheduling policy by Con Kolivas with help
>+ *		from William Lee Irwin III, Zwane Mwaikambo & Peter Williams.
>  */
> 
> #include <linux/mm.h>
>@@ -43,12 +45,6 @@
> 
> #include <asm/unistd.h>
> 
>-#ifdef CONFIG_NUMA
>-#define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
>-#else
>-#define cpu_to_node_mask(cpu) (cpu_online_map)
>-#endif
>-
> /*
>  * Convert user-nice values [ -20 ... 0 ... 19 ]
>  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
>@@ -66,118 +62,20 @@
> #define USER_PRIO(p)		((p)-MAX_RT_PRIO)
> #define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
> #define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
>-#define AVG_TIMESLICE	(MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
>-			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
> 
> /*
>  * Some helpers for converting nanosecond timing to jiffy resolution
>  */
> #define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
>-#define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
>-
>-/*
>- * These are the 'tuning knobs' of the scheduler:
>- *
>- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
>- * maximum timeslice is 200 msecs. Timeslices get refilled after
>- * they expire.
>- */
>-#define MIN_TIMESLICE		( 10 * HZ / 1000)
>-#define MAX_TIMESLICE		(200 * HZ / 1000)
>-#define ON_RUNQUEUE_WEIGHT	 30
>-#define CHILD_PENALTY		 95
>-#define PARENT_PENALTY		100
>-#define EXIT_WEIGHT		  3
>-#define PRIO_BONUS_RATIO	 25
>-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
>-#define INTERACTIVE_DELTA	  2
>-#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
>-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
>-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
>-#define CREDIT_LIMIT		100
>-
>-/*
>- * If a task is 'interactive' then we reinsert it in the active
>- * array after it has expired its current timeslice. (it will not
>- * continue to run immediately, it will still roundrobin with
>- * other interactive tasks.)
>- *
>- * This part scales the interactivity limit depending on niceness.
>- *
>- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
>- * Here are a few examples of different nice levels:
>- *
>- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
>- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
>- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
>- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
>- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
>- *
>- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
>- *  priority range a task can explore, a value of '1' means the
>- *  task is rated interactive.)
>- *
>- * Ie. nice +19 tasks can never get 'interactive' enough to be
>- * reinserted into the active array. And only heavily CPU-hog nice -20
>- * tasks will be expired. Default nice 0 tasks are somewhere between,
>- * it takes some effort for them to get interactive, but it's not
>- * too hard.
>- */
>-
>-#define CURRENT_BONUS(p) \
>-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
>-		MAX_SLEEP_AVG)
>-
>-#ifdef CONFIG_SMP
>-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
>-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
>-			num_online_cpus())
>-#else
>-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
>-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
>-#endif
>-
>-#define SCALE(v1,v1_max,v2_max) \
>-	(v1) * (v2_max) / (v1_max)
>-
>-#define DELTA(p) \
>-	(SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA)
> 
>-#define TASK_INTERACTIVE(p) \
>-	((p)->prio <= (p)->static_prio - DELTA(p))
>-
>-#define INTERACTIVE_SLEEP(p) \
>-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
>-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
>-
>-#define HIGH_CREDIT(p) \
>-	((p)->interactive_credit > CREDIT_LIMIT)
>-
>-#define LOW_CREDIT(p) \
>-	((p)->interactive_credit < -CREDIT_LIMIT)
>-
>-#define TASK_PREEMPTS_CURR(p, rq) \
>-	((p)->prio < (rq)->curr->prio)
>-
>-/*
>- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
>- * to time slice values.
>- *
>- * The higher a thread's priority, the bigger timeslices
>- * it gets during one round of execution. But even the lowest
>- * priority thread gets MIN_TIMESLICE worth of execution time.
>- *
>- * task_timeslice() is the interface that is used by the scheduler.
>+int compute = 0;
>+/* 
>+ *This is the time all tasks within the same priority round robin.
>+ *compute setting is reserved for dedicated computational scheduling
>+ *and has ten times larger intervals.
>  */
>-
>-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
>-		((MAX_TIMESLICE - MIN_TIMESLICE) * \
>-			(MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))
>-
>-static unsigned int task_timeslice(task_t *p)
>-{
>-	return BASE_TIMESLICE(p);
>-}
>+#define _RR_INTERVAL		((10 * HZ / 1000) ? : 1)
>+#define RR_INTERVAL		(_RR_INTERVAL * (1 + 9 * compute))
> 
> #define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
> 
>@@ -185,16 +83,8 @@
>  * These are the runqueue data structures:
>  */
> 
>-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
>-
> typedef struct runqueue runqueue_t;
> 
>-struct prio_array {
>-	unsigned int nr_active;
>-	unsigned long bitmap[BITMAP_SIZE];
>-	struct list_head queue[MAX_PRIO];
>-};
>-
> /*
>  * This is the main, per-CPU runqueue data structure.
>  *
>@@ -214,12 +104,13 @@
> 	unsigned long cpu_load;
> #endif
> 	unsigned long long nr_switches;
>-	unsigned long expired_timestamp, nr_uninterruptible;
>+	unsigned long nr_uninterruptible;
> 	unsigned long long timestamp_last_tick;
>+	unsigned int cache_ticks, preempted;
> 	task_t *curr, *idle;
> 	struct mm_struct *prev_mm;
>-	prio_array_t *active, *expired, arrays[2];
>-	int best_expired_prio;
>+	unsigned long bitmap[BITS_TO_LONGS(MAX_PRIO+1)];
>+	struct list_head queue[MAX_PRIO + 1];
> 	atomic_t nr_iowait;
> 
> #ifdef CONFIG_SMP
>@@ -297,67 +188,53 @@
> 	spin_unlock_irq(&rq->lock);
> }
> 
>-/*
>- * Adding/removing a task to/from a priority array:
>- */
>-static void dequeue_task(struct task_struct *p, prio_array_t *array)
>+static int task_preempts_curr(struct task_struct *p, runqueue_t *rq)
> {
>-	array->nr_active--;
>-	list_del(&p->run_list);
>-	if (list_empty(array->queue + p->prio))
>-		__clear_bit(p->prio, array->bitmap);
>+	if (p->prio >= rq->curr->prio)
>+		return 0;
>+	if (!compute || rq->cache_ticks >= cache_decay_ticks ||
>+		rt_task(p) || !p->mm || rq->curr == rq->idle) {
>+			rq->curr->flags |= PF_PREEMPTED;
>+			return 1;
>+	}
>+	rq->preempted = 1;
>+		return 0;
> }
> 
>-static void enqueue_task(struct task_struct *p, prio_array_t *array)
>+static inline int task_queued(task_t *task)
> {
>-	list_add_tail(&p->run_list, array->queue + p->prio);
>-	__set_bit(p->prio, array->bitmap);
>-	array->nr_active++;
>-	p->array = array;
>+	return !list_empty(&task->run_list);
> }
> 
> /*
>- * Used by the migration code - we pull tasks from the head of the
>- * remote queue so we want these tasks to show up at the head of the
>- * local queue:
>+ * Adding/removing a task to/from a runqueue:
>  */
>-static inline void enqueue_task_head(struct task_struct *p, prio_array_t *array)
>+static void dequeue_task(struct task_struct *p, runqueue_t *rq)
> {
>-	list_add(&p->run_list, array->queue + p->prio);
>-	__set_bit(p->prio, array->bitmap);
>-	array->nr_active++;
>-	p->array = array;
>+	list_del_init(&p->run_list);
>+	if (list_empty(rq->queue + p->prio))
>+		__clear_bit(p->prio, rq->bitmap);
>+}
>+
>+static void enqueue_task(struct task_struct *p, runqueue_t *rq)
>+{
>+	if (rq->curr->flags & PF_PREEMPTED) {
>+		rq->curr->flags &= ~PF_PREEMPTED;
>+		list_add(&p->run_list, rq->queue + p->prio);
>+	} else
>+		list_add_tail(&p->run_list, rq->queue + p->prio);
>+	__set_bit(p->prio, rq->bitmap);
> }
> 
> /*
>- * effective_prio - return the priority that is based on the static
>- * priority but is modified by bonuses/penalties.
>- *
>- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
>- * into the -5 ... 0 ... +5 bonus/penalty range.
>- *
>- * We use 25% of the full 0...39 priority range so that:
>- *
>- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
>- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
>- *
>- * Both properties are important to certain workloads.
>+ * Used by the migration code - we pull tasks from the head of the
>+ * remote queue so we want these tasks to show up at the head of the
>+ * local queue:
>  */
>-static int effective_prio(task_t *p)
>+static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
> {
>-	int bonus, prio;
>-
>-	if (rt_task(p))
>-		return p->prio;
>-
>-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
>-
>-	prio = p->static_prio - bonus;
>-	if (prio < MAX_RT_PRIO)
>-		prio = MAX_RT_PRIO;
>-	if (prio > MAX_PRIO-1)
>-		prio = MAX_PRIO-1;
>-	return prio;
>+	list_add(&p->run_list, rq->queue + p->prio);
>+	__set_bit(p->prio, rq->bitmap);
> }
> 
> /*
>@@ -365,7 +242,7 @@
>  */
> static inline void __activate_task(task_t *p, runqueue_t *rq)
> {
>-	enqueue_task(p, rq->active);
>+	enqueue_task(p, rq);
> 	rq->nr_running++;
> }
> 
>@@ -374,95 +251,109 @@
>  */
> static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
> {
>-	enqueue_task_head(p, rq->active);
>+	enqueue_task_head(p, rq);
> 	rq->nr_running++;
> }
> 
>-static void recalc_task_prio(task_t *p, unsigned long long now)
>+// burst - extra intervals an interactive task can run for at best priority
>+static unsigned int burst(task_t *p)
> {
>-	unsigned long long __sleep_time = now - p->timestamp;
>-	unsigned long sleep_time;
>-
>-	if (__sleep_time > NS_MAX_SLEEP_AVG)
>-		sleep_time = NS_MAX_SLEEP_AVG;
>+	unsigned int task_user_prio;
>+	if (rt_task(p))
>+		return p->burst;
>+	task_user_prio = TASK_USER_PRIO(p);
>+	if (likely(task_user_prio < 40))
>+		return 39 - task_user_prio;
> 	else
>-		sleep_time = (unsigned long)__sleep_time;
>+		return 0;
>+}
> 
>-	if (likely(sleep_time > 0)) {
>-		/*
>-		 * User tasks that sleep a long time are categorised as
>-		 * idle and will get just interactive status to stay active &
>-		 * prevent them suddenly becoming cpu hogs and starving
>-		 * other processes.
>-		 */
>-		if (p->mm && p->activated != -1 &&
>-			sleep_time > INTERACTIVE_SLEEP(p)) {
>-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
>-						AVG_TIMESLICE);
>-				if (!HIGH_CREDIT(p))
>-					p->interactive_credit++;
>-		} else {
>-			/*
>-			 * The lower the sleep avg a task has the more
>-			 * rapidly it will rise with sleep time.
>-			 */
>-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
>+static void inc_burst(task_t *p)
>+{
>+	unsigned int best_burst;
>+	best_burst = burst(p);
>+	if (p->burst < best_burst)
>+		p->burst++;
>+}
> 
>-			/*
>-			 * Tasks with low interactive_credit are limited to
>-			 * one timeslice worth of sleep avg bonus.
>-			 */
>-			if (LOW_CREDIT(p) &&
>-			    sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
>-				sleep_time = JIFFIES_TO_NS(task_timeslice(p));
>+static void dec_burst(task_t *p)
>+{
>+	if (p->burst)
>+		p->burst--;
>+}
> 
>-			/*
>-			 * Non high_credit tasks waking from uninterruptible
>-			 * sleep are limited in their sleep_avg rise as they
>-			 * are likely to be cpu hogs waiting on I/O
>-			 */
>-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
>-				if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
>-					sleep_time = 0;
>-				else if (p->sleep_avg + sleep_time >=
>-						INTERACTIVE_SLEEP(p)) {
>-					p->sleep_avg = INTERACTIVE_SLEEP(p);
>-					sleep_time = 0;
>-				}
>-			}
>+// slice - the duration a task runs before losing burst
>+static unsigned int slice(task_t *p)
>+{
>+	unsigned int slice = RR_INTERVAL;
>+	if (!rt_task(p))
>+		slice += burst(p) * RR_INTERVAL;
>+	return slice;
>+}
> 
>-			/*
>-			 * This code gives a bonus to interactive tasks.
>-			 *
>-			 * The boost works by updating the 'average sleep time'
>-			 * value here, based on ->timestamp. The more time a
>-			 * task spends sleeping, the higher the average gets -
>-			 * and the higher the priority boost gets as well.
>-			 */
>-			p->sleep_avg += sleep_time;
>+// interactive - interactive tasks get longer intervals at best priority
>+int interactive = 1;
> 
>-			if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
>-				p->sleep_avg = NS_MAX_SLEEP_AVG;
>-				if (!HIGH_CREDIT(p))
>-					p->interactive_credit++;
>+/*
>+ * effective_prio - dynamic priority dependent on burst.
>+ * The priority normally decreases by one each RR_INTERVAL.
>+ * As the burst increases the priority stays at the top "stair" or
>+ * priority for longer.
>+ */
>+static int effective_prio(task_t *p)
>+{
>+	int prio;
>+	unsigned int full_slice, used_slice, first_slice;
>+	unsigned int best_burst;
>+	if (rt_task(p))
>+		return p->prio;
>+
>+	best_burst = burst(p);
>+	full_slice = slice(p);
>+	used_slice = full_slice - p->slice;
>+	if (p->burst > best_burst)
>+		p->burst = best_burst;
>+	first_slice = RR_INTERVAL;
>+	if (interactive && !compute)
>+		first_slice *= (p->burst + 1);
>+	prio = MAX_PRIO - 1 - best_burst;
>+
>+	if (used_slice < first_slice)
>+		return prio;
>+	prio += 1 + (used_slice - first_slice) / RR_INTERVAL;
>+	if (prio > MAX_PRIO - 1)
>+		prio = MAX_PRIO - 1;
>+	return prio;
>+}
>+
>+static void recalc_task_prio(task_t *p, unsigned long long now)
>+{
>+	unsigned long sleep_time = now - p->timestamp;
>+	unsigned long run_time = NS_TO_JIFFIES(p->runtime);
>+	unsigned long total_run = NS_TO_JIFFIES(p->totalrun) + run_time;
>+	if ((!run_time && NS_TO_JIFFIES(p->runtime + sleep_time) <
>+		RR_INTERVAL) || p->flags & PF_FORKED) {
>+			p->flags &= ~PF_FORKED;
>+			if (p->slice - total_run < 1) {
>+				p->totalrun = 0;
>+				dec_burst(p);
>+			} else {
>+				p->totalrun += p->runtime;
>+				p->slice -= NS_TO_JIFFIES(p->totalrun);
> 			}
>-		}
>+	} else {
>+		inc_burst(p);
>+		p->runtime = 0;
>+		p->totalrun = 0;
> 	}
>-
>-	p->prio = effective_prio(p);
> }
> 
> /*
>  * activate_task - move a task to the runqueue and do priority recalculation
>- *
>- * Update all the scheduling statistics stuff. (sleep average
>- * calculation, priority modifiers, etc.)
>  */
> static void activate_task(task_t *p, runqueue_t *rq, int local)
> {
>-	unsigned long long now;
>-
>-	now = sched_clock();
>+	unsigned long long now = sched_clock();
> #ifdef CONFIG_SMP
> 	if (!local) {
> 		/* Compensate for drifting sched_clock */
>@@ -471,33 +362,11 @@
> 			+ rq->timestamp_last_tick;
> 	}
> #endif
>-
>+	p->slice = slice(p);
> 	recalc_task_prio(p, now);
>-
>-	/*
>-	 * This checks to make sure it's not an uninterruptible task
>-	 * that is now waking up.
>-	 */
>-	if (!p->activated) {
>-		/*
>-		 * Tasks which were woken up by interrupts (ie. hw events)
>-		 * are most likely of interactive nature. So we give them
>-		 * the credit of extending their sleep time to the period
>-		 * of time they spend on the runqueue, waiting for execution
>-		 * on a CPU, first time around:
>-		 */
>-		if (in_interrupt())
>-			p->activated = 2;
>-		else {
>-			/*
>-			 * Normal first-time wakeups get a credit too for
>-			 * on-runqueue time, but it will be weighted down:
>-			 */
>-			p->activated = 1;
>-		}
>-	}
>+	p->prio = effective_prio(p);
>+	p->time_slice = RR_INTERVAL;
> 	p->timestamp = now;
>-
> 	__activate_task(p, rq);
> }
> 
>@@ -509,8 +378,7 @@
> 	rq->nr_running--;
> 	if (p->state == TASK_UNINTERRUPTIBLE)
> 		rq->nr_uninterruptible++;
>-	dequeue_task(p, p->array);
>-	p->array = NULL;
>+	dequeue_task(p, rq);
> }
> 
> /*
>@@ -583,7 +451,7 @@
> 	 * If the task is not on a runqueue (and not running), then
> 	 * it is sufficient to simply update the task's cpu field.
> 	 */
>-	if (!p->array && !task_running(rq, p)) {
>+	if (!task_queued(p) && !task_running(rq, p)) {
> 		set_task_cpu(p, dest_cpu);
> 		return 0;
> 	}
>@@ -614,7 +482,7 @@
> repeat:
> 	rq = task_rq_lock(p, &flags);
> 	/* Must be off runqueue entirely, not preempted. */
>-	if (unlikely(p->array)) {
>+	if (unlikely(task_queued(p))) {
> 		/* If it's preempted, we yield.  It could be a while. */
> 		preempted = !task_running(rq, p);
> 		task_rq_unlock(rq, &flags);
>@@ -744,7 +612,7 @@
> 	if (!(old_state & state))
> 		goto out;
> 
>-	if (p->array)
>+	if (task_queued(p))
> 		goto out_running;
> 
> 	cpu = task_cpu(p);
>@@ -811,7 +679,7 @@
> 		old_state = p->state;
> 		if (!(old_state & state))
> 			goto out;
>-		if (p->array)
>+		if (task_queued(p))
> 			goto out_running;
> 
> 		this_cpu = smp_processor_id();
>@@ -820,14 +688,8 @@
> 
> out_activate:
> #endif /* CONFIG_SMP */
>-	if (old_state == TASK_UNINTERRUPTIBLE) {
>+	if (old_state == TASK_UNINTERRUPTIBLE)
> 		rq->nr_uninterruptible--;
>-		/*
>-		 * Tasks on involuntary sleep don't earn
>-		 * sleep_avg beyond just interactive state.
>-		 */
>-		p->activated = -1;
>-	}
> 
> 	/*
> 	 * Sync wakeups (i.e. those types of wakeups where the waker
>@@ -839,7 +701,7 @@
> 	 */
> 	activate_task(p, rq, cpu == this_cpu);
> 	if (!sync || cpu != this_cpu) {
>-		if (TASK_PREEMPTS_CURR(p, rq))
>+		if (task_preempts_curr(p, rq))
> 			resched_task(rq->curr);
> 	}
> 	success = 1;
>@@ -879,7 +741,6 @@
> 	 */
> 	p->state = TASK_RUNNING;
> 	INIT_LIST_HEAD(&p->run_list);
>-	p->array = NULL;
> 	spin_lock_init(&p->switch_lock);
> #ifdef CONFIG_PREEMPT
> 	/*
>@@ -890,33 +751,6 @@
> 	 */
> 	p->thread_info->preempt_count = 1;
> #endif
>-	/*
>-	 * Share the timeslice between parent and child, thus the
>-	 * total amount of pending timeslices in the system doesn't change,
>-	 * resulting in more scheduling fairness.
>-	 */
>-	local_irq_disable();
>-	p->time_slice = (current->time_slice + 1) >> 1;
>-	/*
>-	 * The remainder of the first timeslice might be recovered by
>-	 * the parent if the child exits early enough.
>-	 */
>-	p->first_time_slice = 1;
>-	current->time_slice >>= 1;
>-	p->timestamp = sched_clock();
>-	if (!current->time_slice) {
>-		/*
>-		 * This case is rare, it happens when the parent has only
>-		 * a single jiffy left from its timeslice. Taking the
>-		 * runqueue lock is not a problem.
>-		 */
>-		current->time_slice = 1;
>-		preempt_disable();
>-		scheduler_tick(0, 0);
>-		local_irq_enable();
>-		preempt_enable();
>-	} else
>-		local_irq_enable();
> }
> 
> /*
>@@ -930,66 +764,14 @@
> 	unsigned long flags;
> 	runqueue_t *rq = task_rq_lock(current, &flags);
> 
>+	//Forked process gets no burst to prevent fork bombs.
>+	p->burst = 0;
> 	BUG_ON(p->state != TASK_RUNNING);
> 
>-	/*
>-	 * We decrease the sleep average of forking parents
>-	 * and children as well, to keep max-interactive tasks
>-	 * from forking tasks that are max-interactive.
>-	 */
>-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
>-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
>-
>-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
>-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
>-
>-	p->interactive_credit = 0;
>-
>-	p->prio = effective_prio(p);
> 	set_task_cpu(p, smp_processor_id());
> 
>-	if (unlikely(!current->array))
>-		__activate_task(p, rq);
>-	else {
>-		p->prio = current->prio;
>-		list_add_tail(&p->run_list, &current->run_list);
>-		p->array = current->array;
>-		p->array->nr_active++;
>-		rq->nr_running++;
>-	}
>-	task_rq_unlock(rq, &flags);
>-}
>-
>-/*
>- * Potentially available exiting-child timeslices are
>- * retrieved here - this way the parent does not get
>- * penalized for creating too many threads.
>- *
>- * (this cannot be used to 'generate' timeslices
>- * artificially, because any timeslice recovered here
>- * was given away by the parent in the first place.)
>- */
>-void fastcall sched_exit(task_t * p)
>-{
>-	unsigned long flags;
>-	runqueue_t *rq;
>-
>-	local_irq_save(flags);
>-	if (p->first_time_slice) {
>-		p->parent->time_slice += p->time_slice;
>-		if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
>-			p->parent->time_slice = MAX_TIMESLICE;
>-	}
>-	local_irq_restore(flags);
>-	/*
>-	 * If the child was a (relative-) CPU hog then decrease
>-	 * the sleep_avg of the parent as well.
>-	 */
>-	rq = task_rq_lock(p->parent, &flags);
>-	if (p->sleep_avg < p->parent->sleep_avg)
>-		p->parent->sleep_avg = p->parent->sleep_avg /
>-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
>-		(EXIT_WEIGHT + 1);
>+	__activate_task(p, rq);
>+	current->flags |= PF_FORKED;
> 	task_rq_unlock(rq, &flags);
> }
> 
>@@ -1253,30 +1035,16 @@
> 		double_rq_unlock(this_rq, rq);
> 		goto lock_again;
> 	}
>-	/*
>-	 * We decrease the sleep average of forking parents
>-	 * and children as well, to keep max-interactive tasks
>-	 * from forking tasks that are max-interactive.
>-	 */
>-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
>-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
>-
>-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
>-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
>-
>-	p->interactive_credit = 0;
> 
> 	p->prio = effective_prio(p);
> 	set_task_cpu(p, cpu);
> 
> 	if (cpu == this_cpu) {
>-		if (unlikely(!current->array))
>+		if (unlikely(!task_queued(current)))
> 			__activate_task(p, rq);
> 		else {
> 			p->prio = current->prio;
> 			list_add_tail(&p->run_list, &current->run_list);
>-			p->array = current->array;
>-			p->array->nr_active++;
> 			rq->nr_running++;
> 		}
> 	} else {
>@@ -1284,7 +1052,7 @@
> 		p->timestamp = (p->timestamp - this_rq->timestamp_last_tick)
> 					+ rq->timestamp_last_tick;
> 		__activate_task(p, rq);
>-		if (TASK_PREEMPTS_CURR(p, rq))
>+		if (task_preempts_curr(p, rq))
> 			resched_task(rq->curr);
> 	}
> 
>@@ -1376,22 +1144,22 @@
>  * pull_task - move a task from a remote runqueue to the local runqueue.
>  * Both runqueues must be locked.
>  */
>-static inline
>-void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
>-	       runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
>+static inline 
>+void pull_task(runqueue_t *src_rq, task_t *p, 
>+		runqueue_t *this_rq, int this_cpu)
> {
>-	dequeue_task(p, src_array);
>+	dequeue_task(p, src_rq);
> 	src_rq->nr_running--;
> 	set_task_cpu(p, this_cpu);
> 	this_rq->nr_running++;
>-	enqueue_task(p, this_array);
>+	enqueue_task(p, this_rq);
> 	p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
> 				+ this_rq->timestamp_last_tick;
> 	/*
> 	 * Note that idle threads have a prio of MAX_PRIO, for this test
> 	 * to be always true for them.
> 	 */
>-	if (TASK_PREEMPTS_CURR(p, this_rq))
>+	if (task_preempts_curr(p, this_rq))
> 		resched_task(this_rq->curr);
> }
> 
>@@ -1434,7 +1202,6 @@
> 		      unsigned long max_nr_move, struct sched_domain *sd,
> 		      enum idle_type idle)
> {
>-	prio_array_t *array, *dst_array;
> 	struct list_head *head, *curr;
> 	int idx, pulled = 0;
> 	task_t *tmp;
>@@ -1442,38 +1209,17 @@
> 	if (max_nr_move <= 0 || busiest->nr_running <= 1)
> 		goto out;
> 
>-	/*
>-	 * We first consider expired tasks. Those will likely not be
>-	 * executed in the near future, and they are most likely to
>-	 * be cache-cold, thus switching CPUs has the least effect
>-	 * on them.
>-	 */
>-	if (busiest->expired->nr_active) {
>-		array = busiest->expired;
>-		dst_array = this_rq->expired;
>-	} else {
>-		array = busiest->active;
>-		dst_array = this_rq->active;
>-	}
>-
>-new_array:
> 	/* Start searching at priority 0: */
> 	idx = 0;
> skip_bitmap:
> 	if (!idx)
>-		idx = sched_find_first_bit(array->bitmap);
>+		idx = sched_find_first_bit(busiest->bitmap);
> 	else
>-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
>-	if (idx >= MAX_PRIO) {
>-		if (array == busiest->expired && busiest->active->nr_active) {
>-			array = busiest->active;
>-			dst_array = this_rq->active;
>-			goto new_array;
>-		}
>+		idx = find_next_bit(busiest->bitmap, MAX_PRIO, idx);
>+	if (idx >= MAX_PRIO) 
> 		goto out;
>-	}
> 
>-	head = array->queue + idx;
>+	head = busiest->queue + idx;
> 	curr = head->prev;
> skip_queue:
> 	tmp = list_entry(curr, task_t, run_list);
>@@ -1486,7 +1232,7 @@
> 		idx++;
> 		goto skip_bitmap;
> 	}
>-	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
>+	pull_task(busiest, tmp, this_rq, this_cpu);
> 	pulled++;
> 
> 	/* We only want to steal up to the prescribed number of tasks. */
>@@ -1956,22 +1702,6 @@
> EXPORT_PER_CPU_SYMBOL(kstat);
> 
> /*
>- * We place interactive tasks back into the active array, if possible.
>- *
>- * To guarantee that this does not starve expired tasks we ignore the
>- * interactivity of a task if the first expired task had to wait more
>- * than a 'reasonable' amount of time. This deadline timeout is
>- * load-dependent, as the frequency of array switched decreases with
>- * increasing number of running tasks. We also ignore the interactivity
>- * if a better static_prio task has expired:
>- */
>-#define EXPIRED_STARVING(rq) \
>-	((STARVATION_LIMIT && ((rq)->expired_timestamp && \
>-		(jiffies - (rq)->expired_timestamp >= \
>-			STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
>-			((rq)->curr->static_prio > (rq)->best_expired_prio))
>-
>-/*
>  * This function gets called by the timer code, with HZ frequency.
>  * We call it with interrupts disabled.
>  *
>@@ -2015,78 +1745,36 @@
> 		cpustat->user += user_ticks;
> 	cpustat->system += sys_ticks;
> 
>-	/* Task might have expired already, but not scheduled off yet */
>-	if (p->array != rq->active) {
>+	spin_lock(&rq->lock);
>+	// SCHED_FIFO tasks never run out of timeslice.
>+	if (unlikely(p->policy == SCHED_FIFO))
>+		goto out_unlock;
>+	rq->cache_ticks++;
>+	// Tasks lose burst each time they use up a full slice().
>+	if (!--p->slice) {
> 		set_tsk_need_resched(p);
>-		goto out;
>+		dequeue_task(p, rq);
>+		dec_burst(p);
>+		p->slice = slice(p);
>+		p->prio = effective_prio(p);
>+		p->time_slice = RR_INTERVAL;
>+		enqueue_task(p, rq);
>+		goto out_unlock;
> 	}
>-	spin_lock(&rq->lock);
> 	/*
>-	 * The task was running during this tick - update the
>-	 * time slice counter. Note: we do not update a thread's
>-	 * priority until it either goes to sleep or uses up its
>-	 * timeslice. This makes it possible for interactive tasks
>-	 * to use up their timeslices at their highest priority levels.
>+	 * Tasks that run out of time_slice but still have slice left get
>+	 * requeued with a lower priority && RR_INTERVAL time_slice.
> 	 */
>-	if (unlikely(rt_task(p))) {
>-		/*
>-		 * RR tasks need a special form of timeslice management.
>-		 * FIFO tasks have no timeslices.
>-		 */
>-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
>-			p->time_slice = task_timeslice(p);
>-			p->first_time_slice = 0;
>-			set_tsk_need_resched(p);
>-
>-			/* put it at the end of the queue: */
>-			dequeue_task(p, rq->active);
>-			enqueue_task(p, rq->active);
>-		}
>-		goto out_unlock;
>-	}
> 	if (!--p->time_slice) {
>-		dequeue_task(p, rq->active);
> 		set_tsk_need_resched(p);
>+		dequeue_task(p, rq);
> 		p->prio = effective_prio(p);
>-		p->time_slice = task_timeslice(p);
>-		p->first_time_slice = 0;
>-
>-		if (!rq->expired_timestamp)
>-			rq->expired_timestamp = jiffies;
>-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
>-			enqueue_task(p, rq->expired);
>-			if (p->static_prio < rq->best_expired_prio)
>-				rq->best_expired_prio = p->static_prio;
>-		} else
>-			enqueue_task(p, rq->active);
>-	} else {
>-		/*
>-		 * Prevent a too long timeslice allowing a task to monopolize
>-		 * the CPU. We do this by splitting up the timeslice into
>-		 * smaller pieces.
>-		 *
>-		 * Note: this does not mean the task's timeslices expire or
>-		 * get lost in any way, they just might be preempted by
>-		 * another task of equal priority. (one with higher
>-		 * priority would have preempted this task already.) We
>-		 * requeue this task to the end of the list on this priority
>-		 * level, which is in essence a round-robin of tasks with
>-		 * equal priority.
>-		 *
>-		 * This only applies to tasks in the interactive
>-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
>-		 */
>-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
>-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
>-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
>-			(p->array == rq->active)) {
>-
>-			dequeue_task(p, rq->active);
>-			set_tsk_need_resched(p);
>-			p->prio = effective_prio(p);
>-			enqueue_task(p, rq->active);
>-		}
>+		p->time_slice = RR_INTERVAL;
>+		enqueue_task(p, rq);
>+		goto out_unlock;
> 	}
>+	if (rq->preempted && rq->cache_ticks >= cache_decay_ticks)
>+		set_tsk_need_resched(p);
> out_unlock:
> 	spin_unlock(&rq->lock);
> out:
>@@ -2149,8 +1837,8 @@
> 		 * task from using an unfair proportion of the
> 		 * physical cpu's resources. -ck
> 		 */
>-		if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
>-			task_timeslice(p) || rt_task(smt_curr)) &&
>+		if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) >
>+			slice(p) || rt_task(smt_curr)) &&
> 			p->mm && smt_curr->mm && !rt_task(p))
> 				ret = 1;
> 
>@@ -2159,8 +1847,8 @@
> 		 * or wake it up if it has been put to sleep for priority
> 		 * reasons.
> 		 */
>-		if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
>-			task_timeslice(smt_curr) || rt_task(p)) &&
>+		if ((((p->slice * (100 - sd->per_cpu_gain) / 100) > 
>+			slice(smt_curr) || rt_task(p)) && 
> 			smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
> 			(smt_curr == smt_rq->idle && smt_rq->nr_running))
> 				resched_task(smt_curr);
>@@ -2186,10 +1874,8 @@
> 	long *switch_count;
> 	task_t *prev, *next;
> 	runqueue_t *rq;
>-	prio_array_t *array;
> 	struct list_head *queue;
> 	unsigned long long now;
>-	unsigned long run_time;
> 	int cpu, idx;
> 
> 	/*
>@@ -2211,19 +1897,8 @@
> 
> 	release_kernel_lock(prev);
> 	now = sched_clock();
>-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
>-		run_time = now - prev->timestamp;
>-	else
>-		run_time = NS_MAX_SLEEP_AVG;
>-
>-	/*
>-	 * Tasks with interactive credits get charged less run_time
>-	 * at high sleep_avg to delay them losing their interactive
>-	 * status
>-	 */
>-	if (HIGH_CREDIT(prev))
>-		run_time /= (CURRENT_BONUS(prev) ? : 1);
> 
>+	prev->runtime = now - prev->timestamp;
> 	spin_lock_irq(&rq->lock);
> 
> 	/*
>@@ -2245,59 +1920,26 @@
> 		idle_balance(cpu, rq);
> 		if (!rq->nr_running) {
> 			next = rq->idle;
>-			rq->expired_timestamp = 0;
> 			wake_sleeping_dependent(cpu, rq);
> 			goto switch_tasks;
> 		}
> 	}
> 
>-	array = rq->active;
>-	if (unlikely(!array->nr_active)) {
>-		/*
>-		 * Switch the active and expired arrays.
>-		 */
>-		rq->active = rq->expired;
>-		rq->expired = array;
>-		array = rq->active;
>-		rq->expired_timestamp = 0;
>-		rq->best_expired_prio = MAX_PRIO;
>-	}
>-
>-	idx = sched_find_first_bit(array->bitmap);
>-	queue = array->queue + idx;
>+	idx = sched_find_first_bit(rq->bitmap);
>+	queue = rq->queue + idx;
> 	next = list_entry(queue->next, task_t, run_list);
> 
>-	if (dependent_sleeper(cpu, rq, next)) {
>+	if (dependent_sleeper(cpu, rq, next))
> 		next = rq->idle;
>-		goto switch_tasks;
>-	}
>-
>-	if (!rt_task(next) && next->activated > 0) {
>-		unsigned long long delta = now - next->timestamp;
>-
>-		if (next->activated == 1)
>-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
>-
>-		array = next->array;
>-		dequeue_task(next, array);
>-		recalc_task_prio(next, next->timestamp + delta);
>-		enqueue_task(next, array);
>-	}
>-	next->activated = 0;
> switch_tasks:
> 	prefetch(next);
> 	clear_tsk_need_resched(prev);
> 	RCU_qsctr(task_cpu(prev))++;
>-
>-	prev->sleep_avg -= run_time;
>-	if ((long)prev->sleep_avg <= 0) {
>-		prev->sleep_avg = 0;
>-		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
>-			prev->interactive_credit--;
>-	}
> 	prev->timestamp = now;
> 
> 	if (likely(prev != next)) {
>+		rq->preempted = 0;
>+		rq->cache_ticks = 0;
> 		next->timestamp = now;
> 		rq->nr_switches++;
> 		rq->curr = next;
>@@ -2560,9 +2202,8 @@
> void set_user_nice(task_t *p, long nice)
> {
> 	unsigned long flags;
>-	prio_array_t *array;
> 	runqueue_t *rq;
>-	int old_prio, new_prio, delta;
>+	int queued, old_prio, new_prio, delta;
> 
> 	if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
> 		return;
>@@ -2581,9 +2222,8 @@
> 		p->static_prio = NICE_TO_PRIO(nice);
> 		goto out_unlock;
> 	}
>-	array = p->array;
>-	if (array)
>-		dequeue_task(p, array);
>+	if ((queued = task_queued(p)))
>+		dequeue_task(p, rq);
> 
> 	old_prio = p->prio;
> 	new_prio = NICE_TO_PRIO(nice);
>@@ -2591,8 +2231,8 @@
> 	p->static_prio = NICE_TO_PRIO(nice);
> 	p->prio += delta;
> 
>-	if (array) {
>-		enqueue_task(p, array);
>+	if (queued) {
>+		enqueue_task(p, rq);
> 		/*
> 		 * If the task increased its priority or is running and
> 		 * lowered its priority, then reschedule its CPU:
>@@ -2697,7 +2337,7 @@
> /* Actually do priority change: must hold rq lock. */
> static void __setscheduler(struct task_struct *p, int policy, int prio)
> {
>-	BUG_ON(p->array);
>+	BUG_ON(task_queued(p));
> 	p->policy = policy;
> 	p->rt_priority = prio;
> 	if (policy != SCHED_NORMAL)
>@@ -2713,8 +2353,7 @@
> {
> 	struct sched_param lp;
> 	int retval = -EINVAL;
>-	int oldprio;
>-	prio_array_t *array;
>+	int queued, oldprio;
> 	unsigned long flags;
> 	runqueue_t *rq;
> 	task_t *p;
>@@ -2774,13 +2413,12 @@
> 	if (retval)
> 		goto out_unlock;
> 
>-	array = p->array;
>-	if (array)
>+	if ((queued = task_queued(p)))
> 		deactivate_task(p, task_rq(p));
> 	retval = 0;
> 	oldprio = p->prio;
> 	__setscheduler(p, policy, lp.sched_priority);
>-	if (array) {
>+	if (queued) {
> 		__activate_task(p, task_rq(p));
> 		/*
> 		 * Reschedule if we are currently running on this runqueue and
>@@ -2790,7 +2428,7 @@
> 		if (task_running(rq, p)) {
> 			if (p->prio > oldprio)
> 				resched_task(rq->curr);
>-		} else if (TASK_PREEMPTS_CURR(p, rq))
>+		} else if (task_preempts_curr(p, rq))
> 			resched_task(rq->curr);
> 	}
> 
>@@ -2983,28 +2621,19 @@
> /**
>  * sys_sched_yield - yield the current processor to other threads.
>  *
>- * this function yields the current CPU by moving the calling thread
>- * to the expired array. If there are no other threads running on this
>  * CPU then this function will return.
>  */
> asmlinkage long sys_sched_yield(void)
> {
> 	runqueue_t *rq = this_rq_lock();
>-	prio_array_t *array = current->array;
>-	prio_array_t *target = rq->expired;
>-
>-	/*
>-	 * We implement yielding by moving the task into the expired
>-	 * queue.
>-	 *
>-	 * (special rule: RT tasks will just roundrobin in the active
>-	 *  array.)
>-	 */
>-	if (unlikely(rt_task(current)))
>-		target = rq->active;
> 
>-	dequeue_task(current, array);
>-	enqueue_task(current, target);
>+	dequeue_task(current, rq);
>+ 	current->slice = slice(current);
>+ 	current->time_slice = RR_INTERVAL;
>+	if (!rt_task(current))
>+		current->prio = MAX_PRIO - 1;
>+	current->burst = 0;
>+	enqueue_task(current, rq);
> 
> 	/*
> 	 * Since we are going to call schedule() anyway, there's
>@@ -3143,7 +2772,7 @@
> 		goto out_unlock;
> 
> 	jiffies_to_timespec(p->policy & SCHED_FIFO ?
>-				0 : task_timeslice(p), &t);
>+				0 : slice(p), &t);
> 	read_unlock(&tasklist_lock);
> 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
> out_nounlock:
>@@ -3261,9 +2890,9 @@
> 
> 	idle_rq->curr = idle_rq->idle = idle;
> 	deactivate_task(idle, rq);
>-	idle->array = NULL;
> 	idle->prio = MAX_PRIO;
> 	idle->state = TASK_RUNNING;
>+	idle->burst = 0;
> 	set_task_cpu(idle, cpu);
> 	double_rq_unlock(idle_rq, rq);
> 	set_tsk_need_resched(idle);
>@@ -3372,7 +3001,7 @@
> 		goto out;
> 
> 	set_task_cpu(p, dest_cpu);
>-	if (p->array) {
>+	if (task_queued(p)) {
> 		/*
> 		 * Sync timestamp with rq_dest's before activating.
> 		 * The same thing could be achieved by doing this step
>@@ -3383,7 +3012,7 @@
> 				+ rq_dest->timestamp_last_tick;
> 		deactivate_task(p, rq_src);
> 		activate_task(p, rq_dest, 0);
>-		if (TASK_PREEMPTS_CURR(p, rq_dest))
>+		if (task_preempts_curr(p, rq_dest))
> 			resched_task(rq_dest->curr);
> 	}
> 
>@@ -3896,7 +3525,7 @@
> void __init sched_init(void)
> {
> 	runqueue_t *rq;
>-	int i, j, k;
>+	int i, j;
> 
> #ifdef CONFIG_SMP
> 	/* Set up an initial dummy domain for early boot */
>@@ -3917,13 +3546,11 @@
> #endif
> 
> 	for (i = 0; i < NR_CPUS; i++) {
>-		prio_array_t *array;
>-
> 		rq = cpu_rq(i);
> 		spin_lock_init(&rq->lock);
>-		rq->active = rq->arrays;
>-		rq->expired = rq->arrays + 1;
>-		rq->best_expired_prio = MAX_PRIO;
>+		
>+		rq->cache_ticks = 0;
>+		rq->preempted = 0;
> 
> #ifdef CONFIG_SMP
> 		rq->sd = &sched_domain_init;
>@@ -3934,16 +3561,11 @@
> 		INIT_LIST_HEAD(&rq->migration_queue);
> #endif
> 		atomic_set(&rq->nr_iowait, 0);
>-
>-		for (j = 0; j < 2; j++) {
>-			array = rq->arrays + j;
>-			for (k = 0; k < MAX_PRIO; k++) {
>-				INIT_LIST_HEAD(array->queue + k);
>-				__clear_bit(k, array->bitmap);
>-			}
>-			// delimiter for bitsearch
>-			__set_bit(MAX_PRIO, array->bitmap);
>-		}
>+		for (j = 0; j <= MAX_PRIO; j++)
>+			INIT_LIST_HEAD(&rq->queue[j]);
>+		memset(rq->bitmap, 0, BITS_TO_LONGS(MAX_PRIO+1)*sizeof(long));
>+		// delimiter for bitsearch
>+		__set_bit(MAX_PRIO, rq->bitmap);
> 	}
> 	/*
> 	 * We have to do a little magic to get the first
>Index: linux-2.6.7-staircase/kernel/sysctl.c
>===================================================================
>--- linux-2.6.7-staircase.orig/kernel/sysctl.c	2004-06-25 02:24:28.112116264 +1000
>+++ linux-2.6.7-staircase/kernel/sysctl.c	2004-06-25 02:24:33.895207459 +1000
>@@ -636,6 +636,22 @@
> 		.mode		= 0444,
> 		.proc_handler	= &proc_dointvec,
> 	},
>+	{
>+		.ctl_name	= KERN_INTERACTIVE,
>+		.procname	= "interactive",
>+		.data		= &interactive,
>+		.maxlen		= sizeof (int),
>+		.mode		= 0644,
>+		.proc_handler	= &proc_dointvec,
>+	},
>+	{
>+		.ctl_name	= KERN_COMPUTE,
>+		.procname	= "compute",
>+		.data		= &compute,
>+		.maxlen		= sizeof (int),
>+		.mode		= 0644,
>+		.proc_handler	= &proc_dointvec,
>+	},
> 	{ .ctl_name = 0 }
> };
> 
>  
>

next prev parent reply	other threads:[~2004-06-26 20:06 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-06-25 14:38 [PATCH] Staircase scheduler v7.4 Con Kolivas
2004-06-25 18:32 ` Matthias Urlichs
2004-06-26  1:28   ` Con Kolivas
2004-06-25 22:20 ` Willy Tarreau
2004-06-26  1:05   ` kernel
2004-06-26 20:04 ` Wes Janzen [this message]
2004-06-26 20:11   ` Michael Buesch
2004-06-26 21:14     ` Wes Janzen
2004-06-26 21:38       ` Prakash K. Cheemplavam
2004-06-27  9:16   ` Con Kolivas
2004-06-27 11:40     ` Grzegorz Kulewski
  -- strict thread matches above, loose matches on Subject: below --
2004-06-25 16:40 Michael Buesch
2004-06-25 16:46 ` Con Kolivas
2004-06-25 18:44   ` Michael Buesch
2004-06-25 19:05     ` Willy Tarreau
2004-06-25 19:48       ` Michael Buesch
2004-06-26  1:11         ` kernel
2004-06-26 16:33           ` Michael Buesch
2004-06-26 17:29           ` Michael Buesch
2004-06-27  9:14             ` Con Kolivas
2004-06-27 19:17             ` Felipe Alfaro Solana
2004-06-27 19:28               ` Michael Buesch
2004-06-27 21:55                 ` Felipe Alfaro Solana
2004-06-28  0:15                   ` Con Kolivas
2004-06-28  8:40                     ` Felipe Alfaro Solana
2004-06-28  8:49                       ` Nick Piggin
2004-06-28 11:53                         ` Felipe Alfaro Solana
2004-06-28 12:11                           ` Con Kolivas
2004-06-28 15:03                             ` Oswald Buddenhagen
2004-06-28 15:19                               ` Con Kolivas
2004-06-28 15:39                                 ` Oswald Buddenhagen
2004-06-28 17:11                             ` Felipe Alfaro Solana
2004-06-29  4:36                               ` Nick Piggin
2004-06-28 23:21                         ` Peter Williams
2004-06-29  4:44                           ` Nick Piggin
2004-06-29  6:01                             ` Ed Sweetman
2004-06-29  6:55                               ` Nick Piggin
2004-06-26  2:05         ` Con Kolivas
2004-06-27 10:24         ` Con Kolivas
2004-06-27 10:27           ` Con Kolivas
2004-06-27 23:50             ` Peter Williams
2004-06-27 12:00         ` Con Kolivas
2004-06-27 12:04           ` Con Kolivas
2004-06-27 12:54             ` Michael Buesch
2004-06-27 13:15               ` Con Kolivas
2004-06-25 16:46 ` Michael Buesch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=40DDD6CC.7000201@sbcglobal.net \
    --to=superchkn@sbcglobal.net \
    --cc=kernel@kolivas.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pauli.virtanen@hut.fi \
    --cc=wli@holomorphy.com \
    --cc=zwane@linuxpower.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox