[ANNOUNCE][RFC] PlugSched-6.3.1 for 2.6.16-rc5

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
@ 2006-02-28 22:32 Peter Williams
  2006-03-01  2:36 ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-02-28 22:32 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Chris Han, Con Kolivas, William Lee Irwin III, Jake Moilanen,
	Paolo Ornati, Ingo Molnar

This version updates staircase scheduler to version 14.1 (thanks Con) 
and includes the latest smpnice patches

A patch for 2.6.16-rc5 is available at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-rc5.patch?download>

Very Brief Documentation:

You can select a default scheduler at kernel build time.  If you wish to
boot with a scheduler other than the default it can be selected at boot
time by adding:

cpusched=<scheduler>

to the boot command line where <scheduler> is one of: ingosched,
ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
or zaphod.  If you don't change the default when you build the kernel
the default scheduler will be ingosched (which is the normal scheduler).

The scheduler in force on a running system can be determined by the
contents of:

/proc/scheduler

Control parameters for the scheduler can be read/set via files in:

/sys/cpusched/<scheduler>/

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-02-28 22:32 Peter Williams
@ 2006-03-01  2:36 ` Peter Williams
  2006-04-02  2:04   ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-03-01  2:36 UTC (permalink / raw)
  To: Linux Kernel Mailing List
  Cc: Chris Han, Con Kolivas, William Lee Irwin III, Jake Moilanen,
	Paolo Ornati, Ingo Molnar

Peter Williams wrote:
> This version updates staircase scheduler to version 14.1 (thanks Con) 
> and includes the latest smpnice patches
> 
> A patch for 2.6.16-rc5 is available at:
> 
> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-rc5.patch?download> 
> 

and for 2.6.16-rc5-mm1 at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-rc5-mm1.patch?download>

> 
> Very Brief Documentation:
> 
> You can select a default scheduler at kernel build time.  If you wish to
> boot with a scheduler other than the default it can be selected at boot
> time by adding:
> 
> cpusched=<scheduler>
> 
> to the boot command line where <scheduler> is one of: ingosched,
> ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
> or zaphod.  If you don't change the default when you build the kernel
> the default scheduler will be ingosched (which is the normal scheduler).
> 
> The scheduler in force on a running system can be determined by the
> contents of:
> 
> /proc/scheduler
> 
> Control parameters for the scheduler can be read/set via files in:
> 
> /sys/cpusched/<scheduler>/
> 
> Peter


-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-03-01  2:36 ` Peter Williams
@ 2006-04-02  2:04   ` Peter Williams
  2006-04-02  6:02     ` Con Kolivas
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-02  2:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Chris Han, Con Kolivas,
	William Lee Irwin III, Jake Moilanen, Paolo Ornati, Ingo Molnar

Peter Williams wrote:
> Peter Williams wrote:
>> This version updates staircase scheduler to version 14.1 (thanks Con) 
>> and includes the latest smpnice patches
>>
>> A patch for 2.6.16-rc5 is available at:
>>
>> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-rc5.patch?download> 
>>
> 
> and for 2.6.16-rc5-mm1 at:
> 
> <http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-rc5-mm1.patch?download> 
> 
> 
>>
>> Very Brief Documentation:
>>
>> You can select a default scheduler at kernel build time.  If you wish to
>> boot with a scheduler other than the default it can be selected at boot
>> time by adding:
>>
>> cpusched=<scheduler>
>>
>> to the boot command line where <scheduler> is one of: ingosched,
>> ingo_ll, nicksched, staircase, spa_no_frills, spa_ws, spa_svr, spa_ebs
>> or zaphod.  If you don't change the default when you build the kernel
>> the default scheduler will be ingosched (which is the normal scheduler).
>>
>> The scheduler in force on a running system can be determined by the
>> contents of:
>>
>> /proc/scheduler
>>
>> Control parameters for the scheduler can be read/set via files in:
>>
>> /sys/cpusched/<scheduler>/
>>
>> Peter
> 
> 

Now available for 2.6.16 at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16.patch?download>

and 2.6.16-mm2 at:

<http://prdownloads.sourceforge.net/cpuse/plugsched-6.3.1-for-2.6.16-mm2.patch?download>

Con and Nick,
	I've taken the liberty of modifying staircase and nicksched (in the 
2.6.16-mm2 version) to support priority inheritance.  I'd appreciate it 
if you could review the code?

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-02  2:04   ` Peter Williams
@ 2006-04-02  6:02     ` Con Kolivas
  0 siblings, 0 replies; 28+ messages in thread
From: Con Kolivas @ 2006-04-02  6:02 UTC (permalink / raw)
  To: Peter Williams
  Cc: Linux Kernel Mailing List, Chris Han, William Lee Irwin III,
	Jake Moilanen, Paolo Ornati, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On Sunday 02 April 2006 12:04, Peter Williams wrote:
> Con and Nick,
> 	I've taken the liberty of modifying staircase and nicksched (in the
> 2.6.16-mm2 version) to support priority inheritance.  I'd appreciate it
> if you could review the code?

Looks fine to me.

Here are two patches (I know it's bad form to send two patches as attachments 
under normal circumstances but I'm sure you won't mind). The first brings us 
up to staircase v15. The second adds the sched_system_tick function from 
account_system_time that staircase 15 needs to work properly. Unfortunately I 
can only build test these patches at this time (since the family needs my 
only pc and they build fine) but the changes are mostly straightforward so 
they should be ok.

Cheers,
Con

[-- Attachment #2: plugsched-6.3.1-staircase14.1_15.patch --]
[-- Type: text/x-diff, Size: 17266 bytes --]

---
 include/linux/sched_runq.h |    3 
 include/linux/sched_task.h |    2 
 kernel/staircase.c         |  228 ++++++++++++++++++++++++---------------------
 3 files changed, 127 insertions(+), 106 deletions(-)

Index: linux-2.6.16-mm2/include/linux/sched_runq.h
===================================================================
--- linux-2.6.16-mm2.orig/include/linux/sched_runq.h	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/include/linux/sched_runq.h	2006-04-02 14:29:55.000000000 +1000
@@ -39,8 +39,7 @@ struct ingo_runqueue_queue {
 struct staircase_runqueue_queue {
 	DECLARE_BITMAP(bitmap, STAIRCASE_NUM_PRIO_SLOTS);
 	struct list_head queue[STAIRCASE_NUM_PRIO_SLOTS - 1];
-	unsigned int cache_ticks;
-	unsigned int preempted;
+	unsigned short cache_ticks, preempted;
 };
 #endif
 
Index: linux-2.6.16-mm2/include/linux/sched_task.h
===================================================================
--- linux-2.6.16-mm2.orig/include/linux/sched_task.h	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/include/linux/sched_task.h	2006-04-02 14:31:39.000000000 +1000
@@ -48,7 +48,7 @@ struct ingo_ll_sched_drv_task {
 #ifdef CONFIG_CPUSCHED_STAIRCASE
 struct staircase_sched_drv_task {
 	unsigned long sflags;
-	unsigned long runtime, totalrun, ns_debit;
+	unsigned long runtime, totalrun, ns_debit, systime;
 	unsigned int bonus;
 	unsigned int slice, time_slice;
 };
Index: linux-2.6.16-mm2/kernel/staircase.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/staircase.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/staircase.c	2006-04-02 15:11:10.000000000 +1000
@@ -2,8 +2,8 @@
  *  kernel/staircase.c
  *  Copyright (C) 2002-2006 Con Kolivas
  *
- * 2006-02-22 Staircase scheduler by Con Kolivas <kernel@kolivas.org>
- *            Staircase v14.1
+ * 2006-04-02 Staircase scheduler by Con Kolivas <kernel@kolivas.org>
+ *            Staircase v15
  */
 #include <linux/sched.h>
 #include <linux/init.h>
@@ -23,8 +23,7 @@ static void staircase_init_runqueue_queu
 {
 	int k;
 
-	qup->staircase.cache_ticks = 0;
-	qup->staircase.preempted = 0;
+	qup->staircase.cache_ticks = qup->staircase.preempted = 0;
 
 	for (k = 0; k < STAIRCASE_MAX_PRIO; k++) {
 		INIT_LIST_HEAD(qup->staircase.queue + k);
@@ -42,7 +41,9 @@ static int staircase_idle_prio(void)
 static void staircase_set_oom_time_slice(struct task_struct *p,
 	unsigned long t)
 {
-	p->sdu.staircase.slice = p->sdu.staircase.time_slice = t;
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+
+	sp->slice = sp->time_slice = t;
 }
 
 /*
@@ -53,13 +54,14 @@ static void staircase_set_oom_time_slice
 #define USER_PRIO(p)		((p)-MAX_RT_PRIO)
 #define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
 #define MAX_USER_PRIO		(USER_PRIO(STAIRCASE_MAX_PRIO))
+#define MIN_USER_PRIO		(STAIRCASE_MAX_PRIO - 1)
 
 /*
  * Some helpers for converting nanosecond timing to jiffy resolution
  */
-#define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
 #define NSJIFFY			(1000000000 / HZ)	/* One jiffy in ns */
+#define NS_TO_JIFFIES(TIME)	((TIME) / NSJIFFY)
+#define JIFFIES_TO_NS(TIME)	((TIME) * NSJIFFY)
 
 int sched_compute __read_mostly = 0;
 /*
@@ -68,7 +70,7 @@ int sched_compute __read_mostly = 0;
  *and has twenty times larger intervals. Set to a minimum of 6ms.
  */
 #define _RR_INTERVAL		((6 * HZ / 1001) + 1)
-#define RR_INTERVAL()		(_RR_INTERVAL * (1 + 16 * sched_compute))
+#define RR_INTERVAL()		(_RR_INTERVAL * (1 + 9 * sched_compute))
 #define DEF_TIMESLICE		(RR_INTERVAL() * 19)
 
 #define TASK_PREEMPTS_CURR(p, rq) \
@@ -81,7 +83,7 @@ static unsigned long ns_diff(const unsig
 	const unsigned long long v2)
 {
 	unsigned long long vdiff;
-	if (likely(v1 > v2)) {
+	if (likely(v1 >= v2)) {
 		vdiff = v1 - v2;
 #if BITS_PER_LONG < 64
 		if (vdiff > (1 << 31))
@@ -103,10 +105,12 @@ static unsigned long ns_diff(const unsig
 static inline void dequeue_task(struct task_struct *p,
 	struct staircase_runqueue_queue *rqq)
 {
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+
 	list_del_init(&p->run_list);
 	if (list_empty(rqq->queue + p->prio))
 		__clear_bit(p->prio, rqq->bitmap);
-	p->sdu.staircase.ns_debit = 0;
+	sp->ns_debit = 0;
 }
 
 static void enqueue_task(struct task_struct *p,
@@ -117,10 +121,19 @@ static void enqueue_task(struct task_str
 	__set_bit(p->prio, rqq->bitmap);
 }
 
-static inline void requeue_task(struct task_struct *p,
-	struct staircase_runqueue_queue *rq)
+static void fastcall requeue_task(struct task_struct *p,
+	struct staircase_runqueue_queue *rqq, const int prio)
 {
-	list_move_tail(&p->run_list, rq->queue + p->prio);
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+
+	list_move_tail(&p->run_list, rqq->queue + prio);
+	if (p->prio != prio) {
+		if (list_empty(rqq->queue + p->prio))
+			__clear_bit(p->prio, rqq->bitmap);
+		p->prio = prio;
+		__set_bit(prio, rqq->bitmap);
+	}
+	sp->ns_debit = 0;
 }
 
 /*
@@ -140,7 +153,9 @@ static inline void enqueue_task_head(str
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task(p, &rq->qu.staircase);
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
+
+	enqueue_task(p, rqq);
 	inc_nr_running(p, rq);
 }
 
@@ -150,7 +165,9 @@ static inline void __activate_task(task_
  */
 static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task_head(p, &rq->qu.staircase);
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
+
+	enqueue_task_head(p, &rqq);
 	inc_nr_running(p, rq);
 }
 #endif
@@ -239,22 +256,24 @@ static inline void staircase_set_load_we
 static void inc_bonus(task_t *p, const unsigned long totalrun,
 	const unsigned long sleep)
 {
-	unsigned int best_bonus;
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+	unsigned int best_bonus = sleep / (totalrun + 1);
 
-	best_bonus = sleep / (totalrun + 1);
-	if (p->sdu.staircase.bonus >= best_bonus)
+	if (sp->bonus >= best_bonus)
 		return;
 
-	p->sdu.staircase.bonus++;
 	best_bonus = bonus(p);
-	if (p->sdu.staircase.bonus > best_bonus)
-		p->sdu.staircase.bonus = best_bonus;
+	if (sp->bonus < best_bonus)
+		sp->bonus++;
 }
 
-static void dec_bonus(task_t *p)
+static inline void dec_bonus(task_t *p)
 {
-	if (p->sdu.staircase.bonus)
-		p->sdu.staircase.bonus--;
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+
+	sp->totalrun = 0;
+	if (sp->bonus)
+		sp->bonus--;
 }
 
 /*
@@ -270,41 +289,43 @@ int sched_interactive __read_mostly = 1;
  */
 static int staircase_normal_prio(task_t *p)
 {
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
 	int prio;
 	unsigned int full_slice, used_slice = 0;
 	unsigned int best_bonus, rr;
 
 	full_slice = slice(p);
-	if (full_slice > p->sdu.staircase.slice)
-		used_slice = full_slice - p->sdu.staircase.slice;
+	if (full_slice > sp->slice)
+		used_slice = full_slice - sp->slice;
 
 	best_bonus = bonus(p);
 	prio = MAX_RT_PRIO + best_bonus;
 	if (sched_interactive && !sched_compute && p->policy != SCHED_BATCH)
-		prio -= p->sdu.staircase.bonus;
+		prio -= sp->bonus;
 
 	rr = rr_interval(p);
 	prio += used_slice / rr;
-	if (prio > STAIRCASE_MAX_PRIO - 1)
-		prio = STAIRCASE_MAX_PRIO - 1;
+	if (prio > MIN_USER_PRIO)
+		prio = MIN_USER_PRIO;
 	return prio;
 }
 
 static inline void continue_slice(task_t *p)
 {
-	unsigned long total_run = NS_TO_JIFFIES(p->sdu.staircase.totalrun);
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+	unsigned long total_run = NS_TO_JIFFIES(sp->totalrun);
 
-	if (total_run >= p->sdu.staircase.slice) {
- 		p->sdu.staircase.totalrun -=
- 			JIFFIES_TO_NS(p->sdu.staircase.slice);
+	if (total_run >= sp->slice || p->prio == MIN_USER_PRIO)
 		dec_bonus(p);
-	} else {
-		unsigned int remainder;
+	else {
+		unsigned long remainder;
 
-		p->sdu.staircase.slice -= total_run;
-		remainder = p->sdu.staircase.slice % rr_interval(p);
+		sp->slice -= total_run;
+		if (sp->slice <= sp->time_slice)
+			dec_bonus(p);
+		remainder = sp->slice % rr_interval(p);
 		if (remainder)
-			p->sdu.staircase.time_slice = remainder;
+			sp->time_slice = remainder;
  	}
 }
 
@@ -315,35 +336,36 @@ static inline void continue_slice(task_t
  */
 static inline void recalc_task_prio(task_t *p, const unsigned long long now)
 {
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+	/* Double the systime to account for missed sub-jiffy time */
+	unsigned long ns_systime = JIFFIES_TO_NS(sp->systime) * 2;
 	unsigned long sleep_time = ns_diff(now, p->timestamp);
 
 	/*
-	 * Add the total for this last scheduled run (p->runtime) to the
-	 * running total so far used (p->totalrun).
+	 * Add the total for this last scheduled run (sp->runtime) and system
+	 * time (sp->systime) done on behalf of p to the running total so far
+	 * used (sp->totalrun).
 	 */
-	p->sdu.staircase.totalrun += p->sdu.staircase.runtime;
+	sp->totalrun += sp->runtime + ns_systime;
+
+	/* systime is unintentionally seen as sleep, subtract it */
+	if (likely(ns_systime < sleep_time))
+		sleep_time -= ns_systime;
+	else
+		sleep_time = 0;
 
 	/*
 	 * If we sleep longer than our running total and have not set the
-	 * PF_NONSLEEP flag we gain a bonus.
+	 * SF_NONSLEEP flag we gain a bonus.
 	 */
-	if (sleep_time >= p->sdu.staircase.totalrun &&
-		!(p->sdu.staircase.sflags & SF_NONSLEEP) &&
-		!sched_compute) {
-			inc_bonus(p, p->sdu.staircase.totalrun, sleep_time);
-			p->sdu.staircase.totalrun = 0;
-			return;
+	if (sleep_time >= sp->totalrun && !(sp->sflags & SF_NONSLEEP)) {
+		inc_bonus(p, sp->totalrun, sleep_time);
+		sp->totalrun = 0;
+		return;
 	}
 
-	/*
-	 * If we have not set the PF_NONSLEEP flag we elevate priority by the
-	 * amount of time we slept.
-	 */
-	if (p->sdu.staircase.sflags & SF_NONSLEEP)
-		p->sdu.staircase.sflags &= ~SF_NONSLEEP;
-	else
-		p->sdu.staircase.totalrun -= sleep_time;
-
+	/* We elevate priority by the amount of time we slept. */
+	sp->totalrun -= sleep_time;
 	continue_slice(p);
 }
 
@@ -355,6 +377,7 @@ static inline void recalc_task_prio(task
  */
 static void activate_task(task_t *p, runqueue_t *rq, const int local)
 {
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
 	unsigned long long now = sched_clock();
 	unsigned long rr = rr_interval(p);
 
@@ -366,11 +389,12 @@ static void activate_task(task_t *p, run
 			+ rq->timestamp_last_tick;
 	}
 #endif
-	p->sdu.staircase.slice = slice(p);
-	p->sdu.staircase.time_slice = p->sdu.staircase.slice % rr ? : rr;
+	sp->slice = slice(p);
+	sp->time_slice = sp->slice % rr ? : rr;
 	if (!rt_task(p)) {
 		recalc_task_prio(p, now);
-		p->sdu.staircase.sflags &= ~SF_NONSLEEP;
+		sp->sflags &= ~SF_NONSLEEP;
+		sp->systime = 0;
 		p->prio = effective_prio(p);
 	}
 	p->timestamp = now;
@@ -398,12 +422,17 @@ static void fastcall deactivate_task(tas
  */
 static void fastcall preempt(const task_t *p, runqueue_t *rq)
 {
-	if (p->prio >= rq->curr->prio)
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
+	task_t *curr = rq->curr;
+
+	if (p->prio >= curr->prio)
 		return;
-	if (!sched_compute || rq->qu.staircase.cache_ticks >= CACHE_DELAY ||
-		!p->mm || rt_task(p))
-			resched_task(rq->curr);
-	rq->qu.staircase.preempted = 1;
+	if (!sched_compute || rqq->cache_ticks >= CACHE_DELAY || !p->mm ||
+	    rt_task(p) || curr == rq->idle) {
+		resched_task(curr);
+		return;
+	}
+	rqq->preempted = 1;
 }
 
 /***
@@ -436,7 +465,7 @@ static void staircase_wake_up_task(task_
 
 /*
  * Perform scheduler related setup for a newly forked process p.
- * p is forked by current.
+ * p is forked by current. (nothing to do)
  */
 static void staircase_fork(task_t *__unused)
 {
@@ -452,6 +481,8 @@ static void staircase_fork(task_t *__unu
 static void staircase_wake_up_new_task(task_t *p,
 	const unsigned long clone_flags)
 {
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+	struct staircase_sched_drv_task *scurr = &current->sdu.staircase;
 	unsigned long flags;
 	int this_cpu, cpu;
 	runqueue_t *rq, *this_rq;
@@ -461,21 +492,20 @@ static void staircase_wake_up_new_task(t
 	this_cpu = smp_processor_id();
 	cpu = task_cpu(p);
 
-	/*
-	 * Forked process gets no bonus to prevent fork bombs.
-	 */
-	p->sdu.staircase.bonus = 0;
+	/* Forked process gets no bonus to prevent fork bombs. */
+	sp->bonus = 0;
+	scurr->sflags |= SF_NONSLEEP;
 
 	if (likely(cpu == this_cpu)) {
-		current->sdu.staircase.sflags |= SF_NONSLEEP;
 		activate_task(p, rq, 1);
-		if (!(clone_flags & CLONE_VM))
+		if (!(clone_flags & CLONE_VM)) {
 			/*
 			 * The VM isn't cloned, so we're in a good position to
 			 * do child-runs-first in anticipation of an exec. This
 			 * usually avoids a lot of COW overhead.
 			 */
 			set_need_resched();
+		}
 		/*
 		 * We skip the following code due to cpu == this_cpu
 	 	 *
@@ -501,20 +531,13 @@ static void staircase_wake_up_new_task(t
 		 */
 		task_rq_unlock(rq, &flags);
 		this_rq = task_rq_lock(current, &flags);
-		current->sdu.staircase.sflags |= SF_NONSLEEP;
 	}
 
 	task_rq_unlock(this_rq, &flags);
 }
 
 /*
- * Potentially available exiting-child timeslices are
- * retrieved here - this way the parent does not get
- * penalized for creating too many threads.
- *
- * (this cannot be used to 'generate' timeslices
- * artificially, because any timeslice recovered here
- * was given away by the parent in the first place.)
+ * Perform task exit functions (nothing to do)
  */
 static void staircase_exit(task_t *__unused)
 {
@@ -620,11 +643,12 @@ out:
 static void time_slice_expired(task_t *p, runqueue_t *rq)
 {
 	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
 
 	set_tsk_need_resched(p);
 	dequeue_task(p, rqq);
 	p->prio = effective_prio(p);
-	p->sdu.staircase.time_slice = rr_interval(p);
+	sp->time_slice = rr_interval(p);
 	enqueue_task(p, rqq);
 }
 
@@ -635,6 +659,8 @@ static void time_slice_expired(task_t *p
 static void staircase_tick(struct task_struct *p, struct runqueue *rq,
 	unsigned long long now)
 {
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
 	int cpu = smp_processor_id();
 	unsigned long debit, expired_balance = rq->nr_running;
 
@@ -661,31 +687,31 @@ static void staircase_tick(struct task_s
 
 	spin_lock(&rq->lock);
 	debit = ns_diff(rq->timestamp_last_tick, p->timestamp);
-	p->sdu.staircase.ns_debit += debit;
-	if (p->sdu.staircase.ns_debit < NSJIFFY)
+	sp->ns_debit += debit;
+	if (sp->ns_debit < NSJIFFY)
 		goto out_unlock;
-	p->sdu.staircase.ns_debit %= NSJIFFY;
+	sp->ns_debit %= NSJIFFY;
 	/*
 	 * Tasks lose bonus each time they use up a full slice().
 	 */
-	if (!--p->sdu.staircase.slice) {
+	if (!--sp->slice) {
 		dec_bonus(p);
-		p->sdu.staircase.slice = slice(p);
+		sp->slice = slice(p);
 		time_slice_expired(p, rq);
-		p->sdu.staircase.totalrun = 0;
+		sp->totalrun = 0;
 		goto out_unlock;
 	}
 	/*
 	 * Tasks that run out of time_slice but still have slice left get
 	 * requeued with a lower priority && RR_INTERVAL time_slice.
 	 */
-	if (!--p->sdu.staircase.time_slice) {
+	if (!--sp->time_slice) {
 		time_slice_expired(p, rq);
 		goto out_unlock;
 	}
-	rq->qu.staircase.cache_ticks++;
-	if (rq->qu.staircase.preempted &&
-		rq->qu.staircase.cache_ticks >= CACHE_DELAY) {
+	rqq->cache_ticks++;
+	if (rqq->preempted &&
+		rqq->cache_ticks >= CACHE_DELAY) {
 		set_tsk_need_resched(p);
 		goto out_unlock;
 	}
@@ -721,6 +747,7 @@ static void staircase_schedule(void)
 	int cpu, idx;
 	struct task_struct *prev = current, *next;
 	struct runqueue *rq = this_rq();
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
 	unsigned long long now = sched_clock();
 	unsigned long debit;
 	struct list_head *queue;
@@ -781,8 +808,8 @@ go_idle:
 			goto go_idle;
 	}
 
-	idx = sched_find_first_bit(rq->qu.staircase.bitmap);
-	queue = rq->qu.staircase.queue + idx;
+	idx = sched_find_first_bit(rqq->bitmap);
+	queue = rqq->queue + idx;
 	next = list_entry(queue->next, task_t, run_list);
 
 switch_tasks:
@@ -799,8 +826,7 @@ switch_tasks:
 
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
-		rq->qu.staircase.preempted = 0;
-		rq->qu.staircase.cache_ticks = 0;
+		rqq->preempted = rqq->cache_ticks = 0;
 		next->timestamp = now;
 		rq->nr_switches++;
 		rq->curr = next;
@@ -950,14 +976,9 @@ static long staircase_sys_yield(void)
 	current->sdu.staircase.slice = slice(current);
 	current->sdu.staircase.time_slice = rr_interval(current);
 	if (likely(!rt_task(current)))
-		newprio = STAIRCASE_MAX_PRIO - 1;
+		newprio = MIN_USER_PRIO;
 
-	if (newprio != current->prio) {
-		dequeue_task(current, rqq);
-		current->prio = newprio;
-		enqueue_task(current, rqq);
-	} else
-		requeue_task(current, rqq);
+	requeue_task(current, rqq, newprio);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
@@ -1023,9 +1044,10 @@ static void staircase_migrate_dead_tasks
 {
 	unsigned i;
 	struct runqueue *rq = cpu_rq(dead_cpu);
+	struct staircase_runqueue_queue *rqq = &rq->qu.staircase;
 
 	for (i = 0; i < STAIRCASE_MAX_PRIO; i++) {
-		struct list_head *list = &rq->qu.staircase.queue[i];
+		struct list_head *list = &rqq->queue[i];
 		while (!list_empty(list))
 			migrate_dead(dead_cpu, list_entry(list->next, task_t,
 				run_list));

[-- Attachment #3: plugsched-6.3.1-sched_system_time.patch --]
[-- Type: text/x-diff, Size: 8216 bytes --]

---
 include/linux/sched_drv.h |    1 +
 include/linux/sched_spa.h |    1 +
 kernel/ingo_ll.c          |    5 +++++
 kernel/ingosched.c        |    5 +++++
 kernel/nicksched.c        |    5 +++++
 kernel/sched.c            |    1 +
 kernel/sched_spa.c        |    5 +++++
 kernel/sched_spa_ebs.c    |    1 +
 kernel/sched_spa_svr.c    |    1 +
 kernel/sched_spa_ws.c     |    1 +
 kernel/sched_zaphod.c     |    1 +
 kernel/staircase.c        |    8 ++++++++
 12 files changed, 35 insertions(+)

Index: linux-2.6.16-mm2/include/linux/sched_drv.h
===================================================================
--- linux-2.6.16-mm2.orig/include/linux/sched_drv.h	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/include/linux/sched_drv.h	2006-04-02 15:28:56.000000000 +1000
@@ -32,6 +32,7 @@ struct sched_drv {
 	int (*move_tasks)(runqueue_t *, int, runqueue_t *, unsigned long, unsigned long,
 		 struct sched_domain *, enum idle_type, int *all_pinned);
 #endif
+	void (*sched_system_tick)(task_t *);
 	void (*tick)(struct task_struct*, struct runqueue *, unsigned long long);
 #ifdef CONFIG_SCHED_SMT
 	struct task_struct *(*head_of_queue)(union runqueue_queue *);
Index: linux-2.6.16-mm2/kernel/ingo_ll.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/ingo_ll.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/ingo_ll.c	2006-04-02 15:31:29.000000000 +1000
@@ -752,6 +752,10 @@ out:
 }
 #endif
 
+static void ingo_system_tick(struct task_struct *p)
+{
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1290,6 +1294,7 @@ const struct sched_drv ingo_ll_sched_drv
 #ifdef CONFIG_SMP
 	.move_tasks = ingo_move_tasks,
 #endif
+	.sched_system_tick = ingo_system_tick,
 	.tick = ingo_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = ingo_head_of_queue,
Index: linux-2.6.16-mm2/kernel/ingosched.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/ingosched.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/ingosched.c	2006-04-02 15:31:40.000000000 +1000
@@ -777,6 +777,10 @@ out:
 }
 #endif
 
+static void ingo_system_tick(struct task_struct *p)
+{
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1336,6 +1340,7 @@ const struct sched_drv ingo_sched_drv = 
 #ifdef CONFIG_SMP
 	.move_tasks = ingo_move_tasks,
 #endif
+	.sched_system_tick = ingo_system_tick,
 	.tick = ingo_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = ingo_head_of_queue,
Index: linux-2.6.16-mm2/kernel/nicksched.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/nicksched.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/nicksched.c	2006-04-02 15:32:54.000000000 +1000
@@ -642,6 +642,10 @@ out:
 }
 #endif
 
+static void nick_system_tick(struct task_struct *p)
+{
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1087,6 +1091,7 @@ const struct sched_drv nick_sched_drv = 
 #ifdef CONFIG_SMP
 	.move_tasks = nick_move_tasks,
 #endif
+	.sched_system_tick = nick_system_tick,
 	.tick = nick_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = nick_head_of_queue,
Index: linux-2.6.16-mm2/kernel/sched.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched.c	2006-04-02 15:29:01.000000000 +1000
@@ -1693,6 +1693,7 @@ void account_system_time(struct task_str
 		cpustat->idle = cputime64_add(cpustat->idle, tmp);
 	/* Account for system time used */
 	acct_update_integrals(p);
+	sched_drvp->sched_system_tick(p);
 }
 
 /*
Index: linux-2.6.16-mm2/kernel/sched_spa.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched_spa.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched_spa.c	2006-04-02 15:39:37.000000000 +1000
@@ -825,6 +825,10 @@ static void spa_nf_runq_data_tick(unsign
 {
 }
 
+void spa_system_tick(struct task_struct *p)
+{
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1658,6 +1662,7 @@ const struct sched_drv spa_nf_sched_drv 
 #ifdef CONFIG_SMP
 	.move_tasks = spa_move_tasks,
 #endif
+	.sched_system_tick = spa_system_tick,
 	.tick = spa_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = spa_head_of_queue,
Index: linux-2.6.16-mm2/kernel/staircase.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/staircase.c	2006-04-02 15:11:10.000000000 +1000
+++ linux-2.6.16-mm2/kernel/staircase.c	2006-04-02 15:35:48.000000000 +1000
@@ -652,6 +652,13 @@ static void time_slice_expired(task_t *p
 	enqueue_task(p, rqq);
 }
 
+static void staircase_system_tick(struct task_struct *p)
+{
+	struct staircase_sched_drv_task *sp = &p->sdu.staircase;
+
+	sp->systime++;
+}
+
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -1115,6 +1122,7 @@ const struct sched_drv staircase_sched_d
 #ifdef CONFIG_SMP
 	.move_tasks = staircase_move_tasks,
 #endif
+	.sched_system_tick = staircase_system_tick,
 	.tick = staircase_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = staircase_head_of_queue,
Index: linux-2.6.16-mm2/kernel/sched_spa_ebs.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched_spa_ebs.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched_spa_ebs.c	2006-04-02 15:37:52.000000000 +1000
@@ -370,6 +370,7 @@ const struct sched_drv spa_ebs_sched_drv
 #ifdef CONFIG_SMP
 	.move_tasks = spa_move_tasks,
 #endif
+	.sched_system_tick = spa_system_tick,
 	.tick = spa_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = spa_head_of_queue,
Index: linux-2.6.16-mm2/kernel/sched_spa_svr.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched_spa_svr.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched_spa_svr.c	2006-04-02 15:38:21.000000000 +1000
@@ -170,6 +170,7 @@ const struct sched_drv spa_svr_sched_drv
 #ifdef CONFIG_SMP
 	.move_tasks = spa_move_tasks,
 #endif
+	.sched_system_tick = spa_system_tick,
 	.tick = spa_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = spa_head_of_queue,
Index: linux-2.6.16-mm2/kernel/sched_spa_ws.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched_spa_ws.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched_spa_ws.c	2006-04-02 15:38:12.000000000 +1000
@@ -317,6 +317,7 @@ const struct sched_drv spa_ws_sched_drv 
 #ifdef CONFIG_SMP
 	.move_tasks = spa_move_tasks,
 #endif
+	.sched_system_tick = spa_system_tick,
 	.tick = spa_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = spa_head_of_queue,
Index: linux-2.6.16-mm2/kernel/sched_zaphod.c
===================================================================
--- linux-2.6.16-mm2.orig/kernel/sched_zaphod.c	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/kernel/sched_zaphod.c	2006-04-02 15:38:02.000000000 +1000
@@ -607,6 +607,7 @@ const struct sched_drv zaphod_sched_drv 
 #ifdef CONFIG_SMP
 	.move_tasks = spa_move_tasks,
 #endif
+	.sched_system_tick = spa_system_tick,
 	.tick = spa_tick,
 #ifdef CONFIG_SCHED_SMT
 	.head_of_queue = spa_head_of_queue,
Index: linux-2.6.16-mm2/include/linux/sched_spa.h
===================================================================
--- linux-2.6.16-mm2.orig/include/linux/sched_spa.h	2006-04-02 14:20:34.000000000 +1000
+++ linux-2.6.16-mm2/include/linux/sched_spa.h	2006-04-02 15:40:47.000000000 +1000
@@ -111,6 +111,7 @@ void spa_wake_up_task(struct task_struct
 void spa_fork(task_t *);
 void spa_wake_up_new_task(task_t *, unsigned long);
 void spa_exit(task_t *);
+void spa_system_tick(struct task_struct *);
 void spa_tick(struct task_struct *, struct runqueue *, unsigned long long);
 void spa_schedule(void);
 void spa_set_normal_task_nice(task_t *, long);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
@ 2006-04-03 11:59 Al Boldi
  2006-04-03 12:13 ` Paolo Ornati
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Al Boldi @ 2006-04-03 11:59 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Peter Williams wrote:
> > Peter Williams wrote:
>
> Now available for 2.6.16 at:

Thanks a lot!

> >> You can select a default scheduler at kernel build time.  If you wish
> >> to boot with a scheduler other than the default it can be selected at
> >> boot time by adding:
> >>
> >> cpusched=<scheduler>

Can this be made runtime selectable/loadable, akin to iosched?

> >> Control parameters for the scheduler can be read/set via files in:
> >>
> >> /sys/cpusched/<scheduler>/

The default values for spa make it really easy to lock up the system.
Is there a module to autotune these values according to cpu/mem/ctxt 
performance?

Also, different schedulers per cpu could be rather useful.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 11:59 [ANNOUNCE][RFC] PlugSched-6.3.1 for 2.6.16-rc5 Al Boldi
@ 2006-04-03 12:13 ` Paolo Ornati
  2006-04-03 23:04 ` Peter Williams
  2006-04-03 23:27 ` Peter Williams
  2 siblings, 0 replies; 28+ messages in thread
From: Paolo Ornati @ 2006-04-03 12:13 UTC (permalink / raw)
  To: Al Boldi; +Cc: Peter Williams, linux-kernel

On Mon, 3 Apr 2006 14:59:51 +0300
Al Boldi <a1426z@gawab.com> wrote:

> > >> You can select a default scheduler at kernel build time.  If you wish
> > >> to boot with a scheduler other than the default it can be selected at
> > >> boot time by adding:
> > >>
> > >> cpusched=<scheduler>
> 
> Can this be made runtime selectable/loadable, akin to iosched?

There is a project to do so:

http://groups.google.com/group/fa.linux.kernel/msg/d555ecee596690d1

-- 
	Paolo Ornati
	Linux 2.6.16-mm2 on x86_64

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 11:59 [ANNOUNCE][RFC] PlugSched-6.3.1 for 2.6.16-rc5 Al Boldi
  2006-04-03 12:13 ` Paolo Ornati
@ 2006-04-03 23:04 ` Peter Williams
  2006-04-03 23:29   ` Con Kolivas
  2006-04-04 13:27   ` Al Boldi
  2006-04-03 23:27 ` Peter Williams
  2 siblings, 2 replies; 28+ messages in thread
From: Peter Williams @ 2006-04-03 23:04 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Peter Williams wrote:
>>> Peter Williams wrote:
>> Now available for 2.6.16 at:
> 
> Thanks a lot!
> 
>>>> You can select a default scheduler at kernel build time.  If you wish
>>>> to boot with a scheduler other than the default it can be selected at
>>>> boot time by adding:
>>>>
>>>> cpusched=<scheduler>
> 
> Can this be made runtime selectable/loadable, akin to iosched?

See <https://sourceforge.net/projects/dynsched>.  It's an extension to 
PlugSched that allows schedulers to be changed at run time.

> 
>>>> Control parameters for the scheduler can be read/set via files in:
>>>>
>>>> /sys/cpusched/<scheduler>/
> 
> The default values for spa make it really easy to lock up the system.

Which one of the SPA schedulers and under what conditions?  I've been 
mucking around with these and may have broken something.  If so I'd like 
to fix it.

> Is there a module to autotune these values according to cpu/mem/ctxt 
> performance?
> 
> Also, different schedulers per cpu could be rather useful.

I think that would be dangerous.  However, different schedulers per 
cpuset might make sense but it involve a fair bit of work.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 11:59 [ANNOUNCE][RFC] PlugSched-6.3.1 for 2.6.16-rc5 Al Boldi
  2006-04-03 12:13 ` Paolo Ornati
  2006-04-03 23:04 ` Peter Williams
@ 2006-04-03 23:27 ` Peter Williams
  2006-04-04 13:27   ` Al Boldi
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-03 23:27 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel, Jake Moilanen

Al Boldi wrote:
> 
> The default values for spa make it really easy to lock up the system.
> Is there a module to autotune these values according to cpu/mem/ctxt 
> performance?

Jake Moilanen had a genetic algorithm autotuner for Zaphod at one time 
which I believe he ported over to PlugSched (as he said he was going 
to).  However, I remove the per run queue statistics that he used to use 
some time ago and he hasn't complained so I've assumed that he's given 
up on it.  I've CC'd him on this mail so he may tell you what's happening.

I could generate a patch to gather the statistic again and make them 
available via /proc if you would like to try a user space version of 
Jake's work (his was in kernel).

Peter
PS The reason that I removed the stats gathering is that they add 
overhead which is undesirable if they're not being used so if I put them 
back it would probably be as a build configurable option.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 23:04 ` Peter Williams
@ 2006-04-03 23:29   ` Con Kolivas
  2006-04-04  0:01     ` Peter Williams
  2006-04-04 13:27   ` Al Boldi
  1 sibling, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2006-04-03 23:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Williams, Al Boldi

On Tuesday 04 April 2006 09:04, Peter Williams wrote:
> Al Boldi wrote:

> > Is there a module to autotune these values according to cpu/mem/ctxt
> > performance?

I think you're thinking of Jake's genetic algorithms (separate patch). They 
tune the zaphod scheduler but bear in mind the limitation of such an 
algorithm is they can only tune for one workload which means that if you have 
two workloads running concurrently with different requirements, the other 
will suffer.

> > Also, different schedulers per cpu could be rather useful.
> > Peter Williams wrote:
> I think that would be dangerous.  However, different schedulers per
> cpuset might make sense but it involve a fair bit of work.

I'm curious. How do you think different schedulers per cpu would be useful?

Cheers,
Con

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 23:29   ` Con Kolivas
@ 2006-04-04  0:01     ` Peter Williams
  2006-04-04  0:12       ` Con Kolivas
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-04  0:01 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Al Boldi

Con Kolivas wrote:
> On Tuesday 04 April 2006 09:04, Peter Williams wrote:
>> Al Boldi wrote:
> 
>>> Is there a module to autotune these values according to cpu/mem/ctxt
>>> performance?
> 
> I think you're thinking of Jake's genetic algorithms (separate patch). They 
> tune the zaphod scheduler but bear in mind the limitation of such an 
> algorithm is they can only tune for one workload which means that if you have 
> two workloads running concurrently with different requirements, the other 
> will suffer.
> 
>>> Also, different schedulers per cpu could be rather useful.
>>> Peter Williams wrote:
>> I think that would be dangerous.  However, different schedulers per
>> cpuset might make sense but it involve a fair bit of work.
> 
> I'm curious. How do you think different schedulers per cpu would be useful?

I don't but I think they MIGHT make sense for cpusets e.g. one set with 
a scheduler targeted at interactive tasks and another targeted at server 
tasks.  NB the emphasis on might.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04  0:01     ` Peter Williams
@ 2006-04-04  0:12       ` Con Kolivas
  2006-04-04  1:29         ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Con Kolivas @ 2006-04-04  0:12 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel, Al Boldi

On Tuesday 04 April 2006 10:01, Peter Williams wrote:
> Con Kolivas wrote:
> > On Tuesday 04 April 2006 09:04, Peter Williams wrote:
> >> Al Boldi wrote:
> >>> Also, different schedulers per cpu could be rather useful.
> >>> Peter Williams wrote:
> >>
> >> I think that would be dangerous.  However, different schedulers per
> >> cpuset might make sense but it involve a fair bit of work.
> >
> > I'm curious. How do you think different schedulers per cpu would be
> > useful?
>
> I don't but I think they MIGHT make sense for cpusets e.g. one set with
> a scheduler targeted at interactive tasks and another targeted at server
> tasks.  NB the emphasis on might.

I am curious as to Al's answer since he asked for the feature. It would be 
easy for me to modify the staircase cpu scheduler to allow the interactive 
and compute modes be set on a per-cpu basis if that was desired. For that to 
be helpful of course you'd have to manually set affinity for the tasks or 
logins you wanted to run on each cpu(s).

Cheers,
Con

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04  0:12       ` Con Kolivas
@ 2006-04-04  1:29         ` Peter Williams
  2006-04-04 13:27           ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-04  1:29 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Al Boldi

Con Kolivas wrote:
> On Tuesday 04 April 2006 10:01, Peter Williams wrote:
>> Con Kolivas wrote:
>>> On Tuesday 04 April 2006 09:04, Peter Williams wrote:
>>>> Al Boldi wrote:
>>>>> Also, different schedulers per cpu could be rather useful.
>>>>> Peter Williams wrote:
>>>> I think that would be dangerous.  However, different schedulers per
>>>> cpuset might make sense but it involve a fair bit of work.
>>> I'm curious. How do you think different schedulers per cpu would be
>>> useful?
>> I don't but I think they MIGHT make sense for cpusets e.g. one set with
>> a scheduler targeted at interactive tasks and another targeted at server
>> tasks.  NB the emphasis on might.
> 
> I am curious as to Al's answer since he asked for the feature.

OK.

> It would be 
> easy for me to modify the staircase cpu scheduler to allow the interactive 
> and compute modes be set on a per-cpu basis if that was desired.  For that to
> be helpful of course you'd have to manually set affinity for the tasks or 
> logins you wanted to run on each cpu(s).

Yes, I agree that it would not be a good idea for CPUs that are sharing 
(via load balancing) the same set of tasks to have different schedulers 
or policy which is why I suggested only doing it at the cpuset level.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04  1:29         ` Peter Williams
@ 2006-04-04 13:27           ` Al Boldi
  0 siblings, 0 replies; 28+ messages in thread
From: Al Boldi @ 2006-04-04 13:27 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, Peter Williams

Peter Williams wrote:
> Con Kolivas wrote:
> >>>> Al Boldi wrote:
> >>>>> Also, different schedulers per cpu could be rather useful.
> >>>>
> >>>> I think that would be dangerous.  However, different schedulers per
> >>>> cpuset might make sense but it involve a fair bit of work.
> >>>
> >>> I'm curious. How do you think different schedulers per cpu would be
> >>> useful?
> >>
> >> I don't but I think they MIGHT make sense for cpusets e.g. one set with
> >> a scheduler targeted at interactive tasks and another targeted at
> >> server tasks.  NB the emphasis on might.

Exactly.

> > I am curious as to Al's answer since he asked for the feature.

Can you imagine how neat it would be to set timeslice per cpuset/workload?

> > It would be
> > easy for me to modify the staircase cpu scheduler to allow the
> > interactive and compute modes be set on a per-cpu basis if that was
> > desired.  For that to be helpful of course you'd have to manually set
> > affinity for the tasks or logins you wanted to run on each cpu(s).

Your staircase scheduler is great, and adding this feature would make it 
unique.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 23:27 ` Peter Williams
@ 2006-04-04 13:27   ` Al Boldi
  2006-04-04 23:20     ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-04 13:27 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel, Jake Moilanen

Peter Williams wrote:
> Al Boldi wrote:
> > The default values for spa make it really easy to lock up the system.
> > Is there a module to autotune these values according to cpu/mem/ctxt
> > performance?
>
> Jake Moilanen had a genetic algorithm autotuner for Zaphod at one time
> which I believe he ported over to PlugSched

Would this be a load-adaptive dynamic tuner?

What I meant was a lock-preventive static tuner.  Something that would take 
hw-latencies into account at boot and set values for non-locking console 
operation.

> I could generate a patch to gather the statistic again and make them
> available via /proc if you would like to try a user space version of
> Jake's work (his was in kernel).

That would be great!

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-03 23:04 ` Peter Williams
  2006-04-03 23:29   ` Con Kolivas
@ 2006-04-04 13:27   ` Al Boldi
  2006-04-04 23:17     ` Peter Williams
  1 sibling, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-04 13:27 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Al Boldi wrote:
> >>>> Control parameters for the scheduler can be read/set via files in:
> >>>>
> >>>> /sys/cpusched/<scheduler>/
> >
> > The default values for spa make it really easy to lock up the system.
>
> Which one of the SPA schedulers and under what conditions?  I've been
> mucking around with these and may have broken something.  If so I'd like
> to fix it.

spa_no_frills, with a malloc-hog less than timeslice.  Setting 
promotion_floor to max unlocks the console.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04 13:27   ` Al Boldi
@ 2006-04-04 23:17     ` Peter Williams
  2006-04-05  8:16       ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-04 23:17 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>>>>> Control parameters for the scheduler can be read/set via files in:
>>>>>>
>>>>>> /sys/cpusched/<scheduler>/
>>> The default values for spa make it really easy to lock up the system.
>> Which one of the SPA schedulers and under what conditions?  I've been
>> mucking around with these and may have broken something.  If so I'd like
>> to fix it.
> 
> spa_no_frills, with a malloc-hog less than timeslice.  Setting 
> promotion_floor to max unlocks the console.

OK, you could also try increasing the promotion interval.

It should be noted that spa_no_frills isn't really expected to behave 
very well as it's a pure round robin scheduler.  It's intended purpose 
is as a basis for more sophisticated schedulers.  I've been thinking 
about removing it as a bootable scheduler and only making its children 
available but I find it useful to compare benchmark and other test 
results from it with that from the other schedulers to get an idea of 
the extra costs involved.

Similarly, zaphod is really just a vehicle for trying different ideas 
and the spa_ws, spa_svr and spa_ebs are the ones intended for use on 
real systems.  Of these, spa_svr isn't very good for interactive systems 
as it is designed to maximize throughput on a server (it actually beats 
spa_no_frills by about 1% on kernbench) which isn't always compatible 
with good interactive response.

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04 13:27   ` Al Boldi
@ 2006-04-04 23:20     ` Peter Williams
  2006-04-05  8:16       ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-04 23:20 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel, Jake Moilanen

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> The default values for spa make it really easy to lock up the system.
>>> Is there a module to autotune these values according to cpu/mem/ctxt
>>> performance?
>> Jake Moilanen had a genetic algorithm autotuner for Zaphod at one time
>> which I believe he ported over to PlugSched
> 
> Would this be a load-adaptive dynamic tuner?

Yes.

> 
> What I meant was a lock-preventive static tuner.  Something that would take 
> hw-latencies into account at boot and set values for non-locking console 
> operation.

I'm not sure what you mean here.  Can you elaborate?

> 
>> I could generate a patch to gather the statistic again and make them
>> available via /proc if you would like to try a user space version of
>> Jake's work (his was in kernel).
> 
> That would be great!

OK, I'll put it on my "to do" list.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04 23:17     ` Peter Williams
@ 2006-04-05  8:16       ` Al Boldi
  2006-04-05 22:53         ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-05  8:16 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Al Boldi wrote:
> > Peter Williams wrote:
> >> Al Boldi wrote:
> >>>>>> Control parameters for the scheduler can be read/set via files in:
> >>>>>>
> >>>>>> /sys/cpusched/<scheduler>/
> >>>
> >>> The default values for spa make it really easy to lock up the system.
> >>
> >> Which one of the SPA schedulers and under what conditions?  I've been
> >> mucking around with these and may have broken something.  If so I'd
> >> like to fix it.
> >
> > spa_no_frills, with a malloc-hog less than timeslice.  Setting
> > promotion_floor to max unlocks the console.
>
> OK, you could also try increasing the promotion interval.

Seems that this will only delay the lock in spa_svr but not inhibit it.

> It should be noted that spa_no_frills isn't really expected to behave
> very well as it's a pure round robin scheduler.

It's a bare bone scheduler that allows to prioritize procs to the admins 
desire, instead of leaving the priority management to the scheduler, which 
may be undesirable for some but not all.

> It's intended purpose is as a basis for more sophisticated schedulers.

And that's why the same problem exists in the child scheds, i.e. spa_ws, 
spa_svr, zaphod, but not spa_ebs.

> I've been thinking
> about removing it as a bootable scheduler and only making its children
> available but I find it useful to compare benchmark and other test
> results from it with that from the other schedulers to get an idea of
> the extra costs involved.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-04 23:20     ` Peter Williams
@ 2006-04-05  8:16       ` Al Boldi
  0 siblings, 0 replies; 28+ messages in thread
From: Al Boldi @ 2006-04-05  8:16 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel, Jake Moilanen

Peter Williams wrote:
> Al Boldi wrote:
> > Peter Williams wrote:
> >> Al Boldi wrote:
> >>> The default values for spa make it really easy to lock up the system.
> >>> Is there a module to autotune these values according to cpu/mem/ctxt
> >>> performance?
> >>
> >> Jake Moilanen had a genetic algorithm autotuner for Zaphod at one time
> >> which I believe he ported over to PlugSched
> >
> > Would this be a load-adaptive dynamic tuner?
>
> Yes.

Wow!

> > What I meant was a lock-preventive static tuner.  Something that would
> > take hw-latencies into account at boot and set values for non-locking
> > console operation.
>
> I'm not sure what you mean here.  Can you elaborate?

In another thread Al Boldi wrote:
> After playing w/ these tunables it occurred to me that they are really
> only deadline limits, w/ a direct relation to cpu/mem/ctxt perf.
>
> i.e timeslice=1 on i386sx means something other than timeslice=1 on amd64
>
> It follows that w/o autotuning, the static default values have to be
> selected to allow for a large underlying perf range w/ a preference for
> the high range.  This is also the reason why 2.6 feels really crummy on
> low perf ranges.
>
> Autotuning the default values would allow to tighten this range specific
> to the hw used, thus allowing for a smoother desktop experience.

> >> I could generate a patch to gather the statistic again and make them
> >> available via /proc if you would like to try a user space version of
> >> Jake's work (his was in kernel).
> >
> > That would be great!
>
> OK, I'll put it on my "to do" list.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-05  8:16       ` Al Boldi
@ 2006-04-05 22:53         ` Peter Williams
  2006-04-07 21:32           ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-05 22:53 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> Peter Williams wrote:
>>>> Al Boldi wrote:
>>>>>>>> Control parameters for the scheduler can be read/set via files in:
>>>>>>>>
>>>>>>>> /sys/cpusched/<scheduler>/
>>>>> The default values for spa make it really easy to lock up the system.
>>>> Which one of the SPA schedulers and under what conditions?  I've been
>>>> mucking around with these and may have broken something.  If so I'd
>>>> like to fix it.
>>> spa_no_frills, with a malloc-hog less than timeslice.  Setting
>>> promotion_floor to max unlocks the console.
>> OK, you could also try increasing the promotion interval.
> 
> Seems that this will only delay the lock in spa_svr but not inhibit it.

OK. But turning the promotion mechanism off completely (which is what 
setting the floor to the maximum) runs the risk of a runaway high 
priority task locking the whole system up.  IMHO the only SPA scheduler 
where it's safe for the promotion floor to be greater than MAX_RT_PRIO 
is spa_ebs.  So a better solution is highly desirable.

I'd like to fix this problem but don't fully understand what it is. 
What do you mean by a malloc-hog?  Would it possible for you to give me 
an example of how to reproduce the problem?

> 
>> It should be noted that spa_no_frills isn't really expected to behave
>> very well as it's a pure round robin scheduler.
> 
> It's a bare bone scheduler that allows to prioritize procs to the admins 
> desire, instead of leaving the priority management to the scheduler, which 
> may be undesirable for some but not all.

OK.  But it's important to realize that "nice" does not (in general) 
control the "amount of CPU" that tasks get with this scheduler.  It 
merely controls the order in which runnable tasks run i.e. it's a 
"priority based" interpretation of "nice" not an "entitlement based" 
interpretation.  The difference is important.

If you want a "bare bones" scheduler with an "entitlement based" 
interpretation of "nice" use spa_ebs with the maximum bonuses set to 
zero.  In that state spa_ebs is safe for promotion to be completely 
disabled unless you also want to use cpu caps (in which case you'd need 
to set the floor to about 137 to make sure that capped tasks don't get 
completely starved).

> 
>> It's intended purpose is as a basis for more sophisticated schedulers.
> 
> And that's why the same problem exists in the child scheds, i.e. spa_ws, 
> spa_svr, zaphod, but not spa_ebs.

OK.

> 
>> I've been thinking
>> about removing it as a bootable scheduler and only making its children
>> available but I find it useful to compare benchmark and other test
>> results from it with that from the other schedulers to get an idea of
>> the extra costs involved.

Thanks,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-05 22:53         ` Peter Williams
@ 2006-04-07 21:32           ` Al Boldi
  2006-04-08  1:29             ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-07 21:32 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1808 bytes --]

Peter Williams wrote:
> Al Boldi wrote:
> > Peter Williams wrote:
> >> Al Boldi wrote:
> >>> Peter Williams wrote:
> >>>> Al Boldi wrote:
> >>>>>>>> Control parameters for the scheduler can be read/set via files
> >>>>>>>> in:
> >>>>>>>>
> >>>>>>>> /sys/cpusched/<scheduler>/
> >>>>>
> >>>>> The default values for spa make it really easy to lock up the
> >>>>> system.
> >>>>
> >>>> Which one of the SPA schedulers and under what conditions?  I've been
> >>>> mucking around with these and may have broken something.  If so I'd
> >>>> like to fix it.
> >>>
> >>> spa_no_frills, with a malloc-hog less than timeslice.  Setting
> >>> promotion_floor to max unlocks the console.
> >>
> >> OK, you could also try increasing the promotion interval.
> >
> > Seems that this will only delay the lock in spa_svr but not inhibit it.
>
> OK. But turning the promotion mechanism off completely (which is what
> setting the floor to the maximum) runs the risk of a runaway high
> priority task locking the whole system up.  IMHO the only SPA scheduler
> where it's safe for the promotion floor to be greater than MAX_RT_PRIO
> is spa_ebs.  So a better solution is highly desirable.

Yes.

> I'd like to fix this problem but don't fully understand what it is.
> What do you mean by a malloc-hog?  Would it possible for you to give me
> an example of how to reproduce the problem?

Can you try the attached mem-eater passing it the number of kb to be eaten.

	i.e. '# while :; do ./eatm 9999 ; done' 

This will print the number of bytes eaten and the timing in ms.

Adjust the number of kb to be eaten such that the timing will be less than 
timeslice (120ms by default for spa).  Switch to another vt and start 
pressing enter.  A console lockup should follow within seconds for all spas 
except ebs.

Thanks!

--
Al



[-- Attachment #2: eatm.c --]
[-- Type: text/x-csrc, Size: 810 bytes --]

#include <stdio.h>
#include <sys/time.h>

unsigned long elapsed(int start) {

	static struct timeval s,e;

	if (start) return gettimeofday(&s, NULL);

	gettimeofday(&e, NULL);

	return ((e.tv_sec - s.tv_sec) * 1000 + (e.tv_usec - s.tv_usec) / 1000);

}

int main(int argc, char **argv) {

    unsigned long int i,j,max;
    unsigned char *p;

    if (argc>1)
	max=atol(argv[1]);
    else
	max=0x60000;


    elapsed(1); 

    for (i=0;((i<max/1024) && (p = (char *)malloc(1024*1024)));i++) {
        for (j=0;j<1024;p[1024*j++]=0);
	fprintf(stderr,"\r%d MB ",i+1);
    }

    for (j=max-(i*=1024);((i<max) && (p = (char *)malloc(1024)));i++) {
	*p = 0;
    }
    fprintf(stderr,"%d KB ",j-(max-i));

    fprintf(stderr,"eaten in %lu msec (%lu MB/s)\n",elapsed(0),i/(elapsed(0)?:1)*1000/1024);

    return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-07 21:32           ` Al Boldi
@ 2006-04-08  1:29             ` Peter Williams
  2006-04-08 20:31               ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-08  1:29 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> Peter Williams wrote:
>>>> Al Boldi wrote:
>>>>> Peter Williams wrote:
>>>>>> Al Boldi wrote:
>>>>>>>>>> Control parameters for the scheduler can be read/set via files
>>>>>>>>>> in:
>>>>>>>>>>
>>>>>>>>>> /sys/cpusched/<scheduler>/
>>>>>>> The default values for spa make it really easy to lock up the
>>>>>>> system.
>>>>>> Which one of the SPA schedulers and under what conditions?  I've been
>>>>>> mucking around with these and may have broken something.  If so I'd
>>>>>> like to fix it.
>>>>> spa_no_frills, with a malloc-hog less than timeslice.  Setting
>>>>> promotion_floor to max unlocks the console.
>>>> OK, you could also try increasing the promotion interval.
>>> Seems that this will only delay the lock in spa_svr but not inhibit it.
>> OK. But turning the promotion mechanism off completely (which is what
>> setting the floor to the maximum) runs the risk of a runaway high
>> priority task locking the whole system up.  IMHO the only SPA scheduler
>> where it's safe for the promotion floor to be greater than MAX_RT_PRIO
>> is spa_ebs.  So a better solution is highly desirable.
> 
> Yes.
> 
>> I'd like to fix this problem but don't fully understand what it is.
>> What do you mean by a malloc-hog?  Would it possible for you to give me
>> an example of how to reproduce the problem?
> 
> Can you try the attached mem-eater passing it the number of kb to be eaten.
> 
> 	i.e. '# while :; do ./eatm 9999 ; done' 
> 
> This will print the number of bytes eaten and the timing in ms.
> 
> Adjust the number of kb to be eaten such that the timing will be less than 
> timeslice (120ms by default for spa).  Switch to another vt and start 
> pressing enter.  A console lockup should follow within seconds for all spas 
> except ebs.

This doesn't seem to present a problem (other than the eatme loop being 
hard to kill with control-C) on my system using spa_ws with standard 
settings.  I tried both UP and SMP.  I may be doing something wrong or 
perhaps don't understand what you mean by a console lock up.  When you 
say "less than the timeslice" how much smaller do you mean?

Peter
PS I even managed to do a kernel build with the eatme loop running on a 
single processor system.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-08  1:29             ` Peter Williams
@ 2006-04-08 20:31               ` Al Boldi
  2006-04-09  2:58                 ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-08 20:31 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Al Boldi wrote:
> > Can you try the attached mem-eater passing it the number of kb to be
> > eaten.
> >
> > 	i.e. '# while :; do ./eatm 9999 ; done'
> >
> > This will print the number of bytes eaten and the timing in ms.
> >
> > Adjust the number of kb to be eaten such that the timing will be less
> > than timeslice (120ms by default for spa).  Switch to another vt and
> > start pressing enter.  A console lockup should follow within seconds for
> > all spas except ebs.
>
> This doesn't seem to present a problem (other than the eatme loop being
> hard to kill with control-C) on my system using spa_ws with standard
> settings.  I tried both UP and SMP.  I may be doing something wrong or
> perhaps don't understand what you mean by a console lock up.

Switching from one vt to another receives hardly any response.

This is especially visible in spa_no_frills, and spa_ws recovers from this 
lockup somewhat and starts exhibiting this problem as a choking behavior.

Running '# top d.1 (then shift T)' on another vt shows this choking behavior 
as the proc gets boosted.

> When you say "less than the timeslice" how much smaller do you mean?

This depends on your machine's performance.  On my 400MhzP2 UP 128MB, w/ 
spa_no_frills default settings, looping eatm 9999 takes 63ms per eat and 
causes the rest of the system to be starved.  Raising kb to 19999 takes 
126ms which is greater than the default 120ms timeslice and causes no system 
starvation.

What numbers do you get?

Thanks!

--
Al

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-08 20:31               ` Al Boldi
@ 2006-04-09  2:58                 ` Peter Williams
  2006-04-09  5:04                   ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-09  2:58 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> Can you try the attached mem-eater passing it the number of kb to be
>>> eaten.
>>>
>>> 	i.e. '# while :; do ./eatm 9999 ; done'
>>>
>>> This will print the number of bytes eaten and the timing in ms.
>>>
>>> Adjust the number of kb to be eaten such that the timing will be less
>>> than timeslice (120ms by default for spa).  Switch to another vt and
>>> start pressing enter.  A console lockup should follow within seconds for
>>> all spas except ebs.
>> This doesn't seem to present a problem (other than the eatme loop being
>> hard to kill with control-C) on my system using spa_ws with standard
>> settings.  I tried both UP and SMP.  I may be doing something wrong or
>> perhaps don't understand what you mean by a console lock up.
> 
> Switching from one vt to another receives hardly any response.

Aah.  Virtual terminals.  I was using Gnome terminals under X.

> 
> This is especially visible in spa_no_frills, and spa_ws recovers from this 
> lockup somewhat and starts exhibiting this problem as a choking behavior.
> 
> Running '# top d.1 (then shift T)' on another vt shows this choking behavior 
> as the proc gets boosted.
> 
>> When you say "less than the timeslice" how much smaller do you mean?
> 
> This depends on your machine's performance.  On my 400MhzP2 UP 128MB, w/ 
> spa_no_frills default settings, looping eatm 9999 takes 63ms per eat and 
> causes the rest of the system to be starved.  Raising kb to 19999 takes 
> 126ms which is greater than the default 120ms timeslice and causes no system 
> starvation.
> 
> What numbers do you get?

For 9999 I get 20ms.  I have 1GB of memory and no swapping is taking 
place but with only 128MB it's possible that your system is swapping and 
that could make the effect more pronounced.

But anyway, based on the evidence, I think the problem is caused by the 
fact that the eatm tasks are running to completion in less than one time 
slice without sleeping and this means that they never have their 
priorities reassessed.  The reason that spa_ebs doesn't demonstrate the 
problem is that it uses a smaller time slice for the first time slice 
that a task gets. The reason that it does this is that it gives newly 
forked processes a fairly high priority and if they're left to run for a 
full 120 msecs at that high priority they can hose the system.  Having a 
shorter first time slice gives the scheduler a chance to reassess the 
task's priority before it does much damage.

The reason that the other schedulers don't have this strategy is that I 
didn't think that it was necessary.  Obviously I was wrong and should 
extend it to the other schedulers.  It's doubtful whether this will help 
a great deal with spa_no_frills as it is pure round robin and doesn't 
reassess priorities except when nice changes of the task changes 
policies.  This is one good reason not to use spa_no_frills on 
production systems.  Perhaps you should consider creating a child 
scheduler on top of it that meets your needs?

Anyway, an alternative (and safer) way to reduce the effects of this 
problem (while your waiting for me to do the above change) is to reduce 
the size of the time slice.  The only bad effects of doing this is that 
you'll do slightly worse (less than 1%) on kernbench.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-09  2:58                 ` Peter Williams
@ 2006-04-09  5:04                   ` Al Boldi
  2006-04-09 23:53                     ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-09  5:04 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Al Boldi wrote:
> > This is especially visible in spa_no_frills, and spa_ws recovers from
> > this lockup somewhat and starts exhibiting this problem as a choking
> > behavior.
> >
> > Running '# top d.1 (then shift T)' on another vt shows this choking
> > behavior as the proc gets boosted.
>
> But anyway, based on the evidence, I think the problem is caused by the
> fact that the eatm tasks are running to completion in less than one time
> slice without sleeping and this means that they never have their
> priorities reassessed. 

Yes.

> The reason that spa_ebs doesn't demonstrate the
> problem is that it uses a smaller time slice for the first time slice
> that a task gets. The reason that it does this is that it gives newly
> forked processes a fairly high priority and if they're left to run for a
> full 120 msecs at that high priority they can hose the system.  Having a
> shorter first time slice gives the scheduler a chance to reassess the
> task's priority before it does much damage.

But how does this explain spa_no_frills setting promotion to max not having 
this problem?

> The reason that the other schedulers don't have this strategy is that I
> didn't think that it was necessary.  Obviously I was wrong and should
> extend it to the other schedulers.  It's doubtful whether this will help
> a great deal with spa_no_frills as it is pure round robin and doesn't
> reassess priorities except when nice changes of the task changes
> policies.

Would it hurt to add it to spa_no_frills and let the children inherit it?

> This is one good reason not to use spa_no_frills on
> production systems.

spa_ebs is great, but rather bursty.  Even setting max_ia_bonus=0 doesn't fix 
that.  Is there a way to smooth it like spa_no_frills?

> Perhaps you should consider creating a child
> scheduler on top of it that meets your needs?

Perhaps.

> Anyway, an alternative (and safer) way to reduce the effects of this
> problem (while your waiting for me to do the above change) is to reduce
> the size of the time slice.  The only bad effects of doing this is that
> you'll do slightly worse (less than 1%) on kernbench.

Actually, setting timeslice to 5,50,100 gives me better performance on 
kernbench.  After closer inspection, I found 120ms a rather awkward 
timeslice whereas 5,50, and 100 exhibited a smoother and faster response, 
which may be machine dependent, thus the need for an autotuner.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-09  5:04                   ` Al Boldi
@ 2006-04-09 23:53                     ` Peter Williams
  2006-04-10 14:43                       ` Al Boldi
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Williams @ 2006-04-09 23:53 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> This is especially visible in spa_no_frills, and spa_ws recovers from
>>> this lockup somewhat and starts exhibiting this problem as a choking
>>> behavior.
>>>
>>> Running '# top d.1 (then shift T)' on another vt shows this choking
>>> behavior as the proc gets boosted.
>> But anyway, based on the evidence, I think the problem is caused by the
>> fact that the eatm tasks are running to completion in less than one time
>> slice without sleeping and this means that they never have their
>> priorities reassessed. 
> 
> Yes.
> 
>> The reason that spa_ebs doesn't demonstrate the
>> problem is that it uses a smaller time slice for the first time slice
>> that a task gets. The reason that it does this is that it gives newly
>> forked processes a fairly high priority and if they're left to run for a
>> full 120 msecs at that high priority they can hose the system.  Having a
>> shorter first time slice gives the scheduler a chance to reassess the
>> task's priority before it does much damage.
> 
> But how does this explain spa_no_frills setting promotion to max not having 
> this problem?

I'm still puzzled by this.  The only thing I can think of is that the 
promotion mechanism is to simple in that it just moves all promotable 
tasks up one slot without regard for how long they've been on the queue. 
  Doing this was a deliberate decision based on the desire to minimize 
overhead and the belief that it wouldn't matter in the grand scheme of 
things.  I may do some experimenting with slightly more sophisticated 
version.

Properly done, promotion should hardly ever occur but the cost would be 
slightly more complex enqueue/dequeue operations.  The current version 
will do unnecessary promotions but it was felt this was more than 
compensated for by the lower enqueue/dequeue costs.  We'll see how a 
more sophisticated version goes in terms of trade offs.

> 
>> The reason that the other schedulers don't have this strategy is that I
>> didn't think that it was necessary.  Obviously I was wrong and should
>> extend it to the other schedulers.  It's doubtful whether this will help
>> a great deal with spa_no_frills as it is pure round robin and doesn't
>> reassess priorities except when nice changes of the task changes
>> policies.
> 
> Would it hurt to add it to spa_no_frills and let the children inherit it?

That would be the plan :-)

> 
>> This is one good reason not to use spa_no_frills on
>> production systems.
> 
> spa_ebs is great, but rather bursty.  Even setting max_ia_bonus=0 doesn't fix 
> that.   Is there a way to smooth it like spa_no_frills?

The principal determinant would be the smoothness of the yardstick. 
This is supposed to represent the task with the highest (recent) CPU 
usage rate per share and is used to determine how fairly CPU is being 
distributed among the currently active tasks.  Tasks are given a 
priority based on how their CPU usage rate per share compares to this 
yardstick.  This means that as the system load and/or type of task 
running changes the priorities of the tasks can change dramatically.

Is the burstiness that you're seeing just in the observed priorities or 
is it associated with behavioural burstiness as well?

> 
>> Perhaps you should consider creating a child
>> scheduler on top of it that meets your needs?
> 
> Perhaps.

Good.  I've been hoping that other interested parties might be 
encouraged by the small interface to SPA children to try different ideas 
for scheduling.

> 
>> Anyway, an alternative (and safer) way to reduce the effects of this
>> problem (while your waiting for me to do the above change) is to reduce
>> the size of the time slice.  The only bad effects of doing this is that
>> you'll do slightly worse (less than 1%) on kernbench.
> 
> Actually, setting timeslice to 5,50,100 gives me better performance on 
> kernbench.  After closer inspection, I found 120ms a rather awkward 
> timeslice whereas 5,50, and 100 exhibited a smoother and faster response, 
> which may be machine dependent, thus the need for an autotuner.

When I had the SPA schedulers fully instrumented I did some long term 
measurements of my work station and found that the average CPU burst for 
all tasks was only a few msecs.  The exceptions were some of the tasks 
involved in building kernels.  So the only bad effects of reducing the 
time slice will be causing those tasks to have more context switches 
than otherwise and this will slightly reduce their throughput.

One thing that could be played with here is to vary the time slice based 
on the priority.  This would be in the opposite direction to the normal 
scheduler with higher priority tasks (i.e. those with lower prio values) 
getting smaller time slices.  The rationale being:

1. stop tasks that have been given large bonuses from shutting out other 
tasks for too long, and
2. reduce the context switch rate for tasks that haven't received bonuses.

Because tasks that get large bonuses will have short CPU bursts they 
should not be adversely effected (if this is done properly) as they will 
(except in exceptional circumstances such as a change in behaviour) 
surrender the CPU voluntarily before their reduced time slice has 
expired.  Imaginative use of the available statistics could make this 
largely automatic but there would be a need to be aware that the 
statistics can be distorted by the shorter time slices.

On the other hand, giving tasks without bonuses longer time slices 
shouldn't adversely effect interactive performance as the interactive 
tasks will (courtesy of their bonuses) preempt them.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-09 23:53                     ` Peter Williams
@ 2006-04-10 14:43                       ` Al Boldi
  2006-04-11  2:07                         ` Peter Williams
  0 siblings, 1 reply; 28+ messages in thread
From: Al Boldi @ 2006-04-10 14:43 UTC (permalink / raw)
  To: Peter Williams; +Cc: linux-kernel

Peter Williams wrote:
> Al Boldi wrote:
> > But how does this explain spa_no_frills setting promotion to max not
> > having this problem?
>
> I'm still puzzled by this.  The only thing I can think of is that the
> promotion mechanism is to simple in that it just moves all promotable
> tasks up one slot without regard for how long they've been on the queue.
>   Doing this was a deliberate decision based on the desire to minimize
> overhead and the belief that it wouldn't matter in the grand scheme of
> things.  I may do some experimenting with slightly more sophisticated
> version.
>
> Properly done, promotion should hardly ever occur but the cost would be
> slightly more complex enqueue/dequeue operations.  The current version
> will do unnecessary promotions but it was felt this was more than
> compensated for by the lower enqueue/dequeue costs.  We'll see how a
> more sophisticated version goes in terms of trade offs.

Would this affect the current, nearly perfect, spa_no_frills rr-behaviour w/ 
its ability to circumvent the timeslice problem when setting promo to max?

> >> This is one good reason not to use spa_no_frills on
> >> production systems.
> >
> > spa_ebs is great, but rather bursty.  Even setting max_ia_bonus=0
> > doesn't fix that.   Is there a way to smooth it like spa_no_frills?
>
> The principal determinant would be the smoothness of the yardstick.
> This is supposed to represent the task with the highest (recent) CPU
> usage rate per share and is used to determine how fairly CPU is being
> distributed among the currently active tasks.  Tasks are given a
> priority based on how their CPU usage rate per share compares to this
> yardstick.  This means that as the system load and/or type of task
> running changes the priorities of the tasks can change dramatically.
>
> Is the burstiness that you're seeing just in the observed priorities or
> is it associated with behavioural burstiness as well?

It's behavioural, exhibited in a choking style, like a jumpy mouse move 
during ia boosts.

> >> Perhaps you should consider creating a child
> >> scheduler on top of it that meets your needs?
> >
> > Perhaps.
>
> Good.  I've been hoping that other interested parties might be
> encouraged by the small interface to SPA children to try different ideas
> for scheduling.

Is there a no-op child skeleton available?

> One thing that could be played with here is to vary the time slice based
> on the priority.  This would be in the opposite direction to the normal
> scheduler with higher priority tasks (i.e. those with lower prio values)
> getting smaller time slices.  The rationale being:
>
> 1. stop tasks that have been given large bonuses from shutting out other
> tasks for too long, and
> 2. reduce the context switch rate for tasks that haven't received bonuses.
>
> Because tasks that get large bonuses will have short CPU bursts they
> should not be adversely effected (if this is done properly) as they will
> (except in exceptional circumstances such as a change in behaviour)
> surrender the CPU voluntarily before their reduced time slice has
> expired.  Imaginative use of the available statistics could make this
> largely automatic but there would be a need to be aware that the
> statistics can be distorted by the shorter time slices.
>
> On the other hand, giving tasks without bonuses longer time slices
> shouldn't adversely effect interactive performance as the interactive
> tasks will (courtesy of their bonuses) preempt them.

I couldn't agree more.  Tackling the problem on both fronts (prio/tslice) may 
give us more control, which could result in a more appropriate / fairer / 
smoother scheduler.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [ANNOUNCE][RFC] PlugSched-6.3.1 for  2.6.16-rc5
  2006-04-10 14:43                       ` Al Boldi
@ 2006-04-11  2:07                         ` Peter Williams
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Williams @ 2006-04-11  2:07 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-kernel

Al Boldi wrote:
> Peter Williams wrote:
>> Al Boldi wrote:
>>> But how does this explain spa_no_frills setting promotion to max not
>>> having this problem?
>> I'm still puzzled by this.  The only thing I can think of is that the
>> promotion mechanism is to simple in that it just moves all promotable
>> tasks up one slot without regard for how long they've been on the queue.
>>   Doing this was a deliberate decision based on the desire to minimize
>> overhead and the belief that it wouldn't matter in the grand scheme of
>> things.  I may do some experimenting with slightly more sophisticated
>> version.
>>
>> Properly done, promotion should hardly ever occur but the cost would be
>> slightly more complex enqueue/dequeue operations.  The current version
>> will do unnecessary promotions but it was felt this was more than
>> compensated for by the lower enqueue/dequeue costs.  We'll see how a
>> more sophisticated version goes in terms of trade offs.
> 
> Would this affect the current, nearly perfect, spa_no_frills rr-behaviour w/ 
> its ability to circumvent the timeslice problem when setting promo to max?

No, I'd leave those controls in there.

> 
>>>> This is one good reason not to use spa_no_frills on
>>>> production systems.
>>> spa_ebs is great, but rather bursty.  Even setting max_ia_bonus=0
>>> doesn't fix that.   Is there a way to smooth it like spa_no_frills?
>> The principal determinant would be the smoothness of the yardstick.
>> This is supposed to represent the task with the highest (recent) CPU
>> usage rate per share and is used to determine how fairly CPU is being
>> distributed among the currently active tasks.  Tasks are given a
>> priority based on how their CPU usage rate per share compares to this
>> yardstick.  This means that as the system load and/or type of task
>> running changes the priorities of the tasks can change dramatically.
>>
>> Is the burstiness that you're seeing just in the observed priorities or
>> is it associated with behavioural burstiness as well?
> 
> It's behavioural, exhibited in a choking style, like a jumpy mouse move 
> during ia boosts.

Yeah, I just tried it on my machine with the same results.  It used to 
behave quite well so I must have broken something recently.  I've been 
trying different things for IA bonus calculations.

BTW I've increased the smoothing of my rate statistics and that should 
help smooth scheduling as a whole.  It used to average a tasks behaviour 
over its last 10 cycles but now it does it over 44.  Plus I've moved 
initial_time_slice as discussed.  I'll post patches for 2.6.17-rc1-mm2 
shortly.

> 
>>>> Perhaps you should consider creating a child
>>>> scheduler on top of it that meets your needs?
>>> Perhaps.
>> Good.  I've been hoping that other interested parties might be
>> encouraged by the small interface to SPA children to try different ideas
>> for scheduling.
> 
> Is there a no-op child skeleton available?

No.  But I could create one.

> 
>> One thing that could be played with here is to vary the time slice based
>> on the priority.  This would be in the opposite direction to the normal
>> scheduler with higher priority tasks (i.e. those with lower prio values)
>> getting smaller time slices.  The rationale being:
>>
>> 1. stop tasks that have been given large bonuses from shutting out other
>> tasks for too long, and
>> 2. reduce the context switch rate for tasks that haven't received bonuses.
>>
>> Because tasks that get large bonuses will have short CPU bursts they
>> should not be adversely effected (if this is done properly) as they will
>> (except in exceptional circumstances such as a change in behaviour)
>> surrender the CPU voluntarily before their reduced time slice has
>> expired.  Imaginative use of the available statistics could make this
>> largely automatic but there would be a need to be aware that the
>> statistics can be distorted by the shorter time slices.
>>
>> On the other hand, giving tasks without bonuses longer time slices
>> shouldn't adversely effect interactive performance as the interactive
>> tasks will (courtesy of their bonuses) preempt them.
> 
> I couldn't agree more.  Tackling the problem on both fronts (prio/tslice) may 
> give us more control, which could result in a more appropriate / fairer / 
> smoother scheduler.

"Hedging one's bets" as punters would say.

> 
> Thanks!

My pleasure.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-04-11  2:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-03 11:59 [ANNOUNCE][RFC] PlugSched-6.3.1 for 2.6.16-rc5 Al Boldi
2006-04-03 12:13 ` Paolo Ornati
2006-04-03 23:04 ` Peter Williams
2006-04-03 23:29   ` Con Kolivas
2006-04-04  0:01     ` Peter Williams
2006-04-04  0:12       ` Con Kolivas
2006-04-04  1:29         ` Peter Williams
2006-04-04 13:27           ` Al Boldi
2006-04-04 13:27   ` Al Boldi
2006-04-04 23:17     ` Peter Williams
2006-04-05  8:16       ` Al Boldi
2006-04-05 22:53         ` Peter Williams
2006-04-07 21:32           ` Al Boldi
2006-04-08  1:29             ` Peter Williams
2006-04-08 20:31               ` Al Boldi
2006-04-09  2:58                 ` Peter Williams
2006-04-09  5:04                   ` Al Boldi
2006-04-09 23:53                     ` Peter Williams
2006-04-10 14:43                       ` Al Boldi
2006-04-11  2:07                         ` Peter Williams
2006-04-03 23:27 ` Peter Williams
2006-04-04 13:27   ` Al Boldi
2006-04-04 23:20     ` Peter Williams
2006-04-05  8:16       ` Al Boldi
  -- strict thread matches above, loose matches on Subject: below --
2006-02-28 22:32 Peter Williams
2006-03-01  2:36 ` Peter Williams
2006-04-02  2:04   ` Peter Williams
2006-04-02  6:02     ` Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox