* [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2
@ 2004-06-06 15:39 Con Kolivas
2004-06-06 16:59 ` Jan Killius
` (7 more replies)
0 siblings, 8 replies; 31+ messages in thread
From: Con Kolivas @ 2004-06-06 15:39 UTC (permalink / raw)
To: Linux Kernel Mailinglist; +Cc: Zwane Mwaikambo, William Lee Irwin III
[-- Attachment #1: Type: Text/Plain, Size: 3146 bytes --]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
This is an update of the scheduler policy mechanism rewrite using the
infrastructure of the current O(1) scheduler. Slight changes from the
original design require a detailed description. The change to the original
design has enabled all known corner cases to be abolished and cpu
distribution to be much better maintained. It has proven to be stable in my
testing and is ready for more widespread public testing now.
Aims:
- Interactive by design rather than have interactivity bolted on.
- Good scalability.
- Simple predictable design.
- Maintain appropriate cpu distribution and fairness.
- Low scheduling latency for normal policy tasks.
- Low overhead.
- Making renicing processes actually matter for CPU distribution (nice 0 gets
20 times what nice +20 gets)
- Resistant to priority inversion
- More forgiving of poor locking
- Tunable for a server workload or computational tasks
Description:
- All tasks start at a dynamic priority based on their nice value. They run
for one RR_INTERVAL (nominally set to 10ms) and then drop one stair
(priority). If they sleep before running again they get to run for 2
intervals before being demoted a priority and so on until they get all their
intervals at their best priority: 20 intervals for nice 0; 1 interval for
nice +19.
- - The staircase elevation mechanism can be disabled and all tasks can simply
descend stairs using the sysctl:
echo 0 > /proc/sys/kernel/interactive
this has the effect of maintaining cpu distribution much more strictly
according to nice whereas the default mechanism allows bursts of cpu by
interactive tasks before settling to appropriate cpu distribution.
- - The time tasks are cpu bound can be increased by using the sysctl:
echo 1 > /proc/sys/kernel/compute
which extends the RR_INTERVAL to 100ms and disables the staircase elevation
improving conditions for pure computational tasks by optimising cache
benefits and decreasing context switching (gains another 1.5% on kernbench).
Performance:
- - All cpu throughput benchmarks show equivalent or better performance than
mainline. Note that disabling the interactive setting actually _worsens_ some
benchmarks because of their dependence on yield() so I don't recommend
disabling it unless you do a comparison first.
- - Interactivity is approximately equivalent to mainline 2.6 but with faster
application startup and no known behavioural quirks.
Comments and testing welcome.
fs/proc/array.c | 2
include/linux/sched.h | 8
include/linux/sysctl.h | 2
kernel/sched.c | 676
+++++++++++++------------------------------------
kernel/sysctl.c | 16 +
5 files changed, 212 insertions(+), 492 deletions(-)
Can be downloaded here:
http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3
and below
Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou
wXhrRsBsfY2BIqbLT1tUWcs=
=g1no
-----END PGP SIGNATURE-----
[-- Attachment #2: patch-2.6.7-rc2-s6.3 --]
[-- Type: text/x-diff, Size: 34187 bytes --]
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/fs/proc/array.c linux-2.6.7-rc2-s6.3/fs/proc/array.c
--- linux-2.6.7-rc2-base/fs/proc/array.c 2004-05-31 21:29:15.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/fs/proc/array.c 2004-06-07 00:03:56.959133536 +1000
@@ -155,7 +155,6 @@ static inline char * task_state(struct t
read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
- "SleepAVG:\t%lu%%\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
@@ -163,7 +162,6 @@ static inline char * task_state(struct t
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
- (p->sleep_avg/1024)*100/(1020000000/1024),
p->tgid,
p->pid, p->pid ? p->real_parent->pid : 0,
p->pid && p->ptrace ? p->parent->pid : 0,
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/include/linux/sched.h linux-2.6.7-rc2-s6.3/include/linux/sched.h
--- linux-2.6.7-rc2-base/include/linux/sched.h 2004-05-31 21:29:21.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/include/linux/sched.h 2004-06-07 00:10:37.450576503 +1000
@@ -164,6 +164,7 @@ extern void show_stack(struct task_struc
void io_schedule(void);
long io_schedule_timeout(long timeout);
+extern int interactive, compute;
extern void cpu_init (void);
extern void trap_init(void);
@@ -394,14 +395,13 @@ struct task_struct {
struct list_head run_list;
prio_array_t *array;
- unsigned long sleep_avg;
- long interactive_credit;
unsigned long long timestamp;
- int activated;
+ unsigned long runtime, totalrun;
+ unsigned int deadline;
unsigned long policy;
cpumask_t cpus_allowed;
- unsigned int time_slice, first_time_slice;
+ unsigned int slice, time_slice, first_time_slice;
struct list_head tasks;
struct list_head ptrace_children;
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/include/linux/sysctl.h linux-2.6.7-rc2-s6.3/include/linux/sysctl.h
--- linux-2.6.7-rc2-base/include/linux/sysctl.h 2004-05-31 21:29:21.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/include/linux/sysctl.h 2004-06-07 00:06:13.007895851 +1000
@@ -133,6 +133,8 @@ enum
KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */
KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
KERN_HZ_TIMER=65, /* int: hz timer on or off */
+ KERN_INTERACTIVE=66, /* interactive tasks to have cpu bursts */
+ KERN_COMPUTE=67, /* adjust timeslices for a compute server */
};
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/kernel/sched.c linux-2.6.7-rc2-s6.3/kernel/sched.c
--- linux-2.6.7-rc2-base/kernel/sched.c 2004-05-31 21:29:24.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/kernel/sched.c 2004-06-07 00:36:43.382396246 +1000
@@ -16,7 +16,10 @@
* by Davide Libenzi, preemptible kernel bits by Robert Love.
* 2003-09-03 Interactivity tuning by Con Kolivas.
* 2004-04-02 Scheduler domains code by Nick Piggin
- */
+ * 2004-03-19. New staircase scheduling policy by Con Kolivas with help
+ * from Zwane Mwaikambo and useful suggestions by
+ * William Lee Irwin III.
+*/
#include <linux/mm.h>
#include <linux/module.h>
@@ -66,8 +69,6 @@
#define USER_PRIO(p) ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
- (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
/*
* Some helpers for converting nanosecond timing to jiffy resolution
@@ -75,110 +76,18 @@
#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE ( 10 * HZ / 1000)
-#define MAX_TIMESLICE (200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT 30
-#define CHILD_PENALTY 95
-#define PARENT_PENALTY 100
-#define EXIT_WEIGHT 3
-#define PRIO_BONUS_RATIO 25
-#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT (MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define CREDIT_LIMIT 100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- * priority range a task can explore, a value of '1' means the
- * task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
- (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
- MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
- num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
- (v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
- (SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
- ((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
- ((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
- ((p)->interactive_credit < -CREDIT_LIMIT)
+int compute = 0;
+/*
+ *This is the time all tasks within the same priority round robin.
+ *compute setting is reserved for dedicated computational scheduling
+ *and has ten times larger intervals.
+ */
+#define _RR_INTERVAL ((10 * HZ / 1000) ? : 1)
+#define RR_INTERVAL (_RR_INTERVAL * (1 + 9 * compute))
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
- ((MAX_TIMESLICE - MIN_TIMESLICE) * \
- (MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1)))
-
-static unsigned int task_timeslice(task_t *p)
-{
- return BASE_TIMESLICE(p);
-}
-
#define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time)
/*
@@ -192,7 +101,7 @@ typedef struct runqueue runqueue_t;
struct prio_array {
unsigned int nr_active;
unsigned long bitmap[BITMAP_SIZE];
- struct list_head queue[MAX_PRIO];
+ struct list_head queue[MAX_PRIO + 1];
};
/*
@@ -214,12 +123,11 @@ struct runqueue {
unsigned long cpu_load;
#endif
unsigned long long nr_switches;
- unsigned long expired_timestamp, nr_uninterruptible;
+ unsigned long nr_uninterruptible;
unsigned long long timestamp_last_tick;
task_t *curr, *idle;
struct mm_struct *prev_mm;
- prio_array_t *active, *expired, arrays[2];
- int best_expired_prio;
+ prio_array_t array;
atomic_t nr_iowait;
#ifdef CONFIG_SMP
@@ -300,16 +208,18 @@ static inline void rq_unlock(runqueue_t
/*
* Adding/removing a task to/from a priority array:
*/
-static void dequeue_task(struct task_struct *p, prio_array_t *array)
+static void dequeue_task(struct task_struct *p, runqueue_t *rq)
{
+ prio_array_t* array = &rq->array;
array->nr_active--;
list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
}
-static void enqueue_task(struct task_struct *p, prio_array_t *array)
+static void enqueue_task(struct task_struct *p, runqueue_t *rq)
{
+ prio_array_t* array = &rq->array;
list_add_tail(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
@@ -321,8 +231,9 @@ static void enqueue_task(struct task_str
* remote queue so we want these tasks to show up at the head of the
* local queue:
*/
-static inline void enqueue_task_head(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq)
{
+ prio_array_t* array = &rq->array;
list_add(&p->run_list, array->queue + p->prio);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
@@ -330,42 +241,11 @@ static inline void enqueue_task_head(str
}
/*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
- int bonus, prio;
-
- if (rt_task(p))
- return p->prio;
-
- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
- prio = p->static_prio - bonus;
- if (prio < MAX_RT_PRIO)
- prio = MAX_RT_PRIO;
- if (prio > MAX_PRIO-1)
- prio = MAX_PRIO-1;
- return prio;
-}
-
-/*
* __activate_task - move a task to the runqueue.
*/
static inline void __activate_task(task_t *p, runqueue_t *rq)
{
- enqueue_task(p, rq->active);
+ enqueue_task(p, rq);
rq->nr_running++;
}
@@ -374,95 +254,112 @@ static inline void __activate_task(task_
*/
static inline void __activate_idle_task(task_t *p, runqueue_t *rq)
{
- enqueue_task_head(p, rq->active);
+ enqueue_task_head(p, rq);
rq->nr_running++;
}
-static void recalc_task_prio(task_t *p, unsigned long long now)
+// deadline - the best deadline rank a task can have.
+static unsigned int deadline(task_t *p)
{
- unsigned long long __sleep_time = now - p->timestamp;
- unsigned long sleep_time;
-
- if (__sleep_time > NS_MAX_SLEEP_AVG)
- sleep_time = NS_MAX_SLEEP_AVG;
+ unsigned int task_user_prio;
+ if (rt_task(p))
+ return p->deadline;
+ task_user_prio = TASK_USER_PRIO(p);
+ if (likely(task_user_prio < 40))
+ return 39 - task_user_prio;
else
- sleep_time = (unsigned long)__sleep_time;
+ return 0;
+}
- if (likely(sleep_time > 0)) {
- /*
- * User tasks that sleep a long time are categorised as
- * idle and will get just interactive status to stay active &
- * prevent them suddenly becoming cpu hogs and starving
- * other processes.
- */
- if (p->mm && p->activated != -1 &&
- sleep_time > INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- AVG_TIMESLICE);
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- } else {
- /*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time.
- */
- sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
+static void inc_deadline(task_t *p)
+{
+ unsigned int best_deadline;
+ best_deadline = deadline(p);
+ if (!p->mm || rt_task(p) || !best_deadline)
+ return;
+ if (p->deadline < best_deadline)
+ p->deadline++;
+}
- /*
- * Tasks with low interactive_credit are limited to
- * one timeslice worth of sleep avg bonus.
- */
- if (LOW_CREDIT(p) &&
- sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
- sleep_time = JIFFIES_TO_NS(task_timeslice(p));
+static void dec_deadline(task_t *p)
+{
+ if (!p->mm || rt_task(p))
+ return;
+ if (p->deadline)
+ p->deadline--;
+}
- /*
- * Non high_credit tasks waking from uninterruptible
- * sleep are limited in their sleep_avg rise as they
- * are likely to be cpu hogs waiting on I/O
- */
- if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) {
- if (p->sleep_avg >= INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- INTERACTIVE_SLEEP(p)) {
- p->sleep_avg = INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
- }
+// slice - the duration a task runs before losing a deadline rank.
+static unsigned int slice(task_t *p)
+{
+ unsigned int slice = RR_INTERVAL;
+ if (!rt_task(p))
+ slice += deadline(p) * RR_INTERVAL;
+ return slice;
+}
- /*
- * This code gives a bonus to interactive tasks.
- *
- * The boost works by updating the 'average sleep time'
- * value here, based on ->timestamp. The more time a
- * task spends sleeping, the higher the average gets -
- * and the higher the priority boost gets as well.
- */
- p->sleep_avg += sleep_time;
+// interactive - interactive tasks get longer intervals at best priority
+int interactive = 1;
- if (p->sleep_avg > NS_MAX_SLEEP_AVG) {
- p->sleep_avg = NS_MAX_SLEEP_AVG;
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- }
+/*
+ * effective_prio - dynamic priority dependent on deadline rank.
+ * With 0 deadline ranks the priority decreases by one each RR_INTERVAL.
+ * As the deadline rank increases the priority stays at the top "stair" or
+ * value for longer.
+ */
+static int effective_prio(task_t *p)
+{
+ int prio;
+ unsigned int full_slice, used_slice, first_slice;
+ unsigned int best_deadline;
+ if (rt_task(p))
+ return p->prio;
+
+ best_deadline = deadline(p);
+ full_slice = slice(p);
+ used_slice = full_slice - p->slice;
+ if (p->deadline > best_deadline || !p->mm)
+ p->deadline = best_deadline;
+ first_slice = RR_INTERVAL;
+ if (interactive && !compute)
+ first_slice *= (p->deadline + 1);
+ prio = MAX_PRIO - 1 - best_deadline;
+
+ if (used_slice < first_slice)
+ return prio;
+ if (p->mm)
+ prio += 1 + (used_slice - first_slice) / RR_INTERVAL;
+ if (prio > MAX_PRIO - 1)
+ prio = MAX_PRIO - 1;
+ return prio;
+}
+
+static void recalc_task_prio(task_t *p, unsigned long long now)
+{
+ unsigned long sleep_time = now - p->timestamp;
+ unsigned long run_time = NS_TO_JIFFIES(p->runtime);
+ unsigned long total_run = NS_TO_JIFFIES(p->totalrun) + run_time;
+ if (!run_time && NS_TO_JIFFIES(p->runtime + sleep_time) < RR_INTERVAL) {
+ if (p->slice - total_run < 1) {
+ p->totalrun = 0;
+ dec_deadline(p);
+ } else {
+ p->totalrun += p->runtime;
+ p->slice -= NS_TO_JIFFIES(p->totalrun);
}
+ } else {
+ inc_deadline(p);
+ p->runtime = 0;
+ p->totalrun = 0;
}
-
- p->prio = effective_prio(p);
}
/*
* activate_task - move a task to the runqueue and do priority recalculation
- *
- * Update all the scheduling statistics stuff. (sleep average
- * calculation, priority modifiers, etc.)
*/
static void activate_task(task_t *p, runqueue_t *rq, int local)
{
- unsigned long long now;
-
- now = sched_clock();
+ unsigned long long now = sched_clock();
#ifdef CONFIG_SMP
if (!local) {
/* Compensate for drifting sched_clock */
@@ -471,33 +368,14 @@ static void activate_task(task_t *p, run
+ rq->timestamp_last_tick;
}
#endif
-
- recalc_task_prio(p, now);
-
- /*
- * This checks to make sure it's not an uninterruptible task
- * that is now waking up.
- */
- if (!p->activated) {
- /*
- * Tasks which were woken up by interrupts (ie. hw events)
- * are most likely of interactive nature. So we give them
- * the credit of extending their sleep time to the period
- * of time they spend on the runqueue, waiting for execution
- * on a CPU, first time around:
- */
- if (in_interrupt())
- p->activated = 2;
- else {
- /*
- * Normal first-time wakeups get a credit too for
- * on-runqueue time, but it will be weighted down:
- */
- p->activated = 1;
- }
- }
+ p->slice = slice(p);
+ if (!rt_task(p) && p->mm)
+ recalc_task_prio(p, now);
+ else
+ inc_deadline(p);
+ p->prio = effective_prio(p);
+ p->time_slice = RR_INTERVAL;
p->timestamp = now;
-
__activate_task(p, rq);
}
@@ -507,9 +385,12 @@ static void activate_task(task_t *p, run
static void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
rq->nr_running--;
- if (p->state == TASK_UNINTERRUPTIBLE)
+ if (p->state == TASK_UNINTERRUPTIBLE) {
rq->nr_uninterruptible++;
- dequeue_task(p, p->array);
+ if (p->deadline > (deadline(p) - 2) && p->deadline && p->mm)
+ p->deadline--;
+ }
+ dequeue_task(p, rq);
p->array = NULL;
}
@@ -813,14 +694,8 @@ out_set_cpu:
out_activate:
#endif /* CONFIG_SMP */
- if (old_state == TASK_UNINTERRUPTIBLE) {
+ if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
- /*
- * Tasks on involuntary sleep don't earn
- * sleep_avg beyond just interactive state.
- */
- p->activated = -1;
- }
/*
* Sync wakeups (i.e. those types of wakeups where the waker
@@ -890,12 +765,14 @@ void fastcall sched_fork(task_t *p)
*/
local_irq_disable();
p->time_slice = (current->time_slice + 1) >> 1;
+ p->slice = (current->slice + 1) >> 1;
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
p->first_time_slice = 1;
current->time_slice >>= 1;
+ current->slice >>= 1;
p->timestamp = sched_clock();
if (!current->time_slice) {
/*
@@ -923,33 +800,14 @@ void fastcall wake_up_forked_process(tas
unsigned long flags;
runqueue_t *rq = task_rq_lock(current, &flags);
+ // Forked process gets a lower deadline rank to prevent fork bombs.
+ p->deadline = 0;
BUG_ON(p->state != TASK_RUNNING);
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive.
- */
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->interactive_credit = 0;
-
- p->prio = effective_prio(p);
set_task_cpu(p, smp_processor_id());
- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- list_add_tail(&p->run_list, ¤t->run_list);
- p->array = current->array;
- p->array->nr_active++;
- rq->nr_running++;
- }
+ p->prio = effective_prio(p);
+ __activate_task(p, rq);
task_rq_unlock(rq, &flags);
}
@@ -970,19 +828,11 @@ void fastcall sched_exit(task_t * p)
local_irq_save(flags);
if (p->first_time_slice) {
p->parent->time_slice += p->time_slice;
- if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
- p->parent->time_slice = MAX_TIMESLICE;
+ if (unlikely(p->parent->time_slice > slice(p->parent)))
+ p->parent->time_slice = slice(p->parent);
}
local_irq_restore(flags);
- /*
- * If the child was a (relative-) CPU hog then decrease
- * the sleep_avg of the parent as well.
- */
rq = task_rq_lock(p->parent, &flags);
- if (p->sleep_avg < p->parent->sleep_avg)
- p->parent->sleep_avg = p->parent->sleep_avg /
- (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
- (EXIT_WEIGHT + 1);
task_rq_unlock(rq, &flags);
}
@@ -1246,18 +1096,6 @@ lock_again:
double_rq_unlock(this_rq, rq);
goto lock_again;
}
- /*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive.
- */
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
- p->interactive_credit = 0;
p->prio = effective_prio(p);
set_task_cpu(p, cpu);
@@ -1369,15 +1207,15 @@ static void double_lock_balance(runqueue
* pull_task - move a task from a remote runqueue to the local runqueue.
* Both runqueues must be locked.
*/
-static inline
-void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p,
- runqueue_t *this_rq, prio_array_t *this_array, int this_cpu)
+static inline
+void pull_task(runqueue_t *src_rq, task_t *p,
+ runqueue_t *this_rq, int this_cpu)
{
- dequeue_task(p, src_array);
+ dequeue_task(p, src_rq);
src_rq->nr_running--;
set_task_cpu(p, this_cpu);
this_rq->nr_running++;
- enqueue_task(p, this_array);
+ enqueue_task(p, this_rq);
p->timestamp = (p->timestamp - src_rq->timestamp_last_tick)
+ this_rq->timestamp_last_tick;
/*
@@ -1427,29 +1265,16 @@ static int move_tasks(runqueue_t *this_r
unsigned long max_nr_move, struct sched_domain *sd,
enum idle_type idle)
{
- prio_array_t *array, *dst_array;
+ prio_array_t* array, *dst_array;
struct list_head *head, *curr;
int idx, pulled = 0;
task_t *tmp;
if (max_nr_move <= 0 || busiest->nr_running <= 1)
goto out;
+ array = &busiest->array;
+ dst_array = &this_rq->array;
- /*
- * We first consider expired tasks. Those will likely not be
- * executed in the near future, and they are most likely to
- * be cache-cold, thus switching CPUs has the least effect
- * on them.
- */
- if (busiest->expired->nr_active) {
- array = busiest->expired;
- dst_array = this_rq->expired;
- } else {
- array = busiest->active;
- dst_array = this_rq->active;
- }
-
-new_array:
/* Start searching at priority 0: */
idx = 0;
skip_bitmap:
@@ -1457,14 +1282,8 @@ skip_bitmap:
idx = sched_find_first_bit(array->bitmap);
else
idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
- if (idx >= MAX_PRIO) {
- if (array == busiest->expired && busiest->active->nr_active) {
- array = busiest->active;
- dst_array = this_rq->active;
- goto new_array;
- }
+ if (idx >= MAX_PRIO)
goto out;
- }
head = array->queue + idx;
curr = head->prev;
@@ -1479,7 +1298,7 @@ skip_queue:
idx++;
goto skip_bitmap;
}
- pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+ pull_task(busiest, tmp, this_rq, this_cpu);
pulled++;
/* We only want to steal up to the prescribed number of tasks. */
@@ -1938,22 +1757,6 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
EXPORT_PER_CPU_SYMBOL(kstat);
/*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks. We also ignore the interactivity
- * if a better static_prio task has expired:
- */
-#define EXPIRED_STARVING(rq) \
- ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
- (jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
- ((rq)->curr->static_prio > (rq)->best_expired_prio))
-
-/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -1998,76 +1801,38 @@ void scheduler_tick(int user_ticks, int
cpustat->system += sys_ticks;
/* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
+ if (p->array != &rq->array) {
set_tsk_need_resched(p);
goto out;
}
spin_lock(&rq->lock);
- /*
- * The task was running during this tick - update the
- * time slice counter. Note: we do not update a thread's
- * priority until it either goes to sleep or uses up its
- * timeslice. This makes it possible for interactive tasks
- * to use up their timeslices at their highest priority levels.
- */
- if (unlikely(rt_task(p))) {
- /*
- * RR tasks need a special form of timeslice management.
- * FIFO tasks have no timeslices.
- */
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
- set_tsk_need_resched(p);
-
- /* put it at the end of the queue: */
- dequeue_task(p, rq->active);
- enqueue_task(p, rq->active);
- }
+
+ // SCHED_FIFO tasks never run out of timeslice.
+ if (unlikely(p->policy == SCHED_FIFO))
+ goto out_unlock;
+ // Tasks lose a deadline rank each time they use up a full slice().
+ if (!--p->slice) {
+ set_tsk_need_resched(p);
+ dequeue_task(p, rq);
+ dec_deadline(p);
+ p->slice = slice(p);
+ p->prio = effective_prio(p);
+ p->time_slice = RR_INTERVAL;
+ enqueue_task(p, rq);
+ p->first_time_slice = 0;
goto out_unlock;
}
+ /*
+ * Tasks that run out of time_slice but still have slice left get
+ * requeued with a lower priority && RR_INTERVAL time_slice.
+ */
if (!--p->time_slice) {
- dequeue_task(p, rq->active);
set_tsk_need_resched(p);
+ dequeue_task(p, rq);
p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
-
- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
- enqueue_task(p, rq->expired);
- if (p->static_prio < rq->best_expired_prio)
- rq->best_expired_prio = p->static_prio;
- } else
- enqueue_task(p, rq->active);
- } else {
- /*
- * Prevent a too long timeslice allowing a task to monopolize
- * the CPU. We do this by splitting up the timeslice into
- * smaller pieces.
- *
- * Note: this does not mean the task's timeslices expire or
- * get lost in any way, they just might be preempted by
- * another task of equal priority. (one with higher
- * priority would have preempted this task already.) We
- * requeue this task to the end of the list on this priority
- * level, which is in essence a round-robin of tasks with
- * equal priority.
- *
- * This only applies to tasks in the interactive
- * delta range with at least TIMESLICE_GRANULARITY to requeue.
- */
- if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
- p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
- (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
-
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- enqueue_task(p, rq->active);
- }
+ p->time_slice = RR_INTERVAL;
+ enqueue_task(p, rq);
+ goto out_unlock;
}
out_unlock:
spin_unlock(&rq->lock);
@@ -2131,8 +1896,8 @@ static inline int dependent_sleeper(int
* task from using an unfair proportion of the
* physical cpu's resources. -ck
*/
- if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
- task_timeslice(p) || rt_task(smt_curr)) &&
+ if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) >
+ slice(p) || rt_task(smt_curr)) &&
p->mm && smt_curr->mm && !rt_task(p))
ret = 1;
@@ -2141,8 +1906,8 @@ static inline int dependent_sleeper(int
* or wake it up if it has been put to sleep for priority
* reasons.
*/
- if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
- task_timeslice(smt_curr) || rt_task(p)) &&
+ if ((((p->slice * (100 - sd->per_cpu_gain) / 100) >
+ slice(smt_curr) || rt_task(p)) &&
smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
(smt_curr == smt_rq->idle && smt_rq->nr_running))
resched_task(smt_curr);
@@ -2168,10 +1933,9 @@ asmlinkage void __sched schedule(void)
long *switch_count;
task_t *prev, *next;
runqueue_t *rq;
- prio_array_t *array;
+ prio_array_t* array;
struct list_head *queue;
unsigned long long now;
- unsigned long run_time;
int cpu, idx;
/*
@@ -2193,19 +1957,8 @@ need_resched:
release_kernel_lock(prev);
now = sched_clock();
- if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
- run_time = now - prev->timestamp;
- else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks with interactive credits get charged less run_time
- * at high sleep_avg to delay them losing their interactive
- * status
- */
- if (HIGH_CREDIT(prev))
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ prev->runtime = now - prev->timestamp;
spin_lock_irq(&rq->lock);
/*
@@ -2227,56 +1980,22 @@ need_resched:
idle_balance(cpu, rq);
if (!rq->nr_running) {
next = rq->idle;
- rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu, rq);
goto switch_tasks;
}
}
- array = rq->active;
- if (unlikely(!array->nr_active)) {
- /*
- * Switch the active and expired arrays.
- */
- rq->active = rq->expired;
- rq->expired = array;
- array = rq->active;
- rq->expired_timestamp = 0;
- rq->best_expired_prio = MAX_PRIO;
- }
-
+ array = &rq->array;
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
- if (dependent_sleeper(cpu, rq, next)) {
+ if (dependent_sleeper(cpu, rq, next))
next = rq->idle;
- goto switch_tasks;
- }
-
- if (!rt_task(next) && next->activated > 0) {
- unsigned long long delta = now - next->timestamp;
-
- if (next->activated == 1)
- delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
- array = next->array;
- dequeue_task(next, array);
- recalc_task_prio(next, next->timestamp + delta);
- enqueue_task(next, array);
- }
- next->activated = 0;
switch_tasks:
prefetch(next);
clear_tsk_need_resched(prev);
RCU_qsctr(task_cpu(prev))++;
-
- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0) {
- prev->sleep_avg = 0;
- if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
- prev->interactive_credit--;
- }
prev->timestamp = now;
if (likely(prev != next)) {
@@ -2542,7 +2261,7 @@ EXPORT_SYMBOL(sleep_on_timeout);
void set_user_nice(task_t *p, long nice)
{
unsigned long flags;
- prio_array_t *array;
+ prio_array_t* array;
runqueue_t *rq;
int old_prio, new_prio, delta;
@@ -2565,7 +2284,7 @@ void set_user_nice(task_t *p, long nice)
}
array = p->array;
if (array)
- dequeue_task(p, array);
+ dequeue_task(p, rq);
old_prio = p->prio;
new_prio = NICE_TO_PRIO(nice);
@@ -2574,7 +2293,7 @@ void set_user_nice(task_t *p, long nice)
p->prio += delta;
if (array) {
- enqueue_task(p, array);
+ enqueue_task(p, rq);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
@@ -2696,7 +2415,7 @@ static int setscheduler(pid_t pid, int p
struct sched_param lp;
int retval = -EINVAL;
int oldprio;
- prio_array_t *array;
+ prio_array_t* array;
unsigned long flags;
runqueue_t *rq;
task_t *p;
@@ -2965,28 +2684,17 @@ out_unlock:
/**
* sys_sched_yield - yield the current processor to other threads.
*
- * this function yields the current CPU by moving the calling thread
- * to the expired array. If there are no other threads running on this
* CPU then this function will return.
*/
asmlinkage long sys_sched_yield(void)
{
runqueue_t *rq = this_rq_lock();
- prio_array_t *array = current->array;
- prio_array_t *target = rq->expired;
- /*
- * We implement yielding by moving the task into the expired
- * queue.
- *
- * (special rule: RT tasks will just roundrobin in the active
- * array.)
- */
- if (unlikely(rt_task(current)))
- target = rq->active;
-
- dequeue_task(current, array);
- enqueue_task(current, target);
+ dequeue_task(current, rq);
+ current->slice = slice(current);
+ current->time_slice = current->slice;
+ current->prio = effective_prio(current);
+ enqueue_task(current, rq);
/*
* Since we are going to call schedule() anyway, there's
@@ -3125,7 +2833,7 @@ long sys_sched_rr_get_interval(pid_t pid
goto out_unlock;
jiffies_to_timespec(p->policy & SCHED_FIFO ?
- 0 : task_timeslice(p), &t);
+ 0 : slice(p), &t);
read_unlock(&tasklist_lock);
retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
out_nounlock:
@@ -3246,6 +2954,7 @@ void __devinit init_idle(task_t *idle, i
idle->array = NULL;
idle->prio = MAX_PRIO;
idle->state = TASK_RUNNING;
+ idle->deadline = 0;
set_task_cpu(idle, cpu);
double_rq_unlock(idle_rq, rq);
set_tsk_need_resched(idle);
@@ -3878,7 +3587,7 @@ int in_sched_functions(unsigned long add
void __init sched_init(void)
{
runqueue_t *rq;
- int i, j, k;
+ int i, j;
#ifdef CONFIG_SMP
/* Set up an initial dummy domain for early boot */
@@ -3899,13 +3608,10 @@ void __init sched_init(void)
#endif
for (i = 0; i < NR_CPUS; i++) {
- prio_array_t *array;
+ prio_array_t* array;
rq = cpu_rq(i);
spin_lock_init(&rq->lock);
- rq->active = rq->arrays;
- rq->expired = rq->arrays + 1;
- rq->best_expired_prio = MAX_PRIO;
#ifdef CONFIG_SMP
rq->sd = &sched_domain_init;
@@ -3916,16 +3622,14 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->migration_queue);
#endif
atomic_set(&rq->nr_iowait, 0);
+ array = &rq->array;
- for (j = 0; j < 2; j++) {
- array = rq->arrays + j;
- for (k = 0; k < MAX_PRIO; k++) {
- INIT_LIST_HEAD(array->queue + k);
- __clear_bit(k, array->bitmap);
- }
- // delimiter for bitsearch
- __set_bit(MAX_PRIO, array->bitmap);
+ for (j = 0; j <= MAX_PRIO; j++) {
+ INIT_LIST_HEAD(array->queue + j);
+ __clear_bit(j, array->bitmap);
}
+ // delimiter for bitsearch
+ __set_bit(MAX_PRIO, array->bitmap);
}
/*
* We have to do a little magic to get the first
diff -Naurp --exclude-from=dontdiff linux-2.6.7-rc2-base/kernel/sysctl.c linux-2.6.7-rc2-s6.3/kernel/sysctl.c
--- linux-2.6.7-rc2-base/kernel/sysctl.c 2004-05-31 21:29:24.000000000 +1000
+++ linux-2.6.7-rc2-s6.3/kernel/sysctl.c 2004-06-07 00:09:21.513441965 +1000
@@ -636,6 +636,22 @@ static ctl_table kern_table[] = {
.mode = 0444,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = KERN_INTERACTIVE,
+ .procname = "interactive",
+ .data = &interactive,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = KERN_COMPUTE,
+ .procname = "compute",
+ .data = &compute,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas @ 2004-06-06 16:59 ` Jan Killius 2004-06-06 17:40 ` Felipe Alfaro Solana ` (6 subsequent siblings) 7 siblings, 0 replies; 31+ messages in thread From: Jan Killius @ 2004-06-06 16:59 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 3041 bytes --] On Sunday 06 June 2004 17:39, you wrote: > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. The change to the original > design has enabled all known corner cases to be abolished and cpu > distribution to be much better maintained. It has proven to be stable in my > testing and is ready for more widespread public testing now. > > > Aims: > - Interactive by design rather than have interactivity bolted on. > - Good scalability. > - Simple predictable design. > - Maintain appropriate cpu distribution and fairness. > - Low scheduling latency for normal policy tasks. > - Low overhead. > - Making renicing processes actually matter for CPU distribution (nice 0 > gets 20 times what nice +20 gets) > - Resistant to priority inversion > - More forgiving of poor locking > - Tunable for a server workload or computational tasks > > > Description: > - All tasks start at a dynamic priority based on their nice value. They > run for one RR_INTERVAL (nominally set to 10ms) and then drop one stair > (priority). If they sleep before running again they get to run for 2 > intervals before being demoted a priority and so on until they get all > their intervals at their best priority: 20 intervals for nice 0; 1 interval > for nice +19. > > > - The staircase elevation mechanism can be disabled and all tasks can > simply descend stairs using the sysctl: > > echo 0 > /proc/sys/kernel/interactive > > this has the effect of maintaining cpu distribution much more strictly > according to nice whereas the default mechanism allows bursts of cpu by > interactive tasks before settling to appropriate cpu distribution. > > > - The time tasks are cpu bound can be increased by using the sysctl: > > echo 1 > /proc/sys/kernel/compute > > which extends the RR_INTERVAL to 100ms and disables the staircase elevation > improving conditions for pure computational tasks by optimising cache > benefits and decreasing context switching (gains another 1.5% on > kernbench). > > > Performance: > - All cpu throughput benchmarks show equivalent or better performance than > mainline. Note that disabling the interactive setting actually _worsens_ > some benchmarks because of their dependence on yield() so I don't recommend > disabling it unless you do a comparison first. > - Interactivity is approximately equivalent to mainline 2.6 but with faster > application startup and no known behavioural quirks. > > > Comments and testing welcome. > > fs/proc/array.c | 2 > include/linux/sched.h | 8 > include/linux/sysctl.h | 2 > kernel/sched.c | 676 > +++++++++++++------------------------------------ > kernel/sysctl.c | 16 + > 5 files changed, 212 insertions(+), 492 deletions(-) > > Can be downloaded here: > http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3 > > and below > Con The same patch modified to apply clean on 2.6-7-rc2-mm2. -- Jan [-- Attachment #2: patch-2.6.7-rc2-mm2-s6.3 --] [-- Type: text/x-diff, Size: 34235 bytes --] diff -Naurp linux-2.6.7-rc2-mm2/fs/proc/array.c linux-2.6.7-rc2-mm2-s6.3/fs/proc/array.c --- linux-2.6.7-rc2-mm2/fs/proc/array.c 2004-06-06 18:51:52.000000000 +0200 +++ linux-2.6.7-rc2-mm2-s6.3/fs/proc/array.c 2004-06-06 18:27:58.343412208 +0200 @@ -155,7 +155,6 @@ static inline char * task_state(struct t read_lock(&tasklist_lock); buffer += sprintf(buffer, "State:\t%s\n" - "SleepAVG:\t%lu%%\n" "Tgid:\t%d\n" "Pid:\t%d\n" "PPid:\t%d\n" @@ -163,7 +162,6 @@ static inline char * task_state(struct t "Uid:\t%d\t%d\t%d\t%d\n" "Gid:\t%d\t%d\t%d\t%d\n", get_task_state(p), - (p->sleep_avg/1024)*100/(1020000000/1024), p->tgid, p->pid, p->pid ? p->real_parent->pid : 0, p->pid && p->ptrace ? p->parent->pid : 0, diff -Naurp linux-2.6.7-rc2-mm2/include/linux/sched.h linux-2.6.7-rc2-mm2-s6.3/include/linux/sched.h --- linux-2.6.7-rc2-mm2/include/linux/sched.h 2004-06-06 18:51:52.000000000 +0200 +++ linux-2.6.7-rc2-mm2-s6.3/include/linux/sched.h 2004-06-06 18:27:58.343412208 +0200 @@ -172,6 +172,7 @@ extern void show_stack(struct task_struc void io_schedule(void); long io_schedule_timeout(long timeout); +extern int interactive, compute; extern void cpu_init (void); extern void trap_init(void); @@ -410,14 +411,13 @@ struct task_struct { struct list_head run_list; prio_array_t *array; - unsigned long sleep_avg; - long interactive_credit; unsigned long long timestamp; - int activated; + unsigned long runtime, totalrun; + unsigned int deadline; unsigned long policy; cpumask_t cpus_allowed; - unsigned int time_slice, first_time_slice; + unsigned int slice, time_slice, first_time_slice; struct list_head tasks; struct list_head ptrace_children; diff -Naurp linux-2.6.7-rc2-mm2/include/linux/sysctl.h linux-2.6.7-rc2-mm2-s6.3/include/linux/sysctl.h --- linux-2.6.7-rc2-mm2/include/linux/sysctl.h 2004-06-06 18:51:52.000000000 +0200 +++ linux-2.6.7-rc2-mm2-s6.3/include/linux/sysctl.h 2004-06-06 18:27:58.345411904 +0200 @@ -133,6 +133,8 @@ enum KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */ KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */ KERN_HZ_TIMER=65, /* int: hz timer on or off */ + KERN_INTERACTIVE=66, /* interactive tasks to have cpu bursts */ + KERN_COMPUTE=67, /* adjust timeslices for a compute server */ }; diff -Naurp linux-2.6.7-rc2-mm2/kernel/sched.c linux-2.6.7-rc2-mm2-s6.3/kernel/sched.c --- linux-2.6.7-rc2-mm2/kernel/sched.c 2004-06-06 18:51:52.000000000 +0200 +++ linux-2.6.7-rc2-mm2-s6.3/kernel/sched.c 2004-06-06 18:33:49.541022056 +0200 @@ -16,7 +16,10 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin - */ + * 2004-03-19. New staircase scheduling policy by Con Kolivas with help + * from Zwane Mwaikambo and useful suggestions by + * William Lee Irwin III. +*/ #include <linux/mm.h> #include <linux/module.h> @@ -70,8 +73,6 @@ #define USER_PRIO(p) ((p)-MAX_RT_PRIO) #define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio) #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) -#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\ - (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1))) /* * Some helpers for converting nanosecond timing to jiffy resolution @@ -79,110 +80,18 @@ #define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) -/* - * These are the 'tuning knobs' of the scheduler: - * - * Minimum timeslice is 10 msecs, default timeslice is 100 msecs, - * maximum timeslice is 200 msecs. Timeslices get refilled after - * they expire. - */ -#define MIN_TIMESLICE ( 10 * HZ / 1000) -#define MAX_TIMESLICE (200 * HZ / 1000) -#define ON_RUNQUEUE_WEIGHT 30 -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 -#define PRIO_BONUS_RATIO 25 -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) -#define INTERACTIVE_DELTA 2 -#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS) -#define STARVATION_LIMIT (MAX_SLEEP_AVG) -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) -#define CREDIT_LIMIT 100 - -/* - * If a task is 'interactive' then we reinsert it in the active - * array after it has expired its current timeslice. (it will not - * continue to run immediately, it will still roundrobin with - * other interactive tasks.) - * - * This part scales the interactivity limit depending on niceness. - * - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. - * Here are a few examples of different nice levels: - * - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] - * - * (the X axis represents the possible -5 ... 0 ... +5 dynamic - * priority range a task can explore, a value of '1' means the - * task is rated interactive.) - * - * Ie. nice +19 tasks can never get 'interactive' enough to be - * reinserted into the active array. And only heavily CPU-hog nice -20 - * tasks will be expired. Default nice 0 tasks are somewhere between, - * it takes some effort for them to get interactive, but it's not - * too hard. - */ - -#define CURRENT_BONUS(p) \ - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ - MAX_SLEEP_AVG) - -#ifdef CONFIG_SMP -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ - num_online_cpus()) -#else -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) -#endif - -#define SCALE(v1,v1_max,v2_max) \ - (v1) * (v2_max) / (v1_max) - -#define DELTA(p) \ - (SCALE(TASK_NICE(p), 40, MAX_BONUS) + INTERACTIVE_DELTA) - -#define TASK_INTERACTIVE(p) \ - ((p)->prio <= (p)->static_prio - DELTA(p)) - -#define INTERACTIVE_SLEEP(p) \ - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) - -#define HIGH_CREDIT(p) \ - ((p)->interactive_credit > CREDIT_LIMIT) - -#define LOW_CREDIT(p) \ - ((p)->interactive_credit < -CREDIT_LIMIT) +int compute = 0; +/* + *This is the time all tasks within the same priority round robin. + *compute setting is reserved for dedicated computational scheduling + *and has ten times larger intervals. + */ +#define _RR_INTERVAL ((10 * HZ / 1000) ? : 1) +#define RR_INTERVAL (_RR_INTERVAL * (1 + 9 * compute)) #define TASK_PREEMPTS_CURR(p, rq) \ ((p)->prio < (rq)->curr->prio) -/* - * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ] - * to time slice values. - * - * The higher a thread's priority, the bigger timeslices - * it gets during one round of execution. But even the lowest - * priority thread gets MIN_TIMESLICE worth of execution time. - * - * task_timeslice() is the interface that is used by the scheduler. - */ - -#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \ - ((MAX_TIMESLICE - MIN_TIMESLICE) * \ - (MAX_PRIO-1 - (p)->static_prio) / (MAX_USER_PRIO-1))) - -static unsigned int task_timeslice(task_t *p) -{ - return BASE_TIMESLICE(p); -} - #define task_hot(p, now, sd) ((now) - (p)->timestamp < (sd)->cache_hot_time) /* @@ -196,7 +105,7 @@ typedef struct runqueue runqueue_t; struct prio_array { unsigned int nr_active; unsigned long bitmap[BITMAP_SIZE]; - struct list_head queue[MAX_PRIO]; + struct list_head queue[MAX_PRIO + 1]; }; /* @@ -218,13 +127,12 @@ struct runqueue { unsigned long cpu_load; #endif unsigned long long nr_switches; - unsigned long expired_timestamp, nr_uninterruptible; + unsigned long nr_uninterruptible; unsigned long long timestamp_last_tick; task_t *curr, *idle; struct mm_struct *prev_mm; - prio_array_t *active, *expired, arrays[2]; - int best_expired_prio; + prio_array_t array; atomic_t nr_iowait; #ifdef CONFIG_SMP @@ -406,16 +314,18 @@ static inline void rq_unlock(runqueue_t /* * Adding/removing a task to/from a priority array: */ -static void dequeue_task(struct task_struct *p, prio_array_t *array) +static void dequeue_task(struct task_struct *p, runqueue_t *rq) { + prio_array_t* array = &rq->array; array->nr_active--; list_del(&p->run_list); if (list_empty(array->queue + p->prio)) __clear_bit(p->prio, array->bitmap); } -static void enqueue_task(struct task_struct *p, prio_array_t *array) +static void enqueue_task(struct task_struct *p, runqueue_t *rq) { + prio_array_t* array = &rq->array; list_add_tail(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); array->nr_active++; @@ -427,8 +337,9 @@ static void enqueue_task(struct task_str * remote queue so we want these tasks to show up at the head of the * local queue: */ -static inline void enqueue_task_head(struct task_struct *p, prio_array_t *array) +static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq) { + prio_array_t* array = &rq->array; list_add(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); array->nr_active++; @@ -436,42 +347,11 @@ static inline void enqueue_task_head(str } /* - * effective_prio - return the priority that is based on the static - * priority but is modified by bonuses/penalties. - * - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] - * into the -5 ... 0 ... +5 bonus/penalty range. - * - * We use 25% of the full 0...39 priority range so that: - * - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. - * - * Both properties are important to certain workloads. - */ -static int effective_prio(task_t *p) -{ - int bonus, prio; - - if (rt_task(p)) - return p->prio; - - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; - - prio = p->static_prio - bonus; - if (prio < MAX_RT_PRIO) - prio = MAX_RT_PRIO; - if (prio > MAX_PRIO-1) - prio = MAX_PRIO-1; - return prio; -} - -/* * __activate_task - move a task to the runqueue. */ static inline void __activate_task(task_t *p, runqueue_t *rq) { - enqueue_task(p, rq->active); + enqueue_task(p, rq); rq->nr_running++; } @@ -480,95 +360,112 @@ static inline void __activate_task(task_ */ static inline void __activate_idle_task(task_t *p, runqueue_t *rq) { - enqueue_task_head(p, rq->active); + enqueue_task_head(p, rq); rq->nr_running++; } -static void recalc_task_prio(task_t *p, unsigned long long now) +// deadline - the best deadline rank a task can have. +static unsigned int deadline(task_t *p) { - unsigned long long __sleep_time = now - p->timestamp; - unsigned long sleep_time; - - if (__sleep_time > NS_MAX_SLEEP_AVG) - sleep_time = NS_MAX_SLEEP_AVG; + unsigned int task_user_prio; + if (rt_task(p)) + return p->deadline; + task_user_prio = TASK_USER_PRIO(p); + if (likely(task_user_prio < 40)) + return 39 - task_user_prio; else - sleep_time = (unsigned long)__sleep_time; + return 0; +} - if (likely(sleep_time > 0)) { - /* - * User tasks that sleep a long time are categorised as - * idle and will get just interactive status to stay active & - * prevent them suddenly becoming cpu hogs and starving - * other processes. - */ - if (p->mm && p->activated != -1 && - sleep_time > INTERACTIVE_SLEEP(p)) { - p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - - AVG_TIMESLICE); - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } else { - /* - * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. - */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; +static void inc_deadline(task_t *p) +{ + unsigned int best_deadline; + best_deadline = deadline(p); + if (!p->mm || rt_task(p) || !best_deadline) + return; + if (p->deadline < best_deadline) + p->deadline++; +} - /* - * Tasks with low interactive_credit are limited to - * one timeslice worth of sleep avg bonus. - */ - if (LOW_CREDIT(p) && - sleep_time > JIFFIES_TO_NS(task_timeslice(p))) - sleep_time = JIFFIES_TO_NS(task_timeslice(p)); +static void dec_deadline(task_t *p) +{ + if (!p->mm || rt_task(p)) + return; + if (p->deadline) + p->deadline--; +} - /* - * Non high_credit tasks waking from uninterruptible - * sleep are limited in their sleep_avg rise as they - * are likely to be cpu hogs waiting on I/O - */ - if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm) { - if (p->sleep_avg >= INTERACTIVE_SLEEP(p)) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - INTERACTIVE_SLEEP(p)) { - p->sleep_avg = INTERACTIVE_SLEEP(p); - sleep_time = 0; - } - } +// slice - the duration a task runs before losing a deadline rank. +static unsigned int slice(task_t *p) +{ + unsigned int slice = RR_INTERVAL; + if (!rt_task(p)) + slice += deadline(p) * RR_INTERVAL; + return slice; +} - /* - * This code gives a bonus to interactive tasks. - * - * The boost works by updating the 'average sleep time' - * value here, based on ->timestamp. The more time a - * task spends sleeping, the higher the average gets - - * and the higher the priority boost gets as well. - */ - p->sleep_avg += sleep_time; +// interactive - interactive tasks get longer intervals at best priority +int interactive = 1; - if (p->sleep_avg > NS_MAX_SLEEP_AVG) { - p->sleep_avg = NS_MAX_SLEEP_AVG; - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } +/* + * effective_prio - dynamic priority dependent on deadline rank. + * With 0 deadline ranks the priority decreases by one each RR_INTERVAL. + * As the deadline rank increases the priority stays at the top "stair" or + * value for longer. + */ +static int effective_prio(task_t *p) +{ + int prio; + unsigned int full_slice, used_slice, first_slice; + unsigned int best_deadline; + if (rt_task(p)) + return p->prio; + + best_deadline = deadline(p); + full_slice = slice(p); + used_slice = full_slice - p->slice; + if (p->deadline > best_deadline || !p->mm) + p->deadline = best_deadline; + first_slice = RR_INTERVAL; + if (interactive && !compute) + first_slice *= (p->deadline + 1); + prio = MAX_PRIO - 1 - best_deadline; + + if (used_slice < first_slice) + return prio; + if (p->mm) + prio += 1 + (used_slice - first_slice) / RR_INTERVAL; + if (prio > MAX_PRIO - 1) + prio = MAX_PRIO - 1; + return prio; +} + +static void recalc_task_prio(task_t *p, unsigned long long now) +{ + unsigned long sleep_time = now - p->timestamp; + unsigned long run_time = NS_TO_JIFFIES(p->runtime); + unsigned long total_run = NS_TO_JIFFIES(p->totalrun) + run_time; + if (!run_time && NS_TO_JIFFIES(p->runtime + sleep_time) < RR_INTERVAL) { + if (p->slice - total_run < 1) { + p->totalrun = 0; + dec_deadline(p); + } else { + p->totalrun += p->runtime; + p->slice -= NS_TO_JIFFIES(p->totalrun); } + } else { + inc_deadline(p); + p->runtime = 0; + p->totalrun = 0; } - - p->prio = effective_prio(p); } /* * activate_task - move a task to the runqueue and do priority recalculation - * - * Update all the scheduling statistics stuff. (sleep average - * calculation, priority modifiers, etc.) */ static void activate_task(task_t *p, runqueue_t *rq, int local) { - unsigned long long now; - - now = sched_clock(); + unsigned long long now = sched_clock(); #ifdef CONFIG_SMP if (!local) { /* Compensate for drifting sched_clock */ @@ -577,33 +474,14 @@ static void activate_task(task_t *p, run + rq->timestamp_last_tick; } #endif - - recalc_task_prio(p, now); - - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (!p->activated) { - /* - * Tasks which were woken up by interrupts (ie. hw events) - * are most likely of interactive nature. So we give them - * the credit of extending their sleep time to the period - * of time they spend on the runqueue, waiting for execution - * on a CPU, first time around: - */ - if (in_interrupt()) - p->activated = 2; - else { - /* - * Normal first-time wakeups get a credit too for - * on-runqueue time, but it will be weighted down: - */ - p->activated = 1; - } - } + p->slice = slice(p); + if (!rt_task(p) && p->mm) + recalc_task_prio(p, now); + else + inc_deadline(p); + p->prio = effective_prio(p); + p->time_slice = RR_INTERVAL; p->timestamp = now; - __activate_task(p, rq); } @@ -613,9 +491,12 @@ static void activate_task(task_t *p, run static void deactivate_task(struct task_struct *p, runqueue_t *rq) { rq->nr_running--; - if (p->state == TASK_UNINTERRUPTIBLE) + if (p->state == TASK_UNINTERRUPTIBLE) { rq->nr_uninterruptible++; - dequeue_task(p, p->array); + if (p->deadline > (deadline(p) - 2) && p->deadline && p->mm) + p->deadline--; + } + dequeue_task(p, rq); p->array = NULL; } @@ -946,14 +827,8 @@ out_set_cpu: out_activate: #endif /* CONFIG_SMP */ - if (old_state == TASK_UNINTERRUPTIBLE) { + if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - /* - * Tasks on involuntary sleep don't earn - * sleep_avg beyond just interactive state. - */ - p->activated = -1; - } /* * Sync wakeups (i.e. those types of wakeups where the waker @@ -1023,12 +898,14 @@ void fastcall sched_fork(task_t *p) */ local_irq_disable(); p->time_slice = (current->time_slice + 1) >> 1; + p->slice = (current->slice + 1) >> 1; /* * The remainder of the first timeslice might be recovered by * the parent if the child exits early enough. */ p->first_time_slice = 1; current->time_slice >>= 1; + current->slice >>= 1; p->timestamp = sched_clock(); if (!current->time_slice) { /* @@ -1056,33 +933,14 @@ void fastcall wake_up_forked_process(tas unsigned long flags; runqueue_t *rq = task_rq_lock(current, &flags); + // Forked process gets a lower deadline rank to prevent fork bombs. + p->deadline = 0; BUG_ON(p->state != TASK_RUNNING); - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. - */ - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->interactive_credit = 0; - - p->prio = effective_prio(p); set_task_cpu(p, smp_processor_id()); - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - rq->nr_running++; - } + p->prio = effective_prio(p); + __activate_task(p, rq); task_rq_unlock(rq, &flags); } @@ -1103,19 +961,11 @@ void fastcall sched_exit(task_t * p) local_irq_save(flags); if (p->first_time_slice) { p->parent->time_slice += p->time_slice; - if (unlikely(p->parent->time_slice > MAX_TIMESLICE)) - p->parent->time_slice = MAX_TIMESLICE; + if (unlikely(p->parent->time_slice > slice(p->parent))) + p->parent->time_slice = slice(p->parent); } local_irq_restore(flags); - /* - * If the child was a (relative-) CPU hog then decrease - * the sleep_avg of the parent as well. - */ rq = task_rq_lock(p->parent, &flags); - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / - (EXIT_WEIGHT + 1); task_rq_unlock(rq, &flags); } @@ -1379,18 +1229,6 @@ lock_again: double_rq_unlock(this_rq, rq); goto lock_again; } - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. - */ - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->interactive_credit = 0; p->prio = effective_prio(p); set_task_cpu(p, cpu); @@ -1505,15 +1343,15 @@ static void double_lock_balance(runqueue * pull_task - move a task from a remote runqueue to the local runqueue. * Both runqueues must be locked. */ -static inline -void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, - runqueue_t *this_rq, prio_array_t *this_array, int this_cpu) +static inline +void pull_task(runqueue_t *src_rq, task_t *p, + runqueue_t *this_rq, int this_cpu) { - dequeue_task(p, src_array); + dequeue_task(p, src_rq); src_rq->nr_running--; set_task_cpu(p, this_cpu); this_rq->nr_running++; - enqueue_task(p, this_array); + enqueue_task(p, this_rq); p->timestamp = (p->timestamp - src_rq->timestamp_last_tick) + this_rq->timestamp_last_tick; /* @@ -1570,29 +1408,16 @@ static int move_tasks(runqueue_t *this_r unsigned long max_nr_move, struct sched_domain *sd, enum idle_type idle) { - prio_array_t *array, *dst_array; + prio_array_t* array, *dst_array; struct list_head *head, *curr; int idx, pulled = 0; task_t *tmp; if (max_nr_move <= 0 || busiest->nr_running <= 1) goto out; + array = &busiest->array; + dst_array = &this_rq->array; - /* - * We first consider expired tasks. Those will likely not be - * executed in the near future, and they are most likely to - * be cache-cold, thus switching CPUs has the least effect - * on them. - */ - if (busiest->expired->nr_active) { - array = busiest->expired; - dst_array = this_rq->expired; - } else { - array = busiest->active; - dst_array = this_rq->active; - } - -new_array: /* Start searching at priority 0: */ idx = 0; skip_bitmap: @@ -1600,14 +1425,8 @@ skip_bitmap: idx = sched_find_first_bit(array->bitmap); else idx = find_next_bit(array->bitmap, MAX_PRIO, idx); - if (idx >= MAX_PRIO) { - if (array == busiest->expired && busiest->active->nr_active) { - array = busiest->active; - dst_array = this_rq->active; - goto new_array; - } + if (idx >= MAX_PRIO) goto out; - } head = array->queue + idx; curr = head->prev; @@ -1622,7 +1441,7 @@ skip_queue: idx++; goto skip_bitmap; } - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); + pull_task(busiest, tmp, this_rq, this_cpu); pulled++; /* We only want to steal up to the prescribed number of tasks. */ @@ -2114,22 +1933,6 @@ DEFINE_PER_CPU(struct kernel_stat, kstat EXPORT_PER_CPU_SYMBOL(kstat); /* - * We place interactive tasks back into the active array, if possible. - * - * To guarantee that this does not starve expired tasks we ignore the - * interactivity of a task if the first expired task had to wait more - * than a 'reasonable' amount of time. This deadline timeout is - * load-dependent, as the frequency of array switched decreases with - * increasing number of running tasks. We also ignore the interactivity - * if a better static_prio task has expired: - */ -#define EXPIRED_STARVING(rq) \ - ((STARVATION_LIMIT && ((rq)->expired_timestamp && \ - (jiffies - (rq)->expired_timestamp >= \ - STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \ - ((rq)->curr->static_prio > (rq)->best_expired_prio)) - -/* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. * @@ -2174,76 +1977,38 @@ void scheduler_tick(int user_ticks, int cpustat->system += sys_ticks; /* Task might have expired already, but not scheduled off yet */ - if (p->array != rq->active) { + if (p->array != &rq->array) { set_tsk_need_resched(p); goto out; } spin_lock(&rq->lock); - /* - * The task was running during this tick - update the - * time slice counter. Note: we do not update a thread's - * priority until it either goes to sleep or uses up its - * timeslice. This makes it possible for interactive tasks - * to use up their timeslices at their highest priority levels. - */ - if (unlikely(rt_task(p))) { - /* - * RR tasks need a special form of timeslice management. - * FIFO tasks have no timeslices. - */ - if ((p->policy == SCHED_RR) && !--p->time_slice) { - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - set_tsk_need_resched(p); - - /* put it at the end of the queue: */ - dequeue_task(p, rq->active); - enqueue_task(p, rq->active); - } + + // SCHED_FIFO tasks never run out of timeslice. + if (unlikely(p->policy == SCHED_FIFO)) + goto out_unlock; + // Tasks lose a deadline rank each time they use up a full slice(). + if (!--p->slice) { + set_tsk_need_resched(p); + dequeue_task(p, rq); + dec_deadline(p); + p->slice = slice(p); + p->prio = effective_prio(p); + p->time_slice = RR_INTERVAL; + enqueue_task(p, rq); + p->first_time_slice = 0; goto out_unlock; } + /* + * Tasks that run out of time_slice but still have slice left get + * requeued with a lower priority && RR_INTERVAL time_slice. + */ if (!--p->time_slice) { - dequeue_task(p, rq->active); set_tsk_need_resched(p); + dequeue_task(p, rq); p->prio = effective_prio(p); - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - - if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { - enqueue_task(p, rq->expired); - if (p->static_prio < rq->best_expired_prio) - rq->best_expired_prio = p->static_prio; - } else - enqueue_task(p, rq->active); - } else { - /* - * Prevent a too long timeslice allowing a task to monopolize - * the CPU. We do this by splitting up the timeslice into - * smaller pieces. - * - * Note: this does not mean the task's timeslices expire or - * get lost in any way, they just might be preempted by - * another task of equal priority. (one with higher - * priority would have preempted this task already.) We - * requeue this task to the end of the list on this priority - * level, which is in essence a round-robin of tasks with - * equal priority. - * - * This only applies to tasks in the interactive - * delta range with at least TIMESLICE_GRANULARITY to requeue. - */ - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - - p->time_slice) % TIMESLICE_GRANULARITY(p)) && - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && - (p->array == rq->active)) { - - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - enqueue_task(p, rq->active); - } + p->time_slice = RR_INTERVAL; + enqueue_task(p, rq); + goto out_unlock; } out_unlock: spin_unlock(&rq->lock); @@ -2307,8 +2072,8 @@ static inline int dependent_sleeper(int * task from using an unfair proportion of the * physical cpu's resources. -ck */ - if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) > - task_timeslice(p) || rt_task(smt_curr)) && + if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) > + slice(p) || rt_task(smt_curr)) && p->mm && smt_curr->mm && !rt_task(p)) ret = 1; @@ -2317,8 +2082,8 @@ static inline int dependent_sleeper(int * or wake it up if it has been put to sleep for priority * reasons. */ - if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) > - task_timeslice(smt_curr) || rt_task(p)) && + if ((((p->slice * (100 - sd->per_cpu_gain) / 100) > + slice(smt_curr) || rt_task(p)) && smt_curr->mm && p->mm && !rt_task(smt_curr)) || (smt_curr == smt_rq->idle && smt_rq->nr_running)) resched_task(smt_curr); @@ -2344,10 +2109,9 @@ asmlinkage void __sched schedule(void) long *switch_count; task_t *prev, *next; runqueue_t *rq; - prio_array_t *array; + prio_array_t* array; struct list_head *queue; unsigned long long now; - unsigned long run_time; int cpu, idx; /* @@ -2370,19 +2134,8 @@ need_resched: release_kernel_lock(prev); schedstat_inc(rq, sched_cnt); now = sched_clock(); - if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG)) - run_time = now - prev->timestamp; - else - run_time = NS_MAX_SLEEP_AVG; - - /* - * Tasks with interactive credits get charged less run_time - * at high sleep_avg to delay them losing their interactive - * status - */ - if (HIGH_CREDIT(prev)) - run_time /= (CURRENT_BONUS(prev) ? : 1); + prev->runtime = now - prev->timestamp; spin_lock_irq(&rq->lock); /* @@ -2404,58 +2157,24 @@ need_resched: idle_balance(cpu, rq); if (!rq->nr_running) { next = rq->idle; - rq->expired_timestamp = 0; wake_sleeping_dependent(cpu, rq); schedstat_inc(rq, sched_idle); goto switch_tasks; } } - array = rq->active; - if (unlikely(!array->nr_active)) { - /* - * Switch the active and expired arrays. - */ - schedstat_inc(rq, sched_switch); - rq->active = rq->expired; - rq->expired = array; - array = rq->active; - rq->expired_timestamp = 0; - rq->best_expired_prio = MAX_PRIO; - } - + array = &rq->array; idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); - - if (dependent_sleeper(cpu, rq, next)) { + if (dependent_sleeper(cpu, rq, next)) next = rq->idle; - goto switch_tasks; - } - - if (!rt_task(next) && next->activated > 0) { - unsigned long long delta = now - next->timestamp; - - if (next->activated == 1) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; - array = next->array; - dequeue_task(next, array); - recalc_task_prio(next, next->timestamp + delta); - enqueue_task(next, array); - } - next->activated = 0; switch_tasks: prefetch(next); clear_tsk_need_resched(prev); RCU_qsctr(task_cpu(prev))++; - prev->sleep_avg -= run_time; - if ((long)prev->sleep_avg <= 0) { - prev->sleep_avg = 0; - if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev))) - prev->interactive_credit--; - } prev->timestamp = now; if (likely(prev != next)) { @@ -2721,7 +2440,7 @@ EXPORT_SYMBOL(sleep_on_timeout); void set_user_nice(task_t *p, long nice) { unsigned long flags; - prio_array_t *array; + prio_array_t* array; runqueue_t *rq; int old_prio, new_prio, delta; @@ -2744,7 +2463,7 @@ void set_user_nice(task_t *p, long nice) } array = p->array; if (array) - dequeue_task(p, array); + dequeue_task(p, rq); old_prio = p->prio; new_prio = NICE_TO_PRIO(nice); @@ -2753,7 +2472,7 @@ void set_user_nice(task_t *p, long nice) p->prio += delta; if (array) { - enqueue_task(p, array); + enqueue_task(p, rq); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: @@ -2882,7 +2601,7 @@ static int setscheduler(pid_t pid, int p struct sched_param lp; int retval = -EINVAL; int oldprio; - prio_array_t *array; + prio_array_t* array; unsigned long flags; runqueue_t *rq; task_t *p; @@ -3151,29 +2870,17 @@ out_unlock: /** * sys_sched_yield - yield the current processor to other threads. * - * this function yields the current CPU by moving the calling thread - * to the expired array. If there are no other threads running on this * CPU then this function will return. */ asmlinkage long sys_sched_yield(void) { runqueue_t *rq = this_rq_lock(); - prio_array_t *array = current->array; - prio_array_t *target = rq->expired; - schedstat_inc(rq, yld_cnt); - /* - * We implement yielding by moving the task into the expired - * queue. - * - * (special rule: RT tasks will just roundrobin in the active - * array.) - */ - if (unlikely(rt_task(current))) - target = rq->active; - - dequeue_task(current, array); - enqueue_task(current, target); + dequeue_task(current, rq); + current->slice = slice(current); + current->time_slice = current->slice; + current->prio = effective_prio(current); + enqueue_task(current, rq); /* * Since we are going to call schedule() anyway, there's @@ -3312,7 +3019,7 @@ long sys_sched_rr_get_interval(pid_t pid goto out_unlock; jiffies_to_timespec(p->policy & SCHED_FIFO ? - 0 : task_timeslice(p), &t); + 0 : slice(p), &t); read_unlock(&tasklist_lock); retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; out_nounlock: @@ -3433,6 +3140,7 @@ void __devinit init_idle(task_t *idle, i idle->array = NULL; idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; + idle->deadline = 0; set_task_cpu(idle, cpu); double_rq_unlock(idle_rq, rq); set_tsk_need_resched(idle); @@ -4068,7 +3776,7 @@ int in_sched_functions(unsigned long add void __init sched_init(void) { runqueue_t *rq; - int i, j, k; + int i, j; #ifdef CONFIG_SMP /* Set up an initial dummy domain for early boot */ @@ -4089,13 +3797,10 @@ void __init sched_init(void) #endif for (i = 0; i < NR_CPUS; i++) { - prio_array_t *array; + prio_array_t* array; rq = cpu_rq(i); spin_lock_init(&rq->lock); - rq->active = rq->arrays; - rq->expired = rq->arrays + 1; - rq->best_expired_prio = MAX_PRIO; #ifdef CONFIG_SMP rq->sd = &sched_domain_init; @@ -4106,16 +3811,14 @@ void __init sched_init(void) INIT_LIST_HEAD(&rq->migration_queue); #endif atomic_set(&rq->nr_iowait, 0); + array = &rq->array; - for (j = 0; j < 2; j++) { - array = rq->arrays + j; - for (k = 0; k < MAX_PRIO; k++) { - INIT_LIST_HEAD(array->queue + k); - __clear_bit(k, array->bitmap); - } - // delimiter for bitsearch - __set_bit(MAX_PRIO, array->bitmap); + for (j = 0; j <= MAX_PRIO; j++) { + INIT_LIST_HEAD(array->queue + j); + __clear_bit(j, array->bitmap); } + // delimiter for bitsearch + __set_bit(MAX_PRIO, array->bitmap); } /* * We have to do a little magic to get the first diff -Naurp linux-2.6.7-rc2-mm2/kernel/sysctl.c linux-2.6.7-rc2-mm2-s6.3/kernel/sysctl.c --- linux-2.6.7-rc2-mm2/kernel/sysctl.c 2004-06-06 18:51:52.000000000 +0200 +++ linux-2.6.7-rc2-mm2-s6.3/kernel/sysctl.c 2004-06-06 18:27:58.359409776 +0200 @@ -618,6 +618,22 @@ static ctl_table kern_table[] = { .mode = 0444, .proc_handler = &proc_dointvec, }, + { + .ctl_name = KERN_INTERACTIVE, + .procname = "interactive", + .data = &interactive, + .maxlen = sizeof (int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { + .ctl_name = KERN_COMPUTE, + .procname = "compute", + .data = &compute, + .maxlen = sizeof (int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, { .ctl_name = 0 } }; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas 2004-06-06 16:59 ` Jan Killius @ 2004-06-06 17:40 ` Felipe Alfaro Solana 2004-06-06 22:58 ` Con Kolivas 2004-06-06 20:43 ` Prakash K. Cheemplavam ` (5 subsequent siblings) 7 siblings, 1 reply; 31+ messages in thread From: Felipe Alfaro Solana @ 2004-06-06 17:40 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Sun, 2004-06-06 at 17:39, Con Kolivas wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. The change to the original > design has enabled all known corner cases to be abolished and cpu > distribution to be much better maintained. It has proven to be stable in my > testing and is ready for more widespread public testing now. I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus staircase plus autoswappiness and my system behaves exceptionally. It seems pretty responsive even when under heavy load (while true; do a=2; done). Nice work. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 17:40 ` Felipe Alfaro Solana @ 2004-06-06 22:58 ` Con Kolivas 0 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-06 22:58 UTC (permalink / raw) To: Felipe Alfaro Solana Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Mon, 7 Jun 2004 03:40, Felipe Alfaro Solana wrote: > On Sun, 2004-06-06 at 17:39, Con Kolivas wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > This is an update of the scheduler policy mechanism rewrite using the > > infrastructure of the current O(1) scheduler. Slight changes from the > > original design require a detailed description. The change to the > > original design has enabled all known corner cases to be abolished and > > cpu distribution to be much better maintained. It has proven to be stable > > in my testing and is ready for more widespread public testing now. > > I'm impressed... I'm currently playing with linux-2.6.7-rc2-bk7 plus > staircase plus autoswappiness and my system behaves exceptionally. It > seems pretty responsive even when under heavy load (while true; do a=2; > done). Nice work. Sounds good. Thanks for testing. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas 2004-06-06 16:59 ` Jan Killius 2004-06-06 17:40 ` Felipe Alfaro Solana @ 2004-06-06 20:43 ` Prakash K. Cheemplavam 2004-06-06 22:55 ` Con Kolivas 2004-06-06 20:59 ` Felipe Alfaro Solana ` (4 subsequent siblings) 7 siblings, 1 reply; 31+ messages in thread From: Prakash K. Cheemplavam @ 2004-06-06 20:43 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III Hi, it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o criteria for me: Playing ut2004 it generated a lot of statics (sound wise). I consider this a regression in contrast to O(1). Nick's scheduler plays nice as well. For Nick's I have X reniced to -10. Your scheduler doesn't like this, as well: When plaing some tune via xmms and then switching to another (virtual) desktop, I get pops in the sound for fractions of a second. Putting X back to 0 fixes this. But I don't know how to "fix" ut2004 with staircase. :-( Hth, Prakash ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 20:43 ` Prakash K. Cheemplavam @ 2004-06-06 22:55 ` Con Kolivas 0 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-06 22:55 UTC (permalink / raw) To: Prakash K. Cheemplavam Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Mon, 7 Jun 2004 06:43, Prakash K. Cheemplavam wrote: > Hi, Hi. Thanks for testing. > it is the first time I tried this scheduler with 2.6.7-rc2-mm2. A k.o > criteria for me: Playing ut2004 it generated a lot of statics (sound > wise). I consider this a regression in contrast to O(1). Nick's > scheduler plays nice as well. For Nick's I have X reniced to -10. Your > scheduler doesn't like this, as well: When plaing some tune via xmms and > then switching to another (virtual) desktop, I get pops in the sound for > fractions of a second. Putting X back to 0 fixes this. But I don't know > how to "fix" ut2004 with staircase. :-( Yes this is designed for X nice==0. Try interactive = 0 setting for gaming. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas ` (2 preceding siblings ...) 2004-06-06 20:43 ` Prakash K. Cheemplavam @ 2004-06-06 20:59 ` Felipe Alfaro Solana 2004-06-06 22:57 ` Con Kolivas 2004-06-06 23:47 ` Con Kolivas ` (3 subsequent siblings) 7 siblings, 1 reply; 31+ messages in thread From: Felipe Alfaro Solana @ 2004-06-06 20:59 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Sun, 2004-06-06 at 17:39, Con Kolivas wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. The change to the original > design has enabled all known corner cases to be abolished and cpu > distribution to be much better maintained. It has proven to be stable in my > testing and is ready for more widespread public testing now. It seems this staircase scheduler has some strange interactions with networking and PM suspend. I can't suspend my laptop when any program is using the network and, what's more, after the suspend mechanism fails, the program that was using the network stays stuck at D state and can't be killed. I can trigger this with evolution 1.4.8 using an IMAP mail account, or simply by launching Konqueror, pointing it to a URL and then try suspending while Konqueror is retrieving the page from Internet. I'm open to suggestions. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 20:59 ` Felipe Alfaro Solana @ 2004-06-06 22:57 ` Con Kolivas 0 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-06 22:57 UTC (permalink / raw) To: Felipe Alfaro Solana Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Mon, 7 Jun 2004 06:59, Felipe Alfaro Solana wrote: > On Sun, 2004-06-06 at 17:39, Con Kolivas wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > This is an update of the scheduler policy mechanism rewrite using the > > infrastructure of the current O(1) scheduler. Slight changes from the > > original design require a detailed description. The change to the > > original design has enabled all known corner cases to be abolished and > > cpu distribution to be much better maintained. It has proven to be stable > > in my testing and is ready for more widespread public testing now. > > It seems this staircase scheduler has some strange interactions with > networking and PM suspend. I can't suspend my laptop when any program is > using the network and, what's more, after the suspend mechanism fails, > the program that was using the network stays stuck at D state and can't > be killed. Could well be the autoregulated vm swappiness which triggers the unable to suspend when swappiness=0 bug. If not, this staircase scheduler is quite a large change to the scheduler priority design and there could be some weird interaction. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas ` (3 preceding siblings ...) 2004-06-06 20:59 ` Felipe Alfaro Solana @ 2004-06-06 23:47 ` Con Kolivas 2004-06-07 13:56 ` William Lee Irwin III ` (2 subsequent siblings) 7 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-06 23:47 UTC (permalink / raw) To: Linux Kernel Mailinglist; +Cc: Zwane Mwaikambo, William Lee Irwin III On Mon, 7 Jun 2004 01:39, Con Kolivas wrote: > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. Seems it wasn't clear unless you look at the code; this has only a single priority queue with no expired array. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas ` (4 preceding siblings ...) 2004-06-06 23:47 ` Con Kolivas @ 2004-06-07 13:56 ` William Lee Irwin III 2004-06-07 13:57 ` William Lee Irwin III 2004-06-07 19:57 ` Phy Prabab 2004-06-29 11:10 ` Pavel Machek 7 siblings, 1 reply; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 13:56 UTC (permalink / raw) To: Con Kolivas; +Cc: Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 01:39:26AM +1000, Con Kolivas wrote: > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. The change to the original > design has enabled all known corner cases to be abolished and cpu > distribution to be much better maintained. It has proven to be stable in my > testing and is ready for more widespread public testing now. There is only one "array" per runqueue; the removed code should be at most a BUG_ON() or WARN_ON(); I opted to delete it altogether. Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:19:45.938605000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 06:53:21.555185000 -0700 @@ -1789,13 +1789,7 @@ cpustat->user += user_ticks; cpustat->system += sys_ticks; - /* Task might have expired already, but not scheduled off yet */ - if (p->array != &rq->array) { - set_tsk_need_resched(p); - goto out; - } spin_lock(&rq->lock); - // SCHED_FIFO tasks never run out of timeslice. if (unlikely(p->policy == SCHED_FIFO)) goto out_unlock; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 13:56 ` William Lee Irwin III @ 2004-06-07 13:57 ` William Lee Irwin III 2004-06-07 14:04 ` William Lee Irwin III 0 siblings, 1 reply; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 13:57 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 06:56:31AM -0700, William Lee Irwin III wrote: > There is only one "array" per runqueue; the removed code should be at > most a BUG_ON() or WARN_ON(); I opted to delete it altogether. JIFFIES_TO_NS() is unused. Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:49:10.104411000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 06:50:39.987747000 -0700 @@ -74,7 +74,6 @@ * Some helpers for converting nanosecond timing to jiffy resolution */ #define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) -#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) //This is the time all tasks within the same priority round robin. #define RR_INTERVAL ((10 * HZ / 1000) ? : 1) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 13:57 ` William Lee Irwin III @ 2004-06-07 14:04 ` William Lee Irwin III 2004-06-07 14:07 ` William Lee Irwin III 0 siblings, 1 reply; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 14:04 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 06:57:38AM -0700, William Lee Irwin III wrote: > JIFFIES_TO_NS() is unused. array->nr_active only ever modified, never examined. Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 06:53:26.779390000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:03:00.910109000 -0700 @@ -92,7 +92,6 @@ typedef struct runqueue runqueue_t; struct prio_array { - unsigned int nr_active; unsigned long bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO + 1]; }; @@ -204,7 +203,6 @@ static void dequeue_task(struct task_struct *p, runqueue_t *rq) { prio_array_t* array = &rq->array; - array->nr_active--; list_del(&p->run_list); if (list_empty(array->queue + p->prio)) __clear_bit(p->prio, array->bitmap); @@ -215,7 +213,6 @@ prio_array_t* array = &rq->array; list_add_tail(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); - array->nr_active++; p->array = array; } @@ -229,7 +226,6 @@ prio_array_t* array = &rq->array; list_add(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); - array->nr_active++; p->array = array; } @@ -1095,7 +1091,6 @@ p->prio = current->prio; list_add_tail(&p->run_list, ¤t->run_list); p->array = current->array; - p->array->nr_active++; rq->nr_running++; } } else { ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 14:04 ` William Lee Irwin III @ 2004-06-07 14:07 ` William Lee Irwin III 2004-06-07 15:00 ` William Lee Irwin III 0 siblings, 1 reply; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 14:07 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote: > array->nr_active only ever modified, never examined. cpu_to_node_mask() is dead. Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:03:00.910109000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:06:28.072616000 -0700 @@ -46,12 +46,6 @@ #include <asm/unistd.h> -#ifdef CONFIG_NUMA -#define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu)) -#else -#define cpu_to_node_mask(cpu) (cpu_online_map) -#endif - /* * Convert user-nice values [ -20 ... 0 ... 19 ] * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ], ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 14:07 ` William Lee Irwin III @ 2004-06-07 15:00 ` William Lee Irwin III 2004-06-07 15:11 ` William Lee Irwin III 0 siblings, 1 reply; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 15:00 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 07:04:45AM -0700, William Lee Irwin III wrote: >> array->nr_active only ever modified, never examined. On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote: > cpu_to_node_mask() is dead. task->array is nothing more than a boolean flag. Shove it into task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t). Compiletested only on sparc64 only. Index: kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-alpha/thread_info.h 2004-05-29 23:26:43.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-alpha/thread_info.h 2004-06-07 07:13:08.775700000 -0700 @@ -77,6 +77,7 @@ #define TIF_UAC_NOPRINT 6 /* see sysinfo.h */ #define TIF_UAC_NOFIX 7 #define TIF_UAC_SIGBUS 8 +#define TIF_QUEUED 9 /* queued for scheduling */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-arm/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-arm/thread_info.h 2004-05-29 23:26:10.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-arm/thread_info.h 2004-06-07 07:14:01.315712000 -0700 @@ -123,6 +123,7 @@ * TIF_NEED_RESCHED - rescheduling necessary * TIF_USEDFPU - FPU was used by this task this quantum (SMP) * TIF_POLLING_NRFLAG - true if poll_idle() is polling TIF_NEED_RESCHED + * TIF_QUEUED - true if on scheduler runqueue */ #define TIF_NOTIFY_RESUME 0 #define TIF_SIGPENDING 1 @@ -130,6 +131,7 @@ #define TIF_SYSCALL_TRACE 8 #define TIF_USED_FPU 16 #define TIF_POLLING_NRFLAG 17 +#define TIF_QUEUED 18 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) Index: kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-arm26/thread_info.h 2004-05-29 23:25:53.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-arm26/thread_info.h 2004-06-07 07:14:30.291307000 -0700 @@ -118,6 +118,7 @@ * TIF_NEED_RESCHED - rescheduling necessary * TIF_USEDFPU - FPU was used by this task this quantum (SMP) * TIF_POLLING_NRFLAG - true if poll_idle() is polling TIF_NEED_RESCHED + * TIF_QUEUED - true if on scheduler runqueue */ #define TIF_NOTIFY_RESUME 0 #define TIF_SIGPENDING 1 @@ -125,6 +126,7 @@ #define TIF_SYSCALL_TRACE 8 #define TIF_USED_FPU 16 #define TIF_POLLING_NRFLAG 17 +#define TIF_QUEUED 18 #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) Index: kolivas-2.6.7-rc2/include/asm-cris/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-cris/thread_info.h 2004-05-29 23:26:48.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-cris/thread_info.h 2004-06-07 07:14:57.512169000 -0700 @@ -85,6 +85,7 @@ #define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-h8300/thread_info.h 2004-05-29 23:26:03.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-h8300/thread_info.h 2004-06-07 07:16:00.653570000 -0700 @@ -93,6 +93,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-i386/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-i386/thread_info.h 2004-05-29 23:25:45.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-i386/thread_info.h 2004-06-07 07:16:15.755274000 -0700 @@ -145,6 +145,7 @@ #define TIF_IRET 5 /* return with iret */ #define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-ia64/thread_info.h 2004-05-29 23:26:49.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-ia64/thread_info.h 2004-06-07 07:16:53.195583000 -0700 @@ -74,6 +74,7 @@ #define TIF_NEED_RESCHED 2 /* rescheduling necessary */ #define TIF_SYSCALL_TRACE 3 /* syscall trace active */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 /* true if on scheduler runqueue */ #define TIF_WORK_MASK 0x7 /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE */ #define TIF_ALLWORK_MASK 0xf /* bits 0..3 are "work to do on user-return" bits */ Index: kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-m68k/thread_info.h 2004-05-29 23:26:19.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-m68k/thread_info.h 2004-06-07 07:23:29.772294000 -0700 @@ -11,7 +11,7 @@ __s32 preempt_count; /* 0 => preemptable, <0 => BUG */ __u32 cpu; /* should always be 0 on m68k */ struct restart_block restart_block; - + unsigned char queued; __u8 supervisor_stack[0]; }; @@ -48,6 +48,7 @@ #define TIF_NOTIFY_RESUME 2 /* resumption notification requested */ #define TIF_SIGPENDING 3 /* signal pending */ #define TIF_NEED_RESCHED 4 /* rescheduling necessary */ +#define TIF_QUEUED 5 /* true if on scheduler runqueue */ extern int thread_flag_fixme(void); @@ -67,6 +68,9 @@ case TIF_SYSCALL_TRACE: \ tsk->thread.work.syscall_trace = val; \ break; \ + case TIF_QUEUED: \ + (tsk)->thread_info.queued = 1; \ + break; \ default: \ thread_flag_fixme(); \ } \ @@ -84,6 +88,9 @@ case TIF_SYSCALL_TRACE: \ ___res = tsk->thread.work.syscall_trace;\ break; \ + case TIF_QUEUED: \ + ___res = tsk->thread_info.queued; \ + break; \ default: \ ___res = thread_flag_fixme(); \ } \ Index: kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-m68knommu/thread_info.h 2004-05-29 23:26:09.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-m68knommu/thread_info.h 2004-06-07 07:24:30.203107000 -0700 @@ -91,6 +91,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 16 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-mips/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-mips/thread_info.h 2004-05-29 23:25:40.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-mips/thread_info.h 2004-06-07 07:25:03.044114000 -0700 @@ -113,6 +113,7 @@ #define TIF_SYSCALL_AUDIT 4 /* syscall auditing active */ #define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */ #define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 18 /* true if on scheduler runqueue */ #define TIF_SYSCALL_TRACE 31 /* syscall trace active */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-parisc/thread_info.h 2004-05-29 23:26:03.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-parisc/thread_info.h 2004-06-07 07:25:26.559539000 -0700 @@ -60,6 +60,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_32BIT 5 /* 32 bit binary */ +#define TIF_QUEUED 6 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-ppc/thread_info.h 2004-05-29 23:25:52.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-ppc/thread_info.h 2004-06-07 07:26:16.802901000 -0700 @@ -86,6 +86,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 5 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-ppc64/thread_info.h 2004-05-29 23:25:45.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-ppc64/thread_info.h 2004-06-07 07:26:34.180260000 -0700 @@ -94,6 +94,7 @@ #define TIF_RUN_LIGHT 6 /* iSeries run light */ #define TIF_ABI_PENDING 7 /* 32/64 bit switch needed */ #define TIF_SYSCALL_AUDIT 8 /* syscall auditing active */ +#define TIF_QUEUED 9 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-s390/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-s390/thread_info.h 2004-05-29 23:26:19.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-s390/thread_info.h 2004-06-07 07:27:50.350680000 -0700 @@ -89,6 +89,7 @@ #define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_31BIT 18 /* 32bit process */ +#define TIF_QUEUED 19 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-sh/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-sh/thread_info.h 2004-05-29 23:26:26.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-sh/thread_info.h 2004-06-07 07:28:08.307950000 -0700 @@ -93,6 +93,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */ #define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 18 /* true if on scheduler runqueue */ #define TIF_USERSPACE 31 /* true if FS sets userspace */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-sparc/thread_info.h 2004-05-29 23:26:43.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-sparc/thread_info.h 2004-06-07 07:29:22.011745000 -0700 @@ -137,6 +137,7 @@ * this quantum (SMP) */ #define TIF_POLLING_NRFLAG 9 /* true if poll_idle() is polling * TIF_NEED_RESCHED */ +#define TIF_QUEUED 10 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-sparc64/thread_info.h 2004-05-29 23:26:26.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-sparc64/thread_info.h 2004-06-07 07:30:54.506684000 -0700 @@ -228,6 +228,7 @@ * an immediate value in instructions such as andcc. */ #define TIF_ABI_PENDING 12 +#define TIF_QUEUED 13 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/asm-um/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-um/thread_info.h 2004-05-29 23:26:03.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-um/thread_info.h 2004-06-07 07:35:20.439256000 -0700 @@ -65,6 +65,7 @@ #define TIF_POLLING_NRFLAG 3 /* true if poll_idle() is polling * TIF_NEED_RESCHED */ +#define TIF_QUEUED 4 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) Index: kolivas-2.6.7-rc2/include/asm-v850/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-v850/thread_info.h 2004-05-29 23:26:03.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-v850/thread_info.h 2004-06-07 07:36:38.430400000 -0700 @@ -83,6 +83,7 @@ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 5 /* true if on scheduler runqueue */ /* as above, but as bit values */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) Index: kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/asm-x86_64/thread_info.h 2004-05-29 23:26:26.000000000 -0700 +++ kolivas-2.6.7-rc2/include/asm-x86_64/thread_info.h 2004-06-07 07:37:09.234717000 -0700 @@ -106,6 +106,7 @@ #define TIF_IA32 17 /* 32bit process */ #define TIF_FORK 18 /* ret_from_fork */ #define TIF_ABI_PENDING 19 +#define TIF_QUEUED 20 /* true if on scheduler runqueue */ #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) Index: kolivas-2.6.7-rc2/include/linux/sched.h =================================================================== --- kolivas-2.6.7-rc2.orig/include/linux/sched.h 2004-06-07 06:19:45.891612000 -0700 +++ kolivas-2.6.7-rc2/include/linux/sched.h 2004-06-07 07:37:49.068661000 -0700 @@ -325,7 +325,6 @@ extern struct user_struct root_user; #define INIT_USER (&root_user) -typedef struct prio_array prio_array_t; struct backing_dev_info; struct reclaim_state; @@ -392,8 +391,6 @@ int prio, static_prio; struct list_head run_list; - prio_array_t *array; - unsigned long long timestamp; unsigned long runtime, totalrun; unsigned int deadline; Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:06:28.072616000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 07:59:22.711997000 -0700 @@ -85,10 +85,10 @@ typedef struct runqueue runqueue_t; -struct prio_array { +typedef struct prio_array { unsigned long bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO + 1]; -}; +} prio_array_t; /* * This is the main, per-CPU runqueue data structure. @@ -191,6 +191,21 @@ spin_unlock_irq(&rq->lock); } +static inline int task_queued(task_t *task) +{ + return test_ti_thread_flag(task->thread_info, TIF_QUEUED); +} + +static inline void set_task_queued(task_t *task) +{ + set_ti_thread_flag(task->thread_info, TIF_QUEUED); +} + +static inline void clear_task_queued(task_t *task) +{ + clear_ti_thread_flag(task->thread_info, TIF_QUEUED); +} + /* * Adding/removing a task to/from a priority array: */ @@ -207,7 +222,7 @@ prio_array_t* array = &rq->array; list_add_tail(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); - p->array = array; + set_task_queued(p); } /* @@ -220,7 +235,7 @@ prio_array_t* array = &rq->array; list_add(&p->run_list, array->queue + p->prio); __set_bit(p->prio, array->bitmap); - p->array = array; + set_task_queued(p); } /* @@ -369,7 +384,7 @@ p->deadline--; } dequeue_task(p, rq); - p->array = NULL; + clear_task_queued(p); } /* @@ -442,7 +457,7 @@ * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!task_queued(p) && !task_running(rq, p)) { set_task_cpu(p, dest_cpu); return 0; } @@ -473,7 +488,7 @@ repeat: rq = task_rq_lock(p, &flags); /* Must be off runqueue entirely, not preempted. */ - if (unlikely(p->array)) { + if (unlikely(task_queued(p))) { /* If it's preempted, we yield. It could be a while. */ preempted = !task_running(rq, p); task_rq_unlock(rq, &flags); @@ -603,7 +618,7 @@ if (!(old_state & state)) goto out; - if (p->array) + if (task_queued(p)) goto out_running; cpu = task_cpu(p); @@ -663,7 +678,7 @@ old_state = p->state; if (!(old_state & state)) goto out; - if (p->array) + if (task_queued(p)) goto out_running; this_cpu = smp_processor_id(); @@ -725,7 +740,7 @@ */ p->state = TASK_RUNNING; INIT_LIST_HEAD(&p->run_list); - p->array = NULL; + clear_task_queued(p); spin_lock_init(&p->switch_lock); #ifdef CONFIG_PREEMPT /* @@ -1079,12 +1094,12 @@ set_task_cpu(p, cpu); if (cpu == this_cpu) { - if (unlikely(!current->array)) + if (unlikely(!task_queued(current))) __activate_task(p, rq); else { p->prio = current->prio; list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; + set_task_queued(p); rq->nr_running++; } } else { @@ -2232,9 +2247,8 @@ void set_user_nice(task_t *p, long nice) { unsigned long flags; - prio_array_t* array; runqueue_t *rq; - int old_prio, new_prio, delta; + int queued, old_prio, new_prio, delta; if (TASK_NICE(p) == nice || nice < -20 || nice > 19) return; @@ -2253,8 +2267,7 @@ p->static_prio = NICE_TO_PRIO(nice); goto out_unlock; } - array = p->array; - if (array) + if ((queued = task_queued(p))) dequeue_task(p, rq); old_prio = p->prio; @@ -2263,7 +2276,7 @@ p->static_prio = NICE_TO_PRIO(nice); p->prio += delta; - if (array) { + if (queued) { enqueue_task(p, rq); /* * If the task increased its priority or is running and @@ -2369,7 +2382,7 @@ /* Actually do priority change: must hold rq lock. */ static void __setscheduler(struct task_struct *p, int policy, int prio) { - BUG_ON(p->array); + BUG_ON(task_queued(p)); p->policy = policy; p->rt_priority = prio; if (policy != SCHED_NORMAL) @@ -2385,8 +2398,7 @@ { struct sched_param lp; int retval = -EINVAL; - int oldprio; - prio_array_t* array; + int queued, oldprio; unsigned long flags; runqueue_t *rq; task_t *p; @@ -2446,13 +2458,12 @@ if (retval) goto out_unlock; - array = p->array; - if (array) + if ((queued = task_queued(p))) deactivate_task(p, task_rq(p)); retval = 0; oldprio = p->prio; __setscheduler(p, policy, lp.sched_priority); - if (array) { + if (queued) { __activate_task(p, task_rq(p)); /* * Reschedule if we are currently running on this runqueue and @@ -2922,7 +2933,7 @@ idle_rq->curr = idle_rq->idle = idle; deactivate_task(idle, rq); - idle->array = NULL; + clear_task_queued(idle); idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; idle->deadline = 0; @@ -3034,7 +3045,7 @@ goto out; set_task_cpu(p, dest_cpu); - if (p->array) { + if (task_queued(p)) { /* * Sync timestamp with rq_dest's before activating. * The same thing could be achieved by doing this step ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 15:00 ` William Lee Irwin III @ 2004-06-07 15:11 ` William Lee Irwin III 0 siblings, 0 replies; 31+ messages in thread From: William Lee Irwin III @ 2004-06-07 15:11 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo On Mon, Jun 07, 2004 at 07:07:14AM -0700, William Lee Irwin III wrote: >> cpu_to_node_mask() is dead. On Mon, Jun 07, 2004 at 08:00:37AM -0700, William Lee Irwin III wrote: > task->array is nothing more than a boolean flag. Shove it into > task->thread_info->flags, saving sizeof(prio_array_t *) from sizeof(task_t). > Compiletested only on sparc64 only. prio_array_t is no longer a useful abstraction. Compiletested only on sparc64 only. Index: kolivas-2.6.7-rc2/kernel/sched.c =================================================================== --- kolivas-2.6.7-rc2.orig/kernel/sched.c 2004-06-07 07:59:22.711997000 -0700 +++ kolivas-2.6.7-rc2/kernel/sched.c 2004-06-07 08:07:09.705003000 -0700 @@ -81,15 +81,8 @@ * These are the runqueue data structures: */ -#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long)) - typedef struct runqueue runqueue_t; -typedef struct prio_array { - unsigned long bitmap[BITMAP_SIZE]; - struct list_head queue[MAX_PRIO + 1]; -} prio_array_t; - /* * This is the main, per-CPU runqueue data structure. * @@ -113,7 +106,8 @@ unsigned long long timestamp_last_tick; task_t *curr, *idle; struct mm_struct *prev_mm; - prio_array_t array; + unsigned long bitmap[BITS_TO_LONGS(MAX_PRIO+1)]; + struct list_head queue[MAX_PRIO + 1]; atomic_t nr_iowait; #ifdef CONFIG_SMP @@ -207,21 +201,19 @@ } /* - * Adding/removing a task to/from a priority array: + * Adding/removing a task to/from a runqueue: */ static void dequeue_task(struct task_struct *p, runqueue_t *rq) { - prio_array_t* array = &rq->array; list_del(&p->run_list); - if (list_empty(array->queue + p->prio)) - __clear_bit(p->prio, array->bitmap); + if (list_empty(rq->queue + p->prio)) + __clear_bit(p->prio, rq->bitmap); } static void enqueue_task(struct task_struct *p, runqueue_t *rq) { - prio_array_t* array = &rq->array; - list_add_tail(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); + list_add_tail(&p->run_list, rq->queue + p->prio); + __set_bit(p->prio, rq->bitmap); set_task_queued(p); } @@ -232,9 +224,8 @@ */ static inline void enqueue_task_head(struct task_struct *p, runqueue_t *rq) { - prio_array_t* array = &rq->array; - list_add(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); + list_add(&p->run_list, rq->queue + p->prio); + __set_bit(p->prio, rq->bitmap); set_task_queued(p); } @@ -1257,27 +1248,24 @@ unsigned long max_nr_move, struct sched_domain *sd, enum idle_type idle) { - prio_array_t* array, *dst_array; struct list_head *head, *curr; int idx, pulled = 0; task_t *tmp; if (max_nr_move <= 0 || busiest->nr_running <= 1) goto out; - array = &busiest->array; - dst_array = &this_rq->array; /* Start searching at priority 0: */ idx = 0; skip_bitmap: if (!idx) - idx = sched_find_first_bit(array->bitmap); + idx = sched_find_first_bit(busiest->bitmap); else - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); + idx = find_next_bit(busiest->bitmap, MAX_PRIO, idx); if (idx >= MAX_PRIO) goto out; - head = array->queue + idx; + head = busiest->queue + idx; curr = head->prev; skip_queue: tmp = list_entry(curr, task_t, run_list); @@ -1919,7 +1907,6 @@ long *switch_count; task_t *prev, *next; runqueue_t *rq; - prio_array_t* array; struct list_head *queue; unsigned long long now; int cpu, idx; @@ -1971,9 +1958,8 @@ } } - array = &rq->array; - idx = sched_find_first_bit(array->bitmap); - queue = array->queue + idx; + idx = sched_find_first_bit(rq->bitmap); + queue = rq->queue + idx; next = list_entry(queue->next, task_t, run_list); if (dependent_sleeper(cpu, rq, next)) @@ -3590,8 +3576,6 @@ #endif for (i = 0; i < NR_CPUS; i++) { - prio_array_t* array; - rq = cpu_rq(i); spin_lock_init(&rq->lock); @@ -3604,14 +3588,11 @@ INIT_LIST_HEAD(&rq->migration_queue); #endif atomic_set(&rq->nr_iowait, 0); - array = &rq->array; - - for (j = 0; j <= MAX_PRIO; j++) { - INIT_LIST_HEAD(array->queue + j); - __clear_bit(j, array->bitmap); - } + for (j = 0; j <= MAX_PRIO; j++) + INIT_LIST_HEAD(&rq->queue[j]); + memset(rq->bitmap, 0, BITS_TO_LONGS(MAX_PRIO+1)*sizeof(long)); // delimiter for bitsearch - __set_bit(MAX_PRIO, array->bitmap); + __set_bit(MAX_PRIO, rq->bitmap); } /* * We have to do a little magic to get the first ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas ` (5 preceding siblings ...) 2004-06-07 13:56 ` William Lee Irwin III @ 2004-06-07 19:57 ` Phy Prabab 2004-06-07 21:12 ` Con Kolivas 2004-06-29 11:10 ` Pavel Machek 7 siblings, 1 reply; 31+ messages in thread From: Phy Prabab @ 2004-06-07 19:57 UTC (permalink / raw) To: Con Kolivas, Linux Kernel Mailinglist Cc: Zwane Mwaikambo, William Lee Irwin III FYI: I have had a chance to test this patch. I have a make system that has presented 2.6 kernels with problems, so I am using this as the test. Observations show that turning off interactive is much more deterministic: 2.6.7-rc2-bk8-s63: echo 0 > /proc/sys/kernel/interactive A: 35.57user 38.18system 1:20.28elapsed 91%CPU B: 35.54user 38.40system 1:19.48elapsed 93%CPU C: 35.48user 38.28system 1:20.94elapsed 91%CPU 2.6.7-rc2-bk8-s63: A: 35.32user 38.51system 1:26.47elapsed 85%CPU B: 35.43user 38.35system 1:20.79elapsed 91%CPU C: 35.61user 38.23system 1:25.00elapsed 86%CPU However, something is still slower than the 2.4.x kernels: 2.4.23: A: 28.32user 29.51system 1:01.17elapsed 93%CPU B: 28.54user 29.40system 1:01.48elapsed 92%CPU B: 28.23user 28.80system 1:00.21elapsed 94%CPU Nice work, as I can now turn off some functionality within the kernel that is causing me some slow down. Thanks! Phy --- Con Kolivas <kernel@kolivas.org> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is an update of the scheduler policy mechanism > rewrite using the > infrastructure of the current O(1) scheduler. Slight > changes from the > original design require a detailed description. The > change to the original > design has enabled all known corner cases to be > abolished and cpu > distribution to be much better maintained. It has > proven to be stable in my > testing and is ready for more widespread public > testing now. > > > Aims: > - Interactive by design rather than have > interactivity bolted on. > - Good scalability. > - Simple predictable design. > - Maintain appropriate cpu distribution and > fairness. > - Low scheduling latency for normal policy tasks. > - Low overhead. > - Making renicing processes actually matter for CPU > distribution (nice 0 gets > 20 times what nice +20 gets) > - Resistant to priority inversion > - More forgiving of poor locking > - Tunable for a server workload or computational > tasks > > > Description: > - All tasks start at a dynamic priority based on > their nice value. They run > for one RR_INTERVAL (nominally set to 10ms) and then > drop one stair > (priority). If they sleep before running again they > get to run for 2 > intervals before being demoted a priority and so on > until they get all their > intervals at their best priority: 20 intervals for > nice 0; 1 interval for > nice +19. > > > - - The staircase elevation mechanism can be > disabled and all tasks can simply > descend stairs using the sysctl: > > echo 0 > /proc/sys/kernel/interactive > > this has the effect of maintaining cpu distribution > much more strictly > according to nice whereas the default mechanism > allows bursts of cpu by > interactive tasks before settling to appropriate cpu > distribution. > > > - - The time tasks are cpu bound can be increased by > using the sysctl: > > echo 1 > /proc/sys/kernel/compute > > which extends the RR_INTERVAL to 100ms and disables > the staircase elevation > improving conditions for pure computational tasks by > optimising cache > benefits and decreasing context switching (gains > another 1.5% on kernbench). > > > Performance: > - - All cpu throughput benchmarks show equivalent or > better performance than > mainline. Note that disabling the interactive > setting actually _worsens_ some > benchmarks because of their dependence on yield() so > I don't recommend > disabling it unless you do a comparison first. > - - Interactivity is approximately equivalent to > mainline 2.6 but with faster > application startup and no known behavioural quirks. > > > Comments and testing welcome. > > fs/proc/array.c | 2 > include/linux/sched.h | 8 > include/linux/sysctl.h | 2 > kernel/sched.c | 676 > +++++++++++++------------------------------------ > kernel/sysctl.c | 16 + > 5 files changed, 212 insertions(+), 492 > deletions(-) > > Can be downloaded here: > http://ck.kolivas.org/patches/2.6/2.6.7-rc2/patch-2.6.7-rc2-s6.3 > > and below > Con > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.4 (GNU/Linux) > > iD8DBQFAwzquZUg7+tp6mRURArBHAJ9SIBOKX6MYOdkJzdb+xRNnW82JQgCghLou > wXhrRsBsfY2BIqbLT1tUWcs= > =g1no > -----END PGP SIGNATURE----- > > diff -Naurp --exclude-from=dontdiff > linux-2.6.7-rc2-base/fs/proc/array.c > linux-2.6.7-rc2-s6.3/fs/proc/array.c > --- linux-2.6.7-rc2-base/fs/proc/array.c 2004-05-31 > 21:29:15.000000000 +1000 > +++ linux-2.6.7-rc2-s6.3/fs/proc/array.c 2004-06-07 > 00:03:56.959133536 +1000 > @@ -155,7 +155,6 @@ static inline char * > task_state(struct t > read_lock(&tasklist_lock); > buffer += sprintf(buffer, > "State:\t%s\n" > - "SleepAVG:\t%lu%%\n" > "Tgid:\t%d\n" > "Pid:\t%d\n" > "PPid:\t%d\n" > @@ -163,7 +162,6 @@ static inline char * > task_state(struct t > "Uid:\t%d\t%d\t%d\t%d\n" > "Gid:\t%d\t%d\t%d\t%d\n", > get_task_state(p), > - (p->sleep_avg/1024)*100/(1020000000/1024), > p->tgid, > p->pid, p->pid ? p->real_parent->pid : 0, > p->pid && p->ptrace ? p->parent->pid : 0, > diff -Naurp --exclude-from=dontdiff > linux-2.6.7-rc2-base/include/linux/sched.h > linux-2.6.7-rc2-s6.3/include/linux/sched.h > --- linux-2.6.7-rc2-base/include/linux/sched.h > 2004-05-31 21:29:21.000000000 +1000 > +++ linux-2.6.7-rc2-s6.3/include/linux/sched.h > 2004-06-07 00:10:37.450576503 +1000 > @@ -164,6 +164,7 @@ extern void show_stack(struct > task_struc > > void io_schedule(void); > long io_schedule_timeout(long timeout); > +extern int interactive, compute; > > extern void cpu_init (void); > extern void trap_init(void); > @@ -394,14 +395,13 @@ struct task_struct { > struct list_head run_list; > prio_array_t *array; > > - unsigned long sleep_avg; > - long interactive_credit; > unsigned long long timestamp; > - int activated; > + unsigned long runtime, totalrun; > + unsigned int deadline; > > unsigned long policy; > cpumask_t cpus_allowed; > - unsigned int time_slice, first_time_slice; > + unsigned int slice, time_slice, first_time_slice; > > struct list_head tasks; > struct list_head ptrace_children; > diff -Naurp --exclude-from=dontdiff > linux-2.6.7-rc2-base/include/linux/sysctl.h > linux-2.6.7-rc2-s6.3/include/linux/sysctl.h > --- linux-2.6.7-rc2-base/include/linux/sysctl.h > 2004-05-31 21:29:21.000000000 +1000 > +++ linux-2.6.7-rc2-s6.3/include/linux/sysctl.h > 2004-06-07 00:06:13.007895851 +1000 > @@ -133,6 +133,8 @@ enum > KERN_NGROUPS_MAX=63, /* int: NGROUPS_MAX */ > KERN_SPARC_SCONS_PWROFF=64, /* int: serial console > power-off halt */ > KERN_HZ_TIMER=65, /* int: hz timer on or off */ > + KERN_INTERACTIVE=66, /* interactive tasks to have > cpu bursts */ > + KERN_COMPUTE=67, /* adjust timeslices for a > compute server */ > }; > > > diff -Naurp --exclude-from=dontdiff > linux-2.6.7-rc2-base/kernel/sched.c > linux-2.6.7-rc2-s6.3/kernel/sched.c > --- linux-2.6.7-rc2-base/kernel/sched.c 2004-05-31 > 21:29:24.000000000 +1000 > +++ linux-2.6.7-rc2-s6.3/kernel/sched.c 2004-06-07 > 00:36:43.382396246 +1000 > @@ -16,7 +16,10 @@ > * by Davide Libenzi, preemptible kernel bits by > Robert Love. > * 2003-09-03 Interactivity tuning by Con Kolivas. > * 2004-04-02 Scheduler domains code by Nick > Piggin > - */ > + * 2004-03-19. New staircase scheduling policy by > Con Kolivas with help > + * from Zwane Mwaikambo and useful suggestions by > + * William Lee Irwin III. > +*/ > > #include <linux/mm.h> > #include <linux/module.h> > @@ -66,8 +69,6 @@ > #define USER_PRIO(p) ((p)-MAX_RT_PRIO) > #define TASK_USER_PRIO(p) > USER_PRIO((p)->static_prio) > #define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) > -#define AVG_TIMESLICE (MIN_TIMESLICE + > ((MAX_TIMESLICE - MIN_TIMESLICE) *\ > - (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - > 1))) > > /* > * Some helpers for converting nanosecond timing to > jiffy resolution > @@ -75,110 +76,18 @@ > #define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / > HZ)) > #define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / > HZ)) > > -/* > - * These are the 'tuning knobs' of the scheduler: > - * > - * Minimum timeslice is 10 msecs, default timeslice > is 100 msecs, > - * maximum timeslice is 200 msecs. Timeslices get > refilled after > - * they expire. > - */ > -#define MIN_TIMESLICE ( 10 * HZ / 1000) > -#define MAX_TIMESLICE (200 * HZ / 1000) > -#define ON_RUNQUEUE_WEIGHT 30 > -#define CHILD_PENALTY 95 > -#define PARENT_PENALTY 100 > -#define EXIT_WEIGHT 3 > -#define PRIO_BONUS_RATIO 25 > -#define MAX_BONUS (MAX_USER_PRIO * > PRIO_BONUS_RATIO / 100) > -#define INTERACTIVE_DELTA 2 > -#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS) > -#define STARVATION_LIMIT (MAX_SLEEP_AVG) > -#define NS_MAX_SLEEP_AVG > (JIFFIES_TO_NS(MAX_SLEEP_AVG)) > -#define CREDIT_LIMIT 100 > - > -/* > - * If a task is 'interactive' then we reinsert it > in the active > - * array after it has expired its current > timeslice. (it will not > - * continue to run immediately, it will still > roundrobin with > - * other interactive tasks.) > - * > - * This part scales the interactivity limit > depending on niceness. > - * > - * We scale it linearly, offset by the > INTERACTIVE_DELTA delta. > - * Here are a few examples of different nice > levels: > - * > - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] > - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] > - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] > - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] > - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] > - * > - * (the X axis represents the possible -5 ... 0 ... > +5 dynamic > - * priority range a task can explore, a value of > '1' means the > - * task is rated interactive.) > - * > - * Ie. nice +19 tasks can never get 'interactive' > enough to be > - * reinserted into the active array. And only > heavily CPU-hog nice -20 > - * tasks will be expired. Default nice 0 tasks are > somewhere between, > - * it takes some effort for them to get > interactive, but it's not > - * too hard. > - */ > - > -#define CURRENT_BONUS(p) \ > - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ > - MAX_SLEEP_AVG) > - > -#ifdef CONFIG_SMP > -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ > - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - > 1)) * \ > - num_online_cpus()) > -#else > -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ > - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - > 1))) > -#endif > - > -#define SCALE(v1,v1_max,v2_max) \ > - (v1) * (v2_max) / (v1_max) > === message truncated === __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 19:57 ` Phy Prabab @ 2004-06-07 21:12 ` Con Kolivas 2004-06-07 21:40 ` Phy Prabab 0 siblings, 1 reply; 31+ messages in thread From: Con Kolivas @ 2004-06-07 21:12 UTC (permalink / raw) To: Phy Prabab Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Tue, 8 Jun 2004 05:57, Phy Prabab wrote: > I have had a chance to test this patch. Thanks > I have a make > system that has presented 2.6 kernels with problems, > so I am using this as the test. Observations show > that turning off interactive is much more > deterministic: > > 2.6.7-rc2-bk8-s63: > echo 0 > /proc/sys/kernel/interactive > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > 2.6.7-rc2-bk8-s63: > A: 35.32user 38.51system 1:26.47elapsed 85%CPU > B: 35.43user 38.35system 1:20.79elapsed 91%CPU > C: 35.61user 38.23system 1:25.00elapsed 86%CPU > > However, something is still slower than the 2.4.x > kernels: > > 2.4.23: > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > Nice work, as I can now turn off some functionality > within the kernel that is causing me some slow down. Glad to see it does what you require. Turning off "interactive" should still provide moderately good interactive performance at low to moderate loads, but is much stricter about cpu usage distribution as you can see. Cheers, Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 21:12 ` Con Kolivas @ 2004-06-07 21:40 ` Phy Prabab 2004-06-07 21:34 ` Con Kolivas 2004-06-08 2:50 ` Andrew Morton 0 siblings, 2 replies; 31+ messages in thread From: Phy Prabab @ 2004-06-07 21:40 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III OOOPPPSSSS.... I need to make a correction on my previous data. I had inadvertantly turned off interactivity and also increased the compute time to 100. I confirmed that just setting interactivity off, does not solve my problem: 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive): A: 37.30user 40.56system 1:42.01elapsed 76%CPU B: 37.29user 40.35system 1:23.87elapsed 92%CPU C: 37.30user 40.56system 1:36.01elapsed 81%CPU 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1 /proc/sys/kernel/compute): A: 37.28user 40.36system 1:25.60elapsed 90%CPU B: 37.22user 40.35system 1:22.17elapsed 94%CPU C: 37.27user 40.35system 1:24.71elapsed 91%CPU The question here, noticing that user and kernel time are the same, where is the dead time coming from and why is it sooooo much more deterministic with compute time at 100 vs 10? Maybe I am misinterpreting the data, but this suggests to me that something is going awry (ping-pong, no settle, ???) within the kernl? Also please note the degredation between 2.6.7-rc2-bk8-s63: A: 35.57user 38.18system 1:20.28elapsed 91%CPU B: 35.54user 38.40system 1:19.48elapsed 93%CPU C: 35.48user 38.28system 1:20.94elapsed 91%CPU Interesting how much more time is spent in both user and kernel space between the two kernels. Also note that 2.4.x exhibits even greater delta: A: 28.32user 29.51system 1:01.17elapsed 93%CPU B: 28.54user 29.40system 1:01.48elapsed 92%CPU B: 28.23user 28.80system 1:00.21elapsed 94%CPU Could anyone suggest a way to understand why the difference between the 2.6 kernels and the 2.4 kernels? Thank you for your time. Phy --- Con Kolivas <kernel@kolivas.org> wrote: > On Tue, 8 Jun 2004 05:57, Phy Prabab wrote: > > I have had a chance to test this patch. > > Thanks > > > I have a make > > system that has presented 2.6 kernels with > problems, > > so I am using this as the test. Observations > show > > that turning off interactive is much more > > deterministic: > > > > 2.6.7-rc2-bk8-s63: > > echo 0 > /proc/sys/kernel/interactive > > > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > > > 2.6.7-rc2-bk8-s63: > > A: 35.32user 38.51system 1:26.47elapsed 85%CPU > > B: 35.43user 38.35system 1:20.79elapsed 91%CPU > > C: 35.61user 38.23system 1:25.00elapsed 86%CPU > > > > However, something is still slower than the 2.4.x > > kernels: > > > > 2.4.23: > > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > > > Nice work, as I can now turn off some > functionality > > within the kernel that is causing me some slow > down. > > Glad to see it does what you require. Turning off > "interactive" should still > provide moderately good interactive performance at > low to moderate loads, but > is much stricter about cpu usage distribution as you > can see. > > Cheers, > Con __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 21:40 ` Phy Prabab @ 2004-06-07 21:34 ` Con Kolivas 2004-06-08 0:06 ` Phy Prabab 2004-06-08 2:50 ` Andrew Morton 1 sibling, 1 reply; 31+ messages in thread From: Con Kolivas @ 2004-06-07 21:34 UTC (permalink / raw) To: Phy Prabab Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III Quoting Phy Prabab <phyprabab@yahoo.com>: > OOOPPPSSSS.... > > I need to make a correction on my previous data. I > had inadvertantly turned off interactivity and also > increased the compute time to 100. I confirmed that > just setting interactivity off, does not solve my > problem: > > 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive): > A: 37.30user 40.56system 1:42.01elapsed 76%CPU > B: 37.29user 40.35system 1:23.87elapsed 92%CPU > C: 37.30user 40.56system 1:36.01elapsed 81%CPU > > 2.6.7-rc3-s63 (0 @ /proc/sys/kernel/interactive & 1 > /proc/sys/kernel/compute): > A: 37.28user 40.36system 1:25.60elapsed 90%CPU > B: 37.22user 40.35system 1:22.17elapsed 94%CPU > C: 37.27user 40.35system 1:24.71elapsed 91%CPU > > The question here, noticing that user and kernel time > are the same, where is the dead time coming from and > why is it sooooo much more deterministic with compute > time at 100 vs 10? Maybe I am misinterpreting the > data, but this suggests to me that something is going > awry (ping-pong, no settle, ???) within the kernl? > > > Also please note the degredation between > 2.6.7-rc2-bk8-s63: > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > Interesting how much more time is spent in both user > and kernel space between the two kernels. Also note > that 2.4.x exhibits even greater delta: > > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > Could anyone suggest a way to understand why the > difference between the 2.6 kernels and the 2.4 > kernels? > > Thank you for your time. > Phy > Hi. How repeatable are the numbers normally? Some idea of what it is you're benchmarking may also help in understanding the problem; locking may be an issue with what you're benchmarking and out-of-order scheduling is not as forgiving of poor locking. Extending the RR_INTERVAL and turning off interactivity makes it more in-order and more forgiving of poor locking or yield(). Compute==1 setting inactivates interactivity anyway, but that's not really relevant to your figures since you had set interactive 0 when you set compute 1. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 21:34 ` Con Kolivas @ 2004-06-08 0:06 ` Phy Prabab 2004-06-07 22:08 ` Con Kolivas 0 siblings, 1 reply; 31+ messages in thread From: Phy Prabab @ 2004-06-08 0:06 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III Just to clarify, setting compute 1 implys interactive 0? These numbers are very reproducable nad have done them (in a continuous loop) for two hours. The test is a make of headers for a propritary exec. Making headers is rather simple is all it does it link a bunch of h files (traversing dirs) and some dependance generation (3 files, yacc and lex). I have moved the source code base to local disk to dicount nfs issues (though the difference is neglibible and nfs performance on 2.6 is generally faster than 2.4). I have tried to get a good test case that can be submitted. Still trying. Any suggestions to try to diagnose this? Thanks! Phy > > > Hi. > > How repeatable are the numbers normally? Some idea > of what it is you're > benchmarking may also help in understanding the > problem; locking may be an > issue with what you're benchmarking and out-of-order > scheduling is not as > forgiving of poor locking. Extending the RR_INTERVAL > and turning off > interactivity makes it more in-order and more > forgiving of poor locking or > yield(). > > Compute==1 setting inactivates interactivity anyway, > but that's not really > relevant to your figures since you had set > interactive 0 when you set compute > 1. > > Con > __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 0:06 ` Phy Prabab @ 2004-06-07 22:08 ` Con Kolivas 0 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-07 22:08 UTC (permalink / raw) To: Phy Prabab Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III Quoting Phy Prabab <phyprabab@yahoo.com>: > > Just to clarify, setting compute 1 implys interactive > 0? Yes. > > These numbers are very reproducable nad have done them > (in a continuous loop) for two hours. Ok. > > The test is a make of headers for a propritary exec. > Making headers is rather simple is all it does it link > a bunch of h files (traversing dirs) and some > dependance generation (3 files, yacc and lex). I have > moved the source code base to local disk to dicount > nfs issues (though the difference is neglibible and > nfs performance on 2.6 is generally faster than 2.4). Out of curiosity, what happens from nfs? i/o effects can be significant. I assume you've excluded memory effects? > I have tried to get a good test case that can be > submitted. Still trying. Linking kernel headers with full debugging? > Any suggestions to try to diagnose this? Profiling. > Thanks! > Phy Cheers, Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-07 21:40 ` Phy Prabab 2004-06-07 21:34 ` Con Kolivas @ 2004-06-08 2:50 ` Andrew Morton 2004-06-08 3:02 ` Con Kolivas ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Andrew Morton @ 2004-06-08 2:50 UTC (permalink / raw) To: Phy Prabab; +Cc: kernel, linux-kernel, zwane, wli Phy Prabab <phyprabab@yahoo.com> wrote: > > Also please note the degredation between > 2.6.7-rc2-bk8-s63: > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > Interesting how much more time is spent in both user > and kernel space between the two kernels. Also note > that 2.4.x exhibits even greater delta: > > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > Could anyone suggest a way to understand why the > difference between the 2.6 kernels and the 2.4 > kernels? This is very very bad. It's a uniprocessor machine, yes? Could you describe the workload a bit more? Is it something which others can get their hands on? It spends a lot of time in the kernel for a build system. I wonder why. At a guess I'd say either a) you're hitting some path in the kernel which is going for a giant and bogus romp through memory, trashing CPU caches or b) your workload really dislikes run-child-first-after-fork or c) the page allocator is serving up pages which your access pattern dislikes or d) something else. It's certainly interesting. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 2:50 ` Andrew Morton @ 2004-06-08 3:02 ` Con Kolivas 2004-06-08 3:05 ` Nick Piggin 2004-06-08 5:42 ` Phy Prabab 2 siblings, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-08 3:02 UTC (permalink / raw) To: Andrew Morton; +Cc: Phy Prabab, linux-kernel, zwane, wli Quoting Andrew Morton <akpm@osdl.org>: > Phy Prabab <phyprabab@yahoo.com> wrote: > > > > Also please note the degredation between > > 2.6.7-rc2-bk8-s63: > > > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > > > Interesting how much more time is spent in both user > > and kernel space between the two kernels. Also note > > that 2.4.x exhibits even greater delta: > > > > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > > > Could anyone suggest a way to understand why the > > difference between the 2.6 kernels and the 2.4 > > kernels? > > This is very very bad. > > It's a uniprocessor machine, yes? > > Could you describe the workload a bit more? Is it something which others > can get their hands on? > > It spends a lot of time in the kernel for a build system. I wonder why. > > At a guess I'd say either a) you're hitting some path in the kernel which > is going for a giant and bogus romp through memory, trashing CPU caches or > b) your workload really dislikes run-child-first-after-fork or c) the page > allocator is serving up pages which your access pattern dislikes or d) > something else. > > It's certainly interesting. Ordinary kernbench on 8x OSDL hardware fails to show anything like this: Summary: time in seconds for optimal run: 2.6.7-rc2: 129.948 2.6.7-rc2-s6.3: 129.528 2.6.7-rc2-s6.3compute: 127.63 Detailed: kernel: patch-2.6.7-rc2 plmid: 2987 Host: stp8-002 Average Optimal -j 32 Load Run: Elapsed Time 129.948 User Time 873.302 System Time 86.318 Percent CPU 738 Context Switches 39853.4 Sleeps 29045.8 Average Maximal -j Load Run: Elapsed Time 128.362 User Time 866.274 System Time 84.352 Percent CPU 740 Context Switches 25742.6 Sleeps 20615.8 Average Half Load -j 4 Run: Elapsed Time 231.04 User Time 833.69 System Time 72.94 Percent CPU 392 Context Switches 12672.2 Sleeps 22267.2 kernel: staircase6.2 plmid: 3002 Host: stp8-002 Average Optimal -j 32 Load Run: Elapsed Time 129.528 User Time 873.866 System Time 85.372 Percent CPU 740.2 Context Switches 38286.6 Sleeps 28331 Average Maximal -j Load Run: Elapsed Time 128.074 User Time 862.728 System Time 85.842 Percent CPU 740.2 Context Switches 20894 Sleeps 22189.2 Average Half Load -j 4 Run: Elapsed Time 230.942 User Time 833.03 System Time 71.126 Percent CPU 391 Context Switches 3467 Sleeps 22389.2 kernel: crunchah6.3 plmid: 3008 Host: stp8-002 Average Optimal -j 32 Load Run: Elapsed Time 127.63 User Time 861.668 System Time 83.078 Percent CPU 739.6 Context Switches 19818.4 Sleeps 28442.6 Average Maximal -j Load Run: Elapsed Time 127.354 User Time 856.632 System Time 82.952 Percent CPU 737 Context Switches 12389.2 Sleeps 21788.6 Average Half Load -j 4 Run: Elapsed Time 230.8 User Time 833.378 System Time 71.304 Percent CPU 391.6 Context Switches 3738.4 Sleeps 22288.6 Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 2:50 ` Andrew Morton 2004-06-08 3:02 ` Con Kolivas @ 2004-06-08 3:05 ` Nick Piggin 2004-06-08 1:07 ` Con Kolivas 2004-06-08 5:51 ` Phy Prabab 2004-06-08 5:42 ` Phy Prabab 2 siblings, 2 replies; 31+ messages in thread From: Nick Piggin @ 2004-06-08 3:05 UTC (permalink / raw) To: Andrew Morton; +Cc: Phy Prabab, kernel, linux-kernel, zwane, wli Andrew Morton wrote: > At a guess I'd say either a) you're hitting some path in the kernel which > is going for a giant and bogus romp through memory, trashing CPU caches or > b) your workload really dislikes run-child-first-after-fork or c) the page > allocator is serving up pages which your access pattern dislikes or d) > something else. > e) it's the staircase scheduler patch? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 3:05 ` Nick Piggin @ 2004-06-08 1:07 ` Con Kolivas 2004-06-08 5:51 ` Phy Prabab 1 sibling, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-08 1:07 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, Phy Prabab, linux-kernel, zwane, wli Quoting Nick Piggin <nickpiggin@yahoo.com.au>: > Andrew Morton wrote: > > > At a guess I'd say either a) you're hitting some path in the kernel which > > is going for a giant and bogus romp through memory, trashing CPU caches or > > b) your workload really dislikes run-child-first-after-fork or c) the page > > allocator is serving up pages which your access pattern dislikes or d) > > something else. > > > > e) it's the staircase scheduler patch? > Phy Prabab wrote: >Could anyone suggest a way to understand why the >difference between the 2.6 kernels and the 2.4 >kernels I guess Phy better tell us if it's unique to staircase; and if so there's clearly a bug unique to it. Con ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 3:05 ` Nick Piggin 2004-06-08 1:07 ` Con Kolivas @ 2004-06-08 5:51 ` Phy Prabab 1 sibling, 0 replies; 31+ messages in thread From: Phy Prabab @ 2004-06-08 5:51 UTC (permalink / raw) To: Nick Piggin, Andrew Morton; +Cc: Phy Prabab, kernel, linux-kernel, zwane, wli the results I published also included 2.4.23 runs which do not have the stair step scheduler. Also, I have run this on 2.6-1->2.6.7-rc<x>, all the 2.6.x series are slower than the 2.4 series. Thank you for your time. Phy --- Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Andrew Morton wrote: > > > At a guess I'd say either a) you're hitting some > path in the kernel which > > is going for a giant and bogus romp through > memory, trashing CPU caches or > > b) your workload really dislikes > run-child-first-after-fork or c) the page > > allocator is serving up pages which your access > pattern dislikes or d) > > something else. > > > > e) it's the staircase scheduler patch? __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 2:50 ` Andrew Morton 2004-06-08 3:02 ` Con Kolivas 2004-06-08 3:05 ` Nick Piggin @ 2004-06-08 5:42 ` Phy Prabab 2004-06-08 5:50 ` Zwane Mwaikambo 2 siblings, 1 reply; 31+ messages in thread From: Phy Prabab @ 2004-06-08 5:42 UTC (permalink / raw) To: Andrew Morton; +Cc: kernel, linux-kernel, zwane, wli The test case is a build system that links headers (ln -s) and runs bison and flex on a couple of files. strace shows no difference between running on 2.4.23 compared to running on 2.6.7-rc2bk8s63. Unfortuneately, I can not send this out, but I am trying to get a tet case that will demostrate this. In the mean time, what can I do to try understand this slowdown? Thank you for your time. Phy --- Andrew Morton <akpm@osdl.org> wrote: > Phy Prabab <phyprabab@yahoo.com> wrote: > > > > Also please note the degredation between > > 2.6.7-rc2-bk8-s63: > > > > A: 35.57user 38.18system 1:20.28elapsed 91%CPU > > B: 35.54user 38.40system 1:19.48elapsed 93%CPU > > C: 35.48user 38.28system 1:20.94elapsed 91%CPU > > > > Interesting how much more time is spent in both > user > > and kernel space between the two kernels. Also > note > > that 2.4.x exhibits even greater delta: > > > > A: 28.32user 29.51system 1:01.17elapsed 93%CPU > > B: 28.54user 29.40system 1:01.48elapsed 92%CPU > > B: 28.23user 28.80system 1:00.21elapsed 94%CPU > > > > Could anyone suggest a way to understand why the > > difference between the 2.6 kernels and the 2.4 > > kernels? > > This is very very bad. > > It's a uniprocessor machine, yes? > > Could you describe the workload a bit more? Is it > something which others > can get their hands on? > > It spends a lot of time in the kernel for a build > system. I wonder why. > > At a guess I'd say either a) you're hitting some > path in the kernel which > is going for a giant and bogus romp through memory, > trashing CPU caches or > b) your workload really dislikes > run-child-first-after-fork or c) the page > allocator is serving up pages which your access > pattern dislikes or d) > something else. > > It's certainly interesting. __________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-08 5:42 ` Phy Prabab @ 2004-06-08 5:50 ` Zwane Mwaikambo 0 siblings, 0 replies; 31+ messages in thread From: Zwane Mwaikambo @ 2004-06-08 5:50 UTC (permalink / raw) To: Phy Prabab; +Cc: Andrew Morton, kernel, linux-kernel, wli yOn Mon, 7 Jun 2004, Phy Prabab wrote: > The test case is a build system that links headers (ln > -s) and runs bison and flex on a couple of files. > strace shows no difference between running on 2.4.23 > compared to running on 2.6.7-rc2bk8s63. > > Unfortuneately, I can not send this out, but I am > trying to get a tet case that will demostrate this. > > In the mean time, what can I do to try understand this > slowdown? Whether the regression is also present without the staircase scheduler. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas ` (6 preceding siblings ...) 2004-06-07 19:57 ` Phy Prabab @ 2004-06-29 11:10 ` Pavel Machek 2004-06-29 11:14 ` Con Kolivas 2004-06-29 12:05 ` Felipe Alfaro Solana 7 siblings, 2 replies; 31+ messages in thread From: Pavel Machek @ 2004-06-29 11:10 UTC (permalink / raw) To: Con Kolivas Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III Hi! > This is an update of the scheduler policy mechanism rewrite using the > infrastructure of the current O(1) scheduler. Slight changes from the > original design require a detailed description. The change to the original > design has enabled all known corner cases to be abolished and cpu > distribution to be much better maintained. It has proven to be stable in my > testing and is ready for more widespread public testing now. > > > Aims: > - Interactive by design rather than have interactivity bolted on. > - Good scalability. > - Simple predictable design. > - Maintain appropriate cpu distribution and fairness. > - Low scheduling latency for normal policy tasks. > - Low overhead. > - Making renicing processes actually matter for CPU distribution (nice 0 gets > 20 times what nice +20 gets) > - Resistant to priority inversion How do you solve priority inversion? Can you do "true idle threads" now? (I.e. nice +infinity?) Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-29 11:10 ` Pavel Machek @ 2004-06-29 11:14 ` Con Kolivas 2004-06-29 12:05 ` Felipe Alfaro Solana 1 sibling, 0 replies; 31+ messages in thread From: Con Kolivas @ 2004-06-29 11:14 UTC (permalink / raw) To: Pavel Machek Cc: Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Pavel Machek wrote: | Hi! | | |>This is an update of the scheduler policy mechanism rewrite using the |>infrastructure of the current O(1) scheduler. Slight changes from the |>original design require a detailed description. The change to the original |>design has enabled all known corner cases to be abolished and cpu |>distribution to be much better maintained. It has proven to be stable in my |>testing and is ready for more widespread public testing now. |> |> |>Aims: |> - Interactive by design rather than have interactivity bolted on. |> - Good scalability. |> - Simple predictable design. |> - Maintain appropriate cpu distribution and fairness. |> - Low scheduling latency for normal policy tasks. |> - Low overhead. |> - Making renicing processes actually matter for CPU distribution (nice 0 gets |>20 times what nice +20 gets) |> - Resistant to priority inversion | | | How do you solve priority inversion? I don't solve it. It is relatively resistant to priority inversion by tasks traversing all the priorities lower than itself rather than sitting at one priority. It does this more strictly when interactive is disabled which I recommend for server and multiuser setups. | Can you do "true idle threads" now? (I.e. nice +infinity?) Not safely, no. I have a batch(idle) implementation that is relatively safe but can be abused (it's in my -ck patchset only). At some stage if the codebase settles down enough I'll try and introduce semaphore tracking for batch processes that will elevate them to normal scheduling ~ to prevent DoSing. Cheers, Con -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFA4U8oZUg7+tp6mRURAircAKCSvuZ4D6rM7AsfAbHKhQoCw1C7NACaA44C UyUJAByRIE894CchzlN2OiU= =9xVI -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 2004-06-29 11:10 ` Pavel Machek 2004-06-29 11:14 ` Con Kolivas @ 2004-06-29 12:05 ` Felipe Alfaro Solana 1 sibling, 0 replies; 31+ messages in thread From: Felipe Alfaro Solana @ 2004-06-29 12:05 UTC (permalink / raw) To: Pavel Machek Cc: Con Kolivas, Linux Kernel Mailinglist, Zwane Mwaikambo, William Lee Irwin III On Tue, 2004-06-29 at 13:10 +0200, Pavel Machek wrote: > Hi! > > > This is an update of the scheduler policy mechanism rewrite using the > > infrastructure of the current O(1) scheduler. Slight changes from the > > original design require a detailed description. The change to the original > > design has enabled all known corner cases to be abolished and cpu > > distribution to be much better maintained. It has proven to be stable in my > > testing and is ready for more widespread public testing now. > > > > > > Aims: > > - Interactive by design rather than have interactivity bolted on. > > - Good scalability. > > - Simple predictable design. > > - Maintain appropriate cpu distribution and fairness. > > - Low scheduling latency for normal policy tasks. > > - Low overhead. > > - Making renicing processes actually matter for CPU distribution (nice 0 gets > > 20 times what nice +20 gets) > > - Resistant to priority inversion > > How do you solve priority inversion? > > Can you do "true idle threads" now? (I.e. nice +infinity?) I think idle threads can be achieved with the batch scheduler policy present in -ck patches. It works nicely, and I use it a lot. PS: At least, if you mean with "idle threads" those that do only run where there are idle cycles of CPU (i.e. all processes are sleeping). ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2004-06-29 12:05 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-06-06 15:39 [PATCH] Staircase Scheduler v6.3 for 2.6.7-rc2 Con Kolivas 2004-06-06 16:59 ` Jan Killius 2004-06-06 17:40 ` Felipe Alfaro Solana 2004-06-06 22:58 ` Con Kolivas 2004-06-06 20:43 ` Prakash K. Cheemplavam 2004-06-06 22:55 ` Con Kolivas 2004-06-06 20:59 ` Felipe Alfaro Solana 2004-06-06 22:57 ` Con Kolivas 2004-06-06 23:47 ` Con Kolivas 2004-06-07 13:56 ` William Lee Irwin III 2004-06-07 13:57 ` William Lee Irwin III 2004-06-07 14:04 ` William Lee Irwin III 2004-06-07 14:07 ` William Lee Irwin III 2004-06-07 15:00 ` William Lee Irwin III 2004-06-07 15:11 ` William Lee Irwin III 2004-06-07 19:57 ` Phy Prabab 2004-06-07 21:12 ` Con Kolivas 2004-06-07 21:40 ` Phy Prabab 2004-06-07 21:34 ` Con Kolivas 2004-06-08 0:06 ` Phy Prabab 2004-06-07 22:08 ` Con Kolivas 2004-06-08 2:50 ` Andrew Morton 2004-06-08 3:02 ` Con Kolivas 2004-06-08 3:05 ` Nick Piggin 2004-06-08 1:07 ` Con Kolivas 2004-06-08 5:51 ` Phy Prabab 2004-06-08 5:42 ` Phy Prabab 2004-06-08 5:50 ` Zwane Mwaikambo 2004-06-29 11:10 ` Pavel Machek 2004-06-29 11:14 ` Con Kolivas 2004-06-29 12:05 ` Felipe Alfaro Solana
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox