public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
From: Mike Galbraith <efault@gmx.de>
To: linux-ia64@vger.kernel.org
Subject: [Linux-ia64] Re: web page on O(1) scheduler
Date: Sun, 25 May 2003 09:17:49 +0000	[thread overview]
Message-ID: <marc-linux-ia64-105590723706032@msgid-missing> (raw)
In-Reply-To: <marc-linux-ia64-105590723705966@msgid-missing>

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

At 07:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>At 08:38 PM 5/21/2003 -0400, Rik van Riel wrote:
>>On Wed, 21 May 2003, Mike Galbraith wrote:
>> > At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
>> > >Recently, I started to look into some odd performance behaviors of the
>> > >O(1) scheduler.  I decided to document what I found in a web page
>> > >at:
>> > >
>> > >         http://www.hpl.hp.com/research/linux/kernel/o1.php
>> >
>> > The page mentions persistent starvation.  My own explorations of this
>> > issue indicate that the primary source is always selecting the highest
>> > priority queue.
>>
>>It's deeper than that.  The O(1) scheduler doesn't consider
>>actual CPU usage as a factor of CPU priority.
>
>Oh, there's no doubt in my mind that it's _much_ deeper than my little 
>surface scratchings ;-)
>
>It does consider cpu usage though.  Your run history is right there in 
>your accumulated sleep_avg.  Unfortunately (in some ways, fortunate in 
>others.. conflict) that information can be diluted down to nothing 
>instantly by new input from one wakeup.

Or did you mean that it misses a bunch of cpu usage?  I went looking at cpu 
usage, and...

Unless there's something seriously b0rked in the attached (don't _think_ 
so, but...;), trusting an irq that happens 1/ms to tick tasks isn't 100% 
effective, even if you aren't context switching a zillion times 
faster.  The attached diff produced the attached log while running 
test-starve.c.  I get some pretty serious thievery even when doing ho-hum 
stuff.  We can deactivate tasks at _really_ high frequency, and besides, if 
the timer interrupt doesn't fire while the last runnable task is active, he 
can be missed for a while and have accumulated up nearly a full ms.  It 
seems to me that with the right timing, you could end up stealing a bunch 
of cpu (my logs say it's happening a _lot_ under load.  again, diff might 
be b0rked)

         -Mike 

[-- Attachment #2: xx.log --]
[-- Type: application/octet-stream, Size: 7260 bytes --]

COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
ROADRUNNER! 308:19 stole 2 msecs.
COYOTE! 309:22 not ticked for 27 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 4 msecs.
COYOTE! 309:22 not ticked for 5 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 36 msecs.
COYOTE! 309:22 not ticked for 37 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 74 msecs.
COYOTE! 309:22 not ticked for 75 msecs.
COYOTE! 309:22 not ticked for 76 msecs.
COYOTE! 309:22 not ticked for 77 msecs.
COYOTE! 309:22 not ticked for 78 msecs.
COYOTE! 309:22 not ticked for 79 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 65 msecs.
COYOTE! 309:22 not ticked for 66 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 4 msecs.
COYOTE! 309:22 not ticked for 5 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 38 msecs.
COYOTE! 309:22 not ticked for 39 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 3 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:22 not ticked for 2 msecs.
COYOTE! 309:22 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 47 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
ROADRUNNER! 309:23 stole 30 msecs.
ROADRUNNER! 309:23 stole 1 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 68 msecs.
COYOTE! 309:23 not ticked for 69 msecs.
COYOTE! 309:23 not ticked for 70 msecs.
COYOTE! 309:23 not ticked for 71 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 60 msecs.
COYOTE! 309:23 not ticked for 61 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 3 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 309:23 not ticked for 2 msecs.
COYOTE! 309:23 not ticked for 1 msecs.
COYOTE! 308:20 not ticked for 1 msecs.

[-- Attachment #3: xx.diff --]
[-- Type: application/octet-stream, Size: 6575 bytes --]

--- linux-2.5.69.virgin/include/linux/sched.h.org	Fri May 23 07:14:23 2003
+++ linux-2.5.69.virgin/include/linux/sched.h	Sat May 24 08:03:40 2003
@@ -328,7 +328,8 @@
 	prio_array_t *array;
 
 	unsigned long sleep_avg;
-	unsigned long last_run;
+	unsigned long long last_run;
+	unsigned int used_nsecs;
 
 	unsigned long policy;
 	unsigned long cpus_allowed;
--- linux-2.5.69.virgin/kernel/sched.c.org	Sun May 25 06:05:42 2003
+++ linux-2.5.69.virgin/kernel/sched.c	Sun May 25 10:04:14 2003
@@ -74,6 +74,10 @@
 #define MAX_SLEEP_AVG		(10*HZ)
 #define STARVATION_LIMIT	(10*HZ)
 #define NODE_THRESHOLD		125
+#define SCHED_NANOSECOND	1
+#define SCHED_MILLISECOND	(1000000 * SCHED_NANOSECOND)
+
+extern unsigned long long monotonic_clock(void);
 
 /*
  * If a task is 'interactive' then we reinsert it in the active
@@ -342,7 +346,8 @@
  */
 static inline int activate_task(task_t *p, runqueue_t *rq)
 {
-	long sleep_time = jiffies - p->last_run - 1;
+	unsigned long long now = monotonic_clock();
+	long sleep_time =  (long)(now - p->last_run) / SCHED_MILLISECOND;
 	int requeue_waker = 0;
 
 	if (sleep_time > 0) {
@@ -514,7 +519,9 @@
 				__activate_task(p, rq);
 			else {
 				requeue_waker = activate_task(p, rq);
-				if (p->prio < rq->curr->prio)
+				if (p->prio < rq->curr->prio ||
+						(0 && p->prio == rq->curr->prio &&
+						p->time_slice > rq->curr->time_slice))
 					resched_task(rq->curr);
 			}
 			success = 1;
@@ -1182,11 +1189,12 @@
 	int cpu = smp_processor_id();
 	runqueue_t *rq = this_rq();
 	task_t *p = current;
+	int idle = p == rq->idle, used_msecs;
 
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_ticks);
 
-	if (p == rq->idle) {
+	if (idle) {
 		/* note: this timer irq context must be accounted for as well */
 		if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
 			kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1194,8 +1202,7 @@
 			kstat_cpu(cpu).cpustat.iowait += sys_ticks;
 		else
 			kstat_cpu(cpu).cpustat.idle += sys_ticks;
-		rebalance_tick(rq, 1);
-		return;
+		goto out;
 	}
 	if (TASK_NICE(p) > 0)
 		kstat_cpu(cpu).cpustat.nice += user_ticks;
@@ -1203,12 +1210,12 @@
 		kstat_cpu(cpu).cpustat.user += user_ticks;
 	kstat_cpu(cpu).cpustat.system += sys_ticks;
 
+	spin_lock(&rq->lock);
 	/* Task might have expired already, but not scheduled off yet */
 	if (p->array != rq->active) {
 		set_tsk_need_resched(p);
-		return;
+		goto out_unlock;
 	}
-	spin_lock(&rq->lock);
 	/*
 	 * The task was running during this tick - update the
 	 * time slice counter and the sleep average. Note: we
@@ -1217,8 +1224,15 @@
 	 * it possible for interactive tasks to use up their
 	 * timeslices at their highest priority levels.
 	 */
-	if (p->sleep_avg)
-		p->sleep_avg--;
+	if ((used_msecs = p->used_nsecs / SCHED_MILLISECOND)) {
+		p->used_nsecs -= used_msecs * SCHED_MILLISECOND;
+		if (p->sleep_avg >= used_msecs)
+			p->sleep_avg -= used_msecs;
+		else p->sleep_avg = 0;
+		if (p->policy != SCHED_FIFO && used_msecs > p->time_slice)
+			printk(KERN_DEBUG "ROADRUNNER! %d:%d stole %d msecs.\n",
+				p->pid, p->prio-100, used_msecs - p->time_slice);
+	}
 	if (unlikely(rt_task(p))) {
 		/*
 		 * RR tasks need a special form of timeslice management.
@@ -1233,7 +1247,7 @@
 			dequeue_task(p, rq->active);
 			enqueue_task(p, rq->active);
 		}
-		goto out;
+		goto out_unlock;
 	}
 	if (!--p->time_slice) {
 		dequeue_task(p, rq->active);
@@ -1249,9 +1263,10 @@
 		} else
 			enqueue_task(p, rq->active);
 	}
-out:
+out_unlock:
 	spin_unlock(&rq->lock);
-	rebalance_tick(rq, 0);
+out:
+	rebalance_tick(rq, idle);
 }
 
 void scheduling_functions_start_here(void) { }
@@ -1264,8 +1279,9 @@
 	task_t *prev, *next;
 	runqueue_t *rq;
 	prio_array_t *array;
-	struct list_head *queue;
+	struct list_head *head, *curr;
 	int idx;
+	static int last_pid, last_msused;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1286,7 +1302,6 @@
 	rq = this_rq();
 
 	release_kernel_lock(prev);
-	prev->last_run = jiffies;
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1331,8 +1346,32 @@
 	}
 
 	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
+next_queue:
+	head = array->queue + idx;
+	curr = head->next;
+	next = list_entry(curr, task_t, run_list);
+	curr = curr->next;
+	/*
+	 * If we are about to wrap back to the head of the queue,
+	 * give a lower priority queue a chance to sneak one in.
+	 */
+	if (0 && idx == prev->prio && curr == head && array->nr_active > 1) {
+		int tmp = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+		if (tmp < MAX_PRIO) {
+			idx = tmp;
+			goto next_queue;
+		}
+	}
+	if (prev != rq->idle && prev->array == rq->active &&
+			prev->used_nsecs / SCHED_MILLISECOND >= 1) {
+		if (last_pid != prev->pid ||
+				last_msused != prev->used_nsecs / SCHED_MILLISECOND)
+		printk(KERN_DEBUG "COYOTE! %d:%d not ticked for %d msecs.\n",
+			prev->pid, prev->prio-100,
+			prev->used_nsecs / SCHED_MILLISECOND);
+		last_pid = prev->pid;
+		last_msused = prev->used_nsecs / SCHED_MILLISECOND;
+	}
 
 switch_tasks:
 	prefetch(next);
@@ -1340,8 +1379,11 @@
 	RCU_qsctr(prev->thread_info->cpu)++;
 
 	if (likely(prev != next)) {
+		unsigned long long now = monotonic_clock();
 		rq->nr_switches++;
 		rq->curr = next;
+		prev->used_nsecs += now - prev->last_run;
+		prev->last_run = next->last_run = now;
 
 		prepare_arch_switch(rq, next);
 		prev = context_switch(rq, prev, next);
--- linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c.org	Sat May 24 08:55:05 2003
+++ linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c	Sat May 24 09:26:09 2003
@@ -102,12 +102,13 @@
 static unsigned long long monotonic_clock_tsc(void)
 {
 	unsigned long long last_offset, this_offset, base;
+	unsigned long flags;
 	
 	/* atomically read monotonic base & last_offset */
-	read_lock_irq(&monotonic_lock);
+	read_lock_irqsave(&monotonic_lock, flags);
 	last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
 	base = monotonic_base;
-	read_unlock_irq(&monotonic_lock);
+	read_unlock_irqrestore(&monotonic_lock, flags);
 
 	/* Read the Time Stamp Counter */
 	rdtscll(this_offset);
--- linux-2.5.69.virgin/kernel/printk.c.org	Sun May 25 09:10:38 2003
+++ linux-2.5.69.virgin/kernel/printk.c	Sun May 25 09:11:04 2003
@@ -510,8 +510,10 @@
 	console_may_schedule = 0;
 	up(&console_sem);
 	spin_unlock_irqrestore(&logbuf_lock, flags);
+#if 0
 	if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
 		wake_up_interruptible(&log_wait);
+#endif
 }
 
 /** console_conditional_schedule - yield the CPU if required

      parent reply	other threads:[~2003-05-25  9:17 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-05-21  9:01 [Linux-ia64] Re: web page on O(1) scheduler Arjan van de Ven
2003-05-21  9:26 ` Mike Galbraith
2003-05-21  9:30 ` Mike Galbraith
2003-05-21 10:40 ` Duraid Madina
2003-05-21 10:43 ` Christoph Hellwig
2003-05-21 15:18 ` David Mosberger
2003-05-21 17:56 ` David Mosberger
2003-05-21 20:46 ` Mike Galbraith
2003-05-22  0:38 ` Rik van Riel
2003-05-22  5:52 ` Mike Galbraith
2003-05-22  9:52 ` Mike Galbraith
2003-05-22 16:25 ` Mike Galbraith
2003-05-22 17:58 ` David Mosberger
2003-05-23  1:07 ` Hans Boehm
2003-05-23  8:30 ` Arjan van de Ven
2003-05-23 17:48 ` Boehm, Hans
2003-05-23 18:04 ` Davide Libenzi
2003-05-24  0:10 ` Boehm, Hans
2003-05-24  0:20 ` Davide Libenzi
2003-05-24  0:53 ` Boehm, Hans
2003-05-24  5:38 ` Davide Libenzi
2003-05-24 14:43 ` Davide Libenzi
2003-05-24 16:50 ` Hans Boehm
2003-05-24 21:41 ` Davide Libenzi
2003-05-25  9:17 ` Mike Galbraith [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=marc-linux-ia64-105590723706032@msgid-missing \
    --to=efault@gmx.de \
    --cc=linux-ia64@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox