Question about sched

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Question about sched_yield()
@ 2002-06-15 22:15 mgix
  2002-06-16 14:43 ` [patch] " Ingo Molnar
  2002-06-18  0:46 ` David Schwartz
  0 siblings, 2 replies; 52+ messages in thread
From: mgix @ 2002-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel

Hello,

I am seeing some strange linux scheduler behaviours,
and I thought this'd be the best place to ask.

I have two processes, one that loops forever and
does nothing but calling sched_yield(), and the other
is basically benchmarking how fast it can compute
some long math calculation.

When I run the math simulator with nothing else running,
I get the following:

lenslark # ./loop
1.86543 MLoops per sec, sum=0.291969
1.84847 MLoops per sec, sum=1.65587
1.84064 MLoops per sec, sum=1.74601

When I run one instance of the math simulator and
one instance of the sched_yielder, I see the behaviour
I was expecting (and want): all CPU is allocated to the
process that does some actual work:

lenslark # ./yield &
[1] 26316
lenslark #
lenslark # ./loop
1.86502 MLoops per sec, sum=0.291969
1.84868 MLoops per sec, sum=1.65587
1.84035 MLoops per sec, sum=1.74601

and top confirms what I'm seeing: loop gets 99% of
the CPU, and yield gets close to none.

However, when I have two instances of the
sched_yielder running, things start to behave strange:

lenslark # ./yield &
[1] 26350
lenslark # ./yield &
[2] 26351
lenslark # ./loop
0.686901 MLoops per sec, sum=0.291969
0.674352 MLoops per sec, sum=1.65587
0.665542 MLoops per sec, sum=1.74601
0.674117 MLoops per sec, sum=0.407357

and sure enough, top is showing that for
some reason, the sched_yielders get 1/3rd
of the CPU:

26364 root      18   0   364  364   304 R    36.0  0.3   0:04 loop
26351 root      12   0   252  252   204 R    32.2  0.2   0:31 yield
26350 root      14   0   252  252   204 R    31.2  0.2   0:33 yield
26365 root      11   0  1024 1024   812 R     0.3  1.0   0:00 top

Does anyone have a clue as to why this happens ?
Is this the expected behaviour given the current
scheduler algorithm (of which I know nothing :)).

For comparison purposes, I ran the same test on
Win2k+MingWin32, and things behave the way I'd expect
there: no matter how many sched_yielders are running,
loop gets almost all of the CPU, and barely slows down.

I am running on an AMD-K6@400Mhz, on the following kernel:

lenslark # cat /proc/version
Linux version 2.4.18 (root@lenslark.mgix.com) (gcc version 2.96 20000731 (Mandra
ke Linux 8.2 2.96-0.76mdk)) #1 Sun Apr 28 15:25:19 PDT 2002

This kernel was compiled with SMP turned off.

If anyone can explain me what's going on, or even better
how to replicate the behaviour I'm seeing on Win2k, I'd
be very happy. Also, I'm not a subscriber to linux-kernel,
so I'd appreciate a CC.

I've included the source to the two small programs below:

================================= YIELD.CPP

#ifdef linux
#include <sched.h>
main()
{
    while(1)
    {
	sched_yield();
    }
}
#endif

#ifdef WIN32
#include <windows.h>
main()
{
    while(1)
    {
	Sleep(0);
    }
}
#endif

================================= LOOP.CPP
#include <math.h>
#include <stdio.h>
#include <sys/time.h>

#ifdef WIN32
#include <windows.h>
#endif

typedef unsigned long long uint64_t;

uint64_t usecs()
{

#ifdef linux
    struct timeval t;
    gettimeofday(&t,0);

    uint64_t result;
    result= t.tv_sec;
    result*= 1000000;
    result+= t.tv_usec;
    return result;
#endif

#ifdef WIN32
    return 1000*(uint64_t)GetTickCount();
#endif
}

main()
{
    uint64_t t= 0;
    double sum= 0.0;
    uint64_t lastTime= 0;
    uint64_t lastLoops= 0;
    while(t<100000000)
    {
	sum+= sin((double)t);
	if((t%4000000)==0)
	{
	    uint64_t nowTime= usecs();
	    uint64_t nowLoops= t;

	    if(lastTime!=0)
	    {
		double loops= (nowLoops-lastLoops);
		double elapsed= (nowTime-lastTime)/1000000.0;
		fprintf(
		    stderr,
		    "%g MLoops per sec, sum=%g\n",
		    (loops/elapsed)/1000000.0,
		    sum
		);
	    }

	    lastTime= nowTime;
	    lastLoops= nowLoops;
	}
	++t;
    }
    fprintf(stderr,"Sum=%g\n",sum);
}

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [patch] Re: Question about sched_yield()
  2002-06-15 22:15 Question about sched_yield() mgix
@ 2002-06-16 14:43 ` Ingo Molnar
  2002-06-18  0:46 ` David Schwartz
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-16 14:43 UTC (permalink / raw)
  To: mgix; +Cc: linux-kernel, Linus Torvalds


On Sat, 15 Jun 2002 mgix@mgix.com wrote:

> I am seeing some strange linux scheduler behaviours, and I thought
> this'd be the best place to ask.
> 
> I have two processes, one that loops forever and does nothing but
> calling sched_yield(), and the other is basically benchmarking how fast
> it can compute some long math calculation.

sched_yield() is indeed misbehaving both in 2.4 and 2.5. (I think you are
using some variant of 2.4.18, does that kernel have the O(1) scheduler
patches applied?) In fact in 2.5 it's behaving slightly worse than in
vanilla 2.4.18.

the O(1)-scheduler's sched_yield() implementation does the following to
'give up' the CPU:

 - it decreases its priority by 1 until it reaches the lowest level
 - it queues the task to the end of the priority queue

this scheme works fine in most cases, but if sched_yield()-active tasks
are mixed with CPU-using processes then it's quite likely that the
CPU-using process is in the expired array. In that case the yield()-ing
process only requeues itself in the active array - a true context-switch
to the expired process will only occur once the timeslice of the
yield()-ing process has expired: in ~150 msecs. This leads to the
yield()-ing and CPU-using process to use up rougly the same amount of
CPU-time, which is arguably deficient.

i've fixed this problem by extending sched_yield() the following way:

+        * There are three levels of how a yielding task will give up
+        * the current CPU:
+        *
+        *  #1 - it decreases its priority by one. This priority loss is
+        *       temporary, it's recovered once the current timeslice
+        *       expires.
+        *
+        *  #2 - once it has reached the lowest priority level,
+        *       it will give up timeslices one by one. (We do not
+        *       want to give them up all at once, it's gradual,
+        *       to protect the casual yield()er.)
+        *
+        *  #3 - once all timeslices are gone we put the process into
+        *       the expired array.
+        *
+        *  (special rule: RT tasks do not lose any priority, they just
+        *  roundrobin on their current priority level.)
+        */

the attached patch against vanilla 2.5.21 also includes this sched_yield()  
improvement, and in my tests it now behaves well. Could you give it a go,
does it now behave for your workload as well?

	Ingo

diff -rNu linux-2.5.21-final/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.21-final/arch/i386/kernel/entry.S	Sun Jun  9 07:28:19 2002
+++ linux/arch/i386/kernel/entry.S	Sun Jun 16 16:35:52 2002
@@ -193,6 +193,7 @@
 
 ENTRY(ret_from_fork)
 #if CONFIG_SMP || CONFIG_PREEMPT
+	# NOTE: this function takes a parameter but it's unused on x86.
 	call schedule_tail
 #endif
 	GET_THREAD_INFO(%ebx)
diff -rNu linux-2.5.21-final/fs/pipe.c linux/fs/pipe.c
--- linux-2.5.21-final/fs/pipe.c	Sun Jun  9 07:26:29 2002
+++ linux/fs/pipe.c	Sun Jun 16 16:35:36 2002
@@ -119,7 +119,7 @@
 		 * writers synchronously that there is more
 		 * room.
 		 */
-		wake_up_interruptible(PIPE_WAIT(*inode));
+		wake_up_interruptible_sync(PIPE_WAIT(*inode));
  		kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
 		if (!PIPE_EMPTY(*inode))
 			BUG();
@@ -219,7 +219,7 @@
 			 * is going to give up this CPU, so it doesnt have
 			 * to do idle reschedules.
 			 */
-			wake_up_interruptible(PIPE_WAIT(*inode));
+			wake_up_interruptible_sync(PIPE_WAIT(*inode));
  			kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
 			PIPE_WAITING_WRITERS(*inode)++;
 			pipe_wait(inode);
diff -rNu linux-2.5.21-final/include/asm-i386/system.h linux/include/asm-i386/system.h
--- linux-2.5.21-final/include/asm-i386/system.h	Sun Jun  9 07:26:29 2002
+++ linux/include/asm-i386/system.h	Sun Jun 16 16:35:52 2002
@@ -11,9 +11,12 @@
 struct task_struct;	/* one of the stranger aspects of C forward declarations.. */
 extern void FASTCALL(__switch_to(struct task_struct *prev, struct task_struct *next));
 
-#define prepare_to_switch()	do { } while(0)
+#define prepare_arch_schedule(prev)		do { } while(0)
+#define finish_arch_schedule(prev)		do { } while(0)
+#define prepare_arch_switch(rq)			do { } while(0)
+#define finish_arch_switch(rq)			spin_unlock_irq(&(rq)->lock)
 
-#define switch_to(prev,next) do {					\
+#define switch_to(prev,next,last) do {					\
 	asm volatile("pushl %%esi\n\t"					\
 		     "pushl %%edi\n\t"					\
 		     "pushl %%ebp\n\t"					\
diff -rNu linux-2.5.21-final/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.21-final/include/linux/sched.h	Sun Jun  9 07:27:21 2002
+++ linux/include/linux/sched.h	Sun Jun 16 16:35:36 2002
@@ -491,6 +491,7 @@
 extern unsigned long prof_shift;
 
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(sleep_on(wait_queue_head_t *q));
 extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
 				      signed long timeout));
@@ -507,6 +508,11 @@
 #define wake_up_interruptible(x)	__wake_up((x),TASK_INTERRUPTIBLE, 1)
 #define wake_up_interruptible_nr(x, nr)	__wake_up((x),TASK_INTERRUPTIBLE, nr)
 #define wake_up_interruptible_all(x)	__wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x)   __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x)   __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
 asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
 
 extern int in_group_p(gid_t);
diff -rNu linux-2.5.21-final/include/linux/spinlock.h linux/include/linux/spinlock.h
--- linux-2.5.21-final/include/linux/spinlock.h	Sun Jun  9 07:30:01 2002
+++ linux/include/linux/spinlock.h	Sun Jun 16 16:35:46 2002
@@ -26,6 +26,7 @@
 #define write_lock_bh(lock)			do { local_bh_disable();         write_lock(lock); } while (0)
 
 #define spin_unlock_irqrestore(lock, flags)	do { spin_unlock(lock);  local_irq_restore(flags); } while (0)
+#define _raw_spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock);  local_irq_restore(flags); } while (0)
 #define spin_unlock_irq(lock)			do { spin_unlock(lock);  local_irq_enable();       } while (0)
 #define spin_unlock_bh(lock)			do { spin_unlock(lock);  local_bh_enable();        } while (0)
 
@@ -183,6 +184,12 @@
 		preempt_schedule(); \
 } while (0)
 
+#define preempt_check_resched() \
+do { \
+	if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+		preempt_schedule(); \
+} while (0)
+
 #define spin_lock(lock)	\
 do { \
 	preempt_disable(); \
@@ -197,6 +204,12 @@
 	preempt_enable(); \
 } while (0)
 
+#define spin_unlock_no_resched(lock) \
+do { \
+	_raw_spin_unlock(lock); \
+	preempt_enable_no_resched(); \
+} while (0)
+
 #define read_lock(lock)		({preempt_disable(); _raw_read_lock(lock);})
 #define read_unlock(lock)	({_raw_read_unlock(lock); preempt_enable();})
 #define write_lock(lock)	({preempt_disable(); _raw_write_lock(lock);})
@@ -206,20 +219,22 @@
 
 #else
 
-#define preempt_get_count()	(0)
-#define preempt_disable()	do { } while (0)
+#define preempt_get_count()		(0)
+#define preempt_disable()		do { } while (0)
 #define preempt_enable_no_resched()	do {} while(0)
-#define preempt_enable()	do { } while (0)
+#define preempt_enable()		do { } while (0)
+#define preempt_check_resched()		do { } while (0)
 
-#define spin_lock(lock)		_raw_spin_lock(lock)
-#define spin_trylock(lock)	_raw_spin_trylock(lock)
-#define spin_unlock(lock)	_raw_spin_unlock(lock)
-
-#define read_lock(lock)		_raw_read_lock(lock)
-#define read_unlock(lock)	_raw_read_unlock(lock)
-#define write_lock(lock)	_raw_write_lock(lock)
-#define write_unlock(lock)	_raw_write_unlock(lock)
-#define write_trylock(lock)	_raw_write_trylock(lock)
+#define spin_lock(lock)			_raw_spin_lock(lock)
+#define spin_trylock(lock)		_raw_spin_trylock(lock)
+#define spin_unlock(lock)		_raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock)	_raw_spin_unlock(lock)
+
+#define read_lock(lock)			_raw_read_lock(lock)
+#define read_unlock(lock)		_raw_read_unlock(lock)
+#define write_lock(lock)		_raw_write_lock(lock)
+#define write_unlock(lock)		_raw_write_unlock(lock)
+#define write_trylock(lock)		_raw_write_trylock(lock)
 #endif
 
 /* "lock on reference count zero" */
diff -rNu linux-2.5.21-final/kernel/ksyms.c linux/kernel/ksyms.c
--- linux-2.5.21-final/kernel/ksyms.c	Sun Jun  9 07:26:33 2002
+++ linux/kernel/ksyms.c	Sun Jun 16 16:35:36 2002
@@ -457,6 +457,9 @@
 /* process management */
 EXPORT_SYMBOL(complete_and_exit);
 EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
 EXPORT_SYMBOL(wake_up_process);
 EXPORT_SYMBOL(sleep_on);
 EXPORT_SYMBOL(sleep_on_timeout);
diff -rNu linux-2.5.21-final/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21-final/kernel/sched.c	Sun Jun  9 07:28:13 2002
+++ linux/kernel/sched.c	Sun Jun 16 16:36:00 2002
@@ -135,7 +135,6 @@
  */
 struct runqueue {
 	spinlock_t lock;
-	spinlock_t frozen;
 	unsigned long nr_running, nr_switches, expired_timestamp;
 	signed long nr_uninterruptible;
 	task_t *curr, *idle;
@@ -153,17 +152,21 @@
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define rt_task(p)		((p)->prio < MAX_RT_PRIO)
 
+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts.  Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
 static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
 {
 	struct runqueue *rq;
 
 repeat_lock_task:
-	preempt_disable();
+	local_irq_save(*flags);
 	rq = task_rq(p);
-	spin_lock_irqsave(&rq->lock, *flags);
+	spin_lock(&rq->lock);
 	if (unlikely(rq != task_rq(p))) {
 		spin_unlock_irqrestore(&rq->lock, *flags);
-		preempt_enable();
 		goto repeat_lock_task;
 	}
 	return rq;
@@ -172,7 +175,23 @@
 static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
 {
 	spin_unlock_irqrestore(&rq->lock, *flags);
-	preempt_enable();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+	local_irq_disable();
+	rq = this_rq();
+	spin_lock(&rq->lock);
+	return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+	spin_unlock(&rq->lock);
+	local_irq_enable();
 }
 
 /*
@@ -284,9 +303,15 @@
 repeat:
 	preempt_disable();
 	rq = task_rq(p);
-	while (unlikely(rq->curr == p)) {
+	if (unlikely(rq->curr == p)) {
 		cpu_relax();
-		barrier();
+		/*
+		 * enable/disable preemption just to make this
+		 * a preemption point - we are busy-waiting
+		 * anyway.
+		 */
+		preempt_enable();
+		goto repeat;
 	}
 	rq = task_rq_lock(p, &flags);
 	if (unlikely(rq->curr == p)) {
@@ -322,40 +347,50 @@
  * "current->state = TASK_RUNNING" to mark yourself runnable
  * without the overhead of this.
  */
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
 {
 	unsigned long flags;
 	int success = 0;
 	long old_state;
 	runqueue_t *rq;
 
+repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
-	p->state = TASK_RUNNING;
 	if (!p->array) {
+		if (unlikely(sync && (rq->curr != p))) {
+			if (p->thread_info->cpu != smp_processor_id()) {
+				p->thread_info->cpu = smp_processor_id();
+				task_rq_unlock(rq, &flags);
+				goto repeat_lock_task;
+			}
+		}
 		if (old_state == TASK_UNINTERRUPTIBLE)
 			rq->nr_uninterruptible--;
 		activate_task(p, rq);
+		/*
+		 * If sync is set, a resched_task() is a NOOP
+		 */
 		if (p->prio < rq->curr->prio)
 			resched_task(rq->curr);
 		success = 1;
 	}
+	p->state = TASK_RUNNING;
 	task_rq_unlock(rq, &flags);
+
 	return success;
 }
 
 int wake_up_process(task_t * p)
 {
-	return try_to_wake_up(p);
+	return try_to_wake_up(p, 0);
 }
 
 void wake_up_forked_process(task_t * p)
 {
 	runqueue_t *rq;
 
-	preempt_disable();
-	rq = this_rq();
-	spin_lock_irq(&rq->lock);
+	rq = rq_lock(rq);
 
 	p->state = TASK_RUNNING;
 	if (!rt_task(p)) {
@@ -371,8 +406,7 @@
 	p->thread_info->cpu = smp_processor_id();
 	activate_task(p, rq);
 
-	spin_unlock_irq(&rq->lock);
-	preempt_enable();
+	rq_unlock(rq);
 }
 
 /*
@@ -401,19 +435,18 @@
 }
 
 #if CONFIG_SMP || CONFIG_PREEMPT
-asmlinkage void schedule_tail(void)
+asmlinkage void schedule_tail(task_t *prev)
 {
-	spin_unlock_irq(&this_rq()->frozen);
+	finish_arch_switch(this_rq());
+	finish_arch_schedule(prev);
 }
 #endif
 
-static inline void context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(task_t *prev, task_t *next)
 {
 	struct mm_struct *mm = next->mm;
 	struct mm_struct *oldmm = prev->active_mm;
 
-	prepare_to_switch();
-
 	if (unlikely(!mm)) {
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
@@ -427,7 +460,9 @@
 	}
 
 	/* Here we just switch the register state and the stack. */
-	switch_to(prev, next);
+	switch_to(prev, next, prev);
+
+	return prev;
 }
 
 unsigned long nr_running(void)
@@ -773,6 +808,7 @@
 	rq = this_rq();
 
 	release_kernel_lock(prev, smp_processor_id());
+	prepare_arch_schedule(prev);
 	prev->sleep_timestamp = jiffies;
 	spin_lock_irq(&rq->lock);
 
@@ -828,28 +864,20 @@
 	if (likely(prev != next)) {
 		rq->nr_switches++;
 		rq->curr = next;
-		spin_lock(&rq->frozen);
-		spin_unlock(&rq->lock);
-
-		context_switch(prev, next);
-
-		/*
-		 * The runqueue pointer might be from another CPU
-		 * if the new task was last running on a different
-		 * CPU - thus re-load it.
-		 */
-		mb();
+	
+		prepare_arch_switch(rq);
+		prev = context_switch(prev, next);
+		barrier();
 		rq = this_rq();
-		spin_unlock_irq(&rq->frozen);
-	} else {
+		finish_arch_switch(rq);
+	} else
 		spin_unlock_irq(&rq->lock);
-	}
+	finish_arch_schedule(prev);
 
 	reacquire_kernel_lock(current);
 	preempt_enable_no_resched();
 	if (test_thread_flag(TIF_NEED_RESCHED))
 		goto need_resched;
-	return;
 }
 
 #ifdef CONFIG_PREEMPT
@@ -880,7 +908,7 @@
  * started to run but is not in state TASK_RUNNING.  try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
 {
 	struct list_head *tmp;
 	unsigned int state;
@@ -891,7 +919,7 @@
 		curr = list_entry(tmp, wait_queue_t, task_list);
 		p = curr->task;
 		state = p->state;
-		if ((state & mode) && try_to_wake_up(p) &&
+		if ((state & mode) && try_to_wake_up(p, sync) &&
 			((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
 				break;
 	}
@@ -905,17 +933,36 @@
 		return;
 
 	spin_lock_irqsave(&q->lock, flags);
-	__wake_up_common(q, mode, nr_exclusive);
+	__wake_up_common(q, mode, nr_exclusive, 0);
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+	unsigned long flags;
+
+	if (unlikely(!q))
+		return;
+
+	spin_lock_irqsave(&q->lock, flags);
+	if (likely(nr_exclusive))
+		__wake_up_common(q, mode, nr_exclusive, 1);
+	else
+		__wake_up_common(q, mode, nr_exclusive, 0);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 
+#endif
+ 
 void complete(struct completion *x)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&x->wait.lock, flags);
 	x->done++;
-	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+	__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
 	spin_unlock_irqrestore(&x->wait.lock, flags);
 }
 
@@ -1339,27 +1386,34 @@
 
 asmlinkage long sys_sched_yield(void)
 {
-	runqueue_t *rq;
-	prio_array_t *array;
-
-	preempt_disable();
-	rq = this_rq();
+	runqueue_t *rq = rq_lock(rq);
+	prio_array_t *array = current->array;
 
 	/*
-	 * Decrease the yielding task's priority by one, to avoid
-	 * livelocks. This priority loss is temporary, it's recovered
-	 * once the current timeslice expires.
+	 * There are three levels of how a yielding task will give up
+	 * the current CPU:
 	 *
-	 * If priority is already MAX_PRIO-1 then we still
-	 * roundrobin the task within the runlist.
-	 */
-	spin_lock_irq(&rq->lock);
-	array = current->array;
-	/*
-	 * If the task has reached maximum priority (or is a RT task)
-	 * then just requeue the task to the end of the runqueue:
+	 *  #1 - it decreases its priority by one. This priority loss is
+	 *       temporary, it's recovered once the current timeslice
+	 *       expires.
+	 *  #2 - once it has reached the lowest priority level,
+	 *       it will give up timeslices one by one. (We do not
+	 *       want to give them up all at once, it's gradual,
+	 *       to protect the casual yield()er.)
+	 *
+	 *  #3 - once all timeslices are gone we put the process into
+	 *       the expired array.
+	 *
+	 *  (special rule: RT tasks do not lose any priority, they just
+	 *  roundrobin on their current priority level.)
 	 */
-	if (likely(current->prio == MAX_PRIO-1 || rt_task(current))) {
+	if (likely(current->prio == MAX_PRIO-1)) {
+		if (current->time_slice <= 1) {
+			dequeue_task(current, rq->active);
+			enqueue_task(current, rq->expired);
+		} else
+			current->time_slice--;
+	} else if (unlikely(rt_task(current))) {
 		list_del(&current->run_list);
 		list_add_tail(&current->run_list, array->queue + current->prio);
 	} else {
@@ -1370,8 +1424,7 @@
 		list_add_tail(&current->run_list, array->queue + current->prio);
 		__set_bit(current->prio, array->bitmap);
 	}
-	spin_unlock(&rq->lock);
-	preempt_enable_no_resched();
+	spin_unlock_no_resched(&rq->lock);
 
 	schedule();
 
@@ -1599,7 +1652,6 @@
 		rq->active = rq->arrays;
 		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
-		spin_lock_init(&rq->frozen);
 		INIT_LIST_HEAD(&rq->migration_queue);
 
 		for (j = 0; j < 2; j++) {
@@ -1687,7 +1739,15 @@
 		task_rq_unlock(rq, &flags);
 		goto out;
 	}
-
+	/*
+	 * If the task is not on a runqueue (and not running), then
+	 * it is sufficient to simply update the task's cpu field.
+	 */
+	if (!p->array && (p != rq->curr)) {
+		p->thread_info->cpu = __ffs(p->cpus_allowed);
+		task_rq_unlock(rq, &flags);
+		goto out;
+	}
 	init_MUTEX_LOCKED(&req.sem);
 	req.task = p;
 	list_add(&req.list, &rq->migration_queue);


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-15 22:15 Question about sched_yield() mgix
  2002-06-16 14:43 ` [patch] " Ingo Molnar
@ 2002-06-18  0:46 ` David Schwartz
  2002-06-18  0:55   ` Robert Love
                     ` (4 more replies)
  1 sibling, 5 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18  0:46 UTC (permalink / raw)
  To: mgix, linux-kernel


On Sat, 15 Jun 2002 15:15:32 -0700, mgix@mgix.com wrote:
>
>Hello,
>
>I am seeing some strange linux scheduler behaviours,
>and I thought this'd be the best place to ask.
>
>I have two processes, one that loops forever and
>does nothing but calling sched_yield(), and the other
>is basically benchmarking how fast it can compute
>some long math calculation.
[snip]

	You seem to have a misconception about what sched_yield is for. It is not a 
replacement for blocking or a scheduling priority adjustment. It simply lets 
other ready-to-run tasks be scheduled before returning to the current task.

	Here's a quote from SuS3:

"The sched_yield() function shall force the running thread to relinquish the 
processor until it again becomes the head of its thread list. It takes no 
arguments."

	This neither says nor implies anything about CPU usage. It simply says that 
the current thread will yield and be put at the end of the list.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:46 ` David Schwartz
@ 2002-06-18  0:55   ` Robert Love
  2002-06-18  1:51     ` mgix
                       ` (2 more replies)
  2002-06-18  1:41   ` mgix
                     ` (3 subsequent siblings)
  4 siblings, 3 replies; 52+ messages in thread
From: Robert Love @ 2002-06-18  0:55 UTC (permalink / raw)
  To: David Schwartz; +Cc: mgix, linux-kernel

On Mon, 2002-06-17 at 17:46, David Schwartz wrote:

> 	You seem to have a misconception about what sched_yield is for.
> It is not a  replacement for blocking or a scheduling priority
> adjustment. It simply lets other ready-to-run tasks be scheduled
before returning to the current task.
> 
> 	Here's a quote from SuS3:
> 
> "The sched_yield() function shall force the running thread to relinquish the 
> processor until it again becomes the head of its thread list. It takes no 
> arguments."
> 
> 	This neither says nor implies anything about CPU usage. It simply says
> that  the current thread will yield and be put at the end of the list.

And you seem to have a misconception about sched_yield, too.  If a
machine has n tasks, half of which are doing CPU-intense work and the
other half of which are just yielding... why on Earth would the yielding
tasks get any noticeable amount of CPU use?

Seems to me the behavior of sched_yield is a bit broken.  If the tasks
are correctly returned to the end of their runqueue, this should not
happen.  Note, for example, you will not see this behavior in 2.5.

Quite frankly, even if the supposed standard says nothing of this... I
do not care: calling sched_yield in a loop should not show up as a CPU
hog.

	Robert Love

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18  0:55   ` Robert Love
@ 2002-06-18  1:51     ` mgix
  2002-06-18  3:18     ` David Schwartz
  2002-06-18  9:36     ` David Schwartz
  2 siblings, 0 replies; 52+ messages in thread
From: mgix @ 2002-06-18  1:51 UTC (permalink / raw)
  To: Robert Love, David Schwartz; +Cc: linux-kernel




> Seems to me the behavior of sched_yield is a bit broken.  If the tasks
> are correctly returned to the end of their runqueue, this should not
> happen.  Note, for example, you will not see this behavior in 2.5.

Actually, it seems to happen in 2.5 too.
However, Ingo sent me a patch for 2.5.21
that fixes the issue.

See this message:

http://marc.theaimsgroup.com/?l=linux-kernel&m=102423901727214&w=2

	- Mgix

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:55   ` Robert Love
  2002-06-18  1:51     ` mgix
@ 2002-06-18  3:18     ` David Schwartz
  2002-06-18  9:36     ` David Schwartz
  2 siblings, 0 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18  3:18 UTC (permalink / raw)
  To: rml; +Cc: mgix, linux-kernel


>>    This neither says nor implies anything about CPU usage. It simply says
>>that  the current thread will yield and be put at the end of the list.

>And you seem to have a misconception about sched_yield, too.  If a
>machine has n tasks, half of which are doing CPU-intense work and the
>other half of which are just yielding... why on Earth would the yielding
>tasks get any noticeable amount of CPU use?

	Because they're running infinite loops!

>Quite frankly, even if the supposed standard says nothing of this... I
>do not care: calling sched_yield in a loop should not show up as a CPU
>hog.

	Calling any function that does not block in an endless loop *should* show up 
as a CPU hog. Yielding is not blocking or even lowering your priority.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:55   ` Robert Love
  2002-06-18  1:51     ` mgix
  2002-06-18  3:18     ` David Schwartz
@ 2002-06-18  9:36     ` David Schwartz
  2002-06-18 16:58       ` Chris Friesen
  2 siblings, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-18  9:36 UTC (permalink / raw)
  To: rml; +Cc: mgix, linux-kernel


>And you seem to have a misconception about sched_yield, too.  If a
>machine has n tasks, half of which are doing CPU-intense work and the
>other half of which are just yielding... why on Earth would the yielding
>tasks get any noticeable amount of CPU use?

	Because they are not blocking. They are in an endless CPU burning loop. They 
should get CPU use for the same reason they should get CPU use if they're the 
only threads running. They are always ready-to-run.

>Quite frankly, even if the supposed standard says nothing of this... I
>do not care: calling sched_yield in a loop should not show up as a CPU
>hog.

	It has to. What if the only task running is:

	while(1) sched_yield();

	What would you expect? You cannot use sched_yield as a replacement for 
blocking or scheduling priority. You cannot use it to be 'nice'. You can only 
use it to try to give other threads a chance to run, usually to give them a 
chance to release a resource.

	Imagine if a spinlock uses sched_yield in its wait loop, what would you 
expect on a dual-CPU system with a process on CPU A holding the lock? You 
*want* the sched_yield process to burn the CPU fully, so that it notices the 
spinlock acquisition as soon as possible.

	While linux's sched_yield behavior is definitely sub-optimal, the particular 
criticism being leveled (that sched_yield processes get too much CPU) is 
wrong-headed.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  9:36     ` David Schwartz
@ 2002-06-18 16:58       ` Chris Friesen
  2002-06-18 17:12         ` Richard B. Johnson
                           ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Chris Friesen @ 2002-06-18 16:58 UTC (permalink / raw)
  To: David Schwartz; +Cc: rml, mgix, linux-kernel

David Schwartz wrote:
> 
> >And you seem to have a misconception about sched_yield, too.  If a
> >machine has n tasks, half of which are doing CPU-intense work and the
> >other half of which are just yielding... why on Earth would the yielding
> >tasks get any noticeable amount of CPU use?
> 
>         Because they are not blocking. They are in an endless CPU burning loop. They
> should get CPU use for the same reason they should get CPU use if they're the
> only threads running. They are always ready-to-run.
> 
> >Quite frankly, even if the supposed standard says nothing of this... I
> >do not care: calling sched_yield in a loop should not show up as a CPU
> >hog.
> 
>         It has to. What if the only task running is:
> 
>         while(1) sched_yield();
> 
>         What would you expect?

If there is only the one task, then sure it's going to be 100% cpu on that task.

However, if there is anything else other than the idle task that wants to run,
then it should run until it exhausts its timeslice.

One process looping on sched_yield() and another one doing calculations should
result in almost the entire system being devoted to calculations.

Chris

-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 16:58       ` Chris Friesen
@ 2002-06-18 17:12         ` Richard B. Johnson
  2002-06-18 17:19           ` mgix
  2002-06-18 17:13         ` Robert Love
  2002-06-18 17:23         ` Rik van Riel
  2 siblings, 1 reply; 52+ messages in thread
From: Richard B. Johnson @ 2002-06-18 17:12 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David Schwartz, rml, mgix, linux-kernel

On Tue, 18 Jun 2002, Chris Friesen wrote:

> David Schwartz wrote:
> > 
> > >And you seem to have a misconception about sched_yield, too.  If a
> > >machine has n tasks, half of which are doing CPU-intense work and the
> > >other half of which are just yielding... why on Earth would the yielding
> > >tasks get any noticeable amount of CPU use?
> > 
> >         Because they are not blocking. They are in an endless CPU burning loop. They
> > should get CPU use for the same reason they should get CPU use if they're the
> > only threads running. They are always ready-to-run.
> > 
> > >Quite frankly, even if the supposed standard says nothing of this... I
> > >do not care: calling sched_yield in a loop should not show up as a CPU
> > >hog.
> > 
> >         It has to. What if the only task running is:
> > 
> >         while(1) sched_yield();
> > 
> >         What would you expect?
> 
> If there is only the one task, then sure it's going to be 100% cpu on that task.
> 
> However, if there is anything else other than the idle task that wants to run,
> then it should run until it exhausts its timeslice.
> 
> One process looping on sched_yield() and another one doing calculations should
> result in almost the entire system being devoted to calculations.
> 
> Chris
> 

It's all in the accounting. Use usleep(0) if you want it to "look good".

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 17:12         ` Richard B. Johnson
@ 2002-06-18 17:19           ` mgix
  2002-06-18 18:01             ` David Schwartz
  2002-06-18 18:21             ` Richard B. Johnson
  0 siblings, 2 replies; 52+ messages in thread
From: mgix @ 2002-06-18 17:19 UTC (permalink / raw)
  To: root, Chris Friesen; +Cc: David Schwartz, rml, linux-kernel


> It's all in the accounting. Use usleep(0) if you want it to "look good".


Two things:

	1. First, I think there's a misunderstanding on what my
         original issue was: I am not interested in any way by
         CPU cycle accounting, and wether the yielding loop should
         log any of it. All I want is: when I run a bunch of
         yielders and a actual working process, I want the
         working process to not be slown down (wall clock) in
         anyway. That's all. What top shows is of little interest
         (to me). What matters is how many real world seconds it takes
         for the actually working process to complete its task.
         And that should not be affected by the presence of running
         yielders. And, David, no one is arguing the fact that a yielder
         running all by itself should log 100% of the CPU.

	2. I have a question about usleep(0). You seem to make the point
         that usleep(0) is equivalent to sched_yield(). I can see how
         intuitively this should be the case, but I am not sure if it
         will always be true. It's certainly documented anywhere.


	- Mgix



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 17:19           ` mgix
@ 2002-06-18 18:01             ` David Schwartz
  2002-06-18 18:05               ` mgix
  2002-06-18 22:43               ` Olivier Galibert
  2002-06-18 18:21             ` Richard B. Johnson
  1 sibling, 2 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 18:01 UTC (permalink / raw)
  To: mgix, root, Chris Friesen; +Cc: rml, linux-kernel


>All I want is: when I run a bunch of
>yielders and a actual working process, I want the
>working process to not be slown down (wall clock) in
>anyway. That's all. What top shows is of little interest
>(to me). What matters is how many real world seconds it takes
>for the actually working process to complete its task.
>And that should not be affected by the presence of running
>yielders. And, David, no one is arguing the fact that a yielder
>running all by itself should log 100% of the CPU.

	Your assumptions are just plain wrong. The yielder is being nice, so it 
should get preferential treatment, not worse treatment. All threads are 
ready-to-run all the time. Yielding is not the same as blocking or lowering 
your priority.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 18:01             ` David Schwartz
@ 2002-06-18 18:05               ` mgix
  2002-06-18 19:11                 ` David Schwartz
  2002-06-18 22:43               ` Olivier Galibert
  1 sibling, 1 reply; 52+ messages in thread
From: mgix @ 2002-06-18 18:05 UTC (permalink / raw)
  To: David Schwartz, root, Chris Friesen; +Cc: rml, linux-kernel


> 	Your assumptions are just plain wrong. The yielder is being nice, so it 
> should get preferential treatment, not worse treatment. All threads are 
> ready-to-run all the time. Yielding is not the same as blocking or lowering 
> your priority.


In other words, the more you yield, the nicer you
are and the more CPU you get, and those nasty processes
that are trying to actually use the CPU to do something
with it and wear it down should get it as little as possible.
I get it.

	- Mgix



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 18:05               ` mgix
@ 2002-06-18 19:11                 ` David Schwartz
  2002-06-18 16:58                   ` Rob Landley
  2002-06-18 19:25                   ` Robert Love
  0 siblings, 2 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 19:11 UTC (permalink / raw)
  To: mgix, root, Chris Friesen; +Cc: rml, linux-kernel



On Tue, 18 Jun 2002 11:05:36 -0700, mgix@mgix.com wrote:

>>    Your assumptions are just plain wrong. The yielder is being nice, so it
>>should get preferential treatment, not worse treatment. All threads are
>>ready-to-run all the time. Yielding is not the same as blocking or lowering
>>your priority.

>In other words, the more you yield, the nicer you
>are and the more CPU you get, and those nasty processes
>that are trying to actually use the CPU to do something
>with it and wear it down should get it as little as possible.
>I get it.
>
>    - Mgix

	I'm sorry, but you are being entirely unreasonable. The kernel has no way to 
know which processes are doing something useful and which ones are just 
wasting CPU. It knows is that they are all ready-to-run and they all have the 
same priority and none of them are blocking. The yielding threads are playing 
nice and shouldn't be penalized for it.

	The following code:

while(1) sched_yield();

	Is the problem, not the kernel. You can't expect the kernel's ESP to figure 
out that you really meant something else or that your process isn't really 
doing anything useful.

	If you didn't mean to burn the CPU in an endless loop, WHY DID YOU?

	You should never call sched_yield in a loop like this unless your intent is 
to burn the CPU until some other thread/process does something. Since you 
rarely want to do this, you should seldom if ever call sched_yield in a loop. 

	What sched_yield is good for is if you encounter a situation where you 
need/want some resource and another thread/process has it. You call 
sched_yield once, and maybe when you run again, the other thread/process will 
have released it. You can also use it as the spin function in spinlocks.

	But your expectation that it will reduce CPU usage is just plain wrong. If 
you have one thread spinning on sched_yield, on a single CPU machine it will 
definitely get 100% of the CPU. If you have two, they will each definitely 
get 50% of the CPU. There are blocking functions and scheduler priority 
functions for this purpose.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 19:11                 ` David Schwartz
@ 2002-06-18 16:58                   ` Rob Landley
  2002-06-18 19:25                   ` Robert Love
  1 sibling, 0 replies; 52+ messages in thread
From: Rob Landley @ 2002-06-18 16:58 UTC (permalink / raw)
  To: David Schwartz, mgix, root, Chris Friesen; +Cc: rml, linux-kernel

On Tuesday 18 June 2002 03:11 pm, David Schwartz wrote:

> 	I'm sorry, but you are being entirely unreasonable. The kernel has no way
> to know which processes are doing something useful and which ones are just
> wasting CPU.

So the fact one of them is calling sched_yield (repeatedly) and the other one 
isn't doesn't count as just a little BIT of a hint?

> 	What sched_yield is good for is if you encounter a situation where you
> need/want some resource and another thread/process has it. You call
> sched_yield once, and maybe when you run again, the other thread/process
> will have released it.

Not if it was a higher-priority task that has already exhausted its time 
slice...

> You can also use it as the spin function in
> spinlocks.

In user space?

Rob

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 19:11                 ` David Schwartz
  2002-06-18 16:58                   ` Rob Landley
@ 2002-06-18 19:25                   ` Robert Love
  2002-06-18 19:53                     ` David Schwartz
                                       ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Robert Love @ 2002-06-18 19:25 UTC (permalink / raw)
  To: David Schwartz; +Cc: mgix, root, Chris Friesen, linux-kernel

On Tue, 2002-06-18 at 12:11, David Schwartz wrote:

> 	I'm sorry, but you are being entirely unreasonable.

No, sorry, you are.  Listen to everyone else here.

> 	If you didn't mean to burn the CPU in an endless loop, WHY DID YOU?

It is not an endless loop.  Here is the problem.  You have n tasks.  One
type executes:

	while(1) ;

the others execute:

	while(1)
		sched_yield();

the second bunch should _not_ receive has much CPU time as the first. 
This has nothing to do with intelligent work or blocking or picking your
nose.  It has everything to do with the fact that yielding means
"relinquish my timeslice" and "put me at the end of the runqueue".

If we are doing this, then why does the sched_yield'ing task monopolize
the CPU?  BECAUSE IT IS BROKEN.

> 	You should never call sched_yield in a loop like this unless your intent is 
> to burn the CPU until some other thread/process does something. Since you 
> rarely want to do this, you should seldom if ever call sched_yield in a loop. 

But there are other tasks that wish to do something in these examples...

> 	But your expectation that it will reduce CPU usage is just plain wrong. If 
> you have one thread spinning on sched_yield, on a single CPU machine it will 
> definitely get 100% of the CPU. If you have two, they will each definitely 
> get 50% of the CPU. There are blocking functions and scheduler priority 
> functions for this purpose.

If they are all by themselves, of course they will get 100% of the CPU. 
No one is saying sched_yield is equivalent to blocking.  I am not even
saying it should get no CPU!  It should get a bunch.  But all the
processes being equal, one that keeps yielding its timeslice should not
get as much CPU time as one that does not.  Why is that not logical to
you?

The original report was that given one task of the second case (above)
and two of the first, everything was fine - the yielding task received
little CPU as it continually relinquishes its timeslice.  In the
alternative case, there are two of each types of tasks.  Now, the CPU is
split and the yielding tasks are receiving a chunk of it.  Why?  Because
the yielding behavior is broken and the tasks are continually yielding
and rescheduling back and forth.  So there is an example of how it
should work and how it does.  It is broken.

There isn't even really an argument.  Ingo and I have both identified
this is a problem in 2.4 and 2.5 and Ingo fixed it in 2.5.  If 2.5 no
longer has this behavior, then are you saying it is NOW broken?

	Robert Love

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 19:25                   ` Robert Love
@ 2002-06-18 19:53                     ` David Schwartz
  2002-06-18 20:12                       ` mgix
  2002-06-18 20:08                     ` Richard B. Johnson
  2002-06-19 11:10                     ` Bill Davidsen
  2 siblings, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-18 19:53 UTC (permalink / raw)
  To: rml; +Cc: mgix, root, Chris Friesen, cfriesen, linux-kernel


On 18 Jun 2002 12:25:13 -0700, Robert Love wrote:

>>    If you didn't mean to burn the CPU in an endless loop, WHY DID YOU?

>It is not an endless loop.  Here is the problem.  You have n tasks.  One
>type executes:
>
>    while(1) ;
>
>the others execute:
>
>    while(1)
>        sched_yield();

>the second bunch should _not_ receive has much CPU time as the first.

	Why not? They're both at the same priority. They're both always 
ready-to-run. Neither blocks. Why should the kernel assume that one will do 
useful work while the other won't?

>This has nothing to do with intelligent work or blocking or picking your
>nose.  It has everything to do with the fact that yielding means
>"relinquish my timeslice" and "put me at the end of the runqueue".

	And consider me immediately ready to run again.

>>    You should never call sched_yield in a loop like this unless your 
>>intent
>>is
>>to burn the CPU until some other thread/process does something. Since you
>>rarely want to do this, you should seldom if ever call sched_yield in a
>>loop.

>But there are other tasks that wish to do something in these examples...

	All the tasks are equally situated. Same priority, all ready-to-run.

>>    But your expectation that it will reduce CPU usage is just plain wrong. 
>>If
>>
>>you have one thread spinning on sched_yield, on a single CPU machine it
>>will
>>definitely get 100% of the CPU. If you have two, they will each definitely
>>get 50% of the CPU. There are blocking functions and scheduler priority
>>functions for this purpose.

>If they are all by themselves, of course they will get 100% of the CPU.
>No one is saying sched_yield is equivalent to blocking.  I am not even
>saying it should get no CPU!  It should get a bunch.  But all the
>processes being equal, one that keeps yielding its timeslice should not
>get as much CPU time as one that does not.  Why is that not logical to
>you?

	Because a thread/process that voluntarily yields its timeslice should be 
rewarded over one that burns it up, not penalized.

>The original report was that given one task of the second case (above)
>and two of the first, everything was fine - the yielding task received
>little CPU as it continually relinquishes its timeslice.  In the
>alternative case, there are two of each types of tasks.  Now, the CPU is
>split and the yielding tasks are receiving a chunk of it.  Why?  Because
>the yielding behavior is broken and the tasks are continually yielding
>and rescheduling back and forth.  So there is an example of how it
>should work and how it does.  It is broken.

	So you're saying that the yielding tasks should be penalized for yielding 
rather than just being descheduled?

>There isn't even really an argument.  Ingo and I have both identified
>this is a problem in 2.4 and 2.5 and Ingo fixed it in 2.5.  If 2.5 no
>longer has this behavior, then are you saying it is NOW broken?

	I'm saying that the fixed behavior is better than the previous behavior and 
that the previous behavior wasn't particularly good. But I'm also saying that 
people are misusing sched_yield and have incorrect expectations about it.

	If you spin on sched_yield, expect CPU wastage. It's really that simple. 
Threads/processes that use sched_yield intelligently should be rewarded for 
doing so, not penalized. Otherwise sched_yield becomes much less useful.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 19:53                     ` David Schwartz
@ 2002-06-18 20:12                       ` mgix
  2002-06-18 20:42                         ` David Schwartz
  0 siblings, 1 reply; 52+ messages in thread
From: mgix @ 2002-06-18 20:12 UTC (permalink / raw)
  To: David Schwartz, rml; +Cc: root, Chris.Friesen, cfriesen, linux-kernel

Let's do a little thought experiment with 2
naive scheduling algorithms and see what happens:

Algorithm 1: the FIFO (favours those that haven't had it yet over those who already had it)
	- All ready to run processes are in a FIFO queue.
	- The top of the queue is the yielder
	- It's given a slice of CPU time to run on
	- It gives it back right away.
	- It's sent to back to the queue.
	- The next in line is a task that does real work.
	- It gets a time slice and fully uses it.
	- etc ...

Algorithm 1 will have the expected behaviour: yielders
won't consume anything, workers get it all.

Algorithm 2: the Monte-Carlo scheduler (favours no one, schedules blindly)
	- All ready to run processes are are kept in a black bag
	- The scheduler randomly grabs one out of the bag
	- It's given a slice of CPU time to run on
	- If it's a yielder, it gives the CPU right back
	- If it's an actual worker, it makes full use of the slice.
	- etc ...

Lo and behold, Algorithm 2 exhibits the same behaviour as well:
yielders get nothing since they give it all away, and workers
get it all.

Now, if I understand you well enough David, you'd like an
algorithm where the less you want the CPU, the more you get
it. I'd love if you could actually give us an outlook of
your ideal scheduler so I can try my thought experiment on it,
because from what I've understood so far, your hypothetical
scheduler would allocate all of the CPU to the yielders.

Also, since it seems to worry you: no I'm not using sched_yield
to implement pseudo-blocking behaviour.

	- Mgix

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:12                       ` mgix
@ 2002-06-18 20:42                         ` David Schwartz
  2002-06-18 20:47                           ` mgix
  2002-06-18 22:28                           ` Ingo Molnar
  0 siblings, 2 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 20:42 UTC (permalink / raw)
  To: mgix, rml; +Cc: root, Chris Friesen, cfriesen, linux-kernel


>Now, if I understand you well enough David, you'd like an
>algorithm where the less you want the CPU, the more you get
>it.

	Exactly. This is the UNIX tradition of static and dynamic priorities. The 
more polite you are about yielding the CPU when you don't need it, the more 
claim you have to getting it when you do need it.

>I'd love if you could actually give us an outlook of
>your ideal scheduler so I can try my thought experiment on it,
>because from what I've understood so far, your hypothetical
>scheduler would allocate all of the CPU to the yielders.

	Not all, just the same share any other process gets. They're all 
ready-to-run, they're all at the same priority.

>Also, since it seems to worry you: no I'm not using sched_yield
>to implement pseudo-blocking behaviour.

	Then tell us what you are doing so we can tell you the *right* way to do it. 
Unless this is just an abstract theoretical exercise, you shouldn't complain 
when ready-to-run threads get the CPU.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:42                         ` David Schwartz
@ 2002-06-18 20:47                           ` mgix
  2002-06-18 22:00                             ` David Schwartz
  2002-06-18 22:28                           ` Ingo Molnar
  1 sibling, 1 reply; 52+ messages in thread
From: mgix @ 2002-06-18 20:47 UTC (permalink / raw)
  To: David Schwartz, rml; +Cc: root, Chris.Friesen, cfriesen, linux-kernel




> >Now, if I understand you well enough David, you'd like an
> >algorithm where the less you want the CPU, the more you get
> >it.
> 
> 	Exactly. This is the UNIX tradition of static and dynamic priorities. The 
> more polite you are about yielding the CPU when you don't need it, the more 
> claim you have to getting it when you do need it.
> 
> >I'd love if you could actually give us an outlook of
> >your ideal scheduler so I can try my thought experiment on it,
> >because from what I've understood so far, your hypothetical
> >scheduler would allocate all of the CPU to the yielders.
> 
> 	Not all, just the same share any other process gets. They're all 
> ready-to-run, they're all at the same priority.

Correct my logic, please:
	
	1. Rule: The less you want the CPU, the more you get it.
	2. A yielder process never wants the CPU
	3. As a result of Rule 1, it always gets it.



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:47                           ` mgix
@ 2002-06-18 22:00                             ` David Schwartz
  0 siblings, 0 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 22:00 UTC (permalink / raw)
  To: mgix, rml; +Cc: root, Chris Friesen, cfriesen, linux-kernel


>Correct my logic, please:
>    
>    1. Rule: The less you want the CPU, the more you get it.

	The more you relinquish the CPU, the more you get it when you do want it. 
(Dynamic priority.)

>    2. A yielder process never wants the CPU

	A yielder process *always* wants the CPU, but always relinquishes it when it 
gets it. (It's always ready-to-run.)

>    3. As a result of Rule 1, it always gets it.

	The correct rules 1 and 2 don't lead to the conclusion you think.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:42                         ` David Schwartz
  2002-06-18 20:47                           ` mgix
@ 2002-06-18 22:28                           ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-18 22:28 UTC (permalink / raw)
  To: David Schwartz
  Cc: mgix, Robert Love, root, Chris Friesen, cfriesen, linux-kernel

On Tue, 18 Jun 2002, David Schwartz wrote:

> 	Exactly. This is the UNIX tradition of static and dynamic
> priorities. The more polite you are about yielding the CPU when you
> don't need it, the more claim you have to getting it when you do need
> it.

firstly, the thing that defines the scheduler's implementation details is
not tradition but actual, hard, verifiable use. If you've ever seen
sched_yield() code under Linux then you'll quickly realize that it's
mainly used as a "oh damn, i cannot continue now, i have no proper kernel
object to sleep on, lets busy-wait" kind of mechanizm. (Which, by the way,
is as far from polite as it gets.)

secondly, i'd like to ask everyone arguing one way or other, and this
means you and Robert and everyone else as well, to actually read and think
about the fix i did.

The new implementation of sched_yield() is *not* throwing away your
timeslices immediately. In fact it first tries it the 'polite' way to get
some progress by gradually decreasing its priority - *IF* that fails (and
it has to fail and keep looping for at least ~10 times if there are
default priorities) then it will start dropping away timeslices - one by
one. That gives another 20 opportunities for the task to actually get some
work done. Even after this we still let the task run one more jiffy.
*THEN* only is the task pushed back into the expired array.

so the tests quoted here were the extreme example: a task did nothing but
sched_yield(). And the modified scheduler did a good job of supressing
this process, if there was other, non-sched_yield() work around. A
database process with a casual sched_yield() call should not see much
effect.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 19:25                   ` Robert Love
  2002-06-18 19:53                     ` David Schwartz
@ 2002-06-18 20:08                     ` Richard B. Johnson
  2002-06-19 11:10                     ` Bill Davidsen
  2 siblings, 0 replies; 52+ messages in thread
From: Richard B. Johnson @ 2002-06-18 20:08 UTC (permalink / raw)
  To: Robert Love; +Cc: David Schwartz, mgix, Chris Friesen, linux-kernel

On 18 Jun 2002, Robert Love wrote:

> On Tue, 2002-06-18 at 12:11, David Schwartz wrote:
> 
> > 	I'm sorry, but you are being entirely unreasonable.
> 
> No, sorry, you are.  Listen to everyone else here.
> 
> > 	If you didn't mean to burn the CPU in an endless loop, WHY DID YOU?
> 
> It is not an endless loop.  Here is the problem.  You have n tasks.  One
> type executes:
> 
> 	while(1) ;
> 
> the others execute:
> 
> 	while(1)
> 		sched_yield();
> 
> the second bunch should _not_ receive has much CPU time as the first. 
> This has nothing to do with intelligent work or blocking or picking your
> nose.  It has everything to do with the fact that yielding means
> "relinquish my timeslice" and "put me at the end of the runqueue".
> 
> If we are doing this, then why does the sched_yield'ing task monopolize
> the CPU?  BECAUSE IT IS BROKEN.
> 
> > 	You should never call sched_yield in a loop like this unless your intent is 
> > to burn the CPU until some other thread/process does something. Since you 
> > rarely want to do this, you should seldom if ever call sched_yield in a loop. 
> 
> But there are other tasks that wish to do something in these examples...
> 
> > 	But your expectation that it will reduce CPU usage is just plain wrong. If 
> > you have one thread spinning on sched_yield, on a single CPU machine it will 
> > definitely get 100% of the CPU. If you have two, they will each definitely 
> > get 50% of the CPU. There are blocking functions and scheduler priority 
> > functions for this purpose.
> 
> If they are all by themselves, of course they will get 100% of the CPU. 
> No one is saying sched_yield is equivalent to blocking.  I am not even
> saying it should get no CPU!  It should get a bunch.  But all the
> processes being equal, one that keeps yielding its timeslice should not
> get as much CPU time as one that does not.  Why is that not logical to
> you?
> 
> The original report was that given one task of the second case (above)
> and two of the first, everything was fine - the yielding task received
> little CPU as it continually relinquishes its timeslice.  In the
> alternative case, there are two of each types of tasks.  Now, the CPU is
> split and the yielding tasks are receiving a chunk of it.  Why?  Because
> the yielding behavior is broken and the tasks are continually yielding
> and rescheduling back and forth.  So there is an example of how it
> should work and how it does.  It is broken.
> 
> There isn't even really an argument.  Ingo and I have both identified
> this is a problem in 2.4 and 2.5 and Ingo fixed it in 2.5.  If 2.5 no
> longer has this behavior, then are you saying it is NOW broken?
> 
> 	Robert Love



Lets put in some numbers.

	for(;;)                  Task 0
           ;

        for(;;)                  Task 1
            sched_yield();

I will assume that scheduling takes 0 seconds and system calls
also take 0 seconds. This should be fair if I modify task 0 so
it does:
		for(;;)
                    getuid();

Which means both tasks make continuous system calls, which should
take, roughly, the same time.

Initial conditions has task B spinning for HZ time until preempted
so Task 0 starts up with HZ more time than Task 1. Task 1 now gets
the CPU. When Task 1 gets the CPU, it immediately gives it up, it
does not wait until it's preempted like task 0. Task 0 now gets the
CPU and keeps it for HZ time.

Clearly, task 0 should be getting at least HZ more CPU time than task 1.
And, yawn, it probably does! I just launched 400 of the following
programs:


int main(int c, char *argv[])
{

    for(;;)
        sched_yield();
   return c;
}

This is a dual CPU system. Top shows 184% system CPU usage and 14%
user CPU usage.

  3:54pm  up 23:30,  2 users,  load average: 14.74, 8.08, 3.30
463 processes: 62 sleeping, 401 running, 0 zombie, 0 stopped
CPU states: 13.7% user, 186.3% system,  0.0% nice,  0.0% idle
Mem:  320720K av, 215816K used, 104904K free,      0K shrd,  11752K buff
Swap: 1044208K av, 1044208K used,      0K free               131732K cached
 PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
 8627 root      18   0   208  208   172 R       0 15.6  0.0   0:31 xxx^[[K
 8637 root      18   0   208  208   172 R       0 15.6  0.0   0:30 xxx^[[K
 8638 root      18   0   208  208   172 R       0 15.6  0.0   0:30 xxx^[[K
 8624 root      15   0   208  208   172 R       0 13.7  0.0   0:32 xxx^[[K
 8629 root      15   0   208  208   172 R       0 13.7  0.0   0:31 xxx^[[K
 8632 root      15   0   208  208   172 R       0 13.7  0.0   0:30 xxx^[[K
 8626 root      15   0   208  208   172 R       0 12.7  0.0   0:31 xxx^[[K
 8628 root      15   0   208  208   172 R       0 12.7  0.0   0:31 xxx^[[K
 8633 root      15   0   208  208   172 R       0 12.7  0.0   0:30 xxx^[[K
 8639 root      15   0   208  208   172 R       0 12.7  0.0   0:30 xxx^[[K
 8625 root      14   0   208  208   172 R       0 11.7  0.0   0:32 xxx^[[K
 8630 root      15   0   208  208   172 R       0 11.7  0.0   0:30 xxx^[[K
 8634 root      14   0   208  208   172 R       0 11.7  0.0   0:30 xxx^[[K
 8635 root      15   0   208  208   172 R       0 11.7  0.0   0:30 xxx^[[K
 8636 root      15   0   208  208   172 R       0 11.7  0.0   0:30 xxx^[[K
 [SNIPPED]

The system is as responsive as ever even though there are all those
'CPU burners', each showing they take roughly 12 percent of the CPU.
Let's see 400 * 0.12 = 48 == accounting error, nothing more.



Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 19:25                   ` Robert Love
  2002-06-18 19:53                     ` David Schwartz
  2002-06-18 20:08                     ` Richard B. Johnson
@ 2002-06-19 11:10                     ` Bill Davidsen
  2002-06-19 12:04                       ` Ingo Molnar
  2 siblings, 1 reply; 52+ messages in thread
From: Bill Davidsen @ 2002-06-19 11:10 UTC (permalink / raw)
  To: Robert Love; +Cc: Linux Kernel Mailing List

On 18 Jun 2002, Robert Love wrote:

> the second bunch should _not_ receive has much CPU time as the first. 
> This has nothing to do with intelligent work or blocking or picking your
> nose.  It has everything to do with the fact that yielding means
> "relinquish my timeslice" and "put me at the end of the runqueue".

I have put some ideas below, I agree with "relinquish" but not end of
queue.

> If we are doing this, then why does the sched_yield'ing task monopolize
> the CPU?  BECAUSE IT IS BROKEN.

Consider the case where a threaded application and a CPU hog are running.
In sum the threaded process is also a hog, although any one thread is not.

I find I can figure out problems like this if I start at the desired
result and work back, and what I believe is the desired result is that the
pure hog gets half the CPU and the threaded application, in aggregate, get
the other. So to provide this, I would think that the desired action at
sched_yeild() would look somewhat like this: 

If there is less than {sched overhead} on the timeslice
  add to end of queue
  do accounting to reflect the time used
  run the next thread
else
  if there is another thread of this process
    queue this thread following the last thread of this process
    give the remaining time slice to the top thread of this process
  else
    put the yeild thread at the end of the queue
  .
.

One question which arises is cpu affinity, and I confess that I can't
predict the better of process affinity or contested resource affinity (run
the next thread now). But for uni machines, I think what I outline above
does result in the behaviour I find most intuitive and desirable.

On your comment about leaving the yeilded thread at the top of the queue,
putting the yeilded thread at the end of the queue effectively forces the
context switch time to be the sum of all the timeslices given in one pass
through the queue. In a system with a contested resource and at least one
hog, I think that would make the threaded task starve for CPU. As noted by
several people, queue back at the head also can behave badly for some
cases, although the demo program provided is probably not typical.

I think what I suggest will avoid both problems, although it might have
other drawbacks, like sched overhead if the queue was very long and didn't
have other threads. Atypical, but a max "scan for another thread" count
could be provided, or even a per-process count of runable threads in the
queue, which is probably more overhead than just scanning for light load.
At least under light load you can afford it.

Your comments?

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-19 11:10                     ` Bill Davidsen
@ 2002-06-19 12:04                       ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-19 12:04 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Robert Love, Linux Kernel Mailing List

On Wed, 19 Jun 2002, Bill Davidsen wrote:

> Consider the case where a threaded application and a CPU hog are
> running. In sum the threaded process is also a hog, although any one
> thread is not.

it is a mistake to assume that sched_yield() == 'threaded applications'.

sched_yield() is a horrible interface to base userspace spinlocks on, it's
not in any way cheaper than real blocking, even in the best-case situation
when there is no 'useless' yielding back and forth. In the worst-case
situation it can be a few orders more expensive (!) than blocking-based
solutions. LinuxThread's use of sched_yield() was and remains a hack. I'm 
actually surprised that it works for real applications.

We are trying to improve things as much as possible on the kernel
scheduler side, but since there is absolutely no object/contention
information available for kernel space (why it wants to schedule away,
which other processes are using the blocked object, etc.), it's impossible
to do a perfect job - in fact it's not even possible to do a 'good' job.

IMO people interested in high-performance threading should concentrate on
the lightweight, kernel-backed userspace semaphores patch(es) that have
been presented a few weeks ago, and merge those concepts into LinuxThreads
(or into their own threading/spinlock solutions). Those patches do it
properly, and the scheduler will be able to give all the help it can.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 18:01             ` David Schwartz
  2002-06-18 18:05               ` mgix
@ 2002-06-18 22:43               ` Olivier Galibert
  1 sibling, 0 replies; 52+ messages in thread
From: Olivier Galibert @ 2002-06-18 22:43 UTC (permalink / raw)
  To: linux-kernel

On Tue, Jun 18, 2002 at 11:01:53AM -0700, David Schwartz wrote:
> 	Your assumptions are just plain wrong. The yielder is being nice, so it 
> should get preferential treatment, not worse treatment.

Heh?  Yielding is not about being nice, if you want to be nice you
lower your priority.  Yielding is telling the scheduler "I am
temporarily out of things to do, maybe for a while, but a least until
all the others running threads got a chance to run too".  Any
scheduler that runs you immediatly again without running the others
threads is _broken_.

  OG.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 17:19           ` mgix
  2002-06-18 18:01             ` David Schwartz
@ 2002-06-18 18:21             ` Richard B. Johnson
  1 sibling, 0 replies; 52+ messages in thread
From: Richard B. Johnson @ 2002-06-18 18:21 UTC (permalink / raw)
  To: mgix; +Cc: Chris Friesen, David Schwartz, rml, linux-kernel

On Tue, 18 Jun 2002 mgix@mgix.com wrote:

> 
> > It's all in the accounting. Use usleep(0) if you want it to "look good".
> 
> 
> Two things:
> 
> 	1. First, I think there's a misunderstanding on what my
>          original issue was: I am not interested in any way by
>          CPU cycle accounting, and wether the yielding loop should
>          log any of it. All I want is: when I run a bunch of
>          yielders and a actual working process, I want the
>          working process to not be slown down (wall clock) in
>          anyway. That's all. What top shows is of little interest
>          (to me). What matters is how many real world seconds it takes
>          for the actually working process to complete its task.
>          And that should not be affected by the presence of running
>          yielders. And, David, no one is arguing the fact that a yielder
>          running all by itself should log 100% of the CPU.
> 

Well the problem seems to be an inconsistancy I reported to 'the list'
some time ago. As I recall, Ingo replied that I should use usleep(0) and
the problem I reported would "go away". It did go away. However, if
you run this on a single-processor machine, 'top' will show that
each task gets 99+ percent of the CPU, which of course can't be correct.


#include <stdio.h>
#include <string.h>
#include <unistd.h>

int main(int c, char *argv[])
{
    if(!fork())
    {
        strcpy(argv[0], "Child");
        for(;;)
            ;
    } 
    strcpy(argv[0], "Parent");

    for(;;)
        sched_yield();

   return c;
}


So, it seems that the guy that's yielding the CPU gets 'charged' for
the CPU time he gave away. Fair enough, I guess. As I see it,
sched_yield() will give the CPU to any single computible task once.
After this, the caller gets the CPU back whether or not he wants it.


> 	2. I have a question about usleep(0). You seem to make the point
>          that usleep(0) is equivalent to sched_yield(). I can see how
>          intuitively this should be the case, but I am not sure if it
>          will always be true. It's certainly documented anywhere.
> 

No. I did not mention or imply "equivalent", only that you can use it
instead, in many, perhaps most, instances.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 16:58       ` Chris Friesen
  2002-06-18 17:12         ` Richard B. Johnson
@ 2002-06-18 17:13         ` Robert Love
  2002-06-18 18:00           ` David Schwartz
  2002-06-18 17:23         ` Rik van Riel
  2 siblings, 1 reply; 52+ messages in thread
From: Robert Love @ 2002-06-18 17:13 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David Schwartz, mgix, linux-kernel

On Tue, 2002-06-18 at 09:58, Chris Friesen wrote:

> David Schwartz wrote:
>
> >         What would you expect?
> 
> If there is only the one task, then sure it's going to be 100% cpu on that
> task.
> 
> However, if there is anything else other than the idle task that wants to
> run, then it should run until it exhausts its timeslice.
> 
> One process looping on sched_yield() and another one doing calculations
> should result in almost the entire system being devoted to calculations.

Exactly.  The reason the behavior is odd is not because the sched_yield
task is getting any CPU, David.  I realize sched_yield is not equivalent
to blocking.

The reason this behavior is suspect is because the task is receiving a
similar amount of CPU to tasks that are _not_ yielding but in fact doing
useful work for the entire duration of their timeslice.

A task that continually uses its timeslice vs one that yields should
easily receive a greater amount of CPU, but this is not the case.

As someone who works in the scheduler, this points out that sched_yield
is, well, broken.  First guess would be it is queuing to the front of
the runqueue (it once did this but I thought it was fixed) or otherwise
exhausting the timeslice wrong.

Someone pointed out this bug existed similarly in 2.5, although it was a
bit different.  2.5 has a different (and better, imo) sched_yield
implementation that tries to overcome certain shortcomings and also
perform optimally and fairly.

	Robert Love

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 17:13         ` Robert Love
@ 2002-06-18 18:00           ` David Schwartz
  2002-06-18 22:45             ` Stevie O
  0 siblings, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-18 18:00 UTC (permalink / raw)
  To: rml, Chris Friesen; +Cc: mgix, linux-kernel


>Exactly.  The reason the behavior is odd is not because the sched_yield
>task is getting any CPU, David.  I realize sched_yield is not equivalent
>to blocking.

	Good.

>The reason this behavior is suspect is because the task is receiving a
>similar amount of CPU to tasks that are _not_ yielding but in fact doing
>useful work for the entire duration of their timeslice.

	This is the same error repeated again. Since you realize that an endless 
loop on sched_yield is *not* equivalent to blocking, why do you then say "in 
fact doing useful work"? By what form of ESP is the kernel supposed to 
determine that the sched_yield task is not 'doing useful work' and the other 
task is?

	The kernel doesn't know the loop is endless. For all it knows, as soon as 
another process gets a drop of CPU, the yielding process may be able to 
succeed. And because the yielding process is being nice (by yielding), it 
should get better and better treatment over processes that are burning CPU 
rather than yielding.

>A task that continually uses its timeslice vs one that yields should
>easily receive a greater amount of CPU, but this is not the case.

	Why should the mean task get preferential treatment over the nice task when 
both are always ready-to-run?

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 18:00           ` David Schwartz
@ 2002-06-18 22:45             ` Stevie O
  2002-06-19  2:11               ` David Schwartz
  2002-06-20 20:31               ` David Schwartz
  0 siblings, 2 replies; 52+ messages in thread
From: Stevie O @ 2002-06-18 22:45 UTC (permalink / raw)
  To: David Schwartz, rml, Chris Friesen; +Cc: mgix, linux-kernel

At 11:00 AM 6/18/2002 -0700, David Schwartz wrote:
>This is the same error repeated again. Since you realize that an endless 
>loop on sched_yield is *not* equivalent to blocking, why do you then say "in 
>fact doing useful work"? By what form of ESP is the kernel supposed to 
>determine that the sched_yield task is not 'doing useful work' and the other 
>task is?

By this form of ESP: sched_yield() means "I have nothing better to do right now, give my time to someone who does".  If a thread is doing useful work, why would it call sched_yield() ?!?


--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 22:45             ` Stevie O
@ 2002-06-19  2:11               ` David Schwartz
  2002-06-19  2:52                 ` Stevie O
  2002-06-20 20:31               ` David Schwartz
  1 sibling, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-19  2:11 UTC (permalink / raw)
  To: stevie, rml, Chris Friesen; +Cc: mgix, linux-kernel



On Tue, 18 Jun 2002 18:45:55 -0400, Stevie O wrote:

>At 11:00 AM 6/18/2002 -0700, David Schwartz wrote:

>>This is the same error repeated again. Since you realize that an endless
>>loop on sched_yield is *not* equivalent to blocking, why do you then say
>>"in
>>fact doing useful work"? By what form of ESP is the kernel supposed to
>>determine that the sched_yield task is not 'doing useful work' and the
>>other
>>task is?

>By this form of ESP: sched_yield() means "I have nothing better to do right
>now, give my time to someone who does".

	No, this is not what sched_yield means. What 'sched_yield' means is that 
you're at a point where it's convenient for another process to run. For 
example, perhaps you just released a mutex that you held for a long period of 
time, or perhaps you could function more efficiently if you could acquire a 
resource another thread holds.

>If a thread is doing useful work,
>why would it call sched_yield() ?!?

	Perhaps to allow other threads to make forward progress. Perhaps to give 
other threads a chance to use a resource it just released. Perhaps in hopes 
that another thread will release a resource it could benefit from being able 
to acquire.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-19  2:11               ` David Schwartz
@ 2002-06-19  2:52                 ` Stevie O
  0 siblings, 0 replies; 52+ messages in thread
From: Stevie O @ 2002-06-19  2:52 UTC (permalink / raw)
  To: David Schwartz, rml, Chris Friesen; +Cc: mgix, linux-kernel

At 07:11 PM 6/18/2002 -0700, David Schwartz wrote:
>>By this form of ESP: sched_yield() means "I have nothing better to do right
>>now, give my time to someone who does".
>
>        No, this is not what sched_yield means. What 'sched_yield' means is that 
>you're at a point where it's convenient for another process to run. For 
>example, perhaps you just released a mutex that you held for a long period of 
>time, or perhaps you could function more efficiently if you could acquire a 
>resource another thread holds.

To-may-to, to-mah-to. Okay. I'll try again.

sched_yield() means "It would be convenient (or I would function more efficiently) if another process or thread were to run right now." i.e., "Would you be so kind as to run somebody else? As in, not me.".

 From what I've gathered in this thread, something akin to this happens:

  You are trying to get information out of any of three people. Two of the three people will have better/more reliable information, so you give them a higher priority.

        You choose a person to ask for information -> schedule()
        You ask person #1 for information.
        Person #1 says: "Ask someone else" -> sched_yield()

        You choose a person to ask for information -> schedule()
        You ask person #2 for information.
        Person #2 says: "Ask someone else" -> sched_yield()

        You choose a person to ask for information -> schedule()
Now, any rational human being (who actually wants this information) will certainly proceed as such:

        You ask person #3 for information.
        Person #3 says: "Here is the information you need." 
        You go on your merry way.

However, the Linux scheduler will look at its options, see that Person #1 has a higher priority than Person #3, and that Person #1 is marked ready-to-run, it proceeds:

        You ask person #1 for information.
        Person #1 says: "Ask someone else" -> sched_yield()

        Proceed to thrash between #1 and #2, ignoring #3.

Now, don't get me wrong; I understand your argument.  Like I said, Person #1 is not blocking (sched_yield() does not block), and is a higher priority than #3. All other things being equal, Person #1 should be scheduled after Person #2 yields. 

But things aren't equal. There's a crucial difference here: Person #1 has relinquished his timeslice, and Person #3 hasn't.  And I'm arguing that, if a thread/process/whatever calls sched_yield, then that thread should only be run in that timeslice if:
        * There are no other threads that are ready-to-run on that processor, or
        * Every other thread has called sched_yield().

Yes, the current behavior is technically correct; sched_yield() properly gives up control to another process. But it will sometimes give up the processor to another process that previously did a sched_yield() in the current timeslice. And that's just stupid.

>>If a thread is doing useful work,
>>why would it call sched_yield() ?!?
>
>        Perhaps to allow other threads to make forward progress. Perhaps to give 
>other threads a chance to use a resource it just released. Perhaps in hopes 
>that another thread will release a resource it could benefit from being able 
>to acquire.

Yeah. And if two threads decide to be 'nice' -- to allow a third thread to make forward progress -- neither will. The scheduler with thrash between the two threads, in preference to scheduling the third thread.  This is in accordance with the strictest interpretation of sched_yield()'s definition.

--
Stevie-O

Real programmers use COPY CON PROGRAM.EXE

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 22:45             ` Stevie O
  2002-06-19  2:11               ` David Schwartz
@ 2002-06-20 20:31               ` David Schwartz
  1 sibling, 0 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-20 20:31 UTC (permalink / raw)
  To: stevie, rml, Chris Friesen; +Cc: mgix, linux-kernel


On Tue, 18 Jun 2002 18:45:55 -0400, Stevie O wrote:

>At 11:00 AM 6/18/2002 -0700, David Schwartz wrote:

>>This is the same error repeated again. Since you realize that an endless
>>loop on sched_yield is *not* equivalent to blocking, why do you then say
>>"in
>>fact doing useful work"? By what form of ESP is the kernel supposed to
>>determine that the sched_yield task is not 'doing useful work' and the
>>other
>>task is?

>By this form of ESP: sched_yield() means "I have nothing better to do right
>now, give my time to someone who does".  If a thread is doing useful work,
>why would it call sched_yield() ?!?

	To give other threads a chance to do useful work too, perhaps because it 
just released a mutex that other threads might need that it held for an 
unusually long time.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 16:58       ` Chris Friesen
  2002-06-18 17:12         ` Richard B. Johnson
  2002-06-18 17:13         ` Robert Love
@ 2002-06-18 17:23         ` Rik van Riel
  2002-06-18 17:50           ` Chris Friesen
  2 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2002-06-18 17:23 UTC (permalink / raw)
  To: Chris Friesen; +Cc: David Schwartz, rml, mgix, linux-kernel

On Tue, 18 Jun 2002, Chris Friesen wrote:

> >         It has to. What if the only task running is:
> >
> >         while(1) sched_yield();
> >
> >         What would you expect?
>
> If there is only the one task, then sure it's going to be 100% cpu on
> that task.
>
> However, if there is anything else other than the idle task that wants
> to run, then it should run until it exhausts its timeslice.
>
> One process looping on sched_yield() and another one doing calculations
> should result in almost the entire system being devoted to calculations.

So if you have a database with 20 threads yielding to each other
each time a lock is contended and one CPU hog the database should
be reduced to a very small percentage of the CPU ?

regards,

Rik
-- 
	http://www.linuxsymposium.org/2002/
"You're one of those condescending OLS attendants"
"Here's a nickle kid.  Go buy yourself a real t-shirt"

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 17:23         ` Rik van Riel
@ 2002-06-18 17:50           ` Chris Friesen
  0 siblings, 0 replies; 52+ messages in thread
From: Chris Friesen @ 2002-06-18 17:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: David Schwartz, rml, mgix, linux-kernel

Rik van Riel wrote:
> 
> On Tue, 18 Jun 2002, Chris Friesen wrote:

> > However, if there is anything else other than the idle task that wants
> > to run, then it should run until it exhausts its timeslice.
> >
> > One process looping on sched_yield() and another one doing calculations
> > should result in almost the entire system being devoted to calculations.
> 
> So if you have a database with 20 threads yielding to each other
> each time a lock is contended and one CPU hog the database should
> be reduced to a very small percentage of the CPU ?

If the database threads are at a higher priority than the cpu hog, then no they
shouldn't.  The one that owns the lock should progress as normal.  Assuming that
only the thread owning the lock progresses, the cpu usage ratio between the
threads as a group and the hog should be roughly equivalent to a single database
thread and a single hog.

We certainly shouldn't thrash the yielding threads and starve the hog.

Chris


-- 
Chris Friesen                    | MailStop: 043/33/F10  
Nortel Networks                  | work: (613) 765-0557
3500 Carling Avenue              | fax:  (613) 765-2986
Nepean, ON K2H 8E9 Canada        | email: cfriesen@nortelnetworks.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18  0:46 ` David Schwartz
  2002-06-18  0:55   ` Robert Love
@ 2002-06-18  1:41   ` mgix
  2002-06-18  3:21     ` David Schwartz
  2002-06-18  4:55   ` Ingo Molnar
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 52+ messages in thread
From: mgix @ 2002-06-18  1:41 UTC (permalink / raw)
  To: David Schwartz, linux-kernel



> -----Original Message-----
> From: David Schwartz [mailto:davids@webmaster.com]
> Sent: Monday, June 17, 2002 5:46 PM
> To: mgix@mgix.com; linux-kernel@vger.kernel.org
> Subject: Re: Question about sched_yield()
> 
> 
> 
> On Sat, 15 Jun 2002 15:15:32 -0700, mgix@mgix.com wrote:
> >
> >Hello,
> >
> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.
> [snip]
> 
> 	You seem to have a misconception about what sched_yield is for. It is not a 
> replacement for blocking or a scheduling priority adjustment. It simply lets 
> other ready-to-run tasks be scheduled before returning to the current task.
> 
> 	Here's a quote from SuS3:
> 
> "The sched_yield() function shall force the running thread to relinquish the 
> processor until it again becomes the head of its thread list. It takes no 
> arguments."
> 
> 	This neither says nor implies anything about CPU usage. It simply says that 
> the current thread will yield and be put at the end of the list.

If so, please enlighten me as to when, why, and what for you would use sched_yield.

If willingly and knowingly relinquinshing a CPU does not make it possible
for other processes to use what would otherwise have been your very own slice
of processing time then what could it be used for, I really wonder.

Second, I have tried to run my misconception on various other OS'es I have
access to:Win2k, Mac OSX and OpenBSD, and suprinsingly enough, all of them
seem to be sharing my twisted views of How Things Should Be (tm).

	- Mgix



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18  1:41   ` mgix
@ 2002-06-18  3:21     ` David Schwartz
  2002-06-18  3:52       ` mgix
  0 siblings, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-18  3:21 UTC (permalink / raw)
  To: mgix, linux-kernel


>>    This neither says nor implies anything about CPU usage. It simply says
>>that
>>the current thread will yield and be put at the end of the list.
>
>If so, please enlighten me as to when, why, and what for you would use
>sched_yield.

	Generally because you can't do something until some other thread/process 
does something, so you give it a chance to finish immediately before trying 
something a more expensive way.

>If willingly and knowingly relinquinshing a CPU does not make it possible
>for other processes to use what would otherwise have been your very own 
>slice
>of processing time then what could it be used for, I really wonder.

	It does! That's what it's for.

>Second, I have tried to run my misconception on various other OS'es I have
>access to:Win2k, Mac OSX and OpenBSD, and suprinsingly enough, all of them
>seem to be sharing my twisted views of How Things Should Be (tm).

	Just because they do what you naively expect doesn't validate your 
expectations.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18  3:21     ` David Schwartz
@ 2002-06-18  3:52       ` mgix
  0 siblings, 0 replies; 52+ messages in thread
From: mgix @ 2002-06-18  3:52 UTC (permalink / raw)
  To: David Schwartz, linux-kernel


> >If willingly and knowingly relinquinshing a CPU does not make it possible
> >for other processes to use what would otherwise have been your very own 
> >slice
> >of processing time then what could it be used for, I really wonder.
> 
> 	It does! That's what it's for.

Good, and now that we agree:

Now back to the original point: sched_yield does not
do the above on Linux as of today, which was the point
of my original posting, and which is the reason Ingo
posted a scheduler patch.

	- Mgix



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:46 ` David Schwartz
  2002-06-18  0:55   ` Robert Love
  2002-06-18  1:41   ` mgix
@ 2002-06-18  4:55   ` Ingo Molnar
  2002-06-19 11:24     ` Bill Davidsen
  2002-06-18 18:56   ` Question about sched_yield() Rusty Russell
  2002-06-19  2:10   ` jw schultz
  4 siblings, 1 reply; 52+ messages in thread
From: Ingo Molnar @ 2002-06-18  4:55 UTC (permalink / raw)
  To: David Schwartz; +Cc: mgix, linux-kernel

On Mon, 17 Jun 2002, David Schwartz wrote:

> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.

> [...] It is not a replacement for blocking or a scheduling priority
> adjustment. It simply lets other ready-to-run tasks be scheduled before
> returning to the current task.

and this is what the scheduler didnt do properly, it actually didnt
schedule valid ready-to-run processes for long milliseconds, it switched
between two sched_yield processes, starving the CPU-intensive process. I
posted a patch for 2.5 that fixes this, and the 2.4.19-pre10-ac2 backport
i did includes this fix as well:

    http://redhat.com/~mingo/O(1)-scheduler/sched-2.4.19-pre10-ac2-A4

a good sched_yield() implementation should give *all* other tasks a chance
to run, instead of switching between multiple sched_yield()-ing tasks. I
don think this is specified anywhere, but it's a quality of implementation
issue.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  4:55   ` Ingo Molnar
@ 2002-06-19 11:24     ` Bill Davidsen
  2002-06-19 11:47       ` scheduler timeslice distribution, threads, processes. [was: Re: Question about sched_yield()] Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Bill Davidsen @ 2002-06-19 11:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

On Tue, 18 Jun 2002, Ingo Molnar wrote:

> 
> On Mon, 17 Jun 2002, David Schwartz wrote:
> 
> > >I am seeing some strange linux scheduler behaviours,
> > >and I thought this'd be the best place to ask.
> > >
> > >I have two processes, one that loops forever and
> > >does nothing but calling sched_yield(), and the other
> > >is basically benchmarking how fast it can compute
> > >some long math calculation.
> 
> > [...] It is not a replacement for blocking or a scheduling priority
> > adjustment. It simply lets other ready-to-run tasks be scheduled before
> > returning to the current task.
> 
> and this is what the scheduler didnt do properly, it actually didnt
> schedule valid ready-to-run processes for long milliseconds, it switched
> between two sched_yield processes, starving the CPU-intensive process. I
> posted a patch for 2.5 that fixes this, and the 2.4.19-pre10-ac2 backport
> i did includes this fix as well:
> 
>     http://redhat.com/~mingo/O(1)-scheduler/sched-2.4.19-pre10-ac2-A4
> 
> a good sched_yield() implementation should give *all* other tasks a chance
> to run, instead of switching between multiple sched_yield()-ing tasks. I
> don think this is specified anywhere, but it's a quality of implementation
> issue.

Clearly your fix is subtle enough that a quick reading, at least by me,
can't follow all the nuances, but it looks on first reading as if your
patch fixes this too well, and will now starve the threaded process by
only giving one turn at the CPU per full pass through the queue, rather
than sharing the timeslice between threads of a process.

I posted some thoughts on this to Robert, if you would comment I would
appreciate. My thought is that if you have N processes and one is
threaded, the aggregate of all threads should be 1/N of the CPU(s), not
vastly more or less. I think with your change it will be less if they are
sharing a resource. Feel free to tell me I misread what it does, or my
desired behaviour is not correct.

I do run 15 machines with long running a threaded server application and
periodic CPU hog things like log roll and compress, stats generation,
certain database cleanup, etc, and I have more than intelectual curiousity
on this behaviour.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* scheduler timeslice distribution, threads, processes. [was: Re: Question about sched_yield()]
  2002-06-19 11:24     ` Bill Davidsen
@ 2002-06-19 11:47       ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-19 11:47 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Linux Kernel Mailing List

On Wed, 19 Jun 2002, Bill Davidsen wrote:

> Clearly your fix is subtle enough that a quick reading, at least by me,
> can't follow all the nuances, but it looks on first reading as if your
> patch fixes this too well, and will now starve the threaded process by
> only giving one turn at the CPU per full pass through the queue, rather
> than sharing the timeslice between threads of a process.

the important point is that to the Linux scheduler there is only one
context of execution that matters: one Linux thread. The kernel itself is
deeply threaded - in kernel mode all threads access all data structures of
each other. Whether threads share certain resources (the VM or files) at
the userspace level is irrelevant to the scheduler. My strong point is
that this is not an accidental property of Linux, it's very intentional.  
CPU time partitioning and VM-sharing are two distinct concepts and they
should not be artificially connected.

> I posted some thoughts on this to Robert, if you would comment I would
> appreciate. My thought is that if you have N processes and one is
> threaded, the aggregate of all threads should be 1/N of the CPU(s), not
> vastly more or less. I think with your change it will be less if they
> are sharing a resource. Feel free to tell me I misread what it does, or
> my desired behaviour is not correct.

'processes' are threads that have an isolated set of resources. The
possible sharing relationships between threads is much more complex than
what can be expressed via a simplicistic 'lightweight process' vs.
'heavyweight process' picture. [or the 'container process/shell' and
'member threads' concept used in a particular OS, designed in the early
90s ;-) ] It's a fundamental property of the scheduler that it shares
timeslices between threads of execution - regardless of the
resource-sharing done by those threads.

[in fact it's almost impossible to get an accurate picture of resource
sharing that can be used by the scheduler, the resource sharing
capabilities of Linux are very finegrained. It's possible to share just a
single file descriptor between two threads, and it's possible to share a
given segment of virtual memory as well.]

so in your example, if you have 10 isolated threads ('normal processes'),
and a 'process' that has 5 shared-all threads, the scheduler only sees
that there are 15 threads in the system - and each thread will get 1/15th
of the CPU resources [if all threads have the same priority].

but i'm not fundamentally against enabling users to partition their CPU
resource usage better - ie. to be able to tell which particular set of
threads should get a given percentage of CPU time. What i'm against is to
tie, hardcode this to the 'all VM sharing' property of threads - like
other OSs do. Why shouldnt it be possible to give the 10 isolated threads
30% of all CPU time, and the remaining 5 shared-all threads 70% CPU time?
By decoupling the ability to partition CPU time in a finegrained way we
not only can do what you describe, but we also win lots of flexibility,
and the whole model becomes much cleaner.

> I do run 15 machines with long running a threaded server application and
> periodic CPU hog things like log roll and compress, stats generation,
> certain database cleanup, etc, and I have more than intelectual
> curiousity on this behaviour.

right now you cannot set specific hard percentages for given jobs, but you
can give them a relative priority via nice(). While nice() isnt exactly a
partitioning tool (eg. the CPU time used up increases with the number of
threads), it's pretty close. But being able to partition timeslices more
accurately is something that will happen eventually - and the scheduler is
ready for it.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:46 ` David Schwartz
                     ` (2 preceding siblings ...)
  2002-06-18  4:55   ` Ingo Molnar
@ 2002-06-18 18:56   ` Rusty Russell
  2002-06-18 19:12     ` David Schwartz
  2002-06-19 11:29     ` Bill Davidsen
  2002-06-19  2:10   ` jw schultz
  4 siblings, 2 replies; 52+ messages in thread
From: Rusty Russell @ 2002-06-18 18:56 UTC (permalink / raw)
  To: David Schwartz; +Cc: mgix, linux-kernel

On Mon, 17 Jun 2002 17:46:29 -0700
David Schwartz <davids@webmaster.com> wrote:
> "The sched_yield() function shall force the running thread to relinquish the 
> processor until it again becomes the head of its thread list. It takes no 
> arguments."

Notice how incredibly useless this definition is.  It's even defined in terms
of UP.

Rusty.
-- 
   there are those who do and those who hang on and you don't see too
   many doers quoting their contemporaries.  -- Larry McVoy

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 18:56   ` Question about sched_yield() Rusty Russell
@ 2002-06-18 19:12     ` David Schwartz
  2002-06-18 20:19       ` Rusty Russell
  2002-06-19 11:29     ` Bill Davidsen
  1 sibling, 1 reply; 52+ messages in thread
From: David Schwartz @ 2002-06-18 19:12 UTC (permalink / raw)
  To: rusty; +Cc: mgix, linux-kernel


On Wed, 19 Jun 2002 04:56:06 +1000, Rusty Russell wrote:

>On Mon, 17 Jun 2002 17:46:29 -0700
>David Schwartz <davids@webmaster.com> wrote:

>>"The sched_yield() function shall force the running thread to relinquish
>>the
>>processor until it again becomes the head of its thread list. It takes no
>>arguments."

>Notice how incredibly useless this definition is.  It's even defined in 
>terms
>of UP.

	Huh?! This definition is beautiful in that it makes no such assumptions. How 
would you say this is invalid on an SMP machine? By "the processor", they 
mean "the process on which the thread is running" (the only one it could 
relinquish, after all).

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 19:12     ` David Schwartz
@ 2002-06-18 20:19       ` Rusty Russell
  2002-06-18 20:40         ` David Schwartz
  2002-06-18 20:42         ` mgix
  0 siblings, 2 replies; 52+ messages in thread
From: Rusty Russell @ 2002-06-18 20:19 UTC (permalink / raw)
  To: David Schwartz; +Cc: mgix, linux-kernel, mingo

In message <20020618191233.AAA27954@shell.webmaster.com@whenever> you write:
> 
> On Wed, 19 Jun 2002 04:56:06 +1000, Rusty Russell wrote:
> 
> >On Mon, 17 Jun 2002 17:46:29 -0700
> >David Schwartz <davids@webmaster.com> wrote:
> 
> >>"The sched_yield() function shall force the running thread to relinquish
> >>the processor until it again becomes the head of its thread list.
> >> It takes no arguments."
> 
> >Notice how incredibly useless this definition is.  It's even defined in 
> >terms of UP.
> 
> =09Huh?! This definition is beautiful in that it makes no such=
> assumptions. How would you say this is invalid on an SMP machine? By
> "the= processor", they mean "the process on which the thread is
> running" (the only one= it could relinquish, after all).

Read again: they use "relinquish ... until", not "relinquish".  Subtle
difference.

I have 32 processors and 32 threads.  One does a yield().  What
happens?  What should happen?

Given that yield is "sleep for some time but I won't tell you what I'm
doing", I have no sympathy for yield users 8)

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 20:19       ` Rusty Russell
@ 2002-06-18 20:40         ` David Schwartz
  2002-06-18 20:42         ` mgix
  1 sibling, 0 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 20:40 UTC (permalink / raw)
  To: rusty; +Cc: mgix, linux-kernel, mingo


>>>>"The sched_yield() function shall force the running thread to relinquish
>>>>the processor until it again becomes the head of its thread list.
>>>>It takes no arguments."

>>>Notice how incredibly useless this definition is.  It's even defined in
>>>terms of UP.

>>Huh?! This definition is beautiful in that it makes no such=
>>assumptions. How would you say this is invalid on an SMP machine? By
>>"the= processor", they mean "the process on which the thread is
>>running" (the only one= it could relinquish, after all).

>Read again: they use "relinquish ... until", not "relinquish".  Subtle
>difference.

	So?

>I have 32 processors and 32 threads.  One does a yield().  What
>happens?  What should happen?

	It should relinquish the processor it is running on until it again becomes 
the head of its thread list. (IE, for as long as it takes the scheduler to 
decide that it's the thread to run.)

>Given that yield is "sleep for some time but I won't tell you what I'm
>doing", I have no sympathy for yield users 8)

	I have sympathy for those who use it properly, I have no sympathy for those 
who loop on sched_yield burning the CPU and then complaining that it burns 
CPU.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:19       ` Rusty Russell
  2002-06-18 20:40         ` David Schwartz
@ 2002-06-18 20:42         ` mgix
  2002-06-18 22:03           ` David Schwartz
  2002-06-18 22:36           ` Ingo Molnar
  1 sibling, 2 replies; 52+ messages in thread
From: mgix @ 2002-06-18 20:42 UTC (permalink / raw)
  To: rusty, David Schwartz; +Cc: linux-kernel, mingo

> Given that yield is "sleep for some time but I won't tell you what I'm
> doing", I have no sympathy for yield users 8)

Here's something interesting at last.

True, when userland calls sched_yield, the kernel is left in the dark.

However, I am not aware of any alternative to communicate what
I really want to the scheduler, and here's why. If anyone has
ideas on how to do this better, please, I'm all ears.

It's basically about spinlocks and the cost of task switching.

I'm trying to implement "smart" spinlocks.

Regular blocking lock based on some kind of kernel object are:

	1. Slow because every unsuccessful use of the lock will
         trigger a task switch. There is no way to have two threads
         exchange messages at a very fast rate with a blocking lock.

	2. Expensive in terms of kernel resource consumption: you end
         up allocating a kernel object each time you need a lock.
         If your app's locking is sufficienly fine-grained, you end
         up allocating a huge number of kernel objects.

	3. To avoid problem 2., you can try and allocate kernel objects
         on the fly, but then performance gets even worse: every lock
         acquisition/release requiring the creation/destruction of a
         kernel object.

The alternative, userland spinlocks :

	1. Cost zero in term of kernel resource consumption.

	2. Are a major catastrophe when running on a UP box
         (you can spin as long as you want, nothing's going to
          change since there's only one CPU and you're hogging it)

	3. A CPU hog at best when running on an SMP boxes: the spinning
         thread gobbles up a whole 100% of a CPU.

"Smart" spinlocks basically try and do it this way:

	int spinLoops= GetNumberOfProcsICanRunOn() > 1 ? someBigNumber : 1;
	while(1)
	{
		int n= spinLoops;
		while(n--) tryAndGetTheSpinLock();
		if(gotIt) break;
		sched_yield();
	}

These seem to have all the qualities I want:

1. On a UP box, the thread will poke at the spinlock only once,
   and then yield and *hopefully* get scheduled till later, giving
   a chance to the lock owner to finish its stuff and release.

   Net result: spinning thread gives no CPU consumption, no kernel
   object allocated, and proper locking behaviour.

2. On an SMP box, the thread will bang on the spinlock a large
   number of times, hoping to get it before it gets taskswitched away.
   If it does, great: no time lost.
   If it doesn't, we're out of luck, yield the CPU and try again next time.

   Net result: tunable balance between CPU consumption and task switching
   costs, no kernel object allocated, and proper locking behaviour.

I'm sure this issue has been beaten to death before, but I
just wanted to gove my $.02

	- Mgix

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:42         ` mgix
@ 2002-06-18 22:03           ` David Schwartz
  2002-06-18 22:36           ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: David Schwartz @ 2002-06-18 22:03 UTC (permalink / raw)
  To: mgix, rusty; +Cc: linux-kernel, mingo


>    3. A CPU hog at best when running on an SMP boxes: the spinning
>thread gobbles up a whole 100% of a CPU.

	For the few hundred cycles some other thread holds the lock.

>"Smart" spinlocks basically try and do it this way:
>
>    int spinLoops= GetNumberOfProcsICanRunOn() > 1 ? someBigNumber : 1;
>    while(1)
>    {
>        int n= spinLoops;
>        while(n--) tryAndGetTheSpinLock();
>        if(gotIt) break;
>        sched_yield();
>    }
>
>These seem to have all the qualities I want:

	Almost.

>2. On an SMP box, the thread will bang on the spinlock a large
>number of times, hoping to get it before it gets taskswitched away.
>If it does, great: no time lost.
>If it doesn't, we're out of luck, yield the CPU and try again next time.

	You should limit how many times you spin in this loop. If it gets to be too 
many, you should block.

	You can either block by sleeping for a few milliseconds. If you don't like 
the idea that one thread will release the lock and the other will waste time 
sleeping, then associate a kernel lock with the spinlock when a thread gives 
up waiting, have your unlock function check for an associated kernel lock and 
if there is one, unlock it.

	DS



^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: Question about sched_yield()
  2002-06-18 20:42         ` mgix
  2002-06-18 22:03           ` David Schwartz
@ 2002-06-18 22:36           ` Ingo Molnar
  1 sibling, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-18 22:36 UTC (permalink / raw)
  To: mgix; +Cc: Rusty Russell, David Schwartz, linux-kernel, mingo

On Tue, 18 Jun 2002 mgix@mgix.com wrote:

> However, I am not aware of any alternative to communicate what I really
> want to the scheduler, and here's why. If anyone has ideas on how to do
> this better, please, I'm all ears.
> 
> It's basically about spinlocks and the cost of task switching.
> 
> I'm trying to implement "smart" spinlocks.

the right solution for you are the new ultra-fast kernel-helped,
user-space semaphores. Those are kernel objects where the semaphore data
structure is accessible to user-space as well. Acquiring the semaphore in
the no-contention case is very cheap, it's basically a single assembly
instruction. In the contention case you'll enter the kernel and schedule
away.

(in the sched_yield() case you schedule away just as much in the
contention case, so there isnt any difference.)

for shortlived but contended locks this mechanism can be improved a bit by
looping 'some more' on the lock before entering the kernel and blocking -
but it could also be a loss if the amount of additional looping is too
long or too short. Since the O(1) scheduler context-switches in under 1
microsecond on a 1 GHz box, and has no SMP interlocking and cache-trashing
problems, you'll still get 1 million context switches per second, per CPU.  
And this is the 'slow' case.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18 18:56   ` Question about sched_yield() Rusty Russell
  2002-06-18 19:12     ` David Schwartz
@ 2002-06-19 11:29     ` Bill Davidsen
  2002-06-19 14:03       ` Rusty Russell
  1 sibling, 1 reply; 52+ messages in thread
From: Bill Davidsen @ 2002-06-19 11:29 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linux Kernel Mailing List

On Wed, 19 Jun 2002, Rusty Russell wrote:

> On Mon, 17 Jun 2002 17:46:29 -0700
> David Schwartz <davids@webmaster.com> wrote:
> > "The sched_yield() function shall force the running thread to relinquish the 
> > processor until it again becomes the head of its thread list. It takes no 
> > arguments."
> 
> Notice how incredibly useless this definition is.  It's even defined in terms
> of UP.

I think you parse this differently than I, I see no reference to UP. The
term "the processor" clearly (to me at least) means the processor running
in that thread at the time of the yeild.

The number of processors running in a single thread at any one time is an
integer number in the range zero to one.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-19 11:29     ` Bill Davidsen
@ 2002-06-19 14:03       ` Rusty Russell
  2002-06-19 22:25         ` Bill Davidsen
  0 siblings, 1 reply; 52+ messages in thread
From: Rusty Russell @ 2002-06-19 14:03 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Linux Kernel Mailing List

In message <Pine.LNX.3.96.1020619072548.1119D-100000@gatekeeper.tmr.com> you wr
ite:
> On Wed, 19 Jun 2002, Rusty Russell wrote:
> 
> > On Mon, 17 Jun 2002 17:46:29 -0700
> > David Schwartz <davids@webmaster.com> wrote:
> > > "The sched_yield() function shall force the running thread to relinquish 
the 
> > > processor until it again becomes the head of its thread list. It takes no
 
> > > arguments."
> > 
> > Notice how incredibly useless this definition is.  It's even defined in ter
ms
> > of UP.
> 
> I think you parse this differently than I, I see no reference to UP. The
> term "the processor" clearly (to me at least) means the processor running
> in that thread at the time of the yeild.
> 
> The number of processors running in a single thread at any one time is an
> integer number in the range zero to one.

It's the word "until": "relinquish the processor until".

It's pretty clearly implied that it's going to "unrelinquish" *the
processor* at the end of this process.

So, by your definition, it can be scheduled on another CPU before it
becomes head of the thread list?

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-19 14:03       ` Rusty Russell
@ 2002-06-19 22:25         ` Bill Davidsen
  2002-06-19 22:37           ` Ingo Molnar
  0 siblings, 1 reply; 52+ messages in thread
From: Bill Davidsen @ 2002-06-19 22:25 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linux Kernel Mailing List

On Thu, 20 Jun 2002, Rusty Russell wrote:

> In message <Pine.LNX.3.96.1020619072548.1119D-100000@gatekeeper.tmr.com> you wr
> ite:
> > On Wed, 19 Jun 2002, Rusty Russell wrote:
> > 
> > > On Mon, 17 Jun 2002 17:46:29 -0700
> > > David Schwartz <davids@webmaster.com> wrote:
> > > > "The sched_yield() function shall force the running thread to relinquish 
> the 
> > > > processor until it again becomes the head of its thread list. It takes no
>  
> > > > arguments."
> > > 
> > > Notice how incredibly useless this definition is.  It's even defined in ter
> ms
> > > of UP.
> > 
> > I think you parse this differently than I, I see no reference to UP. The
> > term "the processor" clearly (to me at least) means the processor running
> > in that thread at the time of the yeild.
> > 
> > The number of processors running in a single thread at any one time is an
> > integer number in the range zero to one.
> 
> It's the word "until": "relinquish the processor until".
> 
> It's pretty clearly implied that it's going to "unrelinquish" *the
> processor* at the end of this process.
> 
> So, by your definition, it can be scheduled on another CPU before it
> becomes head of the thread list?

I have to read "head of the thread list" broadly, and assume it means the
thread will be run when it is the most appropriate thread to be run. I
don't read the wording as requiring or forbidding SMP, uni, or a strict
round-robin scheduler. The term "head of the thread list" doesn't state
that all other processes must get a time slice.

I believe I clarified my concerns in another post, I don't want to repeat
them. I don't want a process with threads contending for a resource to get
all or none of the CPU, each of which is possible with various schedulers
and patches recently available. I'd like to see threads of a single
process be able to get, use, and share a timeslice before some cpu hog
comes in and get his timeslice.

I don't read the text you quoted as requiring much more than 'run someone
else," because it's comfortably vague. To me anyway. I'm not knocking
anyone who reads it more strictly, I just don't see what you did.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-19 22:25         ` Bill Davidsen
@ 2002-06-19 22:37           ` Ingo Molnar
  0 siblings, 0 replies; 52+ messages in thread
From: Ingo Molnar @ 2002-06-19 22:37 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Rusty Russell, Linux Kernel Mailing List

On Wed, 19 Jun 2002, Bill Davidsen wrote:

> [...] I'd like to see threads of a single process be able to get, use,
> and share a timeslice before some cpu hog comes in and get his
> timeslice.

there is no such concept as 'threads of a single process' in Linux, and
this is not just a naming difference. In Linux threads are threads, and
whether they share the same set of pagetables or not is secondary to the
kernel. (there are lots of other resources they might or might not share
between each other.)

the OS where processes 'own' threads, where the process is a container,
where this concept is pretty much the only meaningful multiprogramming
concept, and where the kernel API is separated into per-thread and
per-process parts is not called Linux.

	Ingo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: Question about sched_yield()
  2002-06-18  0:46 ` David Schwartz
                     ` (3 preceding siblings ...)
  2002-06-18 18:56   ` Question about sched_yield() Rusty Russell
@ 2002-06-19  2:10   ` jw schultz
  4 siblings, 0 replies; 52+ messages in thread
From: jw schultz @ 2002-06-19  2:10 UTC (permalink / raw)
  To: linux-kernel

Lets back up a moment please.

On Mon, Jun 17, 2002 at 05:46:29PM -0700, David Schwartz wrote:
> 
> On Sat, 15 Jun 2002 15:15:32 -0700, mgix@mgix.com wrote:
> >
> >Hello,
> >
> >I am seeing some strange linux scheduler behaviours,
> >and I thought this'd be the best place to ask.
> >
> >I have two processes, one that loops forever and
> >does nothing but calling sched_yield(), and the other
> >is basically benchmarking how fast it can compute
> >some long math calculation.
> [snip]
> 
> 	You seem to have a misconception about what sched_yield is for. It is not a 
> replacement for blocking or a scheduling priority adjustment. It simply lets 
> other ready-to-run tasks be scheduled before returning to the current task.
> 
> 	Here's a quote from SuS3:
> 
> "The sched_yield() function shall force the running thread to relinquish the 
> processor until it again becomes the head of its thread list. It takes no 
> arguments."

.From this description it sounds like sched_yield was
intended to deal with scheduling of threads within a
process not the between processes.  Namely that there is a
"thread list" within a single process context.

This would be used for yielding to another thread within the
same list (context) when the current thread needs a
contended-for resource (an inter-thread/intra-context lock)
or the result of another thread's calculation.  Or simply to
allow switching, round-robin style, between a set of compute
intensive threads by salting the compute loops with
sched_yield.

Am i right?

Linux, as we know, treats threads and independent processes
as equals.  Therefore, the process scheduler now becomes
responsible for dealing with sched_yield and somehow
translating sched_yield's intent to a situation where threads are
contending with other processes instead of just with their
fellows and their fellows can wind up with differing
priorities.

Handled sched_yield wrong could result in a live/deadlock
(i'm fuzzy at the moment on the distinction, sorry) if one
thread is sched_yield/spinning on a resource pending action of a
lower priority thread.  It would seem that sched_yield
should make sure that the calling thread has a priority
lower than the lowest priority thread within the thread
group.

"until it again becomes the head of its thread list"
sounds to me like all other threads in the group should have
an opportunity to run before the calling thread does.

I would suggest that in the case where there are no other
threads in the group that are runnable the process should
sleep(0) so another (lower priority) process can run.

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw@pegasys.ws

		Remember Cernan and Schmitt

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2002-06-20 20:31 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-15 22:15 Question about sched_yield() mgix
2002-06-16 14:43 ` [patch] " Ingo Molnar
2002-06-18  0:46 ` David Schwartz
2002-06-18  0:55   ` Robert Love
2002-06-18  1:51     ` mgix
2002-06-18  3:18     ` David Schwartz
2002-06-18  9:36     ` David Schwartz
2002-06-18 16:58       ` Chris Friesen
2002-06-18 17:12         ` Richard B. Johnson
2002-06-18 17:19           ` mgix
2002-06-18 18:01             ` David Schwartz
2002-06-18 18:05               ` mgix
2002-06-18 19:11                 ` David Schwartz
2002-06-18 16:58                   ` Rob Landley
2002-06-18 19:25                   ` Robert Love
2002-06-18 19:53                     ` David Schwartz
2002-06-18 20:12                       ` mgix
2002-06-18 20:42                         ` David Schwartz
2002-06-18 20:47                           ` mgix
2002-06-18 22:00                             ` David Schwartz
2002-06-18 22:28                           ` Ingo Molnar
2002-06-18 20:08                     ` Richard B. Johnson
2002-06-19 11:10                     ` Bill Davidsen
2002-06-19 12:04                       ` Ingo Molnar
2002-06-18 22:43               ` Olivier Galibert
2002-06-18 18:21             ` Richard B. Johnson
2002-06-18 17:13         ` Robert Love
2002-06-18 18:00           ` David Schwartz
2002-06-18 22:45             ` Stevie O
2002-06-19  2:11               ` David Schwartz
2002-06-19  2:52                 ` Stevie O
2002-06-20 20:31               ` David Schwartz
2002-06-18 17:23         ` Rik van Riel
2002-06-18 17:50           ` Chris Friesen
2002-06-18  1:41   ` mgix
2002-06-18  3:21     ` David Schwartz
2002-06-18  3:52       ` mgix
2002-06-18  4:55   ` Ingo Molnar
2002-06-19 11:24     ` Bill Davidsen
2002-06-19 11:47       ` scheduler timeslice distribution, threads, processes. [was: Re: Question about sched_yield()] Ingo Molnar
2002-06-18 18:56   ` Question about sched_yield() Rusty Russell
2002-06-18 19:12     ` David Schwartz
2002-06-18 20:19       ` Rusty Russell
2002-06-18 20:40         ` David Schwartz
2002-06-18 20:42         ` mgix
2002-06-18 22:03           ` David Schwartz
2002-06-18 22:36           ` Ingo Molnar
2002-06-19 11:29     ` Bill Davidsen
2002-06-19 14:03       ` Rusty Russell
2002-06-19 22:25         ` Bill Davidsen
2002-06-19 22:37           ` Ingo Molnar
2002-06-19  2:10   ` jw schultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox