[patch] CFS scheduler, -v11

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] CFS scheduler, -v11
@ 2007-05-08 15:04 Ingo Molnar
  2007-05-09 15:46 ` Kasper Sandberg
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-05-08 15:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord


* Mike Galbraith <efault@gmx.de> wrote:

> [...] Values 0x3 and 0xb are merely context switch happy.

thanks Mike - value 0x8 looks pretty good here and doesnt have the 
artifacts you found. I've done a quick -v11 release with that fixed, 
available at the usual place:

    http://people.redhat.com/mingo/cfs-scheduler/

with no other changes.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-08 15:04 [patch] CFS scheduler, -v11 Ingo Molnar
@ 2007-05-09 15:46 ` Kasper Sandberg
  2007-05-11 11:07   ` Ingo Molnar
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
  2007-05-10 16:59 ` [patch] CFS scheduler, -v11 Christian
  2 siblings, 1 reply; 19+ messages in thread
From: Kasper Sandberg @ 2007-05-09 15:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord

Hello Ingo.

Sorry it has taken this long, but i've been quite busy with work.

Here are my results with v11 for smoothness.

under slight load (spamasassin nice 19'ed), its now doing okay in
smoothness, almost as good as SD. but under harder load such as pressing
a link in a browser while 3d(at nice 0), or spamasassin at nice 0 still
makes it go stutterish instead of smooth. But overall it does seem
better.

On Tue, 2007-05-08 at 17:04 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > [...] Values 0x3 and 0xb are merely context switch happy.
> 
> thanks Mike - value 0x8 looks pretty good here and doesnt have the 
> artifacts you found. I've done a quick -v11 release with that fixed, 
> available at the usual place:
> 
>     http://people.redhat.com/mingo/cfs-scheduler/
> 
> with no other changes.
> 
> 	Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-08 15:04 [patch] CFS scheduler, -v11 Ingo Molnar
  2007-05-09 15:46 ` Kasper Sandberg
@ 2007-05-09 18:02 ` Srivatsa Vaddagiri
  2007-05-09 19:24   ` Dmitry Adamushko
                     ` (4 more replies)
  2007-05-10 16:59 ` [patch] CFS scheduler, -v11 Christian
  2 siblings, 5 replies; 19+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-09 18:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	tingy, tong.n.li

[-- Attachment #1: Type: text/plain, Size: 1738 bytes --]

On Tue, May 08, 2007 at 05:04:31PM +0200, Ingo Molnar wrote:
> thanks Mike - value 0x8 looks pretty good here and doesnt have the 
> artifacts you found. I've done a quick -v11 release with that fixed, 
> available at the usual place:
> 
>     http://people.redhat.com/mingo/cfs-scheduler/
> 
> with no other changes.

Ingo,
	I had a question with respect to the definition of fairness used, esp
for tasks that are not 100% cpu hogs.

Ex: consider two equally important tasks T1 and T2 running on same CPU and 
whose execution nature is:

	T1 = 100% cpu hog
	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)

Over a arbitrary observation period of 10 sec, 

	T1 was ready to run for all 10sec
	T2 was ready to run for 6 sec

Over this observation period, how much execution time should T2 get,
under a "fair" scheduler?

I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
wrong expectation of fairness?

Anyway, results of this experiment (using testcase attached) is below.
T2 gets way below its fair share IMO (under both cfs and sd).


2.6.21.1:

 5444 vatsa     16   0  2468  460  388 R   59  0.0   0:19.76 3 T1
 5443 vatsa     25   0  2468  460  388 R   40  0.0   0:15.36 3 T2


2.6.21.1 + cfs-v11:

 5460 vatsa     31   0  2464  460  388 R   70  0.0   0:15.28 3 T1
 5461 vatsa     29   0  2468  460  388 R   30  0.0   0:05.65 3 T2


2.6.21 + sd-0.48:

 5459 vatsa     23   0  2468  460  388 R   70  0.0   0:17.02 3 T1
 5460 vatsa     21   0  2464  460  388 R   30  0.0   0:06.21 3 T2


Note: 

T1 is started as ./cpuhog 600 0 10 > /dev/null &
T2 is started as ./cpuhog 600 400 10 > /dev/null &

First arg = runtime in ms
Second arg = sleeptime in ms
Third arg = Observation period in seconds


-- 
Regards,
vatsa

[-- Attachment #2: cpuhog.c --]
[-- Type: text/plain, Size: 1931 bytes --]


#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>

double loops_per_ms;

double elapsed_time(struct timeval *tv1, struct timeval *tv2)
{
	double d1, d2;
	int elapsed_ms;

	d1 = tv1->tv_sec + tv1->tv_usec * 1e-6;
	d2 = tv2->tv_sec + tv2->tv_usec * 1e-6;
	elapsed_ms = (d2 - d1) * 1000;

	return elapsed_ms;
}

void calibrate_delay(void)
{
	int i;
	double elapsed_ms;
	struct timeval tv1, tv2;

	gettimeofday(&tv1, NULL);
#define LOOP_COUNT 100000000
	for (i=0; i < LOOP_COUNT; ++i)
		;
	gettimeofday(&tv2, NULL);
	elapsed_ms = elapsed_time(&tv1, &tv2);
	loops_per_ms = LOOP_COUNT / elapsed_ms;

	printf ("loops_per_ms = %f \n", loops_per_ms);
}

int run_length  = 52; 	// in milliseconds
int sleep_length = 24;	// in milliseconds
int epoch_time = 5;	// in seconds

main(int argc, char *argv[])
{
	long int i, delay;
	time_t prevtime;
	double prevusage = 0;
	struct rusage stats;

	if (argc > 1) {
		run_length = atoi(argv[1]);
		if (argc > 2)
			sleep_length = atoi(argv[2]);
		if (argc > 3)
			epoch_time = atoi(argv[3]);
	}

	calibrate_delay();

	delay = run_length * loops_per_ms;

	printf ("run time = %d ms (%ld loops), sleep time = %d ms,"
		" epoch time = %d s\n", run_length, delay, sleep_length,
		 epoch_time);

	prevtime = time(NULL);
	while (1) {
		time_t curtime, deltatime;
		struct rusage stats;

		for (i = 0; i < delay; ++i)
			;
		usleep(sleep_length * 1000);

		curtime = time(NULL);
		deltatime = curtime - prevtime;
		if (deltatime >= epoch_time) {
			double curusage, deltausage;

			getrusage(0, &stats);
			curusage = stats.ru_utime.tv_sec +
				   stats.ru_utime.tv_usec * 1e-6 +
				   stats.ru_stime.tv_sec + 
				   stats.ru_stime.tv_usec * 1e-6;

			deltausage = curusage - prevusage;
			printf ("Obtained %3.2f seconds of execution time in"
		       	      " %d elapsed seconds \n", deltausage, deltatime);
			prevtime = curtime;
			prevusage = curusage;
		}
	}
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
@ 2007-05-09 19:24   ` Dmitry Adamushko
  2007-05-10 16:41     ` Srivatsa Vaddagiri
  2007-05-09 20:24   ` William Lee Irwin III
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Dmitry Adamushko @ 2007-05-09 19:24 UTC (permalink / raw)
  To: vatsa; +Cc: Ingo Molnar, Con Kolivas, Mike Galbraith, Linux Kernel

On 09/05/07, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> On Tue, May 08, 2007 at 05:04:31PM +0200, Ingo Molnar wrote:
> > thanks Mike - value 0x8 looks pretty good here and doesnt have the
> > artifacts you found. I've done a quick -v11 release with that fixed,
> > available at the usual place:
> >
> >     http://people.redhat.com/mingo/cfs-scheduler/
> >
> > with no other changes.
>
> Ingo,
>         I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
>
> Ex: consider two equally important tasks T1 and T2 running on same CPU and
> whose execution nature is:
>
>         T1 = 100% cpu hog
>         T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
> Over a arbitrary observation period of 10 sec,
>
>         T1 was ready to run for all 10sec
>         T2 was ready to run for 6 sec
>

One quick observation.

Isn't it important for both processes to have the same "loops_per_ms" value?

At least on my machine, it's not always the case. See how results
differ in the following 2 tests:

(1) when loops_per_ms happen to be quite different;
(2) ----//---- pretty same values.

Both are for mainline 2.6.19 and both processes have equal characteristics.
namely,

./cpuhog 600 400 10

I run them one by one from different shells.

(1)

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 94517.958412  <---------- (*)

run time = 600 ms (56710775 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 6.39 seconds of execution time in 10 elapsed seconds
Obtained 4.52 seconds of execution time in 11 elapsed seconds
Obtained 4.47 seconds of execution time in 10 elapsed seconds
Obtained 4.47 seconds of execution time in 10 elapsed seconds
Obtained 4.53 seconds of execution time in 10 elapsed seconds
Obtained 4.14 seconds of execution time in 10 elapsed seconds
...

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 51599.587203  <------------- (**)

run time = 600 ms (30959752 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 3.91 seconds of execution time in 10 elapsed seconds
Obtained 3.16 seconds of execution time in 10 elapsed seconds
Obtained 3.18 seconds of execution time in 10 elapsed seconds
Obtained 3.19 seconds of execution time in 10 elapsed seconds
Obtained 2.97 seconds of execution time in 10 elapsed seconds

Look at a difference between (*) and (**) and we can see the reported
results are different.
Although, one would expect both are treated "equally" with the current
scheduler (whatever one puts in this meaning :)

(2) now another attempt.

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 96711.798839

run time = 600 ms (58027079 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 4.32 seconds of execution time in 10 elapsed seconds
Obtained 3.84 seconds of execution time in 11 elapsed seconds
Obtained 3.86 seconds of execution time in 11 elapsed seconds
Obtained 3.84 seconds of execution time in 11 elapsed seconds
Obtained 3.86 seconds of execution time in 11 elapsed seconds
...

dimm@dimm:/mnt/storage/kernel/sched-bench> ./cpuhog 600 400 10
loops_per_ms = 96899.224806

run time = 600 ms (58139534 loops), sleep time = 400 ms, epoch time = 10 s
Obtained 4.28 seconds of execution time in 10 elapsed seconds
Obtained 3.88 seconds of execution time in 11 elapsed seconds
Obtained 3.88 seconds of execution time in 10 elapsed seconds
Obtained 3.40 seconds of execution time in 10 elapsed seconds
Obtained 3.42 seconds of execution time in 10 elapsed seconds
Obtained 3.89 seconds of execution time in 11 elapsed seconds
...

In this case, "loops_per_ms" happened to be pretty equal so the
reported results look pretty similar. "top" showed very close numbers
in this case as well.

Maybe it's different in your case (besides I don't claim that
everything is ok with SD and CFS) but I guess this could be one of the
reasons.

At least, fork()-ing the second process (with different
run_time/sleep_time characteristics) from the first one would ensure
both have the same "loop_per_ms".

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
  2007-05-09 19:24   ` Dmitry Adamushko
@ 2007-05-09 20:24   ` William Lee Irwin III
  2007-05-10 17:13     ` Srivatsa Vaddagiri
  2007-05-10  4:22   ` David Schwartz
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: William Lee Irwin III @ 2007-05-09 20:24 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, tingy, tong.n.li

On Wed, May 09, 2007 at 11:32:05PM +0530, Srivatsa Vaddagiri wrote:
> 	I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
> Ex: consider two equally important tasks T1 and T2 running on same CPU and 
> whose execution nature is:
> 	T1 = 100% cpu hog
> 	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
> Over a arbitrary observation period of 10 sec, 
> 	T1 was ready to run for all 10sec
> 	T2 was ready to run for 6 sec
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?
> I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> wrong expectation of fairness?
> Anyway, results of this experiment (using testcase attached) is below.
> T2 gets way below its fair share IMO (under both cfs and sd).

It's not even feasible much of the time. Suppose your task ran for
100ms then slept for 900ms. It can't get more than 10% of the CPU in
any scheduler, work-conserving or not.

Second, anything that would credit tasks for sleep in such a manner
would flunk ringtest.c or otherwise analogues arranged to pass
feasibility checks.

Fairness applies only to running tasks. The fair share of CPU must be
granted while the task is running. As for sleep, it has to use its
fair share of CPU or lose it. The numerous of pathologies that arise
from trying to credit tasks for sleep in this fashion this are why
fairness in the orthodox sense I describe is now such a high priority.

However, it is possible to credit tasks for sleep in other ways. For
instance, EEVDF (which is very close to CFS) has a deadline parameter
expressing latency in addition to one expressing bandwidth that could,
in principle, be used for the purpose of crediting sleeping tasks. It's
not clear to me whether trying to use it for such purposes would be
sound, or, for that matter, whether tasks should receive latency
credits for sleep as opposed to other sorts of behaviors even if
utilizing the latency parameter for credits and bonuses for various
sorts of behavior is sound.

-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
  2007-05-09 19:24   ` Dmitry Adamushko
  2007-05-09 20:24   ` William Lee Irwin III
@ 2007-05-10  4:22   ` David Schwartz
  2007-05-10  8:51   ` Mike Galbraith
  2007-05-10 19:54   ` Ting Yang
  4 siblings, 0 replies; 19+ messages in thread
From: David Schwartz @ 2007-05-10  4:22 UTC (permalink / raw)
  To: Linux-Kernel@Vger. Kernel. Org


> Ex: consider two equally important tasks T1 and T2 running on
> same CPU and
> whose execution nature is:
>
> 	T1 = 100% cpu hog
> 	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>
> Over a arbitrary observation period of 10 sec,
>
> 	T1 was ready to run for all 10sec
> 	T2 was ready to run for 6 sec
>
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?

	Anywhere between 30% and 50% of the CPU is fair. There is nothing
particularly fair or unfair about how much credit you give an interactive
task. Less than 30% is unfair as the task is being punished for voluntarily
yielding. More than 50% is unfair as the task should not be entitled to more
than half the CPU if another task with equal static priority wants as much
CPU as possible.

	Slightly less than 30% or slightly more than 50% can also be reasonable due
to rounding or the scheduling interval beating against T2's internal timing.

	From a practical standpoint, we'd like such a task to get a bit more than
30%. How much more may reasonably depend on exactly what the sleep/run
interval is for the task. It would be unreasonable to expect a task to be
credited for sleeping for an entire day, but unreasonable to expect a task
to not get any credit for voluntarily yielding the CPU when it is later
ready-to-run.

	Typical UNIX schedulers bump a task's dynamic priority when it voluntarily
yields. As a result, it can pre-empty a CPU-burner when it later becomes
ready-to-run. This is intended more to reduce latency than to increase CPU
share. However, it should also have the effect of giving a polite task more
CPU than pure CPU-burned when it needs it. (That's only fair.)

	DS



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
                     ` (2 preceding siblings ...)
  2007-05-10  4:22   ` David Schwartz
@ 2007-05-10  8:51   ` Mike Galbraith
  2007-05-10 20:02     ` Ingo Molnar
  2007-05-10 19:54   ` Ting Yang
  4 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2007-05-10  8:51 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	tingy, tong.n.li

On Wed, 2007-05-09 at 23:32 +0530, Srivatsa Vaddagiri wrote:

> Ingo,
> 	I had a question with respect to the definition of fairness used, esp
> for tasks that are not 100% cpu hogs.
> 
> Ex: consider two equally important tasks T1 and T2 running on same CPU and 
> whose execution nature is:
> 
> 	T1 = 100% cpu hog
> 	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
> 
> Over a arbitrary observation period of 10 sec, 
> 
> 	T1 was ready to run for all 10sec
> 	T2 was ready to run for 6 sec
> 
> Over this observation period, how much execution time should T2 get,
> under a "fair" scheduler?
> 
> I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> wrong expectation of fairness?

Depends on how long your fairness yardstick is I suppose.

> Anyway, results of this experiment (using testcase attached) is below.
> T2 gets way below its fair share IMO (under both cfs and sd).
> 
> 
> 2.6.21.1:
> 
>  5444 vatsa     16   0  2468  460  388 R   59  0.0   0:19.76 3 T1
>  5443 vatsa     25   0  2468  460  388 R   40  0.0   0:15.36 3 T2
> 
> 
> 2.6.21.1 + cfs-v11:
> 
>  5460 vatsa     31   0  2464  460  388 R   70  0.0   0:15.28 3 T1
>  5461 vatsa     29   0  2468  460  388 R   30  0.0   0:05.65 3 T2
> 
> 
> 2.6.21 + sd-0.48:
> 
>  5459 vatsa     23   0  2468  460  388 R   70  0.0   0:17.02 3 T1
>  5460 vatsa     21   0  2464  460  388 R   30  0.0   0:06.21 3 T2
> 
> 
> Note: 
> 
> T1 is started as ./cpuhog 600 0 10 > /dev/null &

 6524 root      20   0  1432  396  336 S   51  0.0   2:31.83 1 cpuhog
 6525 root      20   0  1436  356  296 R   48  0.0   2:25.76 1 chew

> T2 is started as ./cpuhog 600 400 10 > /dev/null &

That's cpuhog in the above, and chew is the 100% hog.  The below is
cpuhog as you started them. 

 6565 root      20   0  1436  396  336 R   51  0.0   1:49.30 1 cpuhog
 6566 root      20   0  1432  396  336 S   49  0.0   1:59.16 1 cpuhog

FWIW, I can squeeze some better long term fairness out of it by doing
the below.

If a task isn't going to sleep, it's always in competition with someone
even if it happens to be the invisible man right now, so runners will
drift to the right.  Also,  these two tasks aren't necessarily the same
at dequeue time wrt the effect they have (or rather _will_ have) on the
system, but are being treated as if they're identical twins.

	-Mike

Experimental gefingerpoken und mittengrabben.

--- kernel/sched_fair.c.org	2007-05-08 07:51:48.000000000 +0200
+++ kernel/sched_fair.c	2007-05-10 09:43:17.000000000 +0200
@@ -20,7 +20,7 @@ unsigned int sysctl_sched_granularity __
 
 unsigned int sysctl_sched_sleep_history_max __read_mostly = 2000000000;
 
-unsigned int sysctl_sched_load_smoothing = 1 | 2 | 4 | 0;
+unsigned int sysctl_sched_load_smoothing = 0 | 0 | 0 | 0 | 16;
 
 /*
  * Wake-up granularity.
@@ -140,6 +140,8 @@ static void limit_wait_runtime(struct ta
 {
 	s64 limit = sysctl_sched_granularity * 16;
 
+	if (sysctl_sched_load_smoothing & 16)
+		limit = sysctl_sched_sleep_history_max >> 1;
 	if (p->wait_runtime > limit)
 		p->wait_runtime = limit;
 	if (p->wait_runtime < -limit)
@@ -150,10 +152,11 @@ static void limit_wait_runtime(struct ta
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
  */
-static inline void update_curr(struct rq *rq, u64 now)
+static inline void update_curr(struct rq *rq, u64 now, int load_weight)
 {
 	u64 delta_exec, delta_fair, delta_mine;
 	struct task_struct *curr = rq->curr;
+	unsigned long load = get_rq_load(rq);
 
 	if (curr->sched_class != &fair_sched_class || curr == rq->idle
 			|| !curr->on_rq)
@@ -167,8 +170,6 @@ static inline void update_curr(struct rq
 		curr->exec_max = delta_exec;
 
 	if (sysctl_sched_load_smoothing & 1) {
-		unsigned long load = get_rq_load(rq);
-
 		if (sysctl_sched_load_smoothing & 2) {
 			delta_fair = delta_exec << NICE_0_SHIFT;
 			do_div(delta_fair, load);
@@ -178,13 +179,13 @@ static inline void update_curr(struct rq
 		}
 
 		delta_mine = delta_exec * curr->load_weight;
-		do_div(delta_mine, load);
+		do_div(delta_mine, load + load_weight);
 	} else {
 		delta_fair = delta_exec << NICE_0_SHIFT;
-		do_div(delta_fair, rq->raw_weighted_load);
+		do_div(delta_fair, load);
 
 		delta_mine = delta_exec * curr->load_weight;
-		do_div(delta_mine, rq->raw_weighted_load);
+		do_div(delta_mine, load + load_weight);
 	}
 
 	curr->sum_exec_runtime += delta_exec;
@@ -300,7 +301,7 @@ update_stats_wait_end(struct rq *rq, str
 static inline void
 update_stats_dequeue(struct rq *rq, struct task_struct *p, u64 now)
 {
-	update_curr(rq, now);
+	update_curr(rq, now, !p->sleep_start_fair ? NICE_0_LOAD : 0);
 	/*
 	 * Mark the end of the wait period if dequeueing a
 	 * waiting task:
@@ -327,7 +328,7 @@ update_stats_curr_start(struct rq *rq, s
 static inline void
 update_stats_curr_end(struct rq *rq, struct task_struct *p, u64 now)
 {
-	update_curr(rq, now);
+	update_curr(rq, now, !p->sleep_start_fair ? NICE_0_LOAD : 0);
 
 	p->exec_start = 0;
 }
@@ -378,7 +379,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	/*
 	 * Update the fair clock.
 	 */
-	update_curr(rq, now);
+	update_curr(rq, now, 0);
 
 	if (wakeup) {
 		if (p->sleep_start) {




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 19:24   ` Dmitry Adamushko
@ 2007-05-10 16:41     ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 19+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-10 16:41 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, Con Kolivas, Mike Galbraith, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 281 bytes --]

On Wed, May 09, 2007 at 09:24:17PM +0200, Dmitry Adamushko wrote:
> One quick observation.
> 
> Isn't it important for both processes to have the same "loops_per_ms" value?

Good catch.

I have modified the testcase based on this observation (using
setitimer).

-- 
Regards,
vatsa

[-- Attachment #2: cpuhog.c --]
[-- Type: text/plain, Size: 1764 bytes --]

#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/resource.h>

volatile int time_elapsed;
int run_length  = 52; 	// in milliseconds
int sleep_length = 24;	// in milliseconds
int epoch_time = 5;	// in seconds

void alrm_handler(int signo)
{
	time_elapsed = 1;
}

main(int argc, char *argv[])
{
	double prevusage = 0;
	struct itimerval timer;
	time_t prevtime;

	if (argc > 1) {
		run_length = atoi(argv[1]);
		if (argc > 2)
			sleep_length = atoi(argv[2]);
		if (argc > 3)
			epoch_time = atoi(argv[3]);
	}

	signal(SIGVTALRM, alrm_handler);
	memset(&timer, 0, sizeof(timer));

	timer.it_value.tv_sec = run_length / 1000;
	timer.it_value.tv_usec = (run_length % 1000) * 1000;

	printf ("run time = %d ms (%d sec + %d usec), sleep time = %d ms,"
		" epoch time = %d s\n", run_length, timer.it_value.tv_sec,
		timer.it_value.tv_usec, sleep_length, epoch_time);

	prevtime = time(NULL);
	while (1) {
		time_t curtime, deltatime;
		struct rusage stats;
		int rc;

		rc = setitimer(ITIMER_VIRTUAL, &timer, NULL);
		if (rc < 0) {
			perror("setitimer");
			exit(1);
		}

		time_elapsed = 0;
		while (!time_elapsed)
		;

		usleep(sleep_length * 1000);

		curtime = time(NULL);
		deltatime = curtime - prevtime;
		if (deltatime >=  epoch_time) {
			double curusage, deltausage;

			getrusage(0, &stats);
			curusage = stats.ru_utime.tv_sec +
				   stats.ru_utime.tv_usec * 1e-6 +
				   stats.ru_stime.tv_sec + 
				   stats.ru_stime.tv_usec * 1e-6;

			deltausage = curusage - prevusage;
			printf ("Obtained %3.2f seconds of execution time in"
		       	      " %d elapsed seconds \n", deltausage, deltatime);
			prevtime = curtime;
			prevusage = curusage;
		}
	}
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-08 15:04 [patch] CFS scheduler, -v11 Ingo Molnar
  2007-05-09 15:46 ` Kasper Sandberg
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
@ 2007-05-10 16:59 ` Christian
  2007-05-10 17:10   ` Kasper Sandberg
  2 siblings, 1 reply; 19+ messages in thread
From: Christian @ 2007-05-10 16:59 UTC (permalink / raw)
  To: linux-kernel, Ingo Molnar

Hello lkml, hello Ingo!

I've been using CFS-v10 for a few days and I must say that I'm verry 
impressed ;-)

Desktop performance without any manual renicing is excellent, even with 
make -j20. Gaming performance is at least on par with SD now! I've tried to 
change the sched_load_smoothing config to "8" but there is no visible 
difference when it's set to "7".

Both schedulers are verrry good! I can't really tell which one is better.
I noticed that while compiling a kernel (with -j4) my CPU temperature is two 
to three degrees hotter than with mainline. I have not done any timing tests, 
but I suspect that it's a little faster while preserving excellent desktop 
usability. Great work!! :-)

-Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-10 16:59 ` [patch] CFS scheduler, -v11 Christian
@ 2007-05-10 17:10   ` Kasper Sandberg
  2007-05-10 18:51     ` Christian
  0 siblings, 1 reply; 19+ messages in thread
From: Kasper Sandberg @ 2007-05-10 17:10 UTC (permalink / raw)
  To: Christian; +Cc: linux-kernel, Ingo Molnar

On Thu, 2007-05-10 at 18:59 +0200, Christian wrote:
> Hello lkml, hello Ingo!
> 
> I've been using CFS-v10 for a few days and I must say that I'm verry 
> impressed ;-)
> 
> Desktop performance without any manual renicing is excellent, even with 
> make -j20. Gaming performance is at least on par with SD now! I've tried to 
Which games are you trying? and have you tried other workloads then make
-j20?

try have a window with some 3d game open, and a browser besides it, and
press a link. i cant seem to get smooth results with CFS.

Perhaps i could also conduct tests with the games you are trying on my
hardware.

> change the sched_load_smoothing config to "8" but there is no visible 
> difference when it's set to "7".
> 
> Both schedulers are verrry good! I can't really tell which one is better.
> I noticed that while compiling a kernel (with -j4) my CPU temperature is two 
> to three degrees hotter than with mainline. I have not done any timing tests, 
> but I suspect that it's a little faster while preserving excellent desktop 
> usability. Great work!! :-)
> 
> -Christian
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 20:24   ` William Lee Irwin III
@ 2007-05-10 17:13     ` Srivatsa Vaddagiri
  2007-05-10 18:55       ` William Lee Irwin III
  0 siblings, 1 reply; 19+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-10 17:13 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, tingy, tong.n.li

On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
> It's not even feasible much of the time. Suppose your task ran for
> 100ms then slept for 900ms. It can't get more than 10% of the CPU in
> any scheduler, work-conserving or not.

sure. The question of fairnes arises when such a task has to "fight" for
space/cputime in those 100ms, resulting in lesser than 10% bandwidth.
Having some statistics shown on cpu demand vs obtained bandwidth may be
good?

> Fairness applies only to running tasks. The fair share of CPU must be
> granted while the task is running. As for sleep, it has to use its
> fair share of CPU or lose it. The numerous of pathologies that arise
> from trying to credit tasks for sleep in this fashion this are why
> fairness in the orthodox sense I describe is now such a high priority.
> 
> However, it is possible to credit tasks for sleep in other ways. For
> instance, EEVDF (which is very close to CFS) has a deadline parameter
> expressing latency in addition to one expressing bandwidth that could,
> in principle, be used for the purpose of crediting sleeping tasks. It's
> not clear to me whether trying to use it for such purposes would be
> sound, or, for that matter, whether tasks should receive latency
> credits for sleep as opposed to other sorts of behaviors even if
> utilizing the latency parameter for credits and bonuses for various
> sorts of behavior is sound.

I guess fairness interval also matters, as Mike pointed out. Over
shorter intervals, it may appear more fair to such sleeping tasks than over
longer intervals.  Btw CFS seems to leave this fairness interval undefined (its 
dependent on N, number of tasks and sysctl_sched_granularity?).

I suspect container/database/webserver workloads would like to receive
such latency/bandwidth credits for sleep. Will check with our
database/workload management folks on this.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-10 17:10   ` Kasper Sandberg
@ 2007-05-10 18:51     ` Christian
  2007-05-10 19:45       ` Ingo Molnar
  2007-05-10 20:05       ` Ingo Molnar
  0 siblings, 2 replies; 19+ messages in thread
From: Christian @ 2007-05-10 18:51 UTC (permalink / raw)
  To: linux-kernel

On Thursday 10 May 2007 19:10:44 Kasper Sandberg wrote:
> On Thu, 2007-05-10 at 18:59 +0200, Christian wrote:
> > Hello lkml, hello Ingo!
> >
> > I've been using CFS-v10 for a few days and I must say that I'm verry
> > impressed ;-)
> >
> > Desktop performance without any manual renicing is excellent, even with
> > make -j20. Gaming performance is at least on par with SD now! I've tried
> > to
>
> Which games are you trying? and have you tried other workloads then make
> -j20?
>
> try have a window with some 3d game open, and a browser besides it, and
> press a link. i cant seem to get smooth results with CFS.
>
> Perhaps i could also conduct tests with the games you are trying on my
> hardware.
>
> > change the sched_load_smoothing config to "8" but there is no visible
> > difference when it's set to "7".
> >
> > Both schedulers are verrry good! I can't really tell which one is better.
> > I noticed that while compiling a kernel (with -j4) my CPU temperature is
> > two to three degrees hotter than with mainline. I have not done any
> > timing tests, but I suspect that it's a little faster while preserving
> > excellent desktop usability. Great work!! :-)
> >
> > -Christian
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

I've tried many different workloads, kernel compile (normal -j4), extreme 
kernel compile (-j20) and Browsing/Open Office. GLXGears, Briquolo and 
enemy-territory work relly well under these loads.

I just tried another test with "nice make -j20" and I see latency blips while 
gaming. SD did not have this. Latencies with nice are _worse_ than latencies 
without nice on my system. Playing with sched_load_smoothing does not change 
anything. (I've tested values in the range 1-10)

-Christian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-10 17:13     ` Srivatsa Vaddagiri
@ 2007-05-10 18:55       ` William Lee Irwin III
  0 siblings, 0 replies; 19+ messages in thread
From: William Lee Irwin III @ 2007-05-10 18:55 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, tingy, tong.n.li, Davide Libenzi

On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
>> It's not even feasible much of the time. Suppose your task ran for
>> 100ms then slept for 900ms. It can't get more than 10% of the CPU in
>> any scheduler, work-conserving or not.

On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> sure. The question of fairnes arises when such a task has to "fight" for
> space/cputime in those 100ms, resulting in lesser than 10% bandwidth.
> Having some statistics shown on cpu demand vs obtained bandwidth may be
> good?

Davide Libenzi has a testcase which can essentially do this sort of
test involving sleep/wakeup behavior for you.


On Wed, May 09, 2007 at 01:24:18PM -0700, William Lee Irwin III wrote:
>> Fairness applies only to running tasks. The fair share of CPU must be
>> granted while the task is running. As for sleep, it has to use its
>> fair share of CPU or lose it. The numerous of pathologies that arise
>> from trying to credit tasks for sleep in this fashion this are why
>> fairness in the orthodox sense I describe is now such a high priority.
>> However, it is possible to credit tasks for sleep in other ways. For
>> instance, EEVDF (which is very close to CFS) has a deadline parameter
>> expressing latency in addition to one expressing bandwidth that could,
>> in principle, be used for the purpose of crediting sleeping tasks. It's
>> not clear to me whether trying to use it for such purposes would be
>> sound, or, for that matter, whether tasks should receive latency
>> credits for sleep as opposed to other sorts of behaviors even if
>> utilizing the latency parameter for credits and bonuses for various
>> sorts of behavior is sound.

On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> I guess fairness interval also matters, as Mike pointed out. Over
> shorter intervals, it may appear more fair to such sleeping tasks
> than over longer intervals.  Btw CFS seems to leave this fairness
> interval undefined (its dependent on N, number of tasks and
> sysctl_sched_granularity?).

Usually things appear fairer over longer than shorter intervals. One
could regard the rate of convergence as a sort of performance metric
for the fairness of a scheduler, provided it converges to fairness at
all.


On Thu, May 10, 2007 at 10:43:12PM +0530, Srivatsa Vaddagiri wrote:
> I suspect container/database/webserver workloads would like to receive
> such latency/bandwidth credits for sleep. Will check with our
> database/workload management folks on this.

Bear in mind that bandwidth credits have soundness issues. Crediting
bandwidth in exchange for sleep will certainly flunk ringtest.c

Crediting latency in exchange for sleep at least has different problems
if it has any at all. I actually expect it to be sound to credit
latency (the l_i in EEVDF) in exchange for sleep, but the experts have
yet to weigh in on the subject.


-- wli

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-10 18:51     ` Christian
@ 2007-05-10 19:45       ` Ingo Molnar
  2007-05-10 20:05       ` Ingo Molnar
  1 sibling, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-05-10 19:45 UTC (permalink / raw)
  To: Christian; +Cc: linux-kernel


* Christian <christiand59@web.de> wrote:

> [...] Latencies with nice are _worse_ than latencies without nice on 
> my system. [...]

i'll check that - this must be a CFS buglet.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
                     ` (3 preceding siblings ...)
  2007-05-10  8:51   ` Mike Galbraith
@ 2007-05-10 19:54   ` Ting Yang
  4 siblings, 0 replies; 19+ messages in thread
From: Ting Yang @ 2007-05-10 19:54 UTC (permalink / raw)
  To: vatsa
  Cc: Ingo Molnar, Mike Galbraith, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, tong.n.li

Hi, Srivatsa

    I would say in this case, your specifications of these tasks do not 
actually give a base for evaluating the fairness. Imagine, when you want 
to check if the generated schedule is _fair_ or not, you first have set 
up a base which indicates the behavior tasks should have, and then 
compare the generated schedule against the "should" case for fairness. 
In fact, the results you get (in the following) all make perfect sense, 
based on the "should" defined by each different model adopted by 
different schedulers.

    Now I will try to explain all these in my creepy English:
> Ex: consider two equally important tasks T1 and T2 running on same CPU and 
> whose execution nature is:
>
> 	T1 = 100% cpu hog
> 	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
>   
First, these specification does not give any idea of the amount real 
work should be done by each task during any time period. Now suppose you 
have 4 T1 running, what do you expect, in 10 sec, each task should 
consume: 2.5 sec. Hmmm, what if one of them is processing something I 
need to be 3 time faster than others? then your expectation will be 5, 
5/3, 5/3, 5/3. Now what if the 3x faster process is not actually 
"important" than others?
The same thing happens to T2, which is a little more complicated. When 
T2 is active and at the same time there are other tasks running, what is 
the amount work it should be done during its active period? and then 
when T2 is sleeping, how does it affects other tasks still running, 
especially when T2 goes back active again? Should scheduler remembers 
the history to be long term "fair", or it should forget about the history?

All these questions require a complete model (different approaches uses 
different models), and there are 3 critical things needed for each task 
in the system, which currently more or less entangled together in all 
existing schedulers.

For each task you need:
               _B_ (CPU bandwidth), _T_ (response window), _P_ (priority 
or importance)

Here, B is the amount of bandwidth needed, and T is the time period that 
B must be ensured. For example, if a task need 20ms to process a 
request, and the request has to be processed within 50ms, then B is 40% 
and T is 50ms. B captures the throughput need of the task, while T 
captures its response requirement. For the above task, if you give (50%, 
100ms), then some requests of that task may miss their deadlines, even 
though throughput is guaranteed.  The combination (B, T) describes the 
requirement of a task to run at least "good" enough, however, it not 
necessarily related to the task's importance, therefore P is needed, so 
that scheduler can make choices when CPU is overload, specifically sum 
of Bs goes beyond CPU capacity.

In the Real-Time work, we have pretty good knowledge about a task's 
requirement (B,T), therefore usually strict admission control or 
bandwidth control is applied to ensure admitted tasks are always 
satisfied. When CPU overload does happen, either B or T serves as static 
priorities, such as Rate monotonic, or dynamic priorities are 
calculated, such as EDF (earliest deadline first). Various strategies 
based on capacity reservation and bandwidth reservation are proposed (in 
late 80's and mid 90's) to enforce protection between tasks when 
workload in the system fluctuates.

However, on a general purpose machine, everything is dynamic. We do not 
have any knowledge about (B, T) for any task. Tens of daemons, kernel 
threads wake up at arbitrary intervals with arbitrary amount of work. 
each may or may not have a specific response deadline, and deadlines may 
be soft or hard, etc. The required pre-knowledge and the scalability 
issue make capacity/bandwidth reservation based approaches not suitable 
for a general purpose scheduler.

The solution here used is to dynamically figure out the (B, T) based on 
the workload in the system (so that the sum(Bi) never goes beyond the 
CPU capacity), which leads to either _Fair Share_ scheduling or 
_Proportional Share_ scheduling (the distinction between them is vague 
though), which uses different model for fairness as I will explain later.

In Linux, we have only one parameter for each task, that is _NICE_ 
value, which has to serve the purpose of giving (B, T, P). Usually small 
nice value indicates higher priority, higher bandwidth (larger time 
quanta), and may or may not say anything about T. Now let's see how 
_Fair share_ model affects vanilla scheduler, and how _Proportional 
share_ model affects CFS and RSDL/SD scheduler. (Note: the specification 
you give above only indicate when task is runnable, and does not give 
information about (B, T))

> 2.6.21.1:
>
>  5444 vatsa     16   0  2468  460  388 R   59  0.0   0:19.76 3 T1
>  5443 vatsa     25   0  2468  460  388 R   40  0.0   0:15.36 3 T2
>
>   
The _Fair Share_ (FS) model which vanilla scheduler adopts, says give a 
long enough time window, the amount work done by each task should be 
proportional to its weight (which comes from the nice value). Therefore 
B_i of each task is given by w_i/sum(w), and in your case 50%:50% for T1 
and T2, but it says _nothing_ about T_i for each task.

Therefore, Fair Share model is inherently problematic! Imagine, If T1 is 
100% cpuhog, and T2 works like this: work 10ms, sleep 90ms, work 10ms, 
sleep 90ms, ... , work 10ms, sleep 90ms, work constantly (100% cpuhog).  
When T2  enters  its last stage and becomes 100% cpuhog, it will starve 
T1 for arbitrarily amount of time in FS model (depends on how long it 
executes before). The problem here is FS remembers arbitrarily long past 
history and tries to compensate in the future, and this leads to 
starvation, long latency, and weird interactive problems.

Vanilla scheduler approximates the FS model by calculating dynamic 
priorities based on execution time and sleep time _heuristically_. It 
also tries to solve the above problem by: (1) dividing long quanta into 
smaller timeslices (2) only remember limited amount of history, that is 
why it has a upperbound on sleep time for each task, which is used for 
calculating goodness.

Overall, this kind of scheduler is heuristic oriented, targeted to long 
tern fairness with low accuracy, and provides almost not guarantee for 
throughput and response time. Therefore people are constantly looking 
for better ones.

> 2.6.21.1 + cfs-v11:
>
>  5460 vatsa     31   0  2464  460  388 R   70  0.0   0:15.28 3 T1
>  5461 vatsa     29   0  2468  460  388 R   30  0.0   0:05.65 3 T2
>
>
> 2.6.21 + sd-0.48:
>
>  5459 vatsa     23   0  2468  460  388 R   70  0.0   0:17.02 3 T1
>  5460 vatsa     21   0  2464  460  388 R   30  0.0   0:06.21 3 T2
Both CFS and RSDL/SD uses Proportional Share (PS). It was originally 
designed for packet transmission where response time is relatively 
important. It was based on an ideal fluid-flow model called GPS. GPS 
model says, any _active_ task should receive the cpu bandwidth 
proportional to its weight. The only difference between PS and FS above 
is that when deciding B for a task, PS only takes _active_ tasks into 
account, while FS takes all tasks. In other words: PS forgets about 
history and sleep time (different from the time a task stay in run 
queue:-)) does not affect future scheduling. In this way, it can 
implicitly provides upperbounds for T for each active task (the accurate 
bound depends on algorithms and implementations), which is usually 
measured as the difference between actual work done against the work 
should be done in ideal GPS model. For example, here CFS and SD give 
bound of  delay O(n), where n is the number of active tasks.

In your example: When T2 is active, it gets B_2 = w_i/sum(w), 50% share, 
and nothing during it sleeps. Therefor the share overtime is:
              T1: 400 * 100% (share) + 600 * 50% (share) = 700
              T2: 400 * 0% (share) + 600 * 50% (share) = 300
Which is exactly the CPU share given in your above results for both CFS 
and SD!

Overall, proportional share scheduler, such as CFS and RSDL (other 
similar ones worth mentioning: WFQ, WF2Q, SCFQ, SFQ, SFS, EEVDF all 
based on fair queuing approximates GPS), is more accurate, provides 
short term fairness, and gives controlled bound for throughput and 
latency. However, all these come with a cost. These schedulers are more 
difficult to implement,  complicated data structures, may take O(n) or 
O(log n) time for each operation,  need more frequent context switches, 
etc.

Given the fluctuation of workloads, various application behaviors, 
uncertainty of user intentions, etc, it is almost impossible to set up a 
model that captures all of them. Therefore, building a good general 
purpose scheduler is so difficult and many people like Ingo, Con, ... 
are working so hard on it :-)

Sorry for another long email.

Ting

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Definition of fairness (was Re: [patch] CFS scheduler, -v11)
  2007-05-10  8:51   ` Mike Galbraith
@ 2007-05-10 20:02     ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-05-10 20:02 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: vatsa, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, tingy, tong.n.li


* Mike Galbraith <efault@gmx.de> wrote:

> On Wed, 2007-05-09 at 23:32 +0530, Srivatsa Vaddagiri wrote:
> 
> > Ingo,
> > 	I had a question with respect to the definition of fairness used, esp
> > for tasks that are not 100% cpu hogs.
> > 
> > Ex: consider two equally important tasks T1 and T2 running on same CPU and 
> > whose execution nature is:
> > 
> > 	T1 = 100% cpu hog
> > 	T2 = 60% cpu hog (run for 600ms, sleep for 400ms)
> > 
> > Over a arbitrary observation period of 10 sec, 
> > 
> > 	T1 was ready to run for all 10sec
> > 	T2 was ready to run for 6 sec
> > 
> > Over this observation period, how much execution time should T2 get,
> > under a "fair" scheduler?
> > 
> > I was expecting both T2 and T1 to get 5 sec (50:50 split). Is this a
> > wrong expectation of fairness?
> 
> Depends on how long your fairness yardstick is I suppose.

yeah, i'd agree with that. I think a 400 msecs sleep period is still 
within the range that we should define as being within the scope, but 
it's probably borderline. The ultimate threshold is the reaction time of 
humans - somewhere between 30 msecs and 1 second. Sleep periods beyond 
that are typically not expected to be 'smoothly and fairly scheduled' 
the same way as say a CPU hogs are scheduled - because you can already 
'see' the effects of the sleep - so 'smoothness' is not possible 
anymore.

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-10 18:51     ` Christian
  2007-05-10 19:45       ` Ingo Molnar
@ 2007-05-10 20:05       ` Ingo Molnar
  2007-05-11 12:01         ` Christian
  1 sibling, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2007-05-10 20:05 UTC (permalink / raw)
  To: Christian; +Cc: linux-kernel, Kasper Sandberg


* Christian <christiand59@web.de> wrote:

> > > Desktop performance without any manual renicing is excellent, even 
> > > with make -j20. Gaming performance is at least on par with SD now! 
> > > I've tried to
> >
> > Which games are you trying? and have you tried other workloads then 
> > make -j20?
> >
> > try have a window with some 3d game open, and a browser besides it, 
> > and press a link. i cant seem to get smooth results with CFS.
[...]
> 
> I've tried many different workloads, kernel compile (normal -j4), 
> extreme kernel compile (-j20) and Browsing/Open Office. GLXGears, 
> Briquolo and enemy-territory work relly well under these loads.

do you have CONFIG_PREEMPT enabled - while perhaps Kasper has it 
disabled? Kasper, if so, could you try a quick test with CONFIG_PREEMPT 
enabled, does that make any difference to your gaming-smoothness 
results? [it's not a real solution, but this would help us narrow down 
the smoothness problem you are seeing.]

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-09 15:46 ` Kasper Sandberg
@ 2007-05-11 11:07   ` Ingo Molnar
  0 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2007-05-11 11:07 UTC (permalink / raw)
  To: Kasper Sandberg
  Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord


[ remailing this on-list too, with some more explanations, i suspect 
  others might be affected by this 3D performance problem as well. ]

* Kasper Sandberg <lkml@metanurb.dk> wrote:

> [...] but under harder load such as pressing a link in a browser while 
> 3d(at nice 0), or spamasassin at nice 0 still makes it go stutterish 
> instead of smooth. But overall it does seem better.

ok, i think i have finally managed to track this one down.

certain 3D drivers grew a subtle performance dependency on a 
sys_sched_yield() implementation/behavioral bug/misbehavior of the 
upstream kernel, which implementation SD does too, but CFS fixes it by 
always yielding efficiently. The result of this bug/dependency is 
extremely low FPS during any CPU-intense workload.

you are using an Nvidia 6600 card so i dont know for sure whether you 
are affected by this problem (Radeon cards are affected and i can now 
reproduce that) - but the symptoms i've reproduced seem to be matching 
your system's symptoms.

I've added a workaround for this to CFS, do you have some time to try 
it? I've attached the sched-cfs-v12-rc4.patch (delta patch ontop of a 
CFS -v11 tree), and once you have booted it you can activate the 
workaround via:

     echo 1 > /proc/sys/kernel/sched_yield_bug_workaround

does this make any difference to the drastic 3D smoothness problems you 
are experiencing?

	Ingo

---
 Makefile                     |    2 +-
 drivers/char/drm/radeon_cp.c |    5 +++++
 include/linux/sched.h        |    2 +-
 kernel/sched_fair.c          |   23 +++++++++++++++++++----
 kernel/sysctl.c              |   12 ++++++------
 5 files changed, 32 insertions(+), 12 deletions(-)

Index: linux/Makefile
===================================================================
--- linux.orig/Makefile
+++ linux/Makefile
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 6
 SUBLEVEL = 21
-EXTRAVERSION = -cfs-v11
+EXTRAVERSION = -cfs-v12
 NAME = Nocturnal Monster Puppy
 
 # *DOCUMENTATION*
Index: linux/drivers/char/drm/radeon_cp.c
===================================================================
--- linux.orig/drivers/char/drm/radeon_cp.c
+++ linux/drivers/char/drm/radeon_cp.c
@@ -2210,6 +2210,11 @@ int radeon_driver_load(struct drm_device
 
 	DRM_DEBUG("%s card detected\n",
 		  ((dev_priv->flags & RADEON_IS_AGP) ? "AGP" : (((dev_priv->flags & RADEON_IS_PCIE) ? "PCIE" : "PCI"))));
+	if (sysctl_sched_yield_bug_workaround == -1) {
+		sysctl_sched_yield_bug_workaround = 1;
+		printk(KERN_WARNING "quirk installed: turning on "
+			"sys_sched_yield() workaround for Radeon DRM.\n");
+	}
 	return ret;
 }
 
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1253,9 +1253,9 @@ extern char * sched_print_task_state(str
 
 extern unsigned int sysctl_sched_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_sleep_history_max;
 extern unsigned int sysctl_sched_child_runs_first;
 extern unsigned int sysctl_sched_load_smoothing;
+extern int sysctl_sched_yield_bug_workaround;
 
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -18,10 +18,6 @@
  */
 unsigned int sysctl_sched_granularity __read_mostly = 2000000;
 
-unsigned int sysctl_sched_sleep_history_max __read_mostly = 2000000000;
-
-unsigned int sysctl_sched_load_smoothing = 0 | 0 | 0 | 8;
-
 /*
  * Wake-up granularity.
  * (default: 1 msec, units: nanoseconds)
@@ -32,6 +28,19 @@ unsigned int sysctl_sched_load_smoothing
  */
 unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0;
 
+unsigned int sysctl_sched_load_smoothing __read_mostly = 0 | 0 | 0 | 8;
+
+/*
+ * sys_sched_yield unfairness bug workaround switch.
+ * (default: -1:auto-detect+disabled. Other values: 0:disabled, 1:enabled)
+ *
+ * This option switches the unfair yield implementation of the
+ * old scheduler back on. Needed for good performance of certain
+ * apps like 3D games on Radeon cards.
+ */
+int sysctl_sched_yield_bug_workaround __read_mostly = -1;
+
+EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround);
 
 extern struct sched_class fair_sched_class;
 
@@ -462,6 +471,12 @@ yield_task_fair(struct rq *rq, struct ta
 	u64 now;
 
 	/*
+	 * Bug workaround for 3D apps running on the radeon 3D driver:
+	 */
+	if (unlikely(sysctl_sched_yield_bug_workaround > 0))
+		return;
+
+	/*
 	 * yield-to support: if we are on the same runqueue then
 	 * give half of our wait_runtime (if it's positive) to the other task:
 	 */
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -223,24 +223,24 @@ static ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_sleep_history_max_ns",
-		.data		= &sysctl_sched_sleep_history_max,
+		.procname	= "sched_child_runs_first",
+		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_child_runs_first",
-		.data		= &sysctl_sched_child_runs_first,
+		.procname	= "sched_load_smoothing",
+		.data		= &sysctl_sched_load_smoothing,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
-		.procname	= "sched_load_smoothing",
-		.data		= &sysctl_sched_load_smoothing,
+		.procname	= "sched_yield_bug_workaround",
+		.data		= &sysctl_sched_yield_bug_workaround,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch] CFS scheduler, -v11
  2007-05-10 20:05       ` Ingo Molnar
@ 2007-05-11 12:01         ` Christian
  0 siblings, 0 replies; 19+ messages in thread
From: Christian @ 2007-05-11 12:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Kasper Sandberg

On Thursday 10 May 2007 22:05:00 Ingo Molnar wrote:
> * Christian <christiand59@web.de> wrote:
> > > > Desktop performance without any manual renicing is excellent, even
> > > > with make -j20. Gaming performance is at least on par with SD now!
> > > > I've tried to
> > >
> > > Which games are you trying? and have you tried other workloads then
> > > make -j20?
> > >
> > > try have a window with some 3d game open, and a browser besides it,
> > > and press a link. i cant seem to get smooth results with CFS.
>
> [...]
>
> > I've tried many different workloads, kernel compile (normal -j4),
> > extreme kernel compile (-j20) and Browsing/Open Office. GLXGears,
> > Briquolo and enemy-territory work relly well under these loads.
>
> do you have CONFIG_PREEMPT enabled - while perhaps Kasper has it
> disabled? Kasper, if so, could you try a quick test with CONFIG_PREEMPT
> enabled, does that make any difference to your gaming-smoothness
> results? [it's not a real solution, but this would help us narrow down
> the smoothness problem you are seeing.]

Yes, I have enabled full preemption and 1000 HZ.

-Christian



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-05-11 12:03 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-08 15:04 [patch] CFS scheduler, -v11 Ingo Molnar
2007-05-09 15:46 ` Kasper Sandberg
2007-05-11 11:07   ` Ingo Molnar
2007-05-09 18:02 ` Definition of fairness (was Re: [patch] CFS scheduler, -v11) Srivatsa Vaddagiri
2007-05-09 19:24   ` Dmitry Adamushko
2007-05-10 16:41     ` Srivatsa Vaddagiri
2007-05-09 20:24   ` William Lee Irwin III
2007-05-10 17:13     ` Srivatsa Vaddagiri
2007-05-10 18:55       ` William Lee Irwin III
2007-05-10  4:22   ` David Schwartz
2007-05-10  8:51   ` Mike Galbraith
2007-05-10 20:02     ` Ingo Molnar
2007-05-10 19:54   ` Ting Yang
2007-05-10 16:59 ` [patch] CFS scheduler, -v11 Christian
2007-05-10 17:10   ` Kasper Sandberg
2007-05-10 18:51     ` Christian
2007-05-10 19:45       ` Ingo Molnar
2007-05-10 20:05       ` Ingo Molnar
2007-05-11 12:01         ` Christian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox