Re: [patch] CFS scheduler, v3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@elte.hu>
To: William Lee Irwin III <wli@holomorphy.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>,
	linux-kernel@vger.kernel.org,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Con Kolivas <kernel@kolivas.org>, Nick Piggin <npiggin@suse.de>,
	Mike Galbraith <efault@gmx.de>,
	Arjan van de Ven <arjan@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	caglar@pardus.org.tr, Willy Tarreau <w@1wt.eu>,
	Gene Heskett <gene.heskett@gmail.com>
Subject: Re: [patch] CFS scheduler, v3
Date: Sat, 21 Apr 2007 10:57:29 +0200	[thread overview]
Message-ID: <20070421085729.GD29800@elte.hu> (raw)
In-Reply-To: <20070421083317.GN31925@holomorphy.com>

* William Lee Irwin III <wli@holomorphy.com> wrote:

> I suppose this is a special case of the dreaded priority inversion. 
> What of, say, nice 19 tasks holding fs semaphores and/or mutexes that 
> nice -19 tasks are waiting to acquire? Perhaps rt_mutex should be the 
> default mutex implementation.

while i agree that it could be an issue, lock inversion is nothing 
really new, so i'd not go _that_ drastic to convert all mutexes to 
rtmutexes. (i've taken my -rt/PREEMPT_RT hat off)

For example reiser3 based systems get pretty laggy on significant 
reniced load (even with the vanilla scheduler) if CONFIG_PREEMPT_BKL is 
enabled: reiser3 holds the BKL for extended periods of time so a "make 
-j50" workload can starve it significantly and the tty layer's BKL use 
makes any sort of keyboard (even over ssh) input laggy.

Other locks though are not held this frequently and the mutex 
implementation is pretty fair for waiters anyway. (the semaphore 
implementation is not nearly as much fair, and the Big Kernel Semaphore 
is still struct semaphore based) So i'd really wait for specific 
workloads to trigger problems, and _maybe_ convert certain mutexes to 
rtmutexes, on an as-needed basis.

> > In any case, it is clear that rq->raw_cpu_load should be used instead of 
> > rq->nr_running, when calculating the fair clock, but i begin to like the 
> > nice_offset solution too in addition of this: it's effective in practice 
> > and starvation-free in theory, and most importantly, it's very simple. 
> > We could even make the nice offset granularity tunable, just in case 
> > anyone wants to weaken (or strengthen) the effectivity of nice levels. 
> > What do you think, can you see any obvious (or less obvious) 
> > showstoppers with this approach?
> 
> ->nice_offset's semantics are not meaningful to the end user, 
> regardless of whether it's effective. [...]

yeah, agreed. That's one reason why i didnt make it tunable, it's pretty 
meaningless to the user.

> [...] If there is something to be tuned, it should be relative shares 
> of CPU bandwidth (load_weight) corresponding to each nice level or 
> something else directly observable. The implementation could be 
> ->nice_offset, if it suffices.
> 
> Suppose a table of nice weights like the following is tuned via 
> /proc/:
> 
> -20	21			 0	1
>  -1	2			19	0.0476

> Essentially 1/(n+1) when n >= 0 and 1-n when n < 0.

ok, thanks for thinking about it. I have changed the nice weight in 
CVSv5-to-be so that it defaults to something pretty close to your 
suggestion: the ratio between a nice 0 loop and a nice 19 loop is now 
set to about 2%. (This something that users requested for some time, the 
default ~5% is a tad high when running reniced SETI jobs, etc.)

the actual percentage scales almost directly with the nice offset 
granularity value, but if this should be exposed to users at all, i 
agree that it would be better to directly expose this as some sort of 
'ratio between nice 0 and nice 19 tasks', right? Or some other, more 
finegrained metric. Percentile is too coarse i think, and using 0.1% 
units isnt intuitive enough i think. The sysctl handler would then 
transform that 'human readable' sysctl value into the appropriate 
internal nice-offset-granularity value (or whatever mechanism the 
implementation ends up using).

I'd not do this as a per-nice-level thing but as a single value that 
rescales the whole nice level range at once. That's alot less easy to 
misconfigure and we've got enough nice levels for users to pick from 
almost arbitrarily, as long as they have the ability to influence the 
max.

does this sound mostly OK to you?

	Ingo

next prev parent reply	other threads:[~2007-04-21  8:58 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-04-18 17:50 [patch] CFS scheduler, v3 Ingo Molnar
2007-04-18 21:26 ` William Lee Irwin III
2007-04-18 21:33   ` Ingo Molnar
2007-04-20 19:24   ` Christoph Lameter
2007-04-20 19:26     ` Siddha, Suresh B
2007-04-20 19:29     ` William Lee Irwin III
2007-04-20 19:33       ` Christoph Lameter
2007-04-20 19:38         ` William Lee Irwin III
2007-04-20 19:44           ` Christoph Lameter
2007-04-20 20:03             ` William Lee Irwin III
2007-04-20 20:11               ` Siddha, Suresh B
2007-04-24 17:39                 ` Christoph Lameter
2007-04-24 17:42                   ` Siddha, Suresh B
2007-04-24 17:47                     ` Christoph Lameter
2007-04-24 17:50                       ` Siddha, Suresh B
2007-04-24 17:55                         ` Christoph Lameter
2007-04-24 18:06                           ` Siddha, Suresh B
2007-04-20  0:10 ` Peter Williams
2007-04-20  4:48   ` Willy Tarreau
2007-04-20  6:02     ` Peter Williams
2007-04-20  6:21       ` Peter Williams
2007-04-20  7:26       ` Willy Tarreau
2007-04-20  6:46   ` Ingo Molnar
2007-04-20  7:32     ` Peter Williams
2007-04-20 12:28       ` Peter Williams
2007-04-21  8:07         ` Peter Williams
2007-04-20 13:15   ` William Lee Irwin III
2007-04-21  0:23     ` Peter Williams
2007-04-21  5:07       ` William Lee Irwin III
2007-04-21  5:38         ` Peter Williams
2007-04-21  7:32           ` Peter Williams
2007-04-21  7:54             ` Ingo Molnar
2007-04-21  8:33               ` William Lee Irwin III
2007-04-21  8:57                 ` Ingo Molnar [this message]
2007-04-21 16:23                   ` William Lee Irwin III
2007-04-21 10:37               ` Peter Williams
2007-04-21 12:21                 ` Peter Williams
2007-04-20 14:21   ` Peter Williams
2007-04-20 14:33     ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070421085729.GD29800@elte.hu \
    --to=mingo@elte.hu \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=caglar@pardus.org.tr \
    --cc=efault@gmx.de \
    --cc=gene.heskett@gmail.com \
    --cc=kernel@kolivas.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=pwil3058@bigpond.net.au \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=w@1wt.eu \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox