Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
@ 2007-04-15 18:47 Tim Tassonis
  0 siblings, 0 replies; 304+ messages in thread
From: Tim Tassonis @ 2007-04-15 18:47 UTC (permalink / raw)
  To: linux-kernel

> +	printk("Fair Scheduler: Copyright (c) 2007 Red Hat, Inc., Ingo Molnar\n");

So that's what all the fuss about the staircase scheduler is all about 
then! At last, I see your point.


>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!
> 

How pathetic can you get?

Tim, really looking forward to the CL final where Liverpool will beat 
the shit out of Scum (and there's a lot to be beaten out).


^ permalink raw reply	[flat|nested] 304+ messages in thread

* [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
@ 2007-04-13 20:21 Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
                   ` (13 more replies)
  0 siblings, 14 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]

i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:

   http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.

[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
  new scheduler will be active by default and all tasks will default
  to the new SCHED_FAIR interactive scheduling class. ]

Highlights are:

 - the introduction of Scheduling Classes: an extensible hierarchy of
   scheduler modules. These modules encapsulate scheduling policy
   details and are handled by the scheduler core without the core
   code assuming about them too much.

 - sched_fair.c implements the 'CFS desktop scheduler': it is a
   replacement for the vanilla scheduler's SCHED_OTHER interactivity
   code.

   i'd like to give credit to Con Kolivas for the general approach here:
   he has proven via RSDL/SD that 'fair scheduling' is possible and that
   it results in better desktop scheduling. Kudos Con!

   The CFS patch uses a completely different approach and implementation
   from RSDL/SD. My goal was to make CFS's interactivity quality exceed
   that of RSDL/SD, which is a high standard to meet :-) Testing
   feedback is welcome to decide this one way or another. [ and, in any
   case, all of SD's logic could be added via a kernel/sched_sd.c module
   as well, if Con is interested in such an approach. ]

   CFS's design is quite radical: it does not use runqueues, it uses a
   time-ordered rbtree to build a 'timeline' of future task execution,
   and thus has no 'array switch' artifacts (by which both the vanilla
   scheduler and RSDL/SD are affected).

   CFS uses nanosecond granularity accounting and does not rely on any
   jiffies or other HZ detail. Thus the CFS scheduler has no notion of
   'timeslices' and has no heuristics whatsoever. There is only one
   central tunable:

         /proc/sys/kernel/sched_granularity_ns

   which can be used to tune the scheduler from 'desktop' (low
   latencies) to 'server' (good batching) workloads. It defaults to a
   setting suitable for desktop workloads. SCHED_BATCH is handled by the
   CFS scheduler module too.

   due to its design, the CFS scheduler is not prone to any of the
   'attacks' that exist today against the heuristics of the stock
   scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
   work fine and do not impact interactivity and produce the expected
   behavior.

   the CFS scheduler has a much stronger handling of nice levels and
   SCHED_BATCH: both types of workloads should be isolated much more
   agressively than under the vanilla scheduler.

   ( another rdetail: due to nanosec accounting and timeline sorting,
     sched_yield() support is very simple under CFS, and in fact under
     CFS sched_yield() behaves much better than under any other
     scheduler i have tested so far. )

 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
   way than the vanilla scheduler does. It uses 100 runqueues (for all
   100 RT priority levels, instead of 140 in the vanilla scheduler)
   and it needs no expired array.

 - reworked/sanitized SMP load-balancing: the runqueue-walking
   assumptions are gone from the load-balancing code now, and
   iterators of the scheduling modules are used. The balancing code got
   quite a bit simpler as a result.

the core scheduler got smaller by more than 700 lines:

 kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
 1 file changed, 372 insertions(+), 1082 deletions(-)

and even adding all the scheduling modules, the total size impact is
relatively small:

 18 files changed, 1454 insertions(+), 1133 deletions(-)

most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:

   text    data     bss     dec     hex filename
  23366    4001      24   27391    6aff kernel/sched.o.vanilla
  24159    2705      56   26920    6928 kernel/sched.o.CFS

(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)

thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.

as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
@ 2007-04-13 20:27 ` Bill Huey
  2007-04-13 20:55   ` Ingo Molnar
  2007-04-13 21:50 ` Ingo Molnar
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Bill Huey @ 2007-04-13 20:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Bill Huey (hui)

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
... 
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]

Ingo,

Con has been asking for module support for years if I understand your patch
corectly. You'll also need this for -rt as well with regards to bandwidth
scheduling. Good to see that you're moving in this direction.

bill


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 20:55   ` Ingo Molnar
  2007-04-13 21:21     ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:55 UTC (permalink / raw)
  To: Bill Huey
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Bill Huey <billh@gnuppy.monkey.org> wrote:

> Con has been asking for module support for years if I understand your 
> patch corectly. [...]

Yeah. Note that there are some subtle but crutial differences between 
PlugSched (which Con used, and which i opposed in the past) and this 
approach.

PlugSched cuts the interfaces at a high level in a monolithic way and 
introduces kernel/scheduler.c that uses one pluggable scheduler 
(represented via the 'scheduler' global template) at a time.

while in this CFS patchset i'm using modularization ('scheduler 
classes') to simplify the _existing_ multi-policy implementation of the 
scheduler. These 'scheduler classes' are in a hierarchy and are stacked 
on top of each other. They are in use at once. Currently there's two of 
them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the 
performance impact is minimal.

So scheduler classes are mainly a simplification of the design of the 
scheduler - not just a mere facility to select multiple schedulers. 
Their ability to also facilitate easier experimentation with schedulers 
is 'just' a happy side-effect. So all in one: it's a fairly different 
model from PlugSched (and that's why i didnt reuse PlugSched) - but 
there's indeed overlap.

> [...] You'll also need this for -rt as well with regards to bandwidth 
> scheduling.

yeah.

scheduler classes are also useful for other purposes like containers and 
virtualization, hierarchical/group scheduling, security encapsulation, 
etc. - features that can be on-demand layered, and which we dont 
necessarily want to have enabled all the time.

> [...] Good to see that you're moving in this direction.

thanks! :)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:55   ` Ingo Molnar
@ 2007-04-13 21:21     ` William Lee Irwin III
  2007-04-13 21:35       ` Bill Huey
  2007-04-13 21:39       ` Ingo Molnar
  0 siblings, 2 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-13 21:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> Yeah. Note that there are some subtle but crutial differences between 
> PlugSched (which Con used, and which i opposed in the past) and this 
> approach.
> PlugSched cuts the interfaces at a high level in a monolithic way and 
> introduces kernel/scheduler.c that uses one pluggable scheduler 
> (represented via the 'scheduler' global template) at a time.

What I originally did did so for a good reason, which was that it was
intended to support far more radical reorganizations, for instance,
things that changed the per-cpu runqueue affairs for gang scheduling.
I wrote a top-level driver that did support scheduling classes in a
similar fashion, though it didn't survive others maintaining the patches.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 21:21     ` William Lee Irwin III
@ 2007-04-13 21:35       ` Bill Huey
  2007-04-13 21:39       ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Bill Huey @ 2007-04-13 21:35 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote:
> On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote:
> > Yeah. Note that there are some subtle but crutial differences between 
> > PlugSched (which Con used, and which i opposed in the past) and this 
> > approach.
> > PlugSched cuts the interfaces at a high level in a monolithic way and 
> > introduces kernel/scheduler.c that uses one pluggable scheduler 
> > (represented via the 'scheduler' global template) at a time.
> 
> What I originally did did so for a good reason, which was that it was
> intended to support far more radical reorganizations, for instance,
> things that changed the per-cpu runqueue affairs for gang scheduling.
> I wrote a top-level driver that did support scheduling classes in a
> similar fashion, though it didn't survive others maintaining the patches.

Also, gang scheduling is needed to solve virtualization issues regarding
spinlocks in a guest image. You could potentally be spinning on a thread
that isn't currently running which, needless to say, is very bad.

bill


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 21:21     ` William Lee Irwin III
  2007-04-13 21:35       ` Bill Huey
@ 2007-04-13 21:39       ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:39 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> What I originally did did so for a good reason, which was that it was 
> intended to support far more radical reorganizations, for instance, 
> things that changed the per-cpu runqueue affairs for gang scheduling. 
> I wrote a top-level driver that did support scheduling classes in a 
> similar fashion, though it didn't survive others maintaining the 
> patches.

yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
@ 2007-04-13 21:50 ` Ingo Molnar
  2007-04-13 21:57 ` Michal Piotrowski
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 21:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> and even adding all the scheduling modules, the total size impact is 
> relatively small:
> 
>  18 files changed, 1454 insertions(+), 1133 deletions(-)
> 
> most of the increase is due to extensive comments. The kernel size 
> impact is in fact a small negative:
> 
>    text    data     bss     dec     hex filename
>   23366    4001      24   27391    6aff kernel/sched.o.vanilla
>   24159    2705      56   26920    6928 kernel/sched.o.CFS

update: these were older numbers, here are the stats redone with the 
latest patch:

     text    data     bss     dec     hex filename
    23366    4001      24   27391    6aff kernel/sched.o.vanilla
    23671    4548      24   28243    6e53 kernel/sched.o.sd.v40
    23349    2705      24   26078    65de kernel/sched.o.cfs

so CFS is now a win both for text and for data size :)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
  2007-04-13 20:27 ` Bill Huey
  2007-04-13 21:50 ` Ingo Molnar
@ 2007-04-13 21:57 ` Michal Piotrowski
  2007-04-13 22:15 ` Daniel Walker
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 304+ messages in thread
From: Michal Piotrowski @ 2007-04-13 21:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar napisał(a):
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 

Friday the 13th, my lucky day :).

/mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers


make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'.  Stop.
make[2]: *** [linux] Error 2
make[1]: *** [headers_check] Error 2
make: *** [vmlinux] Error 2

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)

Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>

--- linux-msc-cfs-clean/include/linux/Kbuild	2007-04-13 23:52:47.000000000 +0200
+++ linux-msc-cfs/include/linux/Kbuild	2007-04-13 23:49:41.000000000 +0200
@@ -133,6 +133,7 @@ header-y += quotaio_v1.h
 header-y += quotaio_v2.h
 header-y += radeonfb.h
 header-y += raw.h
+header-y += rbtree.h
 header-y += resource.h
 header-y += rose.h
 header-y += smbno.h

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-04-13 21:57 ` Michal Piotrowski
@ 2007-04-13 22:15 ` Daniel Walker
  2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:21 ` William Lee Irwin III
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Daniel Walker @ 2007-04-13 22:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

I'm not in love with the current or other schedulers, so I'm indifferent
to this change. However, I was reviewing your release notes and the
patch and found myself wonder what the logarithmic complexity of this
new scheduler is .. I assumed it would also be constant time , but the
__enqueue_task_fair doesn't appear to be constant time (rbtree insert
complexity).. Maybe that's not a critical path , but I thought I would
at least comment on it.

Daniel


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:37     ` Willy Tarreau
  2007-04-13 23:59     ` Daniel Walker
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:30 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Daniel Walker <dwalker@mvista.com> wrote:

> I'm not in love with the current or other schedulers, so I'm 
> indifferent to this change. However, I was reviewing your release 
> notes and the patch and found myself wonder what the logarithmic 
> complexity of this new scheduler is .. I assumed it would also be 
> constant time , but the __enqueue_task_fair doesn't appear to be 
> constant time (rbtree insert complexity).. [...]

i've been worried about that myself and i've done extensive measurements 
before choosing this implementation. The rbtree turned out to be a quite 
compact data structure: we get it quite cheaply as part of the task 
structure cachemisses - which have to be touched anyway. For 1000 tasks 
it's a loop of ~10 - that's still very fast and bound in practice.

here's a test i did under CFS. Lets take some ridiculous load: 1000 
infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted 
into the same rbtree), and lets run lat_ctx:

  neptune:~/l> uptime
  22:51:23 up 8 min,  2 users,  load average: 713.06, 254.64, 91.51

  neptune:~/l> ./lat_ctx -s 0 2
  "size=0k ovr=1.61
  2 1.41

lets stop the 1000 tasks and only have ~2 tasks in the runqueue:

  neptune:~/l> ./lat_ctx -s 0 2

  "size=0k ovr=1.70
  2 1.16

so the overhead is 0.25 usecs. Considering the load (1000 tasks trash 
the cache like crazy already), this is more than acceptable.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:30   ` Ingo Molnar
@ 2007-04-13 22:37     ` Willy Tarreau
  2007-04-13 23:59     ` Daniel Walker
  1 sibling, 0 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Walker, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote:
> 
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > I'm not in love with the current or other schedulers, so I'm 
> > indifferent to this change. However, I was reviewing your release 
> > notes and the patch and found myself wonder what the logarithmic 
> > complexity of this new scheduler is .. I assumed it would also be 
> > constant time , but the __enqueue_task_fair doesn't appear to be 
> > constant time (rbtree insert complexity).. [...]
> 
> i've been worried about that myself and i've done extensive measurements 
> before choosing this implementation. The rbtree turned out to be a quite 
> compact data structure: we get it quite cheaply as part of the task 
> structure cachemisses - which have to be touched anyway. For 1000 tasks 
> it's a loop of ~10 - that's still very fast and bound in practice.

I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n)
than dumb O(1).

In a userland TCP stack I started to write 2 years ago, I used a comparable
scheduler and could reach a sustained rate of 145000 connections/s at 4
millions of concurrent connections. And yes, each time a packet was sent or
received, a task was queued/dequeued (so about 450k/s with 4 million tasks,
on an athlon 1.5 GHz). So that seems much higher than what we currently need.

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:30   ` Ingo Molnar
  2007-04-13 22:37     ` Willy Tarreau
@ 2007-04-13 23:59     ` Daniel Walker
  2007-04-14 10:55       ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Daniel Walker @ 2007-04-13 23:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


One other thing, what happens in the case of slow, frequency changing,
are/or inaccurate clocks .. Is the old sched_clock behavior still
tolerated?

Daniel 


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:59     ` Daniel Walker
@ 2007-04-14 10:55       ` Ingo Molnar
  0 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:55 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Daniel Walker <dwalker@mvista.com> wrote:

> One other thing, what happens in the case of slow, frequency changing, 
> are/or inaccurate clocks .. Is the old sched_clock behavior still 
> tolerated?

yeah, good question. Yesterday i did a quick testboot with that too, and 
it seemed to behave pretty OK with the low-res [jiffies based] 
sched_clock() too. Although in that case things are much more of an 
approximation and rounding/arithmetics artifacts are possible. CFS works 
best with a high-resolution cycle counter.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-04-13 22:15 ` Daniel Walker
@ 2007-04-13 22:21 ` William Lee Irwin III
  2007-04-13 22:52   ` Ingo Molnar
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-13 22:31 ` Willy Tarreau
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-13 22:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]

A pleasant surprise, though I did see it coming.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> Highlights are:
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.

It probably needs further clarification that they're things on the order
of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst
the classes is furthermore assumed, and so on. They're not quite
capable of being full-blown alternative policies, though quite a bit
can be crammed into them.

There are issues with the per- scheduling class data not being very
well-abstracted. A union for per-class data might help, if not a
dynamically allocated scheduling class -private structure. Getting
an alternative policy floating around that actually clashes a little
with the stock data in the task structure would help clarify what's
needed.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!

Bob Mullens banged out a virtual deadline interactive task scheduler
for Multics back in 1976 or thereabouts. ISTR the name Ferranti in
connection with deadline task scheduling for UNIX in particular. I've
largely seen deadline schedulers as a realtime topic, though. In any
event, it's not so radical as to lack a fair number of precedents.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).

A binomial heap would likely serve your purposes better than rbtrees.
It's faster to have the next item to dequeue at the root of the tree
structure rather than a leaf, for one. There are, of course, other
priority queue structures (e.g. van Emde Boas) able to exploit the
limited precision of the priority key for faster asymptotics, though
actual performance is an open question.

Another advantage of heaps is that they support decreasing priorities
directly, so that instead of removal and reinsertion, a less invasive
movement within the tree is possible. This nets additional constant
factor improvements beyond those for the next item to dequeue for the
case where a task remains runnable, but is preempted and its priority
decreased while it remains runnable.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
>          /proc/sys/kernel/sched_granularity_ns
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.

I like not relying on timeslices. Timeslices ultimately get you into
a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque
artifacts therefore.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.

I'm always suspicious of these claims. A moderately formal regression
test suite needs to be assembled and the testcases rather seriously
cleaned up so they e.g. run for a deterministic period of time, have
their parameters passable via command-line options instead of editing
and recompiling, don't need Lindenting to be legible, and so on. With
that in hand, a battery of regression tests can be run against scheduler
modifications to verify their correctness and to detect any disturbance
in scheduling semantics they might cause.

A very serious concern is that while a fresh scheduler may pass all
these tests, later modifications may later cause failures unnoticed
because no one's doing the regression tests and there's no obvious
test suite for testing types to latch onto. Another is that the
testcases themselves may bitrot if they're not maintainable code.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.

Speaking of regression tests, let's please at least state intended
nice semantics and get a regression test for CPU bandwidth distribution
by nice levels going.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )

And there's another one. sched_yield() semantics need a regression test
more transparent than VolanoMark or other macrobenchmarks.

At some point we really need to decide what our sched_yield() is
intended to do and get something out there to detect whether it's
behaving as intended.

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.

The SMP load balancing class operations strike me as unusual and likely
to trip over semantic issues in alternative scheduling classes. Getting
some alternative scheduling classes out there to clarify the issues
would help here, too.

A more general question here is what you mean by "completely fair;"
there doesn't appear to be inter-tgrp, inter-pgrp, inter-session,
or inter-user fairness going on, though one might argue those are
relatively obscure notions of fairness. Complete fairness arguably
precludes static prioritization by nice levels, so there is also
that. There is also the issue of what a fair CPU bandwidth distribution
between tasks of varying desired in-isolation CPU utilization might be.
I suppose my thorniest point is where the demonstration of fairness is
as, say, a testcase. Perhaps it's fair now; when will we find out when
that fairness has been disturbed?

What these things mean when there are multiple CPU's to schedule across
may also be of concern.

I propose the following two testcases:
(1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels
	Create a number of tasks at varying nice levels. Measure the
	CPU bandwidth allocated to each.
	Success depends on intent: we decide up-front that a given nice
	level should correspond to a given share of CPU bandwidth.
	Check to see how far from the intended distribution of CPU
	bandwidth according to those decided-up-front shares the actual
	distribution of CPU bandwidth is for the test.

(2) CPU bandwidth distribution of tasks with varying CPU demands
	Create a number of tasks that would in isolation consume
	varying %cpu. Measure the CPU bandwidth allocated to each.
	Success depends on intent here, too. Decide up-front that a
	given %cpu that would be consumed in isolation should
	correspond to a given share of CPU bandwidth and check the
	actual distribution of CPU bandwidth vs. what was intended.
	Note that the shares need not linearly correspond to the
	%cpu; various sorts of things related to interactivity will
	make this nonlinear.

A third testcase for sched_yield() should be brewed up.

These testcases are oblivious to SMP. This will demand that a scheduling
policy integrate with load balancing to the extent that load balancing
occurs for the sake of distributing CPU bandwidth according to nice level.
Some explicit decision should be made regarding that.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:52   ` Ingo Molnar
  2007-04-13 23:30     ` William Lee Irwin III
  2007-04-14 22:38   ` Davide Libenzi
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 22:52 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
> >   new scheduler will be active by default and all tasks will default
> >   to the new SCHED_FAIR interactive scheduling class. ]
> 
> A pleasant surprise, though I did see it coming.

hey ;)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> > Highlights are:
> >  - the introduction of Scheduling Classes: an extensible hierarchy of
> >    scheduler modules. These modules encapsulate scheduling policy
> >    details and are handled by the scheduler core without the core
> >    code assuming about them too much.
> 
> It probably needs further clarification that they're things on the 
> order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization 
> amongst the classes is furthermore assumed, and so on. [...]

yep - they are linked via sched_ops->next pointer, with NULL delimiting 
the last one.

> [...] They're not quite capable of being full-blown alternative 
> policies, though quite a bit can be crammed into them.

yeah, they are not full-blown: i extended them on-demand, for the 
specific purposes of sched_fair.c and sched_rt.c. More can be done too.

> There are issues with the per- scheduling class data not being very 
> well-abstracted. [...]

yes. It's on my TODO list: i'll work more on extending the cleanups to 
those fields too.

> A binomial heap would likely serve your purposes better than rbtrees. 
> It's faster to have the next item to dequeue at the root of the tree 
> structure rather than a leaf, for one. There are, of course, other 
> priority queue structures (e.g. van Emde Boas) able to exploit the 
> limited precision of the priority key for faster asymptotics, though 
> actual performance is an open question.

i'm caching the leftmost leaf, which serves as an alternate, task-pick 
centric root in essence.

> Another advantage of heaps is that they support decreasing priorities 
> directly, so that instead of removal and reinsertion, a less invasive 
> movement within the tree is possible. This nets additional constant 
> factor improvements beyond those for the next item to dequeue for the 
> case where a task remains runnable, but is preempted and its priority 
> decreased while it remains runnable.

yeah. (Note that in CFS i'm not decreasing priorities anywhere though - 
all the priority levels in CFS stay constant, fairness is not achieved 
via rotating priorities or similar, it is achieved via the accounting 
code.)

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> >    due to its design, the CFS scheduler is not prone to any of the
> >    'attacks' that exist today against the heuristics of the stock
> >    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
> >    work fine and do not impact interactivity and produce the expected
> >    behavior.
> 
> I'm always suspicious of these claims.  [...]

hey, sure - but please give it a go nevertheless, i _did_ test all these 
;)

> A moderately formal regression test suite needs to be assembled [...]

by all means feel free! ;)

> A more general question here is what you mean by "completely fair;"

by that i mean the most common-sense definition: with N tasks running 
each gets 1/N CPU time if observed for a reasonable amount of time. Now 
extend this to arbitrary scheduling patterns, the end result should 
still be completely fair, according to the fundamental 1/N(time) rule 
individually applied to all the small scheduling patterns that the 
scheduling patterns give. (this assumes that the scheduling patterns are 
reasonably independent of each other - if they are not then there's no 
reasonable definition of fairness that makes sense, and we might as well 
use the 1/N rule for those cases too.)

> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or 
> inter-user fairness going on, though one might argue those are 
> relatively obscure notions of fairness. [...]

sure, i mainly concentrated on what we have in Linux today. The things 
you mention are add-ons that i can see handling via new scheduling 
classes: all the CKRM and containers type of CPU time management 
facilities.

> What these things mean when there are multiple CPU's to schedule 
> across may also be of concern.

that is handled by the existing smp-nice load balancer, that logic is 
preserved under CFS.

> These testcases are oblivious to SMP. This will demand that a 
> scheduling policy integrate with load balancing to the extent that 
> load balancing occurs for the sake of distributing CPU bandwidth 
> according to nice level. Some explicit decision should be made 
> regarding that.

this should already work reasonably fine with CFS: try massive_intr.c on 
an SMP box.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:52   ` Ingo Molnar
@ 2007-04-13 23:30     ` William Lee Irwin III
  2007-04-13 23:44       ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A binomial heap would likely serve your purposes better than rbtrees. 
>> It's faster to have the next item to dequeue at the root of the tree 
>> structure rather than a leaf, for one. There are, of course, other 
>> priority queue structures (e.g. van Emde Boas) able to exploit the 
>> limited precision of the priority key for faster asymptotics, though 
>> actual performance is an open question.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> i'm caching the leftmost leaf, which serves as an alternate, task-pick 
> centric root in essence.

I noticed that, yes. It seemed a better idea to me to use a data
structure that has what's needed built-in, but I suppose it's not gospel.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Another advantage of heaps is that they support decreasing priorities 
>> directly, so that instead of removal and reinsertion, a less invasive 
>> movement within the tree is possible. This nets additional constant 
>> factor improvements beyond those for the next item to dequeue for the 
>> case where a task remains runnable, but is preempted and its priority 
>> decreased while it remains runnable.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> yeah. (Note that in CFS i'm not decreasing priorities anywhere though - 
> all the priority levels in CFS stay constant, fairness is not achieved 
> via rotating priorities or similar, it is achieved via the accounting 
> code.)

Sorry, "priority" here would be from the POV of the queue data
structure. From the POV of the scheduler it would be resetting the
deadline or whatever the nomenclature cooked up for things is, most
obviously in requeue_task_fair() and task_tick_fair().

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I'm always suspicious of these claims.  [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> hey, sure - but please give it a go nevertheless, i _did_ test all these 
> ;)

The suspicion essentially centers around how long the state of affairs
will hold up because comprehensive re-testing is not noticeably done
upon updates to scheduling code or kernel point releases.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A moderately formal regression test suite needs to be assembled [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by all means feel free! ;)

I can only do so much, but I have done work to clean up other testcases
going around. I'm mostly looking at testcases as I go over them or
develop some interest in the subject and rewriting those that already
exist or hammering out new ones as I need them. The main contribution
toward this is that I've sort of made a mental note to stash the results
of the effort somewhere and pass them along to those who do regular
testing on kernels or otherwise import test suites into their collections.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> A more general question here is what you mean by "completely fair;"

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> by that i mean the most common-sense definition: with N tasks running 
> each gets 1/N CPU time if observed for a reasonable amount of time. Now 
> extend this to arbitrary scheduling patterns, the end result should 
> still be completely fair, according to the fundamental 1/N(time) rule 
> individually applied to all the small scheduling patterns that the 
> scheduling patterns give. (this assumes that the scheduling patterns are 
> reasonably independent of each other - if they are not then there's no 
> reasonable definition of fairness that makes sense, and we might as well 
> use the 1/N rule for those cases too.)

I'd start with identically-behaving CPU-bound tasks here. It's easy
enough to hammer out a testcase that starts up N CPU-bound tasks, runs
them for a few minutes, stops them, collects statistics on their
runtime, and gives us an idea of whether 1/N came out properly. I'll
get around to that at some point.

Where it gets complex is when the behavior patterns vary, e.g. they're
not entirely CPU-bound and their desired in-isolation CPU utilization
varies, or when nice levels vary, or both vary. I went on about
testcases for those in particular in the prior post, though not both
at once. The nice level one in particular needs an up-front goal for
distribution of CPU bandwidth in a mixture of competing tasks with
varying nice levels.

There are different ways to define fairness, but a uniform distribution
of CPU bandwidth across a set of identical competing tasks is a good,
testable definition.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or 
>> inter-user fairness going on, though one might argue those are 
>> relatively obscure notions of fairness. [...]

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> sure, i mainly concentrated on what we have in Linux today. The things 
> you mention are add-ons that i can see handling via new scheduling 
> classes: all the CKRM and containers type of CPU time management 
> facilities.

At some point the CKRM and container people should be pinged to see
what (if anything) they need to achieve these sorts of things. It's
not clear to me that the specific cases I cited are considered
relevant to anyone. I presume that if they are, someone will pipe
up with a feature request. It was more a sort of catalogue of different
notions of fairness that could arise than any sort of suggestion.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> What these things mean when there are multiple CPU's to schedule 
>> across may also be of concern.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> that is handled by the existing smp-nice load balancer, that logic is 
> preserved under CFS.

Given the things going wrong, I'm curious as to whether that works, and
if so, how well. I'll drop that into my list of testcases that should be
arranged for, though I won't guarantee that I'll get to it myself in any
sort of timely fashion.

What this ultimately needs is specifying the semantics of nice levels
so that we can say that a mixture of competing tasks with varying nice
levels should have an ideal distribution of CPU bandwidth to check for.

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> These testcases are oblivious to SMP. This will demand that a 
>> scheduling policy integrate with load balancing to the extent that 
>> load balancing occurs for the sake of distributing CPU bandwidth 
>> according to nice level. Some explicit decision should be made 
>> regarding that.

On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote:
> this should already work reasonably fine with CFS: try massive_intr.c on 
> an SMP box.

Where is massive_intr.c, BTW?

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:30     ` William Lee Irwin III
@ 2007-04-13 23:44       ` Ingo Molnar
  2007-04-13 23:58         ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]


* William Lee Irwin III <wli@holomorphy.com> wrote:

> Where it gets complex is when the behavior patterns vary, e.g. they're 
> not entirely CPU-bound and their desired in-isolation CPU utilization 
> varies, or when nice levels vary, or both vary. [...]

yes. I tested things like 'massive_intr.c' (attached, written by Satoru 
Takeuchi) which starts N tasks which each work for 8msec then sleep 
1msec:

from its output, the second column is the CPU time each thread got, the 
more even, the fairer the scheduling. On vanilla i get:

 mercury:~> ./massive_intr 10 10
 024873  00000150
 024874  00000123
 024870  00000069
 024868  00000068
 024866  00000051
 024875  00000206
 024872  00000093
 024869  00000138
 024867  00000078
 024871  00000223

on CFS i get:

 neptune:~> ./massive_intr 10 10
 002266  00000112
 002260  00000113
 002261  00000112
 002267  00000112
 002269  00000112
 002265  00000112
 002262  00000113
 002268  00000113
 002264  00000112
 002263  00000113

so it is quite a bit more even ;)

another related test-utility is one i wrote:

  http://people.redhat.com/mingo/scheduler-patches/ring-test.c

this is a ring of 100 tasks each doing work for 100 msecs and then 
sleeping for 1 msec. I usually test this by also running a CPU hog in 
parallel to it, and checking whether it gets ~50.0% of CPU time under 
CFS. (it does)

	Ingo

[-- Attachment #2: massive_intr.c --]
[-- Type: text/plain, Size: 9833 bytes --]


#if 0

Hi Ingo and all,

When I was executing massive interactive processes, I found that some of them
occupy CPU time and the others hardly run.

It seems that some of processes which occupy CPU time always has max effective
prio (default+5) and the others have max - 1. What happen here is...

 1. If there are moderate number of max interactive processes, they can be
    re-inserted into active queue without falling down its priority again and
    again.
 2. In this case, the others seldom run, and can't get max effective priority
    at next exhausting because scheduler considers them to sleep too long.
 3. Goto 1, OOPS!

Unfortunately I haven't been able to make the patch resolving this problem
yet. Any idea?

I also attach the test program which easily recreates this problem.

Test program flow:

  1. First process starts child proesses and wait for 5 minutes.
  2. Each child process executes "work 8 msec and sleep 1 msec" loop
     continuously.
  3. After 3 minits have passed, each child processes prints the # of loops
     which executed.

What expected:

  Each child processes execute nearly equal # of loops.

Test environment:

  - kernel:                 2.6.20(*1)
  - # of CPUs:                 1 or 2
  - # of child processes:  200 or 400
  - nice value:            0 or 20(*2)

*1) I confirmed that 2.6.21-rc5 has no change regarding this problem.
*2) If a process have nice 20, scheduler never regards it as interactive.

Test results:

-----------+----------------+------+------------------------------------
 # of CPUs | # of processes | nice |              result
-----------+----------------+------+------------------------------------
           |                |   20 | looks good
   1(i386) |                +------+------------------------------------
           |                |    0 | 4 processes occupy 98% of CPU time
-----------+            200 +------+------------------------------------
           |                |   20 | looks good
           |                +------+------------------------------------
           |                |    0 | 8 processes occupy 72% of CPU time
   2(ia64) +----------------+------+------------------------------------
           |            400 |   20 | looks good
           |                +------+------------------------------------
           |                |    0 | 8 processes occupy 98% of CPU time
-----------+----------------+------+------------------------------------

FYI. 2.6.21-rc3-mm1 (enabling RSDL scheduler) works fine in the all casees :-)

Thanks,

Satoru

-------------------------------------------------------------------------------
#endif
/*
 * massive_intr - run @nproc interactive processes and print the number of
 *		  loops(*1) each process executes in @runtime secs.
 *
 *		  *1) "work 8 msec and sleep 1msec" loop
 *
 *	Usage:  massive_intr <nproc> <runtime>
 *
 *		 @nproc:   number of processes
 *		 @runtime: execute time[sec]
 *
 *	ex) If you want to run 300 processes for 5 mins, issue the
 *	    command as follows:
 *
 *		$ massive_intr 300 300
 *
 *	How to build:
 *
 *		cc -o massive_intr massive_intr.c -lrt
 *
 *
 *  Copyright (C) 2007  Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or (at
 *  your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful, but
 *  WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 *  General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License along
 *  with this program; if not, write to the Free Software Foundation, Inc.,
 *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
 *
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 */

#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <semaphore.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>
#include <err.h>

#define WORK_MSECS	8
#define SLEEP_MSECS	1

#define MAX_PROC	1024
#define SAMPLE_COUNT	1000000000
#define USECS_PER_SEC	1000000
#define USECS_PER_MSEC	1000
#define NSECS_PER_MSEC	1000000
#define SHMEMSIZE	4096

static const char *shmname = "/sched_interactive_shmem";
static void *shmem;
static sem_t *printsem;
static int nproc;
static int runtime;
static int fd;
static time_t *first;
static pid_t pid[MAX_PROC];
static int return_code;

static void cleanup_resources(void)
{
	if (sem_destroy(printsem) < 0)
		warn("sem_destroy() failed");
	if (munmap(shmem, SHMEMSIZE) < 0)
		warn("munmap() failed");
	if (close(fd) < 0)
		warn("close() failed");
}	

static void abnormal_exit(void)
{
	if (kill(getppid(), SIGUSR2) < 0)
		err(EXIT_FAILURE, "kill() failed");
}

static void sighandler(int signo)
{
}

static void sighandler2(int signo)
{
	return_code = EXIT_FAILURE;
}

static void loopfnc(int nloop)
{
	int i;
	for (i = 0; i < nloop; i++)
		;
}

static int loop_per_msec(void)
{
	struct timeval tv[2];
	int before, after;

	if (gettimeofday(&tv[0], NULL) < 0)
		return -1;
	loopfnc(SAMPLE_COUNT);
	if (gettimeofday(&tv[1], NULL) < 0)
		return -1;
	before = tv[0].tv_sec*USECS_PER_SEC+tv[0].tv_usec;
	after = tv[1].tv_sec*USECS_PER_SEC+tv[1].tv_usec;

	return SAMPLE_COUNT/(after - before)*USECS_PER_MSEC;
}

static void *test_job(void *arg)
{
	int l = (int)arg;
	int count = 0;
	time_t current;
	sigset_t sigset;
	struct sigaction sa;
	struct timespec ts = { 0, NSECS_PER_MSEC*SLEEP_MSECS};

	sa.sa_handler = sighandler;
	if (sigemptyset(&sa.sa_mask) < 0) {
		warn("sigemptyset() failed");
		abnormal_exit();
	}
	sa.sa_flags = 0;
	if (sigaction(SIGUSR1, &sa, NULL) < 0) {
		warn("sigaction() failed");
		abnormal_exit();
	}
	if (sigemptyset(&sigset) < 0) {
		warn("sigfillset() failed");
		abnormal_exit();
	}
	sigsuspend(&sigset);
	if (errno != EINTR) {
		warn("sigsuspend() failed");
		abnormal_exit();
	}
	/* main loop */
	do {
		loopfnc(WORK_MSECS*l);
		if (nanosleep(&ts, NULL) < 0) {
			warn("nanosleep() failed");
			abnormal_exit();
		}
		count++;
		if (time(&current) == -1) {
			warn("time() failed");
			abnormal_exit();
		}
	} while (difftime(current, *first) < runtime);

	if (sem_wait(printsem) < 0) {
		warn("sem_wait() failed");
		abnormal_exit();
	}
	printf("%06d\t%08d\n", getpid(), count);
	if (sem_post(printsem) < 0) {
		warn("sem_post() failed");
		abnormal_exit();
	}
	exit(EXIT_SUCCESS);
}

static void usage(void)
{
	fprintf(stderr,
		"Usage : massive_intr <nproc> <runtime>\n"
		"\t\tnproc  : number of processes\n"
		"\t\truntime   : execute time[sec]\n");
	exit(EXIT_FAILURE);
}

int main(int argc, char **argv)
{
	int i, j;
	int status;
	sigset_t sigset;
	struct sigaction sa;
	int c;

	if (argc != 3)
		usage();

	nproc = strtol(argv[1], NULL, 10);
	if (errno || nproc < 1 || nproc > MAX_PROC)
		err(EXIT_FAILURE, "invalid multinum");
	runtime = strtol(argv[2], NULL, 10);
	if (errno || runtime <= 0)
		err(EXIT_FAILURE, "invalid runtime");

	sa.sa_handler = sighandler2;
	if (sigemptyset(&sa.sa_mask) < 0)
		err(EXIT_FAILURE, "sigemptyset() failed");
	sa.sa_flags = 0;
	if (sigaction(SIGUSR2, &sa, NULL) < 0)
		err(EXIT_FAILURE, "sigaction() failed");
	if (sigemptyset(&sigset) < 0)
		err(EXIT_FAILURE, "sigemptyset() failed");
	if (sigaddset(&sigset, SIGUSR1) < 0)
		err(EXIT_FAILURE, "sigaddset() failed");
	if (sigaddset(&sigset, SIGUSR2) < 0)
		err(EXIT_FAILURE, "sigaddset() failed");
	if (sigprocmask(SIG_BLOCK, &sigset, NULL) < 0)
		err(EXIT_FAILURE, "sigprocmask() failed");

	/* setup shared memory */
	if ((fd = shm_open(shmname, O_CREAT | O_RDWR, 0644)) < 0)
		err(EXIT_FAILURE, "shm_open() failed");
	if (shm_unlink(shmname) < 0) {
		warn("shm_unlink() failed");
		goto err_close;
	}
	if (ftruncate(fd, SHMEMSIZE) < 0) {
		warn("ftruncate() failed");
		goto err_close;
	}
	shmem = mmap(NULL, SHMEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (shmem == (void *)-1) {
		warn("mmap() failed");
		goto err_unmap;
	}
	printsem = shmem;
	first = shmem + sizeof(*printsem);
	
	/* initialize semaphore */
	if ((sem_init(printsem, 1, 1)) < 0) {
		warn("sem_init() failed");
		goto err_unmap;
	}

	if ((c = loop_per_msec()) < 0) {
		fprintf(stderr, "loop_per_msec() failed\n");
		goto err_sem;
	}

	for (i = 0; i < nproc; i++) {
		pid[i] = fork();
		if (pid[i] == -1) {
			warn("fork() failed\n");
			for (j = 0; j < i; j++)
				if (kill(pid[j], SIGKILL) < 0)
					warn("kill() failed");
			goto err_sem;
		}
		if (pid[i] == 0)
			test_job((void *)c);
	}

	if (sigemptyset(&sigset) < 0) {
		warn("sigemptyset() failed");
		goto err_proc;
	}
	if (sigaddset(&sigset, SIGUSR2) < 0) {
		warn("sigaddset() failed");
		goto err_proc;
	}
	if (sigprocmask(SIG_UNBLOCK, &sigset, NULL) < 0) {
		warn("sigprocmask() failed");
		goto err_proc;
	}
	if (time(first) < 0) {
		warn("time() failed");
		goto err_proc;
	}
	if ((kill(0, SIGUSR1)) == -1) {
		warn("kill() failed");
		goto err_proc;
	}
	for (i = 0; i < nproc; i++) {
		if (wait(&status) < 0) {
			warn("wait() failed");
			goto err_proc;
		}
	}
	cleanup_resources();
	exit(return_code);
 err_proc:
	for (i = 0; i < nproc; i++)
		if (kill(pid[i], SIGKILL) < 0)
			if (errno != ESRCH)
				warn("kill() failed");
 err_sem:
	if (sem_destroy(printsem) < 0)
		warn("sem_destroy() failed");
 err_unmap:
	if (munmap(shmem, SHMEMSIZE) < 0)
		warn("munmap() failed");
 err_close:
	if (close(fd) < 0)
		warn("close() failed");	
	exit(EXIT_FAILURE);
}

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:44       ` Ingo Molnar
@ 2007-04-13 23:58         ` William Lee Irwin III
  0 siblings, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-13 23:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Where it gets complex is when the behavior patterns vary, e.g. they're 
>> not entirely CPU-bound and their desired in-isolation CPU utilization 
>> varies, or when nice levels vary, or both vary. [...]

On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote:
> yes. I tested things like 'massive_intr.c' (attached, written by Satoru 
> Takeuchi) which starts N tasks which each work for 8msec then sleep 
> 1msec:
[...]
> another related test-utility is one i wrote:
>   http://people.redhat.com/mingo/scheduler-patches/ring-test.c
> this is a ring of 100 tasks each doing work for 100 msecs and then 
> sleeping for 1 msec. I usually test this by also running a CPU hog in 
> parallel to it, and checking whether it gets ~50.0% of CPU time under 
> CFS. (it does)

These are both tremendously useful. The code is also in rather good
shape so only minimal modifications (for massive_intr.c I'm not even
sure if any are needed at all) are needed to plug them into the test
harness I'm aware of. I'll queue them both for me to adjust and send
over to testers I don't want to burden with hacking on testcases I
myself am asking them to add to their suites.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:21 ` William Lee Irwin III
  2007-04-13 22:52   ` Ingo Molnar
@ 2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
                       ` (2 more replies)
  1 sibling, 3 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-14 22:38 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, William Lee Irwin III wrote:

> On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> >    The CFS patch uses a completely different approach and implementation
> >    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> >    that of RSDL/SD, which is a high standard to meet :-) Testing
> >    feedback is welcome to decide this one way or another. [ and, in any
> >    case, all of SD's logic could be added via a kernel/sched_sd.c module
> >    as well, if Con is interested in such an approach. ]
> >    CFS's design is quite radical: it does not use runqueues, it uses a
> >    time-ordered rbtree to build a 'timeline' of future task execution,
> >    and thus has no 'array switch' artifacts (by which both the vanilla
> >    scheduler and RSDL/SD are affected).
> 
> A binomial heap would likely serve your purposes better than rbtrees.
> It's faster to have the next item to dequeue at the root of the tree
> structure rather than a leaf, for one. There are, of course, other
> priority queue structures (e.g. van Emde Boas) able to exploit the
> limited precision of the priority key for faster asymptotics, though
> actual performance is an open question.

Haven't looked at the scheduler code yet, but for a similar problem I use 
a time ring. The ring has Ns (2 power is better) slots (where tasks are 
queued - in my case they were som sort of timers), and it has a current 
base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
also has a bitmap with bits telling you which slots contains queued tasks. 
An item (task) that has to be scheduled at time T, will be queued in the slot:

S = Ib + min((T - Tb) / Tg, Ns - 1);

Items with T longer than Ns*Tg will be scheduled in the relative last slot 
(chosing a proper Ns and Tg can minimize this).
Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to 
suite to your needs.
This is a simple bench between time-ring (TR) and CFS queueing:

http://www.xmailserver.org/smart-queue.c

In my box (Dual Opteron 252):

davide@alien:~$ ./smart-queue -n 8            
CFS = 142.21 cycles/loop
TR  = 72.33 cycles/loop
davide@alien:~$ ./smart-queue -n 16
CFS = 188.74 cycles/loop
TR  = 83.79 cycles/loop
davide@alien:~$ ./smart-queue -n 32
CFS = 221.36 cycles/loop
TR  = 75.93 cycles/loop
davide@alien:~$ ./smart-queue -n 64
CFS = 242.89 cycles/loop
TR  = 81.29 cycles/loop



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
@ 2007-04-14 23:26     ` Davide Libenzi
  2007-04-15  4:01     ` William Lee Irwin III
  2007-04-15 23:09     ` Pavel Pisa
  2 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-14 23:26 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sat, 14 Apr 2007, Davide Libenzi wrote:

> Haven't looked at the scheduler code yet, but for a similar problem I use 
> a time ring. The ring has Ns (2 power is better) slots (where tasks are 
> queued - in my case they were som sort of timers), and it has a current 
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
> also has a bitmap with bits telling you which slots contains queued tasks. 
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> 
> S = Ib + min((T - Tb) / Tg, Ns - 1);

... mod Ns, of course ;)


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
@ 2007-04-15  4:01     ` William Lee Irwin III
  2007-04-15  4:18       ` Davide Libenzi
  2007-04-15 23:09     ` Pavel Pisa
  2 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15  4:01 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, William Lee Irwin III wrote:
>> A binomial heap would likely serve your purposes better than rbtrees.
[...]

On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use 
> a time ring. The ring has Ns (2 power is better) slots (where tasks are 
> queued - in my case they were som sort of timers), and it has a current 
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It 
> also has a bitmap with bits telling you which slots contains queued tasks. 
> An item (task) that has to be scheduled at time T, will be queued in the slot:
> S = Ib + min((T - Tb) / Tg, Ns - 1);
> Items with T longer than Ns*Tg will be scheduled in the relative last slot 
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to 
> suite to your needs.

I used a similar sort of queue in the virtual deadline scheduler I
wrote in 2003 or thereabouts. CFS uses queue priorities with too high
a precision to map directly to this (queue priorities are marked as
"key" in the cfs code and should not be confused with task priorities).
The elder virtual deadline scheduler used millisecond resolution and a
rather different calculation for its equivalent of ->key, which
explains how it coped with a limited priority space.

The two basic attacks on such large priority spaces are the near future
vs.  far future subdivisions and subdividing the priority space into
(most often regular) intervals. Subdividing the priority space into
intervals is the most obvious; you simply use some O(lg(n)) priority
queue as the bucket discipline in the "time ring," queue by the upper
bits of the queue priority in the time ring, and by the lower bits in
the O(lg(n)) bucket discipline. The near future vs. far future
subdivision is maintaining the first N tasks in a low-constant-overhead
structure like a sorted list and the remainder in some other sort of
queue structure intended to handle large numbers of elements gracefully.
The distribution of queue priorities strongly influences which of the
methods is most potent, though it should be clear the methods can be
used in combination.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  4:01     ` William Lee Irwin III
@ 2007-04-15  4:18       ` Davide Libenzi
  0 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-15  4:18 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sat, 14 Apr 2007, William Lee Irwin III wrote:

> The two basic attacks on such large priority spaces are the near future
> vs.  far future subdivisions and subdividing the priority space into
> (most often regular) intervals. Subdividing the priority space into
> intervals is the most obvious; you simply use some O(lg(n)) priority
> queue as the bucket discipline in the "time ring," queue by the upper
> bits of the queue priority in the time ring, and by the lower bits in
> the O(lg(n)) bucket discipline.

Sure. If you really need sub-millisecond precision, you can replace the 
bucket's list_head with an rb_root. It may be not necessary though for a 
cpu scheduler (still, didn't read Ingo's code yet).


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 22:38   ` Davide Libenzi
  2007-04-14 23:26     ` Davide Libenzi
  2007-04-15  4:01     ` William Lee Irwin III
@ 2007-04-15 23:09     ` Pavel Pisa
  2007-04-16  5:47       ` Davide Libenzi
  2 siblings, 1 reply; 304+ messages in thread
From: Pavel Pisa @ 2007-04-15 23:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007 00:38, Davide Libenzi wrote:
> Haven't looked at the scheduler code yet, but for a similar problem I use
> a time ring. The ring has Ns (2 power is better) slots (where tasks are
> queued - in my case they were som sort of timers), and it has a current
> base index (Ib), a current base time (Tb) and a time granularity (Tg). It
> also has a bitmap with bits telling you which slots contains queued tasks.
> An item (task) that has to be scheduled at time T, will be queued in the
> slot:
>
> S = Ib + min((T - Tb) / Tg, Ns - 1);
>
> Items with T longer than Ns*Tg will be scheduled in the relative last slot
> (chosing a proper Ns and Tg can minimize this).
> Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to
> suite to your needs.
> This is a simple bench between time-ring (TR) and CFS queueing:
>
> http://www.xmailserver.org/smart-queue.c
>
> In my box (Dual Opteron 252):
>
> davide@alien:~$ ./smart-queue -n 8
> CFS = 142.21 cycles/loop
> TR  = 72.33 cycles/loop
> davide@alien:~$ ./smart-queue -n 16
> CFS = 188.74 cycles/loop
> TR  = 83.79 cycles/loop
> davide@alien:~$ ./smart-queue -n 32
> CFS = 221.36 cycles/loop
> TR  = 75.93 cycles/loop
> davide@alien:~$ ./smart-queue -n 64
> CFS = 242.89 cycles/loop
> TR  = 81.29 cycles/loop

Hello all,

I cannot help myself to not report results with GAVL
tree algorithm there as an another race competitor.
I believe, that it is better solution for large priority
queues than RB-tree and even heap trees. It could be
disputable if the scheduler needs such scalability on
the other hand. The AVL heritage guarantees lower height
which results in shorter search times which could
be profitable for other uses in kernel.

GAVL algorithm is AVL tree based, so it does not suffer from
"infinite" priorities granularity there as TR does. It allows
use for generalized case where tree is not fully balanced.
This allows to cut the first item withour rebalancing.
This leads to the degradation of the tree by one more level
(than non degraded AVL gives) in maximum, which is still
considerably better than RB-trees maximum.

http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c

The description behind the code is there

http://cmp.felk.cvut.cz/~pisa/ulan/gavl.pdf

The code is part of much more covering uLUt library

http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf
http://sourceforge.net/project/showfiles.php?group_id=118937&package_id=130840

I have included all required GAVL code directly into smart-queue-v-gavl.c
to provide it for easy testing.

There are tests run on my little dated computer - Duron 600 MHz.
Test are run twice to suppress run order influence.

./smart-queue-v-gavl -n 1 -l 2000000
gavl_cfs = 55.66 cycles/loop
CFS = 88.33 cycles/loop
TR  = 141.78 cycles/loop
CFS = 90.45 cycles/loop
gavl_cfs = 55.38 cycles/loop

./smart-queue-v-gavl -n 2 -l 2000000
gavl_cfs = 82.85 cycles/loop
CFS = 104.18 cycles/loop
TR  = 145.21 cycles/loop
CFS = 102.74 cycles/loop
gavl_cfs = 82.05 cycles/loop

./smart-queue-v-gavl -n 4 -l 2000000
gavl_cfs = 137.45 cycles/loop
CFS = 156.47 cycles/loop
TR  = 142.00 cycles/loop
CFS = 152.65 cycles/loop
gavl_cfs = 139.38 cycles/loop

./smart-queue-v-gavl -n 10 -l 2000000
gavl_cfs = 229.22 cycles/loop           (WORSE)
CFS = 206.26 cycles/loop
TR  = 140.81 cycles/loop
CFS = 208.29 cycles/loop
gavl_cfs = 223.62 cycles/loop           (WORSE)

./smart-queue-v-gavl -n 100 -l 2000000
gavl_cfs = 257.66 cycles/loop
CFS = 329.68 cycles/loop
TR  = 142.20 cycles/loop
CFS = 319.34 cycles/loop
gavl_cfs = 260.02 cycles/loop

./smart-queue-v-gavl -n 1000 -l 2000000
gavl_cfs = 258.41 cycles/loop
CFS = 393.04 cycles/loop
TR  = 134.76 cycles/loop
CFS = 392.20 cycles/loop
gavl_cfs = 260.93 cycles/loop

./smart-queue-v-gavl -n 10000 -l 2000000
gavl_cfs = 259.45 cycles/loop
CFS = 605.89 cycles/loop
TR  = 196.69 cycles/loop
CFS = 622.60 cycles/loop
gavl_cfs = 262.72 cycles/loop

./smart-queue-v-gavl -n 100000 -l 2000000
gavl_cfs = 258.21 cycles/loop
CFS = 845.62 cycles/loop
TR  = 315.37 cycles/loop
CFS = 860.21 cycles/loop
gavl_cfs = 258.94 cycles/loop

The GAVL code has not been tuned by any "likely"/"unlikely"
constructs. It brings even some other overhead from it generic
design which is not necessary for this use - it keeps
permanently even pointer to the last element, ensures,
that the insertion order is preserved for same key values
etc. But it still proves much better scalability then
kernel used RB-tree code. On the other hand, it does not
encode color/height in one of the pointers and requires
additional field for height.

May it be, that difference is due some bug in my testing,
then I would be interrested in correction. The test case
is oversimplified probably. I have already run more different
tests against GAVL code in the past to compare it with
different tree and queues implementations and I have not found
case with real performance degradation. On the other hand, there
are cases for small items counts where GAVL is sometimes
a little worse than others (array based heap-tree for example).

The GAVL code itself is used in more opensource and commercial
projects and we have noticed no problems after one small fix
at the time of the first release in 2004.

Best wishes

                Pavel Pisa
        e-mail: pisa@cmp.felk.cvut.cz
        www:    http://cmp.felk.cvut.cz/~pisa
        work:   http://www.pikron.com

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:09     ` Pavel Pisa
@ 2007-04-16  5:47       ` Davide Libenzi
  2007-04-17  0:37         ` Pavel Pisa
  0 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-16  5:47 UTC (permalink / raw)
  To: Pavel Pisa
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, 16 Apr 2007, Pavel Pisa wrote:

> I cannot help myself to not report results with GAVL
> tree algorithm there as an another race competitor.
> I believe, that it is better solution for large priority
> queues than RB-tree and even heap trees. It could be
> disputable if the scheduler needs such scalability on
> the other hand. The AVL heritage guarantees lower height
> which results in shorter search times which could
> be profitable for other uses in kernel.
> 
> GAVL algorithm is AVL tree based, so it does not suffer from
> "infinite" priorities granularity there as TR does. It allows
> use for generalized case where tree is not fully balanced.
> This allows to cut the first item withour rebalancing.
> This leads to the degradation of the tree by one more level
> (than non degraded AVL gives) in maximum, which is still
> considerably better than RB-trees maximum.
> 
> http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c

Here are the results on my Opteron 252:

Testing N=1
gavl_cfs = 187.20 cycles/loop
CFS = 194.16 cycles/loop
TR  = 314.87 cycles/loop
CFS = 194.15 cycles/loop
gavl_cfs = 187.15 cycles/loop

Testing N=2
gavl_cfs = 268.94 cycles/loop
CFS = 305.53 cycles/loop
TR  = 313.78 cycles/loop
CFS = 289.58 cycles/loop
gavl_cfs = 266.02 cycles/loop

Testing N=4
gavl_cfs = 452.13 cycles/loop
CFS = 518.81 cycles/loop
TR  = 311.54 cycles/loop
CFS = 516.23 cycles/loop
gavl_cfs = 450.73 cycles/loop

Testing N=8
gavl_cfs = 609.29 cycles/loop
CFS = 644.65 cycles/loop
TR  = 308.11 cycles/loop
CFS = 667.01 cycles/loop
gavl_cfs = 592.89 cycles/loop

Testing N=16
gavl_cfs = 686.30 cycles/loop
CFS = 807.41 cycles/loop
TR  = 317.20 cycles/loop
CFS = 810.24 cycles/loop
gavl_cfs = 688.42 cycles/loop

Testing N=32
gavl_cfs = 756.57 cycles/loop
CFS = 852.14 cycles/loop
TR  = 301.22 cycles/loop
CFS = 876.12 cycles/loop
gavl_cfs = 758.46 cycles/loop

Testing N=64
gavl_cfs = 831.97 cycles/loop
CFS = 997.16 cycles/loop
TR  = 304.74 cycles/loop
CFS = 1003.26 cycles/loop
gavl_cfs = 832.83 cycles/loop

Testing N=128
gavl_cfs = 897.33 cycles/loop
CFS = 1030.36 cycles/loop
TR  = 295.65 cycles/loop
CFS = 1035.29 cycles/loop
gavl_cfs = 892.51 cycles/loop

Testing N=256
gavl_cfs = 963.17 cycles/loop
CFS = 1146.04 cycles/loop
TR  = 295.35 cycles/loop
CFS = 1162.04 cycles/loop
gavl_cfs = 966.31 cycles/loop

Testing N=512
gavl_cfs = 1029.82 cycles/loop
CFS = 1218.34 cycles/loop
TR  = 288.78 cycles/loop
CFS = 1257.97 cycles/loop
gavl_cfs = 1029.83 cycles/loop

Testing N=1024
gavl_cfs = 1091.76 cycles/loop
CFS = 1318.47 cycles/loop
TR  = 287.74 cycles/loop
CFS = 1311.72 cycles/loop
gavl_cfs = 1093.29 cycles/loop

Testing N=2048
gavl_cfs = 1153.03 cycles/loop
CFS = 1398.84 cycles/loop
TR  = 286.75 cycles/loop
CFS = 1438.68 cycles/loop
gavl_cfs = 1149.97 cycles/loop


There seem to be some difference from your numbers. This is with:

gcc version 4.1.2

and -O2. But then and Opteron can behave quite differentyl than a Duron on 
a bench like this ;)



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:47       ` Davide Libenzi
@ 2007-04-17  0:37         ` Pavel Pisa
  0 siblings, 0 replies; 304+ messages in thread
From: Pavel Pisa @ 2007-04-17  0:37 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 07:47, Davide Libenzi wrote:
> On Mon, 16 Apr 2007, Pavel Pisa wrote:
> > I cannot help myself to not report results with GAVL
> > tree algorithm there as an another race competitor.
> > I believe, that it is better solution for large priority
> > queues than RB-tree and even heap trees. It could be
> > disputable if the scheduler needs such scalability on
> > the other hand. The AVL heritage guarantees lower height
> > which results in shorter search times which could
> > be profitable for other uses in kernel.
> >
> > GAVL algorithm is AVL tree based, so it does not suffer from
> > "infinite" priorities granularity there as TR does. It allows
> > use for generalized case where tree is not fully balanced.
> > This allows to cut the first item withour rebalancing.
> > This leads to the degradation of the tree by one more level
> > (than non degraded AVL gives) in maximum, which is still
> > considerably better than RB-trees maximum.
> >
> > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c
>
> Here are the results on my Opteron 252:
>
> Testing N=1
> gavl_cfs = 187.20 cycles/loop
> CFS = 194.16 cycles/loop
> TR  = 314.87 cycles/loop
> CFS = 194.15 cycles/loop
> gavl_cfs = 187.15 cycles/loop
>
> Testing N=2
> gavl_cfs = 268.94 cycles/loop
> CFS = 305.53 cycles/loop
> TR  = 313.78 cycles/loop
> CFS = 289.58 cycles/loop
> gavl_cfs = 266.02 cycles/loop
>
> Testing N=4
> gavl_cfs = 452.13 cycles/loop
> CFS = 518.81 cycles/loop
> TR  = 311.54 cycles/loop
> CFS = 516.23 cycles/loop
> gavl_cfs = 450.73 cycles/loop
>
> Testing N=8
> gavl_cfs = 609.29 cycles/loop
> CFS = 644.65 cycles/loop
> TR  = 308.11 cycles/loop
> CFS = 667.01 cycles/loop
> gavl_cfs = 592.89 cycles/loop
>
> Testing N=16
> gavl_cfs = 686.30 cycles/loop
> CFS = 807.41 cycles/loop
> TR  = 317.20 cycles/loop
> CFS = 810.24 cycles/loop
> gavl_cfs = 688.42 cycles/loop
>
> Testing N=32
> gavl_cfs = 756.57 cycles/loop
> CFS = 852.14 cycles/loop
> TR  = 301.22 cycles/loop
> CFS = 876.12 cycles/loop
> gavl_cfs = 758.46 cycles/loop
>
> Testing N=64
> gavl_cfs = 831.97 cycles/loop
> CFS = 997.16 cycles/loop
> TR  = 304.74 cycles/loop
> CFS = 1003.26 cycles/loop
> gavl_cfs = 832.83 cycles/loop
>
> Testing N=128
> gavl_cfs = 897.33 cycles/loop
> CFS = 1030.36 cycles/loop
> TR  = 295.65 cycles/loop
> CFS = 1035.29 cycles/loop
> gavl_cfs = 892.51 cycles/loop
>
> Testing N=256
> gavl_cfs = 963.17 cycles/loop
> CFS = 1146.04 cycles/loop
> TR  = 295.35 cycles/loop
> CFS = 1162.04 cycles/loop
> gavl_cfs = 966.31 cycles/loop
>
> Testing N=512
> gavl_cfs = 1029.82 cycles/loop
> CFS = 1218.34 cycles/loop
> TR  = 288.78 cycles/loop
> CFS = 1257.97 cycles/loop
> gavl_cfs = 1029.83 cycles/loop
>
> Testing N=1024
> gavl_cfs = 1091.76 cycles/loop
> CFS = 1318.47 cycles/loop
> TR  = 287.74 cycles/loop
> CFS = 1311.72 cycles/loop
> gavl_cfs = 1093.29 cycles/loop
>
> Testing N=2048
> gavl_cfs = 1153.03 cycles/loop
> CFS = 1398.84 cycles/loop
> TR  = 286.75 cycles/loop
> CFS = 1438.68 cycles/loop
> gavl_cfs = 1149.97 cycles/loop
>
>
> There seem to be some difference from your numbers. This is with:
>
> gcc version 4.1.2
>
> and -O2. But then and Opteron can behave quite differentyl than a Duron on
> a bench like this ;)

Thanks for testing, but yours numbers are more correct
than my first report. My numbers seemed to be over-optimistic even
to me, In the fact I have been surprised that difference is so high.
But I have tested bad version of code without GAVL_FAFTER option
set. The code pushed to the web page has been the correct one.
I have not get to look into case until now because I have busy day
to prepare some Linux based labs at university.

Without GAVL_FAFTER option, insert operation does fail
if item with same key is already inserted (intended feature of
the code) and as result of that, not all items have been inserted
in the test. The meaning of GAVL_FAFTER is find/insert after
all items with the same key value. Default behavior is
operate on unique keys in tree and reject duplicates.

My results are even worse for GAVL than yours.
It is possible to try tweak code and optimize it more
(likely/unlikely/do not keep last ptr etc) for this actual usage.
May it be, that I try this exercise, but I do not expect that
the result after tuning would be so much better, that it would
outweight some redesign work. I could see some advantages of AVL
still, but it has its own drawbacks with need of separate height
field and little worse delete in the middle timing.

So excuse me for disturbance. I have been only curious how
GAVL code would behave in the comparison of other algorithms
and I did not kept my premature enthusiasm under the lock.

Best wishes

             Pavel Pisa 


./smart-queue-v-gavl -n 4
gavl_cfs = 279.02 cycles/loop
CFS = 200.87 cycles/loop
TR  = 229.55 cycles/loop
CFS = 201.23 cycles/loop
gavl_cfs = 276.08 cycles/loop

./smart-queue-v-gavl -n 8
gavl_cfs = 310.92 cycles/loop
CFS = 288.45 cycles/loop
TR  = 192.46 cycles/loop
CFS = 284.94 cycles/loop
gavl_cfs = 357.02 cycles/loop

./smart-queue-v-gavl -n 16
gavl_cfs = 350.45 cycles/loop
CFS = 354.01 cycles/loop
TR  = 189.79 cycles/loop
CFS = 320.08 cycles/loop
gavl_cfs = 387.43 cycles/loop

./smart-queue-v-gavl -n 32
gavl_cfs = 419.23 cycles/loop
CFS = 406.88 cycles/loop
TR  = 198.10 cycles/loop
CFS = 398.15 cycles/loop
gavl_cfs = 412.57 cycles/loop

./smart-queue-v-gavl -n 64
gavl_cfs = 442.81 cycles/loop
CFS = 429.62 cycles/loop
TR  = 235.40 cycles/loop
CFS = 389.54 cycles/loop
gavl_cfs = 433.56 cycles/loop

./smart-queue-v-gavl -n 128
gavl_cfs = 358.20 cycles/loop
CFS = 605.49 cycles/loop
TR  = 236.01 cycles/loop
CFS = 458.50 cycles/loop
gavl_cfs = 455.05 cycles/loop

./smart-queue-v-gavl -n 256
gavl_cfs = 529.72 cycles/loop
CFS = 530.98 cycles/loop
TR  = 193.75 cycles/loop
CFS = 533.75 cycles/loop
gavl_cfs = 471.47 cycles/loop

./smart-queue-v-gavl -n 512
gavl_cfs = 525.80 cycles/loop
CFS = 550.63 cycles/loop
TR  = 188.71 cycles/loop
CFS = 549.81 cycles/loop
gavl_cfs = 494.73 cycles/loop

./smart-queue-v-gavl -n 1024
gavl_cfs = 544.91 cycles/loop
CFS = 561.68 cycles/loop
TR  = 230.97 cycles/loop
CFS = 522.68 cycles/loop
gavl_cfs = 542.40 cycles/loop

./smart-queue-v-gavl -n 2048
gavl_cfs = 567.46 cycles/loop
CFS = 581.85 cycles/loop
TR  = 229.69 cycles/loop
CFS = 585.41 cycles/loop
gavl_cfs = 563.22 cycles/loop

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (4 preceding siblings ...)
  2007-04-13 22:21 ` William Lee Irwin III
@ 2007-04-13 22:31 ` Willy Tarreau
  2007-04-13 23:18   ` Ingo Molnar
  2007-04-13 23:07 ` Gabriel C
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-13 22:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Hi Ingo,

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
(...)
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).

I have a high confidence this will work better : I've been using
time-ordered trees in userland projects for several years, and never
found anything better. To be honnest, I never understood the concept
behind the array switch, but as I never felt brave enough to hack
something in this kernel area, I simply preferred to shut up (not
enough knowledge and not enough time).

However, I have been using a very fast struct timeval-ordered RADIX
tree. I found generic rbtree code to generally be slower, certainly
because of the call to a function with arguments on every node. Both
trees are O(log(n)), the rbtree being balanced and the radix tree
being unbalanced. If you're interested, I can try to see how that
would fit (but not this week-end).

Also, I had spent much time in the past doing paper work on how to
improve fairness between interactive tasks and batch tasks. I came
up with the conclusion that for perfectness, tasks should not be
ordered by their expected wakeup time, but by their expected completion
time, which automatically takes account of their allocated and used
timeslice. It would also allow both types of workloads to share equal
CPU time with better responsiveness for the most interactive one through
the reallocation of a "credit" for the tasks which have not consumed
all of their timeslices. I remember we had discussed this with Mike
about one year ago when he fixed lots of problems in mainline scheduler.
The downside is that I never found how to make this algo fit in
O(log(n)). I always ended in something like O(n.log(n)) IIRC.

But maybe this is overkill for real life anyway. Given that a basic two
arrays switch (which I never understood) was sufficient for many people,
probably that a basic tree will be an order of magnitude better.

>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns
> 
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.

I find this useful, but to be fair with Mike and Con, they both have
proposed similar tuning knobs in the past and you said you did not want
to add that complexity for admins. People can sometimes be demotivated
by seeing their proposals finally used by people who first rejected them. 
And since both Mike and Con both have done a wonderful job in that area,
we need their experience and continued active participation more than ever.

>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.

I'm very pleased to read this. Because as I have already said it, my major
concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the
basic problems for me to the point that I switched the default lilo entry
on my notebook to 2.6 ! I hope that whatever the next scheduler will be,
we'll definitely get rid of any heuristics. Heuristics are good in 95% of
situations and extremely bad in the remaining 5%. I prefer something
reasonably good in 100% of situations.

>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.
> 
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )
> 
>  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
>    way than the vanilla scheduler does. It uses 100 runqueues (for all
>    100 RT priority levels, instead of 140 in the vanilla scheduler)
>    and it needs no expired array.
> 
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.

Will this have any impact on NUMA/HT/multi-core/etc... ?

> the core scheduler got smaller by more than 700 lines:

Well done !

Cheers,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:18   ` Ingo Molnar
  2007-04-14 18:48     ` Bill Huey
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> >    central tunable:
> > 
> >          /proc/sys/kernel/sched_granularity_ns
> > 
> >    which can be used to tune the scheduler from 'desktop' (low
> >    latencies) to 'server' (good batching) workloads. It defaults to a
> >    setting suitable for desktop workloads. SCHED_BATCH is handled by the
> >    CFS scheduler module too.
> 
> I find this useful, but to be fair with Mike and Con, they both have 
> proposed similar tuning knobs in the past and you said you did not 
> want to add that complexity for admins. [...]

yeah. [ Note that what i opposed in the past was mostly the 'export all 
the zillion of sched.c knobs to /sys and let people mess with them' kind 
of patches which did exist and still exist. A _single_ knob, which 
represents basically the totality of parameters within sched_fair.c is 
less of a problem. I dont think i ever objected to this knob within 
staircase/SD. (If i did then i was dead wrong.) ]

> [...] People can sometimes be demotivated by seeing their proposals 
> finally used by people who first rejected them. And since both Mike 
> and Con both have done a wonderful job in that area, we need their 
> experience and continued active participation more than ever.

very much so! Both Con and Mike has contributed regularly to upstream 
sched.c:

 $ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l
 19

 $ git-log kernel/sched.c | grep 'by: Mike' | wc -l
 6

and i'd very much like both counts to increase steadily in the future 
too :)

> >  - reworked/sanitized SMP load-balancing: the runqueue-walking
> >    assumptions are gone from the load-balancing code now, and 
> >    iterators of the scheduling modules are used. The balancing code 
> >    got quite a bit simpler as a result.
> 
> Will this have any impact on NUMA/HT/multi-core/etc... ?

it will inevitably have some sort of effect - and if it's negative, i'll 
try to fix it.

I got rid of the explicit cache-hot tracking code and replaced it with a 
more natural pure 'pick the next-to-run task first, that is likely the 
most cache-cold one' logic. That just derives naturally from the rbtree 
approach.

> > the core scheduler got smaller by more than 700 lines:
> 
> Well done !

thanks :)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:18   ` Ingo Molnar
@ 2007-04-14 18:48     ` Bill Huey
  0 siblings, 0 replies; 304+ messages in thread
From: Bill Huey @ 2007-04-14 18:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote:
> very much so! Both Con and Mike has contributed regularly to upstream 
> sched.c:

The problem here is tha Con can get demotivated (and rather upset) when an
idea gets proposed, like SchedPlug, only to have people be hostile to it
and then sudden turn around an adopt this idea. It give the impression
that you, in this specific case, were more interested in controlling a
situation and the track of development instead of actually being inclusive
of the development process with discussion and serious consideration, etc...

This is how the Linux community can be perceived as elitist. The old guard
would serve the community better if people were more mindful and sensitive
to developer issues. There was a particular speech that I was turned off by
at OLS 2006 that pretty much pandering to the "old guard's" needs over
newer developers. Since I'm a some what established engineer in -rt (being
the only other person that mapped the lock hierarchy out for full
preemptibility), I had the confidence to pretty much ignored it while
previously this could have really upset me and be highly discouraging to
a relatively new developer.

As Linux gets larger and larger this is going to be an increasing problem
when folks come into the community with new ideas and the community will
need to change if it intends to integrate these folks. IMO, a lot of
these flame ware wouldn't need to exist if folks listent ot each other
better and permit co-ownership of code like the scheduler since it needs
multipule hands in it adapt to new loads and situations, etc...

I'm saying this nicely now since I can be nasty about it.

bill

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (5 preceding siblings ...)
  2007-04-13 22:31 ` Willy Tarreau
@ 2007-04-13 23:07 ` Gabriel C
  2007-04-13 23:25   ` Ingo Molnar
  2007-04-14  2:04 ` Nick Piggin
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Gabriel C @ 2007-04-13 23:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
>   
[...]
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,
>   

Compile error here.

...

CC kernel/sched.o
kernel/sched.c: In function '__rq_clock':
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c:219: warning: type defaults to 'int' in declaration of 
'__ret_warn_once'
kernel/sched.c:219: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'rq_clock':
kernel/sched.c:230: error: 'struct rq' has no member named 'cpu'
kernel/sched.c: In function 'sched_init':
kernel/sched.c:6013: warning: unused variable 'j'
make[1]: *** [kernel/sched.o] Error 1
make: *** [kernel] Error 2
==> ERROR: Build Failed. Aborting...

...

There the config :

http://frugalware.org/~crazy/other/kernel/config

> 	Ingo
> -
>
>   

Regards,

Gabriel

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:07 ` Gabriel C
@ 2007-04-13 23:25   ` Ingo Molnar
  2007-04-13 23:39     ` Gabriel C
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 23:25 UTC (permalink / raw)
  To: Gabriel C
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Gabriel C <nix.or.die@googlemail.com> wrote:

> > as usual, any sort of feedback, bugreports, fixes and suggestions 
> > are more than welcome,
> 
> Compile error here.

ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also 
updated the full patch at the cfs-scheduler URL)

	Ingo

----------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix !CONFIG_SMP build

fix the !CONFIG_SMP build error reported by Gabriel C

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl
 	return rq->rq_clock;
 }
 
-static inline unsigned long long rq_clock(struct rq *rq)
-{
-	int this_cpu = smp_processor_id();
-
-	if (this_cpu == rq->cpu)
-		return __rq_clock(rq);
-
-	return rq->rq_clock;
-}
-
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline unsigned long long rq_clock(struct rq *rq)
+{
+	int this_cpu = smp_processor_id();
+
+	if (this_cpu == cpu_of(rq))
+		return __rq_clock(rq);
+
+	return rq->rq_clock;
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 23:25   ` Ingo Molnar
@ 2007-04-13 23:39     ` Gabriel C
  0 siblings, 0 replies; 304+ messages in thread
From: Gabriel C @ 2007-04-13 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Gabriel C <nix.or.die@googlemail.com> wrote:
>
>   
>>> as usual, any sort of feedback, bugreports, fixes and suggestions 
>>> are more than welcome,
>>>       
>> Compile error here.
>>     
>
> ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also 
> updated the full patch at the cfs-scheduler URL)
>   

Yes it does , thx :) , only the " warning: unused variable 'j' " left.

> 	Ingo
>   

Regards,

Gabriel


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (6 preceding siblings ...)
  2007-04-13 23:07 ` Gabriel C
@ 2007-04-14  2:04 ` Nick Piggin
  2007-04-14  6:32   ` Ingo Molnar
  2007-04-14 15:09 ` S.Çağlar Onur
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-14  2:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Always good to see another contender ;)

> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> 
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]

I don't know why there is such noise about fairness right now... I
thought fairness was one of the fundamental properties of a good CPU
scheduler, and my scheduler definitely always aims for that above most
other things. Why not just keep SCHED_OTHER?


> Highlights are:
> 
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.

Don't really like this, but anyway...


>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
> 
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!

I guess the 2.4 and earlier scheduler kind of did that as well.


>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]

Comment about the code: shouldn't you be requeueing the task in the rbtree
wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick
look so far).


>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).
> 
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever.

Well, I guess there is still some mechanism to decide which process is most
eligible to run? ;) Considering that question has no "right" answer for
SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted
they are obviously fairly elegant in contrast to the O(1) scheduler ;)


> There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns

Suppose you have 2 CPU hogs running, is sched_granularity_ns the
frequency at which they will context switch?


>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )

What is better behaviour for sched_yield?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  2:04 ` Nick Piggin
@ 2007-04-14  6:32   ` Ingo Molnar
  2007-04-14  6:43     ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14  6:32 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Nick Piggin <npiggin@suse.de> wrote:

> >    The CFS patch uses a completely different approach and implementation
> >    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
> >    that of RSDL/SD, which is a high standard to meet :-) Testing
> >    feedback is welcome to decide this one way or another. [ and, in any
> >    case, all of SD's logic could be added via a kernel/sched_sd.c module
> >    as well, if Con is interested in such an approach. ]
> 
> Comment about the code: shouldn't you be requeueing the task in the 
> rbtree wherever you change wait_runtime? eg. task_new_fair? [...]

yes: the task's position within the rbtree is updated every time 
wherever wait_runtime is change. task_new_fair is the method during new 
task creation, but indeed i forgot to requeue the parent. I've fixed 
this in my tree (see the delta patch below) - thanks!

	Ingo

----------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [cfs] fix parent's rbtree position

Nick noticed that upon fork we change parent->wait_runtime but we do not 
requeue it within the rbtree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq,
 
 	p->wait_runtime = parent->wait_runtime/2;
 	parent->wait_runtime /= 2;
+	requeue_task_fair(rq, parent);
+
 	/*
 	 * For the first timeslice we allow child threads
 	 * to move their parent-inherited fairness back

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  6:32   ` Ingo Molnar
@ 2007-04-14  6:43     ` Ingo Molnar
  2007-04-14  8:08       ` Willy Tarreau
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14  6:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Ingo Molnar <mingo@elte.hu> wrote:

> Nick noticed that upon fork we change parent->wait_runtime but we do 
> not requeue it within the rbtree.

this fix is not complete - because the child runqueue is locked here, 
not the parent's. I've fixed this properly in my tree and have uploaded 
a new sched-modular+cfs.patch. (the effects of the original bug are 
mostly harmless, the rbtree position gets corrected the first time the 
parent reschedules. The fix might improve heavy forker handling.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  6:43     ` Ingo Molnar
@ 2007-04-14  8:08       ` Willy Tarreau
  2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:36         ` Ingo Molnar
  0 siblings, 2 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14  8:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Nick noticed that upon fork we change parent->wait_runtime but we do 
> > not requeue it within the rbtree.
> 
> this fix is not complete - because the child runqueue is locked here, 
> not the parent's. I've fixed this properly in my tree and have uploaded 
> a new sched-modular+cfs.patch. (the effects of the original bug are 
> mostly harmless, the rbtree position gets corrected the first time the 
> parent reschedules. The fix might improve heavy forker handling.)

It looks like it did not reach your public dir yet.

BTW, I've given it a try. It seems pretty usable. I have also tried
the usual meaningless "glxgears" test with 12 of them at the same time,
and they rotate very smoothly, there is absolutely no pause in any of
them. But they don't all run at same speed, and top reports their CPU
load varying from 3.4 to 10.8%, with what looks like more CPU is
assigned to the first processes, and less CPU for the last ones. But
this is just a rough observation on a stupid test, I would not call
that one scientific in any way (and X has its share in the test too).

I'll perform other tests when I can rebuild with your fixed patch.

Cheers,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:08       ` Willy Tarreau
@ 2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 19:48           ` William Lee Irwin III
  2007-04-14 10:36         ` Ingo Molnar
  1 sibling, 2 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14  8:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote:
> > 
> > * Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > Nick noticed that upon fork we change parent->wait_runtime but we do 
> > > not requeue it within the rbtree.
> > 
> > this fix is not complete - because the child runqueue is locked here, 
> > not the parent's. I've fixed this properly in my tree and have uploaded 
> > a new sched-modular+cfs.patch. (the effects of the original bug are 
> > mostly harmless, the rbtree position gets corrected the first time the 
> > parent reschedules. The fix might improve heavy forker handling.)
> 
> It looks like it did not reach your public dir yet.
> 
> BTW, I've given it a try. It seems pretty usable. I have also tried
> the usual meaningless "glxgears" test with 12 of them at the same time,
> and they rotate very smoothly, there is absolutely no pause in any of
> them. But they don't all run at same speed, and top reports their CPU
> load varying from 3.4 to 10.8%, with what looks like more CPU is
> assigned to the first processes, and less CPU for the last ones. But
> this is just a rough observation on a stupid test, I would not call
> that one scientific in any way (and X has its share in the test too).

Follow-up: I think this is mostly X-related. I've started 100 scheddos,
and all get the same CPU percentage. Interestingly, mpg123 in parallel
does never skip at all because it needs quite less than 1% CPU and gets
its fair share at a load of 112. Xterms are slow to respond to typing
with the 12 gears and 100 scheddos, and expectedly it was X which was
starving. renicing it to -5 restores normal feeling with very slow
but smooth gear rotations. Leaving X niced at 0 and killing the gears
also restores normal behaviour.

All in all, it seems logical that processes which serve many others
become a bottleneck for them.

Forking becomes very slow above a load of 100 it seems. Sometimes,
the shell takes 2 or 3 seconds to return to prompt after I run
"scheddos &"

Those are very promising results, I nearly observe the same responsiveness
as I had on a solaris 10 with 10k running processes on a bigger machine.

I would be curious what a mysql test result would look like now.

Regards,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:36         ` Willy Tarreau
@ 2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 15:17             ` Mark Lord
  2007-04-14 19:48           ` William Lee Irwin III
  1 sibling, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Willy Tarreau <w@1wt.eu> wrote:

> Forking becomes very slow above a load of 100 it seems. Sometimes, the 
> shell takes 2 or 3 seconds to return to prompt after I run "scheddos 
> &"

this might be changed/impacted by the parent-requeue fix that is in the 
updated (for real, promise! ;) patch. Right now on CFS a forking parent 
shares its own run stats with the child 50%/50%. This means that heavy 
forkers are indeed penalized. Another logical choice would be 100%/0%: a 
child has to earn its own right.

i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
pristine (and smaller/faster) approach to just not give new children any 
stats history to begin with. I've implemented an add-on patch that 
implements this, you can find it at:

    http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

> Those are very promising results, I nearly observe the same 
> responsiveness as I had on a solaris 10 with 10k running processes on 
> a bigger machine.

cool and thanks for the feedback! (Btw., as another test you could also 
try to renice "scheddos" to +19. While that does not push the scheduler 
nearly as hard as nice 0, it is perhaps more indicative of how a truly 
abusive many-tasks workload would be run in practice.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 10:53           ` Ingo Molnar
@ 2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
                                 ` (2 more replies)
  2007-04-14 15:17             ` Mark Lord
  1 sibling, 3 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Forking becomes very slow above a load of 100 it seems. Sometimes, the 
> > shell takes 2 or 3 seconds to return to prompt after I run "scheddos 
> > &"
> 
> this might be changed/impacted by the parent-requeue fix that is in the 
> updated (for real, promise! ;) patch. Right now on CFS a forking parent 
> shares its own run stats with the child 50%/50%. This means that heavy 
> forkers are indeed penalized. Another logical choice would be 100%/0%: a 
> child has to earn its own right.
> 
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
> pristine (and smaller/faster) approach to just not give new children any 
> stats history to begin with. I've implemented an add-on patch that 
> implements this, you can find it at:
> 
>     http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

Not tried yet, it already looks better with the update and sched-fair-hog.
Now xterm open "instantly" even with 1000 running processes.

> > Those are very promising results, I nearly observe the same 
> > responsiveness as I had on a solaris 10 with 10k running processes on 
> > a bigger machine.
> 
> cool and thanks for the feedback! (Btw., as another test you could also 
> try to renice "scheddos" to +19. While that does not push the scheduler 
> nearly as hard as nice 0, it is perhaps more indicative of how a truly 
> abusive many-tasks workload would be run in practice.)

Good idea. The machine I'm typing from now has 1000 scheddos running at +19,
and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears,
but I'm pretty sure that it's a top artifact now because the cumulated times
are roughly identical :

 14:33:13  up 13 min,  7 users,  load average: 900.30, 443.75, 177.70
1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped
CPU0 states:  56.0% user  43.0% system   23.0% nice   0.0% iowait   0.0% idle
CPU1 states:  94.0% user   5.0% system    0.0% nice   0.0% iowait   0.0% idle
Mem:  1034764k av,  223788k used,  810976k free,       0k shrd,    7192k buff
       104400k active,              51904k inactive
Swap:  497972k av,       0k used,  497972k free                   68020k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
 1325 root      20   0 69240 9400  3740 R    27.6  0.9   4:46   1 X
 1412 willy     20   0  6284 2552  1740 R    14.2  0.2   1:09   1 glxgears
 1419 willy     20   0  6256 2384  1612 R    10.7  0.2   1:09   1 glxgears
 1409 willy     20   0  2824 1940   788 R     8.9  0.1   0:25   1 top
 1414 willy     20   0  6280 2544  1728 S     8.9  0.2   1:08   0 glxgears
 1415 willy     20   0  6256 2376  1600 R     8.9  0.2   1:07   1 glxgears
 1417 willy     20   0  6256 2384  1612 S     8.9  0.2   1:05   1 glxgears
 1420 willy     20   0  6284 2552  1740 R     8.9  0.2   1:07   1 glxgears
 1410 willy     20   0  6256 2372  1600 S     7.1  0.2   1:11   1 glxgears
 1413 willy     20   0  6260 2388  1612 S     7.1  0.2   1:08   0 glxgears
 1416 willy     20   0  6284 2544  1728 S     6.2  0.2   1:06   0 glxgears
 1418 willy     20   0  6252 2384  1612 S     6.2  0.2   1:09   0 glxgears
 1411 willy     20   0  6280 2548  1740 S     5.3  0.2   1:15   1 glxgears
 1421 willy     20   0  6280 2536  1728 R     5.3  0.2   1:05   1 glxgears

>From time to time, one of the 12 aligned gears will quickly perform a full
quarter of round while others slowly turn by a few degrees. In fact, while
I don't know this process's CPU usage pattern, there's something useful in
it : it allows me to visually see when process accelerate/deceleraet. What
would be best would be just a clock requiring low X ressources and eating
vast amounts of CPU between movements. It will help visually monitor CPU
distribution without being too much impacted by X.

I've just added another 100 scheddos at nice 0, and the system is still
amazingly usable. I just tried exchanging a 1-byte token between 188 "dd"
processes which communicate through circular pipes. The context switch
rate is rather high but this has no impact on the rest :

willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo

   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
1105  0  1      0 781108   8364  68180    0    0     0    12    5 82187 59 41  0
1114  0  1      0 781108   8364  68180    0    0     0     0    0 81528 58 42  0
1112  0  1      0 781108   8364  68180    0    0     0     0    1 80899 58 42  0
1113  0  1      0 781108   8364  68180    0    0     0     0   26 83466 58 42  0
1106  0  2      0 781108   8376  68168    0    0     0     8   91 83193 58 42  0
1107  0  1      0 781108   8376  68180    0    0     0     4    7 79951 58 42  0
1106  0  1      0 781108   8376  68180    0    0     0     0   46 80939 57 43  0
1114  0  1      0 781108   8376  68180    0    0     0     0   21 82019 56 44  0
1116  0  1      0 781108   8376  68180    0    0     0     0   16 85134 56 44  0
1114  0  3      0 781108   8388  68168    0    0     0    16   20 85871 56 44  0
1112  0  1      0 781108   8388  68168    0    0     0     0   15 80412 57 43  0
1112  0  1      0 781108   8388  68180    0    0     0     0  101 83002 58 42  0
1113  0  1      0 781108   8388  68180    0    0     0     0   25 82230 56 44  0

Playing with the sched_max_hog_history_ns does not seem to change anything.
Maybe it's useful for other workloads. Anyway, I have nothing to complain
about, because it's not common for me to be able to normally type a mail on
a system with more than 1000 running processes ;-)

Also, mixed with this load, I have started injecting HTTP requests between
two local processes. The load is stable at 7700 req/s (11800 when alone),
and what I was interested in is the response time. It's perfectly stable
between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were
varying a lot under stock scheduler, with some sessions sometimes pausing
for seconds. (RSDL fixed this though).

Well, I'll stop heating the room for now as I get out of ideas about how
to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
with his workload which behaves strangely on RSDL. If it works OK here,
it will be the proof that heuristics should not be needed.

Congrats !
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
@ 2007-04-14 13:27               ` Willy Tarreau
  2007-04-14 14:45                 ` Willy Tarreau
  2007-04-14 16:19                 ` Ingo Molnar
  2007-04-15  7:54               ` Mike Galbraith
  2007-04-19  9:01               ` Ingo Molnar
  2 siblings, 2 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 13:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> 
> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it.

Ah, I found something nasty.
If I start large batches of processes like this :

$ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done

the ramp up slows down after 700-800 processes, but something very
strange happens. If I'm under X, I can switch the focus to all xterms
(the WM is still alive) but all xterms are frozen. On the console,
after one moment I simply cannot switch to another VT anymore while
I can still start commands locally. But "chvt 2" simply blocks.
SysRq-K killed everything and restored full control. Dmesg shows lots
of :
SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

I wonder if part of the problem would be too many processes bound to
the same tty :-/

I'll investigate a bit.

Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:27               ` Willy Tarreau
@ 2007-04-14 14:45                 ` Willy Tarreau
  2007-04-14 16:14                   ` Ingo Molnar
  2007-04-14 16:19                 ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 14:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote:
> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> > 
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
> 
> Ah, I found something nasty.
> If I start large batches of processes like this :
> 
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
> 
> the ramp up slows down after 700-800 processes, but something very
> strange happens. If I'm under X, I can switch the focus to all xterms
> (the WM is still alive) but all xterms are frozen. On the console,
> after one moment I simply cannot switch to another VT anymore while
> I can still start commands locally. But "chvt 2" simply blocks.
> SysRq-K killed everything and restored full control. Dmesg shows lots
> of :
> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> I wonder if part of the problem would be too many processes bound to
> the same tty :-/

Does not seem easy to reproduce, it looks like some resource pools are
kept pre-allocated after a first run, because if I kill scheddos during
the ramp up then start it again, it can go further. The problem happens
when the parent is forking. Also, I modified scheddos to close(0,1,2)
and to perform the forks itself and it does not cause any problem, even
with 4000 processes running. So I really suspect that the problem I
encountered above was tty-related.

BTW, I've tried your fork patch. It definitely helps forking because it
takes below one second to create 4000 processes, then the load slowly
increases. As you said, the children have to earn their share, and I
find that it makes it easier to conserve control of the whole system's
stability.

Regards,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 14:45                 ` Willy Tarreau
@ 2007-04-14 16:14                   ` Ingo Molnar
  0 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:14 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> BTW, I've tried your fork patch. It definitely helps forking because 
> it takes below one second to create 4000 processes, then the load 
> slowly increases. As you said, the children have to earn their share, 
> and I find that it makes it easier to conserve control of the whole 
> system's stability.

ok, thanks for testing this out, i think i'll integrate this one back 
into the core. (I'm still unsure about the cpu-hog one.) And it saves 
some code-size too:

   text    data     bss     dec     hex filename
  23349    2705      24   26078    65de kernel/sched.o.cfs-v1
  23189    2705      24   25918    653e kernel/sched.o.cfs-before
  23052    2705      24   25781    64b5 kernel/sched.o.cfs-after

  23366    4001      24   27391    6aff kernel/sched.o.vanilla
  23671    4548      24   28243    6e53 kernel/sched.o.sd.v40

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:27               ` Willy Tarreau
  2007-04-14 14:45                 ` Willy Tarreau
@ 2007-04-14 16:19                 ` Ingo Molnar
  2007-04-14 17:15                   ` Eric W. Biederman
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:19 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Eric W. Biederman, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
> > 
> > Well, I'll stop heating the room for now as I get out of ideas about how
> > to defeat it.
> 
> Ah, I found something nasty.
> If I start large batches of processes like this :
> 
> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
> 
> the ramp up slows down after 700-800 processes, but something very 
> strange happens. If I'm under X, I can switch the focus to all xterms 
> (the WM is still alive) but all xterms are frozen. On the console, 
> after one moment I simply cannot switch to another VT anymore while I 
> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
> killed everything and restored full control. Dmesg shows lots of :

> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> I wonder if part of the problem would be too many processes bound to 
> the same tty :-/

hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
maybe this description rings a bell with them?

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 16:19                 ` Ingo Molnar
@ 2007-04-14 17:15                   ` Eric W. Biederman
  2007-04-14 17:29                     ` Willy Tarreau
  0 siblings, 1 reply; 304+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Ingo Molnar <mingo@elte.hu> writes:

> * Willy Tarreau <w@1wt.eu> wrote:
>
>> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote:
>> > 
>> > Well, I'll stop heating the room for now as I get out of ideas about how
>> > to defeat it.
>> 
>> Ah, I found something nasty.
>> If I start large batches of processes like this :
>> 
>> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done
>> 
>> the ramp up slows down after 700-800 processes, but something very 
>> strange happens. If I'm under X, I can switch the focus to all xterms 
>> (the WM is still alive) but all xterms are frozen. On the console, 
>> after one moment I simply cannot switch to another VT anymore while I 
>> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
>> killed everything and restored full control. Dmesg shows lots of :
>
>> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.

This.  Yes. SAK is noisy and tells you everything it kills.

>> I wonder if part of the problem would be too many processes bound to 
>> the same tty :-/
>
> hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
> maybe this description rings a bell with them?

Is there any swapping going on?

I'm inclined to suspect that it is a problem that has more to do with the
number of processes and has nothing to do with ttys.

Anyway you can easily rule out ttys by having your startup program
detach from a controlling tty before you start everything.

I'm more inclined to guess something is reading /proc a lot, or doing
something that holds the tasklist lock, a lot or something like that,
if the problem isn't that you are being kicked into swap.

Eric



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:15                   ` Eric W. Biederman
@ 2007-04-14 17:29                     ` Willy Tarreau
  2007-04-14 17:44                       ` Eric W. Biederman
  2007-04-14 17:50                       ` Linus Torvalds
  0 siblings, 2 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 17:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Hi Eric,

[...]
> >> the ramp up slows down after 700-800 processes, but something very 
> >> strange happens. If I'm under X, I can switch the focus to all xterms 
> >> (the WM is still alive) but all xterms are frozen. On the console, 
> >> after one moment I simply cannot switch to another VT anymore while I 
> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
> >> killed everything and restored full control. Dmesg shows lots of :
> >
> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
> 
> This.  Yes. SAK is noisy and tells you everything it kills.

OK, that's what I suspected, but I did not know if the fact that it talked
about the session was systematic or related to any particular state when it
killed the task.

> >> I wonder if part of the problem would be too many processes bound to 
> >> the same tty :-/
> >
> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
> > maybe this description rings a bell with them?
> 
> Is there any swapping going on?

Not at all.

> I'm inclined to suspect that it is a problem that has more to do with the
> number of processes and has nothing to do with ttys.

It is clearly possible. What I found strange is that I could still fork
processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
It first happened under X with frozen xterms but a perfectly usable WM,
then I reproduced it on pure console to rule out any potential X problem.

> Anyway you can easily rule out ttys by having your startup program
> detach from a controlling tty before you start everything.
> 
> I'm more inclined to guess something is reading /proc a lot, or doing
> something that holds the tasklist lock, a lot or something like that,
> if the problem isn't that you are being kicked into swap.

Oh I'm sorry you were invited into the discussion without a first description
of the context. I was giving a try to Ingo's new scheduler, and trying to
reach corner cases with lots of processes competing for CPU.

I simply used a "for" loop in bash to fork 1000 processes, and this problem
happened between 700-800 children. The program only uses a busy loop and a
pause. I then changed my program to close 0,1,2 and perform the fork itself,
and the problem vanished. So there are two differences here :

  - bash not forking anymore
  - far less FDs on /dev/tty1

At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
in this area.

I agree that this is not normal usage at all, I'm just trying to attack
Ingo's scheduler to ensure it is more robust than the stock one. But
sometimes brute force methods can make other sleeping problems pop up.

Thinking about it, I don't know if there are calls to schedule() while
switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
simply blocked. It would have been possible that a schedule() call
somewhere got starved due to the load, I don't know.

Thanks,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:29                     ` Willy Tarreau
@ 2007-04-14 17:44                       ` Eric W. Biederman
  2007-04-14 17:54                         ` Ingo Molnar
  2007-04-14 17:50                       ` Linus Torvalds
  1 sibling, 1 reply; 304+ messages in thread
From: Eric W. Biederman @ 2007-04-14 17:44 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Willy Tarreau <w@1wt.eu> writes:

> Hi Eric,
>
> [...]
>> >> the ramp up slows down after 700-800 processes, but something very 
>> >> strange happens. If I'm under X, I can switch the focus to all xterms 
>> >> (the WM is still alive) but all xterms are frozen. On the console, 
>> >> after one moment I simply cannot switch to another VT anymore while I 
>> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K 
>> >> killed everything and restored full control. Dmesg shows lots of :
>> >
>> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session.
>> 
>> This.  Yes. SAK is noisy and tells you everything it kills.
>
> OK, that's what I suspected, but I did not know if the fact that it talked
> about the session was systematic or related to any particular state when it
> killed the task.
>
>> >> I wonder if part of the problem would be too many processes bound to 
>> >> the same tty :-/
>> >
>> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), 
>> > maybe this description rings a bell with them?
>> 
>> Is there any swapping going on?
>
> Not at all.
>
>> I'm inclined to suspect that it is a problem that has more to do with the
>> number of processes and has nothing to do with ttys.
>
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.
> It first happened under X with frozen xterms but a perfectly usable WM,
> then I reproduced it on pure console to rule out any potential X problem.
>
>> Anyway you can easily rule out ttys by having your startup program
>> detach from a controlling tty before you start everything.
>> 
>> I'm more inclined to guess something is reading /proc a lot, or doing
>> something that holds the tasklist lock, a lot or something like that,
>> if the problem isn't that you are being kicked into swap.
>
> Oh I'm sorry you were invited into the discussion without a first description
> of the context. I was giving a try to Ingo's new scheduler, and trying to
> reach corner cases with lots of processes competing for CPU.
>
> I simply used a "for" loop in bash to fork 1000 processes, and this problem
> happened between 700-800 children. The program only uses a busy loop and a
> pause. I then changed my program to close 0,1,2 and perform the fork itself,
> and the problem vanished. So there are two differences here :
>
>   - bash not forking anymore
>   - far less FDs on /dev/tty1

Yes.  But with /dev/tty1 being the controlling terminal in both cases,
as you haven't dropped your session, or disassociated your tty.

The bash problem may have something to setpgid or scheduling effects.
Hmm.  I just looked and setpgid does grab the tasklist lock for
writing so we may possibly have some contention there.

> At first, I had around 2200 fds on /dev/tty1, reason why I suspected something
> in this area.
>
> I agree that this is not normal usage at all, I'm just trying to attack
> Ingo's scheduler to ensure it is more robust than the stock one. But
> sometimes brute force methods can make other sleeping problems pop up.

Yep.  If we can narrow it down to one that would be interesting.  Of course
that also means when we start finding other possibly sleeping problems people
are working in areas of code the don't normally touch, so we must investigate.

> Thinking about it, I don't know if there are calls to schedule() while
> switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2"
> simply blocked. It would have been possible that a schedule() call
> somewhere got starved due to the load, I don't know.

It looks like there is a call to schedule_work.

There are two pieces of the path. If you are switching in and out of a tty
controlled by something like X.  User space has to grant permission before
the operation happens.  Where there isn't a gate keeper I know it is cheaper
but I don't know by how much, I suspect there is still a schedule happening
in there.

Eric

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:44                       ` Eric W. Biederman
@ 2007-04-14 17:54                         ` Ingo Molnar
  2007-04-14 18:18                           ` Willy Tarreau
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 17:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> > Thinking about it, I don't know if there are calls to schedule() 
> > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> > "chvt 2" simply blocked. It would have been possible that a 
> > schedule() call somewhere got starved due to the load, I don't know.
> 
> It looks like there is a call to schedule_work.

so this goes over keventd, right?

> There are two pieces of the path. If you are switching in and out of a 
> tty controlled by something like X.  User space has to grant 
> permission before the operation happens.  Where there isn't a gate 
> keeper I know it is cheaper but I don't know by how much, I suspect 
> there is still a schedule happening in there.

Could keventd perhaps be starved? Willy, to exclude this possibility, 
could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
the command to set it to SCHED_FIFO:50 would be:

  chrt -f -p 50 5

but ... events/0 is reniced to -5 by default, so it should definitely 
not be starved.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:54                         ` Ingo Molnar
@ 2007-04-14 18:18                           ` Willy Tarreau
  2007-04-14 18:40                             ` Eric W. Biederman
  2007-04-15 17:55                             ` Ingo Molnar
  0 siblings, 2 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 18:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> 
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> > > Thinking about it, I don't know if there are calls to schedule() 
> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> > > "chvt 2" simply blocked. It would have been possible that a 
> > > schedule() call somewhere got starved due to the load, I don't know.
> > 
> > It looks like there is a call to schedule_work.
> 
> so this goes over keventd, right?
> 
> > There are two pieces of the path. If you are switching in and out of a 
> > tty controlled by something like X.  User space has to grant 
> > permission before the operation happens.  Where there isn't a gate 
> > keeper I know it is cheaper but I don't know by how much, I suspect 
> > there is still a schedule happening in there.
> 
> Could keventd perhaps be starved? Willy, to exclude this possibility, 
> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
> the command to set it to SCHED_FIFO:50 would be:
> 
>   chrt -f -p 50 5
> 
> but ... events/0 is reniced to -5 by default, so it should definitely 
> not be starved.

Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
bash forks 1000 processes, then progressively execs scheddos, but it
takes some time). So I'm rebuilding right now. But I think that Linus
has an interesting clue about GPM and notification before switching
the terminal. I think it was enabled in console mode. I don't know
how that translates to frozen xterms, but let's attack the problems
one at a time.

Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:18                           ` Willy Tarreau
@ 2007-04-14 18:40                             ` Eric W. Biederman
  2007-04-14 19:01                               ` Willy Tarreau
  2007-04-15 17:55                             ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Eric W. Biederman @ 2007-04-14 18:40 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Willy Tarreau <w@1wt.eu> writes:

> On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
>> 
>> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>> 
>> > > Thinking about it, I don't know if there are calls to schedule() 
>> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
>> > > "chvt 2" simply blocked. It would have been possible that a 
>> > > schedule() call somewhere got starved due to the load, I don't know.
>> > 
>> > It looks like there is a call to schedule_work.
>> 
>> so this goes over keventd, right?
>> 
>> > There are two pieces of the path. If you are switching in and out of a 
>> > tty controlled by something like X.  User space has to grant 
>> > permission before the operation happens.  Where there isn't a gate 
>> > keeper I know it is cheaper but I don't know by how much, I suspect 
>> > there is still a schedule happening in there.
>> 
>> Could keventd perhaps be starved? Willy, to exclude this possibility, 
>> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
>> the command to set it to SCHED_FIFO:50 would be:
>> 
>>   chrt -f -p 50 5
>> 
>> but ... events/0 is reniced to -5 by default, so it should definitely 
>> not be starved.
>
> Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> bash forks 1000 processes, then progressively execs scheddos, but it
> takes some time). So I'm rebuilding right now. But I think that Linus
> has an interesting clue about GPM and notification before switching
> the terminal. I think it was enabled in console mode. I don't know
> how that translates to frozen xterms, but let's attack the problems
> one at a time.

I think it is a good clue.  However the intention of the mechanism is
that only processes that change the video mode on a VT are supposed to
use it.  So I really don't think gpm is the culprit.  However it easily could
be something else that has similar characteristics.

I just realized we do have proof that schedule_work is actually working
because SAK works, and we can't sanely do SAK from interrupt context
so we call schedule work.

Eric

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:40                             ` Eric W. Biederman
@ 2007-04-14 19:01                               ` Willy Tarreau
  0 siblings, 0 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 19:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote:
> >> 
> >> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> 
> >> > > Thinking about it, I don't know if there are calls to schedule() 
> >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and 
> >> > > "chvt 2" simply blocked. It would have been possible that a 
> >> > > schedule() call somewhere got starved due to the load, I don't know.
> >> > 
> >> > It looks like there is a call to schedule_work.
> >> 
> >> so this goes over keventd, right?
> >> 
> >> > There are two pieces of the path. If you are switching in and out of a 
> >> > tty controlled by something like X.  User space has to grant 
> >> > permission before the operation happens.  Where there isn't a gate 
> >> > keeper I know it is cheaper but I don't know by how much, I suspect 
> >> > there is still a schedule happening in there.
> >> 
> >> Could keventd perhaps be starved? Willy, to exclude this possibility, 
> >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then 
> >> the command to set it to SCHED_FIFO:50 would be:
> >> 
> >>   chrt -f -p 50 5
> >> 
> >> but ... events/0 is reniced to -5 by default, so it should definitely 
> >> not be starved.
> >
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact,
> > bash forks 1000 processes, then progressively execs scheddos, but it
> > takes some time). So I'm rebuilding right now. But I think that Linus
> > has an interesting clue about GPM and notification before switching
> > the terminal. I think it was enabled in console mode. I don't know
> > how that translates to frozen xterms, but let's attack the problems
> > one at a time.
> 
> I think it is a good clue.  However the intention of the mechanism is
> that only processes that change the video mode on a VT are supposed to
> use it.  So I really don't think gpm is the culprit.  However it easily could
> be something else that has similar characteristics.
> 
> I just realized we do have proof that schedule_work is actually working
> because SAK works, and we can't sanely do SAK from interrupt context
> so we call schedule work.

Eric,

I can say that Linus, Ingo and you all got on the right track.
I could reproduce, I got a hung tty around 1400 running processes.
Fortunately, it was the one with the root shell which was reniced
to -19.

I could strace chvt 2 :

20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000>
20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002>
20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000>
20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...>

Then I applied Ingo's suggestion about changing keventd prio :

root@pcw:~# ps auxw|grep event
root         8  0.0  0.0     0    0 ?        SW<  20:31   0:00 [events/0]
root         9  0.0  0.0     0    0 ?        RW<  20:31   0:00 [events/1]

root@pcw:~# rtprio -s 1 -p 50 8 9     (I don't have chrt but it does the same)

My VT immediately switched as soon as I hit Enter. Everything's
working fine again now. So the good news is that it's not a bug
in the tty code, nor a deadlock.

Now, maybe keventd should get a higher prio ? It seems worrying to
me that it may starve when it seems so much sensible.

Also, that may explain why I couldn't reproduce with the fork patch.
Since all new processes got no runtime at first, their impact on
existing ones must have been lower. But I think that if I had waited
longer, I would have had the problem again (though I did not see it
even under a load of 7800).

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 18:18                           ` Willy Tarreau
  2007-04-14 18:40                             ` Eric W. Biederman
@ 2007-04-15 17:55                             ` Ingo Molnar
  2007-04-15 18:06                               ` Willy Tarreau
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 17:55 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> Well, since I merged the fair-fork patch, I cannot reproduce (in fact, 
> bash forks 1000 processes, then progressively execs scheddos, but it 
> takes some time). So I'm rebuilding right now. But I think that Linus 
> has an interesting clue about GPM and notification before switching 
> the terminal. I think it was enabled in console mode. I don't know how 
> that translates to frozen xterms, but let's attack the problems one at 
> a time.

to debug this, could you try to apply this add-on as well:

  http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch

with this patch applied you should have a /proc/sched_debug file that 
prints all runnable tasks and other interesting info from the runqueue. 

[ i've refreshed all the patches on the CFS webpage, so if this doesnt 
  apply cleanly to your current tree then you'll probably have to 
  refresh one of the patches.]

The output should look like this:

 Sched Debug Version: v0.01
 now at 226761724575 nsecs

 cpu: 0
   .nr_running            : 3
   .raw_weighted_load     : 384
   .nr_switches           : 13666
   .nr_uninterruptible    : 0
   .next_balance          : 4294947416
   .curr->pid             : 2179
   .rq_clock              : 241337421233
   .fair_clock            : 7503791206
   .wait_runtime          : 2269918379

 runnable tasks:
            task | PID | tree-key |   -delta |  waiting | switches
 -----------------------------------------------------------------
 +            cat  2179 7501930066   -1861140    1861140         2
      loop_silent  2149 7503010354    -780852          0       911
      loop_silent  2148 7503510048    -281158     280753       918

now for your workload the list should be considerably larger. If there's 
starvation going on then the 'switches' field (number of context 
switches) of one of the tasks would never increase while you have this 
'cannot switch consoles' problem.

maybe you'll have to unapply the fair-fork patch to make it trigger 
again. (fair-fork does not fix anything, so it probably just hides a 
real bug.)

(i'm meanwhile busy running your scheddos utilities to reproduce it 
locally as well :)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:55                             ` Ingo Molnar
@ 2007-04-15 18:06                               ` Willy Tarreau
  2007-04-15 19:20                                 ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-15 18:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

Hi Ingo,

On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, 
> > bash forks 1000 processes, then progressively execs scheddos, but it 
> > takes some time). So I'm rebuilding right now. But I think that Linus 
> > has an interesting clue about GPM and notification before switching 
> > the terminal. I think it was enabled in console mode. I don't know how 
> > that translates to frozen xterms, but let's attack the problems one at 
> > a time.
> 
> to debug this, could you try to apply this add-on as well:
> 
>   http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> 
> with this patch applied you should have a /proc/sched_debug file that 
> prints all runnable tasks and other interesting info from the runqueue. 

I don't know if you have seen my mail from yesterday evening (here). I
found that changing keventd prio fixed the problem. You may be interested
in the description. I sent it at 21:01 (+200).

> [ i've refreshed all the patches on the CFS webpage, so if this doesnt 
>   apply cleanly to your current tree then you'll probably have to 
>   refresh one of the patches.]

Fine, I'll have a look. I already had to rediff the sched-fair-fork
patch last time.

> The output should look like this:
> 
>  Sched Debug Version: v0.01
>  now at 226761724575 nsecs
> 
>  cpu: 0
>    .nr_running            : 3
>    .raw_weighted_load     : 384
>    .nr_switches           : 13666
>    .nr_uninterruptible    : 0
>    .next_balance          : 4294947416
>    .curr->pid             : 2179
>    .rq_clock              : 241337421233
>    .fair_clock            : 7503791206
>    .wait_runtime          : 2269918379
> 
>  runnable tasks:
>             task | PID | tree-key |   -delta |  waiting | switches
>  -----------------------------------------------------------------
>  +            cat  2179 7501930066   -1861140    1861140         2
>       loop_silent  2149 7503010354    -780852          0       911
>       loop_silent  2148 7503510048    -281158     280753       918

Nice.

> now for your workload the list should be considerably larger. If there's 
> starvation going on then the 'switches' field (number of context 
> switches) of one of the tasks would never increase while you have this 
> 'cannot switch consoles' problem.
> 
> maybe you'll have to unapply the fair-fork patch to make it trigger 
> again. (fair-fork does not fix anything, so it probably just hides a 
> real bug.)
> 
> (i'm meanwhile busy running your scheddos utilities to reproduce it 
> locally as well :)

I discovered I had the frame-buffer enabled (I did not notice it first
because I do not have the logo and the resolution is the same as text).
It's matroxfb with a G400, if that can help. It may be possible that
it needs some CPU that it cannot get to clear the display before
switching, I don't know.

However I won't try this right now, I'm deep in userland at the moment.

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 18:06                               ` Willy Tarreau
@ 2007-04-15 19:20                                 ` Ingo Molnar
  2007-04-15 19:35                                   ` William Lee Irwin III
  2007-04-15 19:37                                   ` Ingo Molnar
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:20 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Willy Tarreau <w@1wt.eu> wrote:

> > to debug this, could you try to apply this add-on as well:
> > 
> >   http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch
> > 
> > with this patch applied you should have a /proc/sched_debug file 
> > that prints all runnable tasks and other interesting info from the 
> > runqueue.
> 
> I don't know if you have seen my mail from yesterday evening (here). I 
> found that changing keventd prio fixed the problem. You may be 
> interested in the description. I sent it at 21:01 (+200).

ah, indeed i missed that mail - the response to the patches was quite 
overwhelming (and i naively thought people dont do Linux hacking over 
the weekends anymore ;).

so Linus was right: this was caused by scheduler starvation. I can see 
one immediate problem already: the 'nice offset' is not divided by 
nr_running as it should. The patch below should fix this but i have yet 
to test it accurately, this change might as well render nice levels 
unacceptably ineffective under high loads.

	Ingo

--------->
---
 kernel/sched_fair.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
 	int leftmost = 1;
 	long long key;
 
-	key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+	key = rq->fair_clock - p->wait_runtime;
+	if (unlikely(p->nice_offset))
+		key += p->nice_offset / rq->nr_running;
 
 	p->fair_key = key;
 

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:20                                 ` Ingo Molnar
@ 2007-04-15 19:35                                   ` William Lee Irwin III
  2007-04-15 19:57                                     ` Ingo Molnar
  2007-04-15 19:37                                   ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15 19:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> so Linus was right: this was caused by scheduler starvation. I can see 
> one immediate problem already: the 'nice offset' is not divided by 
> nr_running as it should. The patch below should fix this but i have yet 
> to test it accurately, this change might as well render nice levels 
> unacceptably ineffective under high loads.

I've been suggesting testing CPU bandwidth allocation as influenced by
nice numbers for a while now for a reason.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:35                                   ` William Lee Irwin III
@ 2007-04-15 19:57                                     ` Ingo Molnar
  2007-04-15 23:54                                       ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote:
> > so Linus was right: this was caused by scheduler starvation. I can 
> > see one immediate problem already: the 'nice offset' is not divided 
> > by nr_running as it should. The patch below should fix this but i 
> > have yet to test it accurately, this change might as well render 
> > nice levels unacceptably ineffective under high loads.
> 
> I've been suggesting testing CPU bandwidth allocation as influenced by 
> nice numbers for a while now for a reason.

Oh I was very much testing "CPU bandwidth allocation as influenced by 
nice numbers" - it's one of the basic things i do when modifying the 
scheduler. An automated tool, while nice (all automation is nice) 
wouldnt necessarily show such bugs though, because here too it needed 
thousands of running tasks to trigger in practice. Any volunteers? ;)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:57                                     ` Ingo Molnar
@ 2007-04-15 23:54                                       ` William Lee Irwin III
  2007-04-16 11:24                                         ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> I've been suggesting testing CPU bandwidth allocation as influenced by 
>> nice numbers for a while now for a reason.

On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> Oh I was very much testing "CPU bandwidth allocation as influenced by 
> nice numbers" - it's one of the basic things i do when modifying the 
> scheduler. An automated tool, while nice (all automation is nice) 
> wouldnt necessarily show such bugs though, because here too it needed 
> thousands of running tasks to trigger in practice. Any volunteers? ;)

Worse comes to worse I might actually get around to doing it myself.
Any more detailed descriptions of the test for a rainy day?


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:54                                       ` William Lee Irwin III
@ 2007-04-16 11:24                                         ` Ingo Molnar
  2007-04-16 13:46                                           ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:24 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote:
> > Oh I was very much testing "CPU bandwidth allocation as influenced by 
> > nice numbers" - it's one of the basic things i do when modifying the 
> > scheduler. An automated tool, while nice (all automation is nice) 
> > wouldnt necessarily show such bugs though, because here too it needed 
> > thousands of running tasks to trigger in practice. Any volunteers? ;)
> 
> Worse comes to worse I might actually get around to doing it myself. 
> Any more detailed descriptions of the test for a rainy day?

the main complication here is that the handling of nice levels is still 
typically a 2nd or 3rd degree design factor when writing schedulers. The 
reason isnt carelessness, the reason is simply that users typically only 
care about a single nice level: the one that all tasks run under by 
default.

Also, often there's just one or two good ways to attack the problem 
within a given scheduler approach and the quality of nice levels often 
suffers under other, more important design factors like performance.

This means that for example for the vanilla scheduler the distribution 
of CPU power depends on HZ, on the number of tasks and on the scheduling 
pattern. The distribution of CPU power amongst nice levels is basically 
a function of _everything_. That makes any automated test pretty 
challenging. Both with SD and with CFS there's a good chance to actually 
formalize the meaning of nice levels, but i'd not go as far as to 
mandate any particular behavior by rigidly saying "pass this automated 
tool, else ...", other than "make nice levels resonable". All the other 
more formal CPU resource limitation techniques are then a matter of 
CKRM-alike patches, which offer much more finegrained mechanisms than 
pure nice levels anyway.

so to answer your question: it's pretty much freely defined. Make up 
your mind about it and figure out the ways how people use nice levels 
and think about which aspects of that experience are worth testing for 
intelligently.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:24                                         ` Ingo Molnar
@ 2007-04-16 13:46                                           ` William Lee Irwin III
  0 siblings, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-16 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Worse comes to worse I might actually get around to doing it myself. 
>> Any more detailed descriptions of the test for a rainy day?

On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> the main complication here is that the handling of nice levels is still 
> typically a 2nd or 3rd degree design factor when writing schedulers. The 
> reason isnt carelessness, the reason is simply that users typically only 
> care about a single nice level: the one that all tasks run under by 
> default.

I'm a bit unconvinced here. Support for prioritization is a major
scheduler feature IMHO.

On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> Also, often there's just one or two good ways to attack the problem 
> within a given scheduler approach and the quality of nice levels often 
> suffers under other, more important design factors like performance.
> This means that for example for the vanilla scheduler the distribution 
> of CPU power depends on HZ, on the number of tasks and on the scheduling 
> pattern. The distribution of CPU power amongst nice levels is basically 
> a function of _everything_. That makes any automated test pretty 
> challenging. Both with SD and with CFS there's a good chance to actually 
> formalize the meaning of nice levels, but i'd not go as far as to 
> mandate any particular behavior by rigidly saying "pass this automated 
> tool, else ...", other than "make nice levels resonable". All the other 
> more formal CPU resource limitation techniques are then a matter of 
> CKRM-alike patches, which offer much more finegrained mechanisms than 
> pure nice levels anyway.

Some of the issues with respect to the number of tasks and scheduling
patterns can be made part of the test; one can furthermore insist that
the system be quiescent in a variety of ways. I'm not convinced that
formalization of nice levels is a bad idea. They're the standard UNIX
prioritization facility, and it should work with some definite value
of "work."

Even supposing one doesn't care to bolt down the semantics of nice
levels, there should at least be some awareness of what those semantics
are and when and how they're changing. So in that respect a test for
CPU bandwidth distribution according to nice level remains valuable
even supposing that the semantics aren't required to be rigidly fixed.

As far as CKRM goes, I'm wild about it. I wish things would get in
shape to be merged (if they're not already) and merged ASAP on that
front. I think with so much agreement in concept we can work with
changing out implementations as-needed with it sitting in mainline once
the the user API/ABI is decided upon, and I think it already is.

I'm not entirely convinced CKRM answers this, though. If the scheduler
can't support nice levels, how is it supposed to support prioritization
or CPU bandwidth allocation according to CKRM configurations? I'm
relatively certain schedulers must be able to support prioritization
with deterministic CPU bandwidth as essential functionality. This is,
of course, not to say my certainty about things sets the standards for
what testcases are considered meaningful and valid.

On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote:
> so to answer your question: it's pretty much freely defined. Make up 
> your mind about it and figure out the ways how people use nice levels 
> and think about which aspects of that experience are worth testing for 
> intelligently.

Looking for usage cases is a good idea; I'll do that before coding any
testcase for nice semantics.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:20                                 ` Ingo Molnar
  2007-04-15 19:35                                   ` William Lee Irwin III
@ 2007-04-15 19:37                                   ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 19:37 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox


* Ingo Molnar <mingo@elte.hu> wrote:

> so Linus was right: this was caused by scheduler starvation. I can see 
> one immediate problem already: the 'nice offset' is not divided by 
> nr_running as it should. The patch below should fix this but i have 
> yet to test it accurately, this change might as well render nice 
> levels unacceptably ineffective under high loads.

erm, rather the updated patch below if you want to use this on a 32-bit 
system. But ... i think you should wait until i have all this re-tested.

	Ingo

---
 include/linux/sched.h |    2 +-
 kernel/sched_fair.c   |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -839,7 +839,7 @@ struct task_struct {
 
 	s64 wait_runtime;
 	u64 exec_runtime, fair_key;
-	s64 nice_offset, hog_limit;
+	s32 nice_offset, hog_limit;
 
 	unsigned long policy;
 	cpumask_t cpus_allowed;
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r
 	int leftmost = 1;
 	long long key;
 
-	key = rq->fair_clock - p->wait_runtime + p->nice_offset;
+	key = rq->fair_clock - p->wait_runtime;
+	if (unlikely(p->nice_offset))
+		key += p->nice_offset / (rq->nr_running + 1);
 
 	p->fair_key = key;
 

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 17:29                     ` Willy Tarreau
  2007-04-14 17:44                       ` Eric W. Biederman
@ 2007-04-14 17:50                       ` Linus Torvalds
  1 sibling, 0 replies; 304+ messages in thread
From: Linus Torvalds @ 2007-04-14 17:50 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Eric W. Biederman, Ingo Molnar, Nick Piggin, linux-kernel,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, Jiri Slaby, Alan Cox

On Sat, 14 Apr 2007, Willy Tarreau wrote:
> 
> It is clearly possible. What I found strange is that I could still fork
> processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore.

Considering the patches in question, it's almost definitely just a CPU 
scheduling problem with starvation.

The VT switching is obviously done by the kernel, but the kernel will 
signal and wait for the "controlling process" for the VT. The most obvious 
case of that is X, of course, but even in text mode I think gpm will 
have taken control of the VT's it runs on (all of them), which means that 
when you initiate a VT switch, the kernel will actually signal the 
controlling process (gpm), and wait for it to acknowledge the switch.

If gpm doesn't get a timeslice for some reason (and it sounds like there 
may be some serious unfairness after "fork()"), your behaviour is 
explainable.

(NOTE! I've never actually looked at gpm sources or what it really does, 
so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and 
something else is going on, but quite frankly, it sounds like the obvious 
candidate for this bug. Explaining it with some non-scheduler-related 
thing sounds unlikely, considering the patch in question).

		Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
@ 2007-04-15  7:54               ` Mike Galbraith
  2007-04-15  8:58                 ` Ingo Molnar
  2007-04-19  9:01               ` Ingo Molnar
  2 siblings, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15  7:54 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner

On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:

> Well, I'll stop heating the room for now as I get out of ideas about how
> to defeat it. I'm convinced. I'm impatient to read about Mike's feedback
> with his workload which behaves strangely on RSDL. If it works OK here,
> it will be the proof that heuristics should not be needed.

You mean the X + mp3 player + audio visualization test?  X+Gforce
visualization have problems getting half of my box in the presence of
two other heavy cpu using tasks.  Behavior is _much_ better than
RSDL/SD, but the synchronous nature of X/client seems to be a problem.  

With this scheduler, renicing X/client does cure it, whereas with SD it
did not help one bit.  (I know a trivial way to cure that, and this
framework makes that possible without dorking up fairness as a general
policy.)

	-Mike

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  7:54               ` Mike Galbraith
@ 2007-04-15  8:58                 ` Ingo Molnar
  2007-04-15  9:11                   ` Mike Galbraith
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15  8:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner


* Mike Galbraith <efault@gmx.de> wrote:

> On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote:
> 
> > Well, I'll stop heating the room for now as I get out of ideas about 
> > how to defeat it. I'm convinced. I'm impatient to read about Mike's 
> > feedback with his workload which behaves strangely on RSDL. If it 
> > works OK here, it will be the proof that heuristics should not be 
> > needed.
> 
> You mean the X + mp3 player + audio visualization test?  X+Gforce 
> visualization have problems getting half of my box in the presence of 
> two other heavy cpu using tasks.  Behavior is _much_ better than 
> RSDL/SD, but the synchronous nature of X/client seems to be a problem.
> 
> With this scheduler, renicing X/client does cure it, whereas with SD 
> it did not help one bit. [...]

thanks for testing it! I was quite worried about your setup - two tasks 
using up 50%/50% of CPU time, pitted against a kernel rebuild workload 
seems to be a hard workload to get right.

> [...] (I know a trivial way to cure that, and this framework makes 
> that possible without dorking up fairness as a general policy.)

great! Please send patches so i can add them (once you are happy with 
the solution) - i think your workload isnt special in any way and could 
hit other people too.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:58                 ` Ingo Molnar
@ 2007-04-15  9:11                   ` Mike Galbraith
  0 siblings, 0 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15  9:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:

> > [...] (I know a trivial way to cure that, and this framework makes 
> > that possible without dorking up fairness as a general policy.)
> 
> great! Please send patches so i can add them (once you are happy with 
> the solution) - i think your workload isnt special in any way and could 
> hit other people too.

I'll give it a shot.  (have to read and actually understand your new
code first though, then see if it's really viable)

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 13:01             ` Willy Tarreau
  2007-04-14 13:27               ` Willy Tarreau
  2007-04-15  7:54               ` Mike Galbraith
@ 2007-04-19  9:01               ` Ingo Molnar
  2007-04-19 12:54                 ` Willy Tarreau
  2007-04-19 17:32                 ` Gene Heskett
  2 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-19  9:01 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Willy Tarreau <w@1wt.eu> wrote:

> Good idea. The machine I'm typing from now has 1000 scheddos running 
> at +19, and 12 gears at nice 0. [...]

> From time to time, one of the 12 aligned gears will quickly perform a 
> full quarter of round while others slowly turn by a few degrees. In 
> fact, while I don't know this process's CPU usage pattern, there's 
> something useful in it : it allows me to visually see when process 
> accelerate/decelerate. [...]

cool idea - i have just tried this and it rocks - you can easily see the 
'nature' of CPU time distribution just via visual feedback. (Is there 
any easy way to start up 12 glxgears fully aligned, or does one always 
have to mouse around to get them into proper position?)

btw., i am using another method to quickly judge X's behavior: i started 
the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth 
opengl-rendered snow fall on the desktop background. That gives me an 
idea about how well X is scheduling under various workloads, without 
having to instrument it explicitly.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  9:01               ` Ingo Molnar
@ 2007-04-19 12:54                 ` Willy Tarreau
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:32                 ` Gene Heskett
  1 sibling, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-19 12:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Hi Ingo,

On Thu, Apr 19, 2007 at 11:01:44AM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > Good idea. The machine I'm typing from now has 1000 scheddos running 
> > at +19, and 12 gears at nice 0. [...]
> 
> > From time to time, one of the 12 aligned gears will quickly perform a 
> > full quarter of round while others slowly turn by a few degrees. In 
> > fact, while I don't know this process's CPU usage pattern, there's 
> > something useful in it : it allows me to visually see when process 
> > accelerate/decelerate. [...]
> 
> cool idea - i have just tried this and it rocks - you can easily see the 
> 'nature' of CPU time distribution just via visual feedback. (Is there 
> any easy way to start up 12 glxgears fully aligned, or does one always 
> have to mouse around to get them into proper position?)

-- Replying quickly, I'm short in time --

You can certainly script it with -geometry. But it is the wrong application
for this matter, because you benchmark X more than glxgears itself. What would
be better is something like a line rotating 360 degrees and doing some short
stuff between each degree, so that X is not much sollicitated, but the CPU
would be spent more on the processes themselves.

Benchmarking interactions between X and multiple clients is a completely
different test IMHO. Glxgears is between those two, making it inappropriate
for scheduler tuning.

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 12:54                 ` Willy Tarreau
@ 2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
                                       ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-19 15:18 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> You can certainly script it with -geometry. But it is the wrong 
> application for this matter, because you benchmark X more than 
> glxgears itself. What would be better is something like a line 
> rotating 360 degrees and doing some short stuff between each degree, 
> so that X is not much sollicitated, but the CPU would be spent more on 
> the processes themselves.

at least on my setup glxgears goes via DRI/DRM so there's no X 
scheduling inbetween at all, and the visual appearance of glxgears is a 
direct function of its scheduling.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
@ 2007-04-19 17:34                     ` Gene Heskett
  2007-04-19 18:45                     ` Willy Tarreau
  2007-04-19 23:52                     ` Jan Knutar
  2 siblings, 0 replies; 304+ messages in thread
From: Gene Heskett @ 2007-04-19 17:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007, Ingo Molnar wrote:
>* Willy Tarreau <w@1wt.eu> wrote:
>> You can certainly script it with -geometry. But it is the wrong
>> application for this matter, because you benchmark X more than
>> glxgears itself. What would be better is something like a line
>> rotating 360 degrees and doing some short stuff between each degree,
>> so that X is not much sollicitated, but the CPU would be spent more on
>> the processes themselves.
>
>at least on my setup glxgears goes via DRI/DRM so there's no X
>scheduling inbetween at all, and the visual appearance of glxgears is a
>direct function of its scheduling.
>
>	Ingo

That doesn't appear to be the case here Ingo. Even when I know the rest of the 
system is lagged, glxgears continues to show very smooth and steady movement.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yow!  I just went below the poverty line!

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
@ 2007-04-19 18:45                     ` Willy Tarreau
  2007-04-21 10:31                       ` Ingo Molnar
  2007-04-19 23:52                     ` Jan Knutar
  2 siblings, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-19 18:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Thu, Apr 19, 2007 at 05:18:03PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <w@1wt.eu> wrote:
> 
> > You can certainly script it with -geometry. But it is the wrong 
> > application for this matter, because you benchmark X more than 
> > glxgears itself. What would be better is something like a line 
> > rotating 360 degrees and doing some short stuff between each degree, 
> > so that X is not much sollicitated, but the CPU would be spent more on 
> > the processes themselves.
> 
> at least on my setup glxgears goes via DRI/DRM so there's no X 
> scheduling inbetween at all, and the visual appearance of glxgears is a 
> direct function of its scheduling.

OK, I thought that somethink looking like a clock would be useful, especially
if we could tune the amount of CPU spent per task instead of being limited by
graphics drivers.

I searched freashmeat for a clock and found "orbitclock" by Jeremy Weatherford,
which was exactly what I was looking for :
  - small
  - C only
  - X11 only
  - needed less than 5 minutes and no knowledge of X11 for the complete hack !
  => Kudos to its author, sincerely !

I hacked it a bit to make it accept two parameters :
  -R <run_time_in_microsecond> : time spent burning CPU cycles at each round
  -S <sleep_time_in_microsecond> : time spent getting a rest

It now advances what it thinks is a second at each iteration, so that it makes
it easy to compare its progress with other instances (there are seconds,
minutes and hours, so it's easy to visually count up to around 43200).

The modified code is here :

  http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz

What is interesting to note is that it's easy to make X work a lot (99%) by
using 0 as the sleeping time, and it's easy to make the process work a lot
by using large values for the running time associated with very low values
(or 0) for the sleep time.

Ah, and it supports -geometry ;-)

It could become a useful scheduler benchmark !

Have fun !
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 18:45                     ` Willy Tarreau
@ 2007-04-21 10:31                       ` Ingo Molnar
  2007-04-21 10:38                         ` Ingo Molnar
  2007-04-21 10:45                         ` Ingo Molnar
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:31 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Willy Tarreau <w@1wt.eu> wrote:

> I hacked it a bit to make it accept two parameters :
>   -R <run_time_in_microsecond> : time spent burning CPU cycles at each round
>   -S <sleep_time_in_microsecond> : time spent getting a rest
> 
> It now advances what it thinks is a second at each iteration, so that 
> it makes it easy to compare its progress with other instances (there 
> are seconds, minutes and hours, so it's easy to visually count up to 
> around 43200).
> 
> The modified code is here :
> 
>   http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz
> 
> What is interesting to note is that it's easy to make X work a lot 
> (99%) by using 0 as the sleeping time, and it's easy to make the 
> process work a lot by using large values for the running time 
> associated with very low values (or 0) for the sleep time.
> 
> Ah, and it supports -geometry ;-)
> 
> It could become a useful scheduler benchmark !

i just tried ocbench-0.3, and it is indeed very nice!

Would it make sense perhaps to (optionally?) also log some sort of 
periodic text feedback to stdout, about the quality of scheduling? Maybe 
even a 'run this many seconds' option plus a summary text output at the 
end (which would output measured runtime, observed longest/smallest 
latency and standard deviation of latencies maybe)? That would make it 
directly usable both as a 'consistency of X app scheduling' visual test 
and as an easily shareable benchmark with an objective numeric result as 
well.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:31                       ` Ingo Molnar
@ 2007-04-21 10:38                         ` Ingo Molnar
  2007-04-21 10:45                         ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:38 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > The modified code is here :
> > 
> >   http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz
> > 
> > What is interesting to note is that it's easy to make X work a lot 
> > (99%) by using 0 as the sleeping time, and it's easy to make the 
> > process work a lot by using large values for the running time 
> > associated with very low values (or 0) for the sleep time.
> > 
> > Ah, and it supports -geometry ;-)
> > 
> > It could become a useful scheduler benchmark !
> 
> i just tried ocbench-0.3, and it is indeed very nice!

another thing i just noticed: when starting up lots of ocbench tasks 
(say -x 6 -y 6) then they (naturally) get started up with an already 
visible offset. It's nice to observe the startup behavior, but after 
that it would be useful if it were possible to 'resync' all those 
ocbench tasks so that they start at the same offset. [ Maybe a "killall 
-SIGUSR1 ocbench" could serve this purpose, without having to 
synchronize the tasks explicitly? ]

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:31                       ` Ingo Molnar
  2007-04-21 10:38                         ` Ingo Molnar
@ 2007-04-21 10:45                         ` Ingo Molnar
  2007-04-21 11:07                           ` Willy Tarreau
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-21 10:45 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> > It could become a useful scheduler benchmark !
> 
> i just tried ocbench-0.3, and it is indeed very nice!

another thing i noticed: when using a -y larger then 1, then the window 
title (at least on Metacity) overlaps and thus the ocbench tasks have 
different X overhead and get scheduled a bit assymetrically as well. Is 
there any way to start them up title-less perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 10:45                         ` Ingo Molnar
@ 2007-04-21 11:07                           ` Willy Tarreau
  2007-04-21 11:29                             ` Björn Steinbrink
  0 siblings, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-21 11:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi Ingo,

I'm replying to your 3 mails at once.

On Sat, Apr 21, 2007 at 12:45:22PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > It could become a useful scheduler benchmark !
> > 
> > i just tried ocbench-0.3, and it is indeed very nice!

So as you've noticed just one minute after I put it there, I've updated
the tool and renamed it ocbench. For others, it's here :

    http://linux.1wt.eu/sched/

Useful news are proper positionning, automatic forking, and more visible
progress with smaller windows, which eat less of X ressources.

Now about your idea of making it report information on stdout, I don't
know if it would be that useful. There are many other command line tools
for this purpose. This one's goal is to eat CPU with a visual control of
CPU distribution only.

Concerning your idea of using a signal to resync every process, I agree
with you. Running at 8x8 shows a noticeable offset. I've just uploaded
v0.4 which supports your idea of sending USR1.

> another thing i noticed: when using a -y larger then 1, then the window 
> title (at least on Metacity) overlaps and thus the ocbench tasks have 
> different X overhead and get scheduled a bit assymetrically as well. Is 
> there any way to start them up title-less perhaps?

It has annoyed me a bit too, but I'm no X developer at all, so I don't
know at all if it's possible nor how to do this. I know that my window
manager even adds title bars to xeyes, so I'm not sure we can do this.

Right now, I've added a "-B <border size>" argument so that you can
skip the size of your title bar. It's dirty but it's not my main job :-)

Thanks for your feedback
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 11:07                           ` Willy Tarreau
@ 2007-04-21 11:29                             ` Björn Steinbrink
  2007-04-21 11:51                               ` Willy Tarreau
  0 siblings, 1 reply; 304+ messages in thread
From: Björn Steinbrink @ 2007-04-21 11:29 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Ingo Molnar, linux-kernel

Hi,

On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote:
> > another thing i noticed: when using a -y larger then 1, then the window 
> > title (at least on Metacity) overlaps and thus the ocbench tasks have 
> > different X overhead and get scheduled a bit assymetrically as well. Is 
> > there any way to start them up title-less perhaps?
> 
> It has annoyed me a bit too, but I'm no X developer at all, so I don't
> know at all if it's possible nor how to do this. I know that my window
> manager even adds title bars to xeyes, so I'm not sure we can do this.
> 
> Right now, I've added a "-B <border size>" argument so that you can
> skip the size of your title bar. It's dirty but it's not my main job :-)

Here's a small patch that makes the windows unmanaged, which also causes
ocbench to start up quite a bit faster on my box with larger number of
windows, so it probably avoids some window manager overhead, which is a
nice side-effect.

Björn

--

diff -u ocbench-0.4/ocbench.c ocbench-0.4.1/ocbench.c
--- ocbench-0.4/ocbench.c	2007-04-21 13:05:55.000000000 +0200
+++ ocbench-0.4.1/ocbench.c	2007-04-21 13:24:01.000000000 +0200
@@ -213,6 +213,7 @@
 int main(int argc, char *argv[]) {
   Window root;
   XGCValues gc_setup;
+  XSetWindowAttributes swa;
   int c, index, proc_x, proc_y, pid;
   int *pcount[] = {&HOUR, &MIN, &SEC};
   char *p, *q;
@@ -342,8 +343,11 @@
   alloc_color(fg, &orange);
   alloc_color(fg2, &blue);
 
-  win = XCreateSimpleWindow(dpy, root, X, Y, width, height, 0, 
-  			    black.pixel, black.pixel);
+  swa.override_redirect = 1;
+
+  win = XCreateWindow(dpy, root, X, Y, width, height, 0,
+  			    CopyFromParent, InputOutput, CopyFromParent,
+  			    CWOverrideRedirect, &swa);
   XStoreName(dpy, win, "ocbench");
 
   XSelectInput(dpy, win, ExposureMask | StructureNotifyMask);
Only in ocbench-0.4.1/: .README.swp

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-21 11:29                             ` Björn Steinbrink
@ 2007-04-21 11:51                               ` Willy Tarreau
  0 siblings, 0 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-21 11:51 UTC (permalink / raw)
  To: Björn Steinbrink, Ingo Molnar, linux-kernel

Hi Björn,

On Sat, Apr 21, 2007 at 01:29:41PM +0200, Björn Steinbrink wrote:
> Hi,
> 
> On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote:
> > > another thing i noticed: when using a -y larger then 1, then the window 
> > > title (at least on Metacity) overlaps and thus the ocbench tasks have 
> > > different X overhead and get scheduled a bit assymetrically as well. Is 
> > > there any way to start them up title-less perhaps?
> > 
> > It has annoyed me a bit too, but I'm no X developer at all, so I don't
> > know at all if it's possible nor how to do this. I know that my window
> > manager even adds title bars to xeyes, so I'm not sure we can do this.
> > 
> > Right now, I've added a "-B <border size>" argument so that you can
> > skip the size of your title bar. It's dirty but it's not my main job :-)
> 
> Here's a small patch that makes the windows unmanaged, which also causes
> ocbench to start up quite a bit faster on my box with larger number of
> windows, so it probably avoids some window manager overhead, which is a
> nice side-effect.

Excellent ! I've just merged it but conditionned it to a "-u" argument
so that we can keep previous behaviour (moving the windows is useful
especially when there are few of them).

So the new version 0.5 is available there :

  http://linux.1wt.eu/sched/

I believe it's the last one for today as I'm late on some work.

Thanks !
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 15:18                   ` Ingo Molnar
  2007-04-19 17:34                     ` Gene Heskett
  2007-04-19 18:45                     ` Willy Tarreau
@ 2007-04-19 23:52                     ` Jan Knutar
  2007-04-20  5:05                       ` Willy Tarreau
  2 siblings, 1 reply; 304+ messages in thread
From: Jan Knutar @ 2007-04-19 23:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Willy Tarreau, Nick Piggin, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007 18:18, Ingo Molnar wrote:
> * Willy Tarreau <w@1wt.eu> wrote:
> > You can certainly script it with -geometry. But it is the wrong
> > application for this matter, because you benchmark X more than
> > glxgears itself. What would be better is something like a line
> > rotating 360 degrees and doing some short stuff between each
> > degree, so that X is not much sollicitated, but the CPU would be
> > spent more on the processes themselves.
>
> at least on my setup glxgears goes via DRI/DRM so there's no X
> scheduling inbetween at all, and the visual appearance of glxgears is
> a direct function of its scheduling.

How much of the subjective interactiveness-feel of the desktop is at the 
mercy of the X server's scheduling and not the cpu scheduler?

I've noticed that video playback is significantly smoother and resistant 
to other load, when using MPlayer's opengl output, especially if 
"heavy" programs are running at the same time. Especially firefox and 
ksysguard seem to have found a way to cause video through Xv to look 
annoyingly jittery.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 23:52                     ` Jan Knutar
@ 2007-04-20  5:05                       ` Willy Tarreau
  0 siblings, 0 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-20  5:05 UTC (permalink / raw)
  To: Jan Knutar
  Cc: linux-kernel, Ingo Molnar, Nick Piggin, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Fri, Apr 20, 2007 at 02:52:38AM +0300, Jan Knutar wrote:
> On Thursday 19 April 2007 18:18, Ingo Molnar wrote:
> > * Willy Tarreau <w@1wt.eu> wrote:
> > > You can certainly script it with -geometry. But it is the wrong
> > > application for this matter, because you benchmark X more than
> > > glxgears itself. What would be better is something like a line
> > > rotating 360 degrees and doing some short stuff between each
> > > degree, so that X is not much sollicitated, but the CPU would be
> > > spent more on the processes themselves.
> >
> > at least on my setup glxgears goes via DRI/DRM so there's no X
> > scheduling inbetween at all, and the visual appearance of glxgears is
> > a direct function of its scheduling.
> 
> How much of the subjective interactiveness-feel of the desktop is at the 
> mercy of the X server's scheduling and not the cpu scheduler?

probably a lot. Hence the reason why I wanted something visually noticeable
but using far less X resources than glxgears. The modified orbitclock is
perfect IMHO.

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  9:01               ` Ingo Molnar
  2007-04-19 12:54                 ` Willy Tarreau
@ 2007-04-19 17:32                 ` Gene Heskett
  1 sibling, 0 replies; 304+ messages in thread
From: Gene Heskett @ 2007-04-19 17:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Thursday 19 April 2007, Ingo Molnar wrote:
>* Willy Tarreau <w@1wt.eu> wrote:
>> Good idea. The machine I'm typing from now has 1000 scheddos running
>> at +19, and 12 gears at nice 0. [...]
>>
>> From time to time, one of the 12 aligned gears will quickly perform a
>> full quarter of round while others slowly turn by a few degrees. In
>> fact, while I don't know this process's CPU usage pattern, there's
>> something useful in it : it allows me to visually see when process
>> accelerate/decelerate. [...]
>
>cool idea - i have just tried this and it rocks - you can easily see the
>'nature' of CPU time distribution just via visual feedback. (Is there
>any easy way to start up 12 glxgears fully aligned, or does one always
>have to mouse around to get them into proper position?)
>
>btw., i am using another method to quickly judge X's behavior: i started
>the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth
>opengl-rendered snow fall on the desktop background. That gives me an
>idea about how well X is scheduling under various workloads, without
>having to instrument it explicitly.
>
yes, its a  cute idea, till you switch away from that screen to check progress 
on something else, like to compose this message.

===========
5913 frames in 5.0 seconds = 1182.499 FPS
6238 frames in 5.0 seconds = 1247.556 FPS
11380 frames in 5.0 seconds = 2275.905 FPS
10691 frames in 5.0 seconds = 2138.173 FPS
8707 frames in 5.0 seconds = 1741.305 FPS
10669 frames in 5.0 seconds = 2133.708 FPS
11392 frames in 5.0 seconds = 2278.037 FPS
11379 frames in 5.0 seconds = 2275.711 FPS
11310 frames in 5.0 seconds = 2261.861 FPS
11386 frames in 5.0 seconds = 2277.081 FPS
11292 frames in 5.0 seconds = 2258.353 FPS
11352 frames in 5.0 seconds = 2270.297 FPS
11415 frames in 5.0 seconds = 2282.886 FPS
11406 frames in 5.0 seconds = 2281.037 FPS
11483 frames in 5.0 seconds = 2296.533 FPS
11510 frames in 5.0 seconds = 2301.883 FPS
11123 frames in 5.0 seconds = 2224.266 FPS
8980 frames in 5.0 seconds = 1795.861 FPS
=======
The over 2000fps reports were while I was either looking at htop, or starting 
this message, both on different screens.  htop said it was using 95+ % of the 
cpu even when its display was going to /dev/null.  So 'Kewl' doesn't seem to 
get us apples to apples numbers we can go to the window and bet 
win-place-show based on them alone.

FWIW, running the nvidia-9755 drivers here.

So if we are going to use that as a judgement operator, it obviously needs 
some intelligently applied scaling before they are worth more than a 
subjective feel is.

>	Ingo
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The confusion of a staff member is measured by the length of his memos.
		-- New York Times, Jan. 20, 1981

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 10:53           ` Ingo Molnar
  2007-04-14 13:01             ` Willy Tarreau
@ 2007-04-14 15:17             ` Mark Lord
  1 sibling, 0 replies; 304+ messages in thread
From: Mark Lord @ 2007-04-14 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> i kept the 50%/50% rule from the old scheduler, but maybe it's a more 
> pristine (and smaller/faster) approach to just not give new children any 
> stats history to begin with. I've implemented an add-on patch that 
> implements this, you can find it at:
> 
>     http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch

I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper)
with the new CFS for much of this morning now, with the odd switch back to
the stock scheduler for comparison.

Here, CFS really works and feels better than the stock scheduler.

Even with a "make -j2" kernel rebuild happening (no manual renice, either!)
things "just work" about as smoothly as ever.  That's something which RSDL
never achieved for me, though I have not retested RSDL beyond v0.34 or so.

Well done, Ingo!  I *want* this as my default scheduler.

Things seemed slightly less smooth when I had the CPU hogs
and fair-fork extension patches both applied.

I'm going to try again now with just the fair-fork added on.

Cheers

Mark

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:36         ` Willy Tarreau
  2007-04-14 10:53           ` Ingo Molnar
@ 2007-04-14 19:48           ` William Lee Irwin III
  2007-04-14 20:12             ` Willy Tarreau
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-14 19:48 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> Forking becomes very slow above a load of 100 it seems. Sometimes,
> the shell takes 2 or 3 seconds to return to prompt after I run
> "scheddos &"
> Those are very promising results, I nearly observe the same responsiveness
> as I had on a solaris 10 with 10k running processes on a bigger machine.
> I would be curious what a mysql test result would look like now.

Where is scheddos?


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 19:48           ` William Lee Irwin III
@ 2007-04-14 20:12             ` Willy Tarreau
  0 siblings, 0 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-14 20:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote:
> > Forking becomes very slow above a load of 100 it seems. Sometimes,
> > the shell takes 2 or 3 seconds to return to prompt after I run
> > "scheddos &"
> > Those are very promising results, I nearly observe the same responsiveness
> > as I had on a solaris 10 with 10k running processes on a bigger machine.
> > I would be curious what a mysql test result would look like now.
> 
> Where is scheddos?

I will send it to you off-list. I've been avoiding to publish it for a long
time because the stock scheduler was *very* sensible to trivial attacks
(freezes larger than 30s, impossible to log in). It's very basic, and I
have no problem sending it to anyone who requests it, it's just that as
long as some distros ship early 2.6 kernels I do not want it to appear on
mailing list archives for anyone to grab it and annoy their admins for free.

Cheers,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14  8:08       ` Willy Tarreau
  2007-04-14  8:36         ` Willy Tarreau
@ 2007-04-14 10:36         ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 10:36 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> > this fix is not complete - because the child runqueue is locked 
> > here, not the parent's. I've fixed this properly in my tree and have 
> > uploaded a new sched-modular+cfs.patch. (the effects of the original 
> > bug are mostly harmless, the rbtree position gets corrected the 
> > first time the parent reschedules. The fix might improve heavy 
> > forker handling.)
> 
> It looks like it did not reach your public dir yet.

oops, forgot to do the last step - should be fixed now.

> BTW, I've given it a try. It seems pretty usable. I have also tried 
> the usual meaningless "glxgears" test with 12 of them at the same 
> time, and they rotate very smoothly, there is absolutely no pause in 
> any of them. But they don't all run at same speed, and top reports 
> their CPU load varying from 3.4 to 10.8%, with what looks like more 
> CPU is assigned to the first processes, and less CPU for the last 
> ones. But this is just a rough observation on a stupid test, I would 
> not call that one scientific in any way (and X has its share in the 
> test too).

ok, i'll try that too - there should be nothing particularly special 
about glxgears.

there's another tweak you could try:

	echo 500000 > /proc/sys/kernel/sched_granularity_ns

note that this causes preemption to be done as fast as the scheduler can 
do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the 
best results a CONFIG_HZ of 1000 is useful.)

plus there's an add-on to CFS at:

  http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch

this makes the 'CPU usage history cutoff' configurable and sets it to a 
default of 100 msecs. This means that CPU hogs (tasks which actively 
kept other tasks from running) will be remembered, for up to 100 msecs 
of their 'hogness'.

Setting this limit back to 0 gives the 'vanilla' CFS scheduler's 
behavior:

     echo 0 > /proc/sys/kernel/sched_max_hog_history_ns

(So when trying this you dont have to reboot with this patch 
applied/unapplied, just set this value.)

> I'll perform other tests when I can rebuild with your fixed patch.

cool, thanks!

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (7 preceding siblings ...)
  2007-04-14  2:04 ` Nick Piggin
@ 2007-04-14 15:09 ` S.Çağlar Onur
  2007-04-14 16:09   ` Ingo Molnar
  2007-04-15  3:27 ` Con Kolivas
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 15:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: 
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:

Currently im using Linus's current git + your extra patches + CFS for a while. 
Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek 
forward/backward while its playing a video with some workload (checking out 
SVN repositories, compiling something). Stopping other process didn't help 
kaffeine so it stays freezed stated until i kill it.

I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have 
that problem) but i can't reproduce this with mainline or mainline + sd-0.39.

[1] http://cekirdek.pardus.org.tr/~caglar/psaux
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-14 16:09   ` Ingo Molnar
  2007-04-14 16:59     ` S.Çağlar Onur
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-14 16:09 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> > i'm pleased to announce the first release of the "Modular Scheduler 
> > Core and Completely Fair Scheduler [CFS]" patchset:
> 
> Currently im using Linus's current git + your extra patches + CFS for 
> a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i 
> seek forward/backward while its playing a video with some workload 
> (checking out SVN repositories, compiling something). Stopping other 
> process didn't help kaffeine so it stays freezed stated until i kill 
> it.

hm, could you try to strace it and/or attach gdb to it and figure out 
what's wrong? (perhaps involving the Kaffeine developers too?) As long 
as it's not a kernel level crash i cannot see how the scheduler could 
directly cause this - other than by accident creating a scheduling 
pattern that triggers a user-space bug more often than with other 
schedulers.

> [1] http://cekirdek.pardus.org.tr/~caglar/psaux

looks quite weird!

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-14 16:09   ` Ingo Molnar
@ 2007-04-14 16:59     ` S.Çağlar Onur
  0 siblings, 0 replies; 304+ messages in thread
From: S.Çağlar Onur @ 2007-04-14 16:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı: 
> hm, could you try to strace it and/or attach gdb to it and figure out
> what's wrong? (perhaps involving the Kaffeine developers too?) As long
> as it's not a kernel level crash i cannot see how the scheduler could
> directly cause this - other than by accident creating a scheduling
> pattern that triggers a user-space bug more often than with other
> schedulers.

...
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = -1 EINTR (Interrupted system call)
--- SIGINT (Interrupt) @ 0 (0) ---
+++ killed by SIGINT +++

is where freeze occurs. Full log can be found at [1]

> > [1] http://cekirdek.pardus.org.tr/~caglar/psaux
>
> looks quite weird!

:)

[1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (8 preceding siblings ...)
  2007-04-14 15:09 ` S.Çağlar Onur
@ 2007-04-15  3:27 ` Con Kolivas
  2007-04-15  5:16   ` Bill Huey
                     ` (2 more replies)
  2007-04-15 12:29 ` Esben Nielsen
                   ` (3 subsequent siblings)
  13 siblings, 3 replies; 304+ messages in thread
From: Con Kolivas @ 2007-04-15  3:27 UTC (permalink / raw)
  To: Ingo Molnar, ck list, Peter Williams, Bill Huey
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.

The casual observer will be completely confused by what on earth has happened 
here so let me try to demystify things for them.

1. I tried in vain some time ago to push a working extensable pluggable cpu 
scheduler framework (based on wli's work) for the linux kernel. It was 
perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as 
being absolutely the wrong approach and that we should never do that. Oddly 
enough the linux-kernel-mailing list was -dead- at the time and the 
discussion did not make it to the mailing list. Every time I've tried to 
forward it to the mailing list the spam filter decided to drop it so most 
people have not even seen this original veto-forever discussion.

2. Since then I've been thinking/working on a cpu scheduler design that takes 
away all the guesswork out of scheduling and gives very predictable, as fair 
as possible, cpu distribution and latency while preserving as solid 
interactivity as possible within those confines. For weeks now, Ingo has said 
that the interactivity regressions were showstoppers and we should address 
them, never mind the fact that the so-called regressions were purely "it 
slows down linearly with load" which to me is perfectly desirable behaviour. 
While this was not perma-vetoed, I predicted pretty accurately your intent 
was to veto it based on this.

People kept claiming scheduling problems were few and far between but what was 
really happening is users were terrified of lkml and instead used 1. windows 
and 2. 2.4 kernels. The problems were there.

So where are we now? Here is where your latest patch comes in.

As a solution to the many scheduling problems we finally all agree exist, you 
propose a patch that adds 1. a limited pluggable framework and 2. a fairness 
based cpu scheduler policy... o_O

So I should be happy at last now that the things I was promoting you are also 
promoting, right? Well I'll fill in the rest of the gaps and let other people 
decide how I should feel.

> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and 
having trips in and out of hospitals for my condition. I appreciate greatly 
the sympathy and patience from people in this regard. However at one stage I 
virtually begged for support with my attempts and help with the code. Dmitry 
Adamushko is the only person who actually helped me with the code in the 
interim, while others poked sticks at it. Sure the sticks helped at times but 
the sticks always seemed to have their ends kerosene doused and flaming for 
reasons I still don't get. No other help was forthcoming.

Now that you're agreeing my direction was correct you've done the usual Linux 
kernel thing - ignore all my previous code and write your own version. Oh 
well, that I've come to expect; at least you get a copyright notice in the 
bootup and somewhere in the comments give me credit for proving it's 
possible. Let's give some other credit here too. William Lee Irwin provided 
the major architecture behind plugsched at my request and I simply finished 
the work and got it working. He is also responsible for many IRC discussions 
I've had about cpu scheduling fairness, designs, programming history and code 
help. Even though he did not contribute code directly to SD, his comments 
have been invaluable.

So let's look at the code.

kernel/sched.c
kernel/sched_fair.c
kernel/sched_rt.c

It turns out this is not a pluggable cpu scheduler framework at all, and I 
guess you didn't really promote it as such. It's a "modular scheduler core". 
Which means you moved code from sched.c into sched_fair.c and sched_rt.c. 
This abstracts out each _scheduling policy's_ functions into struct 
sched_class and allows each scheduling policy's functions to be in a separate 
file etc.

Ok so what it means is that instead of whole cpu schedulers being able to be 
plugged into this framework we can plug in only cpu scheduling policies.... 
hrm... So let's look on

-#define SCHED_NORMAL		0

Ok once upon a time we rename SCHED_OTHER which every other unix calls the 
standard policy 99.9% of applications used into a more meaningful name, 
SCHED_NORMAL. That's fine since all it did was change the description 
internally for those reading the code. Let's see what you've done now:

+#define SCHED_FAIR		0

You've renamed it again. This is, I don't know what exactly to call it, but an 
interesting way of making it look like there is now more choice. Well, 
whatever you call it, everything in linux spawned from init without 
specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed 
SCHED_NORMAL and now SCHED_FAIR.

You encouraged me to create a sched_sd.c to add onto your design as well. 
Well, what do I do with that? I need to create another scheduling policy for 
that code to even be used. A separate scheduling policy requires a userspace 
change to even benefit from it. Even if I make that sched_sd.c patch, people 
cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to 
read SCHED_SD 0 or similar. The same goes for original staircase cpusched, 
nicksched, zaphod, spa_ws, ebs and so on.

So what you've achieved with your patch is - replaced the current scheduler 
with another one and moved it into another file. There is no choice, and no 
pluggability, just code trumping. 

Do I support this? In this form.... no.

It's not that I don't like your new scheduler. Heck it's beautiful like most 
of your _serious_ code. It even comes with a catchy name that's bound to give 
people hard-ons (even though many schedulers aim to be completely fair, yours 
has been named that for maximum selling power). The complaint I have is that 
you are not providing quite what you advertise (on the modular front), or 
perhaps you're advertising it as such to make it look more appealing; I'm not 
sure.

Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything 
different, and that this is anything other than your NIH (Not Invented Here) 
cpu scheduling policy rewrite which will probably end up taking it's position 
in mainline after yet another truckload of regression/performance tests and 
so on. I haven't seen an awful lot of comparisons with SD yet, just people 
jumping on your bandwagon which is fine I guess. Maybe a few tiny tests 
showing less than 5% variation in their fairness from what I can see. Either 
way, I already feel you've killed off SD... like pretty much everything else 
I've done lately. At least I no longer have to try and support my code mostly 
by myself.

In the interest of putting aside any ego concerns since this is about linux 
and not me...

Because...  You are a hair's breadth away from producing something that I 
would support, which _does_ do what you say and produces the pluggability 
we're all begging for with only tiny changes to the code you've already done. 
Make Kconfig let you choose which sched_*.c gets built into the kernel, and 
make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig 
and even choose one of the alternative built in ones with boot parametersyour 
code has more clout than mine will (ie do exactly what plugsched does). Then 
we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is 
the very thing Linus didn't want in specialisation with the cpu schedulers! 
Does this mean this idea will be vetoed yet again? In all likelihood, yes.

I guess I have lots to put into -ck still... sigh.

> 	Ingo

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` Con Kolivas
@ 2007-04-15  5:16   ` Bill Huey
  2007-04-15  8:44     ` Ingo Molnar
  2007-04-15 16:11     ` Bernd Eckenfels
  2007-04-15  6:43   ` Mike Galbraith
  2007-04-15 15:05   ` Ingo Molnar
  2 siblings, 2 replies; 304+ messages in thread
From: Bill Huey @ 2007-04-15  5:16 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote:
...
> Now that you're agreeing my direction was correct you've done the usual Linux 
> kernel thing - ignore all my previous code and write your own version. Oh 
> well, that I've come to expect; at least you get a copyright notice in the 
> bootup and somewhere in the comments give me credit for proving it's 
> possible. Let's give some other credit here too. William Lee Irwin provided 
> the major architecture behind plugsched at my request and I simply finished 
> the work and got it working. He is also responsible for many IRC discussions 
> I've had about cpu scheduling fairness, designs, programming history and code 
> help. Even though he did not contribute code directly to SD, his comments 
> have been invaluable.

Hello folks,

I think the main failure I see here is that Con wasn't included in this design
or privately in review process. There could have been better co-ownership of the
code. This could also have been done openly on lkml (since this is kind of what
this medium is about to significant degree) so that consensus can happen (Con
can be reasoned with). It would have achieved the same thing but probably more
smoothly if folks just listened, considered an idea and then, in this case,
created something that would allow for experimentation from outsiders in a
fluid fashion.

If these issues aren't fixed, you're going to stuck with the same kind of creeping
elitism that has gradually killed the FreeBSD project and other BSDs. I can't
comment on the code implementation. I'm focus on other things now that I'm at
NetApp and I can't help out as much as I could. Being former BSDi, I had a first
hand account of these issues as they played out.

A development process like this is likely to exclude smart people from wanting
to contribute to Linux and folks should be conscious about this issues. It's
basically a lot of code and concept that at least two individuals have worked
on (wli and con) only to have it be rejected and then sudden replaced by
code from a community gatekeeper. In this case, this results in both Con and
Bill Irwin being woefully under utilized.

If I were one of these people. I'd be mighty pissed.

bill

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  5:16   ` Bill Huey
@ 2007-04-15  8:44     ` Ingo Molnar
  2007-04-15  9:51       ` Bill Huey
  2007-04-15 16:11     ` Bernd Eckenfels
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15  8:44 UTC (permalink / raw)
  To: Bill Huey
  Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> Hello folks,
> 
> I think the main failure I see here is that Con wasn't included in 
> this design or privately in review process. There could have been 
> better co-ownership of the code. This could also have been done openly 
> on lkml [...]

Bill, you come from a BSD background and you are still relatively new to 
Linux development, so i dont at all fault you for misunderstanding this 
situation, and fortunately i have a really easy resolution for your 
worries: i did exactly that! :)

i wrote the first line of code of the CFS patch this week, 8am Wednesday 
morning, and released it to lkml 62 hours later, 22pm on Friday. (I've 
listed the file timestamps of my backup patches further below, for all 
the fine details.)

I prefer such early releases to lkml _alot_ more than any private review 
process. I released the CFS code about 6 hours after i thought "okay, 
this looks pretty good" and i spent those final 6 hours on testing it 
(making sure it doesnt blow up on your box, etc.), in the final 2 hours 
i showed it to two folks i could reach on IRC (Arjan and Thomas) and on 
various finishing touches. It doesnt get much faster than that and i 
definitely didnt want to sit on it even one day longer because i very 
much thought that Con and others should definitely see this work!

And i very much credited (and still credit) Con for the whole fairness 
angle:

||  i'd like to give credit to Con Kolivas for the general approach here:
||  he has proven via RSDL/SD that 'fair scheduling' is possible and that
||  it results in better desktop scheduling. Kudos Con!

the 'design consultation' phase you are talking about is _NOW_! :)

I got the v1 code out to Con, to Mike and to many others ASAP. That's 
how you are able to comment on this thread and be part of the 
development process to begin with, in a 'private consultation' setup 
you'd not have had any opportunity to see _any_ of this.

In the BSD space there seem to be more 'political' mechanisms for 
development, but Linux is truly about doing things out in the open, and 
doing it immediately.

Okay? ;-)

Here's the timestamps of all my backups of the patch, from its humble 4K 
beginnings to the 100K first-cut v1 result:

-rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch
-rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch

i hope this helps :)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:44     ` Ingo Molnar
@ 2007-04-15  9:51       ` Bill Huey
  2007-04-15 10:39         ` Pekka Enberg
  0 siblings, 1 reply; 304+ messages in thread
From: Bill Huey @ 2007-04-15  9:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote:
> I prefer such early releases to lkml _alot_ more than any private review 
> process. I released the CFS code about 6 hours after i thought "okay, 
> this looks pretty good" and i spent those final 6 hours on testing it 
> (making sure it doesnt blow up on your box, etc.), in the final 2 hours 
> i showed it to two folks i could reach on IRC (Arjan and Thomas) and on 
> various finishing touches. It doesnt get much faster than that and i 
> definitely didnt want to sit on it even one day longer because i very 
> much thought that Con and others should definitely see this work!
> 
> And i very much credited (and still credit) Con for the whole fairness 
> angle:
> 
> ||  i'd like to give credit to Con Kolivas for the general approach here:
> ||  he has proven via RSDL/SD that 'fair scheduling' is possible and that
> ||  it results in better desktop scheduling. Kudos Con!
> 
> the 'design consultation' phase you are talking about is _NOW_! :)
> 
> I got the v1 code out to Con, to Mike and to many others ASAP. That's 
> how you are able to comment on this thread and be part of the 
> development process to begin with, in a 'private consultation' setup 
> you'd not have had any opportunity to see _any_ of this.
> 
> In the BSD space there seem to be more 'political' mechanisms for 
> development, but Linux is truly about doing things out in the open, and 
> doing it immediately.

I can't even begin to talk about how screwed up BSD development is. Maybe
another time privately.

Ok, Linux development and inclusiveness can be improved. I'm not trying
to "call you out" (slang for accusing you with the sole intention to call
you crazy in a highly confrontative manner). This is discussed publically
here to bring this issue to light, open a communication channel as a means
to resolve it.

> Okay? ;-)

It's cool. We're still getting to know each other professionally and it's
okay to a certain degree to have a communication disconnect but only as
long as it clears. Your productivity is amazing BTW. But here's the
problem, there's this perception that NIH is the default mentality here
in Linux.

Con feels that this kind of action is intentional and has a malicious
quality to it as means of "churn squating" sections of the kernel tree.
The perception here is that there is that there is this expectation that
sections of the Linux kernel are intentionally "churn squated" to prevent
any other ideas from creeping in other than of the owner of that subsytem
(VM, scheduling, etc...) because of lack of modularity in the kernel.
This isn't an API question but a question possibly general code quality
and how maintenance () of it can .

This was predicted by folks and then this perception was *realized* when
you wrote the equivalent kind of code that has technical overlap with SDL
(this is just one dry example). To a person that is writing new code for
Linux, having one of the old guards write equivalent code to that of a
newcomer has the effect of displacing that person both with regards to
code and responsibility with that. When this happens over and over again
and folks get annoyed by it, it starts seeming that Linux development
seems elitist.

I know this because I heard (read) Con's IRC chats all the time about
these matters all of the time. This is not just his view but a view of
other kernel folks that differing views as to. The closing talk at OLS
2006 was highly disturbing in many ways. It went "Christoph" is right
everybody else is wrong which sends a highly negative message to new
kernel developers that, say, don't work for RH directly or any of the
other mainstream Linux companies. After a while, it starts seeming like
this kind of behavior is completely intentional and that Linux is
full of arrogant bastards.

What I would have done here was to contact Peter Williams, Bill Irwin
and Con about what your doing and reach a common concensus about how
to create something that would be inclusive of all of their ideas.
Discussions can technically heated but that's ok, the discussion is
happening and it brings down the wall of this perception. Bill and
Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra.
It might be very useful, it might not be. Folks are all stubborn
about there ideas and hold on to them for dear life. Effective
leaders can deconstruct this hostility and animosity. I don't claim
to be one.

Because of past hostility to something like schedplugin, the hostility
and terseness of responses can be percieved simply as "I'm right,
you're wrong" which is condescending. This effects discussion and
outright destroys a constructive process if this happens continually
since it reenforces that view of "You're an outsider, we don't care
about you". Nobody is listening to each other at that point, folks get
pissed. Then they think about "I'm going to NIH this person with patc
X because he/she did the same here" which is dysfunctional.

Oddly enough, sometimes you're the best person to get a new idea
into the tree. What's not happening here is communication. That takes
sensitivity, careful listening which is a difficult skill, and then
understanding of the characters involved to unify creative energies.

That's a very difficult thing to do for folks that are use to working
solo. It take time to develop trust in those relationships so that
a true collaboration can happen. I know that there is a lot of
creativity in folks like Con and Bill. It would be wise to develop a
dialog with them to see if they can offload some of your work for you
(we all know you're really busy) yet have you be a key facilitator of
their and your ideas. That's a really tough thing to do and it requires
practice. Just imagine (assuming they can follow through) what could
have positively happened if there collective knowledge was leveraged
better. It's not all clear and rosy, but I think these people are more
on your side that you might realized and it might be a good thing to
discover that.

This is tough because I know the personalities involved and I know
kind of how people function and malfunction in this discussion on a
personal basis.

[We can continue privately. This not just about you but applicable
to open source development in general]

The tone of this email is intellectually critical (not ment as
personality attack) and calm. If I'm otherwise, them I'm a bastard. :)

bill

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  9:51       ` Bill Huey
@ 2007-04-15 10:39         ` Pekka Enberg
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 15:16           ` Gene Heskett
  0 siblings, 2 replies; 304+ messages in thread
From: Pekka Enberg @ 2007-04-15 10:39 UTC (permalink / raw)
  To: hui Bill Huey
  Cc: Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> The perception here is that there is that there is this expectation that
> sections of the Linux kernel are intentionally "churn squated" to prevent
> any other ideas from creeping in other than of the owner of that subsytem

Strangely enough, my perception is that Ingo is simply trying to
address the issues Mike's testing discovered in RDSL and SD. It's not
surprising Ingo made it a separate patch set as Con has repeatedly
stated that the "problems" are in fact by design and won't be fixed.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 10:39         ` Pekka Enberg
@ 2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
                               ` (2 more replies)
  2007-04-15 15:16           ` Gene Heskett
  1 sibling, 3 replies; 304+ messages in thread
From: Willy Tarreau @ 2007-04-15 12:45 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote:
> On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >The perception here is that there is that there is this expectation that
> >sections of the Linux kernel are intentionally "churn squated" to prevent
> >any other ideas from creeping in other than of the owner of that subsytem
> 
> Strangely enough, my perception is that Ingo is simply trying to
> address the issues Mike's testing discovered in RDSL and SD. It's not
> surprising Ingo made it a separate patch set as Con has repeatedly
> stated that the "problems" are in fact by design and won't be fixed.

That's not exactly the problem. There are people who work very hard to
try to improve some areas of the kernel. They progress slowly, and
acquire more and more skills. Sometimes they feel like they need to
change some concepts and propose those changes which are required for
them to go further, or to develop faster. Those are rejected. So they
are constrained to work in a delimited perimeter from which it is
difficult for them to escape.

Then, the same person who rejected their changes comes with something
shiny new, better and which took him far less time. But he sort of
broke the rules because what was forbidden to the first persons is
suddenly permitted. Maybe for very good reasons, I'm not discussing
that. The good reason should have been valid the first time too.

The fact is that when changes are rejected, we should not simply say
"no", but explain why and define what would be acceptable. Some people
here have excellent teaching skills for this, but most others do not.
Anyway, the rules should be the same for everybody.

Also, there is what can be perceived as marketting here. Con worked
on his idea with convictions, he took time to write some generous
documentation, but he hit a wall where his concept was suboptimal on
a given workload. But at least, all the work was oriented on a technical
basis : design + code + doc.

Then, Ingo comes in with something looking amazingly better, with
virtually no documentation, an appealing announcement, and a shiny
advertising at boot. All this implemented without the constraints
other people had to respect. It already looks like definitive work
which will be merge as-is without many changes except a few bugfixes.

If those were two companies, the first one would simply have accused
the second one of not having respected contracts and having employed
heaving marketting to take the first place.

People here do not code for a living, they do it at least because they
believe in what they are doing, and some of them want a bit of gratitude
for their work. I've met people who were proud to say they implement
this or that feature in the kernel, so it is something important for
them. And being cited in an email is nothing compared to advertising
at boot time.

When the discussion was blocked between Con and Mike concerning the
design problems, it is where a new discussion should have taken place.
Ingo could have publicly spoken with them about his ideas of killing
the O(1) scheduler and replacing it with an rbtree-based one, and using
part of Bill's work to speed up development.

It is far easier to resign when people explain what concepts are wrong
and how they think they will do than when they suddenly present something
out of nowhere which is already better.

And it's not specific to Ingo (though I think his ability to work that
fast alone makes him tend to practise this more often than others).

Imagine if Con had worked another full week on his scheduler with better
results on Mike's workload, but still not as good as Ingo's, and they both
published at the same time. You certainly can imagine he would have preferred
to be informed first that it was pointless to continue in that direction.

Now I hope he and Bill will get over this and accept to work on improving
this scheduler, because I really find it smarter than a dumb O(1). I even
agree with Mike that we now have a solid basis for future work. But for
this, maybe a good starting point would be to remove the selfish printk
at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
and improve the documentation a bit so that people can work together on
the new design, without feeling like their work will only server to
promote X or Y.

Regards,
Willy

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
@ 2007-04-15 13:08             ` Pekka J Enberg
  2007-04-15 17:32               ` Mike Galbraith
  2007-04-15 15:26             ` William Lee Irwin III
  2007-04-15 15:39             ` Ingo Molnar
  2 siblings, 1 reply; 304+ messages in thread
From: Pekka J Enberg @ 2007-04-15 13:08 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, 15 Apr 2007, Willy Tarreau wrote:
> Ingo could have publicly spoken with them about his ideas of killing
> the O(1) scheduler and replacing it with an rbtree-based one, and using
> part of Bill's work to speed up development.

He did exactly that and he did it with a patch. Nothing new here. This is 
how development on LKML proceeds when you have two or more competing 
designs. There's absolutely no need to get upset or hurt your feelings 
over it. It's not malicious, it's how we do Linux development.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 13:08             ` Pekka J Enberg
@ 2007-04-15 17:32               ` Mike Galbraith
  2007-04-15 17:59                 ` Linus Torvalds
  0 siblings, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15 17:32 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> On Sun, 15 Apr 2007, Willy Tarreau wrote:
> > Ingo could have publicly spoken with them about his ideas of killing
> > the O(1) scheduler and replacing it with an rbtree-based one, and using
> > part of Bill's work to speed up development.
> 
> He did exactly that and he did it with a patch. Nothing new here. This is 
> how development on LKML proceeds when you have two or more competing 
> designs. There's absolutely no need to get upset or hurt your feelings 
> over it. It's not malicious, it's how we do Linux development.

Yes.  Exactly.  This is what it's all about, this is what makes it work.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:32               ` Mike Galbraith
@ 2007-04-15 17:59                 ` Linus Torvalds
  2007-04-15 19:00                   ` Jonathan Lundell
  0 siblings, 1 reply; 304+ messages in thread
From: Linus Torvalds @ 2007-04-15 17:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sun, 15 Apr 2007, Mike Galbraith wrote:

> On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote:
> > 
> > He did exactly that and he did it with a patch. Nothing new here. This is 
> > how development on LKML proceeds when you have two or more competing 
> > designs. There's absolutely no need to get upset or hurt your feelings 
> > over it. It's not malicious, it's how we do Linux development.
> 
> Yes.  Exactly.  This is what it's all about, this is what makes it work.

I obviously agree, but I will also add that one of the most motivating 
things there *is* in open source is "personal pride".

It's a really good thing, and it means that if somebody shows that your 
code is flawed in some way (by, for example, making a patch that people 
claim gets better behaviour or numbers), any *good* programmer that 
actually cares about his code will obviously suddenly be very motivated to 
out-do the out-doer!

Does this mean that there will be tension and rivalry? Hell yes. But 
that's kind of the point. Life is a game, and if you aren't in it to win, 
what the heck are you still doing here?

As long as it's reasonably civil (I'm not personally a huge believer in 
being too polite or "politically correct", so I think the "reasonably" is 
more important than the "civil" part!), and as long as the end result is 
judged on TECHNICAL MERIT, it's all good.

We don't want to play politics. But encouraging peoples competitive 
feelings? Oh, yes. 

			Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 17:59                 ` Linus Torvalds
@ 2007-04-15 19:00                   ` Jonathan Lundell
  2007-04-15 22:52                     ` Con Kolivas
  0 siblings, 1 reply; 304+ messages in thread
From: Jonathan Lundell @ 2007-04-15 19:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey,
	Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:

> It's a really good thing, and it means that if somebody shows that  
> your
> code is flawed in some way (by, for example, making a patch that  
> people
> claim gets better behaviour or numbers), any *good* programmer that
> actually cares about his code will obviously suddenly be very  
> motivated to
> out-do the out-doer!

"No one who cannot rejoice in the discovery of his own mistakes  
deserves to be called a scholar."

--Don Foster, "literary sleuth", on retracting his attribution of "A  
Funerall Elegye" to Shakespeare (it's more likely John Ford's work).

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 19:00                   ` Jonathan Lundell
@ 2007-04-15 22:52                     ` Con Kolivas
  2007-04-16  2:28                       ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-15 22:52 UTC (permalink / raw)
  To: Jonathan Lundell
  Cc: Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau,
	hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > It's a really good thing, and it means that if somebody shows that
> > your
> > code is flawed in some way (by, for example, making a patch that
> > people
> > claim gets better behaviour or numbers), any *good* programmer that
> > actually cares about his code will obviously suddenly be very
> > motivated to
> > out-do the out-doer!
>
> "No one who cannot rejoice in the discovery of his own mistakes
> deserves to be called a scholar."

Lovely comment. I realise this is not truly directed at me but clearly in the 
context it has been said people will assume it is directed my way, so while 
we're all spinning lkml quality rhetoric, let me have a right of reply.

One thing I have never tried to do was to ignore bug reports. I'm forever 
joking that I keep pulling code out of my arse to improve what I've done. 
RSDL/SD was no exception; heck it had 40 iterations. The reason I could not 
reply to bug report A with "Oh that is problem B so I'll fix it with code C" 
was, as I've said many many times over, health related. I did indeed try to 
fix many of them without spending hours replying to sometimes unpleasant 
emails. If health wasn't an issue there might have been 1000 iterations of 
SD.

There was only ever _one_ thing that I was absolutely steadfast on as a 
concept that I refused to fix that people might claim was "a mistake I did 
not rejoice in to be a scholar". That was that the _correct_ behaviour for a 
scheduler is to be fair such that proportional slowdown with load is (using 
that awful pun) a feature, not a bug. Now there are people who will still 
disagree violently with me on that. SD attempted to be a fairness first 
virtual-deadline design. If I failed on that front, then so be it (and at 
least one person certainly has said in lovely warm fuzzy friendly 
communication that I'm a global failure on all fronts with SD). But let me 
point out now that Ingo's shiny new scheduler is a fairness-first 
virtual-deadline design which will have proportional slowdown with load. So 
it will have a very similar feature. I dare anyone to claim that proportional 
slowdown with load is a bug, because I will no longer feel like I'm standing 
alone with a BFG9000 trying to defend my standpoint. Others can take up the 
post at last.

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:52                     ` Con Kolivas
@ 2007-04-16  2:28                       ` Nick Piggin
  2007-04-16  3:15                         ` Con Kolivas
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-16  2:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 08:52:33AM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 05:00, Jonathan Lundell wrote:
> > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote:
> > > It's a really good thing, and it means that if somebody shows that
> > > your
> > > code is flawed in some way (by, for example, making a patch that
> > > people
> > > claim gets better behaviour or numbers), any *good* programmer that
> > > actually cares about his code will obviously suddenly be very
> > > motivated to
> > > out-do the out-doer!
> >
> > "No one who cannot rejoice in the discovery of his own mistakes
> > deserves to be called a scholar."
> 
> Lovely comment. I realise this is not truly directed at me but clearly in the 
> context it has been said people will assume it is directed my way, so while 
> we're all spinning lkml quality rhetoric, let me have a right of reply.
> 
> One thing I have never tried to do was to ignore bug reports. I'm forever 
> joking that I keep pulling code out of my arse to improve what I've done. 
> RSDL/SD was no exception; heck it had 40 iterations. The reason I could not 
> reply to bug report A with "Oh that is problem B so I'll fix it with code C" 
> was, as I've said many many times over, health related. I did indeed try to 
> fix many of them without spending hours replying to sometimes unpleasant 
> emails. If health wasn't an issue there might have been 1000 iterations of 
> SD.

Well what matters is the code and development. I don't think Ingo's
scheduler is the final word, although I worry that Linus might jump the
gun and merge something "just to give it a test", which we then get
stuck with :P

I don't know how anybody can think Ingo's new scheduler is anything but
a good thing (so long as it has to compete before being merged). And
that's coming from someone who wants *their* scheduler to get merged...
I think mine can compete ;) and if it can't, then I'd rather be using
the scheduler that beats it.

> There was only ever _one_ thing that I was absolutely steadfast on as a 
> concept that I refused to fix that people might claim was "a mistake I did 
> not rejoice in to be a scholar". That was that the _correct_ behaviour for a 
> scheduler is to be fair such that proportional slowdown with load is (using 
> that awful pun) a feature, not a bug.

If something is using more than a fair share of CPU time, over some macro
period, in order to be interactive, then definitely it should get throttled.
I've always maintained (since starting scheduler work) that the 2.6 scheduler
is horrible because it allows these cases where some things can get more CPU
time just by how they behave.

Glad people are starting to come around on that point.

So, on to something productive, we have 3 candidates for a new scheduler so
far. How do we decide which way to go? (and yes, I still think switchable
schedulers is wrong and a copout) This is one area where it is virtually
impossible to discount any decent design on correctness/performance/etc.
and even testing in -mm isn't really enough.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  2:28                       ` Nick Piggin
@ 2007-04-16  3:15                         ` Con Kolivas
  2007-04-16  3:34                           ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-16  3:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Monday 16 April 2007 12:28, Nick Piggin wrote:
> So, on to something productive, we have 3 candidates for a new scheduler so
> far. How do we decide which way to go? (and yes, I still think switchable
> schedulers is wrong and a copout) This is one area where it is virtually
> impossible to discount any decent design on correctness/performance/etc.
> and even testing in -mm isn't really enough.

We're in agreement! YAY!

Actually this is simpler than that. I'm taking SD out of the picture. It has 
served it's purpose of proving that we need to seriously address all the 
scheduling issues and did more than a half decent job at it. Unfortunately I 
also cannot sit around supporting it forever by myself. My own life is more 
important, so consider SD not even running the race any more.

I'm off to continue maintaining permanent-out-of-tree leisurely code at my own 
pace. What's more is, I think I'll just stick to staircase Gen I version blah 
and shelve SD and try to have fond memories of SD as an intellectual 
prompting exercise only.

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:15                         ` Con Kolivas
@ 2007-04-16  3:34                           ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-16  3:34 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg,
	Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 01:15:27PM +1000, Con Kolivas wrote:
> On Monday 16 April 2007 12:28, Nick Piggin wrote:
> > So, on to something productive, we have 3 candidates for a new scheduler so
> > far. How do we decide which way to go? (and yes, I still think switchable
> > schedulers is wrong and a copout) This is one area where it is virtually
> > impossible to discount any decent design on correctness/performance/etc.
> > and even testing in -mm isn't really enough.
> 
> We're in agreement! YAY!
> 
> Actually this is simpler than that. I'm taking SD out of the picture. It has 
> served it's purpose of proving that we need to seriously address all the 
> scheduling issues and did more than a half decent job at it. Unfortunately I 
> also cannot sit around supporting it forever by myself. My own life is more 
> important, so consider SD not even running the race any more.
> 
> I'm off to continue maintaining permanent-out-of-tree leisurely code at my own 
> pace. What's more is, I think I'll just stick to staircase Gen I version blah 
> and shelve SD and try to have fond memories of SD as an intellectual 
> prompting exercise only.

Well I would hope that _if_ we decide to switch schedulers, then you
get a chance to field something (and I hope you will decide to and have
time to), and I hope we don't rush into the decision.

We've had the current scheduler for so many years now that it is much
more important to make sure we take the time to do the right thing
rather than absolutely have to merge a new scheduler right now ;)


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
@ 2007-04-15 15:26             ` William Lee Irwin III
  2007-04-16 15:55               ` Chris Friesen
  2007-04-15 15:39             ` Ingo Molnar
  2 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote:
> Now I hope he and Bill will get over this and accept to work on improving
> this scheduler, because I really find it smarter than a dumb O(1). I even
> agree with Mike that we now have a solid basis for future work. But for
> this, maybe a good starting point would be to remove the selfish printk
> at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind)
> and improve the documentation a bit so that people can work together on
> the new design, without feeling like their work will only server to
> promote X or Y.

While I appreciate people coming to my defense, or at least the good
intentions behind such, my only actual interest in pointing out
4-year-old work is getting some acknowledgment of having done something
relevant at all. Sometimes it has "I told you so" value. At other times
it's merely clarifying what went on when people refer to it since in a
number of cases the patches are no longer extant, so they can't
actually look at it to get an idea of what was or wasn't done. At other
times I'm miffed about not being credited, whether I should've been or
whether dead and buried code has an implementation of the same idea
resurfacing without the author(s) having any knowledge of my prior work.

One should note that in this case, the first work of mine this trips
over (scheduling classes) was never publicly posted as it was only a
part of the original plugsched (an alternate scheduler implementation
devised to demonstrate plugsched's flexibility with respect to
scheduling policies), and a part that was dropped by subsequent
maintainers. The second work of mine this trips over, a virtual deadline
scheduler named "vdls," was also never publicly posted. Both are from
around the same time period, which makes them approximately 4 years dead.
Neither of the codebases are extant, having been lost in a transition
between employers, though various people recall having been sent them
privately, and plugsched survives in a mutated form as maintained by
Peter Williams, who's been very good about acknowledging my original
contribution.

If I care to become a direct participant in scheduler work, I can do so
easily enough.

I'm not entirely sure what this is about a basis for future work. By
and large one should alter the API's and data structures to fit the
policy being implemented. While the array swapping was nice for
algorithmically improving 2.4.x -style epoch expiry, most algorithms
not based on the 2.4.x scheduler (in however mutated a form) should use
a different queue structure, in fact, one designed around their
policy's specific algorithmic needs. IOW, when one alters the scheduler,
one should also alter the queue data structure appropriately. I'd not
expect the priority queue implementation in cfs to continue to be used
unaltered as it matures, nor would I expect any significant modification
of the scheduler to necessarily use a similar one.

By and large I've been mystified as to why there is such a penchant for
preserving the existing queue structures in the various scheduler
patches floating around. I am now every bit as mystified at the point
of view that seems to be emerging that a change of queue structure is
particularly significant. These are all largely internal changes to
sched.c, and as such, rather small changes in and of themselves. While
they do tend to have user-visible effects, from this point of view
even changing out every line of sched.c is effectively a micropatch.

Something more significant might be altering the schedule() API to
take a mandatory description of the intention of the call to it, or
breaking up schedule() into several different functions to distinguish
between different sorts of uses of it to which one would then respond
differently. Also more significant would be adding a new state beyond
TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some
tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE
users to use the new state and handle those fatal signals. While not
quite as ostentatious in their user-visible effects as SCHED_OTHER
policy affairs, they are tremendously more work than switching out the
implementation of a single C file, and so somewhat more respectable.

Even as scheduling semantics go, these are micropatches. So SCHED_OTHER
changes a little. Where are the gang schedulers? Where are the batch
schedulers (SCHED_BATCH is not truly such)? Where are the isochronous
(frame) schedulers? I suppose there is some CKRM work that actually has
a semantic impact despite being largely devoted to SCHED_OTHER, and
there's some spufs gang scheduling going on, though not all that much.
And to reiterate a point from other threads, even as SCHED_OTHER
patches go, I see precious little verification that things like the
semantics of nice numbers or other sorts of CPU bandwidth allocation
between competing tasks of various natures are staying the same while
other things are changed, or at least being consciously modified in
such a fashion as to improve them. I've literally only seen one or two
tests (and rather inflexible ones with respect to sleep and running
time mixtures) with any sort of quantification of how CPU bandwidth is
distributed get run on all this.

So from my point of view, there's a lot of churn and craziness going on
in one tiny corner of the kernel and people don't seem to have a very
solid grip on what effects their changes have or how they might
potentially break userspace. So I've developed a sudden interest in
regression testing of the scheduler in order to ensure that various sorts
of semantics on which userspace relies are not broken, and am trying to
spark more interest in general in nailing down scheduling semantics and
verifying that those semantics are honored and remain honored by whatever
future scheduler implementations might be merged.

Thus far, the laundry list of semantics I'd like to have nailed down
are specifically:

(1) CPU bandwidth allocation according to nice numbers
(2) CPU bandwidth allocation among mixtures of tasks with varying
	sleep/wakeup behavior e.g. that consume some percentage of cpu
	in isolation, perhaps also varying the granularity of their
	sleep/wakeup patterns
(3) sched_yield(), so multitier userspace locking doesn't go haywire
(4) How these work with SMP; most people agree it should be mostly the
	same as it works on UP, but it's not being verified, as most
	testcases are barely SMP-aware if at all, and corner cases
	where proportionality breaks down aren't considered

The sorts of like explicit decisions I'd like to be made for these are:
(1) In a mixture of tasks with varying nice numbers, a given nice number
	corresponds to some share of CPU bandwidth. Implementations
	should not have the freedom to change this arbitrarily according
	to some intention.
(2) A given scheduler _implementation_ intends to distribute CPU
	bandwidth among mixtures of tasks that would each consume some
	percentage of the CPU in isolation varying across tasks in some
	particular pattern. For example, maybe some scheduler
	implementation assigns a share of 1/%cpu to a task that would
	consume %cpu in isolation, for a CPU bandwidth allocation of
	(1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks
	(this is not to say that such a policy makes sense).
(3) sched_yield() is intended to result in some particular scheduling
	pattern in a given scheduler implementation. For instance, an
	implementation may intend that a set of CPU hogs calling
	sched_yield() between repeatedly printf()'ing their pid's will
	see their printf()'s come out in an approximately consistent
	order as the scheduler cycles between them.
(4) What an implementation intends to do with respect to SMP CPU
	bandwidth allocation when precise emulation of UP behavior is
	impossible, considering sched_yield() scheduling patterns when
	possible as well. For instance, perhaps an implementation
	intends to ensure equal CPU bandwidth among competing CPU-bound
	tasks of equal priority at all costs, and so triggers migration
	and/or load balancing to make it so. Or perhaps an implementation
	intends to ensure precise sched_yield() ordering at all costs
	even on SMP. Some sort of specification of the intention, then
	a verification that the intention is carried out in a testcase.

Also, if there's a semantic issue to be resolved, I want it to have
something describing it and verifying it. For instance, characterizing
whatever sort of scheduling artifacts queue-swapping causes in the
mainline scheduler and then a testcase to demonstrate the artifact and
its resolution in a given scheduler rewrite would be a good design
statement and verification. For instance, if someone wants to go back
to queue-swapping or other epoch expiry semantics, it would make them
(and hopefully everyone else) conscious of the semantic issue the
change raises, or possibly serve as a demonstration that the artifacts
can be mitigated in some implementation retaining epoch expiry semantics.

As I become aware of more potential issues I'll add more to my laundry
list, and I'll hammer out testcases as I go. My concern with the
scheduler is that this sort of basic functionality may be significantly
disturbed with no one noticing at all until a distro issues a prerelease
and benchmarks go haywire, and furthermore that changes to this kind of
basic behavior may be signs of things going awry, particularly as more
churn happens.

So now that I've clarified my role in all this to date and my point of
view on it, it should be clear that accepting something and working on
some particular scheduler implementation don't make sense as
suggestions to me.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:26             ` William Lee Irwin III
@ 2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
                                   ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Chris Friesen @ 2007-04-16 15:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:

> The sorts of like explicit decisions I'd like to be made for these are:
> (1) In a mixture of tasks with varying nice numbers, a given nice number
> 	corresponds to some share of CPU bandwidth. Implementations
> 	should not have the freedom to change this arbitrarily according
> 	to some intention.

The first question that comes to my mind is whether nice levels should 
be linear or not.  I would lean towards nonlinear as it allows a wider 
range (although of course at the expense of precision).  Maybe something 
like "each nice level gives X times the cpu of the previous"?  I think a 
value of X somewhere between 1.15 and 1.25 might be reasonable.

What about also having something that looks at latency, and how latency 
changes with niceness?

What about specifying the timeframe over which the cpu bandwidth is 
measured?  I currently have a system where the application designers 
would like it to be totally fair over a period of 1 second.  As you can 
imagine, mainline doesn't do very well in this case.

Chris

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
@ 2007-04-16 16:13                 ` William Lee Irwin III
  2007-04-17  0:04                 ` Peter Williams
  2007-04-17 13:07                 ` James Bruce
  2 siblings, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-16 16:13 UTC (permalink / raw)
  To: Chris Friesen
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar,
	Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>	corresponds to some share of CPU bandwidth. Implementations
>>	should not have the freedom to change this arbitrarily according
>>	to some intention.

On Mon, Apr 16, 2007 at 09:55:14AM -0600, Chris Friesen wrote:
> The first question that comes to my mind is whether nice levels should 
> be linear or not.  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.
> What about also having something that looks at latency, and how latency 
> changes with niceness?
> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.  As you can 
> imagine, mainline doesn't do very well in this case.

It's unclear how latency enters the picture as the semantics of nice
levels relevant to such are essentially priority preemption, which is
not particularly easy to mess up. I suppose tests to ensure priority
preemption occurs properly are in order.

I don't really have a preference regarding specific semantics for nice
numbers, just that they should be deterministic and specified somewhere.
It's not really for us to decide what those semantics are as it's more
of a userspace ABI/API issue.

The timeframe is also relevant, but I suspect it's more of a performance
metric than a strict requirement.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
@ 2007-04-17  0:04                 ` Peter Williams
  2007-04-17 13:07                 ` James Bruce
  2 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-17  0:04 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Willy Tarreau, Pekka Enberg, hui Bill Huey,
	Ingo Molnar, Con Kolivas, ck list, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

Chris Friesen wrote:
> William Lee Irwin III wrote:
> 
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>     corresponds to some share of CPU bandwidth. Implementations
>>     should not have the freedom to change this arbitrarily according
>>     to some intention.
> 
> The first question that comes to my mind is whether nice levels should 
> be linear or not.

No. That squishes one end of the table too much.  It needs to be 
(approximately) piecewise linear around nice == 0.  Here's the mapping I 
use in my entitlement based schedulers:

#define NICE_TO_LP(nice) ((nice >=0) ? (20 - (nice)) : (20 + (nice) * 
(nice)))

It has the (good) feature that a nice == 19 task has 1/20th the 
entitlement of a nice == 0 task and a nice == -20 task has 21 times the 
entitlement of a nice == 0 task.  It's not strictly linear for negative 
nice values but is very cheap to calculate and quite easy to invert if 
necessary.

>  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.
> 
> What about also having something that looks at latency, and how latency 
> changes with niceness?
> 
> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.

Have you tried the spa_ebs scheduler?  The half life is no longer a run 
time configurable parameter (as making it highly adjustable results in 
less efficient code) but it could be adjusted to be approximately 
equivalent to 0.5 seconds by changing some constants in the code.

>  As you can 
> imagine, mainline doesn't do very well in this case.

You should look back through the plugsched patches where many of these 
ideas have been experimented with.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 15:55               ` Chris Friesen
  2007-04-16 16:13                 ` William Lee Irwin III
  2007-04-17  0:04                 ` Peter Williams
@ 2007-04-17 13:07                 ` James Bruce
  2007-04-17 20:05                   ` William Lee Irwin III
  2 siblings, 1 reply; 304+ messages in thread
From: James Bruce @ 2007-04-17 13:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: ck

Chris Friesen wrote:
> William Lee Irwin III wrote:
> 
>> The sorts of like explicit decisions I'd like to be made for these are:
>> (1) In a mixture of tasks with varying nice numbers, a given nice number
>>     corresponds to some share of CPU bandwidth. Implementations
>>     should not have the freedom to change this arbitrarily according
>>     to some intention.
> 
> The first question that comes to my mind is whether nice levels should 
> be linear or not.  I would lean towards nonlinear as it allows a wider 
> range (although of course at the expense of precision).  Maybe something 
> like "each nice level gives X times the cpu of the previous"?  I think a 
> value of X somewhere between 1.15 and 1.25 might be reasonable.

Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589

That value has the property that a nice=10 task gets 1/10th the cpu of a 
nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
would be fairly easy to explain to admins and users so that they can 
know what to expect from nicing tasks.

> What about also having something that looks at latency, and how latency 
> changes with niceness?

I think this would be a lot harder to pin down, since it's a function of 
all the other tasks running and their nice levels.  Do you have any of 
the RT-derived analysis models in mind?

> What about specifying the timeframe over which the cpu bandwidth is 
> measured?  I currently have a system where the application designers 
> would like it to be totally fair over a period of 1 second.  As you can 
> imagine, mainline doesn't do very well in this case.

It might be easier to specify the maximum deviation from the ideal 
bandwidth over a certain period.  I.e. something like "over a period of 
one second, each task receives within 10% of the expected bandwidth".

  - Jim Bruce


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:07                 ` James Bruce
@ 2007-04-17 20:05                   ` William Lee Irwin III
  0 siblings, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 20:05 UTC (permalink / raw)
  To: James Bruce
  Cc: Chris Friesen, Willy Tarreau, Pekka Enberg, hui Bill Huey,
	Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a 
> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that 
> would be fairly easy to explain to admins and users so that they can 
> know what to expect from nicing tasks.
[...additional good commentary trimmed...]

Lots of good ideas here. I'll follow them.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:45           ` Willy Tarreau
  2007-04-15 13:08             ` Pekka J Enberg
  2007-04-15 15:26             ` William Lee Irwin III
@ 2007-04-15 15:39             ` Ingo Molnar
  2007-04-15 15:47               ` William Lee Irwin III
  2007-04-16  5:27               ` Peter Williams
  2 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:39 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Willy Tarreau <w@1wt.eu> wrote:

> Ingo could have publicly spoken with them about his ideas of killing 
> the O(1) scheduler and replacing it with an rbtree-based one, [...]

yes, that's precisely what i did, via a patchset :)

[ I can even tell you when it all started: i was thinking about Mike's
  throttling patches while watching Manchester United beat the crap out
  of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
  Wednesday morning and sent the patch Friday evening. I very much
  believe in low-latency when it comes to development too ;) ]

(if this had been done via a comittee then today we'd probably still be 
trying to find a suitable timeslot for the initial conference call where 
we'd discuss the election of a chair who would be tasked with writing up 
an initial document of feature requests, on which we'd take a vote, 
possibly this year already, because the matter is really urgent you know 
;-)

> [...] and using part of Bill's work to speed up development.

ok, let me make this absolutely clear: i didnt use any bit of plugsched 
- in fact the most difficult bits of the modularization was for areas of 
sched.c that plugsched never even touched AFAIK. (the load-balancer for 
example.)

Plugsched simply does something else: i modularized scheduling policies 
in essence that have to cooperate with each other, while plugsched 
modularized complete schedulers which are compile-time or boot-time 
selected, with no runtime cooperation between them. (one has to be 
selected at a time)

(and i have no trouble at all with crediting Will's work either: a few 
years ago i used Will's PID rework concepts for an NPTL related speedup 
and Will is very much credited for it in today's kernel/pid.c and he 
continued to contribute to it later on.)

(the tree walking bits of sched_fair.c were in fact derived from 
kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:39             ` Ingo Molnar
@ 2007-04-15 15:47               ` William Lee Irwin III
  2007-04-16  5:27               ` Peter Williams
  1 sibling, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15 15:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Willy Tarreau <w@1wt.eu> wrote:
>> [...] and using part of Bill's work to speed up development.

On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote:
> ok, let me make this absolutely clear: i didnt use any bit of plugsched 
> - in fact the most difficult bits of the modularization was for areas of 
> sched.c that plugsched never even touched AFAIK. (the load-balancer for 
> example.)
> Plugsched simply does something else: i modularized scheduling policies 
> in essence that have to cooperate with each other, while plugsched 
> modularized complete schedulers which are compile-time or boot-time 
> selected, with no runtime cooperation between them. (one has to be 
> selected at a time)
> (and i have no trouble at all with crediting Will's work either: a few 
> years ago i used Will's PID rework concepts for an NPTL related speedup 
> and Will is very much credited for it in today's kernel/pid.c and he 
> continued to contribute to it later on.)
> (the tree walking bits of sched_fair.c were in fact derived from 
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)

The extant plugsched patches have nothing to do with cfs; I suspect
what everyone else is going on about is terminological confusion. The
4-year-old sample policy with scheduling classes for the original
plugsched is something you had no way of knowing about, as it was never
publicly posted. There isn't really anything all that interesting going
on here, apart from pointing out that it's been done before.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:39             ` Ingo Molnar
  2007-04-15 15:47               ` William Lee Irwin III
@ 2007-04-16  5:27               ` Peter Williams
  2007-04-16  6:23                 ` Peter Williams
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-16  5:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Willy Tarreau <w@1wt.eu> wrote:
> 
>> Ingo could have publicly spoken with them about his ideas of killing 
>> the O(1) scheduler and replacing it with an rbtree-based one, [...]
> 
> yes, that's precisely what i did, via a patchset :)
> 
> [ I can even tell you when it all started: i was thinking about Mike's
>   throttling patches while watching Manchester United beat the crap out
>   of AS Roma (7 to 1 end result), Thuesday evening. I started coding it
>   Wednesday morning and sent the patch Friday evening. I very much
>   believe in low-latency when it comes to development too ;) ]
> 
> (if this had been done via a comittee then today we'd probably still be 
> trying to find a suitable timeslot for the initial conference call where 
> we'd discuss the election of a chair who would be tasked with writing up 
> an initial document of feature requests, on which we'd take a vote, 
> possibly this year already, because the matter is really urgent you know 
> ;-)
> 
>> [...] and using part of Bill's work to speed up development.
> 
> ok, let me make this absolutely clear: i didnt use any bit of plugsched 
> - in fact the most difficult bits of the modularization was for areas of 
> sched.c that plugsched never even touched AFAIK. (the load-balancer for 
> example.)

This sounds like your new scheduler intends to increase the coupling 
between scheduling and load balancing.  I think that this would be a 
mistake and lead (down the track) to spiralling complexity as you make 
changes to the code to address the corner conditions that it will create.

> 
> Plugsched simply does something else: i modularized scheduling policies 
> in essence that have to cooperate with each other, while plugsched 
> modularized complete schedulers which are compile-time or boot-time 
> selected, with no runtime cooperation between them. (one has to be 
> selected at a time)

You can't really have more than one scheduler operating in the same 
priority range on the same CPU as they will be fighting each other 
trying to achieve their separate and not necessarily compatible (in fact 
highly likely to be incompatible) aims.  Multiple schedulers on the same 
CPU have to have a pecking order just like SCHED_OTHER and real time 
policies.  It wouldn't be hard to prove that SCHED_RR and SCHED_FIFO is 
a problem in waiting if ever someone tried to use them both on a highly 
real time system.

> 
> (and i have no trouble at all with crediting Will's work either: a few 
> years ago i used Will's PID rework concepts for an NPTL related speedup 
> and Will is very much credited for it in today's kernel/pid.c and he 
> continued to contribute to it later on.)
> 
> (the tree walking bits of sched_fair.c were in fact derived from 
> kernel/hrtimer.c, the rbtree code written by Thomas and me :-)
> 
> 	Ingo

Are your new patches available somewhere for easy download or do I have 
to try to dig them out of the mailing list archive?  Or could you mail 
them to me separately?  I'm keen to see how you new scheduler proposal 
works.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:27               ` Peter Williams
@ 2007-04-16  6:23                 ` Peter Williams
  2007-04-16  6:40                   ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-16  6:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> 
> Are your new patches available somewhere for easy download or do I have 
> to try to dig them out of the mailing list archive?  Or could you mail 
> them to me separately?  I'm keen to see how you new scheduler proposal 
> works.

Forget about this.  I found the patch.

After a quick look, I like a lot of what I see especially the removal of 
the dual arrays in the run queue.

Some minor suggestions:

1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to 
initialize the task structure in init_task.h.
2. the on_rq field in the task structure is unnecessary as many years of 
experience with ingosched in plugsched indicates that 
!list_empty(&(p)->run_list does the job provided list_del_init() is used 
when dequeueing and there is no noticeable overhead incurred so there's 
no gain by caching the result.  Also it removes the possibility of 
errors creeping in due the value of on_rq being inconsistent with the 
task's actual state.
3. having modular load balancing is a good idea but it should be 
decoupled form the scheduler and provided as a separate interface.  This 
would enable different schedulers to use the same load balancer if they 
desired.
4. why rename SCHED_OTHER to SCHED_FAIR?  SCHED_OTHER's supposed to be 
fair(ish) anyway.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  6:23                 ` Peter Williams
@ 2007-04-16  6:40                   ` Peter Williams
  2007-04-16  7:32                     ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-16  6:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Peter Williams wrote:
>>
>> Are your new patches available somewhere for easy download or do I 
>> have to try to dig them out of the mailing list archive?  Or could you 
>> mail them to me separately?  I'm keen to see how you new scheduler 
>> proposal works.
> 
> Forget about this.  I found the patch.
> 
> After a quick look, I like a lot of what I see especially the removal of 
> the dual arrays in the run queue.
> 
> Some minor suggestions:
> 
> 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to 
> initialize the task structure in init_task.h.
> 2. the on_rq field in the task structure is unnecessary as many years of 
> experience with ingosched in plugsched indicates that 
> !list_empty(&(p)->run_list does the job provided list_del_init() is used 
> when dequeueing and there is no noticeable overhead incurred so there's 
> no gain by caching the result.  Also it removes the possibility of 
> errors creeping in due the value of on_rq being inconsistent with the 
> task's actual state.
> 3. having modular load balancing is a good idea but it should be 
> decoupled form the scheduler and provided as a separate interface.  This 
> would enable different schedulers to use the same load balancer if they 
> desired.
> 4. why rename SCHED_OTHER to SCHED_FAIR?  SCHED_OTHER's supposed to be 
> fair(ish) anyway.

One more quick comment.  The claim that there is no concept of time 
slice in the new scheduler is only true in the sense of the rather 
arcane implementation of time slices extant in the O(1) scheduler.  Your 
new parameter sched_granularity_ns is equivalent to the concept of time 
slice in most other kernels that I've peeked inside and computing 
literature in general (going back over several decades e.g. the magic 
garden).

Welcome to the mainstream,
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  6:40                   ` Peter Williams
@ 2007-04-16  7:32                     ` Ingo Molnar
  2007-04-16  8:54                       ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-16  7:32 UTC (permalink / raw)
  To: Peter Williams
  Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> One more quick comment.  The claim that there is no concept of time 
> slice in the new scheduler is only true in the sense of the rather 
> arcane implementation of time slices extant in the O(1) scheduler.

yeah. AFAIK most other mainstream OSs also still often use similarly 
'arcane' concepts (i'm here ignoring literature, you can find everything 
and its opposite suggested in literature) so i felt the need to point 
out the difference ;) After all Linux is about doing a better mainstream 
OS, it is not about beating the OS literature at lunacy ;-)

The precise statement would be: "there's no concept of giving out a 
time-slice to a task and sticking to it unless a higher-prio task comes 
along, nor is there a concept of having a low-res granularity 
->time_slice thing. There is accurate accounting of how much CPU time a 
task used up, and there is a granularity setting that together gives the 
current task a fairness advantage of a given amount of nanoseconds - 
which has similar [but not equivalent] effects to traditional timeslices 
that most mainstream OSs use".

> Your new parameter sched_granularity_ns is equivalent to the concept 
> of time slice in most other kernels that I've peeked inside and 
> computing literature in general (going back over several decades e.g. 
> the magic garden).

note that you can set it to 0 and the box still functions - so 
sched_granularity_ns, while useful for performance/bandwidth workloads, 
isnt truly inherent to the design.

So in the announcement i just opted for a short sentence: "there's no 
concept of timeslices", albeit like most short stentences it's not a 
technically 100% accurate statement - but still it conveyed the intended 
information more effectively to the interested lkml reader than the 
longer version could ever have =B-)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  7:32                     ` Ingo Molnar
@ 2007-04-16  8:54                       ` Peter Williams
  0 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-16  8:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> One more quick comment.  The claim that there is no concept of time 
>> slice in the new scheduler is only true in the sense of the rather 
>> arcane implementation of time slices extant in the O(1) scheduler.
> 
> yeah. AFAIK most other mainstream OSs also still often use similarly 
> 'arcane' concepts (i'm here ignoring literature, you can find everything 
> and its opposite suggested in literature) so i felt the need to point 
> out the difference ;) After all Linux is about doing a better mainstream 
> OS, it is not about beating the OS literature at lunacy ;-)
> 
> The precise statement would be: "there's no concept of giving out a 
> time-slice to a task and sticking to it unless a higher-prio task comes 
> along,

I would have said "no concept of using tile slices to implement nice" 
which always seemed strange to me.

If it really does what you just said then a (malicious or otherwise) CPU 
intensive task that never sleeps, once it got the CPU, would completely 
hog the CPU.

> nor is there a concept of having a low-res granularity 
> ->time_slice thing. There is accurate accounting of how much CPU time a 
> task used up, and there is a granularity setting that together gives the 
> current task a fairness advantage of a given amount of nanoseconds - 
> which has similar [but not equivalent] effects to traditional timeslices 
> that most mainstream OSs use".

Most traditional OSes have more or less fixed time slices and do the 
scheduling by fiddling the dynamic priority.

Using total CPU used will also come to grief when used for long running 
tasks.  Eventually, even very low bandwidth tasks will accumulate enough 
total CPU to look busy.  The CPU bandwidth the task is using is what 
needs to be controlled.

Or have I not looked closely enough at what sched_granularity_ns does? 
Is it really a control for the decay rate of a CPU usage bandwidth metric?

> 
>> Your new parameter sched_granularity_ns is equivalent to the concept 
>> of time slice in most other kernels that I've peeked inside and 
>> computing literature in general (going back over several decades e.g. 
>> the magic garden).
> 
> note that you can set it to 0 and the box still functions - so 
> sched_granularity_ns, while useful for performance/bandwidth workloads, 
> isnt truly inherent to the design.

Just like my SPA schedulers.  But if you set it to zero you'll get a 
fairly high context switch rate with associated overhead, won't you?

> 
> So in the announcement i just opted for a short sentence: "there's no 
> concept of timeslices", albeit like most short stentences it's not a 
> technically 100% accurate statement - but still it conveyed the intended 
> information more effectively to the interested lkml reader than the 
> longer version could ever have =B-)

I hope that I implied that I was being picky :-) (I meant to -- imply I 
was being picky, that is).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 10:39         ` Pekka Enberg
  2007-04-15 12:45           ` Willy Tarreau
@ 2007-04-15 15:16           ` Gene Heskett
  2007-04-15 16:43             ` Con Kolivas
  1 sibling, 1 reply; 304+ messages in thread
From: Gene Heskett @ 2007-04-15 15:16 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Pekka Enberg wrote:
>On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> The perception here is that there is that there is this expectation that
>> sections of the Linux kernel are intentionally "churn squated" to prevent
>> any other ideas from creeping in other than of the owner of that subsytem
>
>Strangely enough, my perception is that Ingo is simply trying to
>address the issues Mike's testing discovered in RDSL and SD. It's not
>surprising Ingo made it a separate patch set as Con has repeatedly
>stated that the "problems" are in fact by design and won't be fixed.

I won't get into the middle of this just yet, not having decided which dog I 
should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for about 
24 hours, its been generally usable, but gzip still causes lots of 5 to 10+ 
second lags when its running.  I'm coming to the conclusion that gzip simply 
doesn't play well with others...  

Amazing to me, the cpu its using stays generally below 80%, and often below 
60%, even while the kmail composer has a full sentence in its buffer that it 
still hasn't shown me when I switch to the htop screen to check, and back to 
the kmail screen to see if its updated yet.  The screen switch doesn't seem 
to lag so I don't think renicing x would be helpfull.  Those are the obvious 
lags, and I'll build & reboot to the CFS patch at some point this morning 
(whats left of it that is :).  And report in due time of course

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
knot in cables caused data stream to become twisted and kinked

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair  Scheduler [CFS]
  2007-04-15 15:16           ` Gene Heskett
@ 2007-04-15 16:43             ` Con Kolivas
  2007-04-15 16:58               ` Gene Heskett
  0 siblings, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-15 16:43 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 01:16, Gene Heskett wrote:
> On Sunday 15 April 2007, Pekka Enberg wrote:
> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
> >> The perception here is that there is that there is this expectation that
> >> sections of the Linux kernel are intentionally "churn squated" to
> >> prevent any other ideas from creeping in other than of the owner of that
> >> subsytem
> >
> >Strangely enough, my perception is that Ingo is simply trying to
> >address the issues Mike's testing discovered in RDSL and SD. It's not
> >surprising Ingo made it a separate patch set as Con has repeatedly
> >stated that the "problems" are in fact by design and won't be fixed.
>
> I won't get into the middle of this just yet, not having decided which dog
> I should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for
> about 24 hours, its been generally usable, but gzip still causes lots of 5
> to 10+ second lags when its running.  I'm coming to the conclusion that
> gzip simply doesn't play well with others...

Actually Gene I think you're being bitten here by something I/O bound since 
the cpu usage never tops out. If that's the case and gzip is dumping 
truckloads of writes then you're suffering something that irks me even more 
than the scheduler in linux, and that's how much writes hurt just about 
everything else. Try your testcase with bzip2 instead (since that won't be 
i/o bound), or drop your dirty ratio to as low as possible which helps a 
little bit (5% is the minimum)

echo 5 > /proc/sys/vm/dirty_ratio

and finally try the braindead noop i/o scheduler as well.

echo noop > /sys/block/sda/queue/scheduler

(replace sda with your drive obviously).

I'd wager a big one that's what causes your gzip pain. If it wasn't for the 
fact that I've decided to all but give up ever trying to provide code for 
mainline again, trying my best to make writes hurt less on linux would be my 
next big thing [tm]. 

Oh and for the others watching, (points to vm hackers) I found a bug when 
playing with the dirty ratio code. If you modify it to allow it drop below 5% 
but still above the minimum in the vm code, stalls happen somewhere in the vm 
where nothing much happens for sometimes 20 or 30 seconds worst case 
scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be 
set ultra low because these stalls were gross.

> Amazing to me, the cpu its using stays generally below 80%, and often below
> 60%, even while the kmail composer has a full sentence in its buffer that
> it still hasn't shown me when I switch to the htop screen to check, and
> back to the kmail screen to see if its updated yet.  The screen switch
> doesn't seem to lag so I don't think renicing x would be helpfull.  Those
> are the obvious lags, and I'll build & reboot to the CFS patch at some
> point this morning (whats left of it that is :).  And report in due time of
> course

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:43             ` Con Kolivas
@ 2007-04-15 16:58               ` Gene Heskett
  2007-04-15 18:00                 ` Mike Galbraith
  0 siblings, 1 reply; 304+ messages in thread
From: Gene Heskett @ 2007-04-15 16:58 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Con Kolivas wrote:
>On Monday 16 April 2007 01:16, Gene Heskett wrote:
>> On Sunday 15 April 2007, Pekka Enberg wrote:
>> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote:
>> >> The perception here is that there is that there is this expectation
>> >> that sections of the Linux kernel are intentionally "churn squated" to
>> >> prevent any other ideas from creeping in other than of the owner of
>> >> that subsytem
>> >
>> >Strangely enough, my perception is that Ingo is simply trying to
>> >address the issues Mike's testing discovered in RDSL and SD. It's not
>> >surprising Ingo made it a separate patch set as Con has repeatedly
>> >stated that the "problems" are in fact by design and won't be fixed.
>>
>> I won't get into the middle of this just yet, not having decided which dog
>> I should bet on yet.  I've been running 2.6.21-rc6 + Con's 0.40 patch for
>> about 24 hours, its been generally usable, but gzip still causes lots of 5
>> to 10+ second lags when its running.  I'm coming to the conclusion that
>> gzip simply doesn't play well with others...
>
>Actually Gene I think you're being bitten here by something I/O bound since
>the cpu usage never tops out. If that's the case and gzip is dumping
>truckloads of writes then you're suffering something that irks me even more
>than the scheduler in linux, and that's how much writes hurt just about
>everything else. Try your testcase with bzip2 instead (since that won't be
>i/o bound), or drop your dirty ratio to as low as possible which helps a
>little bit (5% is the minimum)
>
>echo 5 > /proc/sys/vm/dirty_ratio
>
>and finally try the braindead noop i/o scheduler as well.
>
>echo noop > /sys/block/sda/queue/scheduler
>
>(replace sda with your drive obviously).
>
>I'd wager a big one that's what causes your gzip pain. If it wasn't for the
>fact that I've decided to all but give up ever trying to provide code for
>mainline again, trying my best to make writes hurt less on linux would be my
>next big thing [tm].

Chuckle, possibly but then I'm not anything even remotely close to an expert 
here Con, just reporting what I get.  And I just rebooted to 2.6.21-rc6 + 
sched-mike-5.patch for grins and giggles, or frowns and profanity as the case 
may call for.

>Oh and for the others watching, (points to vm hackers) I found a bug when
>playing with the dirty ratio code. If you modify it to allow it drop below
> 5% but still above the minimum in the vm code, stalls happen somewhere in
> the vm where nothing much happens for sometimes 20 or 30 seconds worst case
> scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to
> be set ultra low because these stalls were gross.

I think I'd need a bit of tutoring on how to do that.  I recall that one other 
time, several weeks back, I thought I would try one of those famous echo this 
>/proc/that ideas that went by on this list, but even though I was root, 
apparently /proc was read-only AFAIWC.

>> Amazing to me, the cpu its using stays generally below 80%, and often
>> below 60%, even while the kmail composer has a full sentence in its buffer
>> that it still hasn't shown me when I switch to the htop screen to check,
>> and back to the kmail screen to see if its updated yet.  The screen switch
>> doesn't seem to lag so I don't think renicing x would be helpfull.  Those
>> are the obvious lags, and I'll build & reboot to the CFS patch at some
>> point this morning (whats left of it that is :).  And report in due time
>> of course

And now I wonder if I applied the right patch.  This one feels good ATM, but I 
don't think its the CFS thingy.  No, I'm sure of it now, none of the patches 
I've saved say a thing about CFS.  Backtrack up the list time I guess, ignore 
me for the nonce.


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Microsoft: Re-inventing square wheels

   -- From a Slashdot.org post

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:58               ` Gene Heskett
@ 2007-04-15 18:00                 ` Mike Galbraith
  2007-04-16  0:18                   ` Gene Heskett
  0 siblings, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15 18:00 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:

> Chuckle, possibly but then I'm not anything even remotely close to an expert 
> here Con, just reporting what I get.  And I just rebooted to 2.6.21-rc6 + 
> sched-mike-5.patch for grins and giggles, or frowns and profanity as the case 
> may call for.

Erm, that patch is embarrassingly buggy, so profanity should dominate.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 18:00                 ` Mike Galbraith
@ 2007-04-16  0:18                   ` Gene Heskett
  0 siblings, 0 replies; 304+ messages in thread
From: Gene Heskett @ 2007-04-16  0:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list,
	Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Arjan van de Ven, Thomas Gleixner

On Sunday 15 April 2007, Mike Galbraith wrote:
>On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote:
>> Chuckle, possibly but then I'm not anything even remotely close to an
>> expert here Con, just reporting what I get.  And I just rebooted to
>> 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and
>> profanity as the case may call for.
>
>Erm, that patch is embarrassingly buggy, so profanity should dominate.
>
>	-Mike

Chuckle, ROTFLMAO even.

I didn't run it that long as I immediately rebuilt and rebooted when I found 
I'd used the wrong patch, and in fact had tested that one and found it 
sub-optimal before I'd built and ran Con's -0.40 version.  As for bugs of the 
type that make it to the screen or logs, I didn't see any.  OTOH, my eyesight 
is slowly going downhill, now 20/25.  It was 20/10 30 years ago.  Now thats 
reason for profanity...

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Unix weanies are as bad at this as anyone.
             -- Larry Wall in <199702111730.JAA28598@wall.org>

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  5:16   ` Bill Huey
  2007-04-15  8:44     ` Ingo Molnar
@ 2007-04-15 16:11     ` Bernd Eckenfels
  1 sibling, 0 replies; 304+ messages in thread
From: Bernd Eckenfels @ 2007-04-15 16:11 UTC (permalink / raw)
  To: linux-kernel

In article <20070415051645.GA28438@gnuppy.monkey.org> you wrote:
> A development process like this is likely to exclude smart people from wanting
> to contribute to Linux and folks should be conscious about this issues.

Nobody is excluded, you can always have a next iteration.

Gruss
Bernd

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` Con Kolivas
  2007-04-15  5:16   ` Bill Huey
@ 2007-04-15  6:43   ` Mike Galbraith
  2007-04-15  8:36     ` Bill Huey
  2007-04-17  0:06     ` Peter Williams
  2007-04-15 15:05   ` Ingo Molnar
  2 siblings, 2 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15  6:43 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Peter Williams, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > This project is a complete rewrite of the Linux task scheduler. My goal
> > is to address various feature requests and to fix deficiencies in the
> > vanilla scheduler that were suggested/found in the past few years, both
> > for desktop scheduling and for server scheduling workloads.
> 
> The casual observer will be completely confused by what on earth has happened 
> here so let me try to demystify things for them.

[...]

Demystify what?   The casual observer need only read either your attempt
at writing a scheduler, or my attempts at fixing the one we have, to see
that it was high time for someone with the necessary skills to step in.
Now progress can happen, which was _not_ happening before.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  6:43   ` Mike Galbraith
@ 2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
                         ` (2 more replies)
  2007-04-17  0:06     ` Peter Williams
  1 sibling, 3 replies; 304+ messages in thread
From: Bill Huey @ 2007-04-15  8:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> [...]
> 
> Demystify what?   The casual observer need only read either your attempt

Here's the problem. You're a casual observer and obviously not paying
attention.

> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.
> Now progress can happen, which was _not_ happening before.

I think that's inaccurate and there are plenty of folks that have that
technical skill and background. The scheduler code isn't a deep mystery
and there are plenty of good kernel hackers out here across many
communities.  Ingo isn't the only person on this planet to have deep
scheduler knowledge. Priority heaps are not new and Solaris has had a
pluggable scheduler framework for years.

Con's characterization is something that I'm more prone to believe about
how Linux kernel development works versus your view. I think it's a great
shame to have folks like Bill Irwin and Con to have waste time trying to
do something right only to have their ideas attack, then copied and held
as the solution for this kind of technical problem as complete reversal
of technical opinion as it suits a moment. This is just wrong in so many
ways.

It outlines the problems with Linux kernel development and questionable
elistism regarding ownership of certain sections of the kernel code.

I call it "churn squat" and instances like this only support that view
which I would rather it be completely wrong and inaccurate instead.

bill

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
@ 2007-04-15  8:45       ` Mike Galbraith
  2007-04-15  9:06       ` Ingo Molnar
  2007-04-15 16:25       ` Arjan van de Ven
  2 siblings, 0 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-15  8:45 UTC (permalink / raw)
  To: Bill Huey
  Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> > 
> > Demystify what?   The casual observer need only read either your attempt
> 
> Here's the problem. You're a casual observer and obviously not paying
> attention.
> 
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> > Now progress can happen, which was _not_ happening before.
> 
> I think that's inaccurate and there are plenty of folks that have that
> technical skill and background. The scheduler code isn't a deep mystery
> and there are plenty of good kernel hackers out here across many
> communities.  Ingo isn't the only person on this planet to have deep
> scheduler knowledge.

Ok <shrug>, I'm not paying attention, and you can't read.  We're even.
Have a nice life.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
@ 2007-04-15  9:06       ` Ingo Molnar
  2007-04-16 10:00         ` Ingo Molnar
  2007-04-15 16:25       ` Arjan van de Ven
  2 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15  9:06 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Arjan van de Ven, Thomas Gleixner


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote:
> > [...]
> > 
> > Demystify what?  The casual observer need only read either your 
> > attempt
> 
> Here's the problem. You're a casual observer and obviously not paying 
> attention.

guys, please calm down. Judging by the number of contributions to 
sched.c the main folks who are not 'observers' here and who thus have an 
unalienable right to be involved in a nasty flamewar about scheduler 
interactivity are Con, Mike, Nick and me ;-) Everyone else is just a 
happy bystander, ok? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  9:06       ` Ingo Molnar
@ 2007-04-16 10:00         ` Ingo Molnar
  0 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-16 10:00 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Peter Williams, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> guys, please calm down. Judging by the number of contributions to 
> sched.c the main folks who are not 'observers' here and who thus have 
> an unalienable right to be involved in a nasty flamewar about 
> scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else 
> is just a happy bystander, ok? ;-)

just to make sure: this is a short (and incomplete) list of contributors 
related to scheduler interactivity code. The full list of contributors 
to sched.c includes many other people as well: Peter, Suresh, Christoph, 
Kenneth and many others. Even the git logs, which only span 2 years out 
of 15, already list 79 individual contributors to kernel/sched.c.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  8:36     ` Bill Huey
  2007-04-15  8:45       ` Mike Galbraith
  2007-04-15  9:06       ` Ingo Molnar
@ 2007-04-15 16:25       ` Arjan van de Ven
  2007-04-16  5:36         ` Bill Huey
  2 siblings, 1 reply; 304+ messages in thread
From: Arjan van de Ven @ 2007-04-15 16:25 UTC (permalink / raw)
  To: Bill Huey
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Thomas Gleixner

> It outlines the problems with Linux kernel development and questionable
> elistism regarding ownership of certain sections of the kernel code.

I have to step in and disagree here....

Linux is not about who writes the code.

Linux is about getting the best solution for a problem. Who wrote which
line of the code is irrelevant in the big picture.

that often means that multiple implementations happen, and that the a
darwinistic process decides that the best solution wins.

This darwinistic process often happens in the form of discussion, and
that discussion can happen with words or with code. In this case it
happened with a code proposal.

To make this specific: it has happened many times to me that when I
solved an issue with code, someone else stepped in and wrote a different
solution (although that was usually for smaller pieces). Was I upset
about that? No! I was happy because my *problem got solved* in the best
possible way.

Now this doesn't mean that people shouldn't be nice to each other, not
cooperate or steal credits, but I don't get the impression that that is
happening here. Ingo is taking part in the discussion with a counter
proposal for discussion *on the mailing list*. What more do you want??
If you or anyone else can improve it or do better, take part of this
discussion and show what you mean either in words or in code.

Your qualification of the discussion as a elitist takeover... I disagree
with that. It's a *discussion*. Now if you agree that Ingo's patch is
better technically, you and others should be happy about that because
your problem is getting solved better. If you don't agree that his patch
is better technically, take part in the technical discussion.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 16:25       ` Arjan van de Ven
@ 2007-04-16  5:36         ` Bill Huey
  2007-04-16  6:17           ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Bill Huey @ 2007-04-16  5:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Thomas Gleixner, Bill Huey (hui)

On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> Now this doesn't mean that people shouldn't be nice to each other, not
> cooperate or steal credits, but I don't get the impression that that is
> happening here. Ingo is taking part in the discussion with a counter
> proposal for discussion *on the mailing list*. What more do you want??

Con should have been CCed from the first moment this was put into motion
to limit the perception of exclusion. That was mistake number one and big
time failures to understand this dynamic. After it was Con's idea. Why
the hell he was excluded from Ingo's development process is baffling to
me and him (most likely).

He put int a lot of effort into SDL and his experiences with scheduling
should still be seriously considered in this development process even if
he doesn't write a single line of code from this moment on.

What should have happened is that our very busy associate at RH by the
name of Ingo Molnar should have leverage more of Con's and Bill's work
and use them as a proxy for his own ideas. They would have loved to have
contributed more and our very busy Ingo Molnar would have gotten a lot
of his work and ideas implemented without him even opening a single
source file for editting. They would have happily done this work for
Ingo. Ingo could have been used for something else more important like
making KVM less of a freaking ugly hack and we all would have benefitted
from this.

He could have been working on SystemTap so that you stop losing accounts
to Sun and Solaris 10's Dtrace. He could have been working with Riel to
fix your butt ugly page scanning problem causing horrible contention via
the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
rwsem mapping problem that's killing -rt and anything that uses Posix
threads. He could have created a userspace thread control block (TCB)
with Mr. Drepper so that we can turn off preemption in userspace
(userspace per CPU local storage) and implement a very quick non-kernel
crossing implementation of priority ceilings (userspace check for priority
and flags at preempt_schedule() in the TCB) so that our -rt Posix API
doesn't suck donkey shit... Need I say more ?

As programmers like Ingo get spread more thinly, he needs super smart
folks like Bill Irwin and Con to help him out and learn to resist NIH
folk's stuff out of some weird fear. When this happens, folks like Ingo
must learn to "facilitate" development in addition to implementing it
with those kind of folks.

This takes time and practice to entrust folks to do things for him.
Ingo is the best method of getting new Linux kernel ideas and communicate
them to Linus. His value goes beyond just just code and is often the
biggest hammer we have in the Linux community to get stuff into the
kernel. "Facilitation" of others is something that solo programmers must
need when groups like the Linux kernel get larger and large every year.

Understand ? Are we in embarrassing agreement here ?

bill

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:36         ` Bill Huey
@ 2007-04-16  6:17           ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-16  6:17 UTC (permalink / raw)
  To: Bill Huey
  Cc: Arjan van de Ven, Mike Galbraith, Con Kolivas, Ingo Molnar,
	ck list, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Thomas Gleixner

On Sun, Apr 15, 2007 at 10:36:29PM -0700, Bill Huey wrote:
> On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote:
> > Now this doesn't mean that people shouldn't be nice to each other, not
> > cooperate or steal credits, but I don't get the impression that that is
> > happening here. Ingo is taking part in the discussion with a counter
> > proposal for discussion *on the mailing list*. What more do you want??
> 
> Con should have been CCed from the first moment this was put into motion
> to limit the perception of exclusion. That was mistake number one and big
> time failures to understand this dynamic. After it was Con's idea. Why
> the hell he was excluded from Ingo's development process is baffling to
> me and him (most likely).

Ingo's scheduler is completely different to any I've seen proposed
for Linux. And after he did an initial implementation, he did post
it to everyone.

Maybe something he said offended someone, but the process followed
is exactly how Linux kernel development works (ie. if you think you
can do better, then write the code). Sometimes you can give suggestions,
but other times if you come up with a different idea then it is
better just to do it yourself.

Con's code is still out there. If it is better than Ingo's then it
should win out. Nobody has a monopoly on schedulers or ideas or
posting patches.


> He put int a lot of effort into SDL and his experiences with scheduling
> should still be seriously considered in this development process even if
> he doesn't write a single line of code from this moment on.
> 
> What should have happened is that our very busy associate at RH by the
> name of Ingo Molnar should have leverage more of Con's and Bill's work
> and use them as a proxy for his own ideas. They would have loved to have
> contributed more and our very busy Ingo Molnar would have gotten a lot
> of his work and ideas implemented without him even opening a single
> source file for editting. They would have happily done this work for
> Ingo. Ingo could have been used for something else more important like
> making KVM less of a freaking ugly hack and we all would have benefitted
> from this.
> 
> He could have been working on SystemTap so that you stop losing accounts
> to Sun and Solaris 10's Dtrace. He could have been working with Riel to
> fix your butt ugly page scanning problem causing horrible contention via
> the Clock/Pro algorithm, etc... He could have been fixing the ugly futex
> rwsem mapping problem that's killing -rt and anything that uses Posix
> threads. He could have created a userspace thread control block (TCB)
> with Mr. Drepper so that we can turn off preemption in userspace
> (userspace per CPU local storage) and implement a very quick non-kernel
> crossing implementation of priority ceilings (userspace check for priority
> and flags at preempt_schedule() in the TCB) so that our -rt Posix API
> doesn't suck donkey shit... Need I say more ?

Well that's some pretty strong criticism of Linux and of someone who
does a great deal to improve things... Let's stick to the topic of
schedulers in this thread and try keeping it constructive.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  6:43   ` Mike Galbraith
  2007-04-15  8:36     ` Bill Huey
@ 2007-04-17  0:06     ` Peter Williams
  2007-04-17  2:29       ` Mike Galbraith
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  0:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

Mike Galbraith wrote:
> On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote:
>> On Saturday 14 April 2007 06:21, Ingo Molnar wrote:
>>> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
>>> [CFS]
>>>
>>> i'm pleased to announce the first release of the "Modular Scheduler Core
>>> and Completely Fair Scheduler [CFS]" patchset:
>>>
>>>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>>>
>>> This project is a complete rewrite of the Linux task scheduler. My goal
>>> is to address various feature requests and to fix deficiencies in the
>>> vanilla scheduler that were suggested/found in the past few years, both
>>> for desktop scheduling and for server scheduling workloads.
>> The casual observer will be completely confused by what on earth has happened 
>> here so let me try to demystify things for them.
> 
> [...]
> 
> Demystify what?   The casual observer need only read either your attempt
> at writing a scheduler, or my attempts at fixing the one we have, to see
> that it was high time for someone with the necessary skills to step in.

Make that "someone with the necessary clout".

> Now progress can happen, which was _not_ happening before.
> 

This is true.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  0:06     ` Peter Williams
@ 2007-04-17  2:29       ` Mike Galbraith
  2007-04-17  3:40         ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-17  2:29 UTC (permalink / raw)
  To: Peter Williams
  Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven,
	Thomas Gleixner

On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> Mike Galbraith wrote:
> >
> > Demystify what?   The casual observer need only read either your attempt
> > at writing a scheduler, or my attempts at fixing the one we have, to see
> > that it was high time for someone with the necessary skills to step in.
> 
> Make that "someone with the necessary clout".

No, I was brutally honest to both of us, but quite correct.

> > Now progress can happen, which was _not_ happening before.
> > 
> 
> This is true.

Yup, and progress _is_ happening now, quite rapidly.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  2:29       ` Mike Galbraith
@ 2007-04-17  3:40         ` Nick Piggin
  2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  4:17           ` Peter Williams
  0 siblings, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  3:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> > Mike Galbraith wrote:
> > >
> > > Demystify what?   The casual observer need only read either your attempt
> > > at writing a scheduler, or my attempts at fixing the one we have, to see
> > > that it was high time for someone with the necessary skills to step in.
> > 
> > Make that "someone with the necessary clout".
> 
> No, I was brutally honest to both of us, but quite correct.
> 
> > > Now progress can happen, which was _not_ happening before.
> > > 
> > 
> > This is true.
> 
> Yup, and progress _is_ happening now, quite rapidly.

Progress as in progress on Ingo's scheduler. I still don't know how we'd
decide when to replace the mainline scheduler or with what.

I don't think we can say Ingo's is better than the alternatives, can we?
If there is some kind of bakeoff, then I'd like one of Con's designs to
be involved, and mine, and Peter's...

Maybe the progress is that more key people are becoming open to the idea
of changing the scheduler.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:40         ` Nick Piggin
@ 2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  4:14             ` Nick Piggin
  2007-04-20 20:36             ` Bill Davidsen
  2007-04-17  4:17           ` Peter Williams
  1 sibling, 2 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-17  4:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
 
> > Yup, and progress _is_ happening now, quite rapidly.
> 
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
> 
> I don't think we can say Ingo's is better than the alternatives, can we?

No, that would require massive performance testing of all alternatives.

> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...

The trouble with a bakeoff is that it's pretty darn hard to get people
to test in the first place, and then comes weighting the subjective and
hard performance numbers.  If they're close in numbers, do you go with
the one which starts the least flamewars or what?

> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.

Could be.  All was quiet for quite a while, but when RSDL showed up, it
aroused enough interest to show that scheduling woes is on folks radar.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:01           ` Mike Galbraith
@ 2007-04-17  4:14             ` Nick Piggin
  2007-04-17  6:26               ` Peter Williams
  2007-04-17  9:51               ` Ingo Molnar
  2007-04-20 20:36             ` Bill Davidsen
  1 sibling, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  4:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 06:01:29AM +0200, Mike Galbraith wrote:
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>  
> > > Yup, and progress _is_ happening now, quite rapidly.
> > 
> > Progress as in progress on Ingo's scheduler. I still don't know how we'd
> > decide when to replace the mainline scheduler or with what.
> > 
> > I don't think we can say Ingo's is better than the alternatives, can we?
> 
> No, that would require massive performance testing of all alternatives.
> 
> > If there is some kind of bakeoff, then I'd like one of Con's designs to
> > be involved, and mine, and Peter's...
> 
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers.  If they're close in numbers, do you go with
> the one which starts the least flamewars or what?

I don't know how you'd do it. I know you wouldn't count people telling you
how good they are (getting people to tell you how bad they are, and whether
others do better in a given situation might be slightly move viable).

But we have to choose somehow. I'd hope that is going to be based solely
on the results and technical properties of the code, so... if we were to
somehow determine that the results are exactly the same, we'd go for the
the simpler one, wouldn't we?


> > Maybe the progress is that more key people are becoming open to the idea
> > of changing the scheduler.
> 
> Could be.  All was quiet for quite a while, but when RSDL showed up, it
> aroused enough interest to show that scheduling woes is on folks radar.

Well I know people have had woes with the scheduler for ever (I guess that
isn't going to change :P). I think people generally lost a bit of interest
in trying to improve the situation because of the upstream problem.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:14             ` Nick Piggin
@ 2007-04-17  6:26               ` Peter Williams
  2007-04-17  9:51               ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-17  6:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> Well I know people have had woes with the scheduler for ever (I guess that
> isn't going to change :P). I think people generally lost a bit of interest
> in trying to improve the situation because of the upstream problem.

Yes.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:14             ` Nick Piggin
  2007-04-17  6:26               ` Peter Williams
@ 2007-04-17  9:51               ` Ingo Molnar
  2007-04-17 13:44                 ` Peter Williams
  2007-04-20 20:47                 ` Bill Davidsen
  1 sibling, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Peter Williams, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

* Nick Piggin <npiggin@suse.de> wrote:

> > > Maybe the progress is that more key people are becoming open to 
> > > the idea of changing the scheduler.
> > 
> > Could be.  All was quiet for quite a while, but when RSDL showed up, 
> > it aroused enough interest to show that scheduling woes is on folks 
> > radar.
> 
> Well I know people have had woes with the scheduler for ever (I guess 
> that isn't going to change :P). [...]

yes, that part isnt going to change, because the CPU is a _scarce 
resource_ that is perhaps the most frequently overcommitted physical 
computer resource in existence, and because the kernel does not (yet) 
track eye movements of humans to figure out which tasks are more 
important them. So critical human constraints are unknown to the 
scheduler and thus complaints will always come.

The upstream scheduler thought it had enough information: the sleep 
average. So now the attempt is to go back and _simplify_ the scheduler 
and remove that information, and concentrate on getting fairness 
precisely right. The magic thing about 'fairness' is that it's a pretty 
good default policy if we decide that we simply have not enough 
information to do an intelligent choice.

( Lets be cautious though: the jury is still out whether people actually 
  like this more than the current approach. While CFS feedback looks 
  promising after a whopping 3 days of it being released [ ;-) ], the 
  test coverage of all 'fairness centric' schedulers, even considering 
  years of availability is less than 1% i'm afraid, and that < 1% was 
  mostly self-selecting. )

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:51               ` Ingo Molnar
@ 2007-04-17 13:44                 ` Peter Williams
  2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-20 20:47                 ` Bill Davidsen
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17 13:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> * Nick Piggin <npiggin@suse.de> wrote:
> 
>>>> Maybe the progress is that more key people are becoming open to 
>>>> the idea of changing the scheduler.
>>> Could be.  All was quiet for quite a while, but when RSDL showed up, 
>>> it aroused enough interest to show that scheduling woes is on folks 
>>> radar.
>> Well I know people have had woes with the scheduler for ever (I guess 
>> that isn't going to change :P). [...]
> 
> yes, that part isnt going to change, because the CPU is a _scarce 
> resource_ that is perhaps the most frequently overcommitted physical 
> computer resource in existence, and because the kernel does not (yet) 
> track eye movements of humans to figure out which tasks are more 
> important them. So critical human constraints are unknown to the 
> scheduler and thus complaints will always come.
> 
> The upstream scheduler thought it had enough information: the sleep 
> average. So now the attempt is to go back and _simplify_ the scheduler 
> and remove that information, and concentrate on getting fairness 
> precisely right. The magic thing about 'fairness' is that it's a pretty 
> good default policy if we decide that we simply have not enough 
> information to do an intelligent choice.
> 
> ( Lets be cautious though: the jury is still out whether people actually 
>   like this more than the current approach. While CFS feedback looks 
>   promising after a whopping 3 days of it being released [ ;-) ], the 
>   test coverage of all 'fairness centric' schedulers, even considering 
>   years of availability is less than 1% i'm afraid, and that < 1% was 
>   mostly self-selecting. )

At this point I'd like to make the observation that spa_ebs is a very 
fair scheduler if you consider "nice" to be an indication of the 
relative entitlement of tasks to CPU bandwidth.

It works by mapping nice to shares using a function very similar to the 
one for calculating p->load weight except it's not offset by the RT 
priorities as RT is handled separately.  In theory, a runnable task's 
entitlement to CPU bandwidth at any time is the ratio of its shares to 
the total shares held by runnable tasks on the same CPU (in reality, a 
smoothed average of this sum is used to make scheduling smoother).  The 
dynamic priorities of the runnable tasks are then fiddled to try to keep 
each tasks CPU bandwidth usage in proportion to its entitlement.

That's the theory anyway.

The actual implementation looks a bit different due to efficiency 
considerations.  The modifications to the above theory boil down to 
keeping a running measure of the (recent) highest CPU bandwidth use per 
share for tasks running on the CPU -- I call this the yardstick for this 
CPU.  When it's time to put a task on the run queue it's dynamic 
priority is determined by comparing its CPU bandwidth per share value 
with the yardstick for its CPU.  If it's greater than the yardstick this 
value becomes the new yardstick and the task gets given the lowest 
possible dynamic priority (for its scheduling class).  If the value is 
zero it gets the highest possible priority (for its scheduling class) 
which would be MAX_RT_PRIO for a SCHED_OTHER task.  Otherwise it gets 
given a priority between these two extremes proportional to ratio of its 
CPU bandwidth per share value and the yardstick.  Quite simple really.

The other way in which the code deviates from the original as that (for 
a few years now) I no longer calculated CPU bandwidth usage directly. 
I've found that the overhead is less if I keep a running average of the 
size of a tasks CPU bursts and the length of its scheduling cycle (i.e. 
from on CPU one time to on CPU next time) and using the ratio of these 
values as a measure of bandwidth usage.

Anyway it works and gives very predictable allocations of CPU bandwidth 
based on nice.

Another good feature is that (in this pure form) it's starvation free. 
However, if you fiddle with it and do things like giving bonus priority 
boosts to interactive tasks it becomes susceptible to starvation.  This 
can be fixed by using an anti starvation mechanism such as SPA's 
promotion scheme and that's what spa_ebs does.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:44                 ` Peter Williams
@ 2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-17 23:07                     ` William Lee Irwin III
  2007-04-18  2:39                     ` Peter Williams
  0 siblings, 2 replies; 304+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:00 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> The other way in which the code deviates from the original as that (for
> a few years now) I no longer calculated CPU bandwidth usage directly.
> I've found that the overhead is less if I keep a running average of the
> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
> from on CPU one time to on CPU next time) and using the ratio of these
> values as a measure of bandwidth usage.
>
> Anyway it works and gives very predictable allocations of CPU bandwidth
> based on nice.

Works, that is, right up until you add nonlinear interactions with CPU
speed scaling.  From my perspective as an embedded platform
integrator, clock/voltage scaling is the elephant in the scheduler's
living room.  Patch in DPM (now OpPoint?) to scale the clock based on
what task is being scheduled, and suddenly the dynamic priority
calculations go wild.  Nip this in the bud by putting an RT priority
on the relevant threads (which you have to do anyway if you need
remotely audio-grade latency), and the lock affinity heuristics break,
so you have to hand-tune all the thread priorities.  Blecch.

Not to mention the likelihood that the task whose clock speed you're
trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
than the application.  (You want to crank the CPU for this task
because it runs with the RF hot, which may cost you as much power as
the rest of the platform.)  You'd better hope you can remove it from
the dynamic priority heuristics with SCHED_BATCH.  Otherwise
everything _else_ has to be RT priority (or it'll be starved by the
soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
blecch!

Is it too much to ask for someone with actual engineering training
(not me, unfortunately) to sit down and build a negative-feedback
control system that handles soft-real-time _and_ dynamic-priority
_and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
scaling?  And actually separates the accounting and control mechanisms
from the heuristics, so the latter can be tuned (within a well
documented stable range) to reflect the expected system usage
patterns?

It's not like there isn't a vast literature in this area over the past
decade, including some dealing specifically with clock scaling
consistent with low-latency applications.  It's a pity that people
doing academic work in this area rarely wade into LKML, even when
they're hacking on a Linux fork.  But then, there's not much economic
incentive for them to do so, and they can usually get their fill of
citation politics and dominance games without leaving their home
department.  :-P

Seriously, though.  If you're really going to put the mainline
scheduler through this kind of churn, please please pretty please knit
in per-task clock scaling (possibly even rejigged during the slice;
see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
linger mechanism to keep from taking context switch hits when you're
confident that an I/O will complete quickly.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:00                   ` Michael K. Edwards
@ 2007-04-17 23:07                     ` William Lee Irwin III
  2007-04-17 23:52                       ` Michael K. Edwards
  2007-04-18  2:39                     ` Peter Williams
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 23:07 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:00:53PM -0700, Michael K. Edwards wrote:
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling.  From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room.  Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild.  Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities.  Blecch.
[...not terribly enlightening stuff trimmed...]

The ongoing scheduler work is on a much more basic level than these
affairs I'm guessing you googled. When the basics work as intended it
will be possible to move on to more advanced issues.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:07                     ` William Lee Irwin III
@ 2007-04-17 23:52                       ` Michael K. Edwards
  2007-04-18  0:36                         ` Bill Huey
  0 siblings, 1 reply; 304+ messages in thread
From: Michael K. Edwards @ 2007-04-17 23:52 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> The ongoing scheduler work is on a much more basic level than these
> affairs I'm guessing you googled. When the basics work as intended it
> will be possible to move on to more advanced issues.

OK, let me try this in smaller words for people who can't tell bitter
experience from Google hits.  CPU clock scaling for power efficiency
is already the only thing that matters about the Linux scheduler in my
world, because battery-powered device vendors in their infinite wisdom
are abandoning real RTOSes in favor of Linux now that WiFi is the "in"
thing (again).  And on the timescale that anyone will actually be
_using_ this shiny new scheduler of Ingo's, it'll be nearly the only
thing that matters about the Linux scheduler in anyone's world,
because the amount of work the CPU can get done in a given minute will
depend mostly on how intelligently it can spend its heat dissipation
budget.

Clock scaling schemes that aren't integral to the scheduler design
make a bad situation (scheduling embedded loads with shotgun
heuristics tuned for desktop CPUs) worse, because the opaque
heuristics are now being applied to distorted data.  Add a "smoothing"
scheme for the distorted data, and you may find that you have
introduced an actual control-path instability.  A small fluctuation in
the data (say, two bursts of interrupt traffic at just the right
interval) can result in a long-lasting oscillation in some task's
"dynamic priority" -- and, on a fully loaded CPU, in the time that
task actually gets.  If anything else depends on how much work this
task gets done each time around, the oscillation can easily propagate
throughout the system.  Thrash city.

(If you haven't seen this happen on real production systems under what
shouldn't be a pathological load, you haven't been around long.  The
classic mechanisms that triggered oscillations in, say, early SMP
Solaris boxes haven't bitten recently, perhaps because most modern
CPUs don't lose their marbles so comprehensively on context switch.
But I got to live this nightmare again recently on ARM Linux, due to
some impressively broken application-level threading/locking "design",
whose assumptions about scheduler behavior got broken when I switched
to an NPTL toolchain.)

I don't have the training to design a scheduler that isn't vulnerable
to control-feedback oscillations.  Neither do you, if you haven't
taken (and excelled at) a control theory course, which nowadays seems
to be taught by applied math and ECE departments and too often skipped
by CS types.  But I can recognize an impending train wreck when I see
it.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:52                       ` Michael K. Edwards
@ 2007-04-18  0:36                         ` Bill Huey
  0 siblings, 0 replies; 304+ messages in thread
From: Bill Huey @ 2007-04-18  0:36 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: William Lee Irwin III, Peter Williams, Ingo Molnar, Nick Piggin,
	Mike Galbraith, Con Kolivas, ck list, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner,
	Bill Huey (hui)

On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote:
> On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> >The ongoing scheduler work is on a much more basic level than these
> >affairs I'm guessing you googled. When the basics work as intended it
> >will be possible to move on to more advanced issues.
... 

Will probably shouldn't have dismissed your points but he probably means
that can't even get at this stuff until fundamental are in place.

> Clock scaling schemes that aren't integral to the scheduler design
> make a bad situation (scheduling embedded loads with shotgun
> heuristics tuned for desktop CPUs) worse, because the opaque
> heuristics are now being applied to distorted data.  Add a "smoothing"
> scheme for the distorted data, and you may find that you have
> introduced an actual control-path instability.  A small fluctuation in
> the data (say, two bursts of interrupt traffic at just the right
> interval) can result in a long-lasting oscillation in some task's
> "dynamic priority" -- and, on a fully loaded CPU, in the time that
> task actually gets.  If anything else depends on how much work this
> task gets done each time around, the oscillation can easily propagate
> throughout the system.  Thrash city.

Hyperthreading issues are quite similar that clock scaling issues.
Con's infrastructures changes to move things in that direction were
rejected, as well as other infrastructure changes, further infuritating
Con to drop development on RSDL and derivatives.

bill


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:00                   ` Michael K. Edwards
  2007-04-17 23:07                     ` William Lee Irwin III
@ 2007-04-18  2:39                     ` Peter Williams
  1 sibling, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-18  2:39 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Michael K. Edwards wrote:
> On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> The other way in which the code deviates from the original as that (for
>> a few years now) I no longer calculated CPU bandwidth usage directly.
>> I've found that the overhead is less if I keep a running average of the
>> size of a tasks CPU bursts and the length of its scheduling cycle (i.e.
>> from on CPU one time to on CPU next time) and using the ratio of these
>> values as a measure of bandwidth usage.
>>
>> Anyway it works and gives very predictable allocations of CPU bandwidth
>> based on nice.
> 
> Works, that is, right up until you add nonlinear interactions with CPU
> speed scaling.  From my perspective as an embedded platform
> integrator, clock/voltage scaling is the elephant in the scheduler's
> living room.  Patch in DPM (now OpPoint?) to scale the clock based on
> what task is being scheduled, and suddenly the dynamic priority
> calculations go wild.  Nip this in the bud by putting an RT priority
> on the relevant threads (which you have to do anyway if you need
> remotely audio-grade latency), and the lock affinity heuristics break,
> so you have to hand-tune all the thread priorities.  Blecch.
> 
> Not to mention the likelihood that the task whose clock speed you're
> trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority
> than the application.  (You want to crank the CPU for this task
> because it runs with the RF hot, which may cost you as much power as
> the rest of the platform.)  You'd better hope you can remove it from
> the dynamic priority heuristics with SCHED_BATCH.  Otherwise
> everything _else_ has to be RT priority (or it'll be starved by the
> soft MAC) and you've basically tossed SCHED_NORMAL in the bin.  Double
> blecch!
> 
> Is it too much to ask for someone with actual engineering training
> (not me, unfortunately) to sit down and build a negative-feedback
> control system that handles soft-real-time _and_ dynamic-priority
> _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock
> scaling?  And actually separates the accounting and control mechanisms
> from the heuristics, so the latter can be tuned (within a well
> documented stable range) to reflect the expected system usage
> patterns?
> 
> It's not like there isn't a vast literature in this area over the past
> decade, including some dealing specifically with clock scaling
> consistent with low-latency applications.  It's a pity that people
> doing academic work in this area rarely wade into LKML, even when
> they're hacking on a Linux fork.  But then, there's not much economic
> incentive for them to do so, and they can usually get their fill of
> citation politics and dominance games without leaving their home
> department.  :-P
> 
> Seriously, though.  If you're really going to put the mainline
> scheduler through this kind of churn, please please pretty please knit
> in per-task clock scaling (possibly even rejigged during the slice;
> see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of
> linger mechanism to keep from taking context switch hits when you're
> confident that an I/O will complete quickly.

I think that this doesn't effect the basic design principles of spa_ebs 
but just means that the statistics that it uses need to be rethought. 
E.g. instead of measuring average CPU usage per burst in terms of wall 
clock time spent on the CPU measure it in terms of CPU capacity (for the 
want of a better word) used per burst.

I don't have suitable hardware for investigating this line of attack 
further, unfortunately, and have no idea what would be the best way to 
calculate this new statistic.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-17  9:51               ` Ingo Molnar
  2007-04-17 13:44                 ` Peter Williams
@ 2007-04-20 20:47                 ` Bill Davidsen
  2007-04-21  7:39                   ` Nick Piggin
  2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 2 replies; 304+ messages in thread
From: Bill Davidsen @ 2007-04-20 20:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

Ingo Molnar wrote:

> ( Lets be cautious though: the jury is still out whether people actually 
>   like this more than the current approach. While CFS feedback looks 
>   promising after a whopping 3 days of it being released [ ;-) ], the 
>   test coverage of all 'fairness centric' schedulers, even considering 
>   years of availability is less than 1% i'm afraid, and that < 1% was 
>   mostly self-selecting. )
> 
All of my testing has been on desktop machines, although in most cases 
they were really loaded desktops which had load avg 10..100 from time to 
time, and none were low memory machines. Up to CFS v3 I thought 
nicksched was my winner, now CFSv3 looks better, by not having stumbles 
under stupid loads.

I have not tested:
   1 - server loads, nntp, smtp, etc
   2 - low memory machines
   3 - uniprocessor systems

I think this should be done before drawing conclusions. Or if someone 
has tried this, perhaps they would report what they saw. People are 
talking about smoothness, but not how many pages per second come out of 
their overloaded web server.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-20 20:47                 ` Bill Davidsen
@ 2007-04-21  7:39                   ` Nick Piggin
  2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-21  7:39 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Ingo Molnar, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven

On Fri, Apr 20, 2007 at 04:47:27PM -0400, Bill Davidsen wrote:
> Ingo Molnar wrote:
> 
> >( Lets be cautious though: the jury is still out whether people actually 
> >  like this more than the current approach. While CFS feedback looks 
> >  promising after a whopping 3 days of it being released [ ;-) ], the 
> >  test coverage of all 'fairness centric' schedulers, even considering 
> >  years of availability is less than 1% i'm afraid, and that < 1% was 
> >  mostly self-selecting. )
> >
> All of my testing has been on desktop machines, although in most cases 
> they were really loaded desktops which had load avg 10..100 from time to 
> time, and none were low memory machines. Up to CFS v3 I thought 
> nicksched was my winner, now CFSv3 looks better, by not having stumbles 
> under stupid loads.

What base_timeslice were you using for nicksched, and what HZ?


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely   Fair Scheduler [CFS]
  2007-04-20 20:47                 ` Bill Davidsen
  2007-04-21  7:39                   ` Nick Piggin
@ 2007-04-21  8:33                   ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-21  8:33 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams,
	linux-kernel, ck list, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Arjan van de Ven


* Bill Davidsen <davidsen@tmr.com> wrote:

> All of my testing has been on desktop machines, although in most cases 
> they were really loaded desktops which had load avg 10..100 from time 
> to time, and none were low memory machines. Up to CFS v3 I thought 
> nicksched was my winner, now CFSv3 looks better, by not having 
> stumbles under stupid loads.

nice! I hope CFSv4 kept that good tradition too ;)

> I have not tested:
>   1 - server loads, nntp, smtp, etc
>   2 - low memory machines
>   3 - uniprocessor systems
> 
> I think this should be done before drawing conclusions. Or if someone 
> has tried this, perhaps they would report what they saw. People are 
> talking about smoothness, but not how many pages per second come out 
> of their overloaded web server.

i tested heavily swapping systems. (make -j50 workloads easily trigger 
that) I also tested UP systems and a handful of SMP systems. I have also 
tested massive_intr.c which i believe is an indicator of how fairly CPU 
time is distributed between partly sleeping partly running server 
threads. But i very much agree that diverse feedback is sought and 
welcome, both from those who are happy with the current scheduler and 
those who are unhappy about it.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:01           ` Mike Galbraith
  2007-04-17  4:14             ` Nick Piggin
@ 2007-04-20 20:36             ` Bill Davidsen
  1 sibling, 0 replies; 304+ messages in thread
From: Bill Davidsen @ 2007-04-20 20:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Mike Galbraith wrote:
> On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote:
>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>  
>>> Yup, and progress _is_ happening now, quite rapidly.
>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>> decide when to replace the mainline scheduler or with what.
>>
>> I don't think we can say Ingo's is better than the alternatives, can we?
> 
> No, that would require massive performance testing of all alternatives.
> 
>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>> be involved, and mine, and Peter's...
> 
> The trouble with a bakeoff is that it's pretty darn hard to get people
> to test in the first place, and then comes weighting the subjective and
> hard performance numbers.  If they're close in numbers, do you go with
> the one which starts the least flamewars or what?
> 
Here we disagree... I picked a scheduler not by running benchmarks, but 
by running loads which piss me off with the mainline scheduler. And then 
I ran the other schedulers for a while to find the things, normal things 
I do, which resulted in bad behavior. And when I found one which had (so 
far) no such cases I called it my winner, but I haven't tested it under 
server load, so I can't begin to say it's "the best."

What we need is for lots of people to run every scheduler in real life, 
and do "worst case analysis" by finding the cases which cause bad 
behavior. And if there were a way to easily choose another scheduler, 
call it plugable, modular, or Russian Roulette, people who found a worst 
case would report it (aka bitch about it) and try another. But the 
average user is better able to boot with an option like "sched=cfs" (or 
sc, or nick, or ...) than to patch and build a kernel. So if we don't 
get easily switched schedulers people will not test nearly as well.

The best scheduler isn't the one 2% faster than the rest, it's the one 
with the fewest jackpot cases where it sucks. And if the mainline had 
multiple schedulers this testing would get done, authors would get more 
reports and have a better chance of fixing corner cases.

Note that we really need multiple schedulers to make people happy, 
because fairness is not the most desirable behavior on all machines, and 
adding knobs probably isn't the answer. I want a server to degrade 
gently, I want my desktop to show my movie and echo my typing, and if 
that's hard on compiles or the file transfer, so be it. Con doesn't want 
to compromise his goals, I agree but want to have an option if I don't 
share them.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:40         ` Nick Piggin
  2007-04-17  4:01           ` Mike Galbraith
@ 2007-04-17  4:17           ` Peter Williams
  2007-04-17  4:29             ` Nick Piggin
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  4:17 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>> Mike Galbraith wrote:
>>>> Demystify what?   The casual observer need only read either your attempt
>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>> that it was high time for someone with the necessary skills to step in.
>>> Make that "someone with the necessary clout".
>> No, I was brutally honest to both of us, but quite correct.
>>
>>>> Now progress can happen, which was _not_ happening before.
>>>>
>>> This is true.
>> Yup, and progress _is_ happening now, quite rapidly.
> 
> Progress as in progress on Ingo's scheduler. I still don't know how we'd
> decide when to replace the mainline scheduler or with what.
> 
> I don't think we can say Ingo's is better than the alternatives, can we?
> If there is some kind of bakeoff, then I'd like one of Con's designs to
> be involved, and mine, and Peter's...

I myself was thinking of this as the chance for a much needed 
simplification of the scheduling code and if this can be done with the 
result being "reasonable" it then gives us the basis on which to propose 
improvements based on the ideas of others such as you mention.

As the size of the cpusched indicates, trying to evaluate alternative 
proposals based on the current O(1) scheduler is fraught.  Hopefully, 
this initiative can fix this problem.  Then we just need Ingo to listen 
to suggestions and he's showing signs of being willing to do this :-)

> 
> Maybe the progress is that more key people are becoming open to the idea
> of changing the scheduler.

That too.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:17           ` Peter Williams
@ 2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
                                 ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  4:29 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
> >>On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
> >>>Mike Galbraith wrote:
> >>>>Demystify what?   The casual observer need only read either your attempt
> >>>>at writing a scheduler, or my attempts at fixing the one we have, to see
> >>>>that it was high time for someone with the necessary skills to step in.
> >>>Make that "someone with the necessary clout".
> >>No, I was brutally honest to both of us, but quite correct.
> >>
> >>>>Now progress can happen, which was _not_ happening before.
> >>>>
> >>>This is true.
> >>Yup, and progress _is_ happening now, quite rapidly.
> >
> >Progress as in progress on Ingo's scheduler. I still don't know how we'd
> >decide when to replace the mainline scheduler or with what.
> >
> >I don't think we can say Ingo's is better than the alternatives, can we?
> >If there is some kind of bakeoff, then I'd like one of Con's designs to
> >be involved, and mine, and Peter's...
> 
> I myself was thinking of this as the chance for a much needed 
> simplification of the scheduling code and if this can be done with the 
> result being "reasonable" it then gives us the basis on which to propose 
> improvements based on the ideas of others such as you mention.
> 
> As the size of the cpusched indicates, trying to evaluate alternative 
> proposals based on the current O(1) scheduler is fraught.  Hopefully, 

I don't know why. The problem is that you can't really evaluate good
proposals by looking at the code (you can say that one is bad, ie. the
current one, which has a huge amount of temporal complexity and is
explicitly unfair), but it is pretty hard to say one behaves well.

And my scheduler for example cuts down the amount of policy code and
code size significantly. I haven't looked at Con's ones for a while,
but I believe they are also much more straightforward than mainline...

For example, let's say all else is equal between them, then why would
we go with the O(logN) implementation rather than the O(1)?


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
@ 2007-04-17  5:53               ` Willy Tarreau
  2007-04-17  6:10                 ` Nick Piggin
  2007-04-17  6:09               ` William Lee Irwin III
  2007-04-17  6:23               ` Peter Williams
  2 siblings, 1 reply; 304+ messages in thread
From: Willy Tarreau @ 2007-04-17  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

Hi Nick,

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
(...)
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> 
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

Of course, if this is the case, the question will be raised. But as a
general rule, I don't see much potential in O(1) to finely tune scheduling
according to several criteria. In O(logN), you can adjust scheduling in
realtime at a very low cost. Better processing of varying priorities or
fork() comes to mind.

Regards,
Willy


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  5:53               ` Willy Tarreau
@ 2007-04-17  6:10                 ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  6:10 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote:
> Hi Nick,
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> (...)
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > 
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> Of course, if this is the case, the question will be raised. But as a
> general rule, I don't see much potential in O(1) to finely tune scheduling
> according to several criteria.

What do you mean? By what criteria?

> In O(logN), you can adjust scheduling in
> realtime at a very low cost. Better processing of varying priorities or
> fork() comes to mind.

The main problem as I see it is choosing which task to run next and
how much time to run it for. And given that there are typically far less
than 58 (the number of priorities in nicksched) runnable tasks for a
desktop system, I don't find it at all constraining to quantize my "next
runnable" criteria onto that size of key.

Even if you do expect a huge number of runnable tasks, you would hope
for fewer interactive ones toward the higher end of the priority scale.

Handwaving or even detailed design descriptions is simply not the best
way to decide on a new scheduler, is all I'm saying.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
@ 2007-04-17  6:09               ` William Lee Irwin III
  2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:23               ` Peter Williams
  2 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> I myself was thinking of this as the chance for a much needed 
>> simplification of the scheduling code and if this can be done with the 
>> result being "reasonable" it then gives us the basis on which to propose 
>> improvements based on the ideas of others such as you mention.
>> As the size of the cpusched indicates, trying to evaluate alternative 
>> proposals based on the current O(1) scheduler is fraught.  Hopefully, 

On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.
> And my scheduler for example cuts down the amount of policy code and
> code size significantly. I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

All things are not equal; they all have different properties. I like
getting rid of the queue-swapping artifacts as ebs and cfs have done,
as the artifacts introduced there are nasty IMNSHO. I'm queueing up
a demonstration of epoch expiry scheduling artifacts as a testcase,
albeit one with no pass/fail semantics for its results, just detecting
scheduler properties.

That said, inequality/inequivalence is not a superiority/inferiority
ranking per se. What needs to come out of these discussions is a set
of standards which a candidate for mainline must pass to be considered
correct and a set of performance metrics by which to rank them. Video
game framerates and some sort of way to automate window wiggle tests
sound like good ideas, but automating such is beyond my userspace
programming abilities. An organization able to devote manpower to
devising such testcases will likely have to get involved for them to
happen, I suspect.

On a random note, limitations on kernel address space make O(lg(n))
effectively O(1), albeit with large upper bounds on the worst case
and an expected case much faster than the worst case.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:09               ` William Lee Irwin III
@ 2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:26                   ` William Lee Irwin III
  2007-04-17  6:50                   ` Davide Libenzi
  0 siblings, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  6:15 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
> >> I myself was thinking of this as the chance for a much needed 
> >> simplification of the scheduling code and if this can be done with the 
> >> result being "reasonable" it then gives us the basis on which to propose 
> >> improvements based on the ideas of others such as you mention.
> >> As the size of the cpusched indicates, trying to evaluate alternative 
> >> proposals based on the current O(1) scheduler is fraught.  Hopefully, 
> 
> On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote:
> > I don't know why. The problem is that you can't really evaluate good
> > proposals by looking at the code (you can say that one is bad, ie. the
> > current one, which has a huge amount of temporal complexity and is
> > explicitly unfair), but it is pretty hard to say one behaves well.
> > And my scheduler for example cuts down the amount of policy code and
> > code size significantly. I haven't looked at Con's ones for a while,
> > but I believe they are also much more straightforward than mainline...
> > For example, let's say all else is equal between them, then why would
> > we go with the O(logN) implementation rather than the O(1)?
> 
> All things are not equal; they all have different properties. I like

Exactly. So we have to explore those properties and evaluate performance
(in all meanings of the word). That's only logical.


> On a random note, limitations on kernel address space make O(lg(n))
> effectively O(1), albeit with large upper bounds on the worst case
> and an expected case much faster than the worst case.

Yeah. O(n!) is also O(1) if you can put an upper bound on n ;)


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:15                 ` Nick Piggin
@ 2007-04-17  6:26                   ` William Lee Irwin III
  2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  6:50                   ` Davide Libenzi
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>> All things are not equal; they all have different properties. I like

On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

Any chance you'd be willing to put down a few thoughts on what sorts
of standards you'd like to set for both correctness (i.e. the bare
minimum a scheduler implementation must do to be considered valid
beyond not oopsing) and performance metrics (i.e. things that produce
numbers for each scheduler you can compare to say "this scheduler is
better than this other scheduler at this.").


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:26                   ` William Lee Irwin III
@ 2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  8:23                       ` William Lee Irwin III
  2007-04-17 21:39                       ` Matt Mackall
  0 siblings, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >> All things are not equal; they all have different properties. I like
> 
> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> Any chance you'd be willing to put down a few thoughts on what sorts
> of standards you'd like to set for both correctness (i.e. the bare
> minimum a scheduler implementation must do to be considered valid
> beyond not oopsing) and performance metrics (i.e. things that produce
> numbers for each scheduler you can compare to say "this scheduler is
> better than this other scheduler at this.").

Yeah I guess that's the hard part :)

For correctness, I guess fairness is an easy one. I think that unfairness
is basically a bug and that it would be very unfortunate to merge something
unfair. But this is just within the context of a single runqueue... for
better or worse, we allow some unfairness in multiprocessors for performance
reasons of course.

Latency. Given N tasks in the system, an arbitrary task should get
onto the CPU in a bounded amount of time (excluding events like freak
IRQ holdoffs and such, obviously -- ie. just considering the context
of the scheduler's state machine).

I wouldn't like to see a significant drop in any micro or macro
benchmarks or even worse real workloads, but I could accept some if it
means haaving a fair scheduler by default.

Now it isn't actually too hard to achieve the above, I think. The hard bit
is trying to compare interactivity. Ideally, we'd be able to get scripted
dumps of login sessions, and measure scheduling latencies of key proceses
(sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
problems with any scheduler, and we could compare all of them against it.
Wishful thinking!

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:01                     ` Nick Piggin
@ 2007-04-17  8:23                       ` William Lee Irwin III
  2007-04-17 22:23                         ` Davide Libenzi
  2007-04-17 21:39                       ` Matt Mackall
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  8:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>> Any chance you'd be willing to put down a few thoughts on what sorts
>> of standards you'd like to set for both correctness (i.e. the bare
>> minimum a scheduler implementation must do to be considered valid
>> beyond not oopsing) and performance metrics (i.e. things that produce
>> numbers for each scheduler you can compare to say "this scheduler is
>> better than this other scheduler at this.").

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Yeah I guess that's the hard part :)
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

Requiring that identical tasks be allocated equal shares of CPU
bandwidth is the easy part here. ringtest.c exercises another aspect
of fairness that is extremely important. Generalizing ringtest.c is
a good idea for fairness testing.

But another aspect of fairness is that "controlled unfairness" is also
intended to exist, in no small part by virtue of nice levels, but also
in the form of favoring tasks that are considered interactive somehow.
Testing various forms of controlled unfairness to ensure that they are
indeed controlled and otherwise have the semantics intended is IMHO the
more difficult aspect of fairness testing.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

ISTR Davide Libenzi having a scheduling latency test a number of years
ago. Resurrecting that and tuning it to the needs of this kind of
testing sounds relevant here. The test suite Peter Willliams mentioned
would also help.


On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> I wouldn't like to see a significant drop in any micro or macro
> benchmarks or even worse real workloads, but I could accept some if it
> means haaving a fair scheduler by default.

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> Now it isn't actually too hard to achieve the above, I think. The hard bit
> is trying to compare interactivity. Ideally, we'd be able to get scripted
> dumps of login sessions, and measure scheduling latencies of key proceses
> (sh/X/wm/xmms/firefox/etc).  People would send a dump if they were having
> problems with any scheduler, and we could compare all of them against it.
> Wishful thinking!

That's a pretty good idea. I'll queue up writing something of that form
as well.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:23                       ` William Lee Irwin III
@ 2007-04-17 22:23                         ` Davide Libenzi
  0 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-17 22:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
> 
> ISTR Davide Libenzi having a scheduling latency test a number of years
> ago. Resurrecting that and tuning it to the needs of this kind of
> testing sounds relevant here. The test suite Peter Willliams mentioned
> would also help.

That helped me a lot at that time. At every context switch was sampling 
critical scheduler parameters for both entering and exiting task (and 
associated timestamps). Then the data was collected through a 
/dev/idontremember from userspace for analysis. It'd very useful to have 
it those days, to study what really happens under the hook 
(scheduler internal parameters variations and such) when those wierd loads 
make the scheduler unstable.



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:01                     ` Nick Piggin
  2007-04-17  8:23                       ` William Lee Irwin III
@ 2007-04-17 21:39                       ` Matt Mackall
  2007-04-17 23:23                         ` Peter Williams
  2007-04-18  3:15                         ` Nick Piggin
  1 sibling, 2 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 21:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > >> All things are not equal; they all have different properties. I like
> > 
> > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > Exactly. So we have to explore those properties and evaluate performance
> > > (in all meanings of the word). That's only logical.
> > 
> > Any chance you'd be willing to put down a few thoughts on what sorts
> > of standards you'd like to set for both correctness (i.e. the bare
> > minimum a scheduler implementation must do to be considered valid
> > beyond not oopsing) and performance metrics (i.e. things that produce
> > numbers for each scheduler you can compare to say "this scheduler is
> > better than this other scheduler at this.").
> 
> Yeah I guess that's the hard part :)
> 
> For correctness, I guess fairness is an easy one. I think that unfairness
> is basically a bug and that it would be very unfortunate to merge something
> unfair. But this is just within the context of a single runqueue... for
> better or worse, we allow some unfairness in multiprocessors for performance
> reasons of course.

I'm a big fan of fairness, but I think it's a bit early to declare it
a mandatory feature. Bounded unfairness is probably something we can
agree on, ie "if we decide to be unfair, no process suffers more than
a factor of x".
 
> Latency. Given N tasks in the system, an arbitrary task should get
> onto the CPU in a bounded amount of time (excluding events like freak
> IRQ holdoffs and such, obviously -- ie. just considering the context
> of the scheduler's state machine).

This is a slightly stronger statement than starvation-free (which is
obviously mandatory). I think you're looking for something like
"worst-case scheduling latency is proportional to the number of
runnable tasks". Which I think is quite a reasonable requirement.

I'm pretty sure the stock scheduler falls short of both of these
guarantees though.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 21:39                       ` Matt Mackall
@ 2007-04-17 23:23                         ` Peter Williams
  2007-04-17 23:19                           ` Matt Mackall
  2007-04-18  3:15                         ` Nick Piggin
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17 23:23 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
>> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
>>> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
>>>>> All things are not equal; they all have different properties. I like
>>> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
>>>> Exactly. So we have to explore those properties and evaluate performance
>>>> (in all meanings of the word). That's only logical.
>>> Any chance you'd be willing to put down a few thoughts on what sorts
>>> of standards you'd like to set for both correctness (i.e. the bare
>>> minimum a scheduler implementation must do to be considered valid
>>> beyond not oopsing) and performance metrics (i.e. things that produce
>>> numbers for each scheduler you can compare to say "this scheduler is
>>> better than this other scheduler at this.").
>> Yeah I guess that's the hard part :)
>>
>> For correctness, I guess fairness is an easy one. I think that unfairness
>> is basically a bug and that it would be very unfortunate to merge something
>> unfair. But this is just within the context of a single runqueue... for
>> better or worse, we allow some unfairness in multiprocessors for performance
>> reasons of course.
> 
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".
>  
>> Latency. Given N tasks in the system, an arbitrary task should get
>> onto the CPU in a bounded amount of time (excluding events like freak
>> IRQ holdoffs and such, obviously -- ie. just considering the context
>> of the scheduler's state machine).
> 
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks".

add "taking into consideration nice and/or real time priorities of 
runnable tasks".  I.e. if a task is nice 19 it can expect to wait longer 
to get onto the CPU than if it was nice 0.

> Which I think is quite a reasonable requirement.
> 
> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.
> 

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:23                         ` Peter Williams
@ 2007-04-17 23:19                           ` Matt Mackall
  0 siblings, 0 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 23:19 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 09:23:42AM +1000, Peter Williams wrote:
> Matt Mackall wrote:
> >On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> >>On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> >>>On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> >>>>>All things are not equal; they all have different properties. I like
> >>>On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> >>>>Exactly. So we have to explore those properties and evaluate performance
> >>>>(in all meanings of the word). That's only logical.
> >>>Any chance you'd be willing to put down a few thoughts on what sorts
> >>>of standards you'd like to set for both correctness (i.e. the bare
> >>>minimum a scheduler implementation must do to be considered valid
> >>>beyond not oopsing) and performance metrics (i.e. things that produce
> >>>numbers for each scheduler you can compare to say "this scheduler is
> >>>better than this other scheduler at this.").
> >>Yeah I guess that's the hard part :)
> >>
> >>For correctness, I guess fairness is an easy one. I think that unfairness
> >>is basically a bug and that it would be very unfortunate to merge 
> >>something
> >>unfair. But this is just within the context of a single runqueue... for
> >>better or worse, we allow some unfairness in multiprocessors for 
> >>performance
> >>reasons of course.
> >
> >I'm a big fan of fairness, but I think it's a bit early to declare it
> >a mandatory feature. Bounded unfairness is probably something we can
> >agree on, ie "if we decide to be unfair, no process suffers more than
> >a factor of x".
> > 
> >>Latency. Given N tasks in the system, an arbitrary task should get
> >>onto the CPU in a bounded amount of time (excluding events like freak
> >>IRQ holdoffs and such, obviously -- ie. just considering the context
> >>of the scheduler's state machine).
> >
> >This is a slightly stronger statement than starvation-free (which is
> >obviously mandatory). I think you're looking for something like
> >"worst-case scheduling latency is proportional to the number of
> >runnable tasks".
> 
> add "taking into consideration nice and/or real time priorities of 
> runnable tasks".  I.e. if a task is nice 19 it can expect to wait longer 
> to get onto the CPU than if it was nice 0.

Yes. Assuming we meet the "bounded unfairness" criterion above, this
follows.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 21:39                       ` Matt Mackall
  2007-04-17 23:23                         ` Peter Williams
@ 2007-04-18  3:15                         ` Nick Piggin
  2007-04-18  3:45                           ` Mike Galbraith
  2007-04-18  4:38                           ` Matt Mackall
  1 sibling, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  3:15 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > >> All things are not equal; they all have different properties. I like
> > > 
> > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > Exactly. So we have to explore those properties and evaluate performance
> > > > (in all meanings of the word). That's only logical.
> > > 
> > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > of standards you'd like to set for both correctness (i.e. the bare
> > > minimum a scheduler implementation must do to be considered valid
> > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > numbers for each scheduler you can compare to say "this scheduler is
> > > better than this other scheduler at this.").
> > 
> > Yeah I guess that's the hard part :)
> > 
> > For correctness, I guess fairness is an easy one. I think that unfairness
> > is basically a bug and that it would be very unfortunate to merge something
> > unfair. But this is just within the context of a single runqueue... for
> > better or worse, we allow some unfairness in multiprocessors for performance
> > reasons of course.
> 
> I'm a big fan of fairness, but I think it's a bit early to declare it
> a mandatory feature. Bounded unfairness is probably something we can
> agree on, ie "if we decide to be unfair, no process suffers more than
> a factor of x".

I don't know why this would be a useful feature (of course I'm talking
about processes at the same nice level). One of the big problems with
the current scheduler is that it is unfair in some corner cases. It
works OK for most people, but when it breaks down it really hurts. At
least if you start with a fair scheduler, you can alter priorities
until it satisfies your need... with an unfair one your guess is as
good as mine.

So on what basis would you allow unfairness? On the basis that it doesn't
seem to harm anyone? It doesn't seem to harm testers?

I think we should aim for something better.


> > Latency. Given N tasks in the system, an arbitrary task should get
> > onto the CPU in a bounded amount of time (excluding events like freak
> > IRQ holdoffs and such, obviously -- ie. just considering the context
> > of the scheduler's state machine).
> 
> This is a slightly stronger statement than starvation-free (which is
> obviously mandatory). I think you're looking for something like
> "worst-case scheduling latency is proportional to the number of
> runnable tasks". Which I think is quite a reasonable requirement.

Yes, bounded and proportional to.


> I'm pretty sure the stock scheduler falls short of both of these
> guarantees though.

And I think that's what its main problems are. It's interactivity
obviously can't be too bad for most people. It's performance seems to
be pretty good.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:15                         ` Nick Piggin
@ 2007-04-18  3:45                           ` Mike Galbraith
  2007-04-18  3:56                             ` Nick Piggin
  2007-04-18  4:38                           ` Matt Mackall
  1 sibling, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-18  3:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > 
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
> 
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
> 
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?

Well, there's short term fair and long term fair.  Seems to me a burst
load having to always merge with a steady stream load using a short term
fairness yardstick absolutely must 'starve' relative to the steady load,
so to be long term fair, you have to add some short term unfairness.
The mainline scheduler is more long term fair (discounting the rather
obnoxious corner cases).

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:45                           ` Mike Galbraith
@ 2007-04-18  3:56                             ` Nick Piggin
  2007-04-18  4:29                               ` Mike Galbraith
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  3:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > > 
> > > I'm a big fan of fairness, but I think it's a bit early to declare it
> > > a mandatory feature. Bounded unfairness is probably something we can
> > > agree on, ie "if we decide to be unfair, no process suffers more than
> > > a factor of x".
> > 
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> > 
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
> 
> Well, there's short term fair and long term fair.  Seems to me a burst
> load having to always merge with a steady stream load using a short term
> fairness yardstick absolutely must 'starve' relative to the steady load,
> so to be long term fair, you have to add some short term unfairness.
> The mainline scheduler is more long term fair (discounting the rather
> obnoxious corner cases).

Oh yes definitely I mean long term fair. I guess it is impossible to be
completely fair so long as we have to timeshare the CPU :)

So a constant delta is fine and unavoidable. But I don't think I agree
with a constant factor: that means you can pick a time where process 1
is allowed an arbitrary T more CPU time than process 2.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:56                             ` Nick Piggin
@ 2007-04-18  4:29                               ` Mike Galbraith
  0 siblings, 0 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-18  4:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 05:56 +0200, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote:
> > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote:
> > >
> > > 
> > > So on what basis would you allow unfairness? On the basis that it doesn't
> > > seem to harm anyone? It doesn't seem to harm testers?
> > 
> > Well, there's short term fair and long term fair.  Seems to me a burst
> > load having to always merge with a steady stream load using a short term
> > fairness yardstick absolutely must 'starve' relative to the steady load,
> > so to be long term fair, you have to add some short term unfairness.
> > The mainline scheduler is more long term fair (discounting the rather
> > obnoxious corner cases).
> 
> Oh yes definitely I mean long term fair. I guess it is impossible to be
> completely fair so long as we have to timeshare the CPU :)
> 
> So a constant delta is fine and unavoidable. But I don't think I agree
> with a constant factor: that means you can pick a time where process 1
> is allowed an arbitrary T more CPU time than process 2.

Definitely.  Using constants with no consideration of what else is
running is what causes the fairness mechanism in mainline to break down
under load.

(aside: What I was experimenting with before this new scheduler came
along was to turn the sleep_avg thing into an off-cpu period thing. Once
a time slice begins execution [runqueue wait doesn't count], that task
has the right to use it's slice in one go, and _anything_ that knocked
it off the cpu added to it's credit.  Knocking someone else off detracts
from credit, and to get to the point where you can knock others off
costs you stored credit proportional to the dynamic priority you attain
by using it.  All tasks that have credit stay active, no favorites.)

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  3:15                         ` Nick Piggin
  2007-04-18  3:45                           ` Mike Galbraith
@ 2007-04-18  4:38                           ` Matt Mackall
  2007-04-18  5:00                             ` Nick Piggin
  1 sibling, 1 reply; 304+ messages in thread
From: Matt Mackall @ 2007-04-18  4:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote:
> > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote:
> > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote:
> > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote:
> > > > >> All things are not equal; they all have different properties. I like
> > > > 
> > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote:
> > > > > Exactly. So we have to explore those properties and evaluate performance
> > > > > (in all meanings of the word). That's only logical.
> > > > 
> > > > Any chance you'd be willing to put down a few thoughts on what sorts
> > > > of standards you'd like to set for both correctness (i.e. the bare
> > > > minimum a scheduler implementation must do to be considered valid
> > > > beyond not oopsing) and performance metrics (i.e. things that produce
> > > > numbers for each scheduler you can compare to say "this scheduler is
> > > > better than this other scheduler at this.").
> > > 
> > > Yeah I guess that's the hard part :)
> > > 
> > > For correctness, I guess fairness is an easy one. I think that unfairness
> > > is basically a bug and that it would be very unfortunate to merge something
> > > unfair. But this is just within the context of a single runqueue... for
> > > better or worse, we allow some unfairness in multiprocessors for performance
> > > reasons of course.
> > 
> > I'm a big fan of fairness, but I think it's a bit early to declare it
> > a mandatory feature. Bounded unfairness is probably something we can
> > agree on, ie "if we decide to be unfair, no process suffers more than
> > a factor of x".
> 
> I don't know why this would be a useful feature (of course I'm talking
> about processes at the same nice level). One of the big problems with
> the current scheduler is that it is unfair in some corner cases. It
> works OK for most people, but when it breaks down it really hurts. At
> least if you start with a fair scheduler, you can alter priorities
> until it satisfies your need... with an unfair one your guess is as
> good as mine.
> 
> So on what basis would you allow unfairness? On the basis that it doesn't
> seem to harm anyone? It doesn't seem to harm testers?

On the basis that there's only anecdotal evidence thus far that
fairness is the right approach.

It's not yet clear that a fair scheduler can do the right thing with X,
with various kernel threads, etc. without fiddling with nice levels.
Which makes it no longer "completely fair".

It's also not yet clear that a scheduler can't be taught to do the
right thing with X without fiddling with nice levels.

So I'm just not yet willing to completely rule out systems that aren't
"completely fair".

But I think we should rule out schedulers that don't have rigid bounds on
that unfairness. That's where the really ugly behavior lies.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  4:38                           ` Matt Mackall
@ 2007-04-18  5:00                             ` Nick Piggin
  2007-04-18  5:55                               ` Matt Mackall
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  5:00 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:38:31PM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote:
> > 
> > I don't know why this would be a useful feature (of course I'm talking
> > about processes at the same nice level). One of the big problems with
> > the current scheduler is that it is unfair in some corner cases. It
> > works OK for most people, but when it breaks down it really hurts. At
> > least if you start with a fair scheduler, you can alter priorities
> > until it satisfies your need... with an unfair one your guess is as
> > good as mine.
> > 
> > So on what basis would you allow unfairness? On the basis that it doesn't
> > seem to harm anyone? It doesn't seem to harm testers?
> 
> On the basis that there's only anecdotal evidence thus far that
> fairness is the right approach.
> 
> It's not yet clear that a fair scheduler can do the right thing with X,
> with various kernel threads, etc. without fiddling with nice levels.
> Which makes it no longer "completely fair".

Of course I mean SCHED_OTHER tasks at the same nice level. Otherwise
I would be arguing to make nice basically a noop.


> It's also not yet clear that a scheduler can't be taught to do the
> right thing with X without fiddling with nice levels.

Being fair doesn't prevent that. Implicit unfairness is wrong though,
because it will bite people.

What's wrong with allowing X to get more than it's fair share of CPU
time by "fiddling with nice levels"? That's what they're there for.


> So I'm just not yet willing to completely rule out systems that aren't
> "completely fair".
> 
> But I think we should rule out schedulers that don't have rigid bounds on
> that unfairness. That's where the really ugly behavior lies.

Been a while since I really looked at the mainline scheduler, but I
don't think it can permanently starve something, so I don't know what
your bounded unfairness would help with.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:00                             ` Nick Piggin
@ 2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
                                                   ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-18  5:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > It's also not yet clear that a scheduler can't be taught to do the
> > right thing with X without fiddling with nice levels.
> 
> Being fair doesn't prevent that. Implicit unfairness is wrong though,
> because it will bite people.
> 
> What's wrong with allowing X to get more than it's fair share of CPU
> time by "fiddling with nice levels"? That's what they're there for.

Why is X special? Because it does work on behalf of other processes?
Lots of things do this. Perhaps a scheduler should focus entirely on
the implicit and directed wakeup matrix and optimizing that
instead[1].

Why are processes special? Should user A be able to get more CPU time
for his job than user B by splitting it into N parallel jobs? Should
we be fair per process, per user, per thread group, per session, per
controlling terminal? Some weighted combination of the preceding?[2]

Why is the measure CPU time? I can imagine a scheduler that weighed
memory bandwidth in the equation. Or power consumption. Or execution
unit usage.

Fairness is nice. It's simple, it's obvious, it's predictable. But
it's just not clear that it's optimal. If the question is (and it
was!) "what should the basic requirements for the scheduler be?" it's
not clear that fairness is a requirement or even how to pick a metric
for fairness that's obviously and uniquely superior.

It's instead much easier to try to recognize and rule out really bad
behaviour with bounded latencies, minimum service guarantees, etc.

[1] That's basically how Google decides to prioritize webpages, which
it seems to do moderately well. And how a number of other optimization
problems are solved.

[2] It's trivial to construct two or more perfectly reasonable and
desirable definitions of fairness that are mutually incompatible.
-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
@ 2007-04-18  6:37                                 ` Nick Piggin
  2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18 13:08                                 ` William Lee Irwin III
  2007-04-18 14:48                                 ` Linus Torvalds
  2 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  6:37 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote:
> > > It's also not yet clear that a scheduler can't be taught to do the
> > > right thing with X without fiddling with nice levels.
> > 
> > Being fair doesn't prevent that. Implicit unfairness is wrong though,
> > because it will bite people.
> > 
> > What's wrong with allowing X to get more than it's fair share of CPU
> > time by "fiddling with nice levels"? That's what they're there for.
> 
> Why is X special? Because it does work on behalf of other processes?

The high level reason is that giving it more than its fair share of
CPU allows a desktop to remain interactive under load. And it isn't
just about doing work on behalf of other processes. Mouse interrupts
are a big part of it, for example.

> Lots of things do this. Perhaps a scheduler should focus entirely on
> the implicit and directed wakeup matrix and optimizing that
> instead[1].

You could do that, and I tried a variant of it at one point. The
problem was that it leads to unexpected bad things too.

UNIX programs more or less expect fair SCHED_OTHER scheduling, and
given the principle of least surprise...

> Why are processes special? Should user A be able to get more CPU time
> for his job than user B by splitting it into N parallel jobs? Should
> we be fair per process, per user, per thread group, per session, per
> controlling terminal? Some weighted combination of the preceding?[2]

I don't know how that supports your argument for unfairness, but
processes are special only because that's how we've always done
scheduling.  I'm not precluding other groupings for fairness, though.

> Why is the measure CPU time? I can imagine a scheduler that weighed
> memory bandwidth in the equation. Or power consumption. Or execution
> unit usage.

Feel free. And I'd also argue that once you schedule for those metrics
then fairness is also important there too.

> Fairness is nice. It's simple, it's obvious, it's predictable. But
> it's just not clear that it's optimal. If the question is (and it
> was!) "what should the basic requirements for the scheduler be?" it's
> not clear that fairness is a requirement or even how to pick a metric
> for fairness that's obviously and uniquely superior.

What do you mean optimal? If your criteria is fairness, then of course
it is optimal. If your criteria is throughput, then it probably isn't.

Considering it is simple and what we've always done, measuring fairness
by CPU time per process is obvious for a general purpose scheduler.
If you accept that, then I argue that fairness is an optimal property
given that the alternative is unfairness.

> It's instead much easier to try to recognize and rule out really bad
> behaviour with bounded latencies, minimum service guarantees, etc.

It's the bad behaviour that you didn't recognize that is the problem.
If you start with explicit fairness, then unfairness will never be
one of those problems.

> [1] That's basically how Google decides to prioritize webpages, which
> it seems to do moderately well. And how a number of other optimization
> problems are solved.

This is not an optimization problem, it is a heuristic. There is no
right and wrong answer.

> [2] It's trivial to construct two or more perfectly reasonable and
> desirable definitions of fairness that are mutually incompatible.

Probably not if you use common sense, and in the context of a replacement
for the 2.6 scheduler.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:37                                 ` Nick Piggin
@ 2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18  7:24                                     ` Nick Piggin
  2007-04-21 13:33                                     ` Bill Davidsen
  0 siblings, 2 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-18  6:55 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:
> I don't know how that supports your argument for unfairness,

I never had such an argument. I like fairness.

My argument is that -you- don't have an argument for making fairness a
-requirement-.

> processes are special only because that's how we've always done
> scheduling.  I'm not precluding other groupings for fairness, though.

If you make one form of fairness a -requirement- for all acceptable
algorithms, your -are- precluding most other forms of fairness.

If you refuse to define what "fairness" means when specifying your
requirement, what's the point of requiring it?

> What do you mean optimal? If your criteria is fairness, then of course
> it is optimal. If your criteria is throughput, then it probably isn't.

I don't know what optimal behavior is. And neither do you. It may or
may not be fair. It very likely includes small deviations from fair.

> > [2] It's trivial to construct two or more perfectly reasonable and
> > desirable definitions of fairness that are mutually incompatible.
> 
> Probably not if you use common sense, and in the context of a replacement
> for the 2.6 scheduler.

Ok, trivial example. You cannot allocate equal CPU time to
processes/tasks and simultaneously allocate equal time to thread
groups. Is it common sense that a heavily-threaded app should be able
to get hugely more CPU than a well-written app? No. I don't want Joe's
stupid Java app to make my compile crawl.

On the other hand, if my heavily threaded app is, say, a voicemail
server serving 30 customers, I probably want it to get 30x the CPU of
my gzip job.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:55                                   ` Matt Mackall
@ 2007-04-18  7:24                                     ` Nick Piggin
  2007-04-21 13:33                                     ` Bill Davidsen
  1 sibling, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  7:24 UTC (permalink / raw)
  To: Matt Mackall
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 01:55:34AM -0500, Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:
> > I don't know how that supports your argument for unfairness,
> 
> I never had such an argument. I like fairness.
> 
> My argument is that -you- don't have an argument for making fairness a
> -requirement-.

It seems easy enough that there is no point accepting unfair
behaviour like the old scheduler if we're going to go to all
this trouble to replace it. The old scheduler seems to have
bounded unfairness and bounded starvation, so let the good times
roll.

> > processes are special only because that's how we've always done
> > scheduling.  I'm not precluding other groupings for fairness, though.
> 
> If you make one form of fairness a -requirement- for all acceptable
> algorithms, your -are- precluding most other forms of fairness.
> 
> If you refuse to define what "fairness" means when specifying your
> requirement, what's the point of requiring it?

I don't refuse. I'm talking about per-process CPU time fairness.
My paragraph above was pointing out that subsequent work to
add other classes of fairness are not excluded as configurable
features, but this basic type of fairness should be included.

> > What do you mean optimal? If your criteria is fairness, then of course
> > it is optimal. If your criteria is throughput, then it probably isn't.
> 
> I don't know what optimal behavior is. And neither do you. It may or
> may not be fair. It very likely includes small deviations from fair.

You misunderstand me. There is no single "optimal" when you're talking
about fairness (or most other scheduler things). So pondering whether
fairness is optimal or not doesn't really make sense.

I'm saying it should be a basic axiom, not that it is quantitively
better. It isn't a refutable argument. I state it because that it is
what users and programs expect.

You can reject that, and fine. I guess if a scheduler comes along that
does exactly the right thing for everyone, then it is better than any
fair scheduler. So OK, while we're talking theoretical, I won't dismiss
an unfair scheduler out of hand.

> > > [2] It's trivial to construct two or more perfectly reasonable and
> > > desirable definitions of fairness that are mutually incompatible.
> > 
> > Probably not if you use common sense, and in the context of a replacement
> > for the 2.6 scheduler.
> 
> Ok, trivial example. You cannot allocate equal CPU time to
> processes/tasks and simultaneously allocate equal time to thread
> groups. Is it common sense that a heavily-threaded app should be able
> to get hugely more CPU than a well-written app? No. I don't want Joe's
> stupid Java app to make my compile crawl.
> 
> On the other hand, if my heavily threaded app is, say, a voicemail
> server serving 30 customers, I probably want it to get 30x the CPU of
> my gzip job.

So that might be a nice addition, but the base funcionality is threads
simply because that's what we've always done. Just common sense.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  6:55                                   ` Matt Mackall
  2007-04-18  7:24                                     ` Nick Piggin
@ 2007-04-21 13:33                                     ` Bill Davidsen
  1 sibling, 0 replies; 304+ messages in thread
From: Bill Davidsen @ 2007-04-21 13:33 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Matt Mackall wrote:
> On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote:

>>> [2] It's trivial to construct two or more perfectly reasonable and
>>> desirable definitions of fairness that are mutually incompatible.
>> Probably not if you use common sense, and in the context of a replacement
>> for the 2.6 scheduler.
> 
> Ok, trivial example. You cannot allocate equal CPU time to
> processes/tasks and simultaneously allocate equal time to thread
> groups. Is it common sense that a heavily-threaded app should be able
> to get hugely more CPU than a well-written app? No. I don't want Joe's
> stupid Java app to make my compile crawl.
> 
> On the other hand, if my heavily threaded app is, say, a voicemail
> server serving 30 customers, I probably want it to get 30x the CPU of
> my gzip job.
> 
Matt, you tickled a thought... on one hand we have a single user running 
a threaded application, and it ideally should get the same total CPU as 
a user running a single thread process. On the other hand we have a 
threaded application, call it sendmail, nnrpd, httpd, bind, whatever. In 
that case each thread is really providing service for an independent 
user, and should get an appropriate share of the CPU.

Perhaps the solution is to add a means for identifying server processes, 
by capability, or by membership in a "server" group, or by having the 
initiating process set some flag at exec() time. That doesn't 
necessarily solve problems, but it may provide more information to allow 
them to be soluble.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
@ 2007-04-18 13:08                                 ` William Lee Irwin III
  2007-04-18 19:48                                   ` Davide Libenzi
  2007-04-18 14:48                                 ` Linus Torvalds
  2 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-18 13:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote:
> Why are processes special? Should user A be able to get more CPU time
> for his job than user B by splitting it into N parallel jobs? Should
> we be fair per process, per user, per thread group, per session, per
> controlling terminal? Some weighted combination of the preceding?[2]

On a side note, I think a combination of all of the above is a very
good idea, plus process groups (pgrp's). All the make -j loads should
come up in one pgrp of one session for one user and hence should be
automatically kept isolated in its own corner by such policies. Thread
bombs, forkbombs, and so on get handled too, which is good when on e.g.
a compileserver and someone rudely spawns too many tasks.

Thinking of the scheduler as a CPU bandwidth allocator, this means
handing out shares of CPU bandwidth to all users on the system, which
in turn hand out shares of bandwidth to all sessions, which in turn
hand out shares of bandwidth to all process groups, which in turn hand
out shares of bandwidth to all thread groups, which in turn hand out
shares of bandwidth to threads. The event handlers for the scheduler
need not deal with this apart from task creation and exit and various
sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.).
They just determine what the scheduler sees as ->load_weight or some
analogue of ->static_prio, though it is possible to do this by means of
data structure organization instead of numerical prioritization. It'd
probably have to be calculated on the fly by, say, doing fixpoint
arithmetic something like
    user_share(p)*session_share(p)*pgrp_share(p)*tgrp_share(p)*task_share(p)
so that readjusting the shares of aggregates doesn't have to traverse
lists and remains O(1). Each of the share computations can instead just
do some analogue of the calculation p->load_weight/rq->raw_weighted_load
in fixpoint, though precision issues with this make me queasy. There is
maybe a slight nasty point in that the ->raw_weighted_load analogue for
users or whatever the highest level chosen is ends up being global. One
might as well get users in there and omit intermediate levels if any are
to be omitted so that the truly global state is as read-only as possible.

I suppose jacking up the fixpoint precision to 128-bit or 256-bit all
below the radix point (our max is 1.0 after all) until precision issues
vanish can be done but the idea of that much number crunching in the
scheduler makes me rather uncomfortable. I hope u64 or u32 [2] can be
gotten away with as far as fixpoint goes.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 13:08                                 ` William Lee Irwin III
@ 2007-04-18 19:48                                   ` Davide Libenzi
  0 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:48 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, William Lee Irwin III wrote:

> Thinking of the scheduler as a CPU bandwidth allocator, this means
> handing out shares of CPU bandwidth to all users on the system, which
> in turn hand out shares of bandwidth to all sessions, which in turn
> hand out shares of bandwidth to all process groups, which in turn hand
> out shares of bandwidth to all thread groups, which in turn hand out
> shares of bandwidth to threads. The event handlers for the scheduler
> need not deal with this apart from task creation and exit and various
> sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.).

Yes, it really becomes a hierarchical problem once you consider user and 
processes. Top level sees a "user" can be scheduled (put itself on the 
virtual run queue), and passes the ball to the "process" scheduler inside 
the "user" container, down to maybe "threads". With all the "key" 
calculation parameters kept at each level (with up-propagation).



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:55                               ` Matt Mackall
  2007-04-18  6:37                                 ` Nick Piggin
  2007-04-18 13:08                                 ` William Lee Irwin III
@ 2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
                                                     ` (2 more replies)
  2 siblings, 3 replies; 304+ messages in thread
From: Linus Torvalds @ 2007-04-18 14:48 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Matt Mackall wrote:
> 
> Why is X special? Because it does work on behalf of other processes?
> Lots of things do this. Perhaps a scheduler should focus entirely on
> the implicit and directed wakeup matrix and optimizing that
> instead[1].

I 100% agree - the perfect scheduler would indeed take into account where 
the wakeups come from, and try to "weigh" processes that help other 
processes make progress more. That would naturally give server processes 
more CPU power, because they help others

I don't believe for a second that "fairness" means "give everybody the 
same amount of CPU". That's a totally illogical measure of fairness. All 
processes are _not_ created equal.

That said, even trying to do "fairness by effective user ID" would 
probably already do a lot. In a desktop environment, X would get as much 
CPU time as the user processes, simply because it's in a different 
protection domain (and that's really what "effective user ID" means: it's 
not about "users", it's really about "protection domains").

And "fairness by euid" is probably a hell of a lot easier to do than 
trying to figure out the wakeup matrix.

		Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
@ 2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
                                                       ` (2 more replies)
  2007-04-19  3:18                                   ` Nick Piggin
  2007-04-21 13:40                                   ` Bill Davidsen
  2 siblings, 3 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-18 15:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.

For the record, you actually don't need to track a whole NxN matrix
(or do the implied O(n**3) matrix inversion!) to get to the same
result. You can converge on the same node weightings (ie dynamic
priorities) by applying a damped function at each transition point
(directed wakeup, preemption, fork, exit).

The trouble with any scheme like this is that it needs careful tuning
of the damping factor to converge rapidly and not oscillate and
precise numerical attention to the transition functions so that the sum of
dynamic priorities is conserved.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
@ 2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:49                                       ` Ingo Molnar
                                                         ` (3 more replies)
  2007-04-18 19:05                                     ` Davide Libenzi
  2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 4 replies; 304+ messages in thread
From: Linus Torvalds @ 2007-04-18 17:22 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Matt Mackall wrote:
>
> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> > And "fairness by euid" is probably a hell of a lot easier to do than 
> > trying to figure out the wakeup matrix.
> 
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result.

I'm sure you can do things differently, but the reason I think "fairness 
by euid" is actually worth looking at is that it's pretty much the 
*identical* issue that we'll have with "fairness by virtual machine" and a 
number of other "container" issues.

The fact is:

 - "fairness" is *not* about giving everybody the same amount of CPU time 
   (scaled by some niceness level or not). Anybody who thinks that is 
   "fair" is just being silly and hasn't thought it through.

 - "fairness" is multi-level. You want to be fair to threads within a 
   thread group (where "process" may be one good approximation of what a 
   "thread group" is, but not necessarily the only one).

   But you *also* want to be fair in between those "thread groups", and 
   then you want to be fair across "containers" (where "user" may be one 
   such container).

So I claim that anything that cannot be fair by user ID is actually really 
REALLY unfair. I think it's absolutely humongously STUPID to call 
something the "Completely Fair Scheduler", and then just be fair on a 
thread level. That's not fair AT ALL! It's the anti-thesis of being fair!

So if you have 2 users on a machine running CPU hogs, you should *first* 
try to be fair among users. If one user then runs 5 programs, and the 
other one runs just 1, then the *one* program should get 50% of the CPU 
time (the users fair share), and the five programs should get 10% of CPU 
time each. And if one of them uses two threads, each thread should get 5%.

So you should see one thread get 50& CPU (single thread of one user), 4 
threads get 10% CPU (their fair share of that users time), and 2 threads 
get 5% CPU (the fair share within that thread group!).

Any scheduling argument that just considers the above to be "7 threads 
total" and gives each thread 14% of CPU time "fairly" is *anything* but 
fair. It's a joke if that kind of scheduler then calls itself CFS!

And yes, that's largely what the current scheduler will do, but at least 
the current scheduler doesn't claim to be fair! So the current scheduler 
is a lot *better* if only in the sense that it doesn't make ridiculous 
claims that aren't true!

			Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
@ 2007-04-18 17:49                                       ` Ingo Molnar
  2007-04-18 17:59                                         ` Ingo Molnar
  2007-04-18 19:23                                         ` Linus Torvalds
  2007-04-18 18:02                                       ` William Lee Irwin III
                                                         ` (2 subsequent siblings)
  3 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 17:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The fact is:
> 
>  - "fairness" is *not* about giving everybody the same amount of CPU 
>    time (scaled by some niceness level or not). Anybody who thinks 
>    that is "fair" is just being silly and hasn't thought it through.

yeah, very much so.

But note that most of the reported CFS interactivity wins, as surprising 
as it might be, were due to fairness between _the same user's tasks_. In 
the typical case, 99% of the desktop CPU time is executed either as X 
(root user) or under the uid of the logged in user, and X is just one 
task. Even with a bad hack of making X super-high-prio, interactivity as 
experienced by users still sucks without having fairness between the 
other 100-200 user tasks that a desktop system is typically using.

'renicing X to -10' is a broken way of achieving: 'root uid should get 
its share of CPU time too, no matter how many user tasks are running'. 
We can do this much cleaner by saying: 'each uid, if it has any tasks 
running, should get its fair share of CPU time, independently of the 
number of tasks it is running'.

In that sense 'fairness' is not global (and in fact it is almost _never_ 
a global property, as X runs under root uid [*]), it is only the most 
lowlevel scheduling machinery that can then be built upon. Higher-level 
controls to allocate CPU power between groups of tasks very much make 
sense - but according to the CFS interactivity test results i got from 
people so far, they very much need this basic fairness machinery 
_within_ the uid group too. So 'fairness' is still a powerful lower 
level scheduling concept. And this all makes lots of sense to me.

One purpose of doing the hierarchical scheduling classes stuff was to 
enable such higher scope task group decisions too. Next i'll try to 
figure out whether 'task group bandwidth' logic should live right within 
sched_fair.c itself, or whether it should be layered separately as a 
sched_group.c. Intutively i'd say it should live within sched_fair.c.

	Ingo

[*] There are distributions where X does not run under root uid anymore.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:49                                       ` Ingo Molnar
@ 2007-04-18 17:59                                         ` Ingo Molnar
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:23                                         ` Linus Torvalds
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Ingo Molnar <mingo@elte.hu> wrote:

> In that sense 'fairness' is not global (and in fact it is almost 
> _never_ a global property, as X runs under root uid [*]), it is only 
> the most lowlevel scheduling machinery that can then be built upon. 
> [...]

perhaps a more fitting term would be 'precise group-scheduling'. Within 
the lowest level task group entity (be that thread group or uid group, 
etc.) 'precise scheduling' is equivalent to 'fairness'.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:59                                         ` Ingo Molnar
@ 2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
                                                               ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Linus Torvalds @ 2007-04-18 19:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Ingo Molnar wrote:
> 
> perhaps a more fitting term would be 'precise group-scheduling'. Within 
> the lowest level task group entity (be that thread group or uid group, 
> etc.) 'precise scheduling' is equivalent to 'fairness'.

Yes. Absolutely. Except I think that at least if you're going to name 
somethign "complete" (or "perfect" or "precise"), you should also admit 
that groups can be hierarchical.

The "threads in a process" thing is a great example of a hierarchical 
group. Imagine if X was running as a collection of threads - then each 
server thread would no longer be more important than the clients! But if 
you have a mix of "bags of threads" and "single process" kind 
applications, then very arguably the single thread in a single traditional 
process should get as much time as the "bag of threads" process gets 
total.

So it really should be a hierarchical notion, where each thread is owned 
by one "process", and each process is owned by one "user", and each user 
is in one "virtual machine" - there's at least three different levels to 
this, and you'd want to schedule this thing top-down: virtual machines 
should be given CPU time "fairly" (which doesn't need to mean "equally", 
of course - nice-values could very well work at that level too), and then 
within each virtual machine users or "scheduling groups" should be 
scheduled fairly, and then within each scheduling group the processes 
should be scheduled, and within each process threads should equally get 
their fair share at _that_ level.

And no, I don't think we necessarily need to do something quite that 
elaborate. But I think that's the kind of "obviously good goal" to keep in 
mind. Can we perhaps _approximate_ something like that by other means? 

For example, maybe we can approximate it by spreading out the statistics: 
right now you have things like

 - last_ran, wait_runtime, sum_wait_runtime..

be per-thread things. Maybe some of those can be spread out, so that you 
put a part of them in the "struct vm_struct" thing (to approximate 
processes), part of them in the "struct user" struct (to approximate the 
user-level thing), and part of it in a per-container thing for when/if we 
support that kind of thing?

IOW, I don't think the scheduling "groups" have to be explicit boxes or 
anything like that. I suspect you can make do with just heurstics that 
penalize the same "struct user" and "struct vm_struct" to get overly much 
scheduling time, and you'll get the same _effect_. 

And I don't think it's wrong to look at the "one hundred processes by the 
same user" case as being an important case. But it should not be the 
*only* case or even necessarily the *main* case that matters. I think a 
benchmark that literally does

	pid_t pid = fork();
	if (pid < 0)
		exit(1);
	if (pid) {
		if (setuid(500) < 0)
			exit(2);
		for (;;)
			/* Do nothing */;
	}
	if (setuid(501) < 0)
		exit(3);
	fork();
	for (;;)
		/* Do nothing in two processes */;

and I think that it's a really valid benchmark: if the scheduler gives 25% 
of time to each of the two processes of user 501, and 50% to user 500, 
then THAT is a good scheduler.

If somebody wants to actually write and test the above as a test-script, 
and add it to a collection of scheduler tests, I think that could be a 
good thing.

		Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
@ 2007-04-18 19:43                                             ` Ingo Molnar
  2007-04-18 20:07                                             ` Davide Libenzi
  2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 19:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> For example, maybe we can approximate it by spreading out the 
> statistics: right now you have things like
> 
>  - last_ran, wait_runtime, sum_wait_runtime..
> 
> be per-thread things. [...]

yes, yes, yes! :) My thinking is "struct sched_group" embedded into 
_arbitrary_ other resource containers and abstractions, which 
sched_group's are then in a simple hierarchy and are driven by the core 
scheduling machinery.

> [...] Maybe some of those can be spread out, so that you put a part of 
> them in the "struct vm_struct" thing (to approximate processes), part 
> of them in the "struct user" struct (to approximate the user-level 
> thing), and part of it in a per-container thing for when/if we support 
> that kind of thing?

yes.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
@ 2007-04-18 20:07                                             ` Davide Libenzi
  2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18 20:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> For example, maybe we can approximate it by spreading out the statistics: 
> right now you have things like
> 
>  - last_ran, wait_runtime, sum_wait_runtime..
> 
> be per-thread things. Maybe some of those can be spread out, so that you 
> put a part of them in the "struct vm_struct" thing (to approximate 
> processes), part of them in the "struct user" struct (to approximate the 
> user-level thing), and part of it in a per-container thing for when/if we 
> support that kind of thing?

I think Ingo's idea of a new sched_group to contain the generic 
parameters needed for the "key" calculation, works better than adding more 
fields to existing strctures (that would, of course, host pointers to it). 
Otherwise I can already the the struct_signal being the target for other 
unrelated fields :)



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 20:07                                             ` Davide Libenzi
@ 2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  6:52                                                 ` Mike Galbraith
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:48 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

* Davide Libenzi <davidel@xmailserver.org> wrote:

> I think Ingo's idea of a new sched_group to contain the generic 
> parameters needed for the "key" calculation, works better than adding 
> more fields to existing strctures (that would, of course, host 
> pointers to it). Otherwise I can already the the struct_signal being 
> the target for other unrelated fields :)

yeah. Another detail is that for global containers like uids, the 
statistics will have to be percpu_alloc()-ed, both for correctness 
(runqueues are per CPU) and for performance.

That's one reason why i dont think it's necessarily a good idea to 
group-schedule threads, we dont really want to do a per thread group 
percpu_alloc().

In fact for threads the _reverse_ problem exists, threaded apps tend to 
_strive_ for more performance - hence their desperation of using the 
threaded programming model to begin with ;) (just think of media 
playback apps which are typically multithreaded)

I dont think threads are all that different. Also, the 
resource-conserving act of using CLONE_VM to share the VM (and to use a 
different programming environment like Java) should not be 'punished' by 
forcing the thread group to be accounted as a single, shared entity 
against other 'fat' tasks.

so my current impression is that we want per UID accounting to solve the 
X problem, the kernel threads problem and the many-users problem, but 
i'd not want to do it for threads just yet because for them there's not 
really any apparent problem to be solved.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 21:48                                               ` Ingo Molnar
@ 2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 17:39                                                   ` Bernd Eckenfels
  2007-04-19  6:52                                                 ` Mike Galbraith
  1 sibling, 2 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18 23:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Ingo Molnar wrote:

> That's one reason why i dont think it's necessarily a good idea to 
> group-schedule threads, we dont really want to do a per thread group 
> percpu_alloc().

I still do not have clear how much overhead this will bring into the 
table, but I think (like Linus was pointing out) the hierarchy should look 
like:

Top (VCPU maybe?)
    User
        Process
            Thread

The "run_queue" concept (and data) that now is bound to a CPU, need to be 
replicated in:

ROOT <- VCPUs add themselves here
    VCPU <- USERs add themselves here
        USER <- PROCs add themselves here
            PROC <- THREADs add themselves here
                THREAD (ultimate fine grained scheduling unit)

So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking up a 
new task would mean:

VCPU = ROOT->lookup();
USER = VCPU->lookup();
PROC = USER->lookup();
THREAD = PROC->lookup();

Run-time statistics should propagate back the other way around.


> In fact for threads the _reverse_ problem exists, threaded apps tend to 
> _strive_ for more performance - hence their desperation of using the 
> threaded programming model to begin with ;) (just think of media 
> playback apps which are typically multithreaded)

The same user nicing two different multi-threaded processes would expect a 
predictable CPU distribution too. Doing that efficently (the old per-cpu 
run-queue is pretty nice from many POVs) is the real challenge.



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 23:30                                                 ` Davide Libenzi
@ 2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 15:43                                                     ` Davide Libenzi
  2007-04-21 14:09                                                     ` Bill Davidsen
  2007-04-19 17:39                                                   ` Bernd Eckenfels
  1 sibling, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-19  8:00 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > That's one reason why i dont think it's necessarily a good idea to 
> > group-schedule threads, we dont really want to do a per thread group 
> > percpu_alloc().
> 
> I still do not have clear how much overhead this will bring into the 
> table, but I think (like Linus was pointing out) the hierarchy should 
> look like:
> 
> Top (VCPU maybe?)
>     User
>         Process
>             Thread
> 
> The "run_queue" concept (and data) that now is bound to a CPU, need to be 
> replicated in:
> 
> ROOT <- VCPUs add themselves here
>     VCPU <- USERs add themselves here
>         USER <- PROCs add themselves here
>             PROC <- THREADs add themselves here
>                 THREAD (ultimate fine grained scheduling unit)
> 
> So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking 
> up a new task would mean:
> 
> VCPU = ROOT->lookup();
> USER = VCPU->lookup();
> PROC = USER->lookup();
> THREAD = PROC->lookup();
> 
> Run-time statistics should propagate back the other way around.

yeah, but this looks quite bad from an overhead POV ... i think we can 
do alot simpler to solve X and kernel threads prioritization.

> > In fact for threads the _reverse_ problem exists, threaded apps tend 
> > to _strive_ for more performance - hence their desperation of using 
> > the threaded programming model to begin with ;) (just think of media 
> > playback apps which are typically multithreaded)
> 
> The same user nicing two different multi-threaded processes would 
> expect a predictable CPU distribution too. [...]

i disagree that the user 'would expect' this. Some users might. Others 
would say: 'my 10-thread rendering engine is more important than a 
1-thread job because it's using 10 threads for a reason'. And the CFS 
feedback so far strengthens this point: the default behavior of treating 
the thread as a single scheduling (and CPU time accounting) unit works 
pretty well on the desktop.

think about it in another, 'kernel policy' way as well: we'd like to 
_encourage_ more parallel user applications. Hurting them by accounting 
all threads together sends the exact opposite message.

> [...] Doing that efficently (the old per-cpu run-queue is pretty nice 
> from many POVs) is the real challenge.

yeah.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  8:00                                                   ` Ingo Molnar
@ 2007-04-19 15:43                                                     ` Davide Libenzi
  2007-04-21 14:09                                                     ` Bill Davidsen
  1 sibling, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-19 15:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Thu, 19 Apr 2007, Ingo Molnar wrote:

> i disagree that the user 'would expect' this. Some users might. Others 
> would say: 'my 10-thread rendering engine is more important than a 
> 1-thread job because it's using 10 threads for a reason'. And the CFS 
> feedback so far strengthens this point: the default behavior of treating 
> the thread as a single scheduling (and CPU time accounting) unit works 
> pretty well on the desktop.
> 
> think about it in another, 'kernel policy' way as well: we'd like to 
> _encourage_ more parallel user applications. Hurting them by accounting 
> all threads together sends the exact opposite message.

There are counter argouments too. Like, not every user knows if a certain 
process is MT or not. I agree though that doing accounting and fairness at 
a depth lower then USER is messy, and not only for performance.


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  8:00                                                   ` Ingo Molnar
  2007-04-19 15:43                                                     ` Davide Libenzi
@ 2007-04-21 14:09                                                     ` Bill Davidsen
  1 sibling, 0 replies; 304+ messages in thread
From: Bill Davidsen @ 2007-04-21 14:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Ingo Molnar wrote:
> * Davide Libenzi <davidel@xmailserver.org> wrote:

>> The same user nicing two different multi-threaded processes would 
>> expect a predictable CPU distribution too. [...]
> 
> i disagree that the user 'would expect' this. Some users might. Others 
> would say: 'my 10-thread rendering engine is more important than a 
> 1-thread job because it's using 10 threads for a reason'. And the CFS 
> feedback so far strengthens this point: the default behavior of treating 
> the thread as a single scheduling (and CPU time accounting) unit works 
> pretty well on the desktop.
> 
If by desktop you mean "one and only one interactive user," that's true. 
On a shared machine it's very hard to preserve any semblance of fairness 
when one user gets far more than another, based not on the value of what 
they're doing but the tools they use to to it.

> think about it in another, 'kernel policy' way as well: we'd like to 
> _encourage_ more parallel user applications. Hurting them by accounting 
> all threads together sends the exact opposite message.
> 
Why is that? There are lots of things which are intrinsically single 
threaded, how are we hurting hurting multi-threaded applications by 
refusing to give them more CPU than an application running on behalf of 
another user? By accounting all threads together we encourage writing an 
application in the most logical way. Threads are a solution, not a goal 
in themselves.

>> [...] Doing that efficently (the old per-cpu run-queue is pretty nice 
>> from many POVs) is the real challenge.
> 
> yeah.
> 
> 	Ingo


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 23:30                                                 ` Davide Libenzi
  2007-04-19  8:00                                                   ` Ingo Molnar
@ 2007-04-19 17:39                                                   ` Bernd Eckenfels
  1 sibling, 0 replies; 304+ messages in thread
From: Bernd Eckenfels @ 2007-04-19 17:39 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.64.0704181515290.25880@alien.or.mcafeemobile.com> you wrote:
> Top (VCPU maybe?)
>    User
>        Process
>            Thread

The problem with that is, that not all Schedulers might work on the User
level. You can think of Batch/Job, Parent, Group, Session or namespace
level. That would be IMHO a generic Top, with no need for a level above.

Greetings
Bernd

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 21:48                                               ` Ingo Molnar
  2007-04-18 23:30                                                 ` Davide Libenzi
@ 2007-04-19  6:52                                                 ` Mike Galbraith
  2007-04-19  7:09                                                   ` Ingo Molnar
  2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 2 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-19  6:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Con Kolivas, ck list,
	Bill Huey, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote:

> so my current impression is that we want per UID accounting to solve the 
> X problem, the kernel threads problem and the many-users problem, but 
> i'd not want to do it for threads just yet because for them there's not 
> really any apparent problem to be solved.

If you really mean UID vs EUID as Linus mentioned, I suppose I could
learn to login as !root, and set KDE up to always give me root shells.

With a heavily reniced X (perfectly fine), that should indeed solve my
daily usage pattern nicely (always need godmode for shells, but not for
mozilla and ilk. 50/50 split automatic without renice of entire gui)

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:52                                                 ` Mike Galbraith
@ 2007-04-19  7:09                                                   ` Ingo Molnar
  2007-04-19  7:32                                                     ` Mike Galbraith
  2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-19  7:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel


* Mike Galbraith <efault@gmx.de> wrote:

> With a heavily reniced X (perfectly fine), that should indeed solve my 
> daily usage pattern nicely (always need godmode for shells, but not 
> for mozilla and ilk. 50/50 split automatic without renice of entire 
> gui)

how about the first-approximation solution i suggested in the previous 
mail: to add a per UID default nice level? (With this default defaulting 
to '-10' for all root-owned processes, and defaulting to '0' for 
everything else.) That would solve most of the current CFS regressions 
at hand.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:09                                                   ` Ingo Molnar
@ 2007-04-19  7:32                                                     ` Mike Galbraith
  2007-04-19 16:55                                                       ` Davide Libenzi
  0 siblings, 1 reply; 304+ messages in thread
From: Mike Galbraith @ 2007-04-19  7:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > daily usage pattern nicely (always need godmode for shells, but not 
> > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > gui)
> 
> how about the first-approximation solution i suggested in the previous 
> mail: to add a per UID default nice level? (With this default defaulting 
> to '-10' for all root-owned processes, and defaulting to '0' for 
> everything else.) That would solve most of the current CFS regressions 
> at hand.

That would make my kernel builds etc interfere with my other self's
surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
X portion of my Joe-User activity pushes the compile portion of root
down in bandwidth utilization automagically, which is exactly the right
thing, because the root me in not as important as the Joe-User me using
the GUI at that time.  If the idea of X disturbing root upsets some,
they can move X to another UID.  Generally, it seems perfect for here.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:32                                                     ` Mike Galbraith
@ 2007-04-19 16:55                                                       ` Davide Libenzi
  2007-04-20  5:16                                                         ` Mike Galbraith
  0 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-19 16:55 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel

On Thu, 19 Apr 2007, Mike Galbraith wrote:

> On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> > * Mike Galbraith <efault@gmx.de> wrote:
> > 
> > > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > > daily usage pattern nicely (always need godmode for shells, but not 
> > > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > > gui)
> > 
> > how about the first-approximation solution i suggested in the previous 
> > mail: to add a per UID default nice level? (With this default defaulting 
> > to '-10' for all root-owned processes, and defaulting to '0' for 
> > everything else.) That would solve most of the current CFS regressions 
> > at hand.
> 
> That would make my kernel builds etc interfere with my other self's
> surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
> X portion of my Joe-User activity pushes the compile portion of root
> down in bandwidth utilization automagically, which is exactly the right
> thing, because the root me in not as important as the Joe-User me using
> the GUI at that time.  If the idea of X disturbing root upsets some,
> they can move X to another UID.  Generally, it seems perfect for here.

Now guys, I did not follow the whole lengthy and feisty thread, but IIRC 
Con's scheduler has been attacked because, among other argouments, was 
requiring X to be reniced. This happened like a month ago IINM.
I did not have time to look at Con's scheduler, and I only had a brief 
look at Ingo's one (looks very promising IMO, but so was the initial O(1) 
post before all the corner-cases fixes went in).
But this is not a about technical merit, this is about applying the same 
rules of judgement to others as well to ourselves.
We went from a "renicing X to -10 is bad because the scheduler should 
be able to correctly handle the problem w/out additional external plugs" 
to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads 
class, on top of all the tasks owned by root" [1].
>From a spectator POV like myself in this case, this looks rather "unfair".



[1] I think, before and now, that that's more a duck tape patch than a 
    real solution. OTOH if the "solution" is gonna be another maze of 
    macros and heuristics filled with pretty bad corner cases, I may 
    prefer the former.


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 16:55                                                       ` Davide Libenzi
@ 2007-04-20  5:16                                                         ` Mike Galbraith
  0 siblings, 0 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-20  5:16 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Ingo Molnar, linux-kernel

On Thu, 2007-04-19 at 09:55 -0700, Davide Libenzi wrote:
> On Thu, 19 Apr 2007, Mike Galbraith wrote:
> 
> > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote:
> > > * Mike Galbraith <efault@gmx.de> wrote:
> > > 
> > > > With a heavily reniced X (perfectly fine), that should indeed solve my 
> > > > daily usage pattern nicely (always need godmode for shells, but not 
> > > > for mozilla and ilk. 50/50 split automatic without renice of entire 
> > > > gui)
> > > 
> > > how about the first-approximation solution i suggested in the previous 
> > > mail: to add a per UID default nice level? (With this default defaulting 
> > > to '-10' for all root-owned processes, and defaulting to '0' for 
> > > everything else.) That would solve most of the current CFS regressions 
> > > at hand.
> > 
> > That would make my kernel builds etc interfere with my other self's
> > surfing and whatnot.  With it by EUID, when I'm surfing or whatnot, the
> > X portion of my Joe-User activity pushes the compile portion of root
> > down in bandwidth utilization automagically, which is exactly the right
> > thing, because the root me in not as important as the Joe-User me using
> > the GUI at that time.  If the idea of X disturbing root upsets some,
> > they can move X to another UID.  Generally, it seems perfect for here.
> 
> Now guys, I did not follow the whole lengthy and feisty thread, but IIRC 
> Con's scheduler has been attacked because, among other argouments, was 
> requiring X to be reniced. This happened like a month ago IINM.

I don't object to renicing X if you want it to receive _more_ than it's
fair share. I do object to having to renice X in order for it to _get_
it's fair share.  That's what I attacked.

> I did not have time to look at Con's scheduler, and I only had a brief 
> look at Ingo's one (looks very promising IMO, but so was the initial O(1) 
> post before all the corner-cases fixes went in).
> But this is not a about technical merit, this is about applying the same 
> rules of judgement to others as well to ourselves.

I'm running the same tests with CFS that I ran for RSDL/SD.  It falls
short in one key area (to me) in that X+client cannot yet split my box
50/50 with two concurrent tasks.  In the CFS case, renicing both X and
client does work, but it should not be necessary IMHO.  With RSDL/SD
renicing didn't help.

> We went from a "renicing X to -10 is bad because the scheduler should 
> be able to correctly handle the problem w/out additional external plugs" 
> to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads 
> class, on top of all the tasks owned by root" [1].
> >From a spectator POV like myself in this case, this looks rather "unfair".

Well, for me, the renicing I mentioned above is only interesting as a
way to improve long term fairness with schedulers with no history.

I found Linus' EUID idea intriguing in that by putting the server
together with a steady load in one 'fair' domain, and clients in
another, X can, if prioritized to empower it to do so, modulate the
steady load in it's domain (but can't starve it!), the clients modulate
X, and the steady load gets it all when X and clients are idle.  The
nice level of X determines to what _extent_ X can modulate the constant
load rather like a mixer slider.  The synchronous (I'm told) nature of
X/client then becomes kind of an asset to the desktop instead of a
liability.

The specific case I was thinking about is the X+Gforce test where both
RSDL and CFS fail to provide fairness (as defined by me;).  X and Gforce
are mostly not concurrent.  The make -j2 I put them up against are
mostly concurrent.  I don't call giving 1/3 of my CPU to X+Client fair
at _all_, but that's what you'll get if your fairstick of the instant
generally can't see the fourth competing task.  Seemed pretty cool to me
because it creates the missing connection between client and server,
though also likely complicated (and maybe full of perils, who knows).

	-Mike

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:52                                                 ` Mike Galbraith
  2007-04-19  7:09                                                   ` Ingo Molnar
@ 2007-04-19  7:14                                                   ` Mike Galbraith
  1 sibling, 0 replies; 304+ messages in thread
From: Mike Galbraith @ 2007-04-19  7:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Con Kolivas, ck list,
	Bill Huey, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Thu, 2007-04-19 at 08:52 +0200, Mike Galbraith wrote:
> On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote:
> 
> > so my current impression is that we want per UID accounting to solve the 
> > X problem, the kernel threads problem and the many-users problem, but 
> > i'd not want to do it for threads just yet because for them there's not 
> > really any apparent problem to be solved.
> 
> If you really mean UID vs EUID as Linus mentioned, I suppose I could
> learn to login as !root, and set KDE up to always give me root shells.
> 
> With a heavily reniced X (perfectly fine), that should indeed solve my
> daily usage pattern nicely (always need godmode for shells, but not for
> mozilla and ilk. 50/50 split automatic without renice of entire gui)

Backward, needs to be EUID as Linus suggested.  Kernel builds etc along
with reniced X in root's bucket, surfing and whatnot in Joe-User's
bucket.

	-Mike


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:40                                           ` Linus Torvalds
  2007-04-18 19:43                                             ` Ingo Molnar
  2007-04-18 20:07                                             ` Davide Libenzi
@ 2007-04-18 21:04                                             ` Ingo Molnar
  2 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > perhaps a more fitting term would be 'precise group-scheduling'. 
> > Within the lowest level task group entity (be that thread group or 
> > uid group, etc.) 'precise scheduling' is equivalent to 'fairness'.
> 
> Yes. Absolutely. Except I think that at least if you're going to name 
> somethign "complete" (or "perfect" or "precise"), you should also 
> admit that groups can be hierarchical.

yes. Am i correct to sum up your impression as:

 " Ingo, for you the hierarchy still appears to be an after-thought,
   while in practice it's easily the most important thing! Why are you
   so hung up about 'fairness', it makes no sense!"

right?

and you would definitely be right if you suggested that i neglected the 
'group scheduling' aspects of CFS (except for a minimalistic nice level 
implementation, which is a poor-man's-non-automatic-group-scheduling), 
but i very much know its important and i'll definitely fix it for -v4.

But please let me explain my reasons for my different focus:

yes, group scheduling in practice is the most important first-layer 
thing, and without it any of the other 'CFS wins' can easily be useless.

Firstly, i have not neglected the group scheduling related CFS 
regressions at all, mainly because there _is_ already a quick hack to 
check whether group scheduling would solve these regressions: renice. 
And it was tried in both of the two CFS regression cases i'm aware of: 
Mike's X starvation problem and Willy's "kevents starvation with 
thousands of scheddos tasks running" problem. And in both cases, 
applying the renice hack [which should be properly and automatically 
implemented as uid group scheduling] fixed the regression for them! So i 
was not worried at all, group scheduling _provably solves_ these CFS 
regressions. I rather concentrated on the CFS regressions that were much 
less clear.

But PLEASE believe me: even with perfect cross-group CPU allocation but 
with a simple non-heuristic scheduler underlying it, you can _easily_ 
get a sucky desktop experience! I know it because i tried it and others 
tried it too. (in fact the first version of sched_fair.c was tick based 
and low-res, and it sucked)

Two more things were needed:

  - the high precision of nsec/64-bit accounting
    ('reliability of scheduling')

  - extremely even time-distribution of CPU power 
    ('determinism/smoothness, human perception')

(i'm expanding on these two concepts further below)

take out any of these and group scheduling or not, you are easily going 
to have a sucky desktop! (We know that from years of experiments: many 
people tried to rip out the unfairness from the scheduler and there were 
always nasty corner cases that 'should' have worked but didnt.)

Without these we'd in essence start again at square one, just at a 
different square, this time with another group of people being 
irritated!

But the biggest and hardest to achieve _wins_ of CFS are _NOT_ achieved 
via a simple 'get rid of the unfairness of the upstream scheduler and 
apply group scheduling'. (I know that because i tried it before and 
because others tried it before, for many many years.) You will _easily_ 
get sucky desktop experience. The other two things are very much needed 
too:

 - the high precision of nsec/64-bit accounting, and the many
   corner-cases this solves. (For example on a typical desktop there are
   _lots_ of timing-driven workloads that are in essence 'invisible' to
   low-resolution, timer-tick based accounting and are heavily skewed.)

 - extremely even time-distribution of CPU power. CFS behaves pretty
   well even under the dreaded 'make -jN in an xterm' kernel build
   workload as reported by Mark Lord, because it also distributes CPU
   power in a _finegrained_ way. A shell prompt under CFS still behaves
   acceptably on a single-CPU testbox of mine with a "make -j50"
   workload. (yes, fifty) Humans react alot more negatively to sudden
   changes in application behavior ('lags', pauses, short hangs) than
   they react to fine, gradual, all-encompassing slowdowns. This is a
   key property of CFS.

  ( Otherwise renicing X to -10 would have solved most of the
    interactivity complaints against the vanilla scheduler, otherwise
    renicing X to -10 would have fixed Mike's setup under SD (it didnt)
    while it worked much better under CFS, otherwise Gene wouldnt have
    found CFS markedly better than SD, etc., etc. So getting rid of the
    heuristics is less than 50% of the road to the perfect desktop
    scheduler. )

and i claim that these were the really hard bits, and i spent most of 
the CFS coding only on getting _these_ details 100% right under various 
workloads, and it makes a night and day difference _even without any 
group scheduling help_.

and note another reason here: group scheduling _masks_ many other 
scheduling deficiencies that are possible in scheduler. So since CFS 
doesnt do group scheduling, i get a _fuller_ picture of the behavior of 
the core "precise scheduling" engine. At the initial stage i didnt want 
to hide bugs by masking them via group scheduling, especially because 
the renice workaround/hack was available.

Guess how nice it all will get if we also add group scheduling to the 
mix, and people wouldnt have to add nasty and fragile renice based 
hacks, it will 'just work' out of box?

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:49                                       ` Ingo Molnar
  2007-04-18 17:59                                         ` Ingo Molnar
@ 2007-04-18 19:23                                         ` Linus Torvalds
  2007-04-18 19:56                                           ` Davide Libenzi
  1 sibling, 1 reply; 304+ messages in thread
From: Linus Torvalds @ 2007-04-18 19:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Ingo Molnar wrote:
> 
> But note that most of the reported CFS interactivity wins, as surprising 
> as it might be, were due to fairness between _the same user's tasks_.

And *ALL* of the CFS interactivity *losses* and complaints have been 
because it did the wrong thing _between different user's tasks_

So what's your point? Your point was that when people try it out as a 
single user, it is indeed fair. But that's no point at all, since it 
totally missed _my_ point.

The problems with X scheduling is exactly that "other user" kind of thing.

The problem with kernel thread starvation due to user threads getting all 
the CPU time is exactly the same issue.

As logn as you think that all threads are equal, and should be treated 
equally, you CANNOT make it work well. People can say "ok, you can renice 
X", but the whole problem stems from the fact that you're trying to be 
fair based on A TOTALLY INVALID NOTION of what "fair" is.

> In the typical case, 99% of the desktop CPU time is executed either as X 
> (root user) or under the uid of the logged in user, and X is just one 
> task.

So? You are ignoring the argument again. You're totally bringing up a red 
herring:

> Even with a bad hack of making X super-high-prio, interactivity as 
> experienced by users still sucks without having fairness between the 
> other 100-200 user tasks that a desktop system is typically using.

I didn't say that you should be *unfair* within one user group. What kind 
of *idiotic* argument are you trying to put forth?

OF COURSE you should be fair "within the user group". Nobody contests that 
the "other 100-200 user tasks" should be scheduled fairly _amongst 
themselves_. 

The only point I had was that you cannot just lump all threads together 
and say "these threads are equally important". The 100-200 user tasks may 
be equally important, and should get equal amounts of preference, but that 
has absolutely _zero_ bearing on the _single_ task run in another 
"scheduling group", ie by other users or by X.

I'm not arguing against fairness. I'm arguing against YOUR notion of 
fairness, which is obviously bogus. It is *not* fair to try to give out 
CPU time evenly!

		Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:23                                         ` Linus Torvalds
@ 2007-04-18 19:56                                           ` Davide Libenzi
  2007-04-18 20:11                                             ` Linus Torvalds
  0 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> I'm not arguing against fairness. I'm arguing against YOUR notion of 
> fairness, which is obviously bogus. It is *not* fair to try to give out 
> CPU time evenly!

"Perhaps on the rare occasion pursuing the right course demands an act of 
 unfairness, unfairness itself can be the right course?"



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 19:56                                           ` Davide Libenzi
@ 2007-04-18 20:11                                             ` Linus Torvalds
  2007-04-19  0:22                                               ` Davide Libenzi
  0 siblings, 1 reply; 304+ messages in thread
From: Linus Torvalds @ 2007-04-18 20:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Davide Libenzi wrote:
> 
> "Perhaps on the rare occasion pursuing the right course demands an act of 
>  unfairness, unfairness itself can be the right course?"

I don't think that's the right issue.

It's just that "fairness" != "equal".

Do you think it "fair" to pay everybody the same regardless of how good a 
job they do? I don't think anybody really believes that. 

Equating "fair" and "equal" is simply a very fundamental mistake. They're 
not the same thing. Never have been, and never will.

Now, there's no question that "equal" is much easier to implement, if only 
because it's a lot easier to agree what it means. "Equal parts" is 
somethign everybody can agree on. "Fair parts" automatically involves a 
balancing act, and people will invariably count things differently and 
thus disagree about what is "fair" and what is not.

I don't think we can ever get a "perfect" setup for that reason, but I 
think we can get something that at least gets reasonably close, at least 
for the obvious cases.

So my suggested test-case of running one process as one user and two 
processes as another one has a fairly "obviously correct" solution if you 
have just one CPU's, and you can probably be pretty fair in practice on 
two CPU's (there's an obvious theoretical solution, whether you can get 
there with a practical algorithm is another thing). On three or more 
CPU's, you obviously wouldn't even *want* to be fair, since you can very 
naturally just give a CPU to each..

			Linus

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 20:11                                             ` Linus Torvalds
@ 2007-04-19  0:22                                               ` Davide Libenzi
  2007-04-19  0:30                                                 ` Linus Torvalds
  0 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-19  0:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Davide Libenzi, Ingo Molnar, Matt Mackall, Nick Piggin,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, 18 Apr 2007, Linus Torvalds wrote:

> On Wed, 18 Apr 2007, Davide Libenzi wrote:
> > 
> > "Perhaps on the rare occasion pursuing the right course demands an act of 
> >  unfairness, unfairness itself can be the right course?"
> 
> I don't think that's the right issue.
> 
> It's just that "fairness" != "equal".
> 
> Do you think it "fair" to pay everybody the same regardless of how good a 
> job they do? I don't think anybody really believes that. 
> 
> Equating "fair" and "equal" is simply a very fundamental mistake. They're 
> not the same thing. Never have been, and never will.

I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :)



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  0:22                                               ` Davide Libenzi
@ 2007-04-19  0:30                                                 ` Linus Torvalds
  0 siblings, 0 replies; 304+ messages in thread
From: Linus Torvalds @ 2007-04-19  0:30 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner



On Wed, 18 Apr 2007, Davide Libenzi wrote:
> 
> I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :)

Ahh, I'm clearly not cultured enough, I didn't catch that reference.

	Linus "yes, I've seen the movie, but it
		 apparently left more of a mark in other people" Torvalds

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:49                                       ` Ingo Molnar
@ 2007-04-18 18:02                                       ` William Lee Irwin III
  2007-04-18 18:12                                         ` Ingo Molnar
  2007-04-18 18:36                                       ` Diego Calleja
  2007-04-19  0:37                                       ` Peter Williams
  3 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-18 18:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote:
> So I claim that anything that cannot be fair by user ID is actually really 
> REALLY unfair. I think it's absolutely humongously STUPID to call 
> something the "Completely Fair Scheduler", and then just be fair on a 
> thread level. That's not fair AT ALL! It's the anti-thesis of being fair!
> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.
> So you should see one thread get 50& CPU (single thread of one user), 4 
> threads get 10% CPU (their fair share of that users time), and 2 threads 
> get 5% CPU (the fair share within that thread group!).
> Any scheduling argument that just considers the above to be "7 threads 
> total" and gives each thread 14% of CPU time "fairly" is *anything* but 
> fair. It's a joke if that kind of scheduler then calls itself CFS!

I don't think it's completely fair [sic] to come down on it that hard.
It does largely achieve the sort of fairness it set out for itself as
its design goal. One should also note that the queueing mechanism is
more than flexible enough to handle prioritization by a number of
different methods, and the large precision of its priorities is useful
there. So a rather broad variety of policies can be implemented by
changing the ->fair_key calculations.

In some respects, the vast priority space and very high clock precision
are two of its most crucial advantages.


On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote:
> And yes, that's largely what the current scheduler will do, but at least 
> the current scheduler doesn't claim to be fair! So the current scheduler 
> is a lot *better* if only in the sense that it doesn't make ridiculous 
> claims that aren't true!

The name chosen was somewhat buzzwordy. I'd have named it something more
descriptive of the algorithm, though what's implemented in the current
dynamic priority (i.e. ->fair_key) calculations are somewhat difficult
to precisely categorize.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 18:02                                       ` William Lee Irwin III
@ 2007-04-18 18:12                                         ` Ingo Molnar
  0 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 18:12 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> It does largely achieve the sort of fairness it set out for itself as 
> its design goal. One should also note that the queueing mechanism is 
> more than flexible enough to handle prioritization by a number of 
> different methods, and the large precision of its priorities is useful 
> there. So a rather broad variety of policies can be implemented by 
> changing the ->fair_key calculations.

yeah. Note that i concentrated on the bit that makes the largest 
interactivity improvement: to implement "precise scheduling" (a'ka 
complete fairness) between the 100+ user tasks that do a complex 
scheduling dance on a typical desktop on various workloads.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 17:49                                       ` Ingo Molnar
  2007-04-18 18:02                                       ` William Lee Irwin III
@ 2007-04-18 18:36                                       ` Diego Calleja
  2007-04-19  0:37                                       ` Peter Williams
  3 siblings, 0 replies; 304+ messages in thread
From: Diego Calleja @ 2007-04-18 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

El Wed, 18 Apr 2007 10:22:59 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> escribió:

> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.


"Fairness between users" was implemented long time ago by rik van riel
(http://surriel.com/patches/2.4/2.4.19-fairsched). Some people has been
asking for a functionality like that for a long time, ie: universities that want
to avoid gcc processes from one student that is trying to learn how fork()
works from starving the processes of rest of the students.

But not only they want "fairness between users", they also want "priorities
between users and/or groups of users", ie: "the 'students' group shouldn't
starve the 'admins' group".

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 17:22                                     ` Linus Torvalds
                                                         ` (2 preceding siblings ...)
  2007-04-18 18:36                                       ` Diego Calleja
@ 2007-04-19  0:37                                       ` Peter Williams
  3 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-19  0:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel,
	Andrew Morton, Arjan van de Ven, Thomas Gleixner

Linus Torvalds wrote:
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
>> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
>>> And "fairness by euid" is probably a hell of a lot easier to do than 
>>> trying to figure out the wakeup matrix.
>> For the record, you actually don't need to track a whole NxN matrix
>> (or do the implied O(n**3) matrix inversion!) to get to the same
>> result.
> 
> I'm sure you can do things differently, but the reason I think "fairness 
> by euid" is actually worth looking at is that it's pretty much the 
> *identical* issue that we'll have with "fairness by virtual machine" and a 
> number of other "container" issues.
> 
> The fact is:
> 
>  - "fairness" is *not* about giving everybody the same amount of CPU time 
>    (scaled by some niceness level or not). Anybody who thinks that is 
>    "fair" is just being silly and hasn't thought it through.
> 
>  - "fairness" is multi-level. You want to be fair to threads within a 
>    thread group (where "process" may be one good approximation of what a 
>    "thread group" is, but not necessarily the only one).
> 
>    But you *also* want to be fair in between those "thread groups", and 
>    then you want to be fair across "containers" (where "user" may be one 
>    such container).
> 
> So I claim that anything that cannot be fair by user ID is actually really 
> REALLY unfair. I think it's absolutely humongously STUPID to call 
> something the "Completely Fair Scheduler", and then just be fair on a 
> thread level. That's not fair AT ALL! It's the anti-thesis of being fair!
> 
> So if you have 2 users on a machine running CPU hogs, you should *first* 
> try to be fair among users. If one user then runs 5 programs, and the 
> other one runs just 1, then the *one* program should get 50% of the CPU 
> time (the users fair share), and the five programs should get 10% of CPU 
> time each. And if one of them uses two threads, each thread should get 5%.
> 
> So you should see one thread get 50& CPU (single thread of one user), 4 
> threads get 10% CPU (their fair share of that users time), and 2 threads 
> get 5% CPU (the fair share within that thread group!).
> 
> Any scheduling argument that just considers the above to be "7 threads 
> total" and gives each thread 14% of CPU time "fairly" is *anything* but 
> fair. It's a joke if that kind of scheduler then calls itself CFS!
> 
> And yes, that's largely what the current scheduler will do, but at least 
> the current scheduler doesn't claim to be fair! So the current scheduler 
> is a lot *better* if only in the sense that it doesn't make ridiculous 
> claims that aren't true!
> 
> 			Linus

Sounds a lot like the PLFS (process level fair sharing) scheduler in 
Aurema's ARMTech (for whom I used to work).  The "fair" in the title is 
a bit misleading as it's all about unfair scheduling in order to meet 
specific policies.  But it's based on the principle that if you can 
allocate CPU band width "fairly" (which really means in proportion to 
the entitlement each process is allocated) then you can allocate CPU 
band width "fairly" between higher level entities such as process 
groups, users groups and so on by subdividing the entitlements downwards.

The tricky part of implementing this was the fact that not all entities 
at the various levels have sufficient demand for CPU band width to use 
their entitlements and this in turn means that the entities above them 
will have difficulty using their entitlements even if other of their 
subordinates have sufficient demand (because their entitlements will be 
too small).  The trick is to have a measure of each entity's demand for 
CPU bandwidth and use that to modify the way entitlement is divided 
among subordinates.

As a first guess, an entity's CPU band width usage is an indicator of 
demand but doesn't take into account unmet demand due to tasks waiting 
on a run queue waiting for access to the CPU.  On the other hand, usage 
plus time waiting on the queue isn't a good measure of demand either 
(although it's probably a good upper bound) as it's unlikely that the 
task would have used the same amount of CPU as the waiting time if it 
had gone straight to the CPU.

But my main point is that it is possible to build schedulers that can 
achieve higher level scheduling policies.  Versions of PLFS work on 
Windows from user space by twiddling process priorities.  Part of my 
more recent work at Aurema had been involved in patching Linux's 
scheduler so that nice worked more predictably so that we could release 
a user space version of PLFS for Linux.  The other part was to add hard 
CPU band width caps for processes so that ARMTech could enforce hard CPU 
bandwidth caps on higher level entities (as this can't be done without 
the kernel being able to do it at that level.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
@ 2007-04-18 19:05                                     ` Davide Libenzi
  2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18 19:05 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Wed, 18 Apr 2007, Matt Mackall wrote:

> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> > And "fairness by euid" is probably a hell of a lot easier to do than 
> > trying to figure out the wakeup matrix.
> 
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result. You can converge on the same node weightings (ie dynamic
> priorities) by applying a damped function at each transition point
> (directed wakeup, preemption, fork, exit).
> 
> The trouble with any scheme like this is that it needs careful tuning
> of the damping factor to converge rapidly and not oscillate and
> precise numerical attention to the transition functions so that the sum of
> dynamic priorities is conserved.

Doing that inside the boundaries of the time constrains imposed by a 
scheduler, is the interesting part. Given also that the size (and members) 
of it (matrix) is dynamic.
Also, a "wakup matrix" (if the name correctly pictures what it is for) 
would help with latencies and priority inheritance, but not for 
global fairness.
The maniacal fairness focus we're seeing now, is due to the fact the 
mainline can have extremely unfair behaviour under certain conditions. 
IMO fairness, although important, should not be main objective of the 
scheduler rewrite. Simplification and predictability should be on higher 
priority, with interactivity achievements bound to decent fariness 
constraints.

- Davide

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-18 17:22                                     ` Linus Torvalds
  2007-04-18 19:05                                     ` Davide Libenzi
@ 2007-04-18 19:13                                     ` Michael K. Edwards
  2 siblings, 0 replies; 304+ messages in thread
From: Michael K. Edwards @ 2007-04-18 19:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On 4/18/07, Matt Mackall <mpm@selenic.com> wrote:
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result. You can converge on the same node weightings (ie dynamic
> priorities) by applying a damped function at each transition point
> (directed wakeup, preemption, fork, exit).
>
> The trouble with any scheme like this is that it needs careful tuning
> of the damping factor to converge rapidly and not oscillate and
> precise numerical attention to the transition functions so that the sum of
> dynamic priorities is conserved.

That would be the control theory approach.  And yes, you have to get
both the theoretical transfer function and the numerics right.  It
sometimes helps to use a control-systems framework like the classic
Takagi-Sugeno-Kang fuzzy logic controller; get the numerics right once
and for all, and treat the heuristics as data, not logic.  (I haven't
worked in this area in almost twenty years, but Google -- yes, I do
use Google+brain for fact-checking; what do you do? -- says that
people are still doing active research on TSK models, and solid
fixed-point reference implementations are readily available.)  That
seems like an attractive strategy here because you could easily embed
the control engine in the kernel and load rule sets dynamically.  Done
right, that could give most of the advantages of pluggable schedulers
(different heuristic strokes for different folks) without diluting the
tester pool for the actual engine code.

(Of course, different scheduling strategies require different input
data, and you might not want the overhead of collecting data that your
chosen heuristics won't use.  But that's not much different from the
netfilter situation, and is obviously a solvable problem, if anyone
cares to put that much work in.  The people who ought to be funding
this kind of work are Sun and IBM, who don't have a chance on the
desktop and are in big trouble in the database tier; their future as
processor vendors depends on being able to service presentation-tier
and business-logic-tier loads efficiently on their massively
multi-core chips.  MIPS should pitch in too, on behalf of licensees
like Cavium who need more predictable behavior on multi-core embedded
Linux.)

Note also that you might not even want to persistently prioritize
particular processes or process groups.  You might want a heuristic
that notices that some task (say, the X server) often responds to
being awakened by doing a little work and then unblocking the task
that awakened it.  When it is pinged from some highly interactive
task, you want it to jump the scheduler queue just long enough to
unblock the interactive task, which may mean letting it flush some
work out of its internal queue.  But otherwise you want to batch
things up until there's too much "scheduler pressure" behind it, then
let it work more or less until it runs out of things to do, because
its working set is so large that repeatedly scheduling it in and out
is hell on caches.

(Priority inheritance is the classic solution to the
blocked-high-priority-task problem _in_isolation_.  It is not without
its pitfalls, especially when the designer of the "server" didn't
expect to lose his timeslice instantly on releasing the lock.  True
priority inheritance is probably not something you want to inflict on
a non-real-time system, but you do need some urgency heuristic.  What
a "fuzzy logic" framework does for you is to let you combine competing
heuristics in a way that remains amenable to analysis using control
theory techniques.)

What does any of this have to do with "fairness"?  Nothing whatsoever!
 There's work that has to be done, and choosing when to do it is
almost entirely a matter of staying out of the way of more urgent work
while minimizing the task's negative impact on the rest of the system.
 Does that mean that the X server is "special", kind of the way that
latency-sensitive A/V applications are "special", and belongs in a
separate scheduler class?  No.  Nowadays, workloads where the kernel
has any idea what tasks belong to what "users" are the exception, not
the norm.  The X server is the canary in the coal mine, and a
scheduler that won't do the right thing for X without hand tweaking
won't do the right thing for other eyeball-driven,
multiple-tiers-on-one-box scenarios either.

If you want fairness among users to the extent that their demands
_compete_, you might as well partition the whole machine, and have a
separate fairness-oriented scheduler (let's call it a "hypervisor")
that lives outside the kernel.  (Talk about two students running gcc
on the same shell server, with more important people also doing things
on the same system, is so 1990's!)  Not that the design of scheduler
heuristics shouldn't include "fairness"-like considerations; but
they're probably only interesting as a fallback for when the scheduler
has no idea what it ought to schedule next.

So why is Ingo's scheduler apparently working well for desktop loads?
I haven't tried it or even looked at its code, but from its marketing
I would guess that it effectively penalizes tasks whose I/O requests
can be serviced from (or directed to) cache long enough to actually
consume a whole timeslice.  This is prima facie evidence that their
_current_behavior_ is non-interactive.  Presumably this penalty
expires quickly when the task again asks for information that is not
readily at hand (or writes data that the system is not willing to
cache) -- which usually implies either actual user interaction or a
change of working set, both of which deserve an "urgency premium".

The mainline scheduler seems to contain various heuristics that
mistake a burst of non-interactive _activity_ for a persistently
non-interactive _task_.  Take them away in the name of "fairness", and
the system adapts more quickly to the change of working set implied by
a change of user focus.  There are probably fewer pathological load
patterns too, since manual knob-turning uninformed by control theory
is a lot less likely to get you into trouble when there are few knobs
and no deliberately inserted long-time-constant feedback paths.  But
you can't say there are _no_ pathological load patterns, or even that
the major economic drivers of the Linux ecosystem don't generate them,
until you do some authentic engineering analysis.

In short (too late!) -- alternate schedulers are fun to experiment
with, and the sort of people who would actually try out patches
floated on LKML may find that they improve their desktop experience,
hosting farm throughput, etc.  But even if the mainline scheduler is a
hack atop a kludge covering a crock, it's more or less what production
applications have expected since the last major architectural shift
(NPTL).  There's just no sense in replacing it until you can either
add real value (say, integral clock scaling for power efficiency, with
a reasonable "spinning reserve" for peaking load) or demonstrate
stability by engineering analysis instead of trial and error.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
@ 2007-04-19  3:18                                   ` Nick Piggin
  2007-04-19  5:14                                     ` Andrew Morton
  2007-04-21 13:40                                   ` Bill Davidsen
  2 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-19  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
> > 
> > Why is X special? Because it does work on behalf of other processes?
> > Lots of things do this. Perhaps a scheduler should focus entirely on
> > the implicit and directed wakeup matrix and optimizing that
> > instead[1].
> 
> I 100% agree - the perfect scheduler would indeed take into account where 
> the wakeups come from, and try to "weigh" processes that help other 
> processes make progress more. That would naturally give server processes 
> more CPU power, because they help others
> 
> I don't believe for a second that "fairness" means "give everybody the 
> same amount of CPU". That's a totally illogical measure of fairness. All 
> processes are _not_ created equal.

I believe that unless the kernel is told of these inequalities, then it
must schedule fairly.

And yes, by fairly, I mean fairly among all threads as a base resource
class, because that's what Linux has always done (and if you aggregate
into higher classes, you still need that per-thread scheduling).

So I'm not excluding extra scheduling classes like per-process, per-user,
but among any class of equal schedulable entities, fair scheduling is the
only option because the alternative of unfairness is just insane.

> That said, even trying to do "fairness by effective user ID" would 
> probably already do a lot. In a desktop environment, X would get as much 
> CPU time as the user processes, simply because it's in a different 
> protection domain (and that's really what "effective user ID" means: it's 
> not about "users", it's really about "protection domains").
> 
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.

Well my X server has an euid of root, which would mean my X clients can
cause X to do work and eat into root's resources. Or as Ingo said, X
may not be running as root. Seems like just another hack to try to
implicitly solve the X problem and probably create a lot of others
along the way.

All fairness issues aside, in the context of keeping a very heavily
loaded desktop interactive, X is special. That you are trying to think
up funny rules that would implicitly give X better priority is kind of
indicative of that.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  3:18                                   ` Nick Piggin
@ 2007-04-19  5:14                                     ` Andrew Morton
  2007-04-19  6:38                                       ` Ingo Molnar
  0 siblings, 1 reply; 304+ messages in thread
From: Andrew Morton @ 2007-04-19  5:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list,
	Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner

On Thu, 19 Apr 2007 05:18:07 +0200 Nick Piggin <npiggin@suse.de> wrote:

> And yes, by fairly, I mean fairly among all threads as a base resource
> class, because that's what Linux has always done

Yes, there are potential compatibility problems.  Example: a machine with
100 busy httpd processes and suddenly a big gzip starts up from console or
cron.

Under current kernels, that gzip will take ages and the httpds will take a
1% slowdown, which may well be exactly the behaviour which is desired.

If we were to schedule by UID then the gzip suddenly gets 50% of the CPU
and those httpd's all take a 50% hit, which could be quite serious.

That's simple to fix via nicing, but people have to know to do that, and
there will be a transition period where some disruption is possible.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  5:14                                     ` Andrew Morton
@ 2007-04-19  6:38                                       ` Ingo Molnar
  2007-04-19  7:57                                         ` William Lee Irwin III
  2007-04-19  8:33                                         ` Nick Piggin
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-19  6:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner


* Andrew Morton <akpm@linux-foundation.org> wrote:

> > And yes, by fairly, I mean fairly among all threads as a base 
> > resource class, because that's what Linux has always done
> 
> Yes, there are potential compatibility problems.  Example: a machine 
> with 100 busy httpd processes and suddenly a big gzip starts up from 
> console or cron.
> 
> Under current kernels, that gzip will take ages and the httpds will 
> take a 1% slowdown, which may well be exactly the behaviour which is 
> desired.
> 
> If we were to schedule by UID then the gzip suddenly gets 50% of the 
> CPU and those httpd's all take a 50% hit, which could be quite 
> serious.
> 
> That's simple to fix via nicing, but people have to know to do that, 
> and there will be a transition period where some disruption is 
> possible.

hmmmm. How about the following then: default to nice -10 for all 
(SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
special: root already has disk space reserved to it, root has special 
memory allocation allowances, etc. I dont see a reason why we couldnt by 
default make all root tasks have nice -10. This would be instantly loved 
by sysadmins i suspect ;-)

(distros that go the extra mile of making Xorg run under non-root could 
also go another extra one foot to renice that X server to -10.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:38                                       ` Ingo Molnar
@ 2007-04-19  7:57                                         ` William Lee Irwin III
  2007-04-19 11:50                                           ` Peter Williams
  2007-04-19  8:33                                         ` Nick Piggin
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-19  7:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

* Andrew Morton <akpm@linux-foundation.org> wrote:
>> Yes, there are potential compatibility problems.  Example: a machine 
>> with 100 busy httpd processes and suddenly a big gzip starts up from 
>> console or cron.
[...]

On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
> hmmmm. How about the following then: default to nice -10 for all 
> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
> special: root already has disk space reserved to it, root has special 
> memory allocation allowances, etc. I dont see a reason why we couldnt by 
> default make all root tasks have nice -10. This would be instantly loved 
> by sysadmins i suspect ;-)
> (distros that go the extra mile of making Xorg run under non-root could 
> also go another extra one foot to renice that X server to -10.)

I'd further recommend making priority levels accessible to kernel threads
that are not otherwise accessible to processes, both above and below
user-available priority levels. Basically, if you can get SCHED_RR and
SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
scheduler class can coexist with SCHED_OTHER in like fashion, but with
availability of higher and lower priorities than any userspace process
is allowed, and potentially some differing scheduling semantics. In such
a manner nonessential background processing intended not to ever disturb
userspace can be given priorities appropriate to it (perhaps even con's
SCHED_IDLEPRIO would make sense), and other, urgent processing can be
given priority over userspace altogether.

I believe root's default priority can be adjusted in userspace as
things now stand somewhere in /etc/ but I'm not sure of the specifics.
Word is somewhere in /etc/security/limits.conf

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  7:57                                         ` William Lee Irwin III
@ 2007-04-19 11:50                                           ` Peter Williams
  2007-04-20  5:26                                             ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-19 11:50 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>>> Yes, there are potential compatibility problems.  Example: a machine 
>>> with 100 busy httpd processes and suddenly a big gzip starts up from 
>>> console or cron.
> [...]
> 
> On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
>> hmmmm. How about the following then: default to nice -10 for all 
>> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
>> special: root already has disk space reserved to it, root has special 
>> memory allocation allowances, etc. I dont see a reason why we couldnt by 
>> default make all root tasks have nice -10. This would be instantly loved 
>> by sysadmins i suspect ;-)
>> (distros that go the extra mile of making Xorg run under non-root could 
>> also go another extra one foot to renice that X server to -10.)
> 
> I'd further recommend making priority levels accessible to kernel threads
> that are not otherwise accessible to processes, both above and below
> user-available priority levels. Basically, if you can get SCHED_RR and
> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
> scheduler class can coexist with SCHED_OTHER in like fashion, but with
> availability of higher and lower priorities than any userspace process
> is allowed, and potentially some differing scheduling semantics. In such
> a manner nonessential background processing intended not to ever disturb
> userspace can be given priorities appropriate to it (perhaps even con's
> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
> given priority over userspace altogether.
> 
> I believe root's default priority can be adjusted in userspace as
> things now stand somewhere in /etc/ but I'm not sure of the specifics.
> Word is somewhere in /etc/security/limits.conf

This is sounding very much like System V Release 4 (and descendants) 
except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
priority inversion, I believe).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19 11:50                                           ` Peter Williams
@ 2007-04-20  5:26                                             ` William Lee Irwin III
  2007-04-20  6:16                                               ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-20  5:26 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> I'd further recommend making priority levels accessible to kernel threads
>> that are not otherwise accessible to processes, both above and below
>> user-available priority levels. Basically, if you can get SCHED_RR and
>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
>> scheduler class can coexist with SCHED_OTHER in like fashion, but with
>> availability of higher and lower priorities than any userspace process
>> is allowed, and potentially some differing scheduling semantics. In such
>> a manner nonessential background processing intended not to ever disturb
>> userspace can be given priorities appropriate to it (perhaps even con's
>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
>> given priority over userspace altogether.

On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote:
> This is sounding very much like System V Release 4 (and descendants) 
> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
> are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
> priority inversion, I believe).

Descriptions of that are probably where I got the idea (hurrah for OS
textbooks). It makes a fair amount of sense. Not sure what the take on
the specific precedent is. The only content here is expanding the
priority range with ranges above and below for the exclusive use of
ultra-privileged tasks, so it's really trivial. Actually it might be so
trivial it should just be some permission checks in the SCHED_OTHER
renicing code.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-20  5:26                                             ` William Lee Irwin III
@ 2007-04-20  6:16                                               ` Peter Williams
  0 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-20  6:16 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds,
	Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> I'd further recommend making priority levels accessible to kernel threads
>>> that are not otherwise accessible to processes, both above and below
>>> user-available priority levels. Basically, if you can get SCHED_RR and
>>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN
>>> scheduler class can coexist with SCHED_OTHER in like fashion, but with
>>> availability of higher and lower priorities than any userspace process
>>> is allowed, and potentially some differing scheduling semantics. In such
>>> a manner nonessential background processing intended not to ever disturb
>>> userspace can be given priorities appropriate to it (perhaps even con's
>>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be
>>> given priority over userspace altogether.
> 
> On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote:
>> This is sounding very much like System V Release 4 (and descendants) 
>> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that 
>> are in system mode dynamic priorities in the SCHED_SYS range (to avoid 
>> priority inversion, I believe).
> 
> Descriptions of that are probably where I got the idea (hurrah for OS
> textbooks).

And long term background memory.  :-)

> It makes a fair amount of sense.

Yes.  You could also add a SCHED_IA in between SCHED_SYS and SCHED_OTHER 
(a la Solaris) for interactive tasks.  The only problem is how to get a 
task into SCHED_IA without root privileges.

> Not sure what the take on
> the specific precedent is. The only content here is expanding the
> priority range with ranges above and below for the exclusive use of
> ultra-privileged tasks, so it's really trivial. Actually it might be so
> trivial it should just be some permission checks in the SCHED_OTHER
> renicing code.

Perhaps.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-19  6:38                                       ` Ingo Molnar
  2007-04-19  7:57                                         ` William Lee Irwin III
@ 2007-04-19  8:33                                         ` Nick Piggin
  1 sibling, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-19  8:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Matt Mackall,
	William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven,
	Thomas Gleixner

On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > And yes, by fairly, I mean fairly among all threads as a base 
> > > resource class, because that's what Linux has always done
> > 
> > Yes, there are potential compatibility problems.  Example: a machine 
> > with 100 busy httpd processes and suddenly a big gzip starts up from 
> > console or cron.
> > 
> > Under current kernels, that gzip will take ages and the httpds will 
> > take a 1% slowdown, which may well be exactly the behaviour which is 
> > desired.
> > 
> > If we were to schedule by UID then the gzip suddenly gets 50% of the 
> > CPU and those httpd's all take a 50% hit, which could be quite 
> > serious.
> > 
> > That's simple to fix via nicing, but people have to know to do that, 
> > and there will be a transition period where some disruption is 
> > possible.
> 
> hmmmm. How about the following then: default to nice -10 for all 
> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ 
> special: root already has disk space reserved to it, root has special 
> memory allocation allowances, etc. I dont see a reason why we couldnt by 
> default make all root tasks have nice -10. This would be instantly loved 
> by sysadmins i suspect ;-)

I have no problem with doing fancy new fairness classes and things.

But considering that we _need_ to have per-thread fairness and that
is also what the current scheduler has and what we need to do well for
obvious reasons, the best path to take is to get per-thread scheduling
up to a point where it is able to replace the current scheduler, then
look at more complex things after that.
 

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 14:48                                 ` Linus Torvalds
  2007-04-18 15:23                                   ` Matt Mackall
  2007-04-19  3:18                                   ` Nick Piggin
@ 2007-04-21 13:40                                   ` Bill Davidsen
  2 siblings, 0 replies; 304+ messages in thread
From: Bill Davidsen @ 2007-04-21 13:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams,
	Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner

Linus Torvalds wrote:
> 
> On Wed, 18 Apr 2007, Matt Mackall wrote:
>> Why is X special? Because it does work on behalf of other processes?
>> Lots of things do this. Perhaps a scheduler should focus entirely on
>> the implicit and directed wakeup matrix and optimizing that
>> instead[1].
> 
> I 100% agree - the perfect scheduler would indeed take into account where 
> the wakeups come from, and try to "weigh" processes that help other 
> processes make progress more. That would naturally give server processes 
> more CPU power, because they help others
> 
> I don't believe for a second that "fairness" means "give everybody the 
> same amount of CPU". That's a totally illogical measure of fairness. All 
> processes are _not_ created equal.
> 
> That said, even trying to do "fairness by effective user ID" would 
> probably already do a lot. In a desktop environment, X would get as much 
> CPU time as the user processes, simply because it's in a different 
> protection domain (and that's really what "effective user ID" means: it's 
> not about "users", it's really about "protection domains").
> 
> And "fairness by euid" is probably a hell of a lot easier to do than 
> trying to figure out the wakeup matrix.
> 
You probably want to consider the controlling terminal as well...  do 
you want to have people starting 'at' jobs competing on equal footing 
with people typing at a terminal? I'm not offering an answer, just 
raising the question.

And for some database applications, everyone in a group may connect with 
the same login-id, then do sub authorization to the database 
application. euid may be an issue there as well.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:15                 ` Nick Piggin
  2007-04-17  6:26                   ` William Lee Irwin III
@ 2007-04-17  6:50                   ` Davide Libenzi
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:11                     ` Nick Piggin
  1 sibling, 2 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-17  6:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, Nick Piggin wrote:

> > All things are not equal; they all have different properties. I like
> 
> Exactly. So we have to explore those properties and evaluate performance
> (in all meanings of the word). That's only logical.

I had a quick look at Ingo's code yesterday. Ingo is always smart to 
prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
And even this code does that pretty nicely. The deadline designs looks 
good, although I think the final "key" calculation code will end up quite 
different from what it looks now.
I would suggest to thoroughly test all your alternatives before deciding. 
Some code and design may look very good and small at the beginning, but 
when you start patching it to cover all the dark spots, you effectively 
end up with another thing (in both design and code footprint).
About O(1), I never thought it was a must (besides a good marketing 
material), and O(log(N)) *may* be just fine (to be verified, of course).

- Davide

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:50                   ` Davide Libenzi
@ 2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
                                         ` (3 more replies)
  2007-04-17  7:11                     ` Nick Piggin
  1 sibling, 4 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  7:09 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.

The additive nice_offset breaks nice levels. A multiplicative priority
weighting of a different, nonnegative metric of cpu utilization from
what's now used is required for nice levels to work. I've been trying
to point this out politely by strongly suggesting testing whether nice
levels work.

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

The trouble with thorough testing right now is that no one agrees on
what the tests should be and a number of the testcases are not in great
shape. An agreed-upon set of testcases for basic correctness should be
devised and the implementations of those testcases need to be
maintainable code and the tests set up for automated testing and
changing their parameters without recompiling via command-line options.

Once there's a standard regression test suite for correctness, one
needs to be devised for performance, including interactive performance.
The primary difficulty I see along these lines is finding a way to
automate tests of graphics and input device response performance. Others,
like how deterministically priorities are respected over progressively
smaller time intervals and noninteractive workload performance are
nowhere near as difficult to arrange and in many cases already exist.
Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
@ 2007-04-17  7:22                       ` Peter Williams
  2007-04-17  7:23                       ` Nick Piggin
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-17  7:22 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
>> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
>> And even this code does that pretty nicely. The deadline designs looks 
>> good, although I think the final "key" calculation code will end up quite 
>> different from what it looks now.
> 
> The additive nice_offset breaks nice levels. A multiplicative priority
> weighting of a different, nonnegative metric of cpu utilization from
> what's now used is required for nice levels to work. I've been trying
> to point this out politely by strongly suggesting testing whether nice
> levels work.
> 
> 
> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
>> I would suggest to thoroughly test all your alternatives before deciding. 
>> Some code and design may look very good and small at the beginning, but 
>> when you start patching it to cover all the dark spots, you effectively 
>> end up with another thing (in both design and code footprint).
>> About O(1), I never thought it was a must (besides a good marketing 
>> material), and O(log(N)) *may* be just fine (to be verified, of course).
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

At this point, I'd like direct everyone's attention to the simloads package:

<http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz>

which contains a set of programs designed to be used in the construction 
of CPU scheduler tests.  Of particular use is the aspin program which 
can be used to launch tasks with specified sleep/wake characteristics.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
@ 2007-04-17  7:23                       ` Nick Piggin
  2007-04-17  7:27                       ` Davide Libenzi
  2007-04-17  7:33                       ` Ingo Molnar
  3 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote:
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

Definitely. It would be really good if we could have interactivity
regression tests too (see my earlier wishful email). The problem
with a lot of the scripted interactivity tests I see is that they
don't really capture the complexities of the interactions between,
say, an interactive X session. Others just go straight for trying
to exploit the design by making lots of high priority processes
runnablel at once. This just provides an unrealistic decoy and you
end up trying to tune for the wrong thing.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
  2007-04-17  7:22                       ` Peter Williams
  2007-04-17  7:23                       ` Nick Piggin
@ 2007-04-17  7:27                       ` Davide Libenzi
  2007-04-17  7:33                         ` Nick Piggin
  2007-04-17  7:33                       ` Ingo Molnar
  3 siblings, 1 reply; 304+ messages in thread
From: Davide Libenzi @ 2007-04-17  7:27 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas,
	Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I would suggest to thoroughly test all your alternatives before deciding. 
> > Some code and design may look very good and small at the beginning, but 
> > when you start patching it to cover all the dark spots, you effectively 
> > end up with another thing (in both design and code footprint).
> > About O(1), I never thought it was a must (besides a good marketing 
> > material), and O(log(N)) *may* be just fine (to be verified, of course).
> 
> The trouble with thorough testing right now is that no one agrees on
> what the tests should be and a number of the testcases are not in great
> shape. An agreed-upon set of testcases for basic correctness should be
> devised and the implementations of those testcases need to be
> maintainable code and the tests set up for automated testing and
> changing their parameters without recompiling via command-line options.
> 
> Once there's a standard regression test suite for correctness, one
> needs to be devised for performance, including interactive performance.
> The primary difficulty I see along these lines is finding a way to
> automate tests of graphics and input device response performance. Others,
> like how deterministically priorities are respected over progressively
> smaller time intervals and noninteractive workload performance are
> nowhere near as difficult to arrange and in many cases already exist.
> Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.

What I meant was, that the rules (requirements and associated test cases) 
for this new Scheduler Amazing Race should be set forward, and not kept a 
moving target to fit&follow one or the other implementation.


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:27                       ` Davide Libenzi
@ 2007-04-17  7:33                         ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:33 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, William Lee Irwin III wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I would suggest to thoroughly test all your alternatives before deciding. 
> > > Some code and design may look very good and small at the beginning, but 
> > > when you start patching it to cover all the dark spots, you effectively 
> > > end up with another thing (in both design and code footprint).
> > > About O(1), I never thought it was a must (besides a good marketing 
> > > material), and O(log(N)) *may* be just fine (to be verified, of course).
> > 
> > The trouble with thorough testing right now is that no one agrees on
> > what the tests should be and a number of the testcases are not in great
> > shape. An agreed-upon set of testcases for basic correctness should be
> > devised and the implementations of those testcases need to be
> > maintainable code and the tests set up for automated testing and
> > changing their parameters without recompiling via command-line options.
> > 
> > Once there's a standard regression test suite for correctness, one
> > needs to be devised for performance, including interactive performance.
> > The primary difficulty I see along these lines is finding a way to
> > automate tests of graphics and input device response performance. Others,
> > like how deterministically priorities are respected over progressively
> > smaller time intervals and noninteractive workload performance are
> > nowhere near as difficult to arrange and in many cases already exist.
> > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al.
> 
> What I meant was, that the rules (requirements and associated test cases) 
> for this new Scheduler Amazing Race should be set forward, and not kept a 
> moving target to fit&follow one or the other implementation.

Exactly. Well I don't mind if it is a moving target as such, just as
long as the decisions are rational (no "blah is more important
because I say so").

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:09                     ` William Lee Irwin III
                                         ` (2 preceding siblings ...)
  2007-04-17  7:27                       ` Davide Libenzi
@ 2007-04-17  7:33                       ` Ingo Molnar
  2007-04-17  7:40                         ` Nick Piggin
  2007-04-17  9:05                         ` William Lee Irwin III
  3 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > Linus ;) And even this code does that pretty nicely. The deadline 
> > designs looks good, although I think the final "key" calculation 
> > code will end up quite different from what it looks now.
> 
> The additive nice_offset breaks nice levels. A multiplicative priority 
> weighting of a different, nonnegative metric of cpu utilization from 
> what's now used is required for nice levels to work. I've been trying 
> to point this out politely by strongly suggesting testing whether nice 
> levels work.

granted, CFS's nice code is still incomplete, but you err quite 
significantly with this extreme statement that they are "broken".

nice levels certainly work to a fair degree even in the current code and 
much of the focus is elsewhere - just try it. (In fact i claim that 
CFS's nice levels often work _better_ than the mainline scheduler's nice 
level support, for the testcases that matter to users.)

The precise behavior of nice levels, as i pointed it out in previous 
mails, is largely 'uninteresting' and it has changed multiple times in 
the past 10 years.

What matters to users is mainly: whether X reniced to -10 does get 
enough CPU time and whether stuff reniced to +19 doesnt take away too 
much CPU time from the rest of the system. _How_ a Linux scheduler 
achieves this is an internal matter and certainly CFS does it in a hacky 
way at the moment.

All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
could come up with is just a fancy academic technicality that has no 
real significance to any of the testers who are trying CFS right now.

Sure we prefer final solutions that are clean and make sense (because 
such things are the easiest to maintain long-term), and often such final 
solutions are quite close to academic concepts, and i think Davide 
correctly observed this by saying that "the final key calculation code 
will end up quite different from what it looks now", but your 
extreme-end claim of 'breakage' for something that is just plain 
incomplete is not really a fair characterisation at this point.

Anyone who thinks that there exists only two kinds of code: 100% correct 
and 100% incorrect with no shades of grey inbetween is in reality a sort 
of an extremist: whom, depending on mood and affection, we could call 
either a 'coding purist' or a 'coding taliban' ;-)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:33                       ` Ingo Molnar
@ 2007-04-17  7:40                         ` Nick Piggin
  2007-04-17  7:58                           ` Ingo Molnar
  2007-04-17  9:05                         ` William Lee Irwin III
  1 sibling, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
> > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> > > I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> > > prepare a main dish (feature) with a nice sider (code cleanup) to 
> > > Linus ;) And even this code does that pretty nicely. The deadline 
> > > designs looks good, although I think the final "key" calculation 
> > > code will end up quite different from what it looks now.
> > 
> > The additive nice_offset breaks nice levels. A multiplicative priority 
> > weighting of a different, nonnegative metric of cpu utilization from 
> > what's now used is required for nice levels to work. I've been trying 
> > to point this out politely by strongly suggesting testing whether nice 
> > levels work.
> 
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".
> 
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)
> 
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.
> 
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system.

I agree there.


> _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.
> 
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.
> 
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.
> 
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

Only if you are an extremist-naming extremist with no shades of grey.
Others, like myself, also include 'coding al-qaeda' and 'coding john
howard' in that scale.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:40                         ` Nick Piggin
@ 2007-04-17  7:58                           ` Ingo Molnar
  0 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Davide Libenzi, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

* Nick Piggin <npiggin@suse.de> wrote:

> > Anyone who thinks that there exists only two kinds of code: 100% 
> > correct and 100% incorrect with no shades of grey inbetween is in 
> > reality a sort of an extremist: whom, depending on mood and 
> > affection, we could call either a 'coding purist' or a 'coding 
> > taliban' ;-)
> 
> Only if you are an extremist-naming extremist with no shades of grey. 
> Others, like myself, also include 'coding al-qaeda' and 'coding john 
> howard' in that scale.

heh ;) You, you ... nitpicking extremist! ;)

And beware that you just commited another act of extremism too:

> I agree there.

because you just went to the extreme position of saying that "i agree 
with this portion 100%", instead of saying "this seems to be 91.5% 
correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200".

and the nasty thing is, that in reality even shades of grey, if you 
print them out, are just a set of extreme black dots on an extreme white 
sheet of paper! ;)

[ so i guess we've got to consider the scope of extremism too: the 
  larger the scope, the more limiting and hence the more dangerous it
  is. ]

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:33                       ` Ingo Molnar
  2007-04-17  7:40                         ` Nick Piggin
@ 2007-04-17  9:05                         ` William Lee Irwin III
  2007-04-17  9:24                           ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  9:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> The additive nice_offset breaks nice levels. A multiplicative priority 
>> weighting of a different, nonnegative metric of cpu utilization from 
>> what's now used is required for nice levels to work. I've been trying 
>> to point this out politely by strongly suggesting testing whether nice 
>> levels work.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> granted, CFS's nice code is still incomplete, but you err quite 
> significantly with this extreme statement that they are "broken".

I used the word relatively loosely. Nothing extreme is going on.
Maybe the phrasing exaggerated the force of the opinion. I'm sorry
about having misspoke so.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> nice levels certainly work to a fair degree even in the current code and 
> much of the focus is elsewhere - just try it. (In fact i claim that 
> CFS's nice levels often work _better_ than the mainline scheduler's nice 
> level support, for the testcases that matter to users.)

Al Boldi's testcase appears to reveal some issues. I'm plotting a
testcase of my own if I can ever get past responding to email.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> The precise behavior of nice levels, as i pointed it out in previous 
> mails, is largely 'uninteresting' and it has changed multiple times in 
> the past 10 years.

I expect that whether a scheduler can handle such prioritization has a
rather strong predictive quality regarding whether it can handle, say,
CKRM controls. I remain convinced that there should be some target
behavior and that some attempt should be made to achieve it. I don't
think any particular behavior is best, just that the behavior should
be well-defined.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> What matters to users is mainly: whether X reniced to -10 does get 
> enough CPU time and whether stuff reniced to +19 doesnt take away too 
> much CPU time from the rest of the system. _How_ a Linux scheduler 
> achieves this is an internal matter and certainly CFS does it in a hacky 
> way at the moment.

It's not so far out. Basically just changing the key calculation in a
relatively simple manner should get things into relatively good shape.
It can, of course, be done other ways (I did it a rather different way
in vdls, though that method is not likely to be considered desirable).

I can't really write a testcase for such loose semantics, so the above
description is useless to me. These squishy sorts of definitions of
semantics are also uninformative to users, who, I would argue, do have
some interest in what nice levels mean. There have been at least a small
number of concerns about the strength of nice levels, and it would
reveal issues surrounding that area earlier if there were an objective
one could test to see if it were achieved.

It's furthermore a user-visible change in system call semantics we
should be more careful about changing out from beneath users.

So I see a lot of good reasons to pin down nice numbers. Incompleteness
is not a particularly mortal sin, but the proliferation of competing
schedulers is creating a need for standards, and that's what I'm really
on about.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> All the rest, 'CPU bandwidth utilization' or whatever abstract metric we 
> could come up with is just a fancy academic technicality that has no 
> real significance to any of the testers who are trying CFS right now.

I could say "percent cpu" if it sounds less like formal jargon, which
"CPU bandwidth utilization" isn't really.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Sure we prefer final solutions that are clean and make sense (because 
> such things are the easiest to maintain long-term), and often such final 
> solutions are quite close to academic concepts, and i think Davide 
> correctly observed this by saying that "the final key calculation code 
> will end up quite different from what it looks now", but your 
> extreme-end claim of 'breakage' for something that is just plain 
> incomplete is not really a fair characterisation at this point.

It wasn't meant to be quite as strong a statement as it came out.
Sorry about that.

On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote:
> Anyone who thinks that there exists only two kinds of code: 100% correct 
> and 100% incorrect with no shades of grey inbetween is in reality a sort 
> of an extremist: whom, depending on mood and affection, we could call 
> either a 'coding purist' or a 'coding taliban' ;-)

I've made no such claims. Also rest assured that the tone of the
critique is not hostile, and wasn't meant to sound that way.

Also, given the general comments it appears clear that some statistical
metric of deviation from the intended behavior furthermore qualified by
timescale is necessary, so this appears to be headed toward a sort of
performance metric as opposed to a pass/fail test anyway. However, to
even measure this at all, some statement of intention is required. I'd
prefer that there be a Linux-standard semantics for nice so results are
more directly comparable and so that users also get similar nice
behavior from the scheduler as it varies over time and possibly
implementations if users should care to switch them out with some
scheduler patch or other.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:05                         ` William Lee Irwin III
@ 2007-04-17  9:24                           ` Ingo Molnar
  2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 22:08                             ` Matt Mackall
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:24 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:

> [...] Also rest assured that the tone of the critique is not hostile, 
> and wasn't meant to sound that way.

ok :) (And i guess i was too touchy - sorry about coming out swinging.)

> Also, given the general comments it appears clear that some 
> statistical metric of deviation from the intended behavior furthermore 
> qualified by timescale is necessary, so this appears to be headed 
> toward a sort of performance metric as opposed to a pass/fail test 
> anyway. However, to even measure this at all, some statement of 
> intention is required. I'd prefer that there be a Linux-standard 
> semantics for nice so results are more directly comparable and so that 
> users also get similar nice behavior from the scheduler as it varies 
> over time and possibly implementations if users should care to switch 
> them out with some scheduler patch or other.

yeah. If you could come up with a sane definition that also translates 
into low overhead on the algorithm side that would be great! The only 
good generic definition i could come up with (nice levels are isolated 
buckets with a constant maximum relative percentage of CPU time 
available to every active bucket) resulted in having a per-nice-level 
array of rbtree roots, which did not look worth the hassle at first 
sight :-)

until now the main approach for nice levels in Linux was always: 
"implement your main scheduling logic for nice 0 and then look for some 
low-overhead method that can be glued to it that does something that 
behaves like nice levels". Feel free to turn that around into a more 
natural approach, but the algorithm should remain fairly simple i think.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:24                           ` Ingo Molnar
@ 2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 10:01                               ` Ingo Molnar
  2007-04-17 11:31                               ` William Lee Irwin III
  2007-04-17 22:08                             ` Matt Mackall
  1 sibling, 2 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  9:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Also, given the general comments it appears clear that some 
>> statistical metric of deviation from the intended behavior furthermore 
>> qualified by timescale is necessary, so this appears to be headed 
>> toward a sort of performance metric as opposed to a pass/fail test 
[...]

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great! The only 
> good generic definition i could come up with (nice levels are isolated 
> buckets with a constant maximum relative percentage of CPU time 
> available to every active bucket) resulted in having a per-nice-level 
> array of rbtree roots, which did not look worth the hassle at first 
> sight :-)

Interesting! That's what vdls did, except its fundamental data structure
was more like a circular buffer data structure (resembling Davide
Libenzi's timer ring in concept, but with all the details different).
I'm not entirely sure how that would've turned out performancewise if
I'd done any tuning at all. I was mostly interested in doing something
like what I heard Bob Mullens did in 1976 for basic pedagogical value
about schedulers to prepare for writing patches for gang scheduling as
opposed to creating a viable replacement for the mainline scheduler.

I'm relatively certain a different key calculation will suffice, but
it may disturb other desired semantics since they really need to be
nonnegative for multiplying by a scaling factor corresponding to its
nice number to work properly. Well, as the cfs code now stands, it
would correspond to negative keys. Dividing positive keys by the nice
scaling factor is my first thought of how to extend the method to the
current key semantics. Or such are my thoughts on the subject.

I expect that all that's needed is to fiddle with those numbers a bit.
There's quite some capacity for expression there given the precision.

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> until now the main approach for nice levels in Linux was always: 
> "implement your main scheduling logic for nice 0 and then look for some 
> low-overhead method that can be glued to it that does something that 
> behaves like nice levels". Feel free to turn that around into a more 
> natural approach, but the algorithm should remain fairly simple i think.

Part of my insistence was because it seemed to be relatively close to a
one-liner, though I'm not entirely sure what particular computation to
use to handle the signedness of the keys. I guess I could pick some
particular nice semantics myself and then sweep the extant schedulers
to use them after getting a testcase hammered out.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:57                             ` William Lee Irwin III
@ 2007-04-17 10:01                               ` Ingo Molnar
  2007-04-17 11:31                               ` William Lee Irwin III
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17 10:01 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>
> > until now the main approach for nice levels in Linux was always: 
> > "implement your main scheduling logic for nice 0 and then look for 
> > some low-overhead method that can be glued to it that does something 
> > that behaves like nice levels". Feel free to turn that around into a 
> > more natural approach, but the algorithm should remain fairly simple 
> > i think.
> 
> Part of my insistence was because it seemed to be relatively close to 
> a one-liner, though I'm not entirely sure what particular computation 
> to use to handle the signedness of the keys. I guess I could pick some 
> particular nice semantics myself and then sweep the extant schedulers 
> to use them after getting a testcase hammered out.

i'd love to have a oneliner solution :-)

wrt. signedness: note that in v2 i have made rq_running signed, and most 
calculations (especially those related to nice) are signed values. (On 
64-bit systems this all isnt a big issue - most of the arithmetics 
gymnastics in CFS are done to keep deltas within 32 bits, so that 
divisions and multiplications are sane.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:57                             ` William Lee Irwin III
  2007-04-17 10:01                               ` Ingo Molnar
@ 2007-04-17 11:31                               ` William Lee Irwin III
  1 sibling, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 11:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith,
	Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:57:49AM -0700, William Lee Irwin III wrote:
> Interesting! That's what vdls did, except its fundamental data structure
> was more like a circular buffer data structure (resembling Davide
> Libenzi's timer ring in concept, but with all the details different).
> I'm not entirely sure how that would've turned out performancewise if
> I'd done any tuning at all. I was mostly interested in doing something
> like what I heard Bob Mullens did in 1976 for basic pedagogical value
> about schedulers to prepare for writing patches for gang scheduling as
> opposed to creating a viable replacement for the mainline scheduler.

Con helped me dredge up the vdls bits, so here is the last version I
before I got tired of toying with the idea. It's not all that clean,
with a fair amount of debug code floating around and a number of
idiocies (it seems there was a plot to use a heap somewhere I forgot
about entirely, never mind other cruft), but I thought I should at least
say something more provable than "there was a patch I never posted."

Enjoy!


-- wli

diff -prauN linux-2.6.0-test11/fs/proc/array.c sched-2.6.0-test11-5/fs/proc/array.c
--- linux-2.6.0-test11/fs/proc/array.c	2003-11-26 12:44:26.000000000 -0800
+++ sched-2.6.0-test11-5/fs/proc/array.c	2003-12-17 07:37:11.000000000 -0800
@@ -162,7 +162,7 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1000000000/1024),
+		0UL, /* was ->sleep_avg */
 	       	p->tgid,
 		p->pid, p->pid ? p->real_parent->pid : 0,
 		p->pid && p->ptrace ? p->parent->pid : 0,
@@ -345,7 +345,7 @@ int proc_pid_stat(struct task_struct *ta
 	read_unlock(&tasklist_lock);
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d %d\n",
 		task->pid,
 		task->comm,
 		state,
@@ -390,8 +390,8 @@ int proc_pid_stat(struct task_struct *ta
 		task->cnswap,
 		task->exit_signal,
 		task_cpu(task),
-		task->rt_priority,
-		task->policy);
+		task_prio(task),
+		task_sched_policy(task));
 	if(mm)
 		mmput(mm);
 	return res;
diff -prauN linux-2.6.0-test11/include/asm-i386/thread_info.h sched-2.6.0-test11-5/include/asm-i386/thread_info.h
--- linux-2.6.0-test11/include/asm-i386/thread_info.h	2003-11-26 12:43:06.000000000 -0800
+++ sched-2.6.0-test11-5/include/asm-i386/thread_info.h	2003-12-17 04:55:22.000000000 -0800
@@ -114,6 +114,8 @@ static inline struct thread_info *curren
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_IRET		5	/* return with iret */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
+#define TIF_QUEUED		17
+#define TIF_PREEMPT		18
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
diff -prauN linux-2.6.0-test11/include/linux/binomial.h sched-2.6.0-test11-5/include/linux/binomial.h
--- linux-2.6.0-test11/include/linux/binomial.h	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/binomial.h	2003-12-20 15:53:33.000000000 -0800
@@ -0,0 +1,16 @@
+/*
+ * Simple binomial heaps.
+ */
+
+struct binomial {
+	unsigned priority, degree;
+	struct binomial *parent, *child, *sibling;
+};
+
+
+struct binomial *binomial_minimum(struct binomial **);
+void binomial_union(struct binomial **, struct binomial **, struct binomial **);
+void binomial_insert(struct binomial **, struct binomial *);
+struct binomial *binomial_extract_min(struct binomial **);
+void binomial_decrease(struct binomial **, struct binomial *, unsigned);
+void binomial_delete(struct binomial **, struct binomial *);
diff -prauN linux-2.6.0-test11/include/linux/init_task.h sched-2.6.0-test11-5/include/linux/init_task.h
--- linux-2.6.0-test11/include/linux/init_task.h	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/init_task.h	2003-12-18 05:51:16.000000000 -0800
@@ -56,6 +56,12 @@
 	.siglock	= SPIN_LOCK_UNLOCKED, 		\
 }
 
+#define INIT_SCHED_INFO(info)					\
+{								\
+	.run_list	= LIST_HEAD_INIT((info).run_list),	\
+	.policy		= 1 /* SCHED_POLICY_TS */,		\
+}
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -67,14 +73,10 @@
 	.usage		= ATOMIC_INIT(2),				\
 	.flags		= 0,						\
 	.lock_depth	= -1,						\
-	.prio		= MAX_PRIO-20,					\
-	.static_prio	= MAX_PRIO-20,					\
-	.policy		= SCHED_NORMAL,					\
+	.sched_info	= INIT_SCHED_INFO(tsk.sched_info),		\
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
-	.run_list	= LIST_HEAD_INIT(tsk.run_list),			\
-	.time_slice	= HZ,						\
 	.tasks		= LIST_HEAD_INIT(tsk.tasks),			\
 	.ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children),		\
 	.ptrace_list	= LIST_HEAD_INIT(tsk.ptrace_list),		\
diff -prauN linux-2.6.0-test11/include/linux/sched.h sched-2.6.0-test11-5/include/linux/sched.h
--- linux-2.6.0-test11/include/linux/sched.h	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/include/linux/sched.h	2003-12-23 03:47:45.000000000 -0800
@@ -126,6 +126,8 @@ extern unsigned long nr_iowait(void);
 #define SCHED_NORMAL		0
 #define SCHED_FIFO		1
 #define SCHED_RR		2
+#define SCHED_BATCH		3
+#define SCHED_IDLE		4
 
 struct sched_param {
 	int sched_priority;
@@ -281,10 +283,14 @@ struct signal_struct {
 
 #define MAX_USER_RT_PRIO	100
 #define MAX_RT_PRIO		MAX_USER_RT_PRIO
-
-#define MAX_PRIO		(MAX_RT_PRIO + 40)
-
-#define rt_task(p)		((p)->prio < MAX_RT_PRIO)
+#define NICE_QLEN		128
+#define MIN_TS_PRIO		MAX_RT_PRIO
+#define MAX_TS_PRIO		(40*NICE_QLEN)
+#define MIN_BATCH_PRIO		(MAX_RT_PRIO + MAX_TS_PRIO)
+#define MAX_BATCH_PRIO		100
+#define MAX_PRIO		(MIN_BATCH_PRIO + MAX_BATCH_PRIO)
+#define USER_PRIO(prio)		((prio) - MAX_RT_PRIO)
+#define MAX_USER_PRIO		USER_PRIO(MAX_PRIO)
 
 /*
  * Some day this will be a full-fledged user tracking system..
@@ -330,6 +336,36 @@ struct k_itimer {
 struct io_context;			/* See blkdev.h */
 void exit_io_context(void);
 
+struct rt_data {
+	int prio, rt_policy;
+	unsigned long quantum, ticks; 
+};
+
+/* XXX: do %cpu estimation for ts wakeup levels */
+struct ts_data {
+	int nice;
+	unsigned long ticks, frac_cpu;
+	unsigned long sample_start, sample_ticks;
+};
+
+struct bt_data {
+	int prio;
+	unsigned long ticks;
+};
+
+union class_data {
+	struct rt_data rt;
+	struct ts_data ts;
+	struct bt_data bt;
+};
+
+struct sched_info {
+	int idx;			/* queue index, used by all classes */
+	unsigned long policy;		/* scheduling policy */
+	struct list_head run_list;	/* list links for priority queues */
+	union class_data cl_data;	/* class-specific data */
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -339,18 +375,9 @@ struct task_struct {
 
 	int lock_depth;		/* Lock depth */
 
-	int prio, static_prio;
-	struct list_head run_list;
-	prio_array_t *array;
-
-	unsigned long sleep_avg;
-	long interactive_credit;
-	unsigned long long timestamp;
-	int activated;
+	struct sched_info sched_info;
 
-	unsigned long policy;
 	cpumask_t cpus_allowed;
-	unsigned int time_slice, first_time_slice;
 
 	struct list_head tasks;
 	struct list_head ptrace_children;
@@ -391,7 +418,6 @@ struct task_struct {
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
 
-	unsigned long rt_priority;
 	unsigned long it_real_value, it_prof_value, it_virt_value;
 	unsigned long it_real_incr, it_prof_incr, it_virt_incr;
 	struct timer_list real_timer;
@@ -520,12 +546,14 @@ extern void node_nr_running_init(void);
 #define node_nr_running_init() {}
 #endif
 
-extern void set_user_nice(task_t *p, long nice);
-extern int task_prio(task_t *p);
-extern int task_nice(task_t *p);
-extern int task_curr(task_t *p);
-extern int idle_cpu(int cpu);
-
+void set_user_nice(task_t *task, long nice);
+int task_prio(task_t *task);
+int task_nice(task_t *task);
+int task_sched_policy(task_t *task);
+void set_task_sched_policy(task_t *task, int policy);
+int rt_task(task_t *task);
+int task_curr(task_t *task);
+int idle_cpu(int cpu);
 void yield(void);
 
 /*
@@ -844,6 +872,21 @@ static inline int need_resched(void)
 	return unlikely(test_thread_flag(TIF_NEED_RESCHED));
 }
 
+static inline void set_task_queued(task_t *task)
+{
+	set_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline void clear_task_queued(task_t *task)
+{
+	clear_tsk_thread_flag(task, TIF_QUEUED);
+}
+
+static inline int task_queued(task_t *task)
+{
+	return test_tsk_thread_flag(task, TIF_QUEUED);
+}
+
 extern void __cond_resched(void);
 static inline void cond_resched(void)
 {
diff -prauN linux-2.6.0-test11/kernel/Makefile sched-2.6.0-test11-5/kernel/Makefile
--- linux-2.6.0-test11/kernel/Makefile	2003-11-26 12:43:24.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/Makefile	2003-12-17 03:30:08.000000000 -0800
@@ -6,7 +6,7 @@ obj-y     = sched.o fork.o exec_domain.o
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o intermodule.o extable.o params.o posix-timers.o
+	    rcupdate.o intermodule.o extable.o params.o posix-timers.o sched/
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
diff -prauN linux-2.6.0-test11/kernel/exit.c sched-2.6.0-test11-5/kernel/exit.c
--- linux-2.6.0-test11/kernel/exit.c	2003-11-26 12:45:29.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/exit.c	2003-12-17 07:04:02.000000000 -0800
@@ -225,7 +225,7 @@ void reparent_to_init(void)
 	/* Set the exit signal to SIGCHLD so we signal init on exit */
 	current->exit_signal = SIGCHLD;
 
-	if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0))
+	if (task_nice(current) < 0)
 		set_user_nice(current, 0);
 	/* cpus_allowed? */
 	/* rt_priority? */
diff -prauN linux-2.6.0-test11/kernel/fork.c sched-2.6.0-test11-5/kernel/fork.c
--- linux-2.6.0-test11/kernel/fork.c	2003-11-26 12:42:58.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/fork.c	2003-12-23 06:22:59.000000000 -0800
@@ -836,6 +836,9 @@ struct task_struct *copy_process(unsigne
 	atomic_inc(&p->user->__count);
 	atomic_inc(&p->user->processes);
 
+	clear_tsk_thread_flag(p, TIF_SIGPENDING);
+	clear_tsk_thread_flag(p, TIF_QUEUED);
+
 	/*
 	 * If multiple threads are within copy_process(), then this check
 	 * triggers too late. This doesn't hurt, the check is only there
@@ -861,13 +864,21 @@ struct task_struct *copy_process(unsigne
 	p->state = TASK_UNINTERRUPTIBLE;
 
 	copy_flags(clone_flags, p);
-	if (clone_flags & CLONE_IDLETASK)
+	if (clone_flags & CLONE_IDLETASK) {
 		p->pid = 0;
-	else {
+		set_task_sched_policy(p, SCHED_IDLE);
+	} else {
+		if (task_sched_policy(p) == SCHED_IDLE) {
+			memset(&p->sched_info, 0, sizeof(struct sched_info));
+			set_task_sched_policy(p, SCHED_NORMAL);
+			set_user_nice(p, 0);
+		}
 		p->pid = alloc_pidmap();
 		if (p->pid == -1)
 			goto bad_fork_cleanup;
 	}
+	if (p->pid == 1)
+		BUG_ON(task_nice(p));
 	retval = -EFAULT;
 	if (clone_flags & CLONE_PARENT_SETTID)
 		if (put_user(p->pid, parent_tidptr))
@@ -875,8 +886,7 @@ struct task_struct *copy_process(unsigne
 
 	p->proc_dentry = NULL;
 
-	INIT_LIST_HEAD(&p->run_list);
-
+	INIT_LIST_HEAD(&p->sched_info.run_list);
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
 	INIT_LIST_HEAD(&p->posix_timers);
@@ -885,8 +895,6 @@ struct task_struct *copy_process(unsigne
 	spin_lock_init(&p->alloc_lock);
 	spin_lock_init(&p->switch_lock);
 	spin_lock_init(&p->proc_lock);
-
-	clear_tsk_thread_flag(p, TIF_SIGPENDING);
 	init_sigpending(&p->pending);
 
 	p->it_real_value = p->it_virt_value = p->it_prof_value = 0;
@@ -898,7 +906,6 @@ struct task_struct *copy_process(unsigne
 	p->tty_old_pgrp = 0;
 	p->utime = p->stime = 0;
 	p->cutime = p->cstime = 0;
-	p->array = NULL;
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
@@ -948,33 +955,6 @@ struct task_struct *copy_process(unsigne
 	p->pdeath_signal = 0;
 
 	/*
-	 * Share the timeslice between parent and child, thus the
-	 * total amount of pending timeslices in the system doesn't change,
-	 * resulting in more scheduling fairness.
-	 */
-	local_irq_disable();
-        p->time_slice = (current->time_slice + 1) >> 1;
-	/*
-	 * The remainder of the first timeslice might be recovered by
-	 * the parent if the child exits early enough.
-	 */
-	p->first_time_slice = 1;
-	current->time_slice >>= 1;
-	p->timestamp = sched_clock();
-	if (!current->time_slice) {
-		/*
-	 	 * This case is rare, it happens when the parent has only
-	 	 * a single jiffy left from its timeslice. Taking the
-		 * runqueue lock is not a problem.
-		 */
-		current->time_slice = 1;
-		preempt_disable();
-		scheduler_tick(0, 0);
-		local_irq_enable();
-		preempt_enable();
-	} else
-		local_irq_enable();
-	/*
 	 * Ok, add it to the run-queues and make it
 	 * visible to the rest of the system.
 	 *
diff -prauN linux-2.6.0-test11/kernel/sched/Makefile sched-2.6.0-test11-5/kernel/sched/Makefile
--- linux-2.6.0-test11/kernel/sched/Makefile	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/Makefile	2003-12-17 03:32:21.000000000 -0800
@@ -0,0 +1 @@
+obj-y = util.o ts.o idle.o rt.o batch.o
diff -prauN linux-2.6.0-test11/kernel/sched/batch.c sched-2.6.0-test11-5/kernel/sched/batch.c
--- linux-2.6.0-test11/kernel/sched/batch.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/batch.c	2003-12-19 21:32:49.000000000 -0800
@@ -0,0 +1,190 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+struct batch_queue {
+	int base, tasks;
+	task_t *curr;
+	unsigned long bitmap[BITS_TO_LONGS(MAX_BATCH_PRIO)];
+	struct list_head queue[MAX_BATCH_PRIO];
+};
+
+static int batch_quantum = 1024;
+static DEFINE_PER_CPU(struct batch_queue, batch_queues);
+
+static int batch_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct batch_queue *queue = &per_cpu(batch_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	for (k = 0; k < MAX_BATCH_PRIO; ++k)
+		INIT_LIST_HEAD(&queue->queue[k]);
+	return 0;
+}
+
+static int batch_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+
+	cpustat->nice += user_ticks;
+	cpustat->system += sys_ticks;
+
+	task->sched_info.cl_data.bt.ticks--;
+	if (!task->sched_info.cl_data.bt.ticks) {
+		int new_idx;
+
+		task->sched_info.cl_data.bt.ticks = batch_quantum;
+		new_idx = (task->sched_info.idx + task->sched_info.cl_data.bt.prio)
+				% MAX_BATCH_PRIO;
+		if (!test_bit(new_idx, queue->bitmap))
+			__set_bit(new_idx, queue->bitmap);
+		list_move_tail(&task->sched_info.run_list,
+				&queue->queue[new_idx]);
+		if (list_empty(&queue->queue[task->sched_info.idx]))
+			__clear_bit(task->sched_info.idx, queue->bitmap);
+		task->sched_info.idx = new_idx;
+		queue->base = find_first_circular_bit(queue->bitmap,
+							queue->base,
+							MAX_BATCH_PRIO);
+		set_need_resched();
+	}
+	return 0;
+}
+
+static void batch_yield(struct queue *__queue, task_t *task)
+{
+	int new_idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	new_idx = (queue->base + MAX_BATCH_PRIO - 1) % MAX_BATCH_PRIO;
+	if (!test_bit(new_idx, queue->bitmap))
+		__set_bit(new_idx, queue->bitmap);
+	list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+	if (list_empty(&queue->queue[task->sched_info.idx]))
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	task->sched_info.idx = new_idx;
+	queue->base = find_first_circular_bit(queue->bitmap,
+						queue->base,
+						MAX_BATCH_PRIO);
+	set_need_resched();
+}
+
+static task_t *batch_curr(struct queue *__queue)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	return queue->curr;
+}
+
+static void batch_set_curr(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	queue->curr = task;
+}
+
+static task_t *batch_best(struct queue *__queue)
+{
+	int idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	idx = find_first_circular_bit(queue->bitmap,
+					queue->base,
+					MAX_BATCH_PRIO);
+	BUG_ON(idx >= MAX_BATCH_PRIO);
+	BUG_ON(list_empty(&queue->queue[idx]));
+	return list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+}
+
+static void batch_enqueue(struct queue *__queue, task_t *task)
+{
+	int idx;
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+
+	idx = (queue->base + task->sched_info.cl_data.bt.prio) % MAX_BATCH_PRIO;
+	if (!test_bit(idx, queue->bitmap))
+		__set_bit(idx, queue->bitmap);
+	list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+	task->sched_info.idx = idx;
+	task->sched_info.cl_data.bt.ticks = batch_quantum;
+	queue->tasks++;
+	if (!queue->curr)
+		queue->curr = task;
+}
+
+static void batch_dequeue(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.idx]))
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	queue->tasks--;
+	if (!queue->tasks)
+		queue->curr = NULL;
+	else if (task == queue->curr)
+		queue->curr = batch_best(__queue);
+}
+
+static int batch_preempt(struct queue *__queue, task_t *task)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	if (!queue->curr)
+		return 1;
+	else
+		return task->sched_info.cl_data.bt.prio
+				< queue->curr->sched_info.cl_data.bt.prio;
+}
+
+static int batch_tasks(struct queue *__queue)
+{
+	struct batch_queue *queue = (struct batch_queue *)__queue;
+	return queue->tasks;
+}
+
+static int batch_nice(struct queue *queue, task_t *task)
+{
+	return 20;
+}
+
+static int batch_prio(task_t *task)
+{
+	return USER_PRIO(task->sched_info.cl_data.bt.prio + MIN_BATCH_PRIO);
+}
+
+static void batch_setprio(task_t *task, int prio)
+{
+	BUG_ON(prio < 0);
+	BUG_ON(prio >= MAX_BATCH_PRIO);
+	task->sched_info.cl_data.bt.prio = prio;
+}
+
+struct queue_ops batch_ops = {
+	.init		= batch_init,
+	.fini		= nop_fini,
+	.tick		= batch_tick,
+	.yield		= batch_yield,
+	.curr		= batch_curr,
+	.set_curr	= batch_set_curr,
+	.tasks		= batch_tasks,
+	.best		= batch_best,
+	.enqueue	= batch_enqueue,
+	.dequeue	= batch_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= batch_preempt,
+	.nice		= batch_nice,
+	.renice		= nop_renice,
+	.prio		= batch_prio,
+	.setprio	= batch_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy batch_policy = {
+	.ops	= &batch_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/idle.c sched-2.6.0-test11-5/kernel/sched/idle.c
--- linux-2.6.0-test11/kernel/sched/idle.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/idle.c	2003-12-19 17:31:39.000000000 -0800
@@ -0,0 +1,99 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+static DEFINE_PER_CPU(task_t *, idle_tasks) = NULL;
+
+static int idle_nice(struct queue *queue, task_t *task)
+{
+	return 20;
+}
+
+static int idle_tasks(struct queue *queue)
+{
+	task_t **idle = (task_t **)queue;
+	return !!(*idle);
+}
+
+static task_t *idle_task(struct queue *queue)
+{
+	return *((task_t **)queue);
+}
+
+static void idle_yield(struct queue *queue, task_t *task)
+{
+	set_need_resched();
+}
+
+static void idle_enqueue(struct queue *queue, task_t *task)
+{
+	task_t **idle = (task_t **)queue;
+	*idle = task;
+}
+
+static void idle_dequeue(struct queue *queue, task_t *task)
+{
+}
+
+static int idle_preempt(struct queue *queue, task_t *task)
+{
+	return 0;
+}
+
+static int idle_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	runqueue_t *rq = &per_cpu(runqueues, smp_processor_id());
+
+	if (atomic_read(&rq->nr_iowait) > 0)
+		cpustat->iowait += sys_ticks;
+	else
+		cpustat->idle += sys_ticks;
+	return 1;
+}
+
+static int idle_init(struct policy *policy, int cpu)
+{
+	policy->queue = (struct queue *)&per_cpu(idle_tasks, cpu);
+	return 0;
+}
+
+static int idle_prio(task_t *task)
+{
+	return MAX_USER_PRIO;
+}
+
+static void idle_setprio(task_t *task, int prio)
+{
+}
+
+static struct queue_ops idle_ops = {
+	.init		= idle_init,
+	.fini		= nop_fini,
+	.tick		= idle_tick,
+	.yield		= idle_yield,
+	.curr		= idle_task,
+	.set_curr	= queue_nop,
+	.tasks		= idle_tasks,
+	.best		= idle_task,
+	.enqueue	= idle_enqueue,
+	.dequeue	= idle_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= idle_preempt,
+	.nice		= idle_nice,
+	.renice		= nop_renice,
+	.prio		= idle_prio,
+	.setprio	= idle_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy idle_policy = {
+	.ops	= &idle_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/queue.h sched-2.6.0-test11-5/kernel/sched/queue.h
--- linux-2.6.0-test11/kernel/sched/queue.h	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/queue.h	2003-12-23 03:58:02.000000000 -0800
@@ -0,0 +1,104 @@
+#define SCHED_POLICY_RT		0
+#define SCHED_POLICY_TS		1
+#define SCHED_POLICY_BATCH	2
+#define SCHED_POLICY_IDLE	3
+
+#define RT_POLICY_FIFO		0
+#define RT_POLICY_RR		1
+
+#define NODE_THRESHOLD		125
+
+struct queue;
+struct queue_ops;
+
+struct policy {
+	struct queue *queue;
+	struct queue_ops *ops;
+};
+
+extern struct policy rt_policy, ts_policy, batch_policy, idle_policy;
+
+struct runqueue {
+        spinlock_t lock;
+	int curr;
+	task_t *__curr;
+	unsigned long policy_bitmap;
+	struct policy *policies[BITS_PER_LONG];
+        unsigned long nr_running, nr_switches, nr_uninterruptible;
+        struct mm_struct *prev_mm;
+        int prev_cpu_load[NR_CPUS];
+#ifdef CONFIG_NUMA
+        atomic_t *node_nr_running;
+        int prev_node_load[MAX_NUMNODES];
+#endif
+        task_t *migration_thread;
+        struct list_head migration_queue;
+
+        atomic_t nr_iowait;
+};
+
+typedef struct runqueue runqueue_t;
+
+struct queue_ops {
+	int (*init)(struct policy *, int);
+	void (*fini)(struct policy *, int);
+	task_t *(*curr)(struct queue *);
+	void (*set_curr)(struct queue *, task_t *);
+	task_t *(*best)(struct queue *);
+	int (*tick)(struct queue *, task_t *, int, int);
+	int (*tasks)(struct queue *);
+	void (*enqueue)(struct queue *, task_t *);
+	void (*dequeue)(struct queue *, task_t *);
+	void (*start_wait)(struct queue *, task_t *);
+	void (*stop_wait)(struct queue *, task_t *);
+	void (*sleep)(struct queue *, task_t *);
+	void (*wake)(struct queue *, task_t *);
+	int (*preempt)(struct queue *, task_t *);
+	void (*yield)(struct queue *, task_t *);
+	int (*prio)(task_t *);
+	void (*setprio)(task_t *, int);
+	int (*nice)(struct queue *, task_t *);
+	void (*renice)(struct queue *, task_t *, int);
+	unsigned long (*timeslice)(struct queue *, task_t *);
+	void (*set_timeslice)(struct queue *, task_t *, unsigned long);
+};
+
+DECLARE_PER_CPU(runqueue_t, runqueues);
+
+int find_first_circular_bit(unsigned long *, int, int);
+void queue_nop(struct queue *, task_t *);
+void nop_renice(struct queue *, task_t *, int);
+void nop_fini(struct policy *, int);
+unsigned long nop_timeslice(struct queue *, task_t *);
+void nop_set_timeslice(struct queue *, task_t *, unsigned long);
+
+/* #define DEBUG_SCHED */
+
+#ifdef DEBUG_SCHED
+#define __check_task_policy(idx)					\
+do {									\
+	unsigned long __idx__ = (idx);					\
+	if (__idx__ > SCHED_POLICY_IDLE) {				\
+		printk("invalid policy 0x%lx\n", __idx__);		\
+		BUG();							\
+	}								\
+} while (0)
+
+#define check_task_policy(task)						\
+do {									\
+	__check_task_policy((task)->sched_info.policy);			\
+} while (0)
+
+#define check_policy(policy)						\
+do {									\
+	BUG_ON((policy) != &rt_policy &&				\
+		(policy) != &ts_policy &&				\
+		(policy) != &batch_policy &&				\
+		(policy) != &idle_policy);				\
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define __check_task_policy(idx)			do { } while (0)
+#define check_task_policy(task)				do { } while (0)
+#define check_policy(policy)				do { } while (0)
+#endif /* !DEBUG_SCHED */
diff -prauN linux-2.6.0-test11/kernel/sched/rt.c sched-2.6.0-test11-5/kernel/sched/rt.c
--- linux-2.6.0-test11/kernel/sched/rt.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/rt.c	2003-12-19 18:16:07.000000000 -0800
@@ -0,0 +1,208 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_rt_policy(task)						\
+do {									\
+	BUG_ON((task)->sched_info.policy != SCHED_POLICY_RT);		\
+	BUG_ON((task)->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR	\
+			&&						\
+	      (task)->sched_info.cl_data.rt.rt_policy!=RT_POLICY_FIFO);	\
+	BUG_ON((task)->sched_info.cl_data.rt.prio < 0);			\
+	BUG_ON((task)->sched_info.cl_data.rt.prio >= MAX_RT_PRIO);	\
+} while (0)
+#else
+#define check_rt_policy(task)				do { } while (0)
+#endif
+
+struct rt_queue {
+	unsigned long bitmap[BITS_TO_LONGS(MAX_RT_PRIO)];
+	struct list_head queue[MAX_RT_PRIO];
+	task_t *curr;
+	int tasks;
+};
+
+static DEFINE_PER_CPU(struct rt_queue, rt_queues);
+
+static int rt_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct rt_queue *queue = &per_cpu(rt_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	for (k = 0; k < MAX_RT_PRIO; ++k)
+		INIT_LIST_HEAD(&queue->queue[k]);
+	return 0;
+}
+
+static void rt_yield(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+		set_need_resched();
+	list_add_tail(&task->sched_info.run_list,
+			&queue->queue[task->sched_info.cl_data.rt.prio]);
+	check_rt_policy(task);
+}
+
+static int rt_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	check_rt_policy(task);
+	cpustat->user += user_ticks;
+	cpustat->system += sys_ticks;
+	if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) {
+		task->sched_info.cl_data.rt.ticks--;
+		if (!task->sched_info.cl_data.rt.ticks) {
+			task->sched_info.cl_data.rt.ticks =
+				task->sched_info.cl_data.rt.quantum;
+			rt_yield(queue, task);
+		}
+	}
+	check_rt_policy(task);
+	return 0;
+}
+
+static task_t *rt_curr(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	task_t *task = queue->curr;
+	check_rt_policy(task);
+	return task;
+}
+
+static void rt_set_curr(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	queue->curr = task;
+	check_rt_policy(task);
+}
+
+static task_t *rt_best(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	task_t *task;
+	int idx;
+	idx = find_first_bit(queue->bitmap, MAX_RT_PRIO);
+	BUG_ON(idx >= MAX_RT_PRIO);
+	task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+	check_rt_policy(task);
+	return task;
+}
+
+static void rt_enqueue(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	if (!test_bit(task->sched_info.cl_data.rt.prio, queue->bitmap))
+		__set_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+	list_add_tail(&task->sched_info.run_list,
+			&queue->queue[task->sched_info.cl_data.rt.prio]);
+	check_rt_policy(task);
+	queue->tasks++;
+	if (!queue->curr)
+		queue->curr = task;
+}
+
+static void rt_dequeue(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio]))
+		__clear_bit(task->sched_info.cl_data.rt.prio, queue->bitmap);
+	queue->tasks--;
+	check_rt_policy(task);
+	if (!queue->tasks)
+		queue->curr = NULL;
+	else if (task == queue->curr)
+		queue->curr = rt_best(__queue);
+}
+
+static int rt_preempt(struct queue *__queue, task_t *task)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	check_rt_policy(task);
+	if (!queue->curr)
+		return 1;
+	check_rt_policy(queue->curr);
+	return task->sched_info.cl_data.rt.prio
+			< queue->curr->sched_info.cl_data.rt.prio;
+}
+
+static int rt_tasks(struct queue *__queue)
+{
+	struct rt_queue *queue = (struct rt_queue *)__queue;
+	return queue->tasks;
+}
+
+static int rt_nice(struct queue *queue, task_t *task)
+{
+	check_rt_policy(task);
+	return -20;
+}
+
+static unsigned long rt_timeslice(struct queue *queue, task_t *task)
+{
+	check_rt_policy(task);
+	if (task->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR)
+		return 0;
+	else
+		return task->sched_info.cl_data.rt.quantum;
+}
+
+static void rt_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+	check_rt_policy(task);
+	if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR)
+		task->sched_info.cl_data.rt.quantum = n;
+	check_rt_policy(task);
+}
+
+static void rt_setprio(task_t *task, int prio)
+{
+	check_rt_policy(task);
+	BUG_ON(prio < 0);
+	BUG_ON(prio >= MAX_RT_PRIO);
+	task->sched_info.cl_data.rt.prio = prio;
+}
+
+static int rt_prio(task_t *task)
+{
+	check_rt_policy(task);
+	return USER_PRIO(task->sched_info.cl_data.rt.prio);
+}
+
+static struct queue_ops rt_ops = {
+	.init		= rt_init,
+	.fini		= nop_fini,
+	.tick		= rt_tick,
+	.yield		= rt_yield,
+	.curr		= rt_curr,
+	.set_curr	= rt_set_curr,
+	.tasks		= rt_tasks,
+	.best		= rt_best,
+	.enqueue	= rt_enqueue,
+	.dequeue	= rt_dequeue,
+	.start_wait	= queue_nop,
+	.stop_wait	= queue_nop,
+	.sleep		= queue_nop,
+	.wake		= queue_nop,
+	.preempt	= rt_preempt,
+	.nice		= rt_nice,
+	.renice		= nop_renice,
+	.prio		= rt_prio,
+	.setprio	= rt_setprio,
+	.timeslice	= rt_timeslice,
+	.set_timeslice	= rt_set_timeslice,
+};
+
+struct policy rt_policy = {
+	.ops	= &rt_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/ts.c sched-2.6.0-test11-5/kernel/sched/ts.c
--- linux-2.6.0-test11/kernel/sched/ts.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/ts.c	2003-12-23 08:24:55.000000000 -0800
@@ -0,0 +1,841 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/kernel_stat.h>
+#include <asm/page.h>
+#include "queue.h"
+
+#ifdef DEBUG_SCHED
+#define check_ts_policy(task)						\
+do {									\
+	BUG_ON((task)->sched_info.policy != SCHED_POLICY_TS);		\
+} while (0)
+
+#define check_nice(__queue__)						\
+({									\
+	int __k__, __count__ = 0;					\
+	if ((__queue__)->tasks < 0) {					\
+		printk("negative nice task count %d\n", 		\
+			(__queue__)->tasks);				\
+		BUG();							\
+	}								\
+	for (__k__ = 0; __k__ < NICE_QLEN; ++__k__) {			\
+		task_t *__task__;					\
+		if (list_empty(&(__queue__)->queue[__k__])) {		\
+			if (test_bit(__k__, (__queue__)->bitmap)) {	\
+				printk("wrong nice bit set\n");		\
+				BUG();					\
+			}						\
+		} else {						\
+			if (!test_bit(__k__, (__queue__)->bitmap)) {	\
+				printk("wrong nice bit clear\n");	\
+				BUG();					\
+			}						\
+		}							\
+		list_for_each_entry(__task__,				\
+					&(__queue__)->queue[__k__],	\
+					sched_info.run_list) {		\
+			check_ts_policy(__task__);			\
+			if (__task__->sched_info.idx != __k__) {	\
+				printk("nice index mismatch\n");	\
+				BUG();					\
+			}						\
+			++__count__;					\
+		}							\
+	}								\
+	if ((__queue__)->tasks != __count__) {				\
+		printk("wrong nice task count\n");			\
+		printk("expected %d, got %d\n",				\
+			(__queue__)->tasks,				\
+			__count__);					\
+		BUG();							\
+	}								\
+	__count__;							\
+})
+
+#define check_queue(__queue)						\
+do {									\
+	int __k, __count = 0;						\
+	if ((__queue)->tasks < 0) {					\
+		printk("negative queue task count %d\n", 		\
+			(__queue)->tasks);				\
+		BUG();							\
+	}								\
+	for (__k = 0; __k < 40; ++__k) {				\
+		struct nice_queue *__nice;				\
+		if (list_empty(&(__queue)->nices[__k])) {		\
+			if (test_bit(__k, (__queue)->bitmap)) {		\
+				printk("wrong queue bit set\n");	\
+				BUG();					\
+			}						\
+		} else {						\
+			if (!test_bit(__k, (__queue)->bitmap)) {	\
+				printk("wrong queue bit clear\n");	\
+				BUG();					\
+			}						\
+		}							\
+		list_for_each_entry(__nice,				\
+					&(__queue)->nices[__k],		\
+					list) {				\
+			__count += check_nice(__nice);			\
+			if (__nice->idx != __k) {			\
+				printk("queue index mismatch\n");	\
+				BUG();					\
+			}						\
+		}							\
+	}								\
+	if ((__queue)->tasks != __count) {				\
+		printk("wrong queue task count\n");			\
+		printk("expected %d, got %d\n",				\
+			(__queue)->tasks,				\
+			__count);					\
+		BUG();							\
+	}								\
+} while (0)
+
+#else /* !DEBUG_SCHED */
+#define check_ts_policy(task)			do { } while (0)
+#define check_nice(nice)			do { } while (0)
+#define check_queue(queue)			do { } while (0)
+#endif
+
+/*
+ * Hybrid deadline/multilevel scheduling. Cpu utilization
+ * -dependent deadlines at wake. Queue rotation every 50ms or when
+ * demotions empty the highest level, setting demoted deadlines
+ * relative to the new highest level. Intra-level RR quantum at 10ms.
+ */
+struct nice_queue {
+	int idx, nice, base, tasks, level_quantum, expired;
+	unsigned long bitmap[BITS_TO_LONGS(NICE_QLEN)];
+	struct list_head list, queue[NICE_QLEN];
+	task_t *curr;
+};
+
+/*
+ * Deadline schedule nice levels with priority-dependent deadlines,
+ * default quantum of 100ms. Queue rotates at demotions emptying the
+ * highest level, setting the demoted deadline relative to the new
+ * highest level.
+ */
+struct ts_queue {
+	struct nice_queue nice_levels[40];
+	struct list_head nices[40];
+	int base, quantum, tasks;
+	unsigned long bitmap[BITS_TO_LONGS(40)];
+	struct nice_queue *curr;
+};
+
+/*
+ * Make these sysctl-tunable.
+ */
+static int nice_quantum = 100;
+static int rr_quantum = 10;
+static int level_quantum = 50;
+static int sample_interval = HZ;
+
+static DEFINE_PER_CPU(struct ts_queue, ts_queues);
+
+static task_t *nice_best(struct nice_queue *);
+static struct nice_queue *ts_best_nice(struct ts_queue *);
+
+static void nice_init(struct nice_queue *queue)
+{
+	int k;
+
+	INIT_LIST_HEAD(&queue->list);
+	for (k = 0; k < NICE_QLEN; ++k) {
+		INIT_LIST_HEAD(&queue->queue[k]);
+	}
+}
+
+static int ts_init(struct policy *policy, int cpu)
+{
+	int k;
+	struct ts_queue *queue = &per_cpu(ts_queues, cpu);
+
+	policy->queue = (struct queue *)queue;
+	queue->quantum = nice_quantum;
+
+	for (k = 0; k < 40; ++k) {
+		nice_init(&queue->nice_levels[k]);
+		queue->nice_levels[k].nice = k;
+		INIT_LIST_HEAD(&queue->nices[k]);
+	}
+	return 0;
+}
+
+static int task_deadline(task_t *task)
+{
+	u64 frac_cpu = task->sched_info.cl_data.ts.frac_cpu;
+	frac_cpu *= (u64)NICE_QLEN;
+	frac_cpu >>= 32;
+	return (int)min((u32)(NICE_QLEN - 1), (u32)frac_cpu);
+}
+
+static void nice_rotate_queue(struct nice_queue *queue)
+{
+	int idx, new_idx, deadline, idxdiff;
+	task_t *task = queue->curr;
+
+	check_nice(queue);
+
+	/* shit what if idxdiff == NICE_QLEN - 1?? */
+	idx = queue->curr->sched_info.idx;
+	idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+	deadline = min(1 + task_deadline(task), NICE_QLEN - idxdiff - 1);
+	new_idx = (idx + deadline) % NICE_QLEN;
+#if 0
+	if (idx == new_idx) {
+		/*
+		 * buggy; it sets queue->base = idx because in this case
+		 * we have task_deadline(task) == 0
+		 */
+		new_idx = (idx - task_deadline(task) + NICE_QLEN) % NICE_QLEN;
+		if (queue->base != new_idx)
+			queue->base = new_idx;
+		return;
+	}
+	BUG_ON(!deadline);
+	BUG_ON(queue->base <= new_idx && new_idx <= idx);
+	BUG_ON(idx < queue->base && queue->base <= new_idx);
+	BUG_ON(new_idx <= idx && idx < queue->base);
+	if (0 && idx == new_idx) {
+		printk("FUCKUP: pid = %d, tdl = %d, dl = %d, idx = %d, "
+				"base = %d, diff = %d, fcpu = 0x%lx\n",
+			queue->curr->pid,
+			task_deadline(queue->curr),
+			deadline,
+			idx,
+			queue->base,
+			idxdiff,
+			task->sched_info.cl_data.ts.frac_cpu);
+		BUG();
+	}
+#else
+	/*
+	 * RR in the last deadline
+	 * special-cased so as not to trip BUG_ON()'s below
+	 */
+	if (idx == new_idx) {
+		/* if we got here these two things must hold */
+		BUG_ON(idxdiff != NICE_QLEN - 1);
+		BUG_ON(deadline);
+		list_move_tail(&task->sched_info.run_list, &queue->queue[idx]);
+		if (queue->expired) {
+			queue->level_quantum = level_quantum;
+			queue->expired = 0;
+		}
+		return;
+	}
+#endif
+	task->sched_info.idx = new_idx;
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&task->sched_info.run_list,
+			&queue->queue[new_idx]);
+
+	/* expired until list drains */
+	if (!list_empty(&queue->queue[idx]))
+		queue->expired = 1;
+	else {
+		int k, w, m = NICE_QLEN % BITS_PER_LONG;
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+
+		for (w = 0, k = 0; k < NICE_QLEN/BITS_PER_LONG; ++k)
+			w += hweight_long(queue->bitmap[k]);
+		if (NICE_QLEN % BITS_PER_LONG)
+			w += hweight_long(queue->bitmap[k] & ((1UL << m) - 1));
+		if (w > 1)
+			queue->base = (queue->base + 1) % NICE_QLEN;
+		queue->level_quantum = level_quantum;
+		queue->expired = 0;
+	}
+	check_nice(queue);
+}
+
+static void nice_tick(struct nice_queue *queue, task_t *task)
+{
+	int idx = task->sched_info.idx;
+	BUG_ON(!task_queued(task));
+	BUG_ON(task != queue->curr);
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	BUG_ON(list_empty(&queue->queue[idx]));
+	check_ts_policy(task);
+	check_nice(queue);
+
+	if (task->sched_info.cl_data.ts.ticks)
+		task->sched_info.cl_data.ts.ticks--;
+
+	if (queue->level_quantum > level_quantum) {
+		WARN_ON(1);
+		queue->level_quantum = 1;
+	}
+
+	if (!queue->expired) {
+		if (queue->level_quantum)
+			queue->level_quantum--;
+	} else if (0 && queue->queue[idx].prev != &task->sched_info.run_list) {
+		int queued = 0, new_idx = (queue->base + 1) % NICE_QLEN;
+		task_t *curr, *sav;
+		task_t *victim = list_entry(queue->queue[idx].prev,
+						task_t,
+						sched_info.run_list);
+		victim->sched_info.idx = new_idx;
+		if (!test_bit(new_idx, queue->bitmap))
+			__set_bit(new_idx, queue->bitmap);
+#if 1
+		list_for_each_entry_safe(curr, sav, &queue->queue[new_idx], sched_info.run_list) {
+			if (victim->sched_info.cl_data.ts.frac_cpu
+				< curr->sched_info.cl_data.ts.frac_cpu) {
+				queued = 1;
+				list_move(&victim->sched_info.run_list,
+						curr->sched_info.run_list.prev);
+				break;
+			}
+		}
+		if (!queued)
+			list_move_tail(&victim->sched_info.run_list,
+					&queue->queue[new_idx]);
+#else
+		list_move(&victim->sched_info.run_list, &queue->queue[new_idx]);
+#endif
+		BUG_ON(list_empty(&queue->queue[idx]));
+	}
+
+	if (!queue->level_quantum && !queue->expired) {
+		check_nice(queue);
+		nice_rotate_queue(queue);
+		check_nice(queue);
+		set_need_resched();
+	} else if (!task->sched_info.cl_data.ts.ticks) {
+		int idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN;
+		check_nice(queue);
+		task->sched_info.cl_data.ts.ticks = rr_quantum;
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		BUG_ON(list_empty(&queue->queue[idx]));
+		if (queue->expired)
+			nice_rotate_queue(queue);
+		else if (idxdiff == NICE_QLEN - 1)
+			list_move_tail(&task->sched_info.run_list,
+					&queue->queue[idx]);
+		else {
+			int new_idx = (idx + 1) % NICE_QLEN;
+			list_del(&task->sched_info.run_list);
+			if (list_empty(&queue->queue[idx])) {
+				BUG_ON(!test_bit(idx, queue->bitmap));
+				__clear_bit(idx, queue->bitmap);
+			}
+			if (!test_bit(new_idx, queue->bitmap)) {
+				BUG_ON(!list_empty(&queue->queue[new_idx]));
+				__set_bit(new_idx, queue->bitmap);
+			}
+			task->sched_info.idx = new_idx;
+			list_add(&task->sched_info.run_list,
+					&queue->queue[new_idx]);
+		}
+		check_nice(queue);
+		set_need_resched();
+	}
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_rotate_queue(struct ts_queue *queue)
+{
+	int idx, new_idx, idxdiff, off, deadline;
+
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+
+	/* shit what if idxdiff == 39?? */
+	check_queue(queue);
+	idx = queue->curr->idx;
+	idxdiff = (idx - queue->base + 40) % 40;
+	off = (int)(queue->curr - queue->nice_levels);
+	deadline = min(1 + off, 40 - idxdiff - 1);
+	new_idx = (idx + deadline) % 40;
+	if (idx == new_idx) {
+		new_idx = (idx - off + 40) % 40;
+		if (queue->base != new_idx)
+			queue->base = new_idx;
+		return;
+	}
+	BUG_ON(!deadline);
+	BUG_ON(queue->base <= new_idx && new_idx <= idx);
+	BUG_ON(idx < queue->base && queue->base <= new_idx);
+	BUG_ON(new_idx <= idx && idx < queue->base);
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->nices[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&queue->curr->list, &queue->nices[new_idx]);
+	queue->curr->idx = new_idx;
+
+	if (list_empty(&queue->nices[idx])) {
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+		queue->base = (queue->base + 1) % 40;
+	}
+	check_queue(queue);
+}
+
+static int ts_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	int nice_idx = (int)(queue->curr - queue->nice_levels);
+	unsigned long sample_end, delta;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	BUG_ON(!nice);
+	BUG_ON(nice_idx != task->sched_info.cl_data.ts.nice);
+	BUG_ON(!test_bit(nice->idx, queue->bitmap));
+	BUG_ON(list_empty(&queue->nices[nice->idx]));
+
+	sample_end = jiffies;
+	delta = sample_end - task->sched_info.cl_data.ts.sample_start;
+	if (delta)
+		task->sched_info.cl_data.ts.sample_ticks++;
+	else {
+		task->sched_info.cl_data.ts.sample_start = jiffies;
+		task->sched_info.cl_data.ts.sample_ticks = 1;
+	}
+
+	if (delta >= sample_interval) {
+		u64 frac_cpu;
+		frac_cpu = (u64)task->sched_info.cl_data.ts.sample_ticks << 32;
+		do_div(frac_cpu, delta);
+		frac_cpu = 2*frac_cpu + task->sched_info.cl_data.ts.frac_cpu;
+		do_div(frac_cpu, 3);
+		frac_cpu = min(frac_cpu, (1ULL << 32) - 1);
+		task->sched_info.cl_data.ts.frac_cpu = (unsigned long)frac_cpu;
+		task->sched_info.cl_data.ts.sample_start = sample_end;
+		task->sched_info.cl_data.ts.sample_ticks = 0;
+	}
+
+	cpustat->user += user_ticks;
+	cpustat->system += sys_ticks;
+	nice_tick(nice, task);
+	if (queue->quantum > nice_quantum) {
+		queue->quantum = 0;
+		WARN_ON(1);
+	} else if (queue->quantum)
+		queue->quantum--;
+	if (!queue->quantum) {
+		queue->quantum = nice_quantum;
+		ts_rotate_queue(queue);
+		set_need_resched();
+	}
+	check_queue(queue);
+	check_ts_policy(task);
+	return 0;
+}
+
+static void nice_yield(struct nice_queue *queue, task_t *task)
+{
+	int idx, new_idx = (queue->base + NICE_QLEN - 1) % NICE_QLEN;
+
+	check_nice(queue);
+	check_ts_policy(task);
+	if (!test_bit(new_idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[new_idx]));
+		__set_bit(new_idx, queue->bitmap);
+	}
+	list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]);
+	idx = task->sched_info.idx;
+	task->sched_info.idx = new_idx;
+	set_need_resched();
+
+	if (list_empty(&queue->queue[idx])) {
+		BUG_ON(!test_bit(idx, queue->bitmap));
+		__clear_bit(idx, queue->bitmap);
+	}
+	queue->curr = nice_best(queue);
+#if 0
+	if (queue->curr->sched_info.idx != queue->base)
+		queue->base = queue->curr->sched_info.idx;
+#endif
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+/*
+ * This is somewhat problematic; nice_yield() only parks tasks on
+ * the end of their current nice levels.
+ */
+static void ts_yield(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	nice_yield(nice, task);
+
+	/*
+	 * If there's no one to yield to, move the whole nice level.
+	 * If this is problematic, setting nice-dependent deadlines
+	 * on a single unified queue may be in order.
+	 */
+	if (nice->tasks == 1) {
+		int idx, new_idx = (queue->base + 40 - 1) % 40;
+		idx = nice->idx;
+		if (!test_bit(new_idx, queue->bitmap)) {
+			BUG_ON(!list_empty(&queue->nices[new_idx]));
+			__set_bit(new_idx, queue->bitmap);
+		}
+		list_move_tail(&nice->list, &queue->nices[new_idx]);
+		if (list_empty(&queue->nices[idx])) {
+			BUG_ON(!test_bit(idx, queue->bitmap));
+			__clear_bit(idx, queue->bitmap);
+		}
+		nice->idx = new_idx;
+		queue->base = find_first_circular_bit(queue->bitmap,
+							queue->base,
+							40);
+		BUG_ON(queue->base >= 40);
+		BUG_ON(!test_bit(queue->base, queue->bitmap));
+		queue->curr = ts_best_nice(queue);
+	}
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static task_t *ts_curr(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	task_t *task = queue->curr->curr;
+	check_queue(queue);
+	if (task)
+		check_ts_policy(task);
+	return task;
+}
+
+static void ts_set_curr(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+	queue->curr = nice;
+	nice->curr = task;
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static task_t *nice_best(struct nice_queue *queue)
+{
+	task_t *task;
+	int idx = find_first_circular_bit(queue->bitmap,
+						queue->base,
+						NICE_QLEN);
+	check_nice(queue);
+	if (idx >= NICE_QLEN)
+		return NULL;
+	BUG_ON(list_empty(&queue->queue[idx]));
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list);
+	check_nice(queue);
+	check_ts_policy(task);
+	return task;
+}
+
+static struct nice_queue *ts_best_nice(struct ts_queue *queue)
+{
+	int idx = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	check_queue(queue);
+	if (idx >= 40)
+		return NULL;
+	BUG_ON(list_empty(&queue->nices[idx]));
+	BUG_ON(!test_bit(idx, queue->bitmap));
+	return list_entry(queue->nices[idx].next, struct nice_queue, list);
+}
+
+static task_t *ts_best(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = ts_best_nice(queue);
+	return nice ? nice_best(nice) : NULL;
+}
+
+static void nice_enqueue(struct nice_queue *queue, task_t *task)
+{
+	task_t *curr, *sav;
+	int queued = 0, idx, deadline, base, idxdiff;
+	check_nice(queue);
+	check_ts_policy(task);
+
+	/* don't livelock when queue->expired */
+	deadline = min(!!queue->expired + task_deadline(task), NICE_QLEN - 1);
+	idx = (queue->base + deadline) % NICE_QLEN;
+
+	if (!test_bit(idx, queue->bitmap)) {
+		BUG_ON(!list_empty(&queue->queue[idx]));
+		__set_bit(idx, queue->bitmap);
+	}
+
+#if 1
+	/* keep nice level's queue sorted -- use binomial heaps here soon */
+	list_for_each_entry_safe(curr, sav, &queue->queue[idx], sched_info.run_list) {
+		if (task->sched_info.cl_data.ts.frac_cpu
+				>= curr->sched_info.cl_data.ts.frac_cpu) {
+			list_add(&task->sched_info.run_list,
+					curr->sched_info.run_list.prev);
+			queued = 1;
+			break;
+		}
+	}
+	if (!queued)
+		list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#else
+	list_add_tail(&task->sched_info.run_list, &queue->queue[idx]);
+#endif
+	task->sched_info.idx = idx;
+	/* if (!task->sched_info.cl_data.ts.ticks) */
+		task->sched_info.cl_data.ts.ticks = rr_quantum;
+
+	if (queue->tasks)
+		BUG_ON(!queue->curr);
+	else {
+		BUG_ON(queue->curr);
+		queue->curr = task;
+	}
+	queue->tasks++;
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_enqueue(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+	if (!nice->tasks) {
+		int idx = (queue->base + task->sched_info.cl_data.ts.nice) % 40;
+		if (!test_bit(idx, queue->bitmap)) {
+			BUG_ON(!list_empty(&queue->nices[idx]));
+			__set_bit(idx, queue->bitmap);
+		}
+		list_add_tail(&nice->list, &queue->nices[idx]);
+		nice->idx = idx;
+		if (!queue->curr)
+			queue->curr = nice;
+	}
+	nice_enqueue(nice, task);
+	queue->tasks++;
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static void nice_dequeue(struct nice_queue *queue, task_t *task)
+{
+	check_nice(queue);
+	check_ts_policy(task);
+	list_del(&task->sched_info.run_list);
+	if (list_empty(&queue->queue[task->sched_info.idx])) {
+		BUG_ON(!test_bit(task->sched_info.idx, queue->bitmap));
+		__clear_bit(task->sched_info.idx, queue->bitmap);
+	}
+	queue->tasks--;
+	if (task == queue->curr) {
+		queue->curr = nice_best(queue);
+#if 0
+		if (queue->curr)
+			queue->base = queue->curr->sched_info.idx;
+#endif
+	}
+	check_nice(queue);
+	check_ts_policy(task);
+}
+
+static void ts_dequeue(struct queue *__queue, task_t *task)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice;
+
+	BUG_ON(!queue->tasks);
+	check_queue(queue);
+	check_ts_policy(task);
+	nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice];
+
+	nice_dequeue(nice, task);
+	queue->tasks--;
+	if (!nice->tasks) {
+		list_del_init(&nice->list);
+		if (list_empty(&queue->nices[nice->idx])) {
+			BUG_ON(!test_bit(nice->idx, queue->bitmap));
+			__clear_bit(nice->idx, queue->bitmap);
+		}
+		if (nice == queue->curr)
+			queue->curr = ts_best_nice(queue);
+	}
+	queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40);
+	if (queue->base >= 40)
+		queue->base = 0;
+	check_queue(queue);
+	check_ts_policy(task);
+}
+
+static int ts_tasks(struct queue *__queue)
+{
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	check_queue(queue);
+	return queue->tasks;
+}
+
+static int ts_nice(struct queue *__queue, task_t *task)
+{
+	int nice = task->sched_info.cl_data.ts.nice - 20;
+	check_ts_policy(task);
+	BUG_ON(nice < -20);
+	BUG_ON(nice >= 20);
+	return nice;
+}
+
+static void ts_renice(struct queue *queue, task_t *task, int nice)
+{
+	check_queue((struct ts_queue *)queue);
+	check_ts_policy(task);
+	BUG_ON(nice < -20);
+	BUG_ON(nice >= 20);
+	task->sched_info.cl_data.ts.nice = nice + 20;
+	check_queue((struct ts_queue *)queue);
+}
+
+static int nice_task_prio(struct nice_queue *nice, task_t *task)
+{
+	if (!task_queued(task))
+		return task_deadline(task);
+	else {
+		int prio = task->sched_info.idx - nice->base;
+		return prio < 0 ? prio + NICE_QLEN : prio;
+	}
+}
+
+static int ts_nice_prio(struct ts_queue *ts, struct nice_queue *nice)
+{
+	if (list_empty(&nice->list))
+		return (int)(nice - ts->nice_levels);
+	else {
+		int prio = nice->idx - ts->base;
+		return prio < 0 ? prio + 40 : prio;
+	}
+}
+
+/* 100% fake priority to report heuristics and the like */
+static int ts_prio(task_t *task)
+{
+	int policy_idx;
+	struct policy *policy;
+	struct ts_queue *ts;
+	struct nice_queue *nice;
+
+	policy_idx = task->sched_info.policy;
+	policy = per_cpu(runqueues, task_cpu(task)).policies[policy_idx];
+	ts = (struct ts_queue *)policy->queue;
+	nice = &ts->nice_levels[task->sched_info.cl_data.ts.nice];
+	return 40*ts_nice_prio(ts, nice) + nice_task_prio(nice, task);
+}
+
+static void ts_setprio(task_t *task, int prio)
+{
+}
+
+static void ts_start_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_stop_wait(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_sleep(struct queue *__queue, task_t *task)
+{
+}
+
+static void ts_wake(struct queue *__queue, task_t *task)
+{
+}
+
+static int nice_preempt(struct nice_queue *queue, task_t *task)
+{
+	check_nice(queue);
+	check_ts_policy(task);
+	/* assume FB style preemption at wakeup */
+	if (!task_queued(task) || !queue->curr)
+		return 1;
+	else {
+		int delta_t, delta_q;
+		delta_t = (task->sched_info.idx - queue->base + NICE_QLEN)
+				% NICE_QLEN;
+		delta_q = (queue->curr->sched_info.idx - queue->base
+							+ NICE_QLEN)
+				% NICE_QLEN;
+		if (delta_t < delta_q)
+			return 1;
+		else if (task->sched_info.cl_data.ts.frac_cpu
+				< queue->curr->sched_info.cl_data.ts.frac_cpu)
+			return 1;
+		else
+			return 0;
+	}
+	check_nice(queue);
+}
+
+static int ts_preempt(struct queue *__queue, task_t *task)
+{
+	int curr_nice;
+	struct ts_queue *queue = (struct ts_queue *)__queue;
+	struct nice_queue *nice = queue->curr;
+
+	check_queue(queue);
+	check_ts_policy(task);
+	if (!queue->curr)
+		return 1;
+
+	curr_nice = (int)(nice - queue->nice_levels);
+
+	/* preempt when nice number is lower, or the above for matches */
+	if (task->sched_info.cl_data.ts.nice != curr_nice)
+		return task->sched_info.cl_data.ts.nice < curr_nice;
+	else
+		return nice_preempt(nice, task);
+}
+
+static struct queue_ops ts_ops = {
+	.init		= ts_init,
+	.fini		= nop_fini,
+	.tick		= ts_tick,
+	.yield		= ts_yield,
+	.curr		= ts_curr,
+	.set_curr	= ts_set_curr,
+	.tasks		= ts_tasks,
+	.best		= ts_best,
+	.enqueue	= ts_enqueue,
+	.dequeue	= ts_dequeue,
+	.start_wait	= ts_start_wait,
+	.stop_wait	= ts_stop_wait,
+	.sleep		= ts_sleep,
+	.wake		= ts_wake,
+	.preempt	= ts_preempt,
+	.nice		= ts_nice,
+	.renice		= ts_renice,
+	.prio		= ts_prio,
+	.setprio	= ts_setprio,
+	.timeslice	= nop_timeslice,
+	.set_timeslice	= nop_set_timeslice,
+};
+
+struct policy ts_policy = {
+	.ops	= &ts_ops,
+};
diff -prauN linux-2.6.0-test11/kernel/sched/util.c sched-2.6.0-test11-5/kernel/sched/util.c
--- linux-2.6.0-test11/kernel/sched/util.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched/util.c	2003-12-19 08:43:20.000000000 -0800
@@ -0,0 +1,37 @@
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <asm/page.h>
+#include "queue.h"
+
+int find_first_circular_bit(unsigned long *addr, int start, int end)
+{
+	int bit = find_next_bit(addr, end, start);
+	if (bit < end)
+		return bit;
+	bit = find_first_bit(addr, start);
+	if (bit < start)
+		return bit;
+	return end;
+}
+
+void queue_nop(struct queue *queue, task_t *task)
+{
+}
+
+void nop_renice(struct queue *queue, task_t *task, int nice)
+{
+}
+
+void nop_fini(struct policy *policy, int cpu)
+{
+}
+
+unsigned long nop_timeslice(struct queue *queue, task_t *task)
+{
+	return 0;
+}
+
+void nop_set_timeslice(struct queue *queue, task_t *task, unsigned long n)
+{
+}
diff -prauN linux-2.6.0-test11/kernel/sched.c sched-2.6.0-test11-5/kernel/sched.c
--- linux-2.6.0-test11/kernel/sched.c	2003-11-26 12:45:17.000000000 -0800
+++ sched-2.6.0-test11-5/kernel/sched.c	2003-12-21 06:06:32.000000000 -0800
@@ -15,6 +15,8 @@
  *		and per-CPU runqueues.  Cleanups and useful suggestions
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
+ *  2003-12-17	Total rewrite and generalized scheduler policies
+ *		by William Irwin.
  */
 
 #include <linux/mm.h>
@@ -38,6 +40,8 @@
 #include <linux/cpu.h>
 #include <linux/percpu.h>
 
+#include "sched/queue.h"
+
 #ifdef CONFIG_NUMA
 #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu))
 #else
@@ -45,181 +49,79 @@
 #endif
 
 /*
- * Convert user-nice values [ -20 ... 0 ... 19 ]
- * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
- * and back.
- */
-#define NICE_TO_PRIO(nice)	(MAX_RT_PRIO + (nice) + 20)
-#define PRIO_TO_NICE(prio)	((prio) - MAX_RT_PRIO - 20)
-#define TASK_NICE(p)		PRIO_TO_NICE((p)->static_prio)
-
-/*
- * 'User priority' is the nice value converted to something we
- * can work with better when scaling various scheduler parameters,
- * it's a [ 0 ... 39 ] range.
- */
-#define USER_PRIO(p)		((p)-MAX_RT_PRIO)
-#define TASK_USER_PRIO(p)	USER_PRIO((p)->static_prio)
-#define MAX_USER_PRIO		(USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE	(MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
-			(MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))
-
-/*
- * Some helpers for converting nanosecond timing to jiffy resolution
- */
-#define NS_TO_JIFFIES(TIME)	((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME)	((TIME) * (1000000000 / HZ))
-
-/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
- */
-#define MIN_TIMESLICE		( 10 * HZ / 1000)
-#define MAX_TIMESLICE		(200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	30
-#define CHILD_PENALTY		95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		3
-#define PRIO_BONUS_RATIO	25
-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	2
-#define MAX_SLEEP_AVG		(AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define NODE_THRESHOLD		125
-#define CREDIT_LIMIT		100
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- *  priority range a task can explore, a value of '1' means the
- *  task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
-			num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p)	(MIN_TIMESLICE * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
-	(v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
-	(SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
-		INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
-	((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define JUST_INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
-	((p)->interactive_credit > CREDIT_LIMIT)
-
-#define LOW_CREDIT(p) \
-	((p)->interactive_credit < -CREDIT_LIMIT)
-
-#define TASK_PREEMPTS_CURR(p, rq) \
-	((p)->prio < (rq)->curr->prio)
-
-/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
- */
-
-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
-	((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1)))
-
-static inline unsigned int task_timeslice(task_t *p)
-{
-	return BASE_TIMESLICE(p);
-}
-
-/*
- * These are the runqueue data structures:
- */
-
-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
-
-typedef struct runqueue runqueue_t;
-
-struct prio_array {
-	int nr_active;
-	unsigned long bitmap[BITMAP_SIZE];
-	struct list_head queue[MAX_PRIO];
-};
-
-/*
  * This is the main, per-CPU runqueue data structure.
  *
  * Locking rule: those places that want to lock multiple runqueues
  * (such as the load balancing or the thread migration code), lock
  * acquire operations must be ordered by ascending &runqueue.
  */
-struct runqueue {
-	spinlock_t lock;
-	unsigned long nr_running, nr_switches, expired_timestamp,
-			nr_uninterruptible;
-	task_t *curr, *idle;
-	struct mm_struct *prev_mm;
-	prio_array_t *active, *expired, arrays[2];
-	int prev_cpu_load[NR_CPUS];
-#ifdef CONFIG_NUMA
-	atomic_t *node_nr_running;
-	int prev_node_load[MAX_NUMNODES];
-#endif
-	task_t *migration_thread;
-	struct list_head migration_queue;
+DEFINE_PER_CPU(struct runqueue, runqueues);
 
-	atomic_t nr_iowait;
+struct policy *policies[] = {
+	&rt_policy,
+	&ts_policy,
+	&batch_policy,
+	&idle_policy,
+	NULL,
 };
 
-static DEFINE_PER_CPU(struct runqueue, runqueues);
-
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		(&__get_cpu_var(runqueues))
 #define task_rq(p)		cpu_rq(task_cpu(p))
-#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
+#define rq_curr(rq)		(rq)->__curr
+#define cpu_curr(cpu)		rq_curr(cpu_rq(cpu))
+
+static inline struct policy *task_policy(task_t *task)
+{
+	unsigned long idx;
+	struct policy *policy;
+	idx = task->sched_info.policy;
+	__check_task_policy(idx);
+	policy = task_rq(task)->policies[idx];
+	check_policy(policy);
+	return policy;
+}
+
+static inline struct policy *rq_policy(runqueue_t *rq)
+{
+	unsigned long idx;
+	task_t *task;
+	struct policy *policy;
+
+	task = rq_curr(rq);
+	BUG_ON(!task);
+	BUG_ON((unsigned long)task < PAGE_OFFSET);
+	idx = task->sched_info.policy;
+	__check_task_policy(idx);
+	policy = rq->policies[idx];
+	check_policy(policy);
+	return policy;
+}
+
+static int __task_nice(task_t *task)
+{
+	struct policy *policy = task_policy(task);
+	return policy->ops->nice(policy->queue, task);
+}
+
+static inline void set_rq_curr(runqueue_t *rq, task_t *task)
+{
+	rq->curr = task->sched_info.policy;
+	__check_task_policy(rq->curr);
+	rq->__curr = task;
+}
+
+static inline int task_preempts_curr(task_t *task, runqueue_t *rq)
+{
+	check_task_policy(rq_curr(rq));
+	check_task_policy(task);
+	if (rq_curr(rq)->sched_info.policy != task->sched_info.policy)
+		return task->sched_info.policy < rq_curr(rq)->sched_info.policy;
+	else {
+		struct policy *policy = rq_policy(rq);
+		return policy->ops->preempt(policy->queue, task);
+	}
+}
 
 /*
  * Default context-switch locking:
@@ -227,7 +129,7 @@ static DEFINE_PER_CPU(struct runqueue, r
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(rq, next)	do { } while(0)
 # define finish_arch_switch(rq, next)	spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p)		((rq)->curr == (p))
+# define task_running(rq, p)		(rq_curr(rq) == (p))
 #endif
 
 #ifdef CONFIG_NUMA
@@ -320,53 +222,32 @@ static inline void rq_unlock(runqueue_t 
 }
 
 /*
- * Adding/removing a task to/from a priority array:
+ * Adding/removing a task to/from a policy's queue.
+ * We dare not BUG_ON() a wrong task_queued() as boot-time
+ * calls may trip it.
  */
-static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
+static inline void dequeue_task(task_t *task, runqueue_t *rq)
 {
-	array->nr_active--;
-	list_del(&p->run_list);
-	if (list_empty(array->queue + p->prio))
-		__clear_bit(p->prio, array->bitmap);
+	struct policy *policy = task_policy(task);
+	BUG_ON(!task_queued(task));
+	policy->ops->dequeue(policy->queue, task);
+	if (!policy->ops->tasks(policy->queue)) {
+		BUG_ON(!test_bit(task->sched_info.policy, &rq->policy_bitmap));
+		__clear_bit(task->sched_info.policy, &rq->policy_bitmap);
+	}
+	clear_task_queued(task);
 }
 
-static inline void enqueue_task(struct task_struct *p, prio_array_t *array)
+static inline void enqueue_task(task_t *task, runqueue_t *rq)
 {
-	list_add_tail(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
-	p->array = array;
-}
-
-/*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
-static int effective_prio(task_t *p)
-{
-	int bonus, prio;
-
-	if (rt_task(p))
-		return p->prio;
-
-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
-	prio = p->static_prio - bonus;
-	if (prio < MAX_RT_PRIO)
-		prio = MAX_RT_PRIO;
-	if (prio > MAX_PRIO-1)
-		prio = MAX_PRIO-1;
-	return prio;
+	struct policy *policy = task_policy(task);
+	BUG_ON(task_queued(task));
+	if (!policy->ops->tasks(policy->queue)) {
+		BUG_ON(test_bit(task->sched_info.policy, &rq->policy_bitmap));
+		__set_bit(task->sched_info.policy, &rq->policy_bitmap);
+	}
+	policy->ops->enqueue(policy->queue, task);
+	set_task_queued(task);
 }
 
 /*
@@ -374,134 +255,34 @@ static int effective_prio(task_t *p)
  */
 static inline void __activate_task(task_t *p, runqueue_t *rq)
 {
-	enqueue_task(p, rq->active);
+	enqueue_task(p, rq);
 	nr_running_inc(rq);
 }
 
-static void recalc_task_prio(task_t *p, unsigned long long now)
-{
-	unsigned long long __sleep_time = now - p->timestamp;
-	unsigned long sleep_time;
-
-	if (__sleep_time > NS_MAX_SLEEP_AVG)
-		sleep_time = NS_MAX_SLEEP_AVG;
-	else
-		sleep_time = (unsigned long)__sleep_time;
-
-	if (likely(sleep_time > 0)) {
-		/*
-		 * User tasks that sleep a long time are categorised as
-		 * idle and will get just interactive status to stay active &
-		 * prevent them suddenly becoming cpu hogs and starving
-		 * other processes.
-		 */
-		if (p->mm && p->activated != -1 &&
-			sleep_time > JUST_INTERACTIVE_SLEEP(p)){
-				p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
-						AVG_TIMESLICE);
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-		} else {
-			/*
-			 * The lower the sleep avg a task has the more
-			 * rapidly it will rise with sleep time.
-			 */
-			sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
-			/*
-			 * Tasks with low interactive_credit are limited to
-			 * one timeslice worth of sleep avg bonus.
-			 */
-			if (LOW_CREDIT(p) &&
-				sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
-					sleep_time =
-						JIFFIES_TO_NS(task_timeslice(p));
-
-			/*
-			 * Non high_credit tasks waking from uninterruptible
-			 * sleep are limited in their sleep_avg rise as they
-			 * are likely to be cpu hogs waiting on I/O
-			 */
-			if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
-				if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-					JUST_INTERACTIVE_SLEEP(p)){
-						p->sleep_avg =
-							JUST_INTERACTIVE_SLEEP(p);
-						sleep_time = 0;
-					}
-			}
-
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a task
-			 * spends sleeping, the higher the average gets - and the
-			 * higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
-
-			if (p->sleep_avg > NS_MAX_SLEEP_AVG){
-				p->sleep_avg = NS_MAX_SLEEP_AVG;
-				if (!HIGH_CREDIT(p))
-					p->interactive_credit++;
-			}
-		}
-	}
-
-	p->prio = effective_prio(p);
-}
-
 /*
  * activate_task - move a task to the runqueue and do priority recalculation
  *
  * Update all the scheduling statistics stuff. (sleep average
  * calculation, priority modifiers, etc.)
  */
-static inline void activate_task(task_t *p, runqueue_t *rq)
+static inline void activate_task(task_t *task, runqueue_t *rq)
 {
-	unsigned long long now = sched_clock();
-
-	recalc_task_prio(p, now);
-
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (!p->activated){
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->activated = 2;
-		else
-		/*
-		 * Normal first-time wakeups get a credit too for on-runqueue
-		 * time, but it will be weighted down:
-		 */
-			p->activated = 1;
-		}
-	p->timestamp = now;
-
-	__activate_task(p, rq);
+	struct policy *policy = task_policy(task);
+	policy->ops->wake(policy->queue, task);
+	__activate_task(task, rq);
 }
 
 /*
  * deactivate_task - remove a task from the runqueue.
  */
-static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
+static inline void deactivate_task(task_t *task, runqueue_t *rq)
 {
+	struct policy *policy = task_policy(task);
 	nr_running_dec(rq);
-	if (p->state == TASK_UNINTERRUPTIBLE)
+	if (task->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
-	dequeue_task(p, p->array);
-	p->array = NULL;
+	policy->ops->sleep(policy->queue, task);
+	dequeue_task(task, rq);
 }
 
 /*
@@ -625,7 +406,7 @@ repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
 	if (old_state & state) {
-		if (!p->array) {
+		if (!task_queued(p)) {
 			/*
 			 * Fast-migrate the task if it's not running or runnable
 			 * currently. Do not violate hard affinity.
@@ -644,14 +425,13 @@ repeat_lock_task:
 				 * Tasks on involuntary sleep don't earn
 				 * sleep_avg beyond just interactive state.
 				 */
-				p->activated = -1;
 			}
 			if (sync)
 				__activate_task(p, rq);
 			else {
 				activate_task(p, rq);
-				if (TASK_PREEMPTS_CURR(p, rq))
-					resched_task(rq->curr);
+				if (task_preempts_curr(p, rq))
+					resched_task(rq_curr(rq));
 			}
 			success = 1;
 		}
@@ -679,68 +459,26 @@ int wake_up_state(task_t *p, unsigned in
  * This function will do some initial scheduler statistics housekeeping
  * that must be done for every newly created process.
  */
-void wake_up_forked_process(task_t * p)
+void wake_up_forked_process(task_t *task)
 {
 	unsigned long flags;
 	runqueue_t *rq = task_rq_lock(current, &flags);
 
-	p->state = TASK_RUNNING;
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive.
-	 */
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->interactive_credit = 0;
-
-	p->prio = effective_prio(p);
-	set_task_cpu(p, smp_processor_id());
-
-	if (unlikely(!current->array))
-		__activate_task(p, rq);
-	else {
-		p->prio = current->prio;
-		list_add_tail(&p->run_list, &current->run_list);
-		p->array = current->array;
-		p->array->nr_active++;
-		nr_running_inc(rq);
-	}
+	task->state = TASK_RUNNING;
+	set_task_cpu(task, smp_processor_id());
+	if (unlikely(!task_queued(current)))
+		__activate_task(task, rq);
+	else
+		activate_task(task, rq);
 	task_rq_unlock(rq, &flags);
 }
 
 /*
- * Potentially available exiting-child timeslices are
- * retrieved here - this way the parent does not get
- * penalized for creating too many threads.
- *
- * (this cannot be used to 'generate' timeslices
- * artificially, because any timeslice recovered here
- * was given away by the parent in the first place.)
+ * Policies that depend on trapping fork() and exit() may need to
+ * put a hook here.
  */
-void sched_exit(task_t * p)
+void sched_exit(task_t *task)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	if (p->first_time_slice) {
-		p->parent->time_slice += p->time_slice;
-		if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
-			p->parent->time_slice = MAX_TIMESLICE;
-	}
-	local_irq_restore(flags);
-	/*
-	 * If the child was a (relative-) CPU hog then decrease
-	 * the sleep_avg of the parent as well.
-	 */
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
 }
 
 /**
@@ -1128,18 +866,18 @@ out:
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
+static inline void pull_task(runqueue_t *src_rq, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
-	dequeue_task(p, src_array);
+	dequeue_task(p, src_rq);
 	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
 	nr_running_inc(this_rq);
-	enqueue_task(p, this_rq->active);
+	enqueue_task(p, this_rq);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
 	 * to be always true for them.
 	 */
-	if (TASK_PREEMPTS_CURR(p, this_rq))
+	if (task_preempts_curr(p, this_rq))
 		set_need_resched();
 }
 
@@ -1150,14 +888,14 @@ static inline void pull_task(runqueue_t 
  *	((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
  *		cache_decay_ticks)) && !task_running(rq, p) && \
  *			cpu_isset(this_cpu, (p)->cpus_allowed))
+ *
+ * Since there isn't a timestamp anymore, this needs adjustment.
  */
 
 static inline int
 can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
 {
-	unsigned long delta = sched_clock() - tsk->timestamp;
-
-	if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
+	if (!idle)
 		return 0;
 	if (task_running(rq, tsk))
 		return 0;
@@ -1176,11 +914,8 @@ can_migrate_task(task_t *tsk, runqueue_t
  */
 static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
-	prio_array_t *array;
-	struct list_head *head, *curr;
-	task_t *tmp;
 
 	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
 	if (!busiest)
@@ -1192,37 +927,6 @@ static void load_balance(runqueue_t *thi
 	 */
 	imbalance /= 2;
 
-	/*
-	 * We first consider expired tasks. Those will likely not be
-	 * executed in the near future, and they are most likely to
-	 * be cache-cold, thus switching CPUs has the least effect
-	 * on them.
-	 */
-	if (busiest->expired->nr_active)
-		array = busiest->expired;
-	else
-		array = busiest->active;
-
-new_array:
-	/* Start searching at priority 0: */
-	idx = 0;
-skip_bitmap:
-	if (!idx)
-		idx = sched_find_first_bit(array->bitmap);
-	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired) {
-			array = busiest->active;
-			goto new_array;
-		}
-		goto out_unlock;
-	}
-
-	head = array->queue + idx;
-	curr = head->prev;
-skip_queue:
-	tmp = list_entry(curr, task_t, run_list);
 
 	/*
 	 * We do not migrate tasks that are:
@@ -1231,21 +935,19 @@ skip_queue:
 	 * 3) are cache-hot on their current CPU.
 	 */
 
-	curr = curr->prev;
+	do {
+		struct policy *policy;
+		task_t *task;
+
+		policy = rq_migrate_policy(busiest);
+		if (!policy)
+			break;
+		task = policy->migrate(policy->queue);
+		if (!task)
+			break;
+		pull_task(busiest, task, this_rq, this_cpu);
+	} while (!idle && --imbalance);
 
-	if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
-	pull_task(busiest, array, tmp, this_rq, this_cpu);
-	if (!idle && --imbalance) {
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
-	}
 out_unlock:
 	spin_unlock(&busiest->lock);
 out:
@@ -1356,10 +1058,10 @@ EXPORT_PER_CPU_SYMBOL(kstat);
  */
 void scheduler_tick(int user_ticks, int sys_ticks)
 {
-	int cpu = smp_processor_id();
+	int idle, cpu = smp_processor_id();
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	struct policy *policy;
 	runqueue_t *rq = this_rq();
-	task_t *p = current;
 
 	if (rcu_pending(cpu))
 		rcu_check_callbacks(cpu, user_ticks);
@@ -1373,98 +1075,28 @@ void scheduler_tick(int user_ticks, int 
 		sys_ticks = 0;
 	}
 
-	if (p == rq->idle) {
-		if (atomic_read(&rq->nr_iowait) > 0)
-			cpustat->iowait += sys_ticks;
-		else
-			cpustat->idle += sys_ticks;
-		rebalance_tick(rq, 1);
-		return;
-	}
-	if (TASK_NICE(p) > 0)
-		cpustat->nice += user_ticks;
-	else
-		cpustat->user += user_ticks;
-	cpustat->system += sys_ticks;
-
-	/* Task might have expired already, but not scheduled off yet */
-	if (p->array != rq->active) {
-		set_tsk_need_resched(p);
-		goto out;
-	}
 	spin_lock(&rq->lock);
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter. Note: we do not update a thread's
-	 * priority until it either goes to sleep or uses up its
-	 * timeslice. This makes it possible for interactive tasks
-	 * to use up their timeslices at their highest priority levels.
-	 */
-	if (unlikely(rt_task(p))) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			dequeue_task(p, rq->active);
-			enqueue_task(p, rq->active);
-		}
-		goto out_unlock;
-	}
-	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
-		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!rq->expired_timestamp)
-			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
-			enqueue_task(p, rq->expired);
-		} else
-			enqueue_task(p, rq->active);
-	} else {
-		/*
-		 * Prevent a too long timeslice allowing a task to monopolize
-		 * the CPU. We do this by splitting up the timeslice into
-		 * smaller pieces.
-		 *
-		 * Note: this does not mean the task's timeslices expire or
-		 * get lost in any way, they just might be preempted by
-		 * another task of equal priority. (one with higher
-		 * priority would have preempted this task already.) We
-		 * requeue this task to the end of the list on this priority
-		 * level, which is in essence a round-robin of tasks with
-		 * equal priority.
-		 *
-		 * This only applies to tasks in the interactive
-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
-		 */
-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
-			(p->array == rq->active)) {
-
-			dequeue_task(p, rq->active);
-			set_tsk_need_resched(p);
-			p->prio = effective_prio(p);
-			enqueue_task(p, rq->active);
-		}
-	}
-out_unlock:
+	policy = rq_policy(rq);
+	idle = policy->ops->tick(policy->queue, current, user_ticks, sys_ticks);
 	spin_unlock(&rq->lock);
-out:
-	rebalance_tick(rq, 0);
+	rebalance_tick(rq, idle);
 }
 
 void scheduling_functions_start_here(void) { }
 
+static inline task_t *find_best_task(runqueue_t *rq)
+{
+	int idx;
+	struct policy *policy;
+
+	BUG_ON(!rq->policy_bitmap);
+	idx = __ffs(rq->policy_bitmap);
+	__check_task_policy(idx);
+	policy = rq->policies[idx];
+	check_policy(policy);
+	return policy->ops->best(policy->queue);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
@@ -1472,11 +1104,7 @@ asmlinkage void schedule(void)
 {
 	task_t *prev, *next;
 	runqueue_t *rq;
-	prio_array_t *array;
-	struct list_head *queue;
-	unsigned long long now;
-	unsigned long run_time;
-	int idx;
+	struct policy *policy;
 
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
@@ -1494,22 +1122,9 @@ need_resched:
 	preempt_disable();
 	prev = current;
 	rq = this_rq();
+	policy = rq_policy(rq);
 
 	release_kernel_lock(prev);
-	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
-		run_time = now - prev->timestamp;
-	else
-		run_time = NS_MAX_SLEEP_AVG;
-
-	/*
-	 * Tasks with interactive credits get charged less run_time
-	 * at high sleep_avg to delay them losing their interactive
-	 * status
-	 */
-	if (HIGH_CREDIT(prev))
-		run_time /= (CURRENT_BONUS(prev) ? : 1);
-
 	spin_lock_irq(&rq->lock);
 
 	/*
@@ -1530,66 +1145,27 @@ need_resched:
 		prev->nvcsw++;
 		break;
 	case TASK_RUNNING:
+		policy->ops->start_wait(policy->queue, prev);
 		prev->nivcsw++;
 	}
+
 pick_next_task:
-	if (unlikely(!rq->nr_running)) {
 #ifdef CONFIG_SMP
+	if (unlikely(!rq->nr_running))
 		load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
-		if (rq->nr_running)
-			goto pick_next_task;
 #endif
-		next = rq->idle;
-		rq->expired_timestamp = 0;
-		goto switch_tasks;
-	}
-
-	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-	}
-
-	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, task_t, run_list);
-
-	if (next->activated > 0) {
-		unsigned long long delta = now - next->timestamp;
-
-		if (next->activated == 1)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
-		array = next->array;
-		dequeue_task(next, array);
-		recalc_task_prio(next, next->timestamp + delta);
-		enqueue_task(next, array);
-	}
-	next->activated = 0;
-switch_tasks:
+	next = find_best_task(rq);
+	BUG_ON(!next);
 	prefetch(next);
 	clear_tsk_need_resched(prev);
 	RCU_qsctr(task_cpu(prev))++;
 
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0){
-		prev->sleep_avg = 0;
-		if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
-			prev->interactive_credit--;
-	}
-	prev->timestamp = now;
-
 	if (likely(prev != next)) {
-		next->timestamp = now;
 		rq->nr_switches++;
-		rq->curr = next;
-
 		prepare_arch_switch(rq, next);
+		policy = task_policy(next);
+		policy->ops->set_curr(policy->queue, next);
+		set_rq_curr(rq, next);
 		prev = context_switch(rq, prev, next);
 		barrier();
 
@@ -1845,45 +1421,46 @@ void scheduling_functions_end_here(void)
 void set_user_nice(task_t *p, long nice)
 {
 	unsigned long flags;
-	prio_array_t *array;
 	runqueue_t *rq;
-	int old_prio, new_prio, delta;
+	struct policy *policy;
+	int delta, queued;
 
-	if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
+	if (nice < -20 || nice > 19)
 		return;
 	/*
 	 * We have to be careful, if called from sys_setpriority(),
 	 * the task might be in the middle of scheduling on another CPU.
 	 */
 	rq = task_rq_lock(p, &flags);
+	delta = nice - __task_nice(p);
+	if (!delta) {
+		if (p->pid == 0 || p->pid == 1)
+			printk("no change in nice, set_user_nice() nops!\n");
+		goto out_unlock;
+	}
+
+	policy = task_policy(p);
+
 	/*
 	 * The RT priorities are set via setscheduler(), but we still
 	 * allow the 'normal' nice value to be set - but as expected
 	 * it wont have any effect on scheduling until the task is
 	 * not SCHED_NORMAL:
 	 */
-	if (rt_task(p)) {
-		p->static_prio = NICE_TO_PRIO(nice);
-		goto out_unlock;
-	}
-	array = p->array;
-	if (array)
-		dequeue_task(p, array);
-
-	old_prio = p->prio;
-	new_prio = NICE_TO_PRIO(nice);
-	delta = new_prio - old_prio;
-	p->static_prio = NICE_TO_PRIO(nice);
-	p->prio += delta;
+	queued = task_queued(p);
+	if (queued)
+		dequeue_task(p, rq);
+
+	policy->ops->renice(policy->queue, p, nice);
 
-	if (array) {
-		enqueue_task(p, array);
+	if (queued) {
+		enqueue_task(p, rq);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
 		 */
 		if (delta < 0 || (delta > 0 && task_running(rq, p)))
-			resched_task(rq->curr);
+			resched_task(rq_curr(rq));
 	}
 out_unlock:
 	task_rq_unlock(rq, &flags);
@@ -1919,7 +1496,7 @@ asmlinkage long sys_nice(int increment)
 	if (increment > 40)
 		increment = 40;
 
-	nice = PRIO_TO_NICE(current->static_prio) + increment;
+	nice = task_nice(current) + increment;
 	if (nice < -20)
 		nice = -20;
 	if (nice > 19)
@@ -1935,6 +1512,12 @@ asmlinkage long sys_nice(int increment)
 
 #endif
 
+static int __task_prio(task_t *task)
+{
+	struct policy *policy = task_policy(task);
+	return policy->ops->prio(task);
+}
+
 /**
  * task_prio - return the priority value of a given task.
  * @p: the task in question.
@@ -1943,29 +1526,111 @@ asmlinkage long sys_nice(int increment)
  * RT tasks are offset by -200. Normal tasks are centered
  * around 0, value goes from -16 to +15.
  */
-int task_prio(task_t *p)
+int task_prio(task_t *task)
 {
-	return p->prio - MAX_RT_PRIO;
+	int prio;
+	unsigned long flags;
+	runqueue_t *rq;
+
+	rq = task_rq_lock(task, &flags);
+	prio = __task_prio(task);
+	task_rq_unlock(rq, &flags);
+	return prio;
 }
 
 /**
  * task_nice - return the nice value of a given task.
  * @p: the task in question.
  */
-int task_nice(task_t *p)
+int task_nice(task_t *task)
 {
-	return TASK_NICE(p);
+	int nice;
+	unsigned long flags;
+	runqueue_t *rq;
+
+
+	rq = task_rq_lock(task, &flags);
+	nice = __task_nice(task);
+	task_rq_unlock(rq, &flags);
+	return nice;
 }
 
 EXPORT_SYMBOL(task_nice);
 
+int task_sched_policy(task_t *task)
+{
+	check_task_policy(task);
+	switch (task->sched_info.policy) {
+		case SCHED_POLICY_RT:
+			if (task->sched_info.cl_data.rt.rt_policy
+							== RT_POLICY_RR)
+				return SCHED_RR;
+			else
+				return SCHED_FIFO;
+		case SCHED_POLICY_TS:
+			return SCHED_NORMAL;
+		case SCHED_POLICY_BATCH:
+			return SCHED_BATCH;
+		case SCHED_POLICY_IDLE:
+			return SCHED_IDLE;
+		default:
+			BUG();
+			return -1;
+	}
+}
+EXPORT_SYMBOL(task_sched_policy);
+
+void set_task_sched_policy(task_t *task, int policy)
+{
+	check_task_policy(task);
+	BUG_ON(task_queued(task));
+	switch (policy) {
+		case SCHED_FIFO:
+			task->sched_info.policy = SCHED_POLICY_RT;
+			task->sched_info.cl_data.rt.rt_policy = RT_POLICY_FIFO;
+			break;
+		case SCHED_RR:
+			task->sched_info.policy = SCHED_POLICY_RT;
+			task->sched_info.cl_data.rt.rt_policy = RT_POLICY_RR;
+			break;
+		case SCHED_NORMAL:
+			task->sched_info.policy = SCHED_POLICY_TS;
+			break;
+		case SCHED_BATCH:
+			task->sched_info.policy = SCHED_POLICY_BATCH;
+			break;
+		case SCHED_IDLE:
+			task->sched_info.policy = SCHED_POLICY_IDLE;
+			break;
+		default:
+			BUG();
+			break;
+	}
+	check_task_policy(task);
+}
+EXPORT_SYMBOL(set_task_sched_policy);
+
+int rt_task(task_t *task)
+{
+	check_task_policy(task);
+	return !!(task->sched_info.policy == SCHED_POLICY_RT);
+}
+EXPORT_SYMBOL(rt_task);
+
 /**
  * idle_cpu - is a given cpu idle currently?
  * @cpu: the processor in question.
  */
 int idle_cpu(int cpu)
 {
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+	int idle;
+	unsigned long flags;
+	runqueue_t *rq = cpu_rq(cpu);
+
+	spin_lock_irqsave(&rq->lock, flags);
+	idle = !!(rq->curr == SCHED_POLICY_IDLE);
+	spin_unlock_irqrestore(&rq->lock, flags);
+	return idle;
 }
 
 EXPORT_SYMBOL_GPL(idle_cpu);
@@ -1985,11 +1650,10 @@ static inline task_t *find_process_by_pi
 static int setscheduler(pid_t pid, int policy, struct sched_param __user *param)
 {
 	struct sched_param lp;
-	int retval = -EINVAL;
-	int oldprio;
-	prio_array_t *array;
+	int queued, retval = -EINVAL;
 	unsigned long flags;
 	runqueue_t *rq;
+	struct policy *rq_policy;
 	task_t *p;
 
 	if (!param || pid < 0)
@@ -2017,7 +1681,7 @@ static int setscheduler(pid_t pid, int p
 	rq = task_rq_lock(p, &flags);
 
 	if (policy < 0)
-		policy = p->policy;
+		policy = task_sched_policy(p);
 	else {
 		retval = -EINVAL;
 		if (policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -2047,29 +1711,23 @@ static int setscheduler(pid_t pid, int p
 	if (retval)
 		goto out_unlock;
 
-	array = p->array;
-	if (array)
+	queued = task_queued(p);
+	if (queued)
 		deactivate_task(p, task_rq(p));
 	retval = 0;
-	p->policy = policy;
-	p->rt_priority = lp.sched_priority;
-	oldprio = p->prio;
-	if (policy != SCHED_NORMAL)
-		p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
-	else
-		p->prio = p->static_prio;
-	if (array) {
+	set_task_sched_policy(p, policy);
+	check_task_policy(p);
+	rq_policy = rq->policies[p->sched_info.policy];
+	check_policy(rq_policy);
+	rq_policy->ops->setprio(p, lp.sched_priority);
+	if (queued) {
 		__activate_task(p, task_rq(p));
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
 		 * this runqueue and our priority is higher than the current's
 		 */
-		if (rq->curr == p) {
-			if (p->prio > oldprio)
-				resched_task(rq->curr);
-		} else if (p->prio < rq->curr->prio)
-			resched_task(rq->curr);
+		resched_task(rq_curr(rq));
 	}
 
 out_unlock:
@@ -2121,7 +1779,7 @@ asmlinkage long sys_sched_getscheduler(p
 	if (p) {
 		retval = security_task_getscheduler(p);
 		if (!retval)
-			retval = p->policy;
+			retval = task_sched_policy(p);
 	}
 	read_unlock(&tasklist_lock);
 
@@ -2153,7 +1811,7 @@ asmlinkage long sys_sched_getparam(pid_t
 	if (retval)
 		goto out_unlock;
 
-	lp.sched_priority = p->rt_priority;
+	lp.sched_priority = task_prio(p);
 	read_unlock(&tasklist_lock);
 
 	/*
@@ -2262,32 +1920,13 @@ out_unlock:
  */
 asmlinkage long sys_sched_yield(void)
 {
+	struct policy *policy;
 	runqueue_t *rq = this_rq_lock();
-	prio_array_t *array = current->array;
-
-	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
-	 */
-	if (likely(!rt_task(current))) {
-		dequeue_task(current, array);
-		enqueue_task(current, rq->expired);
-	} else {
-		list_del(&current->run_list);
-		list_add_tail(&current->run_list, array->queue + current->prio);
-	}
-	/*
-	 * Since we are going to call schedule() anyway, there's
-	 * no need to preempt:
-	 */
+	policy = rq_policy(rq);
+	policy->ops->yield(policy->queue, current);
 	_raw_spin_unlock(&rq->lock);
 	preempt_enable_no_resched();
-
 	schedule();
-
 	return 0;
 }
 
@@ -2387,6 +2026,19 @@ asmlinkage long sys_sched_get_priority_m
 	return ret;
 }
 
+static inline unsigned long task_timeslice(task_t *task)
+{
+	unsigned long flags, timeslice;
+	struct policy *policy;
+	runqueue_t *rq;
+
+	rq = task_rq_lock(task, &flags);
+	policy = task_policy(task);
+	timeslice = policy->ops->timeslice(policy->queue, task);
+	task_rq_unlock(rq, &flags);
+	return timeslice;
+}
+
 /**
  * sys_sched_rr_get_interval - return the default timeslice of a process.
  * @pid: pid of the process.
@@ -2414,8 +2066,7 @@ asmlinkage long sys_sched_rr_get_interva
 	if (retval)
 		goto out_unlock;
 
-	jiffies_to_timespec(p->policy & SCHED_FIFO ?
-				0 : task_timeslice(p), &t);
+	jiffies_to_timespec(task_timeslice(p), &t);
 	read_unlock(&tasklist_lock);
 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
@@ -2523,17 +2174,22 @@ void show_state(void)
 void __init init_idle(task_t *idle, int cpu)
 {
 	runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle));
+	struct policy *policy;
 	unsigned long flags;
 
 	local_irq_save(flags);
 	double_rq_lock(idle_rq, rq);
-
-	idle_rq->curr = idle_rq->idle = idle;
+	policy = rq_policy(rq);
+	BUG_ON(policy != task_policy(idle));
+	printk("deactivating, have %d tasks\n",
+			policy->ops->tasks(policy->queue));
 	deactivate_task(idle, rq);
-	idle->array = NULL;
-	idle->prio = MAX_PRIO;
+	set_task_sched_policy(idle, SCHED_IDLE);
 	idle->state = TASK_RUNNING;
 	set_task_cpu(idle, cpu);
+	activate_task(idle, rq);
+	nr_running_dec(rq);
+	set_rq_curr(rq, idle);
 	double_rq_unlock(idle_rq, rq);
 	set_tsk_need_resched(idle);
 	local_irq_restore(flags);
@@ -2804,38 +2460,27 @@ __init static void init_kstat(void) {
 void __init sched_init(void)
 {
 	runqueue_t *rq;
-	int i, j, k;
+	int i, j;
 
 	/* Init the kstat counters */
 	init_kstat();
 	for (i = 0; i < NR_CPUS; i++) {
-		prio_array_t *array;
-
 		rq = cpu_rq(i);
-		rq->active = rq->arrays;
-		rq->expired = rq->arrays + 1;
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
 		nr_running_init(rq);
-
-		for (j = 0; j < 2; j++) {
-			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
-				INIT_LIST_HEAD(array->queue + k);
-				__clear_bit(k, array->bitmap);
-			}
-			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
-		}
+		memcpy(rq->policies, policies, sizeof(policies));
+		for (j = 0; j < BITS_PER_LONG && rq->policies[j]; ++j)
+			rq->policies[j]->ops->init(rq->policies[j], i);
 	}
 	/*
 	 * We have to do a little magic to get the first
 	 * thread right in SMP mode.
 	 */
 	rq = this_rq();
-	rq->curr = current;
-	rq->idle = current;
+	set_task_sched_policy(current, SCHED_NORMAL);
+	set_rq_curr(rq, current);
 	set_task_cpu(current, smp_processor_id());
 	wake_up_forked_process(current);
 
diff -prauN linux-2.6.0-test11/lib/Makefile sched-2.6.0-test11-5/lib/Makefile
--- linux-2.6.0-test11/lib/Makefile	2003-11-26 12:42:55.000000000 -0800
+++ sched-2.6.0-test11-5/lib/Makefile	2003-12-20 15:09:16.000000000 -0800
@@ -5,7 +5,7 @@
 
 lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
 	 bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
-	 kobject.o idr.o div64.o parser.o
+	 kobject.o idr.o div64.o parser.o binomial.o
 
 lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
diff -prauN linux-2.6.0-test11/lib/binomial.c sched-2.6.0-test11-5/lib/binomial.c
--- linux-2.6.0-test11/lib/binomial.c	1969-12-31 16:00:00.000000000 -0800
+++ sched-2.6.0-test11-5/lib/binomial.c	2003-12-20 17:32:09.000000000 -0800
@@ -0,0 +1,138 @@
+#include <linux/kernel.h>
+#include <linux/binomial.h>
+
+struct binomial *binomial_minimum(struct binomial **heap)
+{
+	struct binomial *minimum, *tmp;
+
+	for (minimum = NULL, tmp = *heap; tmp; tmp = tmp->sibling) {
+		if (!minimum || minimum->priority > tmp->priority)
+			minimum = tmp;
+	}
+	return minimum;
+}
+
+static void binomial_link(struct binomial *left, struct binomial *right)
+{
+	left->parent  = right;
+	left->sibling = right->child;
+	right->child  = left;
+	right->degree++;
+}
+
+static void binomial_merge(struct binomial **both, struct binomial **left,
+						struct binomial **right)
+{
+	while (*left && *right) {
+		if ((*left)->degree < (*right)->degree) {
+			*both = *left;
+			left = &(*left)->sibling;
+		} else {
+			*both = *right;
+			right = &(*right)->sibling;
+		}
+		both = &(*both)->sibling;
+	}
+	/*
+	 * for more safety:
+	 * *left = *right = NULL;
+	 */
+}
+
+void binomial_union(struct binomial **both, struct binomial **left,
+						struct binomial **right)
+{
+	struct binomial *prev, *tmp, *next;
+
+	binomial_merge(both, left, right);
+	if (!(tmp = *both))
+		return;
+
+	for (prev = NULL, next = tmp->sibling; next; next = tmp->sibling) {
+		if ((next->sibling && next->sibling->degree == tmp->degree)
+					|| tmp->degree != next->degree) {
+			prev = tmp;
+			tmp  = next;
+		} else if (tmp->priority <= next->priority) {
+			tmp->sibling = next->sibling;
+			binomial_link(next, tmp);
+		} else {
+			if (!prev)
+				*both = next;
+			else
+				prev->sibling = next;
+			binomial_link(tmp, next);
+			tmp = next;
+		}
+	}
+}
+
+void binomial_insert(struct binomial **heap, struct binomial *element)
+{
+	element->parent  = NULL;
+	element->child   = NULL;
+	element->sibling = NULL;
+	element->degree  = 0;
+	binomial_union(heap, heap, &element);
+}
+
+static void binomial_reverse(struct binomial **in, struct binomial **out)
+{
+	while (*in) {
+		struct binomial *tmp = *in;
+		*in = (*in)->sibling;
+		tmp->sibling = *out;
+		*out = tmp;
+	}
+}
+
+struct binomial *binomial_extract_min(struct binomial **heap)
+{
+	struct binomial *tmp, *minimum, *last, *min_last, *new_heap;
+
+	minimum = last = min_last = new_heap = NULL;
+	for (tmp = *heap; tmp; last = tmp, tmp = tmp->sibling) {
+		if (!minimum || tmp->priority < minimum->priority) {
+			minimum  = tmp;
+			min_last = last;
+		}
+	}
+	if (min_last && minimum)
+		min_last->sibling = minimum->sibling;
+	else if (minimum)
+		(*heap)->sibling = minimum->sibling;
+	else
+		return NULL;
+	binomial_reverse(&minimum->child, &new_heap);
+	binomial_union(heap, heap, &new_heap);
+	return minimum;
+}
+
+void binomial_decrease(struct binomial **heap, struct binomial *element,
+							unsigned increment)
+{
+	struct binomial *tmp, *last = NULL;
+
+	element->priority -= min(element->priority, increment);
+	last = element;
+	tmp  = last->parent;
+	while (tmp && last->priority < tmp->priority) {
+		unsigned tmp_prio = tmp->priority;
+		tmp->priority = last->priority;
+		last->priority = tmp_prio;
+		last = tmp;
+		tmp  = tmp->parent;
+	}
+}
+
+void binomial_delete(struct binomial **heap, struct binomial *element)
+{
+	struct binomial *tmp, *last = element;
+	for (tmp = last->parent; tmp; last = tmp, tmp = tmp->parent) {
+		unsigned tmp_prio = tmp->priority;
+		tmp->priority = last->priority;
+		last->priority = tmp_prio;
+	}
+	binomial_reverse(&last->child, &tmp);
+	binomial_union(heap, heap, &tmp);
+}
diff -prauN linux-2.6.0-test11/mm/oom_kill.c sched-2.6.0-test11-5/mm/oom_kill.c
--- linux-2.6.0-test11/mm/oom_kill.c	2003-11-26 12:44:16.000000000 -0800
+++ sched-2.6.0-test11-5/mm/oom_kill.c	2003-12-17 07:07:53.000000000 -0800
@@ -158,7 +158,6 @@ static void __oom_kill_task(task_t *p)
 	 * all the memory it needs. That way it should be able to
 	 * exit() and clear out its resources quickly...
 	 */
-	p->time_slice = HZ;
 	p->flags |= PF_MEMALLOC | PF_MEMDIE;
 
 	/* This process has hardware access, be more careful. */

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:24                           ` Ingo Molnar
  2007-04-17  9:57                             ` William Lee Irwin III
@ 2007-04-17 22:08                             ` Matt Mackall
  2007-04-17 22:32                               ` William Lee Irwin III
  1 sibling, 1 reply; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 22:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Davide Libenzi, Nick Piggin,
	Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> 
> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
> > [...] Also rest assured that the tone of the critique is not hostile, 
> > and wasn't meant to sound that way.
> 
> ok :) (And i guess i was too touchy - sorry about coming out swinging.)
> 
> > Also, given the general comments it appears clear that some 
> > statistical metric of deviation from the intended behavior furthermore 
> > qualified by timescale is necessary, so this appears to be headed 
> > toward a sort of performance metric as opposed to a pass/fail test 
> > anyway. However, to even measure this at all, some statement of 
> > intention is required. I'd prefer that there be a Linux-standard 
> > semantics for nice so results are more directly comparable and so that 
> > users also get similar nice behavior from the scheduler as it varies 
> > over time and possibly implementations if users should care to switch 
> > them out with some scheduler patch or other.
> 
> yeah. If you could come up with a sane definition that also translates 
> into low overhead on the algorithm side that would be great!

How's this:

If you're running two identical CPU hog tasks A and B differing only by nice
level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
constant f(Anice - Bnice).

Other definitions make things hard to analyze and probably not
well-bounded when confronted with > 2 tasks.

I -think- this implies keeping a separate scaled CPU usage counter,
where the scaling factor is a trivial exponential function of nice
level where f(0) == 1. Then you schedule based on this scaled usage
counter rather than unscaled.

I also suspect we want to keep the exponential base small so that the
maximal difference is 10x-100x.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:08                             ` Matt Mackall
@ 2007-04-17 22:32                               ` William Lee Irwin III
  2007-04-17 22:39                                 ` Matt Mackall
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:32 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
>> yeah. If you could come up with a sane definition that also translates 
>> into low overhead on the algorithm side that would be great!

On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> How's this:
> If you're running two identical CPU hog tasks A and B differing only by nice
> level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> constant f(Anice - Bnice).
> Other definitions make things hard to analyze and probably not
> well-bounded when confronted with > 2 tasks.
> I -think- this implies keeping a separate scaled CPU usage counter,
> where the scaling factor is a trivial exponential function of nice
> level where f(0) == 1. Then you schedule based on this scaled usage
> counter rather than unscaled.
> I also suspect we want to keep the exponential base small so that the
> maximal difference is 10x-100x.

I'm already working with this as my assumed nice semantics (actually
something with a specific exponential base, suggested in other emails)
until others start saying they want something different and agree.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:32                               ` William Lee Irwin III
@ 2007-04-17 22:39                                 ` Matt Mackall
  2007-04-17 22:59                                   ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 22:39 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote:
> >> yeah. If you could come up with a sane definition that also translates 
> >> into low overhead on the algorithm side that would be great!
> 
> On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote:
> > How's this:
> > If you're running two identical CPU hog tasks A and B differing only by nice
> > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a
> > constant f(Anice - Bnice).
> > Other definitions make things hard to analyze and probably not
> > well-bounded when confronted with > 2 tasks.
> > I -think- this implies keeping a separate scaled CPU usage counter,
> > where the scaling factor is a trivial exponential function of nice
> > level where f(0) == 1. Then you schedule based on this scaled usage
> > counter rather than unscaled.
> > I also suspect we want to keep the exponential base small so that the
> > maximal difference is 10x-100x.
> 
> I'm already working with this as my assumed nice semantics (actually
> something with a specific exponential base, suggested in other emails)
> until others start saying they want something different and agree.

Good. This has a couple nice mathematical properties, including
"bounded unfairness" which I mentioned earlier. What base are you
looking at?

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:39                                 ` Matt Mackall
@ 2007-04-17 22:59                                   ` William Lee Irwin III
  2007-04-17 22:57                                     ` Matt Mackall
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 22:59 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
>> I'm already working with this as my assumed nice semantics (actually
>> something with a specific exponential base, suggested in other emails)
>> until others start saying they want something different and agree.

On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> Good. This has a couple nice mathematical properties, including
> "bounded unfairness" which I mentioned earlier. What base are you
> looking at?

I'm working with the following suggestion:


On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> That value has the property that a nice=10 task gets 1/10th the cpu of a
> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
> would be fairly easy to explain to admins and users so that they can
> know what to expect from nicing tasks.


I'm not likely to write the testcase until this upcoming weekend, though.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:59                                   ` William Lee Irwin III
@ 2007-04-17 22:57                                     ` Matt Mackall
  2007-04-18  4:29                                       ` William Lee Irwin III
  2007-04-18  7:29                                       ` James Bruce
  0 siblings, 2 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 22:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
> >> I'm already working with this as my assumed nice semantics (actually
> >> something with a specific exponential base, suggested in other emails)
> >> until others start saying they want something different and agree.
> 
> On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote:
> > Good. This has a couple nice mathematical properties, including
> > "bounded unfairness" which I mentioned earlier. What base are you
> > looking at?
> 
> I'm working with the following suggestion:
> 
> 
> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
> > Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
> > That value has the property that a nice=10 task gets 1/10th the cpu of a
> > nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
> > would be fairly easy to explain to admins and users so that they can
> > know what to expect from nicing tasks.
> 
> I'm not likely to write the testcase until this upcoming weekend, though.

So that means there's a 10000:1 ratio between nice 20 and nice -19. In
that sort of dynamic range, you're likely to have non-trivial
numerical accuracy issues in integer/fixed-point math.

(Especially if your clock is jiffies-scale, which a significant number
of machines will continue to be.)

I really think if we want to have vastly different ratios, we probably
want to be looking at BATCH and RT scheduling classes instead.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:57                                     ` Matt Mackall
@ 2007-04-18  4:29                                       ` William Lee Irwin III
  2007-04-18  4:42                                         ` Davide Libenzi
  2007-04-18  7:29                                       ` James Bruce
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-18  4:29 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
>>> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
>>> That value has the property that a nice=10 task gets 1/10th the cpu of a
>>> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
>>> would be fairly easy to explain to admins and users so that they can
>>> know what to expect from nicing tasks.

On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
>> I'm not likely to write the testcase until this upcoming weekend, though.

On Tue, Apr 17, 2007 at 05:57:23PM -0500, Matt Mackall wrote:
> So that means there's a 10000:1 ratio between nice 20 and nice -19. In
> that sort of dynamic range, you're likely to have non-trivial
> numerical accuracy issues in integer/fixed-point math.
> (Especially if your clock is jiffies-scale, which a significant number
> of machines will continue to be.)
> I really think if we want to have vastly different ratios, we probably
> want to be looking at BATCH and RT scheduling classes instead.

100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
even 1000**(1/39.0) ~= 1.19378 still seems weak.

I suspect that in order to get low nice numbers strong enough without
making high nice numbers too strong something sub-exponential may need
to be used. Maybe just picking percentages outright as opposed to some
particular function.

We may also be better off defining it in terms of a share weighting as
opposed to two tasks in competition. In such a manner the extension to
N tasks is more automatic. f(n) would be a univariate function of nice
numbers and two tasks in competition with nice numbers n_1 and n_2
would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
the exponential case f(n) = K*e**(r*n) this ends up as
1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
other choices it's not so. f(n) = n+K for K >= 20 results in a share
weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
when n <= 0 is highly plausible. An exponent or an additive constant
may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
and the ratio of shares is 420, which is still arithmeticaly feasible.
-10 vs. 0 and 0 vs. 10 are both 10:1.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  4:29                                       ` William Lee Irwin III
@ 2007-04-18  4:42                                         ` Davide Libenzi
  0 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-18  4:42 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Matt Mackall, Ingo Molnar, Nick Piggin, Peter Williams,
	Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, William Lee Irwin III wrote:

> 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and
> even 1000**(1/39.0) ~= 1.19378 still seems weak.
> 
> I suspect that in order to get low nice numbers strong enough without
> making high nice numbers too strong something sub-exponential may need
> to be used. Maybe just picking percentages outright as opposed to some
> particular function.
> 
> We may also be better off defining it in terms of a share weighting as
> opposed to two tasks in competition. In such a manner the extension to
> N tasks is more automatic. f(n) would be a univariate function of nice
> numbers and two tasks in competition with nice numbers n_1 and n_2
> would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In
> the exponential case f(n) = K*e**(r*n) this ends up as
> 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for
> other choices it's not so. f(n) = n+K for K >= 20 results in a share
> weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear
> in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n
> when n <= 0 is highly plausible. An exponent or an additive constant
> may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21,
> and the ratio of shares is 420, which is still arithmeticaly feasible.
> -10 vs. 0 and 0 vs. 10 are both 10:1.

This makes more sense, and the ratio at the extremes is something 
reasonable.



- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 22:57                                     ` Matt Mackall
  2007-04-18  4:29                                       ` William Lee Irwin III
@ 2007-04-18  7:29                                       ` James Bruce
  1 sibling, 0 replies; 304+ messages in thread
From: James Bruce @ 2007-04-18  7:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: ck

Matt Mackall wrote:
> On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote:
>> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote:
>> I'm working with the following suggestion:
>>
>> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote:
>>> Nonlinear is a must IMO.  I would suggest X = exp(ln(10)/10) ~= 1.2589
>>> That value has the property that a nice=10 task gets 1/10th the cpu of a
>>> nice=0 task, and a nice=20 task gets 1/100 of nice=0.  I think that
>>> would be fairly easy to explain to admins and users so that they can
>>> know what to expect from nicing tasks.
>> I'm not likely to write the testcase until this upcoming weekend, though.
> 
> So that means there's a 10000:1 ratio between nice 20 and nice -19. In
> that sort of dynamic range, you're likely to have non-trivial
> numerical accuracy issues in integer/fixed-point math.

Well, you *are* specifying vastly different priorities.  The question is 
how many nice=20 tasks should it take to interfere with a nice=-19 task? 
  If you've only got a 100:1 ratio, 100 nice=20 tasks will take ~50% of 
the CPU away from a nice=-19 task.  I don't think that's ideal, as in my 
mind a -19 task shouldn't have to care how many nice=20 tasks there are 
(within reason).  IMHO, if a user is running a CPU hog at nice=-19, and 
expecting nice=20 tasks to run immediately, I don't think the scheduler 
is the problem.

> (Especially if your clock is jiffies-scale, which a significant number
> of machines will continue to be.)
> 
> I really think if we want to have vastly different ratios, we probably
> want to be looking at BATCH and RT scheduling classes instead.

I, like all users, can live with anything, but there should be a clear 
specification of what the user should expect.  Magic changes in the 
function at nice=0, or no real clear meaning at all (mainline), are both 
things that don't help the users to figure that out.  I like the 
exponential base because shifting all tasks up or down one nice level 
does not change the relative cpu distribution (i.e. two tasks 
{nice=-5,nice=0} get the same relative cpu distribution as if they were 
{nice=0,nice=5}.  An exponential base is the only way that property can 
hold.

Now, perhaps implementation issues may prevent something like the 
"1.2589" ratio rule from being realized, but I'm not sure we should 
throw it out _before_ we know that it's actually a problem.  This is the 
same sort of resistance that the timekeeping code updates faces (using 
nanoseconds everywhere instead of "natural" clock bases), but that got 
addressed eventually.

  - Jim Bruce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:50                   ` Davide Libenzi
  2007-04-17  7:09                     ` William Lee Irwin III
@ 2007-04-17  7:11                     ` Nick Piggin
  2007-04-17  7:21                       ` Davide Libenzi
  1 sibling, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:11 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Nick Piggin wrote:
> 
> > > All things are not equal; they all have different properties. I like
> > 
> > Exactly. So we have to explore those properties and evaluate performance
> > (in all meanings of the word). That's only logical.
> 
> I had a quick look at Ingo's code yesterday. Ingo is always smart to 
> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;)
> And even this code does that pretty nicely. The deadline designs looks 
> good, although I think the final "key" calculation code will end up quite 
> different from what it looks now.
> I would suggest to thoroughly test all your alternatives before deciding. 
> Some code and design may look very good and small at the beginning, but 
> when you start patching it to cover all the dark spots, you effectively 
> end up with another thing (in both design and code footprint).
> About O(1), I never thought it was a must (besides a good marketing 
> material), and O(log(N)) *may* be just fine (to be verified, of course).

To be clear, I'm not saying O(logN) itself is a big problem. Type

  plot [10:100] x with lines, log(x) with lines, 1 with lines

into gnuplot. I was just trying to point out that we need to evalute
things. Considering how long we've had this scheduler with its known
deficiencies, let's pick a new one wisely.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:11                     ` Nick Piggin
@ 2007-04-17  7:21                       ` Davide Libenzi
  0 siblings, 0 replies; 304+ messages in thread
From: Davide Libenzi @ 2007-04-17  7:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: William Lee Irwin III, Peter Williams, Mike Galbraith,
	Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Thomas Gleixner

On Tue, 17 Apr 2007, Nick Piggin wrote:

> To be clear, I'm not saying O(logN) itself is a big problem. Type
> 
>   plot [10:100] x with lines, log(x) with lines, 1 with lines

Haha, Nick, I know how a log() looks like :)
The Time Ring I posted as example (that nothing is other than a 
ring-based bucket sort), keeps O(1) if you can concede some timer 
clustering.


- Davide



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:29             ` Nick Piggin
  2007-04-17  5:53               ` Willy Tarreau
  2007-04-17  6:09               ` William Lee Irwin III
@ 2007-04-17  6:23               ` Peter Williams
  2007-04-17  6:44                 ` Nick Piggin
  2007-04-17  8:44                 ` Ingo Molnar
  2 siblings, 2 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-17  6:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote:
>>>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote:
>>>>> Mike Galbraith wrote:
>>>>>> Demystify what?   The casual observer need only read either your attempt
>>>>>> at writing a scheduler, or my attempts at fixing the one we have, to see
>>>>>> that it was high time for someone with the necessary skills to step in.
>>>>> Make that "someone with the necessary clout".
>>>> No, I was brutally honest to both of us, but quite correct.
>>>>
>>>>>> Now progress can happen, which was _not_ happening before.
>>>>>>
>>>>> This is true.
>>>> Yup, and progress _is_ happening now, quite rapidly.
>>> Progress as in progress on Ingo's scheduler. I still don't know how we'd
>>> decide when to replace the mainline scheduler or with what.
>>>
>>> I don't think we can say Ingo's is better than the alternatives, can we?
>>> If there is some kind of bakeoff, then I'd like one of Con's designs to
>>> be involved, and mine, and Peter's...
>> I myself was thinking of this as the chance for a much needed 
>> simplification of the scheduling code and if this can be done with the 
>> result being "reasonable" it then gives us the basis on which to propose 
>> improvements based on the ideas of others such as you mention.
>>
>> As the size of the cpusched indicates, trying to evaluate alternative 
>> proposals based on the current O(1) scheduler is fraught.  Hopefully, 
> 
> I don't know why. The problem is that you can't really evaluate good
> proposals by looking at the code (you can say that one is bad, ie. the
> current one, which has a huge amount of temporal complexity and is
> explicitly unfair), but it is pretty hard to say one behaves well.

I meant that it's indicative of the amount of work that you have to do 
to implement a new scheduling discipline for evaluation.

> 
> And my scheduler for example cuts down the amount of policy code and
> code size significantly.

Yours is one of the smaller patches mainly because you perpetuate (or 
you did in the last one I looked at) the (horrible to my eyes) dual 
array (active/expired) mechanism.  That this idea was bad should have 
been apparent to all as soon as the decision was made to excuse some 
tasks from being moved from the active array to the expired array.  This 
essentially meant that there would be circumstances where extreme 
unfairness (to the extent of starvation in some cases) -- the very 
things that the mechanism was originally designed to ensure (as far as I 
can gather).  Right about then in the development of the O(1) scheduler 
alternative solutions should have been sought.

Other hints that it was a bad idea was the need to transfer time slices 
between children and parents during fork() and exit().

This disregard for the dual array mechanism has prevented me from 
looking at the rest of your scheduler in any great detail so I can't 
comment on any other ideas that may be in there.

> I haven't looked at Con's ones for a while,
> but I believe they are also much more straightforward than mainline...

I like Con's scheduler (partly because it uses a single array) but 
mainly because it's nice and simple.  However, his earlier schedulers 
were prone to starvation (admittedly, only if you went out of your way 
to make it happen) and I tried to convince him to use the anti 
starvation mechanism in my SPA schedulers but was unsuccessful.  I 
haven't looked at his latest scheduler that sparked all this furore so 
can't comment on it.

> 
> For example, let's say all else is equal between them, then why would
> we go with the O(logN) implementation rather than the O(1)?

In the highly unlikely event that you can't separate them on technical 
grounds, Occam's razor recommends choosing the simplest solution. :-)

To digress, my main concern is that load balancing is being lumped in 
with this new change.  It's becoming "accept this beg lump of new code 
or nothing".  I'd rather see a good fix to the intra runqueue/CPU 
scheduler problem implemented first and then if there really are any 
outstanding problems with the load balancer attack them later.  Them all 
being mixed up together gives me a nasty deja vu of impending disaster.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:23               ` Peter Williams
@ 2007-04-17  6:44                 ` Nick Piggin
  2007-04-17  7:48                   ` Peter Williams
  2007-04-17  8:44                 ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  6:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >And my scheduler for example cuts down the amount of policy code and
> >code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.

Actually, I wasn't comparing with other out of tree schedulers (but it
is good to know mine is among the smaller ones). I was comparing with
the mainline scheduler, which also has the dual arrays.

>  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  This 

My patch doesn't implement any such excusing.

> essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as I 
> can gather).  Right about then in the development of the O(1) scheduler 
> alternative solutions should have been sought.

Fairness has always been my first priority, and I consider it a bug
if it is possible for any process to get more CPU time than a CPU hog
over the long term. Or over another task doing the same thing, for
that matter.

> Other hints that it was a bad idea was the need to transfer time slices 
> between children and parents during fork() and exit().

I don't see how that has anything to do with dual arrays. If you put
a new child at the back of the queue, then your various interactive
shell commands that typically do a lot of dependant forking get slowed
right down behind your compile job. If you give a new child its own
timeslice irrespective of the parent, then you have things like 'make'
(which doesn't use a lot of CPU time) spawning off lots of high
priority children.

You need to do _something_ (Ingo's does). I don't see why this would
be tied with a dual array. FWIW, mine doesn't do anything on exit()
like most others, but it may need more tuning in this area.

> This disregard for the dual array mechanism has prevented me from 
> looking at the rest of your scheduler in any great detail so I can't 
> comment on any other ideas that may be in there.

Well I wasn't really asking you to review it. As I said, everyone
has their own idea of what a good design does, and review can't really
distinguish between the better of two reasonable designs.

A fair evaluation of the alternatives seems like a good idea though.
Nobody is actually against this, are they?

> >I haven't looked at Con's ones for a while,
> >but I believe they are also much more straightforward than mainline...
> 
> I like Con's scheduler (partly because it uses a single array) but 
> mainly because it's nice and simple.  However, his earlier schedulers 
> were prone to starvation (admittedly, only if you went out of your way 
> to make it happen) and I tried to convince him to use the anti 
> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> haven't looked at his latest scheduler that sparked all this furore so 
> can't comment on it.

I agree starvation or unfairness is unacceptable for a new scheduler.

> >For example, let's say all else is equal between them, then why would
> >we go with the O(logN) implementation rather than the O(1)?
> 
> In the highly unlikely event that you can't separate them on technical 
> grounds, Occam's razor recommends choosing the simplest solution. :-)

O(logN) vs O(1) is technical grounds.

But yeah, see my earlier comment: simplicity would be a factor too.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:44                 ` Nick Piggin
@ 2007-04-17  7:48                   ` Peter Williams
  2007-04-17  7:56                     ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  7:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> And my scheduler for example cuts down the amount of policy code and
>>> code size significantly.
>> Yours is one of the smaller patches mainly because you perpetuate (or 
>> you did in the last one I looked at) the (horrible to my eyes) dual 
>> array (active/expired) mechanism.
> 
> Actually, I wasn't comparing with other out of tree schedulers (but it
> is good to know mine is among the smaller ones). I was comparing with
> the mainline scheduler, which also has the dual arrays.
> 
> 
>>  That this idea was bad should have 
>> been apparent to all as soon as the decision was made to excuse some 
>> tasks from being moved from the active array to the expired array.  This 
> 
> My patch doesn't implement any such excusing.
> 
> 
>> essentially meant that there would be circumstances where extreme 
>> unfairness (to the extent of starvation in some cases) -- the very 
>> things that the mechanism was originally designed to ensure (as far as I 
>> can gather).  Right about then in the development of the O(1) scheduler 
>> alternative solutions should have been sought.
> 
> Fairness has always been my first priority, and I consider it a bug
> if it is possible for any process to get more CPU time than a CPU hog
> over the long term. Or over another task doing the same thing, for
> that matter.
> 
> 
>> Other hints that it was a bad idea was the need to transfer time slices 
>> between children and parents during fork() and exit().
> 
> I don't see how that has anything to do with dual arrays.

It's totally to do with the dual arrays.  The only real purpose of the 
time slice in O(1) (regardless of what its perceived purpose was) was to 
control the switching between the arrays.

> If you put
> a new child at the back of the queue, then your various interactive
> shell commands that typically do a lot of dependant forking get slowed
> right down behind your compile job. If you give a new child its own
> timeslice irrespective of the parent, then you have things like 'make'
> (which doesn't use a lot of CPU time) spawning off lots of high
> priority children.

This is an artefact of trying to control nice using time slices while 
using them for controlling array switching and whatever else they were 
being used for.  Priority (static and dynamic) is the the best way to 
implement nice.

> 
> You need to do _something_ (Ingo's does). I don't see why this would
> be tied with a dual array. FWIW, mine doesn't do anything on exit()
> like most others, but it may need more tuning in this area.
> 
> 
>> This disregard for the dual array mechanism has prevented me from 
>> looking at the rest of your scheduler in any great detail so I can't 
>> comment on any other ideas that may be in there.
> 
> Well I wasn't really asking you to review it. As I said, everyone
> has their own idea of what a good design does, and review can't really
> distinguish between the better of two reasonable designs.
> 
> A fair evaluation of the alternatives seems like a good idea though.
> Nobody is actually against this, are they?

No.  It would be nice if the basic ideas that each scheduler tries to 
implement could be extracted and explained though.  This could lead to a 
melding of ideas that leads to something quite good.

> 
> 
>>> I haven't looked at Con's ones for a while,
>>> but I believe they are also much more straightforward than mainline...
>> I like Con's scheduler (partly because it uses a single array) but 
>> mainly because it's nice and simple.  However, his earlier schedulers 
>> were prone to starvation (admittedly, only if you went out of your way 
>> to make it happen) and I tried to convince him to use the anti 
>> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
>> haven't looked at his latest scheduler that sparked all this furore so 
>> can't comment on it.
> 
> I agree starvation or unfairness is unacceptable for a new scheduler.
> 
> 
>>> For example, let's say all else is equal between them, then why would
>>> we go with the O(logN) implementation rather than the O(1)?
>> In the highly unlikely event that you can't separate them on technical 
>> grounds, Occam's razor recommends choosing the simplest solution. :-)
> 
> O(logN) vs O(1) is technical grounds.

In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
greater than O(logN)'s k factor multiplied by logMaxN.

> 
> But yeah, see my earlier comment: simplicity would be a factor too.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:48                   ` Peter Williams
@ 2007-04-17  7:56                     ` Nick Piggin
  2007-04-17 13:16                       ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  7:56 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >>Other hints that it was a bad idea was the need to transfer time slices 
> >>between children and parents during fork() and exit().
> >
> >I don't see how that has anything to do with dual arrays.
> 
> It's totally to do with the dual arrays.  The only real purpose of the 
> time slice in O(1) (regardless of what its perceived purpose was) was to 
> control the switching between the arrays.

The O(1) design is pretty convoluted in that regard. In my scheduler,
the only purpose of the arrays is to renew time slices.

The fork/exit logic is added to make interactivity better. Ingo's
scheduler has similar equivalent logic.


> >If you put
> >a new child at the back of the queue, then your various interactive
> >shell commands that typically do a lot of dependant forking get slowed
> >right down behind your compile job. If you give a new child its own
> >timeslice irrespective of the parent, then you have things like 'make'
> >(which doesn't use a lot of CPU time) spawning off lots of high
> >priority children.
> 
> This is an artefact of trying to control nice using time slices while 
> using them for controlling array switching and whatever else they were 
> being used for.  Priority (static and dynamic) is the the best way to 
> implement nice.

I don't like the timeslice based nice in mainline. It's too nasty
with latencies. nicksched is far better in that regard IMO.

But I don't know how you can assert a particular way is the best way
to do something.


> >You need to do _something_ (Ingo's does). I don't see why this would
> >be tied with a dual array. FWIW, mine doesn't do anything on exit()
> >like most others, but it may need more tuning in this area.
> >
> >
> >>This disregard for the dual array mechanism has prevented me from 
> >>looking at the rest of your scheduler in any great detail so I can't 
> >>comment on any other ideas that may be in there.
> >
> >Well I wasn't really asking you to review it. As I said, everyone
> >has their own idea of what a good design does, and review can't really
> >distinguish between the better of two reasonable designs.
> >
> >A fair evaluation of the alternatives seems like a good idea though.
> >Nobody is actually against this, are they?
> 
> No.  It would be nice if the basic ideas that each scheduler tries to 
> implement could be extracted and explained though.  This could lead to a 
> melding of ideas that leads to something quite good.
> 
> >
> >
> >>>I haven't looked at Con's ones for a while,
> >>>but I believe they are also much more straightforward than mainline...
> >>I like Con's scheduler (partly because it uses a single array) but 
> >>mainly because it's nice and simple.  However, his earlier schedulers 
> >>were prone to starvation (admittedly, only if you went out of your way 
> >>to make it happen) and I tried to convince him to use the anti 
> >>starvation mechanism in my SPA schedulers but was unsuccessful.  I 
> >>haven't looked at his latest scheduler that sparked all this furore so 
> >>can't comment on it.
> >
> >I agree starvation or unfairness is unacceptable for a new scheduler.
> >
> >
> >>>For example, let's say all else is equal between them, then why would
> >>>we go with the O(logN) implementation rather than the O(1)?
> >>In the highly unlikely event that you can't separate them on technical 
> >>grounds, Occam's razor recommends choosing the simplest solution. :-)
> >
> >O(logN) vs O(1) is technical grounds.
> 
> In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
> greater than O(logN)'s k factor multiplied by logMaxN.

Yes, or even significantly greater around typical large sizes of N.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56                     ` Nick Piggin
@ 2007-04-17 13:16                       ` Peter Williams
  2007-04-18  4:46                         ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17 13:16 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>>> Other hints that it was a bad idea was the need to transfer time slices 
>>>> between children and parents during fork() and exit().
>>> I don't see how that has anything to do with dual arrays.
>> It's totally to do with the dual arrays.  The only real purpose of the 
>> time slice in O(1) (regardless of what its perceived purpose was) was to 
>> control the switching between the arrays.
> 
> The O(1) design is pretty convoluted in that regard. In my scheduler,
> the only purpose of the arrays is to renew time slices.
> 
> The fork/exit logic is added to make interactivity better. Ingo's
> scheduler has similar equivalent logic.
> 
> 
>>> If you put
>>> a new child at the back of the queue, then your various interactive
>>> shell commands that typically do a lot of dependant forking get slowed
>>> right down behind your compile job. If you give a new child its own
>>> timeslice irrespective of the parent, then you have things like 'make'
>>> (which doesn't use a lot of CPU time) spawning off lots of high
>>> priority children.
>> This is an artefact of trying to control nice using time slices while 
>> using them for controlling array switching and whatever else they were 
>> being used for.  Priority (static and dynamic) is the the best way to 
>> implement nice.
> 
> I don't like the timeslice based nice in mainline. It's too nasty
> with latencies. nicksched is far better in that regard IMO.
> 
> But I don't know how you can assert a particular way is the best way
> to do something.

I should have added "I may be wrong but I think that ...".

My opinion is based on a lot of experience with different types of 
scheduler design and the observation from gathering scheduling 
statistics while playing with these schedulers that the size of the time 
slices we're talking about is much larger than the CPU chunks most tasks 
use in any one go so time slice size has no real effect on most tasks 
and the faster CPUs become the more this becomes true.

> 
> 
>>> You need to do _something_ (Ingo's does). I don't see why this would
>>> be tied with a dual array. FWIW, mine doesn't do anything on exit()
>>> like most others, but it may need more tuning in this area.
>>>
>>>
>>>> This disregard for the dual array mechanism has prevented me from 
>>>> looking at the rest of your scheduler in any great detail so I can't 
>>>> comment on any other ideas that may be in there.
>>> Well I wasn't really asking you to review it. As I said, everyone
>>> has their own idea of what a good design does, and review can't really
>>> distinguish between the better of two reasonable designs.
>>>
>>> A fair evaluation of the alternatives seems like a good idea though.
>>> Nobody is actually against this, are they?
>> No.  It would be nice if the basic ideas that each scheduler tries to 
>> implement could be extracted and explained though.  This could lead to a 
>> melding of ideas that leads to something quite good.
>>
>>>
>>>>> I haven't looked at Con's ones for a while,
>>>>> but I believe they are also much more straightforward than mainline...
>>>> I like Con's scheduler (partly because it uses a single array) but 
>>>> mainly because it's nice and simple.  However, his earlier schedulers 
>>>> were prone to starvation (admittedly, only if you went out of your way 
>>>> to make it happen) and I tried to convince him to use the anti 
>>>> starvation mechanism in my SPA schedulers but was unsuccessful.  I 
>>>> haven't looked at his latest scheduler that sparked all this furore so 
>>>> can't comment on it.
>>> I agree starvation or unfairness is unacceptable for a new scheduler.
>>>
>>>
>>>>> For example, let's say all else is equal between them, then why would
>>>>> we go with the O(logN) implementation rather than the O(1)?
>>>> In the highly unlikely event that you can't separate them on technical 
>>>> grounds, Occam's razor recommends choosing the simplest solution. :-)
>>> O(logN) vs O(1) is technical grounds.
>> In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
>> greater than O(logN)'s k factor multiplied by logMaxN.
> 
> Yes, or even significantly greater around typical large sizes of N.

Yes.  In fact its' probably better to use the maximum number of threads 
allowed on the system for N.  We know that value don't we?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:16                       ` Peter Williams
@ 2007-04-18  4:46                         ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  4:46 UTC (permalink / raw)
  To: Peter Williams
  Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 11:16:54PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >I don't like the timeslice based nice in mainline. It's too nasty
> >with latencies. nicksched is far better in that regard IMO.
> >
> >But I don't know how you can assert a particular way is the best way
> >to do something.
> 
> I should have added "I may be wrong but I think that ...".
> 
> My opinion is based on a lot of experience with different types of 
> scheduler design and the observation from gathering scheduling 
> statistics while playing with these schedulers that the size of the time 
> slices we're talking about is much larger than the CPU chunks most tasks 
> use in any one go so time slice size has no real effect on most tasks 
> and the faster CPUs become the more this becomes true.

For desktop loads, maybe. But for things that are compute bound, the
cost of context switching I believe still gets worse as CPUs continue
to be able to execute more instructions per cycle, get clocked faster,
and get larger caches.


> >>In that case I'd go O(1) provided that the k factor for the O(1) wasn't 
> >>greater than O(logN)'s k factor multiplied by logMaxN.
> >
> >Yes, or even significantly greater around typical large sizes of N.
> 
> Yes.  In fact its' probably better to use the maximum number of threads 
> allowed on the system for N.  We know that value don't we?

Well we might be able to work it out by looking at the tunables or
amount of kernel memory available, but I guess it is hard to just
pick a number.

I'll try running a few more benchmarks.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:23               ` Peter Williams
  2007-04-17  6:44                 ` Nick Piggin
@ 2007-04-17  8:44                 ` Ingo Molnar
  2007-04-19  2:20                   ` Peter Williams
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:44 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

* Peter Williams <pwil3058@bigpond.net.au> wrote:

> > And my scheduler for example cuts down the amount of policy code and 
> > code size significantly.
> 
> Yours is one of the smaller patches mainly because you perpetuate (or 
> you did in the last one I looked at) the (horrible to my eyes) dual 
> array (active/expired) mechanism.  That this idea was bad should have 
> been apparent to all as soon as the decision was made to excuse some 
> tasks from being moved from the active array to the expired array.  
> This essentially meant that there would be circumstances where extreme 
> unfairness (to the extent of starvation in some cases) -- the very 
> things that the mechanism was originally designed to ensure (as far as 
> I can gather).  Right about then in the development of the O(1) 
> scheduler alternative solutions should have been sought.

in hindsight i'd agree. But back then we were clearly not ready for 
fine-grained accurate statistics + trees (cpus are alot faster at more 
complex arithmetics today, plus people still believed that low-res can 
be done well enough), and taking out any of these two concepts from CFS 
would result in a similarly complex runqueue implementation. Also, the 
array switch was just thought to be of another piece of 'if the 
heuristics go wrong, we fall back to an array switch' logic, right in 
line with the other heuristics. And you have to accept it, mainline's 
ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
plus for developers, so it had (and probably still has) quite some 
inertia.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:44                 ` Ingo Molnar
@ 2007-04-19  2:20                   ` Peter Williams
  0 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-19  2:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey,
	linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Thomas Gleixner

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>>> And my scheduler for example cuts down the amount of policy code and 
>>> code size significantly.
>> Yours is one of the smaller patches mainly because you perpetuate (or 
>> you did in the last one I looked at) the (horrible to my eyes) dual 
>> array (active/expired) mechanism.  That this idea was bad should have 
>> been apparent to all as soon as the decision was made to excuse some 
>> tasks from being moved from the active array to the expired array.  
>> This essentially meant that there would be circumstances where extreme 
>> unfairness (to the extent of starvation in some cases) -- the very 
>> things that the mechanism was originally designed to ensure (as far as 
>> I can gather).  Right about then in the development of the O(1) 
>> scheduler alternative solutions should have been sought.
> 
> in hindsight i'd agree.

Hindsight's a wonderful place isn't it :-) and, of course, it's where I 
was making my comments from.

> But back then we were clearly not ready for 
> fine-grained accurate statistics + trees (cpus are alot faster at more 
> complex arithmetics today, plus people still believed that low-res can 
> be done well enough),  and taking out any of these two concepts from CFS
> would result in a similarly complex runqueue implementation.

I disagree.  The single priority array with a promotion mechanism that I 
use in the SPA schedulers can do the job of avoiding starvation with no 
measurable increase in the overhead.  Fairness, nice, good interactive 
responsiveness can then be managed by how you determine tasks' dynamic 
priorities.

> Also, the 
> array switch was just thought to be of another piece of 'if the 
> heuristics go wrong, we fall back to an array switch' logic, right in 
> line with the other heuristics. And you have to accept it, mainline's 
> ability to auto-renice make -j jobs (and other CPU hogs) was quite a 
> plus for developers, so it had (and probably still has) quite some 
> inertia.

I agree, it wasn't totally useless especially for the average user.  My 
main problem with it was that the effect of "nice" wasn't consistent or 
predictable enough for reliable resource allocation.

I also agree with the aims of the various heuristics i.e. you have to be 
unfair and give some tasks preferential treatment in order to give the 
users the type of responsiveness that they want.  It's just a shame that 
it got broken in the process but as you say it's easier to see these 
things in hindsight than in the middle of the melee.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15  3:27 ` Con Kolivas
  2007-04-15  5:16   ` Bill Huey
  2007-04-15  6:43   ` Mike Galbraith
@ 2007-04-15 15:05   ` Ingo Molnar
  2007-04-15 20:05     ` Matt Mackall
  2007-04-16  5:16     ` Con Kolivas
  2 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 15:05 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Con Kolivas <kernel@kolivas.org> wrote:

[ i'm quoting this bit out of order: ]

> 2. Since then I've been thinking/working on a cpu scheduler design 
> that takes away all the guesswork out of scheduling and gives very 
> predictable, as fair as possible, cpu distribution and latency while 
> preserving as solid interactivity as possible within those confines.

yeah. I think you were right on target with this call. I've applied the 
sched.c change attached at the bottom of this mail to the CFS patch, if 
you dont mind. (or feel free to suggest some other text instead.)

> 1. I tried in vain some time ago to push a working extensable 
> pluggable cpu scheduler framework (based on wli's work) for the linux 
> kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he 
> didn't like it) as being absolutely the wrong approach and that we 
> should never do that. [...]

i partially replied to that point to Will already, and i'd like to make 
it clear again: yes, i rejected plugsched 2-3 years ago (which already 
drifted away from wli's original codebase) and i would still reject it 
today.

First and foremost, please dont take such rejections too personally - i 
had my own share of rejections (and in fact, as i mentioned it in a 
previous mail, i had a fair number of complete project throwaways: 
4g:4g, in-kernel Tux, irqrate and many others). I know that they can 
hurt and can demoralize, but if i dont like something it's my job to 
tell that.

Can i sum up your argument as: "you rejected plugsched, but then why on 
earth did you modularize portions of the scheduler in CFS? Isnt your 
position thus woefully inconsistent?" (i'm sure you would never put it 
this impolitely though, but i guess i can flame myself with impunity ;)

While having an inconsistent position isnt a terminal sin in itself, 
please realize that the scheduler classes code in CFS is quite different 
from plugsched: it was a result of what i saw to be technological 
pressure for _internal modularization_. (This internal/policy 
modularization aspect is something that Will said was present in his 
original plugsched code, but which aspect i didnt see in the plugsched 
patches that i reviewed.)

That possibility never even occured to me to until 3 days ago. You never 
raised it either AFAIK. No patches to simplify the scheduler that way 
were ever sent. Plugsched doesnt even touch the core load-balancer for 
example, and most of the time i spent with the modularization was to get 
the load-balancing details right. So it's really apples to oranges.

My view about plugsched: first please take a look at the latest 
plugsched code:

  http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch

  26 files changed, 8951 insertions(+), 1495 deletions(-)

As an experiment i've removed all the add-on schedulers (both the core 
and the include files, only kept the vanilla one) from the plugsched 
patch (and the makefile and kconfig complications, etc), to see the 
'infrastructure cost', and it still gave:

  12 files changed, 1933 insertions(+), 1479 deletions(-)

that's the extra complication i didnt like 3 years ago and which i still 
dont like today. What the current plugsched code does is that it 
simplifies the adding of new experimental schedulers, but it doesnt 
really do what i wanted: to simplify the _scheduler itself_. Personally 
i'm still not primarily interested in having a large selection of 
schedulers, i'm mainly interested in a good and maintainable scheduler 
that works for people.

so the rejection was on these grounds, and i still very much stand by 
that position here and today: i didnt want to see the Linux scheduler 
landscape balkanized and i saw no technological reasons for the 
complication that external modularization brings.

the new scheding classes code in the CFS patch was not a result of "oh, 
i want to write a new scheduler, lets make schedulers pluggable" kind of 
thinking. That result was just a side-effect of it. (and as you 
correctly noted it, the CFS related modularization is incomplete).

Btw., the thing that triggered the scheduling classes code wasnt even 
plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he 
fixed it within the framework of the existing scheduler, and the end 
result behaved quite well when i threw various testloads on it.

But i felt a bit uncomfortable that it added another few hundred lines 
of code to an already complex sched.c. This felt unnatural so i mailed 
Mike that i'd attempt to clean these infrastructure aspects of sched.c 
up a bit so that it becomes more hackable to him. Thus 3 days ago, 
without having made up my mind about anything, i started this experiment 
(which ended up in the modularization and in the CFS scheduler) to 
simplify the code and to enable Mike to fix such itches in an easier 
way. By your logic Mike should in fact be quite upset about this: if the 
new code works out and proves to be useful then it obsoletes a whole lot 
of code of him!

> For weeks now, Ingo has said that the interactivity regressions were 
> showstoppers and we should address them, never mind the fact that the 
> so-called regressions were purely "it slows down linearly with load" 
> which to me is perfectly desirable behaviour. [...]

yes. For me the first thing when considering a large scheduler patch is: 
"does a patch do what it claims" and "does it work". If those goals are 
met (and if it's a complete scheduler i actually try it quite 
extensively) then i look at the code cleanliness issues. Mike's patch 
was the first one that seemed to meet that threshold in my own humble 
testing, and CFS was a direct result of that.

note that i tried the same workloads with CFS and while it wasnt as good 
as mainline, it handled them better than SD. Mike reported the same, and 
Mark Lord (who too reported SD interactivity problems) reported success 
yesterday too.

(but .. CFS is a mere 2 days old so we cannot really tell anything with 
certainty yet.)

> [...] However at one stage I virtually begged for support with my 
> attempts and help with the code. Dmitry Adamushko is the only person 
> who actually helped me with the code in the interim, while others 
> poked sticks at it. Sure the sticks helped at times but the sticks 
> always seemed to have their ends kerosene doused and flaming for 
> reasons I still don't get. No other help was forthcoming.

i'm really sorry you got that impression.

in 2004 i had a good look at the staircase scheduler and said:

  http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html

   "But in general i'm quite positive about the staircase scheduler."

and even tested it and gave you feedback:

   http://lwn.net/Articles/96562/

i think i even told Andrew that i dont really like pluggable schedulers 
and if there's any replacement for the current scheduler then that would 
be a full replacement, and it would be the staircase scheduler.

Hey, i told this to you as recently as 1 month ago as well:

   http://lkml.org/lkml/2007/3/8/54

   "cool! I like this even more than i liked your original staircase 
    scheduler from 2 years ago :)"

	Ingo

----------->
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -16,6 +16,7 @@
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
  *  2004-04-02	Scheduler domains code by Nick Piggin
+ *  2007-04-15	Con Kolivas was dead right: fairness matters! :)
  */

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:05   ` Ingo Molnar
@ 2007-04-15 20:05     ` Matt Mackall
  2007-04-15 20:48       ` Ingo Molnar
  2007-04-16  5:16     ` Con Kolivas
  1 sibling, 1 reply; 304+ messages in thread
From: Matt Mackall @ 2007-04-15 20:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote:
> so the rejection was on these grounds, and i still very much stand by 
> that position here and today: i didnt want to see the Linux scheduler 
> landscape balkanized and i saw no technological reasons for the 
> complication that external modularization brings.

But "balkanization" is a good thing. "Monoculture" is a bad thing.

Look at what happened with I/O scheduling. Opening things up to some
new ideas by making it possible to select your I/O scheduler took us
from 10 years of stagnation to healthy, competitive development, which
gave us a substantially better I/O scheduler.

Look at what's happening right now with TCP congestion algorithms.
We've had decades of tweaking Reno slightly now turned into a vibrant
research area with lots of radical alternatives. A winner will
eventually emerge and it will probably look quite a bit different than
Reno.

Similar things have gone on since the beginning with filesystems on
Linux. Being able to easily compare filesystems head to head has been
immensely valuable in improving our 'core' Linux filesystems.

And what we've had up to now is a scheduler monoculture. Until Andrew
put RSDL in -mm, if people wanted to experiment with other schedulers,
they had to go well off the beaten path to do it. So all the people
who've been hopelessy frustrated with the mainline scheduler go off to
the -ck ghetto, or worse, stick with 2.4.

Whether your motivations have been protectionist or merely
shortsighted, you've stomped pretty heavily on alternative scheduler
development by completely rejecting the whole plugsched concept. If
we'd opened up mainline to a variety of schedulers _3 years ago_, we'd
probably have gotten to where we are today much sooner.

Hopefully, the next time Rik suggests pluggable page replacement
algorithms, folks will actually seriously consider it.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:05     ` Matt Mackall
@ 2007-04-15 20:48       ` Ingo Molnar
  2007-04-15 21:31         ` Matt Mackall
  2007-04-15 23:39         ` William Lee Irwin III
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 20:48 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

* Matt Mackall <mpm@selenic.com> wrote:

> Look at what happened with I/O scheduling. Opening things up to some 
> new ideas by making it possible to select your I/O scheduler took us 
> from 10 years of stagnation to healthy, competitive development, which 
> gave us a substantially better I/O scheduler.

actually, 2-3 years ago we already had IO schedulers, and my opinion 
against plugsched back then (also shared by Nick and Linus) was very 
much considering them. There are at least 4 reasons why I/O schedulers 
are different from CPU schedulers:

1) CPUs are a non-persistent resource shared by _all_ tasks and 
   workloads in the system. Disks are _persistent_ resources very much 
   attached to specific workloads. (If tasks had to be 'persistent' to
   the CPU they were started on we'd have much different scheduling
   technology, and there would be much less complexity.) More analogous 
   to CPU schedulers would perhaps be VM/MM schedulers, and those tend 
   to be hard to modularize in a technologically sane way too. (and 
   unlike disks there's no good generic way to attach VM/MM schedulers 
   to particular workloads.) So it's apples to oranges.

   in practice it comes down to having one good scheduler that runs all 
   workloads on a system reasonably well. And given that a very large 
   portion of system runs mixed workloads, the demand for one good 
   scheduler is pretty high. While i can run with mixed IO schedulers 
   just fine.

2) plugsched did not allow on the fly selection of schedulers, nor did
   it allow a per CPU selection of schedulers. IO schedulers you can 
   change per disk, on the fly, making them much more useful in
   practice. Also, IO schedulers (while definitely not being slow!) are 
   alot less performance sensitive than CPU schedulers.

3) I/O schedulers are pretty damn clean code, and plugsched, at least
   the last version i saw of it, didnt come even close.

4) the good thing that happened to I/O, after years of stagnation isnt
   I/O schedulers. The good thing that happened to I/O is called Jens
   Axboe. If you care about the I/O subystem then print that name out 
   and hang it on the wall. That and only that is what mattered.

all in one, while there are definitely uses (embedded would like to have 
a smaller/different scheduler, etc.), the technical case for 
modularization for the sake of selectability is alot lower for CPU 
schedulers than it is for I/O schedulers.

nor was the non-modularity of some piece of code ever an impediment to 
competition. May i remind you of the pretty competitive SLAB allocator 
landscape, resulting in things like the SLOB allocator, written by 
yourself? ;-)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:48       ` Ingo Molnar
@ 2007-04-15 21:31         ` Matt Mackall
  2007-04-16  3:03           ` Nick Piggin
  2007-04-16 15:45           ` William Lee Irwin III
  2007-04-15 23:39         ` William Lee Irwin III
  1 sibling, 2 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-15 21:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 
> * Matt Mackall <mpm@selenic.com> wrote:
> 
> > Look at what happened with I/O scheduling. Opening things up to some 
> > new ideas by making it possible to select your I/O scheduler took us 
> > from 10 years of stagnation to healthy, competitive development, which 
> > gave us a substantially better I/O scheduler.
> 
> actually, 2-3 years ago we already had IO schedulers, and my opinion 
> against plugsched back then (also shared by Nick and Linus) was very 
> much considering them. There are at least 4 reasons why I/O schedulers 
> are different from CPU schedulers:

...

> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>    the last version i saw of it, didnt come even close.

That's irrelevant. Plugsched was an attempt to get alternative
schedulers exposure in mainline. I know, because I remember
encouraging Bill to pursue it. Not only did you veto plugsched (which
may have been a perfectly reasonable thing to do), but you also vetoed
the whole concept of multiple schedulers in the tree too. "We don't
want to balkanize the scheduling landscape".

And that latter part is what I'm claiming has set us back for years.
It's not a technical argument but a strategic one. And it's just not a
good strategy.

> 4) the good thing that happened to I/O, after years of stagnation isnt
>    I/O schedulers. The good thing that happened to I/O is called Jens
>    Axboe. If you care about the I/O subystem then print that name out 
>    and hang it on the wall. That and only that is what mattered.

Disagree. Things didn't actually get interesting until Nick showed up
with AS and got it in-tree to demonstrate the huge amount of room we
had for improvement. It took several iterations of AS and CFQ (with a
couple complete rewrites) before CFQ began to look like the winner.
The resulting time-sliced CFQ was fairly heavily influenced by the
ideas in AS.

Similarly, things in scheduler land had been pretty damn boring until
Con finally got Andrew to take one of his schedulers for a spin.

> nor was the non-modularity of some piece of code ever an impediment to 
> competition. May i remind you of the pretty competitive SLAB allocator 
> landscape, resulting in things like the SLOB allocator, written by 
> yourself? ;-)

Thankfully no one came out and said "we don't want to balkanize the
allocator landscape" when I submitted it or I probably would have just
dropped it, rather than painfully dragging it along out of tree for
years. I'm not nearly the glutton for punishment that Con is. :-P

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 21:31         ` Matt Mackall
@ 2007-04-16  3:03           ` Nick Piggin
  2007-04-16 14:28             ` Matt Mackall
  2007-04-16 15:45           ` William Lee Irwin III
  1 sibling, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-16  3:03 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>  
> > 4) the good thing that happened to I/O, after years of stagnation isnt
> >    I/O schedulers. The good thing that happened to I/O is called Jens
> >    Axboe. If you care about the I/O subystem then print that name out 
> >    and hang it on the wall. That and only that is what mattered.
> 
> Disagree. Things didn't actually get interesting until Nick showed up
> with AS and got it in-tree to demonstrate the huge amount of room we
> had for improvement. It took several iterations of AS and CFQ (with a
> couple complete rewrites) before CFQ began to look like the winner.
> The resulting time-sliced CFQ was fairly heavily influenced by the
> ideas in AS.

Well to be fair, Jens had just implemented deadline, which got me
interested ;)

Actually, I would still like to be able to deprecate deadline for
AS, because AS has a tunable that you can switch to turn off read
anticipation and revert to deadline behaviour (or very close to).

It would have been nice if CFQ were then a layer on top of AS that
implemented priorities (or vice versa). And then AS could be
deprecated and we'd be back to 1 primary scheduler.

Well CFQ seems to be going in the right direction with that, however
some large users still find AS faster for some reason...

Anyway, moral of the story is that I think it would have been nice
if we hadn't proliferated IO schedulers, however in practice it
isn't easy to just layer features on top of each other, and also
keeping deadline helped a lot to be able to debug and examine
performance regressions and actually get code upstream. And this
was true even when it was globally boottine switchable only.

I'd prefer if we kept a single CPU scheduler in mainline, because I
think that simplifies analysis and focuses testing. I think we can
have one that is good enough for everyone. But if the only other
option for progress is that Linus or Andrew just pull one out of a
hat, then I would rather merge all of them. Yes I think Con's
scheduler should get a fair go, ditto for Ingo's, mine, and anyone
else's.

> > nor was the non-modularity of some piece of code ever an impediment to 
> > competition. May i remind you of the pretty competitive SLAB allocator 
> > landscape, resulting in things like the SLOB allocator, written by 
> > yourself? ;-)
> 
> Thankfully no one came out and said "we don't want to balkanize the
> allocator landscape" when I submitted it or I probably would have just
> dropped it, rather than painfully dragging it along out of tree for
> years. I'm not nearly the glutton for punishment that Con is. :-P

I don't think this is a fault of the people or the code involved.
We just didn't have much collective drive to replace the scheduler,
and even less an idea of how to decide between any two of them.

I've kept nicksched around since 2003 or so and no hard feelings ;)

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:03           ` Nick Piggin
@ 2007-04-16 14:28             ` Matt Mackall
  2007-04-17  3:31               ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Matt Mackall @ 2007-04-16 14:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> I'd prefer if we kept a single CPU scheduler in mainline, because I
> think that simplifies analysis and focuses testing.

I think you'll find something like 80-90% of the testing will be done
on the default choice, even if other choices exist. So you really
won't have much of a problem here.

But when the only choice for other schedulers is to go out-of-tree,
then only 1% of the people will try it out and those people are
guaranteed to be the ones who saw scheduling problems in mainline.
So the alternative won't end up getting any testing on many of the
workloads that work fine in mainstream so their feedback won't tell
you very much at all.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 14:28             ` Matt Mackall
@ 2007-04-17  3:31               ` Nick Piggin
  2007-04-17 17:35                 ` Matt Mackall
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  3:31 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > think that simplifies analysis and focuses testing.
> 
> I think you'll find something like 80-90% of the testing will be done
> on the default choice, even if other choices exist. So you really
> won't have much of a problem here.
> 
> But when the only choice for other schedulers is to go out-of-tree,
> then only 1% of the people will try it out and those people are
> guaranteed to be the ones who saw scheduling problems in mainline.
> So the alternative won't end up getting any testing on many of the
> workloads that work fine in mainstream so their feedback won't tell
> you very much at all.

Yeah I concede that perhaps it is the only way to get things going
any further. But how do we decide if and when the current scheduler
should be demoted from default, and which should replace it?

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:31               ` Nick Piggin
@ 2007-04-17 17:35                 ` Matt Mackall
  0 siblings, 0 replies; 304+ messages in thread
From: Matt Mackall @ 2007-04-17 17:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote:
> > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote:
> > > I'd prefer if we kept a single CPU scheduler in mainline, because I
> > > think that simplifies analysis and focuses testing.
> > 
> > I think you'll find something like 80-90% of the testing will be done
> > on the default choice, even if other choices exist. So you really
> > won't have much of a problem here.
> > 
> > But when the only choice for other schedulers is to go out-of-tree,
> > then only 1% of the people will try it out and those people are
> > guaranteed to be the ones who saw scheduling problems in mainline.
> > So the alternative won't end up getting any testing on many of the
> > workloads that work fine in mainstream so their feedback won't tell
> > you very much at all.
> 
> Yeah I concede that perhaps it is the only way to get things going
> any further. But how do we decide if and when the current scheduler
> should be demoted from default, and which should replace it?

Step one is ship both in -mm. If that doesn't give us enough
confidence, ship both in mainline. If that doesn't give us enough
confidence, wait until vendors ship both. Eventually a clear picture
should emerge. If it doesn't, either the change is not significant or
no one cares.

But it really is important to be able to do controlled experiments on
this stuff with little effort. That's the recipe for getting lots of
valid feedback.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 21:31         ` Matt Mackall
  2007-04-16  3:03           ` Nick Piggin
@ 2007-04-16 15:45           ` William Lee Irwin III
  1 sibling, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-16 15:45 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote:
> That's irrelevant. Plugsched was an attempt to get alternative
> schedulers exposure in mainline. I know, because I remember
> encouraging Bill to pursue it. Not only did you veto plugsched (which
> may have been a perfectly reasonable thing to do), but you also vetoed
> the whole concept of multiple schedulers in the tree too. "We don't
> want to balkanize the scheduling landscape".
> And that latter part is what I'm claiming has set us back for years.
> It's not a technical argument but a strategic one. And it's just not a
> good strategy.
[... excellent post trimmed...]

These are some rather powerful arguments. I think I'll actually start
looking at plugsched again.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 20:48       ` Ingo Molnar
  2007-04-15 21:31         ` Matt Mackall
@ 2007-04-15 23:39         ` William Lee Irwin III
  2007-04-16  1:06           ` Peter Williams
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-15 23:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Matt Mackall, Con Kolivas, Peter Williams, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 2) plugsched did not allow on the fly selection of schedulers, nor did
>    it allow a per CPU selection of schedulers. IO schedulers you can 
>    change per disk, on the fly, making them much more useful in
>    practice. Also, IO schedulers (while definitely not being slow!) are 
>    alot less performance sensitive than CPU schedulers.

One of the reasons I never posted my own code is that it never met its
own design goals, which absolutely included switching on the fly. I
think Peter Williams may have done something about that. It was my hope
to be able to do insmod sched_foo.ko until it became clear that the
effort it was intended to assist wasn't going to get even the limited
hardware access required, at which point I largely stopped working on
it.

On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>    the last version i saw of it, didnt come even close.

I'm not sure what happened there. It wasn't a big enough patch to take
hits in this area due to getting overwhelmed by the programming burden
like some other efforts of mine. Maybe things started getting ugly once
on-the-fly switching entered the picture. My guess is that Peter Williams
will have to chime in here, since things have diverged enough from my
one-time contribution 4 years ago.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:39         ` William Lee Irwin III
@ 2007-04-16  1:06           ` Peter Williams
  2007-04-16  3:04             ` William Lee Irwin III
  2007-04-16 17:22             ` Chris Friesen
  0 siblings, 2 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-16  1:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 2) plugsched did not allow on the fly selection of schedulers, nor did
>>    it allow a per CPU selection of schedulers. IO schedulers you can 
>>    change per disk, on the fly, making them much more useful in
>>    practice. Also, IO schedulers (while definitely not being slow!) are 
>>    alot less performance sensitive than CPU schedulers.
> 
> One of the reasons I never posted my own code is that it never met its
> own design goals, which absolutely included switching on the fly. I
> think Peter Williams may have done something about that.

I didn't but some students did.

In a previous life, I did implement a runtime configurable CPU 
scheduling mechanism (implemented on True64, Solaris and Linux) that 
allowed schedulers to be loaded as modules at run time.  This was 
released commercially on True64 and Solaris.  So I know that it can be done.

I have thought about doing something similar for the SPA schedulers 
which differ in only small ways from each other but lack motivation.

> It was my hope
> to be able to do insmod sched_foo.ko until it became clear that the
> effort it was intended to assist wasn't going to get even the limited
> hardware access required, at which point I largely stopped working on
> it.
> 
> 
> On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote:
>> 3) I/O schedulers are pretty damn clean code, and plugsched, at least
>>    the last version i saw of it, didnt come even close.
> 
> I'm not sure what happened there. It wasn't a big enough patch to take
> hits in this area due to getting overwhelmed by the programming burden
> like some other efforts of mine. Maybe things started getting ugly once
> on-the-fly switching entered the picture. My guess is that Peter Williams
> will have to chime in here, since things have diverged enough from my
> one-time contribution 4 years ago.

 From my POV, the current version of plugsched is considerably simpler 
than it was when I took the code over from Con as I put considerable 
effort into minimizing code overlap in the various schedulers.

I also put considerable effort into minimizing any changes to the load 
balancing code (something Ingo seems to think is a deficiency) and the 
result is that plugsched allows "intra run queue" scheduling to be 
easily modified WITHOUT effecting load balancing.  To my mind scheduling 
and load balancing are orthogonal and keeping them that way simplifies 
things.

As Ingo correctly points out, plugsched does not allow different 
schedulers to be used per CPU but it would not be difficult to modify it 
so that they could.  Although I've considered doing this over the years 
I decided not to as it would just increase the complexity and the amount 
of work required to keep the patch set going.  About six months ago I 
decided to reduce the amount of work I was doing on plugsched (as it was 
obviously never going to be accepted) and now only publish patches 
against the vanilla kernel's major releases (and the only reason that I 
kept doing that is that the download figures indicated that about 80 
users were interested in the experiment).

Peter
PS I no longer read LKML (due to time constraints) and would appreciate 
it if I could be CC'd on any e-mails suggesting scheduler changes.
PPS I'm just happy to see that Ingo has finally accepted that the 
vanilla scheduler was badly in need of fixing and don't really care who 
fixes it.
PPS Different schedulers for different aims (i.e. server or work 
station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
does about 1% better on kernbench than the next best scheduler in the bunch.
PPPS Con, fairness isn't always best as humans aren't very altruistic 
and we need to give unfair preference to interactive tasks in order to 
stop the users flinging their PCs out the window.  But the current 
scheduler doesn't do this very well and is also not very good at 
fairness so needs to change.  But the changes need to address 
interactive response and fairness not just fairness.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  1:06           ` Peter Williams
@ 2007-04-16  3:04             ` William Lee Irwin III
  2007-04-16  5:09               ` Peter Williams
  2007-04-16 17:22             ` Chris Friesen
  1 sibling, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-16  3:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> One of the reasons I never posted my own code is that it never met its
>> own design goals, which absolutely included switching on the fly. I
>> think Peter Williams may have done something about that.
>> It was my hope
>> to be able to do insmod sched_foo.ko until it became clear that the
>> effort it was intended to assist wasn't going to get even the limited
>> hardware access required, at which point I largely stopped working on
>> it.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> I didn't but some students did.
> In a previous life, I did implement a runtime configurable CPU 
> scheduling mechanism (implemented on True64, Solaris and Linux) that 
> allowed schedulers to be loaded as modules at run time.  This was 
> released commercially on True64 and Solaris.  So I know that it can be done.
> I have thought about doing something similar for the SPA schedulers 
> which differ in only small ways from each other but lack motivation.

Driver models for scheduling are not so far out. AFAICS it's largely a
tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
switching out intra-queue policies vs. switching out whole-system
policies, SMP handling and all. Whether this involves load balancing
depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
scheduler module, for instance, would not have a load balancer at all,
as it has only one global runqueue. There are other sorts of policies
wanting significant changes to SMP handling vs. the stock load
balancing.

William Lee Irwin III wrote:
>> I'm not sure what happened there. It wasn't a big enough patch to take
>> hits in this area due to getting overwhelmed by the programming burden
>> like some other efforts of mine. Maybe things started getting ugly once
>> on-the-fly switching entered the picture. My guess is that Peter Williams
>> will have to chime in here, since things have diverged enough from my
>> one-time contribution 4 years ago.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> From my POV, the current version of plugsched is considerably simpler 
> than it was when I took the code over from Con as I put considerable 
> effort into minimizing code overlap in the various schedulers.
> I also put considerable effort into minimizing any changes to the load 
> balancing code (something Ingo seems to think is a deficiency) and the 
> result is that plugsched allows "intra run queue" scheduling to be 
> easily modified WITHOUT effecting load balancing.  To my mind scheduling 
> and load balancing are orthogonal and keeping them that way simplifies 
> things.

ISTR rearranging things for con in such a fashion that it no longer
worked out of the box (though that wasn't the intention; restructuring it
to be more suited to his purposes was) and that's what he worked off of
afterward. I don't remember very well what changed there as I clearly
invested less effort there than the prior versions. Now that I think of
it, that may have been where the sample policy demonstrating scheduling
classes was lost.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> As Ingo correctly points out, plugsched does not allow different 
> schedulers to be used per CPU but it would not be difficult to modify it 
> so that they could.  Although I've considered doing this over the years 
> I decided not to as it would just increase the complexity and the amount 
> of work required to keep the patch set going.  About six months ago I 
> decided to reduce the amount of work I was doing on plugsched (as it was 
> obviously never going to be accepted) and now only publish patches 
> against the vanilla kernel's major releases (and the only reason that I 
> kept doing that is that the download figures indicated that about 80 
> users were interested in the experiment).

That's a rather different goal from what I was going on about with it,
so it's all diverged quite a bit. Where I had a significant need for
mucking with the entire concept of how SMP was handled, this is rather
different. At this point I'm questioning the relevance of my own work,
though it was already relatively marginal as it started life as an
attempt at a sort of debug patch to help gang scheduling (which is in
itself a rather marginally relevant feature to most users) code along.

On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
> PS I no longer read LKML (due to time constraints) and would appreciate 
> it if I could be CC'd on any e-mails suggesting scheduler changes.
> PPS I'm just happy to see that Ingo has finally accepted that the 
> vanilla scheduler was badly in need of fixing and don't really care who 
> fixes it.
> PPS Different schedulers for different aims (i.e. server or work 
> station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
> does about 1% better on kernbench than the next best scheduler in the bunch.
> PPPS Con, fairness isn't always best as humans aren't very altruistic 
> and we need to give unfair preference to interactive tasks in order to 
> stop the users flinging their PCs out the window.  But the current 
> scheduler doesn't do this very well and is also not very good at 
> fairness so needs to change.  But the changes need to address 
> interactive response and fairness not just fairness.

Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
better ones. I'd not bother citing kernel compile results.

In any event, I'm not sure what to say about different schedulers for
different aims. My intentions with plugsched were not centered around
production usage or intra-queue policy. I'm relatively indifferent to
the notion of having pluggable CPU schedulers, intra-queue or otherwise,
in mainline. I don't see any particular harm in it, but neither am I
particularly motivated to have it in. I had a rather strong sense of
instrumentality about it, and since it became useless to me (at a
conceptual level; the implementation was never finished ot the point of
dynamic loading of scheduler modules) for assisting development on
large systems via reboot avoidance by dint of it becoming clear that
access to such was never going to happen, I've stopped looking at it.

-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  3:04             ` William Lee Irwin III
@ 2007-04-16  5:09               ` Peter Williams
  2007-04-16 11:04                 ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-16  5:09 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> One of the reasons I never posted my own code is that it never met its
>>> own design goals, which absolutely included switching on the fly. I
>>> think Peter Williams may have done something about that.
>>> It was my hope
>>> to be able to do insmod sched_foo.ko until it became clear that the
>>> effort it was intended to assist wasn't going to get even the limited
>>> hardware access required, at which point I largely stopped working on
>>> it.
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> I didn't but some students did.
>> In a previous life, I did implement a runtime configurable CPU 
>> scheduling mechanism (implemented on True64, Solaris and Linux) that 
>> allowed schedulers to be loaded as modules at run time.  This was 
>> released commercially on True64 and Solaris.  So I know that it can be done.
>> I have thought about doing something similar for the SPA schedulers 
>> which differ in only small ways from each other but lack motivation.
> 
> Driver models for scheduling are not so far out. AFAICS it's largely a
> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
> switching out intra-queue policies vs. switching out whole-system
> policies, SMP handling and all. Whether this involves load balancing
> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
> scheduler module, for instance, would not have a load balancer at all,
> as it has only one global runqueue. There are other sorts of policies
> wanting significant changes to SMP handling vs. the stock load
> balancing.

Well a single run queue removes the need for load balancing but has 
scalability issues on large systems.  Personally, I think something in 
between would be the best solution i.e. multiple run queues but more 
than one CPU per run queue.  I think that this would be a particularly 
good solution to the problems introduced by hyper threading and multi 
core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
package are using the one queue then the case where one CPU is trying to 
run a high priority task and the other a low priority task (i.e. the 
cases that the sleeping dependent mechanism tried to address) is less 
likely to occur.

By the way, I think that it's a very bad idea for the scheduling 
mechanism and the load balancing mechanism to be coupled.  The anomalies 
that will be experienced and the attempts to make ad hoc fixes for them 
will lead to complexity spiralling out of control.

> 
> 
> William Lee Irwin III wrote:
>>> I'm not sure what happened there. It wasn't a big enough patch to take
>>> hits in this area due to getting overwhelmed by the programming burden
>>> like some other efforts of mine. Maybe things started getting ugly once
>>> on-the-fly switching entered the picture. My guess is that Peter Williams
>>> will have to chime in here, since things have diverged enough from my
>>> one-time contribution 4 years ago.
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> From my POV, the current version of plugsched is considerably simpler 
>> than it was when I took the code over from Con as I put considerable 
>> effort into minimizing code overlap in the various schedulers.
>> I also put considerable effort into minimizing any changes to the load 
>> balancing code (something Ingo seems to think is a deficiency) and the 
>> result is that plugsched allows "intra run queue" scheduling to be 
>> easily modified WITHOUT effecting load balancing.  To my mind scheduling 
>> and load balancing are orthogonal and keeping them that way simplifies 
>> things.
> 
> ISTR rearranging things for con in such a fashion that it no longer
> worked out of the box (though that wasn't the intention; restructuring it
> to be more suited to his purposes was) and that's what he worked off of
> afterward. I don't remember very well what changed there as I clearly
> invested less effort there than the prior versions. Now that I think of
> it, that may have been where the sample policy demonstrating scheduling
> classes was lost.

I can't comment here as (as far as I can recall) I never saw your code 
and only became involved when Con posted his version of cpusched.

> 
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> As Ingo correctly points out, plugsched does not allow different 
>> schedulers to be used per CPU but it would not be difficult to modify it 
>> so that they could.  Although I've considered doing this over the years 
>> I decided not to as it would just increase the complexity and the amount 
>> of work required to keep the patch set going.  About six months ago I 
>> decided to reduce the amount of work I was doing on plugsched (as it was 
>> obviously never going to be accepted) and now only publish patches 
>> against the vanilla kernel's major releases (and the only reason that I 
>> kept doing that is that the download figures indicated that about 80 
>> users were interested in the experiment).
> 
> That's a rather different goal from what I was going on about with it,
> so it's all diverged quite a bit.

Yes, pragmatic considerations dictated a change of tack.

> Where I had a significant need for
> mucking with the entire concept of how SMP was handled, this is rather
> different.

Yes, I went with the idea of intra run queue scheduling being orthogonal 
to load balancing for two reasons:

1. I think that coupling them is a bad idea from the complexity POV, and
2. it's enough of a battle fighting for modifications to one bit of the 
code without trying to do it to two simultaneously.

> At this point I'm questioning the relevance of my own work,
> though it was already relatively marginal as it started life as an
> attempt at a sort of debug patch to help gang scheduling (which is in
> itself a rather marginally relevant feature to most users) code along.

The main commercial plug in scheduler used with the run time loadable 
module scheduler that I mentioned earlier did gang scheduling (at the 
insistence of the Tru64 kernel folks).  As this scheduler was a 
hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
("unfairly" really in according to an allocation policy) among higher 
level entities such as users, groups and applications as well as 
processes; it was fairly easy to make it a gang scheduler by modifying 
it to give all of a process's threads the same priority based on the 
process's CPU usage rather than different priorities based on the 
threads' usage rates.  In fact, it would have been possible to select 
between gang and non gang on a per process basis if that was considered 
desirable.

The fact that threads and processes are distinct entities on Tru64 and 
Solaris made this easier to do on them than on Linux.

My experience with this scheduler leads me to believe that to achieve 
gang scheduling and fairness, etc. you need (usage) statistics based 
schedulers.

> 
> 
> On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote:
>> PS I no longer read LKML (due to time constraints) and would appreciate 
>> it if I could be CC'd on any e-mails suggesting scheduler changes.
>> PPS I'm just happy to see that Ingo has finally accepted that the 
>> vanilla scheduler was badly in need of fixing and don't really care who 
>> fixes it.
>> PPS Different schedulers for different aims (i.e. server or work 
>> station) do make a difference.  E.g. the spa_svr scheduler in plugsched 
>> does about 1% better on kernbench than the next best scheduler in the bunch.
>> PPPS Con, fairness isn't always best as humans aren't very altruistic 
>> and we need to give unfair preference to interactive tasks in order to 
>> stop the users flinging their PCs out the window.  But the current 
>> scheduler doesn't do this very well and is also not very good at 
>> fairness so needs to change.  But the changes need to address 
>> interactive response and fairness not just fairness.
> 
> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
> better ones. I'd not bother citing kernel compile results.

spa_svr actually does its best work when the system isn't fully loaded 
as the type of improvement it strives to achieve (minimizing on queue 
wait time) hasn't got much room to manoeuvre when the system is fully 
loaded.  Therefore, the fact that it's 1% better even in these 
circumstances is a good result and also indicates that the overhead for 
keeping the scheduling statistics it uses for its decision making is 
well spent.  Especially, when you consider that the total available room 
for improvement on this benchmark is less than 3%.

To elaborate, the motivation for this scheduler was acquired from the 
observation of scheduling statistics (in particular, on queue wait time) 
on systems running at about 30% to 50% load.  Theoretically, at these 
load levels there should be no such waiting but the statistics show that 
there is considerable waiting (sometimes as high as 30% to 50%).  I put 
this down to "lack of serendipity" e.g.  everyone sleeping at the same 
time and then trying to run at the same time would be complete lack of 
serendipity.  On the other hand, if everyone is synced then there would 
be total serendipity.

Obviously, from the POV of a client, time the server task spends waiting 
on the queue adds to the response time for any request that has been 
made so reduction of this time on a server is a good thing(tm).  Equally 
obviously, trying to achieve this synchronization by asking the tasks to 
cooperate with each other is not a feasible solution and some external 
influence needs to be exerted and this is what spa_svr does -- it nudges 
the scheduling order of the tasks in a way that makes them become well 
synced.

Unfortunately, this is not a good scheduler for an interactive system as 
it minimizes the response times for ALL tasks (and the system as a 
whole) and this can result in increased response time for some 
interactive tasks (clunkiness) which annoys interactive users.  When you 
start fiddling with this scheduler to bring back "interactive 
unfairness" you kill a lot of its superior low overall wait time 
performance.

So this is why I think "horses for courses" schedulers are worth while.

> 
> In any event, I'm not sure what to say about different schedulers for
> different aims. My intentions with plugsched were not centered around
> production usage or intra-queue policy. I'm relatively indifferent to
> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
> in mainline. I don't see any particular harm in it, but neither am I
> particularly motivated to have it in.

If you look at the struct sched_spa_child in the file 
include/linux/sched_spa.h you'll see that the interface for switching 
between the various SPA schedulers is quite small and making them 
runtime switchable would be easy (I haven't done this in cpusched as I 
wanted to keep the same interface for switching schedulers for all 
schedulers: i.e. all run time switchable or none run time switchable; as 
the main aim of plugsched had become a mechanism for evaluating 
different intra queue scheduling designs.)

> I had a rather strong sense of
> instrumentality about it, and since it became useless to me (at a
> conceptual level; the implementation was never finished ot the point of
> dynamic loading of scheduler modules) for assisting development on
> large systems via reboot avoidance by dint of it becoming clear that
> access to such was never going to happen, I've stopped looking at it.

I'll probably stop looking at this problem as well at least for the time 
being until all this new code has settled.

Peter
PS As I no longer read LKML, I haven't yet seen Ingo's or Con's or 
Nick's new schedulers yet so am unable to comment on their technical 
merits with respect to my comments above.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:09               ` Peter Williams
@ 2007-04-16 11:04                 ` William Lee Irwin III
  2007-04-16 12:55                   ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-16 11:04 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
>> Driver models for scheduling are not so far out. AFAICS it's largely a
>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>> switching out intra-queue policies vs. switching out whole-system
>> policies, SMP handling and all. Whether this involves load balancing
>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>> scheduler module, for instance, would not have a load balancer at all,
>> as it has only one global runqueue. There are other sorts of policies
>> wanting significant changes to SMP handling vs. the stock load
>> balancing.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Well a single run queue removes the need for load balancing but has 
> scalability issues on large systems.  Personally, I think something in 
> between would be the best solution i.e. multiple run queues but more 
> than one CPU per run queue.  I think that this would be a particularly 
> good solution to the problems introduced by hyper threading and multi 
> core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
> package are using the one queue then the case where one CPU is trying to 
> run a high priority task and the other a low priority task (i.e. the 
> cases that the sleeping dependent mechanism tried to address) is less 
> likely to occur.

This wasn't meant to sing the praises of the 2.4.x scheduler; it was
rather meant to point out that the 2.4.x scheduler, among others, is
unimplementable within the framework if it assumes per-cpu runqueues.
More plausibly useful single-queue schedulers would likely use a vastly
different policy and attempt to carry out all queue manipulations via
lockless operations.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> By the way, I think that it's a very bad idea for the scheduling 
> mechanism and the load balancing mechanism to be coupled.  The anomalies 
> that will be experienced and the attempts to make ad hoc fixes for them 
> will lead to complexity spiralling out of control.

This is clearly unavoidable in the case of gang scheduling. There is
simply no other way to schedule N tasks which must all be run
simultaneously when they run at all on N cpus of the system without
such coupling and furthermore at an extremely intimate level,
particularly when multiple competing gangs must be scheduled in such
a fashion.


William Lee Irwin III wrote:
>> Where I had a significant need for
>> mucking with the entire concept of how SMP was handled, this is rather
>> different.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Yes, I went with the idea of intra run queue scheduling being orthogonal 
> to load balancing for two reasons:
> 1. I think that coupling them is a bad idea from the complexity POV, and
> 2. it's enough of a battle fighting for modifications to one bit of the 
> code without trying to do it to two simultaneously.

As nice as that sounds, such a code structure would've precluded the
entire raison d'etre of the patch, i.e. gang scheduling.


William Lee Irwin III wrote:
>> At this point I'm questioning the relevance of my own work,
>> though it was already relatively marginal as it started life as an
>> attempt at a sort of debug patch to help gang scheduling (which is in
>> itself a rather marginally relevant feature to most users) code along.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> The main commercial plug in scheduler used with the run time loadable 
> module scheduler that I mentioned earlier did gang scheduling (at the 
> insistence of the Tru64 kernel folks).  As this scheduler was a 
> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
> ("unfairly" really in according to an allocation policy) among higher 
> level entities such as users, groups and applications as well as 
> processes; it was fairly easy to make it a gang scheduler by modifying 
> it to give all of a process's threads the same priority based on the 
> process's CPU usage rather than different priorities based on the 
> threads' usage rates.  In fact, it would have been possible to select 
> between gang and non gang on a per process basis if that was considered 
> desirable.
> The fact that threads and processes are distinct entities on Tru64 and 
> Solaris made this easier to do on them than on Linux.
> My experience with this scheduler leads me to believe that to achieve 
> gang scheduling and fairness, etc. you need (usage) statistics based 
> schedulers.

This does not appear to make sense unless it's based on an incorrect
use of the term "gang scheduling." I'm referring to a gang as a set of
tasks (typically restricted to threads of the same process) which must
all be considered runnable or unrunnable simultaneously, and are for
the sake of performance required to all actually be run simultaneously.
This means a gang of N threads, when run, must run on N processors at
once. A time and a set of processors must be chosen for any time
interval where the gang is running. This interacts with load balancing
by needing to choose the cpus to run the gang on, and also arranging
for a set of cpus available for the gang to use to exist by means of
migrating tasks off the chosen cpus.


William Lee Irwin III wrote:
>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>> better ones. I'd not bother citing kernel compile results.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> spa_svr actually does its best work when the system isn't fully loaded 
> as the type of improvement it strives to achieve (minimizing on queue 
> wait time) hasn't got much room to manoeuvre when the system is fully 
> loaded.  Therefore, the fact that it's 1% better even in these 
> circumstances is a good result and also indicates that the overhead for 
> keeping the scheduling statistics it uses for its decision making is 
> well spent.  Especially, when you consider that the total available room 
> for improvement on this benchmark is less than 3%.

None of these benchmarks require the system to be fully loaded. They
are, on the other hand, vastly more realistic simulated workloads than
kernel compiles, and furthermore are actually developed as benchmarks,
with in some cases even measurements of variance, iteration to
convergence, and similar such things that make them actually scientific.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> To elaborate, the motivation for this scheduler was acquired from the 
> observation of scheduling statistics (in particular, on queue wait time) 
> on systems running at about 30% to 50% load.  Theoretically, at these 
> load levels there should be no such waiting but the statistics show that 
> there is considerable waiting (sometimes as high as 30% to 50%).  I put 
> this down to "lack of serendipity" e.g.  everyone sleeping at the same 
> time and then trying to run at the same time would be complete lack of 
> serendipity.  On the other hand, if everyone is synced then there would 
> be total serendipity.
> Obviously, from the POV of a client, time the server task spends waiting 
> on the queue adds to the response time for any request that has been 
> made so reduction of this time on a server is a good thing(tm).  Equally 
> obviously, trying to achieve this synchronization by asking the tasks to 
> cooperate with each other is not a feasible solution and some external 
> influence needs to be exerted and this is what spa_svr does -- it nudges 
> the scheduling order of the tasks in a way that makes them become well 
> synced.

This all sounds like a relatively good idea. So it's good for throughput
vs. latency or otherwise not particularly interactive. No big deal, just
use it where it makes sense.


On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> Unfortunately, this is not a good scheduler for an interactive system as 
> it minimizes the response times for ALL tasks (and the system as a 
> whole) and this can result in increased response time for some 
> interactive tasks (clunkiness) which annoys interactive users.  When you 
> start fiddling with this scheduler to bring back "interactive 
> unfairness" you kill a lot of its superior low overall wait time 
> performance.
> So this is why I think "horses for courses" schedulers are worth while.

I have no particular objection to using an appropriate scheduler for the
system's workload. I also have little or no preference as to how that's
accomplished overall. But I really think that if we want to push
pluggable scheduling it should load schedulers as kernel modules on the
fly and so on versus pure /proc/ tunables and a compiled-in set of
alternatives.


William Lee Irwin III wrote:
>> In any event, I'm not sure what to say about different schedulers for
>> different aims. My intentions with plugsched were not centered around
>> production usage or intra-queue policy. I'm relatively indifferent to
>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>> in mainline. I don't see any particular harm in it, but neither am I
>> particularly motivated to have it in.

On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
> If you look at the struct sched_spa_child in the file 
> include/linux/sched_spa.h you'll see that the interface for switching 
> between the various SPA schedulers is quite small and making them 
> runtime switchable would be easy (I haven't done this in cpusched as I 
> wanted to keep the same interface for switching schedulers for all 
> schedulers: i.e. all run time switchable or none run time switchable; as 
> the main aim of plugsched had become a mechanism for evaluating 
> different intra queue scheduling designs.)

I remember actually looking at this, and I would almost characterize
the differences between the SPA schedulers as a tunable parameter. I
have a different concept of what pluggability means from how the SPA
schedulers were switched, but no particular objection to the method
given the commonalities between them.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:04                 ` William Lee Irwin III
@ 2007-04-16 12:55                   ` Peter Williams
  2007-04-16 23:10                     ` Michael K. Edwards
       [not found]                     ` <20070416135915.GK8915@holomorphy.com>
  0 siblings, 2 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-16 12:55 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel,
	Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Driver models for scheduling are not so far out. AFAICS it's largely a
>>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and
>>> switching out intra-queue policies vs. switching out whole-system
>>> policies, SMP handling and all. Whether this involves load balancing
>>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x
>>> scheduler module, for instance, would not have a load balancer at all,
>>> as it has only one global runqueue. There are other sorts of policies
>>> wanting significant changes to SMP handling vs. the stock load
>>> balancing.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Well a single run queue removes the need for load balancing but has 
>> scalability issues on large systems.  Personally, I think something in 
>> between would be the best solution i.e. multiple run queues but more 
>> than one CPU per run queue.  I think that this would be a particularly 
>> good solution to the problems introduced by hyper threading and multi 
>> core systems and also NUMA systems.  E.g. if all CPUs in a hyper thread 
>> package are using the one queue then the case where one CPU is trying to 
>> run a high priority task and the other a low priority task (i.e. the 
>> cases that the sleeping dependent mechanism tried to address) is less 
>> likely to occur.
> 
> This wasn't meant to sing the praises of the 2.4.x scheduler; it was
> rather meant to point out that the 2.4.x scheduler, among others, is
> unimplementable within the framework if it assumes per-cpu runqueues.
> More plausibly useful single-queue schedulers would likely use a vastly
> different policy and attempt to carry out all queue manipulations via
> lockless operations.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> By the way, I think that it's a very bad idea for the scheduling 
>> mechanism and the load balancing mechanism to be coupled.  The anomalies 
>> that will be experienced and the attempts to make ad hoc fixes for them 
>> will lead to complexity spiralling out of control.
> 
> This is clearly unavoidable in the case of gang scheduling. There is
> simply no other way to schedule N tasks which must all be run
> simultaneously when they run at all on N cpus of the system without
> such coupling and furthermore at an extremely intimate level,
> particularly when multiple competing gangs must be scheduled in such
> a fashion.

I can't see the logic here or why you would want to do such a thing.  It 
certainly doesn't coincide with what I interpret "gang scheduling" to mean.

> 
> 
> William Lee Irwin III wrote:
>>> Where I had a significant need for
>>> mucking with the entire concept of how SMP was handled, this is rather
>>> different.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Yes, I went with the idea of intra run queue scheduling being orthogonal 
>> to load balancing for two reasons:
>> 1. I think that coupling them is a bad idea from the complexity POV, and
>> 2. it's enough of a battle fighting for modifications to one bit of the 
>> code without trying to do it to two simultaneously.
> 
> As nice as that sounds, such a code structure would've precluded the
> entire raison d'etre of the patch, i.e. gang scheduling.

Not for what I understand "gang scheduling" to mean.  As I understand it 
the constraints of gang scheduling are no where near as strict as you 
seem to think they are.  And for what it's worth I don't think that what 
you think it means is in any sense a reasonable target.

> 
> 
> William Lee Irwin III wrote:
>>> At this point I'm questioning the relevance of my own work,
>>> though it was already relatively marginal as it started life as an
>>> attempt at a sort of debug patch to help gang scheduling (which is in
>>> itself a rather marginally relevant feature to most users) code along.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> The main commercial plug in scheduler used with the run time loadable 
>> module scheduler that I mentioned earlier did gang scheduling (at the 
>> insistence of the Tru64 kernel folks).  As this scheduler was a 
>> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" 
>> ("unfairly" really in according to an allocation policy) among higher 
>> level entities such as users, groups and applications as well as 
>> processes; it was fairly easy to make it a gang scheduler by modifying 
>> it to give all of a process's threads the same priority based on the 
>> process's CPU usage rather than different priorities based on the 
>> threads' usage rates.  In fact, it would have been possible to select 
>> between gang and non gang on a per process basis if that was considered 
>> desirable.
>> The fact that threads and processes are distinct entities on Tru64 and 
>> Solaris made this easier to do on them than on Linux.
>> My experience with this scheduler leads me to believe that to achieve 
>> gang scheduling and fairness, etc. you need (usage) statistics based 
>> schedulers.
> 
> This does not appear to make sense unless it's based on an incorrect
> use of the term "gang scheduling."

It's become obvious that we mean different things.

> I'm referring to a gang as a set of
> tasks (typically restricted to threads of the same process) which must
> all be considered runnable or unrunnable simultaneously, and are for
> the sake of performance required to all actually be run simultaneously.
> This means a gang of N threads, when run, must run on N processors at
> once. A time and a set of processors must be chosen for any time
> interval where the gang is running. This interacts with load balancing
> by needing to choose the cpus to run the gang on, and also arranging
> for a set of cpus available for the gang to use to exist by means of
> migrating tasks off the chosen cpus.

Sounds like a job for the load balancer NOT the scheduler.

Also I can't see you meeting such strict constraints without making the 
tasks all SCHED_FIFO.

> 
> 
> William Lee Irwin III wrote:
>>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are
>>> better ones. I'd not bother citing kernel compile results.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> spa_svr actually does its best work when the system isn't fully loaded 
>> as the type of improvement it strives to achieve (minimizing on queue 
>> wait time) hasn't got much room to manoeuvre when the system is fully 
>> loaded.  Therefore, the fact that it's 1% better even in these 
>> circumstances is a good result and also indicates that the overhead for 
>> keeping the scheduling statistics it uses for its decision making is 
>> well spent.  Especially, when you consider that the total available room 
>> for improvement on this benchmark is less than 3%.
> 
> None of these benchmarks require the system to be fully loaded. They
> are, on the other hand, vastly more realistic simulated workloads than
> kernel compiles, and furthermore are actually developed as benchmarks,
> with in some cases even measurements of variance, iteration to
> convergence, and similar such things that make them actually scientific.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> To elaborate, the motivation for this scheduler was acquired from the 
>> observation of scheduling statistics (in particular, on queue wait time) 
>> on systems running at about 30% to 50% load.  Theoretically, at these 
>> load levels there should be no such waiting but the statistics show that 
>> there is considerable waiting (sometimes as high as 30% to 50%).  I put 
>> this down to "lack of serendipity" e.g.  everyone sleeping at the same 
>> time and then trying to run at the same time would be complete lack of 
>> serendipity.  On the other hand, if everyone is synced then there would 
>> be total serendipity.
>> Obviously, from the POV of a client, time the server task spends waiting 
>> on the queue adds to the response time for any request that has been 
>> made so reduction of this time on a server is a good thing(tm).  Equally 
>> obviously, trying to achieve this synchronization by asking the tasks to 
>> cooperate with each other is not a feasible solution and some external 
>> influence needs to be exerted and this is what spa_svr does -- it nudges 
>> the scheduling order of the tasks in a way that makes them become well 
>> synced.
> 
> This all sounds like a relatively good idea. So it's good for throughput
> vs. latency or otherwise not particularly interactive. No big deal, just
> use it where it makes sense.
> 
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> Unfortunately, this is not a good scheduler for an interactive system as 
>> it minimizes the response times for ALL tasks (and the system as a 
>> whole) and this can result in increased response time for some 
>> interactive tasks (clunkiness) which annoys interactive users.  When you 
>> start fiddling with this scheduler to bring back "interactive 
>> unfairness" you kill a lot of its superior low overall wait time 
>> performance.
>> So this is why I think "horses for courses" schedulers are worth while.
> 
> I have no particular objection to using an appropriate scheduler for the
> system's workload. I also have little or no preference as to how that's
> accomplished overall. But I really think that if we want to push
> pluggable scheduling it should load schedulers as kernel modules on the
> fly and so on versus pure /proc/ tunables and a compiled-in set of
> alternatives.
> 
> 
> William Lee Irwin III wrote:
>>> In any event, I'm not sure what to say about different schedulers for
>>> different aims. My intentions with plugsched were not centered around
>>> production usage or intra-queue policy. I'm relatively indifferent to
>>> the notion of having pluggable CPU schedulers, intra-queue or otherwise,
>>> in mainline. I don't see any particular harm in it, but neither am I
>>> particularly motivated to have it in.
> 
> On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote:
>> If you look at the struct sched_spa_child in the file 
>> include/linux/sched_spa.h you'll see that the interface for switching 
>> between the various SPA schedulers is quite small and making them 
>> runtime switchable would be easy (I haven't done this in cpusched as I 
>> wanted to keep the same interface for switching schedulers for all 
>> schedulers: i.e. all run time switchable or none run time switchable; as 
>> the main aim of plugsched had become a mechanism for evaluating 
>> different intra queue scheduling designs.)
> 
> I remember actually looking at this, and I would almost characterize
> the differences between the SPA schedulers as a tunable parameter. I
> have a different concept of what pluggability means from how the SPA
> schedulers were switched, but no particular objection to the method
> given the commonalities between them.

Yes, that's the way I look at them (in fact, in Zaphod that's exactly 
what they were -- i.e. Zaphod could be made to behave like various SPA 
schedulers by fiddling its run time parameters).  They illustrate (to my 
mind) that once you get rid of the O(1) scheduler and replace it with a 
simple mechanism such as SPA (where there's a small number of points 
where the scheduling discipline gets to do its thing rather than being 
interspersed willy nilly throughout the rest of the code) adding run 
time switchable "horses for courses" scheduler disciplines becomes 
simple.  I think that the simplifications in Ingo's new scheduler (whose 
scheduling classes now look a lot like Solaris's and its predecessor 
OSes' scheduler classes) may make it possible to have switchable 
scheduling disciplines within a scheduling class.

I think that something similar (i.e. switchability) could be done for 
load balancing so that different load balancers could be used when 
required.  By keeping this load balancing functionality orthogonal to 
the intra run queue scheduling disciplines you increase the number of 
options available.

As I see it, if the scheduling discipline in use does its job properly 
within a run queue and the load balancer does its job of keeping the 
weighted load/demand on each run queue roughly equal (except where it 
has to do otherwise for your version of "gang scheduling") then the 
overall outcome will meet expectations.  Note that I talk of run queues 
not CPUs as I think a shift to multiple CPUs per run queue may be a good 
idea.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 12:55                   ` Peter Williams
@ 2007-04-16 23:10                     ` Michael K. Edwards
  2007-04-17  3:55                       ` Nick Piggin
       [not found]                     ` <20070416135915.GK8915@holomorphy.com>
  1 sibling, 1 reply; 304+ messages in thread
From: Michael K. Edwards @ 2007-04-16 23:10 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> Note that I talk of run queues
> not CPUs as I think a shift to multiple CPUs per run queue may be a good
> idea.

This observation of Peter's is the best thing to come out of this
whole foofaraw.  Looking at what's happening in CPU-land, I think it's
going to be necessary, within a couple of years, to replace the whole
idea of "CPU scheduling" with "run queue scheduling" across a complex,
possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
point in churning the mainline scheduler through a design that isn't
significantly more flexible than any of those now under discussion.

For instance, there are architectures where several "CPUs"
(instruction stream decoders feeding execution pipelines) share parts
of a cache hierarchy ("chip-level multitasking").  On these machines,
you may want to co-schedule a "real" processing task on one pipeline
with a "cache warming" task on the other pipeline -- but only for
tasks whose memory access patterns have been sufficiently analyzed to
write the "cache warming" task code.  Some other tasks may want to
idle the second pipeline so they can use the full cache-to-RAM
bandwidth.  Yet other tasks may be genuinely CPU-intensive (or I/O
bound but so context-heavy that it's not worth yielding the CPU during
quick I/Os), and hence perfectly happy to run concurrently with an
unrelated task on the other pipeline.

There are other architectures where several "hardware threads" fight
over parts of a cache hierarchy (sometimes bizarrely described as
"sharing" the cache, kind of the way most two-year-olds "share" toys).
 On these machines, one instruction pipeline can't help the other
along cache-wise, but it sure can hurt.  A scheduler designed, tested,
and tuned principally on one of these architectures (hint:
"hyperthreading") will probably leave a lot of performance on the
floor on processors in the former category.

In the not-so-distant future, we're likely to see architectures with
dynamically reconfigurable interconnect between instruction issue
units and execution resources.  (This is already quite feasible on,
say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
as many Nios II cores as fit on the chip.)  Restoring task context may
involve not just MMU swaps and FPU instructions (with state-dependent
hidden costs) but processsor reconfiguration.  Achieving "fairness"
according to any standard that a platform integrator cares about (let
alone an end user) will require a fairly detailed model of the hidden
costs associated with different sorts of task switch.

So if you are interested in schedulers for some reason other than a
paycheck, let the distros worry about 5% improvements on x86[_64].
Get hold of some different "hardware" -- say:
  - a Xilinx ML410 if you've got $3K to blow and want to explore
reconfigurable processors;
  - a SunFire T2000 if you've got $11K and want to mess with a CMT
system that's actually shipping;
  - a QEMU-simulated massively SMP x86 if you're poor but clever
enough to implement funky cross-core cache effects yourself; or
  - a cycle-accurate simulator from Gaisler or Virtio if you want a
real research project.
Then go explore some more interesting regions of parameter space and
see what the demands on mainline Linux will look like in a few years.

Cheers,
- Michael

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 23:10                     ` Michael K. Edwards
@ 2007-04-17  3:55                       ` Nick Piggin
  2007-04-17  4:25                         ` Peter Williams
  2007-04-17  8:24                         ` William Lee Irwin III
  0 siblings, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  3:55 UTC (permalink / raw)
  To: Michael K. Edwards
  Cc: Peter Williams, William Lee Irwin III, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >Note that I talk of run queues
> >not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >idea.
> 
> This observation of Peter's is the best thing to come out of this
> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
> going to be necessary, within a couple of years, to replace the whole
> idea of "CPU scheduling" with "run queue scheduling" across a complex,
> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
> point in churning the mainline scheduler through a design that isn't
> significantly more flexible than any of those now under discussion.

Why? If you do that, then your load balancer just becomes less flexible
because it is harder to have tasks run on one or the other.

You can have single-runqueue-per-domain behaviour (or close to) just by
relaxing all restrictions on idle load balancing within that domain. It
is harder to go the other way and place any per-cpu affinity or
restirctions with multiple cpus on a single runqueue.


> For instance, there are architectures where several "CPUs"
> (instruction stream decoders feeding execution pipelines) share parts
> of a cache hierarchy ("chip-level multitasking").  On these machines,
> you may want to co-schedule a "real" processing task on one pipeline
> with a "cache warming" task on the other pipeline -- but only for
> tasks whose memory access patterns have been sufficiently analyzed to
> write the "cache warming" task code.  Some other tasks may want to
> idle the second pipeline so they can use the full cache-to-RAM
> bandwidth.  Yet other tasks may be genuinely CPU-intensive (or I/O
> bound but so context-heavy that it's not worth yielding the CPU during
> quick I/Os), and hence perfectly happy to run concurrently with an
> unrelated task on the other pipeline.

We can do all that now with load balancing, affinities or by shutting
down threads dynamically.


> There are other architectures where several "hardware threads" fight
> over parts of a cache hierarchy (sometimes bizarrely described as
> "sharing" the cache, kind of the way most two-year-olds "share" toys).
> On these machines, one instruction pipeline can't help the other
> along cache-wise, but it sure can hurt.  A scheduler designed, tested,
> and tuned principally on one of these architectures (hint:
> "hyperthreading") will probably leave a lot of performance on the
> floor on processors in the former category.
> 
> In the not-so-distant future, we're likely to see architectures with
> dynamically reconfigurable interconnect between instruction issue
> units and execution resources.  (This is already quite feasible on,
> say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with
> as many Nios II cores as fit on the chip.)  Restoring task context may
> involve not just MMU swaps and FPU instructions (with state-dependent
> hidden costs) but processsor reconfiguration.  Achieving "fairness"
> according to any standard that a platform integrator cares about (let
> alone an end user) will require a fairly detailed model of the hidden
> costs associated with different sorts of task switch.
> 
> So if you are interested in schedulers for some reason other than a
> paycheck, let the distros worry about 5% improvements on x86[_64].
> Get hold of some different "hardware" -- say:
>  - a Xilinx ML410 if you've got $3K to blow and want to explore
> reconfigurable processors;
>  - a SunFire T2000 if you've got $11K and want to mess with a CMT
> system that's actually shipping;
>  - a QEMU-simulated massively SMP x86 if you're poor but clever
> enough to implement funky cross-core cache effects yourself; or
>  - a cycle-accurate simulator from Gaisler or Virtio if you want a
> real research project.
> Then go explore some more interesting regions of parameter space and
> see what the demands on mainline Linux will look like in a few years.

There are no doubt improvements to be made, but they are generally
intended to be able to be done within the sched-domains framework. I
am not aware of a particular need that would be impossible to do using
that topology hierarchy and per-CPU runqueues, and there are added
complications involved with multiple CPUs per runqueue.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:55                       ` Nick Piggin
@ 2007-04-17  4:25                         ` Peter Williams
  2007-04-17  4:34                           ` Nick Piggin
  2007-04-17  8:24                         ` William Lee Irwin III
  1 sibling, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  4:25 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>> Note that I talk of run queues
>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>> idea.
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.
> 
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.
> 
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.

Allowing N (where N can be one or greater) CPUs per run queue actually 
increases flexibility as you can still set N to 1 to get the current 
behaviour.

One advantage of allowing multiple CPUs per run queue would be at the 
smaller end of the system scale i.e. a PC with a single hyper threading 
chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
if both CPUs used the one runqueue and all the nasty side effects that 
come with hyper threading would be minimized at the same time.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:25                         ` Peter Williams
@ 2007-04-17  4:34                           ` Nick Piggin
  2007-04-17  6:03                             ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  4:34 UTC (permalink / raw)
  To: Peter Williams
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
> >>On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> >>>Note that I talk of run queues
> >>>not CPUs as I think a shift to multiple CPUs per run queue may be a good
> >>>idea.
> >>This observation of Peter's is the best thing to come out of this
> >>whole foofaraw.  Looking at what's happening in CPU-land, I think it's
> >>going to be necessary, within a couple of years, to replace the whole
> >>idea of "CPU scheduling" with "run queue scheduling" across a complex,
> >>possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
> >>point in churning the mainline scheduler through a design that isn't
> >>significantly more flexible than any of those now under discussion.
> >
> >Why? If you do that, then your load balancer just becomes less flexible
> >because it is harder to have tasks run on one or the other.
> >
> >You can have single-runqueue-per-domain behaviour (or close to) just by
> >relaxing all restrictions on idle load balancing within that domain. It
> >is harder to go the other way and place any per-cpu affinity or
> >restirctions with multiple cpus on a single runqueue.
> 
> Allowing N (where N can be one or greater) CPUs per run queue actually 
> increases flexibility as you can still set N to 1 to get the current 
> behaviour.

But you add extra code for that on top of what we have, and are also
prevented from making per-cpu assumptions.

And you can get N CPUs per runqueue behaviour by having them in a domain
with no restrictions on idle balancing. So where does your increased
flexibilty come from?

> One advantage of allowing multiple CPUs per run queue would be at the 
> smaller end of the system scale i.e. a PC with a single hyper threading 
> chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
> if both CPUs used the one runqueue and all the nasty side effects that 
> come with hyper threading would be minimized at the same time.

I don't know about that -- the current load balancer already minimises
the nasty multi threading effects. SMT is very important for IBM's chips
for example, and they've never had any problem with that side of it
since it was introduced and bugs ironed out (at least, none that I've
heard).



^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  4:34                           ` Nick Piggin
@ 2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
                                                 ` (2 more replies)
  0 siblings, 3 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-17  6:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote:
>> Nick Piggin wrote:
>>> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>>>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>>>> Note that I talk of run queues
>>>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good
>>>>> idea.
>>>> This observation of Peter's is the best thing to come out of this
>>>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>>>> going to be necessary, within a couple of years, to replace the whole
>>>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>>>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>>>> point in churning the mainline scheduler through a design that isn't
>>>> significantly more flexible than any of those now under discussion.
>>> Why? If you do that, then your load balancer just becomes less flexible
>>> because it is harder to have tasks run on one or the other.
>>>
>>> You can have single-runqueue-per-domain behaviour (or close to) just by
>>> relaxing all restrictions on idle load balancing within that domain. It
>>> is harder to go the other way and place any per-cpu affinity or
>>> restirctions with multiple cpus on a single runqueue.
>> Allowing N (where N can be one or greater) CPUs per run queue actually 
>> increases flexibility as you can still set N to 1 to get the current 
>> behaviour.
> 
> But you add extra code for that on top of what we have, and are also
> prevented from making per-cpu assumptions.
> 
> And you can get N CPUs per runqueue behaviour by having them in a domain
> with no restrictions on idle balancing. So where does your increased
> flexibilty come from?
> 
>> One advantage of allowing multiple CPUs per run queue would be at the 
>> smaller end of the system scale i.e. a PC with a single hyper threading 
>> chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
>> if both CPUs used the one runqueue and all the nasty side effects that 
>> come with hyper threading would be minimized at the same time.
> 
> I don't know about that -- the current load balancer already minimises
> the nasty multi threading effects. SMT is very important for IBM's chips
> for example, and they've never had any problem with that side of it
> since it was introduced and bugs ironed out (at least, none that I've
> heard).
> 

There's a lot of ugly code in the load balancer that is only there to 
overcome the side effects of SMT and dual core.  A lot of it was put 
there by Intel employees trying to make load balancing more friendly to 
their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
better way of achieving that end.  I may (of course) be wrong but I 
think that the idea deserves more consideration than you're willing to 
give it.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
@ 2007-04-17  6:14                               ` William Lee Irwin III
  2007-04-17  6:23                               ` Nick Piggin
  2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  6:14 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Michael K. Edwards, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 
> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

This may be a good one to ask Ingo about, as he did significant
performance work on per-core runqueues for SMT. While I did write
per-node runqueue code for NUMA at some point in the past, I did no
tuning or other performance work on it, only functionality.

I've actually dealt with kernels using elder versions of Ingo's code
for per-core runqueues on SMT, but was never called upon to examine
that particular code for either performance or stability, so I'm
largely ignorant of what the perceived outcome of it was.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
@ 2007-04-17  6:23                               ` Nick Piggin
  2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  6:23 UTC (permalink / raw)
  To: Peter Williams
  Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote:
> Nick Piggin wrote:
> >
> >But you add extra code for that on top of what we have, and are also
> >prevented from making per-cpu assumptions.
> >
> >And you can get N CPUs per runqueue behaviour by having them in a domain
> >with no restrictions on idle balancing. So where does your increased
> >flexibilty come from?
> >
> >>One advantage of allowing multiple CPUs per run queue would be at the 
> >>smaller end of the system scale i.e. a PC with a single hyper threading 
> >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all 
> >>if both CPUs used the one runqueue and all the nasty side effects that 
> >>come with hyper threading would be minimized at the same time.
> >
> >I don't know about that -- the current load balancer already minimises
> >the nasty multi threading effects. SMT is very important for IBM's chips
> >for example, and they've never had any problem with that side of it
> >since it was introduced and bugs ironed out (at least, none that I've
> >heard).
> >
> 
> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly to 

I agree that some of that has exploded complexity. I have some
thoughts about better approaches for some of those things, but
basically been stuck working on VM problems for a while.


> their systems.  What I'm suggesting is that an N CPUs per runqueue is a 
> better way of achieving that end.  I may (of course) be wrong but I 
> think that the idea deserves more consideration than you're willing to 
> give it.

Put it this way: it is trivial to group the load balancing stats
of N CPUs with their own runqueues. Just put them under a domain
and take the sum. The domain essentially takes on the same function
as a single queue with N CPUs under it. Anything _further_ you can
do with individual runqueues (like naturally adding an affinity
pressure ranging from nothing to absolute) are things that you
don't trivially get with 1:N approach. AFAIKS.

So I will definitely give any idea consideration, but I just need to
be shown where the benefit comes from.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  6:03                             ` Peter Williams
  2007-04-17  6:14                               ` William Lee Irwin III
  2007-04-17  6:23                               ` Nick Piggin
@ 2007-04-17  9:36                               ` Ingo Molnar
  2 siblings, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:36 UTC (permalink / raw)
  To: Peter Williams
  Cc: Nick Piggin, Michael K. Edwards, William Lee Irwin III,
	Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> There's a lot of ugly code in the load balancer that is only there to 
> overcome the side effects of SMT and dual core.  A lot of it was put 
> there by Intel employees trying to make load balancing more friendly 
> to their systems.  What I'm suggesting is that an N CPUs per runqueue 
> is a better way of achieving that end.  I may (of course) be wrong but 
> I think that the idea deserves more consideration than you're willing 
> to give it.

i actually implemented that some time ago and i'm afraid it was ugly as 
hell and pretty fragile. Load-balancing gets simpler, but task picking 
gets alot uglier.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  3:55                       ` Nick Piggin
  2007-04-17  4:25                         ` Peter Williams
@ 2007-04-17  8:24                         ` William Lee Irwin III
  1 sibling, 0 replies; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17  8:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Michael K. Edwards, Peter Williams, Ingo Molnar, Matt Mackall,
	Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote:
>> This observation of Peter's is the best thing to come out of this
>> whole foofaraw.  Looking at what's happening in CPU-land, I think it's
>> going to be necessary, within a couple of years, to replace the whole
>> idea of "CPU scheduling" with "run queue scheduling" across a complex,
>> possibly dynamic mix of CPU-ish resources.  Ergo, there's not much
>> point in churning the mainline scheduler through a design that isn't
>> significantly more flexible than any of those now under discussion.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> Why? If you do that, then your load balancer just becomes less flexible
> because it is harder to have tasks run on one or the other.

On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote:
> You can have single-runqueue-per-domain behaviour (or close to) just by
> relaxing all restrictions on idle load balancing within that domain. It
> is harder to go the other way and place any per-cpu affinity or
> restirctions with multiple cpus on a single runqueue.

The big sticking point here is order-sensitivity. One can point to
stringent sched_yield() ordering but that's not so important in and of
itself. The more significant case is RT applications which are order-
sensitive. Per-cpu runqueues rather significantly disturb the ordering
requirements of applications that care about it.

In terms of a plugging framework, the per-cpu arrangement precludes or
makes extremely awkward scheduling policies that don't have per-cpu
runqueues, for instance, the 2.4.x policy. There is also the alternate
SMP scalability strategy of a lockless scheduler with a single global
queue, which is more performance-oriented.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

[parent not found: <20070416135915.GK8915@holomorphy.com>]

[parent not found: <46241677.7060909@bigpond.net.au>]

[parent not found: <20070417025704.GM8915@holomorphy.com>]

[parent not found: <462445EC.1060306@bigpond.net.au>]

[parent not found: <20070417053147.GN8915@holomorphy.com>]

[parent not found: <46246A7C.8050501@bigpond.net.au>]

[parent not found: <20070417064109.GP8915@holomorphy.com>]

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
       [not found]                                 ` <20070417064109.GP8915@holomorphy.com>
@ 2007-04-17  8:00                                   ` Peter Williams
  2007-04-17 10:41                                     ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  8:00 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
> On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote:
>> This doesn't make any sense to me.
>> For a start, exact simultaneous operation would be impossible to achieve 
>> except with highly specialized architecture such as the long departed 
>> transputer.  And secondly, I can't see why it's necessary.
> 
> We're not going to make any headway here, so we might as well drop the
> thread.

Yes, we were starting to go around in circles weren't we?

> 
> There are other things to talk about anyway, for instance I'm seeing
> interest in plugsched come about from elsewhere and am taking an
> interest in getting it into shape wrt. various design goals therefore.
> 
> Probably the largest issue of note is getting scheduler drivers
> loadable as kernel modules. Addressing the points Ingo made that can
> be addressed are also lined up for this effort.
> 
> Comments on which directions you'd like this to go in these respects
> would be appreciated, as I regard you as the current "project owner."

I'd do scan through LKML from about 18 months ago looking for mention of 
runtime configurable version of plugsched.  Some students at a 
university (in Germany, I think) posted some patches adding this feature 
to plugsched around about then.

I never added them to plugsched proper as I knew (from previous 
experience when the company I worked for posted patches with similar 
functionality) that Linux would like this idea less than he did the 
current plugsched mechanism.

Unfortunately, my own cache of the relevant e-mails got overwritten 
during a Fedora Core upgrade (I've since moved /var onto a separate 
drive to avoid a repetition) or I would dig them out and send them to 
you.  I'd provided with copies of the company's patches to use as a 
guide to how to overcome the problems associated with changing 
schedulers on a running system (a few non trivial locking issues pop up).

Maybe if one of the students still reads LKML he will provide a pointer.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  8:00                                   ` Peter Williams
@ 2007-04-17 10:41                                     ` William Lee Irwin III
  2007-04-17 13:48                                       ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-17 10:41 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
>> Comments on which directions you'd like this to go in these respects
>> would be appreciated, as I regard you as the current "project owner."

On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I'd do scan through LKML from about 18 months ago looking for mention of 
> runtime configurable version of plugsched.  Some students at a 
> university (in Germany, I think) posted some patches adding this feature 
> to plugsched around about then.

Excellent. I'll go hunting for that.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> I never added them to plugsched proper as I knew (from previous 
> experience when the company I worked for posted patches with similar 
> functionality) that Linux would like this idea less than he did the 
> current plugsched mechanism.

Odd how the requirements ended up including that. Fickleness abounds.
If only we knew up-front what the end would be.


On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
> Unfortunately, my own cache of the relevant e-mails got overwritten 
> during a Fedora Core upgrade (I've since moved /var onto a separate 
> drive to avoid a repetition) or I would dig them out and send them to 
> you.  I'd provided with copies of the company's patches to use as a 
> guide to how to overcome the problems associated with changing 
> schedulers on a running system (a few non trivial locking issues pop up).
> Maybe if one of the students still reads LKML he will provide a pointer.

I was tempted to restart from scratch given Ingo's comments, but I
reconsidered and I'll be working with your code (and the German
students' as well). If everything has to change, so be it, but it'll
still be a derived work. It would be ignoring precedent and failure to
properly attribute if I did otherwise.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 10:41                                     ` William Lee Irwin III
@ 2007-04-17 13:48                                       ` Peter Williams
  2007-04-18  0:27                                         ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17 13:48 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>>> Comments on which directions you'd like this to go in these respects
>>> would be appreciated, as I regard you as the current "project owner."
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I'd do scan through LKML from about 18 months ago looking for mention of 
>> runtime configurable version of plugsched.  Some students at a 
>> university (in Germany, I think) posted some patches adding this feature 
>> to plugsched around about then.
> 
> Excellent. I'll go hunting for that.
> 
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> I never added them to plugsched proper as I knew (from previous 
>> experience when the company I worked for posted patches with similar 
>> functionality) that Linux would like this idea less than he did the 
>> current plugsched mechanism.
> 
> Odd how the requirements ended up including that. Fickleness abounds.
> If only we knew up-front what the end would be.
> 
> 
> On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote:
>> Unfortunately, my own cache of the relevant e-mails got overwritten 
>> during a Fedora Core upgrade (I've since moved /var onto a separate 
>> drive to avoid a repetition) or I would dig them out and send them to 
>> you.  I'd provided with copies of the company's patches to use as a 
>> guide to how to overcome the problems associated with changing 
>> schedulers on a running system (a few non trivial locking issues pop up).
>> Maybe if one of the students still reads LKML he will provide a pointer.
> 
> I was tempted to restart from scratch given Ingo's comments, but I
> reconsidered and I'll be working with your code (and the German
> students' as well). If everything has to change, so be it, but it'll
> still be a derived work. It would be ignoring precedent and failure to
> properly attribute if I did otherwise.

I can give you a patch (or set of patches) against the latest git 
vanilla kernel version if that would help.  There have been changes to 
the vanilla scheduler code since 2.6.20 so the latest patch on 
sourceforge won't apply cleanly.  I've found that implementing this as a 
series of patches rather than one big patch makes it easier fro me to 
cope with changes to the underlying code.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 13:48                                       ` Peter Williams
@ 2007-04-18  0:27                                         ` Peter Williams
  2007-04-18  2:03                                           ` William Lee Irwin III
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-18  0:27 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

Peter Williams wrote:
> William Lee Irwin III wrote:
>> I was tempted to restart from scratch given Ingo's comments, but I
>> reconsidered and I'll be working with your code (and the German
>> students' as well). If everything has to change, so be it, but it'll
>> still be a derived work. It would be ignoring precedent and failure to
>> properly attribute if I did otherwise.
> 
> I can give you a patch (or set of patches) against the latest git 
> vanilla kernel version if that would help.  There have been changes to 
> the vanilla scheduler code since 2.6.20 so the latest patch on 
> sourceforge won't apply cleanly.  I've found that implementing this as a 
> series of patches rather than one big patch makes it easier fro me to 
> cope with changes to the underlying code.

I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
to Linus's git tree as of an hour or two ago on sourceforge:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>

This should at least enable you to get it to apply cleanly to the latest 
kernel sources.  Let me know if you'd also like this as a quilt/mq 
friendly patch series?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  0:27                                         ` Peter Williams
@ 2007-04-18  2:03                                           ` William Lee Irwin III
  2007-04-18  2:31                                             ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: William Lee Irwin III @ 2007-04-18  2:03 UTC (permalink / raw)
  To: Peter Williams; +Cc: Linux Kernel Mailing List

> Peter Williams wrote:
> >William Lee Irwin III wrote:
> >>I was tempted to restart from scratch given Ingo's comments, but I
> >>reconsidered and I'll be working with your code (and the German
> >>students' as well). If everything has to change, so be it, but it'll
> >>still be a derived work. It would be ignoring precedent and failure to
> >>properly attribute if I did otherwise.
> >
> >I can give you a patch (or set of patches) against the latest git 
> >vanilla kernel version if that would help.  There have been changes to 
> >the vanilla scheduler code since 2.6.20 so the latest patch on 
> >sourceforge won't apply cleanly.  I've found that implementing this as a 
> >series of patches rather than one big patch makes it easier fro me to 
> >cope with changes to the underlying code.
> 
On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
> I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
> to Linus's git tree as of an hour or two ago on sourceforge:
> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
> This should at least enable you to get it to apply cleanly to the latest 
> kernel sources.  Let me know if you'd also like this as a quilt/mq 
> friendly patch series?

A quilt-friendly series would be most excellent if you could arrange it.

Thanks.


-- wli

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  2:03                                           ` William Lee Irwin III
@ 2007-04-18  2:31                                             ` Peter Williams
  0 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-18  2:31 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Linux Kernel Mailing List

William Lee Irwin III wrote:
>> Peter Williams wrote:
>>> William Lee Irwin III wrote:
>>>> I was tempted to restart from scratch given Ingo's comments, but I
>>>> reconsidered and I'll be working with your code (and the German
>>>> students' as well). If everything has to change, so be it, but it'll
>>>> still be a derived work. It would be ignoring precedent and failure to
>>>> properly attribute if I did otherwise.
>>> I can give you a patch (or set of patches) against the latest git 
>>> vanilla kernel version if that would help.  There have been changes to 
>>> the vanilla scheduler code since 2.6.20 so the latest patch on 
>>> sourceforge won't apply cleanly.  I've found that implementing this as a 
>>> series of patches rather than one big patch makes it easier fro me to 
>>> cope with changes to the underlying code.
> On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote:
>> I've just placed a single patch for plugsched against 2.6.21-rc7 updated 
>> to Linus's git tree as of an hour or two ago on sourceforge:
>> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch>
>> This should at least enable you to get it to apply cleanly to the latest 
>> kernel sources.  Let me know if you'd also like this as a quilt/mq 
>> friendly patch series?
> 
> A quilt-friendly series would be most excellent if you could arrange it.

Done:

<http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch-series.tar.gz>

Just untar this in the base directory of your Linux kernel source and 
Bob's your uncle.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  1:06           ` Peter Williams
  2007-04-16  3:04             ` William Lee Irwin III
@ 2007-04-16 17:22             ` Chris Friesen
  2007-04-17  0:54               ` Peter Williams
  1 sibling, 1 reply; 304+ messages in thread
From: Chris Friesen @ 2007-04-16 17:22 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:

> To my mind scheduling 
> and load balancing are orthogonal and keeping them that way simplifies 
> things.

Scuse me if I jump in here, but doesn't the load balancer need some way 
to figure out a) when to run, and b) which tasks to pull and where to 
push them?

I suppose you could abstract this into a per-scheduler API, but to me at 
least these are the hard parts of the load balancer...

Chris

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 17:22             ` Chris Friesen
@ 2007-04-17  0:54               ` Peter Williams
  2007-04-17 15:52                 ` Chris Friesen
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17  0:54 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
> 
>> To my mind scheduling and load balancing are orthogonal and keeping 
>> them that way simplifies things.
> 
> Scuse me if I jump in here, but doesn't the load balancer need some way 
> to figure out a) when to run, and b) which tasks to pull and where to 
> push them?

Yes but both of these are independent of the scheduler discipline in force.

> 
> I suppose you could abstract this into a per-scheduler API, but to me at 
> least these are the hard parts of the load balancer...

Load balancing needs to be based on the static priorities (i.e. nice or 
real time priority) of the runnable tasks not the dynamic priorities. 
If the load balancer manages to keep the weighted (according to static 
priority) load and distribution of priorities within the loads on the 
CPUs roughly equal and the scheduler does a good job of ensuring 
fairness, interactive responsiveness etc. for the tasks within a CPU 
then the result will be good system performance within the constraints 
set by the sys admins use of real time priorities and nice.

The smpnice modifications to the load balancer were meant to give it the 
appropriate behaviour and what we need to fix now is the intra CPU 
scheduling.

Even if the load balancer isn't yet perfect perfecting it can be done 
separately to fixing the scheduler preferably with as little 
interdependency as possible.  Probably the only contribution to load 
balancing that the scheduler really needs to make is the calculating of 
the average weighted load on each of the CPUs (or run queues if there's 
more than one CPU per runqueue).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  0:54               ` Peter Williams
@ 2007-04-17 15:52                 ` Chris Friesen
  2007-04-17 23:50                   ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Chris Friesen @ 2007-04-17 15:52 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Chris Friesen wrote:
>> Scuse me if I jump in here, but doesn't the load balancer need some 
>> way to figure out a) when to run, and b) which tasks to pull and where 
>> to push them?

> Yes but both of these are independent of the scheduler discipline in force.

It is not clear to me that this is always the case, especially once you 
mix in things like resource groups.

> If
> the load balancer manages to keep the weighted (according to static 
> priority) load and distribution of priorities within the loads on the 
> CPUs roughly equal and the scheduler does a good job of ensuring 
> fairness, interactive responsiveness etc. for the tasks within a CPU 
> then the result will be good system performance within the constraints 
> set by the sys admins use of real time priorities and nice.

Suppose I have a really high priority task running.  Another very high 
priority task wakes up and would normally preempt the first one. 
However, there happens to be another cpu available.  It seems like it 
would be a win if we moved one of those tasks to the available cpu 
immediately so they can both run simultaneously.  This would seem to 
require some communication between the scheduler and the load balancer.

Certainly the above design could introduce a lot of context switching. 
But if my goal is a scheduler that minimizes latency (even at the cost 
of throughput) then that's an acceptable price to pay.

Chris

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 15:52                 ` Chris Friesen
@ 2007-04-17 23:50                   ` Peter Williams
  2007-04-18  5:43                     ` Chris Friesen
  0 siblings, 1 reply; 304+ messages in thread
From: Peter Williams @ 2007-04-17 23:50 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
>> Chris Friesen wrote:
>>> Scuse me if I jump in here, but doesn't the load balancer need some 
>>> way to figure out a) when to run, and b) which tasks to pull and 
>>> where to push them?
> 
>> Yes but both of these are independent of the scheduler discipline in 
>> force.
> 
> It is not clear to me that this is always the case, especially once you 
> mix in things like resource groups.
> 
>> If
>> the load balancer manages to keep the weighted (according to static 
>> priority) load and distribution of priorities within the loads on the 
>> CPUs roughly equal and the scheduler does a good job of ensuring 
>> fairness, interactive responsiveness etc. for the tasks within a CPU 
>> then the result will be good system performance within the constraints 
>> set by the sys admins use of real time priorities and nice.
> 
> Suppose I have a really high priority task running.  Another very high 
> priority task wakes up and would normally preempt the first one. 
> However, there happens to be another cpu available.  It seems like it 
> would be a win if we moved one of those tasks to the available cpu 
> immediately so they can both run simultaneously.  This would seem to 
> require some communication between the scheduler and the load balancer.

Not really the load balancer can do this on its own AND the decision 
should be based on the STATIC priority of the task being woken.

> 
> Certainly the above design could introduce a lot of context switching. 
> But if my goal is a scheduler that minimizes latency (even at the cost 
> of throughput) then that's an acceptable price to pay.

It would actually probably reduce context switching as putting the woken 
task on the best CPU at wake up means you don't have to move it later 
on.  The wake up code already does a little bit in this direction when 
it chooses which CPU to put a newly woken task on but could do more -- 
the only real cost would be the cost of looking at more candidate CPUs 
than it currently does.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17 23:50                   ` Peter Williams
@ 2007-04-18  5:43                     ` Chris Friesen
  2007-04-18 13:00                       ` Peter Williams
  0 siblings, 1 reply; 304+ messages in thread
From: Chris Friesen @ 2007-04-18  5:43 UTC (permalink / raw)
  To: Peter Williams
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Peter Williams wrote:
> Chris Friesen wrote:

>> Suppose I have a really high priority task running.  Another very high 
>> priority task wakes up and would normally preempt the first one. 
>> However, there happens to be another cpu available.  It seems like it 
>> would be a win if we moved one of those tasks to the available cpu 
>> immediately so they can both run simultaneously.  This would seem to 
>> require some communication between the scheduler and the load balancer.
> 
> 
> Not really the load balancer can do this on its own AND the decision 
> should be based on the STATIC priority of the task being woken.

I guess I don't follow.  How would the load balancer know that it needs 
to run?  Running on every task wake-up seems expensive.  Also, static 
priority isn't everything.  What about the gang-scheduler concept where 
certain tasks must be scheduled simultaneously on different cpus?  What 
about a resource-group scenario where you have per-cpu resource limits, 
so that for good latency/fairness you need to force a high priority task 
to migrate to another cpu once it has consumed the cpu allocation of 
that group on the current cpu?

I can see having a generic load balancer core code, but it seems to me 
that the scheduler proper needs to have some way of triggering the load 
balancer to run, and some kind of goodness functions to indicate a) 
which tasks to move, and b) where to move them.

Chris

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  5:43                     ` Chris Friesen
@ 2007-04-18 13:00                       ` Peter Williams
  0 siblings, 0 replies; 304+ messages in thread
From: Peter Williams @ 2007-04-18 13:00 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas,
	linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner

Chris Friesen wrote:
> Peter Williams wrote:
>> Chris Friesen wrote:
> 
>>> Suppose I have a really high priority task running.  Another very 
>>> high priority task wakes up and would normally preempt the first one. 
>>> However, there happens to be another cpu available.  It seems like it 
>>> would be a win if we moved one of those tasks to the available cpu 
>>> immediately so they can both run simultaneously.  This would seem to 
>>> require some communication between the scheduler and the load balancer.
>>
>>
>> Not really the load balancer can do this on its own AND the decision 
>> should be based on the STATIC priority of the task being woken.
> 
> I guess I don't follow.  How would the load balancer know that it needs 
> to run?  Running on every task wake-up seems expensive.  Also, static 
> priority isn't everything.  What about the gang-scheduler concept where 
> certain tasks must be scheduled simultaneously on different cpus?  What 
> about a resource-group scenario where you have per-cpu resource limits, 
> so that for good latency/fairness you need to force a high priority task 
> to migrate to another cpu once it has consumed the cpu allocation of 
> that group on the current cpu?
> 
> I can see having a generic load balancer core code, but it seems to me 
> that the scheduler proper needs to have some way of triggering the load 
> balancer to run,

It doesn't have to be closely coupled with the load balancer to does 
this.  It just needs to know where the trigger is.

> and some kind of goodness functions to indicate a) 
> which tasks to move, and b) where to move them.

That's the load balancer's job and even if you use dynamic priority for 
load balancing it still wouldn't need to be closely coupled.  The load 
balancer would just need to know how to find a process's dynamic priority.

In fact, in the current set up, the load balancer decides how much load 
needs to be moved based on the static load on the CPUs but uses dynamic 
priority (to a large degree) to decide which ones to move.  This is due 
more to computational efficiency considerations than any deliberate 
design (I suspect) as the fact that tasks are stored on the runqueue in 
dynamic priority order makes looking at processes in dynamic priority 
order is the most efficient strategy.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 15:05   ` Ingo Molnar
  2007-04-15 20:05     ` Matt Mackall
@ 2007-04-16  5:16     ` Con Kolivas
  2007-04-16  5:48       ` Gene Heskett
  1 sibling, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-16  5:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 01:05, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > 2. Since then I've been thinking/working on a cpu scheduler design
> > that takes away all the guesswork out of scheduling and gives very
> > predictable, as fair as possible, cpu distribution and latency while
> > preserving as solid interactivity as possible within those confines.
>
> yeah. I think you were right on target with this call.

Yay thank goodness :) It's time to fix the damn cpu scheduler once and for 
all. Everyone uses this; it's no minor driver or $bigsmp or $bigram or 
$small_embedded_RT_hardware feature.

> I've applied the 
> sched.c change attached at the bottom of this mail to the CFS patch, if
> you dont mind. (or feel free to suggest some other text instead.)

>   *  2003-09-03	Interactivity tuning by Con Kolivas.
>   *  2004-04-02	Scheduler domains code by Nick Piggin
> + *  2007-04-15	Con Kolivas was dead right: fairness matters! :)

LOL that's awful. I'd prefer something meaningful like "Work begun on 
replacing all interactivity tuning with a fair virtual-deadline design by Con 
Kolivas".

While you're at it, it's worth getting rid of a few slightly pointless name 
changes too. Don't rename SCHED_NORMAL yet again, and don't call all your 
things sched_fair blah_fair __blah_fair and so on. It means that anything 
else is by proxy going to be considered unfair. Leave SCHED_NORMAL as is, 
replace the use of the word _fair with _cfs. I don't really care how many 
copyright notices you put into our already noisy bootup but it's redundant 
since there is no choice; we all get the same cpu scheduler.

> > 1. I tried in vain some time ago to push a working extensable
> > pluggable cpu scheduler framework (based on wli's work) for the linux
> > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he
> > didn't like it) as being absolutely the wrong approach and that we
> > should never do that. [...]
>
> i partially replied to that point to Will already, and i'd like to make
> it clear again: yes, i rejected plugsched 2-3 years ago (which already
> drifted away from wli's original codebase) and i would still reject it
> today.

No that was just me being flabbergasted by what appeared to be you posting 
your own plugsched. Note nowhere in the 40 iterations of rsdl->sd did I 
ask/suggest for plugsched. I said in my first announcement my aim was to 
create a scheduling policy robust enough for all situations rather than 
fantastic a lot of the time and awful sometimes. There are plenty of people 
ready to throw out arguments for plugsched now and I don't have the energy to 
continue that fight (I never did really).

But my question still stands about this comment:

>   case, all of SD's logic could be added via a kernel/sched_sd.c module
>   as well, if Con is interested in such an approach. ]

What exactly would be the purpose of such a module that governs nothing in 
particular? Since there'll be no pluggable scheduler by your admission it has 
no control over SCHED_NORMAL, and would require another scheduling policy for 
it to govern which there is no express way to use at the moment and people 
tend to just use the default without great effort. 

> First and foremost, please dont take such rejections too personally - i
> had my own share of rejections (and in fact, as i mentioned it in a
> previous mail, i had a fair number of complete project throwaways:
> 4g:4g, in-kernel Tux, irqrate and many others). I know that they can
> hurt and can demoralize, but if i dont like something it's my job to
> tell that.

Hmm? No that's not what this is about. Remember dynticks which was not 
originally my code but I tried to bring it up to mainline standard which I 
fought with for months? You came along with yet another rewrite from scratch 
and the flaws in the design I was working with were obvious so I instantly 
bowed down to that and never touched my code again. I didn't ask for credit 
back then, but obviously brought the requirement for a no idle tick 
implementation to the table.

> My view about plugsched: first please take a look at the latest
> plugsched code:
>
>   http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch
>
>   26 files changed, 8951 insertions(+), 1495 deletions(-)
>
> As an experiment i've removed all the add-on schedulers (both the core
> and the include files, only kept the vanilla one) from the plugsched
> patch (and the makefile and kconfig complications, etc), to see the
> 'infrastructure cost', and it still gave:
>
>   12 files changed, 1933 insertions(+), 1479 deletions(-)

I do not see extra code per-se as being a bad thing. I've heard said a few 
times before "ever notice how when the correct solution is done it is a lot 
more code than the quick hack that ultimately fails?". Insert long winded 
discussion of perfect is the enemy of good here, _but_ I'm not arguing 
perfect versus good, I'm talking about solid code versus quick fix. Again, 
none of this comment is directed specifically at this implementation of 
plugsched, its code quality or intent, but using "extra code is bad" as an 
argument is not enough.

> By your logic Mike should in fact be quite upset about this: if the 
> new code works out and proves to be useful then it obsoletes a whole lot
> of code of him!

> > [...] However at one stage I virtually begged for support with my
> > attempts and help with the code. Dmitry Adamushko is the only person
> > who actually helped me with the code in the interim, while others
> > poked sticks at it. Sure the sticks helped at times but the sticks
> > always seemed to have their ends kerosene doused and flaming for
> > reasons I still don't get. No other help was forthcoming.

> Hey, i told this to you as recently as 1 month ago as well:
>
>    http://lkml.org/lkml/2007/3/8/54
>
>    "cool! I like this even more than i liked your original staircase
>     scheduler from 2 years ago :)"

Email has an awful knack of disguising intent so I took that on face value 
that you did like the idea :). 

Above when I said "no other help was forthcoming" all I was hoping for was 
really simple obvious bugfixes to help me along while I was laid up in bed 
such as "I like what you're doing but oh your use of memset here is bogus, 
here is a one line patch". I wasn't specifically expecting you to fix my 
code; you've got truckloads of things you need to do. 

It just reminds me that the concept of "release early, release often" doesn't 
actually work in the kernel. What is far more obvious is "release code only 
when it's so close to perfect that noone can argue against it" since most of 
the work is done by one person, otherwise someone will come out with a 
counterpatch that is _complete_ earlier but in all possibility not as good, 
it's just ready sooner. *NOTE* In no way am I saying your code is not as good 
as mine; I would have to say exactly the opposite is true pretty much always 
(<sarcasm>conversely then I doubt if I dropped you in my work environment 
you'd do as good a job as I do</sarcasm>). At one stage wli (again at my 
request) put together a quick hack to check for non-preemptible regions 
within the kernel. From that quick hack you adopted it and turned it into 
that beautiful latency tracer that is the cornerstone of the -rt tree 
testing. However, there are many instances I've seen good evolving code in 
the linux kernel be trumped by not-as-good but already-working alternatives 
written from scratch with no reference to the original work. This is the NIH 
(not invented here) mechanism I see happening that is worth objecting to.

What you may find most amusing is the very first iterations of RSDL looked 
_nothing_ like the mainline scheduler. There were all sorts of different 
structures, mechanisms, one priority array, plans to remove scheduler_tick 
entirely and so on. Most of those were never made for public consumption. I 
spent about half a dozen iterations of RSDL removing all of that and making 
it as close to the mainline design as possible, thus minimising the size of 
the patch, and to make it readily readable for most people familiar with the 
scheduler policy code in sched.c (all 5 of them). I should have just said 
bugger it and started everything from scratch with little to no reference to 
the original scheduler but found myself obliged to try to do things the 
minimal code patch size readable difference thingy that was valued in linux 
kernel development. I think the radically different approach would have been 
better in the long run. Trying to play ball I ruined it.

Either way I've decided for myself, my family, my career and my sanity I'm 
abandoning SD. I will shelve SD and try to have fond memories of SD as an 
intellectual prompting exercise only

> 	Ingo

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16  5:16     ` Con Kolivas
@ 2007-04-16  5:48       ` Gene Heskett
  0 siblings, 0 replies; 304+ messages in thread
From: Gene Heskett @ 2007-04-16  5:48 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Peter Williams, linux-kernel, Linus Torvalds,
	Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Monday 16 April 2007, Con Kolivas wrote:

And I snipped, Sorry fellas.

Con's original submission was to me, quite an improvement.  But I have to say 
it, and no denegration of your efforts is intended Con, but you did 'pull the 
trigger' and get this thing rolling by scratching the itch & drawing 
attention to an ugly lack of user interactivity that had crept into the 2.6 
family.  So from me to Con, a tip of the hat, and a deep bow in your 
direction, thank you.  Now, you have done what you aimed to do, so please get 
well.

I've now been through most of an amanda session using Ingo's "CFS" and I have 
to say that it is another improvement over your 0.40 that's is just as 
obvious as your first patch was against the stock scheduler.  No other 
scheduler yet has allowed the full utilization of the cpu, and maintained 
user interactivity as well as this one has,  my cpu is running about 5 
degrees F hotter just from this effect alone.  gzip, if the rest of the 
system is in between tasks, is consistently showing around 95%, but let 
anything else stick up its hand, like procmail etc, and gzip now dutifully 
steps aside, dropping into the 40% range until procmail and spamd are done, 
at which point there is no rest for the wicked and the cpu never gets a 
chance to cool.

There was, just now, a pause of about 2 seconds, while amanda moved a tarball 
from the holding disk area on /dev/hda to the vtapes disk on /dev/hdd, so 
that would have been an I/O bound situation.

This one Ingo, even without any other patches and I think I did see one go by 
in this thread which I didn't apply, is a definite keeper.  Sweet even.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
A word to the wise is enough.
		-- Miguel de Cervantes

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (9 preceding siblings ...)
  2007-04-15  3:27 ` Con Kolivas
@ 2007-04-15 12:29 ` Esben Nielsen
  2007-04-15 13:04   ` Ingo Molnar
  2007-04-15 22:49 ` Ismail Dönmez
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 304+ messages in thread
From: Esben Nielsen @ 2007-04-15 12:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Fri, 13 Apr 2007, Ingo Molnar wrote:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>   http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
>
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
>
> [...]

I took a brief look at it. Have you tested priority inheritance?
As far as  I can see rt_mutex_setprio doesn't have much effect on 
SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change 
scheduler class when boosted in rt_mutex_setprio().

Esben


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 13:04   ` Ingo Molnar
  2007-04-16  7:16     ` Esben Nielsen
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-15 13:04 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

* Esben Nielsen <nielsen.esben@googlemail.com> wrote:

> I took a brief look at it. Have you tested priority inheritance?

yeah, you are right, it's broken at the moment, i'll fix it. But the 
good news is that i think PI could become cleaner via scheduling 
classes.

> As far as I can see rt_mutex_setprio doesn't have much effect on 
> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task 
> change scheduler class when boosted in rt_mutex_setprio().

i think via scheduling classes we dont have to do the p->policy and 
p->prio based gymnastics anymore, we can just have a clean look at 
p->sched_class and stack the original scheduling class into 
p->real_sched_class. It would probably also make sense to 'privatize' 
p->prio into the scheduling class. That way PI would be a pure property 
of sched_rt, and the PI scheduler would be driven purely by 
p->rt_priority, not by p->prio. That way all the normal_prio() kind of 
complications and interactions with SCHED_OTHER/SCHED_FAIR would be 
eliminated as well. What do you think?

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 13:04   ` Ingo Molnar
@ 2007-04-16  7:16     ` Esben Nielsen
  0 siblings, 0 replies; 304+ messages in thread
From: Esben Nielsen @ 2007-04-16  7:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner

On Sun, 15 Apr 2007, Ingo Molnar wrote:

>
> * Esben Nielsen <nielsen.esben@googlemail.com> wrote:
>
>> I took a brief look at it. Have you tested priority inheritance?
>
> yeah, you are right, it's broken at the moment, i'll fix it. But the
> good news is that i think PI could become cleaner via scheduling
> classes.
>
>> As far as I can see rt_mutex_setprio doesn't have much effect on
>> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task
>> change scheduler class when boosted in rt_mutex_setprio().
>
> i think via scheduling classes we dont have to do the p->policy and
> p->prio based gymnastics anymore, we can just have a clean look at
> p->sched_class and stack the original scheduling class into
> p->real_sched_class. It would probably also make sense to 'privatize'
> p->prio into the scheduling class. That way PI would be a pure property
> of sched_rt, and the PI scheduler would be driven purely by
> p->rt_priority, not by p->prio. That way all the normal_prio() kind of
> complications and interactions with SCHED_OTHER/SCHED_FAIR would be
> eliminated as well. What do you think?
>

Now I have not read your patch into detail. But agree it would be nice to 
have it more "OO" and remove cross references between schedulers. But 
first one should consider wether PI between SCHED_FAIR tasks or not is 
usefull or not. Does PI among dynamic priorities make sense at all? I think it
does: On heavy loaded systems where a nice 19 might not get the CPU for 
very long, a nice -20 task can be priority inverted for a very long 
time.
But I see no need it taking the dynamic part of the effective priorities 
into account. The current/old solution of mapping the static nice values 
into a global priority index which can incorporate the two scheduler 
classes is probably good enough - it just has to be "switched on" a again 
:-)

But what about other scheduler classes which some people want to add in
the future? What about having a "cleaner design"?

My thought was to generalize the concept of 'priority' to be an
object (a struct prio) to be interpreted with help from a scheduler class 
instead of globally interpreted integer.

int compare_prio(struct prio *a, struct prio *b)
{
 	if (a->sched_class->class_prio < b->sched_class->class_prio)
 		return -1;

 	if (a->sched_class->class_prio < b->sched_class->class_prio)
 		return +1;

 	return a->sched_class->compare_prio(a, b);

}

Problem 1: Performance.

Problem 2: Operations on a plist with these generalized priorities are not 
bounded because the number of different priorites are not bounded.

Problem 2 could be solved by using a combined plist (for rt priorities) 
and rbtree (for fair priorities) - making operations logarithmic just as 
the fair scheduler itself. But that would take more memory for every 
rtmutex.

I conclude that is too complicated and go on to the obvious idea:
Use a global priority index where each scheduler class get's it own 
range (rt: 0-99, fair 100-139 :-). Let the scheduler class have a 
function returning it instead of reading it directly from task_struct such
that new scheduler classes can return their own numbers.

Esben

> 	Ingo
>

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (10 preceding siblings ...)
  2007-04-15 12:29 ` Esben Nielsen
@ 2007-04-15 22:49 ` Ismail Dönmez
  2007-04-15 23:23   ` Arjan van de Ven
  2007-04-16 11:58   ` Ingo Molnar
  2007-04-16 22:00 ` Andi Kleen
  2007-04-17  7:56 ` Andy Whitcroft
  13 siblings, 2 replies; 304+ messages in thread
From: Ismail Dönmez @ 2007-04-15 22:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 573 bytes --]

Hi,
On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> [CFS]
>
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
>
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

Tested this on top of Linus' GIT tree but the system gets very unresponsive 
during high disk i/o using ext3 as filesystem but even writing a 300mb file 
to a usb disk (iPod actually) has the same affect.

Regards,
ismail

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-15 23:23   ` Arjan van de Ven
  2007-04-15 23:33     ` Ismail Dönmez
  2007-04-16 11:58   ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Arjan van de Ven @ 2007-04-15 23:23 UTC (permalink / raw)
  To: Ismail Dönmez
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner

On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> Hi,
> On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > [CFS]
> >
> > i'm pleased to announce the first release of the "Modular Scheduler Core
> > and Completely Fair Scheduler [CFS]" patchset:
> >
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> Tested this on top of Linus' GIT tree but the system gets very unresponsive 
> during high disk i/o using ext3 as filesystem but even writing a 300mb file 
> to a usb disk (iPod actually) has the same affect.

just to make sure; this exact same workload but with the stock scheduler
does not have this effect?

if so, then it could well be that the scheduler is too fair for it's own
good (being really fair inevitably ends up not batching as much as one
should, and batching is needed to get any kind of decent performance out
of disks nowadays)


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 23:23   ` Arjan van de Ven
@ 2007-04-15 23:33     ` Ismail Dönmez
  0 siblings, 0 replies; 304+ messages in thread
From: Ismail Dönmez @ 2007-04-15 23:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner

On Monday 16 April 2007 02:23:08 Arjan van de Ven wrote:
> On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote:
> > Hi,
> >
> > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote:
> > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler
> > > [CFS]
> > >
> > > i'm pleased to announce the first release of the "Modular Scheduler
> > > Core and Completely Fair Scheduler [CFS]" patchset:
> > >
> > >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> >
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same affect.
>
> just to make sure; this exact same workload but with the stock scheduler
> does not have this effect?
>
> if so, then it could well be that the scheduler is too fair for it's own
> good (being really fair inevitably ends up not batching as much as one
> should, and batching is needed to get any kind of decent performance out
> of disks nowadays)

Tried with make install in kdepim (which made system sluggish with CFS) and 
the system is just fine (using CFQ).

Regards,
ismail

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-15 22:49 ` Ismail Dönmez
  2007-04-15 23:23   ` Arjan van de Ven
@ 2007-04-16 11:58   ` Ingo Molnar
  2007-04-16 12:02     ` Ismail Dönmez
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-16 11:58 UTC (permalink / raw)
  To: Ismail Dönmez
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner


* Ismail Dönmez <ismail@pardus.org.tr> wrote:

> Tested this on top of Linus' GIT tree but the system gets very 
> unresponsive during high disk i/o using ext3 as filesystem but even 
> writing a 300mb file to a usb disk (iPod actually) has the same 
> affect.

hm. Is this an SMP system+kernel by any chance?

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 11:58   ` Ingo Molnar
@ 2007-04-16 12:02     ` Ismail Dönmez
  0 siblings, 0 replies; 304+ messages in thread
From: Ismail Dönmez @ 2007-04-16 12:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner

On Monday 16 April 2007 14:58:54 Ingo Molnar wrote:
> * Ismail Dönmez <ismail@pardus.org.tr> wrote:
> > Tested this on top of Linus' GIT tree but the system gets very
> > unresponsive during high disk i/o using ext3 as filesystem but even
> > writing a 300mb file to a usb disk (iPod actually) has the same
> > affect.
>
> hm. Is this an SMP system+kernel by any chance?

Nope the kernel and the system is UP.

Regards,
ismail


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (11 preceding siblings ...)
  2007-04-15 22:49 ` Ismail Dönmez
@ 2007-04-16 22:00 ` Andi Kleen
  2007-04-16 21:05   ` Ingo Molnar
  2007-04-17  7:56 ` Andy Whitcroft
  13 siblings, 1 reply; 304+ messages in thread
From: Andi Kleen @ 2007-04-16 22:00 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Ingo Molnar <mingo@elte.hu> writes:

> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch

I would suggest to drop the tsc.c change. The "small errors" can be
really large on some systems and you can also see large backward jumps.
I have a proper (but complicated) solution pending
in ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/sched-clock-share

BTW with all this CPU time measurement it would be really nice to
report it to the user too. It seems a bit bizarre that the scheduler
keeps track of ns, but top only knows jiffies with large sampling errors.

-Andi

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 22:00 ` Andi Kleen
@ 2007-04-16 21:05   ` Ingo Molnar
  2007-04-16 21:21     ` Andi Kleen
  0 siblings, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-16 21:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel


* Andi Kleen <andi@firstfloor.org> wrote:

> > i'm pleased to announce the first release of the "Modular Scheduler 
> > Core and Completely Fair Scheduler [CFS]" patchset:
> > 
> >    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> I would suggest to drop the tsc.c change. The "small errors" can be 
> really large on some systems and you can also see large backward 
> jumps.

actually, i designed the CFS code assuming a per-CPU TSC (with no global 
synchronization), not assuming any globally sync TSC. In fact i wrote it 
on such systems: a CoreDuo2 box that has stops the TSC in C3 and the 
different cores have wildly different TSC values and a dual-core 
Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() 
change for now.

> BTW with all this CPU time measurement it would be really nice to 
> report it to the user too. It seems a bit bizarre that the scheduler 
> keeps track of ns, but top only knows jiffies with large sampling 
> errors.

yeah - i'll fix that too if someone doesnt beat me at it.

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-16 21:05   ` Ingo Molnar
@ 2007-04-16 21:21     ` Andi Kleen
  0 siblings, 0 replies; 304+ messages in thread
From: Andi Kleen @ 2007-04-16 21:21 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, linux-kernel

> actually, i designed the CFS code assuming a per-CPU TSC (with no global 
> synchronization), not assuming any globally sync TSC. In fact i wrote it 

That already worked in the old scheduler (just in a hackish way)

> on such systems: a CoreDuo2 box that has stops the TSC in C3 and the 
> different cores have wildly different TSC values and a dual-core 
> Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() 
> change for now.

The problem is not CPU synchronized TSC, but TSC with varying frequency
on a single CPU like on the A64.

The old implementation can lose really badly on that because it mixes
measurements at different frequencies together without individual scaling.

The error gets worse the longer the system runs.

>> BTW with all this CPU time measurement it would be really nice to
>> report it to the user too. It seems a bit bizarre that the scheduler
>> keeps track of ns, but top only knows jiffies with large sampling
>> errors.

> yeah - i'll fix that too if someone doesnt beat me at it.

I've been pondering for some time if doubling the NMI watchdog as a 
ring 0 counter for this is worth it. So far I'm still undecided
(and it's moot now since it's disabled by default :/)

-Andi

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-13 20:21 Ingo Molnar
                   ` (12 preceding siblings ...)
  2007-04-16 22:00 ` Andi Kleen
@ 2007-04-17  7:56 ` Andy Whitcroft
  2007-04-17  9:32   ` Nick Piggin
  2007-04-18 10:22   ` Ingo Molnar
  13 siblings, 2 replies; 304+ messages in thread
From: Andy Whitcroft @ 2007-04-17  7:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

Ingo Molnar wrote:
> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
> 
> i'm pleased to announce the first release of the "Modular Scheduler Core
> and Completely Fair Scheduler [CFS]" patchset:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
> 
> This project is a complete rewrite of the Linux task scheduler. My goal
> is to address various feature requests and to fix deficiencies in the
> vanilla scheduler that were suggested/found in the past few years, both
> for desktop scheduling and for server scheduling workloads.
> 
> [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
>   new scheduler will be active by default and all tasks will default
>   to the new SCHED_FAIR interactive scheduling class. ]
> 
> Highlights are:
> 
>  - the introduction of Scheduling Classes: an extensible hierarchy of
>    scheduler modules. These modules encapsulate scheduling policy
>    details and are handled by the scheduler core without the core
>    code assuming about them too much.
> 
>  - sched_fair.c implements the 'CFS desktop scheduler': it is a
>    replacement for the vanilla scheduler's SCHED_OTHER interactivity
>    code.
> 
>    i'd like to give credit to Con Kolivas for the general approach here:
>    he has proven via RSDL/SD that 'fair scheduling' is possible and that
>    it results in better desktop scheduling. Kudos Con!
> 
>    The CFS patch uses a completely different approach and implementation
>    from RSDL/SD. My goal was to make CFS's interactivity quality exceed
>    that of RSDL/SD, which is a high standard to meet :-) Testing
>    feedback is welcome to decide this one way or another. [ and, in any
>    case, all of SD's logic could be added via a kernel/sched_sd.c module
>    as well, if Con is interested in such an approach. ]
> 
>    CFS's design is quite radical: it does not use runqueues, it uses a
>    time-ordered rbtree to build a 'timeline' of future task execution,
>    and thus has no 'array switch' artifacts (by which both the vanilla
>    scheduler and RSDL/SD are affected).
> 
>    CFS uses nanosecond granularity accounting and does not rely on any
>    jiffies or other HZ detail. Thus the CFS scheduler has no notion of
>    'timeslices' and has no heuristics whatsoever. There is only one
>    central tunable:
> 
>          /proc/sys/kernel/sched_granularity_ns
> 
>    which can be used to tune the scheduler from 'desktop' (low
>    latencies) to 'server' (good batching) workloads. It defaults to a
>    setting suitable for desktop workloads. SCHED_BATCH is handled by the
>    CFS scheduler module too.
> 
>    due to its design, the CFS scheduler is not prone to any of the
>    'attacks' that exist today against the heuristics of the stock
>    scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
>    work fine and do not impact interactivity and produce the expected
>    behavior.
> 
>    the CFS scheduler has a much stronger handling of nice levels and
>    SCHED_BATCH: both types of workloads should be isolated much more
>    agressively than under the vanilla scheduler.
> 
>    ( another rdetail: due to nanosec accounting and timeline sorting,
>      sched_yield() support is very simple under CFS, and in fact under
>      CFS sched_yield() behaves much better than under any other
>      scheduler i have tested so far. )
> 
>  - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
>    way than the vanilla scheduler does. It uses 100 runqueues (for all
>    100 RT priority levels, instead of 140 in the vanilla scheduler)
>    and it needs no expired array.
> 
>  - reworked/sanitized SMP load-balancing: the runqueue-walking
>    assumptions are gone from the load-balancing code now, and
>    iterators of the scheduling modules are used. The balancing code got
>    quite a bit simpler as a result.
> 
> the core scheduler got smaller by more than 700 lines:
> 
>  kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
>  1 file changed, 372 insertions(+), 1082 deletions(-)
> 
> and even adding all the scheduling modules, the total size impact is
> relatively small:
> 
>  18 files changed, 1454 insertions(+), 1133 deletions(-)
> 
> most of the increase is due to extensive comments. The kernel size
> impact is in fact a small negative:
> 
>    text    data     bss     dec     hex filename
>   23366    4001      24   27391    6aff kernel/sched.o.vanilla
>   24159    2705      56   26920    6928 kernel/sched.o.CFS
> 
> (this is mainly due to the benefit of getting rid of the expired array
> and its data structure overhead.)
> 
> thanks go to Thomas Gleixner and Arjan van de Ven for review of this
> patchset.
> 
> as usual, any sort of feedback, bugreports, fixes and suggestions are
> more than welcome,

Pushed this through the test.kernel.org and nothing new blew up.
Notably the kernbench figures are within expectations even on the bigger
numa systems, commonly badly affected by balancing problems in the
schedular.

I see there is a second one out, I'll push that one through too.

-apw

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56 ` Andy Whitcroft
@ 2007-04-17  9:32   ` Nick Piggin
  2007-04-17  9:59     ` Ingo Molnar
  2007-04-18 10:22   ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-17  9:32 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote:
> > 
> > as usual, any sort of feedback, bugreports, fixes and suggestions are
> > more than welcome,
> 
> Pushed this through the test.kernel.org and nothing new blew up.
> Notably the kernbench figures are within expectations even on the bigger
> numa systems, commonly badly affected by balancing problems in the
> schedular.
> 
> I see there is a second one out, I'll push that one through too.

Well I just sent some feedback on cfs-v2, but realised it went off-list,
so I'll resend here because others may find it interesting too. Sorry
about jamming it in here, but it is relevant to performance...

Anyway, roughly in the context of good cfs-v2 interactivity, I wrote:

Well I'm not too surprised. I am disappointed that it uses such small
timeslices (or whatever they are called) as the default.

Using small timeslices is actually a pretty easy way to ensure everything
stays smooth even under load, but is bad for efficiency. Sure you can say
you'll have desktop and server tunings, but... With nicksched I'm testing
a default timeslice of *300ms* even on the desktop, wheras Ingo's seems
to be effectively 3ms :P So if you compare default tunings, it isn't
exactly fair!

Kbuild times on a 2x Xeon:

2.6.21-rc7
508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

2.6.21-rc7-cfs-v2
534.80user 30.92system 2:23.64elapsed 393%CPU
534.75user 31.01system 2:23.70elapsed 393%CPU
534.66user 31.07system 2:23.76elapsed 393%CPU
534.56user 30.91system 2:23.76elapsed 393%CPU
534.66user 31.07system 2:23.67elapsed 393%CPU
535.43user 30.62system 2:23.72elapsed 393%CPU

2.6.21-rc7-nicksched
505.60user 32.31system 2:17.91elapsed 390%CPU
506.55user 32.42system 2:17.66elapsed 391%CPU
506.41user 32.30system 2:17.85elapsed 390%CPU
506.48user 32.36system 2:17.77elapsed 391%CPU
506.10user 32.40system 2:17.81elapsed 390%CPU
506.69user 32.16system 2:17.78elapsed 391%CPU


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:32   ` Nick Piggin
@ 2007-04-17  9:59     ` Ingo Molnar
  2007-04-17 11:11       ` Nick Piggin
  2007-04-18  8:55       ` Nick Piggin
  0 siblings, 2 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan


* Nick Piggin <npiggin@suse.de> wrote:

> 2.6.21-rc7-cfs-v2
> 534.80user 30.92system 2:23.64elapsed 393%CPU
> 534.75user 31.01system 2:23.70elapsed 393%CPU
> 534.66user 31.07system 2:23.76elapsed 393%CPU
> 534.56user 30.91system 2:23.76elapsed 393%CPU
> 534.66user 31.07system 2:23.67elapsed 393%CPU
> 535.43user 30.62system 2:23.72elapsed 393%CPU

Thanks for testing this! Could you please try this also with:

   echo 100000000 > /proc/sys/kernel/sched_granularity

on the same system, so that we can get a complete set of numbers? Just 
to make sure that lowering the preemption frequency indeed has the 
expected result of moving kernbench numbers back to mainline levels. (if 
not then that would indicate some CFS buglet)

could you maybe even try a more extreme setting of:

   echo 500000000 > /proc/sys/kernel/sched_granularity

for kicks? This would allow us to see how much kernbench we lose due to 
preemption granularity. Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:59     ` Ingo Molnar
@ 2007-04-17 11:11       ` Nick Piggin
  2007-04-18  8:55       ` Nick Piggin
  1 sibling, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-17 11:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
> 
> Thanks for testing this! Could you please try this also with:
> 
>    echo 100000000 > /proc/sys/kernel/sched_granularity
> 
> on the same system, so that we can get a complete set of numbers? Just 
> to make sure that lowering the preemption frequency indeed has the 
> expected result of moving kernbench numbers back to mainline levels. (if 
> not then that would indicate some CFS buglet)
> 
> could you maybe even try a more extreme setting of:
> 
>    echo 500000000 > /proc/sys/kernel/sched_granularity
> 
> for kicks? This would allow us to see how much kernbench we lose due to 
> preemption granularity. Thanks!

Yeah but I just powered down the test-box, so I'll have to get onto
that tomorrow.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  9:59     ` Ingo Molnar
  2007-04-17 11:11       ` Nick Piggin
@ 2007-04-18  8:55       ` Nick Piggin
  2007-04-18  9:33         ` Con Kolivas
  2007-04-18  9:53         ` Ingo Molnar
  1 sibling, 2 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-18  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > 2.6.21-rc7-cfs-v2
> > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > 535.43user 30.62system 2:23.72elapsed 393%CPU
> 
> Thanks for testing this! Could you please try this also with:
> 
>    echo 100000000 > /proc/sys/kernel/sched_granularity

507.68user 31.87system 2:18.05elapsed 390%CPU
507.99user 31.93system 2:18.09elapsed 390%CPU
507.46user 31.78system 2:18.03elapsed 390%CPU
507.68user 31.93system 2:18.11elapsed 390%CPU
507.63user 31.98system 2:18.01elapsed 390%CPU
507.83user 31.94system 2:18.28elapsed 390%CPU

> could you maybe even try a more extreme setting of:
> 
>    echo 500000000 > /proc/sys/kernel/sched_granularity

504.87user 32.13system 2:18.03elapsed 389%CPU
505.94user 32.29system 2:17.87elapsed 390%CPU
506.10user 31.90system 2:17.96elapsed 389%CPU
505.02user 32.02system 2:17.96elapsed 389%CPU
506.69user 31.96system 2:17.82elapsed 390%CPU
505.70user 31.84system 2:17.90elapsed 389%CPU


Again, for comparison 2.6.21-rc7 mainline:

508.87user 32.47system 2:17.82elapsed 392%CPU
509.05user 32.25system 2:17.84elapsed 392%CPU
508.75user 32.26system 2:17.83elapsed 392%CPU
508.63user 32.17system 2:17.88elapsed 392%CPU
509.01user 32.26system 2:17.90elapsed 392%CPU
509.08user 32.20system 2:17.95elapsed 392%CPU

So looking at elapsed time, a granularity of 100ms is just behind the
mainline score. However it is using slightly less user time and
slightly more idle time, which indicates that balancing might have got
a bit less aggressive.

But anyway, it conclusively shows the efficiency impact of such tiny
timeslices.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  8:55       ` Nick Piggin
@ 2007-04-18  9:33         ` Con Kolivas
  2007-04-18 12:14           ` Nick Piggin
  2007-04-18  9:53         ` Ingo Molnar
  1 sibling, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-18  9:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote:
> > * Nick Piggin <npiggin@suse.de> wrote:
> > > 2.6.21-rc7-cfs-v2
> > > 534.80user 30.92system 2:23.64elapsed 393%CPU
> > > 534.75user 31.01system 2:23.70elapsed 393%CPU
> > > 534.66user 31.07system 2:23.76elapsed 393%CPU
> > > 534.56user 30.91system 2:23.76elapsed 393%CPU
> > > 534.66user 31.07system 2:23.67elapsed 393%CPU
> > > 535.43user 30.62system 2:23.72elapsed 393%CPU
> >
> > Thanks for testing this! Could you please try this also with:
> >
> >    echo 100000000 > /proc/sys/kernel/sched_granularity
>
> 507.68user 31.87system 2:18.05elapsed 390%CPU
> 507.99user 31.93system 2:18.09elapsed 390%CPU
> 507.46user 31.78system 2:18.03elapsed 390%CPU
> 507.68user 31.93system 2:18.11elapsed 390%CPU
> 507.63user 31.98system 2:18.01elapsed 390%CPU
> 507.83user 31.94system 2:18.28elapsed 390%CPU
>
> > could you maybe even try a more extreme setting of:
> >
> >    echo 500000000 > /proc/sys/kernel/sched_granularity
>
> 504.87user 32.13system 2:18.03elapsed 389%CPU
> 505.94user 32.29system 2:17.87elapsed 390%CPU
> 506.10user 31.90system 2:17.96elapsed 389%CPU
> 505.02user 32.02system 2:17.96elapsed 389%CPU
> 506.69user 31.96system 2:17.82elapsed 390%CPU
> 505.70user 31.84system 2:17.90elapsed 389%CPU
>
>
> Again, for comparison 2.6.21-rc7 mainline:
>
> 508.87user 32.47system 2:17.82elapsed 392%CPU
> 509.05user 32.25system 2:17.84elapsed 392%CPU
> 508.75user 32.26system 2:17.83elapsed 392%CPU
> 508.63user 32.17system 2:17.88elapsed 392%CPU
> 509.01user 32.26system 2:17.90elapsed 392%CPU
> 509.08user 32.20system 2:17.95elapsed 392%CPU
>
> So looking at elapsed time, a granularity of 100ms is just behind the
> mainline score. However it is using slightly less user time and
> slightly more idle time, which indicates that balancing might have got
> a bit less aggressive.
>
> But anyway, it conclusively shows the efficiency impact of such tiny
> timeslices.

See test.kernel.org for how (the now defunct) SD was performing on kernbench. 
It had low latency _and_ equivalent throughput to mainline. Set the standard 
appropriately on both counts please.

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  9:33         ` Con Kolivas
@ 2007-04-18 12:14           ` Nick Piggin
  2007-04-18 12:33             ` Con Kolivas
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-18 12:14 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > Again, for comparison 2.6.21-rc7 mainline:
> >
> > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > 509.08user 32.20system 2:17.95elapsed 392%CPU
> >
> > So looking at elapsed time, a granularity of 100ms is just behind the
> > mainline score. However it is using slightly less user time and
> > slightly more idle time, which indicates that balancing might have got
> > a bit less aggressive.
> >
> > But anyway, it conclusively shows the efficiency impact of such tiny
> > timeslices.
> 
> See test.kernel.org for how (the now defunct) SD was performing on kernbench. 
> It had low latency _and_ equivalent throughput to mainline. Set the standard 
> appropriately on both counts please.

I can give it a run. Got an updated patch against -rc7?


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:14           ` Nick Piggin
@ 2007-04-18 12:33             ` Con Kolivas
  2007-04-18 21:49               ` Con Kolivas
  0 siblings, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-18 12:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:14, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> > On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > > Again, for comparison 2.6.21-rc7 mainline:
> > >
> > > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > > 509.08user 32.20system 2:17.95elapsed 392%CPU
> > >
> > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > mainline score. However it is using slightly less user time and
> > > slightly more idle time, which indicates that balancing might have got
> > > a bit less aggressive.
> > >
> > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > timeslices.
> >
> > See test.kernel.org for how (the now defunct) SD was performing on
> > kernbench. It had low latency _and_ equivalent throughput to mainline.
> > Set the standard appropriately on both counts please.
>
> I can give it a run. Got an updated patch against -rc7?

I said I wasn't pursuing it but since you're offering, the rc6 patch should 
apply ok.

http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:33             ` Con Kolivas
@ 2007-04-18 21:49               ` Con Kolivas
  0 siblings, 0 replies; 304+ messages in thread
From: Con Kolivas @ 2007-04-18 21:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:33, Con Kolivas wrote:
> On Wednesday 18 April 2007 22:14, Nick Piggin wrote:
> > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote:
> > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote:
> > > > Again, for comparison 2.6.21-rc7 mainline:
> > > >
> > > > 508.87user 32.47system 2:17.82elapsed 392%CPU
> > > > 509.05user 32.25system 2:17.84elapsed 392%CPU
> > > > 508.75user 32.26system 2:17.83elapsed 392%CPU
> > > > 508.63user 32.17system 2:17.88elapsed 392%CPU
> > > > 509.01user 32.26system 2:17.90elapsed 392%CPU
> > > > 509.08user 32.20system 2:17.95elapsed 392%CPU
> > > >
> > > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > > mainline score. However it is using slightly less user time and
> > > > slightly more idle time, which indicates that balancing might have
> > > > got a bit less aggressive.
> > > >
> > > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > > timeslices.
> > >
> > > See test.kernel.org for how (the now defunct) SD was performing on
> > > kernbench. It had low latency _and_ equivalent throughput to mainline.
> > > Set the standard appropriately on both counts please.
> >
> > I can give it a run. Got an updated patch against -rc7?
>
> I said I wasn't pursuing it but since you're offering, the rc6 patch should
> apply ok.
>
> http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch

Oh and if you go to the effort of trying you may as well try the timeslice 
tweak to see what effect it has on SD as well.

/proc/sys/kernel/rr_interval

100 is the highest.

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  8:55       ` Nick Piggin
  2007-04-18  9:33         ` Con Kolivas
@ 2007-04-18  9:53         ` Ingo Molnar
  2007-04-18 12:13           ` Nick Piggin
  1 sibling, 1 reply; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18  9:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan


* Nick Piggin <npiggin@suse.de> wrote:

> > > 535.43user 30.62system 2:23.72elapsed 393%CPU
> > 
> > Thanks for testing this! Could you please try this also with:
> > 
> >    echo 100000000 > /proc/sys/kernel/sched_granularity
> 
> 507.68user 31.87system 2:18.05elapsed 390%CPU
> 507.99user 31.93system 2:18.09elapsed 390%CPU

> > could you maybe even try a more extreme setting of:
> > 
> >    echo 500000000 > /proc/sys/kernel/sched_granularity

> 506.69user 31.96system 2:17.82elapsed 390%CPU
> 505.70user 31.84system 2:17.90elapsed 389%CPU

> Again, for comparison 2.6.21-rc7 mainline:
> 
> 508.87user 32.47system 2:17.82elapsed 392%CPU
> 509.05user 32.25system 2:17.84elapsed 392%CPU

thanks for testing this!

> So looking at elapsed time, a granularity of 100ms is just behind the 
> mainline score. However it is using slightly less user time and 
> slightly more idle time, which indicates that balancing might have got 
> a bit less aggressive.
> 
> But anyway, it conclusively shows the efficiency impact of such tiny 
> timeslices.

yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is 
not unexpected when going to really frequent preemption. Clearly, the 
default preemption granularity needs to be tuned up.

I think you said you measured ~3msec average preemption rate per CPU? 
That would suggest the average cache-trashing cost was 120 usecs per 
every 3 msec window. Taking that as a ballpark figure, to get the 
difference back into the noise range we'd have to either use ~5 msec:

    echo 5000000 > /proc/sys/kernel/sched_granularity

or 15 msec:

    echo 15000000 > /proc/sys/kernel/sched_granularity

(depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i 
correctly understood your 3msec value. I'd have to know your kernbench 
workload's approximate 'steady state' context-switch rate to do a more 
accurate calculation.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18  9:53         ` Ingo Molnar
@ 2007-04-18 12:13           ` Nick Piggin
  2007-04-18 12:49             ` Con Kolivas
  0 siblings, 1 reply; 304+ messages in thread
From: Nick Piggin @ 2007-04-18 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > So looking at elapsed time, a granularity of 100ms is just behind the 
> > mainline score. However it is using slightly less user time and 
> > slightly more idle time, which indicates that balancing might have got 
> > a bit less aggressive.
> > 
> > But anyway, it conclusively shows the efficiency impact of such tiny 
> > timeslices.
> 
> yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is 
> not unexpected when going to really frequent preemption. Clearly, the 
> default preemption granularity needs to be tuned up.
> 
> I think you said you measured ~3msec average preemption rate per CPU? 

This was just looking at ctxsw numbers from running 2 cpu hogs on the
same runqueue.

> That would suggest the average cache-trashing cost was 120 usecs per 
> every 3 msec window. Taking that as a ballpark figure, to get the 
> difference back into the noise range we'd have to either use ~5 msec:
> 
>     echo 5000000 > /proc/sys/kernel/sched_granularity
> 
> or 15 msec:
> 
>     echo 15000000 > /proc/sys/kernel/sched_granularity
> 
> (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i 
> correctly understood your 3msec value. I'd have to know your kernbench 
> workload's approximate 'steady state' context-switch rate to do a more 
> accurate calculation.)

The kernel compile (make -j8 on 4 thread system) is doing 1800 total
context switches per second (450/s per runqueue) for cfs, and 670
for mainline. Going up to 20ms granularity for cfs brings the context
switch numbers similar, but user time is still a % or so higher. I'd
be more worried about compute heavy threads which naturally don't do
much context switching.

Some other numbers on the same system
Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
10 groups: Time: 1.332		0.743		0.607
20 groups: Time: 1.197		1.100		1.241
30 groups: Time: 1.754		2.376		1.834
40 groups: Time: 3.451		2.227		2.503
50 groups: Time: 3.726		3.399		3.220
60 groups: Time: 3.548		4.567		3.668
70 groups: Time: 4.206		4.905		4.314
80 groups: Time: 4.551		6.324		4.879
90 groups: Time: 7.904		6.962		5.335
100 groups: Time: 7.293		7.799		5.857
110 groups: Time: 10.595	8.728		6.517
120 groups: Time: 7.543		9.304		7.082
130 groups: Time: 8.269		10.639		8.007
140 groups: Time: 11.867	8.250		8.302
150 groups: Time: 14.852	8.656		8.662
160 groups: Time: 9.648		9.313		9.541

Mainline seems pretty inconsistent here.

lmbench 0K ctxsw latency bound to CPU0:
tasks
2		2.59		3.42		2.50
4		3.26		3.54		3.09
8		3.01		3.64		3.22
16		3.00		3.66		3.50
32		2.99		3.70		3.49
64		3.09		4.17		3.50
128		4.80		5.58		4.74
256		5.79		6.37		5.76

cfs is noticably disadvantaged.

[*] 500ms didn't make much difference in either test.


^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:13           ` Nick Piggin
@ 2007-04-18 12:49             ` Con Kolivas
  2007-04-19  3:28               ` Nick Piggin
  0 siblings, 1 reply; 304+ messages in thread
From: Con Kolivas @ 2007-04-18 12:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wednesday 18 April 2007 22:13, Nick Piggin wrote:
> On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote:
> > * Nick Piggin <npiggin@suse.de> wrote:
> > > So looking at elapsed time, a granularity of 100ms is just behind the
> > > mainline score. However it is using slightly less user time and
> > > slightly more idle time, which indicates that balancing might have got
> > > a bit less aggressive.
> > >
> > > But anyway, it conclusively shows the efficiency impact of such tiny
> > > timeslices.
> >
> > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is
> > not unexpected when going to really frequent preemption. Clearly, the
> > default preemption granularity needs to be tuned up.
> >
> > I think you said you measured ~3msec average preemption rate per CPU?
>
> This was just looking at ctxsw numbers from running 2 cpu hogs on the
> same runqueue.
>
> > That would suggest the average cache-trashing cost was 120 usecs per
> > every 3 msec window. Taking that as a ballpark figure, to get the
> > difference back into the noise range we'd have to either use ~5 msec:
> >
> >     echo 5000000 > /proc/sys/kernel/sched_granularity
> >
> > or 15 msec:
> >
> >     echo 15000000 > /proc/sys/kernel/sched_granularity
> >
> > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i
> > correctly understood your 3msec value. I'd have to know your kernbench
> > workload's approximate 'steady state' context-switch rate to do a more
> > accurate calculation.)
>
> The kernel compile (make -j8 on 4 thread system) is doing 1800 total
> context switches per second (450/s per runqueue) for cfs, and 670
> for mainline. Going up to 20ms granularity for cfs brings the context
> switch numbers similar, but user time is still a % or so higher. I'd
> be more worried about compute heavy threads which naturally don't do
> much context switching.

While kernel compiles are nice and easy to do I've seen enough criticism of 
them in the past to wonder about their usefulness as a standard benchmark on 
their own.

>
> Some other numbers on the same system
> Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
> 10 groups: Time: 1.332		0.743		0.607
> 20 groups: Time: 1.197		1.100		1.241
> 30 groups: Time: 1.754		2.376		1.834
> 40 groups: Time: 3.451		2.227		2.503
> 50 groups: Time: 3.726		3.399		3.220
> 60 groups: Time: 3.548		4.567		3.668
> 70 groups: Time: 4.206		4.905		4.314
> 80 groups: Time: 4.551		6.324		4.879
> 90 groups: Time: 7.904		6.962		5.335
> 100 groups: Time: 7.293		7.799		5.857
> 110 groups: Time: 10.595	8.728		6.517
> 120 groups: Time: 7.543		9.304		7.082
> 130 groups: Time: 8.269		10.639		8.007
> 140 groups: Time: 11.867	8.250		8.302
> 150 groups: Time: 14.852	8.656		8.662
> 160 groups: Time: 9.648		9.313		9.541

Hackbench even more so. A prolonged discussion with Rusty Russell on this 
issue he suggested hackbench was more a pass/fail benchmark to ensure there 
was no starvation scenario that never ended, and very little value should be 
placed on the actual results returned from it.

Wli's concerns regarding some sort of standard framework for a battery of 
accepted meaningful benchmarks comes to mind as important rather than ones 
that highlight one over the other. So while interesting for their own 
endpoints, I certainly wouldn't put either benchmark as some sort of 
yardstick for a "winner". Note I'm not saying that we shouldn't be looking at 
them per se, but since the whole drive for a new scheduler is trying to be 
more objective we need to start expanding the range of benchmarks. Even 
though I don't feel the need to have SD in the "race" I guess it stands for 
more data to compare what is possible/where as well.

> Mainline seems pretty inconsistent here.
>
> lmbench 0K ctxsw latency bound to CPU0:
> tasks
> 2		2.59		3.42		2.50
> 4		3.26		3.54		3.09
> 8		3.01		3.64		3.22
> 16		3.00		3.66		3.50
> 32		2.99		3.70		3.49
> 64		3.09		4.17		3.50
> 128		4.80		5.58		4.74
> 256		5.79		6.37		5.76
>
> cfs is noticably disadvantaged.
>
> [*] 500ms didn't make much difference in either test.

-- 
-ck

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-18 12:49             ` Con Kolivas
@ 2007-04-19  3:28               ` Nick Piggin
  0 siblings, 0 replies; 304+ messages in thread
From: Nick Piggin @ 2007-04-19  3:28 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds,
	Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

On Wed, Apr 18, 2007 at 10:49:45PM +1000, Con Kolivas wrote:
> On Wednesday 18 April 2007 22:13, Nick Piggin wrote:
> >
> > The kernel compile (make -j8 on 4 thread system) is doing 1800 total
> > context switches per second (450/s per runqueue) for cfs, and 670
> > for mainline. Going up to 20ms granularity for cfs brings the context
> > switch numbers similar, but user time is still a % or so higher. I'd
> > be more worried about compute heavy threads which naturally don't do
> > much context switching.
> 
> While kernel compiles are nice and easy to do I've seen enough criticism of 
> them in the past to wonder about their usefulness as a standard benchmark on 
> their own.

Actually it is a real workload for most kernel developers including you
no doubt :)

The criticism's of kernbench for the kernel are probably fair in that
kernel compiles don't exercise a lot of kernel functionality (page
allocator and fault paths mostly, IIRC). However as far as I'm concerned,
they're great for testing the CPU scheduler, because it doesn't actually
matter whether you're running in userspace or kernel space for a context
switch to blow your caches. The results are quite stable.

You could actually make up a benchmark that hurts a whole lot more from
context switching, but I figure that kernbench is a real world thing
that shows it up quite well.


> > Some other numbers on the same system
> > Hackbench:	2.6.21-rc7	cfs-v2 1ms[*]	nicksched
> > 10 groups: Time: 1.332		0.743		0.607
> > 20 groups: Time: 1.197		1.100		1.241
> > 30 groups: Time: 1.754		2.376		1.834
> > 40 groups: Time: 3.451		2.227		2.503
> > 50 groups: Time: 3.726		3.399		3.220
> > 60 groups: Time: 3.548		4.567		3.668
> > 70 groups: Time: 4.206		4.905		4.314
> > 80 groups: Time: 4.551		6.324		4.879
> > 90 groups: Time: 7.904		6.962		5.335
> > 100 groups: Time: 7.293		7.799		5.857
> > 110 groups: Time: 10.595	8.728		6.517
> > 120 groups: Time: 7.543		9.304		7.082
> > 130 groups: Time: 8.269		10.639		8.007
> > 140 groups: Time: 11.867	8.250		8.302
> > 150 groups: Time: 14.852	8.656		8.662
> > 160 groups: Time: 9.648		9.313		9.541
> 
> Hackbench even more so. A prolonged discussion with Rusty Russell on this 
> issue he suggested hackbench was more a pass/fail benchmark to ensure there 
> was no starvation scenario that never ended, and very little value should be 
> placed on the actual results returned from it.

Yeah, cfs seems to do a little worse than nicksched here, but I
include the numbers not because I think that is significant, but to
show mainline's poor characteristics.

^ permalink raw reply	[flat|nested] 304+ messages in thread

* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
  2007-04-17  7:56 ` Andy Whitcroft
  2007-04-17  9:32   ` Nick Piggin
@ 2007-04-18 10:22   ` Ingo Molnar
  1 sibling, 0 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-18 10:22 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	Steve Fox, Nishanth Aravamudan

* Andy Whitcroft <apw@shadowen.org> wrote:

> > as usual, any sort of feedback, bugreports, fixes and suggestions 
> > are more than welcome,
> 
> Pushed this through the test.kernel.org and nothing new blew up. 
> Notably the kernbench figures are within expectations even on the 
> bigger numa systems, commonly badly affected by balancing problems in 
> the schedular.

thanks! Given the really low preemption latency/granularity default 
(roughly equivalent to 'timeslice length'), and that basically all of my 
focus was on interactivity characteristics, this is a pretty good 
result. I suspect it will be necessary to increase the default to 10 
msecs (or more) to be on the safe side. (Nick has reported a 4% 
kernbench drop so for his kernbench workload it's needed.)

	Ingo

^ permalink raw reply	[flat|nested] 304+ messages in thread

end of thread, other threads:[~2007-04-21 14:10 UTC | newest]

Thread overview: 304+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-15 18:47 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Tim Tassonis
  -- strict thread matches above, loose matches on Subject: below --
2007-04-13 20:21 Ingo Molnar
2007-04-13 20:27 ` Bill Huey
2007-04-13 20:55   ` Ingo Molnar
2007-04-13 21:21     ` William Lee Irwin III
2007-04-13 21:35       ` Bill Huey
2007-04-13 21:39       ` Ingo Molnar
2007-04-13 21:50 ` Ingo Molnar
2007-04-13 21:57 ` Michal Piotrowski
2007-04-13 22:15 ` Daniel Walker
2007-04-13 22:30   ` Ingo Molnar
2007-04-13 22:37     ` Willy Tarreau
2007-04-13 23:59     ` Daniel Walker
2007-04-14 10:55       ` Ingo Molnar
2007-04-13 22:21 ` William Lee Irwin III
2007-04-13 22:52   ` Ingo Molnar
2007-04-13 23:30     ` William Lee Irwin III
2007-04-13 23:44       ` Ingo Molnar
2007-04-13 23:58         ` William Lee Irwin III
2007-04-14 22:38   ` Davide Libenzi
2007-04-14 23:26     ` Davide Libenzi
2007-04-15  4:01     ` William Lee Irwin III
2007-04-15  4:18       ` Davide Libenzi
2007-04-15 23:09     ` Pavel Pisa
2007-04-16  5:47       ` Davide Libenzi
2007-04-17  0:37         ` Pavel Pisa
2007-04-13 22:31 ` Willy Tarreau
2007-04-13 23:18   ` Ingo Molnar
2007-04-14 18:48     ` Bill Huey
2007-04-13 23:07 ` Gabriel C
2007-04-13 23:25   ` Ingo Molnar
2007-04-13 23:39     ` Gabriel C
2007-04-14  2:04 ` Nick Piggin
2007-04-14  6:32   ` Ingo Molnar
2007-04-14  6:43     ` Ingo Molnar
2007-04-14  8:08       ` Willy Tarreau
2007-04-14  8:36         ` Willy Tarreau
2007-04-14 10:53           ` Ingo Molnar
2007-04-14 13:01             ` Willy Tarreau
2007-04-14 13:27               ` Willy Tarreau
2007-04-14 14:45                 ` Willy Tarreau
2007-04-14 16:14                   ` Ingo Molnar
2007-04-14 16:19                 ` Ingo Molnar
2007-04-14 17:15                   ` Eric W. Biederman
2007-04-14 17:29                     ` Willy Tarreau
2007-04-14 17:44                       ` Eric W. Biederman
2007-04-14 17:54                         ` Ingo Molnar
2007-04-14 18:18                           ` Willy Tarreau
2007-04-14 18:40                             ` Eric W. Biederman
2007-04-14 19:01                               ` Willy Tarreau
2007-04-15 17:55                             ` Ingo Molnar
2007-04-15 18:06                               ` Willy Tarreau
2007-04-15 19:20                                 ` Ingo Molnar
2007-04-15 19:35                                   ` William Lee Irwin III
2007-04-15 19:57                                     ` Ingo Molnar
2007-04-15 23:54                                       ` William Lee Irwin III
2007-04-16 11:24                                         ` Ingo Molnar
2007-04-16 13:46                                           ` William Lee Irwin III
2007-04-15 19:37                                   ` Ingo Molnar
2007-04-14 17:50                       ` Linus Torvalds
2007-04-15  7:54               ` Mike Galbraith
2007-04-15  8:58                 ` Ingo Molnar
2007-04-15  9:11                   ` Mike Galbraith
2007-04-19  9:01               ` Ingo Molnar
2007-04-19 12:54                 ` Willy Tarreau
2007-04-19 15:18                   ` Ingo Molnar
2007-04-19 17:34                     ` Gene Heskett
2007-04-19 18:45                     ` Willy Tarreau
2007-04-21 10:31                       ` Ingo Molnar
2007-04-21 10:38                         ` Ingo Molnar
2007-04-21 10:45                         ` Ingo Molnar
2007-04-21 11:07                           ` Willy Tarreau
2007-04-21 11:29                             ` Björn Steinbrink
2007-04-21 11:51                               ` Willy Tarreau
2007-04-19 23:52                     ` Jan Knutar
2007-04-20  5:05                       ` Willy Tarreau
2007-04-19 17:32                 ` Gene Heskett
2007-04-14 15:17             ` Mark Lord
2007-04-14 19:48           ` William Lee Irwin III
2007-04-14 20:12             ` Willy Tarreau
2007-04-14 10:36         ` Ingo Molnar
2007-04-14 15:09 ` S.Çağlar Onur
2007-04-14 16:09   ` Ingo Molnar
2007-04-14 16:59     ` S.Çağlar Onur
2007-04-15  3:27 ` Con Kolivas
2007-04-15  5:16   ` Bill Huey
2007-04-15  8:44     ` Ingo Molnar
2007-04-15  9:51       ` Bill Huey
2007-04-15 10:39         ` Pekka Enberg
2007-04-15 12:45           ` Willy Tarreau
2007-04-15 13:08             ` Pekka J Enberg
2007-04-15 17:32               ` Mike Galbraith
2007-04-15 17:59                 ` Linus Torvalds
2007-04-15 19:00                   ` Jonathan Lundell
2007-04-15 22:52                     ` Con Kolivas
2007-04-16  2:28                       ` Nick Piggin
2007-04-16  3:15                         ` Con Kolivas
2007-04-16  3:34                           ` Nick Piggin
2007-04-15 15:26             ` William Lee Irwin III
2007-04-16 15:55               ` Chris Friesen
2007-04-16 16:13                 ` William Lee Irwin III
2007-04-17  0:04                 ` Peter Williams
2007-04-17 13:07                 ` James Bruce
2007-04-17 20:05                   ` William Lee Irwin III
2007-04-15 15:39             ` Ingo Molnar
2007-04-15 15:47               ` William Lee Irwin III
2007-04-16  5:27               ` Peter Williams
2007-04-16  6:23                 ` Peter Williams
2007-04-16  6:40                   ` Peter Williams
2007-04-16  7:32                     ` Ingo Molnar
2007-04-16  8:54                       ` Peter Williams
2007-04-15 15:16           ` Gene Heskett
2007-04-15 16:43             ` Con Kolivas
2007-04-15 16:58               ` Gene Heskett
2007-04-15 18:00                 ` Mike Galbraith
2007-04-16  0:18                   ` Gene Heskett
2007-04-15 16:11     ` Bernd Eckenfels
2007-04-15  6:43   ` Mike Galbraith
2007-04-15  8:36     ` Bill Huey
2007-04-15  8:45       ` Mike Galbraith
2007-04-15  9:06       ` Ingo Molnar
2007-04-16 10:00         ` Ingo Molnar
2007-04-15 16:25       ` Arjan van de Ven
2007-04-16  5:36         ` Bill Huey
2007-04-16  6:17           ` Nick Piggin
2007-04-17  0:06     ` Peter Williams
2007-04-17  2:29       ` Mike Galbraith
2007-04-17  3:40         ` Nick Piggin
2007-04-17  4:01           ` Mike Galbraith
2007-04-17  4:14             ` Nick Piggin
2007-04-17  6:26               ` Peter Williams
2007-04-17  9:51               ` Ingo Molnar
2007-04-17 13:44                 ` Peter Williams
2007-04-17 23:00                   ` Michael K. Edwards
2007-04-17 23:07                     ` William Lee Irwin III
2007-04-17 23:52                       ` Michael K. Edwards
2007-04-18  0:36                         ` Bill Huey
2007-04-18  2:39                     ` Peter Williams
2007-04-20 20:47                 ` Bill Davidsen
2007-04-21  7:39                   ` Nick Piggin
2007-04-21  8:33                   ` Ingo Molnar
2007-04-20 20:36             ` Bill Davidsen
2007-04-17  4:17           ` Peter Williams
2007-04-17  4:29             ` Nick Piggin
2007-04-17  5:53               ` Willy Tarreau
2007-04-17  6:10                 ` Nick Piggin
2007-04-17  6:09               ` William Lee Irwin III
2007-04-17  6:15                 ` Nick Piggin
2007-04-17  6:26                   ` William Lee Irwin III
2007-04-17  7:01                     ` Nick Piggin
2007-04-17  8:23                       ` William Lee Irwin III
2007-04-17 22:23                         ` Davide Libenzi
2007-04-17 21:39                       ` Matt Mackall
2007-04-17 23:23                         ` Peter Williams
2007-04-17 23:19                           ` Matt Mackall
2007-04-18  3:15                         ` Nick Piggin
2007-04-18  3:45                           ` Mike Galbraith
2007-04-18  3:56                             ` Nick Piggin
2007-04-18  4:29                               ` Mike Galbraith
2007-04-18  4:38                           ` Matt Mackall
2007-04-18  5:00                             ` Nick Piggin
2007-04-18  5:55                               ` Matt Mackall
2007-04-18  6:37                                 ` Nick Piggin
2007-04-18  6:55                                   ` Matt Mackall
2007-04-18  7:24                                     ` Nick Piggin
2007-04-21 13:33                                     ` Bill Davidsen
2007-04-18 13:08                                 ` William Lee Irwin III
2007-04-18 19:48                                   ` Davide Libenzi
2007-04-18 14:48                                 ` Linus Torvalds
2007-04-18 15:23                                   ` Matt Mackall
2007-04-18 17:22                                     ` Linus Torvalds
2007-04-18 17:49                                       ` Ingo Molnar
2007-04-18 17:59                                         ` Ingo Molnar
2007-04-18 19:40                                           ` Linus Torvalds
2007-04-18 19:43                                             ` Ingo Molnar
2007-04-18 20:07                                             ` Davide Libenzi
2007-04-18 21:48                                               ` Ingo Molnar
2007-04-18 23:30                                                 ` Davide Libenzi
2007-04-19  8:00                                                   ` Ingo Molnar
2007-04-19 15:43                                                     ` Davide Libenzi
2007-04-21 14:09                                                     ` Bill Davidsen
2007-04-19 17:39                                                   ` Bernd Eckenfels
2007-04-19  6:52                                                 ` Mike Galbraith
2007-04-19  7:09                                                   ` Ingo Molnar
2007-04-19  7:32                                                     ` Mike Galbraith
2007-04-19 16:55                                                       ` Davide Libenzi
2007-04-20  5:16                                                         ` Mike Galbraith
2007-04-19  7:14                                                   ` Mike Galbraith
2007-04-18 21:04                                             ` Ingo Molnar
2007-04-18 19:23                                         ` Linus Torvalds
2007-04-18 19:56                                           ` Davide Libenzi
2007-04-18 20:11                                             ` Linus Torvalds
2007-04-19  0:22                                               ` Davide Libenzi
2007-04-19  0:30                                                 ` Linus Torvalds
2007-04-18 18:02                                       ` William Lee Irwin III
2007-04-18 18:12                                         ` Ingo Molnar
2007-04-18 18:36                                       ` Diego Calleja
2007-04-19  0:37                                       ` Peter Williams
2007-04-18 19:05                                     ` Davide Libenzi
2007-04-18 19:13                                     ` Michael K. Edwards
2007-04-19  3:18                                   ` Nick Piggin
2007-04-19  5:14                                     ` Andrew Morton
2007-04-19  6:38                                       ` Ingo Molnar
2007-04-19  7:57                                         ` William Lee Irwin III
2007-04-19 11:50                                           ` Peter Williams
2007-04-20  5:26                                             ` William Lee Irwin III
2007-04-20  6:16                                               ` Peter Williams
2007-04-19  8:33                                         ` Nick Piggin
2007-04-21 13:40                                   ` Bill Davidsen
2007-04-17  6:50                   ` Davide Libenzi
2007-04-17  7:09                     ` William Lee Irwin III
2007-04-17  7:22                       ` Peter Williams
2007-04-17  7:23                       ` Nick Piggin
2007-04-17  7:27                       ` Davide Libenzi
2007-04-17  7:33                         ` Nick Piggin
2007-04-17  7:33                       ` Ingo Molnar
2007-04-17  7:40                         ` Nick Piggin
2007-04-17  7:58                           ` Ingo Molnar
2007-04-17  9:05                         ` William Lee Irwin III
2007-04-17  9:24                           ` Ingo Molnar
2007-04-17  9:57                             ` William Lee Irwin III
2007-04-17 10:01                               ` Ingo Molnar
2007-04-17 11:31                               ` William Lee Irwin III
2007-04-17 22:08                             ` Matt Mackall
2007-04-17 22:32                               ` William Lee Irwin III
2007-04-17 22:39                                 ` Matt Mackall
2007-04-17 22:59                                   ` William Lee Irwin III
2007-04-17 22:57                                     ` Matt Mackall
2007-04-18  4:29                                       ` William Lee Irwin III
2007-04-18  4:42                                         ` Davide Libenzi
2007-04-18  7:29                                       ` James Bruce
2007-04-17  7:11                     ` Nick Piggin
2007-04-17  7:21                       ` Davide Libenzi
2007-04-17  6:23               ` Peter Williams
2007-04-17  6:44                 ` Nick Piggin
2007-04-17  7:48                   ` Peter Williams
2007-04-17  7:56                     ` Nick Piggin
2007-04-17 13:16                       ` Peter Williams
2007-04-18  4:46                         ` Nick Piggin
2007-04-17  8:44                 ` Ingo Molnar
2007-04-19  2:20                   ` Peter Williams
2007-04-15 15:05   ` Ingo Molnar
2007-04-15 20:05     ` Matt Mackall
2007-04-15 20:48       ` Ingo Molnar
2007-04-15 21:31         ` Matt Mackall
2007-04-16  3:03           ` Nick Piggin
2007-04-16 14:28             ` Matt Mackall
2007-04-17  3:31               ` Nick Piggin
2007-04-17 17:35                 ` Matt Mackall
2007-04-16 15:45           ` William Lee Irwin III
2007-04-15 23:39         ` William Lee Irwin III
2007-04-16  1:06           ` Peter Williams
2007-04-16  3:04             ` William Lee Irwin III
2007-04-16  5:09               ` Peter Williams
2007-04-16 11:04                 ` William Lee Irwin III
2007-04-16 12:55                   ` Peter Williams
2007-04-16 23:10                     ` Michael K. Edwards
2007-04-17  3:55                       ` Nick Piggin
2007-04-17  4:25                         ` Peter Williams
2007-04-17  4:34                           ` Nick Piggin
2007-04-17  6:03                             ` Peter Williams
2007-04-17  6:14                               ` William Lee Irwin III
2007-04-17  6:23                               ` Nick Piggin
2007-04-17  9:36                               ` Ingo Molnar
2007-04-17  8:24                         ` William Lee Irwin III
     [not found]                     ` <20070416135915.GK8915@holomorphy.com>
     [not found]                       ` <46241677.7060909@bigpond.net.au>
     [not found]                         ` <20070417025704.GM8915@holomorphy.com>
     [not found]                           ` <462445EC.1060306@bigpond.net.au>
     [not found]                             ` <20070417053147.GN8915@holomorphy.com>
     [not found]                               ` <46246A7C.8050501@bigpond.net.au>
     [not found]                                 ` <20070417064109.GP8915@holomorphy.com>
2007-04-17  8:00                                   ` Peter Williams
2007-04-17 10:41                                     ` William Lee Irwin III
2007-04-17 13:48                                       ` Peter Williams
2007-04-18  0:27                                         ` Peter Williams
2007-04-18  2:03                                           ` William Lee Irwin III
2007-04-18  2:31                                             ` Peter Williams
2007-04-16 17:22             ` Chris Friesen
2007-04-17  0:54               ` Peter Williams
2007-04-17 15:52                 ` Chris Friesen
2007-04-17 23:50                   ` Peter Williams
2007-04-18  5:43                     ` Chris Friesen
2007-04-18 13:00                       ` Peter Williams
2007-04-16  5:16     ` Con Kolivas
2007-04-16  5:48       ` Gene Heskett
2007-04-15 12:29 ` Esben Nielsen
2007-04-15 13:04   ` Ingo Molnar
2007-04-16  7:16     ` Esben Nielsen
2007-04-15 22:49 ` Ismail Dönmez
2007-04-15 23:23   ` Arjan van de Ven
2007-04-15 23:33     ` Ismail Dönmez
2007-04-16 11:58   ` Ingo Molnar
2007-04-16 12:02     ` Ismail Dönmez
2007-04-16 22:00 ` Andi Kleen
2007-04-16 21:05   ` Ingo Molnar
2007-04-16 21:21     ` Andi Kleen
2007-04-17  7:56 ` Andy Whitcroft
2007-04-17  9:32   ` Nick Piggin
2007-04-17  9:59     ` Ingo Molnar
2007-04-17 11:11       ` Nick Piggin
2007-04-18  8:55       ` Nick Piggin
2007-04-18  9:33         ` Con Kolivas
2007-04-18 12:14           ` Nick Piggin
2007-04-18 12:33             ` Con Kolivas
2007-04-18 21:49               ` Con Kolivas
2007-04-18  9:53         ` Ingo Molnar
2007-04-18 12:13           ` Nick Piggin
2007-04-18 12:49             ` Con Kolivas
2007-04-19  3:28               ` Nick Piggin
2007-04-18 10:22   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox