* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] @ 2007-04-15 18:47 Tim Tassonis 0 siblings, 0 replies; 304+ messages in thread From: Tim Tassonis @ 2007-04-15 18:47 UTC (permalink / raw) To: linux-kernel > + printk("Fair Scheduler: Copyright (c) 2007 Red Hat, Inc., Ingo Molnar\n"); So that's what all the fuss about the staircase scheduler is all about then! At last, I see your point. > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! > How pathetic can you get? Tim, really looking forward to the CL final where Liverpool will beat the shit out of Scum (and there's a lot to be beaten out). ^ permalink raw reply [flat|nested] 304+ messages in thread
* [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
@ 2007-04-13 20:21 Ingo Molnar
2007-04-13 20:27 ` Bill Huey
` (13 more replies)
0 siblings, 14 replies; 304+ messages in thread
From: Ingo Molnar @ 2007-04-13 20:21 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Thomas Gleixner
[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
i'm pleased to announce the first release of the "Modular Scheduler Core
and Completely Fair Scheduler [CFS]" patchset:
http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch
This project is a complete rewrite of the Linux task scheduler. My goal
is to address various feature requests and to fix deficiencies in the
vanilla scheduler that were suggested/found in the past few years, both
for desktop scheduling and for server scheduling workloads.
[ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The
new scheduler will be active by default and all tasks will default
to the new SCHED_FAIR interactive scheduling class. ]
Highlights are:
- the introduction of Scheduling Classes: an extensible hierarchy of
scheduler modules. These modules encapsulate scheduling policy
details and are handled by the scheduler core without the core
code assuming about them too much.
- sched_fair.c implements the 'CFS desktop scheduler': it is a
replacement for the vanilla scheduler's SCHED_OTHER interactivity
code.
i'd like to give credit to Con Kolivas for the general approach here:
he has proven via RSDL/SD that 'fair scheduling' is possible and that
it results in better desktop scheduling. Kudos Con!
The CFS patch uses a completely different approach and implementation
from RSDL/SD. My goal was to make CFS's interactivity quality exceed
that of RSDL/SD, which is a high standard to meet :-) Testing
feedback is welcome to decide this one way or another. [ and, in any
case, all of SD's logic could be added via a kernel/sched_sd.c module
as well, if Con is interested in such an approach. ]
CFS's design is quite radical: it does not use runqueues, it uses a
time-ordered rbtree to build a 'timeline' of future task execution,
and thus has no 'array switch' artifacts (by which both the vanilla
scheduler and RSDL/SD are affected).
CFS uses nanosecond granularity accounting and does not rely on any
jiffies or other HZ detail. Thus the CFS scheduler has no notion of
'timeslices' and has no heuristics whatsoever. There is only one
central tunable:
/proc/sys/kernel/sched_granularity_ns
which can be used to tune the scheduler from 'desktop' (low
latencies) to 'server' (good batching) workloads. It defaults to a
setting suitable for desktop workloads. SCHED_BATCH is handled by the
CFS scheduler module too.
due to its design, the CFS scheduler is not prone to any of the
'attacks' that exist today against the heuristics of the stock
scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
work fine and do not impact interactivity and produce the expected
behavior.
the CFS scheduler has a much stronger handling of nice levels and
SCHED_BATCH: both types of workloads should be isolated much more
agressively than under the vanilla scheduler.
( another rdetail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )
- sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
way than the vanilla scheduler does. It uses 100 runqueues (for all
100 RT priority levels, instead of 140 in the vanilla scheduler)
and it needs no expired array.
- reworked/sanitized SMP load-balancing: the runqueue-walking
assumptions are gone from the load-balancing code now, and
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.
the core scheduler got smaller by more than 700 lines:
kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
1 file changed, 372 insertions(+), 1082 deletions(-)
and even adding all the scheduling modules, the total size impact is
relatively small:
18 files changed, 1454 insertions(+), 1133 deletions(-)
most of the increase is due to extensive comments. The kernel size
impact is in fact a small negative:
text data bss dec hex filename
23366 4001 24 27391 6aff kernel/sched.o.vanilla
24159 2705 56 26920 6928 kernel/sched.o.CFS
(this is mainly due to the benefit of getting rid of the expired array
and its data structure overhead.)
thanks go to Thomas Gleixner and Arjan van de Ven for review of this
patchset.
as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,
Ingo
^ permalink raw reply [flat|nested] 304+ messages in thread* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar @ 2007-04-13 20:27 ` Bill Huey 2007-04-13 20:55 ` Ingo Molnar 2007-04-13 21:50 ` Ingo Molnar ` (12 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Bill Huey @ 2007-04-13 20:27 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] ... > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] Ingo, Con has been asking for module support for years if I understand your patch corectly. You'll also need this for -rt as well with regards to bandwidth scheduling. Good to see that you're moving in this direction. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:27 ` Bill Huey @ 2007-04-13 20:55 ` Ingo Molnar 2007-04-13 21:21 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 20:55 UTC (permalink / raw) To: Bill Huey Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > Con has been asking for module support for years if I understand your > patch corectly. [...] Yeah. Note that there are some subtle but crutial differences between PlugSched (which Con used, and which i opposed in the past) and this approach. PlugSched cuts the interfaces at a high level in a monolithic way and introduces kernel/scheduler.c that uses one pluggable scheduler (represented via the 'scheduler' global template) at a time. while in this CFS patchset i'm using modularization ('scheduler classes') to simplify the _existing_ multi-policy implementation of the scheduler. These 'scheduler classes' are in a hierarchy and are stacked on top of each other. They are in use at once. Currently there's two of them: sched_ops_rt is stacked ontop of sched_ops_fair. Fortunately the performance impact is minimal. So scheduler classes are mainly a simplification of the design of the scheduler - not just a mere facility to select multiple schedulers. Their ability to also facilitate easier experimentation with schedulers is 'just' a happy side-effect. So all in one: it's a fairly different model from PlugSched (and that's why i didnt reuse PlugSched) - but there's indeed overlap. > [...] You'll also need this for -rt as well with regards to bandwidth > scheduling. yeah. scheduler classes are also useful for other purposes like containers and virtualization, hierarchical/group scheduling, security encapsulation, etc. - features that can be on-demand layered, and which we dont necessarily want to have enabled all the time. > [...] Good to see that you're moving in this direction. thanks! :) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:55 ` Ingo Molnar @ 2007-04-13 21:21 ` William Lee Irwin III 2007-04-13 21:35 ` Bill Huey 2007-04-13 21:39 ` Ingo Molnar 0 siblings, 2 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-13 21:21 UTC (permalink / raw) To: Ingo Molnar Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote: > Yeah. Note that there are some subtle but crutial differences between > PlugSched (which Con used, and which i opposed in the past) and this > approach. > PlugSched cuts the interfaces at a high level in a monolithic way and > introduces kernel/scheduler.c that uses one pluggable scheduler > (represented via the 'scheduler' global template) at a time. What I originally did did so for a good reason, which was that it was intended to support far more radical reorganizations, for instance, things that changed the per-cpu runqueue affairs for gang scheduling. I wrote a top-level driver that did support scheduling classes in a similar fashion, though it didn't survive others maintaining the patches. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 21:21 ` William Lee Irwin III @ 2007-04-13 21:35 ` Bill Huey 2007-04-13 21:39 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Bill Huey @ 2007-04-13 21:35 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Fri, Apr 13, 2007 at 02:21:10PM -0700, William Lee Irwin III wrote: > On Fri, Apr 13, 2007 at 10:55:45PM +0200, Ingo Molnar wrote: > > Yeah. Note that there are some subtle but crutial differences between > > PlugSched (which Con used, and which i opposed in the past) and this > > approach. > > PlugSched cuts the interfaces at a high level in a monolithic way and > > introduces kernel/scheduler.c that uses one pluggable scheduler > > (represented via the 'scheduler' global template) at a time. > > What I originally did did so for a good reason, which was that it was > intended to support far more radical reorganizations, for instance, > things that changed the per-cpu runqueue affairs for gang scheduling. > I wrote a top-level driver that did support scheduling classes in a > similar fashion, though it didn't survive others maintaining the patches. Also, gang scheduling is needed to solve virtualization issues regarding spinlocks in a guest image. You could potentally be spinning on a thread that isn't currently running which, needless to say, is very bad. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 21:21 ` William Lee Irwin III 2007-04-13 21:35 ` Bill Huey @ 2007-04-13 21:39 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 21:39 UTC (permalink / raw) To: William Lee Irwin III Cc: Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > What I originally did did so for a good reason, which was that it was > intended to support far more radical reorganizations, for instance, > things that changed the per-cpu runqueue affairs for gang scheduling. > I wrote a top-level driver that did support scheduling classes in a > similar fashion, though it didn't survive others maintaining the > patches. yeah - i looked at plugsched-6.5-for-2.6.20.patch in particular. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar 2007-04-13 20:27 ` Bill Huey @ 2007-04-13 21:50 ` Ingo Molnar 2007-04-13 21:57 ` Michal Piotrowski ` (11 subsequent siblings) 13 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 21:50 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > and even adding all the scheduling modules, the total size impact is > relatively small: > > 18 files changed, 1454 insertions(+), 1133 deletions(-) > > most of the increase is due to extensive comments. The kernel size > impact is in fact a small negative: > > text data bss dec hex filename > 23366 4001 24 27391 6aff kernel/sched.o.vanilla > 24159 2705 56 26920 6928 kernel/sched.o.CFS update: these were older numbers, here are the stats redone with the latest patch: text data bss dec hex filename 23366 4001 24 27391 6aff kernel/sched.o.vanilla 23671 4548 24 28243 6e53 kernel/sched.o.sd.v40 23349 2705 24 26078 65de kernel/sched.o.cfs so CFS is now a win both for text and for data size :) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar 2007-04-13 20:27 ` Bill Huey 2007-04-13 21:50 ` Ingo Molnar @ 2007-04-13 21:57 ` Michal Piotrowski 2007-04-13 22:15 ` Daniel Walker ` (10 subsequent siblings) 13 siblings, 0 replies; 304+ messages in thread From: Michal Piotrowski @ 2007-04-13 21:57 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar napisał(a): > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > Friday the 13th, my lucky day :). /mnt/md0/devel/linux-msc-cfs/usr/include/linux/sched.h requires linux/rbtree.h, which does not exist in exported headers make[3]: *** No rule to make target `/mnt/md0/devel/linux-msc-cfs/usr/include/linux/.check.sched.h', needed by `__headerscheck'. Stop. make[2]: *** [linux] Error 2 make[1]: *** [headers_check] Error 2 make: *** [vmlinux] Error 2 Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) Signed-off-by: Michal Piotrowski <michal.k.k.piotrowski@gmail.com> --- linux-msc-cfs-clean/include/linux/Kbuild 2007-04-13 23:52:47.000000000 +0200 +++ linux-msc-cfs/include/linux/Kbuild 2007-04-13 23:49:41.000000000 +0200 @@ -133,6 +133,7 @@ header-y += quotaio_v1.h header-y += quotaio_v2.h header-y += radeonfb.h header-y += raw.h +header-y += rbtree.h header-y += resource.h header-y += rose.h header-y += smbno.h ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (2 preceding siblings ...) 2007-04-13 21:57 ` Michal Piotrowski @ 2007-04-13 22:15 ` Daniel Walker 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:21 ` William Lee Irwin III ` (9 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Daniel Walker @ 2007-04-13 22:15 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 2007-04-13 at 22:21 +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. I'm not in love with the current or other schedulers, so I'm indifferent to this change. However, I was reviewing your release notes and the patch and found myself wonder what the logarithmic complexity of this new scheduler is .. I assumed it would also be constant time , but the __enqueue_task_fair doesn't appear to be constant time (rbtree insert complexity).. Maybe that's not a critical path , but I thought I would at least comment on it. Daniel ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:15 ` Daniel Walker @ 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:37 ` Willy Tarreau 2007-04-13 23:59 ` Daniel Walker 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 22:30 UTC (permalink / raw) To: Daniel Walker Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Daniel Walker <dwalker@mvista.com> wrote: > I'm not in love with the current or other schedulers, so I'm > indifferent to this change. However, I was reviewing your release > notes and the patch and found myself wonder what the logarithmic > complexity of this new scheduler is .. I assumed it would also be > constant time , but the __enqueue_task_fair doesn't appear to be > constant time (rbtree insert complexity).. [...] i've been worried about that myself and i've done extensive measurements before choosing this implementation. The rbtree turned out to be a quite compact data structure: we get it quite cheaply as part of the task structure cachemisses - which have to be touched anyway. For 1000 tasks it's a loop of ~10 - that's still very fast and bound in practice. here's a test i did under CFS. Lets take some ridiculous load: 1000 infinite loop tasks running at SCHED_BATCH on a single CPU (all inserted into the same rbtree), and lets run lat_ctx: neptune:~/l> uptime 22:51:23 up 8 min, 2 users, load average: 713.06, 254.64, 91.51 neptune:~/l> ./lat_ctx -s 0 2 "size=0k ovr=1.61 2 1.41 lets stop the 1000 tasks and only have ~2 tasks in the runqueue: neptune:~/l> ./lat_ctx -s 0 2 "size=0k ovr=1.70 2 1.16 so the overhead is 0.25 usecs. Considering the load (1000 tasks trash the cache like crazy already), this is more than acceptable. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:30 ` Ingo Molnar @ 2007-04-13 22:37 ` Willy Tarreau 2007-04-13 23:59 ` Daniel Walker 1 sibling, 0 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-13 22:37 UTC (permalink / raw) To: Ingo Molnar Cc: Daniel Walker, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:30:17AM +0200, Ingo Molnar wrote: > > * Daniel Walker <dwalker@mvista.com> wrote: > > > I'm not in love with the current or other schedulers, so I'm > > indifferent to this change. However, I was reviewing your release > > notes and the patch and found myself wonder what the logarithmic > > complexity of this new scheduler is .. I assumed it would also be > > constant time , but the __enqueue_task_fair doesn't appear to be > > constant time (rbtree insert complexity).. [...] > > i've been worried about that myself and i've done extensive measurements > before choosing this implementation. The rbtree turned out to be a quite > compact data structure: we get it quite cheaply as part of the task > structure cachemisses - which have to be touched anyway. For 1000 tasks > it's a loop of ~10 - that's still very fast and bound in practice. I'm not worried at all by O(log(n)) algorithms, and generally prefer smart log(n) than dumb O(1). In a userland TCP stack I started to write 2 years ago, I used a comparable scheduler and could reach a sustained rate of 145000 connections/s at 4 millions of concurrent connections. And yes, each time a packet was sent or received, a task was queued/dequeued (so about 450k/s with 4 million tasks, on an athlon 1.5 GHz). So that seems much higher than what we currently need. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:30 ` Ingo Molnar 2007-04-13 22:37 ` Willy Tarreau @ 2007-04-13 23:59 ` Daniel Walker 2007-04-14 10:55 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Daniel Walker @ 2007-04-13 23:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner One other thing, what happens in the case of slow, frequency changing, are/or inaccurate clocks .. Is the old sched_clock behavior still tolerated? Daniel ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:59 ` Daniel Walker @ 2007-04-14 10:55 ` Ingo Molnar 0 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 10:55 UTC (permalink / raw) To: Daniel Walker Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Daniel Walker <dwalker@mvista.com> wrote: > One other thing, what happens in the case of slow, frequency changing, > are/or inaccurate clocks .. Is the old sched_clock behavior still > tolerated? yeah, good question. Yesterday i did a quick testboot with that too, and it seemed to behave pretty OK with the low-res [jiffies based] sched_clock() too. Although in that case things are much more of an approximation and rounding/arithmetics artifacts are possible. CFS works best with a high-resolution cycle counter. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (3 preceding siblings ...) 2007-04-13 22:15 ` Daniel Walker @ 2007-04-13 22:21 ` William Lee Irwin III 2007-04-13 22:52 ` Ingo Molnar 2007-04-14 22:38 ` Davide Libenzi 2007-04-13 22:31 ` Willy Tarreau ` (8 subsequent siblings) 13 siblings, 2 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-13 22:21 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] A pleasant surprise, though I did see it coming. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > Highlights are: > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. It probably needs further clarification that they're things on the order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization amongst the classes is furthermore assumed, and so on. They're not quite capable of being full-blown alternative policies, though quite a bit can be crammed into them. There are issues with the per- scheduling class data not being very well-abstracted. A union for per-class data might help, if not a dynamically allocated scheduling class -private structure. Getting an alternative policy floating around that actually clashes a little with the stock data in the task structure would help clarify what's needed. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! Bob Mullens banged out a virtual deadline interactive task scheduler for Multics back in 1976 or thereabouts. ISTR the name Ferranti in connection with deadline task scheduling for UNIX in particular. I've largely seen deadline schedulers as a realtime topic, though. In any event, it's not so radical as to lack a fair number of precedents. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). A binomial heap would likely serve your purposes better than rbtrees. It's faster to have the next item to dequeue at the root of the tree structure rather than a leaf, for one. There are, of course, other priority queue structures (e.g. van Emde Boas) able to exploit the limited precision of the priority key for faster asymptotics, though actual performance is an open question. Another advantage of heaps is that they support decreasing priorities directly, so that instead of removal and reinsertion, a less invasive movement within the tree is possible. This nets additional constant factor improvements beyond those for the next item to dequeue for the case where a task remains runnable, but is preempted and its priority decreased while it remains runnable. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > /proc/sys/kernel/sched_granularity_ns > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. I like not relying on timeslices. Timeslices ultimately get you into a 2.4.x -like epoch expiry scenarios and introduce a number of RR-esque artifacts therefore. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. I'm always suspicious of these claims. A moderately formal regression test suite needs to be assembled and the testcases rather seriously cleaned up so they e.g. run for a deterministic period of time, have their parameters passable via command-line options instead of editing and recompiling, don't need Lindenting to be legible, and so on. With that in hand, a battery of regression tests can be run against scheduler modifications to verify their correctness and to detect any disturbance in scheduling semantics they might cause. A very serious concern is that while a fresh scheduler may pass all these tests, later modifications may later cause failures unnoticed because no one's doing the regression tests and there's no obvious test suite for testing types to latch onto. Another is that the testcases themselves may bitrot if they're not maintainable code. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. Speaking of regression tests, let's please at least state intended nice semantics and get a regression test for CPU bandwidth distribution by nice levels going. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) And there's another one. sched_yield() semantics need a regression test more transparent than VolanoMark or other macrobenchmarks. At some point we really need to decide what our sched_yield() is intended to do and get something out there to detect whether it's behaving as intended. On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. The SMP load balancing class operations strike me as unusual and likely to trip over semantic issues in alternative scheduling classes. Getting some alternative scheduling classes out there to clarify the issues would help here, too. A more general question here is what you mean by "completely fair;" there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or inter-user fairness going on, though one might argue those are relatively obscure notions of fairness. Complete fairness arguably precludes static prioritization by nice levels, so there is also that. There is also the issue of what a fair CPU bandwidth distribution between tasks of varying desired in-isolation CPU utilization might be. I suppose my thorniest point is where the demonstration of fairness is as, say, a testcase. Perhaps it's fair now; when will we find out when that fairness has been disturbed? What these things mean when there are multiple CPU's to schedule across may also be of concern. I propose the following two testcases: (1) CPU bandwidth distribution of CPU-bound tasks of varying nice levels Create a number of tasks at varying nice levels. Measure the CPU bandwidth allocated to each. Success depends on intent: we decide up-front that a given nice level should correspond to a given share of CPU bandwidth. Check to see how far from the intended distribution of CPU bandwidth according to those decided-up-front shares the actual distribution of CPU bandwidth is for the test. (2) CPU bandwidth distribution of tasks with varying CPU demands Create a number of tasks that would in isolation consume varying %cpu. Measure the CPU bandwidth allocated to each. Success depends on intent here, too. Decide up-front that a given %cpu that would be consumed in isolation should correspond to a given share of CPU bandwidth and check the actual distribution of CPU bandwidth vs. what was intended. Note that the shares need not linearly correspond to the %cpu; various sorts of things related to interactivity will make this nonlinear. A third testcase for sched_yield() should be brewed up. These testcases are oblivious to SMP. This will demand that a scheduling policy integrate with load balancing to the extent that load balancing occurs for the sake of distributing CPU bandwidth according to nice level. Some explicit decision should be made regarding that. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:21 ` William Lee Irwin III @ 2007-04-13 22:52 ` Ingo Molnar 2007-04-13 23:30 ` William Lee Irwin III 2007-04-14 22:38 ` Davide Libenzi 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 22:52 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > > is to address various feature requests and to fix deficiencies in the > > vanilla scheduler that were suggested/found in the past few years, both > > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > > new scheduler will be active by default and all tasks will default > > to the new SCHED_FAIR interactive scheduling class. ] > > A pleasant surprise, though I did see it coming. hey ;) > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > > scheduler modules. These modules encapsulate scheduling policy > > details and are handled by the scheduler core without the core > > code assuming about them too much. > > It probably needs further clarification that they're things on the > order of SCHED_FIFO, SCHED_RR, SCHED_NORMAL, etc.; some prioritization > amongst the classes is furthermore assumed, and so on. [...] yep - they are linked via sched_ops->next pointer, with NULL delimiting the last one. > [...] They're not quite capable of being full-blown alternative > policies, though quite a bit can be crammed into them. yeah, they are not full-blown: i extended them on-demand, for the specific purposes of sched_fair.c and sched_rt.c. More can be done too. > There are issues with the per- scheduling class data not being very > well-abstracted. [...] yes. It's on my TODO list: i'll work more on extending the cleanups to those fields too. > A binomial heap would likely serve your purposes better than rbtrees. > It's faster to have the next item to dequeue at the root of the tree > structure rather than a leaf, for one. There are, of course, other > priority queue structures (e.g. van Emde Boas) able to exploit the > limited precision of the priority key for faster asymptotics, though > actual performance is an open question. i'm caching the leftmost leaf, which serves as an alternate, task-pick centric root in essence. > Another advantage of heaps is that they support decreasing priorities > directly, so that instead of removal and reinsertion, a less invasive > movement within the tree is possible. This nets additional constant > factor improvements beyond those for the next item to dequeue for the > case where a task remains runnable, but is preempted and its priority > decreased while it remains runnable. yeah. (Note that in CFS i'm not decreasing priorities anywhere though - all the priority levels in CFS stay constant, fairness is not achieved via rotating priorities or similar, it is achieved via the accounting code.) > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > due to its design, the CFS scheduler is not prone to any of the > > 'attacks' that exist today against the heuristics of the stock > > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > > work fine and do not impact interactivity and produce the expected > > behavior. > > I'm always suspicious of these claims. [...] hey, sure - but please give it a go nevertheless, i _did_ test all these ;) > A moderately formal regression test suite needs to be assembled [...] by all means feel free! ;) > A more general question here is what you mean by "completely fair;" by that i mean the most common-sense definition: with N tasks running each gets 1/N CPU time if observed for a reasonable amount of time. Now extend this to arbitrary scheduling patterns, the end result should still be completely fair, according to the fundamental 1/N(time) rule individually applied to all the small scheduling patterns that the scheduling patterns give. (this assumes that the scheduling patterns are reasonably independent of each other - if they are not then there's no reasonable definition of fairness that makes sense, and we might as well use the 1/N rule for those cases too.) > there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or > inter-user fairness going on, though one might argue those are > relatively obscure notions of fairness. [...] sure, i mainly concentrated on what we have in Linux today. The things you mention are add-ons that i can see handling via new scheduling classes: all the CKRM and containers type of CPU time management facilities. > What these things mean when there are multiple CPU's to schedule > across may also be of concern. that is handled by the existing smp-nice load balancer, that logic is preserved under CFS. > These testcases are oblivious to SMP. This will demand that a > scheduling policy integrate with load balancing to the extent that > load balancing occurs for the sake of distributing CPU bandwidth > according to nice level. Some explicit decision should be made > regarding that. this should already work reasonably fine with CFS: try massive_intr.c on an SMP box. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:52 ` Ingo Molnar @ 2007-04-13 23:30 ` William Lee Irwin III 2007-04-13 23:44 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-13 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> A binomial heap would likely serve your purposes better than rbtrees. >> It's faster to have the next item to dequeue at the root of the tree >> structure rather than a leaf, for one. There are, of course, other >> priority queue structures (e.g. van Emde Boas) able to exploit the >> limited precision of the priority key for faster asymptotics, though >> actual performance is an open question. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > i'm caching the leftmost leaf, which serves as an alternate, task-pick > centric root in essence. I noticed that, yes. It seemed a better idea to me to use a data structure that has what's needed built-in, but I suppose it's not gospel. * William Lee Irwin III <wli@holomorphy.com> wrote: >> Another advantage of heaps is that they support decreasing priorities >> directly, so that instead of removal and reinsertion, a less invasive >> movement within the tree is possible. This nets additional constant >> factor improvements beyond those for the next item to dequeue for the >> case where a task remains runnable, but is preempted and its priority >> decreased while it remains runnable. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > yeah. (Note that in CFS i'm not decreasing priorities anywhere though - > all the priority levels in CFS stay constant, fairness is not achieved > via rotating priorities or similar, it is achieved via the accounting > code.) Sorry, "priority" here would be from the POV of the queue data structure. From the POV of the scheduler it would be resetting the deadline or whatever the nomenclature cooked up for things is, most obviously in requeue_task_fair() and task_tick_fair(). * William Lee Irwin III <wli@holomorphy.com> wrote: >> I'm always suspicious of these claims. [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > hey, sure - but please give it a go nevertheless, i _did_ test all these > ;) The suspicion essentially centers around how long the state of affairs will hold up because comprehensive re-testing is not noticeably done upon updates to scheduling code or kernel point releases. * William Lee Irwin III <wli@holomorphy.com> wrote: >> A moderately formal regression test suite needs to be assembled [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > by all means feel free! ;) I can only do so much, but I have done work to clean up other testcases going around. I'm mostly looking at testcases as I go over them or develop some interest in the subject and rewriting those that already exist or hammering out new ones as I need them. The main contribution toward this is that I've sort of made a mental note to stash the results of the effort somewhere and pass them along to those who do regular testing on kernels or otherwise import test suites into their collections. * William Lee Irwin III <wli@holomorphy.com> wrote: >> A more general question here is what you mean by "completely fair;" On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > by that i mean the most common-sense definition: with N tasks running > each gets 1/N CPU time if observed for a reasonable amount of time. Now > extend this to arbitrary scheduling patterns, the end result should > still be completely fair, according to the fundamental 1/N(time) rule > individually applied to all the small scheduling patterns that the > scheduling patterns give. (this assumes that the scheduling patterns are > reasonably independent of each other - if they are not then there's no > reasonable definition of fairness that makes sense, and we might as well > use the 1/N rule for those cases too.) I'd start with identically-behaving CPU-bound tasks here. It's easy enough to hammer out a testcase that starts up N CPU-bound tasks, runs them for a few minutes, stops them, collects statistics on their runtime, and gives us an idea of whether 1/N came out properly. I'll get around to that at some point. Where it gets complex is when the behavior patterns vary, e.g. they're not entirely CPU-bound and their desired in-isolation CPU utilization varies, or when nice levels vary, or both vary. I went on about testcases for those in particular in the prior post, though not both at once. The nice level one in particular needs an up-front goal for distribution of CPU bandwidth in a mixture of competing tasks with varying nice levels. There are different ways to define fairness, but a uniform distribution of CPU bandwidth across a set of identical competing tasks is a good, testable definition. * William Lee Irwin III <wli@holomorphy.com> wrote: >> there doesn't appear to be inter-tgrp, inter-pgrp, inter-session, or >> inter-user fairness going on, though one might argue those are >> relatively obscure notions of fairness. [...] On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > sure, i mainly concentrated on what we have in Linux today. The things > you mention are add-ons that i can see handling via new scheduling > classes: all the CKRM and containers type of CPU time management > facilities. At some point the CKRM and container people should be pinged to see what (if anything) they need to achieve these sorts of things. It's not clear to me that the specific cases I cited are considered relevant to anyone. I presume that if they are, someone will pipe up with a feature request. It was more a sort of catalogue of different notions of fairness that could arise than any sort of suggestion. * William Lee Irwin III <wli@holomorphy.com> wrote: >> What these things mean when there are multiple CPU's to schedule >> across may also be of concern. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > that is handled by the existing smp-nice load balancer, that logic is > preserved under CFS. Given the things going wrong, I'm curious as to whether that works, and if so, how well. I'll drop that into my list of testcases that should be arranged for, though I won't guarantee that I'll get to it myself in any sort of timely fashion. What this ultimately needs is specifying the semantics of nice levels so that we can say that a mixture of competing tasks with varying nice levels should have an ideal distribution of CPU bandwidth to check for. * William Lee Irwin III <wli@holomorphy.com> wrote: >> These testcases are oblivious to SMP. This will demand that a >> scheduling policy integrate with load balancing to the extent that >> load balancing occurs for the sake of distributing CPU bandwidth >> according to nice level. Some explicit decision should be made >> regarding that. On Sat, Apr 14, 2007 at 12:52:16AM +0200, Ingo Molnar wrote: > this should already work reasonably fine with CFS: try massive_intr.c on > an SMP box. Where is massive_intr.c, BTW? -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:30 ` William Lee Irwin III @ 2007-04-13 23:44 ` Ingo Molnar 2007-04-13 23:58 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 23:44 UTC (permalink / raw) To: William Lee Irwin III Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1358 bytes --] * William Lee Irwin III <wli@holomorphy.com> wrote: > Where it gets complex is when the behavior patterns vary, e.g. they're > not entirely CPU-bound and their desired in-isolation CPU utilization > varies, or when nice levels vary, or both vary. [...] yes. I tested things like 'massive_intr.c' (attached, written by Satoru Takeuchi) which starts N tasks which each work for 8msec then sleep 1msec: from its output, the second column is the CPU time each thread got, the more even, the fairer the scheduling. On vanilla i get: mercury:~> ./massive_intr 10 10 024873 00000150 024874 00000123 024870 00000069 024868 00000068 024866 00000051 024875 00000206 024872 00000093 024869 00000138 024867 00000078 024871 00000223 on CFS i get: neptune:~> ./massive_intr 10 10 002266 00000112 002260 00000113 002261 00000112 002267 00000112 002269 00000112 002265 00000112 002262 00000113 002268 00000113 002264 00000112 002263 00000113 so it is quite a bit more even ;) another related test-utility is one i wrote: http://people.redhat.com/mingo/scheduler-patches/ring-test.c this is a ring of 100 tasks each doing work for 100 msecs and then sleeping for 1 msec. I usually test this by also running a CPU hog in parallel to it, and checking whether it gets ~50.0% of CPU time under CFS. (it does) Ingo [-- Attachment #2: massive_intr.c --] [-- Type: text/plain, Size: 9833 bytes --] #if 0 Hi Ingo and all, When I was executing massive interactive processes, I found that some of them occupy CPU time and the others hardly run. It seems that some of processes which occupy CPU time always has max effective prio (default+5) and the others have max - 1. What happen here is... 1. If there are moderate number of max interactive processes, they can be re-inserted into active queue without falling down its priority again and again. 2. In this case, the others seldom run, and can't get max effective priority at next exhausting because scheduler considers them to sleep too long. 3. Goto 1, OOPS! Unfortunately I haven't been able to make the patch resolving this problem yet. Any idea? I also attach the test program which easily recreates this problem. Test program flow: 1. First process starts child proesses and wait for 5 minutes. 2. Each child process executes "work 8 msec and sleep 1 msec" loop continuously. 3. After 3 minits have passed, each child processes prints the # of loops which executed. What expected: Each child processes execute nearly equal # of loops. Test environment: - kernel: 2.6.20(*1) - # of CPUs: 1 or 2 - # of child processes: 200 or 400 - nice value: 0 or 20(*2) *1) I confirmed that 2.6.21-rc5 has no change regarding this problem. *2) If a process have nice 20, scheduler never regards it as interactive. Test results: -----------+----------------+------+------------------------------------ # of CPUs | # of processes | nice | result -----------+----------------+------+------------------------------------ | | 20 | looks good 1(i386) | +------+------------------------------------ | | 0 | 4 processes occupy 98% of CPU time -----------+ 200 +------+------------------------------------ | | 20 | looks good | +------+------------------------------------ | | 0 | 8 processes occupy 72% of CPU time 2(ia64) +----------------+------+------------------------------------ | 400 | 20 | looks good | +------+------------------------------------ | | 0 | 8 processes occupy 98% of CPU time -----------+----------------+------+------------------------------------ FYI. 2.6.21-rc3-mm1 (enabling RSDL scheduler) works fine in the all casees :-) Thanks, Satoru ------------------------------------------------------------------------------- #endif /* * massive_intr - run @nproc interactive processes and print the number of * loops(*1) each process executes in @runtime secs. * * *1) "work 8 msec and sleep 1msec" loop * * Usage: massive_intr <nproc> <runtime> * * @nproc: number of processes * @runtime: execute time[sec] * * ex) If you want to run 300 processes for 5 mins, issue the * command as follows: * * $ massive_intr 300 300 * * How to build: * * cc -o massive_intr massive_intr.c -lrt * * * Copyright (C) 2007 Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or (at * your option) any later version. * * This program is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * General Public License for more details. * * You should have received a copy of the GNU General Public License along * with this program; if not, write to the Free Software Foundation, Inc., * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ #include <sys/time.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/wait.h> #include <fcntl.h> #include <unistd.h> #include <semaphore.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <string.h> #include <errno.h> #include <err.h> #define WORK_MSECS 8 #define SLEEP_MSECS 1 #define MAX_PROC 1024 #define SAMPLE_COUNT 1000000000 #define USECS_PER_SEC 1000000 #define USECS_PER_MSEC 1000 #define NSECS_PER_MSEC 1000000 #define SHMEMSIZE 4096 static const char *shmname = "/sched_interactive_shmem"; static void *shmem; static sem_t *printsem; static int nproc; static int runtime; static int fd; static time_t *first; static pid_t pid[MAX_PROC]; static int return_code; static void cleanup_resources(void) { if (sem_destroy(printsem) < 0) warn("sem_destroy() failed"); if (munmap(shmem, SHMEMSIZE) < 0) warn("munmap() failed"); if (close(fd) < 0) warn("close() failed"); } static void abnormal_exit(void) { if (kill(getppid(), SIGUSR2) < 0) err(EXIT_FAILURE, "kill() failed"); } static void sighandler(int signo) { } static void sighandler2(int signo) { return_code = EXIT_FAILURE; } static void loopfnc(int nloop) { int i; for (i = 0; i < nloop; i++) ; } static int loop_per_msec(void) { struct timeval tv[2]; int before, after; if (gettimeofday(&tv[0], NULL) < 0) return -1; loopfnc(SAMPLE_COUNT); if (gettimeofday(&tv[1], NULL) < 0) return -1; before = tv[0].tv_sec*USECS_PER_SEC+tv[0].tv_usec; after = tv[1].tv_sec*USECS_PER_SEC+tv[1].tv_usec; return SAMPLE_COUNT/(after - before)*USECS_PER_MSEC; } static void *test_job(void *arg) { int l = (int)arg; int count = 0; time_t current; sigset_t sigset; struct sigaction sa; struct timespec ts = { 0, NSECS_PER_MSEC*SLEEP_MSECS}; sa.sa_handler = sighandler; if (sigemptyset(&sa.sa_mask) < 0) { warn("sigemptyset() failed"); abnormal_exit(); } sa.sa_flags = 0; if (sigaction(SIGUSR1, &sa, NULL) < 0) { warn("sigaction() failed"); abnormal_exit(); } if (sigemptyset(&sigset) < 0) { warn("sigfillset() failed"); abnormal_exit(); } sigsuspend(&sigset); if (errno != EINTR) { warn("sigsuspend() failed"); abnormal_exit(); } /* main loop */ do { loopfnc(WORK_MSECS*l); if (nanosleep(&ts, NULL) < 0) { warn("nanosleep() failed"); abnormal_exit(); } count++; if (time(¤t) == -1) { warn("time() failed"); abnormal_exit(); } } while (difftime(current, *first) < runtime); if (sem_wait(printsem) < 0) { warn("sem_wait() failed"); abnormal_exit(); } printf("%06d\t%08d\n", getpid(), count); if (sem_post(printsem) < 0) { warn("sem_post() failed"); abnormal_exit(); } exit(EXIT_SUCCESS); } static void usage(void) { fprintf(stderr, "Usage : massive_intr <nproc> <runtime>\n" "\t\tnproc : number of processes\n" "\t\truntime : execute time[sec]\n"); exit(EXIT_FAILURE); } int main(int argc, char **argv) { int i, j; int status; sigset_t sigset; struct sigaction sa; int c; if (argc != 3) usage(); nproc = strtol(argv[1], NULL, 10); if (errno || nproc < 1 || nproc > MAX_PROC) err(EXIT_FAILURE, "invalid multinum"); runtime = strtol(argv[2], NULL, 10); if (errno || runtime <= 0) err(EXIT_FAILURE, "invalid runtime"); sa.sa_handler = sighandler2; if (sigemptyset(&sa.sa_mask) < 0) err(EXIT_FAILURE, "sigemptyset() failed"); sa.sa_flags = 0; if (sigaction(SIGUSR2, &sa, NULL) < 0) err(EXIT_FAILURE, "sigaction() failed"); if (sigemptyset(&sigset) < 0) err(EXIT_FAILURE, "sigemptyset() failed"); if (sigaddset(&sigset, SIGUSR1) < 0) err(EXIT_FAILURE, "sigaddset() failed"); if (sigaddset(&sigset, SIGUSR2) < 0) err(EXIT_FAILURE, "sigaddset() failed"); if (sigprocmask(SIG_BLOCK, &sigset, NULL) < 0) err(EXIT_FAILURE, "sigprocmask() failed"); /* setup shared memory */ if ((fd = shm_open(shmname, O_CREAT | O_RDWR, 0644)) < 0) err(EXIT_FAILURE, "shm_open() failed"); if (shm_unlink(shmname) < 0) { warn("shm_unlink() failed"); goto err_close; } if (ftruncate(fd, SHMEMSIZE) < 0) { warn("ftruncate() failed"); goto err_close; } shmem = mmap(NULL, SHMEMSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (shmem == (void *)-1) { warn("mmap() failed"); goto err_unmap; } printsem = shmem; first = shmem + sizeof(*printsem); /* initialize semaphore */ if ((sem_init(printsem, 1, 1)) < 0) { warn("sem_init() failed"); goto err_unmap; } if ((c = loop_per_msec()) < 0) { fprintf(stderr, "loop_per_msec() failed\n"); goto err_sem; } for (i = 0; i < nproc; i++) { pid[i] = fork(); if (pid[i] == -1) { warn("fork() failed\n"); for (j = 0; j < i; j++) if (kill(pid[j], SIGKILL) < 0) warn("kill() failed"); goto err_sem; } if (pid[i] == 0) test_job((void *)c); } if (sigemptyset(&sigset) < 0) { warn("sigemptyset() failed"); goto err_proc; } if (sigaddset(&sigset, SIGUSR2) < 0) { warn("sigaddset() failed"); goto err_proc; } if (sigprocmask(SIG_UNBLOCK, &sigset, NULL) < 0) { warn("sigprocmask() failed"); goto err_proc; } if (time(first) < 0) { warn("time() failed"); goto err_proc; } if ((kill(0, SIGUSR1)) == -1) { warn("kill() failed"); goto err_proc; } for (i = 0; i < nproc; i++) { if (wait(&status) < 0) { warn("wait() failed"); goto err_proc; } } cleanup_resources(); exit(return_code); err_proc: for (i = 0; i < nproc; i++) if (kill(pid[i], SIGKILL) < 0) if (errno != ESRCH) warn("kill() failed"); err_sem: if (sem_destroy(printsem) < 0) warn("sem_destroy() failed"); err_unmap: if (munmap(shmem, SHMEMSIZE) < 0) warn("munmap() failed"); err_close: if (close(fd) < 0) warn("close() failed"); exit(EXIT_FAILURE); } ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:44 ` Ingo Molnar @ 2007-04-13 23:58 ` William Lee Irwin III 0 siblings, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-13 23:58 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> Where it gets complex is when the behavior patterns vary, e.g. they're >> not entirely CPU-bound and their desired in-isolation CPU utilization >> varies, or when nice levels vary, or both vary. [...] On Sat, Apr 14, 2007 at 01:44:44AM +0200, Ingo Molnar wrote: > yes. I tested things like 'massive_intr.c' (attached, written by Satoru > Takeuchi) which starts N tasks which each work for 8msec then sleep > 1msec: [...] > another related test-utility is one i wrote: > http://people.redhat.com/mingo/scheduler-patches/ring-test.c > this is a ring of 100 tasks each doing work for 100 msecs and then > sleeping for 1 msec. I usually test this by also running a CPU hog in > parallel to it, and checking whether it gets ~50.0% of CPU time under > CFS. (it does) These are both tremendously useful. The code is also in rather good shape so only minimal modifications (for massive_intr.c I'm not even sure if any are needed at all) are needed to plug them into the test harness I'm aware of. I'll queue them both for me to adjust and send over to testers I don't want to burden with hacking on testcases I myself am asking them to add to their suites. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:21 ` William Lee Irwin III 2007-04-13 22:52 ` Ingo Molnar @ 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi ` (2 more replies) 1 sibling, 3 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-14 22:38 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, William Lee Irwin III wrote: > On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > > The CFS patch uses a completely different approach and implementation > > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > > that of RSDL/SD, which is a high standard to meet :-) Testing > > feedback is welcome to decide this one way or another. [ and, in any > > case, all of SD's logic could be added via a kernel/sched_sd.c module > > as well, if Con is interested in such an approach. ] > > CFS's design is quite radical: it does not use runqueues, it uses a > > time-ordered rbtree to build a 'timeline' of future task execution, > > and thus has no 'array switch' artifacts (by which both the vanilla > > scheduler and RSDL/SD are affected). > > A binomial heap would likely serve your purposes better than rbtrees. > It's faster to have the next item to dequeue at the root of the tree > structure rather than a leaf, for one. There are, of course, other > priority queue structures (e.g. van Emde Boas) able to exploit the > limited precision of the priority key for faster asymptotics, though > actual performance is an open question. Haven't looked at the scheduler code yet, but for a similar problem I use a time ring. The ring has Ns (2 power is better) slots (where tasks are queued - in my case they were som sort of timers), and it has a current base index (Ib), a current base time (Tb) and a time granularity (Tg). It also has a bitmap with bits telling you which slots contains queued tasks. An item (task) that has to be scheduled at time T, will be queued in the slot: S = Ib + min((T - Tb) / Tg, Ns - 1); Items with T longer than Ns*Tg will be scheduled in the relative last slot (chosing a proper Ns and Tg can minimize this). Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to suite to your needs. This is a simple bench between time-ring (TR) and CFS queueing: http://www.xmailserver.org/smart-queue.c In my box (Dual Opteron 252): davide@alien:~$ ./smart-queue -n 8 CFS = 142.21 cycles/loop TR = 72.33 cycles/loop davide@alien:~$ ./smart-queue -n 16 CFS = 188.74 cycles/loop TR = 83.79 cycles/loop davide@alien:~$ ./smart-queue -n 32 CFS = 221.36 cycles/loop TR = 75.93 cycles/loop davide@alien:~$ ./smart-queue -n 64 CFS = 242.89 cycles/loop TR = 81.29 cycles/loop - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi @ 2007-04-14 23:26 ` Davide Libenzi 2007-04-15 4:01 ` William Lee Irwin III 2007-04-15 23:09 ` Pavel Pisa 2 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-14 23:26 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, 14 Apr 2007, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the slot: > > S = Ib + min((T - Tb) / Tg, Ns - 1); ... mod Ns, of course ;) - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi @ 2007-04-15 4:01 ` William Lee Irwin III 2007-04-15 4:18 ` Davide Libenzi 2007-04-15 23:09 ` Pavel Pisa 2 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 4:01 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, William Lee Irwin III wrote: >> A binomial heap would likely serve your purposes better than rbtrees. [...] On Sat, Apr 14, 2007 at 03:38:04PM -0700, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the slot: > S = Ib + min((T - Tb) / Tg, Ns - 1); > Items with T longer than Ns*Tg will be scheduled in the relative last slot > (chosing a proper Ns and Tg can minimize this). > Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to > suite to your needs. I used a similar sort of queue in the virtual deadline scheduler I wrote in 2003 or thereabouts. CFS uses queue priorities with too high a precision to map directly to this (queue priorities are marked as "key" in the cfs code and should not be confused with task priorities). The elder virtual deadline scheduler used millisecond resolution and a rather different calculation for its equivalent of ->key, which explains how it coped with a limited priority space. The two basic attacks on such large priority spaces are the near future vs. far future subdivisions and subdividing the priority space into (most often regular) intervals. Subdividing the priority space into intervals is the most obvious; you simply use some O(lg(n)) priority queue as the bucket discipline in the "time ring," queue by the upper bits of the queue priority in the time ring, and by the lower bits in the O(lg(n)) bucket discipline. The near future vs. far future subdivision is maintaining the first N tasks in a low-constant-overhead structure like a sorted list and the remainder in some other sort of queue structure intended to handle large numbers of elements gracefully. The distribution of queue priorities strongly influences which of the methods is most potent, though it should be clear the methods can be used in combination. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 4:01 ` William Lee Irwin III @ 2007-04-15 4:18 ` Davide Libenzi 0 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-15 4:18 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, 14 Apr 2007, William Lee Irwin III wrote: > The two basic attacks on such large priority spaces are the near future > vs. far future subdivisions and subdividing the priority space into > (most often regular) intervals. Subdividing the priority space into > intervals is the most obvious; you simply use some O(lg(n)) priority > queue as the bucket discipline in the "time ring," queue by the upper > bits of the queue priority in the time ring, and by the lower bits in > the O(lg(n)) bucket discipline. Sure. If you really need sub-millisecond precision, you can replace the bucket's list_head with an rb_root. It may be not necessary though for a cpu scheduler (still, didn't read Ingo's code yet). - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 22:38 ` Davide Libenzi 2007-04-14 23:26 ` Davide Libenzi 2007-04-15 4:01 ` William Lee Irwin III @ 2007-04-15 23:09 ` Pavel Pisa 2007-04-16 5:47 ` Davide Libenzi 2 siblings, 1 reply; 304+ messages in thread From: Pavel Pisa @ 2007-04-15 23:09 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007 00:38, Davide Libenzi wrote: > Haven't looked at the scheduler code yet, but for a similar problem I use > a time ring. The ring has Ns (2 power is better) slots (where tasks are > queued - in my case they were som sort of timers), and it has a current > base index (Ib), a current base time (Tb) and a time granularity (Tg). It > also has a bitmap with bits telling you which slots contains queued tasks. > An item (task) that has to be scheduled at time T, will be queued in the > slot: > > S = Ib + min((T - Tb) / Tg, Ns - 1); > > Items with T longer than Ns*Tg will be scheduled in the relative last slot > (chosing a proper Ns and Tg can minimize this). > Queueing is O(1) and de-queueing is O(Ns). You can play with Ns and Tg to > suite to your needs. > This is a simple bench between time-ring (TR) and CFS queueing: > > http://www.xmailserver.org/smart-queue.c > > In my box (Dual Opteron 252): > > davide@alien:~$ ./smart-queue -n 8 > CFS = 142.21 cycles/loop > TR = 72.33 cycles/loop > davide@alien:~$ ./smart-queue -n 16 > CFS = 188.74 cycles/loop > TR = 83.79 cycles/loop > davide@alien:~$ ./smart-queue -n 32 > CFS = 221.36 cycles/loop > TR = 75.93 cycles/loop > davide@alien:~$ ./smart-queue -n 64 > CFS = 242.89 cycles/loop > TR = 81.29 cycles/loop Hello all, I cannot help myself to not report results with GAVL tree algorithm there as an another race competitor. I believe, that it is better solution for large priority queues than RB-tree and even heap trees. It could be disputable if the scheduler needs such scalability on the other hand. The AVL heritage guarantees lower height which results in shorter search times which could be profitable for other uses in kernel. GAVL algorithm is AVL tree based, so it does not suffer from "infinite" priorities granularity there as TR does. It allows use for generalized case where tree is not fully balanced. This allows to cut the first item withour rebalancing. This leads to the degradation of the tree by one more level (than non degraded AVL gives) in maximum, which is still considerably better than RB-trees maximum. http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c The description behind the code is there http://cmp.felk.cvut.cz/~pisa/ulan/gavl.pdf The code is part of much more covering uLUt library http://cmp.felk.cvut.cz/~pisa/ulan/ulut.pdf http://sourceforge.net/project/showfiles.php?group_id=118937&package_id=130840 I have included all required GAVL code directly into smart-queue-v-gavl.c to provide it for easy testing. There are tests run on my little dated computer - Duron 600 MHz. Test are run twice to suppress run order influence. ./smart-queue-v-gavl -n 1 -l 2000000 gavl_cfs = 55.66 cycles/loop CFS = 88.33 cycles/loop TR = 141.78 cycles/loop CFS = 90.45 cycles/loop gavl_cfs = 55.38 cycles/loop ./smart-queue-v-gavl -n 2 -l 2000000 gavl_cfs = 82.85 cycles/loop CFS = 104.18 cycles/loop TR = 145.21 cycles/loop CFS = 102.74 cycles/loop gavl_cfs = 82.05 cycles/loop ./smart-queue-v-gavl -n 4 -l 2000000 gavl_cfs = 137.45 cycles/loop CFS = 156.47 cycles/loop TR = 142.00 cycles/loop CFS = 152.65 cycles/loop gavl_cfs = 139.38 cycles/loop ./smart-queue-v-gavl -n 10 -l 2000000 gavl_cfs = 229.22 cycles/loop (WORSE) CFS = 206.26 cycles/loop TR = 140.81 cycles/loop CFS = 208.29 cycles/loop gavl_cfs = 223.62 cycles/loop (WORSE) ./smart-queue-v-gavl -n 100 -l 2000000 gavl_cfs = 257.66 cycles/loop CFS = 329.68 cycles/loop TR = 142.20 cycles/loop CFS = 319.34 cycles/loop gavl_cfs = 260.02 cycles/loop ./smart-queue-v-gavl -n 1000 -l 2000000 gavl_cfs = 258.41 cycles/loop CFS = 393.04 cycles/loop TR = 134.76 cycles/loop CFS = 392.20 cycles/loop gavl_cfs = 260.93 cycles/loop ./smart-queue-v-gavl -n 10000 -l 2000000 gavl_cfs = 259.45 cycles/loop CFS = 605.89 cycles/loop TR = 196.69 cycles/loop CFS = 622.60 cycles/loop gavl_cfs = 262.72 cycles/loop ./smart-queue-v-gavl -n 100000 -l 2000000 gavl_cfs = 258.21 cycles/loop CFS = 845.62 cycles/loop TR = 315.37 cycles/loop CFS = 860.21 cycles/loop gavl_cfs = 258.94 cycles/loop The GAVL code has not been tuned by any "likely"/"unlikely" constructs. It brings even some other overhead from it generic design which is not necessary for this use - it keeps permanently even pointer to the last element, ensures, that the insertion order is preserved for same key values etc. But it still proves much better scalability then kernel used RB-tree code. On the other hand, it does not encode color/height in one of the pointers and requires additional field for height. May it be, that difference is due some bug in my testing, then I would be interrested in correction. The test case is oversimplified probably. I have already run more different tests against GAVL code in the past to compare it with different tree and queues implementations and I have not found case with real performance degradation. On the other hand, there are cases for small items counts where GAVL is sometimes a little worse than others (array based heap-tree for example). The GAVL code itself is used in more opensource and commercial projects and we have noticed no problems after one small fix at the time of the first release in 2004. Best wishes Pavel Pisa e-mail: pisa@cmp.felk.cvut.cz www: http://cmp.felk.cvut.cz/~pisa work: http://www.pikron.com ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:09 ` Pavel Pisa @ 2007-04-16 5:47 ` Davide Libenzi 2007-04-17 0:37 ` Pavel Pisa 0 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-16 5:47 UTC (permalink / raw) To: Pavel Pisa Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, 16 Apr 2007, Pavel Pisa wrote: > I cannot help myself to not report results with GAVL > tree algorithm there as an another race competitor. > I believe, that it is better solution for large priority > queues than RB-tree and even heap trees. It could be > disputable if the scheduler needs such scalability on > the other hand. The AVL heritage guarantees lower height > which results in shorter search times which could > be profitable for other uses in kernel. > > GAVL algorithm is AVL tree based, so it does not suffer from > "infinite" priorities granularity there as TR does. It allows > use for generalized case where tree is not fully balanced. > This allows to cut the first item withour rebalancing. > This leads to the degradation of the tree by one more level > (than non degraded AVL gives) in maximum, which is still > considerably better than RB-trees maximum. > > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c Here are the results on my Opteron 252: Testing N=1 gavl_cfs = 187.20 cycles/loop CFS = 194.16 cycles/loop TR = 314.87 cycles/loop CFS = 194.15 cycles/loop gavl_cfs = 187.15 cycles/loop Testing N=2 gavl_cfs = 268.94 cycles/loop CFS = 305.53 cycles/loop TR = 313.78 cycles/loop CFS = 289.58 cycles/loop gavl_cfs = 266.02 cycles/loop Testing N=4 gavl_cfs = 452.13 cycles/loop CFS = 518.81 cycles/loop TR = 311.54 cycles/loop CFS = 516.23 cycles/loop gavl_cfs = 450.73 cycles/loop Testing N=8 gavl_cfs = 609.29 cycles/loop CFS = 644.65 cycles/loop TR = 308.11 cycles/loop CFS = 667.01 cycles/loop gavl_cfs = 592.89 cycles/loop Testing N=16 gavl_cfs = 686.30 cycles/loop CFS = 807.41 cycles/loop TR = 317.20 cycles/loop CFS = 810.24 cycles/loop gavl_cfs = 688.42 cycles/loop Testing N=32 gavl_cfs = 756.57 cycles/loop CFS = 852.14 cycles/loop TR = 301.22 cycles/loop CFS = 876.12 cycles/loop gavl_cfs = 758.46 cycles/loop Testing N=64 gavl_cfs = 831.97 cycles/loop CFS = 997.16 cycles/loop TR = 304.74 cycles/loop CFS = 1003.26 cycles/loop gavl_cfs = 832.83 cycles/loop Testing N=128 gavl_cfs = 897.33 cycles/loop CFS = 1030.36 cycles/loop TR = 295.65 cycles/loop CFS = 1035.29 cycles/loop gavl_cfs = 892.51 cycles/loop Testing N=256 gavl_cfs = 963.17 cycles/loop CFS = 1146.04 cycles/loop TR = 295.35 cycles/loop CFS = 1162.04 cycles/loop gavl_cfs = 966.31 cycles/loop Testing N=512 gavl_cfs = 1029.82 cycles/loop CFS = 1218.34 cycles/loop TR = 288.78 cycles/loop CFS = 1257.97 cycles/loop gavl_cfs = 1029.83 cycles/loop Testing N=1024 gavl_cfs = 1091.76 cycles/loop CFS = 1318.47 cycles/loop TR = 287.74 cycles/loop CFS = 1311.72 cycles/loop gavl_cfs = 1093.29 cycles/loop Testing N=2048 gavl_cfs = 1153.03 cycles/loop CFS = 1398.84 cycles/loop TR = 286.75 cycles/loop CFS = 1438.68 cycles/loop gavl_cfs = 1149.97 cycles/loop There seem to be some difference from your numbers. This is with: gcc version 4.1.2 and -O2. But then and Opteron can behave quite differentyl than a Duron on a bench like this ;) - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:47 ` Davide Libenzi @ 2007-04-17 0:37 ` Pavel Pisa 0 siblings, 0 replies; 304+ messages in thread From: Pavel Pisa @ 2007-04-17 0:37 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Ingo Molnar, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 07:47, Davide Libenzi wrote: > On Mon, 16 Apr 2007, Pavel Pisa wrote: > > I cannot help myself to not report results with GAVL > > tree algorithm there as an another race competitor. > > I believe, that it is better solution for large priority > > queues than RB-tree and even heap trees. It could be > > disputable if the scheduler needs such scalability on > > the other hand. The AVL heritage guarantees lower height > > which results in shorter search times which could > > be profitable for other uses in kernel. > > > > GAVL algorithm is AVL tree based, so it does not suffer from > > "infinite" priorities granularity there as TR does. It allows > > use for generalized case where tree is not fully balanced. > > This allows to cut the first item withour rebalancing. > > This leads to the degradation of the tree by one more level > > (than non degraded AVL gives) in maximum, which is still > > considerably better than RB-trees maximum. > > > > http://cmp.felk.cvut.cz/~pisa/linux/smart-queue-v-gavl.c > > Here are the results on my Opteron 252: > > Testing N=1 > gavl_cfs = 187.20 cycles/loop > CFS = 194.16 cycles/loop > TR = 314.87 cycles/loop > CFS = 194.15 cycles/loop > gavl_cfs = 187.15 cycles/loop > > Testing N=2 > gavl_cfs = 268.94 cycles/loop > CFS = 305.53 cycles/loop > TR = 313.78 cycles/loop > CFS = 289.58 cycles/loop > gavl_cfs = 266.02 cycles/loop > > Testing N=4 > gavl_cfs = 452.13 cycles/loop > CFS = 518.81 cycles/loop > TR = 311.54 cycles/loop > CFS = 516.23 cycles/loop > gavl_cfs = 450.73 cycles/loop > > Testing N=8 > gavl_cfs = 609.29 cycles/loop > CFS = 644.65 cycles/loop > TR = 308.11 cycles/loop > CFS = 667.01 cycles/loop > gavl_cfs = 592.89 cycles/loop > > Testing N=16 > gavl_cfs = 686.30 cycles/loop > CFS = 807.41 cycles/loop > TR = 317.20 cycles/loop > CFS = 810.24 cycles/loop > gavl_cfs = 688.42 cycles/loop > > Testing N=32 > gavl_cfs = 756.57 cycles/loop > CFS = 852.14 cycles/loop > TR = 301.22 cycles/loop > CFS = 876.12 cycles/loop > gavl_cfs = 758.46 cycles/loop > > Testing N=64 > gavl_cfs = 831.97 cycles/loop > CFS = 997.16 cycles/loop > TR = 304.74 cycles/loop > CFS = 1003.26 cycles/loop > gavl_cfs = 832.83 cycles/loop > > Testing N=128 > gavl_cfs = 897.33 cycles/loop > CFS = 1030.36 cycles/loop > TR = 295.65 cycles/loop > CFS = 1035.29 cycles/loop > gavl_cfs = 892.51 cycles/loop > > Testing N=256 > gavl_cfs = 963.17 cycles/loop > CFS = 1146.04 cycles/loop > TR = 295.35 cycles/loop > CFS = 1162.04 cycles/loop > gavl_cfs = 966.31 cycles/loop > > Testing N=512 > gavl_cfs = 1029.82 cycles/loop > CFS = 1218.34 cycles/loop > TR = 288.78 cycles/loop > CFS = 1257.97 cycles/loop > gavl_cfs = 1029.83 cycles/loop > > Testing N=1024 > gavl_cfs = 1091.76 cycles/loop > CFS = 1318.47 cycles/loop > TR = 287.74 cycles/loop > CFS = 1311.72 cycles/loop > gavl_cfs = 1093.29 cycles/loop > > Testing N=2048 > gavl_cfs = 1153.03 cycles/loop > CFS = 1398.84 cycles/loop > TR = 286.75 cycles/loop > CFS = 1438.68 cycles/loop > gavl_cfs = 1149.97 cycles/loop > > > There seem to be some difference from your numbers. This is with: > > gcc version 4.1.2 > > and -O2. But then and Opteron can behave quite differentyl than a Duron on > a bench like this ;) Thanks for testing, but yours numbers are more correct than my first report. My numbers seemed to be over-optimistic even to me, In the fact I have been surprised that difference is so high. But I have tested bad version of code without GAVL_FAFTER option set. The code pushed to the web page has been the correct one. I have not get to look into case until now because I have busy day to prepare some Linux based labs at university. Without GAVL_FAFTER option, insert operation does fail if item with same key is already inserted (intended feature of the code) and as result of that, not all items have been inserted in the test. The meaning of GAVL_FAFTER is find/insert after all items with the same key value. Default behavior is operate on unique keys in tree and reject duplicates. My results are even worse for GAVL than yours. It is possible to try tweak code and optimize it more (likely/unlikely/do not keep last ptr etc) for this actual usage. May it be, that I try this exercise, but I do not expect that the result after tuning would be so much better, that it would outweight some redesign work. I could see some advantages of AVL still, but it has its own drawbacks with need of separate height field and little worse delete in the middle timing. So excuse me for disturbance. I have been only curious how GAVL code would behave in the comparison of other algorithms and I did not kept my premature enthusiasm under the lock. Best wishes Pavel Pisa ./smart-queue-v-gavl -n 4 gavl_cfs = 279.02 cycles/loop CFS = 200.87 cycles/loop TR = 229.55 cycles/loop CFS = 201.23 cycles/loop gavl_cfs = 276.08 cycles/loop ./smart-queue-v-gavl -n 8 gavl_cfs = 310.92 cycles/loop CFS = 288.45 cycles/loop TR = 192.46 cycles/loop CFS = 284.94 cycles/loop gavl_cfs = 357.02 cycles/loop ./smart-queue-v-gavl -n 16 gavl_cfs = 350.45 cycles/loop CFS = 354.01 cycles/loop TR = 189.79 cycles/loop CFS = 320.08 cycles/loop gavl_cfs = 387.43 cycles/loop ./smart-queue-v-gavl -n 32 gavl_cfs = 419.23 cycles/loop CFS = 406.88 cycles/loop TR = 198.10 cycles/loop CFS = 398.15 cycles/loop gavl_cfs = 412.57 cycles/loop ./smart-queue-v-gavl -n 64 gavl_cfs = 442.81 cycles/loop CFS = 429.62 cycles/loop TR = 235.40 cycles/loop CFS = 389.54 cycles/loop gavl_cfs = 433.56 cycles/loop ./smart-queue-v-gavl -n 128 gavl_cfs = 358.20 cycles/loop CFS = 605.49 cycles/loop TR = 236.01 cycles/loop CFS = 458.50 cycles/loop gavl_cfs = 455.05 cycles/loop ./smart-queue-v-gavl -n 256 gavl_cfs = 529.72 cycles/loop CFS = 530.98 cycles/loop TR = 193.75 cycles/loop CFS = 533.75 cycles/loop gavl_cfs = 471.47 cycles/loop ./smart-queue-v-gavl -n 512 gavl_cfs = 525.80 cycles/loop CFS = 550.63 cycles/loop TR = 188.71 cycles/loop CFS = 549.81 cycles/loop gavl_cfs = 494.73 cycles/loop ./smart-queue-v-gavl -n 1024 gavl_cfs = 544.91 cycles/loop CFS = 561.68 cycles/loop TR = 230.97 cycles/loop CFS = 522.68 cycles/loop gavl_cfs = 542.40 cycles/loop ./smart-queue-v-gavl -n 2048 gavl_cfs = 567.46 cycles/loop CFS = 581.85 cycles/loop TR = 229.69 cycles/loop CFS = 585.41 cycles/loop gavl_cfs = 563.22 cycles/loop ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (4 preceding siblings ...) 2007-04-13 22:21 ` William Lee Irwin III @ 2007-04-13 22:31 ` Willy Tarreau 2007-04-13 23:18 ` Ingo Molnar 2007-04-13 23:07 ` Gabriel C ` (7 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-13 22:31 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Hi Ingo, On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] (...) > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). I have a high confidence this will work better : I've been using time-ordered trees in userland projects for several years, and never found anything better. To be honnest, I never understood the concept behind the array switch, but as I never felt brave enough to hack something in this kernel area, I simply preferred to shut up (not enough knowledge and not enough time). However, I have been using a very fast struct timeval-ordered RADIX tree. I found generic rbtree code to generally be slower, certainly because of the call to a function with arguments on every node. Both trees are O(log(n)), the rbtree being balanced and the radix tree being unbalanced. If you're interested, I can try to see how that would fit (but not this week-end). Also, I had spent much time in the past doing paper work on how to improve fairness between interactive tasks and batch tasks. I came up with the conclusion that for perfectness, tasks should not be ordered by their expected wakeup time, but by their expected completion time, which automatically takes account of their allocated and used timeslice. It would also allow both types of workloads to share equal CPU time with better responsiveness for the most interactive one through the reallocation of a "credit" for the tasks which have not consumed all of their timeslices. I remember we had discussed this with Mike about one year ago when he fixed lots of problems in mainline scheduler. The downside is that I never found how to make this algo fit in O(log(n)). I always ended in something like O(n.log(n)) IIRC. But maybe this is overkill for real life anyway. Given that a basic two arrays switch (which I never understood) was sufficient for many people, probably that a basic tree will be an order of magnitude better. > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. I find this useful, but to be fair with Mike and Con, they both have proposed similar tuning knobs in the past and you said you did not want to add that complexity for admins. People can sometimes be demotivated by seeing their proposals finally used by people who first rejected them. And since both Mike and Con both have done a wonderful job in that area, we need their experience and continued active participation more than ever. > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. I'm very pleased to read this. Because as I have already said it, my major concern with 2.6 was the stock scheduler. Recently, RSDL fixed most of the basic problems for me to the point that I switched the default lilo entry on my notebook to 2.6 ! I hope that whatever the next scheduler will be, we'll definitely get rid of any heuristics. Heuristics are good in 95% of situations and extremely bad in the remaining 5%. I prefer something reasonably good in 100% of situations. > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. > > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) > > - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler > way than the vanilla scheduler does. It uses 100 runqueues (for all > 100 RT priority levels, instead of 140 in the vanilla scheduler) > and it needs no expired array. > > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. Will this have any impact on NUMA/HT/multi-core/etc... ? > the core scheduler got smaller by more than 700 lines: Well done ! Cheers, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 22:31 ` Willy Tarreau @ 2007-04-13 23:18 ` Ingo Molnar 2007-04-14 18:48 ` Bill Huey 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 23:18 UTC (permalink / raw) To: Willy Tarreau Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > > central tunable: > > > > /proc/sys/kernel/sched_granularity_ns > > > > which can be used to tune the scheduler from 'desktop' (low > > latencies) to 'server' (good batching) workloads. It defaults to a > > setting suitable for desktop workloads. SCHED_BATCH is handled by the > > CFS scheduler module too. > > I find this useful, but to be fair with Mike and Con, they both have > proposed similar tuning knobs in the past and you said you did not > want to add that complexity for admins. [...] yeah. [ Note that what i opposed in the past was mostly the 'export all the zillion of sched.c knobs to /sys and let people mess with them' kind of patches which did exist and still exist. A _single_ knob, which represents basically the totality of parameters within sched_fair.c is less of a problem. I dont think i ever objected to this knob within staircase/SD. (If i did then i was dead wrong.) ] > [...] People can sometimes be demotivated by seeing their proposals > finally used by people who first rejected them. And since both Mike > and Con both have done a wonderful job in that area, we need their > experience and continued active participation more than ever. very much so! Both Con and Mike has contributed regularly to upstream sched.c: $ git-log kernel/sched.c | grep 'by: Con Kolivas' 1 | wc -l 19 $ git-log kernel/sched.c | grep 'by: Mike' | wc -l 6 and i'd very much like both counts to increase steadily in the future too :) > > - reworked/sanitized SMP load-balancing: the runqueue-walking > > assumptions are gone from the load-balancing code now, and > > iterators of the scheduling modules are used. The balancing code > > got quite a bit simpler as a result. > > Will this have any impact on NUMA/HT/multi-core/etc... ? it will inevitably have some sort of effect - and if it's negative, i'll try to fix it. I got rid of the explicit cache-hot tracking code and replaced it with a more natural pure 'pick the next-to-run task first, that is likely the most cache-cold one' logic. That just derives naturally from the rbtree approach. > > the core scheduler got smaller by more than 700 lines: > > Well done ! thanks :) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:18 ` Ingo Molnar @ 2007-04-14 18:48 ` Bill Huey 0 siblings, 0 replies; 304+ messages in thread From: Bill Huey @ 2007-04-14 18:48 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sat, Apr 14, 2007 at 01:18:09AM +0200, Ingo Molnar wrote: > very much so! Both Con and Mike has contributed regularly to upstream > sched.c: The problem here is tha Con can get demotivated (and rather upset) when an idea gets proposed, like SchedPlug, only to have people be hostile to it and then sudden turn around an adopt this idea. It give the impression that you, in this specific case, were more interested in controlling a situation and the track of development instead of actually being inclusive of the development process with discussion and serious consideration, etc... This is how the Linux community can be perceived as elitist. The old guard would serve the community better if people were more mindful and sensitive to developer issues. There was a particular speech that I was turned off by at OLS 2006 that pretty much pandering to the "old guard's" needs over newer developers. Since I'm a some what established engineer in -rt (being the only other person that mapped the lock hierarchy out for full preemptibility), I had the confidence to pretty much ignored it while previously this could have really upset me and be highly discouraging to a relatively new developer. As Linux gets larger and larger this is going to be an increasing problem when folks come into the community with new ideas and the community will need to change if it intends to integrate these folks. IMO, a lot of these flame ware wouldn't need to exist if folks listent ot each other better and permit co-ownership of code like the scheduler since it needs multipule hands in it adapt to new loads and situations, etc... I'm saying this nicely now since I can be nasty about it. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (5 preceding siblings ...) 2007-04-13 22:31 ` Willy Tarreau @ 2007-04-13 23:07 ` Gabriel C 2007-04-13 23:25 ` Ingo Molnar 2007-04-14 2:04 ` Nick Piggin ` (6 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Gabriel C @ 2007-04-13 23:07 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > [...] > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, > Compile error here. ... CC kernel/sched.o kernel/sched.c: In function '__rq_clock': kernel/sched.c:219: error: 'struct rq' has no member named 'cpu' kernel/sched.c:219: warning: type defaults to 'int' in declaration of '__ret_warn_once' kernel/sched.c:219: error: 'struct rq' has no member named 'cpu' kernel/sched.c: In function 'rq_clock': kernel/sched.c:230: error: 'struct rq' has no member named 'cpu' kernel/sched.c: In function 'sched_init': kernel/sched.c:6013: warning: unused variable 'j' make[1]: *** [kernel/sched.o] Error 1 make: *** [kernel] Error 2 ==> ERROR: Build Failed. Aborting... ... There the config : http://frugalware.org/~crazy/other/kernel/config > Ingo > - > > Regards, Gabriel ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:07 ` Gabriel C @ 2007-04-13 23:25 ` Ingo Molnar 2007-04-13 23:39 ` Gabriel C 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-13 23:25 UTC (permalink / raw) To: Gabriel C Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Gabriel C <nix.or.die@googlemail.com> wrote: > > as usual, any sort of feedback, bugreports, fixes and suggestions > > are more than welcome, > > Compile error here. ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also updated the full patch at the cfs-scheduler URL) Ingo -----------------------> From: Ingo Molnar <mingo@elte.hu> Subject: [cfs] fix !CONFIG_SMP build fix the !CONFIG_SMP build error reported by Gabriel C Signed-off-by: Ingo Molnar <mingo@elte.hu> Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -257,16 +257,6 @@ static inline unsigned long long __rq_cl return rq->rq_clock; } -static inline unsigned long long rq_clock(struct rq *rq) -{ - int this_cpu = smp_processor_id(); - - if (this_cpu == rq->cpu) - return __rq_clock(rq); - - return rq->rq_clock; -} - static inline int cpu_of(struct rq *rq) { #ifdef CONFIG_SMP @@ -276,6 +266,16 @@ static inline int cpu_of(struct rq *rq) #endif } +static inline unsigned long long rq_clock(struct rq *rq) +{ + int this_cpu = smp_processor_id(); + + if (this_cpu == cpu_of(rq)) + return __rq_clock(rq); + + return rq->rq_clock; +} + /* * The domain tree (rq->sd) is protected by RCU's quiescent state transition. * See detach_destroy_domains: synchronize_sched for details. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 23:25 ` Ingo Molnar @ 2007-04-13 23:39 ` Gabriel C 0 siblings, 0 replies; 304+ messages in thread From: Gabriel C @ 2007-04-13 23:39 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Gabriel C <nix.or.die@googlemail.com> wrote: > > >>> as usual, any sort of feedback, bugreports, fixes and suggestions >>> are more than welcome, >>> >> Compile error here. >> > > ah, !CONFIG_SMP. Does the patch below do the trick for you? (I've also > updated the full patch at the cfs-scheduler URL) > Yes it does , thx :) , only the " warning: unused variable 'j' " left. > Ingo > Regards, Gabriel ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (6 preceding siblings ...) 2007-04-13 23:07 ` Gabriel C @ 2007-04-14 2:04 ` Nick Piggin 2007-04-14 6:32 ` Ingo Molnar 2007-04-14 15:09 ` S.Çağlar Onur ` (5 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-14 2:04 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 13, 2007 at 10:21:00PM +0200, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch Always good to see another contender ;) > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] I don't know why there is such noise about fairness right now... I thought fairness was one of the fundamental properties of a good CPU scheduler, and my scheduler definitely always aims for that above most other things. Why not just keep SCHED_OTHER? > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. Don't really like this, but anyway... > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! I guess the 2.4 and earlier scheduler kind of did that as well. > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] Comment about the code: shouldn't you be requeueing the task in the rbtree wherever you change wait_runtime? eg. task_new_fair? (I've only had a quick look so far). > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). > > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. Well, I guess there is still some mechanism to decide which process is most eligible to run? ;) Considering that question has no "right" answer for SCHED_OTHER scheduling, I guess you could say it has heuristics. But granted they are obviously fairly elegant in contrast to the O(1) scheduler ;) > There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns Suppose you have 2 CPU hogs running, is sched_granularity_ns the frequency at which they will context switch? > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) What is better behaviour for sched_yield? Thanks, Nick ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 2:04 ` Nick Piggin @ 2007-04-14 6:32 ` Ingo Molnar 2007-04-14 6:43 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 6:32 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > The CFS patch uses a completely different approach and implementation > > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > > that of RSDL/SD, which is a high standard to meet :-) Testing > > feedback is welcome to decide this one way or another. [ and, in any > > case, all of SD's logic could be added via a kernel/sched_sd.c module > > as well, if Con is interested in such an approach. ] > > Comment about the code: shouldn't you be requeueing the task in the > rbtree wherever you change wait_runtime? eg. task_new_fair? [...] yes: the task's position within the rbtree is updated every time wherever wait_runtime is change. task_new_fair is the method during new task creation, but indeed i forgot to requeue the parent. I've fixed this in my tree (see the delta patch below) - thanks! Ingo -----------> From: Ingo Molnar <mingo@elte.hu> Subject: [cfs] fix parent's rbtree position Nick noticed that upon fork we change parent->wait_runtime but we do not requeue it within the rbtree. Signed-off-by: Ingo Molnar <mingo@elte.hu> Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -524,6 +524,8 @@ static void task_new_fair(struct rq *rq, p->wait_runtime = parent->wait_runtime/2; parent->wait_runtime /= 2; + requeue_task_fair(rq, parent); + /* * For the first timeslice we allow child threads * to move their parent-inherited fairness back ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 6:32 ` Ingo Molnar @ 2007-04-14 6:43 ` Ingo Molnar 2007-04-14 8:08 ` Willy Tarreau 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 6:43 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > Nick noticed that upon fork we change parent->wait_runtime but we do > not requeue it within the rbtree. this fix is not complete - because the child runqueue is locked here, not the parent's. I've fixed this properly in my tree and have uploaded a new sched-modular+cfs.patch. (the effects of the original bug are mostly harmless, the rbtree position gets corrected the first time the parent reschedules. The fix might improve heavy forker handling.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 6:43 ` Ingo Molnar @ 2007-04-14 8:08 ` Willy Tarreau 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:36 ` Ingo Molnar 0 siblings, 2 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 8:08 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > Nick noticed that upon fork we change parent->wait_runtime but we do > > not requeue it within the rbtree. > > this fix is not complete - because the child runqueue is locked here, > not the parent's. I've fixed this properly in my tree and have uploaded > a new sched-modular+cfs.patch. (the effects of the original bug are > mostly harmless, the rbtree position gets corrected the first time the > parent reschedules. The fix might improve heavy forker handling.) It looks like it did not reach your public dir yet. BTW, I've given it a try. It seems pretty usable. I have also tried the usual meaningless "glxgears" test with 12 of them at the same time, and they rotate very smoothly, there is absolutely no pause in any of them. But they don't all run at same speed, and top reports their CPU load varying from 3.4 to 10.8%, with what looks like more CPU is assigned to the first processes, and less CPU for the last ones. But this is just a rough observation on a stupid test, I would not call that one scientific in any way (and X has its share in the test too). I'll perform other tests when I can rebuild with your fixed patch. Cheers, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:08 ` Willy Tarreau @ 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 19:48 ` William Lee Irwin III 2007-04-14 10:36 ` Ingo Molnar 1 sibling, 2 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 8:36 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 10:08:34AM +0200, Willy Tarreau wrote: > On Sat, Apr 14, 2007 at 08:43:34AM +0200, Ingo Molnar wrote: > > > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > > Nick noticed that upon fork we change parent->wait_runtime but we do > > > not requeue it within the rbtree. > > > > this fix is not complete - because the child runqueue is locked here, > > not the parent's. I've fixed this properly in my tree and have uploaded > > a new sched-modular+cfs.patch. (the effects of the original bug are > > mostly harmless, the rbtree position gets corrected the first time the > > parent reschedules. The fix might improve heavy forker handling.) > > It looks like it did not reach your public dir yet. > > BTW, I've given it a try. It seems pretty usable. I have also tried > the usual meaningless "glxgears" test with 12 of them at the same time, > and they rotate very smoothly, there is absolutely no pause in any of > them. But they don't all run at same speed, and top reports their CPU > load varying from 3.4 to 10.8%, with what looks like more CPU is > assigned to the first processes, and less CPU for the last ones. But > this is just a rough observation on a stupid test, I would not call > that one scientific in any way (and X has its share in the test too). Follow-up: I think this is mostly X-related. I've started 100 scheddos, and all get the same CPU percentage. Interestingly, mpg123 in parallel does never skip at all because it needs quite less than 1% CPU and gets its fair share at a load of 112. Xterms are slow to respond to typing with the 12 gears and 100 scheddos, and expectedly it was X which was starving. renicing it to -5 restores normal feeling with very slow but smooth gear rotations. Leaving X niced at 0 and killing the gears also restores normal behaviour. All in all, it seems logical that processes which serve many others become a bottleneck for them. Forking becomes very slow above a load of 100 it seems. Sometimes, the shell takes 2 or 3 seconds to return to prompt after I run "scheddos &" Those are very promising results, I nearly observe the same responsiveness as I had on a solaris 10 with 10k running processes on a bigger machine. I would be curious what a mysql test result would look like now. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:36 ` Willy Tarreau @ 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 15:17 ` Mark Lord 2007-04-14 19:48 ` William Lee Irwin III 1 sibling, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 10:53 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Forking becomes very slow above a load of 100 it seems. Sometimes, the > shell takes 2 or 3 seconds to return to prompt after I run "scheddos > &" this might be changed/impacted by the parent-requeue fix that is in the updated (for real, promise! ;) patch. Right now on CFS a forking parent shares its own run stats with the child 50%/50%. This means that heavy forkers are indeed penalized. Another logical choice would be 100%/0%: a child has to earn its own right. i kept the 50%/50% rule from the old scheduler, but maybe it's a more pristine (and smaller/faster) approach to just not give new children any stats history to begin with. I've implemented an add-on patch that implements this, you can find it at: http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch > Those are very promising results, I nearly observe the same > responsiveness as I had on a solaris 10 with 10k running processes on > a bigger machine. cool and thanks for the feedback! (Btw., as another test you could also try to renice "scheddos" to +19. While that does not push the scheduler nearly as hard as nice 0, it is perhaps more indicative of how a truly abusive many-tasks workload would be run in practice.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 10:53 ` Ingo Molnar @ 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau ` (2 more replies) 2007-04-14 15:17 ` Mark Lord 1 sibling, 3 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 13:01 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:53:39PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Forking becomes very slow above a load of 100 it seems. Sometimes, the > > shell takes 2 or 3 seconds to return to prompt after I run "scheddos > > &" > > this might be changed/impacted by the parent-requeue fix that is in the > updated (for real, promise! ;) patch. Right now on CFS a forking parent > shares its own run stats with the child 50%/50%. This means that heavy > forkers are indeed penalized. Another logical choice would be 100%/0%: a > child has to earn its own right. > > i kept the 50%/50% rule from the old scheduler, but maybe it's a more > pristine (and smaller/faster) approach to just not give new children any > stats history to begin with. I've implemented an add-on patch that > implements this, you can find it at: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch Not tried yet, it already looks better with the update and sched-fair-hog. Now xterm open "instantly" even with 1000 running processes. > > Those are very promising results, I nearly observe the same > > responsiveness as I had on a solaris 10 with 10k running processes on > > a bigger machine. > > cool and thanks for the feedback! (Btw., as another test you could also > try to renice "scheddos" to +19. While that does not push the scheduler > nearly as hard as nice 0, it is perhaps more indicative of how a truly > abusive many-tasks workload would be run in practice.) Good idea. The machine I'm typing from now has 1000 scheddos running at +19, and 12 gears at nice 0. Top keeps reporting different cpu usages for all gears, but I'm pretty sure that it's a top artifact now because the cumulated times are roughly identical : 14:33:13 up 13 min, 7 users, load average: 900.30, 443.75, 177.70 1088 processes: 80 sleeping, 1008 running, 0 zombie, 0 stopped CPU0 states: 56.0% user 43.0% system 23.0% nice 0.0% iowait 0.0% idle CPU1 states: 94.0% user 5.0% system 0.0% nice 0.0% iowait 0.0% idle Mem: 1034764k av, 223788k used, 810976k free, 0k shrd, 7192k buff 104400k active, 51904k inactive Swap: 497972k av, 0k used, 497972k free 68020k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 1325 root 20 0 69240 9400 3740 R 27.6 0.9 4:46 1 X 1412 willy 20 0 6284 2552 1740 R 14.2 0.2 1:09 1 glxgears 1419 willy 20 0 6256 2384 1612 R 10.7 0.2 1:09 1 glxgears 1409 willy 20 0 2824 1940 788 R 8.9 0.1 0:25 1 top 1414 willy 20 0 6280 2544 1728 S 8.9 0.2 1:08 0 glxgears 1415 willy 20 0 6256 2376 1600 R 8.9 0.2 1:07 1 glxgears 1417 willy 20 0 6256 2384 1612 S 8.9 0.2 1:05 1 glxgears 1420 willy 20 0 6284 2552 1740 R 8.9 0.2 1:07 1 glxgears 1410 willy 20 0 6256 2372 1600 S 7.1 0.2 1:11 1 glxgears 1413 willy 20 0 6260 2388 1612 S 7.1 0.2 1:08 0 glxgears 1416 willy 20 0 6284 2544 1728 S 6.2 0.2 1:06 0 glxgears 1418 willy 20 0 6252 2384 1612 S 6.2 0.2 1:09 0 glxgears 1411 willy 20 0 6280 2548 1740 S 5.3 0.2 1:15 1 glxgears 1421 willy 20 0 6280 2536 1728 R 5.3 0.2 1:05 1 glxgears >From time to time, one of the 12 aligned gears will quickly perform a full quarter of round while others slowly turn by a few degrees. In fact, while I don't know this process's CPU usage pattern, there's something useful in it : it allows me to visually see when process accelerate/deceleraet. What would be best would be just a clock requiring low X ressources and eating vast amounts of CPU between movements. It will help visually monitor CPU distribution without being too much impacted by X. I've just added another 100 scheddos at nice 0, and the system is still amazingly usable. I just tried exchanging a 1-byte token between 188 "dd" processes which communicate through circular pipes. The context switch rate is rather high but this has no impact on the rest : willy@pcw:c$ dd if=/tmp/fifo bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | dd bs=1 | (echo -n a;dd bs=1) | dd bs=1 of=/tmp/fifo procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1105 0 1 0 781108 8364 68180 0 0 0 12 5 82187 59 41 0 1114 0 1 0 781108 8364 68180 0 0 0 0 0 81528 58 42 0 1112 0 1 0 781108 8364 68180 0 0 0 0 1 80899 58 42 0 1113 0 1 0 781108 8364 68180 0 0 0 0 26 83466 58 42 0 1106 0 2 0 781108 8376 68168 0 0 0 8 91 83193 58 42 0 1107 0 1 0 781108 8376 68180 0 0 0 4 7 79951 58 42 0 1106 0 1 0 781108 8376 68180 0 0 0 0 46 80939 57 43 0 1114 0 1 0 781108 8376 68180 0 0 0 0 21 82019 56 44 0 1116 0 1 0 781108 8376 68180 0 0 0 0 16 85134 56 44 0 1114 0 3 0 781108 8388 68168 0 0 0 16 20 85871 56 44 0 1112 0 1 0 781108 8388 68168 0 0 0 0 15 80412 57 43 0 1112 0 1 0 781108 8388 68180 0 0 0 0 101 83002 58 42 0 1113 0 1 0 781108 8388 68180 0 0 0 0 25 82230 56 44 0 Playing with the sched_max_hog_history_ns does not seem to change anything. Maybe it's useful for other workloads. Anyway, I have nothing to complain about, because it's not common for me to be able to normally type a mail on a system with more than 1000 running processes ;-) Also, mixed with this load, I have started injecting HTTP requests between two local processes. The load is stable at 7700 req/s (11800 when alone), and what I was interested in is the response time. It's perfectly stable between 9.0 and 9.4 ms with a standard deviation of about 6.0 ms. Those were varying a lot under stock scheduler, with some sessions sometimes pausing for seconds. (RSDL fixed this though). Well, I'll stop heating the room for now as I get out of ideas about how to defeat it. I'm convinced. I'm impatient to read about Mike's feedback with his workload which behaves strangely on RSDL. If it works OK here, it will be the proof that heuristics should not be needed. Congrats ! Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau @ 2007-04-14 13:27 ` Willy Tarreau 2007-04-14 14:45 ` Willy Tarreau 2007-04-14 16:19 ` Ingo Molnar 2007-04-15 7:54 ` Mike Galbraith 2007-04-19 9:01 ` Ingo Molnar 2 siblings, 2 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 13:27 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > Well, I'll stop heating the room for now as I get out of ideas about how > to defeat it. Ah, I found something nasty. If I start large batches of processes like this : $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done the ramp up slows down after 700-800 processes, but something very strange happens. If I'm under X, I can switch the focus to all xterms (the WM is still alive) but all xterms are frozen. On the console, after one moment I simply cannot switch to another VT anymore while I can still start commands locally. But "chvt 2" simply blocks. SysRq-K killed everything and restored full control. Dmesg shows lots of : SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. I wonder if part of the problem would be too many processes bound to the same tty :-/ I'll investigate a bit. Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:27 ` Willy Tarreau @ 2007-04-14 14:45 ` Willy Tarreau 2007-04-14 16:14 ` Ingo Molnar 2007-04-14 16:19 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 14:45 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 03:27:32PM +0200, Willy Tarreau wrote: > On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > > > Well, I'll stop heating the room for now as I get out of ideas about how > > to defeat it. > > Ah, I found something nasty. > If I start large batches of processes like this : > > $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done > > the ramp up slows down after 700-800 processes, but something very > strange happens. If I'm under X, I can switch the focus to all xterms > (the WM is still alive) but all xterms are frozen. On the console, > after one moment I simply cannot switch to another VT anymore while > I can still start commands locally. But "chvt 2" simply blocks. > SysRq-K killed everything and restored full control. Dmesg shows lots > of : > SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > I wonder if part of the problem would be too many processes bound to > the same tty :-/ Does not seem easy to reproduce, it looks like some resource pools are kept pre-allocated after a first run, because if I kill scheddos during the ramp up then start it again, it can go further. The problem happens when the parent is forking. Also, I modified scheddos to close(0,1,2) and to perform the forks itself and it does not cause any problem, even with 4000 processes running. So I really suspect that the problem I encountered above was tty-related. BTW, I've tried your fork patch. It definitely helps forking because it takes below one second to create 4000 processes, then the load slowly increases. As you said, the children have to earn their share, and I find that it makes it easier to conserve control of the whole system's stability. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 14:45 ` Willy Tarreau @ 2007-04-14 16:14 ` Ingo Molnar 0 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 16:14 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > BTW, I've tried your fork patch. It definitely helps forking because > it takes below one second to create 4000 processes, then the load > slowly increases. As you said, the children have to earn their share, > and I find that it makes it easier to conserve control of the whole > system's stability. ok, thanks for testing this out, i think i'll integrate this one back into the core. (I'm still unsure about the cpu-hog one.) And it saves some code-size too: text data bss dec hex filename 23349 2705 24 26078 65de kernel/sched.o.cfs-v1 23189 2705 24 25918 653e kernel/sched.o.cfs-before 23052 2705 24 25781 64b5 kernel/sched.o.cfs-after 23366 4001 24 27391 6aff kernel/sched.o.vanilla 23671 4548 24 28243 6e53 kernel/sched.o.sd.v40 Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:27 ` Willy Tarreau 2007-04-14 14:45 ` Willy Tarreau @ 2007-04-14 16:19 ` Ingo Molnar 2007-04-14 17:15 ` Eric W. Biederman 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 16:19 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Eric W. Biederman, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: > > > > Well, I'll stop heating the room for now as I get out of ideas about how > > to defeat it. > > Ah, I found something nasty. > If I start large batches of processes like this : > > $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done > > the ramp up slows down after 700-800 processes, but something very > strange happens. If I'm under X, I can switch the focus to all xterms > (the WM is still alive) but all xterms are frozen. On the console, > after one moment I simply cannot switch to another VT anymore while I > can still start commands locally. But "chvt 2" simply blocks. SysRq-K > killed everything and restored full control. Dmesg shows lots of : > SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > I wonder if part of the problem would be too many processes bound to > the same tty :-/ hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), maybe this description rings a bell with them? Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 16:19 ` Ingo Molnar @ 2007-04-14 17:15 ` Eric W. Biederman 2007-04-14 17:29 ` Willy Tarreau 0 siblings, 1 reply; 304+ messages in thread From: Eric W. Biederman @ 2007-04-14 17:15 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Ingo Molnar <mingo@elte.hu> writes: > * Willy Tarreau <w@1wt.eu> wrote: > >> On Sat, Apr 14, 2007 at 03:01:01PM +0200, Willy Tarreau wrote: >> > >> > Well, I'll stop heating the room for now as I get out of ideas about how >> > to defeat it. >> >> Ah, I found something nasty. >> If I start large batches of processes like this : >> >> $ for i in $(seq 1 1000); do ./scheddos2 4000 4000 & done >> >> the ramp up slows down after 700-800 processes, but something very >> strange happens. If I'm under X, I can switch the focus to all xterms >> (the WM is still alive) but all xterms are frozen. On the console, >> after one moment I simply cannot switch to another VT anymore while I >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K >> killed everything and restored full control. Dmesg shows lots of : > >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. This. Yes. SAK is noisy and tells you everything it kills. >> I wonder if part of the problem would be too many processes bound to >> the same tty :-/ > > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), > maybe this description rings a bell with them? Is there any swapping going on? I'm inclined to suspect that it is a problem that has more to do with the number of processes and has nothing to do with ttys. Anyway you can easily rule out ttys by having your startup program detach from a controlling tty before you start everything. I'm more inclined to guess something is reading /proc a lot, or doing something that holds the tasklist lock, a lot or something like that, if the problem isn't that you are being kicked into swap. Eric ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:15 ` Eric W. Biederman @ 2007-04-14 17:29 ` Willy Tarreau 2007-04-14 17:44 ` Eric W. Biederman 2007-04-14 17:50 ` Linus Torvalds 0 siblings, 2 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 17:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Hi Eric, [...] > >> the ramp up slows down after 700-800 processes, but something very > >> strange happens. If I'm under X, I can switch the focus to all xterms > >> (the WM is still alive) but all xterms are frozen. On the console, > >> after one moment I simply cannot switch to another VT anymore while I > >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K > >> killed everything and restored full control. Dmesg shows lots of : > > > >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. > > This. Yes. SAK is noisy and tells you everything it kills. OK, that's what I suspected, but I did not know if the fact that it talked about the session was systematic or related to any particular state when it killed the task. > >> I wonder if part of the problem would be too many processes bound to > >> the same tty :-/ > > > > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), > > maybe this description rings a bell with them? > > Is there any swapping going on? Not at all. > I'm inclined to suspect that it is a problem that has more to do with the > number of processes and has nothing to do with ttys. It is clearly possible. What I found strange is that I could still fork processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. It first happened under X with frozen xterms but a perfectly usable WM, then I reproduced it on pure console to rule out any potential X problem. > Anyway you can easily rule out ttys by having your startup program > detach from a controlling tty before you start everything. > > I'm more inclined to guess something is reading /proc a lot, or doing > something that holds the tasklist lock, a lot or something like that, > if the problem isn't that you are being kicked into swap. Oh I'm sorry you were invited into the discussion without a first description of the context. I was giving a try to Ingo's new scheduler, and trying to reach corner cases with lots of processes competing for CPU. I simply used a "for" loop in bash to fork 1000 processes, and this problem happened between 700-800 children. The program only uses a busy loop and a pause. I then changed my program to close 0,1,2 and perform the fork itself, and the problem vanished. So there are two differences here : - bash not forking anymore - far less FDs on /dev/tty1 At first, I had around 2200 fds on /dev/tty1, reason why I suspected something in this area. I agree that this is not normal usage at all, I'm just trying to attack Ingo's scheduler to ensure it is more robust than the stock one. But sometimes brute force methods can make other sleeping problems pop up. Thinking about it, I don't know if there are calls to schedule() while switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2" simply blocked. It would have been possible that a schedule() call somewhere got starved due to the load, I don't know. Thanks, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:29 ` Willy Tarreau @ 2007-04-14 17:44 ` Eric W. Biederman 2007-04-14 17:54 ` Ingo Molnar 2007-04-14 17:50 ` Linus Torvalds 1 sibling, 1 reply; 304+ messages in thread From: Eric W. Biederman @ 2007-04-14 17:44 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Willy Tarreau <w@1wt.eu> writes: > Hi Eric, > > [...] >> >> the ramp up slows down after 700-800 processes, but something very >> >> strange happens. If I'm under X, I can switch the focus to all xterms >> >> (the WM is still alive) but all xterms are frozen. On the console, >> >> after one moment I simply cannot switch to another VT anymore while I >> >> can still start commands locally. But "chvt 2" simply blocks. SysRq-K >> >> killed everything and restored full control. Dmesg shows lots of : >> > >> >> SAK: killed process xxxx (scheddos2): process_session(p)==tty->session. >> >> This. Yes. SAK is noisy and tells you everything it kills. > > OK, that's what I suspected, but I did not know if the fact that it talked > about the session was systematic or related to any particular state when it > killed the task. > >> >> I wonder if part of the problem would be too many processes bound to >> >> the same tty :-/ >> > >> > hm, that's really weird. I've Cc:-ed the tty experts (Erik, Jiri, Alan), >> > maybe this description rings a bell with them? >> >> Is there any swapping going on? > > Not at all. > >> I'm inclined to suspect that it is a problem that has more to do with the >> number of processes and has nothing to do with ttys. > > It is clearly possible. What I found strange is that I could still fork > processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. > It first happened under X with frozen xterms but a perfectly usable WM, > then I reproduced it on pure console to rule out any potential X problem. > >> Anyway you can easily rule out ttys by having your startup program >> detach from a controlling tty before you start everything. >> >> I'm more inclined to guess something is reading /proc a lot, or doing >> something that holds the tasklist lock, a lot or something like that, >> if the problem isn't that you are being kicked into swap. > > Oh I'm sorry you were invited into the discussion without a first description > of the context. I was giving a try to Ingo's new scheduler, and trying to > reach corner cases with lots of processes competing for CPU. > > I simply used a "for" loop in bash to fork 1000 processes, and this problem > happened between 700-800 children. The program only uses a busy loop and a > pause. I then changed my program to close 0,1,2 and perform the fork itself, > and the problem vanished. So there are two differences here : > > - bash not forking anymore > - far less FDs on /dev/tty1 Yes. But with /dev/tty1 being the controlling terminal in both cases, as you haven't dropped your session, or disassociated your tty. The bash problem may have something to setpgid or scheduling effects. Hmm. I just looked and setpgid does grab the tasklist lock for writing so we may possibly have some contention there. > At first, I had around 2200 fds on /dev/tty1, reason why I suspected something > in this area. > > I agree that this is not normal usage at all, I'm just trying to attack > Ingo's scheduler to ensure it is more robust than the stock one. But > sometimes brute force methods can make other sleeping problems pop up. Yep. If we can narrow it down to one that would be interesting. Of course that also means when we start finding other possibly sleeping problems people are working in areas of code the don't normally touch, so we must investigate. > Thinking about it, I don't know if there are calls to schedule() while > switching from tty1 to tty2. Alt-F2 had no effect anymore, and "chvt 2" > simply blocked. It would have been possible that a schedule() call > somewhere got starved due to the load, I don't know. It looks like there is a call to schedule_work. There are two pieces of the path. If you are switching in and out of a tty controlled by something like X. User space has to grant permission before the operation happens. Where there isn't a gate keeper I know it is cheaper but I don't know by how much, I suspect there is still a schedule happening in there. Eric ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:44 ` Eric W. Biederman @ 2007-04-14 17:54 ` Ingo Molnar 2007-04-14 18:18 ` Willy Tarreau 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 17:54 UTC (permalink / raw) To: Eric W. Biederman Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Eric W. Biederman <ebiederm@xmission.com> wrote: > > Thinking about it, I don't know if there are calls to schedule() > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > > "chvt 2" simply blocked. It would have been possible that a > > schedule() call somewhere got starved due to the load, I don't know. > > It looks like there is a call to schedule_work. so this goes over keventd, right? > There are two pieces of the path. If you are switching in and out of a > tty controlled by something like X. User space has to grant > permission before the operation happens. Where there isn't a gate > keeper I know it is cheaper but I don't know by how much, I suspect > there is still a schedule happening in there. Could keventd perhaps be starved? Willy, to exclude this possibility, could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then the command to set it to SCHED_FIFO:50 would be: chrt -f -p 50 5 but ... events/0 is reniced to -5 by default, so it should definitely not be starved. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:54 ` Ingo Molnar @ 2007-04-14 18:18 ` Willy Tarreau 2007-04-14 18:40 ` Eric W. Biederman 2007-04-15 17:55 ` Ingo Molnar 0 siblings, 2 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 18:18 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: > > * Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > Thinking about it, I don't know if there are calls to schedule() > > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > > > "chvt 2" simply blocked. It would have been possible that a > > > schedule() call somewhere got starved due to the load, I don't know. > > > > It looks like there is a call to schedule_work. > > so this goes over keventd, right? > > > There are two pieces of the path. If you are switching in and out of a > > tty controlled by something like X. User space has to grant > > permission before the operation happens. Where there isn't a gate > > keeper I know it is cheaper but I don't know by how much, I suspect > > there is still a schedule happening in there. > > Could keventd perhaps be starved? Willy, to exclude this possibility, > could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then > the command to set it to SCHED_FIFO:50 would be: > > chrt -f -p 50 5 > > but ... events/0 is reniced to -5 by default, so it should definitely > not be starved. Well, since I merged the fair-fork patch, I cannot reproduce (in fact, bash forks 1000 processes, then progressively execs scheddos, but it takes some time). So I'm rebuilding right now. But I think that Linus has an interesting clue about GPM and notification before switching the terminal. I think it was enabled in console mode. I don't know how that translates to frozen xterms, but let's attack the problems one at a time. Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:18 ` Willy Tarreau @ 2007-04-14 18:40 ` Eric W. Biederman 2007-04-14 19:01 ` Willy Tarreau 2007-04-15 17:55 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Eric W. Biederman @ 2007-04-14 18:40 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Willy Tarreau <w@1wt.eu> writes: > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: >> >> * Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> > > Thinking about it, I don't know if there are calls to schedule() >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and >> > > "chvt 2" simply blocked. It would have been possible that a >> > > schedule() call somewhere got starved due to the load, I don't know. >> > >> > It looks like there is a call to schedule_work. >> >> so this goes over keventd, right? >> >> > There are two pieces of the path. If you are switching in and out of a >> > tty controlled by something like X. User space has to grant >> > permission before the operation happens. Where there isn't a gate >> > keeper I know it is cheaper but I don't know by how much, I suspect >> > there is still a schedule happening in there. >> >> Could keventd perhaps be starved? Willy, to exclude this possibility, >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then >> the command to set it to SCHED_FIFO:50 would be: >> >> chrt -f -p 50 5 >> >> but ... events/0 is reniced to -5 by default, so it should definitely >> not be starved. > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > bash forks 1000 processes, then progressively execs scheddos, but it > takes some time). So I'm rebuilding right now. But I think that Linus > has an interesting clue about GPM and notification before switching > the terminal. I think it was enabled in console mode. I don't know > how that translates to frozen xterms, but let's attack the problems > one at a time. I think it is a good clue. However the intention of the mechanism is that only processes that change the video mode on a VT are supposed to use it. So I really don't think gpm is the culprit. However it easily could be something else that has similar characteristics. I just realized we do have proof that schedule_work is actually working because SAK works, and we can't sanely do SAK from interrupt context so we call schedule work. Eric ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:40 ` Eric W. Biederman @ 2007-04-14 19:01 ` Willy Tarreau 0 siblings, 0 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 19:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, Apr 14, 2007 at 12:40:15PM -0600, Eric W. Biederman wrote: > Willy Tarreau <w@1wt.eu> writes: > > > On Sat, Apr 14, 2007 at 07:54:33PM +0200, Ingo Molnar wrote: > >> > >> * Eric W. Biederman <ebiederm@xmission.com> wrote: > >> > >> > > Thinking about it, I don't know if there are calls to schedule() > >> > > while switching from tty1 to tty2. Alt-F2 had no effect anymore, and > >> > > "chvt 2" simply blocked. It would have been possible that a > >> > > schedule() call somewhere got starved due to the load, I don't know. > >> > > >> > It looks like there is a call to schedule_work. > >> > >> so this goes over keventd, right? > >> > >> > There are two pieces of the path. If you are switching in and out of a > >> > tty controlled by something like X. User space has to grant > >> > permission before the operation happens. Where there isn't a gate > >> > keeper I know it is cheaper but I don't know by how much, I suspect > >> > there is still a schedule happening in there. > >> > >> Could keventd perhaps be starved? Willy, to exclude this possibility, > >> could you perhaps chrt keventd to RT priority? If events/0 is PID 5 then > >> the command to set it to SCHED_FIFO:50 would be: > >> > >> chrt -f -p 50 5 > >> > >> but ... events/0 is reniced to -5 by default, so it should definitely > >> not be starved. > > > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > > bash forks 1000 processes, then progressively execs scheddos, but it > > takes some time). So I'm rebuilding right now. But I think that Linus > > has an interesting clue about GPM and notification before switching > > the terminal. I think it was enabled in console mode. I don't know > > how that translates to frozen xterms, but let's attack the problems > > one at a time. > > I think it is a good clue. However the intention of the mechanism is > that only processes that change the video mode on a VT are supposed to > use it. So I really don't think gpm is the culprit. However it easily could > be something else that has similar characteristics. > > I just realized we do have proof that schedule_work is actually working > because SAK works, and we can't sanely do SAK from interrupt context > so we call schedule work. Eric, I can say that Linus, Ingo and you all got on the right track. I could reproduce, I got a hung tty around 1400 running processes. Fortunately, it was the one with the root shell which was reniced to -19. I could strace chvt 2 : 20:44:23.761117 open("/dev/tty", O_RDONLY) = 3 <0.004000> 20:44:23.765117 ioctl(3, KDGKBTYPE, 0xbfa305a3) = 0 <0.024002> 20:44:23.789119 ioctl(3, VIDIOC_G_COMP or VT_ACTIVATE, 0x3) = 0 <0.000000> 20:44:23.789119 ioctl(3, VIDIOC_S_COMP or VT_WAITACTIVE <unfinished ...> Then I applied Ingo's suggestion about changing keventd prio : root@pcw:~# ps auxw|grep event root 8 0.0 0.0 0 0 ? SW< 20:31 0:00 [events/0] root 9 0.0 0.0 0 0 ? RW< 20:31 0:00 [events/1] root@pcw:~# rtprio -s 1 -p 50 8 9 (I don't have chrt but it does the same) My VT immediately switched as soon as I hit Enter. Everything's working fine again now. So the good news is that it's not a bug in the tty code, nor a deadlock. Now, maybe keventd should get a higher prio ? It seems worrying to me that it may starve when it seems so much sensible. Also, that may explain why I couldn't reproduce with the fork patch. Since all new processes got no runtime at first, their impact on existing ones must have been lower. But I think that if I had waited longer, I would have had the problem again (though I did not see it even under a load of 7800). Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 18:18 ` Willy Tarreau 2007-04-14 18:40 ` Eric W. Biederman @ 2007-04-15 17:55 ` Ingo Molnar 2007-04-15 18:06 ` Willy Tarreau 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 17:55 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > bash forks 1000 processes, then progressively execs scheddos, but it > takes some time). So I'm rebuilding right now. But I think that Linus > has an interesting clue about GPM and notification before switching > the terminal. I think it was enabled in console mode. I don't know how > that translates to frozen xterms, but let's attack the problems one at > a time. to debug this, could you try to apply this add-on as well: http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch with this patch applied you should have a /proc/sched_debug file that prints all runnable tasks and other interesting info from the runqueue. [ i've refreshed all the patches on the CFS webpage, so if this doesnt apply cleanly to your current tree then you'll probably have to refresh one of the patches.] The output should look like this: Sched Debug Version: v0.01 now at 226761724575 nsecs cpu: 0 .nr_running : 3 .raw_weighted_load : 384 .nr_switches : 13666 .nr_uninterruptible : 0 .next_balance : 4294947416 .curr->pid : 2179 .rq_clock : 241337421233 .fair_clock : 7503791206 .wait_runtime : 2269918379 runnable tasks: task | PID | tree-key | -delta | waiting | switches ----------------------------------------------------------------- + cat 2179 7501930066 -1861140 1861140 2 loop_silent 2149 7503010354 -780852 0 911 loop_silent 2148 7503510048 -281158 280753 918 now for your workload the list should be considerably larger. If there's starvation going on then the 'switches' field (number of context switches) of one of the tasks would never increase while you have this 'cannot switch consoles' problem. maybe you'll have to unapply the fair-fork patch to make it trigger again. (fair-fork does not fix anything, so it probably just hides a real bug.) (i'm meanwhile busy running your scheddos utilities to reproduce it locally as well :) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:55 ` Ingo Molnar @ 2007-04-15 18:06 ` Willy Tarreau 2007-04-15 19:20 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-15 18:06 UTC (permalink / raw) To: Ingo Molnar Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox Hi Ingo, On Sun, Apr 15, 2007 at 07:55:55PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Well, since I merged the fair-fork patch, I cannot reproduce (in fact, > > bash forks 1000 processes, then progressively execs scheddos, but it > > takes some time). So I'm rebuilding right now. But I think that Linus > > has an interesting clue about GPM and notification before switching > > the terminal. I think it was enabled in console mode. I don't know how > > that translates to frozen xterms, but let's attack the problems one at > > a time. > > to debug this, could you try to apply this add-on as well: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch > > with this patch applied you should have a /proc/sched_debug file that > prints all runnable tasks and other interesting info from the runqueue. I don't know if you have seen my mail from yesterday evening (here). I found that changing keventd prio fixed the problem. You may be interested in the description. I sent it at 21:01 (+200). > [ i've refreshed all the patches on the CFS webpage, so if this doesnt > apply cleanly to your current tree then you'll probably have to > refresh one of the patches.] Fine, I'll have a look. I already had to rediff the sched-fair-fork patch last time. > The output should look like this: > > Sched Debug Version: v0.01 > now at 226761724575 nsecs > > cpu: 0 > .nr_running : 3 > .raw_weighted_load : 384 > .nr_switches : 13666 > .nr_uninterruptible : 0 > .next_balance : 4294947416 > .curr->pid : 2179 > .rq_clock : 241337421233 > .fair_clock : 7503791206 > .wait_runtime : 2269918379 > > runnable tasks: > task | PID | tree-key | -delta | waiting | switches > ----------------------------------------------------------------- > + cat 2179 7501930066 -1861140 1861140 2 > loop_silent 2149 7503010354 -780852 0 911 > loop_silent 2148 7503510048 -281158 280753 918 Nice. > now for your workload the list should be considerably larger. If there's > starvation going on then the 'switches' field (number of context > switches) of one of the tasks would never increase while you have this > 'cannot switch consoles' problem. > > maybe you'll have to unapply the fair-fork patch to make it trigger > again. (fair-fork does not fix anything, so it probably just hides a > real bug.) > > (i'm meanwhile busy running your scheddos utilities to reproduce it > locally as well :) I discovered I had the frame-buffer enabled (I did not notice it first because I do not have the logo and the resolution is the same as text). It's matroxfb with a G400, if that can help. It may be possible that it needs some CPU that it cannot get to clear the display before switching, I don't know. However I won't try this right now, I'm deep in userland at the moment. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 18:06 ` Willy Tarreau @ 2007-04-15 19:20 ` Ingo Molnar 2007-04-15 19:35 ` William Lee Irwin III 2007-04-15 19:37 ` Ingo Molnar 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 19:20 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Willy Tarreau <w@1wt.eu> wrote: > > to debug this, could you try to apply this add-on as well: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-print.patch > > > > with this patch applied you should have a /proc/sched_debug file > > that prints all runnable tasks and other interesting info from the > > runqueue. > > I don't know if you have seen my mail from yesterday evening (here). I > found that changing keventd prio fixed the problem. You may be > interested in the description. I sent it at 21:01 (+200). ah, indeed i missed that mail - the response to the patches was quite overwhelming (and i naively thought people dont do Linux hacking over the weekends anymore ;). so Linus was right: this was caused by scheduler starvation. I can see one immediate problem already: the 'nice offset' is not divided by nr_running as it should. The patch below should fix this but i have yet to test it accurately, this change might as well render nice levels unacceptably ineffective under high loads. Ingo ---------> --- kernel/sched_fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r int leftmost = 1; long long key; - key = rq->fair_clock - p->wait_runtime + p->nice_offset; + key = rq->fair_clock - p->wait_runtime; + if (unlikely(p->nice_offset)) + key += p->nice_offset / rq->nr_running; p->fair_key = key; ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:20 ` Ingo Molnar @ 2007-04-15 19:35 ` William Lee Irwin III 2007-04-15 19:57 ` Ingo Molnar 2007-04-15 19:37 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 19:35 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote: > so Linus was right: this was caused by scheduler starvation. I can see > one immediate problem already: the 'nice offset' is not divided by > nr_running as it should. The patch below should fix this but i have yet > to test it accurately, this change might as well render nice levels > unacceptably ineffective under high loads. I've been suggesting testing CPU bandwidth allocation as influenced by nice numbers for a while now for a reason. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:35 ` William Lee Irwin III @ 2007-04-15 19:57 ` Ingo Molnar 2007-04-15 23:54 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 19:57 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: > On Sun, Apr 15, 2007 at 09:20:46PM +0200, Ingo Molnar wrote: > > so Linus was right: this was caused by scheduler starvation. I can > > see one immediate problem already: the 'nice offset' is not divided > > by nr_running as it should. The patch below should fix this but i > > have yet to test it accurately, this change might as well render > > nice levels unacceptably ineffective under high loads. > > I've been suggesting testing CPU bandwidth allocation as influenced by > nice numbers for a while now for a reason. Oh I was very much testing "CPU bandwidth allocation as influenced by nice numbers" - it's one of the basic things i do when modifying the scheduler. An automated tool, while nice (all automation is nice) wouldnt necessarily show such bugs though, because here too it needed thousands of running tasks to trigger in practice. Any volunteers? ;) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:57 ` Ingo Molnar @ 2007-04-15 23:54 ` William Lee Irwin III 2007-04-16 11:24 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 23:54 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: >> I've been suggesting testing CPU bandwidth allocation as influenced by >> nice numbers for a while now for a reason. On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote: > Oh I was very much testing "CPU bandwidth allocation as influenced by > nice numbers" - it's one of the basic things i do when modifying the > scheduler. An automated tool, while nice (all automation is nice) > wouldnt necessarily show such bugs though, because here too it needed > thousands of running tasks to trigger in practice. Any volunteers? ;) Worse comes to worse I might actually get around to doing it myself. Any more detailed descriptions of the test for a rainy day? -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:54 ` William Lee Irwin III @ 2007-04-16 11:24 ` Ingo Molnar 2007-04-16 13:46 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-16 11:24 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: > On Sun, Apr 15, 2007 at 09:57:48PM +0200, Ingo Molnar wrote: > > Oh I was very much testing "CPU bandwidth allocation as influenced by > > nice numbers" - it's one of the basic things i do when modifying the > > scheduler. An automated tool, while nice (all automation is nice) > > wouldnt necessarily show such bugs though, because here too it needed > > thousands of running tasks to trigger in practice. Any volunteers? ;) > > Worse comes to worse I might actually get around to doing it myself. > Any more detailed descriptions of the test for a rainy day? the main complication here is that the handling of nice levels is still typically a 2nd or 3rd degree design factor when writing schedulers. The reason isnt carelessness, the reason is simply that users typically only care about a single nice level: the one that all tasks run under by default. Also, often there's just one or two good ways to attack the problem within a given scheduler approach and the quality of nice levels often suffers under other, more important design factors like performance. This means that for example for the vanilla scheduler the distribution of CPU power depends on HZ, on the number of tasks and on the scheduling pattern. The distribution of CPU power amongst nice levels is basically a function of _everything_. That makes any automated test pretty challenging. Both with SD and with CFS there's a good chance to actually formalize the meaning of nice levels, but i'd not go as far as to mandate any particular behavior by rigidly saying "pass this automated tool, else ...", other than "make nice levels resonable". All the other more formal CPU resource limitation techniques are then a matter of CKRM-alike patches, which offer much more finegrained mechanisms than pure nice levels anyway. so to answer your question: it's pretty much freely defined. Make up your mind about it and figure out the ways how people use nice levels and think about which aspects of that experience are worth testing for intelligently. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:24 ` Ingo Molnar @ 2007-04-16 13:46 ` William Lee Irwin III 0 siblings, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-16 13:46 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * William Lee Irwin III <wli@holomorphy.com> wrote: >> Worse comes to worse I might actually get around to doing it myself. >> Any more detailed descriptions of the test for a rainy day? On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > the main complication here is that the handling of nice levels is still > typically a 2nd or 3rd degree design factor when writing schedulers. The > reason isnt carelessness, the reason is simply that users typically only > care about a single nice level: the one that all tasks run under by > default. I'm a bit unconvinced here. Support for prioritization is a major scheduler feature IMHO. On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > Also, often there's just one or two good ways to attack the problem > within a given scheduler approach and the quality of nice levels often > suffers under other, more important design factors like performance. > This means that for example for the vanilla scheduler the distribution > of CPU power depends on HZ, on the number of tasks and on the scheduling > pattern. The distribution of CPU power amongst nice levels is basically > a function of _everything_. That makes any automated test pretty > challenging. Both with SD and with CFS there's a good chance to actually > formalize the meaning of nice levels, but i'd not go as far as to > mandate any particular behavior by rigidly saying "pass this automated > tool, else ...", other than "make nice levels resonable". All the other > more formal CPU resource limitation techniques are then a matter of > CKRM-alike patches, which offer much more finegrained mechanisms than > pure nice levels anyway. Some of the issues with respect to the number of tasks and scheduling patterns can be made part of the test; one can furthermore insist that the system be quiescent in a variety of ways. I'm not convinced that formalization of nice levels is a bad idea. They're the standard UNIX prioritization facility, and it should work with some definite value of "work." Even supposing one doesn't care to bolt down the semantics of nice levels, there should at least be some awareness of what those semantics are and when and how they're changing. So in that respect a test for CPU bandwidth distribution according to nice level remains valuable even supposing that the semantics aren't required to be rigidly fixed. As far as CKRM goes, I'm wild about it. I wish things would get in shape to be merged (if they're not already) and merged ASAP on that front. I think with so much agreement in concept we can work with changing out implementations as-needed with it sitting in mainline once the the user API/ABI is decided upon, and I think it already is. I'm not entirely convinced CKRM answers this, though. If the scheduler can't support nice levels, how is it supposed to support prioritization or CPU bandwidth allocation according to CKRM configurations? I'm relatively certain schedulers must be able to support prioritization with deterministic CPU bandwidth as essential functionality. This is, of course, not to say my certainty about things sets the standards for what testcases are considered meaningful and valid. On Mon, Apr 16, 2007 at 01:24:40PM +0200, Ingo Molnar wrote: > so to answer your question: it's pretty much freely defined. Make up > your mind about it and figure out the ways how people use nice levels > and think about which aspects of that experience are worth testing for > intelligently. Looking for usage cases is a good idea; I'll do that before coding any testcase for nice semantics. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:20 ` Ingo Molnar 2007-04-15 19:35 ` William Lee Irwin III @ 2007-04-15 19:37 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 19:37 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox * Ingo Molnar <mingo@elte.hu> wrote: > so Linus was right: this was caused by scheduler starvation. I can see > one immediate problem already: the 'nice offset' is not divided by > nr_running as it should. The patch below should fix this but i have > yet to test it accurately, this change might as well render nice > levels unacceptably ineffective under high loads. erm, rather the updated patch below if you want to use this on a 32-bit system. But ... i think you should wait until i have all this re-tested. Ingo --- include/linux/sched.h | 2 +- kernel/sched_fair.c | 4 +++- 2 files changed, 4 insertions(+), 2 deletions(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -839,7 +839,7 @@ struct task_struct { s64 wait_runtime; u64 exec_runtime, fair_key; - s64 nice_offset, hog_limit; + s32 nice_offset, hog_limit; unsigned long policy; cpumask_t cpus_allowed; Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -31,7 +31,9 @@ static void __enqueue_task_fair(struct r int leftmost = 1; long long key; - key = rq->fair_clock - p->wait_runtime + p->nice_offset; + key = rq->fair_clock - p->wait_runtime; + if (unlikely(p->nice_offset)) + key += p->nice_offset / (rq->nr_running + 1); p->fair_key = key; ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 17:29 ` Willy Tarreau 2007-04-14 17:44 ` Eric W. Biederman @ 2007-04-14 17:50 ` Linus Torvalds 1 sibling, 0 replies; 304+ messages in thread From: Linus Torvalds @ 2007-04-14 17:50 UTC (permalink / raw) To: Willy Tarreau Cc: Eric W. Biederman, Ingo Molnar, Nick Piggin, linux-kernel, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Jiri Slaby, Alan Cox On Sat, 14 Apr 2007, Willy Tarreau wrote: > > It is clearly possible. What I found strange is that I could still fork > processes (eg: ls, dmesg|tail), ... but not switch to another VT anymore. Considering the patches in question, it's almost definitely just a CPU scheduling problem with starvation. The VT switching is obviously done by the kernel, but the kernel will signal and wait for the "controlling process" for the VT. The most obvious case of that is X, of course, but even in text mode I think gpm will have taken control of the VT's it runs on (all of them), which means that when you initiate a VT switch, the kernel will actually signal the controlling process (gpm), and wait for it to acknowledge the switch. If gpm doesn't get a timeslice for some reason (and it sounds like there may be some serious unfairness after "fork()"), your behaviour is explainable. (NOTE! I've never actually looked at gpm sources or what it really does, so maybe I'm wrong, and it doesn't try to do the controlling VT thing, and something else is going on, but quite frankly, it sounds like the obvious candidate for this bug. Explaining it with some non-scheduler-related thing sounds unlikely, considering the patch in question). Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau @ 2007-04-15 7:54 ` Mike Galbraith 2007-04-15 8:58 ` Ingo Molnar 2007-04-19 9:01 ` Ingo Molnar 2 siblings, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 7:54 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote: > Well, I'll stop heating the room for now as I get out of ideas about how > to defeat it. I'm convinced. I'm impatient to read about Mike's feedback > with his workload which behaves strangely on RSDL. If it works OK here, > it will be the proof that heuristics should not be needed. You mean the X + mp3 player + audio visualization test? X+Gforce visualization have problems getting half of my box in the presence of two other heavy cpu using tasks. Behavior is _much_ better than RSDL/SD, but the synchronous nature of X/client seems to be a problem. With this scheduler, renicing X/client does cure it, whereas with SD it did not help one bit. (I know a trivial way to cure that, and this framework makes that possible without dorking up fairness as a general policy.) -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 7:54 ` Mike Galbraith @ 2007-04-15 8:58 ` Ingo Molnar 2007-04-15 9:11 ` Mike Galbraith 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 8:58 UTC (permalink / raw) To: Mike Galbraith Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner * Mike Galbraith <efault@gmx.de> wrote: > On Sat, 2007-04-14 at 15:01 +0200, Willy Tarreau wrote: > > > Well, I'll stop heating the room for now as I get out of ideas about > > how to defeat it. I'm convinced. I'm impatient to read about Mike's > > feedback with his workload which behaves strangely on RSDL. If it > > works OK here, it will be the proof that heuristics should not be > > needed. > > You mean the X + mp3 player + audio visualization test? X+Gforce > visualization have problems getting half of my box in the presence of > two other heavy cpu using tasks. Behavior is _much_ better than > RSDL/SD, but the synchronous nature of X/client seems to be a problem. > > With this scheduler, renicing X/client does cure it, whereas with SD > it did not help one bit. [...] thanks for testing it! I was quite worried about your setup - two tasks using up 50%/50% of CPU time, pitted against a kernel rebuild workload seems to be a hard workload to get right. > [...] (I know a trivial way to cure that, and this framework makes > that possible without dorking up fairness as a general policy.) great! Please send patches so i can add them (once you are happy with the solution) - i think your workload isnt special in any way and could hit other people too. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:58 ` Ingo Molnar @ 2007-04-15 9:11 ` Mike Galbraith 0 siblings, 0 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 9:11 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 10:58 +0200, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > [...] (I know a trivial way to cure that, and this framework makes > > that possible without dorking up fairness as a general policy.) > > great! Please send patches so i can add them (once you are happy with > the solution) - i think your workload isnt special in any way and could > hit other people too. I'll give it a shot. (have to read and actually understand your new code first though, then see if it's really viable) -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 13:01 ` Willy Tarreau 2007-04-14 13:27 ` Willy Tarreau 2007-04-15 7:54 ` Mike Galbraith @ 2007-04-19 9:01 ` Ingo Molnar 2007-04-19 12:54 ` Willy Tarreau 2007-04-19 17:32 ` Gene Heskett 2 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-19 9:01 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Good idea. The machine I'm typing from now has 1000 scheddos running > at +19, and 12 gears at nice 0. [...] > From time to time, one of the 12 aligned gears will quickly perform a > full quarter of round while others slowly turn by a few degrees. In > fact, while I don't know this process's CPU usage pattern, there's > something useful in it : it allows me to visually see when process > accelerate/decelerate. [...] cool idea - i have just tried this and it rocks - you can easily see the 'nature' of CPU time distribution just via visual feedback. (Is there any easy way to start up 12 glxgears fully aligned, or does one always have to mouse around to get them into proper position?) btw., i am using another method to quickly judge X's behavior: i started the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth opengl-rendered snow fall on the desktop background. That gives me an idea about how well X is scheduling under various workloads, without having to instrument it explicitly. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 9:01 ` Ingo Molnar @ 2007-04-19 12:54 ` Willy Tarreau 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:32 ` Gene Heskett 1 sibling, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-19 12:54 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Hi Ingo, On Thu, Apr 19, 2007 at 11:01:44AM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > Good idea. The machine I'm typing from now has 1000 scheddos running > > at +19, and 12 gears at nice 0. [...] > > > From time to time, one of the 12 aligned gears will quickly perform a > > full quarter of round while others slowly turn by a few degrees. In > > fact, while I don't know this process's CPU usage pattern, there's > > something useful in it : it allows me to visually see when process > > accelerate/decelerate. [...] > > cool idea - i have just tried this and it rocks - you can easily see the > 'nature' of CPU time distribution just via visual feedback. (Is there > any easy way to start up 12 glxgears fully aligned, or does one always > have to mouse around to get them into proper position?) -- Replying quickly, I'm short in time -- You can certainly script it with -geometry. But it is the wrong application for this matter, because you benchmark X more than glxgears itself. What would be better is something like a line rotating 360 degrees and doing some short stuff between each degree, so that X is not much sollicitated, but the CPU would be spent more on the processes themselves. Benchmarking interactions between X and multiple clients is a completely different test IMHO. Glxgears is between those two, making it inappropriate for scheduler tuning. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 12:54 ` Willy Tarreau @ 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-19 15:18 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > You can certainly script it with -geometry. But it is the wrong > application for this matter, because you benchmark X more than > glxgears itself. What would be better is something like a line > rotating 360 degrees and doing some short stuff between each degree, > so that X is not much sollicitated, but the CPU would be spent more on > the processes themselves. at least on my setup glxgears goes via DRI/DRM so there's no X scheduling inbetween at all, and the visual appearance of glxgears is a direct function of its scheduling. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar @ 2007-04-19 17:34 ` Gene Heskett 2007-04-19 18:45 ` Willy Tarreau 2007-04-19 23:52 ` Jan Knutar 2 siblings, 0 replies; 304+ messages in thread From: Gene Heskett @ 2007-04-19 17:34 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Ingo Molnar wrote: >* Willy Tarreau <w@1wt.eu> wrote: >> You can certainly script it with -geometry. But it is the wrong >> application for this matter, because you benchmark X more than >> glxgears itself. What would be better is something like a line >> rotating 360 degrees and doing some short stuff between each degree, >> so that X is not much sollicitated, but the CPU would be spent more on >> the processes themselves. > >at least on my setup glxgears goes via DRI/DRM so there's no X >scheduling inbetween at all, and the visual appearance of glxgears is a >direct function of its scheduling. > > Ingo That doesn't appear to be the case here Ingo. Even when I know the rest of the system is lagged, glxgears continues to show very smooth and steady movement. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yow! I just went below the poverty line! ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett @ 2007-04-19 18:45 ` Willy Tarreau 2007-04-21 10:31 ` Ingo Molnar 2007-04-19 23:52 ` Jan Knutar 2 siblings, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-19 18:45 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 05:18:03PM +0200, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > You can certainly script it with -geometry. But it is the wrong > > application for this matter, because you benchmark X more than > > glxgears itself. What would be better is something like a line > > rotating 360 degrees and doing some short stuff between each degree, > > so that X is not much sollicitated, but the CPU would be spent more on > > the processes themselves. > > at least on my setup glxgears goes via DRI/DRM so there's no X > scheduling inbetween at all, and the visual appearance of glxgears is a > direct function of its scheduling. OK, I thought that somethink looking like a clock would be useful, especially if we could tune the amount of CPU spent per task instead of being limited by graphics drivers. I searched freashmeat for a clock and found "orbitclock" by Jeremy Weatherford, which was exactly what I was looking for : - small - C only - X11 only - needed less than 5 minutes and no knowledge of X11 for the complete hack ! => Kudos to its author, sincerely ! I hacked it a bit to make it accept two parameters : -R <run_time_in_microsecond> : time spent burning CPU cycles at each round -S <sleep_time_in_microsecond> : time spent getting a rest It now advances what it thinks is a second at each iteration, so that it makes it easy to compare its progress with other instances (there are seconds, minutes and hours, so it's easy to visually count up to around 43200). The modified code is here : http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz What is interesting to note is that it's easy to make X work a lot (99%) by using 0 as the sleeping time, and it's easy to make the process work a lot by using large values for the running time associated with very low values (or 0) for the sleep time. Ah, and it supports -geometry ;-) It could become a useful scheduler benchmark ! Have fun ! Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 18:45 ` Willy Tarreau @ 2007-04-21 10:31 ` Ingo Molnar 2007-04-21 10:38 ` Ingo Molnar 2007-04-21 10:45 ` Ingo Molnar 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-21 10:31 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Willy Tarreau <w@1wt.eu> wrote: > I hacked it a bit to make it accept two parameters : > -R <run_time_in_microsecond> : time spent burning CPU cycles at each round > -S <sleep_time_in_microsecond> : time spent getting a rest > > It now advances what it thinks is a second at each iteration, so that > it makes it easy to compare its progress with other instances (there > are seconds, minutes and hours, so it's easy to visually count up to > around 43200). > > The modified code is here : > > http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz > > What is interesting to note is that it's easy to make X work a lot > (99%) by using 0 as the sleeping time, and it's easy to make the > process work a lot by using large values for the running time > associated with very low values (or 0) for the sleep time. > > Ah, and it supports -geometry ;-) > > It could become a useful scheduler benchmark ! i just tried ocbench-0.3, and it is indeed very nice! Would it make sense perhaps to (optionally?) also log some sort of periodic text feedback to stdout, about the quality of scheduling? Maybe even a 'run this many seconds' option plus a summary text output at the end (which would output measured runtime, observed longest/smallest latency and standard deviation of latencies maybe)? That would make it directly usable both as a 'consistency of X app scheduling' visual test and as an easily shareable benchmark with an objective numeric result as well. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:31 ` Ingo Molnar @ 2007-04-21 10:38 ` Ingo Molnar 2007-04-21 10:45 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-21 10:38 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > The modified code is here : > > > > http://linux.1wt.eu/sched/orbitclock-0.2bench.tgz > > > > What is interesting to note is that it's easy to make X work a lot > > (99%) by using 0 as the sleeping time, and it's easy to make the > > process work a lot by using large values for the running time > > associated with very low values (or 0) for the sleep time. > > > > Ah, and it supports -geometry ;-) > > > > It could become a useful scheduler benchmark ! > > i just tried ocbench-0.3, and it is indeed very nice! another thing i just noticed: when starting up lots of ocbench tasks (say -x 6 -y 6) then they (naturally) get started up with an already visible offset. It's nice to observe the startup behavior, but after that it would be useful if it were possible to 'resync' all those ocbench tasks so that they start at the same offset. [ Maybe a "killall -SIGUSR1 ocbench" could serve this purpose, without having to synchronize the tasks explicitly? ] Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:31 ` Ingo Molnar 2007-04-21 10:38 ` Ingo Molnar @ 2007-04-21 10:45 ` Ingo Molnar 2007-04-21 11:07 ` Willy Tarreau 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-21 10:45 UTC (permalink / raw) To: Willy Tarreau; +Cc: linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > > It could become a useful scheduler benchmark ! > > i just tried ocbench-0.3, and it is indeed very nice! another thing i noticed: when using a -y larger then 1, then the window title (at least on Metacity) overlaps and thus the ocbench tasks have different X overhead and get scheduled a bit assymetrically as well. Is there any way to start them up title-less perhaps? Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 10:45 ` Ingo Molnar @ 2007-04-21 11:07 ` Willy Tarreau 2007-04-21 11:29 ` Björn Steinbrink 0 siblings, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-21 11:07 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Hi Ingo, I'm replying to your 3 mails at once. On Sat, Apr 21, 2007 at 12:45:22PM +0200, Ingo Molnar wrote: > > * Ingo Molnar <mingo@elte.hu> wrote: > > > > It could become a useful scheduler benchmark ! > > > > i just tried ocbench-0.3, and it is indeed very nice! So as you've noticed just one minute after I put it there, I've updated the tool and renamed it ocbench. For others, it's here : http://linux.1wt.eu/sched/ Useful news are proper positionning, automatic forking, and more visible progress with smaller windows, which eat less of X ressources. Now about your idea of making it report information on stdout, I don't know if it would be that useful. There are many other command line tools for this purpose. This one's goal is to eat CPU with a visual control of CPU distribution only. Concerning your idea of using a signal to resync every process, I agree with you. Running at 8x8 shows a noticeable offset. I've just uploaded v0.4 which supports your idea of sending USR1. > another thing i noticed: when using a -y larger then 1, then the window > title (at least on Metacity) overlaps and thus the ocbench tasks have > different X overhead and get scheduled a bit assymetrically as well. Is > there any way to start them up title-less perhaps? It has annoyed me a bit too, but I'm no X developer at all, so I don't know at all if it's possible nor how to do this. I know that my window manager even adds title bars to xeyes, so I'm not sure we can do this. Right now, I've added a "-B <border size>" argument so that you can skip the size of your title bar. It's dirty but it's not my main job :-) Thanks for your feedback Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 11:07 ` Willy Tarreau @ 2007-04-21 11:29 ` Björn Steinbrink 2007-04-21 11:51 ` Willy Tarreau 0 siblings, 1 reply; 304+ messages in thread From: Björn Steinbrink @ 2007-04-21 11:29 UTC (permalink / raw) To: Willy Tarreau; +Cc: Ingo Molnar, linux-kernel Hi, On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote: > > another thing i noticed: when using a -y larger then 1, then the window > > title (at least on Metacity) overlaps and thus the ocbench tasks have > > different X overhead and get scheduled a bit assymetrically as well. Is > > there any way to start them up title-less perhaps? > > It has annoyed me a bit too, but I'm no X developer at all, so I don't > know at all if it's possible nor how to do this. I know that my window > manager even adds title bars to xeyes, so I'm not sure we can do this. > > Right now, I've added a "-B <border size>" argument so that you can > skip the size of your title bar. It's dirty but it's not my main job :-) Here's a small patch that makes the windows unmanaged, which also causes ocbench to start up quite a bit faster on my box with larger number of windows, so it probably avoids some window manager overhead, which is a nice side-effect. Björn -- diff -u ocbench-0.4/ocbench.c ocbench-0.4.1/ocbench.c --- ocbench-0.4/ocbench.c 2007-04-21 13:05:55.000000000 +0200 +++ ocbench-0.4.1/ocbench.c 2007-04-21 13:24:01.000000000 +0200 @@ -213,6 +213,7 @@ int main(int argc, char *argv[]) { Window root; XGCValues gc_setup; + XSetWindowAttributes swa; int c, index, proc_x, proc_y, pid; int *pcount[] = {&HOUR, &MIN, &SEC}; char *p, *q; @@ -342,8 +343,11 @@ alloc_color(fg, &orange); alloc_color(fg2, &blue); - win = XCreateSimpleWindow(dpy, root, X, Y, width, height, 0, - black.pixel, black.pixel); + swa.override_redirect = 1; + + win = XCreateWindow(dpy, root, X, Y, width, height, 0, + CopyFromParent, InputOutput, CopyFromParent, + CWOverrideRedirect, &swa); XStoreName(dpy, win, "ocbench"); XSelectInput(dpy, win, ExposureMask | StructureNotifyMask); Only in ocbench-0.4.1/: .README.swp ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-21 11:29 ` Björn Steinbrink @ 2007-04-21 11:51 ` Willy Tarreau 0 siblings, 0 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-21 11:51 UTC (permalink / raw) To: Björn Steinbrink, Ingo Molnar, linux-kernel Hi Björn, On Sat, Apr 21, 2007 at 01:29:41PM +0200, Björn Steinbrink wrote: > Hi, > > On 2007.04.21 13:07:48 +0200, Willy Tarreau wrote: > > > another thing i noticed: when using a -y larger then 1, then the window > > > title (at least on Metacity) overlaps and thus the ocbench tasks have > > > different X overhead and get scheduled a bit assymetrically as well. Is > > > there any way to start them up title-less perhaps? > > > > It has annoyed me a bit too, but I'm no X developer at all, so I don't > > know at all if it's possible nor how to do this. I know that my window > > manager even adds title bars to xeyes, so I'm not sure we can do this. > > > > Right now, I've added a "-B <border size>" argument so that you can > > skip the size of your title bar. It's dirty but it's not my main job :-) > > Here's a small patch that makes the windows unmanaged, which also causes > ocbench to start up quite a bit faster on my box with larger number of > windows, so it probably avoids some window manager overhead, which is a > nice side-effect. Excellent ! I've just merged it but conditionned it to a "-u" argument so that we can keep previous behaviour (moving the windows is useful especially when there are few of them). So the new version 0.5 is available there : http://linux.1wt.eu/sched/ I believe it's the last one for today as I'm late on some work. Thanks ! Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 15:18 ` Ingo Molnar 2007-04-19 17:34 ` Gene Heskett 2007-04-19 18:45 ` Willy Tarreau @ 2007-04-19 23:52 ` Jan Knutar 2007-04-20 5:05 ` Willy Tarreau 2 siblings, 1 reply; 304+ messages in thread From: Jan Knutar @ 2007-04-19 23:52 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Willy Tarreau, Nick Piggin, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007 18:18, Ingo Molnar wrote: > * Willy Tarreau <w@1wt.eu> wrote: > > You can certainly script it with -geometry. But it is the wrong > > application for this matter, because you benchmark X more than > > glxgears itself. What would be better is something like a line > > rotating 360 degrees and doing some short stuff between each > > degree, so that X is not much sollicitated, but the CPU would be > > spent more on the processes themselves. > > at least on my setup glxgears goes via DRI/DRM so there's no X > scheduling inbetween at all, and the visual appearance of glxgears is > a direct function of its scheduling. How much of the subjective interactiveness-feel of the desktop is at the mercy of the X server's scheduling and not the cpu scheduler? I've noticed that video playback is significantly smoother and resistant to other load, when using MPlayer's opengl output, especially if "heavy" programs are running at the same time. Especially firefox and ksysguard seem to have found a way to cause video through Xv to look annoyingly jittery. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 23:52 ` Jan Knutar @ 2007-04-20 5:05 ` Willy Tarreau 0 siblings, 0 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-20 5:05 UTC (permalink / raw) To: Jan Knutar Cc: linux-kernel, Ingo Molnar, Nick Piggin, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, Apr 20, 2007 at 02:52:38AM +0300, Jan Knutar wrote: > On Thursday 19 April 2007 18:18, Ingo Molnar wrote: > > * Willy Tarreau <w@1wt.eu> wrote: > > > You can certainly script it with -geometry. But it is the wrong > > > application for this matter, because you benchmark X more than > > > glxgears itself. What would be better is something like a line > > > rotating 360 degrees and doing some short stuff between each > > > degree, so that X is not much sollicitated, but the CPU would be > > > spent more on the processes themselves. > > > > at least on my setup glxgears goes via DRI/DRM so there's no X > > scheduling inbetween at all, and the visual appearance of glxgears is > > a direct function of its scheduling. > > How much of the subjective interactiveness-feel of the desktop is at the > mercy of the X server's scheduling and not the cpu scheduler? probably a lot. Hence the reason why I wanted something visually noticeable but using far less X resources than glxgears. The modified orbitclock is perfect IMHO. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 9:01 ` Ingo Molnar 2007-04-19 12:54 ` Willy Tarreau @ 2007-04-19 17:32 ` Gene Heskett 1 sibling, 0 replies; 304+ messages in thread From: Gene Heskett @ 2007-04-19 17:32 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Thursday 19 April 2007, Ingo Molnar wrote: >* Willy Tarreau <w@1wt.eu> wrote: >> Good idea. The machine I'm typing from now has 1000 scheddos running >> at +19, and 12 gears at nice 0. [...] >> >> From time to time, one of the 12 aligned gears will quickly perform a >> full quarter of round while others slowly turn by a few degrees. In >> fact, while I don't know this process's CPU usage pattern, there's >> something useful in it : it allows me to visually see when process >> accelerate/decelerate. [...] > >cool idea - i have just tried this and it rocks - you can easily see the >'nature' of CPU time distribution just via visual feedback. (Is there >any easy way to start up 12 glxgears fully aligned, or does one always >have to mouse around to get them into proper position?) > >btw., i am using another method to quickly judge X's behavior: i started >the 'snowflakes' plugin in Beryl on Fedora 7, which puts a nice smooth >opengl-rendered snow fall on the desktop background. That gives me an >idea about how well X is scheduling under various workloads, without >having to instrument it explicitly. > yes, its a cute idea, till you switch away from that screen to check progress on something else, like to compose this message. =========== 5913 frames in 5.0 seconds = 1182.499 FPS 6238 frames in 5.0 seconds = 1247.556 FPS 11380 frames in 5.0 seconds = 2275.905 FPS 10691 frames in 5.0 seconds = 2138.173 FPS 8707 frames in 5.0 seconds = 1741.305 FPS 10669 frames in 5.0 seconds = 2133.708 FPS 11392 frames in 5.0 seconds = 2278.037 FPS 11379 frames in 5.0 seconds = 2275.711 FPS 11310 frames in 5.0 seconds = 2261.861 FPS 11386 frames in 5.0 seconds = 2277.081 FPS 11292 frames in 5.0 seconds = 2258.353 FPS 11352 frames in 5.0 seconds = 2270.297 FPS 11415 frames in 5.0 seconds = 2282.886 FPS 11406 frames in 5.0 seconds = 2281.037 FPS 11483 frames in 5.0 seconds = 2296.533 FPS 11510 frames in 5.0 seconds = 2301.883 FPS 11123 frames in 5.0 seconds = 2224.266 FPS 8980 frames in 5.0 seconds = 1795.861 FPS ======= The over 2000fps reports were while I was either looking at htop, or starting this message, both on different screens. htop said it was using 95+ % of the cpu even when its display was going to /dev/null. So 'Kewl' doesn't seem to get us apples to apples numbers we can go to the window and bet win-place-show based on them alone. FWIW, running the nvidia-9755 drivers here. So if we are going to use that as a judgement operator, it obviously needs some intelligently applied scaling before they are worth more than a subjective feel is. > Ingo >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The confusion of a staff member is measured by the length of his memos. -- New York Times, Jan. 20, 1981 ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 10:53 ` Ingo Molnar 2007-04-14 13:01 ` Willy Tarreau @ 2007-04-14 15:17 ` Mark Lord 1 sibling, 0 replies; 304+ messages in thread From: Mark Lord @ 2007-04-14 15:17 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > i kept the 50%/50% rule from the old scheduler, but maybe it's a more > pristine (and smaller/faster) approach to just not give new children any > stats history to begin with. I've implemented an add-on patch that > implements this, you can find it at: > > http://redhat.com/~mingo/cfs-scheduler/sched-fair-fork.patch I've been running my desktop (single-core Pentium-M w/2GB RAM, Kubuntu Dapper) with the new CFS for much of this morning now, with the odd switch back to the stock scheduler for comparison. Here, CFS really works and feels better than the stock scheduler. Even with a "make -j2" kernel rebuild happening (no manual renice, either!) things "just work" about as smoothly as ever. That's something which RSDL never achieved for me, though I have not retested RSDL beyond v0.34 or so. Well done, Ingo! I *want* this as my default scheduler. Things seemed slightly less smooth when I had the CPU hogs and fair-fork extension patches both applied. I'm going to try again now with just the fair-fork added on. Cheers Mark ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:36 ` Willy Tarreau 2007-04-14 10:53 ` Ingo Molnar @ 2007-04-14 19:48 ` William Lee Irwin III 2007-04-14 20:12 ` Willy Tarreau 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-14 19:48 UTC (permalink / raw) To: Willy Tarreau Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote: > Forking becomes very slow above a load of 100 it seems. Sometimes, > the shell takes 2 or 3 seconds to return to prompt after I run > "scheddos &" > Those are very promising results, I nearly observe the same responsiveness > as I had on a solaris 10 with 10k running processes on a bigger machine. > I would be curious what a mysql test result would look like now. Where is scheddos? -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 19:48 ` William Lee Irwin III @ 2007-04-14 20:12 ` Willy Tarreau 0 siblings, 0 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-14 20:12 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sat, Apr 14, 2007 at 12:48:55PM -0700, William Lee Irwin III wrote: > On Sat, Apr 14, 2007 at 10:36:25AM +0200, Willy Tarreau wrote: > > Forking becomes very slow above a load of 100 it seems. Sometimes, > > the shell takes 2 or 3 seconds to return to prompt after I run > > "scheddos &" > > Those are very promising results, I nearly observe the same responsiveness > > as I had on a solaris 10 with 10k running processes on a bigger machine. > > I would be curious what a mysql test result would look like now. > > Where is scheddos? I will send it to you off-list. I've been avoiding to publish it for a long time because the stock scheduler was *very* sensible to trivial attacks (freezes larger than 30s, impossible to log in). It's very basic, and I have no problem sending it to anyone who requests it, it's just that as long as some distros ship early 2.6 kernels I do not want it to appear on mailing list archives for anyone to grab it and annoy their admins for free. Cheers, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 8:08 ` Willy Tarreau 2007-04-14 8:36 ` Willy Tarreau @ 2007-04-14 10:36 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 10:36 UTC (permalink / raw) To: Willy Tarreau Cc: Nick Piggin, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > > this fix is not complete - because the child runqueue is locked > > here, not the parent's. I've fixed this properly in my tree and have > > uploaded a new sched-modular+cfs.patch. (the effects of the original > > bug are mostly harmless, the rbtree position gets corrected the > > first time the parent reschedules. The fix might improve heavy > > forker handling.) > > It looks like it did not reach your public dir yet. oops, forgot to do the last step - should be fixed now. > BTW, I've given it a try. It seems pretty usable. I have also tried > the usual meaningless "glxgears" test with 12 of them at the same > time, and they rotate very smoothly, there is absolutely no pause in > any of them. But they don't all run at same speed, and top reports > their CPU load varying from 3.4 to 10.8%, with what looks like more > CPU is assigned to the first processes, and less CPU for the last > ones. But this is just a rough observation on a stupid test, I would > not call that one scientific in any way (and X has its share in the > test too). ok, i'll try that too - there should be nothing particularly special about glxgears. there's another tweak you could try: echo 500000 > /proc/sys/kernel/sched_granularity_ns note that this causes preemption to be done as fast as the scheduler can do it. (in practice it will be mainly driven by CONFIG_HZ, so to get the best results a CONFIG_HZ of 1000 is useful.) plus there's an add-on to CFS at: http://redhat.com/~mingo/cfs-scheduler/sched-fair-hog.patch this makes the 'CPU usage history cutoff' configurable and sets it to a default of 100 msecs. This means that CPU hogs (tasks which actively kept other tasks from running) will be remembered, for up to 100 msecs of their 'hogness'. Setting this limit back to 0 gives the 'vanilla' CFS scheduler's behavior: echo 0 > /proc/sys/kernel/sched_max_hog_history_ns (So when trying this you dont have to reboot with this patch applied/unapplied, just set this value.) > I'll perform other tests when I can rebuild with your fixed patch. cool, thanks! Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (7 preceding siblings ...) 2007-04-14 2:04 ` Nick Piggin @ 2007-04-14 15:09 ` S.Çağlar Onur 2007-04-14 16:09 ` Ingo Molnar 2007-04-15 3:27 ` Con Kolivas ` (4 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: S.Çağlar Onur @ 2007-04-14 15:09 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1018 bytes --] 13 Nis 2007 Cum tarihinde, Ingo Molnar şunları yazmıştı: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: Currently im using Linus's current git + your extra patches + CFS for a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i seek forward/backward while its playing a video with some workload (checking out SVN repositories, compiling something). Stopping other process didn't help kaffeine so it stays freezed stated until i kill it. I'm not sure whether its a xine-lib or kaffeine bug (cause mplayer didn't have that problem) but i can't reproduce this with mainline or mainline + sd-0.39. [1] http://cekirdek.pardus.org.tr/~caglar/psaux -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 15:09 ` S.Çağlar Onur @ 2007-04-14 16:09 ` Ingo Molnar 2007-04-14 16:59 ` S.Çağlar Onur 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-14 16:09 UTC (permalink / raw) To: S.Çağlar Onur Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * S.Çağlar Onur <caglar@pardus.org.tr> wrote: > > i'm pleased to announce the first release of the "Modular Scheduler > > Core and Completely Fair Scheduler [CFS]" patchset: > > Currently im using Linus's current git + your extra patches + CFS for > a while. Kaffeine constantly freezes (and uses %80+ CPU time) [1] if i > seek forward/backward while its playing a video with some workload > (checking out SVN repositories, compiling something). Stopping other > process didn't help kaffeine so it stays freezed stated until i kill > it. hm, could you try to strace it and/or attach gdb to it and figure out what's wrong? (perhaps involving the Kaffeine developers too?) As long as it's not a kernel level crash i cannot see how the scheduler could directly cause this - other than by accident creating a scheduling pattern that triggers a user-space bug more often than with other schedulers. > [1] http://cekirdek.pardus.org.tr/~caglar/psaux looks quite weird! Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-14 16:09 ` Ingo Molnar @ 2007-04-14 16:59 ` S.Çağlar Onur 0 siblings, 0 replies; 304+ messages in thread From: S.Çağlar Onur @ 2007-04-14 16:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 1180 bytes --] 14 Nis 2007 Cts tarihinde, Ingo Molnar şunları yazmıştı: > hm, could you try to strace it and/or attach gdb to it and figure out > what's wrong? (perhaps involving the Kaffeine developers too?) As long > as it's not a kernel level crash i cannot see how the scheduler could > directly cause this - other than by accident creating a scheduling > pattern that triggers a user-space bug more often than with other > schedulers. ... futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0 futex(0x89ac218, FUTEX_WAIT, 2, NULL) = -1 EINTR (Interrupted system call) --- SIGINT (Interrupt) @ 0 (0) --- +++ killed by SIGINT +++ is where freeze occurs. Full log can be found at [1] > > [1] http://cekirdek.pardus.org.tr/~caglar/psaux > > looks quite weird! :) [1] http://cekirdek.pardus.org.tr/~caglar/strace.kaffeine -- S.Çağlar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (8 preceding siblings ...) 2007-04-14 15:09 ` S.Çağlar Onur @ 2007-04-15 3:27 ` Con Kolivas 2007-04-15 5:16 ` Bill Huey ` (2 more replies) 2007-04-15 12:29 ` Esben Nielsen ` (3 subsequent siblings) 13 siblings, 3 replies; 304+ messages in thread From: Con Kolivas @ 2007-04-15 3:27 UTC (permalink / raw) To: Ingo Molnar, ck list, Peter Williams, Bill Huey Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Saturday 14 April 2007 06:21, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. The casual observer will be completely confused by what on earth has happened here so let me try to demystify things for them. 1. I tried in vain some time ago to push a working extensable pluggable cpu scheduler framework (based on wli's work) for the linux kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he didn't like it) as being absolutely the wrong approach and that we should never do that. Oddly enough the linux-kernel-mailing list was -dead- at the time and the discussion did not make it to the mailing list. Every time I've tried to forward it to the mailing list the spam filter decided to drop it so most people have not even seen this original veto-forever discussion. 2. Since then I've been thinking/working on a cpu scheduler design that takes away all the guesswork out of scheduling and gives very predictable, as fair as possible, cpu distribution and latency while preserving as solid interactivity as possible within those confines. For weeks now, Ingo has said that the interactivity regressions were showstoppers and we should address them, never mind the fact that the so-called regressions were purely "it slows down linearly with load" which to me is perfectly desirable behaviour. While this was not perma-vetoed, I predicted pretty accurately your intent was to veto it based on this. People kept claiming scheduling problems were few and far between but what was really happening is users were terrified of lkml and instead used 1. windows and 2. 2.4 kernels. The problems were there. So where are we now? Here is where your latest patch comes in. As a solution to the many scheduling problems we finally all agree exist, you propose a patch that adds 1. a limited pluggable framework and 2. a fairness based cpu scheduler policy... o_O So I should be happy at last now that the things I was promoting you are also promoting, right? Well I'll fill in the rest of the gaps and let other people decide how I should feel. > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, In the last 4 weeks I've spent time lying in bed drugged to the eyeballs and having trips in and out of hospitals for my condition. I appreciate greatly the sympathy and patience from people in this regard. However at one stage I virtually begged for support with my attempts and help with the code. Dmitry Adamushko is the only person who actually helped me with the code in the interim, while others poked sticks at it. Sure the sticks helped at times but the sticks always seemed to have their ends kerosene doused and flaming for reasons I still don't get. No other help was forthcoming. Now that you're agreeing my direction was correct you've done the usual Linux kernel thing - ignore all my previous code and write your own version. Oh well, that I've come to expect; at least you get a copyright notice in the bootup and somewhere in the comments give me credit for proving it's possible. Let's give some other credit here too. William Lee Irwin provided the major architecture behind plugsched at my request and I simply finished the work and got it working. He is also responsible for many IRC discussions I've had about cpu scheduling fairness, designs, programming history and code help. Even though he did not contribute code directly to SD, his comments have been invaluable. So let's look at the code. kernel/sched.c kernel/sched_fair.c kernel/sched_rt.c It turns out this is not a pluggable cpu scheduler framework at all, and I guess you didn't really promote it as such. It's a "modular scheduler core". Which means you moved code from sched.c into sched_fair.c and sched_rt.c. This abstracts out each _scheduling policy's_ functions into struct sched_class and allows each scheduling policy's functions to be in a separate file etc. Ok so what it means is that instead of whole cpu schedulers being able to be plugged into this framework we can plug in only cpu scheduling policies.... hrm... So let's look on -#define SCHED_NORMAL 0 Ok once upon a time we rename SCHED_OTHER which every other unix calls the standard policy 99.9% of applications used into a more meaningful name, SCHED_NORMAL. That's fine since all it did was change the description internally for those reading the code. Let's see what you've done now: +#define SCHED_FAIR 0 You've renamed it again. This is, I don't know what exactly to call it, but an interesting way of making it look like there is now more choice. Well, whatever you call it, everything in linux spawned from init without specifying a policy still gets policy 0. This is SCHED_OTHER still, renamed SCHED_NORMAL and now SCHED_FAIR. You encouraged me to create a sched_sd.c to add onto your design as well. Well, what do I do with that? I need to create another scheduling policy for that code to even be used. A separate scheduling policy requires a userspace change to even benefit from it. Even if I make that sched_sd.c patch, people cannot use SD as their default scheduler unless they hack SCHED_FAIR 0 to read SCHED_SD 0 or similar. The same goes for original staircase cpusched, nicksched, zaphod, spa_ws, ebs and so on. So what you've achieved with your patch is - replaced the current scheduler with another one and moved it into another file. There is no choice, and no pluggability, just code trumping. Do I support this? In this form.... no. It's not that I don't like your new scheduler. Heck it's beautiful like most of your _serious_ code. It even comes with a catchy name that's bound to give people hard-ons (even though many schedulers aim to be completely fair, yours has been named that for maximum selling power). The complaint I have is that you are not providing quite what you advertise (on the modular front), or perhaps you're advertising it as such to make it look more appealing; I'm not sure. Since we'll just end up with your code, don't pretend SCHED_NORMAL is anything different, and that this is anything other than your NIH (Not Invented Here) cpu scheduling policy rewrite which will probably end up taking it's position in mainline after yet another truckload of regression/performance tests and so on. I haven't seen an awful lot of comparisons with SD yet, just people jumping on your bandwagon which is fine I guess. Maybe a few tiny tests showing less than 5% variation in their fairness from what I can see. Either way, I already feel you've killed off SD... like pretty much everything else I've done lately. At least I no longer have to try and support my code mostly by myself. In the interest of putting aside any ego concerns since this is about linux and not me... Because... You are a hair's breadth away from producing something that I would support, which _does_ do what you say and produces the pluggability we're all begging for with only tiny changes to the code you've already done. Make Kconfig let you choose which sched_*.c gets built into the kernel, and make SCHED_OTHER choose which SCHED_* gets chosen as the default from Kconfig and even choose one of the alternative built in ones with boot parametersyour code has more clout than mine will (ie do exactly what plugsched does). Then we can have 7 schedulers in linux kernel within a few weeks. Oh no! This is the very thing Linus didn't want in specialisation with the cpu schedulers! Does this mean this idea will be vetoed yet again? In all likelihood, yes. I guess I have lots to put into -ck still... sigh. > Ingo -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` Con Kolivas @ 2007-04-15 5:16 ` Bill Huey 2007-04-15 8:44 ` Ingo Molnar 2007-04-15 16:11 ` Bernd Eckenfels 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 15:05 ` Ingo Molnar 2 siblings, 2 replies; 304+ messages in thread From: Bill Huey @ 2007-04-15 5:16 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 01:27:13PM +1000, Con Kolivas wrote: ... > Now that you're agreeing my direction was correct you've done the usual Linux > kernel thing - ignore all my previous code and write your own version. Oh > well, that I've come to expect; at least you get a copyright notice in the > bootup and somewhere in the comments give me credit for proving it's > possible. Let's give some other credit here too. William Lee Irwin provided > the major architecture behind plugsched at my request and I simply finished > the work and got it working. He is also responsible for many IRC discussions > I've had about cpu scheduling fairness, designs, programming history and code > help. Even though he did not contribute code directly to SD, his comments > have been invaluable. Hello folks, I think the main failure I see here is that Con wasn't included in this design or privately in review process. There could have been better co-ownership of the code. This could also have been done openly on lkml (since this is kind of what this medium is about to significant degree) so that consensus can happen (Con can be reasoned with). It would have achieved the same thing but probably more smoothly if folks just listened, considered an idea and then, in this case, created something that would allow for experimentation from outsiders in a fluid fashion. If these issues aren't fixed, you're going to stuck with the same kind of creeping elitism that has gradually killed the FreeBSD project and other BSDs. I can't comment on the code implementation. I'm focus on other things now that I'm at NetApp and I can't help out as much as I could. Being former BSDi, I had a first hand account of these issues as they played out. A development process like this is likely to exclude smart people from wanting to contribute to Linux and folks should be conscious about this issues. It's basically a lot of code and concept that at least two individuals have worked on (wli and con) only to have it be rejected and then sudden replaced by code from a community gatekeeper. In this case, this results in both Con and Bill Irwin being woefully under utilized. If I were one of these people. I'd be mighty pissed. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 5:16 ` Bill Huey @ 2007-04-15 8:44 ` Ingo Molnar 2007-04-15 9:51 ` Bill Huey 2007-04-15 16:11 ` Bernd Eckenfels 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 8:44 UTC (permalink / raw) To: Bill Huey Cc: Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > Hello folks, > > I think the main failure I see here is that Con wasn't included in > this design or privately in review process. There could have been > better co-ownership of the code. This could also have been done openly > on lkml [...] Bill, you come from a BSD background and you are still relatively new to Linux development, so i dont at all fault you for misunderstanding this situation, and fortunately i have a really easy resolution for your worries: i did exactly that! :) i wrote the first line of code of the CFS patch this week, 8am Wednesday morning, and released it to lkml 62 hours later, 22pm on Friday. (I've listed the file timestamps of my backup patches further below, for all the fine details.) I prefer such early releases to lkml _alot_ more than any private review process. I released the CFS code about 6 hours after i thought "okay, this looks pretty good" and i spent those final 6 hours on testing it (making sure it doesnt blow up on your box, etc.), in the final 2 hours i showed it to two folks i could reach on IRC (Arjan and Thomas) and on various finishing touches. It doesnt get much faster than that and i definitely didnt want to sit on it even one day longer because i very much thought that Con and others should definitely see this work! And i very much credited (and still credit) Con for the whole fairness angle: || i'd like to give credit to Con Kolivas for the general approach here: || he has proven via RSDL/SD that 'fair scheduling' is possible and that || it results in better desktop scheduling. Kudos Con! the 'design consultation' phase you are talking about is _NOW_! :) I got the v1 code out to Con, to Mike and to many others ASAP. That's how you are able to comment on this thread and be part of the development process to begin with, in a 'private consultation' setup you'd not have had any opportunity to see _any_ of this. In the BSD space there seem to be more 'political' mechanisms for development, but Linux is truly about doing things out in the open, and doing it immediately. Okay? ;-) Here's the timestamps of all my backups of the patch, from its humble 4K beginnings to the 100K first-cut v1 result: -rw-rw-r-- 1 mingo mingo 4230 Apr 11 08:47 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7653 Apr 11 09:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 7728 Apr 11 09:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 14416 Apr 11 10:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 24211 Apr 11 10:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 27878 Apr 11 10:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 33807 Apr 11 11:05 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 34524 Apr 11 11:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 39650 Apr 11 11:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40231 Apr 11 11:34 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40627 Apr 11 11:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 40638 Apr 11 11:54 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42733 Apr 11 12:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 42817 Apr 11 12:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43270 Apr 11 12:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 43531 Apr 11 12:48 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 44331 Apr 11 12:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45173 Apr 11 12:56 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45288 Apr 11 12:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45368 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45370 Apr 11 13:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45815 Apr 11 13:14 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45887 Apr 11 13:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45914 Apr 11 13:25 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 45850 Apr 11 13:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 49196 Apr 11 13:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64317 Apr 11 13:45 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 64403 Apr 11 13:52 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:03 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 65199 Apr 11 14:07 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 68995 Apr 11 14:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 69919 Apr 11 15:23 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71065 Apr 11 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 70642 Apr 11 16:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 72334 Apr 11 16:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71624 Apr 11 17:01 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 71854 Apr 11 17:20 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 73571 Apr 11 17:42 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 74708 Apr 11 17:51 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 75144 Apr 11 17:57 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80722 Apr 11 18:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:41 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 80727 Apr 11 18:59 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 89356 Apr 11 21:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 95278 Apr 12 08:36 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97749 Apr 12 10:49 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97687 Apr 12 10:58 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97722 Apr 12 11:06 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 97933 Apr 12 11:22 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 98167 Apr 12 12:09 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100405 Apr 12 12:29 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 100380 Apr 12 12:50 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101631 Apr 12 13:32 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102293 Apr 12 14:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102431 Apr 12 14:28 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102502 Apr 12 14:53 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102128 Apr 13 11:13 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102473 Apr 13 12:12 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102536 Apr 13 12:24 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 102481 Apr 13 12:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103408 Apr 13 13:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103441 Apr 13 13:31 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104759 Apr 13 14:19 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104815 Apr 13 14:39 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104762 Apr 13 15:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105978 Apr 13 16:18 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 105977 Apr 13 16:26 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106761 Apr 13 17:08 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 106358 Apr 13 17:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 107802 Apr 13 19:04 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 104427 Apr 13 19:35 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 103927 Apr 13 19:40 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101867 Apr 13 20:30 patches/sched-fair.patch -rw-rw-r-- 1 mingo mingo 101011 Apr 13 21:05 patches/sched-fair.patch i hope this helps :) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:44 ` Ingo Molnar @ 2007-04-15 9:51 ` Bill Huey 2007-04-15 10:39 ` Pekka Enberg 0 siblings, 1 reply; 304+ messages in thread From: Bill Huey @ 2007-04-15 9:51 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 10:44:47AM +0200, Ingo Molnar wrote: > I prefer such early releases to lkml _alot_ more than any private review > process. I released the CFS code about 6 hours after i thought "okay, > this looks pretty good" and i spent those final 6 hours on testing it > (making sure it doesnt blow up on your box, etc.), in the final 2 hours > i showed it to two folks i could reach on IRC (Arjan and Thomas) and on > various finishing touches. It doesnt get much faster than that and i > definitely didnt want to sit on it even one day longer because i very > much thought that Con and others should definitely see this work! > > And i very much credited (and still credit) Con for the whole fairness > angle: > > || i'd like to give credit to Con Kolivas for the general approach here: > || he has proven via RSDL/SD that 'fair scheduling' is possible and that > || it results in better desktop scheduling. Kudos Con! > > the 'design consultation' phase you are talking about is _NOW_! :) > > I got the v1 code out to Con, to Mike and to many others ASAP. That's > how you are able to comment on this thread and be part of the > development process to begin with, in a 'private consultation' setup > you'd not have had any opportunity to see _any_ of this. > > In the BSD space there seem to be more 'political' mechanisms for > development, but Linux is truly about doing things out in the open, and > doing it immediately. I can't even begin to talk about how screwed up BSD development is. Maybe another time privately. Ok, Linux development and inclusiveness can be improved. I'm not trying to "call you out" (slang for accusing you with the sole intention to call you crazy in a highly confrontative manner). This is discussed publically here to bring this issue to light, open a communication channel as a means to resolve it. > Okay? ;-) It's cool. We're still getting to know each other professionally and it's okay to a certain degree to have a communication disconnect but only as long as it clears. Your productivity is amazing BTW. But here's the problem, there's this perception that NIH is the default mentality here in Linux. Con feels that this kind of action is intentional and has a malicious quality to it as means of "churn squating" sections of the kernel tree. The perception here is that there is that there is this expectation that sections of the Linux kernel are intentionally "churn squated" to prevent any other ideas from creeping in other than of the owner of that subsytem (VM, scheduling, etc...) because of lack of modularity in the kernel. This isn't an API question but a question possibly general code quality and how maintenance () of it can . This was predicted by folks and then this perception was *realized* when you wrote the equivalent kind of code that has technical overlap with SDL (this is just one dry example). To a person that is writing new code for Linux, having one of the old guards write equivalent code to that of a newcomer has the effect of displacing that person both with regards to code and responsibility with that. When this happens over and over again and folks get annoyed by it, it starts seeming that Linux development seems elitist. I know this because I heard (read) Con's IRC chats all the time about these matters all of the time. This is not just his view but a view of other kernel folks that differing views as to. The closing talk at OLS 2006 was highly disturbing in many ways. It went "Christoph" is right everybody else is wrong which sends a highly negative message to new kernel developers that, say, don't work for RH directly or any of the other mainstream Linux companies. After a while, it starts seeming like this kind of behavior is completely intentional and that Linux is full of arrogant bastards. What I would have done here was to contact Peter Williams, Bill Irwin and Con about what your doing and reach a common concensus about how to create something that would be inclusive of all of their ideas. Discussions can technically heated but that's ok, the discussion is happening and it brings down the wall of this perception. Bill and Con are on oftc.net/#offtopic2. Riel is there as well as Peter Zijlstra. It might be very useful, it might not be. Folks are all stubborn about there ideas and hold on to them for dear life. Effective leaders can deconstruct this hostility and animosity. I don't claim to be one. Because of past hostility to something like schedplugin, the hostility and terseness of responses can be percieved simply as "I'm right, you're wrong" which is condescending. This effects discussion and outright destroys a constructive process if this happens continually since it reenforces that view of "You're an outsider, we don't care about you". Nobody is listening to each other at that point, folks get pissed. Then they think about "I'm going to NIH this person with patc X because he/she did the same here" which is dysfunctional. Oddly enough, sometimes you're the best person to get a new idea into the tree. What's not happening here is communication. That takes sensitivity, careful listening which is a difficult skill, and then understanding of the characters involved to unify creative energies. That's a very difficult thing to do for folks that are use to working solo. It take time to develop trust in those relationships so that a true collaboration can happen. I know that there is a lot of creativity in folks like Con and Bill. It would be wise to develop a dialog with them to see if they can offload some of your work for you (we all know you're really busy) yet have you be a key facilitator of their and your ideas. That's a really tough thing to do and it requires practice. Just imagine (assuming they can follow through) what could have positively happened if there collective knowledge was leveraged better. It's not all clear and rosy, but I think these people are more on your side that you might realized and it might be a good thing to discover that. This is tough because I know the personalities involved and I know kind of how people function and malfunction in this discussion on a personal basis. [We can continue privately. This not just about you but applicable to open source development in general] The tone of this email is intellectually critical (not ment as personality attack) and calm. If I'm otherwise, them I'm a bastard. :) bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 9:51 ` Bill Huey @ 2007-04-15 10:39 ` Pekka Enberg 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 15:16 ` Gene Heskett 0 siblings, 2 replies; 304+ messages in thread From: Pekka Enberg @ 2007-04-15 10:39 UTC (permalink / raw) To: hui Bill Huey Cc: Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > The perception here is that there is that there is this expectation that > sections of the Linux kernel are intentionally "churn squated" to prevent > any other ideas from creeping in other than of the owner of that subsytem Strangely enough, my perception is that Ingo is simply trying to address the issues Mike's testing discovered in RDSL and SD. It's not surprising Ingo made it a separate patch set as Con has repeatedly stated that the "problems" are in fact by design and won't be fixed. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 10:39 ` Pekka Enberg @ 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg ` (2 more replies) 2007-04-15 15:16 ` Gene Heskett 1 sibling, 3 replies; 304+ messages in thread From: Willy Tarreau @ 2007-04-15 12:45 UTC (permalink / raw) To: Pekka Enberg Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 01:39:27PM +0300, Pekka Enberg wrote: > On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > >The perception here is that there is that there is this expectation that > >sections of the Linux kernel are intentionally "churn squated" to prevent > >any other ideas from creeping in other than of the owner of that subsytem > > Strangely enough, my perception is that Ingo is simply trying to > address the issues Mike's testing discovered in RDSL and SD. It's not > surprising Ingo made it a separate patch set as Con has repeatedly > stated that the "problems" are in fact by design and won't be fixed. That's not exactly the problem. There are people who work very hard to try to improve some areas of the kernel. They progress slowly, and acquire more and more skills. Sometimes they feel like they need to change some concepts and propose those changes which are required for them to go further, or to develop faster. Those are rejected. So they are constrained to work in a delimited perimeter from which it is difficult for them to escape. Then, the same person who rejected their changes comes with something shiny new, better and which took him far less time. But he sort of broke the rules because what was forbidden to the first persons is suddenly permitted. Maybe for very good reasons, I'm not discussing that. The good reason should have been valid the first time too. The fact is that when changes are rejected, we should not simply say "no", but explain why and define what would be acceptable. Some people here have excellent teaching skills for this, but most others do not. Anyway, the rules should be the same for everybody. Also, there is what can be perceived as marketting here. Con worked on his idea with convictions, he took time to write some generous documentation, but he hit a wall where his concept was suboptimal on a given workload. But at least, all the work was oriented on a technical basis : design + code + doc. Then, Ingo comes in with something looking amazingly better, with virtually no documentation, an appealing announcement, and a shiny advertising at boot. All this implemented without the constraints other people had to respect. It already looks like definitive work which will be merge as-is without many changes except a few bugfixes. If those were two companies, the first one would simply have accused the second one of not having respected contracts and having employed heaving marketting to take the first place. People here do not code for a living, they do it at least because they believe in what they are doing, and some of them want a bit of gratitude for their work. I've met people who were proud to say they implement this or that feature in the kernel, so it is something important for them. And being cited in an email is nothing compared to advertising at boot time. When the discussion was blocked between Con and Mike concerning the design problems, it is where a new discussion should have taken place. Ingo could have publicly spoken with them about his ideas of killing the O(1) scheduler and replacing it with an rbtree-based one, and using part of Bill's work to speed up development. It is far easier to resign when people explain what concepts are wrong and how they think they will do than when they suddenly present something out of nowhere which is already better. And it's not specific to Ingo (though I think his ability to work that fast alone makes him tend to practise this more often than others). Imagine if Con had worked another full week on his scheduler with better results on Mike's workload, but still not as good as Ingo's, and they both published at the same time. You certainly can imagine he would have preferred to be informed first that it was pointless to continue in that direction. Now I hope he and Bill will get over this and accept to work on improving this scheduler, because I really find it smarter than a dumb O(1). I even agree with Mike that we now have a solid basis for future work. But for this, maybe a good starting point would be to remove the selfish printk at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind) and improve the documentation a bit so that people can work together on the new design, without feeling like their work will only server to promote X or Y. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau @ 2007-04-15 13:08 ` Pekka J Enberg 2007-04-15 17:32 ` Mike Galbraith 2007-04-15 15:26 ` William Lee Irwin III 2007-04-15 15:39 ` Ingo Molnar 2 siblings, 1 reply; 304+ messages in thread From: Pekka J Enberg @ 2007-04-15 13:08 UTC (permalink / raw) To: Willy Tarreau Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Willy Tarreau wrote: > Ingo could have publicly spoken with them about his ideas of killing > the O(1) scheduler and replacing it with an rbtree-based one, and using > part of Bill's work to speed up development. He did exactly that and he did it with a patch. Nothing new here. This is how development on LKML proceeds when you have two or more competing designs. There's absolutely no need to get upset or hurt your feelings over it. It's not malicious, it's how we do Linux development. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 13:08 ` Pekka J Enberg @ 2007-04-15 17:32 ` Mike Galbraith 2007-04-15 17:59 ` Linus Torvalds 0 siblings, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 17:32 UTC (permalink / raw) To: Pekka J Enberg Cc: Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote: > On Sun, 15 Apr 2007, Willy Tarreau wrote: > > Ingo could have publicly spoken with them about his ideas of killing > > the O(1) scheduler and replacing it with an rbtree-based one, and using > > part of Bill's work to speed up development. > > He did exactly that and he did it with a patch. Nothing new here. This is > how development on LKML proceeds when you have two or more competing > designs. There's absolutely no need to get upset or hurt your feelings > over it. It's not malicious, it's how we do Linux development. Yes. Exactly. This is what it's all about, this is what makes it work. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:32 ` Mike Galbraith @ 2007-04-15 17:59 ` Linus Torvalds 2007-04-15 19:00 ` Jonathan Lundell 0 siblings, 1 reply; 304+ messages in thread From: Linus Torvalds @ 2007-04-15 17:59 UTC (permalink / raw) To: Mike Galbraith Cc: Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Mike Galbraith wrote: > On Sun, 2007-04-15 at 16:08 +0300, Pekka J Enberg wrote: > > > > He did exactly that and he did it with a patch. Nothing new here. This is > > how development on LKML proceeds when you have two or more competing > > designs. There's absolutely no need to get upset or hurt your feelings > > over it. It's not malicious, it's how we do Linux development. > > Yes. Exactly. This is what it's all about, this is what makes it work. I obviously agree, but I will also add that one of the most motivating things there *is* in open source is "personal pride". It's a really good thing, and it means that if somebody shows that your code is flawed in some way (by, for example, making a patch that people claim gets better behaviour or numbers), any *good* programmer that actually cares about his code will obviously suddenly be very motivated to out-do the out-doer! Does this mean that there will be tension and rivalry? Hell yes. But that's kind of the point. Life is a game, and if you aren't in it to win, what the heck are you still doing here? As long as it's reasonably civil (I'm not personally a huge believer in being too polite or "politically correct", so I think the "reasonably" is more important than the "civil" part!), and as long as the end result is judged on TECHNICAL MERIT, it's all good. We don't want to play politics. But encouraging peoples competitive feelings? Oh, yes. Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 17:59 ` Linus Torvalds @ 2007-04-15 19:00 ` Jonathan Lundell 2007-04-15 22:52 ` Con Kolivas 0 siblings, 1 reply; 304+ messages in thread From: Jonathan Lundell @ 2007-04-15 19:00 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > It's a really good thing, and it means that if somebody shows that > your > code is flawed in some way (by, for example, making a patch that > people > claim gets better behaviour or numbers), any *good* programmer that > actually cares about his code will obviously suddenly be very > motivated to > out-do the out-doer! "No one who cannot rejoice in the discovery of his own mistakes deserves to be called a scholar." --Don Foster, "literary sleuth", on retracting his attribution of "A Funerall Elegye" to Shakespeare (it's more likely John Ford's work). ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 19:00 ` Jonathan Lundell @ 2007-04-15 22:52 ` Con Kolivas 2007-04-16 2:28 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-15 22:52 UTC (permalink / raw) To: Jonathan Lundell Cc: Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 05:00, Jonathan Lundell wrote: > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > > It's a really good thing, and it means that if somebody shows that > > your > > code is flawed in some way (by, for example, making a patch that > > people > > claim gets better behaviour or numbers), any *good* programmer that > > actually cares about his code will obviously suddenly be very > > motivated to > > out-do the out-doer! > > "No one who cannot rejoice in the discovery of his own mistakes > deserves to be called a scholar." Lovely comment. I realise this is not truly directed at me but clearly in the context it has been said people will assume it is directed my way, so while we're all spinning lkml quality rhetoric, let me have a right of reply. One thing I have never tried to do was to ignore bug reports. I'm forever joking that I keep pulling code out of my arse to improve what I've done. RSDL/SD was no exception; heck it had 40 iterations. The reason I could not reply to bug report A with "Oh that is problem B so I'll fix it with code C" was, as I've said many many times over, health related. I did indeed try to fix many of them without spending hours replying to sometimes unpleasant emails. If health wasn't an issue there might have been 1000 iterations of SD. There was only ever _one_ thing that I was absolutely steadfast on as a concept that I refused to fix that people might claim was "a mistake I did not rejoice in to be a scholar". That was that the _correct_ behaviour for a scheduler is to be fair such that proportional slowdown with load is (using that awful pun) a feature, not a bug. Now there are people who will still disagree violently with me on that. SD attempted to be a fairness first virtual-deadline design. If I failed on that front, then so be it (and at least one person certainly has said in lovely warm fuzzy friendly communication that I'm a global failure on all fronts with SD). But let me point out now that Ingo's shiny new scheduler is a fairness-first virtual-deadline design which will have proportional slowdown with load. So it will have a very similar feature. I dare anyone to claim that proportional slowdown with load is a bug, because I will no longer feel like I'm standing alone with a BFG9000 trying to defend my standpoint. Others can take up the post at last. -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:52 ` Con Kolivas @ 2007-04-16 2:28 ` Nick Piggin 2007-04-16 3:15 ` Con Kolivas 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-16 2:28 UTC (permalink / raw) To: Con Kolivas Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 08:52:33AM +1000, Con Kolivas wrote: > On Monday 16 April 2007 05:00, Jonathan Lundell wrote: > > On Apr 15, 2007, at 10:59 AM, Linus Torvalds wrote: > > > It's a really good thing, and it means that if somebody shows that > > > your > > > code is flawed in some way (by, for example, making a patch that > > > people > > > claim gets better behaviour or numbers), any *good* programmer that > > > actually cares about his code will obviously suddenly be very > > > motivated to > > > out-do the out-doer! > > > > "No one who cannot rejoice in the discovery of his own mistakes > > deserves to be called a scholar." > > Lovely comment. I realise this is not truly directed at me but clearly in the > context it has been said people will assume it is directed my way, so while > we're all spinning lkml quality rhetoric, let me have a right of reply. > > One thing I have never tried to do was to ignore bug reports. I'm forever > joking that I keep pulling code out of my arse to improve what I've done. > RSDL/SD was no exception; heck it had 40 iterations. The reason I could not > reply to bug report A with "Oh that is problem B so I'll fix it with code C" > was, as I've said many many times over, health related. I did indeed try to > fix many of them without spending hours replying to sometimes unpleasant > emails. If health wasn't an issue there might have been 1000 iterations of > SD. Well what matters is the code and development. I don't think Ingo's scheduler is the final word, although I worry that Linus might jump the gun and merge something "just to give it a test", which we then get stuck with :P I don't know how anybody can think Ingo's new scheduler is anything but a good thing (so long as it has to compete before being merged). And that's coming from someone who wants *their* scheduler to get merged... I think mine can compete ;) and if it can't, then I'd rather be using the scheduler that beats it. > There was only ever _one_ thing that I was absolutely steadfast on as a > concept that I refused to fix that people might claim was "a mistake I did > not rejoice in to be a scholar". That was that the _correct_ behaviour for a > scheduler is to be fair such that proportional slowdown with load is (using > that awful pun) a feature, not a bug. If something is using more than a fair share of CPU time, over some macro period, in order to be interactive, then definitely it should get throttled. I've always maintained (since starting scheduler work) that the 2.6 scheduler is horrible because it allows these cases where some things can get more CPU time just by how they behave. Glad people are starting to come around on that point. So, on to something productive, we have 3 candidates for a new scheduler so far. How do we decide which way to go? (and yes, I still think switchable schedulers is wrong and a copout) This is one area where it is virtually impossible to discount any decent design on correctness/performance/etc. and even testing in -mm isn't really enough. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 2:28 ` Nick Piggin @ 2007-04-16 3:15 ` Con Kolivas 2007-04-16 3:34 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-16 3:15 UTC (permalink / raw) To: Nick Piggin Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 12:28, Nick Piggin wrote: > So, on to something productive, we have 3 candidates for a new scheduler so > far. How do we decide which way to go? (and yes, I still think switchable > schedulers is wrong and a copout) This is one area where it is virtually > impossible to discount any decent design on correctness/performance/etc. > and even testing in -mm isn't really enough. We're in agreement! YAY! Actually this is simpler than that. I'm taking SD out of the picture. It has served it's purpose of proving that we need to seriously address all the scheduling issues and did more than a half decent job at it. Unfortunately I also cannot sit around supporting it forever by myself. My own life is more important, so consider SD not even running the race any more. I'm off to continue maintaining permanent-out-of-tree leisurely code at my own pace. What's more is, I think I'll just stick to staircase Gen I version blah and shelve SD and try to have fond memories of SD as an intellectual prompting exercise only. -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:15 ` Con Kolivas @ 2007-04-16 3:34 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-16 3:34 UTC (permalink / raw) To: Con Kolivas Cc: Jonathan Lundell, Linus Torvalds, Mike Galbraith, Pekka J Enberg, Willy Tarreau, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 01:15:27PM +1000, Con Kolivas wrote: > On Monday 16 April 2007 12:28, Nick Piggin wrote: > > So, on to something productive, we have 3 candidates for a new scheduler so > > far. How do we decide which way to go? (and yes, I still think switchable > > schedulers is wrong and a copout) This is one area where it is virtually > > impossible to discount any decent design on correctness/performance/etc. > > and even testing in -mm isn't really enough. > > We're in agreement! YAY! > > Actually this is simpler than that. I'm taking SD out of the picture. It has > served it's purpose of proving that we need to seriously address all the > scheduling issues and did more than a half decent job at it. Unfortunately I > also cannot sit around supporting it forever by myself. My own life is more > important, so consider SD not even running the race any more. > > I'm off to continue maintaining permanent-out-of-tree leisurely code at my own > pace. What's more is, I think I'll just stick to staircase Gen I version blah > and shelve SD and try to have fond memories of SD as an intellectual > prompting exercise only. Well I would hope that _if_ we decide to switch schedulers, then you get a chance to field something (and I hope you will decide to and have time to), and I hope we don't rush into the decision. We've had the current scheduler for so many years now that it is much more important to make sure we take the time to do the right thing rather than absolutely have to merge a new scheduler right now ;) ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg @ 2007-04-15 15:26 ` William Lee Irwin III 2007-04-16 15:55 ` Chris Friesen 2007-04-15 15:39 ` Ingo Molnar 2 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 15:26 UTC (permalink / raw) To: Willy Tarreau Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 02:45:27PM +0200, Willy Tarreau wrote: > Now I hope he and Bill will get over this and accept to work on improving > this scheduler, because I really find it smarter than a dumb O(1). I even > agree with Mike that we now have a solid basis for future work. But for > this, maybe a good starting point would be to remove the selfish printk > at boot, revert useless changes (SCHED_NORMAL->SCHED_FAIR come to mind) > and improve the documentation a bit so that people can work together on > the new design, without feeling like their work will only server to > promote X or Y. While I appreciate people coming to my defense, or at least the good intentions behind such, my only actual interest in pointing out 4-year-old work is getting some acknowledgment of having done something relevant at all. Sometimes it has "I told you so" value. At other times it's merely clarifying what went on when people refer to it since in a number of cases the patches are no longer extant, so they can't actually look at it to get an idea of what was or wasn't done. At other times I'm miffed about not being credited, whether I should've been or whether dead and buried code has an implementation of the same idea resurfacing without the author(s) having any knowledge of my prior work. One should note that in this case, the first work of mine this trips over (scheduling classes) was never publicly posted as it was only a part of the original plugsched (an alternate scheduler implementation devised to demonstrate plugsched's flexibility with respect to scheduling policies), and a part that was dropped by subsequent maintainers. The second work of mine this trips over, a virtual deadline scheduler named "vdls," was also never publicly posted. Both are from around the same time period, which makes them approximately 4 years dead. Neither of the codebases are extant, having been lost in a transition between employers, though various people recall having been sent them privately, and plugsched survives in a mutated form as maintained by Peter Williams, who's been very good about acknowledging my original contribution. If I care to become a direct participant in scheduler work, I can do so easily enough. I'm not entirely sure what this is about a basis for future work. By and large one should alter the API's and data structures to fit the policy being implemented. While the array swapping was nice for algorithmically improving 2.4.x -style epoch expiry, most algorithms not based on the 2.4.x scheduler (in however mutated a form) should use a different queue structure, in fact, one designed around their policy's specific algorithmic needs. IOW, when one alters the scheduler, one should also alter the queue data structure appropriately. I'd not expect the priority queue implementation in cfs to continue to be used unaltered as it matures, nor would I expect any significant modification of the scheduler to necessarily use a similar one. By and large I've been mystified as to why there is such a penchant for preserving the existing queue structures in the various scheduler patches floating around. I am now every bit as mystified at the point of view that seems to be emerging that a change of queue structure is particularly significant. These are all largely internal changes to sched.c, and as such, rather small changes in and of themselves. While they do tend to have user-visible effects, from this point of view even changing out every line of sched.c is effectively a micropatch. Something more significant might be altering the schedule() API to take a mandatory description of the intention of the call to it, or breaking up schedule() into several different functions to distinguish between different sorts of uses of it to which one would then respond differently. Also more significant would be adding a new state beyond TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, and TASK_RUNNING for some tasks to respond only to fatal signals, then sweeping TASK_UNINTERRUPTIBLE users to use the new state and handle those fatal signals. While not quite as ostentatious in their user-visible effects as SCHED_OTHER policy affairs, they are tremendously more work than switching out the implementation of a single C file, and so somewhat more respectable. Even as scheduling semantics go, these are micropatches. So SCHED_OTHER changes a little. Where are the gang schedulers? Where are the batch schedulers (SCHED_BATCH is not truly such)? Where are the isochronous (frame) schedulers? I suppose there is some CKRM work that actually has a semantic impact despite being largely devoted to SCHED_OTHER, and there's some spufs gang scheduling going on, though not all that much. And to reiterate a point from other threads, even as SCHED_OTHER patches go, I see precious little verification that things like the semantics of nice numbers or other sorts of CPU bandwidth allocation between competing tasks of various natures are staying the same while other things are changed, or at least being consciously modified in such a fashion as to improve them. I've literally only seen one or two tests (and rather inflexible ones with respect to sleep and running time mixtures) with any sort of quantification of how CPU bandwidth is distributed get run on all this. So from my point of view, there's a lot of churn and craziness going on in one tiny corner of the kernel and people don't seem to have a very solid grip on what effects their changes have or how they might potentially break userspace. So I've developed a sudden interest in regression testing of the scheduler in order to ensure that various sorts of semantics on which userspace relies are not broken, and am trying to spark more interest in general in nailing down scheduling semantics and verifying that those semantics are honored and remain honored by whatever future scheduler implementations might be merged. Thus far, the laundry list of semantics I'd like to have nailed down are specifically: (1) CPU bandwidth allocation according to nice numbers (2) CPU bandwidth allocation among mixtures of tasks with varying sleep/wakeup behavior e.g. that consume some percentage of cpu in isolation, perhaps also varying the granularity of their sleep/wakeup patterns (3) sched_yield(), so multitier userspace locking doesn't go haywire (4) How these work with SMP; most people agree it should be mostly the same as it works on UP, but it's not being verified, as most testcases are barely SMP-aware if at all, and corner cases where proportionality breaks down aren't considered The sorts of like explicit decisions I'd like to be made for these are: (1) In a mixture of tasks with varying nice numbers, a given nice number corresponds to some share of CPU bandwidth. Implementations should not have the freedom to change this arbitrarily according to some intention. (2) A given scheduler _implementation_ intends to distribute CPU bandwidth among mixtures of tasks that would each consume some percentage of the CPU in isolation varying across tasks in some particular pattern. For example, maybe some scheduler implementation assigns a share of 1/%cpu to a task that would consume %cpu in isolation, for a CPU bandwidth allocation of (1/%cpu)/(sum 1/%cpu(t)) as t ranges over all competing tasks (this is not to say that such a policy makes sense). (3) sched_yield() is intended to result in some particular scheduling pattern in a given scheduler implementation. For instance, an implementation may intend that a set of CPU hogs calling sched_yield() between repeatedly printf()'ing their pid's will see their printf()'s come out in an approximately consistent order as the scheduler cycles between them. (4) What an implementation intends to do with respect to SMP CPU bandwidth allocation when precise emulation of UP behavior is impossible, considering sched_yield() scheduling patterns when possible as well. For instance, perhaps an implementation intends to ensure equal CPU bandwidth among competing CPU-bound tasks of equal priority at all costs, and so triggers migration and/or load balancing to make it so. Or perhaps an implementation intends to ensure precise sched_yield() ordering at all costs even on SMP. Some sort of specification of the intention, then a verification that the intention is carried out in a testcase. Also, if there's a semantic issue to be resolved, I want it to have something describing it and verifying it. For instance, characterizing whatever sort of scheduling artifacts queue-swapping causes in the mainline scheduler and then a testcase to demonstrate the artifact and its resolution in a given scheduler rewrite would be a good design statement and verification. For instance, if someone wants to go back to queue-swapping or other epoch expiry semantics, it would make them (and hopefully everyone else) conscious of the semantic issue the change raises, or possibly serve as a demonstration that the artifacts can be mitigated in some implementation retaining epoch expiry semantics. As I become aware of more potential issues I'll add more to my laundry list, and I'll hammer out testcases as I go. My concern with the scheduler is that this sort of basic functionality may be significantly disturbed with no one noticing at all until a distro issues a prerelease and benchmarks go haywire, and furthermore that changes to this kind of basic behavior may be signs of things going awry, particularly as more churn happens. So now that I've clarified my role in all this to date and my point of view on it, it should be clear that accepting something and working on some particular scheduler implementation don't make sense as suggestions to me. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:26 ` William Lee Irwin III @ 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Chris Friesen @ 2007-04-16 15:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > The sorts of like explicit decisions I'd like to be made for these are: > (1) In a mixture of tasks with varying nice numbers, a given nice number > corresponds to some share of CPU bandwidth. Implementations > should not have the freedom to change this arbitrarily according > to some intention. The first question that comes to my mind is whether nice levels should be linear or not. I would lean towards nonlinear as it allows a wider range (although of course at the expense of precision). Maybe something like "each nice level gives X times the cpu of the previous"? I think a value of X somewhere between 1.15 and 1.25 might be reasonable. What about also having something that looks at latency, and how latency changes with niceness? What about specifying the timeframe over which the cpu bandwidth is measured? I currently have a system where the application designers would like it to be totally fair over a period of 1 second. As you can imagine, mainline doesn't do very well in this case. Chris ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen @ 2007-04-16 16:13 ` William Lee Irwin III 2007-04-17 0:04 ` Peter Williams 2007-04-17 13:07 ` James Bruce 2 siblings, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-16 16:13 UTC (permalink / raw) To: Chris Friesen Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. On Mon, Apr 16, 2007 at 09:55:14AM -0600, Chris Friesen wrote: > The first question that comes to my mind is whether nice levels should > be linear or not. I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. > What about also having something that looks at latency, and how latency > changes with niceness? > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. As you can > imagine, mainline doesn't do very well in this case. It's unclear how latency enters the picture as the semantics of nice levels relevant to such are essentially priority preemption, which is not particularly easy to mess up. I suppose tests to ensure priority preemption occurs properly are in order. I don't really have a preference regarding specific semantics for nice numbers, just that they should be deterministic and specified somewhere. It's not really for us to decide what those semantics are as it's more of a userspace ABI/API issue. The timeframe is also relevant, but I suspect it's more of a performance metric than a strict requirement. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III @ 2007-04-17 0:04 ` Peter Williams 2007-04-17 13:07 ` James Bruce 2 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-17 0:04 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > William Lee Irwin III wrote: > >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. > > The first question that comes to my mind is whether nice levels should > be linear or not. No. That squishes one end of the table too much. It needs to be (approximately) piecewise linear around nice == 0. Here's the mapping I use in my entitlement based schedulers: #define NICE_TO_LP(nice) ((nice >=0) ? (20 - (nice)) : (20 + (nice) * (nice))) It has the (good) feature that a nice == 19 task has 1/20th the entitlement of a nice == 0 task and a nice == -20 task has 21 times the entitlement of a nice == 0 task. It's not strictly linear for negative nice values but is very cheap to calculate and quite easy to invert if necessary. > I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. > > What about also having something that looks at latency, and how latency > changes with niceness? > > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. Have you tried the spa_ebs scheduler? The half life is no longer a run time configurable parameter (as making it highly adjustable results in less efficient code) but it could be adjusted to be approximately equivalent to 0.5 seconds by changing some constants in the code. > As you can > imagine, mainline doesn't do very well in this case. You should look back through the plugsched patches where many of these ideas have been experimented with. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 15:55 ` Chris Friesen 2007-04-16 16:13 ` William Lee Irwin III 2007-04-17 0:04 ` Peter Williams @ 2007-04-17 13:07 ` James Bruce 2007-04-17 20:05 ` William Lee Irwin III 2 siblings, 1 reply; 304+ messages in thread From: James Bruce @ 2007-04-17 13:07 UTC (permalink / raw) To: linux-kernel; +Cc: ck Chris Friesen wrote: > William Lee Irwin III wrote: > >> The sorts of like explicit decisions I'd like to be made for these are: >> (1) In a mixture of tasks with varying nice numbers, a given nice number >> corresponds to some share of CPU bandwidth. Implementations >> should not have the freedom to change this arbitrarily according >> to some intention. > > The first question that comes to my mind is whether nice levels should > be linear or not. I would lean towards nonlinear as it allows a wider > range (although of course at the expense of precision). Maybe something > like "each nice level gives X times the cpu of the previous"? I think a > value of X somewhere between 1.15 and 1.25 might be reasonable. Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 That value has the property that a nice=10 task gets 1/10th the cpu of a nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that would be fairly easy to explain to admins and users so that they can know what to expect from nicing tasks. > What about also having something that looks at latency, and how latency > changes with niceness? I think this would be a lot harder to pin down, since it's a function of all the other tasks running and their nice levels. Do you have any of the RT-derived analysis models in mind? > What about specifying the timeframe over which the cpu bandwidth is > measured? I currently have a system where the application designers > would like it to be totally fair over a period of 1 second. As you can > imagine, mainline doesn't do very well in this case. It might be easier to specify the maximum deviation from the ideal bandwidth over a certain period. I.e. something like "over a period of one second, each task receives within 10% of the expected bandwidth". - Jim Bruce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:07 ` James Bruce @ 2007-04-17 20:05 ` William Lee Irwin III 0 siblings, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 20:05 UTC (permalink / raw) To: James Bruce Cc: Chris Friesen, Willy Tarreau, Pekka Enberg, hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > That value has the property that a nice=10 task gets 1/10th the cpu of a > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > would be fairly easy to explain to admins and users so that they can > know what to expect from nicing tasks. [...additional good commentary trimmed...] Lots of good ideas here. I'll follow them. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:45 ` Willy Tarreau 2007-04-15 13:08 ` Pekka J Enberg 2007-04-15 15:26 ` William Lee Irwin III @ 2007-04-15 15:39 ` Ingo Molnar 2007-04-15 15:47 ` William Lee Irwin III 2007-04-16 5:27 ` Peter Williams 2 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 15:39 UTC (permalink / raw) To: Willy Tarreau Cc: Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: > Ingo could have publicly spoken with them about his ideas of killing > the O(1) scheduler and replacing it with an rbtree-based one, [...] yes, that's precisely what i did, via a patchset :) [ I can even tell you when it all started: i was thinking about Mike's throttling patches while watching Manchester United beat the crap out of AS Roma (7 to 1 end result), Thuesday evening. I started coding it Wednesday morning and sent the patch Friday evening. I very much believe in low-latency when it comes to development too ;) ] (if this had been done via a comittee then today we'd probably still be trying to find a suitable timeslot for the initial conference call where we'd discuss the election of a chair who would be tasked with writing up an initial document of feature requests, on which we'd take a vote, possibly this year already, because the matter is really urgent you know ;-) > [...] and using part of Bill's work to speed up development. ok, let me make this absolutely clear: i didnt use any bit of plugsched - in fact the most difficult bits of the modularization was for areas of sched.c that plugsched never even touched AFAIK. (the load-balancer for example.) Plugsched simply does something else: i modularized scheduling policies in essence that have to cooperate with each other, while plugsched modularized complete schedulers which are compile-time or boot-time selected, with no runtime cooperation between them. (one has to be selected at a time) (and i have no trouble at all with crediting Will's work either: a few years ago i used Will's PID rework concepts for an NPTL related speedup and Will is very much credited for it in today's kernel/pid.c and he continued to contribute to it later on.) (the tree walking bits of sched_fair.c were in fact derived from kernel/hrtimer.c, the rbtree code written by Thomas and me :-) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:39 ` Ingo Molnar @ 2007-04-15 15:47 ` William Lee Irwin III 2007-04-16 5:27 ` Peter Williams 1 sibling, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 15:47 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Willy Tarreau <w@1wt.eu> wrote: >> [...] and using part of Bill's work to speed up development. On Sun, Apr 15, 2007 at 05:39:33PM +0200, Ingo Molnar wrote: > ok, let me make this absolutely clear: i didnt use any bit of plugsched > - in fact the most difficult bits of the modularization was for areas of > sched.c that plugsched never even touched AFAIK. (the load-balancer for > example.) > Plugsched simply does something else: i modularized scheduling policies > in essence that have to cooperate with each other, while plugsched > modularized complete schedulers which are compile-time or boot-time > selected, with no runtime cooperation between them. (one has to be > selected at a time) > (and i have no trouble at all with crediting Will's work either: a few > years ago i used Will's PID rework concepts for an NPTL related speedup > and Will is very much credited for it in today's kernel/pid.c and he > continued to contribute to it later on.) > (the tree walking bits of sched_fair.c were in fact derived from > kernel/hrtimer.c, the rbtree code written by Thomas and me :-) The extant plugsched patches have nothing to do with cfs; I suspect what everyone else is going on about is terminological confusion. The 4-year-old sample policy with scheduling classes for the original plugsched is something you had no way of knowing about, as it was never publicly posted. There isn't really anything all that interesting going on here, apart from pointing out that it's been done before. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:39 ` Ingo Molnar 2007-04-15 15:47 ` William Lee Irwin III @ 2007-04-16 5:27 ` Peter Williams 2007-04-16 6:23 ` Peter Williams 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-16 5:27 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Willy Tarreau <w@1wt.eu> wrote: > >> Ingo could have publicly spoken with them about his ideas of killing >> the O(1) scheduler and replacing it with an rbtree-based one, [...] > > yes, that's precisely what i did, via a patchset :) > > [ I can even tell you when it all started: i was thinking about Mike's > throttling patches while watching Manchester United beat the crap out > of AS Roma (7 to 1 end result), Thuesday evening. I started coding it > Wednesday morning and sent the patch Friday evening. I very much > believe in low-latency when it comes to development too ;) ] > > (if this had been done via a comittee then today we'd probably still be > trying to find a suitable timeslot for the initial conference call where > we'd discuss the election of a chair who would be tasked with writing up > an initial document of feature requests, on which we'd take a vote, > possibly this year already, because the matter is really urgent you know > ;-) > >> [...] and using part of Bill's work to speed up development. > > ok, let me make this absolutely clear: i didnt use any bit of plugsched > - in fact the most difficult bits of the modularization was for areas of > sched.c that plugsched never even touched AFAIK. (the load-balancer for > example.) This sounds like your new scheduler intends to increase the coupling between scheduling and load balancing. I think that this would be a mistake and lead (down the track) to spiralling complexity as you make changes to the code to address the corner conditions that it will create. > > Plugsched simply does something else: i modularized scheduling policies > in essence that have to cooperate with each other, while plugsched > modularized complete schedulers which are compile-time or boot-time > selected, with no runtime cooperation between them. (one has to be > selected at a time) You can't really have more than one scheduler operating in the same priority range on the same CPU as they will be fighting each other trying to achieve their separate and not necessarily compatible (in fact highly likely to be incompatible) aims. Multiple schedulers on the same CPU have to have a pecking order just like SCHED_OTHER and real time policies. It wouldn't be hard to prove that SCHED_RR and SCHED_FIFO is a problem in waiting if ever someone tried to use them both on a highly real time system. > > (and i have no trouble at all with crediting Will's work either: a few > years ago i used Will's PID rework concepts for an NPTL related speedup > and Will is very much credited for it in today's kernel/pid.c and he > continued to contribute to it later on.) > > (the tree walking bits of sched_fair.c were in fact derived from > kernel/hrtimer.c, the rbtree code written by Thomas and me :-) > > Ingo Are your new patches available somewhere for easy download or do I have to try to dig them out of the mailing list archive? Or could you mail them to me separately? I'm keen to see how you new scheduler proposal works. Thanks Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:27 ` Peter Williams @ 2007-04-16 6:23 ` Peter Williams 2007-04-16 6:40 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-16 6:23 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > > Are your new patches available somewhere for easy download or do I have > to try to dig them out of the mailing list archive? Or could you mail > them to me separately? I'm keen to see how you new scheduler proposal > works. Forget about this. I found the patch. After a quick look, I like a lot of what I see especially the removal of the dual arrays in the run queue. Some minor suggestions: 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to initialize the task structure in init_task.h. 2. the on_rq field in the task structure is unnecessary as many years of experience with ingosched in plugsched indicates that !list_empty(&(p)->run_list does the job provided list_del_init() is used when dequeueing and there is no noticeable overhead incurred so there's no gain by caching the result. Also it removes the possibility of errors creeping in due the value of on_rq being inconsistent with the task's actual state. 3. having modular load balancing is a good idea but it should be decoupled form the scheduler and provided as a separate interface. This would enable different schedulers to use the same load balancer if they desired. 4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be fair(ish) anyway. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 6:23 ` Peter Williams @ 2007-04-16 6:40 ` Peter Williams 2007-04-16 7:32 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-16 6:40 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, hui Bill Huey, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Peter Williams wrote: >> >> Are your new patches available somewhere for easy download or do I >> have to try to dig them out of the mailing list archive? Or could you >> mail them to me separately? I'm keen to see how you new scheduler >> proposal works. > > Forget about this. I found the patch. > > After a quick look, I like a lot of what I see especially the removal of > the dual arrays in the run queue. > > Some minor suggestions: > > 1. having defined DEFAULT_PRIO in sched.h shouldn't you use it to > initialize the task structure in init_task.h. > 2. the on_rq field in the task structure is unnecessary as many years of > experience with ingosched in plugsched indicates that > !list_empty(&(p)->run_list does the job provided list_del_init() is used > when dequeueing and there is no noticeable overhead incurred so there's > no gain by caching the result. Also it removes the possibility of > errors creeping in due the value of on_rq being inconsistent with the > task's actual state. > 3. having modular load balancing is a good idea but it should be > decoupled form the scheduler and provided as a separate interface. This > would enable different schedulers to use the same load balancer if they > desired. > 4. why rename SCHED_OTHER to SCHED_FAIR? SCHED_OTHER's supposed to be > fair(ish) anyway. One more quick comment. The claim that there is no concept of time slice in the new scheduler is only true in the sense of the rather arcane implementation of time slices extant in the O(1) scheduler. Your new parameter sched_granularity_ns is equivalent to the concept of time slice in most other kernels that I've peeked inside and computing literature in general (going back over several decades e.g. the magic garden). Welcome to the mainstream, Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 6:40 ` Peter Williams @ 2007-04-16 7:32 ` Ingo Molnar 2007-04-16 8:54 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-16 7:32 UTC (permalink / raw) To: Peter Williams Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > One more quick comment. The claim that there is no concept of time > slice in the new scheduler is only true in the sense of the rather > arcane implementation of time slices extant in the O(1) scheduler. yeah. AFAIK most other mainstream OSs also still often use similarly 'arcane' concepts (i'm here ignoring literature, you can find everything and its opposite suggested in literature) so i felt the need to point out the difference ;) After all Linux is about doing a better mainstream OS, it is not about beating the OS literature at lunacy ;-) The precise statement would be: "there's no concept of giving out a time-slice to a task and sticking to it unless a higher-prio task comes along, nor is there a concept of having a low-res granularity ->time_slice thing. There is accurate accounting of how much CPU time a task used up, and there is a granularity setting that together gives the current task a fairness advantage of a given amount of nanoseconds - which has similar [but not equivalent] effects to traditional timeslices that most mainstream OSs use". > Your new parameter sched_granularity_ns is equivalent to the concept > of time slice in most other kernels that I've peeked inside and > computing literature in general (going back over several decades e.g. > the magic garden). note that you can set it to 0 and the box still functions - so sched_granularity_ns, while useful for performance/bandwidth workloads, isnt truly inherent to the design. So in the announcement i just opted for a short sentence: "there's no concept of timeslices", albeit like most short stentences it's not a technically 100% accurate statement - but still it conveyed the intended information more effectively to the interested lkml reader than the longer version could ever have =B-) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 7:32 ` Ingo Molnar @ 2007-04-16 8:54 ` Peter Williams 0 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-16 8:54 UTC (permalink / raw) To: Ingo Molnar Cc: Willy Tarreau, Pekka Enberg, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Peter Williams <pwil3058@bigpond.net.au> wrote: > >> One more quick comment. The claim that there is no concept of time >> slice in the new scheduler is only true in the sense of the rather >> arcane implementation of time slices extant in the O(1) scheduler. > > yeah. AFAIK most other mainstream OSs also still often use similarly > 'arcane' concepts (i'm here ignoring literature, you can find everything > and its opposite suggested in literature) so i felt the need to point > out the difference ;) After all Linux is about doing a better mainstream > OS, it is not about beating the OS literature at lunacy ;-) > > The precise statement would be: "there's no concept of giving out a > time-slice to a task and sticking to it unless a higher-prio task comes > along, I would have said "no concept of using tile slices to implement nice" which always seemed strange to me. If it really does what you just said then a (malicious or otherwise) CPU intensive task that never sleeps, once it got the CPU, would completely hog the CPU. > nor is there a concept of having a low-res granularity > ->time_slice thing. There is accurate accounting of how much CPU time a > task used up, and there is a granularity setting that together gives the > current task a fairness advantage of a given amount of nanoseconds - > which has similar [but not equivalent] effects to traditional timeslices > that most mainstream OSs use". Most traditional OSes have more or less fixed time slices and do the scheduling by fiddling the dynamic priority. Using total CPU used will also come to grief when used for long running tasks. Eventually, even very low bandwidth tasks will accumulate enough total CPU to look busy. The CPU bandwidth the task is using is what needs to be controlled. Or have I not looked closely enough at what sched_granularity_ns does? Is it really a control for the decay rate of a CPU usage bandwidth metric? > >> Your new parameter sched_granularity_ns is equivalent to the concept >> of time slice in most other kernels that I've peeked inside and >> computing literature in general (going back over several decades e.g. >> the magic garden). > > note that you can set it to 0 and the box still functions - so > sched_granularity_ns, while useful for performance/bandwidth workloads, > isnt truly inherent to the design. Just like my SPA schedulers. But if you set it to zero you'll get a fairly high context switch rate with associated overhead, won't you? > > So in the announcement i just opted for a short sentence: "there's no > concept of timeslices", albeit like most short stentences it's not a > technically 100% accurate statement - but still it conveyed the intended > information more effectively to the interested lkml reader than the > longer version could ever have =B-) I hope that I implied that I was being picky :-) (I meant to -- imply I was being picky, that is). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 10:39 ` Pekka Enberg 2007-04-15 12:45 ` Willy Tarreau @ 2007-04-15 15:16 ` Gene Heskett 2007-04-15 16:43 ` Con Kolivas 1 sibling, 1 reply; 304+ messages in thread From: Gene Heskett @ 2007-04-15 15:16 UTC (permalink / raw) To: Pekka Enberg Cc: hui Bill Huey, Ingo Molnar, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Pekka Enberg wrote: >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: >> The perception here is that there is that there is this expectation that >> sections of the Linux kernel are intentionally "churn squated" to prevent >> any other ideas from creeping in other than of the owner of that subsytem > >Strangely enough, my perception is that Ingo is simply trying to >address the issues Mike's testing discovered in RDSL and SD. It's not >surprising Ingo made it a separate patch set as Con has repeatedly >stated that the "problems" are in fact by design and won't be fixed. I won't get into the middle of this just yet, not having decided which dog I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for about 24 hours, its been generally usable, but gzip still causes lots of 5 to 10+ second lags when its running. I'm coming to the conclusion that gzip simply doesn't play well with others... Amazing to me, the cpu its using stays generally below 80%, and often below 60%, even while the kmail composer has a full sentence in its buffer that it still hasn't shown me when I switch to the htop screen to check, and back to the kmail screen to see if its updated yet. The screen switch doesn't seem to lag so I don't think renicing x would be helpfull. Those are the obvious lags, and I'll build & reboot to the CFS patch at some point this morning (whats left of it that is :). And report in due time of course -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) knot in cables caused data stream to become twisted and kinked ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:16 ` Gene Heskett @ 2007-04-15 16:43 ` Con Kolivas 2007-04-15 16:58 ` Gene Heskett 0 siblings, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-15 16:43 UTC (permalink / raw) To: Gene Heskett Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 01:16, Gene Heskett wrote: > On Sunday 15 April 2007, Pekka Enberg wrote: > >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: > >> The perception here is that there is that there is this expectation that > >> sections of the Linux kernel are intentionally "churn squated" to > >> prevent any other ideas from creeping in other than of the owner of that > >> subsytem > > > >Strangely enough, my perception is that Ingo is simply trying to > >address the issues Mike's testing discovered in RDSL and SD. It's not > >surprising Ingo made it a separate patch set as Con has repeatedly > >stated that the "problems" are in fact by design and won't be fixed. > > I won't get into the middle of this just yet, not having decided which dog > I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for > about 24 hours, its been generally usable, but gzip still causes lots of 5 > to 10+ second lags when its running. I'm coming to the conclusion that > gzip simply doesn't play well with others... Actually Gene I think you're being bitten here by something I/O bound since the cpu usage never tops out. If that's the case and gzip is dumping truckloads of writes then you're suffering something that irks me even more than the scheduler in linux, and that's how much writes hurt just about everything else. Try your testcase with bzip2 instead (since that won't be i/o bound), or drop your dirty ratio to as low as possible which helps a little bit (5% is the minimum) echo 5 > /proc/sys/vm/dirty_ratio and finally try the braindead noop i/o scheduler as well. echo noop > /sys/block/sda/queue/scheduler (replace sda with your drive obviously). I'd wager a big one that's what causes your gzip pain. If it wasn't for the fact that I've decided to all but give up ever trying to provide code for mainline again, trying my best to make writes hurt less on linux would be my next big thing [tm]. Oh and for the others watching, (points to vm hackers) I found a bug when playing with the dirty ratio code. If you modify it to allow it drop below 5% but still above the minimum in the vm code, stalls happen somewhere in the vm where nothing much happens for sometimes 20 or 30 seconds worst case scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to be set ultra low because these stalls were gross. > Amazing to me, the cpu its using stays generally below 80%, and often below > 60%, even while the kmail composer has a full sentence in its buffer that > it still hasn't shown me when I switch to the htop screen to check, and > back to the kmail screen to see if its updated yet. The screen switch > doesn't seem to lag so I don't think renicing x would be helpfull. Those > are the obvious lags, and I'll build & reboot to the CFS patch at some > point this morning (whats left of it that is :). And report in due time of > course -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:43 ` Con Kolivas @ 2007-04-15 16:58 ` Gene Heskett 2007-04-15 18:00 ` Mike Galbraith 0 siblings, 1 reply; 304+ messages in thread From: Gene Heskett @ 2007-04-15 16:58 UTC (permalink / raw) To: Con Kolivas Cc: Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Con Kolivas wrote: >On Monday 16 April 2007 01:16, Gene Heskett wrote: >> On Sunday 15 April 2007, Pekka Enberg wrote: >> >On 4/15/07, hui Bill Huey <billh@gnuppy.monkey.org> wrote: >> >> The perception here is that there is that there is this expectation >> >> that sections of the Linux kernel are intentionally "churn squated" to >> >> prevent any other ideas from creeping in other than of the owner of >> >> that subsytem >> > >> >Strangely enough, my perception is that Ingo is simply trying to >> >address the issues Mike's testing discovered in RDSL and SD. It's not >> >surprising Ingo made it a separate patch set as Con has repeatedly >> >stated that the "problems" are in fact by design and won't be fixed. >> >> I won't get into the middle of this just yet, not having decided which dog >> I should bet on yet. I've been running 2.6.21-rc6 + Con's 0.40 patch for >> about 24 hours, its been generally usable, but gzip still causes lots of 5 >> to 10+ second lags when its running. I'm coming to the conclusion that >> gzip simply doesn't play well with others... > >Actually Gene I think you're being bitten here by something I/O bound since >the cpu usage never tops out. If that's the case and gzip is dumping >truckloads of writes then you're suffering something that irks me even more >than the scheduler in linux, and that's how much writes hurt just about >everything else. Try your testcase with bzip2 instead (since that won't be >i/o bound), or drop your dirty ratio to as low as possible which helps a >little bit (5% is the minimum) > >echo 5 > /proc/sys/vm/dirty_ratio > >and finally try the braindead noop i/o scheduler as well. > >echo noop > /sys/block/sda/queue/scheduler > >(replace sda with your drive obviously). > >I'd wager a big one that's what causes your gzip pain. If it wasn't for the >fact that I've decided to all but give up ever trying to provide code for >mainline again, trying my best to make writes hurt less on linux would be my >next big thing [tm]. Chuckle, possibly but then I'm not anything even remotely close to an expert here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and profanity as the case may call for. >Oh and for the others watching, (points to vm hackers) I found a bug when >playing with the dirty ratio code. If you modify it to allow it drop below > 5% but still above the minimum in the vm code, stalls happen somewhere in > the vm where nothing much happens for sometimes 20 or 30 seconds worst case > scenario. I had to drop a patch in 2.6.19 that allowed the dirty ratio to > be set ultra low because these stalls were gross. I think I'd need a bit of tutoring on how to do that. I recall that one other time, several weeks back, I thought I would try one of those famous echo this >/proc/that ideas that went by on this list, but even though I was root, apparently /proc was read-only AFAIWC. >> Amazing to me, the cpu its using stays generally below 80%, and often >> below 60%, even while the kmail composer has a full sentence in its buffer >> that it still hasn't shown me when I switch to the htop screen to check, >> and back to the kmail screen to see if its updated yet. The screen switch >> doesn't seem to lag so I don't think renicing x would be helpfull. Those >> are the obvious lags, and I'll build & reboot to the CFS patch at some >> point this morning (whats left of it that is :). And report in due time >> of course And now I wonder if I applied the right patch. This one feels good ATM, but I don't think its the CFS thingy. No, I'm sure of it now, none of the patches I've saved say a thing about CFS. Backtrack up the list time I guess, ignore me for the nonce. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Microsoft: Re-inventing square wheels -- From a Slashdot.org post ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:58 ` Gene Heskett @ 2007-04-15 18:00 ` Mike Galbraith 2007-04-16 0:18 ` Gene Heskett 0 siblings, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 18:00 UTC (permalink / raw) To: Gene Heskett Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote: > Chuckle, possibly but then I'm not anything even remotely close to an expert > here Con, just reporting what I get. And I just rebooted to 2.6.21-rc6 + > sched-mike-5.patch for grins and giggles, or frowns and profanity as the case > may call for. Erm, that patch is embarrassingly buggy, so profanity should dominate. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 18:00 ` Mike Galbraith @ 2007-04-16 0:18 ` Gene Heskett 0 siblings, 0 replies; 304+ messages in thread From: Gene Heskett @ 2007-04-16 0:18 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Pekka Enberg, hui Bill Huey, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sunday 15 April 2007, Mike Galbraith wrote: >On Sun, 2007-04-15 at 12:58 -0400, Gene Heskett wrote: >> Chuckle, possibly but then I'm not anything even remotely close to an >> expert here Con, just reporting what I get. And I just rebooted to >> 2.6.21-rc6 + sched-mike-5.patch for grins and giggles, or frowns and >> profanity as the case may call for. > >Erm, that patch is embarrassingly buggy, so profanity should dominate. > > -Mike Chuckle, ROTFLMAO even. I didn't run it that long as I immediately rebuilt and rebooted when I found I'd used the wrong patch, and in fact had tested that one and found it sub-optimal before I'd built and ran Con's -0.40 version. As for bugs of the type that make it to the screen or logs, I didn't see any. OTOH, my eyesight is slowly going downhill, now 20/25. It was 20/10 30 years ago. Now thats reason for profanity... -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Unix weanies are as bad at this as anyone. -- Larry Wall in <199702111730.JAA28598@wall.org> ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 5:16 ` Bill Huey 2007-04-15 8:44 ` Ingo Molnar @ 2007-04-15 16:11 ` Bernd Eckenfels 1 sibling, 0 replies; 304+ messages in thread From: Bernd Eckenfels @ 2007-04-15 16:11 UTC (permalink / raw) To: linux-kernel In article <20070415051645.GA28438@gnuppy.monkey.org> you wrote: > A development process like this is likely to exclude smart people from wanting > to contribute to Linux and folks should be conscious about this issues. Nobody is excluded, you can always have a next iteration. Gruss Bernd ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` Con Kolivas 2007-04-15 5:16 ` Bill Huey @ 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 8:36 ` Bill Huey 2007-04-17 0:06 ` Peter Williams 2007-04-15 15:05 ` Ingo Molnar 2 siblings, 2 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 6:43 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Peter Williams, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote: > On Saturday 14 April 2007 06:21, Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > [CFS] > > > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > > > This project is a complete rewrite of the Linux task scheduler. My goal > > is to address various feature requests and to fix deficiencies in the > > vanilla scheduler that were suggested/found in the past few years, both > > for desktop scheduling and for server scheduling workloads. > > The casual observer will be completely confused by what on earth has happened > here so let me try to demystify things for them. [...] Demystify what? The casual observer need only read either your attempt at writing a scheduler, or my attempts at fixing the one we have, to see that it was high time for someone with the necessary skills to step in. Now progress can happen, which was _not_ happening before. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 6:43 ` Mike Galbraith @ 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith ` (2 more replies) 2007-04-17 0:06 ` Peter Williams 1 sibling, 3 replies; 304+ messages in thread From: Bill Huey @ 2007-04-15 8:36 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > [...] > > Demystify what? The casual observer need only read either your attempt Here's the problem. You're a casual observer and obviously not paying attention. > at writing a scheduler, or my attempts at fixing the one we have, to see > that it was high time for someone with the necessary skills to step in. > Now progress can happen, which was _not_ happening before. I think that's inaccurate and there are plenty of folks that have that technical skill and background. The scheduler code isn't a deep mystery and there are plenty of good kernel hackers out here across many communities. Ingo isn't the only person on this planet to have deep scheduler knowledge. Priority heaps are not new and Solaris has had a pluggable scheduler framework for years. Con's characterization is something that I'm more prone to believe about how Linux kernel development works versus your view. I think it's a great shame to have folks like Bill Irwin and Con to have waste time trying to do something right only to have their ideas attack, then copied and held as the solution for this kind of technical problem as complete reversal of technical opinion as it suits a moment. This is just wrong in so many ways. It outlines the problems with Linux kernel development and questionable elistism regarding ownership of certain sections of the kernel code. I call it "churn squat" and instances like this only support that view which I would rather it be completely wrong and inaccurate instead. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey @ 2007-04-15 8:45 ` Mike Galbraith 2007-04-15 9:06 ` Ingo Molnar 2007-04-15 16:25 ` Arjan van de Ven 2 siblings, 0 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-15 8:45 UTC (permalink / raw) To: Bill Huey Cc: Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Sun, 2007-04-15 at 01:36 -0700, Bill Huey wrote: > On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > > [...] > > > > Demystify what? The casual observer need only read either your attempt > > Here's the problem. You're a casual observer and obviously not paying > attention. > > > at writing a scheduler, or my attempts at fixing the one we have, to see > > that it was high time for someone with the necessary skills to step in. > > Now progress can happen, which was _not_ happening before. > > I think that's inaccurate and there are plenty of folks that have that > technical skill and background. The scheduler code isn't a deep mystery > and there are plenty of good kernel hackers out here across many > communities. Ingo isn't the only person on this planet to have deep > scheduler knowledge. Ok <shrug>, I'm not paying attention, and you can't read. We're even. Have a nice life. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith @ 2007-04-15 9:06 ` Ingo Molnar 2007-04-16 10:00 ` Ingo Molnar 2007-04-15 16:25 ` Arjan van de Ven 2 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 9:06 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner * Bill Huey <billh@gnuppy.monkey.org> wrote: > On Sun, Apr 15, 2007 at 08:43:04AM +0200, Mike Galbraith wrote: > > [...] > > > > Demystify what? The casual observer need only read either your > > attempt > > Here's the problem. You're a casual observer and obviously not paying > attention. guys, please calm down. Judging by the number of contributions to sched.c the main folks who are not 'observers' here and who thus have an unalienable right to be involved in a nasty flamewar about scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else is just a happy bystander, ok? ;-) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 9:06 ` Ingo Molnar @ 2007-04-16 10:00 ` Ingo Molnar 0 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-16 10:00 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Peter Williams, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > guys, please calm down. Judging by the number of contributions to > sched.c the main folks who are not 'observers' here and who thus have > an unalienable right to be involved in a nasty flamewar about > scheduler interactivity are Con, Mike, Nick and me ;-) Everyone else > is just a happy bystander, ok? ;-) just to make sure: this is a short (and incomplete) list of contributors related to scheduler interactivity code. The full list of contributors to sched.c includes many other people as well: Peter, Suresh, Christoph, Kenneth and many others. Even the git logs, which only span 2 years out of 15, already list 79 individual contributors to kernel/sched.c. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 8:36 ` Bill Huey 2007-04-15 8:45 ` Mike Galbraith 2007-04-15 9:06 ` Ingo Molnar @ 2007-04-15 16:25 ` Arjan van de Ven 2007-04-16 5:36 ` Bill Huey 2 siblings, 1 reply; 304+ messages in thread From: Arjan van de Ven @ 2007-04-15 16:25 UTC (permalink / raw) To: Bill Huey Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Thomas Gleixner > It outlines the problems with Linux kernel development and questionable > elistism regarding ownership of certain sections of the kernel code. I have to step in and disagree here.... Linux is not about who writes the code. Linux is about getting the best solution for a problem. Who wrote which line of the code is irrelevant in the big picture. that often means that multiple implementations happen, and that the a darwinistic process decides that the best solution wins. This darwinistic process often happens in the form of discussion, and that discussion can happen with words or with code. In this case it happened with a code proposal. To make this specific: it has happened many times to me that when I solved an issue with code, someone else stepped in and wrote a different solution (although that was usually for smaller pieces). Was I upset about that? No! I was happy because my *problem got solved* in the best possible way. Now this doesn't mean that people shouldn't be nice to each other, not cooperate or steal credits, but I don't get the impression that that is happening here. Ingo is taking part in the discussion with a counter proposal for discussion *on the mailing list*. What more do you want?? If you or anyone else can improve it or do better, take part of this discussion and show what you mean either in words or in code. Your qualification of the discussion as a elitist takeover... I disagree with that. It's a *discussion*. Now if you agree that Ingo's patch is better technically, you and others should be happy about that because your problem is getting solved better. If you don't agree that his patch is better technically, take part in the technical discussion. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 16:25 ` Arjan van de Ven @ 2007-04-16 5:36 ` Bill Huey 2007-04-16 6:17 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Bill Huey @ 2007-04-16 5:36 UTC (permalink / raw) To: Arjan van de Ven Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Thomas Gleixner, Bill Huey (hui) On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote: > Now this doesn't mean that people shouldn't be nice to each other, not > cooperate or steal credits, but I don't get the impression that that is > happening here. Ingo is taking part in the discussion with a counter > proposal for discussion *on the mailing list*. What more do you want?? Con should have been CCed from the first moment this was put into motion to limit the perception of exclusion. That was mistake number one and big time failures to understand this dynamic. After it was Con's idea. Why the hell he was excluded from Ingo's development process is baffling to me and him (most likely). He put int a lot of effort into SDL and his experiences with scheduling should still be seriously considered in this development process even if he doesn't write a single line of code from this moment on. What should have happened is that our very busy associate at RH by the name of Ingo Molnar should have leverage more of Con's and Bill's work and use them as a proxy for his own ideas. They would have loved to have contributed more and our very busy Ingo Molnar would have gotten a lot of his work and ideas implemented without him even opening a single source file for editting. They would have happily done this work for Ingo. Ingo could have been used for something else more important like making KVM less of a freaking ugly hack and we all would have benefitted from this. He could have been working on SystemTap so that you stop losing accounts to Sun and Solaris 10's Dtrace. He could have been working with Riel to fix your butt ugly page scanning problem causing horrible contention via the Clock/Pro algorithm, etc... He could have been fixing the ugly futex rwsem mapping problem that's killing -rt and anything that uses Posix threads. He could have created a userspace thread control block (TCB) with Mr. Drepper so that we can turn off preemption in userspace (userspace per CPU local storage) and implement a very quick non-kernel crossing implementation of priority ceilings (userspace check for priority and flags at preempt_schedule() in the TCB) so that our -rt Posix API doesn't suck donkey shit... Need I say more ? As programmers like Ingo get spread more thinly, he needs super smart folks like Bill Irwin and Con to help him out and learn to resist NIH folk's stuff out of some weird fear. When this happens, folks like Ingo must learn to "facilitate" development in addition to implementing it with those kind of folks. This takes time and practice to entrust folks to do things for him. Ingo is the best method of getting new Linux kernel ideas and communicate them to Linus. His value goes beyond just just code and is often the biggest hammer we have in the Linux community to get stuff into the kernel. "Facilitation" of others is something that solo programmers must need when groups like the Linux kernel get larger and large every year. Understand ? Are we in embarrassing agreement here ? bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:36 ` Bill Huey @ 2007-04-16 6:17 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-16 6:17 UTC (permalink / raw) To: Bill Huey Cc: Arjan van de Ven, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner On Sun, Apr 15, 2007 at 10:36:29PM -0700, Bill Huey wrote: > On Sun, Apr 15, 2007 at 09:25:07AM -0700, Arjan van de Ven wrote: > > Now this doesn't mean that people shouldn't be nice to each other, not > > cooperate or steal credits, but I don't get the impression that that is > > happening here. Ingo is taking part in the discussion with a counter > > proposal for discussion *on the mailing list*. What more do you want?? > > Con should have been CCed from the first moment this was put into motion > to limit the perception of exclusion. That was mistake number one and big > time failures to understand this dynamic. After it was Con's idea. Why > the hell he was excluded from Ingo's development process is baffling to > me and him (most likely). Ingo's scheduler is completely different to any I've seen proposed for Linux. And after he did an initial implementation, he did post it to everyone. Maybe something he said offended someone, but the process followed is exactly how Linux kernel development works (ie. if you think you can do better, then write the code). Sometimes you can give suggestions, but other times if you come up with a different idea then it is better just to do it yourself. Con's code is still out there. If it is better than Ingo's then it should win out. Nobody has a monopoly on schedulers or ideas or posting patches. > He put int a lot of effort into SDL and his experiences with scheduling > should still be seriously considered in this development process even if > he doesn't write a single line of code from this moment on. > > What should have happened is that our very busy associate at RH by the > name of Ingo Molnar should have leverage more of Con's and Bill's work > and use them as a proxy for his own ideas. They would have loved to have > contributed more and our very busy Ingo Molnar would have gotten a lot > of his work and ideas implemented without him even opening a single > source file for editting. They would have happily done this work for > Ingo. Ingo could have been used for something else more important like > making KVM less of a freaking ugly hack and we all would have benefitted > from this. > > He could have been working on SystemTap so that you stop losing accounts > to Sun and Solaris 10's Dtrace. He could have been working with Riel to > fix your butt ugly page scanning problem causing horrible contention via > the Clock/Pro algorithm, etc... He could have been fixing the ugly futex > rwsem mapping problem that's killing -rt and anything that uses Posix > threads. He could have created a userspace thread control block (TCB) > with Mr. Drepper so that we can turn off preemption in userspace > (userspace per CPU local storage) and implement a very quick non-kernel > crossing implementation of priority ceilings (userspace check for priority > and flags at preempt_schedule() in the TCB) so that our -rt Posix API > doesn't suck donkey shit... Need I say more ? Well that's some pretty strong criticism of Linux and of someone who does a great deal to improve things... Let's stick to the topic of schedulers in this thread and try keeping it constructive. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 6:43 ` Mike Galbraith 2007-04-15 8:36 ` Bill Huey @ 2007-04-17 0:06 ` Peter Williams 2007-04-17 2:29 ` Mike Galbraith 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 0:06 UTC (permalink / raw) To: Mike Galbraith Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner Mike Galbraith wrote: > On Sun, 2007-04-15 at 13:27 +1000, Con Kolivas wrote: >> On Saturday 14 April 2007 06:21, Ingo Molnar wrote: >>> [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler >>> [CFS] >>> >>> i'm pleased to announce the first release of the "Modular Scheduler Core >>> and Completely Fair Scheduler [CFS]" patchset: >>> >>> http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch >>> >>> This project is a complete rewrite of the Linux task scheduler. My goal >>> is to address various feature requests and to fix deficiencies in the >>> vanilla scheduler that were suggested/found in the past few years, both >>> for desktop scheduling and for server scheduling workloads. >> The casual observer will be completely confused by what on earth has happened >> here so let me try to demystify things for them. > > [...] > > Demystify what? The casual observer need only read either your attempt > at writing a scheduler, or my attempts at fixing the one we have, to see > that it was high time for someone with the necessary skills to step in. Make that "someone with the necessary clout". > Now progress can happen, which was _not_ happening before. > This is true. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 0:06 ` Peter Williams @ 2007-04-17 2:29 ` Mike Galbraith 2007-04-17 3:40 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-17 2:29 UTC (permalink / raw) To: Peter Williams Cc: Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Arjan van de Ven, Thomas Gleixner On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > Mike Galbraith wrote: > > > > Demystify what? The casual observer need only read either your attempt > > at writing a scheduler, or my attempts at fixing the one we have, to see > > that it was high time for someone with the necessary skills to step in. > > Make that "someone with the necessary clout". No, I was brutally honest to both of us, but quite correct. > > Now progress can happen, which was _not_ happening before. > > > > This is true. Yup, and progress _is_ happening now, quite rapidly. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 2:29 ` Mike Galbraith @ 2007-04-17 3:40 ` Nick Piggin 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 4:17 ` Peter Williams 0 siblings, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 3:40 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > > Mike Galbraith wrote: > > > > > > Demystify what? The casual observer need only read either your attempt > > > at writing a scheduler, or my attempts at fixing the one we have, to see > > > that it was high time for someone with the necessary skills to step in. > > > > Make that "someone with the necessary clout". > > No, I was brutally honest to both of us, but quite correct. > > > > Now progress can happen, which was _not_ happening before. > > > > > > > This is true. > > Yup, and progress _is_ happening now, quite rapidly. Progress as in progress on Ingo's scheduler. I still don't know how we'd decide when to replace the mainline scheduler or with what. I don't think we can say Ingo's is better than the alternatives, can we? If there is some kind of bakeoff, then I'd like one of Con's designs to be involved, and mine, and Peter's... Maybe the progress is that more key people are becoming open to the idea of changing the scheduler. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:40 ` Nick Piggin @ 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 4:14 ` Nick Piggin 2007-04-20 20:36 ` Bill Davidsen 2007-04-17 4:17 ` Peter Williams 1 sibling, 2 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-17 4:01 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > > Yup, and progress _is_ happening now, quite rapidly. > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > decide when to replace the mainline scheduler or with what. > > I don't think we can say Ingo's is better than the alternatives, can we? No, that would require massive performance testing of all alternatives. > If there is some kind of bakeoff, then I'd like one of Con's designs to > be involved, and mine, and Peter's... The trouble with a bakeoff is that it's pretty darn hard to get people to test in the first place, and then comes weighting the subjective and hard performance numbers. If they're close in numbers, do you go with the one which starts the least flamewars or what? > Maybe the progress is that more key people are becoming open to the idea > of changing the scheduler. Could be. All was quiet for quite a while, but when RSDL showed up, it aroused enough interest to show that scheduling woes is on folks radar. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:01 ` Mike Galbraith @ 2007-04-17 4:14 ` Nick Piggin 2007-04-17 6:26 ` Peter Williams 2007-04-17 9:51 ` Ingo Molnar 2007-04-20 20:36 ` Bill Davidsen 1 sibling, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 4:14 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 06:01:29AM +0200, Mike Galbraith wrote: > On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: > > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > > > > Yup, and progress _is_ happening now, quite rapidly. > > > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > > decide when to replace the mainline scheduler or with what. > > > > I don't think we can say Ingo's is better than the alternatives, can we? > > No, that would require massive performance testing of all alternatives. > > > If there is some kind of bakeoff, then I'd like one of Con's designs to > > be involved, and mine, and Peter's... > > The trouble with a bakeoff is that it's pretty darn hard to get people > to test in the first place, and then comes weighting the subjective and > hard performance numbers. If they're close in numbers, do you go with > the one which starts the least flamewars or what? I don't know how you'd do it. I know you wouldn't count people telling you how good they are (getting people to tell you how bad they are, and whether others do better in a given situation might be slightly move viable). But we have to choose somehow. I'd hope that is going to be based solely on the results and technical properties of the code, so... if we were to somehow determine that the results are exactly the same, we'd go for the the simpler one, wouldn't we? > > Maybe the progress is that more key people are becoming open to the idea > > of changing the scheduler. > > Could be. All was quiet for quite a while, but when RSDL showed up, it > aroused enough interest to show that scheduling woes is on folks radar. Well I know people have had woes with the scheduler for ever (I guess that isn't going to change :P). I think people generally lost a bit of interest in trying to improve the situation because of the upstream problem. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:14 ` Nick Piggin @ 2007-04-17 6:26 ` Peter Williams 2007-04-17 9:51 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-17 6:26 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > Well I know people have had woes with the scheduler for ever (I guess that > isn't going to change :P). I think people generally lost a bit of interest > in trying to improve the situation because of the upstream problem. Yes. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:14 ` Nick Piggin 2007-04-17 6:26 ` Peter Williams @ 2007-04-17 9:51 ` Ingo Molnar 2007-04-17 13:44 ` Peter Williams 2007-04-20 20:47 ` Bill Davidsen 1 sibling, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 9:51 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Peter Williams, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > > Maybe the progress is that more key people are becoming open to > > > the idea of changing the scheduler. > > > > Could be. All was quiet for quite a while, but when RSDL showed up, > > it aroused enough interest to show that scheduling woes is on folks > > radar. > > Well I know people have had woes with the scheduler for ever (I guess > that isn't going to change :P). [...] yes, that part isnt going to change, because the CPU is a _scarce resource_ that is perhaps the most frequently overcommitted physical computer resource in existence, and because the kernel does not (yet) track eye movements of humans to figure out which tasks are more important them. So critical human constraints are unknown to the scheduler and thus complaints will always come. The upstream scheduler thought it had enough information: the sleep average. So now the attempt is to go back and _simplify_ the scheduler and remove that information, and concentrate on getting fairness precisely right. The magic thing about 'fairness' is that it's a pretty good default policy if we decide that we simply have not enough information to do an intelligent choice. ( Lets be cautious though: the jury is still out whether people actually like this more than the current approach. While CFS feedback looks promising after a whopping 3 days of it being released [ ;-) ], the test coverage of all 'fairness centric' schedulers, even considering years of availability is less than 1% i'm afraid, and that < 1% was mostly self-selecting. ) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:51 ` Ingo Molnar @ 2007-04-17 13:44 ` Peter Williams 2007-04-17 23:00 ` Michael K. Edwards 2007-04-20 20:47 ` Bill Davidsen 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 13:44 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Nick Piggin <npiggin@suse.de> wrote: > >>>> Maybe the progress is that more key people are becoming open to >>>> the idea of changing the scheduler. >>> Could be. All was quiet for quite a while, but when RSDL showed up, >>> it aroused enough interest to show that scheduling woes is on folks >>> radar. >> Well I know people have had woes with the scheduler for ever (I guess >> that isn't going to change :P). [...] > > yes, that part isnt going to change, because the CPU is a _scarce > resource_ that is perhaps the most frequently overcommitted physical > computer resource in existence, and because the kernel does not (yet) > track eye movements of humans to figure out which tasks are more > important them. So critical human constraints are unknown to the > scheduler and thus complaints will always come. > > The upstream scheduler thought it had enough information: the sleep > average. So now the attempt is to go back and _simplify_ the scheduler > and remove that information, and concentrate on getting fairness > precisely right. The magic thing about 'fairness' is that it's a pretty > good default policy if we decide that we simply have not enough > information to do an intelligent choice. > > ( Lets be cautious though: the jury is still out whether people actually > like this more than the current approach. While CFS feedback looks > promising after a whopping 3 days of it being released [ ;-) ], the > test coverage of all 'fairness centric' schedulers, even considering > years of availability is less than 1% i'm afraid, and that < 1% was > mostly self-selecting. ) At this point I'd like to make the observation that spa_ebs is a very fair scheduler if you consider "nice" to be an indication of the relative entitlement of tasks to CPU bandwidth. It works by mapping nice to shares using a function very similar to the one for calculating p->load weight except it's not offset by the RT priorities as RT is handled separately. In theory, a runnable task's entitlement to CPU bandwidth at any time is the ratio of its shares to the total shares held by runnable tasks on the same CPU (in reality, a smoothed average of this sum is used to make scheduling smoother). The dynamic priorities of the runnable tasks are then fiddled to try to keep each tasks CPU bandwidth usage in proportion to its entitlement. That's the theory anyway. The actual implementation looks a bit different due to efficiency considerations. The modifications to the above theory boil down to keeping a running measure of the (recent) highest CPU bandwidth use per share for tasks running on the CPU -- I call this the yardstick for this CPU. When it's time to put a task on the run queue it's dynamic priority is determined by comparing its CPU bandwidth per share value with the yardstick for its CPU. If it's greater than the yardstick this value becomes the new yardstick and the task gets given the lowest possible dynamic priority (for its scheduling class). If the value is zero it gets the highest possible priority (for its scheduling class) which would be MAX_RT_PRIO for a SCHED_OTHER task. Otherwise it gets given a priority between these two extremes proportional to ratio of its CPU bandwidth per share value and the yardstick. Quite simple really. The other way in which the code deviates from the original as that (for a few years now) I no longer calculated CPU bandwidth usage directly. I've found that the overhead is less if I keep a running average of the size of a tasks CPU bursts and the length of its scheduling cycle (i.e. from on CPU one time to on CPU next time) and using the ratio of these values as a measure of bandwidth usage. Anyway it works and gives very predictable allocations of CPU bandwidth based on nice. Another good feature is that (in this pure form) it's starvation free. However, if you fiddle with it and do things like giving bonus priority boosts to interactive tasks it becomes susceptible to starvation. This can be fixed by using an anti starvation mechanism such as SPA's promotion scheme and that's what spa_ebs does. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:44 ` Peter Williams @ 2007-04-17 23:00 ` Michael K. Edwards 2007-04-17 23:07 ` William Lee Irwin III 2007-04-18 2:39 ` Peter Williams 0 siblings, 2 replies; 304+ messages in thread From: Michael K. Edwards @ 2007-04-17 23:00 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > The other way in which the code deviates from the original as that (for > a few years now) I no longer calculated CPU bandwidth usage directly. > I've found that the overhead is less if I keep a running average of the > size of a tasks CPU bursts and the length of its scheduling cycle (i.e. > from on CPU one time to on CPU next time) and using the ratio of these > values as a measure of bandwidth usage. > > Anyway it works and gives very predictable allocations of CPU bandwidth > based on nice. Works, that is, right up until you add nonlinear interactions with CPU speed scaling. From my perspective as an embedded platform integrator, clock/voltage scaling is the elephant in the scheduler's living room. Patch in DPM (now OpPoint?) to scale the clock based on what task is being scheduled, and suddenly the dynamic priority calculations go wild. Nip this in the bud by putting an RT priority on the relevant threads (which you have to do anyway if you need remotely audio-grade latency), and the lock affinity heuristics break, so you have to hand-tune all the thread priorities. Blecch. Not to mention the likelihood that the task whose clock speed you're trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority than the application. (You want to crank the CPU for this task because it runs with the RF hot, which may cost you as much power as the rest of the platform.) You'd better hope you can remove it from the dynamic priority heuristics with SCHED_BATCH. Otherwise everything _else_ has to be RT priority (or it'll be starved by the soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double blecch! Is it too much to ask for someone with actual engineering training (not me, unfortunately) to sit down and build a negative-feedback control system that handles soft-real-time _and_ dynamic-priority _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock scaling? And actually separates the accounting and control mechanisms from the heuristics, so the latter can be tuned (within a well documented stable range) to reflect the expected system usage patterns? It's not like there isn't a vast literature in this area over the past decade, including some dealing specifically with clock scaling consistent with low-latency applications. It's a pity that people doing academic work in this area rarely wade into LKML, even when they're hacking on a Linux fork. But then, there's not much economic incentive for them to do so, and they can usually get their fill of citation politics and dominance games without leaving their home department. :-P Seriously, though. If you're really going to put the mainline scheduler through this kind of churn, please please pretty please knit in per-task clock scaling (possibly even rejigged during the slice; see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of linger mechanism to keep from taking context switch hits when you're confident that an I/O will complete quickly. Cheers, - Michael ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:00 ` Michael K. Edwards @ 2007-04-17 23:07 ` William Lee Irwin III 2007-04-17 23:52 ` Michael K. Edwards 2007-04-18 2:39 ` Peter Williams 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 23:07 UTC (permalink / raw) To: Michael K. Edwards Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:00:53PM -0700, Michael K. Edwards wrote: > Works, that is, right up until you add nonlinear interactions with CPU > speed scaling. From my perspective as an embedded platform > integrator, clock/voltage scaling is the elephant in the scheduler's > living room. Patch in DPM (now OpPoint?) to scale the clock based on > what task is being scheduled, and suddenly the dynamic priority > calculations go wild. Nip this in the bud by putting an RT priority > on the relevant threads (which you have to do anyway if you need > remotely audio-grade latency), and the lock affinity heuristics break, > so you have to hand-tune all the thread priorities. Blecch. [...not terribly enlightening stuff trimmed...] The ongoing scheduler work is on a much more basic level than these affairs I'm guessing you googled. When the basics work as intended it will be possible to move on to more advanced issues. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:07 ` William Lee Irwin III @ 2007-04-17 23:52 ` Michael K. Edwards 2007-04-18 0:36 ` Bill Huey 0 siblings, 1 reply; 304+ messages in thread From: Michael K. Edwards @ 2007-04-17 23:52 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote: > The ongoing scheduler work is on a much more basic level than these > affairs I'm guessing you googled. When the basics work as intended it > will be possible to move on to more advanced issues. OK, let me try this in smaller words for people who can't tell bitter experience from Google hits. CPU clock scaling for power efficiency is already the only thing that matters about the Linux scheduler in my world, because battery-powered device vendors in their infinite wisdom are abandoning real RTOSes in favor of Linux now that WiFi is the "in" thing (again). And on the timescale that anyone will actually be _using_ this shiny new scheduler of Ingo's, it'll be nearly the only thing that matters about the Linux scheduler in anyone's world, because the amount of work the CPU can get done in a given minute will depend mostly on how intelligently it can spend its heat dissipation budget. Clock scaling schemes that aren't integral to the scheduler design make a bad situation (scheduling embedded loads with shotgun heuristics tuned for desktop CPUs) worse, because the opaque heuristics are now being applied to distorted data. Add a "smoothing" scheme for the distorted data, and you may find that you have introduced an actual control-path instability. A small fluctuation in the data (say, two bursts of interrupt traffic at just the right interval) can result in a long-lasting oscillation in some task's "dynamic priority" -- and, on a fully loaded CPU, in the time that task actually gets. If anything else depends on how much work this task gets done each time around, the oscillation can easily propagate throughout the system. Thrash city. (If you haven't seen this happen on real production systems under what shouldn't be a pathological load, you haven't been around long. The classic mechanisms that triggered oscillations in, say, early SMP Solaris boxes haven't bitten recently, perhaps because most modern CPUs don't lose their marbles so comprehensively on context switch. But I got to live this nightmare again recently on ARM Linux, due to some impressively broken application-level threading/locking "design", whose assumptions about scheduler behavior got broken when I switched to an NPTL toolchain.) I don't have the training to design a scheduler that isn't vulnerable to control-feedback oscillations. Neither do you, if you haven't taken (and excelled at) a control theory course, which nowadays seems to be taught by applied math and ECE departments and too often skipped by CS types. But I can recognize an impending train wreck when I see it. Cheers, - Michael ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:52 ` Michael K. Edwards @ 2007-04-18 0:36 ` Bill Huey 0 siblings, 0 replies; 304+ messages in thread From: Bill Huey @ 2007-04-18 0:36 UTC (permalink / raw) To: Michael K. Edwards Cc: William Lee Irwin III, Peter Williams, Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner, Bill Huey (hui) On Tue, Apr 17, 2007 at 04:52:08PM -0700, Michael K. Edwards wrote: > On 4/17/07, William Lee Irwin III <wli@holomorphy.com> wrote: > >The ongoing scheduler work is on a much more basic level than these > >affairs I'm guessing you googled. When the basics work as intended it > >will be possible to move on to more advanced issues. ... Will probably shouldn't have dismissed your points but he probably means that can't even get at this stuff until fundamental are in place. > Clock scaling schemes that aren't integral to the scheduler design > make a bad situation (scheduling embedded loads with shotgun > heuristics tuned for desktop CPUs) worse, because the opaque > heuristics are now being applied to distorted data. Add a "smoothing" > scheme for the distorted data, and you may find that you have > introduced an actual control-path instability. A small fluctuation in > the data (say, two bursts of interrupt traffic at just the right > interval) can result in a long-lasting oscillation in some task's > "dynamic priority" -- and, on a fully loaded CPU, in the time that > task actually gets. If anything else depends on how much work this > task gets done each time around, the oscillation can easily propagate > throughout the system. Thrash city. Hyperthreading issues are quite similar that clock scaling issues. Con's infrastructures changes to move things in that direction were rejected, as well as other infrastructure changes, further infuritating Con to drop development on RSDL and derivatives. bill ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:00 ` Michael K. Edwards 2007-04-17 23:07 ` William Lee Irwin III @ 2007-04-18 2:39 ` Peter Williams 1 sibling, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-18 2:39 UTC (permalink / raw) To: Michael K. Edwards Cc: Ingo Molnar, Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Michael K. Edwards wrote: > On 4/17/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >> The other way in which the code deviates from the original as that (for >> a few years now) I no longer calculated CPU bandwidth usage directly. >> I've found that the overhead is less if I keep a running average of the >> size of a tasks CPU bursts and the length of its scheduling cycle (i.e. >> from on CPU one time to on CPU next time) and using the ratio of these >> values as a measure of bandwidth usage. >> >> Anyway it works and gives very predictable allocations of CPU bandwidth >> based on nice. > > Works, that is, right up until you add nonlinear interactions with CPU > speed scaling. From my perspective as an embedded platform > integrator, clock/voltage scaling is the elephant in the scheduler's > living room. Patch in DPM (now OpPoint?) to scale the clock based on > what task is being scheduled, and suddenly the dynamic priority > calculations go wild. Nip this in the bud by putting an RT priority > on the relevant threads (which you have to do anyway if you need > remotely audio-grade latency), and the lock affinity heuristics break, > so you have to hand-tune all the thread priorities. Blecch. > > Not to mention the likelihood that the task whose clock speed you're > trying to crank up (say, a WiFi soft MAC) needs to be _lower_ priority > than the application. (You want to crank the CPU for this task > because it runs with the RF hot, which may cost you as much power as > the rest of the platform.) You'd better hope you can remove it from > the dynamic priority heuristics with SCHED_BATCH. Otherwise > everything _else_ has to be RT priority (or it'll be starved by the > soft MAC) and you've basically tossed SCHED_NORMAL in the bin. Double > blecch! > > Is it too much to ask for someone with actual engineering training > (not me, unfortunately) to sit down and build a negative-feedback > control system that handles soft-real-time _and_ dynamic-priority > _and_ batch loads, CPU _and_ I/O scheduling, preemption _and_ clock > scaling? And actually separates the accounting and control mechanisms > from the heuristics, so the latter can be tuned (within a well > documented stable range) to reflect the expected system usage > patterns? > > It's not like there isn't a vast literature in this area over the past > decade, including some dealing specifically with clock scaling > consistent with low-latency applications. It's a pity that people > doing academic work in this area rarely wade into LKML, even when > they're hacking on a Linux fork. But then, there's not much economic > incentive for them to do so, and they can usually get their fill of > citation politics and dominance games without leaving their home > department. :-P > > Seriously, though. If you're really going to put the mainline > scheduler through this kind of churn, please please pretty please knit > in per-task clock scaling (possibly even rejigged during the slice; > see e. g. Yuan and Nahrstedt's GRACE-OS papers) and some sort of > linger mechanism to keep from taking context switch hits when you're > confident that an I/O will complete quickly. I think that this doesn't effect the basic design principles of spa_ebs but just means that the statistics that it uses need to be rethought. E.g. instead of measuring average CPU usage per burst in terms of wall clock time spent on the CPU measure it in terms of CPU capacity (for the want of a better word) used per burst. I don't have suitable hardware for investigating this line of attack further, unfortunately, and have no idea what would be the best way to calculate this new statistic. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:51 ` Ingo Molnar 2007-04-17 13:44 ` Peter Williams @ 2007-04-20 20:47 ` Bill Davidsen 2007-04-21 7:39 ` Nick Piggin 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 2 replies; 304+ messages in thread From: Bill Davidsen @ 2007-04-20 20:47 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven Ingo Molnar wrote: > ( Lets be cautious though: the jury is still out whether people actually > like this more than the current approach. While CFS feedback looks > promising after a whopping 3 days of it being released [ ;-) ], the > test coverage of all 'fairness centric' schedulers, even considering > years of availability is less than 1% i'm afraid, and that < 1% was > mostly self-selecting. ) > All of my testing has been on desktop machines, although in most cases they were really loaded desktops which had load avg 10..100 from time to time, and none were low memory machines. Up to CFS v3 I thought nicksched was my winner, now CFSv3 looks better, by not having stumbles under stupid loads. I have not tested: 1 - server loads, nntp, smtp, etc 2 - low memory machines 3 - uniprocessor systems I think this should be done before drawing conclusions. Or if someone has tried this, perhaps they would report what they saw. People are talking about smoothness, but not how many pages per second come out of their overloaded web server. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 20:47 ` Bill Davidsen @ 2007-04-21 7:39 ` Nick Piggin 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-21 7:39 UTC (permalink / raw) To: Bill Davidsen Cc: Ingo Molnar, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven On Fri, Apr 20, 2007 at 04:47:27PM -0400, Bill Davidsen wrote: > Ingo Molnar wrote: > > >( Lets be cautious though: the jury is still out whether people actually > > like this more than the current approach. While CFS feedback looks > > promising after a whopping 3 days of it being released [ ;-) ], the > > test coverage of all 'fairness centric' schedulers, even considering > > years of availability is less than 1% i'm afraid, and that < 1% was > > mostly self-selecting. ) > > > All of my testing has been on desktop machines, although in most cases > they were really loaded desktops which had load avg 10..100 from time to > time, and none were low memory machines. Up to CFS v3 I thought > nicksched was my winner, now CFSv3 looks better, by not having stumbles > under stupid loads. What base_timeslice were you using for nicksched, and what HZ? ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 20:47 ` Bill Davidsen 2007-04-21 7:39 ` Nick Piggin @ 2007-04-21 8:33 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-21 8:33 UTC (permalink / raw) To: Bill Davidsen Cc: Nick Piggin, Bill Huey, Mike Galbraith, Peter Williams, linux-kernel, ck list, Andrew Morton, Linus Torvalds, Thomas Gleixner, Arjan van de Ven * Bill Davidsen <davidsen@tmr.com> wrote: > All of my testing has been on desktop machines, although in most cases > they were really loaded desktops which had load avg 10..100 from time > to time, and none were low memory machines. Up to CFS v3 I thought > nicksched was my winner, now CFSv3 looks better, by not having > stumbles under stupid loads. nice! I hope CFSv4 kept that good tradition too ;) > I have not tested: > 1 - server loads, nntp, smtp, etc > 2 - low memory machines > 3 - uniprocessor systems > > I think this should be done before drawing conclusions. Or if someone > has tried this, perhaps they would report what they saw. People are > talking about smoothness, but not how many pages per second come out > of their overloaded web server. i tested heavily swapping systems. (make -j50 workloads easily trigger that) I also tested UP systems and a handful of SMP systems. I have also tested massive_intr.c which i believe is an indicator of how fairly CPU time is distributed between partly sleeping partly running server threads. But i very much agree that diverse feedback is sought and welcome, both from those who are happy with the current scheduler and those who are unhappy about it. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:01 ` Mike Galbraith 2007-04-17 4:14 ` Nick Piggin @ 2007-04-20 20:36 ` Bill Davidsen 1 sibling, 0 replies; 304+ messages in thread From: Bill Davidsen @ 2007-04-20 20:36 UTC (permalink / raw) To: Mike Galbraith Cc: Nick Piggin, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Mike Galbraith wrote: > On Tue, 2007-04-17 at 05:40 +0200, Nick Piggin wrote: >> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > >>> Yup, and progress _is_ happening now, quite rapidly. >> Progress as in progress on Ingo's scheduler. I still don't know how we'd >> decide when to replace the mainline scheduler or with what. >> >> I don't think we can say Ingo's is better than the alternatives, can we? > > No, that would require massive performance testing of all alternatives. > >> If there is some kind of bakeoff, then I'd like one of Con's designs to >> be involved, and mine, and Peter's... > > The trouble with a bakeoff is that it's pretty darn hard to get people > to test in the first place, and then comes weighting the subjective and > hard performance numbers. If they're close in numbers, do you go with > the one which starts the least flamewars or what? > Here we disagree... I picked a scheduler not by running benchmarks, but by running loads which piss me off with the mainline scheduler. And then I ran the other schedulers for a while to find the things, normal things I do, which resulted in bad behavior. And when I found one which had (so far) no such cases I called it my winner, but I haven't tested it under server load, so I can't begin to say it's "the best." What we need is for lots of people to run every scheduler in real life, and do "worst case analysis" by finding the cases which cause bad behavior. And if there were a way to easily choose another scheduler, call it plugable, modular, or Russian Roulette, people who found a worst case would report it (aka bitch about it) and try another. But the average user is better able to boot with an option like "sched=cfs" (or sc, or nick, or ...) than to patch and build a kernel. So if we don't get easily switched schedulers people will not test nearly as well. The best scheduler isn't the one 2% faster than the rest, it's the one with the fewest jackpot cases where it sucks. And if the mainline had multiple schedulers this testing would get done, authors would get more reports and have a better chance of fixing corner cases. Note that we really need multiple schedulers to make people happy, because fairness is not the most desirable behavior on all machines, and adding knobs probably isn't the answer. I want a server to degrade gently, I want my desktop to show my movie and echo my typing, and if that's hard on compiles or the file transfer, so be it. Con doesn't want to compromise his goals, I agree but want to have an option if I don't share them. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:40 ` Nick Piggin 2007-04-17 4:01 ` Mike Galbraith @ 2007-04-17 4:17 ` Peter Williams 2007-04-17 4:29 ` Nick Piggin 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 4:17 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: >> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: >>> Mike Galbraith wrote: >>>> Demystify what? The casual observer need only read either your attempt >>>> at writing a scheduler, or my attempts at fixing the one we have, to see >>>> that it was high time for someone with the necessary skills to step in. >>> Make that "someone with the necessary clout". >> No, I was brutally honest to both of us, but quite correct. >> >>>> Now progress can happen, which was _not_ happening before. >>>> >>> This is true. >> Yup, and progress _is_ happening now, quite rapidly. > > Progress as in progress on Ingo's scheduler. I still don't know how we'd > decide when to replace the mainline scheduler or with what. > > I don't think we can say Ingo's is better than the alternatives, can we? > If there is some kind of bakeoff, then I'd like one of Con's designs to > be involved, and mine, and Peter's... I myself was thinking of this as the chance for a much needed simplification of the scheduling code and if this can be done with the result being "reasonable" it then gives us the basis on which to propose improvements based on the ideas of others such as you mention. As the size of the cpusched indicates, trying to evaluate alternative proposals based on the current O(1) scheduler is fraught. Hopefully, this initiative can fix this problem. Then we just need Ingo to listen to suggestions and he's showing signs of being willing to do this :-) > > Maybe the progress is that more key people are becoming open to the idea > of changing the scheduler. That too. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:17 ` Peter Williams @ 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 4:29 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: > >>On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: > >>>Mike Galbraith wrote: > >>>>Demystify what? The casual observer need only read either your attempt > >>>>at writing a scheduler, or my attempts at fixing the one we have, to see > >>>>that it was high time for someone with the necessary skills to step in. > >>>Make that "someone with the necessary clout". > >>No, I was brutally honest to both of us, but quite correct. > >> > >>>>Now progress can happen, which was _not_ happening before. > >>>> > >>>This is true. > >>Yup, and progress _is_ happening now, quite rapidly. > > > >Progress as in progress on Ingo's scheduler. I still don't know how we'd > >decide when to replace the mainline scheduler or with what. > > > >I don't think we can say Ingo's is better than the alternatives, can we? > >If there is some kind of bakeoff, then I'd like one of Con's designs to > >be involved, and mine, and Peter's... > > I myself was thinking of this as the chance for a much needed > simplification of the scheduling code and if this can be done with the > result being "reasonable" it then gives us the basis on which to propose > improvements based on the ideas of others such as you mention. > > As the size of the cpusched indicates, trying to evaluate alternative > proposals based on the current O(1) scheduler is fraught. Hopefully, I don't know why. The problem is that you can't really evaluate good proposals by looking at the code (you can say that one is bad, ie. the current one, which has a huge amount of temporal complexity and is explicitly unfair), but it is pretty hard to say one behaves well. And my scheduler for example cuts down the amount of policy code and code size significantly. I haven't looked at Con's ones for a while, but I believe they are also much more straightforward than mainline... For example, let's say all else is equal between them, then why would we go with the O(logN) implementation rather than the O(1)? ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin @ 2007-04-17 5:53 ` Willy Tarreau 2007-04-17 6:10 ` Nick Piggin 2007-04-17 6:09 ` William Lee Irwin III 2007-04-17 6:23 ` Peter Williams 2 siblings, 1 reply; 304+ messages in thread From: Willy Tarreau @ 2007-04-17 5:53 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Hi Nick, On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: (...) > And my scheduler for example cuts down the amount of policy code and > code size significantly. I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... > > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? Of course, if this is the case, the question will be raised. But as a general rule, I don't see much potential in O(1) to finely tune scheduling according to several criteria. In O(logN), you can adjust scheduling in realtime at a very low cost. Better processing of varying priorities or fork() comes to mind. Regards, Willy ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 5:53 ` Willy Tarreau @ 2007-04-17 6:10 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 6:10 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 07:53:55AM +0200, Willy Tarreau wrote: > Hi Nick, > > On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > (...) > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. I haven't looked at Con's ones for a while, > > but I believe they are also much more straightforward than mainline... > > > > For example, let's say all else is equal between them, then why would > > we go with the O(logN) implementation rather than the O(1)? > > Of course, if this is the case, the question will be raised. But as a > general rule, I don't see much potential in O(1) to finely tune scheduling > according to several criteria. What do you mean? By what criteria? > In O(logN), you can adjust scheduling in > realtime at a very low cost. Better processing of varying priorities or > fork() comes to mind. The main problem as I see it is choosing which task to run next and how much time to run it for. And given that there are typically far less than 58 (the number of priorities in nicksched) runnable tasks for a desktop system, I don't find it at all constraining to quantize my "next runnable" criteria onto that size of key. Even if you do expect a huge number of runnable tasks, you would hope for fewer interactive ones toward the higher end of the priority scale. Handwaving or even detailed design descriptions is simply not the best way to decide on a new scheduler, is all I'm saying. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau @ 2007-04-17 6:09 ` William Lee Irwin III 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:23 ` Peter Williams 2 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:09 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: >> I myself was thinking of this as the chance for a much needed >> simplification of the scheduling code and if this can be done with the >> result being "reasonable" it then gives us the basis on which to propose >> improvements based on the ideas of others such as you mention. >> As the size of the cpusched indicates, trying to evaluate alternative >> proposals based on the current O(1) scheduler is fraught. Hopefully, On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > I don't know why. The problem is that you can't really evaluate good > proposals by looking at the code (you can say that one is bad, ie. the > current one, which has a huge amount of temporal complexity and is > explicitly unfair), but it is pretty hard to say one behaves well. > And my scheduler for example cuts down the amount of policy code and > code size significantly. I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? All things are not equal; they all have different properties. I like getting rid of the queue-swapping artifacts as ebs and cfs have done, as the artifacts introduced there are nasty IMNSHO. I'm queueing up a demonstration of epoch expiry scheduling artifacts as a testcase, albeit one with no pass/fail semantics for its results, just detecting scheduler properties. That said, inequality/inequivalence is not a superiority/inferiority ranking per se. What needs to come out of these discussions is a set of standards which a candidate for mainline must pass to be considered correct and a set of performance metrics by which to rank them. Video game framerates and some sort of way to automate window wiggle tests sound like good ideas, but automating such is beyond my userspace programming abilities. An organization able to devote manpower to devising such testcases will likely have to get involved for them to happen, I suspect. On a random note, limitations on kernel address space make O(lg(n)) effectively O(1), albeit with large upper bounds on the worst case and an expected case much faster than the worst case. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:09 ` William Lee Irwin III @ 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:26 ` William Lee Irwin III 2007-04-17 6:50 ` Davide Libenzi 0 siblings, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 6:15 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: > >> I myself was thinking of this as the chance for a much needed > >> simplification of the scheduling code and if this can be done with the > >> result being "reasonable" it then gives us the basis on which to propose > >> improvements based on the ideas of others such as you mention. > >> As the size of the cpusched indicates, trying to evaluate alternative > >> proposals based on the current O(1) scheduler is fraught. Hopefully, > > On Tue, Apr 17, 2007 at 06:29:54AM +0200, Nick Piggin wrote: > > I don't know why. The problem is that you can't really evaluate good > > proposals by looking at the code (you can say that one is bad, ie. the > > current one, which has a huge amount of temporal complexity and is > > explicitly unfair), but it is pretty hard to say one behaves well. > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. I haven't looked at Con's ones for a while, > > but I believe they are also much more straightforward than mainline... > > For example, let's say all else is equal between them, then why would > > we go with the O(logN) implementation rather than the O(1)? > > All things are not equal; they all have different properties. I like Exactly. So we have to explore those properties and evaluate performance (in all meanings of the word). That's only logical. > On a random note, limitations on kernel address space make O(lg(n)) > effectively O(1), albeit with large upper bounds on the worst case > and an expected case much faster than the worst case. Yeah. O(n!) is also O(1) if you can put an upper bound on n ;) ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:15 ` Nick Piggin @ 2007-04-17 6:26 ` William Lee Irwin III 2007-04-17 7:01 ` Nick Piggin 2007-04-17 6:50 ` Davide Libenzi 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:26 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: >> All things are not equal; they all have different properties. I like On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > Exactly. So we have to explore those properties and evaluate performance > (in all meanings of the word). That's only logical. Any chance you'd be willing to put down a few thoughts on what sorts of standards you'd like to set for both correctness (i.e. the bare minimum a scheduler implementation must do to be considered valid beyond not oopsing) and performance metrics (i.e. things that produce numbers for each scheduler you can compare to say "this scheduler is better than this other scheduler at this."). -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:26 ` William Lee Irwin III @ 2007-04-17 7:01 ` Nick Piggin 2007-04-17 8:23 ` William Lee Irwin III 2007-04-17 21:39 ` Matt Mackall 0 siblings, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > >> All things are not equal; they all have different properties. I like > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > Exactly. So we have to explore those properties and evaluate performance > > (in all meanings of the word). That's only logical. > > Any chance you'd be willing to put down a few thoughts on what sorts > of standards you'd like to set for both correctness (i.e. the bare > minimum a scheduler implementation must do to be considered valid > beyond not oopsing) and performance metrics (i.e. things that produce > numbers for each scheduler you can compare to say "this scheduler is > better than this other scheduler at this."). Yeah I guess that's the hard part :) For correctness, I guess fairness is an easy one. I think that unfairness is basically a bug and that it would be very unfortunate to merge something unfair. But this is just within the context of a single runqueue... for better or worse, we allow some unfairness in multiprocessors for performance reasons of course. Latency. Given N tasks in the system, an arbitrary task should get onto the CPU in a bounded amount of time (excluding events like freak IRQ holdoffs and such, obviously -- ie. just considering the context of the scheduler's state machine). I wouldn't like to see a significant drop in any micro or macro benchmarks or even worse real workloads, but I could accept some if it means haaving a fair scheduler by default. Now it isn't actually too hard to achieve the above, I think. The hard bit is trying to compare interactivity. Ideally, we'd be able to get scripted dumps of login sessions, and measure scheduling latencies of key proceses (sh/X/wm/xmms/firefox/etc). People would send a dump if they were having problems with any scheduler, and we could compare all of them against it. Wishful thinking! ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:01 ` Nick Piggin @ 2007-04-17 8:23 ` William Lee Irwin III 2007-04-17 22:23 ` Davide Libenzi 2007-04-17 21:39 ` Matt Mackall 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 8:23 UTC (permalink / raw) To: Nick Piggin Cc: Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: >> Any chance you'd be willing to put down a few thoughts on what sorts >> of standards you'd like to set for both correctness (i.e. the bare >> minimum a scheduler implementation must do to be considered valid >> beyond not oopsing) and performance metrics (i.e. things that produce >> numbers for each scheduler you can compare to say "this scheduler is >> better than this other scheduler at this."). On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Yeah I guess that's the hard part :) > For correctness, I guess fairness is an easy one. I think that unfairness > is basically a bug and that it would be very unfortunate to merge something > unfair. But this is just within the context of a single runqueue... for > better or worse, we allow some unfairness in multiprocessors for performance > reasons of course. Requiring that identical tasks be allocated equal shares of CPU bandwidth is the easy part here. ringtest.c exercises another aspect of fairness that is extremely important. Generalizing ringtest.c is a good idea for fairness testing. But another aspect of fairness is that "controlled unfairness" is also intended to exist, in no small part by virtue of nice levels, but also in the form of favoring tasks that are considered interactive somehow. Testing various forms of controlled unfairness to ensure that they are indeed controlled and otherwise have the semantics intended is IMHO the more difficult aspect of fairness testing. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Latency. Given N tasks in the system, an arbitrary task should get > onto the CPU in a bounded amount of time (excluding events like freak > IRQ holdoffs and such, obviously -- ie. just considering the context > of the scheduler's state machine). ISTR Davide Libenzi having a scheduling latency test a number of years ago. Resurrecting that and tuning it to the needs of this kind of testing sounds relevant here. The test suite Peter Willliams mentioned would also help. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > I wouldn't like to see a significant drop in any micro or macro > benchmarks or even worse real workloads, but I could accept some if it > means haaving a fair scheduler by default. On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > Now it isn't actually too hard to achieve the above, I think. The hard bit > is trying to compare interactivity. Ideally, we'd be able to get scripted > dumps of login sessions, and measure scheduling latencies of key proceses > (sh/X/wm/xmms/firefox/etc). People would send a dump if they were having > problems with any scheduler, and we could compare all of them against it. > Wishful thinking! That's a pretty good idea. I'll queue up writing something of that form as well. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:23 ` William Lee Irwin III @ 2007-04-17 22:23 ` Davide Libenzi 0 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-17 22:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > Latency. Given N tasks in the system, an arbitrary task should get > > onto the CPU in a bounded amount of time (excluding events like freak > > IRQ holdoffs and such, obviously -- ie. just considering the context > > of the scheduler's state machine). > > ISTR Davide Libenzi having a scheduling latency test a number of years > ago. Resurrecting that and tuning it to the needs of this kind of > testing sounds relevant here. The test suite Peter Willliams mentioned > would also help. That helped me a lot at that time. At every context switch was sampling critical scheduler parameters for both entering and exiting task (and associated timestamps). Then the data was collected through a /dev/idontremember from userspace for analysis. It'd very useful to have it those days, to study what really happens under the hook (scheduler internal parameters variations and such) when those wierd loads make the scheduler unstable. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:01 ` Nick Piggin 2007-04-17 8:23 ` William Lee Irwin III @ 2007-04-17 21:39 ` Matt Mackall 2007-04-17 23:23 ` Peter Williams 2007-04-18 3:15 ` Nick Piggin 1 sibling, 2 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-17 21:39 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > >> All things are not equal; they all have different properties. I like > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > Exactly. So we have to explore those properties and evaluate performance > > > (in all meanings of the word). That's only logical. > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > of standards you'd like to set for both correctness (i.e. the bare > > minimum a scheduler implementation must do to be considered valid > > beyond not oopsing) and performance metrics (i.e. things that produce > > numbers for each scheduler you can compare to say "this scheduler is > > better than this other scheduler at this."). > > Yeah I guess that's the hard part :) > > For correctness, I guess fairness is an easy one. I think that unfairness > is basically a bug and that it would be very unfortunate to merge something > unfair. But this is just within the context of a single runqueue... for > better or worse, we allow some unfairness in multiprocessors for performance > reasons of course. I'm a big fan of fairness, but I think it's a bit early to declare it a mandatory feature. Bounded unfairness is probably something we can agree on, ie "if we decide to be unfair, no process suffers more than a factor of x". > Latency. Given N tasks in the system, an arbitrary task should get > onto the CPU in a bounded amount of time (excluding events like freak > IRQ holdoffs and such, obviously -- ie. just considering the context > of the scheduler's state machine). This is a slightly stronger statement than starvation-free (which is obviously mandatory). I think you're looking for something like "worst-case scheduling latency is proportional to the number of runnable tasks". Which I think is quite a reasonable requirement. I'm pretty sure the stock scheduler falls short of both of these guarantees though. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 21:39 ` Matt Mackall @ 2007-04-17 23:23 ` Peter Williams 2007-04-17 23:19 ` Matt Mackall 2007-04-18 3:15 ` Nick Piggin 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 23:23 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Matt Mackall wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: >> On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: >>> On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: >>>>> All things are not equal; they all have different properties. I like >>> On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: >>>> Exactly. So we have to explore those properties and evaluate performance >>>> (in all meanings of the word). That's only logical. >>> Any chance you'd be willing to put down a few thoughts on what sorts >>> of standards you'd like to set for both correctness (i.e. the bare >>> minimum a scheduler implementation must do to be considered valid >>> beyond not oopsing) and performance metrics (i.e. things that produce >>> numbers for each scheduler you can compare to say "this scheduler is >>> better than this other scheduler at this."). >> Yeah I guess that's the hard part :) >> >> For correctness, I guess fairness is an easy one. I think that unfairness >> is basically a bug and that it would be very unfortunate to merge something >> unfair. But this is just within the context of a single runqueue... for >> better or worse, we allow some unfairness in multiprocessors for performance >> reasons of course. > > I'm a big fan of fairness, but I think it's a bit early to declare it > a mandatory feature. Bounded unfairness is probably something we can > agree on, ie "if we decide to be unfair, no process suffers more than > a factor of x". > >> Latency. Given N tasks in the system, an arbitrary task should get >> onto the CPU in a bounded amount of time (excluding events like freak >> IRQ holdoffs and such, obviously -- ie. just considering the context >> of the scheduler's state machine). > > This is a slightly stronger statement than starvation-free (which is > obviously mandatory). I think you're looking for something like > "worst-case scheduling latency is proportional to the number of > runnable tasks". add "taking into consideration nice and/or real time priorities of runnable tasks". I.e. if a task is nice 19 it can expect to wait longer to get onto the CPU than if it was nice 0. > Which I think is quite a reasonable requirement. > > I'm pretty sure the stock scheduler falls short of both of these > guarantees though. > Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:23 ` Peter Williams @ 2007-04-17 23:19 ` Matt Mackall 0 siblings, 0 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-17 23:19 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 09:23:42AM +1000, Peter Williams wrote: > Matt Mackall wrote: > >On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > >>On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > >>>On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > >>>>>All things are not equal; they all have different properties. I like > >>>On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > >>>>Exactly. So we have to explore those properties and evaluate performance > >>>>(in all meanings of the word). That's only logical. > >>>Any chance you'd be willing to put down a few thoughts on what sorts > >>>of standards you'd like to set for both correctness (i.e. the bare > >>>minimum a scheduler implementation must do to be considered valid > >>>beyond not oopsing) and performance metrics (i.e. things that produce > >>>numbers for each scheduler you can compare to say "this scheduler is > >>>better than this other scheduler at this."). > >>Yeah I guess that's the hard part :) > >> > >>For correctness, I guess fairness is an easy one. I think that unfairness > >>is basically a bug and that it would be very unfortunate to merge > >>something > >>unfair. But this is just within the context of a single runqueue... for > >>better or worse, we allow some unfairness in multiprocessors for > >>performance > >>reasons of course. > > > >I'm a big fan of fairness, but I think it's a bit early to declare it > >a mandatory feature. Bounded unfairness is probably something we can > >agree on, ie "if we decide to be unfair, no process suffers more than > >a factor of x". > > > >>Latency. Given N tasks in the system, an arbitrary task should get > >>onto the CPU in a bounded amount of time (excluding events like freak > >>IRQ holdoffs and such, obviously -- ie. just considering the context > >>of the scheduler's state machine). > > > >This is a slightly stronger statement than starvation-free (which is > >obviously mandatory). I think you're looking for something like > >"worst-case scheduling latency is proportional to the number of > >runnable tasks". > > add "taking into consideration nice and/or real time priorities of > runnable tasks". I.e. if a task is nice 19 it can expect to wait longer > to get onto the CPU than if it was nice 0. Yes. Assuming we meet the "bounded unfairness" criterion above, this follows. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 21:39 ` Matt Mackall 2007-04-17 23:23 ` Peter Williams @ 2007-04-18 3:15 ` Nick Piggin 2007-04-18 3:45 ` Mike Galbraith 2007-04-18 4:38 ` Matt Mackall 1 sibling, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-18 3:15 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > > >> All things are not equal; they all have different properties. I like > > > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > > Exactly. So we have to explore those properties and evaluate performance > > > > (in all meanings of the word). That's only logical. > > > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > > of standards you'd like to set for both correctness (i.e. the bare > > > minimum a scheduler implementation must do to be considered valid > > > beyond not oopsing) and performance metrics (i.e. things that produce > > > numbers for each scheduler you can compare to say "this scheduler is > > > better than this other scheduler at this."). > > > > Yeah I guess that's the hard part :) > > > > For correctness, I guess fairness is an easy one. I think that unfairness > > is basically a bug and that it would be very unfortunate to merge something > > unfair. But this is just within the context of a single runqueue... for > > better or worse, we allow some unfairness in multiprocessors for performance > > reasons of course. > > I'm a big fan of fairness, but I think it's a bit early to declare it > a mandatory feature. Bounded unfairness is probably something we can > agree on, ie "if we decide to be unfair, no process suffers more than > a factor of x". I don't know why this would be a useful feature (of course I'm talking about processes at the same nice level). One of the big problems with the current scheduler is that it is unfair in some corner cases. It works OK for most people, but when it breaks down it really hurts. At least if you start with a fair scheduler, you can alter priorities until it satisfies your need... with an unfair one your guess is as good as mine. So on what basis would you allow unfairness? On the basis that it doesn't seem to harm anyone? It doesn't seem to harm testers? I think we should aim for something better. > > Latency. Given N tasks in the system, an arbitrary task should get > > onto the CPU in a bounded amount of time (excluding events like freak > > IRQ holdoffs and such, obviously -- ie. just considering the context > > of the scheduler's state machine). > > This is a slightly stronger statement than starvation-free (which is > obviously mandatory). I think you're looking for something like > "worst-case scheduling latency is proportional to the number of > runnable tasks". Which I think is quite a reasonable requirement. Yes, bounded and proportional to. > I'm pretty sure the stock scheduler falls short of both of these > guarantees though. And I think that's what its main problems are. It's interactivity obviously can't be too bad for most people. It's performance seems to be pretty good. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:15 ` Nick Piggin @ 2007-04-18 3:45 ` Mike Galbraith 2007-04-18 3:56 ` Nick Piggin 2007-04-18 4:38 ` Matt Mackall 1 sibling, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-18 3:45 UTC (permalink / raw) To: Nick Piggin Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > a mandatory feature. Bounded unfairness is probably something we can > > agree on, ie "if we decide to be unfair, no process suffers more than > > a factor of x". > > I don't know why this would be a useful feature (of course I'm talking > about processes at the same nice level). One of the big problems with > the current scheduler is that it is unfair in some corner cases. It > works OK for most people, but when it breaks down it really hurts. At > least if you start with a fair scheduler, you can alter priorities > until it satisfies your need... with an unfair one your guess is as > good as mine. > > So on what basis would you allow unfairness? On the basis that it doesn't > seem to harm anyone? It doesn't seem to harm testers? Well, there's short term fair and long term fair. Seems to me a burst load having to always merge with a steady stream load using a short term fairness yardstick absolutely must 'starve' relative to the steady load, so to be long term fair, you have to add some short term unfairness. The mainline scheduler is more long term fair (discounting the rather obnoxious corner cases). -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:45 ` Mike Galbraith @ 2007-04-18 3:56 ` Nick Piggin 2007-04-18 4:29 ` Mike Galbraith 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-18 3:56 UTC (permalink / raw) To: Mike Galbraith Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote: > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > > a mandatory feature. Bounded unfairness is probably something we can > > > agree on, ie "if we decide to be unfair, no process suffers more than > > > a factor of x". > > > > I don't know why this would be a useful feature (of course I'm talking > > about processes at the same nice level). One of the big problems with > > the current scheduler is that it is unfair in some corner cases. It > > works OK for most people, but when it breaks down it really hurts. At > > least if you start with a fair scheduler, you can alter priorities > > until it satisfies your need... with an unfair one your guess is as > > good as mine. > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > seem to harm anyone? It doesn't seem to harm testers? > > Well, there's short term fair and long term fair. Seems to me a burst > load having to always merge with a steady stream load using a short term > fairness yardstick absolutely must 'starve' relative to the steady load, > so to be long term fair, you have to add some short term unfairness. > The mainline scheduler is more long term fair (discounting the rather > obnoxious corner cases). Oh yes definitely I mean long term fair. I guess it is impossible to be completely fair so long as we have to timeshare the CPU :) So a constant delta is fine and unavoidable. But I don't think I agree with a constant factor: that means you can pick a time where process 1 is allowed an arbitrary T more CPU time than process 2. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:56 ` Nick Piggin @ 2007-04-18 4:29 ` Mike Galbraith 0 siblings, 0 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-18 4:29 UTC (permalink / raw) To: Nick Piggin Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 05:56 +0200, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 05:45:20AM +0200, Mike Galbraith wrote: > > On Wed, 2007-04-18 at 05:15 +0200, Nick Piggin wrote: > > > > > > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > > seem to harm anyone? It doesn't seem to harm testers? > > > > Well, there's short term fair and long term fair. Seems to me a burst > > load having to always merge with a steady stream load using a short term > > fairness yardstick absolutely must 'starve' relative to the steady load, > > so to be long term fair, you have to add some short term unfairness. > > The mainline scheduler is more long term fair (discounting the rather > > obnoxious corner cases). > > Oh yes definitely I mean long term fair. I guess it is impossible to be > completely fair so long as we have to timeshare the CPU :) > > So a constant delta is fine and unavoidable. But I don't think I agree > with a constant factor: that means you can pick a time where process 1 > is allowed an arbitrary T more CPU time than process 2. Definitely. Using constants with no consideration of what else is running is what causes the fairness mechanism in mainline to break down under load. (aside: What I was experimenting with before this new scheduler came along was to turn the sleep_avg thing into an off-cpu period thing. Once a time slice begins execution [runqueue wait doesn't count], that task has the right to use it's slice in one go, and _anything_ that knocked it off the cpu added to it's credit. Knocking someone else off detracts from credit, and to get to the point where you can knock others off costs you stored credit proportional to the dynamic priority you attain by using it. All tasks that have credit stay active, no favorites.) -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 3:15 ` Nick Piggin 2007-04-18 3:45 ` Mike Galbraith @ 2007-04-18 4:38 ` Matt Mackall 2007-04-18 5:00 ` Nick Piggin 1 sibling, 1 reply; 304+ messages in thread From: Matt Mackall @ 2007-04-18 4:38 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:39:54PM -0500, Matt Mackall wrote: > > On Tue, Apr 17, 2007 at 09:01:55AM +0200, Nick Piggin wrote: > > > On Mon, Apr 16, 2007 at 11:26:21PM -0700, William Lee Irwin III wrote: > > > > On Mon, Apr 16, 2007 at 11:09:55PM -0700, William Lee Irwin III wrote: > > > > >> All things are not equal; they all have different properties. I like > > > > > > > > On Tue, Apr 17, 2007 at 08:15:03AM +0200, Nick Piggin wrote: > > > > > Exactly. So we have to explore those properties and evaluate performance > > > > > (in all meanings of the word). That's only logical. > > > > > > > > Any chance you'd be willing to put down a few thoughts on what sorts > > > > of standards you'd like to set for both correctness (i.e. the bare > > > > minimum a scheduler implementation must do to be considered valid > > > > beyond not oopsing) and performance metrics (i.e. things that produce > > > > numbers for each scheduler you can compare to say "this scheduler is > > > > better than this other scheduler at this."). > > > > > > Yeah I guess that's the hard part :) > > > > > > For correctness, I guess fairness is an easy one. I think that unfairness > > > is basically a bug and that it would be very unfortunate to merge something > > > unfair. But this is just within the context of a single runqueue... for > > > better or worse, we allow some unfairness in multiprocessors for performance > > > reasons of course. > > > > I'm a big fan of fairness, but I think it's a bit early to declare it > > a mandatory feature. Bounded unfairness is probably something we can > > agree on, ie "if we decide to be unfair, no process suffers more than > > a factor of x". > > I don't know why this would be a useful feature (of course I'm talking > about processes at the same nice level). One of the big problems with > the current scheduler is that it is unfair in some corner cases. It > works OK for most people, but when it breaks down it really hurts. At > least if you start with a fair scheduler, you can alter priorities > until it satisfies your need... with an unfair one your guess is as > good as mine. > > So on what basis would you allow unfairness? On the basis that it doesn't > seem to harm anyone? It doesn't seem to harm testers? On the basis that there's only anecdotal evidence thus far that fairness is the right approach. It's not yet clear that a fair scheduler can do the right thing with X, with various kernel threads, etc. without fiddling with nice levels. Which makes it no longer "completely fair". It's also not yet clear that a scheduler can't be taught to do the right thing with X without fiddling with nice levels. So I'm just not yet willing to completely rule out systems that aren't "completely fair". But I think we should rule out schedulers that don't have rigid bounds on that unfairness. That's where the really ugly behavior lies. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 4:38 ` Matt Mackall @ 2007-04-18 5:00 ` Nick Piggin 2007-04-18 5:55 ` Matt Mackall 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-18 5:00 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:38:31PM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 05:15:11AM +0200, Nick Piggin wrote: > > > > I don't know why this would be a useful feature (of course I'm talking > > about processes at the same nice level). One of the big problems with > > the current scheduler is that it is unfair in some corner cases. It > > works OK for most people, but when it breaks down it really hurts. At > > least if you start with a fair scheduler, you can alter priorities > > until it satisfies your need... with an unfair one your guess is as > > good as mine. > > > > So on what basis would you allow unfairness? On the basis that it doesn't > > seem to harm anyone? It doesn't seem to harm testers? > > On the basis that there's only anecdotal evidence thus far that > fairness is the right approach. > > It's not yet clear that a fair scheduler can do the right thing with X, > with various kernel threads, etc. without fiddling with nice levels. > Which makes it no longer "completely fair". Of course I mean SCHED_OTHER tasks at the same nice level. Otherwise I would be arguing to make nice basically a noop. > It's also not yet clear that a scheduler can't be taught to do the > right thing with X without fiddling with nice levels. Being fair doesn't prevent that. Implicit unfairness is wrong though, because it will bite people. What's wrong with allowing X to get more than it's fair share of CPU time by "fiddling with nice levels"? That's what they're there for. > So I'm just not yet willing to completely rule out systems that aren't > "completely fair". > > But I think we should rule out schedulers that don't have rigid bounds on > that unfairness. That's where the really ugly behavior lies. Been a while since I really looked at the mainline scheduler, but I don't think it can permanently starve something, so I don't know what your bounded unfairness would help with. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:00 ` Nick Piggin @ 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-18 5:55 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote: > > It's also not yet clear that a scheduler can't be taught to do the > > right thing with X without fiddling with nice levels. > > Being fair doesn't prevent that. Implicit unfairness is wrong though, > because it will bite people. > > What's wrong with allowing X to get more than it's fair share of CPU > time by "fiddling with nice levels"? That's what they're there for. Why is X special? Because it does work on behalf of other processes? Lots of things do this. Perhaps a scheduler should focus entirely on the implicit and directed wakeup matrix and optimizing that instead[1]. Why are processes special? Should user A be able to get more CPU time for his job than user B by splitting it into N parallel jobs? Should we be fair per process, per user, per thread group, per session, per controlling terminal? Some weighted combination of the preceding?[2] Why is the measure CPU time? I can imagine a scheduler that weighed memory bandwidth in the equation. Or power consumption. Or execution unit usage. Fairness is nice. It's simple, it's obvious, it's predictable. But it's just not clear that it's optimal. If the question is (and it was!) "what should the basic requirements for the scheduler be?" it's not clear that fairness is a requirement or even how to pick a metric for fairness that's obviously and uniquely superior. It's instead much easier to try to recognize and rule out really bad behaviour with bounded latencies, minimum service guarantees, etc. [1] That's basically how Google decides to prioritize webpages, which it seems to do moderately well. And how a number of other optimization problems are solved. [2] It's trivial to construct two or more perfectly reasonable and desirable definitions of fairness that are mutually incompatible. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall @ 2007-04-18 6:37 ` Nick Piggin 2007-04-18 6:55 ` Matt Mackall 2007-04-18 13:08 ` William Lee Irwin III 2007-04-18 14:48 ` Linus Torvalds 2 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-18 6:37 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 07:00:24AM +0200, Nick Piggin wrote: > > > It's also not yet clear that a scheduler can't be taught to do the > > > right thing with X without fiddling with nice levels. > > > > Being fair doesn't prevent that. Implicit unfairness is wrong though, > > because it will bite people. > > > > What's wrong with allowing X to get more than it's fair share of CPU > > time by "fiddling with nice levels"? That's what they're there for. > > Why is X special? Because it does work on behalf of other processes? The high level reason is that giving it more than its fair share of CPU allows a desktop to remain interactive under load. And it isn't just about doing work on behalf of other processes. Mouse interrupts are a big part of it, for example. > Lots of things do this. Perhaps a scheduler should focus entirely on > the implicit and directed wakeup matrix and optimizing that > instead[1]. You could do that, and I tried a variant of it at one point. The problem was that it leads to unexpected bad things too. UNIX programs more or less expect fair SCHED_OTHER scheduling, and given the principle of least surprise... > Why are processes special? Should user A be able to get more CPU time > for his job than user B by splitting it into N parallel jobs? Should > we be fair per process, per user, per thread group, per session, per > controlling terminal? Some weighted combination of the preceding?[2] I don't know how that supports your argument for unfairness, but processes are special only because that's how we've always done scheduling. I'm not precluding other groupings for fairness, though. > Why is the measure CPU time? I can imagine a scheduler that weighed > memory bandwidth in the equation. Or power consumption. Or execution > unit usage. Feel free. And I'd also argue that once you schedule for those metrics then fairness is also important there too. > Fairness is nice. It's simple, it's obvious, it's predictable. But > it's just not clear that it's optimal. If the question is (and it > was!) "what should the basic requirements for the scheduler be?" it's > not clear that fairness is a requirement or even how to pick a metric > for fairness that's obviously and uniquely superior. What do you mean optimal? If your criteria is fairness, then of course it is optimal. If your criteria is throughput, then it probably isn't. Considering it is simple and what we've always done, measuring fairness by CPU time per process is obvious for a general purpose scheduler. If you accept that, then I argue that fairness is an optimal property given that the alternative is unfairness. > It's instead much easier to try to recognize and rule out really bad > behaviour with bounded latencies, minimum service guarantees, etc. It's the bad behaviour that you didn't recognize that is the problem. If you start with explicit fairness, then unfairness will never be one of those problems. > [1] That's basically how Google decides to prioritize webpages, which > it seems to do moderately well. And how a number of other optimization > problems are solved. This is not an optimization problem, it is a heuristic. There is no right and wrong answer. > [2] It's trivial to construct two or more perfectly reasonable and > desirable definitions of fairness that are mutually incompatible. Probably not if you use common sense, and in the context of a replacement for the 2.6 scheduler. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:37 ` Nick Piggin @ 2007-04-18 6:55 ` Matt Mackall 2007-04-18 7:24 ` Nick Piggin 2007-04-21 13:33 ` Bill Davidsen 0 siblings, 2 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-18 6:55 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: > I don't know how that supports your argument for unfairness, I never had such an argument. I like fairness. My argument is that -you- don't have an argument for making fairness a -requirement-. > processes are special only because that's how we've always done > scheduling. I'm not precluding other groupings for fairness, though. If you make one form of fairness a -requirement- for all acceptable algorithms, your -are- precluding most other forms of fairness. If you refuse to define what "fairness" means when specifying your requirement, what's the point of requiring it? > What do you mean optimal? If your criteria is fairness, then of course > it is optimal. If your criteria is throughput, then it probably isn't. I don't know what optimal behavior is. And neither do you. It may or may not be fair. It very likely includes small deviations from fair. > > [2] It's trivial to construct two or more perfectly reasonable and > > desirable definitions of fairness that are mutually incompatible. > > Probably not if you use common sense, and in the context of a replacement > for the 2.6 scheduler. Ok, trivial example. You cannot allocate equal CPU time to processes/tasks and simultaneously allocate equal time to thread groups. Is it common sense that a heavily-threaded app should be able to get hugely more CPU than a well-written app? No. I don't want Joe's stupid Java app to make my compile crawl. On the other hand, if my heavily threaded app is, say, a voicemail server serving 30 customers, I probably want it to get 30x the CPU of my gzip job. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:55 ` Matt Mackall @ 2007-04-18 7:24 ` Nick Piggin 2007-04-21 13:33 ` Bill Davidsen 1 sibling, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-18 7:24 UTC (permalink / raw) To: Matt Mackall Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 01:55:34AM -0500, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: > > I don't know how that supports your argument for unfairness, > > I never had such an argument. I like fairness. > > My argument is that -you- don't have an argument for making fairness a > -requirement-. It seems easy enough that there is no point accepting unfair behaviour like the old scheduler if we're going to go to all this trouble to replace it. The old scheduler seems to have bounded unfairness and bounded starvation, so let the good times roll. > > processes are special only because that's how we've always done > > scheduling. I'm not precluding other groupings for fairness, though. > > If you make one form of fairness a -requirement- for all acceptable > algorithms, your -are- precluding most other forms of fairness. > > If you refuse to define what "fairness" means when specifying your > requirement, what's the point of requiring it? I don't refuse. I'm talking about per-process CPU time fairness. My paragraph above was pointing out that subsequent work to add other classes of fairness are not excluded as configurable features, but this basic type of fairness should be included. > > What do you mean optimal? If your criteria is fairness, then of course > > it is optimal. If your criteria is throughput, then it probably isn't. > > I don't know what optimal behavior is. And neither do you. It may or > may not be fair. It very likely includes small deviations from fair. You misunderstand me. There is no single "optimal" when you're talking about fairness (or most other scheduler things). So pondering whether fairness is optimal or not doesn't really make sense. I'm saying it should be a basic axiom, not that it is quantitively better. It isn't a refutable argument. I state it because that it is what users and programs expect. You can reject that, and fine. I guess if a scheduler comes along that does exactly the right thing for everyone, then it is better than any fair scheduler. So OK, while we're talking theoretical, I won't dismiss an unfair scheduler out of hand. > > > [2] It's trivial to construct two or more perfectly reasonable and > > > desirable definitions of fairness that are mutually incompatible. > > > > Probably not if you use common sense, and in the context of a replacement > > for the 2.6 scheduler. > > Ok, trivial example. You cannot allocate equal CPU time to > processes/tasks and simultaneously allocate equal time to thread > groups. Is it common sense that a heavily-threaded app should be able > to get hugely more CPU than a well-written app? No. I don't want Joe's > stupid Java app to make my compile crawl. > > On the other hand, if my heavily threaded app is, say, a voicemail > server serving 30 customers, I probably want it to get 30x the CPU of > my gzip job. So that might be a nice addition, but the base funcionality is threads simply because that's what we've always done. Just common sense. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 6:55 ` Matt Mackall 2007-04-18 7:24 ` Nick Piggin @ 2007-04-21 13:33 ` Bill Davidsen 1 sibling, 0 replies; 304+ messages in thread From: Bill Davidsen @ 2007-04-21 13:33 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Matt Mackall wrote: > On Wed, Apr 18, 2007 at 08:37:11AM +0200, Nick Piggin wrote: >>> [2] It's trivial to construct two or more perfectly reasonable and >>> desirable definitions of fairness that are mutually incompatible. >> Probably not if you use common sense, and in the context of a replacement >> for the 2.6 scheduler. > > Ok, trivial example. You cannot allocate equal CPU time to > processes/tasks and simultaneously allocate equal time to thread > groups. Is it common sense that a heavily-threaded app should be able > to get hugely more CPU than a well-written app? No. I don't want Joe's > stupid Java app to make my compile crawl. > > On the other hand, if my heavily threaded app is, say, a voicemail > server serving 30 customers, I probably want it to get 30x the CPU of > my gzip job. > Matt, you tickled a thought... on one hand we have a single user running a threaded application, and it ideally should get the same total CPU as a user running a single thread process. On the other hand we have a threaded application, call it sendmail, nnrpd, httpd, bind, whatever. In that case each thread is really providing service for an independent user, and should get an appropriate share of the CPU. Perhaps the solution is to add a means for identifying server processes, by capability, or by membership in a "server" group, or by having the initiating process set some flag at exec() time. That doesn't necessarily solve problems, but it may provide more information to allow them to be soluble. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin @ 2007-04-18 13:08 ` William Lee Irwin III 2007-04-18 19:48 ` Davide Libenzi 2007-04-18 14:48 ` Linus Torvalds 2 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-18 13:08 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 12:55:25AM -0500, Matt Mackall wrote: > Why are processes special? Should user A be able to get more CPU time > for his job than user B by splitting it into N parallel jobs? Should > we be fair per process, per user, per thread group, per session, per > controlling terminal? Some weighted combination of the preceding?[2] On a side note, I think a combination of all of the above is a very good idea, plus process groups (pgrp's). All the make -j loads should come up in one pgrp of one session for one user and hence should be automatically kept isolated in its own corner by such policies. Thread bombs, forkbombs, and so on get handled too, which is good when on e.g. a compileserver and someone rudely spawns too many tasks. Thinking of the scheduler as a CPU bandwidth allocator, this means handing out shares of CPU bandwidth to all users on the system, which in turn hand out shares of bandwidth to all sessions, which in turn hand out shares of bandwidth to all process groups, which in turn hand out shares of bandwidth to all thread groups, which in turn hand out shares of bandwidth to threads. The event handlers for the scheduler need not deal with this apart from task creation and exit and various sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.). They just determine what the scheduler sees as ->load_weight or some analogue of ->static_prio, though it is possible to do this by means of data structure organization instead of numerical prioritization. It'd probably have to be calculated on the fly by, say, doing fixpoint arithmetic something like user_share(p)*session_share(p)*pgrp_share(p)*tgrp_share(p)*task_share(p) so that readjusting the shares of aggregates doesn't have to traverse lists and remains O(1). Each of the share computations can instead just do some analogue of the calculation p->load_weight/rq->raw_weighted_load in fixpoint, though precision issues with this make me queasy. There is maybe a slight nasty point in that the ->raw_weighted_load analogue for users or whatever the highest level chosen is ends up being global. One might as well get users in there and omit intermediate levels if any are to be omitted so that the truly global state is as read-only as possible. I suppose jacking up the fixpoint precision to 128-bit or 256-bit all below the radix point (our max is 1.0 after all) until precision issues vanish can be done but the idea of that much number crunching in the scheduler makes me rather uncomfortable. I hope u64 or u32 [2] can be gotten away with as far as fixpoint goes. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 13:08 ` William Lee Irwin III @ 2007-04-18 19:48 ` Davide Libenzi 0 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 19:48 UTC (permalink / raw) To: William Lee Irwin III Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, William Lee Irwin III wrote: > Thinking of the scheduler as a CPU bandwidth allocator, this means > handing out shares of CPU bandwidth to all users on the system, which > in turn hand out shares of bandwidth to all sessions, which in turn > hand out shares of bandwidth to all process groups, which in turn hand > out shares of bandwidth to all thread groups, which in turn hand out > shares of bandwidth to threads. The event handlers for the scheduler > need not deal with this apart from task creation and exit and various > sorts of process ID changes (e.g. setsid(), setpgrp(), setuid(), etc.). Yes, it really becomes a hierarchical problem once you consider user and processes. Top level sees a "user" can be scheduled (put itself on the virtual run queue), and passes the ball to the "process" scheduler inside the "user" container, down to maybe "threads". With all the "key" calculation parameters kept at each level (with up-propagation). - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:55 ` Matt Mackall 2007-04-18 6:37 ` Nick Piggin 2007-04-18 13:08 ` William Lee Irwin III @ 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall ` (2 more replies) 2 siblings, 3 replies; 304+ messages in thread From: Linus Torvalds @ 2007-04-18 14:48 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > > Why is X special? Because it does work on behalf of other processes? > Lots of things do this. Perhaps a scheduler should focus entirely on > the implicit and directed wakeup matrix and optimizing that > instead[1]. I 100% agree - the perfect scheduler would indeed take into account where the wakeups come from, and try to "weigh" processes that help other processes make progress more. That would naturally give server processes more CPU power, because they help others I don't believe for a second that "fairness" means "give everybody the same amount of CPU". That's a totally illogical measure of fairness. All processes are _not_ created equal. That said, even trying to do "fairness by effective user ID" would probably already do a lot. In a desktop environment, X would get as much CPU time as the user processes, simply because it's in a different protection domain (and that's really what "effective user ID" means: it's not about "users", it's really about "protection domains"). And "fairness by euid" is probably a hell of a lot easier to do than trying to figure out the wakeup matrix. Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds @ 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds ` (2 more replies) 2007-04-19 3:18 ` Nick Piggin 2007-04-21 13:40 ` Bill Davidsen 2 siblings, 3 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-18 15:23 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. For the record, you actually don't need to track a whole NxN matrix (or do the implied O(n**3) matrix inversion!) to get to the same result. You can converge on the same node weightings (ie dynamic priorities) by applying a damped function at each transition point (directed wakeup, preemption, fork, exit). The trouble with any scheme like this is that it needs careful tuning of the damping factor to converge rapidly and not oscillate and precise numerical attention to the transition functions so that the sum of dynamic priorities is conserved. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall @ 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:49 ` Ingo Molnar ` (3 more replies) 2007-04-18 19:05 ` Davide Libenzi 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 4 replies; 304+ messages in thread From: Linus Torvalds @ 2007-04-18 17:22 UTC (permalink / raw) To: Matt Mackall Cc: Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > > On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > And "fairness by euid" is probably a hell of a lot easier to do than > > trying to figure out the wakeup matrix. > > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. I'm sure you can do things differently, but the reason I think "fairness by euid" is actually worth looking at is that it's pretty much the *identical* issue that we'll have with "fairness by virtual machine" and a number of other "container" issues. The fact is: - "fairness" is *not* about giving everybody the same amount of CPU time (scaled by some niceness level or not). Anybody who thinks that is "fair" is just being silly and hasn't thought it through. - "fairness" is multi-level. You want to be fair to threads within a thread group (where "process" may be one good approximation of what a "thread group" is, but not necessarily the only one). But you *also* want to be fair in between those "thread groups", and then you want to be fair across "containers" (where "user" may be one such container). So I claim that anything that cannot be fair by user ID is actually really REALLY unfair. I think it's absolutely humongously STUPID to call something the "Completely Fair Scheduler", and then just be fair on a thread level. That's not fair AT ALL! It's the anti-thesis of being fair! So if you have 2 users on a machine running CPU hogs, you should *first* try to be fair among users. If one user then runs 5 programs, and the other one runs just 1, then the *one* program should get 50% of the CPU time (the users fair share), and the five programs should get 10% of CPU time each. And if one of them uses two threads, each thread should get 5%. So you should see one thread get 50& CPU (single thread of one user), 4 threads get 10% CPU (their fair share of that users time), and 2 threads get 5% CPU (the fair share within that thread group!). Any scheduling argument that just considers the above to be "7 threads total" and gives each thread 14% of CPU time "fairly" is *anything* but fair. It's a joke if that kind of scheduler then calls itself CFS! And yes, that's largely what the current scheduler will do, but at least the current scheduler doesn't claim to be fair! So the current scheduler is a lot *better* if only in the sense that it doesn't make ridiculous claims that aren't true! Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds @ 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 17:59 ` Ingo Molnar 2007-04-18 19:23 ` Linus Torvalds 2007-04-18 18:02 ` William Lee Irwin III ` (2 subsequent siblings) 3 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > The fact is: > > - "fairness" is *not* about giving everybody the same amount of CPU > time (scaled by some niceness level or not). Anybody who thinks > that is "fair" is just being silly and hasn't thought it through. yeah, very much so. But note that most of the reported CFS interactivity wins, as surprising as it might be, were due to fairness between _the same user's tasks_. In the typical case, 99% of the desktop CPU time is executed either as X (root user) or under the uid of the logged in user, and X is just one task. Even with a bad hack of making X super-high-prio, interactivity as experienced by users still sucks without having fairness between the other 100-200 user tasks that a desktop system is typically using. 'renicing X to -10' is a broken way of achieving: 'root uid should get its share of CPU time too, no matter how many user tasks are running'. We can do this much cleaner by saying: 'each uid, if it has any tasks running, should get its fair share of CPU time, independently of the number of tasks it is running'. In that sense 'fairness' is not global (and in fact it is almost _never_ a global property, as X runs under root uid [*]), it is only the most lowlevel scheduling machinery that can then be built upon. Higher-level controls to allocate CPU power between groups of tasks very much make sense - but according to the CFS interactivity test results i got from people so far, they very much need this basic fairness machinery _within_ the uid group too. So 'fairness' is still a powerful lower level scheduling concept. And this all makes lots of sense to me. One purpose of doing the hierarchical scheduling classes stuff was to enable such higher scope task group decisions too. Next i'll try to figure out whether 'task group bandwidth' logic should live right within sched_fair.c itself, or whether it should be layered separately as a sched_group.c. Intutively i'd say it should live within sched_fair.c. Ingo [*] There are distributions where X does not run under root uid anymore. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:49 ` Ingo Molnar @ 2007-04-18 17:59 ` Ingo Molnar 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:23 ` Linus Torvalds 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 17:59 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Ingo Molnar <mingo@elte.hu> wrote: > In that sense 'fairness' is not global (and in fact it is almost > _never_ a global property, as X runs under root uid [*]), it is only > the most lowlevel scheduling machinery that can then be built upon. > [...] perhaps a more fitting term would be 'precise group-scheduling'. Within the lowest level task group entity (be that thread group or uid group, etc.) 'precise scheduling' is equivalent to 'fairness'. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:59 ` Ingo Molnar @ 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Linus Torvalds @ 2007-04-18 19:40 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > > perhaps a more fitting term would be 'precise group-scheduling'. Within > the lowest level task group entity (be that thread group or uid group, > etc.) 'precise scheduling' is equivalent to 'fairness'. Yes. Absolutely. Except I think that at least if you're going to name somethign "complete" (or "perfect" or "precise"), you should also admit that groups can be hierarchical. The "threads in a process" thing is a great example of a hierarchical group. Imagine if X was running as a collection of threads - then each server thread would no longer be more important than the clients! But if you have a mix of "bags of threads" and "single process" kind applications, then very arguably the single thread in a single traditional process should get as much time as the "bag of threads" process gets total. So it really should be a hierarchical notion, where each thread is owned by one "process", and each process is owned by one "user", and each user is in one "virtual machine" - there's at least three different levels to this, and you'd want to schedule this thing top-down: virtual machines should be given CPU time "fairly" (which doesn't need to mean "equally", of course - nice-values could very well work at that level too), and then within each virtual machine users or "scheduling groups" should be scheduled fairly, and then within each scheduling group the processes should be scheduled, and within each process threads should equally get their fair share at _that_ level. And no, I don't think we necessarily need to do something quite that elaborate. But I think that's the kind of "obviously good goal" to keep in mind. Can we perhaps _approximate_ something like that by other means? For example, maybe we can approximate it by spreading out the statistics: right now you have things like - last_ran, wait_runtime, sum_wait_runtime.. be per-thread things. Maybe some of those can be spread out, so that you put a part of them in the "struct vm_struct" thing (to approximate processes), part of them in the "struct user" struct (to approximate the user-level thing), and part of it in a per-container thing for when/if we support that kind of thing? IOW, I don't think the scheduling "groups" have to be explicit boxes or anything like that. I suspect you can make do with just heurstics that penalize the same "struct user" and "struct vm_struct" to get overly much scheduling time, and you'll get the same _effect_. And I don't think it's wrong to look at the "one hundred processes by the same user" case as being an important case. But it should not be the *only* case or even necessarily the *main* case that matters. I think a benchmark that literally does pid_t pid = fork(); if (pid < 0) exit(1); if (pid) { if (setuid(500) < 0) exit(2); for (;;) /* Do nothing */; } if (setuid(501) < 0) exit(3); fork(); for (;;) /* Do nothing in two processes */; and I think that it's a really valid benchmark: if the scheduler gives 25% of time to each of the two processes of user 501, and 50% to user 500, then THAT is a good scheduler. If somebody wants to actually write and test the above as a test-script, and add it to a collection of scheduler tests, I think that could be a good thing. Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds @ 2007-04-18 19:43 ` Ingo Molnar 2007-04-18 20:07 ` Davide Libenzi 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 19:43 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > For example, maybe we can approximate it by spreading out the > statistics: right now you have things like > > - last_ran, wait_runtime, sum_wait_runtime.. > > be per-thread things. [...] yes, yes, yes! :) My thinking is "struct sched_group" embedded into _arbitrary_ other resource containers and abstractions, which sched_group's are then in a simple hierarchy and are driven by the core scheduling machinery. > [...] Maybe some of those can be spread out, so that you put a part of > them in the "struct vm_struct" thing (to approximate processes), part > of them in the "struct user" struct (to approximate the user-level > thing), and part of it in a per-container thing for when/if we support > that kind of thing? yes. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar @ 2007-04-18 20:07 ` Davide Libenzi 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 20:07 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > For example, maybe we can approximate it by spreading out the statistics: > right now you have things like > > - last_ran, wait_runtime, sum_wait_runtime.. > > be per-thread things. Maybe some of those can be spread out, so that you > put a part of them in the "struct vm_struct" thing (to approximate > processes), part of them in the "struct user" struct (to approximate the > user-level thing), and part of it in a per-container thing for when/if we > support that kind of thing? I think Ingo's idea of a new sched_group to contain the generic parameters needed for the "key" calculation, works better than adding more fields to existing strctures (that would, of course, host pointers to it). Otherwise I can already the the struct_signal being the target for other unrelated fields :) - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 20:07 ` Davide Libenzi @ 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 6:52 ` Mike Galbraith 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 21:48 UTC (permalink / raw) To: Davide Libenzi Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Davide Libenzi <davidel@xmailserver.org> wrote: > I think Ingo's idea of a new sched_group to contain the generic > parameters needed for the "key" calculation, works better than adding > more fields to existing strctures (that would, of course, host > pointers to it). Otherwise I can already the the struct_signal being > the target for other unrelated fields :) yeah. Another detail is that for global containers like uids, the statistics will have to be percpu_alloc()-ed, both for correctness (runqueues are per CPU) and for performance. That's one reason why i dont think it's necessarily a good idea to group-schedule threads, we dont really want to do a per thread group percpu_alloc(). In fact for threads the _reverse_ problem exists, threaded apps tend to _strive_ for more performance - hence their desperation of using the threaded programming model to begin with ;) (just think of media playback apps which are typically multithreaded) I dont think threads are all that different. Also, the resource-conserving act of using CLONE_VM to share the VM (and to use a different programming environment like Java) should not be 'punished' by forcing the thread group to be accounted as a single, shared entity against other 'fat' tasks. so my current impression is that we want per UID accounting to solve the X problem, the kernel threads problem and the many-users problem, but i'd not want to do it for threads just yet because for them there's not really any apparent problem to be solved. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 21:48 ` Ingo Molnar @ 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 17:39 ` Bernd Eckenfels 2007-04-19 6:52 ` Mike Galbraith 1 sibling, 2 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > That's one reason why i dont think it's necessarily a good idea to > group-schedule threads, we dont really want to do a per thread group > percpu_alloc(). I still do not have clear how much overhead this will bring into the table, but I think (like Linus was pointing out) the hierarchy should look like: Top (VCPU maybe?) User Process Thread The "run_queue" concept (and data) that now is bound to a CPU, need to be replicated in: ROOT <- VCPUs add themselves here VCPU <- USERs add themselves here USER <- PROCs add themselves here PROC <- THREADs add themselves here THREAD (ultimate fine grained scheduling unit) So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking up a new task would mean: VCPU = ROOT->lookup(); USER = VCPU->lookup(); PROC = USER->lookup(); THREAD = PROC->lookup(); Run-time statistics should propagate back the other way around. > In fact for threads the _reverse_ problem exists, threaded apps tend to > _strive_ for more performance - hence their desperation of using the > threaded programming model to begin with ;) (just think of media > playback apps which are typically multithreaded) The same user nicing two different multi-threaded processes would expect a predictable CPU distribution too. Doing that efficently (the old per-cpu run-queue is pretty nice from many POVs) is the real challenge. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 23:30 ` Davide Libenzi @ 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 15:43 ` Davide Libenzi 2007-04-21 14:09 ` Bill Davidsen 2007-04-19 17:39 ` Bernd Eckenfels 1 sibling, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-19 8:00 UTC (permalink / raw) To: Davide Libenzi Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Davide Libenzi <davidel@xmailserver.org> wrote: > > That's one reason why i dont think it's necessarily a good idea to > > group-schedule threads, we dont really want to do a per thread group > > percpu_alloc(). > > I still do not have clear how much overhead this will bring into the > table, but I think (like Linus was pointing out) the hierarchy should > look like: > > Top (VCPU maybe?) > User > Process > Thread > > The "run_queue" concept (and data) that now is bound to a CPU, need to be > replicated in: > > ROOT <- VCPUs add themselves here > VCPU <- USERs add themselves here > USER <- PROCs add themselves here > PROC <- THREADs add themselves here > THREAD (ultimate fine grained scheduling unit) > > So ROOT, VCPU, USER and PROC will have their own "run_queue". Picking > up a new task would mean: > > VCPU = ROOT->lookup(); > USER = VCPU->lookup(); > PROC = USER->lookup(); > THREAD = PROC->lookup(); > > Run-time statistics should propagate back the other way around. yeah, but this looks quite bad from an overhead POV ... i think we can do alot simpler to solve X and kernel threads prioritization. > > In fact for threads the _reverse_ problem exists, threaded apps tend > > to _strive_ for more performance - hence their desperation of using > > the threaded programming model to begin with ;) (just think of media > > playback apps which are typically multithreaded) > > The same user nicing two different multi-threaded processes would > expect a predictable CPU distribution too. [...] i disagree that the user 'would expect' this. Some users might. Others would say: 'my 10-thread rendering engine is more important than a 1-thread job because it's using 10 threads for a reason'. And the CFS feedback so far strengthens this point: the default behavior of treating the thread as a single scheduling (and CPU time accounting) unit works pretty well on the desktop. think about it in another, 'kernel policy' way as well: we'd like to _encourage_ more parallel user applications. Hurting them by accounting all threads together sends the exact opposite message. > [...] Doing that efficently (the old per-cpu run-queue is pretty nice > from many POVs) is the real challenge. yeah. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 8:00 ` Ingo Molnar @ 2007-04-19 15:43 ` Davide Libenzi 2007-04-21 14:09 ` Bill Davidsen 1 sibling, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-19 15:43 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Thu, 19 Apr 2007, Ingo Molnar wrote: > i disagree that the user 'would expect' this. Some users might. Others > would say: 'my 10-thread rendering engine is more important than a > 1-thread job because it's using 10 threads for a reason'. And the CFS > feedback so far strengthens this point: the default behavior of treating > the thread as a single scheduling (and CPU time accounting) unit works > pretty well on the desktop. > > think about it in another, 'kernel policy' way as well: we'd like to > _encourage_ more parallel user applications. Hurting them by accounting > all threads together sends the exact opposite message. There are counter argouments too. Like, not every user knows if a certain process is MT or not. I agree though that doing accounting and fairness at a depth lower then USER is messy, and not only for performance. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 8:00 ` Ingo Molnar 2007-04-19 15:43 ` Davide Libenzi @ 2007-04-21 14:09 ` Bill Davidsen 1 sibling, 0 replies; 304+ messages in thread From: Bill Davidsen @ 2007-04-21 14:09 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Davide Libenzi <davidel@xmailserver.org> wrote: >> The same user nicing two different multi-threaded processes would >> expect a predictable CPU distribution too. [...] > > i disagree that the user 'would expect' this. Some users might. Others > would say: 'my 10-thread rendering engine is more important than a > 1-thread job because it's using 10 threads for a reason'. And the CFS > feedback so far strengthens this point: the default behavior of treating > the thread as a single scheduling (and CPU time accounting) unit works > pretty well on the desktop. > If by desktop you mean "one and only one interactive user," that's true. On a shared machine it's very hard to preserve any semblance of fairness when one user gets far more than another, based not on the value of what they're doing but the tools they use to to it. > think about it in another, 'kernel policy' way as well: we'd like to > _encourage_ more parallel user applications. Hurting them by accounting > all threads together sends the exact opposite message. > Why is that? There are lots of things which are intrinsically single threaded, how are we hurting hurting multi-threaded applications by refusing to give them more CPU than an application running on behalf of another user? By accounting all threads together we encourage writing an application in the most logical way. Threads are a solution, not a goal in themselves. >> [...] Doing that efficently (the old per-cpu run-queue is pretty nice >> from many POVs) is the real challenge. > > yeah. > > Ingo -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 23:30 ` Davide Libenzi 2007-04-19 8:00 ` Ingo Molnar @ 2007-04-19 17:39 ` Bernd Eckenfels 1 sibling, 0 replies; 304+ messages in thread From: Bernd Eckenfels @ 2007-04-19 17:39 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.64.0704181515290.25880@alien.or.mcafeemobile.com> you wrote: > Top (VCPU maybe?) > User > Process > Thread The problem with that is, that not all Schedulers might work on the User level. You can think of Batch/Job, Parent, Group, Session or namespace level. That would be IMHO a generic Top, with no need for a level above. Greetings Bernd ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 21:48 ` Ingo Molnar 2007-04-18 23:30 ` Davide Libenzi @ 2007-04-19 6:52 ` Mike Galbraith 2007-04-19 7:09 ` Ingo Molnar 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 2 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-19 6:52 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote: > so my current impression is that we want per UID accounting to solve the > X problem, the kernel threads problem and the many-users problem, but > i'd not want to do it for threads just yet because for them there's not > really any apparent problem to be solved. If you really mean UID vs EUID as Linus mentioned, I suppose I could learn to login as !root, and set KDE up to always give me root shells. With a heavily reniced X (perfectly fine), that should indeed solve my daily usage pattern nicely (always need godmode for shells, but not for mozilla and ilk. 50/50 split automatic without renice of entire gui) -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:52 ` Mike Galbraith @ 2007-04-19 7:09 ` Ingo Molnar 2007-04-19 7:32 ` Mike Galbraith 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-19 7:09 UTC (permalink / raw) To: Mike Galbraith; +Cc: linux-kernel * Mike Galbraith <efault@gmx.de> wrote: > With a heavily reniced X (perfectly fine), that should indeed solve my > daily usage pattern nicely (always need godmode for shells, but not > for mozilla and ilk. 50/50 split automatic without renice of entire > gui) how about the first-approximation solution i suggested in the previous mail: to add a per UID default nice level? (With this default defaulting to '-10' for all root-owned processes, and defaulting to '0' for everything else.) That would solve most of the current CFS regressions at hand. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:09 ` Ingo Molnar @ 2007-04-19 7:32 ` Mike Galbraith 2007-04-19 16:55 ` Davide Libenzi 0 siblings, 1 reply; 304+ messages in thread From: Mike Galbraith @ 2007-04-19 7:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > daily usage pattern nicely (always need godmode for shells, but not > > for mozilla and ilk. 50/50 split automatic without renice of entire > > gui) > > how about the first-approximation solution i suggested in the previous > mail: to add a per UID default nice level? (With this default defaulting > to '-10' for all root-owned processes, and defaulting to '0' for > everything else.) That would solve most of the current CFS regressions > at hand. That would make my kernel builds etc interfere with my other self's surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the X portion of my Joe-User activity pushes the compile portion of root down in bandwidth utilization automagically, which is exactly the right thing, because the root me in not as important as the Joe-User me using the GUI at that time. If the idea of X disturbing root upsets some, they can move X to another UID. Generally, it seems perfect for here. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:32 ` Mike Galbraith @ 2007-04-19 16:55 ` Davide Libenzi 2007-04-20 5:16 ` Mike Galbraith 0 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-19 16:55 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, linux-kernel On Thu, 19 Apr 2007, Mike Galbraith wrote: > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > > daily usage pattern nicely (always need godmode for shells, but not > > > for mozilla and ilk. 50/50 split automatic without renice of entire > > > gui) > > > > how about the first-approximation solution i suggested in the previous > > mail: to add a per UID default nice level? (With this default defaulting > > to '-10' for all root-owned processes, and defaulting to '0' for > > everything else.) That would solve most of the current CFS regressions > > at hand. > > That would make my kernel builds etc interfere with my other self's > surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the > X portion of my Joe-User activity pushes the compile portion of root > down in bandwidth utilization automagically, which is exactly the right > thing, because the root me in not as important as the Joe-User me using > the GUI at that time. If the idea of X disturbing root upsets some, > they can move X to another UID. Generally, it seems perfect for here. Now guys, I did not follow the whole lengthy and feisty thread, but IIRC Con's scheduler has been attacked because, among other argouments, was requiring X to be reniced. This happened like a month ago IINM. I did not have time to look at Con's scheduler, and I only had a brief look at Ingo's one (looks very promising IMO, but so was the initial O(1) post before all the corner-cases fixes went in). But this is not a about technical merit, this is about applying the same rules of judgement to others as well to ourselves. We went from a "renicing X to -10 is bad because the scheduler should be able to correctly handle the problem w/out additional external plugs" to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads class, on top of all the tasks owned by root" [1]. >From a spectator POV like myself in this case, this looks rather "unfair". [1] I think, before and now, that that's more a duck tape patch than a real solution. OTOH if the "solution" is gonna be another maze of macros and heuristics filled with pretty bad corner cases, I may prefer the former. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 16:55 ` Davide Libenzi @ 2007-04-20 5:16 ` Mike Galbraith 0 siblings, 0 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-20 5:16 UTC (permalink / raw) To: Davide Libenzi; +Cc: Ingo Molnar, linux-kernel On Thu, 2007-04-19 at 09:55 -0700, Davide Libenzi wrote: > On Thu, 19 Apr 2007, Mike Galbraith wrote: > > > On Thu, 2007-04-19 at 09:09 +0200, Ingo Molnar wrote: > > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > > > With a heavily reniced X (perfectly fine), that should indeed solve my > > > > daily usage pattern nicely (always need godmode for shells, but not > > > > for mozilla and ilk. 50/50 split automatic without renice of entire > > > > gui) > > > > > > how about the first-approximation solution i suggested in the previous > > > mail: to add a per UID default nice level? (With this default defaulting > > > to '-10' for all root-owned processes, and defaulting to '0' for > > > everything else.) That would solve most of the current CFS regressions > > > at hand. > > > > That would make my kernel builds etc interfere with my other self's > > surfing and whatnot. With it by EUID, when I'm surfing or whatnot, the > > X portion of my Joe-User activity pushes the compile portion of root > > down in bandwidth utilization automagically, which is exactly the right > > thing, because the root me in not as important as the Joe-User me using > > the GUI at that time. If the idea of X disturbing root upsets some, > > they can move X to another UID. Generally, it seems perfect for here. > > Now guys, I did not follow the whole lengthy and feisty thread, but IIRC > Con's scheduler has been attacked because, among other argouments, was > requiring X to be reniced. This happened like a month ago IINM. I don't object to renicing X if you want it to receive _more_ than it's fair share. I do object to having to renice X in order for it to _get_ it's fair share. That's what I attacked. > I did not have time to look at Con's scheduler, and I only had a brief > look at Ingo's one (looks very promising IMO, but so was the initial O(1) > post before all the corner-cases fixes went in). > But this is not a about technical merit, this is about applying the same > rules of judgement to others as well to ourselves. I'm running the same tests with CFS that I ran for RSDL/SD. It falls short in one key area (to me) in that X+client cannot yet split my box 50/50 with two concurrent tasks. In the CFS case, renicing both X and client does work, but it should not be necessary IMHO. With RSDL/SD renicing didn't help. > We went from a "renicing X to -10 is bad because the scheduler should > be able to correctly handle the problem w/out additional external plugs" > to a totally opposite "let's renice -10 X, the whole SCHED_NORMAL kthreads > class, on top of all the tasks owned by root" [1]. > >From a spectator POV like myself in this case, this looks rather "unfair". Well, for me, the renicing I mentioned above is only interesting as a way to improve long term fairness with schedulers with no history. I found Linus' EUID idea intriguing in that by putting the server together with a steady load in one 'fair' domain, and clients in another, X can, if prioritized to empower it to do so, modulate the steady load in it's domain (but can't starve it!), the clients modulate X, and the steady load gets it all when X and clients are idle. The nice level of X determines to what _extent_ X can modulate the constant load rather like a mixer slider. The synchronous (I'm told) nature of X/client then becomes kind of an asset to the desktop instead of a liability. The specific case I was thinking about is the X+Gforce test where both RSDL and CFS fail to provide fairness (as defined by me;). X and Gforce are mostly not concurrent. The make -j2 I put them up against are mostly concurrent. I don't call giving 1/3 of my CPU to X+Client fair at _all_, but that's what you'll get if your fairstick of the instant generally can't see the fourth competing task. Seemed pretty cool to me because it creates the missing connection between client and server, though also likely complicated (and maybe full of perils, who knows). -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:52 ` Mike Galbraith 2007-04-19 7:09 ` Ingo Molnar @ 2007-04-19 7:14 ` Mike Galbraith 1 sibling, 0 replies; 304+ messages in thread From: Mike Galbraith @ 2007-04-19 7:14 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Linus Torvalds, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Thu, 2007-04-19 at 08:52 +0200, Mike Galbraith wrote: > On Wed, 2007-04-18 at 23:48 +0200, Ingo Molnar wrote: > > > so my current impression is that we want per UID accounting to solve the > > X problem, the kernel threads problem and the many-users problem, but > > i'd not want to do it for threads just yet because for them there's not > > really any apparent problem to be solved. > > If you really mean UID vs EUID as Linus mentioned, I suppose I could > learn to login as !root, and set KDE up to always give me root shells. > > With a heavily reniced X (perfectly fine), that should indeed solve my > daily usage pattern nicely (always need godmode for shells, but not for > mozilla and ilk. 50/50 split automatic without renice of entire gui) Backward, needs to be EUID as Linus suggested. Kernel builds etc along with reniced X in root's bucket, surfing and whatnot in Joe-User's bucket. -Mike ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:40 ` Linus Torvalds 2007-04-18 19:43 ` Ingo Molnar 2007-04-18 20:07 ` Davide Libenzi @ 2007-04-18 21:04 ` Ingo Molnar 2 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 21:04 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > perhaps a more fitting term would be 'precise group-scheduling'. > > Within the lowest level task group entity (be that thread group or > > uid group, etc.) 'precise scheduling' is equivalent to 'fairness'. > > Yes. Absolutely. Except I think that at least if you're going to name > somethign "complete" (or "perfect" or "precise"), you should also > admit that groups can be hierarchical. yes. Am i correct to sum up your impression as: " Ingo, for you the hierarchy still appears to be an after-thought, while in practice it's easily the most important thing! Why are you so hung up about 'fairness', it makes no sense!" right? and you would definitely be right if you suggested that i neglected the 'group scheduling' aspects of CFS (except for a minimalistic nice level implementation, which is a poor-man's-non-automatic-group-scheduling), but i very much know its important and i'll definitely fix it for -v4. But please let me explain my reasons for my different focus: yes, group scheduling in practice is the most important first-layer thing, and without it any of the other 'CFS wins' can easily be useless. Firstly, i have not neglected the group scheduling related CFS regressions at all, mainly because there _is_ already a quick hack to check whether group scheduling would solve these regressions: renice. And it was tried in both of the two CFS regression cases i'm aware of: Mike's X starvation problem and Willy's "kevents starvation with thousands of scheddos tasks running" problem. And in both cases, applying the renice hack [which should be properly and automatically implemented as uid group scheduling] fixed the regression for them! So i was not worried at all, group scheduling _provably solves_ these CFS regressions. I rather concentrated on the CFS regressions that were much less clear. But PLEASE believe me: even with perfect cross-group CPU allocation but with a simple non-heuristic scheduler underlying it, you can _easily_ get a sucky desktop experience! I know it because i tried it and others tried it too. (in fact the first version of sched_fair.c was tick based and low-res, and it sucked) Two more things were needed: - the high precision of nsec/64-bit accounting ('reliability of scheduling') - extremely even time-distribution of CPU power ('determinism/smoothness, human perception') (i'm expanding on these two concepts further below) take out any of these and group scheduling or not, you are easily going to have a sucky desktop! (We know that from years of experiments: many people tried to rip out the unfairness from the scheduler and there were always nasty corner cases that 'should' have worked but didnt.) Without these we'd in essence start again at square one, just at a different square, this time with another group of people being irritated! But the biggest and hardest to achieve _wins_ of CFS are _NOT_ achieved via a simple 'get rid of the unfairness of the upstream scheduler and apply group scheduling'. (I know that because i tried it before and because others tried it before, for many many years.) You will _easily_ get sucky desktop experience. The other two things are very much needed too: - the high precision of nsec/64-bit accounting, and the many corner-cases this solves. (For example on a typical desktop there are _lots_ of timing-driven workloads that are in essence 'invisible' to low-resolution, timer-tick based accounting and are heavily skewed.) - extremely even time-distribution of CPU power. CFS behaves pretty well even under the dreaded 'make -jN in an xterm' kernel build workload as reported by Mark Lord, because it also distributes CPU power in a _finegrained_ way. A shell prompt under CFS still behaves acceptably on a single-CPU testbox of mine with a "make -j50" workload. (yes, fifty) Humans react alot more negatively to sudden changes in application behavior ('lags', pauses, short hangs) than they react to fine, gradual, all-encompassing slowdowns. This is a key property of CFS. ( Otherwise renicing X to -10 would have solved most of the interactivity complaints against the vanilla scheduler, otherwise renicing X to -10 would have fixed Mike's setup under SD (it didnt) while it worked much better under CFS, otherwise Gene wouldnt have found CFS markedly better than SD, etc., etc. So getting rid of the heuristics is less than 50% of the road to the perfect desktop scheduler. ) and i claim that these were the really hard bits, and i spent most of the CFS coding only on getting _these_ details 100% right under various workloads, and it makes a night and day difference _even without any group scheduling help_. and note another reason here: group scheduling _masks_ many other scheduling deficiencies that are possible in scheduler. So since CFS doesnt do group scheduling, i get a _fuller_ picture of the behavior of the core "precise scheduling" engine. At the initial stage i didnt want to hide bugs by masking them via group scheduling, especially because the renice workaround/hack was available. Guess how nice it all will get if we also add group scheduling to the mix, and people wouldnt have to add nasty and fragile renice based hacks, it will 'just work' out of box? Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 17:59 ` Ingo Molnar @ 2007-04-18 19:23 ` Linus Torvalds 2007-04-18 19:56 ` Davide Libenzi 1 sibling, 1 reply; 304+ messages in thread From: Linus Torvalds @ 2007-04-18 19:23 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Ingo Molnar wrote: > > But note that most of the reported CFS interactivity wins, as surprising > as it might be, were due to fairness between _the same user's tasks_. And *ALL* of the CFS interactivity *losses* and complaints have been because it did the wrong thing _between different user's tasks_ So what's your point? Your point was that when people try it out as a single user, it is indeed fair. But that's no point at all, since it totally missed _my_ point. The problems with X scheduling is exactly that "other user" kind of thing. The problem with kernel thread starvation due to user threads getting all the CPU time is exactly the same issue. As logn as you think that all threads are equal, and should be treated equally, you CANNOT make it work well. People can say "ok, you can renice X", but the whole problem stems from the fact that you're trying to be fair based on A TOTALLY INVALID NOTION of what "fair" is. > In the typical case, 99% of the desktop CPU time is executed either as X > (root user) or under the uid of the logged in user, and X is just one > task. So? You are ignoring the argument again. You're totally bringing up a red herring: > Even with a bad hack of making X super-high-prio, interactivity as > experienced by users still sucks without having fairness between the > other 100-200 user tasks that a desktop system is typically using. I didn't say that you should be *unfair* within one user group. What kind of *idiotic* argument are you trying to put forth? OF COURSE you should be fair "within the user group". Nobody contests that the "other 100-200 user tasks" should be scheduled fairly _amongst themselves_. The only point I had was that you cannot just lump all threads together and say "these threads are equally important". The 100-200 user tasks may be equally important, and should get equal amounts of preference, but that has absolutely _zero_ bearing on the _single_ task run in another "scheduling group", ie by other users or by X. I'm not arguing against fairness. I'm arguing against YOUR notion of fairness, which is obviously bogus. It is *not* fair to try to give out CPU time evenly! Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:23 ` Linus Torvalds @ 2007-04-18 19:56 ` Davide Libenzi 2007-04-18 20:11 ` Linus Torvalds 0 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 19:56 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > I'm not arguing against fairness. I'm arguing against YOUR notion of > fairness, which is obviously bogus. It is *not* fair to try to give out > CPU time evenly! "Perhaps on the rare occasion pursuing the right course demands an act of unfairness, unfairness itself can be the right course?" - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 19:56 ` Davide Libenzi @ 2007-04-18 20:11 ` Linus Torvalds 2007-04-19 0:22 ` Davide Libenzi 0 siblings, 1 reply; 304+ messages in thread From: Linus Torvalds @ 2007-04-18 20:11 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Davide Libenzi wrote: > > "Perhaps on the rare occasion pursuing the right course demands an act of > unfairness, unfairness itself can be the right course?" I don't think that's the right issue. It's just that "fairness" != "equal". Do you think it "fair" to pay everybody the same regardless of how good a job they do? I don't think anybody really believes that. Equating "fair" and "equal" is simply a very fundamental mistake. They're not the same thing. Never have been, and never will. Now, there's no question that "equal" is much easier to implement, if only because it's a lot easier to agree what it means. "Equal parts" is somethign everybody can agree on. "Fair parts" automatically involves a balancing act, and people will invariably count things differently and thus disagree about what is "fair" and what is not. I don't think we can ever get a "perfect" setup for that reason, but I think we can get something that at least gets reasonably close, at least for the obvious cases. So my suggested test-case of running one process as one user and two processes as another one has a fairly "obviously correct" solution if you have just one CPU's, and you can probably be pretty fair in practice on two CPU's (there's an obvious theoretical solution, whether you can get there with a practical algorithm is another thing). On three or more CPU's, you obviously wouldn't even *want* to be fair, since you can very naturally just give a CPU to each.. Linus ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 20:11 ` Linus Torvalds @ 2007-04-19 0:22 ` Davide Libenzi 2007-04-19 0:30 ` Linus Torvalds 0 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-19 0:22 UTC (permalink / raw) To: Linus Torvalds Cc: Davide Libenzi, Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Linus Torvalds wrote: > On Wed, 18 Apr 2007, Davide Libenzi wrote: > > > > "Perhaps on the rare occasion pursuing the right course demands an act of > > unfairness, unfairness itself can be the right course?" > > I don't think that's the right issue. > > It's just that "fairness" != "equal". > > Do you think it "fair" to pay everybody the same regardless of how good a > job they do? I don't think anybody really believes that. > > Equating "fair" and "equal" is simply a very fundamental mistake. They're > not the same thing. Never have been, and never will. I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :) - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 0:22 ` Davide Libenzi @ 2007-04-19 0:30 ` Linus Torvalds 0 siblings, 0 replies; 304+ messages in thread From: Linus Torvalds @ 2007-04-19 0:30 UTC (permalink / raw) To: Davide Libenzi Cc: Ingo Molnar, Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Davide Libenzi wrote: > > I know, we agree there. But that did not fit my "Pirates of the Caribbean" quote :) Ahh, I'm clearly not cultured enough, I didn't catch that reference. Linus "yes, I've seen the movie, but it apparently left more of a mark in other people" Torvalds ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:49 ` Ingo Molnar @ 2007-04-18 18:02 ` William Lee Irwin III 2007-04-18 18:12 ` Ingo Molnar 2007-04-18 18:36 ` Diego Calleja 2007-04-19 0:37 ` Peter Williams 3 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-18 18:02 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote: > So I claim that anything that cannot be fair by user ID is actually really > REALLY unfair. I think it's absolutely humongously STUPID to call > something the "Completely Fair Scheduler", and then just be fair on a > thread level. That's not fair AT ALL! It's the anti-thesis of being fair! > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. > So you should see one thread get 50& CPU (single thread of one user), 4 > threads get 10% CPU (their fair share of that users time), and 2 threads > get 5% CPU (the fair share within that thread group!). > Any scheduling argument that just considers the above to be "7 threads > total" and gives each thread 14% of CPU time "fairly" is *anything* but > fair. It's a joke if that kind of scheduler then calls itself CFS! I don't think it's completely fair [sic] to come down on it that hard. It does largely achieve the sort of fairness it set out for itself as its design goal. One should also note that the queueing mechanism is more than flexible enough to handle prioritization by a number of different methods, and the large precision of its priorities is useful there. So a rather broad variety of policies can be implemented by changing the ->fair_key calculations. In some respects, the vast priority space and very high clock precision are two of its most crucial advantages. On Wed, Apr 18, 2007 at 10:22:59AM -0700, Linus Torvalds wrote: > And yes, that's largely what the current scheduler will do, but at least > the current scheduler doesn't claim to be fair! So the current scheduler > is a lot *better* if only in the sense that it doesn't make ridiculous > claims that aren't true! The name chosen was somewhat buzzwordy. I'd have named it something more descriptive of the algorithm, though what's implemented in the current dynamic priority (i.e. ->fair_key) calculations are somewhat difficult to precisely categorize. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 18:02 ` William Lee Irwin III @ 2007-04-18 18:12 ` Ingo Molnar 0 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 18:12 UTC (permalink / raw) To: William Lee Irwin III Cc: Linus Torvalds, Matt Mackall, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > It does largely achieve the sort of fairness it set out for itself as > its design goal. One should also note that the queueing mechanism is > more than flexible enough to handle prioritization by a number of > different methods, and the large precision of its priorities is useful > there. So a rather broad variety of policies can be implemented by > changing the ->fair_key calculations. yeah. Note that i concentrated on the bit that makes the largest interactivity improvement: to implement "precise scheduling" (a'ka complete fairness) between the 100+ user tasks that do a complex scheduling dance on a typical desktop on various workloads. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 17:49 ` Ingo Molnar 2007-04-18 18:02 ` William Lee Irwin III @ 2007-04-18 18:36 ` Diego Calleja 2007-04-19 0:37 ` Peter Williams 3 siblings, 0 replies; 304+ messages in thread From: Diego Calleja @ 2007-04-18 18:36 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner El Wed, 18 Apr 2007 10:22:59 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> escribió: > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. "Fairness between users" was implemented long time ago by rik van riel (http://surriel.com/patches/2.4/2.4.19-fairsched). Some people has been asking for a functionality like that for a long time, ie: universities that want to avoid gcc processes from one student that is trying to learn how fork() works from starving the processes of rest of the students. But not only they want "fairness between users", they also want "priorities between users and/or groups of users", ie: "the 'students' group shouldn't starve the 'admins' group". ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 17:22 ` Linus Torvalds ` (2 preceding siblings ...) 2007-04-18 18:36 ` Diego Calleja @ 2007-04-19 0:37 ` Peter Williams 3 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-19 0:37 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner Linus Torvalds wrote: > > On Wed, 18 Apr 2007, Matt Mackall wrote: >> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: >>> And "fairness by euid" is probably a hell of a lot easier to do than >>> trying to figure out the wakeup matrix. >> For the record, you actually don't need to track a whole NxN matrix >> (or do the implied O(n**3) matrix inversion!) to get to the same >> result. > > I'm sure you can do things differently, but the reason I think "fairness > by euid" is actually worth looking at is that it's pretty much the > *identical* issue that we'll have with "fairness by virtual machine" and a > number of other "container" issues. > > The fact is: > > - "fairness" is *not* about giving everybody the same amount of CPU time > (scaled by some niceness level or not). Anybody who thinks that is > "fair" is just being silly and hasn't thought it through. > > - "fairness" is multi-level. You want to be fair to threads within a > thread group (where "process" may be one good approximation of what a > "thread group" is, but not necessarily the only one). > > But you *also* want to be fair in between those "thread groups", and > then you want to be fair across "containers" (where "user" may be one > such container). > > So I claim that anything that cannot be fair by user ID is actually really > REALLY unfair. I think it's absolutely humongously STUPID to call > something the "Completely Fair Scheduler", and then just be fair on a > thread level. That's not fair AT ALL! It's the anti-thesis of being fair! > > So if you have 2 users on a machine running CPU hogs, you should *first* > try to be fair among users. If one user then runs 5 programs, and the > other one runs just 1, then the *one* program should get 50% of the CPU > time (the users fair share), and the five programs should get 10% of CPU > time each. And if one of them uses two threads, each thread should get 5%. > > So you should see one thread get 50& CPU (single thread of one user), 4 > threads get 10% CPU (their fair share of that users time), and 2 threads > get 5% CPU (the fair share within that thread group!). > > Any scheduling argument that just considers the above to be "7 threads > total" and gives each thread 14% of CPU time "fairly" is *anything* but > fair. It's a joke if that kind of scheduler then calls itself CFS! > > And yes, that's largely what the current scheduler will do, but at least > the current scheduler doesn't claim to be fair! So the current scheduler > is a lot *better* if only in the sense that it doesn't make ridiculous > claims that aren't true! > > Linus Sounds a lot like the PLFS (process level fair sharing) scheduler in Aurema's ARMTech (for whom I used to work). The "fair" in the title is a bit misleading as it's all about unfair scheduling in order to meet specific policies. But it's based on the principle that if you can allocate CPU band width "fairly" (which really means in proportion to the entitlement each process is allocated) then you can allocate CPU band width "fairly" between higher level entities such as process groups, users groups and so on by subdividing the entitlements downwards. The tricky part of implementing this was the fact that not all entities at the various levels have sufficient demand for CPU band width to use their entitlements and this in turn means that the entities above them will have difficulty using their entitlements even if other of their subordinates have sufficient demand (because their entitlements will be too small). The trick is to have a measure of each entity's demand for CPU bandwidth and use that to modify the way entitlement is divided among subordinates. As a first guess, an entity's CPU band width usage is an indicator of demand but doesn't take into account unmet demand due to tasks waiting on a run queue waiting for access to the CPU. On the other hand, usage plus time waiting on the queue isn't a good measure of demand either (although it's probably a good upper bound) as it's unlikely that the task would have used the same amount of CPU as the waiting time if it had gone straight to the CPU. But my main point is that it is possible to build schedulers that can achieve higher level scheduling policies. Versions of PLFS work on Windows from user space by twiddling process priorities. Part of my more recent work at Aurema had been involved in patching Linux's scheduler so that nice worked more predictably so that we could release a user space version of PLFS for Linux. The other part was to add hard CPU band width caps for processes so that ARMTech could enforce hard CPU bandwidth caps on higher level entities (as this can't be done without the kernel being able to do it at that level. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds @ 2007-04-18 19:05 ` Davide Libenzi 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 19:05 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, 18 Apr 2007, Matt Mackall wrote: > On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > And "fairness by euid" is probably a hell of a lot easier to do than > > trying to figure out the wakeup matrix. > > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. You can converge on the same node weightings (ie dynamic > priorities) by applying a damped function at each transition point > (directed wakeup, preemption, fork, exit). > > The trouble with any scheme like this is that it needs careful tuning > of the damping factor to converge rapidly and not oscillate and > precise numerical attention to the transition functions so that the sum of > dynamic priorities is conserved. Doing that inside the boundaries of the time constrains imposed by a scheduler, is the interesting part. Given also that the size (and members) of it (matrix) is dynamic. Also, a "wakup matrix" (if the name correctly pictures what it is for) would help with latencies and priority inheritance, but not for global fairness. The maniacal fairness focus we're seeing now, is due to the fact the mainline can have extremely unfair behaviour under certain conditions. IMO fairness, although important, should not be main objective of the scheduler rewrite. Simplification and predictability should be on higher priority, with interactivity achievements bound to decent fariness constraints. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 15:23 ` Matt Mackall 2007-04-18 17:22 ` Linus Torvalds 2007-04-18 19:05 ` Davide Libenzi @ 2007-04-18 19:13 ` Michael K. Edwards 2 siblings, 0 replies; 304+ messages in thread From: Michael K. Edwards @ 2007-04-18 19:13 UTC (permalink / raw) To: Matt Mackall Cc: Linus Torvalds, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On 4/18/07, Matt Mackall <mpm@selenic.com> wrote: > For the record, you actually don't need to track a whole NxN matrix > (or do the implied O(n**3) matrix inversion!) to get to the same > result. You can converge on the same node weightings (ie dynamic > priorities) by applying a damped function at each transition point > (directed wakeup, preemption, fork, exit). > > The trouble with any scheme like this is that it needs careful tuning > of the damping factor to converge rapidly and not oscillate and > precise numerical attention to the transition functions so that the sum of > dynamic priorities is conserved. That would be the control theory approach. And yes, you have to get both the theoretical transfer function and the numerics right. It sometimes helps to use a control-systems framework like the classic Takagi-Sugeno-Kang fuzzy logic controller; get the numerics right once and for all, and treat the heuristics as data, not logic. (I haven't worked in this area in almost twenty years, but Google -- yes, I do use Google+brain for fact-checking; what do you do? -- says that people are still doing active research on TSK models, and solid fixed-point reference implementations are readily available.) That seems like an attractive strategy here because you could easily embed the control engine in the kernel and load rule sets dynamically. Done right, that could give most of the advantages of pluggable schedulers (different heuristic strokes for different folks) without diluting the tester pool for the actual engine code. (Of course, different scheduling strategies require different input data, and you might not want the overhead of collecting data that your chosen heuristics won't use. But that's not much different from the netfilter situation, and is obviously a solvable problem, if anyone cares to put that much work in. The people who ought to be funding this kind of work are Sun and IBM, who don't have a chance on the desktop and are in big trouble in the database tier; their future as processor vendors depends on being able to service presentation-tier and business-logic-tier loads efficiently on their massively multi-core chips. MIPS should pitch in too, on behalf of licensees like Cavium who need more predictable behavior on multi-core embedded Linux.) Note also that you might not even want to persistently prioritize particular processes or process groups. You might want a heuristic that notices that some task (say, the X server) often responds to being awakened by doing a little work and then unblocking the task that awakened it. When it is pinged from some highly interactive task, you want it to jump the scheduler queue just long enough to unblock the interactive task, which may mean letting it flush some work out of its internal queue. But otherwise you want to batch things up until there's too much "scheduler pressure" behind it, then let it work more or less until it runs out of things to do, because its working set is so large that repeatedly scheduling it in and out is hell on caches. (Priority inheritance is the classic solution to the blocked-high-priority-task problem _in_isolation_. It is not without its pitfalls, especially when the designer of the "server" didn't expect to lose his timeslice instantly on releasing the lock. True priority inheritance is probably not something you want to inflict on a non-real-time system, but you do need some urgency heuristic. What a "fuzzy logic" framework does for you is to let you combine competing heuristics in a way that remains amenable to analysis using control theory techniques.) What does any of this have to do with "fairness"? Nothing whatsoever! There's work that has to be done, and choosing when to do it is almost entirely a matter of staying out of the way of more urgent work while minimizing the task's negative impact on the rest of the system. Does that mean that the X server is "special", kind of the way that latency-sensitive A/V applications are "special", and belongs in a separate scheduler class? No. Nowadays, workloads where the kernel has any idea what tasks belong to what "users" are the exception, not the norm. The X server is the canary in the coal mine, and a scheduler that won't do the right thing for X without hand tweaking won't do the right thing for other eyeball-driven, multiple-tiers-on-one-box scenarios either. If you want fairness among users to the extent that their demands _compete_, you might as well partition the whole machine, and have a separate fairness-oriented scheduler (let's call it a "hypervisor") that lives outside the kernel. (Talk about two students running gcc on the same shell server, with more important people also doing things on the same system, is so 1990's!) Not that the design of scheduler heuristics shouldn't include "fairness"-like considerations; but they're probably only interesting as a fallback for when the scheduler has no idea what it ought to schedule next. So why is Ingo's scheduler apparently working well for desktop loads? I haven't tried it or even looked at its code, but from its marketing I would guess that it effectively penalizes tasks whose I/O requests can be serviced from (or directed to) cache long enough to actually consume a whole timeslice. This is prima facie evidence that their _current_behavior_ is non-interactive. Presumably this penalty expires quickly when the task again asks for information that is not readily at hand (or writes data that the system is not willing to cache) -- which usually implies either actual user interaction or a change of working set, both of which deserve an "urgency premium". The mainline scheduler seems to contain various heuristics that mistake a burst of non-interactive _activity_ for a persistently non-interactive _task_. Take them away in the name of "fairness", and the system adapts more quickly to the change of working set implied by a change of user focus. There are probably fewer pathological load patterns too, since manual knob-turning uninformed by control theory is a lot less likely to get you into trouble when there are few knobs and no deliberately inserted long-time-constant feedback paths. But you can't say there are _no_ pathological load patterns, or even that the major economic drivers of the Linux ecosystem don't generate them, until you do some authentic engineering analysis. In short (too late!) -- alternate schedulers are fun to experiment with, and the sort of people who would actually try out patches floated on LKML may find that they improve their desktop experience, hosting farm throughput, etc. But even if the mainline scheduler is a hack atop a kludge covering a crock, it's more or less what production applications have expected since the last major architectural shift (NPTL). There's just no sense in replacing it until you can either add real value (say, integral clock scaling for power efficiency, with a reasonable "spinning reserve" for peaking load) or demonstrate stability by engineering analysis instead of trial and error. Cheers, - Michael ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall @ 2007-04-19 3:18 ` Nick Piggin 2007-04-19 5:14 ` Andrew Morton 2007-04-21 13:40 ` Bill Davidsen 2 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-19 3:18 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote: > > > On Wed, 18 Apr 2007, Matt Mackall wrote: > > > > Why is X special? Because it does work on behalf of other processes? > > Lots of things do this. Perhaps a scheduler should focus entirely on > > the implicit and directed wakeup matrix and optimizing that > > instead[1]. > > I 100% agree - the perfect scheduler would indeed take into account where > the wakeups come from, and try to "weigh" processes that help other > processes make progress more. That would naturally give server processes > more CPU power, because they help others > > I don't believe for a second that "fairness" means "give everybody the > same amount of CPU". That's a totally illogical measure of fairness. All > processes are _not_ created equal. I believe that unless the kernel is told of these inequalities, then it must schedule fairly. And yes, by fairly, I mean fairly among all threads as a base resource class, because that's what Linux has always done (and if you aggregate into higher classes, you still need that per-thread scheduling). So I'm not excluding extra scheduling classes like per-process, per-user, but among any class of equal schedulable entities, fair scheduling is the only option because the alternative of unfairness is just insane. > That said, even trying to do "fairness by effective user ID" would > probably already do a lot. In a desktop environment, X would get as much > CPU time as the user processes, simply because it's in a different > protection domain (and that's really what "effective user ID" means: it's > not about "users", it's really about "protection domains"). > > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. Well my X server has an euid of root, which would mean my X clients can cause X to do work and eat into root's resources. Or as Ingo said, X may not be running as root. Seems like just another hack to try to implicitly solve the X problem and probably create a lot of others along the way. All fairness issues aside, in the context of keeping a very heavily loaded desktop interactive, X is special. That you are trying to think up funny rules that would implicitly give X better priority is kind of indicative of that. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 3:18 ` Nick Piggin @ 2007-04-19 5:14 ` Andrew Morton 2007-04-19 6:38 ` Ingo Molnar 0 siblings, 1 reply; 304+ messages in thread From: Andrew Morton @ 2007-04-19 5:14 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, 19 Apr 2007 05:18:07 +0200 Nick Piggin <npiggin@suse.de> wrote: > And yes, by fairly, I mean fairly among all threads as a base resource > class, because that's what Linux has always done Yes, there are potential compatibility problems. Example: a machine with 100 busy httpd processes and suddenly a big gzip starts up from console or cron. Under current kernels, that gzip will take ages and the httpds will take a 1% slowdown, which may well be exactly the behaviour which is desired. If we were to schedule by UID then the gzip suddenly gets 50% of the CPU and those httpd's all take a 50% hit, which could be quite serious. That's simple to fix via nicing, but people have to know to do that, and there will be a transition period where some disruption is possible. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 5:14 ` Andrew Morton @ 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III 2007-04-19 8:33 ` Nick Piggin 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-19 6:38 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner * Andrew Morton <akpm@linux-foundation.org> wrote: > > And yes, by fairly, I mean fairly among all threads as a base > > resource class, because that's what Linux has always done > > Yes, there are potential compatibility problems. Example: a machine > with 100 busy httpd processes and suddenly a big gzip starts up from > console or cron. > > Under current kernels, that gzip will take ages and the httpds will > take a 1% slowdown, which may well be exactly the behaviour which is > desired. > > If we were to schedule by UID then the gzip suddenly gets 50% of the > CPU and those httpd's all take a 50% hit, which could be quite > serious. > > That's simple to fix via nicing, but people have to know to do that, > and there will be a transition period where some disruption is > possible. hmmmm. How about the following then: default to nice -10 for all (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ special: root already has disk space reserved to it, root has special memory allocation allowances, etc. I dont see a reason why we couldnt by default make all root tasks have nice -10. This would be instantly loved by sysadmins i suspect ;-) (distros that go the extra mile of making Xorg run under non-root could also go another extra one foot to renice that X server to -10.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:38 ` Ingo Molnar @ 2007-04-19 7:57 ` William Lee Irwin III 2007-04-19 11:50 ` Peter Williams 2007-04-19 8:33 ` Nick Piggin 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-19 7:57 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner * Andrew Morton <akpm@linux-foundation.org> wrote: >> Yes, there are potential compatibility problems. Example: a machine >> with 100 busy httpd processes and suddenly a big gzip starts up from >> console or cron. [...] On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: > hmmmm. How about the following then: default to nice -10 for all > (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ > special: root already has disk space reserved to it, root has special > memory allocation allowances, etc. I dont see a reason why we couldnt by > default make all root tasks have nice -10. This would be instantly loved > by sysadmins i suspect ;-) > (distros that go the extra mile of making Xorg run under non-root could > also go another extra one foot to renice that X server to -10.) I'd further recommend making priority levels accessible to kernel threads that are not otherwise accessible to processes, both above and below user-available priority levels. Basically, if you can get SCHED_RR and SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN scheduler class can coexist with SCHED_OTHER in like fashion, but with availability of higher and lower priorities than any userspace process is allowed, and potentially some differing scheduling semantics. In such a manner nonessential background processing intended not to ever disturb userspace can be given priorities appropriate to it (perhaps even con's SCHED_IDLEPRIO would make sense), and other, urgent processing can be given priority over userspace altogether. I believe root's default priority can be adjusted in userspace as things now stand somewhere in /etc/ but I'm not sure of the specifics. Word is somewhere in /etc/security/limits.conf -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 7:57 ` William Lee Irwin III @ 2007-04-19 11:50 ` Peter Williams 2007-04-20 5:26 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-19 11:50 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: >>> Yes, there are potential compatibility problems. Example: a machine >>> with 100 busy httpd processes and suddenly a big gzip starts up from >>> console or cron. > [...] > > On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: >> hmmmm. How about the following then: default to nice -10 for all >> (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ >> special: root already has disk space reserved to it, root has special >> memory allocation allowances, etc. I dont see a reason why we couldnt by >> default make all root tasks have nice -10. This would be instantly loved >> by sysadmins i suspect ;-) >> (distros that go the extra mile of making Xorg run under non-root could >> also go another extra one foot to renice that X server to -10.) > > I'd further recommend making priority levels accessible to kernel threads > that are not otherwise accessible to processes, both above and below > user-available priority levels. Basically, if you can get SCHED_RR and > SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN > scheduler class can coexist with SCHED_OTHER in like fashion, but with > availability of higher and lower priorities than any userspace process > is allowed, and potentially some differing scheduling semantics. In such > a manner nonessential background processing intended not to ever disturb > userspace can be given priorities appropriate to it (perhaps even con's > SCHED_IDLEPRIO would make sense), and other, urgent processing can be > given priority over userspace altogether. > > I believe root's default priority can be adjusted in userspace as > things now stand somewhere in /etc/ but I'm not sure of the specifics. > Word is somewhere in /etc/security/limits.conf This is sounding very much like System V Release 4 (and descendants) except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that are in system mode dynamic priorities in the SCHED_SYS range (to avoid priority inversion, I believe). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 11:50 ` Peter Williams @ 2007-04-20 5:26 ` William Lee Irwin III 2007-04-20 6:16 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-20 5:26 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> I'd further recommend making priority levels accessible to kernel threads >> that are not otherwise accessible to processes, both above and below >> user-available priority levels. Basically, if you can get SCHED_RR and >> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN >> scheduler class can coexist with SCHED_OTHER in like fashion, but with >> availability of higher and lower priorities than any userspace process >> is allowed, and potentially some differing scheduling semantics. In such >> a manner nonessential background processing intended not to ever disturb >> userspace can be given priorities appropriate to it (perhaps even con's >> SCHED_IDLEPRIO would make sense), and other, urgent processing can be >> given priority over userspace altogether. On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote: > This is sounding very much like System V Release 4 (and descendants) > except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that > are in system mode dynamic priorities in the SCHED_SYS range (to avoid > priority inversion, I believe). Descriptions of that are probably where I got the idea (hurrah for OS textbooks). It makes a fair amount of sense. Not sure what the take on the specific precedent is. The only content here is expanding the priority range with ranges above and below for the exclusive use of ultra-privileged tasks, so it's really trivial. Actually it might be so trivial it should just be some permission checks in the SCHED_OTHER renicing code. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-20 5:26 ` William Lee Irwin III @ 2007-04-20 6:16 ` Peter Williams 0 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-20 6:16 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Andrew Morton, Nick Piggin, Linus Torvalds, Matt Mackall, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> I'd further recommend making priority levels accessible to kernel threads >>> that are not otherwise accessible to processes, both above and below >>> user-available priority levels. Basically, if you can get SCHED_RR and >>> SCHED_FIFO to coexist as "intimate scheduler classes," then a SCHED_KERN >>> scheduler class can coexist with SCHED_OTHER in like fashion, but with >>> availability of higher and lower priorities than any userspace process >>> is allowed, and potentially some differing scheduling semantics. In such >>> a manner nonessential background processing intended not to ever disturb >>> userspace can be given priorities appropriate to it (perhaps even con's >>> SCHED_IDLEPRIO would make sense), and other, urgent processing can be >>> given priority over userspace altogether. > > On Thu, Apr 19, 2007 at 09:50:19PM +1000, Peter Williams wrote: >> This is sounding very much like System V Release 4 (and descendants) >> except that they call it SCHED_SYS and also give SCHED_NORMAL tasks that >> are in system mode dynamic priorities in the SCHED_SYS range (to avoid >> priority inversion, I believe). > > Descriptions of that are probably where I got the idea (hurrah for OS > textbooks). And long term background memory. :-) > It makes a fair amount of sense. Yes. You could also add a SCHED_IA in between SCHED_SYS and SCHED_OTHER (a la Solaris) for interactive tasks. The only problem is how to get a task into SCHED_IA without root privileges. > Not sure what the take on > the specific precedent is. The only content here is expanding the > priority range with ranges above and below for the exclusive use of > ultra-privileged tasks, so it's really trivial. Actually it might be so > trivial it should just be some permission checks in the SCHED_OTHER > renicing code. Perhaps. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-19 6:38 ` Ingo Molnar 2007-04-19 7:57 ` William Lee Irwin III @ 2007-04-19 8:33 ` Nick Piggin 1 sibling, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-19 8:33 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Linus Torvalds, Matt Mackall, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Arjan van de Ven, Thomas Gleixner On Thu, Apr 19, 2007 at 08:38:10AM +0200, Ingo Molnar wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > And yes, by fairly, I mean fairly among all threads as a base > > > resource class, because that's what Linux has always done > > > > Yes, there are potential compatibility problems. Example: a machine > > with 100 busy httpd processes and suddenly a big gzip starts up from > > console or cron. > > > > Under current kernels, that gzip will take ages and the httpds will > > take a 1% slowdown, which may well be exactly the behaviour which is > > desired. > > > > If we were to schedule by UID then the gzip suddenly gets 50% of the > > CPU and those httpd's all take a 50% hit, which could be quite > > serious. > > > > That's simple to fix via nicing, but people have to know to do that, > > and there will be a transition period where some disruption is > > possible. > > hmmmm. How about the following then: default to nice -10 for all > (SCHED_NORMAL) kernel threads and all root-owned tasks. Root _is_ > special: root already has disk space reserved to it, root has special > memory allocation allowances, etc. I dont see a reason why we couldnt by > default make all root tasks have nice -10. This would be instantly loved > by sysadmins i suspect ;-) I have no problem with doing fancy new fairness classes and things. But considering that we _need_ to have per-thread fairness and that is also what the current scheduler has and what we need to do well for obvious reasons, the best path to take is to get per-thread scheduling up to a point where it is able to replace the current scheduler, then look at more complex things after that. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 14:48 ` Linus Torvalds 2007-04-18 15:23 ` Matt Mackall 2007-04-19 3:18 ` Nick Piggin @ 2007-04-21 13:40 ` Bill Davidsen 2 siblings, 0 replies; 304+ messages in thread From: Bill Davidsen @ 2007-04-21 13:40 UTC (permalink / raw) To: Linus Torvalds Cc: Matt Mackall, Nick Piggin, William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Andrew Morton, Arjan van de Ven, Thomas Gleixner Linus Torvalds wrote: > > On Wed, 18 Apr 2007, Matt Mackall wrote: >> Why is X special? Because it does work on behalf of other processes? >> Lots of things do this. Perhaps a scheduler should focus entirely on >> the implicit and directed wakeup matrix and optimizing that >> instead[1]. > > I 100% agree - the perfect scheduler would indeed take into account where > the wakeups come from, and try to "weigh" processes that help other > processes make progress more. That would naturally give server processes > more CPU power, because they help others > > I don't believe for a second that "fairness" means "give everybody the > same amount of CPU". That's a totally illogical measure of fairness. All > processes are _not_ created equal. > > That said, even trying to do "fairness by effective user ID" would > probably already do a lot. In a desktop environment, X would get as much > CPU time as the user processes, simply because it's in a different > protection domain (and that's really what "effective user ID" means: it's > not about "users", it's really about "protection domains"). > > And "fairness by euid" is probably a hell of a lot easier to do than > trying to figure out the wakeup matrix. > You probably want to consider the controlling terminal as well... do you want to have people starting 'at' jobs competing on equal footing with people typing at a terminal? I'm not offering an answer, just raising the question. And for some database applications, everyone in a group may connect with the same login-id, then do sub authorization to the database application. euid may be an issue there as well. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:15 ` Nick Piggin 2007-04-17 6:26 ` William Lee Irwin III @ 2007-04-17 6:50 ` Davide Libenzi 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:11 ` Nick Piggin 1 sibling, 2 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-17 6:50 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, Nick Piggin wrote: > > All things are not equal; they all have different properties. I like > > Exactly. So we have to explore those properties and evaluate performance > (in all meanings of the word). That's only logical. I had a quick look at Ingo's code yesterday. Ingo is always smart to prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) And even this code does that pretty nicely. The deadline designs looks good, although I think the final "key" calculation code will end up quite different from what it looks now. I would suggest to thoroughly test all your alternatives before deciding. Some code and design may look very good and small at the beginning, but when you start patching it to cover all the dark spots, you effectively end up with another thing (in both design and code footprint). About O(1), I never thought it was a must (besides a good marketing material), and O(log(N)) *may* be just fine (to be verified, of course). - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:50 ` Davide Libenzi @ 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams ` (3 more replies) 2007-04-17 7:11 ` Nick Piggin 1 sibling, 4 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 7:09 UTC (permalink / raw) To: Davide Libenzi Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > I had a quick look at Ingo's code yesterday. Ingo is always smart to > prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) > And even this code does that pretty nicely. The deadline designs looks > good, although I think the final "key" calculation code will end up quite > different from what it looks now. The additive nice_offset breaks nice levels. A multiplicative priority weighting of a different, nonnegative metric of cpu utilization from what's now used is required for nice levels to work. I've been trying to point this out politely by strongly suggesting testing whether nice levels work. On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > I would suggest to thoroughly test all your alternatives before deciding. > Some code and design may look very good and small at the beginning, but > when you start patching it to cover all the dark spots, you effectively > end up with another thing (in both design and code footprint). > About O(1), I never thought it was a must (besides a good marketing > material), and O(log(N)) *may* be just fine (to be verified, of course). The trouble with thorough testing right now is that no one agrees on what the tests should be and a number of the testcases are not in great shape. An agreed-upon set of testcases for basic correctness should be devised and the implementations of those testcases need to be maintainable code and the tests set up for automated testing and changing their parameters without recompiling via command-line options. Once there's a standard regression test suite for correctness, one needs to be devised for performance, including interactive performance. The primary difficulty I see along these lines is finding a way to automate tests of graphics and input device response performance. Others, like how deterministically priorities are respected over progressively smaller time intervals and noninteractive workload performance are nowhere near as difficult to arrange and in many cases already exist. Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III @ 2007-04-17 7:22 ` Peter Williams 2007-04-17 7:23 ` Nick Piggin ` (2 subsequent siblings) 3 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-17 7:22 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: >> I had a quick look at Ingo's code yesterday. Ingo is always smart to >> prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) >> And even this code does that pretty nicely. The deadline designs looks >> good, although I think the final "key" calculation code will end up quite >> different from what it looks now. > > The additive nice_offset breaks nice levels. A multiplicative priority > weighting of a different, nonnegative metric of cpu utilization from > what's now used is required for nice levels to work. I've been trying > to point this out politely by strongly suggesting testing whether nice > levels work. > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: >> I would suggest to thoroughly test all your alternatives before deciding. >> Some code and design may look very good and small at the beginning, but >> when you start patching it to cover all the dark spots, you effectively >> end up with another thing (in both design and code footprint). >> About O(1), I never thought it was a must (besides a good marketing >> material), and O(log(N)) *may* be just fine (to be verified, of course). > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. At this point, I'd like direct everyone's attention to the simloads package: <http://downloads.sourceforge.net/cpuse/simloads-0.1.1.tar.gz> which contains a set of programs designed to be used in the construction of CPU scheduler tests. Of particular use is the aspin program which can be used to launch tasks with specified sleep/wake characteristics. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams @ 2007-04-17 7:23 ` Nick Piggin 2007-04-17 7:27 ` Davide Libenzi 2007-04-17 7:33 ` Ingo Molnar 3 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 12:09:49AM -0700, William Lee Irwin III wrote: > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. Definitely. It would be really good if we could have interactivity regression tests too (see my earlier wishful email). The problem with a lot of the scripted interactivity tests I see is that they don't really capture the complexities of the interactions between, say, an interactive X session. Others just go straight for trying to exploit the design by making lots of high priority processes runnablel at once. This just provides an unrealistic decoy and you end up trying to tune for the wrong thing. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III 2007-04-17 7:22 ` Peter Williams 2007-04-17 7:23 ` Nick Piggin @ 2007-04-17 7:27 ` Davide Libenzi 2007-04-17 7:33 ` Nick Piggin 2007-04-17 7:33 ` Ingo Molnar 3 siblings, 1 reply; 304+ messages in thread From: Davide Libenzi @ 2007-04-17 7:27 UTC (permalink / raw) To: William Lee Irwin III Cc: Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > I would suggest to thoroughly test all your alternatives before deciding. > > Some code and design may look very good and small at the beginning, but > > when you start patching it to cover all the dark spots, you effectively > > end up with another thing (in both design and code footprint). > > About O(1), I never thought it was a must (besides a good marketing > > material), and O(log(N)) *may* be just fine (to be verified, of course). > > The trouble with thorough testing right now is that no one agrees on > what the tests should be and a number of the testcases are not in great > shape. An agreed-upon set of testcases for basic correctness should be > devised and the implementations of those testcases need to be > maintainable code and the tests set up for automated testing and > changing their parameters without recompiling via command-line options. > > Once there's a standard regression test suite for correctness, one > needs to be devised for performance, including interactive performance. > The primary difficulty I see along these lines is finding a way to > automate tests of graphics and input device response performance. Others, > like how deterministically priorities are respected over progressively > smaller time intervals and noninteractive workload performance are > nowhere near as difficult to arrange and in many cases already exist. > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. What I meant was, that the rules (requirements and associated test cases) for this new Scheduler Amazing Race should be set forward, and not kept a moving target to fit&follow one or the other implementation. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:27 ` Davide Libenzi @ 2007-04-17 7:33 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:33 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 12:27:28AM -0700, Davide Libenzi wrote: > On Tue, 17 Apr 2007, William Lee Irwin III wrote: > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > > I would suggest to thoroughly test all your alternatives before deciding. > > > Some code and design may look very good and small at the beginning, but > > > when you start patching it to cover all the dark spots, you effectively > > > end up with another thing (in both design and code footprint). > > > About O(1), I never thought it was a must (besides a good marketing > > > material), and O(log(N)) *may* be just fine (to be verified, of course). > > > > The trouble with thorough testing right now is that no one agrees on > > what the tests should be and a number of the testcases are not in great > > shape. An agreed-upon set of testcases for basic correctness should be > > devised and the implementations of those testcases need to be > > maintainable code and the tests set up for automated testing and > > changing their parameters without recompiling via command-line options. > > > > Once there's a standard regression test suite for correctness, one > > needs to be devised for performance, including interactive performance. > > The primary difficulty I see along these lines is finding a way to > > automate tests of graphics and input device response performance. Others, > > like how deterministically priorities are respected over progressively > > smaller time intervals and noninteractive workload performance are > > nowhere near as difficult to arrange and in many cases already exist. > > Just reuse SDET, AIM7/AIM9, OAST, contest, interbench, et al. > > What I meant was, that the rules (requirements and associated test cases) > for this new Scheduler Amazing Race should be set forward, and not kept a > moving target to fit&follow one or the other implementation. Exactly. Well I don't mind if it is a moving target as such, just as long as the decisions are rational (no "blah is more important because I say so"). ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:09 ` William Lee Irwin III ` (2 preceding siblings ...) 2007-04-17 7:27 ` Davide Libenzi @ 2007-04-17 7:33 ` Ingo Molnar 2007-04-17 7:40 ` Nick Piggin 2007-04-17 9:05 ` William Lee Irwin III 3 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 7:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > > prepare a main dish (feature) with a nice sider (code cleanup) to > > Linus ;) And even this code does that pretty nicely. The deadline > > designs looks good, although I think the final "key" calculation > > code will end up quite different from what it looks now. > > The additive nice_offset breaks nice levels. A multiplicative priority > weighting of a different, nonnegative metric of cpu utilization from > what's now used is required for nice levels to work. I've been trying > to point this out politely by strongly suggesting testing whether nice > levels work. granted, CFS's nice code is still incomplete, but you err quite significantly with this extreme statement that they are "broken". nice levels certainly work to a fair degree even in the current code and much of the focus is elsewhere - just try it. (In fact i claim that CFS's nice levels often work _better_ than the mainline scheduler's nice level support, for the testcases that matter to users.) The precise behavior of nice levels, as i pointed it out in previous mails, is largely 'uninteresting' and it has changed multiple times in the past 10 years. What matters to users is mainly: whether X reniced to -10 does get enough CPU time and whether stuff reniced to +19 doesnt take away too much CPU time from the rest of the system. _How_ a Linux scheduler achieves this is an internal matter and certainly CFS does it in a hacky way at the moment. All the rest, 'CPU bandwidth utilization' or whatever abstract metric we could come up with is just a fancy academic technicality that has no real significance to any of the testers who are trying CFS right now. Sure we prefer final solutions that are clean and make sense (because such things are the easiest to maintain long-term), and often such final solutions are quite close to academic concepts, and i think Davide correctly observed this by saying that "the final key calculation code will end up quite different from what it looks now", but your extreme-end claim of 'breakage' for something that is just plain incomplete is not really a fair characterisation at this point. Anyone who thinks that there exists only two kinds of code: 100% correct and 100% incorrect with no shades of grey inbetween is in reality a sort of an extremist: whom, depending on mood and affection, we could call either a 'coding purist' or a 'coding taliban' ;-) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:33 ` Ingo Molnar @ 2007-04-17 7:40 ` Nick Piggin 2007-04-17 7:58 ` Ingo Molnar 2007-04-17 9:05 ` William Lee Irwin III 1 sibling, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:40 UTC (permalink / raw) To: Ingo Molnar Cc: William Lee Irwin III, Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > > * William Lee Irwin III <wli@holomorphy.com> wrote: > > > On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > > > prepare a main dish (feature) with a nice sider (code cleanup) to > > > Linus ;) And even this code does that pretty nicely. The deadline > > > designs looks good, although I think the final "key" calculation > > > code will end up quite different from what it looks now. > > > > The additive nice_offset breaks nice levels. A multiplicative priority > > weighting of a different, nonnegative metric of cpu utilization from > > what's now used is required for nice levels to work. I've been trying > > to point this out politely by strongly suggesting testing whether nice > > levels work. > > granted, CFS's nice code is still incomplete, but you err quite > significantly with this extreme statement that they are "broken". > > nice levels certainly work to a fair degree even in the current code and > much of the focus is elsewhere - just try it. (In fact i claim that > CFS's nice levels often work _better_ than the mainline scheduler's nice > level support, for the testcases that matter to users.) > > The precise behavior of nice levels, as i pointed it out in previous > mails, is largely 'uninteresting' and it has changed multiple times in > the past 10 years. > > What matters to users is mainly: whether X reniced to -10 does get > enough CPU time and whether stuff reniced to +19 doesnt take away too > much CPU time from the rest of the system. I agree there. > _How_ a Linux scheduler > achieves this is an internal matter and certainly CFS does it in a hacky > way at the moment. > > All the rest, 'CPU bandwidth utilization' or whatever abstract metric we > could come up with is just a fancy academic technicality that has no > real significance to any of the testers who are trying CFS right now. > > Sure we prefer final solutions that are clean and make sense (because > such things are the easiest to maintain long-term), and often such final > solutions are quite close to academic concepts, and i think Davide > correctly observed this by saying that "the final key calculation code > will end up quite different from what it looks now", but your > extreme-end claim of 'breakage' for something that is just plain > incomplete is not really a fair characterisation at this point. > > Anyone who thinks that there exists only two kinds of code: 100% correct > and 100% incorrect with no shades of grey inbetween is in reality a sort > of an extremist: whom, depending on mood and affection, we could call > either a 'coding purist' or a 'coding taliban' ;-) Only if you are an extremist-naming extremist with no shades of grey. Others, like myself, also include 'coding al-qaeda' and 'coding john howard' in that scale. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:40 ` Nick Piggin @ 2007-04-17 7:58 ` Ingo Molnar 0 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 7:58 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Davide Libenzi, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Nick Piggin <npiggin@suse.de> wrote: > > Anyone who thinks that there exists only two kinds of code: 100% > > correct and 100% incorrect with no shades of grey inbetween is in > > reality a sort of an extremist: whom, depending on mood and > > affection, we could call either a 'coding purist' or a 'coding > > taliban' ;-) > > Only if you are an extremist-naming extremist with no shades of grey. > Others, like myself, also include 'coding al-qaeda' and 'coding john > howard' in that scale. heh ;) You, you ... nitpicking extremist! ;) And beware that you just commited another act of extremism too: > I agree there. because you just went to the extreme position of saying that "i agree with this portion 100%", instead of saying "this seems to be 91.5% correct in my opinion, Tue, 17 Apr 2007 09:40:25 +0200". and the nasty thing is, that in reality even shades of grey, if you print them out, are just a set of extreme black dots on an extreme white sheet of paper! ;) [ so i guess we've got to consider the scope of extremism too: the larger the scope, the more limiting and hence the more dangerous it is. ] Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:33 ` Ingo Molnar 2007-04-17 7:40 ` Nick Piggin @ 2007-04-17 9:05 ` William Lee Irwin III 2007-04-17 9:24 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 9:05 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> The additive nice_offset breaks nice levels. A multiplicative priority >> weighting of a different, nonnegative metric of cpu utilization from >> what's now used is required for nice levels to work. I've been trying >> to point this out politely by strongly suggesting testing whether nice >> levels work. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > granted, CFS's nice code is still incomplete, but you err quite > significantly with this extreme statement that they are "broken". I used the word relatively loosely. Nothing extreme is going on. Maybe the phrasing exaggerated the force of the opinion. I'm sorry about having misspoke so. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > nice levels certainly work to a fair degree even in the current code and > much of the focus is elsewhere - just try it. (In fact i claim that > CFS's nice levels often work _better_ than the mainline scheduler's nice > level support, for the testcases that matter to users.) Al Boldi's testcase appears to reveal some issues. I'm plotting a testcase of my own if I can ever get past responding to email. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > The precise behavior of nice levels, as i pointed it out in previous > mails, is largely 'uninteresting' and it has changed multiple times in > the past 10 years. I expect that whether a scheduler can handle such prioritization has a rather strong predictive quality regarding whether it can handle, say, CKRM controls. I remain convinced that there should be some target behavior and that some attempt should be made to achieve it. I don't think any particular behavior is best, just that the behavior should be well-defined. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > What matters to users is mainly: whether X reniced to -10 does get > enough CPU time and whether stuff reniced to +19 doesnt take away too > much CPU time from the rest of the system. _How_ a Linux scheduler > achieves this is an internal matter and certainly CFS does it in a hacky > way at the moment. It's not so far out. Basically just changing the key calculation in a relatively simple manner should get things into relatively good shape. It can, of course, be done other ways (I did it a rather different way in vdls, though that method is not likely to be considered desirable). I can't really write a testcase for such loose semantics, so the above description is useless to me. These squishy sorts of definitions of semantics are also uninformative to users, who, I would argue, do have some interest in what nice levels mean. There have been at least a small number of concerns about the strength of nice levels, and it would reveal issues surrounding that area earlier if there were an objective one could test to see if it were achieved. It's furthermore a user-visible change in system call semantics we should be more careful about changing out from beneath users. So I see a lot of good reasons to pin down nice numbers. Incompleteness is not a particularly mortal sin, but the proliferation of competing schedulers is creating a need for standards, and that's what I'm really on about. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > All the rest, 'CPU bandwidth utilization' or whatever abstract metric we > could come up with is just a fancy academic technicality that has no > real significance to any of the testers who are trying CFS right now. I could say "percent cpu" if it sounds less like formal jargon, which "CPU bandwidth utilization" isn't really. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > Sure we prefer final solutions that are clean and make sense (because > such things are the easiest to maintain long-term), and often such final > solutions are quite close to academic concepts, and i think Davide > correctly observed this by saying that "the final key calculation code > will end up quite different from what it looks now", but your > extreme-end claim of 'breakage' for something that is just plain > incomplete is not really a fair characterisation at this point. It wasn't meant to be quite as strong a statement as it came out. Sorry about that. On Tue, Apr 17, 2007 at 09:33:08AM +0200, Ingo Molnar wrote: > Anyone who thinks that there exists only two kinds of code: 100% correct > and 100% incorrect with no shades of grey inbetween is in reality a sort > of an extremist: whom, depending on mood and affection, we could call > either a 'coding purist' or a 'coding taliban' ;-) I've made no such claims. Also rest assured that the tone of the critique is not hostile, and wasn't meant to sound that way. Also, given the general comments it appears clear that some statistical metric of deviation from the intended behavior furthermore qualified by timescale is necessary, so this appears to be headed toward a sort of performance metric as opposed to a pass/fail test anyway. However, to even measure this at all, some statement of intention is required. I'd prefer that there be a Linux-standard semantics for nice so results are more directly comparable and so that users also get similar nice behavior from the scheduler as it varies over time and possibly implementations if users should care to switch them out with some scheduler patch or other. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:05 ` William Lee Irwin III @ 2007-04-17 9:24 ` Ingo Molnar 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 22:08 ` Matt Mackall 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 9:24 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > [...] Also rest assured that the tone of the critique is not hostile, > and wasn't meant to sound that way. ok :) (And i guess i was too touchy - sorry about coming out swinging.) > Also, given the general comments it appears clear that some > statistical metric of deviation from the intended behavior furthermore > qualified by timescale is necessary, so this appears to be headed > toward a sort of performance metric as opposed to a pass/fail test > anyway. However, to even measure this at all, some statement of > intention is required. I'd prefer that there be a Linux-standard > semantics for nice so results are more directly comparable and so that > users also get similar nice behavior from the scheduler as it varies > over time and possibly implementations if users should care to switch > them out with some scheduler patch or other. yeah. If you could come up with a sane definition that also translates into low overhead on the algorithm side that would be great! The only good generic definition i could come up with (nice levels are isolated buckets with a constant maximum relative percentage of CPU time available to every active bucket) resulted in having a per-nice-level array of rbtree roots, which did not look worth the hassle at first sight :-) until now the main approach for nice levels in Linux was always: "implement your main scheduling logic for nice 0 and then look for some low-overhead method that can be glued to it that does something that behaves like nice levels". Feel free to turn that around into a more natural approach, but the algorithm should remain fairly simple i think. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:24 ` Ingo Molnar @ 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 10:01 ` Ingo Molnar 2007-04-17 11:31 ` William Lee Irwin III 2007-04-17 22:08 ` Matt Mackall 1 sibling, 2 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 9:57 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: >> Also, given the general comments it appears clear that some >> statistical metric of deviation from the intended behavior furthermore >> qualified by timescale is necessary, so this appears to be headed >> toward a sort of performance metric as opposed to a pass/fail test [...] On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > yeah. If you could come up with a sane definition that also translates > into low overhead on the algorithm side that would be great! The only > good generic definition i could come up with (nice levels are isolated > buckets with a constant maximum relative percentage of CPU time > available to every active bucket) resulted in having a per-nice-level > array of rbtree roots, which did not look worth the hassle at first > sight :-) Interesting! That's what vdls did, except its fundamental data structure was more like a circular buffer data structure (resembling Davide Libenzi's timer ring in concept, but with all the details different). I'm not entirely sure how that would've turned out performancewise if I'd done any tuning at all. I was mostly interested in doing something like what I heard Bob Mullens did in 1976 for basic pedagogical value about schedulers to prepare for writing patches for gang scheduling as opposed to creating a viable replacement for the mainline scheduler. I'm relatively certain a different key calculation will suffice, but it may disturb other desired semantics since they really need to be nonnegative for multiplying by a scaling factor corresponding to its nice number to work properly. Well, as the cfs code now stands, it would correspond to negative keys. Dividing positive keys by the nice scaling factor is my first thought of how to extend the method to the current key semantics. Or such are my thoughts on the subject. I expect that all that's needed is to fiddle with those numbers a bit. There's quite some capacity for expression there given the precision. On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > until now the main approach for nice levels in Linux was always: > "implement your main scheduling logic for nice 0 and then look for some > low-overhead method that can be glued to it that does something that > behaves like nice levels". Feel free to turn that around into a more > natural approach, but the algorithm should remain fairly simple i think. Part of my insistence was because it seemed to be relatively close to a one-liner, though I'm not entirely sure what particular computation to use to handle the signedness of the keys. I guess I could pick some particular nice semantics myself and then sweep the extant schedulers to use them after getting a testcase hammered out. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:57 ` William Lee Irwin III @ 2007-04-17 10:01 ` Ingo Molnar 2007-04-17 11:31 ` William Lee Irwin III 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 10:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * William Lee Irwin III <wli@holomorphy.com> wrote: > On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > > > until now the main approach for nice levels in Linux was always: > > "implement your main scheduling logic for nice 0 and then look for > > some low-overhead method that can be glued to it that does something > > that behaves like nice levels". Feel free to turn that around into a > > more natural approach, but the algorithm should remain fairly simple > > i think. > > Part of my insistence was because it seemed to be relatively close to > a one-liner, though I'm not entirely sure what particular computation > to use to handle the signedness of the keys. I guess I could pick some > particular nice semantics myself and then sweep the extant schedulers > to use them after getting a testcase hammered out. i'd love to have a oneliner solution :-) wrt. signedness: note that in v2 i have made rq_running signed, and most calculations (especially those related to nice) are signed values. (On 64-bit systems this all isnt a big issue - most of the arithmetics gymnastics in CFS are done to keep deltas within 32 bits, so that divisions and multiplications are sane.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:57 ` William Lee Irwin III 2007-04-17 10:01 ` Ingo Molnar @ 2007-04-17 11:31 ` William Lee Irwin III 1 sibling, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 11:31 UTC (permalink / raw) To: Ingo Molnar Cc: Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:57:49AM -0700, William Lee Irwin III wrote: > Interesting! That's what vdls did, except its fundamental data structure > was more like a circular buffer data structure (resembling Davide > Libenzi's timer ring in concept, but with all the details different). > I'm not entirely sure how that would've turned out performancewise if > I'd done any tuning at all. I was mostly interested in doing something > like what I heard Bob Mullens did in 1976 for basic pedagogical value > about schedulers to prepare for writing patches for gang scheduling as > opposed to creating a viable replacement for the mainline scheduler. Con helped me dredge up the vdls bits, so here is the last version I before I got tired of toying with the idea. It's not all that clean, with a fair amount of debug code floating around and a number of idiocies (it seems there was a plot to use a heap somewhere I forgot about entirely, never mind other cruft), but I thought I should at least say something more provable than "there was a patch I never posted." Enjoy! -- wli diff -prauN linux-2.6.0-test11/fs/proc/array.c sched-2.6.0-test11-5/fs/proc/array.c --- linux-2.6.0-test11/fs/proc/array.c 2003-11-26 12:44:26.000000000 -0800 +++ sched-2.6.0-test11-5/fs/proc/array.c 2003-12-17 07:37:11.000000000 -0800 @@ -162,7 +162,7 @@ static inline char * task_state(struct t "Uid:\t%d\t%d\t%d\t%d\n" "Gid:\t%d\t%d\t%d\t%d\n", get_task_state(p), - (p->sleep_avg/1024)*100/(1000000000/1024), + 0UL, /* was ->sleep_avg */ p->tgid, p->pid, p->pid ? p->real_parent->pid : 0, p->pid && p->ptrace ? p->parent->pid : 0, @@ -345,7 +345,7 @@ int proc_pid_stat(struct task_struct *ta read_unlock(&tasklist_lock); res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \ -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n", +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %d %d\n", task->pid, task->comm, state, @@ -390,8 +390,8 @@ int proc_pid_stat(struct task_struct *ta task->cnswap, task->exit_signal, task_cpu(task), - task->rt_priority, - task->policy); + task_prio(task), + task_sched_policy(task)); if(mm) mmput(mm); return res; diff -prauN linux-2.6.0-test11/include/asm-i386/thread_info.h sched-2.6.0-test11-5/include/asm-i386/thread_info.h --- linux-2.6.0-test11/include/asm-i386/thread_info.h 2003-11-26 12:43:06.000000000 -0800 +++ sched-2.6.0-test11-5/include/asm-i386/thread_info.h 2003-12-17 04:55:22.000000000 -0800 @@ -114,6 +114,8 @@ static inline struct thread_info *curren #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ #define TIF_IRET 5 /* return with iret */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ +#define TIF_QUEUED 17 +#define TIF_PREEMPT 18 #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE) #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME) diff -prauN linux-2.6.0-test11/include/linux/binomial.h sched-2.6.0-test11-5/include/linux/binomial.h --- linux-2.6.0-test11/include/linux/binomial.h 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/binomial.h 2003-12-20 15:53:33.000000000 -0800 @@ -0,0 +1,16 @@ +/* + * Simple binomial heaps. + */ + +struct binomial { + unsigned priority, degree; + struct binomial *parent, *child, *sibling; +}; + + +struct binomial *binomial_minimum(struct binomial **); +void binomial_union(struct binomial **, struct binomial **, struct binomial **); +void binomial_insert(struct binomial **, struct binomial *); +struct binomial *binomial_extract_min(struct binomial **); +void binomial_decrease(struct binomial **, struct binomial *, unsigned); +void binomial_delete(struct binomial **, struct binomial *); diff -prauN linux-2.6.0-test11/include/linux/init_task.h sched-2.6.0-test11-5/include/linux/init_task.h --- linux-2.6.0-test11/include/linux/init_task.h 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/init_task.h 2003-12-18 05:51:16.000000000 -0800 @@ -56,6 +56,12 @@ .siglock = SPIN_LOCK_UNLOCKED, \ } +#define INIT_SCHED_INFO(info) \ +{ \ + .run_list = LIST_HEAD_INIT((info).run_list), \ + .policy = 1 /* SCHED_POLICY_TS */, \ +} + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) @@ -67,14 +73,10 @@ .usage = ATOMIC_INIT(2), \ .flags = 0, \ .lock_depth = -1, \ - .prio = MAX_PRIO-20, \ - .static_prio = MAX_PRIO-20, \ - .policy = SCHED_NORMAL, \ + .sched_info = INIT_SCHED_INFO(tsk.sched_info), \ .cpus_allowed = CPU_MASK_ALL, \ .mm = NULL, \ .active_mm = &init_mm, \ - .run_list = LIST_HEAD_INIT(tsk.run_list), \ - .time_slice = HZ, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ .ptrace_children= LIST_HEAD_INIT(tsk.ptrace_children), \ .ptrace_list = LIST_HEAD_INIT(tsk.ptrace_list), \ diff -prauN linux-2.6.0-test11/include/linux/sched.h sched-2.6.0-test11-5/include/linux/sched.h --- linux-2.6.0-test11/include/linux/sched.h 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/include/linux/sched.h 2003-12-23 03:47:45.000000000 -0800 @@ -126,6 +126,8 @@ extern unsigned long nr_iowait(void); #define SCHED_NORMAL 0 #define SCHED_FIFO 1 #define SCHED_RR 2 +#define SCHED_BATCH 3 +#define SCHED_IDLE 4 struct sched_param { int sched_priority; @@ -281,10 +283,14 @@ struct signal_struct { #define MAX_USER_RT_PRIO 100 #define MAX_RT_PRIO MAX_USER_RT_PRIO - -#define MAX_PRIO (MAX_RT_PRIO + 40) - -#define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define NICE_QLEN 128 +#define MIN_TS_PRIO MAX_RT_PRIO +#define MAX_TS_PRIO (40*NICE_QLEN) +#define MIN_BATCH_PRIO (MAX_RT_PRIO + MAX_TS_PRIO) +#define MAX_BATCH_PRIO 100 +#define MAX_PRIO (MIN_BATCH_PRIO + MAX_BATCH_PRIO) +#define USER_PRIO(prio) ((prio) - MAX_RT_PRIO) +#define MAX_USER_PRIO USER_PRIO(MAX_PRIO) /* * Some day this will be a full-fledged user tracking system.. @@ -330,6 +336,36 @@ struct k_itimer { struct io_context; /* See blkdev.h */ void exit_io_context(void); +struct rt_data { + int prio, rt_policy; + unsigned long quantum, ticks; +}; + +/* XXX: do %cpu estimation for ts wakeup levels */ +struct ts_data { + int nice; + unsigned long ticks, frac_cpu; + unsigned long sample_start, sample_ticks; +}; + +struct bt_data { + int prio; + unsigned long ticks; +}; + +union class_data { + struct rt_data rt; + struct ts_data ts; + struct bt_data bt; +}; + +struct sched_info { + int idx; /* queue index, used by all classes */ + unsigned long policy; /* scheduling policy */ + struct list_head run_list; /* list links for priority queues */ + union class_data cl_data; /* class-specific data */ +}; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ struct thread_info *thread_info; @@ -339,18 +375,9 @@ struct task_struct { int lock_depth; /* Lock depth */ - int prio, static_prio; - struct list_head run_list; - prio_array_t *array; - - unsigned long sleep_avg; - long interactive_credit; - unsigned long long timestamp; - int activated; + struct sched_info sched_info; - unsigned long policy; cpumask_t cpus_allowed; - unsigned int time_slice, first_time_slice; struct list_head tasks; struct list_head ptrace_children; @@ -391,7 +418,6 @@ struct task_struct { int __user *set_child_tid; /* CLONE_CHILD_SETTID */ int __user *clear_child_tid; /* CLONE_CHILD_CLEARTID */ - unsigned long rt_priority; unsigned long it_real_value, it_prof_value, it_virt_value; unsigned long it_real_incr, it_prof_incr, it_virt_incr; struct timer_list real_timer; @@ -520,12 +546,14 @@ extern void node_nr_running_init(void); #define node_nr_running_init() {} #endif -extern void set_user_nice(task_t *p, long nice); -extern int task_prio(task_t *p); -extern int task_nice(task_t *p); -extern int task_curr(task_t *p); -extern int idle_cpu(int cpu); - +void set_user_nice(task_t *task, long nice); +int task_prio(task_t *task); +int task_nice(task_t *task); +int task_sched_policy(task_t *task); +void set_task_sched_policy(task_t *task, int policy); +int rt_task(task_t *task); +int task_curr(task_t *task); +int idle_cpu(int cpu); void yield(void); /* @@ -844,6 +872,21 @@ static inline int need_resched(void) return unlikely(test_thread_flag(TIF_NEED_RESCHED)); } +static inline void set_task_queued(task_t *task) +{ + set_tsk_thread_flag(task, TIF_QUEUED); +} + +static inline void clear_task_queued(task_t *task) +{ + clear_tsk_thread_flag(task, TIF_QUEUED); +} + +static inline int task_queued(task_t *task) +{ + return test_tsk_thread_flag(task, TIF_QUEUED); +} + extern void __cond_resched(void); static inline void cond_resched(void) { diff -prauN linux-2.6.0-test11/kernel/Makefile sched-2.6.0-test11-5/kernel/Makefile --- linux-2.6.0-test11/kernel/Makefile 2003-11-26 12:43:24.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/Makefile 2003-12-17 03:30:08.000000000 -0800 @@ -6,7 +6,7 @@ obj-y = sched.o fork.o exec_domain.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ signal.o sys.o kmod.o workqueue.o pid.o \ - rcupdate.o intermodule.o extable.o params.o posix-timers.o + rcupdate.o intermodule.o extable.o params.o posix-timers.o sched/ obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o diff -prauN linux-2.6.0-test11/kernel/exit.c sched-2.6.0-test11-5/kernel/exit.c --- linux-2.6.0-test11/kernel/exit.c 2003-11-26 12:45:29.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/exit.c 2003-12-17 07:04:02.000000000 -0800 @@ -225,7 +225,7 @@ void reparent_to_init(void) /* Set the exit signal to SIGCHLD so we signal init on exit */ current->exit_signal = SIGCHLD; - if ((current->policy == SCHED_NORMAL) && (task_nice(current) < 0)) + if (task_nice(current) < 0) set_user_nice(current, 0); /* cpus_allowed? */ /* rt_priority? */ diff -prauN linux-2.6.0-test11/kernel/fork.c sched-2.6.0-test11-5/kernel/fork.c --- linux-2.6.0-test11/kernel/fork.c 2003-11-26 12:42:58.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/fork.c 2003-12-23 06:22:59.000000000 -0800 @@ -836,6 +836,9 @@ struct task_struct *copy_process(unsigne atomic_inc(&p->user->__count); atomic_inc(&p->user->processes); + clear_tsk_thread_flag(p, TIF_SIGPENDING); + clear_tsk_thread_flag(p, TIF_QUEUED); + /* * If multiple threads are within copy_process(), then this check * triggers too late. This doesn't hurt, the check is only there @@ -861,13 +864,21 @@ struct task_struct *copy_process(unsigne p->state = TASK_UNINTERRUPTIBLE; copy_flags(clone_flags, p); - if (clone_flags & CLONE_IDLETASK) + if (clone_flags & CLONE_IDLETASK) { p->pid = 0; - else { + set_task_sched_policy(p, SCHED_IDLE); + } else { + if (task_sched_policy(p) == SCHED_IDLE) { + memset(&p->sched_info, 0, sizeof(struct sched_info)); + set_task_sched_policy(p, SCHED_NORMAL); + set_user_nice(p, 0); + } p->pid = alloc_pidmap(); if (p->pid == -1) goto bad_fork_cleanup; } + if (p->pid == 1) + BUG_ON(task_nice(p)); retval = -EFAULT; if (clone_flags & CLONE_PARENT_SETTID) if (put_user(p->pid, parent_tidptr)) @@ -875,8 +886,7 @@ struct task_struct *copy_process(unsigne p->proc_dentry = NULL; - INIT_LIST_HEAD(&p->run_list); - + INIT_LIST_HEAD(&p->sched_info.run_list); INIT_LIST_HEAD(&p->children); INIT_LIST_HEAD(&p->sibling); INIT_LIST_HEAD(&p->posix_timers); @@ -885,8 +895,6 @@ struct task_struct *copy_process(unsigne spin_lock_init(&p->alloc_lock); spin_lock_init(&p->switch_lock); spin_lock_init(&p->proc_lock); - - clear_tsk_thread_flag(p, TIF_SIGPENDING); init_sigpending(&p->pending); p->it_real_value = p->it_virt_value = p->it_prof_value = 0; @@ -898,7 +906,6 @@ struct task_struct *copy_process(unsigne p->tty_old_pgrp = 0; p->utime = p->stime = 0; p->cutime = p->cstime = 0; - p->array = NULL; p->lock_depth = -1; /* -1 = no lock */ p->start_time = get_jiffies_64(); p->security = NULL; @@ -948,33 +955,6 @@ struct task_struct *copy_process(unsigne p->pdeath_signal = 0; /* - * Share the timeslice between parent and child, thus the - * total amount of pending timeslices in the system doesn't change, - * resulting in more scheduling fairness. - */ - local_irq_disable(); - p->time_slice = (current->time_slice + 1) >> 1; - /* - * The remainder of the first timeslice might be recovered by - * the parent if the child exits early enough. - */ - p->first_time_slice = 1; - current->time_slice >>= 1; - p->timestamp = sched_clock(); - if (!current->time_slice) { - /* - * This case is rare, it happens when the parent has only - * a single jiffy left from its timeslice. Taking the - * runqueue lock is not a problem. - */ - current->time_slice = 1; - preempt_disable(); - scheduler_tick(0, 0); - local_irq_enable(); - preempt_enable(); - } else - local_irq_enable(); - /* * Ok, add it to the run-queues and make it * visible to the rest of the system. * diff -prauN linux-2.6.0-test11/kernel/sched/Makefile sched-2.6.0-test11-5/kernel/sched/Makefile --- linux-2.6.0-test11/kernel/sched/Makefile 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/Makefile 2003-12-17 03:32:21.000000000 -0800 @@ -0,0 +1 @@ +obj-y = util.o ts.o idle.o rt.o batch.o diff -prauN linux-2.6.0-test11/kernel/sched/batch.c sched-2.6.0-test11-5/kernel/sched/batch.c --- linux-2.6.0-test11/kernel/sched/batch.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/batch.c 2003-12-19 21:32:49.000000000 -0800 @@ -0,0 +1,190 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +struct batch_queue { + int base, tasks; + task_t *curr; + unsigned long bitmap[BITS_TO_LONGS(MAX_BATCH_PRIO)]; + struct list_head queue[MAX_BATCH_PRIO]; +}; + +static int batch_quantum = 1024; +static DEFINE_PER_CPU(struct batch_queue, batch_queues); + +static int batch_init(struct policy *policy, int cpu) +{ + int k; + struct batch_queue *queue = &per_cpu(batch_queues, cpu); + + policy->queue = (struct queue *)queue; + for (k = 0; k < MAX_BATCH_PRIO; ++k) + INIT_LIST_HEAD(&queue->queue[k]); + return 0; +} + +static int batch_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + + cpustat->nice += user_ticks; + cpustat->system += sys_ticks; + + task->sched_info.cl_data.bt.ticks--; + if (!task->sched_info.cl_data.bt.ticks) { + int new_idx; + + task->sched_info.cl_data.bt.ticks = batch_quantum; + new_idx = (task->sched_info.idx + task->sched_info.cl_data.bt.prio) + % MAX_BATCH_PRIO; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); + list_move_tail(&task->sched_info.run_list, + &queue->queue[new_idx]); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + task->sched_info.idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + set_need_resched(); + } + return 0; +} + +static void batch_yield(struct queue *__queue, task_t *task) +{ + int new_idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + new_idx = (queue->base + MAX_BATCH_PRIO - 1) % MAX_BATCH_PRIO; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); + list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + task->sched_info.idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + set_need_resched(); +} + +static task_t *batch_curr(struct queue *__queue) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + return queue->curr; +} + +static void batch_set_curr(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + queue->curr = task; +} + +static task_t *batch_best(struct queue *__queue) +{ + int idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + idx = find_first_circular_bit(queue->bitmap, + queue->base, + MAX_BATCH_PRIO); + BUG_ON(idx >= MAX_BATCH_PRIO); + BUG_ON(list_empty(&queue->queue[idx])); + return list_entry(queue->queue[idx].next, task_t, sched_info.run_list); +} + +static void batch_enqueue(struct queue *__queue, task_t *task) +{ + int idx; + struct batch_queue *queue = (struct batch_queue *)__queue; + + idx = (queue->base + task->sched_info.cl_data.bt.prio) % MAX_BATCH_PRIO; + if (!test_bit(idx, queue->bitmap)) + __set_bit(idx, queue->bitmap); + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); + task->sched_info.idx = idx; + task->sched_info.cl_data.bt.ticks = batch_quantum; + queue->tasks++; + if (!queue->curr) + queue->curr = task; +} + +static void batch_dequeue(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.idx])) + __clear_bit(task->sched_info.idx, queue->bitmap); + queue->tasks--; + if (!queue->tasks) + queue->curr = NULL; + else if (task == queue->curr) + queue->curr = batch_best(__queue); +} + +static int batch_preempt(struct queue *__queue, task_t *task) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + if (!queue->curr) + return 1; + else + return task->sched_info.cl_data.bt.prio + < queue->curr->sched_info.cl_data.bt.prio; +} + +static int batch_tasks(struct queue *__queue) +{ + struct batch_queue *queue = (struct batch_queue *)__queue; + return queue->tasks; +} + +static int batch_nice(struct queue *queue, task_t *task) +{ + return 20; +} + +static int batch_prio(task_t *task) +{ + return USER_PRIO(task->sched_info.cl_data.bt.prio + MIN_BATCH_PRIO); +} + +static void batch_setprio(task_t *task, int prio) +{ + BUG_ON(prio < 0); + BUG_ON(prio >= MAX_BATCH_PRIO); + task->sched_info.cl_data.bt.prio = prio; +} + +struct queue_ops batch_ops = { + .init = batch_init, + .fini = nop_fini, + .tick = batch_tick, + .yield = batch_yield, + .curr = batch_curr, + .set_curr = batch_set_curr, + .tasks = batch_tasks, + .best = batch_best, + .enqueue = batch_enqueue, + .dequeue = batch_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = batch_preempt, + .nice = batch_nice, + .renice = nop_renice, + .prio = batch_prio, + .setprio = batch_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy batch_policy = { + .ops = &batch_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/idle.c sched-2.6.0-test11-5/kernel/sched/idle.c --- linux-2.6.0-test11/kernel/sched/idle.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/idle.c 2003-12-19 17:31:39.000000000 -0800 @@ -0,0 +1,99 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +static DEFINE_PER_CPU(task_t *, idle_tasks) = NULL; + +static int idle_nice(struct queue *queue, task_t *task) +{ + return 20; +} + +static int idle_tasks(struct queue *queue) +{ + task_t **idle = (task_t **)queue; + return !!(*idle); +} + +static task_t *idle_task(struct queue *queue) +{ + return *((task_t **)queue); +} + +static void idle_yield(struct queue *queue, task_t *task) +{ + set_need_resched(); +} + +static void idle_enqueue(struct queue *queue, task_t *task) +{ + task_t **idle = (task_t **)queue; + *idle = task; +} + +static void idle_dequeue(struct queue *queue, task_t *task) +{ +} + +static int idle_preempt(struct queue *queue, task_t *task) +{ + return 0; +} + +static int idle_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + runqueue_t *rq = &per_cpu(runqueues, smp_processor_id()); + + if (atomic_read(&rq->nr_iowait) > 0) + cpustat->iowait += sys_ticks; + else + cpustat->idle += sys_ticks; + return 1; +} + +static int idle_init(struct policy *policy, int cpu) +{ + policy->queue = (struct queue *)&per_cpu(idle_tasks, cpu); + return 0; +} + +static int idle_prio(task_t *task) +{ + return MAX_USER_PRIO; +} + +static void idle_setprio(task_t *task, int prio) +{ +} + +static struct queue_ops idle_ops = { + .init = idle_init, + .fini = nop_fini, + .tick = idle_tick, + .yield = idle_yield, + .curr = idle_task, + .set_curr = queue_nop, + .tasks = idle_tasks, + .best = idle_task, + .enqueue = idle_enqueue, + .dequeue = idle_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = idle_preempt, + .nice = idle_nice, + .renice = nop_renice, + .prio = idle_prio, + .setprio = idle_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy idle_policy = { + .ops = &idle_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/queue.h sched-2.6.0-test11-5/kernel/sched/queue.h --- linux-2.6.0-test11/kernel/sched/queue.h 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/queue.h 2003-12-23 03:58:02.000000000 -0800 @@ -0,0 +1,104 @@ +#define SCHED_POLICY_RT 0 +#define SCHED_POLICY_TS 1 +#define SCHED_POLICY_BATCH 2 +#define SCHED_POLICY_IDLE 3 + +#define RT_POLICY_FIFO 0 +#define RT_POLICY_RR 1 + +#define NODE_THRESHOLD 125 + +struct queue; +struct queue_ops; + +struct policy { + struct queue *queue; + struct queue_ops *ops; +}; + +extern struct policy rt_policy, ts_policy, batch_policy, idle_policy; + +struct runqueue { + spinlock_t lock; + int curr; + task_t *__curr; + unsigned long policy_bitmap; + struct policy *policies[BITS_PER_LONG]; + unsigned long nr_running, nr_switches, nr_uninterruptible; + struct mm_struct *prev_mm; + int prev_cpu_load[NR_CPUS]; +#ifdef CONFIG_NUMA + atomic_t *node_nr_running; + int prev_node_load[MAX_NUMNODES]; +#endif + task_t *migration_thread; + struct list_head migration_queue; + + atomic_t nr_iowait; +}; + +typedef struct runqueue runqueue_t; + +struct queue_ops { + int (*init)(struct policy *, int); + void (*fini)(struct policy *, int); + task_t *(*curr)(struct queue *); + void (*set_curr)(struct queue *, task_t *); + task_t *(*best)(struct queue *); + int (*tick)(struct queue *, task_t *, int, int); + int (*tasks)(struct queue *); + void (*enqueue)(struct queue *, task_t *); + void (*dequeue)(struct queue *, task_t *); + void (*start_wait)(struct queue *, task_t *); + void (*stop_wait)(struct queue *, task_t *); + void (*sleep)(struct queue *, task_t *); + void (*wake)(struct queue *, task_t *); + int (*preempt)(struct queue *, task_t *); + void (*yield)(struct queue *, task_t *); + int (*prio)(task_t *); + void (*setprio)(task_t *, int); + int (*nice)(struct queue *, task_t *); + void (*renice)(struct queue *, task_t *, int); + unsigned long (*timeslice)(struct queue *, task_t *); + void (*set_timeslice)(struct queue *, task_t *, unsigned long); +}; + +DECLARE_PER_CPU(runqueue_t, runqueues); + +int find_first_circular_bit(unsigned long *, int, int); +void queue_nop(struct queue *, task_t *); +void nop_renice(struct queue *, task_t *, int); +void nop_fini(struct policy *, int); +unsigned long nop_timeslice(struct queue *, task_t *); +void nop_set_timeslice(struct queue *, task_t *, unsigned long); + +/* #define DEBUG_SCHED */ + +#ifdef DEBUG_SCHED +#define __check_task_policy(idx) \ +do { \ + unsigned long __idx__ = (idx); \ + if (__idx__ > SCHED_POLICY_IDLE) { \ + printk("invalid policy 0x%lx\n", __idx__); \ + BUG(); \ + } \ +} while (0) + +#define check_task_policy(task) \ +do { \ + __check_task_policy((task)->sched_info.policy); \ +} while (0) + +#define check_policy(policy) \ +do { \ + BUG_ON((policy) != &rt_policy && \ + (policy) != &ts_policy && \ + (policy) != &batch_policy && \ + (policy) != &idle_policy); \ +} while (0) + +#else /* !DEBUG_SCHED */ +#define __check_task_policy(idx) do { } while (0) +#define check_task_policy(task) do { } while (0) +#define check_policy(policy) do { } while (0) +#endif /* !DEBUG_SCHED */ diff -prauN linux-2.6.0-test11/kernel/sched/rt.c sched-2.6.0-test11-5/kernel/sched/rt.c --- linux-2.6.0-test11/kernel/sched/rt.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/rt.c 2003-12-19 18:16:07.000000000 -0800 @@ -0,0 +1,208 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +#ifdef DEBUG_SCHED +#define check_rt_policy(task) \ +do { \ + BUG_ON((task)->sched_info.policy != SCHED_POLICY_RT); \ + BUG_ON((task)->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR \ + && \ + (task)->sched_info.cl_data.rt.rt_policy!=RT_POLICY_FIFO); \ + BUG_ON((task)->sched_info.cl_data.rt.prio < 0); \ + BUG_ON((task)->sched_info.cl_data.rt.prio >= MAX_RT_PRIO); \ +} while (0) +#else +#define check_rt_policy(task) do { } while (0) +#endif + +struct rt_queue { + unsigned long bitmap[BITS_TO_LONGS(MAX_RT_PRIO)]; + struct list_head queue[MAX_RT_PRIO]; + task_t *curr; + int tasks; +}; + +static DEFINE_PER_CPU(struct rt_queue, rt_queues); + +static int rt_init(struct policy *policy, int cpu) +{ + int k; + struct rt_queue *queue = &per_cpu(rt_queues, cpu); + + policy->queue = (struct queue *)queue; + for (k = 0; k < MAX_RT_PRIO; ++k) + INIT_LIST_HEAD(&queue->queue[k]); + return 0; +} + +static void rt_yield(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio])) + set_need_resched(); + list_add_tail(&task->sched_info.run_list, + &queue->queue[task->sched_info.cl_data.rt.prio]); + check_rt_policy(task); +} + +static int rt_tick(struct queue *queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + check_rt_policy(task); + cpustat->user += user_ticks; + cpustat->system += sys_ticks; + if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) { + task->sched_info.cl_data.rt.ticks--; + if (!task->sched_info.cl_data.rt.ticks) { + task->sched_info.cl_data.rt.ticks = + task->sched_info.cl_data.rt.quantum; + rt_yield(queue, task); + } + } + check_rt_policy(task); + return 0; +} + +static task_t *rt_curr(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + task_t *task = queue->curr; + check_rt_policy(task); + return task; +} + +static void rt_set_curr(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + queue->curr = task; + check_rt_policy(task); +} + +static task_t *rt_best(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + task_t *task; + int idx; + idx = find_first_bit(queue->bitmap, MAX_RT_PRIO); + BUG_ON(idx >= MAX_RT_PRIO); + task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list); + check_rt_policy(task); + return task; +} + +static void rt_enqueue(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + if (!test_bit(task->sched_info.cl_data.rt.prio, queue->bitmap)) + __set_bit(task->sched_info.cl_data.rt.prio, queue->bitmap); + list_add_tail(&task->sched_info.run_list, + &queue->queue[task->sched_info.cl_data.rt.prio]); + check_rt_policy(task); + queue->tasks++; + if (!queue->curr) + queue->curr = task; +} + +static void rt_dequeue(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.cl_data.rt.prio])) + __clear_bit(task->sched_info.cl_data.rt.prio, queue->bitmap); + queue->tasks--; + check_rt_policy(task); + if (!queue->tasks) + queue->curr = NULL; + else if (task == queue->curr) + queue->curr = rt_best(__queue); +} + +static int rt_preempt(struct queue *__queue, task_t *task) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + check_rt_policy(task); + if (!queue->curr) + return 1; + check_rt_policy(queue->curr); + return task->sched_info.cl_data.rt.prio + < queue->curr->sched_info.cl_data.rt.prio; +} + +static int rt_tasks(struct queue *__queue) +{ + struct rt_queue *queue = (struct rt_queue *)__queue; + return queue->tasks; +} + +static int rt_nice(struct queue *queue, task_t *task) +{ + check_rt_policy(task); + return -20; +} + +static unsigned long rt_timeslice(struct queue *queue, task_t *task) +{ + check_rt_policy(task); + if (task->sched_info.cl_data.rt.rt_policy != RT_POLICY_RR) + return 0; + else + return task->sched_info.cl_data.rt.quantum; +} + +static void rt_set_timeslice(struct queue *queue, task_t *task, unsigned long n) +{ + check_rt_policy(task); + if (task->sched_info.cl_data.rt.rt_policy == RT_POLICY_RR) + task->sched_info.cl_data.rt.quantum = n; + check_rt_policy(task); +} + +static void rt_setprio(task_t *task, int prio) +{ + check_rt_policy(task); + BUG_ON(prio < 0); + BUG_ON(prio >= MAX_RT_PRIO); + task->sched_info.cl_data.rt.prio = prio; +} + +static int rt_prio(task_t *task) +{ + check_rt_policy(task); + return USER_PRIO(task->sched_info.cl_data.rt.prio); +} + +static struct queue_ops rt_ops = { + .init = rt_init, + .fini = nop_fini, + .tick = rt_tick, + .yield = rt_yield, + .curr = rt_curr, + .set_curr = rt_set_curr, + .tasks = rt_tasks, + .best = rt_best, + .enqueue = rt_enqueue, + .dequeue = rt_dequeue, + .start_wait = queue_nop, + .stop_wait = queue_nop, + .sleep = queue_nop, + .wake = queue_nop, + .preempt = rt_preempt, + .nice = rt_nice, + .renice = nop_renice, + .prio = rt_prio, + .setprio = rt_setprio, + .timeslice = rt_timeslice, + .set_timeslice = rt_set_timeslice, +}; + +struct policy rt_policy = { + .ops = &rt_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/ts.c sched-2.6.0-test11-5/kernel/sched/ts.c --- linux-2.6.0-test11/kernel/sched/ts.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/ts.c 2003-12-23 08:24:55.000000000 -0800 @@ -0,0 +1,841 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <linux/kernel_stat.h> +#include <asm/page.h> +#include "queue.h" + +#ifdef DEBUG_SCHED +#define check_ts_policy(task) \ +do { \ + BUG_ON((task)->sched_info.policy != SCHED_POLICY_TS); \ +} while (0) + +#define check_nice(__queue__) \ +({ \ + int __k__, __count__ = 0; \ + if ((__queue__)->tasks < 0) { \ + printk("negative nice task count %d\n", \ + (__queue__)->tasks); \ + BUG(); \ + } \ + for (__k__ = 0; __k__ < NICE_QLEN; ++__k__) { \ + task_t *__task__; \ + if (list_empty(&(__queue__)->queue[__k__])) { \ + if (test_bit(__k__, (__queue__)->bitmap)) { \ + printk("wrong nice bit set\n"); \ + BUG(); \ + } \ + } else { \ + if (!test_bit(__k__, (__queue__)->bitmap)) { \ + printk("wrong nice bit clear\n"); \ + BUG(); \ + } \ + } \ + list_for_each_entry(__task__, \ + &(__queue__)->queue[__k__], \ + sched_info.run_list) { \ + check_ts_policy(__task__); \ + if (__task__->sched_info.idx != __k__) { \ + printk("nice index mismatch\n"); \ + BUG(); \ + } \ + ++__count__; \ + } \ + } \ + if ((__queue__)->tasks != __count__) { \ + printk("wrong nice task count\n"); \ + printk("expected %d, got %d\n", \ + (__queue__)->tasks, \ + __count__); \ + BUG(); \ + } \ + __count__; \ +}) + +#define check_queue(__queue) \ +do { \ + int __k, __count = 0; \ + if ((__queue)->tasks < 0) { \ + printk("negative queue task count %d\n", \ + (__queue)->tasks); \ + BUG(); \ + } \ + for (__k = 0; __k < 40; ++__k) { \ + struct nice_queue *__nice; \ + if (list_empty(&(__queue)->nices[__k])) { \ + if (test_bit(__k, (__queue)->bitmap)) { \ + printk("wrong queue bit set\n"); \ + BUG(); \ + } \ + } else { \ + if (!test_bit(__k, (__queue)->bitmap)) { \ + printk("wrong queue bit clear\n"); \ + BUG(); \ + } \ + } \ + list_for_each_entry(__nice, \ + &(__queue)->nices[__k], \ + list) { \ + __count += check_nice(__nice); \ + if (__nice->idx != __k) { \ + printk("queue index mismatch\n"); \ + BUG(); \ + } \ + } \ + } \ + if ((__queue)->tasks != __count) { \ + printk("wrong queue task count\n"); \ + printk("expected %d, got %d\n", \ + (__queue)->tasks, \ + __count); \ + BUG(); \ + } \ +} while (0) + +#else /* !DEBUG_SCHED */ +#define check_ts_policy(task) do { } while (0) +#define check_nice(nice) do { } while (0) +#define check_queue(queue) do { } while (0) +#endif + +/* + * Hybrid deadline/multilevel scheduling. Cpu utilization + * -dependent deadlines at wake. Queue rotation every 50ms or when + * demotions empty the highest level, setting demoted deadlines + * relative to the new highest level. Intra-level RR quantum at 10ms. + */ +struct nice_queue { + int idx, nice, base, tasks, level_quantum, expired; + unsigned long bitmap[BITS_TO_LONGS(NICE_QLEN)]; + struct list_head list, queue[NICE_QLEN]; + task_t *curr; +}; + +/* + * Deadline schedule nice levels with priority-dependent deadlines, + * default quantum of 100ms. Queue rotates at demotions emptying the + * highest level, setting the demoted deadline relative to the new + * highest level. + */ +struct ts_queue { + struct nice_queue nice_levels[40]; + struct list_head nices[40]; + int base, quantum, tasks; + unsigned long bitmap[BITS_TO_LONGS(40)]; + struct nice_queue *curr; +}; + +/* + * Make these sysctl-tunable. + */ +static int nice_quantum = 100; +static int rr_quantum = 10; +static int level_quantum = 50; +static int sample_interval = HZ; + +static DEFINE_PER_CPU(struct ts_queue, ts_queues); + +static task_t *nice_best(struct nice_queue *); +static struct nice_queue *ts_best_nice(struct ts_queue *); + +static void nice_init(struct nice_queue *queue) +{ + int k; + + INIT_LIST_HEAD(&queue->list); + for (k = 0; k < NICE_QLEN; ++k) { + INIT_LIST_HEAD(&queue->queue[k]); + } +} + +static int ts_init(struct policy *policy, int cpu) +{ + int k; + struct ts_queue *queue = &per_cpu(ts_queues, cpu); + + policy->queue = (struct queue *)queue; + queue->quantum = nice_quantum; + + for (k = 0; k < 40; ++k) { + nice_init(&queue->nice_levels[k]); + queue->nice_levels[k].nice = k; + INIT_LIST_HEAD(&queue->nices[k]); + } + return 0; +} + +static int task_deadline(task_t *task) +{ + u64 frac_cpu = task->sched_info.cl_data.ts.frac_cpu; + frac_cpu *= (u64)NICE_QLEN; + frac_cpu >>= 32; + return (int)min((u32)(NICE_QLEN - 1), (u32)frac_cpu); +} + +static void nice_rotate_queue(struct nice_queue *queue) +{ + int idx, new_idx, deadline, idxdiff; + task_t *task = queue->curr; + + check_nice(queue); + + /* shit what if idxdiff == NICE_QLEN - 1?? */ + idx = queue->curr->sched_info.idx; + idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN; + deadline = min(1 + task_deadline(task), NICE_QLEN - idxdiff - 1); + new_idx = (idx + deadline) % NICE_QLEN; +#if 0 + if (idx == new_idx) { + /* + * buggy; it sets queue->base = idx because in this case + * we have task_deadline(task) == 0 + */ + new_idx = (idx - task_deadline(task) + NICE_QLEN) % NICE_QLEN; + if (queue->base != new_idx) + queue->base = new_idx; + return; + } + BUG_ON(!deadline); + BUG_ON(queue->base <= new_idx && new_idx <= idx); + BUG_ON(idx < queue->base && queue->base <= new_idx); + BUG_ON(new_idx <= idx && idx < queue->base); + if (0 && idx == new_idx) { + printk("FUCKUP: pid = %d, tdl = %d, dl = %d, idx = %d, " + "base = %d, diff = %d, fcpu = 0x%lx\n", + queue->curr->pid, + task_deadline(queue->curr), + deadline, + idx, + queue->base, + idxdiff, + task->sched_info.cl_data.ts.frac_cpu); + BUG(); + } +#else + /* + * RR in the last deadline + * special-cased so as not to trip BUG_ON()'s below + */ + if (idx == new_idx) { + /* if we got here these two things must hold */ + BUG_ON(idxdiff != NICE_QLEN - 1); + BUG_ON(deadline); + list_move_tail(&task->sched_info.run_list, &queue->queue[idx]); + if (queue->expired) { + queue->level_quantum = level_quantum; + queue->expired = 0; + } + return; + } +#endif + task->sched_info.idx = new_idx; + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&task->sched_info.run_list, + &queue->queue[new_idx]); + + /* expired until list drains */ + if (!list_empty(&queue->queue[idx])) + queue->expired = 1; + else { + int k, w, m = NICE_QLEN % BITS_PER_LONG; + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + + for (w = 0, k = 0; k < NICE_QLEN/BITS_PER_LONG; ++k) + w += hweight_long(queue->bitmap[k]); + if (NICE_QLEN % BITS_PER_LONG) + w += hweight_long(queue->bitmap[k] & ((1UL << m) - 1)); + if (w > 1) + queue->base = (queue->base + 1) % NICE_QLEN; + queue->level_quantum = level_quantum; + queue->expired = 0; + } + check_nice(queue); +} + +static void nice_tick(struct nice_queue *queue, task_t *task) +{ + int idx = task->sched_info.idx; + BUG_ON(!task_queued(task)); + BUG_ON(task != queue->curr); + BUG_ON(!test_bit(idx, queue->bitmap)); + BUG_ON(list_empty(&queue->queue[idx])); + check_ts_policy(task); + check_nice(queue); + + if (task->sched_info.cl_data.ts.ticks) + task->sched_info.cl_data.ts.ticks--; + + if (queue->level_quantum > level_quantum) { + WARN_ON(1); + queue->level_quantum = 1; + } + + if (!queue->expired) { + if (queue->level_quantum) + queue->level_quantum--; + } else if (0 && queue->queue[idx].prev != &task->sched_info.run_list) { + int queued = 0, new_idx = (queue->base + 1) % NICE_QLEN; + task_t *curr, *sav; + task_t *victim = list_entry(queue->queue[idx].prev, + task_t, + sched_info.run_list); + victim->sched_info.idx = new_idx; + if (!test_bit(new_idx, queue->bitmap)) + __set_bit(new_idx, queue->bitmap); +#if 1 + list_for_each_entry_safe(curr, sav, &queue->queue[new_idx], sched_info.run_list) { + if (victim->sched_info.cl_data.ts.frac_cpu + < curr->sched_info.cl_data.ts.frac_cpu) { + queued = 1; + list_move(&victim->sched_info.run_list, + curr->sched_info.run_list.prev); + break; + } + } + if (!queued) + list_move_tail(&victim->sched_info.run_list, + &queue->queue[new_idx]); +#else + list_move(&victim->sched_info.run_list, &queue->queue[new_idx]); +#endif + BUG_ON(list_empty(&queue->queue[idx])); + } + + if (!queue->level_quantum && !queue->expired) { + check_nice(queue); + nice_rotate_queue(queue); + check_nice(queue); + set_need_resched(); + } else if (!task->sched_info.cl_data.ts.ticks) { + int idxdiff = (idx - queue->base + NICE_QLEN) % NICE_QLEN; + check_nice(queue); + task->sched_info.cl_data.ts.ticks = rr_quantum; + BUG_ON(!test_bit(idx, queue->bitmap)); + BUG_ON(list_empty(&queue->queue[idx])); + if (queue->expired) + nice_rotate_queue(queue); + else if (idxdiff == NICE_QLEN - 1) + list_move_tail(&task->sched_info.run_list, + &queue->queue[idx]); + else { + int new_idx = (idx + 1) % NICE_QLEN; + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + task->sched_info.idx = new_idx; + list_add(&task->sched_info.run_list, + &queue->queue[new_idx]); + } + check_nice(queue); + set_need_resched(); + } + check_nice(queue); + check_ts_policy(task); +} + +static void ts_rotate_queue(struct ts_queue *queue) +{ + int idx, new_idx, idxdiff, off, deadline; + + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + + /* shit what if idxdiff == 39?? */ + check_queue(queue); + idx = queue->curr->idx; + idxdiff = (idx - queue->base + 40) % 40; + off = (int)(queue->curr - queue->nice_levels); + deadline = min(1 + off, 40 - idxdiff - 1); + new_idx = (idx + deadline) % 40; + if (idx == new_idx) { + new_idx = (idx - off + 40) % 40; + if (queue->base != new_idx) + queue->base = new_idx; + return; + } + BUG_ON(!deadline); + BUG_ON(queue->base <= new_idx && new_idx <= idx); + BUG_ON(idx < queue->base && queue->base <= new_idx); + BUG_ON(new_idx <= idx && idx < queue->base); + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&queue->curr->list, &queue->nices[new_idx]); + queue->curr->idx = new_idx; + + if (list_empty(&queue->nices[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + queue->base = (queue->base + 1) % 40; + } + check_queue(queue); +} + +static int ts_tick(struct queue *__queue, task_t *task, int user_ticks, int sys_ticks) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + int nice_idx = (int)(queue->curr - queue->nice_levels); + unsigned long sample_end, delta; + + check_queue(queue); + check_ts_policy(task); + BUG_ON(!nice); + BUG_ON(nice_idx != task->sched_info.cl_data.ts.nice); + BUG_ON(!test_bit(nice->idx, queue->bitmap)); + BUG_ON(list_empty(&queue->nices[nice->idx])); + + sample_end = jiffies; + delta = sample_end - task->sched_info.cl_data.ts.sample_start; + if (delta) + task->sched_info.cl_data.ts.sample_ticks++; + else { + task->sched_info.cl_data.ts.sample_start = jiffies; + task->sched_info.cl_data.ts.sample_ticks = 1; + } + + if (delta >= sample_interval) { + u64 frac_cpu; + frac_cpu = (u64)task->sched_info.cl_data.ts.sample_ticks << 32; + do_div(frac_cpu, delta); + frac_cpu = 2*frac_cpu + task->sched_info.cl_data.ts.frac_cpu; + do_div(frac_cpu, 3); + frac_cpu = min(frac_cpu, (1ULL << 32) - 1); + task->sched_info.cl_data.ts.frac_cpu = (unsigned long)frac_cpu; + task->sched_info.cl_data.ts.sample_start = sample_end; + task->sched_info.cl_data.ts.sample_ticks = 0; + } + + cpustat->user += user_ticks; + cpustat->system += sys_ticks; + nice_tick(nice, task); + if (queue->quantum > nice_quantum) { + queue->quantum = 0; + WARN_ON(1); + } else if (queue->quantum) + queue->quantum--; + if (!queue->quantum) { + queue->quantum = nice_quantum; + ts_rotate_queue(queue); + set_need_resched(); + } + check_queue(queue); + check_ts_policy(task); + return 0; +} + +static void nice_yield(struct nice_queue *queue, task_t *task) +{ + int idx, new_idx = (queue->base + NICE_QLEN - 1) % NICE_QLEN; + + check_nice(queue); + check_ts_policy(task); + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&task->sched_info.run_list, &queue->queue[new_idx]); + idx = task->sched_info.idx; + task->sched_info.idx = new_idx; + set_need_resched(); + + if (list_empty(&queue->queue[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + queue->curr = nice_best(queue); +#if 0 + if (queue->curr->sched_info.idx != queue->base) + queue->base = queue->curr->sched_info.idx; +#endif + check_nice(queue); + check_ts_policy(task); +} + +/* + * This is somewhat problematic; nice_yield() only parks tasks on + * the end of their current nice levels. + */ +static void ts_yield(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + + check_queue(queue); + check_ts_policy(task); + nice_yield(nice, task); + + /* + * If there's no one to yield to, move the whole nice level. + * If this is problematic, setting nice-dependent deadlines + * on a single unified queue may be in order. + */ + if (nice->tasks == 1) { + int idx, new_idx = (queue->base + 40 - 1) % 40; + idx = nice->idx; + if (!test_bit(new_idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[new_idx])); + __set_bit(new_idx, queue->bitmap); + } + list_move_tail(&nice->list, &queue->nices[new_idx]); + if (list_empty(&queue->nices[idx])) { + BUG_ON(!test_bit(idx, queue->bitmap)); + __clear_bit(idx, queue->bitmap); + } + nice->idx = new_idx; + queue->base = find_first_circular_bit(queue->bitmap, + queue->base, + 40); + BUG_ON(queue->base >= 40); + BUG_ON(!test_bit(queue->base, queue->bitmap)); + queue->curr = ts_best_nice(queue); + } + check_queue(queue); + check_ts_policy(task); +} + +static task_t *ts_curr(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + task_t *task = queue->curr->curr; + check_queue(queue); + if (task) + check_ts_policy(task); + return task; +} + +static void ts_set_curr(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + queue->curr = nice; + nice->curr = task; + check_queue(queue); + check_ts_policy(task); +} + +static task_t *nice_best(struct nice_queue *queue) +{ + task_t *task; + int idx = find_first_circular_bit(queue->bitmap, + queue->base, + NICE_QLEN); + check_nice(queue); + if (idx >= NICE_QLEN) + return NULL; + BUG_ON(list_empty(&queue->queue[idx])); + BUG_ON(!test_bit(idx, queue->bitmap)); + task = list_entry(queue->queue[idx].next, task_t, sched_info.run_list); + check_nice(queue); + check_ts_policy(task); + return task; +} + +static struct nice_queue *ts_best_nice(struct ts_queue *queue) +{ + int idx = find_first_circular_bit(queue->bitmap, queue->base, 40); + check_queue(queue); + if (idx >= 40) + return NULL; + BUG_ON(list_empty(&queue->nices[idx])); + BUG_ON(!test_bit(idx, queue->bitmap)); + return list_entry(queue->nices[idx].next, struct nice_queue, list); +} + +static task_t *ts_best(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = ts_best_nice(queue); + return nice ? nice_best(nice) : NULL; +} + +static void nice_enqueue(struct nice_queue *queue, task_t *task) +{ + task_t *curr, *sav; + int queued = 0, idx, deadline, base, idxdiff; + check_nice(queue); + check_ts_policy(task); + + /* don't livelock when queue->expired */ + deadline = min(!!queue->expired + task_deadline(task), NICE_QLEN - 1); + idx = (queue->base + deadline) % NICE_QLEN; + + if (!test_bit(idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->queue[idx])); + __set_bit(idx, queue->bitmap); + } + +#if 1 + /* keep nice level's queue sorted -- use binomial heaps here soon */ + list_for_each_entry_safe(curr, sav, &queue->queue[idx], sched_info.run_list) { + if (task->sched_info.cl_data.ts.frac_cpu + >= curr->sched_info.cl_data.ts.frac_cpu) { + list_add(&task->sched_info.run_list, + curr->sched_info.run_list.prev); + queued = 1; + break; + } + } + if (!queued) + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); +#else + list_add_tail(&task->sched_info.run_list, &queue->queue[idx]); +#endif + task->sched_info.idx = idx; + /* if (!task->sched_info.cl_data.ts.ticks) */ + task->sched_info.cl_data.ts.ticks = rr_quantum; + + if (queue->tasks) + BUG_ON(!queue->curr); + else { + BUG_ON(queue->curr); + queue->curr = task; + } + queue->tasks++; + check_nice(queue); + check_ts_policy(task); +} + +static void ts_enqueue(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + if (!nice->tasks) { + int idx = (queue->base + task->sched_info.cl_data.ts.nice) % 40; + if (!test_bit(idx, queue->bitmap)) { + BUG_ON(!list_empty(&queue->nices[idx])); + __set_bit(idx, queue->bitmap); + } + list_add_tail(&nice->list, &queue->nices[idx]); + nice->idx = idx; + if (!queue->curr) + queue->curr = nice; + } + nice_enqueue(nice, task); + queue->tasks++; + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + check_queue(queue); + check_ts_policy(task); +} + +static void nice_dequeue(struct nice_queue *queue, task_t *task) +{ + check_nice(queue); + check_ts_policy(task); + list_del(&task->sched_info.run_list); + if (list_empty(&queue->queue[task->sched_info.idx])) { + BUG_ON(!test_bit(task->sched_info.idx, queue->bitmap)); + __clear_bit(task->sched_info.idx, queue->bitmap); + } + queue->tasks--; + if (task == queue->curr) { + queue->curr = nice_best(queue); +#if 0 + if (queue->curr) + queue->base = queue->curr->sched_info.idx; +#endif + } + check_nice(queue); + check_ts_policy(task); +} + +static void ts_dequeue(struct queue *__queue, task_t *task) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice; + + BUG_ON(!queue->tasks); + check_queue(queue); + check_ts_policy(task); + nice = &queue->nice_levels[task->sched_info.cl_data.ts.nice]; + + nice_dequeue(nice, task); + queue->tasks--; + if (!nice->tasks) { + list_del_init(&nice->list); + if (list_empty(&queue->nices[nice->idx])) { + BUG_ON(!test_bit(nice->idx, queue->bitmap)); + __clear_bit(nice->idx, queue->bitmap); + } + if (nice == queue->curr) + queue->curr = ts_best_nice(queue); + } + queue->base = find_first_circular_bit(queue->bitmap, queue->base, 40); + if (queue->base >= 40) + queue->base = 0; + check_queue(queue); + check_ts_policy(task); +} + +static int ts_tasks(struct queue *__queue) +{ + struct ts_queue *queue = (struct ts_queue *)__queue; + check_queue(queue); + return queue->tasks; +} + +static int ts_nice(struct queue *__queue, task_t *task) +{ + int nice = task->sched_info.cl_data.ts.nice - 20; + check_ts_policy(task); + BUG_ON(nice < -20); + BUG_ON(nice >= 20); + return nice; +} + +static void ts_renice(struct queue *queue, task_t *task, int nice) +{ + check_queue((struct ts_queue *)queue); + check_ts_policy(task); + BUG_ON(nice < -20); + BUG_ON(nice >= 20); + task->sched_info.cl_data.ts.nice = nice + 20; + check_queue((struct ts_queue *)queue); +} + +static int nice_task_prio(struct nice_queue *nice, task_t *task) +{ + if (!task_queued(task)) + return task_deadline(task); + else { + int prio = task->sched_info.idx - nice->base; + return prio < 0 ? prio + NICE_QLEN : prio; + } +} + +static int ts_nice_prio(struct ts_queue *ts, struct nice_queue *nice) +{ + if (list_empty(&nice->list)) + return (int)(nice - ts->nice_levels); + else { + int prio = nice->idx - ts->base; + return prio < 0 ? prio + 40 : prio; + } +} + +/* 100% fake priority to report heuristics and the like */ +static int ts_prio(task_t *task) +{ + int policy_idx; + struct policy *policy; + struct ts_queue *ts; + struct nice_queue *nice; + + policy_idx = task->sched_info.policy; + policy = per_cpu(runqueues, task_cpu(task)).policies[policy_idx]; + ts = (struct ts_queue *)policy->queue; + nice = &ts->nice_levels[task->sched_info.cl_data.ts.nice]; + return 40*ts_nice_prio(ts, nice) + nice_task_prio(nice, task); +} + +static void ts_setprio(task_t *task, int prio) +{ +} + +static void ts_start_wait(struct queue *__queue, task_t *task) +{ +} + +static void ts_stop_wait(struct queue *__queue, task_t *task) +{ +} + +static void ts_sleep(struct queue *__queue, task_t *task) +{ +} + +static void ts_wake(struct queue *__queue, task_t *task) +{ +} + +static int nice_preempt(struct nice_queue *queue, task_t *task) +{ + check_nice(queue); + check_ts_policy(task); + /* assume FB style preemption at wakeup */ + if (!task_queued(task) || !queue->curr) + return 1; + else { + int delta_t, delta_q; + delta_t = (task->sched_info.idx - queue->base + NICE_QLEN) + % NICE_QLEN; + delta_q = (queue->curr->sched_info.idx - queue->base + + NICE_QLEN) + % NICE_QLEN; + if (delta_t < delta_q) + return 1; + else if (task->sched_info.cl_data.ts.frac_cpu + < queue->curr->sched_info.cl_data.ts.frac_cpu) + return 1; + else + return 0; + } + check_nice(queue); +} + +static int ts_preempt(struct queue *__queue, task_t *task) +{ + int curr_nice; + struct ts_queue *queue = (struct ts_queue *)__queue; + struct nice_queue *nice = queue->curr; + + check_queue(queue); + check_ts_policy(task); + if (!queue->curr) + return 1; + + curr_nice = (int)(nice - queue->nice_levels); + + /* preempt when nice number is lower, or the above for matches */ + if (task->sched_info.cl_data.ts.nice != curr_nice) + return task->sched_info.cl_data.ts.nice < curr_nice; + else + return nice_preempt(nice, task); +} + +static struct queue_ops ts_ops = { + .init = ts_init, + .fini = nop_fini, + .tick = ts_tick, + .yield = ts_yield, + .curr = ts_curr, + .set_curr = ts_set_curr, + .tasks = ts_tasks, + .best = ts_best, + .enqueue = ts_enqueue, + .dequeue = ts_dequeue, + .start_wait = ts_start_wait, + .stop_wait = ts_stop_wait, + .sleep = ts_sleep, + .wake = ts_wake, + .preempt = ts_preempt, + .nice = ts_nice, + .renice = ts_renice, + .prio = ts_prio, + .setprio = ts_setprio, + .timeslice = nop_timeslice, + .set_timeslice = nop_set_timeslice, +}; + +struct policy ts_policy = { + .ops = &ts_ops, +}; diff -prauN linux-2.6.0-test11/kernel/sched/util.c sched-2.6.0-test11-5/kernel/sched/util.c --- linux-2.6.0-test11/kernel/sched/util.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched/util.c 2003-12-19 08:43:20.000000000 -0800 @@ -0,0 +1,37 @@ +#include <linux/sched.h> +#include <linux/list.h> +#include <linux/percpu.h> +#include <asm/page.h> +#include "queue.h" + +int find_first_circular_bit(unsigned long *addr, int start, int end) +{ + int bit = find_next_bit(addr, end, start); + if (bit < end) + return bit; + bit = find_first_bit(addr, start); + if (bit < start) + return bit; + return end; +} + +void queue_nop(struct queue *queue, task_t *task) +{ +} + +void nop_renice(struct queue *queue, task_t *task, int nice) +{ +} + +void nop_fini(struct policy *policy, int cpu) +{ +} + +unsigned long nop_timeslice(struct queue *queue, task_t *task) +{ + return 0; +} + +void nop_set_timeslice(struct queue *queue, task_t *task, unsigned long n) +{ +} diff -prauN linux-2.6.0-test11/kernel/sched.c sched-2.6.0-test11-5/kernel/sched.c --- linux-2.6.0-test11/kernel/sched.c 2003-11-26 12:45:17.000000000 -0800 +++ sched-2.6.0-test11-5/kernel/sched.c 2003-12-21 06:06:32.000000000 -0800 @@ -15,6 +15,8 @@ * and per-CPU runqueues. Cleanups and useful suggestions * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. + * 2003-12-17 Total rewrite and generalized scheduler policies + * by William Irwin. */ #include <linux/mm.h> @@ -38,6 +40,8 @@ #include <linux/cpu.h> #include <linux/percpu.h> +#include "sched/queue.h" + #ifdef CONFIG_NUMA #define cpu_to_node_mask(cpu) node_to_cpumask(cpu_to_node(cpu)) #else @@ -45,181 +49,79 @@ #endif /* - * Convert user-nice values [ -20 ... 0 ... 19 ] - * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ], - * and back. - */ -#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20) -#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20) -#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio) - -/* - * 'User priority' is the nice value converted to something we - * can work with better when scaling various scheduler parameters, - * it's a [ 0 ... 39 ] range. - */ -#define USER_PRIO(p) ((p)-MAX_RT_PRIO) -#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio) -#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO)) -#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\ - (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1))) - -/* - * Some helpers for converting nanosecond timing to jiffy resolution - */ -#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ)) -#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ)) - -/* - * These are the 'tuning knobs' of the scheduler: - * - * Minimum timeslice is 10 msecs, default timeslice is 100 msecs, - * maximum timeslice is 200 msecs. Timeslices get refilled after - * they expire. - */ -#define MIN_TIMESLICE ( 10 * HZ / 1000) -#define MAX_TIMESLICE (200 * HZ / 1000) -#define ON_RUNQUEUE_WEIGHT 30 -#define CHILD_PENALTY 95 -#define PARENT_PENALTY 100 -#define EXIT_WEIGHT 3 -#define PRIO_BONUS_RATIO 25 -#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100) -#define INTERACTIVE_DELTA 2 -#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS) -#define STARVATION_LIMIT (MAX_SLEEP_AVG) -#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG)) -#define NODE_THRESHOLD 125 -#define CREDIT_LIMIT 100 - -/* - * If a task is 'interactive' then we reinsert it in the active - * array after it has expired its current timeslice. (it will not - * continue to run immediately, it will still roundrobin with - * other interactive tasks.) - * - * This part scales the interactivity limit depending on niceness. - * - * We scale it linearly, offset by the INTERACTIVE_DELTA delta. - * Here are a few examples of different nice levels: - * - * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0] - * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0] - * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0] - * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0] - * - * (the X axis represents the possible -5 ... 0 ... +5 dynamic - * priority range a task can explore, a value of '1' means the - * task is rated interactive.) - * - * Ie. nice +19 tasks can never get 'interactive' enough to be - * reinserted into the active array. And only heavily CPU-hog nice -20 - * tasks will be expired. Default nice 0 tasks are somewhere between, - * it takes some effort for them to get interactive, but it's not - * too hard. - */ - -#define CURRENT_BONUS(p) \ - (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \ - MAX_SLEEP_AVG) - -#ifdef CONFIG_SMP -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \ - num_online_cpus()) -#else -#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \ - (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1))) -#endif - -#define SCALE(v1,v1_max,v2_max) \ - (v1) * (v2_max) / (v1_max) - -#define DELTA(p) \ - (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \ - INTERACTIVE_DELTA) - -#define TASK_INTERACTIVE(p) \ - ((p)->prio <= (p)->static_prio - DELTA(p)) - -#define JUST_INTERACTIVE_SLEEP(p) \ - (JIFFIES_TO_NS(MAX_SLEEP_AVG * \ - (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1)) - -#define HIGH_CREDIT(p) \ - ((p)->interactive_credit > CREDIT_LIMIT) - -#define LOW_CREDIT(p) \ - ((p)->interactive_credit < -CREDIT_LIMIT) - -#define TASK_PREEMPTS_CURR(p, rq) \ - ((p)->prio < (rq)->curr->prio) - -/* - * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ] - * to time slice values. - * - * The higher a thread's priority, the bigger timeslices - * it gets during one round of execution. But even the lowest - * priority thread gets MIN_TIMESLICE worth of execution time. - * - * task_timeslice() is the interface that is used by the scheduler. - */ - -#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \ - ((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1))) - -static inline unsigned int task_timeslice(task_t *p) -{ - return BASE_TIMESLICE(p); -} - -/* - * These are the runqueue data structures: - */ - -#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long)) - -typedef struct runqueue runqueue_t; - -struct prio_array { - int nr_active; - unsigned long bitmap[BITMAP_SIZE]; - struct list_head queue[MAX_PRIO]; -}; - -/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues * (such as the load balancing or the thread migration code), lock * acquire operations must be ordered by ascending &runqueue. */ -struct runqueue { - spinlock_t lock; - unsigned long nr_running, nr_switches, expired_timestamp, - nr_uninterruptible; - task_t *curr, *idle; - struct mm_struct *prev_mm; - prio_array_t *active, *expired, arrays[2]; - int prev_cpu_load[NR_CPUS]; -#ifdef CONFIG_NUMA - atomic_t *node_nr_running; - int prev_node_load[MAX_NUMNODES]; -#endif - task_t *migration_thread; - struct list_head migration_queue; +DEFINE_PER_CPU(struct runqueue, runqueues); - atomic_t nr_iowait; +struct policy *policies[] = { + &rt_policy, + &ts_policy, + &batch_policy, + &idle_policy, + NULL, }; -static DEFINE_PER_CPU(struct runqueue, runqueues); - #define cpu_rq(cpu) (&per_cpu(runqueues, (cpu))) #define this_rq() (&__get_cpu_var(runqueues)) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) +#define rq_curr(rq) (rq)->__curr +#define cpu_curr(cpu) rq_curr(cpu_rq(cpu)) + +static inline struct policy *task_policy(task_t *task) +{ + unsigned long idx; + struct policy *policy; + idx = task->sched_info.policy; + __check_task_policy(idx); + policy = task_rq(task)->policies[idx]; + check_policy(policy); + return policy; +} + +static inline struct policy *rq_policy(runqueue_t *rq) +{ + unsigned long idx; + task_t *task; + struct policy *policy; + + task = rq_curr(rq); + BUG_ON(!task); + BUG_ON((unsigned long)task < PAGE_OFFSET); + idx = task->sched_info.policy; + __check_task_policy(idx); + policy = rq->policies[idx]; + check_policy(policy); + return policy; +} + +static int __task_nice(task_t *task) +{ + struct policy *policy = task_policy(task); + return policy->ops->nice(policy->queue, task); +} + +static inline void set_rq_curr(runqueue_t *rq, task_t *task) +{ + rq->curr = task->sched_info.policy; + __check_task_policy(rq->curr); + rq->__curr = task; +} + +static inline int task_preempts_curr(task_t *task, runqueue_t *rq) +{ + check_task_policy(rq_curr(rq)); + check_task_policy(task); + if (rq_curr(rq)->sched_info.policy != task->sched_info.policy) + return task->sched_info.policy < rq_curr(rq)->sched_info.policy; + else { + struct policy *policy = rq_policy(rq); + return policy->ops->preempt(policy->queue, task); + } +} /* * Default context-switch locking: @@ -227,7 +129,7 @@ static DEFINE_PER_CPU(struct runqueue, r #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(rq, p) (rq_curr(rq) == (p)) #endif #ifdef CONFIG_NUMA @@ -320,53 +222,32 @@ static inline void rq_unlock(runqueue_t } /* - * Adding/removing a task to/from a priority array: + * Adding/removing a task to/from a policy's queue. + * We dare not BUG_ON() a wrong task_queued() as boot-time + * calls may trip it. */ -static inline void dequeue_task(struct task_struct *p, prio_array_t *array) +static inline void dequeue_task(task_t *task, runqueue_t *rq) { - array->nr_active--; - list_del(&p->run_list); - if (list_empty(array->queue + p->prio)) - __clear_bit(p->prio, array->bitmap); + struct policy *policy = task_policy(task); + BUG_ON(!task_queued(task)); + policy->ops->dequeue(policy->queue, task); + if (!policy->ops->tasks(policy->queue)) { + BUG_ON(!test_bit(task->sched_info.policy, &rq->policy_bitmap)); + __clear_bit(task->sched_info.policy, &rq->policy_bitmap); + } + clear_task_queued(task); } -static inline void enqueue_task(struct task_struct *p, prio_array_t *array) +static inline void enqueue_task(task_t *task, runqueue_t *rq) { - list_add_tail(&p->run_list, array->queue + p->prio); - __set_bit(p->prio, array->bitmap); - array->nr_active++; - p->array = array; -} - -/* - * effective_prio - return the priority that is based on the static - * priority but is modified by bonuses/penalties. - * - * We scale the actual sleep average [0 .... MAX_SLEEP_AVG] - * into the -5 ... 0 ... +5 bonus/penalty range. - * - * We use 25% of the full 0...39 priority range so that: - * - * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs. - * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks. - * - * Both properties are important to certain workloads. - */ -static int effective_prio(task_t *p) -{ - int bonus, prio; - - if (rt_task(p)) - return p->prio; - - bonus = CURRENT_BONUS(p) - MAX_BONUS / 2; - - prio = p->static_prio - bonus; - if (prio < MAX_RT_PRIO) - prio = MAX_RT_PRIO; - if (prio > MAX_PRIO-1) - prio = MAX_PRIO-1; - return prio; + struct policy *policy = task_policy(task); + BUG_ON(task_queued(task)); + if (!policy->ops->tasks(policy->queue)) { + BUG_ON(test_bit(task->sched_info.policy, &rq->policy_bitmap)); + __set_bit(task->sched_info.policy, &rq->policy_bitmap); + } + policy->ops->enqueue(policy->queue, task); + set_task_queued(task); } /* @@ -374,134 +255,34 @@ static int effective_prio(task_t *p) */ static inline void __activate_task(task_t *p, runqueue_t *rq) { - enqueue_task(p, rq->active); + enqueue_task(p, rq); nr_running_inc(rq); } -static void recalc_task_prio(task_t *p, unsigned long long now) -{ - unsigned long long __sleep_time = now - p->timestamp; - unsigned long sleep_time; - - if (__sleep_time > NS_MAX_SLEEP_AVG) - sleep_time = NS_MAX_SLEEP_AVG; - else - sleep_time = (unsigned long)__sleep_time; - - if (likely(sleep_time > 0)) { - /* - * User tasks that sleep a long time are categorised as - * idle and will get just interactive status to stay active & - * prevent them suddenly becoming cpu hogs and starving - * other processes. - */ - if (p->mm && p->activated != -1 && - sleep_time > JUST_INTERACTIVE_SLEEP(p)){ - p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG - - AVG_TIMESLICE); - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } else { - /* - * The lower the sleep avg a task has the more - * rapidly it will rise with sleep time. - */ - sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1; - - /* - * Tasks with low interactive_credit are limited to - * one timeslice worth of sleep avg bonus. - */ - if (LOW_CREDIT(p) && - sleep_time > JIFFIES_TO_NS(task_timeslice(p))) - sleep_time = - JIFFIES_TO_NS(task_timeslice(p)); - - /* - * Non high_credit tasks waking from uninterruptible - * sleep are limited in their sleep_avg rise as they - * are likely to be cpu hogs waiting on I/O - */ - if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){ - if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p)) - sleep_time = 0; - else if (p->sleep_avg + sleep_time >= - JUST_INTERACTIVE_SLEEP(p)){ - p->sleep_avg = - JUST_INTERACTIVE_SLEEP(p); - sleep_time = 0; - } - } - - /* - * This code gives a bonus to interactive tasks. - * - * The boost works by updating the 'average sleep time' - * value here, based on ->timestamp. The more time a task - * spends sleeping, the higher the average gets - and the - * higher the priority boost gets as well. - */ - p->sleep_avg += sleep_time; - - if (p->sleep_avg > NS_MAX_SLEEP_AVG){ - p->sleep_avg = NS_MAX_SLEEP_AVG; - if (!HIGH_CREDIT(p)) - p->interactive_credit++; - } - } - } - - p->prio = effective_prio(p); -} - /* * activate_task - move a task to the runqueue and do priority recalculation * * Update all the scheduling statistics stuff. (sleep average * calculation, priority modifiers, etc.) */ -static inline void activate_task(task_t *p, runqueue_t *rq) +static inline void activate_task(task_t *task, runqueue_t *rq) { - unsigned long long now = sched_clock(); - - recalc_task_prio(p, now); - - /* - * This checks to make sure it's not an uninterruptible task - * that is now waking up. - */ - if (!p->activated){ - /* - * Tasks which were woken up by interrupts (ie. hw events) - * are most likely of interactive nature. So we give them - * the credit of extending their sleep time to the period - * of time they spend on the runqueue, waiting for execution - * on a CPU, first time around: - */ - if (in_interrupt()) - p->activated = 2; - else - /* - * Normal first-time wakeups get a credit too for on-runqueue - * time, but it will be weighted down: - */ - p->activated = 1; - } - p->timestamp = now; - - __activate_task(p, rq); + struct policy *policy = task_policy(task); + policy->ops->wake(policy->queue, task); + __activate_task(task, rq); } /* * deactivate_task - remove a task from the runqueue. */ -static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) +static inline void deactivate_task(task_t *task, runqueue_t *rq) { + struct policy *policy = task_policy(task); nr_running_dec(rq); - if (p->state == TASK_UNINTERRUPTIBLE) + if (task->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; - dequeue_task(p, p->array); - p->array = NULL; + policy->ops->sleep(policy->queue, task); + dequeue_task(task, rq); } /* @@ -625,7 +406,7 @@ repeat_lock_task: rq = task_rq_lock(p, &flags); old_state = p->state; if (old_state & state) { - if (!p->array) { + if (!task_queued(p)) { /* * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. @@ -644,14 +425,13 @@ repeat_lock_task: * Tasks on involuntary sleep don't earn * sleep_avg beyond just interactive state. */ - p->activated = -1; } if (sync) __activate_task(p, rq); else { activate_task(p, rq); - if (TASK_PREEMPTS_CURR(p, rq)) - resched_task(rq->curr); + if (task_preempts_curr(p, rq)) + resched_task(rq_curr(rq)); } success = 1; } @@ -679,68 +459,26 @@ int wake_up_state(task_t *p, unsigned in * This function will do some initial scheduler statistics housekeeping * that must be done for every newly created process. */ -void wake_up_forked_process(task_t * p) +void wake_up_forked_process(task_t *task) { unsigned long flags; runqueue_t *rq = task_rq_lock(current, &flags); - p->state = TASK_RUNNING; - /* - * We decrease the sleep average of forking parents - * and children as well, to keep max-interactive tasks - * from forking tasks that are max-interactive. - */ - current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * - PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * - CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); - - p->interactive_credit = 0; - - p->prio = effective_prio(p); - set_task_cpu(p, smp_processor_id()); - - if (unlikely(!current->array)) - __activate_task(p, rq); - else { - p->prio = current->prio; - list_add_tail(&p->run_list, ¤t->run_list); - p->array = current->array; - p->array->nr_active++; - nr_running_inc(rq); - } + task->state = TASK_RUNNING; + set_task_cpu(task, smp_processor_id()); + if (unlikely(!task_queued(current))) + __activate_task(task, rq); + else + activate_task(task, rq); task_rq_unlock(rq, &flags); } /* - * Potentially available exiting-child timeslices are - * retrieved here - this way the parent does not get - * penalized for creating too many threads. - * - * (this cannot be used to 'generate' timeslices - * artificially, because any timeslice recovered here - * was given away by the parent in the first place.) + * Policies that depend on trapping fork() and exit() may need to + * put a hook here. */ -void sched_exit(task_t * p) +void sched_exit(task_t *task) { - unsigned long flags; - - local_irq_save(flags); - if (p->first_time_slice) { - p->parent->time_slice += p->time_slice; - if (unlikely(p->parent->time_slice > MAX_TIMESLICE)) - p->parent->time_slice = MAX_TIMESLICE; - } - local_irq_restore(flags); - /* - * If the child was a (relative-) CPU hog then decrease - * the sleep_avg of the parent as well. - */ - if (p->sleep_avg < p->parent->sleep_avg) - p->parent->sleep_avg = p->parent->sleep_avg / - (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / - (EXIT_WEIGHT + 1); } /** @@ -1128,18 +866,18 @@ out: * pull_task - move a task from a remote runqueue to the local runqueue. * Both runqueues must be locked. */ -static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) +static inline void pull_task(runqueue_t *src_rq, task_t *p, runqueue_t *this_rq, int this_cpu) { - dequeue_task(p, src_array); + dequeue_task(p, src_rq); nr_running_dec(src_rq); set_task_cpu(p, this_cpu); nr_running_inc(this_rq); - enqueue_task(p, this_rq->active); + enqueue_task(p, this_rq); /* * Note that idle threads have a prio of MAX_PRIO, for this test * to be always true for them. */ - if (TASK_PREEMPTS_CURR(p, this_rq)) + if (task_preempts_curr(p, this_rq)) set_need_resched(); } @@ -1150,14 +888,14 @@ static inline void pull_task(runqueue_t * ((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \ * cache_decay_ticks)) && !task_running(rq, p) && \ * cpu_isset(this_cpu, (p)->cpus_allowed)) + * + * Since there isn't a timestamp anymore, this needs adjustment. */ static inline int can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle) { - unsigned long delta = sched_clock() - tsk->timestamp; - - if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks))) + if (!idle) return 0; if (task_running(rq, tsk)) return 0; @@ -1176,11 +914,8 @@ can_migrate_task(task_t *tsk, runqueue_t */ static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, this_cpu = smp_processor_id(); runqueue_t *busiest; - prio_array_t *array; - struct list_head *head, *curr; - task_t *tmp; busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask); if (!busiest) @@ -1192,37 +927,6 @@ static void load_balance(runqueue_t *thi */ imbalance /= 2; - /* - * We first consider expired tasks. Those will likely not be - * executed in the near future, and they are most likely to - * be cache-cold, thus switching CPUs has the least effect - * on them. - */ - if (busiest->expired->nr_active) - array = busiest->expired; - else - array = busiest->active; - -new_array: - /* Start searching at priority 0: */ - idx = 0; -skip_bitmap: - if (!idx) - idx = sched_find_first_bit(array->bitmap); - else - idx = find_next_bit(array->bitmap, MAX_PRIO, idx); - if (idx >= MAX_PRIO) { - if (array == busiest->expired) { - array = busiest->active; - goto new_array; - } - goto out_unlock; - } - - head = array->queue + idx; - curr = head->prev; -skip_queue: - tmp = list_entry(curr, task_t, run_list); /* * We do not migrate tasks that are: @@ -1231,21 +935,19 @@ skip_queue: * 3) are cache-hot on their current CPU. */ - curr = curr->prev; + do { + struct policy *policy; + task_t *task; + + policy = rq_migrate_policy(busiest); + if (!policy) + break; + task = policy->migrate(policy->queue); + if (!task) + break; + pull_task(busiest, task, this_rq, this_cpu); + } while (!idle && --imbalance); - if (!can_migrate_task(tmp, busiest, this_cpu, idle)) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } - pull_task(busiest, array, tmp, this_rq, this_cpu); - if (!idle && --imbalance) { - if (curr != head) - goto skip_queue; - idx++; - goto skip_bitmap; - } out_unlock: spin_unlock(&busiest->lock); out: @@ -1356,10 +1058,10 @@ EXPORT_PER_CPU_SYMBOL(kstat); */ void scheduler_tick(int user_ticks, int sys_ticks) { - int cpu = smp_processor_id(); + int idle, cpu = smp_processor_id(); struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat; + struct policy *policy; runqueue_t *rq = this_rq(); - task_t *p = current; if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_ticks); @@ -1373,98 +1075,28 @@ void scheduler_tick(int user_ticks, int sys_ticks = 0; } - if (p == rq->idle) { - if (atomic_read(&rq->nr_iowait) > 0) - cpustat->iowait += sys_ticks; - else - cpustat->idle += sys_ticks; - rebalance_tick(rq, 1); - return; - } - if (TASK_NICE(p) > 0) - cpustat->nice += user_ticks; - else - cpustat->user += user_ticks; - cpustat->system += sys_ticks; - - /* Task might have expired already, but not scheduled off yet */ - if (p->array != rq->active) { - set_tsk_need_resched(p); - goto out; - } spin_lock(&rq->lock); - /* - * The task was running during this tick - update the - * time slice counter. Note: we do not update a thread's - * priority until it either goes to sleep or uses up its - * timeslice. This makes it possible for interactive tasks - * to use up their timeslices at their highest priority levels. - */ - if (unlikely(rt_task(p))) { - /* - * RR tasks need a special form of timeslice management. - * FIFO tasks have no timeslices. - */ - if ((p->policy == SCHED_RR) && !--p->time_slice) { - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - set_tsk_need_resched(p); - - /* put it at the end of the queue: */ - dequeue_task(p, rq->active); - enqueue_task(p, rq->active); - } - goto out_unlock; - } - if (!--p->time_slice) { - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - p->time_slice = task_timeslice(p); - p->first_time_slice = 0; - - if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { - enqueue_task(p, rq->expired); - } else - enqueue_task(p, rq->active); - } else { - /* - * Prevent a too long timeslice allowing a task to monopolize - * the CPU. We do this by splitting up the timeslice into - * smaller pieces. - * - * Note: this does not mean the task's timeslices expire or - * get lost in any way, they just might be preempted by - * another task of equal priority. (one with higher - * priority would have preempted this task already.) We - * requeue this task to the end of the list on this priority - * level, which is in essence a round-robin of tasks with - * equal priority. - * - * This only applies to tasks in the interactive - * delta range with at least TIMESLICE_GRANULARITY to requeue. - */ - if (TASK_INTERACTIVE(p) && !((task_timeslice(p) - - p->time_slice) % TIMESLICE_GRANULARITY(p)) && - (p->time_slice >= TIMESLICE_GRANULARITY(p)) && - (p->array == rq->active)) { - - dequeue_task(p, rq->active); - set_tsk_need_resched(p); - p->prio = effective_prio(p); - enqueue_task(p, rq->active); - } - } -out_unlock: + policy = rq_policy(rq); + idle = policy->ops->tick(policy->queue, current, user_ticks, sys_ticks); spin_unlock(&rq->lock); -out: - rebalance_tick(rq, 0); + rebalance_tick(rq, idle); } void scheduling_functions_start_here(void) { } +static inline task_t *find_best_task(runqueue_t *rq) +{ + int idx; + struct policy *policy; + + BUG_ON(!rq->policy_bitmap); + idx = __ffs(rq->policy_bitmap); + __check_task_policy(idx); + policy = rq->policies[idx]; + check_policy(policy); + return policy->ops->best(policy->queue); +} + /* * schedule() is the main scheduler function. */ @@ -1472,11 +1104,7 @@ asmlinkage void schedule(void) { task_t *prev, *next; runqueue_t *rq; - prio_array_t *array; - struct list_head *queue; - unsigned long long now; - unsigned long run_time; - int idx; + struct policy *policy; /* * Test if we are atomic. Since do_exit() needs to call into @@ -1494,22 +1122,9 @@ need_resched: preempt_disable(); prev = current; rq = this_rq(); + policy = rq_policy(rq); release_kernel_lock(prev); - now = sched_clock(); - if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG)) - run_time = now - prev->timestamp; - else - run_time = NS_MAX_SLEEP_AVG; - - /* - * Tasks with interactive credits get charged less run_time - * at high sleep_avg to delay them losing their interactive - * status - */ - if (HIGH_CREDIT(prev)) - run_time /= (CURRENT_BONUS(prev) ? : 1); - spin_lock_irq(&rq->lock); /* @@ -1530,66 +1145,27 @@ need_resched: prev->nvcsw++; break; case TASK_RUNNING: + policy->ops->start_wait(policy->queue, prev); prev->nivcsw++; } + pick_next_task: - if (unlikely(!rq->nr_running)) { #ifdef CONFIG_SMP + if (unlikely(!rq->nr_running)) load_balance(rq, 1, cpu_to_node_mask(smp_processor_id())); - if (rq->nr_running) - goto pick_next_task; #endif - next = rq->idle; - rq->expired_timestamp = 0; - goto switch_tasks; - } - - array = rq->active; - if (unlikely(!array->nr_active)) { - /* - * Switch the active and expired arrays. - */ - rq->active = rq->expired; - rq->expired = array; - array = rq->active; - rq->expired_timestamp = 0; - } - - idx = sched_find_first_bit(array->bitmap); - queue = array->queue + idx; - next = list_entry(queue->next, task_t, run_list); - - if (next->activated > 0) { - unsigned long long delta = now - next->timestamp; - - if (next->activated == 1) - delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; - - array = next->array; - dequeue_task(next, array); - recalc_task_prio(next, next->timestamp + delta); - enqueue_task(next, array); - } - next->activated = 0; -switch_tasks: + next = find_best_task(rq); + BUG_ON(!next); prefetch(next); clear_tsk_need_resched(prev); RCU_qsctr(task_cpu(prev))++; - prev->sleep_avg -= run_time; - if ((long)prev->sleep_avg <= 0){ - prev->sleep_avg = 0; - if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev))) - prev->interactive_credit--; - } - prev->timestamp = now; - if (likely(prev != next)) { - next->timestamp = now; rq->nr_switches++; - rq->curr = next; - prepare_arch_switch(rq, next); + policy = task_policy(next); + policy->ops->set_curr(policy->queue, next); + set_rq_curr(rq, next); prev = context_switch(rq, prev, next); barrier(); @@ -1845,45 +1421,46 @@ void scheduling_functions_end_here(void) void set_user_nice(task_t *p, long nice) { unsigned long flags; - prio_array_t *array; runqueue_t *rq; - int old_prio, new_prio, delta; + struct policy *policy; + int delta, queued; - if (TASK_NICE(p) == nice || nice < -20 || nice > 19) + if (nice < -20 || nice > 19) return; /* * We have to be careful, if called from sys_setpriority(), * the task might be in the middle of scheduling on another CPU. */ rq = task_rq_lock(p, &flags); + delta = nice - __task_nice(p); + if (!delta) { + if (p->pid == 0 || p->pid == 1) + printk("no change in nice, set_user_nice() nops!\n"); + goto out_unlock; + } + + policy = task_policy(p); + /* * The RT priorities are set via setscheduler(), but we still * allow the 'normal' nice value to be set - but as expected * it wont have any effect on scheduling until the task is * not SCHED_NORMAL: */ - if (rt_task(p)) { - p->static_prio = NICE_TO_PRIO(nice); - goto out_unlock; - } - array = p->array; - if (array) - dequeue_task(p, array); - - old_prio = p->prio; - new_prio = NICE_TO_PRIO(nice); - delta = new_prio - old_prio; - p->static_prio = NICE_TO_PRIO(nice); - p->prio += delta; + queued = task_queued(p); + if (queued) + dequeue_task(p, rq); + + policy->ops->renice(policy->queue, p, nice); - if (array) { - enqueue_task(p, array); + if (queued) { + enqueue_task(p, rq); /* * If the task increased its priority or is running and * lowered its priority, then reschedule its CPU: */ if (delta < 0 || (delta > 0 && task_running(rq, p))) - resched_task(rq->curr); + resched_task(rq_curr(rq)); } out_unlock: task_rq_unlock(rq, &flags); @@ -1919,7 +1496,7 @@ asmlinkage long sys_nice(int increment) if (increment > 40) increment = 40; - nice = PRIO_TO_NICE(current->static_prio) + increment; + nice = task_nice(current) + increment; if (nice < -20) nice = -20; if (nice > 19) @@ -1935,6 +1512,12 @@ asmlinkage long sys_nice(int increment) #endif +static int __task_prio(task_t *task) +{ + struct policy *policy = task_policy(task); + return policy->ops->prio(task); +} + /** * task_prio - return the priority value of a given task. * @p: the task in question. @@ -1943,29 +1526,111 @@ asmlinkage long sys_nice(int increment) * RT tasks are offset by -200. Normal tasks are centered * around 0, value goes from -16 to +15. */ -int task_prio(task_t *p) +int task_prio(task_t *task) { - return p->prio - MAX_RT_PRIO; + int prio; + unsigned long flags; + runqueue_t *rq; + + rq = task_rq_lock(task, &flags); + prio = __task_prio(task); + task_rq_unlock(rq, &flags); + return prio; } /** * task_nice - return the nice value of a given task. * @p: the task in question. */ -int task_nice(task_t *p) +int task_nice(task_t *task) { - return TASK_NICE(p); + int nice; + unsigned long flags; + runqueue_t *rq; + + + rq = task_rq_lock(task, &flags); + nice = __task_nice(task); + task_rq_unlock(rq, &flags); + return nice; } EXPORT_SYMBOL(task_nice); +int task_sched_policy(task_t *task) +{ + check_task_policy(task); + switch (task->sched_info.policy) { + case SCHED_POLICY_RT: + if (task->sched_info.cl_data.rt.rt_policy + == RT_POLICY_RR) + return SCHED_RR; + else + return SCHED_FIFO; + case SCHED_POLICY_TS: + return SCHED_NORMAL; + case SCHED_POLICY_BATCH: + return SCHED_BATCH; + case SCHED_POLICY_IDLE: + return SCHED_IDLE; + default: + BUG(); + return -1; + } +} +EXPORT_SYMBOL(task_sched_policy); + +void set_task_sched_policy(task_t *task, int policy) +{ + check_task_policy(task); + BUG_ON(task_queued(task)); + switch (policy) { + case SCHED_FIFO: + task->sched_info.policy = SCHED_POLICY_RT; + task->sched_info.cl_data.rt.rt_policy = RT_POLICY_FIFO; + break; + case SCHED_RR: + task->sched_info.policy = SCHED_POLICY_RT; + task->sched_info.cl_data.rt.rt_policy = RT_POLICY_RR; + break; + case SCHED_NORMAL: + task->sched_info.policy = SCHED_POLICY_TS; + break; + case SCHED_BATCH: + task->sched_info.policy = SCHED_POLICY_BATCH; + break; + case SCHED_IDLE: + task->sched_info.policy = SCHED_POLICY_IDLE; + break; + default: + BUG(); + break; + } + check_task_policy(task); +} +EXPORT_SYMBOL(set_task_sched_policy); + +int rt_task(task_t *task) +{ + check_task_policy(task); + return !!(task->sched_info.policy == SCHED_POLICY_RT); +} +EXPORT_SYMBOL(rt_task); + /** * idle_cpu - is a given cpu idle currently? * @cpu: the processor in question. */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + int idle; + unsigned long flags; + runqueue_t *rq = cpu_rq(cpu); + + spin_lock_irqsave(&rq->lock, flags); + idle = !!(rq->curr == SCHED_POLICY_IDLE); + spin_unlock_irqrestore(&rq->lock, flags); + return idle; } EXPORT_SYMBOL_GPL(idle_cpu); @@ -1985,11 +1650,10 @@ static inline task_t *find_process_by_pi static int setscheduler(pid_t pid, int policy, struct sched_param __user *param) { struct sched_param lp; - int retval = -EINVAL; - int oldprio; - prio_array_t *array; + int queued, retval = -EINVAL; unsigned long flags; runqueue_t *rq; + struct policy *rq_policy; task_t *p; if (!param || pid < 0) @@ -2017,7 +1681,7 @@ static int setscheduler(pid_t pid, int p rq = task_rq_lock(p, &flags); if (policy < 0) - policy = p->policy; + policy = task_sched_policy(p); else { retval = -EINVAL; if (policy != SCHED_FIFO && policy != SCHED_RR && @@ -2047,29 +1711,23 @@ static int setscheduler(pid_t pid, int p if (retval) goto out_unlock; - array = p->array; - if (array) + queued = task_queued(p); + if (queued) deactivate_task(p, task_rq(p)); retval = 0; - p->policy = policy; - p->rt_priority = lp.sched_priority; - oldprio = p->prio; - if (policy != SCHED_NORMAL) - p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority; - else - p->prio = p->static_prio; - if (array) { + set_task_sched_policy(p, policy); + check_task_policy(p); + rq_policy = rq->policies[p->sched_info.policy]; + check_policy(rq_policy); + rq_policy->ops->setprio(p, lp.sched_priority); + if (queued) { __activate_task(p, task_rq(p)); /* * Reschedule if we are currently running on this runqueue and * our priority decreased, or if we are not currently running on * this runqueue and our priority is higher than the current's */ - if (rq->curr == p) { - if (p->prio > oldprio) - resched_task(rq->curr); - } else if (p->prio < rq->curr->prio) - resched_task(rq->curr); + resched_task(rq_curr(rq)); } out_unlock: @@ -2121,7 +1779,7 @@ asmlinkage long sys_sched_getscheduler(p if (p) { retval = security_task_getscheduler(p); if (!retval) - retval = p->policy; + retval = task_sched_policy(p); } read_unlock(&tasklist_lock); @@ -2153,7 +1811,7 @@ asmlinkage long sys_sched_getparam(pid_t if (retval) goto out_unlock; - lp.sched_priority = p->rt_priority; + lp.sched_priority = task_prio(p); read_unlock(&tasklist_lock); /* @@ -2262,32 +1920,13 @@ out_unlock: */ asmlinkage long sys_sched_yield(void) { + struct policy *policy; runqueue_t *rq = this_rq_lock(); - prio_array_t *array = current->array; - - /* - * We implement yielding by moving the task into the expired - * queue. - * - * (special rule: RT tasks will just roundrobin in the active - * array.) - */ - if (likely(!rt_task(current))) { - dequeue_task(current, array); - enqueue_task(current, rq->expired); - } else { - list_del(¤t->run_list); - list_add_tail(¤t->run_list, array->queue + current->prio); - } - /* - * Since we are going to call schedule() anyway, there's - * no need to preempt: - */ + policy = rq_policy(rq); + policy->ops->yield(policy->queue, current); _raw_spin_unlock(&rq->lock); preempt_enable_no_resched(); - schedule(); - return 0; } @@ -2387,6 +2026,19 @@ asmlinkage long sys_sched_get_priority_m return ret; } +static inline unsigned long task_timeslice(task_t *task) +{ + unsigned long flags, timeslice; + struct policy *policy; + runqueue_t *rq; + + rq = task_rq_lock(task, &flags); + policy = task_policy(task); + timeslice = policy->ops->timeslice(policy->queue, task); + task_rq_unlock(rq, &flags); + return timeslice; +} + /** * sys_sched_rr_get_interval - return the default timeslice of a process. * @pid: pid of the process. @@ -2414,8 +2066,7 @@ asmlinkage long sys_sched_rr_get_interva if (retval) goto out_unlock; - jiffies_to_timespec(p->policy & SCHED_FIFO ? - 0 : task_timeslice(p), &t); + jiffies_to_timespec(task_timeslice(p), &t); read_unlock(&tasklist_lock); retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0; out_nounlock: @@ -2523,17 +2174,22 @@ void show_state(void) void __init init_idle(task_t *idle, int cpu) { runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(task_cpu(idle)); + struct policy *policy; unsigned long flags; local_irq_save(flags); double_rq_lock(idle_rq, rq); - - idle_rq->curr = idle_rq->idle = idle; + policy = rq_policy(rq); + BUG_ON(policy != task_policy(idle)); + printk("deactivating, have %d tasks\n", + policy->ops->tasks(policy->queue)); deactivate_task(idle, rq); - idle->array = NULL; - idle->prio = MAX_PRIO; + set_task_sched_policy(idle, SCHED_IDLE); idle->state = TASK_RUNNING; set_task_cpu(idle, cpu); + activate_task(idle, rq); + nr_running_dec(rq); + set_rq_curr(rq, idle); double_rq_unlock(idle_rq, rq); set_tsk_need_resched(idle); local_irq_restore(flags); @@ -2804,38 +2460,27 @@ __init static void init_kstat(void) { void __init sched_init(void) { runqueue_t *rq; - int i, j, k; + int i, j; /* Init the kstat counters */ init_kstat(); for (i = 0; i < NR_CPUS; i++) { - prio_array_t *array; - rq = cpu_rq(i); - rq->active = rq->arrays; - rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); INIT_LIST_HEAD(&rq->migration_queue); atomic_set(&rq->nr_iowait, 0); nr_running_init(rq); - - for (j = 0; j < 2; j++) { - array = rq->arrays + j; - for (k = 0; k < MAX_PRIO; k++) { - INIT_LIST_HEAD(array->queue + k); - __clear_bit(k, array->bitmap); - } - // delimiter for bitsearch - __set_bit(MAX_PRIO, array->bitmap); - } + memcpy(rq->policies, policies, sizeof(policies)); + for (j = 0; j < BITS_PER_LONG && rq->policies[j]; ++j) + rq->policies[j]->ops->init(rq->policies[j], i); } /* * We have to do a little magic to get the first * thread right in SMP mode. */ rq = this_rq(); - rq->curr = current; - rq->idle = current; + set_task_sched_policy(current, SCHED_NORMAL); + set_rq_curr(rq, current); set_task_cpu(current, smp_processor_id()); wake_up_forked_process(current); diff -prauN linux-2.6.0-test11/lib/Makefile sched-2.6.0-test11-5/lib/Makefile --- linux-2.6.0-test11/lib/Makefile 2003-11-26 12:42:55.000000000 -0800 +++ sched-2.6.0-test11-5/lib/Makefile 2003-12-20 15:09:16.000000000 -0800 @@ -5,7 +5,7 @@ lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \ bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \ - kobject.o idr.o div64.o parser.o + kobject.o idr.o div64.o parser.o binomial.o lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o diff -prauN linux-2.6.0-test11/lib/binomial.c sched-2.6.0-test11-5/lib/binomial.c --- linux-2.6.0-test11/lib/binomial.c 1969-12-31 16:00:00.000000000 -0800 +++ sched-2.6.0-test11-5/lib/binomial.c 2003-12-20 17:32:09.000000000 -0800 @@ -0,0 +1,138 @@ +#include <linux/kernel.h> +#include <linux/binomial.h> + +struct binomial *binomial_minimum(struct binomial **heap) +{ + struct binomial *minimum, *tmp; + + for (minimum = NULL, tmp = *heap; tmp; tmp = tmp->sibling) { + if (!minimum || minimum->priority > tmp->priority) + minimum = tmp; + } + return minimum; +} + +static void binomial_link(struct binomial *left, struct binomial *right) +{ + left->parent = right; + left->sibling = right->child; + right->child = left; + right->degree++; +} + +static void binomial_merge(struct binomial **both, struct binomial **left, + struct binomial **right) +{ + while (*left && *right) { + if ((*left)->degree < (*right)->degree) { + *both = *left; + left = &(*left)->sibling; + } else { + *both = *right; + right = &(*right)->sibling; + } + both = &(*both)->sibling; + } + /* + * for more safety: + * *left = *right = NULL; + */ +} + +void binomial_union(struct binomial **both, struct binomial **left, + struct binomial **right) +{ + struct binomial *prev, *tmp, *next; + + binomial_merge(both, left, right); + if (!(tmp = *both)) + return; + + for (prev = NULL, next = tmp->sibling; next; next = tmp->sibling) { + if ((next->sibling && next->sibling->degree == tmp->degree) + || tmp->degree != next->degree) { + prev = tmp; + tmp = next; + } else if (tmp->priority <= next->priority) { + tmp->sibling = next->sibling; + binomial_link(next, tmp); + } else { + if (!prev) + *both = next; + else + prev->sibling = next; + binomial_link(tmp, next); + tmp = next; + } + } +} + +void binomial_insert(struct binomial **heap, struct binomial *element) +{ + element->parent = NULL; + element->child = NULL; + element->sibling = NULL; + element->degree = 0; + binomial_union(heap, heap, &element); +} + +static void binomial_reverse(struct binomial **in, struct binomial **out) +{ + while (*in) { + struct binomial *tmp = *in; + *in = (*in)->sibling; + tmp->sibling = *out; + *out = tmp; + } +} + +struct binomial *binomial_extract_min(struct binomial **heap) +{ + struct binomial *tmp, *minimum, *last, *min_last, *new_heap; + + minimum = last = min_last = new_heap = NULL; + for (tmp = *heap; tmp; last = tmp, tmp = tmp->sibling) { + if (!minimum || tmp->priority < minimum->priority) { + minimum = tmp; + min_last = last; + } + } + if (min_last && minimum) + min_last->sibling = minimum->sibling; + else if (minimum) + (*heap)->sibling = minimum->sibling; + else + return NULL; + binomial_reverse(&minimum->child, &new_heap); + binomial_union(heap, heap, &new_heap); + return minimum; +} + +void binomial_decrease(struct binomial **heap, struct binomial *element, + unsigned increment) +{ + struct binomial *tmp, *last = NULL; + + element->priority -= min(element->priority, increment); + last = element; + tmp = last->parent; + while (tmp && last->priority < tmp->priority) { + unsigned tmp_prio = tmp->priority; + tmp->priority = last->priority; + last->priority = tmp_prio; + last = tmp; + tmp = tmp->parent; + } +} + +void binomial_delete(struct binomial **heap, struct binomial *element) +{ + struct binomial *tmp, *last = element; + for (tmp = last->parent; tmp; last = tmp, tmp = tmp->parent) { + unsigned tmp_prio = tmp->priority; + tmp->priority = last->priority; + last->priority = tmp_prio; + } + binomial_reverse(&last->child, &tmp); + binomial_union(heap, heap, &tmp); +} diff -prauN linux-2.6.0-test11/mm/oom_kill.c sched-2.6.0-test11-5/mm/oom_kill.c --- linux-2.6.0-test11/mm/oom_kill.c 2003-11-26 12:44:16.000000000 -0800 +++ sched-2.6.0-test11-5/mm/oom_kill.c 2003-12-17 07:07:53.000000000 -0800 @@ -158,7 +158,6 @@ static void __oom_kill_task(task_t *p) * all the memory it needs. That way it should be able to * exit() and clear out its resources quickly... */ - p->time_slice = HZ; p->flags |= PF_MEMALLOC | PF_MEMDIE; /* This process has hardware access, be more careful. */ ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:24 ` Ingo Molnar 2007-04-17 9:57 ` William Lee Irwin III @ 2007-04-17 22:08 ` Matt Mackall 2007-04-17 22:32 ` William Lee Irwin III 1 sibling, 1 reply; 304+ messages in thread From: Matt Mackall @ 2007-04-17 22:08 UTC (permalink / raw) To: Ingo Molnar Cc: William Lee Irwin III, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > > * William Lee Irwin III <wli@holomorphy.com> wrote: > > > [...] Also rest assured that the tone of the critique is not hostile, > > and wasn't meant to sound that way. > > ok :) (And i guess i was too touchy - sorry about coming out swinging.) > > > Also, given the general comments it appears clear that some > > statistical metric of deviation from the intended behavior furthermore > > qualified by timescale is necessary, so this appears to be headed > > toward a sort of performance metric as opposed to a pass/fail test > > anyway. However, to even measure this at all, some statement of > > intention is required. I'd prefer that there be a Linux-standard > > semantics for nice so results are more directly comparable and so that > > users also get similar nice behavior from the scheduler as it varies > > over time and possibly implementations if users should care to switch > > them out with some scheduler patch or other. > > yeah. If you could come up with a sane definition that also translates > into low overhead on the algorithm side that would be great! How's this: If you're running two identical CPU hog tasks A and B differing only by nice level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a constant f(Anice - Bnice). Other definitions make things hard to analyze and probably not well-bounded when confronted with > 2 tasks. I -think- this implies keeping a separate scaled CPU usage counter, where the scaling factor is a trivial exponential function of nice level where f(0) == 1. Then you schedule based on this scaled usage counter rather than unscaled. I also suspect we want to keep the exponential base small so that the maximal difference is 10x-100x. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:08 ` Matt Mackall @ 2007-04-17 22:32 ` William Lee Irwin III 2007-04-17 22:39 ` Matt Mackall 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 22:32 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: >> yeah. If you could come up with a sane definition that also translates >> into low overhead on the algorithm side that would be great! On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote: > How's this: > If you're running two identical CPU hog tasks A and B differing only by nice > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a > constant f(Anice - Bnice). > Other definitions make things hard to analyze and probably not > well-bounded when confronted with > 2 tasks. > I -think- this implies keeping a separate scaled CPU usage counter, > where the scaling factor is a trivial exponential function of nice > level where f(0) == 1. Then you schedule based on this scaled usage > counter rather than unscaled. > I also suspect we want to keep the exponential base small so that the > maximal difference is 10x-100x. I'm already working with this as my assumed nice semantics (actually something with a specific exponential base, suggested in other emails) until others start saying they want something different and agree. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:32 ` William Lee Irwin III @ 2007-04-17 22:39 ` Matt Mackall 2007-04-17 22:59 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Matt Mackall @ 2007-04-17 22:39 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 11:24:22AM +0200, Ingo Molnar wrote: > >> yeah. If you could come up with a sane definition that also translates > >> into low overhead on the algorithm side that would be great! > > On Tue, Apr 17, 2007 at 05:08:09PM -0500, Matt Mackall wrote: > > How's this: > > If you're running two identical CPU hog tasks A and B differing only by nice > > level (Anice, Bnice), the ratio cputime(A)/cputime(B) should be a > > constant f(Anice - Bnice). > > Other definitions make things hard to analyze and probably not > > well-bounded when confronted with > 2 tasks. > > I -think- this implies keeping a separate scaled CPU usage counter, > > where the scaling factor is a trivial exponential function of nice > > level where f(0) == 1. Then you schedule based on this scaled usage > > counter rather than unscaled. > > I also suspect we want to keep the exponential base small so that the > > maximal difference is 10x-100x. > > I'm already working with this as my assumed nice semantics (actually > something with a specific exponential base, suggested in other emails) > until others start saying they want something different and agree. Good. This has a couple nice mathematical properties, including "bounded unfairness" which I mentioned earlier. What base are you looking at? -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:39 ` Matt Mackall @ 2007-04-17 22:59 ` William Lee Irwin III 2007-04-17 22:57 ` Matt Mackall 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 22:59 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: >> I'm already working with this as my assumed nice semantics (actually >> something with a specific exponential base, suggested in other emails) >> until others start saying they want something different and agree. On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote: > Good. This has a couple nice mathematical properties, including > "bounded unfairness" which I mentioned earlier. What base are you > looking at? I'm working with the following suggestion: On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > That value has the property that a nice=10 task gets 1/10th the cpu of a > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > would be fairly easy to explain to admins and users so that they can > know what to expect from nicing tasks. I'm not likely to write the testcase until this upcoming weekend, though. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:59 ` William Lee Irwin III @ 2007-04-17 22:57 ` Matt Mackall 2007-04-18 4:29 ` William Lee Irwin III 2007-04-18 7:29 ` James Bruce 0 siblings, 2 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-17 22:57 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: > >> I'm already working with this as my assumed nice semantics (actually > >> something with a specific exponential base, suggested in other emails) > >> until others start saying they want something different and agree. > > On Tue, Apr 17, 2007 at 05:39:09PM -0500, Matt Mackall wrote: > > Good. This has a couple nice mathematical properties, including > > "bounded unfairness" which I mentioned earlier. What base are you > > looking at? > > I'm working with the following suggestion: > > > On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: > > Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 > > That value has the property that a nice=10 task gets 1/10th the cpu of a > > nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that > > would be fairly easy to explain to admins and users so that they can > > know what to expect from nicing tasks. > > I'm not likely to write the testcase until this upcoming weekend, though. So that means there's a 10000:1 ratio between nice 20 and nice -19. In that sort of dynamic range, you're likely to have non-trivial numerical accuracy issues in integer/fixed-point math. (Especially if your clock is jiffies-scale, which a significant number of machines will continue to be.) I really think if we want to have vastly different ratios, we probably want to be looking at BATCH and RT scheduling classes instead. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:57 ` Matt Mackall @ 2007-04-18 4:29 ` William Lee Irwin III 2007-04-18 4:42 ` Davide Libenzi 2007-04-18 7:29 ` James Bruce 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-18 4:29 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Davide Libenzi, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: >>> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 >>> That value has the property that a nice=10 task gets 1/10th the cpu of a >>> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that >>> would be fairly easy to explain to admins and users so that they can >>> know what to expect from nicing tasks. On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: >> I'm not likely to write the testcase until this upcoming weekend, though. On Tue, Apr 17, 2007 at 05:57:23PM -0500, Matt Mackall wrote: > So that means there's a 10000:1 ratio between nice 20 and nice -19. In > that sort of dynamic range, you're likely to have non-trivial > numerical accuracy issues in integer/fixed-point math. > (Especially if your clock is jiffies-scale, which a significant number > of machines will continue to be.) > I really think if we want to have vastly different ratios, we probably > want to be looking at BATCH and RT scheduling classes instead. 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and even 1000**(1/39.0) ~= 1.19378 still seems weak. I suspect that in order to get low nice numbers strong enough without making high nice numbers too strong something sub-exponential may need to be used. Maybe just picking percentages outright as opposed to some particular function. We may also be better off defining it in terms of a share weighting as opposed to two tasks in competition. In such a manner the extension to N tasks is more automatic. f(n) would be a univariate function of nice numbers and two tasks in competition with nice numbers n_1 and n_2 would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In the exponential case f(n) = K*e**(r*n) this ends up as 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for other choices it's not so. f(n) = n+K for K >= 20 results in a share weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n when n <= 0 is highly plausible. An exponent or an additive constant may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21, and the ratio of shares is 420, which is still arithmeticaly feasible. -10 vs. 0 and 0 vs. 10 are both 10:1. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 4:29 ` William Lee Irwin III @ 2007-04-18 4:42 ` Davide Libenzi 0 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-18 4:42 UTC (permalink / raw) To: William Lee Irwin III Cc: Matt Mackall, Ingo Molnar, Nick Piggin, Peter Williams, Mike Galbraith, Con Kolivas, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, William Lee Irwin III wrote: > 100**(1/39.0) ~= 1.12534 may do if so, but it seems a little weak, and > even 1000**(1/39.0) ~= 1.19378 still seems weak. > > I suspect that in order to get low nice numbers strong enough without > making high nice numbers too strong something sub-exponential may need > to be used. Maybe just picking percentages outright as opposed to some > particular function. > > We may also be better off defining it in terms of a share weighting as > opposed to two tasks in competition. In such a manner the extension to > N tasks is more automatic. f(n) would be a univariate function of nice > numbers and two tasks in competition with nice numbers n_1 and n_2 > would get shares f(n_1)/(f(n_1)+f(n_2)) and f(n_2)/(f(n_1)+f(n_2)). In > the exponential case f(n) = K*e**(r*n) this ends up as > 1/(1+e**(r*(n_2-n_1))) which is indeed a function of n_1-n_2 but for > other choices it's not so. f(n) = n+K for K >= 20 results in a share > weighting of (n_1+K,n_2+K)/(n_1+n_2+2*K), which is not entirely clear > in its impact. My guess is that f(n)=1/(n+1) when n >= 0 and f(n)=1-n > when n <= 0 is highly plausible. An exponent or an additive constant > may be worthwhile to throw in. In this case, f(-19) = 20, f(20) = 1/21, > and the ratio of shares is 420, which is still arithmeticaly feasible. > -10 vs. 0 and 0 vs. 10 are both 10:1. This makes more sense, and the ratio at the extremes is something reasonable. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 22:57 ` Matt Mackall 2007-04-18 4:29 ` William Lee Irwin III @ 2007-04-18 7:29 ` James Bruce 1 sibling, 0 replies; 304+ messages in thread From: James Bruce @ 2007-04-18 7:29 UTC (permalink / raw) To: linux-kernel; +Cc: ck Matt Mackall wrote: > On Tue, Apr 17, 2007 at 03:59:02PM -0700, William Lee Irwin III wrote: >> On Tue, Apr 17, 2007 at 03:32:56PM -0700, William Lee Irwin III wrote: >> I'm working with the following suggestion: >> >> On Tue, Apr 17, 2007 at 09:07:49AM -0400, James Bruce wrote: >>> Nonlinear is a must IMO. I would suggest X = exp(ln(10)/10) ~= 1.2589 >>> That value has the property that a nice=10 task gets 1/10th the cpu of a >>> nice=0 task, and a nice=20 task gets 1/100 of nice=0. I think that >>> would be fairly easy to explain to admins and users so that they can >>> know what to expect from nicing tasks. >> I'm not likely to write the testcase until this upcoming weekend, though. > > So that means there's a 10000:1 ratio between nice 20 and nice -19. In > that sort of dynamic range, you're likely to have non-trivial > numerical accuracy issues in integer/fixed-point math. Well, you *are* specifying vastly different priorities. The question is how many nice=20 tasks should it take to interfere with a nice=-19 task? If you've only got a 100:1 ratio, 100 nice=20 tasks will take ~50% of the CPU away from a nice=-19 task. I don't think that's ideal, as in my mind a -19 task shouldn't have to care how many nice=20 tasks there are (within reason). IMHO, if a user is running a CPU hog at nice=-19, and expecting nice=20 tasks to run immediately, I don't think the scheduler is the problem. > (Especially if your clock is jiffies-scale, which a significant number > of machines will continue to be.) > > I really think if we want to have vastly different ratios, we probably > want to be looking at BATCH and RT scheduling classes instead. I, like all users, can live with anything, but there should be a clear specification of what the user should expect. Magic changes in the function at nice=0, or no real clear meaning at all (mainline), are both things that don't help the users to figure that out. I like the exponential base because shifting all tasks up or down one nice level does not change the relative cpu distribution (i.e. two tasks {nice=-5,nice=0} get the same relative cpu distribution as if they were {nice=0,nice=5}. An exponential base is the only way that property can hold. Now, perhaps implementation issues may prevent something like the "1.2589" ratio rule from being realized, but I'm not sure we should throw it out _before_ we know that it's actually a problem. This is the same sort of resistance that the timekeeping code updates faces (using nanoseconds everywhere instead of "natural" clock bases), but that got addressed eventually. - Jim Bruce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:50 ` Davide Libenzi 2007-04-17 7:09 ` William Lee Irwin III @ 2007-04-17 7:11 ` Nick Piggin 2007-04-17 7:21 ` Davide Libenzi 1 sibling, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:11 UTC (permalink / raw) To: Davide Libenzi Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 11:50:03PM -0700, Davide Libenzi wrote: > On Tue, 17 Apr 2007, Nick Piggin wrote: > > > > All things are not equal; they all have different properties. I like > > > > Exactly. So we have to explore those properties and evaluate performance > > (in all meanings of the word). That's only logical. > > I had a quick look at Ingo's code yesterday. Ingo is always smart to > prepare a main dish (feature) with a nice sider (code cleanup) to Linus ;) > And even this code does that pretty nicely. The deadline designs looks > good, although I think the final "key" calculation code will end up quite > different from what it looks now. > I would suggest to thoroughly test all your alternatives before deciding. > Some code and design may look very good and small at the beginning, but > when you start patching it to cover all the dark spots, you effectively > end up with another thing (in both design and code footprint). > About O(1), I never thought it was a must (besides a good marketing > material), and O(log(N)) *may* be just fine (to be verified, of course). To be clear, I'm not saying O(logN) itself is a big problem. Type plot [10:100] x with lines, log(x) with lines, 1 with lines into gnuplot. I was just trying to point out that we need to evalute things. Considering how long we've had this scheduler with its known deficiencies, let's pick a new one wisely. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:11 ` Nick Piggin @ 2007-04-17 7:21 ` Davide Libenzi 0 siblings, 0 replies; 304+ messages in thread From: Davide Libenzi @ 2007-04-17 7:21 UTC (permalink / raw) To: Nick Piggin Cc: William Lee Irwin III, Peter Williams, Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, Linux Kernel Mailing List, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, 17 Apr 2007, Nick Piggin wrote: > To be clear, I'm not saying O(logN) itself is a big problem. Type > > plot [10:100] x with lines, log(x) with lines, 1 with lines Haha, Nick, I know how a log() looks like :) The Time Ring I posted as example (that nothing is other than a ring-based bucket sort), keeps O(1) if you can concede some timer clustering. - Davide ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:29 ` Nick Piggin 2007-04-17 5:53 ` Willy Tarreau 2007-04-17 6:09 ` William Lee Irwin III @ 2007-04-17 6:23 ` Peter Williams 2007-04-17 6:44 ` Nick Piggin 2007-04-17 8:44 ` Ingo Molnar 2 siblings, 2 replies; 304+ messages in thread From: Peter Williams @ 2007-04-17 6:23 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 02:17:22PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> On Tue, Apr 17, 2007 at 04:29:01AM +0200, Mike Galbraith wrote: >>>> On Tue, 2007-04-17 at 10:06 +1000, Peter Williams wrote: >>>>> Mike Galbraith wrote: >>>>>> Demystify what? The casual observer need only read either your attempt >>>>>> at writing a scheduler, or my attempts at fixing the one we have, to see >>>>>> that it was high time for someone with the necessary skills to step in. >>>>> Make that "someone with the necessary clout". >>>> No, I was brutally honest to both of us, but quite correct. >>>> >>>>>> Now progress can happen, which was _not_ happening before. >>>>>> >>>>> This is true. >>>> Yup, and progress _is_ happening now, quite rapidly. >>> Progress as in progress on Ingo's scheduler. I still don't know how we'd >>> decide when to replace the mainline scheduler or with what. >>> >>> I don't think we can say Ingo's is better than the alternatives, can we? >>> If there is some kind of bakeoff, then I'd like one of Con's designs to >>> be involved, and mine, and Peter's... >> I myself was thinking of this as the chance for a much needed >> simplification of the scheduling code and if this can be done with the >> result being "reasonable" it then gives us the basis on which to propose >> improvements based on the ideas of others such as you mention. >> >> As the size of the cpusched indicates, trying to evaluate alternative >> proposals based on the current O(1) scheduler is fraught. Hopefully, > > I don't know why. The problem is that you can't really evaluate good > proposals by looking at the code (you can say that one is bad, ie. the > current one, which has a huge amount of temporal complexity and is > explicitly unfair), but it is pretty hard to say one behaves well. I meant that it's indicative of the amount of work that you have to do to implement a new scheduling discipline for evaluation. > > And my scheduler for example cuts down the amount of policy code and > code size significantly. Yours is one of the smaller patches mainly because you perpetuate (or you did in the last one I looked at) the (horrible to my eyes) dual array (active/expired) mechanism. That this idea was bad should have been apparent to all as soon as the decision was made to excuse some tasks from being moved from the active array to the expired array. This essentially meant that there would be circumstances where extreme unfairness (to the extent of starvation in some cases) -- the very things that the mechanism was originally designed to ensure (as far as I can gather). Right about then in the development of the O(1) scheduler alternative solutions should have been sought. Other hints that it was a bad idea was the need to transfer time slices between children and parents during fork() and exit(). This disregard for the dual array mechanism has prevented me from looking at the rest of your scheduler in any great detail so I can't comment on any other ideas that may be in there. > I haven't looked at Con's ones for a while, > but I believe they are also much more straightforward than mainline... I like Con's scheduler (partly because it uses a single array) but mainly because it's nice and simple. However, his earlier schedulers were prone to starvation (admittedly, only if you went out of your way to make it happen) and I tried to convince him to use the anti starvation mechanism in my SPA schedulers but was unsuccessful. I haven't looked at his latest scheduler that sparked all this furore so can't comment on it. > > For example, let's say all else is equal between them, then why would > we go with the O(logN) implementation rather than the O(1)? In the highly unlikely event that you can't separate them on technical grounds, Occam's razor recommends choosing the simplest solution. :-) To digress, my main concern is that load balancing is being lumped in with this new change. It's becoming "accept this beg lump of new code or nothing". I'd rather see a good fix to the intra runqueue/CPU scheduler problem implemented first and then if there really are any outstanding problems with the load balancer attack them later. Them all being mixed up together gives me a nasty deja vu of impending disaster. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:23 ` Peter Williams @ 2007-04-17 6:44 ` Nick Piggin 2007-04-17 7:48 ` Peter Williams 2007-04-17 8:44 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 6:44 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >And my scheduler for example cuts down the amount of policy code and > >code size significantly. > > Yours is one of the smaller patches mainly because you perpetuate (or > you did in the last one I looked at) the (horrible to my eyes) dual > array (active/expired) mechanism. Actually, I wasn't comparing with other out of tree schedulers (but it is good to know mine is among the smaller ones). I was comparing with the mainline scheduler, which also has the dual arrays. > That this idea was bad should have > been apparent to all as soon as the decision was made to excuse some > tasks from being moved from the active array to the expired array. This My patch doesn't implement any such excusing. > essentially meant that there would be circumstances where extreme > unfairness (to the extent of starvation in some cases) -- the very > things that the mechanism was originally designed to ensure (as far as I > can gather). Right about then in the development of the O(1) scheduler > alternative solutions should have been sought. Fairness has always been my first priority, and I consider it a bug if it is possible for any process to get more CPU time than a CPU hog over the long term. Or over another task doing the same thing, for that matter. > Other hints that it was a bad idea was the need to transfer time slices > between children and parents during fork() and exit(). I don't see how that has anything to do with dual arrays. If you put a new child at the back of the queue, then your various interactive shell commands that typically do a lot of dependant forking get slowed right down behind your compile job. If you give a new child its own timeslice irrespective of the parent, then you have things like 'make' (which doesn't use a lot of CPU time) spawning off lots of high priority children. You need to do _something_ (Ingo's does). I don't see why this would be tied with a dual array. FWIW, mine doesn't do anything on exit() like most others, but it may need more tuning in this area. > This disregard for the dual array mechanism has prevented me from > looking at the rest of your scheduler in any great detail so I can't > comment on any other ideas that may be in there. Well I wasn't really asking you to review it. As I said, everyone has their own idea of what a good design does, and review can't really distinguish between the better of two reasonable designs. A fair evaluation of the alternatives seems like a good idea though. Nobody is actually against this, are they? > >I haven't looked at Con's ones for a while, > >but I believe they are also much more straightforward than mainline... > > I like Con's scheduler (partly because it uses a single array) but > mainly because it's nice and simple. However, his earlier schedulers > were prone to starvation (admittedly, only if you went out of your way > to make it happen) and I tried to convince him to use the anti > starvation mechanism in my SPA schedulers but was unsuccessful. I > haven't looked at his latest scheduler that sparked all this furore so > can't comment on it. I agree starvation or unfairness is unacceptable for a new scheduler. > >For example, let's say all else is equal between them, then why would > >we go with the O(logN) implementation rather than the O(1)? > > In the highly unlikely event that you can't separate them on technical > grounds, Occam's razor recommends choosing the simplest solution. :-) O(logN) vs O(1) is technical grounds. But yeah, see my earlier comment: simplicity would be a factor too. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:44 ` Nick Piggin @ 2007-04-17 7:48 ` Peter Williams 2007-04-17 7:56 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 7:48 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 04:23:37PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> And my scheduler for example cuts down the amount of policy code and >>> code size significantly. >> Yours is one of the smaller patches mainly because you perpetuate (or >> you did in the last one I looked at) the (horrible to my eyes) dual >> array (active/expired) mechanism. > > Actually, I wasn't comparing with other out of tree schedulers (but it > is good to know mine is among the smaller ones). I was comparing with > the mainline scheduler, which also has the dual arrays. > > >> That this idea was bad should have >> been apparent to all as soon as the decision was made to excuse some >> tasks from being moved from the active array to the expired array. This > > My patch doesn't implement any such excusing. > > >> essentially meant that there would be circumstances where extreme >> unfairness (to the extent of starvation in some cases) -- the very >> things that the mechanism was originally designed to ensure (as far as I >> can gather). Right about then in the development of the O(1) scheduler >> alternative solutions should have been sought. > > Fairness has always been my first priority, and I consider it a bug > if it is possible for any process to get more CPU time than a CPU hog > over the long term. Or over another task doing the same thing, for > that matter. > > >> Other hints that it was a bad idea was the need to transfer time slices >> between children and parents during fork() and exit(). > > I don't see how that has anything to do with dual arrays. It's totally to do with the dual arrays. The only real purpose of the time slice in O(1) (regardless of what its perceived purpose was) was to control the switching between the arrays. > If you put > a new child at the back of the queue, then your various interactive > shell commands that typically do a lot of dependant forking get slowed > right down behind your compile job. If you give a new child its own > timeslice irrespective of the parent, then you have things like 'make' > (which doesn't use a lot of CPU time) spawning off lots of high > priority children. This is an artefact of trying to control nice using time slices while using them for controlling array switching and whatever else they were being used for. Priority (static and dynamic) is the the best way to implement nice. > > You need to do _something_ (Ingo's does). I don't see why this would > be tied with a dual array. FWIW, mine doesn't do anything on exit() > like most others, but it may need more tuning in this area. > > >> This disregard for the dual array mechanism has prevented me from >> looking at the rest of your scheduler in any great detail so I can't >> comment on any other ideas that may be in there. > > Well I wasn't really asking you to review it. As I said, everyone > has their own idea of what a good design does, and review can't really > distinguish between the better of two reasonable designs. > > A fair evaluation of the alternatives seems like a good idea though. > Nobody is actually against this, are they? No. It would be nice if the basic ideas that each scheduler tries to implement could be extracted and explained though. This could lead to a melding of ideas that leads to something quite good. > > >>> I haven't looked at Con's ones for a while, >>> but I believe they are also much more straightforward than mainline... >> I like Con's scheduler (partly because it uses a single array) but >> mainly because it's nice and simple. However, his earlier schedulers >> were prone to starvation (admittedly, only if you went out of your way >> to make it happen) and I tried to convince him to use the anti >> starvation mechanism in my SPA schedulers but was unsuccessful. I >> haven't looked at his latest scheduler that sparked all this furore so >> can't comment on it. > > I agree starvation or unfairness is unacceptable for a new scheduler. > > >>> For example, let's say all else is equal between them, then why would >>> we go with the O(logN) implementation rather than the O(1)? >> In the highly unlikely event that you can't separate them on technical >> grounds, Occam's razor recommends choosing the simplest solution. :-) > > O(logN) vs O(1) is technical grounds. In that case I'd go O(1) provided that the k factor for the O(1) wasn't greater than O(logN)'s k factor multiplied by logMaxN. > > But yeah, see my earlier comment: simplicity would be a factor too. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:48 ` Peter Williams @ 2007-04-17 7:56 ` Nick Piggin 2007-04-17 13:16 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 7:56 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >>Other hints that it was a bad idea was the need to transfer time slices > >>between children and parents during fork() and exit(). > > > >I don't see how that has anything to do with dual arrays. > > It's totally to do with the dual arrays. The only real purpose of the > time slice in O(1) (regardless of what its perceived purpose was) was to > control the switching between the arrays. The O(1) design is pretty convoluted in that regard. In my scheduler, the only purpose of the arrays is to renew time slices. The fork/exit logic is added to make interactivity better. Ingo's scheduler has similar equivalent logic. > >If you put > >a new child at the back of the queue, then your various interactive > >shell commands that typically do a lot of dependant forking get slowed > >right down behind your compile job. If you give a new child its own > >timeslice irrespective of the parent, then you have things like 'make' > >(which doesn't use a lot of CPU time) spawning off lots of high > >priority children. > > This is an artefact of trying to control nice using time slices while > using them for controlling array switching and whatever else they were > being used for. Priority (static and dynamic) is the the best way to > implement nice. I don't like the timeslice based nice in mainline. It's too nasty with latencies. nicksched is far better in that regard IMO. But I don't know how you can assert a particular way is the best way to do something. > >You need to do _something_ (Ingo's does). I don't see why this would > >be tied with a dual array. FWIW, mine doesn't do anything on exit() > >like most others, but it may need more tuning in this area. > > > > > >>This disregard for the dual array mechanism has prevented me from > >>looking at the rest of your scheduler in any great detail so I can't > >>comment on any other ideas that may be in there. > > > >Well I wasn't really asking you to review it. As I said, everyone > >has their own idea of what a good design does, and review can't really > >distinguish between the better of two reasonable designs. > > > >A fair evaluation of the alternatives seems like a good idea though. > >Nobody is actually against this, are they? > > No. It would be nice if the basic ideas that each scheduler tries to > implement could be extracted and explained though. This could lead to a > melding of ideas that leads to something quite good. > > > > > > >>>I haven't looked at Con's ones for a while, > >>>but I believe they are also much more straightforward than mainline... > >>I like Con's scheduler (partly because it uses a single array) but > >>mainly because it's nice and simple. However, his earlier schedulers > >>were prone to starvation (admittedly, only if you went out of your way > >>to make it happen) and I tried to convince him to use the anti > >>starvation mechanism in my SPA schedulers but was unsuccessful. I > >>haven't looked at his latest scheduler that sparked all this furore so > >>can't comment on it. > > > >I agree starvation or unfairness is unacceptable for a new scheduler. > > > > > >>>For example, let's say all else is equal between them, then why would > >>>we go with the O(logN) implementation rather than the O(1)? > >>In the highly unlikely event that you can't separate them on technical > >>grounds, Occam's razor recommends choosing the simplest solution. :-) > > > >O(logN) vs O(1) is technical grounds. > > In that case I'd go O(1) provided that the k factor for the O(1) wasn't > greater than O(logN)'s k factor multiplied by logMaxN. Yes, or even significantly greater around typical large sizes of N. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Nick Piggin @ 2007-04-17 13:16 ` Peter Williams 2007-04-18 4:46 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 13:16 UTC (permalink / raw) To: Nick Piggin Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 05:48:55PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>>> Other hints that it was a bad idea was the need to transfer time slices >>>> between children and parents during fork() and exit(). >>> I don't see how that has anything to do with dual arrays. >> It's totally to do with the dual arrays. The only real purpose of the >> time slice in O(1) (regardless of what its perceived purpose was) was to >> control the switching between the arrays. > > The O(1) design is pretty convoluted in that regard. In my scheduler, > the only purpose of the arrays is to renew time slices. > > The fork/exit logic is added to make interactivity better. Ingo's > scheduler has similar equivalent logic. > > >>> If you put >>> a new child at the back of the queue, then your various interactive >>> shell commands that typically do a lot of dependant forking get slowed >>> right down behind your compile job. If you give a new child its own >>> timeslice irrespective of the parent, then you have things like 'make' >>> (which doesn't use a lot of CPU time) spawning off lots of high >>> priority children. >> This is an artefact of trying to control nice using time slices while >> using them for controlling array switching and whatever else they were >> being used for. Priority (static and dynamic) is the the best way to >> implement nice. > > I don't like the timeslice based nice in mainline. It's too nasty > with latencies. nicksched is far better in that regard IMO. > > But I don't know how you can assert a particular way is the best way > to do something. I should have added "I may be wrong but I think that ...". My opinion is based on a lot of experience with different types of scheduler design and the observation from gathering scheduling statistics while playing with these schedulers that the size of the time slices we're talking about is much larger than the CPU chunks most tasks use in any one go so time slice size has no real effect on most tasks and the faster CPUs become the more this becomes true. > > >>> You need to do _something_ (Ingo's does). I don't see why this would >>> be tied with a dual array. FWIW, mine doesn't do anything on exit() >>> like most others, but it may need more tuning in this area. >>> >>> >>>> This disregard for the dual array mechanism has prevented me from >>>> looking at the rest of your scheduler in any great detail so I can't >>>> comment on any other ideas that may be in there. >>> Well I wasn't really asking you to review it. As I said, everyone >>> has their own idea of what a good design does, and review can't really >>> distinguish between the better of two reasonable designs. >>> >>> A fair evaluation of the alternatives seems like a good idea though. >>> Nobody is actually against this, are they? >> No. It would be nice if the basic ideas that each scheduler tries to >> implement could be extracted and explained though. This could lead to a >> melding of ideas that leads to something quite good. >> >>> >>>>> I haven't looked at Con's ones for a while, >>>>> but I believe they are also much more straightforward than mainline... >>>> I like Con's scheduler (partly because it uses a single array) but >>>> mainly because it's nice and simple. However, his earlier schedulers >>>> were prone to starvation (admittedly, only if you went out of your way >>>> to make it happen) and I tried to convince him to use the anti >>>> starvation mechanism in my SPA schedulers but was unsuccessful. I >>>> haven't looked at his latest scheduler that sparked all this furore so >>>> can't comment on it. >>> I agree starvation or unfairness is unacceptable for a new scheduler. >>> >>> >>>>> For example, let's say all else is equal between them, then why would >>>>> we go with the O(logN) implementation rather than the O(1)? >>>> In the highly unlikely event that you can't separate them on technical >>>> grounds, Occam's razor recommends choosing the simplest solution. :-) >>> O(logN) vs O(1) is technical grounds. >> In that case I'd go O(1) provided that the k factor for the O(1) wasn't >> greater than O(logN)'s k factor multiplied by logMaxN. > > Yes, or even significantly greater around typical large sizes of N. Yes. In fact its' probably better to use the maximum number of threads allowed on the system for N. We know that value don't we? Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:16 ` Peter Williams @ 2007-04-18 4:46 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-18 4:46 UTC (permalink / raw) To: Peter Williams Cc: Mike Galbraith, Con Kolivas, Ingo Molnar, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 11:16:54PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >I don't like the timeslice based nice in mainline. It's too nasty > >with latencies. nicksched is far better in that regard IMO. > > > >But I don't know how you can assert a particular way is the best way > >to do something. > > I should have added "I may be wrong but I think that ...". > > My opinion is based on a lot of experience with different types of > scheduler design and the observation from gathering scheduling > statistics while playing with these schedulers that the size of the time > slices we're talking about is much larger than the CPU chunks most tasks > use in any one go so time slice size has no real effect on most tasks > and the faster CPUs become the more this becomes true. For desktop loads, maybe. But for things that are compute bound, the cost of context switching I believe still gets worse as CPUs continue to be able to execute more instructions per cycle, get clocked faster, and get larger caches. > >>In that case I'd go O(1) provided that the k factor for the O(1) wasn't > >>greater than O(logN)'s k factor multiplied by logMaxN. > > > >Yes, or even significantly greater around typical large sizes of N. > > Yes. In fact its' probably better to use the maximum number of threads > allowed on the system for N. We know that value don't we? Well we might be able to work it out by looking at the tunables or amount of kernel memory available, but I guess it is hard to just pick a number. I'll try running a few more benchmarks. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:23 ` Peter Williams 2007-04-17 6:44 ` Nick Piggin @ 2007-04-17 8:44 ` Ingo Molnar 2007-04-19 2:20 ` Peter Williams 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 8:44 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > > And my scheduler for example cuts down the amount of policy code and > > code size significantly. > > Yours is one of the smaller patches mainly because you perpetuate (or > you did in the last one I looked at) the (horrible to my eyes) dual > array (active/expired) mechanism. That this idea was bad should have > been apparent to all as soon as the decision was made to excuse some > tasks from being moved from the active array to the expired array. > This essentially meant that there would be circumstances where extreme > unfairness (to the extent of starvation in some cases) -- the very > things that the mechanism was originally designed to ensure (as far as > I can gather). Right about then in the development of the O(1) > scheduler alternative solutions should have been sought. in hindsight i'd agree. But back then we were clearly not ready for fine-grained accurate statistics + trees (cpus are alot faster at more complex arithmetics today, plus people still believed that low-res can be done well enough), and taking out any of these two concepts from CFS would result in a similarly complex runqueue implementation. Also, the array switch was just thought to be of another piece of 'if the heuristics go wrong, we fall back to an array switch' logic, right in line with the other heuristics. And you have to accept it, mainline's ability to auto-renice make -j jobs (and other CPU hogs) was quite a plus for developers, so it had (and probably still has) quite some inertia. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:44 ` Ingo Molnar @ 2007-04-19 2:20 ` Peter Williams 0 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-19 2:20 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Mike Galbraith, Con Kolivas, ck list, Bill Huey, linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven, Thomas Gleixner Ingo Molnar wrote: > * Peter Williams <pwil3058@bigpond.net.au> wrote: > >>> And my scheduler for example cuts down the amount of policy code and >>> code size significantly. >> Yours is one of the smaller patches mainly because you perpetuate (or >> you did in the last one I looked at) the (horrible to my eyes) dual >> array (active/expired) mechanism. That this idea was bad should have >> been apparent to all as soon as the decision was made to excuse some >> tasks from being moved from the active array to the expired array. >> This essentially meant that there would be circumstances where extreme >> unfairness (to the extent of starvation in some cases) -- the very >> things that the mechanism was originally designed to ensure (as far as >> I can gather). Right about then in the development of the O(1) >> scheduler alternative solutions should have been sought. > > in hindsight i'd agree. Hindsight's a wonderful place isn't it :-) and, of course, it's where I was making my comments from. > But back then we were clearly not ready for > fine-grained accurate statistics + trees (cpus are alot faster at more > complex arithmetics today, plus people still believed that low-res can > be done well enough), and taking out any of these two concepts from CFS > would result in a similarly complex runqueue implementation. I disagree. The single priority array with a promotion mechanism that I use in the SPA schedulers can do the job of avoiding starvation with no measurable increase in the overhead. Fairness, nice, good interactive responsiveness can then be managed by how you determine tasks' dynamic priorities. > Also, the > array switch was just thought to be of another piece of 'if the > heuristics go wrong, we fall back to an array switch' logic, right in > line with the other heuristics. And you have to accept it, mainline's > ability to auto-renice make -j jobs (and other CPU hogs) was quite a > plus for developers, so it had (and probably still has) quite some > inertia. I agree, it wasn't totally useless especially for the average user. My main problem with it was that the effect of "nice" wasn't consistent or predictable enough for reliable resource allocation. I also agree with the aims of the various heuristics i.e. you have to be unfair and give some tasks preferential treatment in order to give the users the type of responsiveness that they want. It's just a shame that it got broken in the process but as you say it's easier to see these things in hindsight than in the middle of the melee. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 3:27 ` Con Kolivas 2007-04-15 5:16 ` Bill Huey 2007-04-15 6:43 ` Mike Galbraith @ 2007-04-15 15:05 ` Ingo Molnar 2007-04-15 20:05 ` Matt Mackall 2007-04-16 5:16 ` Con Kolivas 2 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 15:05 UTC (permalink / raw) To: Con Kolivas Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Con Kolivas <kernel@kolivas.org> wrote: [ i'm quoting this bit out of order: ] > 2. Since then I've been thinking/working on a cpu scheduler design > that takes away all the guesswork out of scheduling and gives very > predictable, as fair as possible, cpu distribution and latency while > preserving as solid interactivity as possible within those confines. yeah. I think you were right on target with this call. I've applied the sched.c change attached at the bottom of this mail to the CFS patch, if you dont mind. (or feel free to suggest some other text instead.) > 1. I tried in vain some time ago to push a working extensable > pluggable cpu scheduler framework (based on wli's work) for the linux > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he > didn't like it) as being absolutely the wrong approach and that we > should never do that. [...] i partially replied to that point to Will already, and i'd like to make it clear again: yes, i rejected plugsched 2-3 years ago (which already drifted away from wli's original codebase) and i would still reject it today. First and foremost, please dont take such rejections too personally - i had my own share of rejections (and in fact, as i mentioned it in a previous mail, i had a fair number of complete project throwaways: 4g:4g, in-kernel Tux, irqrate and many others). I know that they can hurt and can demoralize, but if i dont like something it's my job to tell that. Can i sum up your argument as: "you rejected plugsched, but then why on earth did you modularize portions of the scheduler in CFS? Isnt your position thus woefully inconsistent?" (i'm sure you would never put it this impolitely though, but i guess i can flame myself with impunity ;) While having an inconsistent position isnt a terminal sin in itself, please realize that the scheduler classes code in CFS is quite different from plugsched: it was a result of what i saw to be technological pressure for _internal modularization_. (This internal/policy modularization aspect is something that Will said was present in his original plugsched code, but which aspect i didnt see in the plugsched patches that i reviewed.) That possibility never even occured to me to until 3 days ago. You never raised it either AFAIK. No patches to simplify the scheduler that way were ever sent. Plugsched doesnt even touch the core load-balancer for example, and most of the time i spent with the modularization was to get the load-balancing details right. So it's really apples to oranges. My view about plugsched: first please take a look at the latest plugsched code: http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch 26 files changed, 8951 insertions(+), 1495 deletions(-) As an experiment i've removed all the add-on schedulers (both the core and the include files, only kept the vanilla one) from the plugsched patch (and the makefile and kconfig complications, etc), to see the 'infrastructure cost', and it still gave: 12 files changed, 1933 insertions(+), 1479 deletions(-) that's the extra complication i didnt like 3 years ago and which i still dont like today. What the current plugsched code does is that it simplifies the adding of new experimental schedulers, but it doesnt really do what i wanted: to simplify the _scheduler itself_. Personally i'm still not primarily interested in having a large selection of schedulers, i'm mainly interested in a good and maintainable scheduler that works for people. so the rejection was on these grounds, and i still very much stand by that position here and today: i didnt want to see the Linux scheduler landscape balkanized and i saw no technological reasons for the complication that external modularization brings. the new scheding classes code in the CFS patch was not a result of "oh, i want to write a new scheduler, lets make schedulers pluggable" kind of thinking. That result was just a side-effect of it. (and as you correctly noted it, the CFS related modularization is incomplete). Btw., the thing that triggered the scheduling classes code wasnt even plugsched or RSDL/SD, it was Mike's patches. Mike had an itch and he fixed it within the framework of the existing scheduler, and the end result behaved quite well when i threw various testloads on it. But i felt a bit uncomfortable that it added another few hundred lines of code to an already complex sched.c. This felt unnatural so i mailed Mike that i'd attempt to clean these infrastructure aspects of sched.c up a bit so that it becomes more hackable to him. Thus 3 days ago, without having made up my mind about anything, i started this experiment (which ended up in the modularization and in the CFS scheduler) to simplify the code and to enable Mike to fix such itches in an easier way. By your logic Mike should in fact be quite upset about this: if the new code works out and proves to be useful then it obsoletes a whole lot of code of him! > For weeks now, Ingo has said that the interactivity regressions were > showstoppers and we should address them, never mind the fact that the > so-called regressions were purely "it slows down linearly with load" > which to me is perfectly desirable behaviour. [...] yes. For me the first thing when considering a large scheduler patch is: "does a patch do what it claims" and "does it work". If those goals are met (and if it's a complete scheduler i actually try it quite extensively) then i look at the code cleanliness issues. Mike's patch was the first one that seemed to meet that threshold in my own humble testing, and CFS was a direct result of that. note that i tried the same workloads with CFS and while it wasnt as good as mainline, it handled them better than SD. Mike reported the same, and Mark Lord (who too reported SD interactivity problems) reported success yesterday too. (but .. CFS is a mere 2 days old so we cannot really tell anything with certainty yet.) > [...] However at one stage I virtually begged for support with my > attempts and help with the code. Dmitry Adamushko is the only person > who actually helped me with the code in the interim, while others > poked sticks at it. Sure the sticks helped at times but the sticks > always seemed to have their ends kerosene doused and flaming for > reasons I still don't get. No other help was forthcoming. i'm really sorry you got that impression. in 2004 i had a good look at the staircase scheduler and said: http://www.uwsg.iu.edu/hypermail/linux/kernel/0408.0/1146.html "But in general i'm quite positive about the staircase scheduler." and even tested it and gave you feedback: http://lwn.net/Articles/96562/ i think i even told Andrew that i dont really like pluggable schedulers and if there's any replacement for the current scheduler then that would be a full replacement, and it would be the staircase scheduler. Hey, i told this to you as recently as 1 month ago as well: http://lkml.org/lkml/2007/3/8/54 "cool! I like this even more than i liked your original staircase scheduler from 2 years ago :)" Ingo -----------> Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -16,6 +16,7 @@ * by Davide Libenzi, preemptible kernel bits by Robert Love. * 2003-09-03 Interactivity tuning by Con Kolivas. * 2004-04-02 Scheduler domains code by Nick Piggin + * 2007-04-15 Con Kolivas was dead right: fairness matters! :) */ ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:05 ` Ingo Molnar @ 2007-04-15 20:05 ` Matt Mackall 2007-04-15 20:48 ` Ingo Molnar 2007-04-16 5:16 ` Con Kolivas 1 sibling, 1 reply; 304+ messages in thread From: Matt Mackall @ 2007-04-15 20:05 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 05:05:36PM +0200, Ingo Molnar wrote: > so the rejection was on these grounds, and i still very much stand by > that position here and today: i didnt want to see the Linux scheduler > landscape balkanized and i saw no technological reasons for the > complication that external modularization brings. But "balkanization" is a good thing. "Monoculture" is a bad thing. Look at what happened with I/O scheduling. Opening things up to some new ideas by making it possible to select your I/O scheduler took us from 10 years of stagnation to healthy, competitive development, which gave us a substantially better I/O scheduler. Look at what's happening right now with TCP congestion algorithms. We've had decades of tweaking Reno slightly now turned into a vibrant research area with lots of radical alternatives. A winner will eventually emerge and it will probably look quite a bit different than Reno. Similar things have gone on since the beginning with filesystems on Linux. Being able to easily compare filesystems head to head has been immensely valuable in improving our 'core' Linux filesystems. And what we've had up to now is a scheduler monoculture. Until Andrew put RSDL in -mm, if people wanted to experiment with other schedulers, they had to go well off the beaten path to do it. So all the people who've been hopelessy frustrated with the mainline scheduler go off to the -ck ghetto, or worse, stick with 2.4. Whether your motivations have been protectionist or merely shortsighted, you've stomped pretty heavily on alternative scheduler development by completely rejecting the whole plugsched concept. If we'd opened up mainline to a variety of schedulers _3 years ago_, we'd probably have gotten to where we are today much sooner. Hopefully, the next time Rik suggests pluggable page replacement algorithms, folks will actually seriously consider it. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:05 ` Matt Mackall @ 2007-04-15 20:48 ` Ingo Molnar 2007-04-15 21:31 ` Matt Mackall 2007-04-15 23:39 ` William Lee Irwin III 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 20:48 UTC (permalink / raw) To: Matt Mackall Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Matt Mackall <mpm@selenic.com> wrote: > Look at what happened with I/O scheduling. Opening things up to some > new ideas by making it possible to select your I/O scheduler took us > from 10 years of stagnation to healthy, competitive development, which > gave us a substantially better I/O scheduler. actually, 2-3 years ago we already had IO schedulers, and my opinion against plugsched back then (also shared by Nick and Linus) was very much considering them. There are at least 4 reasons why I/O schedulers are different from CPU schedulers: 1) CPUs are a non-persistent resource shared by _all_ tasks and workloads in the system. Disks are _persistent_ resources very much attached to specific workloads. (If tasks had to be 'persistent' to the CPU they were started on we'd have much different scheduling technology, and there would be much less complexity.) More analogous to CPU schedulers would perhaps be VM/MM schedulers, and those tend to be hard to modularize in a technologically sane way too. (and unlike disks there's no good generic way to attach VM/MM schedulers to particular workloads.) So it's apples to oranges. in practice it comes down to having one good scheduler that runs all workloads on a system reasonably well. And given that a very large portion of system runs mixed workloads, the demand for one good scheduler is pretty high. While i can run with mixed IO schedulers just fine. 2) plugsched did not allow on the fly selection of schedulers, nor did it allow a per CPU selection of schedulers. IO schedulers you can change per disk, on the fly, making them much more useful in practice. Also, IO schedulers (while definitely not being slow!) are alot less performance sensitive than CPU schedulers. 3) I/O schedulers are pretty damn clean code, and plugsched, at least the last version i saw of it, didnt come even close. 4) the good thing that happened to I/O, after years of stagnation isnt I/O schedulers. The good thing that happened to I/O is called Jens Axboe. If you care about the I/O subystem then print that name out and hang it on the wall. That and only that is what mattered. all in one, while there are definitely uses (embedded would like to have a smaller/different scheduler, etc.), the technical case for modularization for the sake of selectability is alot lower for CPU schedulers than it is for I/O schedulers. nor was the non-modularity of some piece of code ever an impediment to competition. May i remind you of the pretty competitive SLAB allocator landscape, resulting in things like the SLOB allocator, written by yourself? ;-) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:48 ` Ingo Molnar @ 2007-04-15 21:31 ` Matt Mackall 2007-04-16 3:03 ` Nick Piggin 2007-04-16 15:45 ` William Lee Irwin III 2007-04-15 23:39 ` William Lee Irwin III 1 sibling, 2 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-15 21:31 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > > * Matt Mackall <mpm@selenic.com> wrote: > > > Look at what happened with I/O scheduling. Opening things up to some > > new ideas by making it possible to select your I/O scheduler took us > > from 10 years of stagnation to healthy, competitive development, which > > gave us a substantially better I/O scheduler. > > actually, 2-3 years ago we already had IO schedulers, and my opinion > against plugsched back then (also shared by Nick and Linus) was very > much considering them. There are at least 4 reasons why I/O schedulers > are different from CPU schedulers: ... > 3) I/O schedulers are pretty damn clean code, and plugsched, at least > the last version i saw of it, didnt come even close. That's irrelevant. Plugsched was an attempt to get alternative schedulers exposure in mainline. I know, because I remember encouraging Bill to pursue it. Not only did you veto plugsched (which may have been a perfectly reasonable thing to do), but you also vetoed the whole concept of multiple schedulers in the tree too. "We don't want to balkanize the scheduling landscape". And that latter part is what I'm claiming has set us back for years. It's not a technical argument but a strategic one. And it's just not a good strategy. > 4) the good thing that happened to I/O, after years of stagnation isnt > I/O schedulers. The good thing that happened to I/O is called Jens > Axboe. If you care about the I/O subystem then print that name out > and hang it on the wall. That and only that is what mattered. Disagree. Things didn't actually get interesting until Nick showed up with AS and got it in-tree to demonstrate the huge amount of room we had for improvement. It took several iterations of AS and CFQ (with a couple complete rewrites) before CFQ began to look like the winner. The resulting time-sliced CFQ was fairly heavily influenced by the ideas in AS. Similarly, things in scheduler land had been pretty damn boring until Con finally got Andrew to take one of his schedulers for a spin. > nor was the non-modularity of some piece of code ever an impediment to > competition. May i remind you of the pretty competitive SLAB allocator > landscape, resulting in things like the SLOB allocator, written by > yourself? ;-) Thankfully no one came out and said "we don't want to balkanize the allocator landscape" when I submitted it or I probably would have just dropped it, rather than painfully dragging it along out of tree for years. I'm not nearly the glutton for punishment that Con is. :-P -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 21:31 ` Matt Mackall @ 2007-04-16 3:03 ` Nick Piggin 2007-04-16 14:28 ` Matt Mackall 2007-04-16 15:45 ` William Lee Irwin III 1 sibling, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-16 3:03 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote: > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > > > 4) the good thing that happened to I/O, after years of stagnation isnt > > I/O schedulers. The good thing that happened to I/O is called Jens > > Axboe. If you care about the I/O subystem then print that name out > > and hang it on the wall. That and only that is what mattered. > > Disagree. Things didn't actually get interesting until Nick showed up > with AS and got it in-tree to demonstrate the huge amount of room we > had for improvement. It took several iterations of AS and CFQ (with a > couple complete rewrites) before CFQ began to look like the winner. > The resulting time-sliced CFQ was fairly heavily influenced by the > ideas in AS. Well to be fair, Jens had just implemented deadline, which got me interested ;) Actually, I would still like to be able to deprecate deadline for AS, because AS has a tunable that you can switch to turn off read anticipation and revert to deadline behaviour (or very close to). It would have been nice if CFQ were then a layer on top of AS that implemented priorities (or vice versa). And then AS could be deprecated and we'd be back to 1 primary scheduler. Well CFQ seems to be going in the right direction with that, however some large users still find AS faster for some reason... Anyway, moral of the story is that I think it would have been nice if we hadn't proliferated IO schedulers, however in practice it isn't easy to just layer features on top of each other, and also keeping deadline helped a lot to be able to debug and examine performance regressions and actually get code upstream. And this was true even when it was globally boottine switchable only. I'd prefer if we kept a single CPU scheduler in mainline, because I think that simplifies analysis and focuses testing. I think we can have one that is good enough for everyone. But if the only other option for progress is that Linus or Andrew just pull one out of a hat, then I would rather merge all of them. Yes I think Con's scheduler should get a fair go, ditto for Ingo's, mine, and anyone else's. > > nor was the non-modularity of some piece of code ever an impediment to > > competition. May i remind you of the pretty competitive SLAB allocator > > landscape, resulting in things like the SLOB allocator, written by > > yourself? ;-) > > Thankfully no one came out and said "we don't want to balkanize the > allocator landscape" when I submitted it or I probably would have just > dropped it, rather than painfully dragging it along out of tree for > years. I'm not nearly the glutton for punishment that Con is. :-P I don't think this is a fault of the people or the code involved. We just didn't have much collective drive to replace the scheduler, and even less an idea of how to decide between any two of them. I've kept nicksched around since 2003 or so and no hard feelings ;) ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:03 ` Nick Piggin @ 2007-04-16 14:28 ` Matt Mackall 2007-04-17 3:31 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Matt Mackall @ 2007-04-16 14:28 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > I'd prefer if we kept a single CPU scheduler in mainline, because I > think that simplifies analysis and focuses testing. I think you'll find something like 80-90% of the testing will be done on the default choice, even if other choices exist. So you really won't have much of a problem here. But when the only choice for other schedulers is to go out-of-tree, then only 1% of the people will try it out and those people are guaranteed to be the ones who saw scheduling problems in mainline. So the alternative won't end up getting any testing on many of the workloads that work fine in mainstream so their feedback won't tell you very much at all. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 14:28 ` Matt Mackall @ 2007-04-17 3:31 ` Nick Piggin 2007-04-17 17:35 ` Matt Mackall 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 3:31 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote: > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > > I'd prefer if we kept a single CPU scheduler in mainline, because I > > think that simplifies analysis and focuses testing. > > I think you'll find something like 80-90% of the testing will be done > on the default choice, even if other choices exist. So you really > won't have much of a problem here. > > But when the only choice for other schedulers is to go out-of-tree, > then only 1% of the people will try it out and those people are > guaranteed to be the ones who saw scheduling problems in mainline. > So the alternative won't end up getting any testing on many of the > workloads that work fine in mainstream so their feedback won't tell > you very much at all. Yeah I concede that perhaps it is the only way to get things going any further. But how do we decide if and when the current scheduler should be demoted from default, and which should replace it? ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:31 ` Nick Piggin @ 2007-04-17 17:35 ` Matt Mackall 0 siblings, 0 replies; 304+ messages in thread From: Matt Mackall @ 2007-04-17 17:35 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 05:31:20AM +0200, Nick Piggin wrote: > On Mon, Apr 16, 2007 at 09:28:24AM -0500, Matt Mackall wrote: > > On Mon, Apr 16, 2007 at 05:03:49AM +0200, Nick Piggin wrote: > > > I'd prefer if we kept a single CPU scheduler in mainline, because I > > > think that simplifies analysis and focuses testing. > > > > I think you'll find something like 80-90% of the testing will be done > > on the default choice, even if other choices exist. So you really > > won't have much of a problem here. > > > > But when the only choice for other schedulers is to go out-of-tree, > > then only 1% of the people will try it out and those people are > > guaranteed to be the ones who saw scheduling problems in mainline. > > So the alternative won't end up getting any testing on many of the > > workloads that work fine in mainstream so their feedback won't tell > > you very much at all. > > Yeah I concede that perhaps it is the only way to get things going > any further. But how do we decide if and when the current scheduler > should be demoted from default, and which should replace it? Step one is ship both in -mm. If that doesn't give us enough confidence, ship both in mainline. If that doesn't give us enough confidence, wait until vendors ship both. Eventually a clear picture should emerge. If it doesn't, either the change is not significant or no one cares. But it really is important to be able to do controlled experiments on this stuff with little effort. That's the recipe for getting lots of valid feedback. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 21:31 ` Matt Mackall 2007-04-16 3:03 ` Nick Piggin @ 2007-04-16 15:45 ` William Lee Irwin III 1 sibling, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-16 15:45 UTC (permalink / raw) To: Matt Mackall Cc: Ingo Molnar, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 04:31:54PM -0500, Matt Mackall wrote: > That's irrelevant. Plugsched was an attempt to get alternative > schedulers exposure in mainline. I know, because I remember > encouraging Bill to pursue it. Not only did you veto plugsched (which > may have been a perfectly reasonable thing to do), but you also vetoed > the whole concept of multiple schedulers in the tree too. "We don't > want to balkanize the scheduling landscape". > And that latter part is what I'm claiming has set us back for years. > It's not a technical argument but a strategic one. And it's just not a > good strategy. [... excellent post trimmed...] These are some rather powerful arguments. I think I'll actually start looking at plugsched again. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 20:48 ` Ingo Molnar 2007-04-15 21:31 ` Matt Mackall @ 2007-04-15 23:39 ` William Lee Irwin III 2007-04-16 1:06 ` Peter Williams 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-15 23:39 UTC (permalink / raw) To: Ingo Molnar Cc: Matt Mackall, Con Kolivas, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > 2) plugsched did not allow on the fly selection of schedulers, nor did > it allow a per CPU selection of schedulers. IO schedulers you can > change per disk, on the fly, making them much more useful in > practice. Also, IO schedulers (while definitely not being slow!) are > alot less performance sensitive than CPU schedulers. One of the reasons I never posted my own code is that it never met its own design goals, which absolutely included switching on the fly. I think Peter Williams may have done something about that. It was my hope to be able to do insmod sched_foo.ko until it became clear that the effort it was intended to assist wasn't going to get even the limited hardware access required, at which point I largely stopped working on it. On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: > 3) I/O schedulers are pretty damn clean code, and plugsched, at least > the last version i saw of it, didnt come even close. I'm not sure what happened there. It wasn't a big enough patch to take hits in this area due to getting overwhelmed by the programming burden like some other efforts of mine. Maybe things started getting ugly once on-the-fly switching entered the picture. My guess is that Peter Williams will have to chime in here, since things have diverged enough from my one-time contribution 4 years ago. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:39 ` William Lee Irwin III @ 2007-04-16 1:06 ` Peter Williams 2007-04-16 3:04 ` William Lee Irwin III 2007-04-16 17:22 ` Chris Friesen 0 siblings, 2 replies; 304+ messages in thread From: Peter Williams @ 2007-04-16 1:06 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: >> 2) plugsched did not allow on the fly selection of schedulers, nor did >> it allow a per CPU selection of schedulers. IO schedulers you can >> change per disk, on the fly, making them much more useful in >> practice. Also, IO schedulers (while definitely not being slow!) are >> alot less performance sensitive than CPU schedulers. > > One of the reasons I never posted my own code is that it never met its > own design goals, which absolutely included switching on the fly. I > think Peter Williams may have done something about that. I didn't but some students did. In a previous life, I did implement a runtime configurable CPU scheduling mechanism (implemented on True64, Solaris and Linux) that allowed schedulers to be loaded as modules at run time. This was released commercially on True64 and Solaris. So I know that it can be done. I have thought about doing something similar for the SPA schedulers which differ in only small ways from each other but lack motivation. > It was my hope > to be able to do insmod sched_foo.ko until it became clear that the > effort it was intended to assist wasn't going to get even the limited > hardware access required, at which point I largely stopped working on > it. > > > On Sun, Apr 15, 2007 at 10:48:24PM +0200, Ingo Molnar wrote: >> 3) I/O schedulers are pretty damn clean code, and plugsched, at least >> the last version i saw of it, didnt come even close. > > I'm not sure what happened there. It wasn't a big enough patch to take > hits in this area due to getting overwhelmed by the programming burden > like some other efforts of mine. Maybe things started getting ugly once > on-the-fly switching entered the picture. My guess is that Peter Williams > will have to chime in here, since things have diverged enough from my > one-time contribution 4 years ago. From my POV, the current version of plugsched is considerably simpler than it was when I took the code over from Con as I put considerable effort into minimizing code overlap in the various schedulers. I also put considerable effort into minimizing any changes to the load balancing code (something Ingo seems to think is a deficiency) and the result is that plugsched allows "intra run queue" scheduling to be easily modified WITHOUT effecting load balancing. To my mind scheduling and load balancing are orthogonal and keeping them that way simplifies things. As Ingo correctly points out, plugsched does not allow different schedulers to be used per CPU but it would not be difficult to modify it so that they could. Although I've considered doing this over the years I decided not to as it would just increase the complexity and the amount of work required to keep the patch set going. About six months ago I decided to reduce the amount of work I was doing on plugsched (as it was obviously never going to be accepted) and now only publish patches against the vanilla kernel's major releases (and the only reason that I kept doing that is that the download figures indicated that about 80 users were interested in the experiment). Peter PS I no longer read LKML (due to time constraints) and would appreciate it if I could be CC'd on any e-mails suggesting scheduler changes. PPS I'm just happy to see that Ingo has finally accepted that the vanilla scheduler was badly in need of fixing and don't really care who fixes it. PPS Different schedulers for different aims (i.e. server or work station) do make a difference. E.g. the spa_svr scheduler in plugsched does about 1% better on kernbench than the next best scheduler in the bunch. PPPS Con, fairness isn't always best as humans aren't very altruistic and we need to give unfair preference to interactive tasks in order to stop the users flinging their PCs out the window. But the current scheduler doesn't do this very well and is also not very good at fairness so needs to change. But the changes need to address interactive response and fairness not just fairness. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 1:06 ` Peter Williams @ 2007-04-16 3:04 ` William Lee Irwin III 2007-04-16 5:09 ` Peter Williams 2007-04-16 17:22 ` Chris Friesen 1 sibling, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-16 3:04 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> One of the reasons I never posted my own code is that it never met its >> own design goals, which absolutely included switching on the fly. I >> think Peter Williams may have done something about that. >> It was my hope >> to be able to do insmod sched_foo.ko until it became clear that the >> effort it was intended to assist wasn't going to get even the limited >> hardware access required, at which point I largely stopped working on >> it. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > I didn't but some students did. > In a previous life, I did implement a runtime configurable CPU > scheduling mechanism (implemented on True64, Solaris and Linux) that > allowed schedulers to be loaded as modules at run time. This was > released commercially on True64 and Solaris. So I know that it can be done. > I have thought about doing something similar for the SPA schedulers > which differ in only small ways from each other but lack motivation. Driver models for scheduling are not so far out. AFAICS it's largely a tug-of-war over design goals, e.g. maintaining per-cpu runqueues and switching out intra-queue policies vs. switching out whole-system policies, SMP handling and all. Whether this involves load balancing depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x scheduler module, for instance, would not have a load balancer at all, as it has only one global runqueue. There are other sorts of policies wanting significant changes to SMP handling vs. the stock load balancing. William Lee Irwin III wrote: >> I'm not sure what happened there. It wasn't a big enough patch to take >> hits in this area due to getting overwhelmed by the programming burden >> like some other efforts of mine. Maybe things started getting ugly once >> on-the-fly switching entered the picture. My guess is that Peter Williams >> will have to chime in here, since things have diverged enough from my >> one-time contribution 4 years ago. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > From my POV, the current version of plugsched is considerably simpler > than it was when I took the code over from Con as I put considerable > effort into minimizing code overlap in the various schedulers. > I also put considerable effort into minimizing any changes to the load > balancing code (something Ingo seems to think is a deficiency) and the > result is that plugsched allows "intra run queue" scheduling to be > easily modified WITHOUT effecting load balancing. To my mind scheduling > and load balancing are orthogonal and keeping them that way simplifies > things. ISTR rearranging things for con in such a fashion that it no longer worked out of the box (though that wasn't the intention; restructuring it to be more suited to his purposes was) and that's what he worked off of afterward. I don't remember very well what changed there as I clearly invested less effort there than the prior versions. Now that I think of it, that may have been where the sample policy demonstrating scheduling classes was lost. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > As Ingo correctly points out, plugsched does not allow different > schedulers to be used per CPU but it would not be difficult to modify it > so that they could. Although I've considered doing this over the years > I decided not to as it would just increase the complexity and the amount > of work required to keep the patch set going. About six months ago I > decided to reduce the amount of work I was doing on plugsched (as it was > obviously never going to be accepted) and now only publish patches > against the vanilla kernel's major releases (and the only reason that I > kept doing that is that the download figures indicated that about 80 > users were interested in the experiment). That's a rather different goal from what I was going on about with it, so it's all diverged quite a bit. Where I had a significant need for mucking with the entire concept of how SMP was handled, this is rather different. At this point I'm questioning the relevance of my own work, though it was already relatively marginal as it started life as an attempt at a sort of debug patch to help gang scheduling (which is in itself a rather marginally relevant feature to most users) code along. On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: > PS I no longer read LKML (due to time constraints) and would appreciate > it if I could be CC'd on any e-mails suggesting scheduler changes. > PPS I'm just happy to see that Ingo has finally accepted that the > vanilla scheduler was badly in need of fixing and don't really care who > fixes it. > PPS Different schedulers for different aims (i.e. server or work > station) do make a difference. E.g. the spa_svr scheduler in plugsched > does about 1% better on kernbench than the next best scheduler in the bunch. > PPPS Con, fairness isn't always best as humans aren't very altruistic > and we need to give unfair preference to interactive tasks in order to > stop the users flinging their PCs out the window. But the current > scheduler doesn't do this very well and is also not very good at > fairness so needs to change. But the changes need to address > interactive response and fairness not just fairness. Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are better ones. I'd not bother citing kernel compile results. In any event, I'm not sure what to say about different schedulers for different aims. My intentions with plugsched were not centered around production usage or intra-queue policy. I'm relatively indifferent to the notion of having pluggable CPU schedulers, intra-queue or otherwise, in mainline. I don't see any particular harm in it, but neither am I particularly motivated to have it in. I had a rather strong sense of instrumentality about it, and since it became useless to me (at a conceptual level; the implementation was never finished ot the point of dynamic loading of scheduler modules) for assisting development on large systems via reboot avoidance by dint of it becoming clear that access to such was never going to happen, I've stopped looking at it. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 3:04 ` William Lee Irwin III @ 2007-04-16 5:09 ` Peter Williams 2007-04-16 11:04 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-16 5:09 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> One of the reasons I never posted my own code is that it never met its >>> own design goals, which absolutely included switching on the fly. I >>> think Peter Williams may have done something about that. >>> It was my hope >>> to be able to do insmod sched_foo.ko until it became clear that the >>> effort it was intended to assist wasn't going to get even the limited >>> hardware access required, at which point I largely stopped working on >>> it. > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> I didn't but some students did. >> In a previous life, I did implement a runtime configurable CPU >> scheduling mechanism (implemented on True64, Solaris and Linux) that >> allowed schedulers to be loaded as modules at run time. This was >> released commercially on True64 and Solaris. So I know that it can be done. >> I have thought about doing something similar for the SPA schedulers >> which differ in only small ways from each other but lack motivation. > > Driver models for scheduling are not so far out. AFAICS it's largely a > tug-of-war over design goals, e.g. maintaining per-cpu runqueues and > switching out intra-queue policies vs. switching out whole-system > policies, SMP handling and all. Whether this involves load balancing > depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x > scheduler module, for instance, would not have a load balancer at all, > as it has only one global runqueue. There are other sorts of policies > wanting significant changes to SMP handling vs. the stock load > balancing. Well a single run queue removes the need for load balancing but has scalability issues on large systems. Personally, I think something in between would be the best solution i.e. multiple run queues but more than one CPU per run queue. I think that this would be a particularly good solution to the problems introduced by hyper threading and multi core systems and also NUMA systems. E.g. if all CPUs in a hyper thread package are using the one queue then the case where one CPU is trying to run a high priority task and the other a low priority task (i.e. the cases that the sleeping dependent mechanism tried to address) is less likely to occur. By the way, I think that it's a very bad idea for the scheduling mechanism and the load balancing mechanism to be coupled. The anomalies that will be experienced and the attempts to make ad hoc fixes for them will lead to complexity spiralling out of control. > > > William Lee Irwin III wrote: >>> I'm not sure what happened there. It wasn't a big enough patch to take >>> hits in this area due to getting overwhelmed by the programming burden >>> like some other efforts of mine. Maybe things started getting ugly once >>> on-the-fly switching entered the picture. My guess is that Peter Williams >>> will have to chime in here, since things have diverged enough from my >>> one-time contribution 4 years ago. > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> From my POV, the current version of plugsched is considerably simpler >> than it was when I took the code over from Con as I put considerable >> effort into minimizing code overlap in the various schedulers. >> I also put considerable effort into minimizing any changes to the load >> balancing code (something Ingo seems to think is a deficiency) and the >> result is that plugsched allows "intra run queue" scheduling to be >> easily modified WITHOUT effecting load balancing. To my mind scheduling >> and load balancing are orthogonal and keeping them that way simplifies >> things. > > ISTR rearranging things for con in such a fashion that it no longer > worked out of the box (though that wasn't the intention; restructuring it > to be more suited to his purposes was) and that's what he worked off of > afterward. I don't remember very well what changed there as I clearly > invested less effort there than the prior versions. Now that I think of > it, that may have been where the sample policy demonstrating scheduling > classes was lost. I can't comment here as (as far as I can recall) I never saw your code and only became involved when Con posted his version of cpusched. > > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> As Ingo correctly points out, plugsched does not allow different >> schedulers to be used per CPU but it would not be difficult to modify it >> so that they could. Although I've considered doing this over the years >> I decided not to as it would just increase the complexity and the amount >> of work required to keep the patch set going. About six months ago I >> decided to reduce the amount of work I was doing on plugsched (as it was >> obviously never going to be accepted) and now only publish patches >> against the vanilla kernel's major releases (and the only reason that I >> kept doing that is that the download figures indicated that about 80 >> users were interested in the experiment). > > That's a rather different goal from what I was going on about with it, > so it's all diverged quite a bit. Yes, pragmatic considerations dictated a change of tack. > Where I had a significant need for > mucking with the entire concept of how SMP was handled, this is rather > different. Yes, I went with the idea of intra run queue scheduling being orthogonal to load balancing for two reasons: 1. I think that coupling them is a bad idea from the complexity POV, and 2. it's enough of a battle fighting for modifications to one bit of the code without trying to do it to two simultaneously. > At this point I'm questioning the relevance of my own work, > though it was already relatively marginal as it started life as an > attempt at a sort of debug patch to help gang scheduling (which is in > itself a rather marginally relevant feature to most users) code along. The main commercial plug in scheduler used with the run time loadable module scheduler that I mentioned earlier did gang scheduling (at the insistence of the Tru64 kernel folks). As this scheduler was a hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" ("unfairly" really in according to an allocation policy) among higher level entities such as users, groups and applications as well as processes; it was fairly easy to make it a gang scheduler by modifying it to give all of a process's threads the same priority based on the process's CPU usage rather than different priorities based on the threads' usage rates. In fact, it would have been possible to select between gang and non gang on a per process basis if that was considered desirable. The fact that threads and processes are distinct entities on Tru64 and Solaris made this easier to do on them than on Linux. My experience with this scheduler leads me to believe that to achieve gang scheduling and fairness, etc. you need (usage) statistics based schedulers. > > > On Mon, Apr 16, 2007 at 11:06:56AM +1000, Peter Williams wrote: >> PS I no longer read LKML (due to time constraints) and would appreciate >> it if I could be CC'd on any e-mails suggesting scheduler changes. >> PPS I'm just happy to see that Ingo has finally accepted that the >> vanilla scheduler was badly in need of fixing and don't really care who >> fixes it. >> PPS Different schedulers for different aims (i.e. server or work >> station) do make a difference. E.g. the spa_svr scheduler in plugsched >> does about 1% better on kernbench than the next best scheduler in the bunch. >> PPPS Con, fairness isn't always best as humans aren't very altruistic >> and we need to give unfair preference to interactive tasks in order to >> stop the users flinging their PCs out the window. But the current >> scheduler doesn't do this very well and is also not very good at >> fairness so needs to change. But the changes need to address >> interactive response and fairness not just fairness. > > Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are > better ones. I'd not bother citing kernel compile results. spa_svr actually does its best work when the system isn't fully loaded as the type of improvement it strives to achieve (minimizing on queue wait time) hasn't got much room to manoeuvre when the system is fully loaded. Therefore, the fact that it's 1% better even in these circumstances is a good result and also indicates that the overhead for keeping the scheduling statistics it uses for its decision making is well spent. Especially, when you consider that the total available room for improvement on this benchmark is less than 3%. To elaborate, the motivation for this scheduler was acquired from the observation of scheduling statistics (in particular, on queue wait time) on systems running at about 30% to 50% load. Theoretically, at these load levels there should be no such waiting but the statistics show that there is considerable waiting (sometimes as high as 30% to 50%). I put this down to "lack of serendipity" e.g. everyone sleeping at the same time and then trying to run at the same time would be complete lack of serendipity. On the other hand, if everyone is synced then there would be total serendipity. Obviously, from the POV of a client, time the server task spends waiting on the queue adds to the response time for any request that has been made so reduction of this time on a server is a good thing(tm). Equally obviously, trying to achieve this synchronization by asking the tasks to cooperate with each other is not a feasible solution and some external influence needs to be exerted and this is what spa_svr does -- it nudges the scheduling order of the tasks in a way that makes them become well synced. Unfortunately, this is not a good scheduler for an interactive system as it minimizes the response times for ALL tasks (and the system as a whole) and this can result in increased response time for some interactive tasks (clunkiness) which annoys interactive users. When you start fiddling with this scheduler to bring back "interactive unfairness" you kill a lot of its superior low overall wait time performance. So this is why I think "horses for courses" schedulers are worth while. > > In any event, I'm not sure what to say about different schedulers for > different aims. My intentions with plugsched were not centered around > production usage or intra-queue policy. I'm relatively indifferent to > the notion of having pluggable CPU schedulers, intra-queue or otherwise, > in mainline. I don't see any particular harm in it, but neither am I > particularly motivated to have it in. If you look at the struct sched_spa_child in the file include/linux/sched_spa.h you'll see that the interface for switching between the various SPA schedulers is quite small and making them runtime switchable would be easy (I haven't done this in cpusched as I wanted to keep the same interface for switching schedulers for all schedulers: i.e. all run time switchable or none run time switchable; as the main aim of plugsched had become a mechanism for evaluating different intra queue scheduling designs.) > I had a rather strong sense of > instrumentality about it, and since it became useless to me (at a > conceptual level; the implementation was never finished ot the point of > dynamic loading of scheduler modules) for assisting development on > large systems via reboot avoidance by dint of it becoming clear that > access to such was never going to happen, I've stopped looking at it. I'll probably stop looking at this problem as well at least for the time being until all this new code has settled. Peter PS As I no longer read LKML, I haven't yet seen Ingo's or Con's or Nick's new schedulers yet so am unable to comment on their technical merits with respect to my comments above. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:09 ` Peter Williams @ 2007-04-16 11:04 ` William Lee Irwin III 2007-04-16 12:55 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-16 11:04 UTC (permalink / raw) To: Peter Williams Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: >> Driver models for scheduling are not so far out. AFAICS it's largely a >> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and >> switching out intra-queue policies vs. switching out whole-system >> policies, SMP handling and all. Whether this involves load balancing >> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x >> scheduler module, for instance, would not have a load balancer at all, >> as it has only one global runqueue. There are other sorts of policies >> wanting significant changes to SMP handling vs. the stock load >> balancing. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Well a single run queue removes the need for load balancing but has > scalability issues on large systems. Personally, I think something in > between would be the best solution i.e. multiple run queues but more > than one CPU per run queue. I think that this would be a particularly > good solution to the problems introduced by hyper threading and multi > core systems and also NUMA systems. E.g. if all CPUs in a hyper thread > package are using the one queue then the case where one CPU is trying to > run a high priority task and the other a low priority task (i.e. the > cases that the sleeping dependent mechanism tried to address) is less > likely to occur. This wasn't meant to sing the praises of the 2.4.x scheduler; it was rather meant to point out that the 2.4.x scheduler, among others, is unimplementable within the framework if it assumes per-cpu runqueues. More plausibly useful single-queue schedulers would likely use a vastly different policy and attempt to carry out all queue manipulations via lockless operations. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > By the way, I think that it's a very bad idea for the scheduling > mechanism and the load balancing mechanism to be coupled. The anomalies > that will be experienced and the attempts to make ad hoc fixes for them > will lead to complexity spiralling out of control. This is clearly unavoidable in the case of gang scheduling. There is simply no other way to schedule N tasks which must all be run simultaneously when they run at all on N cpus of the system without such coupling and furthermore at an extremely intimate level, particularly when multiple competing gangs must be scheduled in such a fashion. William Lee Irwin III wrote: >> Where I had a significant need for >> mucking with the entire concept of how SMP was handled, this is rather >> different. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Yes, I went with the idea of intra run queue scheduling being orthogonal > to load balancing for two reasons: > 1. I think that coupling them is a bad idea from the complexity POV, and > 2. it's enough of a battle fighting for modifications to one bit of the > code without trying to do it to two simultaneously. As nice as that sounds, such a code structure would've precluded the entire raison d'etre of the patch, i.e. gang scheduling. William Lee Irwin III wrote: >> At this point I'm questioning the relevance of my own work, >> though it was already relatively marginal as it started life as an >> attempt at a sort of debug patch to help gang scheduling (which is in >> itself a rather marginally relevant feature to most users) code along. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > The main commercial plug in scheduler used with the run time loadable > module scheduler that I mentioned earlier did gang scheduling (at the > insistence of the Tru64 kernel folks). As this scheduler was a > hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" > ("unfairly" really in according to an allocation policy) among higher > level entities such as users, groups and applications as well as > processes; it was fairly easy to make it a gang scheduler by modifying > it to give all of a process's threads the same priority based on the > process's CPU usage rather than different priorities based on the > threads' usage rates. In fact, it would have been possible to select > between gang and non gang on a per process basis if that was considered > desirable. > The fact that threads and processes are distinct entities on Tru64 and > Solaris made this easier to do on them than on Linux. > My experience with this scheduler leads me to believe that to achieve > gang scheduling and fairness, etc. you need (usage) statistics based > schedulers. This does not appear to make sense unless it's based on an incorrect use of the term "gang scheduling." I'm referring to a gang as a set of tasks (typically restricted to threads of the same process) which must all be considered runnable or unrunnable simultaneously, and are for the sake of performance required to all actually be run simultaneously. This means a gang of N threads, when run, must run on N processors at once. A time and a set of processors must be chosen for any time interval where the gang is running. This interacts with load balancing by needing to choose the cpus to run the gang on, and also arranging for a set of cpus available for the gang to use to exist by means of migrating tasks off the chosen cpus. William Lee Irwin III wrote: >> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are >> better ones. I'd not bother citing kernel compile results. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > spa_svr actually does its best work when the system isn't fully loaded > as the type of improvement it strives to achieve (minimizing on queue > wait time) hasn't got much room to manoeuvre when the system is fully > loaded. Therefore, the fact that it's 1% better even in these > circumstances is a good result and also indicates that the overhead for > keeping the scheduling statistics it uses for its decision making is > well spent. Especially, when you consider that the total available room > for improvement on this benchmark is less than 3%. None of these benchmarks require the system to be fully loaded. They are, on the other hand, vastly more realistic simulated workloads than kernel compiles, and furthermore are actually developed as benchmarks, with in some cases even measurements of variance, iteration to convergence, and similar such things that make them actually scientific. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > To elaborate, the motivation for this scheduler was acquired from the > observation of scheduling statistics (in particular, on queue wait time) > on systems running at about 30% to 50% load. Theoretically, at these > load levels there should be no such waiting but the statistics show that > there is considerable waiting (sometimes as high as 30% to 50%). I put > this down to "lack of serendipity" e.g. everyone sleeping at the same > time and then trying to run at the same time would be complete lack of > serendipity. On the other hand, if everyone is synced then there would > be total serendipity. > Obviously, from the POV of a client, time the server task spends waiting > on the queue adds to the response time for any request that has been > made so reduction of this time on a server is a good thing(tm). Equally > obviously, trying to achieve this synchronization by asking the tasks to > cooperate with each other is not a feasible solution and some external > influence needs to be exerted and this is what spa_svr does -- it nudges > the scheduling order of the tasks in a way that makes them become well > synced. This all sounds like a relatively good idea. So it's good for throughput vs. latency or otherwise not particularly interactive. No big deal, just use it where it makes sense. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > Unfortunately, this is not a good scheduler for an interactive system as > it minimizes the response times for ALL tasks (and the system as a > whole) and this can result in increased response time for some > interactive tasks (clunkiness) which annoys interactive users. When you > start fiddling with this scheduler to bring back "interactive > unfairness" you kill a lot of its superior low overall wait time > performance. > So this is why I think "horses for courses" schedulers are worth while. I have no particular objection to using an appropriate scheduler for the system's workload. I also have little or no preference as to how that's accomplished overall. But I really think that if we want to push pluggable scheduling it should load schedulers as kernel modules on the fly and so on versus pure /proc/ tunables and a compiled-in set of alternatives. William Lee Irwin III wrote: >> In any event, I'm not sure what to say about different schedulers for >> different aims. My intentions with plugsched were not centered around >> production usage or intra-queue policy. I'm relatively indifferent to >> the notion of having pluggable CPU schedulers, intra-queue or otherwise, >> in mainline. I don't see any particular harm in it, but neither am I >> particularly motivated to have it in. On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: > If you look at the struct sched_spa_child in the file > include/linux/sched_spa.h you'll see that the interface for switching > between the various SPA schedulers is quite small and making them > runtime switchable would be easy (I haven't done this in cpusched as I > wanted to keep the same interface for switching schedulers for all > schedulers: i.e. all run time switchable or none run time switchable; as > the main aim of plugsched had become a mechanism for evaluating > different intra queue scheduling designs.) I remember actually looking at this, and I would almost characterize the differences between the SPA schedulers as a tunable parameter. I have a different concept of what pluggability means from how the SPA schedulers were switched, but no particular objection to the method given the commonalities between them. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:04 ` William Lee Irwin III @ 2007-04-16 12:55 ` Peter Williams 2007-04-16 23:10 ` Michael K. Edwards [not found] ` <20070416135915.GK8915@holomorphy.com> 0 siblings, 2 replies; 304+ messages in thread From: Peter Williams @ 2007-04-16 12:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> Driver models for scheduling are not so far out. AFAICS it's largely a >>> tug-of-war over design goals, e.g. maintaining per-cpu runqueues and >>> switching out intra-queue policies vs. switching out whole-system >>> policies, SMP handling and all. Whether this involves load balancing >>> depends strongly on e.g. whether you have per-cpu runqueues. A 2.4.x >>> scheduler module, for instance, would not have a load balancer at all, >>> as it has only one global runqueue. There are other sorts of policies >>> wanting significant changes to SMP handling vs. the stock load >>> balancing. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Well a single run queue removes the need for load balancing but has >> scalability issues on large systems. Personally, I think something in >> between would be the best solution i.e. multiple run queues but more >> than one CPU per run queue. I think that this would be a particularly >> good solution to the problems introduced by hyper threading and multi >> core systems and also NUMA systems. E.g. if all CPUs in a hyper thread >> package are using the one queue then the case where one CPU is trying to >> run a high priority task and the other a low priority task (i.e. the >> cases that the sleeping dependent mechanism tried to address) is less >> likely to occur. > > This wasn't meant to sing the praises of the 2.4.x scheduler; it was > rather meant to point out that the 2.4.x scheduler, among others, is > unimplementable within the framework if it assumes per-cpu runqueues. > More plausibly useful single-queue schedulers would likely use a vastly > different policy and attempt to carry out all queue manipulations via > lockless operations. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> By the way, I think that it's a very bad idea for the scheduling >> mechanism and the load balancing mechanism to be coupled. The anomalies >> that will be experienced and the attempts to make ad hoc fixes for them >> will lead to complexity spiralling out of control. > > This is clearly unavoidable in the case of gang scheduling. There is > simply no other way to schedule N tasks which must all be run > simultaneously when they run at all on N cpus of the system without > such coupling and furthermore at an extremely intimate level, > particularly when multiple competing gangs must be scheduled in such > a fashion. I can't see the logic here or why you would want to do such a thing. It certainly doesn't coincide with what I interpret "gang scheduling" to mean. > > > William Lee Irwin III wrote: >>> Where I had a significant need for >>> mucking with the entire concept of how SMP was handled, this is rather >>> different. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Yes, I went with the idea of intra run queue scheduling being orthogonal >> to load balancing for two reasons: >> 1. I think that coupling them is a bad idea from the complexity POV, and >> 2. it's enough of a battle fighting for modifications to one bit of the >> code without trying to do it to two simultaneously. > > As nice as that sounds, such a code structure would've precluded the > entire raison d'etre of the patch, i.e. gang scheduling. Not for what I understand "gang scheduling" to mean. As I understand it the constraints of gang scheduling are no where near as strict as you seem to think they are. And for what it's worth I don't think that what you think it means is in any sense a reasonable target. > > > William Lee Irwin III wrote: >>> At this point I'm questioning the relevance of my own work, >>> though it was already relatively marginal as it started life as an >>> attempt at a sort of debug patch to help gang scheduling (which is in >>> itself a rather marginally relevant feature to most users) code along. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> The main commercial plug in scheduler used with the run time loadable >> module scheduler that I mentioned earlier did gang scheduling (at the >> insistence of the Tru64 kernel folks). As this scheduler was a >> hierarchical "fair share" scheduler: i.e. allocating CPU "fairly" >> ("unfairly" really in according to an allocation policy) among higher >> level entities such as users, groups and applications as well as >> processes; it was fairly easy to make it a gang scheduler by modifying >> it to give all of a process's threads the same priority based on the >> process's CPU usage rather than different priorities based on the >> threads' usage rates. In fact, it would have been possible to select >> between gang and non gang on a per process basis if that was considered >> desirable. >> The fact that threads and processes are distinct entities on Tru64 and >> Solaris made this easier to do on them than on Linux. >> My experience with this scheduler leads me to believe that to achieve >> gang scheduling and fairness, etc. you need (usage) statistics based >> schedulers. > > This does not appear to make sense unless it's based on an incorrect > use of the term "gang scheduling." It's become obvious that we mean different things. > I'm referring to a gang as a set of > tasks (typically restricted to threads of the same process) which must > all be considered runnable or unrunnable simultaneously, and are for > the sake of performance required to all actually be run simultaneously. > This means a gang of N threads, when run, must run on N processors at > once. A time and a set of processors must be chosen for any time > interval where the gang is running. This interacts with load balancing > by needing to choose the cpus to run the gang on, and also arranging > for a set of cpus available for the gang to use to exist by means of > migrating tasks off the chosen cpus. Sounds like a job for the load balancer NOT the scheduler. Also I can't see you meeting such strict constraints without making the tasks all SCHED_FIFO. > > > William Lee Irwin III wrote: >>> Kernel compiles not so useful a benchmark. SDET, OAST, AIM7, etc. are >>> better ones. I'd not bother citing kernel compile results. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> spa_svr actually does its best work when the system isn't fully loaded >> as the type of improvement it strives to achieve (minimizing on queue >> wait time) hasn't got much room to manoeuvre when the system is fully >> loaded. Therefore, the fact that it's 1% better even in these >> circumstances is a good result and also indicates that the overhead for >> keeping the scheduling statistics it uses for its decision making is >> well spent. Especially, when you consider that the total available room >> for improvement on this benchmark is less than 3%. > > None of these benchmarks require the system to be fully loaded. They > are, on the other hand, vastly more realistic simulated workloads than > kernel compiles, and furthermore are actually developed as benchmarks, > with in some cases even measurements of variance, iteration to > convergence, and similar such things that make them actually scientific. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> To elaborate, the motivation for this scheduler was acquired from the >> observation of scheduling statistics (in particular, on queue wait time) >> on systems running at about 30% to 50% load. Theoretically, at these >> load levels there should be no such waiting but the statistics show that >> there is considerable waiting (sometimes as high as 30% to 50%). I put >> this down to "lack of serendipity" e.g. everyone sleeping at the same >> time and then trying to run at the same time would be complete lack of >> serendipity. On the other hand, if everyone is synced then there would >> be total serendipity. >> Obviously, from the POV of a client, time the server task spends waiting >> on the queue adds to the response time for any request that has been >> made so reduction of this time on a server is a good thing(tm). Equally >> obviously, trying to achieve this synchronization by asking the tasks to >> cooperate with each other is not a feasible solution and some external >> influence needs to be exerted and this is what spa_svr does -- it nudges >> the scheduling order of the tasks in a way that makes them become well >> synced. > > This all sounds like a relatively good idea. So it's good for throughput > vs. latency or otherwise not particularly interactive. No big deal, just > use it where it makes sense. > > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> Unfortunately, this is not a good scheduler for an interactive system as >> it minimizes the response times for ALL tasks (and the system as a >> whole) and this can result in increased response time for some >> interactive tasks (clunkiness) which annoys interactive users. When you >> start fiddling with this scheduler to bring back "interactive >> unfairness" you kill a lot of its superior low overall wait time >> performance. >> So this is why I think "horses for courses" schedulers are worth while. > > I have no particular objection to using an appropriate scheduler for the > system's workload. I also have little or no preference as to how that's > accomplished overall. But I really think that if we want to push > pluggable scheduling it should load schedulers as kernel modules on the > fly and so on versus pure /proc/ tunables and a compiled-in set of > alternatives. > > > William Lee Irwin III wrote: >>> In any event, I'm not sure what to say about different schedulers for >>> different aims. My intentions with plugsched were not centered around >>> production usage or intra-queue policy. I'm relatively indifferent to >>> the notion of having pluggable CPU schedulers, intra-queue or otherwise, >>> in mainline. I don't see any particular harm in it, but neither am I >>> particularly motivated to have it in. > > On Mon, Apr 16, 2007 at 03:09:31PM +1000, Peter Williams wrote: >> If you look at the struct sched_spa_child in the file >> include/linux/sched_spa.h you'll see that the interface for switching >> between the various SPA schedulers is quite small and making them >> runtime switchable would be easy (I haven't done this in cpusched as I >> wanted to keep the same interface for switching schedulers for all >> schedulers: i.e. all run time switchable or none run time switchable; as >> the main aim of plugsched had become a mechanism for evaluating >> different intra queue scheduling designs.) > > I remember actually looking at this, and I would almost characterize > the differences between the SPA schedulers as a tunable parameter. I > have a different concept of what pluggability means from how the SPA > schedulers were switched, but no particular objection to the method > given the commonalities between them. Yes, that's the way I look at them (in fact, in Zaphod that's exactly what they were -- i.e. Zaphod could be made to behave like various SPA schedulers by fiddling its run time parameters). They illustrate (to my mind) that once you get rid of the O(1) scheduler and replace it with a simple mechanism such as SPA (where there's a small number of points where the scheduling discipline gets to do its thing rather than being interspersed willy nilly throughout the rest of the code) adding run time switchable "horses for courses" scheduler disciplines becomes simple. I think that the simplifications in Ingo's new scheduler (whose scheduling classes now look a lot like Solaris's and its predecessor OSes' scheduler classes) may make it possible to have switchable scheduling disciplines within a scheduling class. I think that something similar (i.e. switchability) could be done for load balancing so that different load balancers could be used when required. By keeping this load balancing functionality orthogonal to the intra run queue scheduling disciplines you increase the number of options available. As I see it, if the scheduling discipline in use does its job properly within a run queue and the load balancer does its job of keeping the weighted load/demand on each run queue roughly equal (except where it has to do otherwise for your version of "gang scheduling") then the overall outcome will meet expectations. Note that I talk of run queues not CPUs as I think a shift to multiple CPUs per run queue may be a good idea. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 12:55 ` Peter Williams @ 2007-04-16 23:10 ` Michael K. Edwards 2007-04-17 3:55 ` Nick Piggin [not found] ` <20070416135915.GK8915@holomorphy.com> 1 sibling, 1 reply; 304+ messages in thread From: Michael K. Edwards @ 2007-04-16 23:10 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > Note that I talk of run queues > not CPUs as I think a shift to multiple CPUs per run queue may be a good > idea. This observation of Peter's is the best thing to come out of this whole foofaraw. Looking at what's happening in CPU-land, I think it's going to be necessary, within a couple of years, to replace the whole idea of "CPU scheduling" with "run queue scheduling" across a complex, possibly dynamic mix of CPU-ish resources. Ergo, there's not much point in churning the mainline scheduler through a design that isn't significantly more flexible than any of those now under discussion. For instance, there are architectures where several "CPUs" (instruction stream decoders feeding execution pipelines) share parts of a cache hierarchy ("chip-level multitasking"). On these machines, you may want to co-schedule a "real" processing task on one pipeline with a "cache warming" task on the other pipeline -- but only for tasks whose memory access patterns have been sufficiently analyzed to write the "cache warming" task code. Some other tasks may want to idle the second pipeline so they can use the full cache-to-RAM bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O bound but so context-heavy that it's not worth yielding the CPU during quick I/Os), and hence perfectly happy to run concurrently with an unrelated task on the other pipeline. There are other architectures where several "hardware threads" fight over parts of a cache hierarchy (sometimes bizarrely described as "sharing" the cache, kind of the way most two-year-olds "share" toys). On these machines, one instruction pipeline can't help the other along cache-wise, but it sure can hurt. A scheduler designed, tested, and tuned principally on one of these architectures (hint: "hyperthreading") will probably leave a lot of performance on the floor on processors in the former category. In the not-so-distant future, we're likely to see architectures with dynamically reconfigurable interconnect between instruction issue units and execution resources. (This is already quite feasible on, say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with as many Nios II cores as fit on the chip.) Restoring task context may involve not just MMU swaps and FPU instructions (with state-dependent hidden costs) but processsor reconfiguration. Achieving "fairness" according to any standard that a platform integrator cares about (let alone an end user) will require a fairly detailed model of the hidden costs associated with different sorts of task switch. So if you are interested in schedulers for some reason other than a paycheck, let the distros worry about 5% improvements on x86[_64]. Get hold of some different "hardware" -- say: - a Xilinx ML410 if you've got $3K to blow and want to explore reconfigurable processors; - a SunFire T2000 if you've got $11K and want to mess with a CMT system that's actually shipping; - a QEMU-simulated massively SMP x86 if you're poor but clever enough to implement funky cross-core cache effects yourself; or - a cycle-accurate simulator from Gaisler or Virtio if you want a real research project. Then go explore some more interesting regions of parameter space and see what the demands on mainline Linux will look like in a few years. Cheers, - Michael ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 23:10 ` Michael K. Edwards @ 2007-04-17 3:55 ` Nick Piggin 2007-04-17 4:25 ` Peter Williams 2007-04-17 8:24 ` William Lee Irwin III 0 siblings, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 3:55 UTC (permalink / raw) To: Michael K. Edwards Cc: Peter Williams, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: > On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > >Note that I talk of run queues > >not CPUs as I think a shift to multiple CPUs per run queue may be a good > >idea. > > This observation of Peter's is the best thing to come out of this > whole foofaraw. Looking at what's happening in CPU-land, I think it's > going to be necessary, within a couple of years, to replace the whole > idea of "CPU scheduling" with "run queue scheduling" across a complex, > possibly dynamic mix of CPU-ish resources. Ergo, there's not much > point in churning the mainline scheduler through a design that isn't > significantly more flexible than any of those now under discussion. Why? If you do that, then your load balancer just becomes less flexible because it is harder to have tasks run on one or the other. You can have single-runqueue-per-domain behaviour (or close to) just by relaxing all restrictions on idle load balancing within that domain. It is harder to go the other way and place any per-cpu affinity or restirctions with multiple cpus on a single runqueue. > For instance, there are architectures where several "CPUs" > (instruction stream decoders feeding execution pipelines) share parts > of a cache hierarchy ("chip-level multitasking"). On these machines, > you may want to co-schedule a "real" processing task on one pipeline > with a "cache warming" task on the other pipeline -- but only for > tasks whose memory access patterns have been sufficiently analyzed to > write the "cache warming" task code. Some other tasks may want to > idle the second pipeline so they can use the full cache-to-RAM > bandwidth. Yet other tasks may be genuinely CPU-intensive (or I/O > bound but so context-heavy that it's not worth yielding the CPU during > quick I/Os), and hence perfectly happy to run concurrently with an > unrelated task on the other pipeline. We can do all that now with load balancing, affinities or by shutting down threads dynamically. > There are other architectures where several "hardware threads" fight > over parts of a cache hierarchy (sometimes bizarrely described as > "sharing" the cache, kind of the way most two-year-olds "share" toys). > On these machines, one instruction pipeline can't help the other > along cache-wise, but it sure can hurt. A scheduler designed, tested, > and tuned principally on one of these architectures (hint: > "hyperthreading") will probably leave a lot of performance on the > floor on processors in the former category. > > In the not-so-distant future, we're likely to see architectures with > dynamically reconfigurable interconnect between instruction issue > units and execution resources. (This is already quite feasible on, > say, Virtex4 FX devices with multiple PPC cores, or Altera FPGAs with > as many Nios II cores as fit on the chip.) Restoring task context may > involve not just MMU swaps and FPU instructions (with state-dependent > hidden costs) but processsor reconfiguration. Achieving "fairness" > according to any standard that a platform integrator cares about (let > alone an end user) will require a fairly detailed model of the hidden > costs associated with different sorts of task switch. > > So if you are interested in schedulers for some reason other than a > paycheck, let the distros worry about 5% improvements on x86[_64]. > Get hold of some different "hardware" -- say: > - a Xilinx ML410 if you've got $3K to blow and want to explore > reconfigurable processors; > - a SunFire T2000 if you've got $11K and want to mess with a CMT > system that's actually shipping; > - a QEMU-simulated massively SMP x86 if you're poor but clever > enough to implement funky cross-core cache effects yourself; or > - a cycle-accurate simulator from Gaisler or Virtio if you want a > real research project. > Then go explore some more interesting regions of parameter space and > see what the demands on mainline Linux will look like in a few years. There are no doubt improvements to be made, but they are generally intended to be able to be done within the sched-domains framework. I am not aware of a particular need that would be impossible to do using that topology hierarchy and per-CPU runqueues, and there are added complications involved with multiple CPUs per runqueue. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:55 ` Nick Piggin @ 2007-04-17 4:25 ` Peter Williams 2007-04-17 4:34 ` Nick Piggin 2007-04-17 8:24 ` William Lee Irwin III 1 sibling, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 4:25 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >>> Note that I talk of run queues >>> not CPUs as I think a shift to multiple CPUs per run queue may be a good >>> idea. >> This observation of Peter's is the best thing to come out of this >> whole foofaraw. Looking at what's happening in CPU-land, I think it's >> going to be necessary, within a couple of years, to replace the whole >> idea of "CPU scheduling" with "run queue scheduling" across a complex, >> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >> point in churning the mainline scheduler through a design that isn't >> significantly more flexible than any of those now under discussion. > > Why? If you do that, then your load balancer just becomes less flexible > because it is harder to have tasks run on one or the other. > > You can have single-runqueue-per-domain behaviour (or close to) just by > relaxing all restrictions on idle load balancing within that domain. It > is harder to go the other way and place any per-cpu affinity or > restirctions with multiple cpus on a single runqueue. Allowing N (where N can be one or greater) CPUs per run queue actually increases flexibility as you can still set N to 1 to get the current behaviour. One advantage of allowing multiple CPUs per run queue would be at the smaller end of the system scale i.e. a PC with a single hyper threading chip (i.e. 2 CPUs) would not need to worry about load balancing at all if both CPUs used the one runqueue and all the nasty side effects that come with hyper threading would be minimized at the same time. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:25 ` Peter Williams @ 2007-04-17 4:34 ` Nick Piggin 2007-04-17 6:03 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 4:34 UTC (permalink / raw) To: Peter Williams Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote: > Nick Piggin wrote: > >On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: > >>On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: > >>>Note that I talk of run queues > >>>not CPUs as I think a shift to multiple CPUs per run queue may be a good > >>>idea. > >>This observation of Peter's is the best thing to come out of this > >>whole foofaraw. Looking at what's happening in CPU-land, I think it's > >>going to be necessary, within a couple of years, to replace the whole > >>idea of "CPU scheduling" with "run queue scheduling" across a complex, > >>possibly dynamic mix of CPU-ish resources. Ergo, there's not much > >>point in churning the mainline scheduler through a design that isn't > >>significantly more flexible than any of those now under discussion. > > > >Why? If you do that, then your load balancer just becomes less flexible > >because it is harder to have tasks run on one or the other. > > > >You can have single-runqueue-per-domain behaviour (or close to) just by > >relaxing all restrictions on idle load balancing within that domain. It > >is harder to go the other way and place any per-cpu affinity or > >restirctions with multiple cpus on a single runqueue. > > Allowing N (where N can be one or greater) CPUs per run queue actually > increases flexibility as you can still set N to 1 to get the current > behaviour. But you add extra code for that on top of what we have, and are also prevented from making per-cpu assumptions. And you can get N CPUs per runqueue behaviour by having them in a domain with no restrictions on idle balancing. So where does your increased flexibilty come from? > One advantage of allowing multiple CPUs per run queue would be at the > smaller end of the system scale i.e. a PC with a single hyper threading > chip (i.e. 2 CPUs) would not need to worry about load balancing at all > if both CPUs used the one runqueue and all the nasty side effects that > come with hyper threading would be minimized at the same time. I don't know about that -- the current load balancer already minimises the nasty multi threading effects. SMT is very important for IBM's chips for example, and they've never had any problem with that side of it since it was introduced and bugs ironed out (at least, none that I've heard). ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 4:34 ` Nick Piggin @ 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 304+ messages in thread From: Peter Williams @ 2007-04-17 6:03 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Nick Piggin wrote: > On Tue, Apr 17, 2007 at 02:25:39PM +1000, Peter Williams wrote: >> Nick Piggin wrote: >>> On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >>>> On 4/16/07, Peter Williams <pwil3058@bigpond.net.au> wrote: >>>>> Note that I talk of run queues >>>>> not CPUs as I think a shift to multiple CPUs per run queue may be a good >>>>> idea. >>>> This observation of Peter's is the best thing to come out of this >>>> whole foofaraw. Looking at what's happening in CPU-land, I think it's >>>> going to be necessary, within a couple of years, to replace the whole >>>> idea of "CPU scheduling" with "run queue scheduling" across a complex, >>>> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >>>> point in churning the mainline scheduler through a design that isn't >>>> significantly more flexible than any of those now under discussion. >>> Why? If you do that, then your load balancer just becomes less flexible >>> because it is harder to have tasks run on one or the other. >>> >>> You can have single-runqueue-per-domain behaviour (or close to) just by >>> relaxing all restrictions on idle load balancing within that domain. It >>> is harder to go the other way and place any per-cpu affinity or >>> restirctions with multiple cpus on a single runqueue. >> Allowing N (where N can be one or greater) CPUs per run queue actually >> increases flexibility as you can still set N to 1 to get the current >> behaviour. > > But you add extra code for that on top of what we have, and are also > prevented from making per-cpu assumptions. > > And you can get N CPUs per runqueue behaviour by having them in a domain > with no restrictions on idle balancing. So where does your increased > flexibilty come from? > >> One advantage of allowing multiple CPUs per run queue would be at the >> smaller end of the system scale i.e. a PC with a single hyper threading >> chip (i.e. 2 CPUs) would not need to worry about load balancing at all >> if both CPUs used the one runqueue and all the nasty side effects that >> come with hyper threading would be minimized at the same time. > > I don't know about that -- the current load balancer already minimises > the nasty multi threading effects. SMT is very important for IBM's chips > for example, and they've never had any problem with that side of it > since it was introduced and bugs ironed out (at least, none that I've > heard). > There's a lot of ugly code in the load balancer that is only there to overcome the side effects of SMT and dual core. A lot of it was put there by Intel employees trying to make load balancing more friendly to their systems. What I'm suggesting is that an N CPUs per runqueue is a better way of achieving that end. I may (of course) be wrong but I think that the idea deserves more consideration than you're willing to give it. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams @ 2007-04-17 6:14 ` William Lee Irwin III 2007-04-17 6:23 ` Nick Piggin 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 6:14 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Michael K. Edwards, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote: > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly to > their systems. What I'm suggesting is that an N CPUs per runqueue is a > better way of achieving that end. I may (of course) be wrong but I > think that the idea deserves more consideration than you're willing to > give it. This may be a good one to ask Ingo about, as he did significant performance work on per-core runqueues for SMT. While I did write per-node runqueue code for NUMA at some point in the past, I did no tuning or other performance work on it, only functionality. I've actually dealt with kernels using elder versions of Ingo's code for per-core runqueues on SMT, but was never called upon to examine that particular code for either performance or stability, so I'm largely ignorant of what the perceived outcome of it was. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III @ 2007-04-17 6:23 ` Nick Piggin 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 6:23 UTC (permalink / raw) To: Peter Williams Cc: Michael K. Edwards, William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Tue, Apr 17, 2007 at 04:03:41PM +1000, Peter Williams wrote: > Nick Piggin wrote: > > > >But you add extra code for that on top of what we have, and are also > >prevented from making per-cpu assumptions. > > > >And you can get N CPUs per runqueue behaviour by having them in a domain > >with no restrictions on idle balancing. So where does your increased > >flexibilty come from? > > > >>One advantage of allowing multiple CPUs per run queue would be at the > >>smaller end of the system scale i.e. a PC with a single hyper threading > >>chip (i.e. 2 CPUs) would not need to worry about load balancing at all > >>if both CPUs used the one runqueue and all the nasty side effects that > >>come with hyper threading would be minimized at the same time. > > > >I don't know about that -- the current load balancer already minimises > >the nasty multi threading effects. SMT is very important for IBM's chips > >for example, and they've never had any problem with that side of it > >since it was introduced and bugs ironed out (at least, none that I've > >heard). > > > > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly to I agree that some of that has exploded complexity. I have some thoughts about better approaches for some of those things, but basically been stuck working on VM problems for a while. > their systems. What I'm suggesting is that an N CPUs per runqueue is a > better way of achieving that end. I may (of course) be wrong but I > think that the idea deserves more consideration than you're willing to > give it. Put it this way: it is trivial to group the load balancing stats of N CPUs with their own runqueues. Just put them under a domain and take the sum. The domain essentially takes on the same function as a single queue with N CPUs under it. Anything _further_ you can do with individual runqueues (like naturally adding an affinity pressure ranging from nothing to absolute) are things that you don't trivially get with 1:N approach. AFAIKS. So I will definitely give any idea consideration, but I just need to be shown where the benefit comes from. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 6:03 ` Peter Williams 2007-04-17 6:14 ` William Lee Irwin III 2007-04-17 6:23 ` Nick Piggin @ 2007-04-17 9:36 ` Ingo Molnar 2 siblings, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 9:36 UTC (permalink / raw) To: Peter Williams Cc: Nick Piggin, Michael K. Edwards, William Lee Irwin III, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Peter Williams <pwil3058@bigpond.net.au> wrote: > There's a lot of ugly code in the load balancer that is only there to > overcome the side effects of SMT and dual core. A lot of it was put > there by Intel employees trying to make load balancing more friendly > to their systems. What I'm suggesting is that an N CPUs per runqueue > is a better way of achieving that end. I may (of course) be wrong but > I think that the idea deserves more consideration than you're willing > to give it. i actually implemented that some time ago and i'm afraid it was ugly as hell and pretty fragile. Load-balancing gets simpler, but task picking gets alot uglier. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 3:55 ` Nick Piggin 2007-04-17 4:25 ` Peter Williams @ 2007-04-17 8:24 ` William Lee Irwin III 1 sibling, 0 replies; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 8:24 UTC (permalink / raw) To: Nick Piggin Cc: Michael K. Edwards, Peter Williams, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Mon, Apr 16, 2007 at 04:10:59PM -0700, Michael K. Edwards wrote: >> This observation of Peter's is the best thing to come out of this >> whole foofaraw. Looking at what's happening in CPU-land, I think it's >> going to be necessary, within a couple of years, to replace the whole >> idea of "CPU scheduling" with "run queue scheduling" across a complex, >> possibly dynamic mix of CPU-ish resources. Ergo, there's not much >> point in churning the mainline scheduler through a design that isn't >> significantly more flexible than any of those now under discussion. On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote: > Why? If you do that, then your load balancer just becomes less flexible > because it is harder to have tasks run on one or the other. On Tue, Apr 17, 2007 at 05:55:28AM +0200, Nick Piggin wrote: > You can have single-runqueue-per-domain behaviour (or close to) just by > relaxing all restrictions on idle load balancing within that domain. It > is harder to go the other way and place any per-cpu affinity or > restirctions with multiple cpus on a single runqueue. The big sticking point here is order-sensitivity. One can point to stringent sched_yield() ordering but that's not so important in and of itself. The more significant case is RT applications which are order- sensitive. Per-cpu runqueues rather significantly disturb the ordering requirements of applications that care about it. In terms of a plugging framework, the per-cpu arrangement precludes or makes extremely awkward scheduling policies that don't have per-cpu runqueues, for instance, the 2.4.x policy. There is also the alternate SMP scalability strategy of a lockless scheduler with a single global queue, which is more performance-oriented. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
[parent not found: <20070416135915.GK8915@holomorphy.com>]
[parent not found: <46241677.7060909@bigpond.net.au>]
[parent not found: <20070417025704.GM8915@holomorphy.com>]
[parent not found: <462445EC.1060306@bigpond.net.au>]
[parent not found: <20070417053147.GN8915@holomorphy.com>]
[parent not found: <46246A7C.8050501@bigpond.net.au>]
[parent not found: <20070417064109.GP8915@holomorphy.com>]
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] [not found] ` <20070417064109.GP8915@holomorphy.com> @ 2007-04-17 8:00 ` Peter Williams 2007-04-17 10:41 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 8:00 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: > On Tue, Apr 17, 2007 at 04:34:36PM +1000, Peter Williams wrote: >> This doesn't make any sense to me. >> For a start, exact simultaneous operation would be impossible to achieve >> except with highly specialized architecture such as the long departed >> transputer. And secondly, I can't see why it's necessary. > > We're not going to make any headway here, so we might as well drop the > thread. Yes, we were starting to go around in circles weren't we? > > There are other things to talk about anyway, for instance I'm seeing > interest in plugsched come about from elsewhere and am taking an > interest in getting it into shape wrt. various design goals therefore. > > Probably the largest issue of note is getting scheduler drivers > loadable as kernel modules. Addressing the points Ingo made that can > be addressed are also lined up for this effort. > > Comments on which directions you'd like this to go in these respects > would be appreciated, as I regard you as the current "project owner." I'd do scan through LKML from about 18 months ago looking for mention of runtime configurable version of plugsched. Some students at a university (in Germany, I think) posted some patches adding this feature to plugsched around about then. I never added them to plugsched proper as I knew (from previous experience when the company I worked for posted patches with similar functionality) that Linux would like this idea less than he did the current plugsched mechanism. Unfortunately, my own cache of the relevant e-mails got overwritten during a Fedora Core upgrade (I've since moved /var onto a separate drive to avoid a repetition) or I would dig them out and send them to you. I'd provided with copies of the company's patches to use as a guide to how to overcome the problems associated with changing schedulers on a running system (a few non trivial locking issues pop up). Maybe if one of the students still reads LKML he will provide a pointer. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 8:00 ` Peter Williams @ 2007-04-17 10:41 ` William Lee Irwin III 2007-04-17 13:48 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-17 10:41 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: >> Comments on which directions you'd like this to go in these respects >> would be appreciated, as I regard you as the current "project owner." On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > I'd do scan through LKML from about 18 months ago looking for mention of > runtime configurable version of plugsched. Some students at a > university (in Germany, I think) posted some patches adding this feature > to plugsched around about then. Excellent. I'll go hunting for that. On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > I never added them to plugsched proper as I knew (from previous > experience when the company I worked for posted patches with similar > functionality) that Linux would like this idea less than he did the > current plugsched mechanism. Odd how the requirements ended up including that. Fickleness abounds. If only we knew up-front what the end would be. On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: > Unfortunately, my own cache of the relevant e-mails got overwritten > during a Fedora Core upgrade (I've since moved /var onto a separate > drive to avoid a repetition) or I would dig them out and send them to > you. I'd provided with copies of the company's patches to use as a > guide to how to overcome the problems associated with changing > schedulers on a running system (a few non trivial locking issues pop up). > Maybe if one of the students still reads LKML he will provide a pointer. I was tempted to restart from scratch given Ingo's comments, but I reconsidered and I'll be working with your code (and the German students' as well). If everything has to change, so be it, but it'll still be a derived work. It would be ignoring precedent and failure to properly attribute if I did otherwise. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 10:41 ` William Lee Irwin III @ 2007-04-17 13:48 ` Peter Williams 2007-04-18 0:27 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 13:48 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: > William Lee Irwin III wrote: >>> Comments on which directions you'd like this to go in these respects >>> would be appreciated, as I regard you as the current "project owner." > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> I'd do scan through LKML from about 18 months ago looking for mention of >> runtime configurable version of plugsched. Some students at a >> university (in Germany, I think) posted some patches adding this feature >> to plugsched around about then. > > Excellent. I'll go hunting for that. > > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> I never added them to plugsched proper as I knew (from previous >> experience when the company I worked for posted patches with similar >> functionality) that Linux would like this idea less than he did the >> current plugsched mechanism. > > Odd how the requirements ended up including that. Fickleness abounds. > If only we knew up-front what the end would be. > > > On Tue, Apr 17, 2007 at 06:00:06PM +1000, Peter Williams wrote: >> Unfortunately, my own cache of the relevant e-mails got overwritten >> during a Fedora Core upgrade (I've since moved /var onto a separate >> drive to avoid a repetition) or I would dig them out and send them to >> you. I'd provided with copies of the company's patches to use as a >> guide to how to overcome the problems associated with changing >> schedulers on a running system (a few non trivial locking issues pop up). >> Maybe if one of the students still reads LKML he will provide a pointer. > > I was tempted to restart from scratch given Ingo's comments, but I > reconsidered and I'll be working with your code (and the German > students' as well). If everything has to change, so be it, but it'll > still be a derived work. It would be ignoring precedent and failure to > properly attribute if I did otherwise. I can give you a patch (or set of patches) against the latest git vanilla kernel version if that would help. There have been changes to the vanilla scheduler code since 2.6.20 so the latest patch on sourceforge won't apply cleanly. I've found that implementing this as a series of patches rather than one big patch makes it easier fro me to cope with changes to the underlying code. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 13:48 ` Peter Williams @ 2007-04-18 0:27 ` Peter Williams 2007-04-18 2:03 ` William Lee Irwin III 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-18 0:27 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List Peter Williams wrote: > William Lee Irwin III wrote: >> I was tempted to restart from scratch given Ingo's comments, but I >> reconsidered and I'll be working with your code (and the German >> students' as well). If everything has to change, so be it, but it'll >> still be a derived work. It would be ignoring precedent and failure to >> properly attribute if I did otherwise. > > I can give you a patch (or set of patches) against the latest git > vanilla kernel version if that would help. There have been changes to > the vanilla scheduler code since 2.6.20 so the latest patch on > sourceforge won't apply cleanly. I've found that implementing this as a > series of patches rather than one big patch makes it easier fro me to > cope with changes to the underlying code. I've just placed a single patch for plugsched against 2.6.21-rc7 updated to Linus's git tree as of an hour or two ago on sourceforge: <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> This should at least enable you to get it to apply cleanly to the latest kernel sources. Let me know if you'd also like this as a quilt/mq friendly patch series? Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 0:27 ` Peter Williams @ 2007-04-18 2:03 ` William Lee Irwin III 2007-04-18 2:31 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: William Lee Irwin III @ 2007-04-18 2:03 UTC (permalink / raw) To: Peter Williams; +Cc: Linux Kernel Mailing List > Peter Williams wrote: > >William Lee Irwin III wrote: > >>I was tempted to restart from scratch given Ingo's comments, but I > >>reconsidered and I'll be working with your code (and the German > >>students' as well). If everything has to change, so be it, but it'll > >>still be a derived work. It would be ignoring precedent and failure to > >>properly attribute if I did otherwise. > > > >I can give you a patch (or set of patches) against the latest git > >vanilla kernel version if that would help. There have been changes to > >the vanilla scheduler code since 2.6.20 so the latest patch on > >sourceforge won't apply cleanly. I've found that implementing this as a > >series of patches rather than one big patch makes it easier fro me to > >cope with changes to the underlying code. > On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote: > I've just placed a single patch for plugsched against 2.6.21-rc7 updated > to Linus's git tree as of an hour or two ago on sourceforge: > <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> > This should at least enable you to get it to apply cleanly to the latest > kernel sources. Let me know if you'd also like this as a quilt/mq > friendly patch series? A quilt-friendly series would be most excellent if you could arrange it. Thanks. -- wli ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 2:03 ` William Lee Irwin III @ 2007-04-18 2:31 ` Peter Williams 0 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-18 2:31 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Linux Kernel Mailing List William Lee Irwin III wrote: >> Peter Williams wrote: >>> William Lee Irwin III wrote: >>>> I was tempted to restart from scratch given Ingo's comments, but I >>>> reconsidered and I'll be working with your code (and the German >>>> students' as well). If everything has to change, so be it, but it'll >>>> still be a derived work. It would be ignoring precedent and failure to >>>> properly attribute if I did otherwise. >>> I can give you a patch (or set of patches) against the latest git >>> vanilla kernel version if that would help. There have been changes to >>> the vanilla scheduler code since 2.6.20 so the latest patch on >>> sourceforge won't apply cleanly. I've found that implementing this as a >>> series of patches rather than one big patch makes it easier fro me to >>> cope with changes to the underlying code. > On Wed, Apr 18, 2007 at 10:27:27AM +1000, Peter Williams wrote: >> I've just placed a single patch for plugsched against 2.6.21-rc7 updated >> to Linus's git tree as of an hour or two ago on sourceforge: >> <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch> >> This should at least enable you to get it to apply cleanly to the latest >> kernel sources. Let me know if you'd also like this as a quilt/mq >> friendly patch series? > > A quilt-friendly series would be most excellent if you could arrange it. Done: <http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.21-rc7.patch-series.tar.gz> Just untar this in the base directory of your Linux kernel source and Bob's your uncle. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 1:06 ` Peter Williams 2007-04-16 3:04 ` William Lee Irwin III @ 2007-04-16 17:22 ` Chris Friesen 2007-04-17 0:54 ` Peter Williams 1 sibling, 1 reply; 304+ messages in thread From: Chris Friesen @ 2007-04-16 17:22 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > To my mind scheduling > and load balancing are orthogonal and keeping them that way simplifies > things. Scuse me if I jump in here, but doesn't the load balancer need some way to figure out a) when to run, and b) which tasks to pull and where to push them? I suppose you could abstract this into a per-scheduler API, but to me at least these are the hard parts of the load balancer... Chris ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 17:22 ` Chris Friesen @ 2007-04-17 0:54 ` Peter Williams 2007-04-17 15:52 ` Chris Friesen 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 0:54 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: > >> To my mind scheduling and load balancing are orthogonal and keeping >> them that way simplifies things. > > Scuse me if I jump in here, but doesn't the load balancer need some way > to figure out a) when to run, and b) which tasks to pull and where to > push them? Yes but both of these are independent of the scheduler discipline in force. > > I suppose you could abstract this into a per-scheduler API, but to me at > least these are the hard parts of the load balancer... Load balancing needs to be based on the static priorities (i.e. nice or real time priority) of the runnable tasks not the dynamic priorities. If the load balancer manages to keep the weighted (according to static priority) load and distribution of priorities within the loads on the CPUs roughly equal and the scheduler does a good job of ensuring fairness, interactive responsiveness etc. for the tasks within a CPU then the result will be good system performance within the constraints set by the sys admins use of real time priorities and nice. The smpnice modifications to the load balancer were meant to give it the appropriate behaviour and what we need to fix now is the intra CPU scheduling. Even if the load balancer isn't yet perfect perfecting it can be done separately to fixing the scheduler preferably with as little interdependency as possible. Probably the only contribution to load balancing that the scheduler really needs to make is the calculating of the average weighted load on each of the CPUs (or run queues if there's more than one CPU per runqueue). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 0:54 ` Peter Williams @ 2007-04-17 15:52 ` Chris Friesen 2007-04-17 23:50 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Chris Friesen @ 2007-04-17 15:52 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Chris Friesen wrote: >> Scuse me if I jump in here, but doesn't the load balancer need some >> way to figure out a) when to run, and b) which tasks to pull and where >> to push them? > Yes but both of these are independent of the scheduler discipline in force. It is not clear to me that this is always the case, especially once you mix in things like resource groups. > If > the load balancer manages to keep the weighted (according to static > priority) load and distribution of priorities within the loads on the > CPUs roughly equal and the scheduler does a good job of ensuring > fairness, interactive responsiveness etc. for the tasks within a CPU > then the result will be good system performance within the constraints > set by the sys admins use of real time priorities and nice. Suppose I have a really high priority task running. Another very high priority task wakes up and would normally preempt the first one. However, there happens to be another cpu available. It seems like it would be a win if we moved one of those tasks to the available cpu immediately so they can both run simultaneously. This would seem to require some communication between the scheduler and the load balancer. Certainly the above design could introduce a lot of context switching. But if my goal is a scheduler that minimizes latency (even at the cost of throughput) then that's an acceptable price to pay. Chris ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 15:52 ` Chris Friesen @ 2007-04-17 23:50 ` Peter Williams 2007-04-18 5:43 ` Chris Friesen 0 siblings, 1 reply; 304+ messages in thread From: Peter Williams @ 2007-04-17 23:50 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: >> Chris Friesen wrote: >>> Scuse me if I jump in here, but doesn't the load balancer need some >>> way to figure out a) when to run, and b) which tasks to pull and >>> where to push them? > >> Yes but both of these are independent of the scheduler discipline in >> force. > > It is not clear to me that this is always the case, especially once you > mix in things like resource groups. > >> If >> the load balancer manages to keep the weighted (according to static >> priority) load and distribution of priorities within the loads on the >> CPUs roughly equal and the scheduler does a good job of ensuring >> fairness, interactive responsiveness etc. for the tasks within a CPU >> then the result will be good system performance within the constraints >> set by the sys admins use of real time priorities and nice. > > Suppose I have a really high priority task running. Another very high > priority task wakes up and would normally preempt the first one. > However, there happens to be another cpu available. It seems like it > would be a win if we moved one of those tasks to the available cpu > immediately so they can both run simultaneously. This would seem to > require some communication between the scheduler and the load balancer. Not really the load balancer can do this on its own AND the decision should be based on the STATIC priority of the task being woken. > > Certainly the above design could introduce a lot of context switching. > But if my goal is a scheduler that minimizes latency (even at the cost > of throughput) then that's an acceptable price to pay. It would actually probably reduce context switching as putting the woken task on the best CPU at wake up means you don't have to move it later on. The wake up code already does a little bit in this direction when it chooses which CPU to put a newly woken task on but could do more -- the only real cost would be the cost of looking at more candidate CPUs than it currently does. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 23:50 ` Peter Williams @ 2007-04-18 5:43 ` Chris Friesen 2007-04-18 13:00 ` Peter Williams 0 siblings, 1 reply; 304+ messages in thread From: Chris Friesen @ 2007-04-18 5:43 UTC (permalink / raw) To: Peter Williams Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Peter Williams wrote: > Chris Friesen wrote: >> Suppose I have a really high priority task running. Another very high >> priority task wakes up and would normally preempt the first one. >> However, there happens to be another cpu available. It seems like it >> would be a win if we moved one of those tasks to the available cpu >> immediately so they can both run simultaneously. This would seem to >> require some communication between the scheduler and the load balancer. > > > Not really the load balancer can do this on its own AND the decision > should be based on the STATIC priority of the task being woken. I guess I don't follow. How would the load balancer know that it needs to run? Running on every task wake-up seems expensive. Also, static priority isn't everything. What about the gang-scheduler concept where certain tasks must be scheduled simultaneously on different cpus? What about a resource-group scenario where you have per-cpu resource limits, so that for good latency/fairness you need to force a high priority task to migrate to another cpu once it has consumed the cpu allocation of that group on the current cpu? I can see having a generic load balancer core code, but it seems to me that the scheduler proper needs to have some way of triggering the load balancer to run, and some kind of goodness functions to indicate a) which tasks to move, and b) where to move them. Chris ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 5:43 ` Chris Friesen @ 2007-04-18 13:00 ` Peter Williams 0 siblings, 0 replies; 304+ messages in thread From: Peter Williams @ 2007-04-18 13:00 UTC (permalink / raw) To: Chris Friesen Cc: William Lee Irwin III, Ingo Molnar, Matt Mackall, Con Kolivas, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner Chris Friesen wrote: > Peter Williams wrote: >> Chris Friesen wrote: > >>> Suppose I have a really high priority task running. Another very >>> high priority task wakes up and would normally preempt the first one. >>> However, there happens to be another cpu available. It seems like it >>> would be a win if we moved one of those tasks to the available cpu >>> immediately so they can both run simultaneously. This would seem to >>> require some communication between the scheduler and the load balancer. >> >> >> Not really the load balancer can do this on its own AND the decision >> should be based on the STATIC priority of the task being woken. > > I guess I don't follow. How would the load balancer know that it needs > to run? Running on every task wake-up seems expensive. Also, static > priority isn't everything. What about the gang-scheduler concept where > certain tasks must be scheduled simultaneously on different cpus? What > about a resource-group scenario where you have per-cpu resource limits, > so that for good latency/fairness you need to force a high priority task > to migrate to another cpu once it has consumed the cpu allocation of > that group on the current cpu? > > I can see having a generic load balancer core code, but it seems to me > that the scheduler proper needs to have some way of triggering the load > balancer to run, It doesn't have to be closely coupled with the load balancer to does this. It just needs to know where the trigger is. > and some kind of goodness functions to indicate a) > which tasks to move, and b) where to move them. That's the load balancer's job and even if you use dynamic priority for load balancing it still wouldn't need to be closely coupled. The load balancer would just need to know how to find a process's dynamic priority. In fact, in the current set up, the load balancer decides how much load needs to be moved based on the static load on the CPUs but uses dynamic priority (to a large degree) to decide which ones to move. This is due more to computational efficiency considerations than any deliberate design (I suspect) as the fact that tasks are stored on the runqueue in dynamic priority order makes looking at processes in dynamic priority order is the most efficient strategy. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 15:05 ` Ingo Molnar 2007-04-15 20:05 ` Matt Mackall @ 2007-04-16 5:16 ` Con Kolivas 2007-04-16 5:48 ` Gene Heskett 1 sibling, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-16 5:16 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 01:05, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > 2. Since then I've been thinking/working on a cpu scheduler design > > that takes away all the guesswork out of scheduling and gives very > > predictable, as fair as possible, cpu distribution and latency while > > preserving as solid interactivity as possible within those confines. > > yeah. I think you were right on target with this call. Yay thank goodness :) It's time to fix the damn cpu scheduler once and for all. Everyone uses this; it's no minor driver or $bigsmp or $bigram or $small_embedded_RT_hardware feature. > I've applied the > sched.c change attached at the bottom of this mail to the CFS patch, if > you dont mind. (or feel free to suggest some other text instead.) > * 2003-09-03 Interactivity tuning by Con Kolivas. > * 2004-04-02 Scheduler domains code by Nick Piggin > + * 2007-04-15 Con Kolivas was dead right: fairness matters! :) LOL that's awful. I'd prefer something meaningful like "Work begun on replacing all interactivity tuning with a fair virtual-deadline design by Con Kolivas". While you're at it, it's worth getting rid of a few slightly pointless name changes too. Don't rename SCHED_NORMAL yet again, and don't call all your things sched_fair blah_fair __blah_fair and so on. It means that anything else is by proxy going to be considered unfair. Leave SCHED_NORMAL as is, replace the use of the word _fair with _cfs. I don't really care how many copyright notices you put into our already noisy bootup but it's redundant since there is no choice; we all get the same cpu scheduler. > > 1. I tried in vain some time ago to push a working extensable > > pluggable cpu scheduler framework (based on wli's work) for the linux > > kernel. It was perma-vetoed by Linus and Ingo (and Nick also said he > > didn't like it) as being absolutely the wrong approach and that we > > should never do that. [...] > > i partially replied to that point to Will already, and i'd like to make > it clear again: yes, i rejected plugsched 2-3 years ago (which already > drifted away from wli's original codebase) and i would still reject it > today. No that was just me being flabbergasted by what appeared to be you posting your own plugsched. Note nowhere in the 40 iterations of rsdl->sd did I ask/suggest for plugsched. I said in my first announcement my aim was to create a scheduling policy robust enough for all situations rather than fantastic a lot of the time and awful sometimes. There are plenty of people ready to throw out arguments for plugsched now and I don't have the energy to continue that fight (I never did really). But my question still stands about this comment: > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] What exactly would be the purpose of such a module that governs nothing in particular? Since there'll be no pluggable scheduler by your admission it has no control over SCHED_NORMAL, and would require another scheduling policy for it to govern which there is no express way to use at the moment and people tend to just use the default without great effort. > First and foremost, please dont take such rejections too personally - i > had my own share of rejections (and in fact, as i mentioned it in a > previous mail, i had a fair number of complete project throwaways: > 4g:4g, in-kernel Tux, irqrate and many others). I know that they can > hurt and can demoralize, but if i dont like something it's my job to > tell that. Hmm? No that's not what this is about. Remember dynticks which was not originally my code but I tried to bring it up to mainline standard which I fought with for months? You came along with yet another rewrite from scratch and the flaws in the design I was working with were obvious so I instantly bowed down to that and never touched my code again. I didn't ask for credit back then, but obviously brought the requirement for a no idle tick implementation to the table. > My view about plugsched: first please take a look at the latest > plugsched code: > > http://downloads.sourceforge.net/cpuse/plugsched-6.5-for-2.6.20.patch > > 26 files changed, 8951 insertions(+), 1495 deletions(-) > > As an experiment i've removed all the add-on schedulers (both the core > and the include files, only kept the vanilla one) from the plugsched > patch (and the makefile and kconfig complications, etc), to see the > 'infrastructure cost', and it still gave: > > 12 files changed, 1933 insertions(+), 1479 deletions(-) I do not see extra code per-se as being a bad thing. I've heard said a few times before "ever notice how when the correct solution is done it is a lot more code than the quick hack that ultimately fails?". Insert long winded discussion of perfect is the enemy of good here, _but_ I'm not arguing perfect versus good, I'm talking about solid code versus quick fix. Again, none of this comment is directed specifically at this implementation of plugsched, its code quality or intent, but using "extra code is bad" as an argument is not enough. > By your logic Mike should in fact be quite upset about this: if the > new code works out and proves to be useful then it obsoletes a whole lot > of code of him! > > [...] However at one stage I virtually begged for support with my > > attempts and help with the code. Dmitry Adamushko is the only person > > who actually helped me with the code in the interim, while others > > poked sticks at it. Sure the sticks helped at times but the sticks > > always seemed to have their ends kerosene doused and flaming for > > reasons I still don't get. No other help was forthcoming. > Hey, i told this to you as recently as 1 month ago as well: > > http://lkml.org/lkml/2007/3/8/54 > > "cool! I like this even more than i liked your original staircase > scheduler from 2 years ago :)" Email has an awful knack of disguising intent so I took that on face value that you did like the idea :). Above when I said "no other help was forthcoming" all I was hoping for was really simple obvious bugfixes to help me along while I was laid up in bed such as "I like what you're doing but oh your use of memset here is bogus, here is a one line patch". I wasn't specifically expecting you to fix my code; you've got truckloads of things you need to do. It just reminds me that the concept of "release early, release often" doesn't actually work in the kernel. What is far more obvious is "release code only when it's so close to perfect that noone can argue against it" since most of the work is done by one person, otherwise someone will come out with a counterpatch that is _complete_ earlier but in all possibility not as good, it's just ready sooner. *NOTE* In no way am I saying your code is not as good as mine; I would have to say exactly the opposite is true pretty much always (<sarcasm>conversely then I doubt if I dropped you in my work environment you'd do as good a job as I do</sarcasm>). At one stage wli (again at my request) put together a quick hack to check for non-preemptible regions within the kernel. From that quick hack you adopted it and turned it into that beautiful latency tracer that is the cornerstone of the -rt tree testing. However, there are many instances I've seen good evolving code in the linux kernel be trumped by not-as-good but already-working alternatives written from scratch with no reference to the original work. This is the NIH (not invented here) mechanism I see happening that is worth objecting to. What you may find most amusing is the very first iterations of RSDL looked _nothing_ like the mainline scheduler. There were all sorts of different structures, mechanisms, one priority array, plans to remove scheduler_tick entirely and so on. Most of those were never made for public consumption. I spent about half a dozen iterations of RSDL removing all of that and making it as close to the mainline design as possible, thus minimising the size of the patch, and to make it readily readable for most people familiar with the scheduler policy code in sched.c (all 5 of them). I should have just said bugger it and started everything from scratch with little to no reference to the original scheduler but found myself obliged to try to do things the minimal code patch size readable difference thingy that was valued in linux kernel development. I think the radically different approach would have been better in the long run. Trying to play ball I ruined it. Either way I've decided for myself, my family, my career and my sanity I'm abandoning SD. I will shelve SD and try to have fond memories of SD as an intellectual prompting exercise only > Ingo -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 5:16 ` Con Kolivas @ 2007-04-16 5:48 ` Gene Heskett 0 siblings, 0 replies; 304+ messages in thread From: Gene Heskett @ 2007-04-16 5:48 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007, Con Kolivas wrote: And I snipped, Sorry fellas. Con's original submission was to me, quite an improvement. But I have to say it, and no denegration of your efforts is intended Con, but you did 'pull the trigger' and get this thing rolling by scratching the itch & drawing attention to an ugly lack of user interactivity that had crept into the 2.6 family. So from me to Con, a tip of the hat, and a deep bow in your direction, thank you. Now, you have done what you aimed to do, so please get well. I've now been through most of an amanda session using Ingo's "CFS" and I have to say that it is another improvement over your 0.40 that's is just as obvious as your first patch was against the stock scheduler. No other scheduler yet has allowed the full utilization of the cpu, and maintained user interactivity as well as this one has, my cpu is running about 5 degrees F hotter just from this effect alone. gzip, if the rest of the system is in between tasks, is consistently showing around 95%, but let anything else stick up its hand, like procmail etc, and gzip now dutifully steps aside, dropping into the 40% range until procmail and spamd are done, at which point there is no rest for the wicked and the cpu never gets a chance to cool. There was, just now, a pause of about 2 seconds, while amanda moved a tarball from the holding disk area on /dev/hda to the vtapes disk on /dev/hdd, so that would have been an I/O bound situation. This one Ingo, even without any other patches and I think I did see one go by in this thread which I didn't apply, is a definite keeper. Sweet even. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) A word to the wise is enough. -- Miguel de Cervantes ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (9 preceding siblings ...) 2007-04-15 3:27 ` Con Kolivas @ 2007-04-15 12:29 ` Esben Nielsen 2007-04-15 13:04 ` Ingo Molnar 2007-04-15 22:49 ` Ismail Dönmez ` (2 subsequent siblings) 13 siblings, 1 reply; 304+ messages in thread From: Esben Nielsen @ 2007-04-15 12:29 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Fri, 13 Apr 2007, Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [...] I took a brief look at it. Have you tested priority inheritance? As far as I can see rt_mutex_setprio doesn't have much effect on SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task change scheduler class when boosted in rt_mutex_setprio(). Esben ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 12:29 ` Esben Nielsen @ 2007-04-15 13:04 ` Ingo Molnar 2007-04-16 7:16 ` Esben Nielsen 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-15 13:04 UTC (permalink / raw) To: Esben Nielsen Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > I took a brief look at it. Have you tested priority inheritance? yeah, you are right, it's broken at the moment, i'll fix it. But the good news is that i think PI could become cleaner via scheduling classes. > As far as I can see rt_mutex_setprio doesn't have much effect on > SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task > change scheduler class when boosted in rt_mutex_setprio(). i think via scheduling classes we dont have to do the p->policy and p->prio based gymnastics anymore, we can just have a clean look at p->sched_class and stack the original scheduling class into p->real_sched_class. It would probably also make sense to 'privatize' p->prio into the scheduling class. That way PI would be a pure property of sched_rt, and the PI scheduler would be driven purely by p->rt_priority, not by p->prio. That way all the normal_prio() kind of complications and interactions with SCHED_OTHER/SCHED_FAIR would be eliminated as well. What do you think? Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 13:04 ` Ingo Molnar @ 2007-04-16 7:16 ` Esben Nielsen 0 siblings, 0 replies; 304+ messages in thread From: Esben Nielsen @ 2007-04-16 7:16 UTC (permalink / raw) To: Ingo Molnar Cc: Esben Nielsen, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Sun, 15 Apr 2007, Ingo Molnar wrote: > > * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > >> I took a brief look at it. Have you tested priority inheritance? > > yeah, you are right, it's broken at the moment, i'll fix it. But the > good news is that i think PI could become cleaner via scheduling > classes. > >> As far as I can see rt_mutex_setprio doesn't have much effect on >> SCHED_FAIR/SCHED_BATCH. I am looking for a place where such a task >> change scheduler class when boosted in rt_mutex_setprio(). > > i think via scheduling classes we dont have to do the p->policy and > p->prio based gymnastics anymore, we can just have a clean look at > p->sched_class and stack the original scheduling class into > p->real_sched_class. It would probably also make sense to 'privatize' > p->prio into the scheduling class. That way PI would be a pure property > of sched_rt, and the PI scheduler would be driven purely by > p->rt_priority, not by p->prio. That way all the normal_prio() kind of > complications and interactions with SCHED_OTHER/SCHED_FAIR would be > eliminated as well. What do you think? > Now I have not read your patch into detail. But agree it would be nice to have it more "OO" and remove cross references between schedulers. But first one should consider wether PI between SCHED_FAIR tasks or not is usefull or not. Does PI among dynamic priorities make sense at all? I think it does: On heavy loaded systems where a nice 19 might not get the CPU for very long, a nice -20 task can be priority inverted for a very long time. But I see no need it taking the dynamic part of the effective priorities into account. The current/old solution of mapping the static nice values into a global priority index which can incorporate the two scheduler classes is probably good enough - it just has to be "switched on" a again :-) But what about other scheduler classes which some people want to add in the future? What about having a "cleaner design"? My thought was to generalize the concept of 'priority' to be an object (a struct prio) to be interpreted with help from a scheduler class instead of globally interpreted integer. int compare_prio(struct prio *a, struct prio *b) { if (a->sched_class->class_prio < b->sched_class->class_prio) return -1; if (a->sched_class->class_prio < b->sched_class->class_prio) return +1; return a->sched_class->compare_prio(a, b); } Problem 1: Performance. Problem 2: Operations on a plist with these generalized priorities are not bounded because the number of different priorites are not bounded. Problem 2 could be solved by using a combined plist (for rt priorities) and rbtree (for fair priorities) - making operations logarithmic just as the fair scheduler itself. But that would take more memory for every rtmutex. I conclude that is too complicated and go on to the obvious idea: Use a global priority index where each scheduler class get's it own range (rt: 0-99, fair 100-139 :-). Let the scheduler class have a function returning it instead of reading it directly from task_struct such that new scheduler classes can return their own numbers. Esben > Ingo > ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (10 preceding siblings ...) 2007-04-15 12:29 ` Esben Nielsen @ 2007-04-15 22:49 ` Ismail Dönmez 2007-04-15 23:23 ` Arjan van de Ven 2007-04-16 11:58 ` Ingo Molnar 2007-04-16 22:00 ` Andi Kleen 2007-04-17 7:56 ` Andy Whitcroft 13 siblings, 2 replies; 304+ messages in thread From: Ismail Dönmez @ 2007-04-15 22:49 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 573 bytes --] Hi, On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch Tested this on top of Linus' GIT tree but the system gets very unresponsive during high disk i/o using ext3 as filesystem but even writing a 300mb file to a usb disk (iPod actually) has the same affect. Regards, ismail [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:49 ` Ismail Dönmez @ 2007-04-15 23:23 ` Arjan van de Ven 2007-04-15 23:33 ` Ismail Dönmez 2007-04-16 11:58 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Arjan van de Ven @ 2007-04-15 23:23 UTC (permalink / raw) To: Ismail Dönmez Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote: > Hi, > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > [CFS] > > > > i'm pleased to announce the first release of the "Modular Scheduler Core > > and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > Tested this on top of Linus' GIT tree but the system gets very unresponsive > during high disk i/o using ext3 as filesystem but even writing a 300mb file > to a usb disk (iPod actually) has the same affect. just to make sure; this exact same workload but with the stock scheduler does not have this effect? if so, then it could well be that the scheduler is too fair for it's own good (being really fair inevitably ends up not batching as much as one should, and batching is needed to get any kind of decent performance out of disks nowadays) -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 23:23 ` Arjan van de Ven @ 2007-04-15 23:33 ` Ismail Dönmez 0 siblings, 0 replies; 304+ messages in thread From: Ismail Dönmez @ 2007-04-15 23:33 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Thomas Gleixner On Monday 16 April 2007 02:23:08 Arjan van de Ven wrote: > On Mon, 2007-04-16 at 01:49 +0300, Ismail Dönmez wrote: > > Hi, > > > > On Friday 13 April 2007 23:21:00 Ingo Molnar wrote: > > > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler > > > [CFS] > > > > > > i'm pleased to announce the first release of the "Modular Scheduler > > > Core and Completely Fair Scheduler [CFS]" patchset: > > > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > > > Tested this on top of Linus' GIT tree but the system gets very > > unresponsive during high disk i/o using ext3 as filesystem but even > > writing a 300mb file to a usb disk (iPod actually) has the same affect. > > just to make sure; this exact same workload but with the stock scheduler > does not have this effect? > > if so, then it could well be that the scheduler is too fair for it's own > good (being really fair inevitably ends up not batching as much as one > should, and batching is needed to get any kind of decent performance out > of disks nowadays) Tried with make install in kdepim (which made system sluggish with CFS) and the system is just fine (using CFQ). Regards, ismail ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-15 22:49 ` Ismail Dönmez 2007-04-15 23:23 ` Arjan van de Ven @ 2007-04-16 11:58 ` Ingo Molnar 2007-04-16 12:02 ` Ismail Dönmez 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-16 11:58 UTC (permalink / raw) To: Ismail Dönmez Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner * Ismail Dönmez <ismail@pardus.org.tr> wrote: > Tested this on top of Linus' GIT tree but the system gets very > unresponsive during high disk i/o using ext3 as filesystem but even > writing a 300mb file to a usb disk (iPod actually) has the same > affect. hm. Is this an SMP system+kernel by any chance? Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 11:58 ` Ingo Molnar @ 2007-04-16 12:02 ` Ismail Dönmez 0 siblings, 0 replies; 304+ messages in thread From: Ismail Dönmez @ 2007-04-16 12:02 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner On Monday 16 April 2007 14:58:54 Ingo Molnar wrote: > * Ismail Dönmez <ismail@pardus.org.tr> wrote: > > Tested this on top of Linus' GIT tree but the system gets very > > unresponsive during high disk i/o using ext3 as filesystem but even > > writing a 300mb file to a usb disk (iPod actually) has the same > > affect. > > hm. Is this an SMP system+kernel by any chance? Nope the kernel and the system is UP. Regards, ismail ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (11 preceding siblings ...) 2007-04-15 22:49 ` Ismail Dönmez @ 2007-04-16 22:00 ` Andi Kleen 2007-04-16 21:05 ` Ingo Molnar 2007-04-17 7:56 ` Andy Whitcroft 13 siblings, 1 reply; 304+ messages in thread From: Andi Kleen @ 2007-04-16 22:00 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Ingo Molnar <mingo@elte.hu> writes: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch I would suggest to drop the tsc.c change. The "small errors" can be really large on some systems and you can also see large backward jumps. I have a proper (but complicated) solution pending in ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/sched-clock-share BTW with all this CPU time measurement it would be really nice to report it to the user too. It seems a bit bizarre that the scheduler keeps track of ns, but top only knows jiffies with large sampling errors. -Andi ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 22:00 ` Andi Kleen @ 2007-04-16 21:05 ` Ingo Molnar 2007-04-16 21:21 ` Andi Kleen 0 siblings, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-16 21:05 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel * Andi Kleen <andi@firstfloor.org> wrote: > > i'm pleased to announce the first release of the "Modular Scheduler > > Core and Completely Fair Scheduler [CFS]" patchset: > > > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > I would suggest to drop the tsc.c change. The "small errors" can be > really large on some systems and you can also see large backward > jumps. actually, i designed the CFS code assuming a per-CPU TSC (with no global synchronization), not assuming any globally sync TSC. In fact i wrote it on such systems: a CoreDuo2 box that has stops the TSC in C3 and the different cores have wildly different TSC values and a dual-core Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() change for now. > BTW with all this CPU time measurement it would be really nice to > report it to the user too. It seems a bit bizarre that the scheduler > keeps track of ns, but top only knows jiffies with large sampling > errors. yeah - i'll fix that too if someone doesnt beat me at it. Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-16 21:05 ` Ingo Molnar @ 2007-04-16 21:21 ` Andi Kleen 0 siblings, 0 replies; 304+ messages in thread From: Andi Kleen @ 2007-04-16 21:21 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, linux-kernel > actually, i designed the CFS code assuming a per-CPU TSC (with no global > synchronization), not assuming any globally sync TSC. In fact i wrote it That already worked in the old scheduler (just in a hackish way) > on such systems: a CoreDuo2 box that has stops the TSC in C3 and the > different cores have wildly different TSC values and a dual-core > Athlon64 that quickly drifts its TSC. So i'll keep the sched_clock() > change for now. The problem is not CPU synchronized TSC, but TSC with varying frequency on a single CPU like on the A64. The old implementation can lose really badly on that because it mixes measurements at different frequencies together without individual scaling. The error gets worse the longer the system runs. >> BTW with all this CPU time measurement it would be really nice to >> report it to the user too. It seems a bit bizarre that the scheduler >> keeps track of ns, but top only knows jiffies with large sampling >> errors. > yeah - i'll fix that too if someone doesnt beat me at it. I've been pondering for some time if doubling the NMI watchdog as a ring 0 counter for this is worth it. So far I'm still undecided (and it's moot now since it's disabled by default :/) -Andi ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-13 20:21 Ingo Molnar ` (12 preceding siblings ...) 2007-04-16 22:00 ` Andi Kleen @ 2007-04-17 7:56 ` Andy Whitcroft 2007-04-17 9:32 ` Nick Piggin 2007-04-18 10:22 ` Ingo Molnar 13 siblings, 2 replies; 304+ messages in thread From: Andy Whitcroft @ 2007-04-17 7:56 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan Ingo Molnar wrote: > [announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] > > i'm pleased to announce the first release of the "Modular Scheduler Core > and Completely Fair Scheduler [CFS]" patchset: > > http://redhat.com/~mingo/cfs-scheduler/sched-modular+cfs.patch > > This project is a complete rewrite of the Linux task scheduler. My goal > is to address various feature requests and to fix deficiencies in the > vanilla scheduler that were suggested/found in the past few years, both > for desktop scheduling and for server scheduling workloads. > > [ QuickStart: apply the patch to v2.6.21-rc6, recompile, reboot. The > new scheduler will be active by default and all tasks will default > to the new SCHED_FAIR interactive scheduling class. ] > > Highlights are: > > - the introduction of Scheduling Classes: an extensible hierarchy of > scheduler modules. These modules encapsulate scheduling policy > details and are handled by the scheduler core without the core > code assuming about them too much. > > - sched_fair.c implements the 'CFS desktop scheduler': it is a > replacement for the vanilla scheduler's SCHED_OTHER interactivity > code. > > i'd like to give credit to Con Kolivas for the general approach here: > he has proven via RSDL/SD that 'fair scheduling' is possible and that > it results in better desktop scheduling. Kudos Con! > > The CFS patch uses a completely different approach and implementation > from RSDL/SD. My goal was to make CFS's interactivity quality exceed > that of RSDL/SD, which is a high standard to meet :-) Testing > feedback is welcome to decide this one way or another. [ and, in any > case, all of SD's logic could be added via a kernel/sched_sd.c module > as well, if Con is interested in such an approach. ] > > CFS's design is quite radical: it does not use runqueues, it uses a > time-ordered rbtree to build a 'timeline' of future task execution, > and thus has no 'array switch' artifacts (by which both the vanilla > scheduler and RSDL/SD are affected). > > CFS uses nanosecond granularity accounting and does not rely on any > jiffies or other HZ detail. Thus the CFS scheduler has no notion of > 'timeslices' and has no heuristics whatsoever. There is only one > central tunable: > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. > > due to its design, the CFS scheduler is not prone to any of the > 'attacks' that exist today against the heuristics of the stock > scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all > work fine and do not impact interactivity and produce the expected > behavior. > > the CFS scheduler has a much stronger handling of nice levels and > SCHED_BATCH: both types of workloads should be isolated much more > agressively than under the vanilla scheduler. > > ( another rdetail: due to nanosec accounting and timeline sorting, > sched_yield() support is very simple under CFS, and in fact under > CFS sched_yield() behaves much better than under any other > scheduler i have tested so far. ) > > - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler > way than the vanilla scheduler does. It uses 100 runqueues (for all > 100 RT priority levels, instead of 140 in the vanilla scheduler) > and it needs no expired array. > > - reworked/sanitized SMP load-balancing: the runqueue-walking > assumptions are gone from the load-balancing code now, and > iterators of the scheduling modules are used. The balancing code got > quite a bit simpler as a result. > > the core scheduler got smaller by more than 700 lines: > > kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------ > 1 file changed, 372 insertions(+), 1082 deletions(-) > > and even adding all the scheduling modules, the total size impact is > relatively small: > > 18 files changed, 1454 insertions(+), 1133 deletions(-) > > most of the increase is due to extensive comments. The kernel size > impact is in fact a small negative: > > text data bss dec hex filename > 23366 4001 24 27391 6aff kernel/sched.o.vanilla > 24159 2705 56 26920 6928 kernel/sched.o.CFS > > (this is mainly due to the benefit of getting rid of the expired array > and its data structure overhead.) > > thanks go to Thomas Gleixner and Arjan van de Ven for review of this > patchset. > > as usual, any sort of feedback, bugreports, fixes and suggestions are > more than welcome, Pushed this through the test.kernel.org and nothing new blew up. Notably the kernbench figures are within expectations even on the bigger numa systems, commonly badly affected by balancing problems in the schedular. I see there is a second one out, I'll push that one through too. -apw ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Andy Whitcroft @ 2007-04-17 9:32 ` Nick Piggin 2007-04-17 9:59 ` Ingo Molnar 2007-04-18 10:22 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-17 9:32 UTC (permalink / raw) To: Andy Whitcroft Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 08:56:27AM +0100, Andy Whitcroft wrote: > > > > as usual, any sort of feedback, bugreports, fixes and suggestions are > > more than welcome, > > Pushed this through the test.kernel.org and nothing new blew up. > Notably the kernbench figures are within expectations even on the bigger > numa systems, commonly badly affected by balancing problems in the > schedular. > > I see there is a second one out, I'll push that one through too. Well I just sent some feedback on cfs-v2, but realised it went off-list, so I'll resend here because others may find it interesting too. Sorry about jamming it in here, but it is relevant to performance... Anyway, roughly in the context of good cfs-v2 interactivity, I wrote: Well I'm not too surprised. I am disappointed that it uses such small timeslices (or whatever they are called) as the default. Using small timeslices is actually a pretty easy way to ensure everything stays smooth even under load, but is bad for efficiency. Sure you can say you'll have desktop and server tunings, but... With nicksched I'm testing a default timeslice of *300ms* even on the desktop, wheras Ingo's seems to be effectively 3ms :P So if you compare default tunings, it isn't exactly fair! Kbuild times on a 2x Xeon: 2.6.21-rc7 508.87user 32.47system 2:17.82elapsed 392%CPU 509.05user 32.25system 2:17.84elapsed 392%CPU 508.75user 32.26system 2:17.83elapsed 392%CPU 508.63user 32.17system 2:17.88elapsed 392%CPU 509.01user 32.26system 2:17.90elapsed 392%CPU 509.08user 32.20system 2:17.95elapsed 392%CPU 2.6.21-rc7-cfs-v2 534.80user 30.92system 2:23.64elapsed 393%CPU 534.75user 31.01system 2:23.70elapsed 393%CPU 534.66user 31.07system 2:23.76elapsed 393%CPU 534.56user 30.91system 2:23.76elapsed 393%CPU 534.66user 31.07system 2:23.67elapsed 393%CPU 535.43user 30.62system 2:23.72elapsed 393%CPU 2.6.21-rc7-nicksched 505.60user 32.31system 2:17.91elapsed 390%CPU 506.55user 32.42system 2:17.66elapsed 391%CPU 506.41user 32.30system 2:17.85elapsed 390%CPU 506.48user 32.36system 2:17.77elapsed 391%CPU 506.10user 32.40system 2:17.81elapsed 390%CPU 506.69user 32.16system 2:17.78elapsed 391%CPU ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:32 ` Nick Piggin @ 2007-04-17 9:59 ` Ingo Molnar 2007-04-17 11:11 ` Nick Piggin 2007-04-18 8:55 ` Nick Piggin 0 siblings, 2 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-17 9:59 UTC (permalink / raw) To: Nick Piggin Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Nick Piggin <npiggin@suse.de> wrote: > 2.6.21-rc7-cfs-v2 > 534.80user 30.92system 2:23.64elapsed 393%CPU > 534.75user 31.01system 2:23.70elapsed 393%CPU > 534.66user 31.07system 2:23.76elapsed 393%CPU > 534.56user 30.91system 2:23.76elapsed 393%CPU > 534.66user 31.07system 2:23.67elapsed 393%CPU > 535.43user 30.62system 2:23.72elapsed 393%CPU Thanks for testing this! Could you please try this also with: echo 100000000 > /proc/sys/kernel/sched_granularity on the same system, so that we can get a complete set of numbers? Just to make sure that lowering the preemption frequency indeed has the expected result of moving kernbench numbers back to mainline levels. (if not then that would indicate some CFS buglet) could you maybe even try a more extreme setting of: echo 500000000 > /proc/sys/kernel/sched_granularity for kicks? This would allow us to see how much kernbench we lose due to preemption granularity. Thanks! Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:59 ` Ingo Molnar @ 2007-04-17 11:11 ` Nick Piggin 2007-04-18 8:55 ` Nick Piggin 1 sibling, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-17 11:11 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > Thanks for testing this! Could you please try this also with: > > echo 100000000 > /proc/sys/kernel/sched_granularity > > on the same system, so that we can get a complete set of numbers? Just > to make sure that lowering the preemption frequency indeed has the > expected result of moving kernbench numbers back to mainline levels. (if > not then that would indicate some CFS buglet) > > could you maybe even try a more extreme setting of: > > echo 500000000 > /proc/sys/kernel/sched_granularity > > for kicks? This would allow us to see how much kernbench we lose due to > preemption granularity. Thanks! Yeah but I just powered down the test-box, so I'll have to get onto that tomorrow. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 9:59 ` Ingo Molnar 2007-04-17 11:11 ` Nick Piggin @ 2007-04-18 8:55 ` Nick Piggin 2007-04-18 9:33 ` Con Kolivas 2007-04-18 9:53 ` Ingo Molnar 1 sibling, 2 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-18 8:55 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > Thanks for testing this! Could you please try this also with: > > echo 100000000 > /proc/sys/kernel/sched_granularity 507.68user 31.87system 2:18.05elapsed 390%CPU 507.99user 31.93system 2:18.09elapsed 390%CPU 507.46user 31.78system 2:18.03elapsed 390%CPU 507.68user 31.93system 2:18.11elapsed 390%CPU 507.63user 31.98system 2:18.01elapsed 390%CPU 507.83user 31.94system 2:18.28elapsed 390%CPU > could you maybe even try a more extreme setting of: > > echo 500000000 > /proc/sys/kernel/sched_granularity 504.87user 32.13system 2:18.03elapsed 389%CPU 505.94user 32.29system 2:17.87elapsed 390%CPU 506.10user 31.90system 2:17.96elapsed 389%CPU 505.02user 32.02system 2:17.96elapsed 389%CPU 506.69user 31.96system 2:17.82elapsed 390%CPU 505.70user 31.84system 2:17.90elapsed 389%CPU Again, for comparison 2.6.21-rc7 mainline: 508.87user 32.47system 2:17.82elapsed 392%CPU 509.05user 32.25system 2:17.84elapsed 392%CPU 508.75user 32.26system 2:17.83elapsed 392%CPU 508.63user 32.17system 2:17.88elapsed 392%CPU 509.01user 32.26system 2:17.90elapsed 392%CPU 509.08user 32.20system 2:17.95elapsed 392%CPU So looking at elapsed time, a granularity of 100ms is just behind the mainline score. However it is using slightly less user time and slightly more idle time, which indicates that balancing might have got a bit less aggressive. But anyway, it conclusively shows the efficiency impact of such tiny timeslices. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 8:55 ` Nick Piggin @ 2007-04-18 9:33 ` Con Kolivas 2007-04-18 12:14 ` Nick Piggin 2007-04-18 9:53 ` Ingo Molnar 1 sibling, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-18 9:33 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > On Tue, Apr 17, 2007 at 11:59:00AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > 2.6.21-rc7-cfs-v2 > > > 534.80user 30.92system 2:23.64elapsed 393%CPU > > > 534.75user 31.01system 2:23.70elapsed 393%CPU > > > 534.66user 31.07system 2:23.76elapsed 393%CPU > > > 534.56user 30.91system 2:23.76elapsed 393%CPU > > > 534.66user 31.07system 2:23.67elapsed 393%CPU > > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > > > Thanks for testing this! Could you please try this also with: > > > > echo 100000000 > /proc/sys/kernel/sched_granularity > > 507.68user 31.87system 2:18.05elapsed 390%CPU > 507.99user 31.93system 2:18.09elapsed 390%CPU > 507.46user 31.78system 2:18.03elapsed 390%CPU > 507.68user 31.93system 2:18.11elapsed 390%CPU > 507.63user 31.98system 2:18.01elapsed 390%CPU > 507.83user 31.94system 2:18.28elapsed 390%CPU > > > could you maybe even try a more extreme setting of: > > > > echo 500000000 > /proc/sys/kernel/sched_granularity > > 504.87user 32.13system 2:18.03elapsed 389%CPU > 505.94user 32.29system 2:17.87elapsed 390%CPU > 506.10user 31.90system 2:17.96elapsed 389%CPU > 505.02user 32.02system 2:17.96elapsed 389%CPU > 506.69user 31.96system 2:17.82elapsed 390%CPU > 505.70user 31.84system 2:17.90elapsed 389%CPU > > > Again, for comparison 2.6.21-rc7 mainline: > > 508.87user 32.47system 2:17.82elapsed 392%CPU > 509.05user 32.25system 2:17.84elapsed 392%CPU > 508.75user 32.26system 2:17.83elapsed 392%CPU > 508.63user 32.17system 2:17.88elapsed 392%CPU > 509.01user 32.26system 2:17.90elapsed 392%CPU > 509.08user 32.20system 2:17.95elapsed 392%CPU > > So looking at elapsed time, a granularity of 100ms is just behind the > mainline score. However it is using slightly less user time and > slightly more idle time, which indicates that balancing might have got > a bit less aggressive. > > But anyway, it conclusively shows the efficiency impact of such tiny > timeslices. See test.kernel.org for how (the now defunct) SD was performing on kernbench. It had low latency _and_ equivalent throughput to mainline. Set the standard appropriately on both counts please. -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 9:33 ` Con Kolivas @ 2007-04-18 12:14 ` Nick Piggin 2007-04-18 12:33 ` Con Kolivas 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-18 12:14 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > Again, for comparison 2.6.21-rc7 mainline: > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > mainline score. However it is using slightly less user time and > > slightly more idle time, which indicates that balancing might have got > > a bit less aggressive. > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > timeslices. > > See test.kernel.org for how (the now defunct) SD was performing on kernbench. > It had low latency _and_ equivalent throughput to mainline. Set the standard > appropriately on both counts please. I can give it a run. Got an updated patch against -rc7? ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:14 ` Nick Piggin @ 2007-04-18 12:33 ` Con Kolivas 2007-04-18 21:49 ` Con Kolivas 0 siblings, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-18 12:33 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:14, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > > Again, for comparison 2.6.21-rc7 mainline: > > > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > mainline score. However it is using slightly less user time and > > > slightly more idle time, which indicates that balancing might have got > > > a bit less aggressive. > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > timeslices. > > > > See test.kernel.org for how (the now defunct) SD was performing on > > kernbench. It had low latency _and_ equivalent throughput to mainline. > > Set the standard appropriately on both counts please. > > I can give it a run. Got an updated patch against -rc7? I said I wasn't pursuing it but since you're offering, the rc6 patch should apply ok. http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:33 ` Con Kolivas @ 2007-04-18 21:49 ` Con Kolivas 0 siblings, 0 replies; 304+ messages in thread From: Con Kolivas @ 2007-04-18 21:49 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:33, Con Kolivas wrote: > On Wednesday 18 April 2007 22:14, Nick Piggin wrote: > > On Wed, Apr 18, 2007 at 07:33:56PM +1000, Con Kolivas wrote: > > > On Wednesday 18 April 2007 18:55, Nick Piggin wrote: > > > > Again, for comparison 2.6.21-rc7 mainline: > > > > > > > > 508.87user 32.47system 2:17.82elapsed 392%CPU > > > > 509.05user 32.25system 2:17.84elapsed 392%CPU > > > > 508.75user 32.26system 2:17.83elapsed 392%CPU > > > > 508.63user 32.17system 2:17.88elapsed 392%CPU > > > > 509.01user 32.26system 2:17.90elapsed 392%CPU > > > > 509.08user 32.20system 2:17.95elapsed 392%CPU > > > > > > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > > mainline score. However it is using slightly less user time and > > > > slightly more idle time, which indicates that balancing might have > > > > got a bit less aggressive. > > > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > > timeslices. > > > > > > See test.kernel.org for how (the now defunct) SD was performing on > > > kernbench. It had low latency _and_ equivalent throughput to mainline. > > > Set the standard appropriately on both counts please. > > > > I can give it a run. Got an updated patch against -rc7? > > I said I wasn't pursuing it but since you're offering, the rc6 patch should > apply ok. > > http://ck.kolivas.org/patches/staircase-deadline/2.6.21-rc6-sd-0.40.patch Oh and if you go to the effort of trying you may as well try the timeslice tweak to see what effect it has on SD as well. /proc/sys/kernel/rr_interval 100 is the highest. -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 8:55 ` Nick Piggin 2007-04-18 9:33 ` Con Kolivas @ 2007-04-18 9:53 ` Ingo Molnar 2007-04-18 12:13 ` Nick Piggin 1 sibling, 1 reply; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 9:53 UTC (permalink / raw) To: Nick Piggin Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Nick Piggin <npiggin@suse.de> wrote: > > > 535.43user 30.62system 2:23.72elapsed 393%CPU > > > > Thanks for testing this! Could you please try this also with: > > > > echo 100000000 > /proc/sys/kernel/sched_granularity > > 507.68user 31.87system 2:18.05elapsed 390%CPU > 507.99user 31.93system 2:18.09elapsed 390%CPU > > could you maybe even try a more extreme setting of: > > > > echo 500000000 > /proc/sys/kernel/sched_granularity > 506.69user 31.96system 2:17.82elapsed 390%CPU > 505.70user 31.84system 2:17.90elapsed 389%CPU > Again, for comparison 2.6.21-rc7 mainline: > > 508.87user 32.47system 2:17.82elapsed 392%CPU > 509.05user 32.25system 2:17.84elapsed 392%CPU thanks for testing this! > So looking at elapsed time, a granularity of 100ms is just behind the > mainline score. However it is using slightly less user time and > slightly more idle time, which indicates that balancing might have got > a bit less aggressive. > > But anyway, it conclusively shows the efficiency impact of such tiny > timeslices. yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is not unexpected when going to really frequent preemption. Clearly, the default preemption granularity needs to be tuned up. I think you said you measured ~3msec average preemption rate per CPU? That would suggest the average cache-trashing cost was 120 usecs per every 3 msec window. Taking that as a ballpark figure, to get the difference back into the noise range we'd have to either use ~5 msec: echo 5000000 > /proc/sys/kernel/sched_granularity or 15 msec: echo 15000000 > /proc/sys/kernel/sched_granularity (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i correctly understood your 3msec value. I'd have to know your kernbench workload's approximate 'steady state' context-switch rate to do a more accurate calculation.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 9:53 ` Ingo Molnar @ 2007-04-18 12:13 ` Nick Piggin 2007-04-18 12:49 ` Con Kolivas 0 siblings, 1 reply; 304+ messages in thread From: Nick Piggin @ 2007-04-18 12:13 UTC (permalink / raw) To: Ingo Molnar Cc: Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > So looking at elapsed time, a granularity of 100ms is just behind the > > mainline score. However it is using slightly less user time and > > slightly more idle time, which indicates that balancing might have got > > a bit less aggressive. > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > timeslices. > > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is > not unexpected when going to really frequent preemption. Clearly, the > default preemption granularity needs to be tuned up. > > I think you said you measured ~3msec average preemption rate per CPU? This was just looking at ctxsw numbers from running 2 cpu hogs on the same runqueue. > That would suggest the average cache-trashing cost was 120 usecs per > every 3 msec window. Taking that as a ballpark figure, to get the > difference back into the noise range we'd have to either use ~5 msec: > > echo 5000000 > /proc/sys/kernel/sched_granularity > > or 15 msec: > > echo 15000000 > /proc/sys/kernel/sched_granularity > > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i > correctly understood your 3msec value. I'd have to know your kernbench > workload's approximate 'steady state' context-switch rate to do a more > accurate calculation.) The kernel compile (make -j8 on 4 thread system) is doing 1800 total context switches per second (450/s per runqueue) for cfs, and 670 for mainline. Going up to 20ms granularity for cfs brings the context switch numbers similar, but user time is still a % or so higher. I'd be more worried about compute heavy threads which naturally don't do much context switching. Some other numbers on the same system Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched 10 groups: Time: 1.332 0.743 0.607 20 groups: Time: 1.197 1.100 1.241 30 groups: Time: 1.754 2.376 1.834 40 groups: Time: 3.451 2.227 2.503 50 groups: Time: 3.726 3.399 3.220 60 groups: Time: 3.548 4.567 3.668 70 groups: Time: 4.206 4.905 4.314 80 groups: Time: 4.551 6.324 4.879 90 groups: Time: 7.904 6.962 5.335 100 groups: Time: 7.293 7.799 5.857 110 groups: Time: 10.595 8.728 6.517 120 groups: Time: 7.543 9.304 7.082 130 groups: Time: 8.269 10.639 8.007 140 groups: Time: 11.867 8.250 8.302 150 groups: Time: 14.852 8.656 8.662 160 groups: Time: 9.648 9.313 9.541 Mainline seems pretty inconsistent here. lmbench 0K ctxsw latency bound to CPU0: tasks 2 2.59 3.42 2.50 4 3.26 3.54 3.09 8 3.01 3.64 3.22 16 3.00 3.66 3.50 32 2.99 3.70 3.49 64 3.09 4.17 3.50 128 4.80 5.58 4.74 256 5.79 6.37 5.76 cfs is noticably disadvantaged. [*] 500ms didn't make much difference in either test. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:13 ` Nick Piggin @ 2007-04-18 12:49 ` Con Kolivas 2007-04-19 3:28 ` Nick Piggin 0 siblings, 1 reply; 304+ messages in thread From: Con Kolivas @ 2007-04-18 12:49 UTC (permalink / raw) To: Nick Piggin Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wednesday 18 April 2007 22:13, Nick Piggin wrote: > On Wed, Apr 18, 2007 at 11:53:34AM +0200, Ingo Molnar wrote: > > * Nick Piggin <npiggin@suse.de> wrote: > > > So looking at elapsed time, a granularity of 100ms is just behind the > > > mainline score. However it is using slightly less user time and > > > slightly more idle time, which indicates that balancing might have got > > > a bit less aggressive. > > > > > > But anyway, it conclusively shows the efficiency impact of such tiny > > > timeslices. > > > > yeah, the 4% drop in a CPU-cache-sensitive workload like kernbench is > > not unexpected when going to really frequent preemption. Clearly, the > > default preemption granularity needs to be tuned up. > > > > I think you said you measured ~3msec average preemption rate per CPU? > > This was just looking at ctxsw numbers from running 2 cpu hogs on the > same runqueue. > > > That would suggest the average cache-trashing cost was 120 usecs per > > every 3 msec window. Taking that as a ballpark figure, to get the > > difference back into the noise range we'd have to either use ~5 msec: > > > > echo 5000000 > /proc/sys/kernel/sched_granularity > > > > or 15 msec: > > > > echo 15000000 > /proc/sys/kernel/sched_granularity > > > > (depending on whether it's 5x 3msec or 5x 1msec - i'm still not sure i > > correctly understood your 3msec value. I'd have to know your kernbench > > workload's approximate 'steady state' context-switch rate to do a more > > accurate calculation.) > > The kernel compile (make -j8 on 4 thread system) is doing 1800 total > context switches per second (450/s per runqueue) for cfs, and 670 > for mainline. Going up to 20ms granularity for cfs brings the context > switch numbers similar, but user time is still a % or so higher. I'd > be more worried about compute heavy threads which naturally don't do > much context switching. While kernel compiles are nice and easy to do I've seen enough criticism of them in the past to wonder about their usefulness as a standard benchmark on their own. > > Some other numbers on the same system > Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched > 10 groups: Time: 1.332 0.743 0.607 > 20 groups: Time: 1.197 1.100 1.241 > 30 groups: Time: 1.754 2.376 1.834 > 40 groups: Time: 3.451 2.227 2.503 > 50 groups: Time: 3.726 3.399 3.220 > 60 groups: Time: 3.548 4.567 3.668 > 70 groups: Time: 4.206 4.905 4.314 > 80 groups: Time: 4.551 6.324 4.879 > 90 groups: Time: 7.904 6.962 5.335 > 100 groups: Time: 7.293 7.799 5.857 > 110 groups: Time: 10.595 8.728 6.517 > 120 groups: Time: 7.543 9.304 7.082 > 130 groups: Time: 8.269 10.639 8.007 > 140 groups: Time: 11.867 8.250 8.302 > 150 groups: Time: 14.852 8.656 8.662 > 160 groups: Time: 9.648 9.313 9.541 Hackbench even more so. A prolonged discussion with Rusty Russell on this issue he suggested hackbench was more a pass/fail benchmark to ensure there was no starvation scenario that never ended, and very little value should be placed on the actual results returned from it. Wli's concerns regarding some sort of standard framework for a battery of accepted meaningful benchmarks comes to mind as important rather than ones that highlight one over the other. So while interesting for their own endpoints, I certainly wouldn't put either benchmark as some sort of yardstick for a "winner". Note I'm not saying that we shouldn't be looking at them per se, but since the whole drive for a new scheduler is trying to be more objective we need to start expanding the range of benchmarks. Even though I don't feel the need to have SD in the "race" I guess it stands for more data to compare what is possible/where as well. > Mainline seems pretty inconsistent here. > > lmbench 0K ctxsw latency bound to CPU0: > tasks > 2 2.59 3.42 2.50 > 4 3.26 3.54 3.09 > 8 3.01 3.64 3.22 > 16 3.00 3.66 3.50 > 32 2.99 3.70 3.49 > 64 3.09 4.17 3.50 > 128 4.80 5.58 4.74 > 256 5.79 6.37 5.76 > > cfs is noticably disadvantaged. > > [*] 500ms didn't make much difference in either test. -- -ck ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-18 12:49 ` Con Kolivas @ 2007-04-19 3:28 ` Nick Piggin 0 siblings, 0 replies; 304+ messages in thread From: Nick Piggin @ 2007-04-19 3:28 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, Andy Whitcroft, linux-kernel, Linus Torvalds, Andrew Morton, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan On Wed, Apr 18, 2007 at 10:49:45PM +1000, Con Kolivas wrote: > On Wednesday 18 April 2007 22:13, Nick Piggin wrote: > > > > The kernel compile (make -j8 on 4 thread system) is doing 1800 total > > context switches per second (450/s per runqueue) for cfs, and 670 > > for mainline. Going up to 20ms granularity for cfs brings the context > > switch numbers similar, but user time is still a % or so higher. I'd > > be more worried about compute heavy threads which naturally don't do > > much context switching. > > While kernel compiles are nice and easy to do I've seen enough criticism of > them in the past to wonder about their usefulness as a standard benchmark on > their own. Actually it is a real workload for most kernel developers including you no doubt :) The criticism's of kernbench for the kernel are probably fair in that kernel compiles don't exercise a lot of kernel functionality (page allocator and fault paths mostly, IIRC). However as far as I'm concerned, they're great for testing the CPU scheduler, because it doesn't actually matter whether you're running in userspace or kernel space for a context switch to blow your caches. The results are quite stable. You could actually make up a benchmark that hurts a whole lot more from context switching, but I figure that kernbench is a real world thing that shows it up quite well. > > Some other numbers on the same system > > Hackbench: 2.6.21-rc7 cfs-v2 1ms[*] nicksched > > 10 groups: Time: 1.332 0.743 0.607 > > 20 groups: Time: 1.197 1.100 1.241 > > 30 groups: Time: 1.754 2.376 1.834 > > 40 groups: Time: 3.451 2.227 2.503 > > 50 groups: Time: 3.726 3.399 3.220 > > 60 groups: Time: 3.548 4.567 3.668 > > 70 groups: Time: 4.206 4.905 4.314 > > 80 groups: Time: 4.551 6.324 4.879 > > 90 groups: Time: 7.904 6.962 5.335 > > 100 groups: Time: 7.293 7.799 5.857 > > 110 groups: Time: 10.595 8.728 6.517 > > 120 groups: Time: 7.543 9.304 7.082 > > 130 groups: Time: 8.269 10.639 8.007 > > 140 groups: Time: 11.867 8.250 8.302 > > 150 groups: Time: 14.852 8.656 8.662 > > 160 groups: Time: 9.648 9.313 9.541 > > Hackbench even more so. A prolonged discussion with Rusty Russell on this > issue he suggested hackbench was more a pass/fail benchmark to ensure there > was no starvation scenario that never ended, and very little value should be > placed on the actual results returned from it. Yeah, cfs seems to do a little worse than nicksched here, but I include the numbers not because I think that is significant, but to show mainline's poor characteristics. ^ permalink raw reply [flat|nested] 304+ messages in thread
* Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] 2007-04-17 7:56 ` Andy Whitcroft 2007-04-17 9:32 ` Nick Piggin @ 2007-04-18 10:22 ` Ingo Molnar 1 sibling, 0 replies; 304+ messages in thread From: Ingo Molnar @ 2007-04-18 10:22 UTC (permalink / raw) To: Andy Whitcroft Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, Steve Fox, Nishanth Aravamudan * Andy Whitcroft <apw@shadowen.org> wrote: > > as usual, any sort of feedback, bugreports, fixes and suggestions > > are more than welcome, > > Pushed this through the test.kernel.org and nothing new blew up. > Notably the kernbench figures are within expectations even on the > bigger numa systems, commonly badly affected by balancing problems in > the schedular. thanks! Given the really low preemption latency/granularity default (roughly equivalent to 'timeslice length'), and that basically all of my focus was on interactivity characteristics, this is a pretty good result. I suspect it will be necessary to increase the default to 10 msecs (or more) to be on the safe side. (Nick has reported a 4% kernbench drop so for his kernbench workload it's needed.) Ingo ^ permalink raw reply [flat|nested] 304+ messages in thread
end of thread, other threads:[~2007-04-21 14:10 UTC | newest]
Thread overview: 304+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-15 18:47 [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] Tim Tassonis
-- strict thread matches above, loose matches on Subject: below --
2007-04-13 20:21 Ingo Molnar
2007-04-13 20:27 ` Bill Huey
2007-04-13 20:55 ` Ingo Molnar
2007-04-13 21:21 ` William Lee Irwin III
2007-04-13 21:35 ` Bill Huey
2007-04-13 21:39 ` Ingo Molnar
2007-04-13 21:50 ` Ingo Molnar
2007-04-13 21:57 ` Michal Piotrowski
2007-04-13 22:15 ` Daniel Walker
2007-04-13 22:30 ` Ingo Molnar
2007-04-13 22:37 ` Willy Tarreau
2007-04-13 23:59 ` Daniel Walker
2007-04-14 10:55 ` Ingo Molnar
2007-04-13 22:21 ` William Lee Irwin III
2007-04-13 22:52 ` Ingo Molnar
2007-04-13 23:30 ` William Lee Irwin III
2007-04-13 23:44 ` Ingo Molnar
2007-04-13 23:58 ` William Lee Irwin III
2007-04-14 22:38 ` Davide Libenzi
2007-04-14 23:26 ` Davide Libenzi
2007-04-15 4:01 ` William Lee Irwin III
2007-04-15 4:18 ` Davide Libenzi
2007-04-15 23:09 ` Pavel Pisa
2007-04-16 5:47 ` Davide Libenzi
2007-04-17 0:37 ` Pavel Pisa
2007-04-13 22:31 ` Willy Tarreau
2007-04-13 23:18 ` Ingo Molnar
2007-04-14 18:48 ` Bill Huey
2007-04-13 23:07 ` Gabriel C
2007-04-13 23:25 ` Ingo Molnar
2007-04-13 23:39 ` Gabriel C
2007-04-14 2:04 ` Nick Piggin
2007-04-14 6:32 ` Ingo Molnar
2007-04-14 6:43 ` Ingo Molnar
2007-04-14 8:08 ` Willy Tarreau
2007-04-14 8:36 ` Willy Tarreau
2007-04-14 10:53 ` Ingo Molnar
2007-04-14 13:01 ` Willy Tarreau
2007-04-14 13:27 ` Willy Tarreau
2007-04-14 14:45 ` Willy Tarreau
2007-04-14 16:14 ` Ingo Molnar
2007-04-14 16:19 ` Ingo Molnar
2007-04-14 17:15 ` Eric W. Biederman
2007-04-14 17:29 ` Willy Tarreau
2007-04-14 17:44 ` Eric W. Biederman
2007-04-14 17:54 ` Ingo Molnar
2007-04-14 18:18 ` Willy Tarreau
2007-04-14 18:40 ` Eric W. Biederman
2007-04-14 19:01 ` Willy Tarreau
2007-04-15 17:55 ` Ingo Molnar
2007-04-15 18:06 ` Willy Tarreau
2007-04-15 19:20 ` Ingo Molnar
2007-04-15 19:35 ` William Lee Irwin III
2007-04-15 19:57 ` Ingo Molnar
2007-04-15 23:54 ` William Lee Irwin III
2007-04-16 11:24 ` Ingo Molnar
2007-04-16 13:46 ` William Lee Irwin III
2007-04-15 19:37 ` Ingo Molnar
2007-04-14 17:50 ` Linus Torvalds
2007-04-15 7:54 ` Mike Galbraith
2007-04-15 8:58 ` Ingo Molnar
2007-04-15 9:11 ` Mike Galbraith
2007-04-19 9:01 ` Ingo Molnar
2007-04-19 12:54 ` Willy Tarreau
2007-04-19 15:18 ` Ingo Molnar
2007-04-19 17:34 ` Gene Heskett
2007-04-19 18:45 ` Willy Tarreau
2007-04-21 10:31 ` Ingo Molnar
2007-04-21 10:38 ` Ingo Molnar
2007-04-21 10:45 ` Ingo Molnar
2007-04-21 11:07 ` Willy Tarreau
2007-04-21 11:29 ` Björn Steinbrink
2007-04-21 11:51 ` Willy Tarreau
2007-04-19 23:52 ` Jan Knutar
2007-04-20 5:05 ` Willy Tarreau
2007-04-19 17:32 ` Gene Heskett
2007-04-14 15:17 ` Mark Lord
2007-04-14 19:48 ` William Lee Irwin III
2007-04-14 20:12 ` Willy Tarreau
2007-04-14 10:36 ` Ingo Molnar
2007-04-14 15:09 ` S.Çağlar Onur
2007-04-14 16:09 ` Ingo Molnar
2007-04-14 16:59 ` S.Çağlar Onur
2007-04-15 3:27 ` Con Kolivas
2007-04-15 5:16 ` Bill Huey
2007-04-15 8:44 ` Ingo Molnar
2007-04-15 9:51 ` Bill Huey
2007-04-15 10:39 ` Pekka Enberg
2007-04-15 12:45 ` Willy Tarreau
2007-04-15 13:08 ` Pekka J Enberg
2007-04-15 17:32 ` Mike Galbraith
2007-04-15 17:59 ` Linus Torvalds
2007-04-15 19:00 ` Jonathan Lundell
2007-04-15 22:52 ` Con Kolivas
2007-04-16 2:28 ` Nick Piggin
2007-04-16 3:15 ` Con Kolivas
2007-04-16 3:34 ` Nick Piggin
2007-04-15 15:26 ` William Lee Irwin III
2007-04-16 15:55 ` Chris Friesen
2007-04-16 16:13 ` William Lee Irwin III
2007-04-17 0:04 ` Peter Williams
2007-04-17 13:07 ` James Bruce
2007-04-17 20:05 ` William Lee Irwin III
2007-04-15 15:39 ` Ingo Molnar
2007-04-15 15:47 ` William Lee Irwin III
2007-04-16 5:27 ` Peter Williams
2007-04-16 6:23 ` Peter Williams
2007-04-16 6:40 ` Peter Williams
2007-04-16 7:32 ` Ingo Molnar
2007-04-16 8:54 ` Peter Williams
2007-04-15 15:16 ` Gene Heskett
2007-04-15 16:43 ` Con Kolivas
2007-04-15 16:58 ` Gene Heskett
2007-04-15 18:00 ` Mike Galbraith
2007-04-16 0:18 ` Gene Heskett
2007-04-15 16:11 ` Bernd Eckenfels
2007-04-15 6:43 ` Mike Galbraith
2007-04-15 8:36 ` Bill Huey
2007-04-15 8:45 ` Mike Galbraith
2007-04-15 9:06 ` Ingo Molnar
2007-04-16 10:00 ` Ingo Molnar
2007-04-15 16:25 ` Arjan van de Ven
2007-04-16 5:36 ` Bill Huey
2007-04-16 6:17 ` Nick Piggin
2007-04-17 0:06 ` Peter Williams
2007-04-17 2:29 ` Mike Galbraith
2007-04-17 3:40 ` Nick Piggin
2007-04-17 4:01 ` Mike Galbraith
2007-04-17 4:14 ` Nick Piggin
2007-04-17 6:26 ` Peter Williams
2007-04-17 9:51 ` Ingo Molnar
2007-04-17 13:44 ` Peter Williams
2007-04-17 23:00 ` Michael K. Edwards
2007-04-17 23:07 ` William Lee Irwin III
2007-04-17 23:52 ` Michael K. Edwards
2007-04-18 0:36 ` Bill Huey
2007-04-18 2:39 ` Peter Williams
2007-04-20 20:47 ` Bill Davidsen
2007-04-21 7:39 ` Nick Piggin
2007-04-21 8:33 ` Ingo Molnar
2007-04-20 20:36 ` Bill Davidsen
2007-04-17 4:17 ` Peter Williams
2007-04-17 4:29 ` Nick Piggin
2007-04-17 5:53 ` Willy Tarreau
2007-04-17 6:10 ` Nick Piggin
2007-04-17 6:09 ` William Lee Irwin III
2007-04-17 6:15 ` Nick Piggin
2007-04-17 6:26 ` William Lee Irwin III
2007-04-17 7:01 ` Nick Piggin
2007-04-17 8:23 ` William Lee Irwin III
2007-04-17 22:23 ` Davide Libenzi
2007-04-17 21:39 ` Matt Mackall
2007-04-17 23:23 ` Peter Williams
2007-04-17 23:19 ` Matt Mackall
2007-04-18 3:15 ` Nick Piggin
2007-04-18 3:45 ` Mike Galbraith
2007-04-18 3:56 ` Nick Piggin
2007-04-18 4:29 ` Mike Galbraith
2007-04-18 4:38 ` Matt Mackall
2007-04-18 5:00 ` Nick Piggin
2007-04-18 5:55 ` Matt Mackall
2007-04-18 6:37 ` Nick Piggin
2007-04-18 6:55 ` Matt Mackall
2007-04-18 7:24 ` Nick Piggin
2007-04-21 13:33 ` Bill Davidsen
2007-04-18 13:08 ` William Lee Irwin III
2007-04-18 19:48 ` Davide Libenzi
2007-04-18 14:48 ` Linus Torvalds
2007-04-18 15:23 ` Matt Mackall
2007-04-18 17:22 ` Linus Torvalds
2007-04-18 17:49 ` Ingo Molnar
2007-04-18 17:59 ` Ingo Molnar
2007-04-18 19:40 ` Linus Torvalds
2007-04-18 19:43 ` Ingo Molnar
2007-04-18 20:07 ` Davide Libenzi
2007-04-18 21:48 ` Ingo Molnar
2007-04-18 23:30 ` Davide Libenzi
2007-04-19 8:00 ` Ingo Molnar
2007-04-19 15:43 ` Davide Libenzi
2007-04-21 14:09 ` Bill Davidsen
2007-04-19 17:39 ` Bernd Eckenfels
2007-04-19 6:52 ` Mike Galbraith
2007-04-19 7:09 ` Ingo Molnar
2007-04-19 7:32 ` Mike Galbraith
2007-04-19 16:55 ` Davide Libenzi
2007-04-20 5:16 ` Mike Galbraith
2007-04-19 7:14 ` Mike Galbraith
2007-04-18 21:04 ` Ingo Molnar
2007-04-18 19:23 ` Linus Torvalds
2007-04-18 19:56 ` Davide Libenzi
2007-04-18 20:11 ` Linus Torvalds
2007-04-19 0:22 ` Davide Libenzi
2007-04-19 0:30 ` Linus Torvalds
2007-04-18 18:02 ` William Lee Irwin III
2007-04-18 18:12 ` Ingo Molnar
2007-04-18 18:36 ` Diego Calleja
2007-04-19 0:37 ` Peter Williams
2007-04-18 19:05 ` Davide Libenzi
2007-04-18 19:13 ` Michael K. Edwards
2007-04-19 3:18 ` Nick Piggin
2007-04-19 5:14 ` Andrew Morton
2007-04-19 6:38 ` Ingo Molnar
2007-04-19 7:57 ` William Lee Irwin III
2007-04-19 11:50 ` Peter Williams
2007-04-20 5:26 ` William Lee Irwin III
2007-04-20 6:16 ` Peter Williams
2007-04-19 8:33 ` Nick Piggin
2007-04-21 13:40 ` Bill Davidsen
2007-04-17 6:50 ` Davide Libenzi
2007-04-17 7:09 ` William Lee Irwin III
2007-04-17 7:22 ` Peter Williams
2007-04-17 7:23 ` Nick Piggin
2007-04-17 7:27 ` Davide Libenzi
2007-04-17 7:33 ` Nick Piggin
2007-04-17 7:33 ` Ingo Molnar
2007-04-17 7:40 ` Nick Piggin
2007-04-17 7:58 ` Ingo Molnar
2007-04-17 9:05 ` William Lee Irwin III
2007-04-17 9:24 ` Ingo Molnar
2007-04-17 9:57 ` William Lee Irwin III
2007-04-17 10:01 ` Ingo Molnar
2007-04-17 11:31 ` William Lee Irwin III
2007-04-17 22:08 ` Matt Mackall
2007-04-17 22:32 ` William Lee Irwin III
2007-04-17 22:39 ` Matt Mackall
2007-04-17 22:59 ` William Lee Irwin III
2007-04-17 22:57 ` Matt Mackall
2007-04-18 4:29 ` William Lee Irwin III
2007-04-18 4:42 ` Davide Libenzi
2007-04-18 7:29 ` James Bruce
2007-04-17 7:11 ` Nick Piggin
2007-04-17 7:21 ` Davide Libenzi
2007-04-17 6:23 ` Peter Williams
2007-04-17 6:44 ` Nick Piggin
2007-04-17 7:48 ` Peter Williams
2007-04-17 7:56 ` Nick Piggin
2007-04-17 13:16 ` Peter Williams
2007-04-18 4:46 ` Nick Piggin
2007-04-17 8:44 ` Ingo Molnar
2007-04-19 2:20 ` Peter Williams
2007-04-15 15:05 ` Ingo Molnar
2007-04-15 20:05 ` Matt Mackall
2007-04-15 20:48 ` Ingo Molnar
2007-04-15 21:31 ` Matt Mackall
2007-04-16 3:03 ` Nick Piggin
2007-04-16 14:28 ` Matt Mackall
2007-04-17 3:31 ` Nick Piggin
2007-04-17 17:35 ` Matt Mackall
2007-04-16 15:45 ` William Lee Irwin III
2007-04-15 23:39 ` William Lee Irwin III
2007-04-16 1:06 ` Peter Williams
2007-04-16 3:04 ` William Lee Irwin III
2007-04-16 5:09 ` Peter Williams
2007-04-16 11:04 ` William Lee Irwin III
2007-04-16 12:55 ` Peter Williams
2007-04-16 23:10 ` Michael K. Edwards
2007-04-17 3:55 ` Nick Piggin
2007-04-17 4:25 ` Peter Williams
2007-04-17 4:34 ` Nick Piggin
2007-04-17 6:03 ` Peter Williams
2007-04-17 6:14 ` William Lee Irwin III
2007-04-17 6:23 ` Nick Piggin
2007-04-17 9:36 ` Ingo Molnar
2007-04-17 8:24 ` William Lee Irwin III
[not found] ` <20070416135915.GK8915@holomorphy.com>
[not found] ` <46241677.7060909@bigpond.net.au>
[not found] ` <20070417025704.GM8915@holomorphy.com>
[not found] ` <462445EC.1060306@bigpond.net.au>
[not found] ` <20070417053147.GN8915@holomorphy.com>
[not found] ` <46246A7C.8050501@bigpond.net.au>
[not found] ` <20070417064109.GP8915@holomorphy.com>
2007-04-17 8:00 ` Peter Williams
2007-04-17 10:41 ` William Lee Irwin III
2007-04-17 13:48 ` Peter Williams
2007-04-18 0:27 ` Peter Williams
2007-04-18 2:03 ` William Lee Irwin III
2007-04-18 2:31 ` Peter Williams
2007-04-16 17:22 ` Chris Friesen
2007-04-17 0:54 ` Peter Williams
2007-04-17 15:52 ` Chris Friesen
2007-04-17 23:50 ` Peter Williams
2007-04-18 5:43 ` Chris Friesen
2007-04-18 13:00 ` Peter Williams
2007-04-16 5:16 ` Con Kolivas
2007-04-16 5:48 ` Gene Heskett
2007-04-15 12:29 ` Esben Nielsen
2007-04-15 13:04 ` Ingo Molnar
2007-04-16 7:16 ` Esben Nielsen
2007-04-15 22:49 ` Ismail Dönmez
2007-04-15 23:23 ` Arjan van de Ven
2007-04-15 23:33 ` Ismail Dönmez
2007-04-16 11:58 ` Ingo Molnar
2007-04-16 12:02 ` Ismail Dönmez
2007-04-16 22:00 ` Andi Kleen
2007-04-16 21:05 ` Ingo Molnar
2007-04-16 21:21 ` Andi Kleen
2007-04-17 7:56 ` Andy Whitcroft
2007-04-17 9:32 ` Nick Piggin
2007-04-17 9:59 ` Ingo Molnar
2007-04-17 11:11 ` Nick Piggin
2007-04-18 8:55 ` Nick Piggin
2007-04-18 9:33 ` Con Kolivas
2007-04-18 12:14 ` Nick Piggin
2007-04-18 12:33 ` Con Kolivas
2007-04-18 21:49 ` Con Kolivas
2007-04-18 9:53 ` Ingo Molnar
2007-04-18 12:13 ` Nick Piggin
2007-04-18 12:49 ` Con Kolivas
2007-04-19 3:28 ` Nick Piggin
2007-04-18 10:22 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox