* [patch] CFS (Completely Fair Scheduler), v2
@ 2007-04-16 22:07 Ingo Molnar
2007-04-16 22:12 ` S.Çağlar Onur
` (3 more replies)
0 siblings, 4 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-16 22:07 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
this is the second release of the CFS (Completely Fair Scheduler)
patchset, against v2.6.21-rc7:
http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
i'd like to thank everyone for the tremendous amount of feedback and
testing the v1 patch got - i could hardly keep up with just reading the
mails! Some of the stuff people addressed i couldnt implement yet, i
mostly concentrated on bugs, regressions and debuggability.
there's a fair amount of churn:
15 files changed, 456 insertions(+), 241 deletions(-)
But it's an encouraging sign that there was no crash bug found in v1,
all the bugs were related to scheduling-behavior details. The code was
tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
code size increase in -v2 is due to debugging helpers, they'll be
removed later. (The new /proc/sched_debug file can be used to see the
fine details of CFS scheduling.)
Changes since -v1:
- make nice levels less starvable. (reported by Willy Tarreau)
- fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
flag can be used to turn it on/off. (This might fix the Kaffeine bug
reported by S.Çağlar Onur <)
- changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
- UP build fix. (reported by Gabriel C)
- timer tick micro-optimization (Dmitry Adamushko)
- preemption fix: sched_class->check_preempt_curr method to decide
whether to preempt after a wakeup (or at a timer tick). (Found via a
fairness-test-utility written for CFS by Mike Galbraith)
- start forked children with neutral statistics instead of trying to
inherit them from the parent: Willy Tarreau reported that this
results in better behavior on extreme workloads, and it also
simplifies the code quite nicely. Removed sched_exit() and the
->task_exit() methods.
- make nice levels independent of the sched_granularity value
- new /proc/sched_debug file listing runqueue details and the rbtree
- new SCH-* fields in /proc/<NR>/status to see scheduling details
- new cpu-hog feature (off by default) and sysctl tunable to set it:
/proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
0 (off). Positive values are meant the maximum 'memory' that the
scheduler has of CPU hogs.
- various code cleanups
- added more statistics temporarily: sum_exec_runtime,
sum_wait_runtime.
- added -CFS-v2 to EXTRAVERSION
as usual, any sort of feedback, bugreports, fixes and suggestions are
more than welcome,
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
@ 2007-04-16 22:12 ` S.Çağlar Onur
2007-04-17 8:59 ` Ingo Molnar
2007-04-17 4:06 ` Peter Williams
` (2 subsequent siblings)
3 siblings, 1 reply; 37+ messages in thread
From: S.Çağlar Onur @ 2007-04-16 22:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko
[-- Attachment #1: Type: text/plain, Size: 621 bytes --]
17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı:
> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> flag can be used to turn it on/off. (This might fix the Kaffeine bug
> reported by S.Çağlar Onur <)
Sorry for delayed response but i just find some free time, do you still want
me to test mainline + "parent-runs first" patch or will i drop that one and
test v2 which can change default behaviour?
--
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/
Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
2007-04-16 22:12 ` S.Çağlar Onur
@ 2007-04-17 4:06 ` Peter Williams
2007-04-17 6:49 ` Ingo Molnar
2007-04-17 4:53 ` Gene Heskett
2007-04-17 6:46 ` Peter Williams
3 siblings, 1 reply; 37+ messages in thread
From: Peter Williams @ 2007-04-17 4:06 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
Ingo Molnar wrote:
> this is the second release of the CFS (Completely Fair Scheduler)
> patchset, against v2.6.21-rc7:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>
> i'd like to thank everyone for the tremendous amount of feedback and
> testing the v1 patch got - i could hardly keep up with just reading the
> mails! Some of the stuff people addressed i couldnt implement yet, i
> mostly concentrated on bugs, regressions and debuggability.
Can I make a suggestion?
Would it be possible (from now on) to publish changes relevant to the
previous patch (eventually leading to a series of patches that describes
the evolution of the new scheduler) so that it's easier for us
reviewers/critics to see the latest changes. E.g. if import such
changes into something like quilt (using my gquilt GUI wrapper, of
course :-)) I can then use meld (or similar) to follow what's going as
suggestions get folded in and bugs get fixed etc.
Thanks
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
2007-04-16 22:12 ` S.Çağlar Onur
2007-04-17 4:06 ` Peter Williams
@ 2007-04-17 4:53 ` Gene Heskett
2007-04-17 5:25 ` Willy Tarreau
2007-04-17 6:18 ` Ingo Molnar
2007-04-17 6:46 ` Peter Williams
3 siblings, 2 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 4:53 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
On Monday 16 April 2007, Ingo Molnar wrote:
>this is the second release of the CFS (Completely Fair Scheduler)
>patchset, against v2.6.21-rc7:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>
>i'd like to thank everyone for the tremendous amount of feedback and
>testing the v1 patch got - i could hardly keep up with just reading the
>mails! Some of the stuff people addressed i couldnt implement yet, i
>mostly concentrated on bugs, regressions and debuggability.
>
>there's a fair amount of churn:
>
> 15 files changed, 456 insertions(+), 241 deletions(-)
>
>But it's an encouraging sign that there was no crash bug found in v1,
>all the bugs were related to scheduling-behavior details. The code was
>tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>code size increase in -v2 is due to debugging helpers, they'll be
>removed later. (The new /proc/sched_debug file can be used to see the
>fine details of CFS scheduling.)
>
>Changes since -v1:
>
> - make nice levels less starvable. (reported by Willy Tarreau)
>
> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> flag can be used to turn it on/off. (This might fix the Kaffeine bug
> reported by S.Çağlar Onur <)
>
> - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>
> - UP build fix. (reported by Gabriel C)
>
> - timer tick micro-optimization (Dmitry Adamushko)
>
> - preemption fix: sched_class->check_preempt_curr method to decide
> whether to preempt after a wakeup (or at a timer tick). (Found via a
> fairness-test-utility written for CFS by Mike Galbraith)
>
> - start forked children with neutral statistics instead of trying to
> inherit them from the parent: Willy Tarreau reported that this
> results in better behavior on extreme workloads, and it also
> simplifies the code quite nicely. Removed sched_exit() and the
> ->task_exit() methods.
>
> - make nice levels independent of the sched_granularity value
>
> - new /proc/sched_debug file listing runqueue details and the rbtree
>
> - new SCH-* fields in /proc/<NR>/status to see scheduling details
>
> - new cpu-hog feature (off by default) and sysctl tunable to set it:
> /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
> 0 (off). Positive values are meant the maximum 'memory' that the
> scheduler has of CPU hogs.
>
> - various code cleanups
>
> - added more statistics temporarily: sum_exec_runtime,
> sum_wait_runtime.
>
> - added -CFS-v2 to EXTRAVERSION
>
>as usual, any sort of feedback, bugreports, fixes and suggestions are
>more than welcome,
>
> Ingo
This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was much
better. Watching amanda run with htop, kmails composer is being subjected to
5 to 10 second pauses, and htop says that gzip -best isn't getting more that
15% of the cpu, and the /amandatapes drive is being written to in a regular
pattern that seems to be the cause of the pauses according to gkrellm, which
also seems to track the size of the writes, and can show anything from 4.3k
to 54 megs as being written in one cycle of its screen update.
Normally hdd will fire up and take it at about 40+M/second steady till its
done when there is a file ready to write even if its a 7GB file. And I can
type right on during the disk i/o. But not now.
In short, I seem to be heavily I/O bound. But when the write to /dev/hdd3 is
done, then gzip -best pops right up to 90% plus cpu and I get my machine
back.
In between file writes I checked the drives speed with hdparm:
root@coyote ~]# hdparm -Tt /dev/hdd
/dev/hdd:
Timing cached reads: 856 MB in 2.01 seconds = 426.15 MB/sec
Timing buffered disk reads: 222 MB in 3.01 seconds = 73.68 MB/sec
That's not too shabby, and obviously dma is active at least for the reading.
gzip -best was running while this was executing. So I think the drive is fine
and the scheduling is whats funkity. Sorry.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
After they got rid of capital punishment, they had to hang twice
as many people as before.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 4:53 ` Gene Heskett
@ 2007-04-17 5:25 ` Willy Tarreau
2007-04-17 5:51 ` Gene Heskett
` (3 more replies)
2007-04-17 6:18 ` Ingo Molnar
1 sibling, 4 replies; 37+ messages in thread
From: Willy Tarreau @ 2007-04-17 5:25 UTC (permalink / raw)
To: Gene Heskett
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko
Hi Gene,
On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
> On Monday 16 April 2007, Ingo Molnar wrote:
> >this is the second release of the CFS (Completely Fair Scheduler)
> >patchset, against v2.6.21-rc7:
> >
> > http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
> >
> >i'd like to thank everyone for the tremendous amount of feedback and
> >testing the v1 patch got - i could hardly keep up with just reading the
> >mails! Some of the stuff people addressed i couldnt implement yet, i
> >mostly concentrated on bugs, regressions and debuggability.
> >
> >there's a fair amount of churn:
> >
> > 15 files changed, 456 insertions(+), 241 deletions(-)
> >
> >But it's an encouraging sign that there was no crash bug found in v1,
> >all the bugs were related to scheduling-behavior details. The code was
> >tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
> >code size increase in -v2 is due to debugging helpers, they'll be
> >removed later. (The new /proc/sched_debug file can be used to see the
> >fine details of CFS scheduling.)
> >
> >Changes since -v1:
> >
> > - make nice levels less starvable. (reported by Willy Tarreau)
> >
> > - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> > flag can be used to turn it on/off. (This might fix the Kaffeine bug
> > reported by S.Ça??lar Onur <)
> >
> > - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
> >
> > - UP build fix. (reported by Gabriel C)
> >
> > - timer tick micro-optimization (Dmitry Adamushko)
> >
> > - preemption fix: sched_class->check_preempt_curr method to decide
> > whether to preempt after a wakeup (or at a timer tick). (Found via a
> > fairness-test-utility written for CFS by Mike Galbraith)
> >
> > - start forked children with neutral statistics instead of trying to
> > inherit them from the parent: Willy Tarreau reported that this
> > results in better behavior on extreme workloads, and it also
> > simplifies the code quite nicely. Removed sched_exit() and the
> > ->task_exit() methods.
> >
> > - make nice levels independent of the sched_granularity value
> >
> > - new /proc/sched_debug file listing runqueue details and the rbtree
> >
> > - new SCH-* fields in /proc/<NR>/status to see scheduling details
> >
> > - new cpu-hog feature (off by default) and sysctl tunable to set it:
> > /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
> > 0 (off). Positive values are meant the maximum 'memory' that the
> > scheduler has of CPU hogs.
> >
> > - various code cleanups
> >
> > - added more statistics temporarily: sum_exec_runtime,
> > sum_wait_runtime.
> >
> > - added -CFS-v2 to EXTRAVERSION
> >
> >as usual, any sort of feedback, bugreports, fixes and suggestions are
> >more than welcome,
> >
> > Ingo
>
> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was much
> better. Watching amanda run with htop, kmails composer is being subjected to
> 5 to 10 second pauses, and htop says that gzip -best isn't getting more that
> 15% of the cpu, and the /amandatapes drive is being written to in a regular
> pattern that seems to be the cause of the pauses according to gkrellm, which
> also seems to track the size of the writes, and can show anything from 4.3k
> to 54 megs as being written in one cycle of its screen update.
Have you tried previous version with the fair-fork patch ? It might be possible
that your workload is sensible to the fork()'s child getting much CPU upon
startup.
Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
new tasks are "forked", they are queued at the end of the run queue with a
fixed priority. In our case, this would translate into assigning them the
same prio and timeslice as their parent, but queuing them at the end so that
they don't make existing tasks starve during huge fork() loads.
I don't know how that would be possible (nor if that would help in anything),
but I found it was a good compromise over sharing the timeslice with the
parent. Perhaps we should have some absolute timeslice and some relative
timeslice (eg: X percent of total time divided by the number of tasks) ?
Regards,
Willy
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 5:25 ` Willy Tarreau
@ 2007-04-17 5:51 ` Gene Heskett
2007-04-17 7:18 ` Paolo Ornati
2007-04-17 5:51 ` Mike Galbraith
` (2 subsequent siblings)
3 siblings, 1 reply; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 5:51 UTC (permalink / raw)
To: Willy Tarreau
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko
On Tuesday 17 April 2007, Willy Tarreau wrote:
>Hi Gene,
>
>On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
>> On Monday 16 April 2007, Ingo Molnar wrote:
>> >this is the second release of the CFS (Completely Fair Scheduler)
>> >patchset, against v2.6.21-rc7:
>> >
>> > http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>> >
>> >i'd like to thank everyone for the tremendous amount of feedback and
>> >testing the v1 patch got - i could hardly keep up with just reading the
>> >mails! Some of the stuff people addressed i couldnt implement yet, i
>> >mostly concentrated on bugs, regressions and debuggability.
>> >
>> >there's a fair amount of churn:
>> >
>> > 15 files changed, 456 insertions(+), 241 deletions(-)
>> >
>> >But it's an encouraging sign that there was no crash bug found in v1,
>> >all the bugs were related to scheduling-behavior details. The code was
>> >tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>> >code size increase in -v2 is due to debugging helpers, they'll be
>> >removed later. (The new /proc/sched_debug file can be used to see the
>> >fine details of CFS scheduling.)
>> >
>> >Changes since -v1:
>> >
>> > - make nice levels less starvable. (reported by Willy Tarreau)
>> >
>> > - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>> > flag can be used to turn it on/off. (This might fix the Kaffeine bug
>> > reported by S.Ça??lar Onur <)
>> >
>> > - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>> >
>> > - UP build fix. (reported by Gabriel C)
>> >
>> > - timer tick micro-optimization (Dmitry Adamushko)
>> >
>> > - preemption fix: sched_class->check_preempt_curr method to decide
>> > whether to preempt after a wakeup (or at a timer tick). (Found via a
>> > fairness-test-utility written for CFS by Mike Galbraith)
>> >
>> > - start forked children with neutral statistics instead of trying to
>> > inherit them from the parent: Willy Tarreau reported that this
>> > results in better behavior on extreme workloads, and it also
>> > simplifies the code quite nicely. Removed sched_exit() and the
>> > ->task_exit() methods.
>> >
>> > - make nice levels independent of the sched_granularity value
>> >
>> > - new /proc/sched_debug file listing runqueue details and the rbtree
>> >
>> > - new SCH-* fields in /proc/<NR>/status to see scheduling details
>> >
>> > - new cpu-hog feature (off by default) and sysctl tunable to set it:
>> > /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>> > 0 (off). Positive values are meant the maximum 'memory' that the
>> > scheduler has of CPU hogs.
>> >
>> > - various code cleanups
>> >
>> > - added more statistics temporarily: sum_exec_runtime,
>> > sum_wait_runtime.
>> >
>> > - added -CFS-v2 to EXTRAVERSION
>> >
>> >as usual, any sort of feedback, bugreports, fixes and suggestions are
>> >more than welcome,
>> >
>> > Ingo
>>
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was much
>> better. Watching amanda run with htop, kmails composer is being subjected
>> to 5 to 10 second pauses, and htop says that gzip -best isn't getting more
>> that 15% of the cpu, and the /amandatapes drive is being written to in a
>> regular pattern that seems to be the cause of the pauses according to
>> gkrellm, which also seems to track the size of the writes, and can show
>> anything from 4.3k to 54 megs as being written in one cycle of its screen
>> update.
Somewhat interesting to this, I have amanda doing a verify phase too. During
the verify phase (and while I was waiting for gmail to transmit this message,
it took 30 minutes before it showed up on the list) I noted that when
amrestore fired up, it, and its child tar were only taking about 20% of the
cpu between them, and that /dev/hdd was showing a pretty steady 55 to
75MB/sec being read. As to what this tells us, I'm not going to hazard a
guess because it wouldn't, this time of the night here in WV, USA, even be a
SWAG. Its coming up on 2am and the toothpicks holding my eyes open are
sagging badly, making creaking noises even.
>Have you tried previous version with the fair-fork patch ? It might be
> possible that your workload is sensible to the fork()'s child getting much
> CPU upon startup.
Willy, I think that patch went by, and was followed by the v2-rc2 so fast that
I never got a chance to try it with the v2-rc0 framework. So I believe the
answer there is probably no. I never saw a problem with the v2-rc0, but Ingo
shot me a message about it without enough detail that I could have tested for
it.
FWIW, I've been using the CFQ I/O scheduler for quite a while, is it time I
gave the AS or Deadline versions another check? They are all built in but I
don't know how to change the default on the fly, or even if it can be done.
>Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
>new tasks are "forked", they are queued at the end of the run queue with a
>fixed priority. In our case, this would translate into assigning them the
>same prio and timeslice as their parent, but queuing them at the end so that
>they don't make existing tasks starve during huge fork() loads.
>
>I don't know how that would be possible (nor if that would help in
> anything), but I found it was a good compromise over sharing the timeslice
> with the parent. Perhaps we should have some absolute timeslice and some
> relative timeslice (eg: X percent of total time divided by the number of
> tasks) ?
>
>Regards,
>Willy
Thanks Willy.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
"I take Him shopping with me. I say, 'OK, Jesus, help me find a bargain'"
--Tammy Faye Bakker
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 5:25 ` Willy Tarreau
2007-04-17 5:51 ` Gene Heskett
@ 2007-04-17 5:51 ` Mike Galbraith
2007-04-17 6:27 ` Ingo Molnar
2007-04-18 0:06 ` Peter Williams
3 siblings, 0 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-04-17 5:51 UTC (permalink / raw)
To: Willy Tarreau
Cc: Gene Heskett, Ingo Molnar, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko
On Tue, 2007-04-17 at 07:25 +0200, Willy Tarreau wrote:
> Have you tried previous version with the fair-fork patch ? It might be possible
> that your workload is sensible to the fork()'s child getting much CPU upon
> startup.
Dunno about that, but here's a possibly related datapoint. I reported
to Ingo yesterday that I was sometimes losing control of my GUI (KDE)
under heavy IO. I just reproduced it in mainline rc7. If I start a
bonnie, and click around popping windows to the foreground, then poke
KDE's menu button, I may lose all GUI capability for a _very_ long time.
Here, with bonnie, that means until it gets past writing with putc, and
moves on to rewrite. Ages.
-Mike
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 4:53 ` Gene Heskett
2007-04-17 5:25 ` Willy Tarreau
@ 2007-04-17 6:18 ` Ingo Molnar
2007-04-17 7:01 ` Ingo Molnar
` (2 more replies)
1 sibling, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 6:18 UTC (permalink / raw)
To: Gene Heskett
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
* Gene Heskett <gene.heskett@gmail.com> wrote:
> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was
> much better. Watching amanda run with htop, kmails composer is being
> subjected to 5 to 10 second pauses, and htop says that gzip -best
> isn't getting more that 15% of the cpu, and the /amandatapes drive is
> being written to in a regular pattern that seems to be the cause of
> the pauses according to gkrellm, which also seems to track the size of
> the writes, and can show anything from 4.3k to 54 megs as being
> written in one cycle of its screen update.
ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
small. One difference is the child-runs-first fix. To restore the
parent-runs-first logic, do this:
echo 0 > /proc/sys/kernel/sched_child_runs_first
does this make any difference?
If not then pretty much the only other change was the nice level tweak i
did. Could you try to grab a few snapshots of scheduling state via
something like:
while sleep 1; do cat /proc/sched_debug >> to-ingo.txt; done
(and tell me the PID of the kmail composer, to make sure i'm checking
the right task's behavior.)
also, as a separate experiment, could you perhaps run this script as
root:
cd /proc; for N in [1-9]*; do renice -n 0 $N; done
this will move all tasks in the system to nice level 0 and should make
any nice level handling logic in the scheduler irrelevant. Do you have X
reniced perhaps?
Lots of system threads have negative or positive nice levels, so once
you have executed this script, only a reboot will be a practical way to
restore it to the previous settings.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 5:25 ` Willy Tarreau
2007-04-17 5:51 ` Gene Heskett
2007-04-17 5:51 ` Mike Galbraith
@ 2007-04-17 6:27 ` Ingo Molnar
2007-04-18 0:06 ` Peter Williams
3 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 6:27 UTC (permalink / raw)
To: Willy Tarreau
Cc: Gene Heskett, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko
[-- Attachment #1: Type: text/plain, Size: 506 bytes --]
* Willy Tarreau <w@1wt.eu> wrote:
> Have you tried previous version with the fair-fork patch ? It might be
> possible that your workload is sensible to the fork()'s child getting
> much CPU upon startup.
the fair-fork patch is now included in -v2, but that was already in
-v2-rc0 too that i sent to Gene separately. I've attached the
-rc0->final delta.
Gene, could you please apply this patch to your -v2-rc0 tree and do a
quick double-check that indeed these changes cause the regression?
Ingo
[-- Attachment #2: sched-cfs-v2-rc0-final-delta.patch --]
[-- Type: text/plain, Size: 22652 bytes --]
---
include/linux/sched.h | 7 +
kernel/exit.c | 2
kernel/posix-cpu-timers.c | 24 ++---
kernel/rtmutex.c | 2
kernel/sched.c | 191 +++++++++++++++++++++++++---------------------
kernel/sched_debug.c | 14 +--
kernel/sched_fair.c | 80 +++++++++++++------
kernel/sched_rt.c | 21 +++++
kernel/sysctl.c | 8 +
9 files changed, 218 insertions(+), 131 deletions(-)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -798,12 +798,15 @@ struct sched_class {
void (*dequeue_task) (struct rq *rq, struct task_struct *p);
void (*requeue_task) (struct rq *rq, struct task_struct *p);
+ void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
+
struct task_struct * (*pick_next_task) (struct rq *rq);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
struct task_struct * (*load_balance_start) (struct rq *rq);
struct task_struct * (*load_balance_next) (struct rq *rq);
void (*task_tick) (struct rq *rq, struct task_struct *p);
+ void (*task_new) (struct rq *rq, struct task_struct *p);
void (*task_init) (struct rq *rq, struct task_struct *p);
};
@@ -838,7 +841,8 @@ struct task_struct {
u64 last_ran;
s64 wait_runtime;
- u64 exec_runtime, fair_key;
+ u64 sum_exec_runtime, fair_key;
+ s64 sum_wait_runtime;
long nice_offset;
s64 hog_limit;
@@ -1236,6 +1240,7 @@ extern char * sched_print_task_state(str
extern unsigned int sysctl_sched_max_hog_history;
extern unsigned int sysctl_sched_granularity;
+extern unsigned int sysctl_sched_child_runs_first;
#ifdef CONFIG_RT_MUTEXES
extern int rt_mutex_getprio(struct task_struct *p);
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -112,7 +112,7 @@ static void __exit_signal(struct task_st
sig->maj_flt += tsk->maj_flt;
sig->nvcsw += tsk->nvcsw;
sig->nivcsw += tsk->nivcsw;
- sig->sum_sched_runtime += tsk->exec_runtime;
+ sig->sum_sched_runtime += tsk->sum_exec_runtime;
sig = NULL; /* Marker for below. */
}
Index: linux/kernel/posix-cpu-timers.c
===================================================================
--- linux.orig/kernel/posix-cpu-timers.c
+++ linux/kernel/posix-cpu-timers.c
@@ -161,7 +161,7 @@ static inline cputime_t virt_ticks(struc
}
static inline unsigned long long sched_ns(struct task_struct *p)
{
- return (p == current) ? current_sched_runtime(p) : p->exec_runtime;
+ return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
}
int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
@@ -249,7 +249,7 @@ static int cpu_clock_sample_group_locked
cpu->sched = p->signal->sum_sched_runtime;
/* Add in each other live thread. */
while ((t = next_thread(t)) != p) {
- cpu->sched += t->exec_runtime;
+ cpu->sched += t->sum_exec_runtime;
}
cpu->sched += sched_ns(p);
break;
@@ -422,7 +422,7 @@ int posix_cpu_timer_del(struct k_itimer
*/
static void cleanup_timers(struct list_head *head,
cputime_t utime, cputime_t stime,
- unsigned long long exec_runtime)
+ unsigned long long sum_exec_runtime)
{
struct cpu_timer_list *timer, *next;
cputime_t ptime = cputime_add(utime, stime);
@@ -451,10 +451,10 @@ static void cleanup_timers(struct list_h
++head;
list_for_each_entry_safe(timer, next, head, entry) {
list_del_init(&timer->entry);
- if (timer->expires.sched < exec_runtime) {
+ if (timer->expires.sched < sum_exec_runtime) {
timer->expires.sched = 0;
} else {
- timer->expires.sched -= exec_runtime;
+ timer->expires.sched -= sum_exec_runtime;
}
}
}
@@ -467,7 +467,7 @@ static void cleanup_timers(struct list_h
void posix_cpu_timers_exit(struct task_struct *tsk)
{
cleanup_timers(tsk->cpu_timers,
- tsk->utime, tsk->stime, tsk->exec_runtime);
+ tsk->utime, tsk->stime, tsk->sum_exec_runtime);
}
void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -475,7 +475,7 @@ void posix_cpu_timers_exit_group(struct
cleanup_timers(tsk->signal->cpu_timers,
cputime_add(tsk->utime, tsk->signal->utime),
cputime_add(tsk->stime, tsk->signal->stime),
- tsk->exec_runtime + tsk->signal->sum_sched_runtime);
+ tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime);
}
@@ -536,7 +536,7 @@ static void process_timer_rebalance(stru
nsleft = max_t(unsigned long long, nsleft, 1);
do {
if (likely(!(t->flags & PF_EXITING))) {
- ns = t->exec_runtime + nsleft;
+ ns = t->sum_exec_runtime + nsleft;
if (t->it_sched_expires == 0 ||
t->it_sched_expires > ns) {
t->it_sched_expires = ns;
@@ -1004,7 +1004,7 @@ static void check_thread_timers(struct t
struct cpu_timer_list *t = list_entry(timers->next,
struct cpu_timer_list,
entry);
- if (!--maxfire || tsk->exec_runtime < t->expires.sched) {
+ if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) {
tsk->it_sched_expires = t->expires.sched;
break;
}
@@ -1049,7 +1049,7 @@ static void check_process_timers(struct
do {
utime = cputime_add(utime, t->utime);
stime = cputime_add(stime, t->stime);
- sum_sched_runtime += t->exec_runtime;
+ sum_sched_runtime += t->sum_exec_runtime;
t = next_thread(t);
} while (t != tsk);
ptime = cputime_add(utime, stime);
@@ -1208,7 +1208,7 @@ static void check_process_timers(struct
t->it_virt_expires = ticks;
}
- sched = t->exec_runtime + sched_left;
+ sched = t->sum_exec_runtime + sched_left;
if (sched_expires && (t->it_sched_expires == 0 ||
t->it_sched_expires > sched)) {
t->it_sched_expires = sched;
@@ -1300,7 +1300,7 @@ void run_posix_cpu_timers(struct task_st
if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
(tsk->it_sched_expires == 0 ||
- tsk->exec_runtime < tsk->it_sched_expires))
+ tsk->sum_exec_runtime < tsk->it_sched_expires))
return;
#undef UNEXPIRED
Index: linux/kernel/rtmutex.c
===================================================================
--- linux.orig/kernel/rtmutex.c
+++ linux/kernel/rtmutex.c
@@ -337,7 +337,7 @@ static inline int try_to_steal_lock(stru
* interrupted, so we would delay a waiter with higher
* priority as current->normal_prio.
*
- * Note: in the rare case of a SCHED_FAIR task changing
+ * Note: in the rare case of a SCHED_OTHER task changing
* its priority and thus stealing the lock, next->task
* might be current:
*/
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -101,8 +101,10 @@ unsigned long long __attribute__((weak))
#define MIN_TIMESLICE max(5 * HZ / 1000, 1)
#define DEF_TIMESLICE (100 * HZ / 1000)
-#define TASK_PREEMPTS_CURR(p, rq) \
- ((p)->prio < (rq)->curr->prio)
+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
+{
+ p->sched_class->check_preempt_curr(rq, p);
+}
#define SCALE_PRIO(x, prio) \
max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
@@ -227,7 +229,7 @@ char * sched_print_task_state(struct tas
P(exec_start);
P(last_ran);
P(wait_runtime);
- P(exec_runtime);
+ P(sum_exec_runtime);
#undef P
t0 = sched_clock();
@@ -431,38 +433,46 @@ static inline struct rq *this_rq_lock(vo
return rq;
}
-#include "sched_stats.h"
-#include "sched_rt.c"
-#include "sched_fair.c"
-#include "sched_debug.c"
+/*
+ * resched_task - mark a task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the need_resched flag, on SMP it
+ * might also involve a cross-CPU call to trigger the scheduler on
+ * the target CPU.
+ */
+#ifdef CONFIG_SMP
-#define sched_class_highest (&rt_sched_class)
+#ifndef tsk_is_polling
+#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
+#endif
-static void enqueue_task(struct rq *rq, struct task_struct *p)
+static void resched_task(struct task_struct *p)
{
- sched_info_queued(p);
- p->sched_class->enqueue_task(rq, p);
- p->on_rq = 1;
-}
+ int cpu;
-static void dequeue_task(struct rq *rq, struct task_struct *p)
-{
- p->sched_class->dequeue_task(rq, p);
- p->on_rq = 0;
-}
+ assert_spin_locked(&task_rq(p)->lock);
-static void requeue_task(struct rq *rq, struct task_struct *p)
-{
- p->sched_class->requeue_task(rq, p);
-}
+ if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
+ return;
-/*
- * __normal_prio - return the priority that is based on the static prio
- */
-static inline int __normal_prio(struct task_struct *p)
+ set_tsk_thread_flag(p, TIF_NEED_RESCHED);
+
+ cpu = task_cpu(p);
+ if (cpu == smp_processor_id())
+ return;
+
+ /* NEED_RESCHED must be visible before we test polling */
+ smp_mb();
+ if (!tsk_is_polling(p))
+ smp_send_reschedule(cpu);
+}
+#else
+static inline void resched_task(struct task_struct *p)
{
- return p->static_prio;
+ assert_spin_locked(&task_rq(p)->lock);
+ set_tsk_need_resched(p);
}
+#endif
/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
@@ -528,6 +538,41 @@ static inline void dec_nr_running(struct
dec_raw_weighted_load(rq, p);
}
+static void activate_task(struct rq *rq, struct task_struct *p);
+
+#include "sched_stats.h"
+#include "sched_rt.c"
+#include "sched_fair.c"
+#include "sched_debug.c"
+
+#define sched_class_highest (&rt_sched_class)
+
+static void enqueue_task(struct rq *rq, struct task_struct *p)
+{
+ sched_info_queued(p);
+ p->sched_class->enqueue_task(rq, p);
+ p->on_rq = 1;
+}
+
+static void dequeue_task(struct rq *rq, struct task_struct *p)
+{
+ p->sched_class->dequeue_task(rq, p);
+ p->on_rq = 0;
+}
+
+static void requeue_task(struct rq *rq, struct task_struct *p)
+{
+ p->sched_class->requeue_task(rq, p);
+}
+
+/*
+ * __normal_prio - return the priority that is based on the static prio
+ */
+static inline int __normal_prio(struct task_struct *p)
+{
+ return p->static_prio;
+}
+
/*
* Calculate the expected normal priority: i.e. priority
* without taking RT-inheritance into account. Might be
@@ -593,47 +638,6 @@ static void deactivate_task(struct rq *r
dec_nr_running(p, rq);
}
-/*
- * resched_task - mark a task 'to be rescheduled now'.
- *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
- */
-#ifdef CONFIG_SMP
-
-#ifndef tsk_is_polling
-#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
-#endif
-
-static void resched_task(struct task_struct *p)
-{
- int cpu;
-
- assert_spin_locked(&task_rq(p)->lock);
-
- if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
- return;
-
- set_tsk_thread_flag(p, TIF_NEED_RESCHED);
-
- cpu = task_cpu(p);
- if (cpu == smp_processor_id())
- return;
-
- /* NEED_RESCHED must be visible before we test polling */
- smp_mb();
- if (!tsk_is_polling(p))
- smp_send_reschedule(cpu);
-}
-#else
-static inline void resched_task(struct task_struct *p)
-{
- assert_spin_locked(&task_rq(p)->lock);
- set_tsk_need_resched(p);
-}
-#endif
-
/**
* task_curr - is this task currently executing on a CPU?
* @p: the task in question.
@@ -1113,10 +1117,8 @@ out_activate:
* the waker guarantees that the freshly woken up task is going
* to be considered on this CPU.)
*/
- if (!sync || cpu != this_cpu) {
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
- }
+ if (!sync || cpu != this_cpu)
+ check_preempt_curr(rq, p);
success = 1;
out_running:
@@ -1159,7 +1161,8 @@ static void task_running_tick(struct rq
static void __sched_fork(struct task_struct *p)
{
p->wait_start_fair = p->exec_start = p->last_ran = 0;
- p->exec_runtime = p->wait_runtime = 0;
+ p->sum_exec_runtime = p->wait_runtime = 0;
+ p->sum_wait_runtime = 0;
INIT_LIST_HEAD(&p->run_list);
p->on_rq = 0;
@@ -1208,6 +1211,12 @@ void sched_fork(struct task_struct *p, i
}
/*
+ * After fork, child runs first. (default) If set to 0 then
+ * parent will (try to) run first.
+ */
+unsigned int __read_mostly sysctl_sched_child_runs_first = 1;
+
+/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
@@ -1218,15 +1227,25 @@ void fastcall wake_up_new_task(struct ta
{
unsigned long flags;
struct rq *rq;
+ int this_cpu;
rq = task_rq_lock(p, &flags);
BUG_ON(p->state != TASK_RUNNING);
+ this_cpu = smp_processor_id(); /* parent's CPU */
p->prio = effective_prio(p);
- activate_task(rq, p);
- if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
+ if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) ||
+ task_cpu(p) != this_cpu || !current->on_rq) {
+ activate_task(rq, p);
+ } else {
+ /*
+ * Let the scheduling class do new task startup
+ * management (if any):
+ */
+ p->sched_class->task_new(rq, p);
+ }
+ check_preempt_curr(rq, p);
task_rq_unlock(rq, &flags);
}
@@ -1559,8 +1578,7 @@ static void pull_task(struct rq *src_rq,
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
*/
- if (TASK_PREEMPTS_CURR(p, this_rq))
- resched_task(this_rq->curr);
+ check_preempt_curr(this_rq, p);
}
/*
@@ -2467,7 +2485,7 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
EXPORT_PER_CPU_SYMBOL(kstat);
/*
- * Return current->exec_runtime plus any more ns on the sched_clock
+ * Return current->sum_exec_runtime plus any more ns on the sched_clock
* that have not yet been banked.
*/
unsigned long long current_sched_runtime(const struct task_struct *p)
@@ -2476,7 +2494,7 @@ unsigned long long current_sched_runtime
unsigned long flags;
local_irq_save(flags);
- ns = p->exec_runtime + sched_clock() - p->last_ran;
+ ns = p->sum_exec_runtime + sched_clock() - p->last_ran;
local_irq_restore(flags);
return ns;
@@ -3176,8 +3194,9 @@ void rt_mutex_setprio(struct task_struct
if (task_running(rq, p)) {
if (p->prio > oldprio)
resched_task(rq->curr);
- } else if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
+ } else {
+ check_preempt_curr(rq, p);
+ }
}
task_rq_unlock(rq, &flags);
}
@@ -3469,8 +3488,9 @@ recheck:
if (task_running(rq, p)) {
if (p->prio > oldprio)
resched_task(rq->curr);
- } else if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
+ } else {
+ check_preempt_curr(rq, p);
+ }
}
__task_rq_unlock(rq);
spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -4183,8 +4203,7 @@ static int __migrate_task(struct task_st
if (p->on_rq) {
deactivate_task(rq_src, p);
activate_task(rq_dest, p);
- if (TASK_PREEMPTS_CURR(p, rq_dest))
- resched_task(rq_dest->curr);
+ check_preempt_curr(rq_dest, p);
}
ret = 1;
out:
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -51,10 +51,10 @@ print_task(struct seq_file *m, struct rq
p->prio,
p->nice_offset,
p->hog_limit,
- p->wait_start_fair,
+ p->wait_start_fair - rq->fair_clock,
p->exec_start,
- p->last_ran,
- p->exec_runtime);
+ p->sum_exec_runtime,
+ p->sum_wait_runtime);
}
static void print_rq(struct seq_file *m, struct rq *rq, u64 now)
@@ -66,10 +66,10 @@ static void print_rq(struct seq_file *m,
"\nrunnable tasks:\n"
" task PID tree-key delta waiting"
" switches prio nice-offset hog-limit wstart-fair exec-start"
- " last-ran exec-runtime\n"
- "------------------------------------------------------------------"
- "------------------------------------------------------------------"
- "-------------------\n");
+ " sum-exec sum-wait\n"
+ "---------------------------------------------------------"
+ "--------------------------------------------------------------------"
+ "--------------------------\n");
curr = first_fair(rq);
while (curr) {
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -27,15 +27,9 @@ static void __enqueue_task_fair(struct r
{
struct rb_node **link = &rq->tasks_timeline.rb_node;
struct rb_node *parent = NULL;
+ long long key = p->fair_key;
struct task_struct *entry;
int leftmost = 1;
- long long key;
-
- key = rq->fair_clock - p->wait_runtime;
- if (unlikely(p->nice_offset))
- key += p->nice_offset / (rq->nr_running + 1);
-
- p->fair_key = key;
/*
* Find the right place in the rbtree:
@@ -48,9 +42,9 @@ static void __enqueue_task_fair(struct r
* the same key stay together.
*/
if (key < entry->fair_key) {
- link = &(*link)->rb_left;
+ link = &parent->rb_left;
} else {
- link = &(*link)->rb_right;
+ link = &parent->rb_right;
leftmost = 0;
}
}
@@ -138,7 +132,7 @@ static inline void update_curr(struct rq
delta_exec = convert_delta(rq, now - curr->exec_start, curr);
delta_fair = delta_exec/rq->nr_running;
- curr->exec_runtime += delta_exec;
+ curr->sum_exec_runtime += delta_exec;
curr->exec_start = now;
rq->fair_clock += delta_fair;
@@ -182,6 +176,11 @@ update_stats_enqueue(struct rq *rq, stru
*/
if (p != rq->curr)
update_stats_wait_start(rq, p, now);
+
+ /*
+ * Update the key:
+ */
+ p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
}
/*
@@ -195,6 +194,7 @@ static inline void update_stats_wait_end
delta = scale_nice_down(rq, p, delta);
p->wait_runtime += delta;
+ p->sum_wait_runtime += delta;
rq->wait_runtime += delta;
p->wait_start_fair = 0;
@@ -275,6 +275,24 @@ static void requeue_task_fair(struct rq
p->on_rq = 1;
}
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+ struct task_struct *curr = rq->curr;
+ long long __delta = curr->fair_key - p->fair_key;
+
+ /*
+ * Take scheduling granularity into account - do not
+ * preempt the current task unless the best task has
+ * a larger than sched_granularity fairness advantage:
+ */
+ if (p->prio < curr->prio ||
+ __delta > (unsigned long long)sysctl_sched_granularity)
+ resched_task(curr);
+}
+
static struct task_struct * pick_next_task_fair(struct rq *rq)
{
struct task_struct *p = __pick_next_task_fair(rq);
@@ -362,25 +380,36 @@ static void task_tick_fair(struct rq *rq
* Dequeue and enqueue the task to update its
* position within the tree:
*/
- dequeue_task_fair(rq, curr);
- curr->on_rq = 0;
- enqueue_task_fair(rq, curr);
- curr->on_rq = 1;
+ requeue_task_fair(rq, curr);
/*
* Reschedule if another task tops the current one.
- *
- * Take scheduling granularity into account - do not
- * preempt the current task unless the best task has
- * a larger than sched_granularity fairness advantage:
*/
next = __pick_next_task_fair(rq);
- if (next != curr) {
- unsigned long long __delta = curr->fair_key - next->fair_key;
+ if (next != curr)
+ check_preempt_curr(rq, next);
+}
- if (__delta > (unsigned long long)sysctl_sched_granularity)
- set_tsk_need_resched(curr);
- }
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+ sched_info_queued(p);
+ update_stats_enqueue(rq, p);
+ /*
+ * Child runs first: we let it run before the parent
+ * until it reschedules once. We set up a key so that
+ * it will preempt the parent:
+ */
+ p->fair_key = current->fair_key - sysctl_sched_granularity - 1;
+ __enqueue_task_fair(rq, p);
+ p->on_rq = 1;
+ inc_nr_running(p, rq);
}
static inline long
@@ -418,6 +447,8 @@ hog_limit(struct rq *rq, struct task_str
return -(long long)limit;
}
+#define NICE_OFFSET_GRANULARITY 100000
+
/*
* Calculate and cache the nice offset and the hog limit values:
*/
@@ -441,12 +472,15 @@ struct sched_class fair_sched_class __re
.dequeue_task = dequeue_task_fair,
.requeue_task = requeue_task_fair,
+ .check_preempt_curr = check_preempt_curr_fair,
+
.pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.load_balance_start = load_balance_start_fair,
.load_balance_next = load_balance_next_fair,
.task_tick = task_tick_fair,
+ .task_new = task_new_fair,
.task_init = task_init_fair,
};
Index: linux/kernel/sched_rt.c
===================================================================
--- linux.orig/kernel/sched_rt.c
+++ linux/kernel/sched_rt.c
@@ -34,6 +34,15 @@ static void requeue_task_rt(struct rq *r
list_move_tail(&p->run_list, array->queue + p->prio);
}
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
+{
+ if (p->prio < rq->curr->prio)
+ resched_task(rq->curr);
+}
+
static struct task_struct * pick_next_task_rt(struct rq *rq)
{
struct prio_array *array = &rq->active;
@@ -140,6 +149,15 @@ static void task_tick_rt(struct rq *rq,
}
}
+/*
+ * No parent/child timeslice management necessary for RT tasks,
+ * just activate them:
+ */
+static void task_new_rt(struct rq *rq, struct task_struct *p)
+{
+ activate_task(rq, p);
+}
+
static void task_init_rt(struct rq *rq, struct task_struct *p)
{
}
@@ -149,6 +167,8 @@ static struct sched_class rt_sched_class
.dequeue_task = dequeue_task_rt,
.requeue_task = requeue_task_rt,
+ .check_preempt_curr = check_preempt_curr_rt,
+
.pick_next_task = pick_next_task_rt,
.put_prev_task = put_prev_task_rt,
@@ -156,5 +176,6 @@ static struct sched_class rt_sched_class
.load_balance_next = load_balance_next_rt,
.task_tick = task_tick_rt,
+ .task_new = task_new_rt,
.task_init = task_init_rt,
};
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -222,6 +222,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_child_runs_first",
+ .data = &sysctl_sched_child_runs_first,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
.ctl_name = KERN_PANIC,
.procname = "panic",
.data = &panic_timeout,
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
` (2 preceding siblings ...)
2007-04-17 4:53 ` Gene Heskett
@ 2007-04-17 6:46 ` Peter Williams
2007-04-17 7:51 ` William Lee Irwin III
2007-04-17 9:53 ` Ingo Molnar
3 siblings, 2 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-17 6:46 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
Ingo Molnar wrote:
> this is the second release of the CFS (Completely Fair Scheduler)
> patchset, against v2.6.21-rc7:
>
> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>
> i'd like to thank everyone for the tremendous amount of feedback and
> testing the v1 patch got - i could hardly keep up with just reading the
> mails! Some of the stuff people addressed i couldnt implement yet, i
> mostly concentrated on bugs, regressions and debuggability.
Have you considered using rq->raw_weighted_load instead of
rq->nr_running in calculating fair_clock? This would take the nice
value (or RT priority) of the other tasks into account when determining
what's fair.
Peter
PS You'd have to change the migration thread's load_weight from 0 to 1
in order to prevent divide by zero without having to explicitly check
for it every time.
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 4:06 ` Peter Williams
@ 2007-04-17 6:49 ` Ingo Molnar
0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 6:49 UTC (permalink / raw)
To: Peter Williams
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
* Peter Williams <pwil3058@bigpond.net.au> wrote:
> Can I make a suggestion?
>
> Would it be possible (from now on) to publish changes relevant to the
> previous patch (eventually leading to a series of patches that
> describes the evolution of the new scheduler) so that it's easier for
> us reviewers/critics to see the latest changes. E.g. if import such
> changes into something like quilt (using my gquilt GUI wrapper, of
> course :-)) I can then use meld (or similar) to follow what's going as
> suggestions get folded in and bugs get fixed etc.
the v1 patch is still downloadable so you can do a delta by first
applying the v1 patch to a quilt queue, doing a 'quilt snapshot', then
'quilt pop', add the v2 patch to the series file, do a 'quilt push',
then doing a "quilt diff --snapshot". (I just posted the delta patch in
this thread so you can pick it from there too.)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 6:18 ` Ingo Molnar
@ 2007-04-17 7:01 ` Ingo Molnar
2007-04-17 7:31 ` Davide Libenzi
` (2 more replies)
2007-04-17 8:03 ` Davide Libenzi
2007-04-17 16:12 ` Gene Heskett
2 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 7:01 UTC (permalink / raw)
To: Gene Heskett
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
* Ingo Molnar <mingo@elte.hu> wrote:
> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
> small. One difference is the child-runs-first fix. To restore the
> parent-runs-first logic, do this:
>
> echo 0 > /proc/sys/kernel/sched_child_runs_first
>
> does this make any difference?
ok, i've got something better to test: i separated the delta out into a
more finegrained stack of 3 patches. You can pick them up from:
http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
i test-built and test-booted all 4 steps of this. The baseline -v2-rc0
patch should be the one that works - you might want to double-check it,
just to be sure. One of the other 3 patches ontop of this baseline
causes the regression on your desktop. My current bet is on preempt-fix,
so i have put that one first. The other one would be the second patch,
child-runs-first. The misc patch should have no effect on behavior - but
i've included it for completeness. (and i was wrong about the 'nice
fix', it is not in this delta)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 5:51 ` Gene Heskett
@ 2007-04-17 7:18 ` Paolo Ornati
0 siblings, 0 replies; 37+ messages in thread
From: Paolo Ornati @ 2007-04-17 7:18 UTC (permalink / raw)
To: Gene Heskett
Cc: Willy Tarreau, Ingo Molnar, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Dmitry Adamushko
On Tue, 17 Apr 2007 01:51:08 -0400
Gene Heskett <gene.heskett@gmail.com> wrote:
> FWIW, I've been using the CFQ I/O scheduler for quite a while, is it time I
> gave the AS or Deadline versions another check? They are all built in but I
> don't know how to change the default on the fly, or even if it can be done.
easy :)
# cat /sys/block/DEVICE/queue/scheduler
as noop [cfq] ...
# echo IO_SCHED > /sys/block/DEVICE/queue/scheduler
--
Paolo Ornati
Linux 2.6.21-rc7 on x86_64
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:01 ` Ingo Molnar
@ 2007-04-17 7:31 ` Davide Libenzi
2007-04-17 7:39 ` Ingo Molnar
2007-04-17 17:15 ` Gene Heskett
2007-04-17 17:22 ` Gene Heskett
2 siblings, 1 reply; 37+ messages in thread
From: Davide Libenzi @ 2007-04-17 7:31 UTC (permalink / raw)
To: Ingo Molnar
Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
On Tue, 17 Apr 2007, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
> > small. One difference is the child-runs-first fix. To restore the
> > parent-runs-first logic, do this:
> >
> > echo 0 > /proc/sys/kernel/sched_child_runs_first
> >
> > does this make any difference?
>
> ok, i've got something better to test: i separated the delta out into a
> more finegrained stack of 3 patches. You can pick them up from:
>
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
Isn't that easier for everyone if you keep them as quilt series (ala
syslets)?
- Davide
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:31 ` Davide Libenzi
@ 2007-04-17 7:39 ` Ingo Molnar
2007-04-17 17:18 ` Gene Heskett
0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 7:39 UTC (permalink / raw)
To: Davide Libenzi
Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
* Davide Libenzi <davidel@xmailserver.org> wrote:
> > ok, i've got something better to test: i separated the delta out
> > into a more finegrained stack of 3 patches. You can pick them up
> > from:
> >
> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>
> Isn't that easier for everyone if you keep them as quilt series (ala
> syslets)?
i _do_ have a quilt tree, but i never had the clean splitup above. Why?
Because i worked on all of these aspects (and a whole lot of other
aspects as well) in parallel during the past 2 days, back and forth,
often mixing changes, etc. and there was never any clean splitup.
Now it turned out that the clean splitup of -rc0->final delta would ease
Gene's testing so i created it. Note that this is just 30% of the total
v1->v2 delta and i just saved the work of having to do a clean splitup
of the other 70%. (and note that this splitup will be undone because it
makes no sense for any potential upstream merge at all, it's only to
ease testing for Gene)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 6:46 ` Peter Williams
@ 2007-04-17 7:51 ` William Lee Irwin III
2007-04-17 8:16 ` Ingo Molnar
2007-04-17 8:30 ` Peter Williams
2007-04-17 9:53 ` Ingo Molnar
1 sibling, 2 replies; 37+ messages in thread
From: William Lee Irwin III @ 2007-04-17 7:51 UTC (permalink / raw)
To: Peter Williams
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
Dmitry Adamushko
Ingo Molnar wrote:
>> this is the second release of the CFS (Completely Fair Scheduler)
>> patchset, against v2.6.21-rc7:
>> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>> i'd like to thank everyone for the tremendous amount of feedback and
>> testing the v1 patch got - i could hardly keep up with just reading the
>> mails! Some of the stuff people addressed i couldnt implement yet, i
>> mostly concentrated on bugs, regressions and debuggability.
On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
> Have you considered using rq->raw_weighted_load instead of
> rq->nr_running in calculating fair_clock? This would take the nice
> value (or RT priority) of the other tasks into account when determining
> what's fair.
I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
in update_curr().
-- wli
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 6:18 ` Ingo Molnar
2007-04-17 7:01 ` Ingo Molnar
@ 2007-04-17 8:03 ` Davide Libenzi
2007-04-17 8:18 ` Nick Piggin
2007-04-17 8:20 ` Ingo Molnar
2007-04-17 16:12 ` Gene Heskett
2 siblings, 2 replies; 37+ messages in thread
From: Davide Libenzi @ 2007-04-17 8:03 UTC (permalink / raw)
To: Ingo Molnar
Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
On Tue, 17 Apr 2007, Ingo Molnar wrote:
> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
> small. One difference is the child-runs-first fix. To restore the
> parent-runs-first logic, do this:
>
> echo 0 > /proc/sys/kernel/sched_child_runs_first
Sorry, I did not follow the latest developments, but how many tunables we
have so far in CFS? Are those for debug only or they're supposed to stay?
Weren't those listed inside the Axis of Evil (just to remain in topic :)
till yesterday?
- Davide
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:51 ` William Lee Irwin III
@ 2007-04-17 8:16 ` Ingo Molnar
2007-04-17 8:52 ` Ingo Molnar
2007-04-17 14:05 ` Peter Williams
2007-04-17 8:30 ` Peter Williams
1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:16 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
Dmitry Adamushko
* William Lee Irwin III <wli@holomorphy.com> wrote:
> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>
> > Have you considered using rq->raw_weighted_load instead of
> > rq->nr_running in calculating fair_clock? This would take the nice
> > value (or RT priority) of the other tasks into account when
> > determining what's fair.
>
> I suspect you mean
> (curr->load_weight*delta_exec)/rq->raw_weighted_load in update_curr().
good idea, i'll try that.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:03 ` Davide Libenzi
@ 2007-04-17 8:18 ` Nick Piggin
2007-04-17 8:26 ` Ingo Molnar
2007-04-17 8:20 ` Ingo Molnar
1 sibling, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2007-04-17 8:18 UTC (permalink / raw)
To: Davide Libenzi
Cc: Ingo Molnar, Gene Heskett, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
On Tue, Apr 17, 2007 at 01:03:46AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Ingo Molnar wrote:
>
> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
> > small. One difference is the child-runs-first fix. To restore the
> > parent-runs-first logic, do this:
> >
> > echo 0 > /proc/sys/kernel/sched_child_runs_first
>
> Sorry, I did not follow the latest developments, but how many tunables we
> have so far in CFS? Are those for debug only or they're supposed to stay?
> Weren't those listed inside the Axis of Evil (just to remain in topic :)
> till yesterday?
Actually I think this is something that makes sense to add, even if
just for debugging, but maybe also for production, depending on how
much it impacts things. Child runs first is an heuristic optimisation
that exploits a VM detail (however fundamental). But for things that
don't exec right after forking (and maybe some things that do), it
can be nicer to reduce context switches, improve cache patterns, and
allow children to be load balanced away before touching memory, if
child_runs_first is turned off.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:03 ` Davide Libenzi
2007-04-17 8:18 ` Nick Piggin
@ 2007-04-17 8:20 ` Ingo Molnar
1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:20 UTC (permalink / raw)
To: Davide Libenzi
Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
* Davide Libenzi <davidel@xmailserver.org> wrote:
> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
> > small. One difference is the child-runs-first fix. To restore the
> > parent-runs-first logic, do this:
> >
> > echo 0 > /proc/sys/kernel/sched_child_runs_first
>
> Sorry, I did not follow the latest developments, but how many tunables
> we have so far in CFS? Are those for debug only or they're supposed to
> stay?
yeah, debug only. I strongly suspect the Kaffeine breakage for example
was related to child-runs-first, so userspace developers might be
interested in a switch to turn this on/off.
while reviewing the upstream scheduler it occured to me that we are
probably _not_ doing child-runs-first there due to the list_add_tail()
[it should be a list_add() for it to be child-first. But i havent
instrumented this heavily and this portion of the mainline scheduler is
pretty fragile.]. So via this flag we could also see the performance
impact, besides the compatibility impact.
> Weren't those listed inside the Axis of Evil (just to remain in topic
> :) till yesterday?
heh ;)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:18 ` Nick Piggin
@ 2007-04-17 8:26 ` Ingo Molnar
2007-04-17 8:41 ` Nick Piggin
0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:26 UTC (permalink / raw)
To: Nick Piggin
Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
* Nick Piggin <npiggin@suse.de> wrote:
> Actually I think this is something that makes sense to add, even if
> just for debugging, but maybe also for production, depending on how
> much it impacts things. Child runs first is an heuristic optimisation
> that exploits a VM detail (however fundamental). But for things that
> don't exec right after forking (and maybe some things that do), it can
> be nicer to reduce context switches, improve cache patterns, and allow
> children to be load balanced away before touching memory, if
> child_runs_first is turned off.
yeah, the primary intent was debug. Nick, am i confused to conclude that
mainline in fact runs the _parent_ first, despite all the elaborate
runqueue juggling we do there? This piece of code in wake_up_new_task()
caught my eyes:
p->prio = current->prio;
p->normal_prio = current->normal_prio;
list_add_tail(&p->run_list, ¤t->run_list);
p->array = current->array;
p->array->nr_active++;
inc_nr_running(p, rq);
shouldnt the list_add_tail() be list_add(), so that task pickup sees the
child first? Maybe we still do child-runs-first in practice, due to the
timeslice and sleep average fixups that happen if the parent preempts,
but the above piece of code seems a quite elaborate way of doing
activate_task(). To have the child _before_ the parent we'd need the
add-on patch below. But ... i could be wrong, this is just a quick
thought.
Ingo
---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1685,7 +1685,7 @@ void fastcall wake_up_new_task(struct ta
else {
p->prio = current->prio;
p->normal_prio = current->normal_prio;
- list_add_tail(&p->run_list, ¤t->run_list);
+ list_add(&p->run_list, ¤t->run_list);
p->array = current->array;
p->array->nr_active++;
inc_nr_running(p, rq);
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:51 ` William Lee Irwin III
2007-04-17 8:16 ` Ingo Molnar
@ 2007-04-17 8:30 ` Peter Williams
2007-04-18 19:15 ` Peter Williams
1 sibling, 1 reply; 37+ messages in thread
From: Peter Williams @ 2007-04-17 8:30 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
Dmitry Adamushko
William Lee Irwin III wrote:
> Ingo Molnar wrote:
>>> this is the second release of the CFS (Completely Fair Scheduler)
>>> patchset, against v2.6.21-rc7:
>>> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>> i'd like to thank everyone for the tremendous amount of feedback and
>>> testing the v1 patch got - i could hardly keep up with just reading the
>>> mails! Some of the stuff people addressed i couldnt implement yet, i
>>> mostly concentrated on bugs, regressions and debuggability.
>
> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>> Have you considered using rq->raw_weighted_load instead of
>> rq->nr_running in calculating fair_clock? This would take the nice
>> value (or RT priority) of the other tasks into account when determining
>> what's fair.
>
> I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
> in update_curr().
Or something like that, yes. :-)
I was trying to make the point that the weighted load stuff provides
useful data for implementing nice (in a number of ways e.g. see spa_ebs).
Also, now that the old time slices are gone, a simpler more efficient
function for mapping RT priority or nice (as appropriate) to
p->load_weight can be used instead of the current one which uses the
time slice the task would have been allocated as a basis. I'd suggest
the function that the current one replaced. (Because it was mine :-)).
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:26 ` Ingo Molnar
@ 2007-04-17 8:41 ` Nick Piggin
2007-04-17 8:57 ` Ingo Molnar
0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2007-04-17 8:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
On Tue, Apr 17, 2007 at 10:26:31AM +0200, Ingo Molnar wrote:
>
> * Nick Piggin <npiggin@suse.de> wrote:
>
> > Actually I think this is something that makes sense to add, even if
> > just for debugging, but maybe also for production, depending on how
> > much it impacts things. Child runs first is an heuristic optimisation
> > that exploits a VM detail (however fundamental). But for things that
> > don't exec right after forking (and maybe some things that do), it can
> > be nicer to reduce context switches, improve cache patterns, and allow
> > children to be load balanced away before touching memory, if
> > child_runs_first is turned off.
>
> yeah, the primary intent was debug. Nick, am i confused to conclude that
> mainline in fact runs the _parent_ first, despite all the elaborate
> runqueue juggling we do there? This piece of code in wake_up_new_task()
> caught my eyes:
>
> p->prio = current->prio;
> p->normal_prio = current->normal_prio;
> list_add_tail(&p->run_list, ¤t->run_list);
> p->array = current->array;
> p->array->nr_active++;
> inc_nr_running(p, rq);
>
> shouldnt the list_add_tail() be list_add(), so that task pickup sees the
> child first? Maybe we still do child-runs-first in practice, due to the
> timeslice and sleep average fixups that happen if the parent preempts,
> but the above piece of code seems a quite elaborate way of doing
> activate_task(). To have the child _before_ the parent we'd need the
> add-on patch below. But ... i could be wrong, this is just a quick
> thought.
I think that it works because the list we're adding to is not the
normal runqueue list head, but the parent's list_head on that runqueue.
Which adds the child directly ahead of the parent... I think?
>
> Ingo
>
> ---
> kernel/sched.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1685,7 +1685,7 @@ void fastcall wake_up_new_task(struct ta
> else {
> p->prio = current->prio;
> p->normal_prio = current->normal_prio;
> - list_add_tail(&p->run_list, ¤t->run_list);
> + list_add(&p->run_list, ¤t->run_list);
> p->array = current->array;
> p->array->nr_active++;
> inc_nr_running(p, rq);
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:16 ` Ingo Molnar
@ 2007-04-17 8:52 ` Ingo Molnar
2007-04-17 14:05 ` Peter Williams
1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:52 UTC (permalink / raw)
To: William Lee Irwin III
Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
Dmitry Adamushko
* Ingo Molnar <mingo@elte.hu> wrote:
> * William Lee Irwin III <wli@holomorphy.com> wrote:
>
> > On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
> >
> > > Have you considered using rq->raw_weighted_load instead of
> > > rq->nr_running in calculating fair_clock? This would take the
> > > nice value (or RT priority) of the other tasks into account when
> > > determining what's fair.
> >
> > I suspect you mean
> > (curr->load_weight*delta_exec)/rq->raw_weighted_load in
> > update_curr().
>
> good idea, i'll try that.
i'll try another thing too: we could perhaps get rid of rq->nr_running
and only use raw_weighted_load, because now the only main remaining
property of ->nr_running is "is it zero or not".
[ ->nr_running's only other significant use is 'group_capacity', but in
reality it is only interested in whether all CPUs in the group are
busy and what the combined cpu power of that group is, and this could
be restructured to use rq->curr and cpu_power - and become independent
of nr_running. ]
[ then there are other details like load-average, but we could change
that to be weighted-cpu-load driven - that makes sense anyway: a
reniced task should have less effect on the 'system load' than a
non-reniced task. ]
that would be one less variable to maintain in the scheduler hotpath,
and it would make smpnice an effective _replacement_ for nr_running,
instead of an add-on thing that costs a bit of performance.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:41 ` Nick Piggin
@ 2007-04-17 8:57 ` Ingo Molnar
0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:57 UTC (permalink / raw)
To: Nick Piggin
Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
* Nick Piggin <npiggin@suse.de> wrote:
> > list_add_tail(&p->run_list, ¤t->run_list);
[...]
> > shouldnt the list_add_tail() be list_add(), so that task pickup sees
> > the child first? [...]
[...]
> I think that it works because the list we're adding to is not the
> normal runqueue list head, but the parent's list_head on that
> runqueue. Which adds the child directly ahead of the parent... I
> think?
yeah, you are right, i was confused: list_add() adds _after_ the head,
list_add_tail() adds _before_ the head - and in the middle of the list
if we do a list_add_tail() it adds before that entry. So everything's
fine and working as expected :)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-16 22:12 ` S.Çağlar Onur
@ 2007-04-17 8:59 ` Ingo Molnar
2007-04-17 14:45 ` S.Çağlar Onur
0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 8:59 UTC (permalink / raw)
To: S.Çağlar Onur
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko
* S.Çağlar Onur <caglar@pardus.org.tr> wrote:
> 17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı:
> > - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> > flag can be used to turn it on/off. (This might fix the Kaffeine bug
> > reported by S.Çağlar Onur <)
>
> Sorry for delayed response but i just find some free time, do you
> still want me to test mainline + "parent-runs first" patch or will i
> drop that one and test v2 which can change default behaviour?
i suspect for now it would be sufficient if you could check the v2
patch.
if it _works_, please try this:
echo 0 > /proc/sys/kernel/sched_child_runs_first
this should break Kaffeine again :)
(if it doesnt work then the Kaffeine problem is unrelated to
child-runs-first.)
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 6:46 ` Peter Williams
2007-04-17 7:51 ` William Lee Irwin III
@ 2007-04-17 9:53 ` Ingo Molnar
1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 9:53 UTC (permalink / raw)
To: Peter Williams
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
* Peter Williams <pwil3058@bigpond.net.au> wrote:
> Have you considered using rq->raw_weighted_load instead of
> rq->nr_running in calculating fair_clock? This would take the nice
> value (or RT priority) of the other tasks into account when
> determining what's fair.
>
> Peter
> PS You'd have to change the migration thread's load_weight from 0 to 1
> in order to prevent divide by zero without having to explicitly check
> for it every time.
yeah - nice idea, i'll try this.
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:16 ` Ingo Molnar
2007-04-17 8:52 ` Ingo Molnar
@ 2007-04-17 14:05 ` Peter Williams
1 sibling, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-17 14:05 UTC (permalink / raw)
To: Ingo Molnar
Cc: William Lee Irwin III, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
Gene Heskett, Dmitry Adamushko
Ingo Molnar wrote:
> * William Lee Irwin III <wli@holomorphy.com> wrote:
>
>> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>>
>>> Have you considered using rq->raw_weighted_load instead of
>>> rq->nr_running in calculating fair_clock? This would take the nice
>>> value (or RT priority) of the other tasks into account when
>>> determining what's fair.
>> I suspect you mean
>> (curr->load_weight*delta_exec)/rq->raw_weighted_load in update_curr().
>
> good idea, i'll try that.
In the longer term, I'd suggest modifying this idea to use the maximum
of rq->raw_weighted_load and a running average of rq->raw_weighted_load
much the same as was done within the load balancer code. This will tend
to make scheduling "smoother". To try the idea out you could (on an SMP
system) use one of the rq->cpu_load[] metrics as the running average.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:59 ` Ingo Molnar
@ 2007-04-17 14:45 ` S.Çağlar Onur
2007-04-17 15:48 ` Gabriel C
2007-04-17 16:01 ` Ingo Molnar
0 siblings, 2 replies; 37+ messages in thread
From: S.Çağlar Onur @ 2007-04-17 14:45 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko
[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]
17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı:
> > Sorry for delayed response but i just find some free time, do you
> > still want me to test mainline + "parent-runs first" patch or will i
> > drop that one and test v2 which can change default behaviour?
>
> i suspect for now it would be sufficient if you could check the v2
> patch.
>
> if it _works_, please try this:
>
> echo 0 > /proc/sys/kernel/sched_child_runs_first
>
> this should break Kaffeine again :)
>
> (if it doesnt work then the Kaffeine problem is unrelated to
> child-runs-first.)
OK, i tested both plain -rc7 and -rc7 + CFSv2 with while
sched_child_runs_first enabled/disabled.
I'm always using same video file and try to reproduce freeze with constantly
pressing forward/backward buttons. With CFS 2-3 forward/backward attempt
reproduces this behaviour.
And here are the results.
Mainline still has no issues with both xine-lib/kaffeine and xine-ui
(kaffeine-0.8.4, xine-lib-1.1.5 [both xcb enabled], xine-ui-0.99.4). I really
try hard to reproduce the freeze, but i can't...
And CFSv2 still fails for both child_runs_first and parent_runs_first cases
with same strace output (FUTEX_WAIT).
If you want me to test something else just ask please :)
Cheers
--
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/
Linux is like living in a teepee. No Windows, no Gates and an Apache in house!
[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 14:45 ` S.Çağlar Onur
@ 2007-04-17 15:48 ` Gabriel C
2007-04-17 16:01 ` Ingo Molnar
1 sibling, 0 replies; 37+ messages in thread
From: Gabriel C @ 2007-04-17 15:48 UTC (permalink / raw)
To: caglar
Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
Peter Williams, Thomas Gleixner, Willy Tarreau, Gene Heskett,
Dmitry Adamushko
S.Çağlar Onur wrote:
> 17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı:
>
>>> Sorry for delayed response but i just find some free time, do you
>>> still want me to test mainline + "parent-runs first" patch or will i
>>> drop that one and test v2 which can change default behaviour?
>>>
>> i suspect for now it would be sufficient if you could check the v2
>> patch.
>>
>> if it _works_, please try this:
>>
>> echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> this should break Kaffeine again :)
>>
>> (if it doesnt work then the Kaffeine problem is unrelated to
>> child-runs-first.)
>>
>
> OK, i tested both plain -rc7 and -rc7 + CFSv2 with while
> sched_child_runs_first enabled/disabled.
>
> I'm always using same video file and try to reproduce freeze with constantly
> pressing forward/backward buttons. With CFS 2-3 forward/backward attempt
> reproduces this behaviour.
>
> And here are the results.
>
> Mainline still has no issues with both xine-lib/kaffeine and xine-ui
> (kaffeine-0.8.4, xine-lib-1.1.5 [both xcb enabled], xine-ui-0.99.4). I really
> try hard to reproduce the freeze, but i can't...
>
> And CFSv2 still fails for both child_runs_first and parent_runs_first cases
> with same strace output (FUTEX_WAIT).
>
I have the same problem here ( same packages ).
Even VLC if I go forward/backward and then play again its start to
ramdom freeze here but only for 1 - 2 seconds maybe.
> If you want me to test something else just ask please :)
>
> Cheers
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 14:45 ` S.Çağlar Onur
2007-04-17 15:48 ` Gabriel C
@ 2007-04-17 16:01 ` Ingo Molnar
1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 16:01 UTC (permalink / raw)
To: S.Çağlar Onur
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko,
Christophe Thommeret, Christoph Pfister, Jurgen Kofler
* S.Çağlar Onur <caglar@pardus.org.tr> wrote:
> If you want me to test something else just ask please :)
yes, it would be nice to do a:
strace -o kaffine.log -f -tttTTT kaffeine
log. Because in your old log this is visible:
clone(child_stack=0xb02394a4,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0xb0239bd8, {entry_number:6, base_addr:0xb0239b90,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1},
child_tidptr=0xb0239bd8) = 11340
futex(0x89ac218, FUTEX_WAKE, 1) = 1
we cloned a task and immediately afterwards we used futex 0x89ac218.
After that point many things happen, but the lockup itself:
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
is the same futex. Probably related to the same child thread? It would
be nice to also get a gdb backtrace:
gdb kaffine
<reproduce the hang>
Ctrl-C
bt
this should give you a gdb backtrace of that kaffeine hang. Thanks,
Ingo
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 6:18 ` Ingo Molnar
2007-04-17 7:01 ` Ingo Molnar
2007-04-17 8:03 ` Davide Libenzi
@ 2007-04-17 16:12 ` Gene Heskett
2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 16:12 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Gene Heskett <gene.heskett@gmail.com> wrote:
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was
>> much better. Watching amanda run with htop, kmails composer is being
>> subjected to 5 to 10 second pauses, and htop says that gzip -best
>> isn't getting more that 15% of the cpu, and the /amandatapes drive is
>> being written to in a regular pattern that seems to be the cause of
>> the pauses according to gkrellm, which also seems to track the size of
>> the writes, and can show anything from 4.3k to 54 megs as being
>> written in one cycle of its screen update.
>
>ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>small. One difference is the child-runs-first fix. To restore the
>parent-runs-first logic, do this:
>
I'm running 21-rc7-CFS-v2-rc0.1 now.
> echo 0 > /proc/sys/kernel/sched_child_runs_first
This is currently a 1. Reset to 0 by the above.
>does this make any difference?
Hard to tell, not much running except fetchmail/procmail and this composer.
>
>If not then pretty much the only other change was the nice level tweak i
>did. Could you try to grab a few snapshots of scheduling state via
>something like:
>
> while sleep 1; do cat /proc/sched_debug >> to-ingo.txt; done
The crf1.txt is with it=1, the cfr0.txt is with it zeroed.
>(and tell me the PID of the kmail composer, to make sure i'm checking
>the right task's behavior.)
And I let the crf0 version run longer as I was looking for the composer's pid,
but htop (or I) can't see it. Even a ps -e isn't seeing it! But its
running, I'm actively typing in it. So you get 3 files, the third one called
ps-e.txt, in private mail. I thought it was called composer, I really did.
>
>also, as a separate experiment, could you perhaps run this script as
>root:
>
> cd /proc; for N in [1-9]*; do renice -n 0 $N; done
>
>this will move all tasks in the system to nice level 0 and should make
>any nice level handling logic in the scheduler irrelevant. Do you have X
>reniced perhaps?
>
>Lots of system threads have negative or positive nice levels, so once
>you have executed this script, only a reboot will be a practical way to
>restore it to the previous settings.
>
> Ingo
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have many CHARTS and DIAGRAMS..
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:01 ` Ingo Molnar
2007-04-17 7:31 ` Davide Libenzi
@ 2007-04-17 17:15 ` Gene Heskett
2007-04-17 17:22 ` Gene Heskett
2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:15 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>> small. One difference is the child-runs-first fix. To restore the
>> parent-runs-first logic, do this:
>>
>> echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> does this make any difference?
>
>ok, i've got something better to test: i separated the delta out into a
>more finegrained stack of 3 patches. You can pick them up from:
>
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.p
>atch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-fi
>rst.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>
Ahh, so many cats, and so few recipes here Ingo. In this case cats=patches &
recipes=time to test adequately. I do have another box, but it would
probably take a week & about a big buck to get that old rh7.3 brought up to
date & suitable, and its only a 500MHZ K-III, which might make the diffs more
obvious. It would need a video card to replace its dinosaur Diamond and a
fresh dvd drive. And its motherboard has very buggy usb chips. TYAN S-1590.
Never could get anything bigger than a mouse packet through them.
>i test-built and test-booted all 4 steps of this. The baseline -v2-rc0
>patch should be the one that works - you might want to double-check it,
>just to be sure. One of the other 3 patches ontop of this baseline
>causes the regression on your desktop. My current bet is on preempt-fix,
>so i have put that one first. The other one would be the second patch,
>child-runs-first. The misc patch should have no effect on behavior - but
>i've included it for completeness. (and i was wrong about the 'nice
>fix', it is not in this delta)
>
> Ingo
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Q: What's the difference between a dead dog in the road and a dead
lawyer in the road?
A: There are skid marks in front of the dog.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:39 ` Ingo Molnar
@ 2007-04-17 17:18 ` Gene Heskett
0 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:18 UTC (permalink / raw)
To: Ingo Molnar
Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
Willy Tarreau, Dmitry Adamushko
On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Davide Libenzi <davidel@xmailserver.org> wrote:
>> > ok, i've got something better to test: i separated the delta out
>> > into a more finegrained stack of 3 patches. You can pick them up
>> > from:
>> >
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
>> >
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fi
>> >x.patch
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs
>> >-first.patch
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>>
>> Isn't that easier for everyone if you keep them as quilt series (ala
>> syslets)?
>
>i _do_ have a quilt tree, but i never had the clean splitup above. Why?
>Because i worked on all of these aspects (and a whole lot of other
>aspects as well) in parallel during the past 2 days, back and forth,
>often mixing changes, etc. and there was never any clean splitup.
>
>Now it turned out that the clean splitup of -rc0->final delta would ease
>Gene's testing so i created it. Note that this is just 30% of the total
>v1->v2 delta and i just saved the work of having to do a clean splitup
>of the other 70%. (and note that this splitup will be undone because it
>makes no sense for any potential upstream merge at all, it's only to
>ease testing for Gene)
Now he tells me. :-) But I have some CHO stuff to do, so it will be about 36
hours before I can get back to this.
> Ingo
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Support the Girl Scouts!
(Today's Brownie is tomorrow's Cookie!)
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 7:01 ` Ingo Molnar
2007-04-17 7:31 ` Davide Libenzi
2007-04-17 17:15 ` Gene Heskett
@ 2007-04-17 17:22 ` Gene Heskett
2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:22 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko
On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>> small. One difference is the child-runs-first fix. To restore the
>> parent-runs-first logic, do this:
>>
>> echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> does this make any difference?
>
>ok, i've got something better to test: i separated the delta out into a
>more finegrained stack of 3 patches. You can pick them up from:
>
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.p
>atch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-fi
>rst.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>
Got them all saved Ingo, but it will be late tomorrow before I can play again.
>i test-built and test-booted all 4 steps of this. The baseline -v2-rc0
>patch should be the one that works - you might want to double-check it,
>just to be sure. One of the other 3 patches ontop of this baseline
>causes the regression on your desktop. My current bet is on preempt-fix,
>so i have put that one first. The other one would be the second patch,
>child-runs-first. The misc patch should have no effect on behavior - but
>i've included it for completeness. (and i was wrong about the 'nice
>fix', it is not in this delta)
>
> Ingo
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
This life is yours. Some of it was given to you; the rest, you made yourself.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 5:25 ` Willy Tarreau
` (2 preceding siblings ...)
2007-04-17 6:27 ` Ingo Molnar
@ 2007-04-18 0:06 ` Peter Williams
3 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-18 0:06 UTC (permalink / raw)
To: Willy Tarreau
Cc: Gene Heskett, Ingo Molnar, linux-kernel, Linus Torvalds,
Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
Arjan van de Ven, Thomas Gleixner, caglar, Dmitry Adamushko
Willy Tarreau wrote:
> Hi Gene,
>
> On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
>> On Monday 16 April 2007, Ingo Molnar wrote:
>>> this is the second release of the CFS (Completely Fair Scheduler)
>>> patchset, against v2.6.21-rc7:
>>>
>>> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>>
>>> i'd like to thank everyone for the tremendous amount of feedback and
>>> testing the v1 patch got - i could hardly keep up with just reading the
>>> mails! Some of the stuff people addressed i couldnt implement yet, i
>>> mostly concentrated on bugs, regressions and debuggability.
>>>
>>> there's a fair amount of churn:
>>>
>>> 15 files changed, 456 insertions(+), 241 deletions(-)
>>>
>>> But it's an encouraging sign that there was no crash bug found in v1,
>>> all the bugs were related to scheduling-behavior details. The code was
>>> tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>>> code size increase in -v2 is due to debugging helpers, they'll be
>>> removed later. (The new /proc/sched_debug file can be used to see the
>>> fine details of CFS scheduling.)
>>>
>>> Changes since -v1:
>>>
>>> - make nice levels less starvable. (reported by Willy Tarreau)
>>>
>>> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>>> flag can be used to turn it on/off. (This might fix the Kaffeine bug
>>> reported by S.Ça??lar Onur <)
>>>
>>> - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>>>
>>> - UP build fix. (reported by Gabriel C)
>>>
>>> - timer tick micro-optimization (Dmitry Adamushko)
>>>
>>> - preemption fix: sched_class->check_preempt_curr method to decide
>>> whether to preempt after a wakeup (or at a timer tick). (Found via a
>>> fairness-test-utility written for CFS by Mike Galbraith)
>>>
>>> - start forked children with neutral statistics instead of trying to
>>> inherit them from the parent: Willy Tarreau reported that this
>>> results in better behavior on extreme workloads, and it also
>>> simplifies the code quite nicely. Removed sched_exit() and the
>>> ->task_exit() methods.
>>>
>>> - make nice levels independent of the sched_granularity value
>>>
>>> - new /proc/sched_debug file listing runqueue details and the rbtree
>>>
>>> - new SCH-* fields in /proc/<NR>/status to see scheduling details
>>>
>>> - new cpu-hog feature (off by default) and sysctl tunable to set it:
>>> /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>>> 0 (off). Positive values are meant the maximum 'memory' that the
>>> scheduler has of CPU hogs.
>>>
>>> - various code cleanups
>>>
>>> - added more statistics temporarily: sum_exec_runtime,
>>> sum_wait_runtime.
>>>
>>> - added -CFS-v2 to EXTRAVERSION
>>>
>>> as usual, any sort of feedback, bugreports, fixes and suggestions are
>>> more than welcome,
>>>
>>> Ingo
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was much
>> better. Watching amanda run with htop, kmails composer is being subjected to
>> 5 to 10 second pauses, and htop says that gzip -best isn't getting more that
>> 15% of the cpu, and the /amandatapes drive is being written to in a regular
>> pattern that seems to be the cause of the pauses according to gkrellm, which
>> also seems to track the size of the writes, and can show anything from 4.3k
>> to 54 megs as being written in one cycle of its screen update.
>
> Have you tried previous version with the fair-fork patch ? It might be possible
> that your workload is sensible to the fork()'s child getting much CPU upon
> startup.
>
> Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
> new tasks are "forked", they are queued at the end of the run queue with a
> fixed priority. In our case, this would translate into assigning them the
> same prio and timeslice as their parent, but queuing them at the end so that
> they don't make existing tasks starve during huge fork() loads.
>
> I don't know how that would be possible (nor if that would help in anything),
> but I found it was a good compromise over sharing the timeslice with the
> parent. Perhaps we should have some absolute timeslice and some relative
> timeslice (eg: X percent of total time divided by the number of tasks) ?
One way of handling forked tasks is to give them a high priority but a
small chunk (i.e. give them a relatively short time to do some work and
surrender the CPU voluntarily before you boot them off). If you choose
the size of this reduced chunk well the vast majority of tasks will
never be booted off and will do a small bit of work and either exit or
sleep and will suffer no penalty as a result of this mechanism. But it
gives you a chance to move any newly forked process that turns out to be
a CPU hog to a lower priority before it gets its next chunk of CPU at
which time it can revert to getting normal size chunks as pre-emption
will stop it hogging the CPU from then on.
I've trialled this mechanism in some of my schedulers and it works well.
I found that 10 milliseconds was a good value for the initial chunk of
CPU for a newly forked process.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [patch] CFS (Completely Fair Scheduler), v2
2007-04-17 8:30 ` Peter Williams
@ 2007-04-18 19:15 ` Peter Williams
0 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-18 19:15 UTC (permalink / raw)
To: William Lee Irwin III, Ingo Molnar
Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko
Peter Williams wrote:
> William Lee Irwin III wrote:
>> Ingo Molnar wrote:
>>>> this is the second release of the CFS (Completely Fair Scheduler)
>>>> patchset, against v2.6.21-rc7:
>>>> http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>>> i'd like to thank everyone for the tremendous amount of feedback and
>>>> testing the v1 patch got - i could hardly keep up with just reading
>>>> the mails! Some of the stuff people addressed i couldnt implement
>>>> yet, i mostly concentrated on bugs, regressions and debuggability.
>>
>> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>>> Have you considered using rq->raw_weighted_load instead of
>>> rq->nr_running in calculating fair_clock? This would take the nice
>>> value (or RT priority) of the other tasks into account when
>>> determining what's fair.
>>
>> I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
>> in update_curr().
>
> Or something like that, yes. :-)
Actually, this formula can't be used for the migration thread itself as
its load_weight isn't an accurate reflection of its static priority.
But as the migration thread is a real time task this probably isn't an
issue, right?
If this assumption is correct (i.e. curr is never a real time task) then
my earlier caveat re division by zero being possible is invalid because
the migration task will never be the only task on the runqueue when this
code is called.
I'm also assuming here that (because of its name) curr is already on the
runqueue when this code is called. If it isn't the divisor in the above
expression should be (rq->raw_weighted_load + curr->load_weight). This
would also preclude the possibility of divide by zero.
Peter
--
Peter Williams pwil3058@bigpond.net.au
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2007-04-18 19:15 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
2007-04-16 22:12 ` S.Çağlar Onur
2007-04-17 8:59 ` Ingo Molnar
2007-04-17 14:45 ` S.Çağlar Onur
2007-04-17 15:48 ` Gabriel C
2007-04-17 16:01 ` Ingo Molnar
2007-04-17 4:06 ` Peter Williams
2007-04-17 6:49 ` Ingo Molnar
2007-04-17 4:53 ` Gene Heskett
2007-04-17 5:25 ` Willy Tarreau
2007-04-17 5:51 ` Gene Heskett
2007-04-17 7:18 ` Paolo Ornati
2007-04-17 5:51 ` Mike Galbraith
2007-04-17 6:27 ` Ingo Molnar
2007-04-18 0:06 ` Peter Williams
2007-04-17 6:18 ` Ingo Molnar
2007-04-17 7:01 ` Ingo Molnar
2007-04-17 7:31 ` Davide Libenzi
2007-04-17 7:39 ` Ingo Molnar
2007-04-17 17:18 ` Gene Heskett
2007-04-17 17:15 ` Gene Heskett
2007-04-17 17:22 ` Gene Heskett
2007-04-17 8:03 ` Davide Libenzi
2007-04-17 8:18 ` Nick Piggin
2007-04-17 8:26 ` Ingo Molnar
2007-04-17 8:41 ` Nick Piggin
2007-04-17 8:57 ` Ingo Molnar
2007-04-17 8:20 ` Ingo Molnar
2007-04-17 16:12 ` Gene Heskett
2007-04-17 6:46 ` Peter Williams
2007-04-17 7:51 ` William Lee Irwin III
2007-04-17 8:16 ` Ingo Molnar
2007-04-17 8:52 ` Ingo Molnar
2007-04-17 14:05 ` Peter Williams
2007-04-17 8:30 ` Peter Williams
2007-04-18 19:15 ` Peter Williams
2007-04-17 9:53 ` Ingo Molnar
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.