All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] CFS (Completely Fair Scheduler), v2
@ 2007-04-16 22:07 Ingo Molnar
  2007-04-16 22:12 ` S.Çağlar Onur
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-16 22:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko


this is the second release of the CFS (Completely Fair Scheduler) 
patchset, against v2.6.21-rc7:

   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch

i'd like to thank everyone for the tremendous amount of feedback and 
testing the v1 patch got - i could hardly keep up with just reading the 
mails! Some of the stuff people addressed i couldnt implement yet, i 
mostly concentrated on bugs, regressions and debuggability.

there's a fair amount of churn:

   15 files changed, 456 insertions(+), 241 deletions(-)

But it's an encouraging sign that there was no crash bug found in v1, 
all the bugs were related to scheduling-behavior details. The code was 
tested on 3 architectures so far: i686, x86_64 and ia64. Most of the 
code size increase in -v2 is due to debugging helpers, they'll be 
removed later. (The new /proc/sched_debug file can be used to see the 
fine details of CFS scheduling.)

Changes since -v1:

 - make nice levels less starvable. (reported by Willy Tarreau)

 - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first 
   flag can be used to turn it on/off. (This might fix the Kaffeine bug
   reported by S.Çağlar Onur <)

 - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)

 - UP build fix. (reported by Gabriel C)

 - timer tick micro-optimization (Dmitry Adamushko)

 - preemption fix: sched_class->check_preempt_curr method to decide 
   whether to preempt after a wakeup (or at a timer tick). (Found via a
   fairness-test-utility written for CFS by Mike Galbraith)

 - start forked children with neutral statistics instead of trying to 
   inherit them from the parent: Willy Tarreau reported that this 
   results in better behavior on extreme workloads, and it also 
   simplifies the code quite nicely. Removed sched_exit() and the 
   ->task_exit() methods.

 - make nice levels independent of the sched_granularity value

 - new /proc/sched_debug file listing runqueue details and the rbtree

 - new SCH-* fields in /proc/<NR>/status to see scheduling details

 - new cpu-hog feature (off by default) and sysctl tunable to set it: 
   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
   0 (off). Positive values are meant the maximum 'memory' that the 
   scheduler has of CPU hogs.

 - various code cleanups

 - added more statistics temporarily: sum_exec_runtime, 
   sum_wait_runtime.

 - added -CFS-v2 to EXTRAVERSION

as usual, any sort of feedback, bugreports, fixes and suggestions are 
more than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
@ 2007-04-16 22:12 ` S.Çağlar Onur
  2007-04-17  8:59   ` Ingo Molnar
  2007-04-17  4:06 ` Peter Williams
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: S.Çağlar Onur @ 2007-04-16 22:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko

[-- Attachment #1: Type: text/plain, Size: 621 bytes --]

17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
>  - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>    flag can be used to turn it on/off. (This might fix the Kaffeine bug
>    reported by S.Çağlar Onur <)

Sorry for delayed response but i just find some free time, do you still want 
me to test mainline + "parent-runs first" patch or will i drop that one and 
test v2 which can change default behaviour?

-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
  2007-04-16 22:12 ` S.Çağlar Onur
@ 2007-04-17  4:06 ` Peter Williams
  2007-04-17  6:49   ` Ingo Molnar
  2007-04-17  4:53 ` Gene Heskett
  2007-04-17  6:46 ` Peter Williams
  3 siblings, 1 reply; 37+ messages in thread
From: Peter Williams @ 2007-04-17  4:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko

Ingo Molnar wrote:
> this is the second release of the CFS (Completely Fair Scheduler) 
> patchset, against v2.6.21-rc7:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
> 
> i'd like to thank everyone for the tremendous amount of feedback and 
> testing the v1 patch got - i could hardly keep up with just reading the 
> mails! Some of the stuff people addressed i couldnt implement yet, i 
> mostly concentrated on bugs, regressions and debuggability.

Can I make a suggestion?

Would it be possible (from now on) to publish changes relevant to the 
previous patch (eventually leading to a series of patches that describes 
the evolution of the new scheduler) so that it's easier for us 
reviewers/critics to see the latest changes.  E.g. if import such 
changes into something like quilt (using my gquilt GUI wrapper, of 
course :-)) I can then use meld (or similar) to follow what's going as 
suggestions get folded in and bugs get fixed etc.

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
  2007-04-16 22:12 ` S.Çağlar Onur
  2007-04-17  4:06 ` Peter Williams
@ 2007-04-17  4:53 ` Gene Heskett
  2007-04-17  5:25   ` Willy Tarreau
  2007-04-17  6:18   ` Ingo Molnar
  2007-04-17  6:46 ` Peter Williams
  3 siblings, 2 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17  4:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko

On Monday 16 April 2007, Ingo Molnar wrote:
>this is the second release of the CFS (Completely Fair Scheduler)
>patchset, against v2.6.21-rc7:
>
>   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>
>i'd like to thank everyone for the tremendous amount of feedback and
>testing the v1 patch got - i could hardly keep up with just reading the
>mails! Some of the stuff people addressed i couldnt implement yet, i
>mostly concentrated on bugs, regressions and debuggability.
>
>there's a fair amount of churn:
>
>   15 files changed, 456 insertions(+), 241 deletions(-)
>
>But it's an encouraging sign that there was no crash bug found in v1,
>all the bugs were related to scheduling-behavior details. The code was
>tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>code size increase in -v2 is due to debugging helpers, they'll be
>removed later. (The new /proc/sched_debug file can be used to see the
>fine details of CFS scheduling.)
>
>Changes since -v1:
>
> - make nice levels less starvable. (reported by Willy Tarreau)
>
> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>   flag can be used to turn it on/off. (This might fix the Kaffeine bug
>   reported by S.Çağlar Onur <)
>
> - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>
> - UP build fix. (reported by Gabriel C)
>
> - timer tick micro-optimization (Dmitry Adamushko)
>
> - preemption fix: sched_class->check_preempt_curr method to decide
>   whether to preempt after a wakeup (or at a timer tick). (Found via a
>   fairness-test-utility written for CFS by Mike Galbraith)
>
> - start forked children with neutral statistics instead of trying to
>   inherit them from the parent: Willy Tarreau reported that this
>   results in better behavior on extreme workloads, and it also
>   simplifies the code quite nicely. Removed sched_exit() and the
>   ->task_exit() methods.
>
> - make nice levels independent of the sched_granularity value
>
> - new /proc/sched_debug file listing runqueue details and the rbtree
>
> - new SCH-* fields in /proc/<NR>/status to see scheduling details
>
> - new cpu-hog feature (off by default) and sysctl tunable to set it:
>   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>   0 (off). Positive values are meant the maximum 'memory' that the
>   scheduler has of CPU hogs.
>
> - various code cleanups
>
> - added more statistics temporarily: sum_exec_runtime,
>   sum_wait_runtime.
>
> - added -CFS-v2 to EXTRAVERSION
>
>as usual, any sort of feedback, bugreports, fixes and suggestions are
>more than welcome,
>
>	Ingo

This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was much 
better.  Watching amanda run with htop, kmails composer is being subjected to 
5 to 10 second pauses, and htop says that gzip -best isn't getting more that 
15% of the cpu, and the /amandatapes drive is being written to in a regular 
pattern that seems to be the cause of the pauses  according to gkrellm, which 
also seems to track the size of the writes, and can show anything from 4.3k 
to 54 megs as being written in one cycle of its screen update.

Normally hdd will fire up and take it at about 40+M/second steady till its 
done when there is a file ready to write even if its a 7GB file.  And I can 
type right on during the disk i/o.  But not now.

In short, I seem to be heavily I/O bound.  But when the write to /dev/hdd3 is 
done, then gzip -best pops right up to 90% plus cpu and I get my machine 
back.

In between file writes I checked the drives speed with hdparm:

root@coyote ~]# hdparm -Tt /dev/hdd

/dev/hdd:
 Timing cached reads:   856 MB in  2.01 seconds = 426.15 MB/sec
 Timing buffered disk reads:  222 MB in  3.01 seconds =  73.68 MB/sec

That's not too shabby, and obviously dma is active at least for the reading.

gzip -best was running while this was executing. So I think the drive is fine 
and the scheduling is whats funkity.  Sorry.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
  After they got rid of capital punishment, they had to hang twice
  as many people as before.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  4:53 ` Gene Heskett
@ 2007-04-17  5:25   ` Willy Tarreau
  2007-04-17  5:51     ` Gene Heskett
                       ` (3 more replies)
  2007-04-17  6:18   ` Ingo Molnar
  1 sibling, 4 replies; 37+ messages in thread
From: Willy Tarreau @ 2007-04-17  5:25 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko

Hi Gene,

On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
> On Monday 16 April 2007, Ingo Molnar wrote:
> >this is the second release of the CFS (Completely Fair Scheduler)
> >patchset, against v2.6.21-rc7:
> >
> >   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
> >
> >i'd like to thank everyone for the tremendous amount of feedback and
> >testing the v1 patch got - i could hardly keep up with just reading the
> >mails! Some of the stuff people addressed i couldnt implement yet, i
> >mostly concentrated on bugs, regressions and debuggability.
> >
> >there's a fair amount of churn:
> >
> >   15 files changed, 456 insertions(+), 241 deletions(-)
> >
> >But it's an encouraging sign that there was no crash bug found in v1,
> >all the bugs were related to scheduling-behavior details. The code was
> >tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
> >code size increase in -v2 is due to debugging helpers, they'll be
> >removed later. (The new /proc/sched_debug file can be used to see the
> >fine details of CFS scheduling.)
> >
> >Changes since -v1:
> >
> > - make nice levels less starvable. (reported by Willy Tarreau)
> >
> > - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> >   flag can be used to turn it on/off. (This might fix the Kaffeine bug
> >   reported by S.Ça??lar Onur <)
> >
> > - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
> >
> > - UP build fix. (reported by Gabriel C)
> >
> > - timer tick micro-optimization (Dmitry Adamushko)
> >
> > - preemption fix: sched_class->check_preempt_curr method to decide
> >   whether to preempt after a wakeup (or at a timer tick). (Found via a
> >   fairness-test-utility written for CFS by Mike Galbraith)
> >
> > - start forked children with neutral statistics instead of trying to
> >   inherit them from the parent: Willy Tarreau reported that this
> >   results in better behavior on extreme workloads, and it also
> >   simplifies the code quite nicely. Removed sched_exit() and the
> >   ->task_exit() methods.
> >
> > - make nice levels independent of the sched_granularity value
> >
> > - new /proc/sched_debug file listing runqueue details and the rbtree
> >
> > - new SCH-* fields in /proc/<NR>/status to see scheduling details
> >
> > - new cpu-hog feature (off by default) and sysctl tunable to set it:
> >   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
> >   0 (off). Positive values are meant the maximum 'memory' that the
> >   scheduler has of CPU hogs.
> >
> > - various code cleanups
> >
> > - added more statistics temporarily: sum_exec_runtime,
> >   sum_wait_runtime.
> >
> > - added -CFS-v2 to EXTRAVERSION
> >
> >as usual, any sort of feedback, bugreports, fixes and suggestions are
> >more than welcome,
> >
> >	Ingo
> 
> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was much 
> better.  Watching amanda run with htop, kmails composer is being subjected to 
> 5 to 10 second pauses, and htop says that gzip -best isn't getting more that 
> 15% of the cpu, and the /amandatapes drive is being written to in a regular 
> pattern that seems to be the cause of the pauses  according to gkrellm, which 
> also seems to track the size of the writes, and can show anything from 4.3k 
> to 54 megs as being written in one cycle of its screen update.

Have you tried previous version with the fair-fork patch ? It might be possible
that your workload is sensible to the fork()'s child getting much CPU upon
startup.

Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
new tasks are "forked", they are queued at the end of the run queue with a
fixed priority. In our case, this would translate into assigning them the
same prio and timeslice as their parent, but queuing them at the end so that
they don't make existing tasks starve during huge fork() loads.

I don't know how that would be possible (nor if that would help in anything),
but I found it was a good compromise over sharing the timeslice with the
parent. Perhaps we should have some absolute timeslice and some relative
timeslice (eg: X percent of total time divided by the number of tasks) ?

Regards,
Willy


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  5:25   ` Willy Tarreau
@ 2007-04-17  5:51     ` Gene Heskett
  2007-04-17  7:18       ` Paolo Ornati
  2007-04-17  5:51     ` Mike Galbraith
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Gene Heskett @ 2007-04-17  5:51 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko

On Tuesday 17 April 2007, Willy Tarreau wrote:
>Hi Gene,
>
>On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
>> On Monday 16 April 2007, Ingo Molnar wrote:
>> >this is the second release of the CFS (Completely Fair Scheduler)
>> >patchset, against v2.6.21-rc7:
>> >
>> >   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>> >
>> >i'd like to thank everyone for the tremendous amount of feedback and
>> >testing the v1 patch got - i could hardly keep up with just reading the
>> >mails! Some of the stuff people addressed i couldnt implement yet, i
>> >mostly concentrated on bugs, regressions and debuggability.
>> >
>> >there's a fair amount of churn:
>> >
>> >   15 files changed, 456 insertions(+), 241 deletions(-)
>> >
>> >But it's an encouraging sign that there was no crash bug found in v1,
>> >all the bugs were related to scheduling-behavior details. The code was
>> >tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>> >code size increase in -v2 is due to debugging helpers, they'll be
>> >removed later. (The new /proc/sched_debug file can be used to see the
>> >fine details of CFS scheduling.)
>> >
>> >Changes since -v1:
>> >
>> > - make nice levels less starvable. (reported by Willy Tarreau)
>> >
>> > - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>> >   flag can be used to turn it on/off. (This might fix the Kaffeine bug
>> >   reported by S.Ça??lar Onur <)
>> >
>> > - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>> >
>> > - UP build fix. (reported by Gabriel C)
>> >
>> > - timer tick micro-optimization (Dmitry Adamushko)
>> >
>> > - preemption fix: sched_class->check_preempt_curr method to decide
>> >   whether to preempt after a wakeup (or at a timer tick). (Found via a
>> >   fairness-test-utility written for CFS by Mike Galbraith)
>> >
>> > - start forked children with neutral statistics instead of trying to
>> >   inherit them from the parent: Willy Tarreau reported that this
>> >   results in better behavior on extreme workloads, and it also
>> >   simplifies the code quite nicely. Removed sched_exit() and the
>> >   ->task_exit() methods.
>> >
>> > - make nice levels independent of the sched_granularity value
>> >
>> > - new /proc/sched_debug file listing runqueue details and the rbtree
>> >
>> > - new SCH-* fields in /proc/<NR>/status to see scheduling details
>> >
>> > - new cpu-hog feature (off by default) and sysctl tunable to set it:
>> >   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>> >   0 (off). Positive values are meant the maximum 'memory' that the
>> >   scheduler has of CPU hogs.
>> >
>> > - various code cleanups
>> >
>> > - added more statistics temporarily: sum_exec_runtime,
>> >   sum_wait_runtime.
>> >
>> > - added -CFS-v2 to EXTRAVERSION
>> >
>> >as usual, any sort of feedback, bugreports, fixes and suggestions are
>> >more than welcome,
>> >
>> >	Ingo
>>
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was much
>> better.  Watching amanda run with htop, kmails composer is being subjected
>> to 5 to 10 second pauses, and htop says that gzip -best isn't getting more
>> that 15% of the cpu, and the /amandatapes drive is being written to in a
>> regular pattern that seems to be the cause of the pauses  according to
>> gkrellm, which also seems to track the size of the writes, and can show
>> anything from 4.3k to 54 megs as being written in one cycle of its screen
>> update.

Somewhat interesting to this, I have amanda doing a verify phase too.  During 
the verify phase (and while I was waiting for gmail to transmit this message, 
it took 30 minutes before it showed up on the list) I noted that when 
amrestore fired up, it, and its child tar were only taking about 20% of the 
cpu between them, and that /dev/hdd was showing a pretty steady 55 to 
75MB/sec being read.  As to what this tells us, I'm not going to hazard a 
guess because it wouldn't, this time of the night here in WV, USA, even be a 
SWAG.  Its coming up on 2am and the toothpicks holding my eyes open are 
sagging badly, making creaking noises even.

>Have you tried previous version with the fair-fork patch ? It might be
> possible that your workload is sensible to the fork()'s child getting much
> CPU upon startup.

Willy, I think that patch went by, and was followed by the v2-rc2 so fast that 
I never got a chance to try it with the v2-rc0 framework.  So I believe the 
answer there is probably no.  I never saw a problem with the v2-rc0, but Ingo 
shot me a message about it without enough detail that I could have tested for 
it.

FWIW, I've been using the CFQ I/O scheduler for quite a while, is it time I 
gave the AS or Deadline versions another check?  They are all built in but I 
don't know how to change the default on the fly, or even if it can be done.

>Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
>new tasks are "forked", they are queued at the end of the run queue with a
>fixed priority. In our case, this would translate into assigning them the
>same prio and timeslice as their parent, but queuing them at the end so that
>they don't make existing tasks starve during huge fork() loads.
>
>I don't know how that would be possible (nor if that would help in
> anything), but I found it was a good compromise over sharing the timeslice
> with the parent. Perhaps we should have some absolute timeslice and some
> relative timeslice (eg: X percent of total time divided by the number of
> tasks) ?
>
>Regards,
>Willy

Thanks Willy.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
"I take Him shopping with me. I say, 'OK, Jesus, help me find a bargain'" 
--Tammy Faye Bakker

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  5:25   ` Willy Tarreau
  2007-04-17  5:51     ` Gene Heskett
@ 2007-04-17  5:51     ` Mike Galbraith
  2007-04-17  6:27     ` Ingo Molnar
  2007-04-18  0:06     ` Peter Williams
  3 siblings, 0 replies; 37+ messages in thread
From: Mike Galbraith @ 2007-04-17  5:51 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Gene Heskett, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko

On Tue, 2007-04-17 at 07:25 +0200, Willy Tarreau wrote:

> Have you tried previous version with the fair-fork patch ? It might be possible
> that your workload is sensible to the fork()'s child getting much CPU upon
> startup.

Dunno about that, but here's a possibly related datapoint.  I reported
to Ingo yesterday that I was sometimes losing control of my GUI (KDE)
under heavy IO.  I just reproduced it in mainline rc7.  If I start a
bonnie, and click around popping windows to the foreground, then poke
KDE's menu button, I may lose all GUI capability for a _very_ long time.
Here, with bonnie, that means until it gets past writing with putc, and
moves on to rewrite.  Ages.

	-Mike


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  4:53 ` Gene Heskett
  2007-04-17  5:25   ` Willy Tarreau
@ 2007-04-17  6:18   ` Ingo Molnar
  2007-04-17  7:01     ` Ingo Molnar
                       ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  6:18 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko


* Gene Heskett <gene.heskett@gmail.com> wrote:

> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was 
> much better.  Watching amanda run with htop, kmails composer is being 
> subjected to 5 to 10 second pauses, and htop says that gzip -best 
> isn't getting more that 15% of the cpu, and the /amandatapes drive is 
> being written to in a regular pattern that seems to be the cause of 
> the pauses according to gkrellm, which also seems to track the size of 
> the writes, and can show anything from 4.3k to 54 megs as being 
> written in one cycle of its screen update.

ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
small. One difference is the child-runs-first fix. To restore the 
parent-runs-first logic, do this: 

	echo 0 > /proc/sys/kernel/sched_child_runs_first

does this make any difference?

If not then pretty much the only other change was the nice level tweak i 
did. Could you try to grab a few snapshots of scheduling state via 
something like:

   while sleep 1; do cat /proc/sched_debug >> to-ingo.txt; done

(and tell me the PID of the kmail composer, to make sure i'm checking 
the right task's behavior.)

also, as a separate experiment, could you perhaps run this script as 
root:

   cd /proc; for N in [1-9]*; do renice -n 0 $N; done

this will move all tasks in the system to nice level 0 and should make 
any nice level handling logic in the scheduler irrelevant. Do you have X 
reniced perhaps?

Lots of system threads have negative or positive nice levels, so once 
you have executed this script, only a reboot will be a practical way to 
restore it to the previous settings.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  5:25   ` Willy Tarreau
  2007-04-17  5:51     ` Gene Heskett
  2007-04-17  5:51     ` Mike Galbraith
@ 2007-04-17  6:27     ` Ingo Molnar
  2007-04-18  0:06     ` Peter Williams
  3 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  6:27 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Gene Heskett, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Dmitry Adamushko

[-- Attachment #1: Type: text/plain, Size: 506 bytes --]


* Willy Tarreau <w@1wt.eu> wrote:

> Have you tried previous version with the fair-fork patch ? It might be 
> possible that your workload is sensible to the fork()'s child getting 
> much CPU upon startup.

the fair-fork patch is now included in -v2, but that was already in 
-v2-rc0 too that i sent to Gene separately. I've attached the 
-rc0->final delta.

Gene, could you please apply this patch to your -v2-rc0 tree and do a 
quick double-check that indeed these changes cause the regression?

	Ingo

[-- Attachment #2: sched-cfs-v2-rc0-final-delta.patch --]
[-- Type: text/plain, Size: 22652 bytes --]

---
 include/linux/sched.h     |    7 +
 kernel/exit.c             |    2 
 kernel/posix-cpu-timers.c |   24 ++---
 kernel/rtmutex.c          |    2 
 kernel/sched.c            |  191 +++++++++++++++++++++++++---------------------
 kernel/sched_debug.c      |   14 +--
 kernel/sched_fair.c       |   80 +++++++++++++------
 kernel/sched_rt.c         |   21 +++++
 kernel/sysctl.c           |    8 +
 9 files changed, 218 insertions(+), 131 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -798,12 +798,15 @@ struct sched_class {
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p);
 	void (*requeue_task) (struct rq *rq, struct task_struct *p);
 
+	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
+
 	struct task_struct * (*pick_next_task) (struct rq *rq);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 	struct task_struct * (*load_balance_start) (struct rq *rq);
 	struct task_struct * (*load_balance_next) (struct rq *rq);
 	void (*task_tick) (struct rq *rq, struct task_struct *p);
+	void (*task_new) (struct rq *rq, struct task_struct *p);
 
 	void (*task_init) (struct rq *rq, struct task_struct *p);
 };
@@ -838,7 +841,8 @@ struct task_struct {
 	u64 last_ran;
 
 	s64 wait_runtime;
-	u64 exec_runtime, fair_key;
+	u64 sum_exec_runtime, fair_key;
+	s64 sum_wait_runtime;
 	long nice_offset;
 	s64 hog_limit;
 
@@ -1236,6 +1240,7 @@ extern char * sched_print_task_state(str
 
 extern unsigned int sysctl_sched_max_hog_history;
 extern unsigned int sysctl_sched_granularity;
+extern unsigned int sysctl_sched_child_runs_first;
 
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -112,7 +112,7 @@ static void __exit_signal(struct task_st
 		sig->maj_flt += tsk->maj_flt;
 		sig->nvcsw += tsk->nvcsw;
 		sig->nivcsw += tsk->nivcsw;
-		sig->sum_sched_runtime += tsk->exec_runtime;
+		sig->sum_sched_runtime += tsk->sum_exec_runtime;
 		sig = NULL; /* Marker for below. */
 	}
 
Index: linux/kernel/posix-cpu-timers.c
===================================================================
--- linux.orig/kernel/posix-cpu-timers.c
+++ linux/kernel/posix-cpu-timers.c
@@ -161,7 +161,7 @@ static inline cputime_t virt_ticks(struc
 }
 static inline unsigned long long sched_ns(struct task_struct *p)
 {
-	return (p == current) ? current_sched_runtime(p) : p->exec_runtime;
+	return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
 }
 
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
@@ -249,7 +249,7 @@ static int cpu_clock_sample_group_locked
 		cpu->sched = p->signal->sum_sched_runtime;
 		/* Add in each other live thread.  */
 		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->exec_runtime;
+			cpu->sched += t->sum_exec_runtime;
 		}
 		cpu->sched += sched_ns(p);
 		break;
@@ -422,7 +422,7 @@ int posix_cpu_timer_del(struct k_itimer 
  */
 static void cleanup_timers(struct list_head *head,
 			   cputime_t utime, cputime_t stime,
-			   unsigned long long exec_runtime)
+			   unsigned long long sum_exec_runtime)
 {
 	struct cpu_timer_list *timer, *next;
 	cputime_t ptime = cputime_add(utime, stime);
@@ -451,10 +451,10 @@ static void cleanup_timers(struct list_h
 	++head;
 	list_for_each_entry_safe(timer, next, head, entry) {
 		list_del_init(&timer->entry);
-		if (timer->expires.sched < exec_runtime) {
+		if (timer->expires.sched < sum_exec_runtime) {
 			timer->expires.sched = 0;
 		} else {
-			timer->expires.sched -= exec_runtime;
+			timer->expires.sched -= sum_exec_runtime;
 		}
 	}
 }
@@ -467,7 +467,7 @@ static void cleanup_timers(struct list_h
 void posix_cpu_timers_exit(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->cpu_timers,
-		       tsk->utime, tsk->stime, tsk->exec_runtime);
+		       tsk->utime, tsk->stime, tsk->sum_exec_runtime);
 
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -475,7 +475,7 @@ void posix_cpu_timers_exit_group(struct 
 	cleanup_timers(tsk->signal->cpu_timers,
 		       cputime_add(tsk->utime, tsk->signal->utime),
 		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->exec_runtime + tsk->signal->sum_sched_runtime);
+		       tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime);
 }
 
 
@@ -536,7 +536,7 @@ static void process_timer_rebalance(stru
 		nsleft = max_t(unsigned long long, nsleft, 1);
 		do {
 			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->exec_runtime + nsleft;
+				ns = t->sum_exec_runtime + nsleft;
 				if (t->it_sched_expires == 0 ||
 				    t->it_sched_expires > ns) {
 					t->it_sched_expires = ns;
@@ -1004,7 +1004,7 @@ static void check_thread_timers(struct t
 		struct cpu_timer_list *t = list_entry(timers->next,
 						      struct cpu_timer_list,
 						      entry);
-		if (!--maxfire || tsk->exec_runtime < t->expires.sched) {
+		if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) {
 			tsk->it_sched_expires = t->expires.sched;
 			break;
 		}
@@ -1049,7 +1049,7 @@ static void check_process_timers(struct 
 	do {
 		utime = cputime_add(utime, t->utime);
 		stime = cputime_add(stime, t->stime);
-		sum_sched_runtime += t->exec_runtime;
+		sum_sched_runtime += t->sum_exec_runtime;
 		t = next_thread(t);
 	} while (t != tsk);
 	ptime = cputime_add(utime, stime);
@@ -1208,7 +1208,7 @@ static void check_process_timers(struct 
 				t->it_virt_expires = ticks;
 			}
 
-			sched = t->exec_runtime + sched_left;
+			sched = t->sum_exec_runtime + sched_left;
 			if (sched_expires && (t->it_sched_expires == 0 ||
 					      t->it_sched_expires > sched)) {
 				t->it_sched_expires = sched;
@@ -1300,7 +1300,7 @@ void run_posix_cpu_timers(struct task_st
 
 	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
 	    (tsk->it_sched_expires == 0 ||
-	     tsk->exec_runtime < tsk->it_sched_expires))
+	     tsk->sum_exec_runtime < tsk->it_sched_expires))
 		return;
 
 #undef	UNEXPIRED
Index: linux/kernel/rtmutex.c
===================================================================
--- linux.orig/kernel/rtmutex.c
+++ linux/kernel/rtmutex.c
@@ -337,7 +337,7 @@ static inline int try_to_steal_lock(stru
 	 * interrupted, so we would delay a waiter with higher
 	 * priority as current->normal_prio.
 	 *
-	 * Note: in the rare case of a SCHED_FAIR task changing
+	 * Note: in the rare case of a SCHED_OTHER task changing
 	 * its priority and thus stealing the lock, next->task
 	 * might be current:
 	 */
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -101,8 +101,10 @@ unsigned long long __attribute__((weak))
 #define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
 #define DEF_TIMESLICE		(100 * HZ / 1000)
 
-#define TASK_PREEMPTS_CURR(p, rq) \
-	((p)->prio < (rq)->curr->prio)
+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
+{
+	p->sched_class->check_preempt_curr(rq, p);
+}
 
 #define SCALE_PRIO(x, prio) \
 	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
@@ -227,7 +229,7 @@ char * sched_print_task_state(struct tas
 	P(exec_start);
 	P(last_ran);
 	P(wait_runtime);
-	P(exec_runtime);
+	P(sum_exec_runtime);
 #undef P
 
 	t0 = sched_clock();
@@ -431,38 +433,46 @@ static inline struct rq *this_rq_lock(vo
 	return rq;
 }
 
-#include "sched_stats.h"
-#include "sched_rt.c"
-#include "sched_fair.c"
-#include "sched_debug.c"
+/*
+ * resched_task - mark a task 'to be rescheduled now'.
+ *
+ * On UP this means the setting of the need_resched flag, on SMP it
+ * might also involve a cross-CPU call to trigger the scheduler on
+ * the target CPU.
+ */
+#ifdef CONFIG_SMP
 
-#define sched_class_highest (&rt_sched_class)
+#ifndef tsk_is_polling
+#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
+#endif
 
-static void enqueue_task(struct rq *rq, struct task_struct *p)
+static void resched_task(struct task_struct *p)
 {
-	sched_info_queued(p);
-	p->sched_class->enqueue_task(rq, p);
-	p->on_rq = 1;
-}
+	int cpu;
 
-static void dequeue_task(struct rq *rq, struct task_struct *p)
-{
-	p->sched_class->dequeue_task(rq, p);
-	p->on_rq = 0;
-}
+	assert_spin_locked(&task_rq(p)->lock);
 
-static void requeue_task(struct rq *rq, struct task_struct *p)
-{
-	p->sched_class->requeue_task(rq, p);
-}
+	if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
+		return;
 
-/*
- * __normal_prio - return the priority that is based on the static prio
- */
-static inline int __normal_prio(struct task_struct *p)
+	set_tsk_thread_flag(p, TIF_NEED_RESCHED);
+
+	cpu = task_cpu(p);
+	if (cpu == smp_processor_id())
+		return;
+
+	/* NEED_RESCHED must be visible before we test polling */
+	smp_mb();
+	if (!tsk_is_polling(p))
+		smp_send_reschedule(cpu);
+}
+#else
+static inline void resched_task(struct task_struct *p)
 {
-	return p->static_prio;
+	assert_spin_locked(&task_rq(p)->lock);
+	set_tsk_need_resched(p);
 }
+#endif
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
@@ -528,6 +538,41 @@ static inline void dec_nr_running(struct
 	dec_raw_weighted_load(rq, p);
 }
 
+static void activate_task(struct rq *rq, struct task_struct *p);
+
+#include "sched_stats.h"
+#include "sched_rt.c"
+#include "sched_fair.c"
+#include "sched_debug.c"
+
+#define sched_class_highest (&rt_sched_class)
+
+static void enqueue_task(struct rq *rq, struct task_struct *p)
+{
+	sched_info_queued(p);
+	p->sched_class->enqueue_task(rq, p);
+	p->on_rq = 1;
+}
+
+static void dequeue_task(struct rq *rq, struct task_struct *p)
+{
+	p->sched_class->dequeue_task(rq, p);
+	p->on_rq = 0;
+}
+
+static void requeue_task(struct rq *rq, struct task_struct *p)
+{
+	p->sched_class->requeue_task(rq, p);
+}
+
+/*
+ * __normal_prio - return the priority that is based on the static prio
+ */
+static inline int __normal_prio(struct task_struct *p)
+{
+	return p->static_prio;
+}
+
 /*
  * Calculate the expected normal priority: i.e. priority
  * without taking RT-inheritance into account. Might be
@@ -593,47 +638,6 @@ static void deactivate_task(struct rq *r
 	dec_nr_running(p, rq);
 }
 
-/*
- * resched_task - mark a task 'to be rescheduled now'.
- *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
- */
-#ifdef CONFIG_SMP
-
-#ifndef tsk_is_polling
-#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
-#endif
-
-static void resched_task(struct task_struct *p)
-{
-	int cpu;
-
-	assert_spin_locked(&task_rq(p)->lock);
-
-	if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
-		return;
-
-	set_tsk_thread_flag(p, TIF_NEED_RESCHED);
-
-	cpu = task_cpu(p);
-	if (cpu == smp_processor_id())
-		return;
-
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(p))
-		smp_send_reschedule(cpu);
-}
-#else
-static inline void resched_task(struct task_struct *p)
-{
-	assert_spin_locked(&task_rq(p)->lock);
-	set_tsk_need_resched(p);
-}
-#endif
-
 /**
  * task_curr - is this task currently executing on a CPU?
  * @p: the task in question.
@@ -1113,10 +1117,8 @@ out_activate:
 	 * the waker guarantees that the freshly woken up task is going
 	 * to be considered on this CPU.)
 	 */
-	if (!sync || cpu != this_cpu) {
-		if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
-	}
+	if (!sync || cpu != this_cpu)
+		check_preempt_curr(rq, p);
 	success = 1;
 
 out_running:
@@ -1159,7 +1161,8 @@ static void task_running_tick(struct rq 
 static void __sched_fork(struct task_struct *p)
 {
 	p->wait_start_fair = p->exec_start = p->last_ran = 0;
-	p->exec_runtime = p->wait_runtime = 0;
+	p->sum_exec_runtime = p->wait_runtime = 0;
+	p->sum_wait_runtime = 0;
 
 	INIT_LIST_HEAD(&p->run_list);
 	p->on_rq = 0;
@@ -1208,6 +1211,12 @@ void sched_fork(struct task_struct *p, i
 }
 
 /*
+ * After fork, child runs first. (default) If set to 0 then
+ * parent will (try to) run first.
+ */
+unsigned int __read_mostly sysctl_sched_child_runs_first = 1;
+
+/*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
  * This function will do some initial scheduler statistics housekeeping
@@ -1218,15 +1227,25 @@ void fastcall wake_up_new_task(struct ta
 {
 	unsigned long flags;
 	struct rq *rq;
+	int this_cpu;
 
 	rq = task_rq_lock(p, &flags);
 	BUG_ON(p->state != TASK_RUNNING);
+	this_cpu = smp_processor_id(); /* parent's CPU */
 
 	p->prio = effective_prio(p);
-	activate_task(rq, p);
-	if (TASK_PREEMPTS_CURR(p, rq))
-		resched_task(rq->curr);
 
+	if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) ||
+			task_cpu(p) != this_cpu || !current->on_rq) {
+		activate_task(rq, p);
+	} else {
+		/*
+		 * Let the scheduling class do new task startup
+		 * management (if any):
+		 */
+		p->sched_class->task_new(rq, p);
+	}
+	check_preempt_curr(rq, p);
 	task_rq_unlock(rq, &flags);
 }
 
@@ -1559,8 +1578,7 @@ static void pull_task(struct rq *src_rq,
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
 	 * to be always true for them.
 	 */
-	if (TASK_PREEMPTS_CURR(p, this_rq))
-		resched_task(this_rq->curr);
+	check_preempt_curr(this_rq, p);
 }
 
 /*
@@ -2467,7 +2485,7 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
 EXPORT_PER_CPU_SYMBOL(kstat);
 
 /*
- * Return current->exec_runtime plus any more ns on the sched_clock
+ * Return current->sum_exec_runtime plus any more ns on the sched_clock
  * that have not yet been banked.
  */
 unsigned long long current_sched_runtime(const struct task_struct *p)
@@ -2476,7 +2494,7 @@ unsigned long long current_sched_runtime
 	unsigned long flags;
 
 	local_irq_save(flags);
-	ns = p->exec_runtime + sched_clock() - p->last_ran;
+	ns = p->sum_exec_runtime + sched_clock() - p->last_ran;
 	local_irq_restore(flags);
 
 	return ns;
@@ -3176,8 +3194,9 @@ void rt_mutex_setprio(struct task_struct
 		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
-		} else if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
+		} else {
+			check_preempt_curr(rq, p);
+		}
 	}
 	task_rq_unlock(rq, &flags);
 }
@@ -3469,8 +3488,9 @@ recheck:
 		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
-		} else if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
+		} else {
+			check_preempt_curr(rq, p);
+		}
 	}
 	__task_rq_unlock(rq);
 	spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -4183,8 +4203,7 @@ static int __migrate_task(struct task_st
 	if (p->on_rq) {
 		deactivate_task(rq_src, p);
 		activate_task(rq_dest, p);
-		if (TASK_PREEMPTS_CURR(p, rq_dest))
-			resched_task(rq_dest->curr);
+		check_preempt_curr(rq_dest, p);
 	}
 	ret = 1;
 out:
Index: linux/kernel/sched_debug.c
===================================================================
--- linux.orig/kernel/sched_debug.c
+++ linux/kernel/sched_debug.c
@@ -51,10 +51,10 @@ print_task(struct seq_file *m, struct rq
 		p->prio,
 		p->nice_offset,
 		p->hog_limit,
-		p->wait_start_fair,
+		p->wait_start_fair - rq->fair_clock,
 		p->exec_start,
-		p->last_ran,
-		p->exec_runtime);
+		p->sum_exec_runtime,
+		p->sum_wait_runtime);
 }
 
 static void print_rq(struct seq_file *m, struct rq *rq, u64 now)
@@ -66,10 +66,10 @@ static void print_rq(struct seq_file *m,
 	"\nrunnable tasks:\n"
 	"           task   PID     tree-key       delta    waiting"
 	"  switches  prio  nice-offset    hog-limit  wstart-fair   exec-start"
-	"     last-ran exec-runtime\n"
-	"------------------------------------------------------------------"
-	"------------------------------------------------------------------"
-	"-------------------\n");
+	"     sum-exec     sum-wait\n"
+	"---------------------------------------------------------"
+	"--------------------------------------------------------------------"
+	"--------------------------\n");
 
 	curr = first_fair(rq);
 	while (curr) {
Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -27,15 +27,9 @@ static void __enqueue_task_fair(struct r
 {
 	struct rb_node **link = &rq->tasks_timeline.rb_node;
 	struct rb_node *parent = NULL;
+	long long key = p->fair_key;
 	struct task_struct *entry;
 	int leftmost = 1;
-	long long key;
-
-	key = rq->fair_clock - p->wait_runtime;
-	if (unlikely(p->nice_offset))
-		key += p->nice_offset / (rq->nr_running + 1);
-
-	p->fair_key = key;
 
 	/*
 	 * Find the right place in the rbtree:
@@ -48,9 +42,9 @@ static void __enqueue_task_fair(struct r
 		 * the same key stay together.
 		 */
 		if (key < entry->fair_key) {
-			link = &(*link)->rb_left;
+			link = &parent->rb_left;
 		} else {
-			link = &(*link)->rb_right;
+			link = &parent->rb_right;
 			leftmost = 0;
 		}
 	}
@@ -138,7 +132,7 @@ static inline void update_curr(struct rq
 	delta_exec = convert_delta(rq, now - curr->exec_start, curr);
 	delta_fair = delta_exec/rq->nr_running;
 
-	curr->exec_runtime += delta_exec;
+	curr->sum_exec_runtime += delta_exec;
 	curr->exec_start = now;
 
 	rq->fair_clock += delta_fair;
@@ -182,6 +176,11 @@ update_stats_enqueue(struct rq *rq, stru
 	 */
 	if (p != rq->curr)
 		update_stats_wait_start(rq, p, now);
+
+	/*
+	 * Update the key:
+	 */
+	p->fair_key = rq->fair_clock - p->wait_runtime + p->nice_offset;
 }
 
 /*
@@ -195,6 +194,7 @@ static inline void update_stats_wait_end
 	delta = scale_nice_down(rq, p, delta);
 
 	p->wait_runtime += delta;
+	p->sum_wait_runtime += delta;
 	rq->wait_runtime += delta;
 
 	p->wait_start_fair = 0;
@@ -275,6 +275,24 @@ static void requeue_task_fair(struct rq 
 	p->on_rq = 1;
 }
 
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+	struct task_struct *curr = rq->curr;
+	long long __delta = curr->fair_key - p->fair_key;
+
+	/*
+	 * Take scheduling granularity into account - do not
+	 * preempt the current task unless the best task has
+	 * a larger than sched_granularity fairness advantage:
+	 */
+	if (p->prio < curr->prio ||
+			__delta > (unsigned long long)sysctl_sched_granularity)
+		resched_task(curr);
+}
+
 static struct task_struct * pick_next_task_fair(struct rq *rq)
 {
 	struct task_struct *p = __pick_next_task_fair(rq);
@@ -362,25 +380,36 @@ static void task_tick_fair(struct rq *rq
 	 * Dequeue and enqueue the task to update its
 	 * position within the tree:
 	 */
-	dequeue_task_fair(rq, curr);
-	curr->on_rq = 0;
-	enqueue_task_fair(rq, curr);
-	curr->on_rq = 1;
+	requeue_task_fair(rq, curr);
 
 	/*
 	 * Reschedule if another task tops the current one.
-	 *
-	 * Take scheduling granularity into account - do not
-	 * preempt the current task unless the best task has
-	 * a larger than sched_granularity fairness advantage:
 	 */
 	next = __pick_next_task_fair(rq);
-	if (next != curr) {
-		unsigned long long __delta = curr->fair_key - next->fair_key;
+	if (next != curr)
+		check_preempt_curr(rq, next);
+}
 
-		if (__delta > (unsigned long long)sysctl_sched_granularity)
-			set_tsk_need_resched(curr);
-	}
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+	sched_info_queued(p);
+	update_stats_enqueue(rq, p);
+	/*
+	 * Child runs first: we let it run before the parent
+	 * until it reschedules once. We set up a key so that
+	 * it will preempt the parent:
+	 */
+	p->fair_key = current->fair_key - sysctl_sched_granularity - 1;
+	__enqueue_task_fair(rq, p);
+	p->on_rq = 1;
+	inc_nr_running(p, rq);
 }
 
 static inline long
@@ -418,6 +447,8 @@ hog_limit(struct rq *rq, struct task_str
 	return -(long long)limit;
 }
 
+#define NICE_OFFSET_GRANULARITY 100000
+
 /*
  * Calculate and cache the nice offset and the hog limit values:
  */
@@ -441,12 +472,15 @@ struct sched_class fair_sched_class __re
 	.dequeue_task		= dequeue_task_fair,
 	.requeue_task		= requeue_task_fair,
 
+	.check_preempt_curr	= check_preempt_curr_fair,
+
 	.pick_next_task		= pick_next_task_fair,
 	.put_prev_task		= put_prev_task_fair,
 
 	.load_balance_start	= load_balance_start_fair,
 	.load_balance_next	= load_balance_next_fair,
 	.task_tick		= task_tick_fair,
+	.task_new		= task_new_fair,
 
 	.task_init		= task_init_fair,
 };
Index: linux/kernel/sched_rt.c
===================================================================
--- linux.orig/kernel/sched_rt.c
+++ linux/kernel/sched_rt.c
@@ -34,6 +34,15 @@ static void requeue_task_rt(struct rq *r
 	list_move_tail(&p->run_list, array->queue + p->prio);
 }
 
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
+{
+	if (p->prio < rq->curr->prio)
+		resched_task(rq->curr);
+}
+
 static struct task_struct * pick_next_task_rt(struct rq *rq)
 {
 	struct prio_array *array = &rq->active;
@@ -140,6 +149,15 @@ static void task_tick_rt(struct rq *rq, 
 	}
 }
 
+/*
+ * No parent/child timeslice management necessary for RT tasks,
+ * just activate them:
+ */
+static void task_new_rt(struct rq *rq, struct task_struct *p)
+{
+	activate_task(rq, p);
+}
+
 static void task_init_rt(struct rq *rq, struct task_struct *p)
 {
 }
@@ -149,6 +167,8 @@ static struct sched_class rt_sched_class
 	.dequeue_task		= dequeue_task_rt,
 	.requeue_task		= requeue_task_rt,
 
+	.check_preempt_curr	= check_preempt_curr_rt,
+
 	.pick_next_task		= pick_next_task_rt,
 	.put_prev_task		= put_prev_task_rt,
 
@@ -156,5 +176,6 @@ static struct sched_class rt_sched_class
 	.load_balance_next	= load_balance_next_rt,
 
 	.task_tick		= task_tick_rt,
+	.task_new		= task_new_rt,
 	.task_init		= task_init_rt,
 };
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -222,6 +222,14 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_child_runs_first",
+		.data		= &sysctl_sched_child_runs_first,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_PANIC,
 		.procname	= "panic",
 		.data		= &panic_timeout,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-04-17  4:53 ` Gene Heskett
@ 2007-04-17  6:46 ` Peter Williams
  2007-04-17  7:51   ` William Lee Irwin III
  2007-04-17  9:53   ` Ingo Molnar
  3 siblings, 2 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-17  6:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko

Ingo Molnar wrote:
> this is the second release of the CFS (Completely Fair Scheduler) 
> patchset, against v2.6.21-rc7:
> 
>    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
> 
> i'd like to thank everyone for the tremendous amount of feedback and 
> testing the v1 patch got - i could hardly keep up with just reading the 
> mails! Some of the stuff people addressed i couldnt implement yet, i 
> mostly concentrated on bugs, regressions and debuggability.

Have you considered using rq->raw_weighted_load instead of 
rq->nr_running in calculating fair_clock?  This would take the nice 
value (or RT priority) of the other tasks into account when determining 
what's fair.

Peter
PS You'd have to change the migration thread's load_weight from 0 to 1 
in order to prevent divide by zero without having to explicitly check 
for it every time.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  4:06 ` Peter Williams
@ 2007-04-17  6:49   ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  6:49 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> Can I make a suggestion?
> 
> Would it be possible (from now on) to publish changes relevant to the 
> previous patch (eventually leading to a series of patches that 
> describes the evolution of the new scheduler) so that it's easier for 
> us reviewers/critics to see the latest changes.  E.g. if import such 
> changes into something like quilt (using my gquilt GUI wrapper, of 
> course :-)) I can then use meld (or similar) to follow what's going as 
> suggestions get folded in and bugs get fixed etc.

the v1 patch is still downloadable so you can do a delta by first 
applying the v1 patch to a quilt queue, doing a 'quilt snapshot', then 
'quilt pop', add the v2 patch to the series file, do a 'quilt push', 
then doing a "quilt diff --snapshot". (I just posted the delta patch in 
this thread so you can pick it from there too.)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  6:18   ` Ingo Molnar
@ 2007-04-17  7:01     ` Ingo Molnar
  2007-04-17  7:31       ` Davide Libenzi
                         ` (2 more replies)
  2007-04-17  8:03     ` Davide Libenzi
  2007-04-17 16:12     ` Gene Heskett
  2 siblings, 3 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:01 UTC (permalink / raw)
  To: Gene Heskett
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko


* Ingo Molnar <mingo@elte.hu> wrote:

> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
> small. One difference is the child-runs-first fix. To restore the 
> parent-runs-first logic, do this:
> 
> 	echo 0 > /proc/sys/kernel/sched_child_runs_first
> 
> does this make any difference?

ok, i've got something better to test: i separated the delta out into a 
more finegrained stack of 3 patches. You can pick them up from:

 http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
 http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
 http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
 http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch

i test-built and test-booted all 4 steps of this. The baseline -v2-rc0 
patch should be the one that works - you might want to double-check it, 
just to be sure. One of the other 3 patches ontop of this baseline 
causes the regression on your desktop. My current bet is on preempt-fix, 
so i have put that one first. The other one would be the second patch, 
child-runs-first. The misc patch should have no effect on behavior - but 
i've included it for completeness. (and i was wrong about the 'nice 
fix', it is not in this delta)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  5:51     ` Gene Heskett
@ 2007-04-17  7:18       ` Paolo Ornati
  0 siblings, 0 replies; 37+ messages in thread
From: Paolo Ornati @ 2007-04-17  7:18 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Willy Tarreau, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Dmitry Adamushko

On Tue, 17 Apr 2007 01:51:08 -0400
Gene Heskett <gene.heskett@gmail.com> wrote:

> FWIW, I've been using the CFQ I/O scheduler for quite a while, is it time I 
> gave the AS or Deadline versions another check?  They are all built in but I 
> don't know how to change the default on the fly, or even if it can be done.

easy :)

# cat /sys/block/DEVICE/queue/scheduler
as noop [cfq] ...

# echo IO_SCHED > /sys/block/DEVICE/queue/scheduler

-- 
	Paolo Ornati
	Linux 2.6.21-rc7 on x86_64

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:01     ` Ingo Molnar
@ 2007-04-17  7:31       ` Davide Libenzi
  2007-04-17  7:39         ` Ingo Molnar
  2007-04-17 17:15       ` Gene Heskett
  2007-04-17 17:22       ` Gene Heskett
  2 siblings, 1 reply; 37+ messages in thread
From: Davide Libenzi @ 2007-04-17  7:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko

On Tue, 17 Apr 2007, Ingo Molnar wrote:

> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
> > small. One difference is the child-runs-first fix. To restore the 
> > parent-runs-first logic, do this:
> > 
> > 	echo 0 > /proc/sys/kernel/sched_child_runs_first
> > 
> > does this make any difference?
> 
> ok, i've got something better to test: i separated the delta out into a 
> more finegrained stack of 3 patches. You can pick them up from:
> 
>  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
>  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
>  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
>  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch

Isn't that easier for everyone if you keep them as quilt series (ala 
syslets)?


- Davide



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:31       ` Davide Libenzi
@ 2007-04-17  7:39         ` Ingo Molnar
  2007-04-17 17:18           ` Gene Heskett
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  7:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > ok, i've got something better to test: i separated the delta out 
> > into a more finegrained stack of 3 patches. You can pick them up 
> > from:
> > 
> >  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> >  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch
> >  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch
> >  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
> 
> Isn't that easier for everyone if you keep them as quilt series (ala 
> syslets)?

i _do_ have a quilt tree, but i never had the clean splitup above. Why? 
Because i worked on all of these aspects (and a whole lot of other 
aspects as well) in parallel during the past 2 days, back and forth, 
often mixing changes, etc. and there was never any clean splitup.

Now it turned out that the clean splitup of -rc0->final delta would ease 
Gene's testing so i created it. Note that this is just 30% of the total 
v1->v2 delta and i just saved the work of having to do a clean splitup 
of the other 70%. (and note that this splitup will be undone because it 
makes no sense for any potential upstream merge at all, it's only to 
ease testing for Gene)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  6:46 ` Peter Williams
@ 2007-04-17  7:51   ` William Lee Irwin III
  2007-04-17  8:16     ` Ingo Molnar
  2007-04-17  8:30     ` Peter Williams
  2007-04-17  9:53   ` Ingo Molnar
  1 sibling, 2 replies; 37+ messages in thread
From: William Lee Irwin III @ 2007-04-17  7:51 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
	Dmitry Adamushko

Ingo Molnar wrote:
>> this is the second release of the CFS (Completely Fair Scheduler) 
>> patchset, against v2.6.21-rc7:
>>    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>> i'd like to thank everyone for the tremendous amount of feedback and 
>> testing the v1 patch got - i could hardly keep up with just reading the 
>> mails! Some of the stuff people addressed i couldnt implement yet, i 
>> mostly concentrated on bugs, regressions and debuggability.

On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
> Have you considered using rq->raw_weighted_load instead of 
> rq->nr_running in calculating fair_clock?  This would take the nice 
> value (or RT priority) of the other tasks into account when determining 
> what's fair.

I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
in update_curr().


-- wli

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  6:18   ` Ingo Molnar
  2007-04-17  7:01     ` Ingo Molnar
@ 2007-04-17  8:03     ` Davide Libenzi
  2007-04-17  8:18       ` Nick Piggin
  2007-04-17  8:20       ` Ingo Molnar
  2007-04-17 16:12     ` Gene Heskett
  2 siblings, 2 replies; 37+ messages in thread
From: Davide Libenzi @ 2007-04-17  8:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko

On Tue, 17 Apr 2007, Ingo Molnar wrote:

> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
> small. One difference is the child-runs-first fix. To restore the 
> parent-runs-first logic, do this: 
> 
> 	echo 0 > /proc/sys/kernel/sched_child_runs_first

Sorry, I did not follow the latest developments, but how many tunables we 
have so far in CFS? Are those for debug only or they're supposed to stay?
Weren't those listed inside the Axis of Evil (just to remain in topic :) 
till yesterday?


- Davide



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:51   ` William Lee Irwin III
@ 2007-04-17  8:16     ` Ingo Molnar
  2007-04-17  8:52       ` Ingo Molnar
  2007-04-17 14:05       ` Peter Williams
  2007-04-17  8:30     ` Peter Williams
  1 sibling, 2 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:16 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
	Dmitry Adamushko


* William Lee Irwin III <wli@holomorphy.com> wrote:

> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>
> > Have you considered using rq->raw_weighted_load instead of 
> > rq->nr_running in calculating fair_clock?  This would take the nice 
> > value (or RT priority) of the other tasks into account when 
> > determining what's fair.
> 
> I suspect you mean 
> (curr->load_weight*delta_exec)/rq->raw_weighted_load in update_curr().

good idea, i'll try that.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:03     ` Davide Libenzi
@ 2007-04-17  8:18       ` Nick Piggin
  2007-04-17  8:26         ` Ingo Molnar
  2007-04-17  8:20       ` Ingo Molnar
  1 sibling, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2007-04-17  8:18 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Ingo Molnar, Gene Heskett, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko

On Tue, Apr 17, 2007 at 01:03:46AM -0700, Davide Libenzi wrote:
> On Tue, 17 Apr 2007, Ingo Molnar wrote:
> 
> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
> > small. One difference is the child-runs-first fix. To restore the 
> > parent-runs-first logic, do this: 
> > 
> > 	echo 0 > /proc/sys/kernel/sched_child_runs_first
> 
> Sorry, I did not follow the latest developments, but how many tunables we 
> have so far in CFS? Are those for debug only or they're supposed to stay?
> Weren't those listed inside the Axis of Evil (just to remain in topic :) 
> till yesterday?

Actually I think this is something that makes sense to add, even if
just for debugging, but maybe also for production, depending on how
much it impacts things. Child runs first is an heuristic optimisation
that exploits a VM detail (however fundamental). But for things that
don't exec right after forking (and maybe some things that do), it
can be nicer to reduce context switches, improve cache patterns, and
allow children to be load balanced away before touching memory, if
child_runs_first is turned off.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:03     ` Davide Libenzi
  2007-04-17  8:18       ` Nick Piggin
@ 2007-04-17  8:20       ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:20 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Gene Heskett, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko


* Davide Libenzi <davidel@xmailserver.org> wrote:

> > ok - fortunately the delta between -v2-rc0 and -v2-final is pretty 
> > small. One difference is the child-runs-first fix. To restore the 
> > parent-runs-first logic, do this:
> > 
> > 	echo 0 > /proc/sys/kernel/sched_child_runs_first
> 
> Sorry, I did not follow the latest developments, but how many tunables 
> we have so far in CFS? Are those for debug only or they're supposed to 
> stay?

yeah, debug only. I strongly suspect the Kaffeine breakage for example 
was related to child-runs-first, so userspace developers might be 
interested in a switch to turn this on/off.

while reviewing the upstream scheduler it occured to me that we are 
probably _not_ doing child-runs-first there due to the list_add_tail() 
[it should be a list_add() for it to be child-first. But i havent 
instrumented this heavily and this portion of the mainline scheduler is 
pretty fragile.]. So via this flag we could also see the performance 
impact, besides the compatibility impact.

> Weren't those listed inside the Axis of Evil (just to remain in topic
> :) till yesterday?

heh ;)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:18       ` Nick Piggin
@ 2007-04-17  8:26         ` Ingo Molnar
  2007-04-17  8:41           ` Nick Piggin
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko


* Nick Piggin <npiggin@suse.de> wrote:

> Actually I think this is something that makes sense to add, even if 
> just for debugging, but maybe also for production, depending on how 
> much it impacts things. Child runs first is an heuristic optimisation 
> that exploits a VM detail (however fundamental). But for things that 
> don't exec right after forking (and maybe some things that do), it can 
> be nicer to reduce context switches, improve cache patterns, and allow 
> children to be load balanced away before touching memory, if 
> child_runs_first is turned off.

yeah, the primary intent was debug. Nick, am i confused to conclude that 
mainline in fact runs the _parent_ first, despite all the elaborate 
runqueue juggling we do there? This piece of code in wake_up_new_task() 
caught my eyes:

                                p->prio = current->prio;
                                p->normal_prio = current->normal_prio;
                                list_add_tail(&p->run_list, &current->run_list);
                                p->array = current->array;
                                p->array->nr_active++;
                                inc_nr_running(p, rq);

shouldnt the list_add_tail() be list_add(), so that task pickup sees the 
child first? Maybe we still do child-runs-first in practice, due to the 
timeslice and sleep average fixups that happen if the parent preempts, 
but the above piece of code seems a quite elaborate way of doing 
activate_task(). To have the child _before_ the parent we'd need the 
add-on patch below. But ... i could be wrong, this is just a quick 
thought.

	Ingo

---
 kernel/sched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1685,7 +1685,7 @@ void fastcall wake_up_new_task(struct ta
 			else {
 				p->prio = current->prio;
 				p->normal_prio = current->normal_prio;
-				list_add_tail(&p->run_list, &current->run_list);
+				list_add(&p->run_list, &current->run_list);
 				p->array = current->array;
 				p->array->nr_active++;
 				inc_nr_running(p, rq);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:51   ` William Lee Irwin III
  2007-04-17  8:16     ` Ingo Molnar
@ 2007-04-17  8:30     ` Peter Williams
  2007-04-18 19:15       ` Peter Williams
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Williams @ 2007-04-17  8:30 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
	Dmitry Adamushko

William Lee Irwin III wrote:
> Ingo Molnar wrote:
>>> this is the second release of the CFS (Completely Fair Scheduler) 
>>> patchset, against v2.6.21-rc7:
>>>    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>> i'd like to thank everyone for the tremendous amount of feedback and 
>>> testing the v1 patch got - i could hardly keep up with just reading the 
>>> mails! Some of the stuff people addressed i couldnt implement yet, i 
>>> mostly concentrated on bugs, regressions and debuggability.
> 
> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>> Have you considered using rq->raw_weighted_load instead of 
>> rq->nr_running in calculating fair_clock?  This would take the nice 
>> value (or RT priority) of the other tasks into account when determining 
>> what's fair.
> 
> I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
> in update_curr().

Or something like that, yes. :-)

I was trying to make the point that the weighted load stuff provides 
useful data for implementing nice (in a number of ways e.g. see spa_ebs).

Also, now that the old time slices are gone, a simpler more efficient 
function for mapping RT priority or nice (as appropriate) to 
p->load_weight can be used instead of the current one which uses the 
time slice the task would have been allocated as a basis.  I'd suggest 
the function that the current one replaced.  (Because it was mine :-)).

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:26         ` Ingo Molnar
@ 2007-04-17  8:41           ` Nick Piggin
  2007-04-17  8:57             ` Ingo Molnar
  0 siblings, 1 reply; 37+ messages in thread
From: Nick Piggin @ 2007-04-17  8:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko

On Tue, Apr 17, 2007 at 10:26:31AM +0200, Ingo Molnar wrote:
> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > Actually I think this is something that makes sense to add, even if 
> > just for debugging, but maybe also for production, depending on how 
> > much it impacts things. Child runs first is an heuristic optimisation 
> > that exploits a VM detail (however fundamental). But for things that 
> > don't exec right after forking (and maybe some things that do), it can 
> > be nicer to reduce context switches, improve cache patterns, and allow 
> > children to be load balanced away before touching memory, if 
> > child_runs_first is turned off.
> 
> yeah, the primary intent was debug. Nick, am i confused to conclude that 
> mainline in fact runs the _parent_ first, despite all the elaborate 
> runqueue juggling we do there? This piece of code in wake_up_new_task() 
> caught my eyes:
> 
>                                 p->prio = current->prio;
>                                 p->normal_prio = current->normal_prio;
>                                 list_add_tail(&p->run_list, &current->run_list);
>                                 p->array = current->array;
>                                 p->array->nr_active++;
>                                 inc_nr_running(p, rq);
> 
> shouldnt the list_add_tail() be list_add(), so that task pickup sees the 
> child first? Maybe we still do child-runs-first in practice, due to the 
> timeslice and sleep average fixups that happen if the parent preempts, 
> but the above piece of code seems a quite elaborate way of doing 
> activate_task(). To have the child _before_ the parent we'd need the 
> add-on patch below. But ... i could be wrong, this is just a quick 
> thought.

I think that it works because the list we're adding to is not the
normal runqueue list head, but the parent's list_head on that runqueue.
Which adds the child directly ahead of the parent... I think?

> 
> 	Ingo
> 
> ---
>  kernel/sched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux/kernel/sched.c
> ===================================================================
> --- linux.orig/kernel/sched.c
> +++ linux/kernel/sched.c
> @@ -1685,7 +1685,7 @@ void fastcall wake_up_new_task(struct ta
>  			else {
>  				p->prio = current->prio;
>  				p->normal_prio = current->normal_prio;
> -				list_add_tail(&p->run_list, &current->run_list);
> +				list_add(&p->run_list, &current->run_list);
>  				p->array = current->array;
>  				p->array->nr_active++;
>  				inc_nr_running(p, rq);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:16     ` Ingo Molnar
@ 2007-04-17  8:52       ` Ingo Molnar
  2007-04-17 14:05       ` Peter Williams
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:52 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Peter Williams, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett,
	Dmitry Adamushko


* Ingo Molnar <mingo@elte.hu> wrote:

> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
> > On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
> >
> > > Have you considered using rq->raw_weighted_load instead of 
> > > rq->nr_running in calculating fair_clock?  This would take the 
> > > nice value (or RT priority) of the other tasks into account when 
> > > determining what's fair.
> > 
> > I suspect you mean 
> > (curr->load_weight*delta_exec)/rq->raw_weighted_load in 
> > update_curr().
> 
> good idea, i'll try that.

i'll try another thing too: we could perhaps get rid of rq->nr_running 
and only use raw_weighted_load, because now the only main remaining 
property of ->nr_running is "is it zero or not".

[ ->nr_running's only other significant use is 'group_capacity', but in
  reality it is only interested in whether all CPUs in the group are
  busy and what the combined cpu power of that group is, and this could
  be restructured to use rq->curr and cpu_power - and become independent
  of nr_running. ]

[ then there are other details like load-average, but we could change
  that to be weighted-cpu-load driven - that makes sense anyway: a
  reniced task should have less effect on the 'system load' than a
  non-reniced task. ]

that would be one less variable to maintain in the scheduler hotpath, 
and it would make smpnice an effective _replacement_ for nr_running, 
instead of an add-on thing that costs a bit of performance.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:41           ` Nick Piggin
@ 2007-04-17  8:57             ` Ingo Molnar
  0 siblings, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:57 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Davide Libenzi, Gene Heskett, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Con Kolivas, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko


* Nick Piggin <npiggin@suse.de> wrote:

> >                   list_add_tail(&p->run_list, &current->run_list);
[...]
> > shouldnt the list_add_tail() be list_add(), so that task pickup sees 
> > the child first? [...]
[...]
> I think that it works because the list we're adding to is not the 
> normal runqueue list head, but the parent's list_head on that 
> runqueue. Which adds the child directly ahead of the parent... I 
> think?

yeah, you are right, i was confused: list_add() adds _after_ the head, 
list_add_tail() adds _before_ the head - and in the middle of the list 
if we do a list_add_tail() it adds before that entry. So everything's 
fine and working as expected :)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-16 22:12 ` S.Çağlar Onur
@ 2007-04-17  8:59   ` Ingo Molnar
  2007-04-17 14:45     ` S.Çağlar Onur
  0 siblings, 1 reply; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  8:59 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> 17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> >  - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
> >    flag can be used to turn it on/off. (This might fix the Kaffeine bug
> >    reported by S.Çağlar Onur <)
> 
> Sorry for delayed response but i just find some free time, do you 
> still want me to test mainline + "parent-runs first" patch or will i 
> drop that one and test v2 which can change default behaviour?

i suspect for now it would be sufficient if you could check the v2 
patch.

if it _works_, please try this:

    echo 0 > /proc/sys/kernel/sched_child_runs_first

this should break Kaffeine again :)

(if it doesnt work then the Kaffeine problem is unrelated to 
child-runs-first.)

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  6:46 ` Peter Williams
  2007-04-17  7:51   ` William Lee Irwin III
@ 2007-04-17  9:53   ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17  9:53 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> Have you considered using rq->raw_weighted_load instead of 
> rq->nr_running in calculating fair_clock?  This would take the nice 
> value (or RT priority) of the other tasks into account when 
> determining what's fair.
> 
> Peter

> PS You'd have to change the migration thread's load_weight from 0 to 1 
> in order to prevent divide by zero without having to explicitly check 
> for it every time.

yeah - nice idea, i'll try this.

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:16     ` Ingo Molnar
  2007-04-17  8:52       ` Ingo Molnar
@ 2007-04-17 14:05       ` Peter Williams
  1 sibling, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-17 14:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Dmitry Adamushko

Ingo Molnar wrote:
> * William Lee Irwin III <wli@holomorphy.com> wrote:
> 
>> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>>
>>> Have you considered using rq->raw_weighted_load instead of 
>>> rq->nr_running in calculating fair_clock?  This would take the nice 
>>> value (or RT priority) of the other tasks into account when 
>>> determining what's fair.
>> I suspect you mean 
>> (curr->load_weight*delta_exec)/rq->raw_weighted_load in update_curr().
> 
> good idea, i'll try that.

In the longer term, I'd suggest modifying this idea to use the maximum 
of rq->raw_weighted_load and a running average of rq->raw_weighted_load 
much the same as was done within the load balancer code.  This will tend 
to make scheduling "smoother".  To try the idea out you could (on an SMP 
system) use one of the rq->cpu_load[] metrics as the running average.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:59   ` Ingo Molnar
@ 2007-04-17 14:45     ` S.Çağlar Onur
  2007-04-17 15:48       ` Gabriel C
  2007-04-17 16:01       ` Ingo Molnar
  0 siblings, 2 replies; 37+ messages in thread
From: S.Çağlar Onur @ 2007-04-17 14:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko

[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
> > Sorry for delayed response but i just find some free time, do you
> > still want me to test mainline + "parent-runs first" patch or will i
> > drop that one and test v2 which can change default behaviour?
>
> i suspect for now it would be sufficient if you could check the v2
> patch.
>
> if it _works_, please try this:
>
>     echo 0 > /proc/sys/kernel/sched_child_runs_first
>
> this should break Kaffeine again :)
>
> (if it doesnt work then the Kaffeine problem is unrelated to
> child-runs-first.)

OK, i tested both plain -rc7 and -rc7 + CFSv2 with while 
sched_child_runs_first enabled/disabled.

I'm always using same video file and try to reproduce freeze with constantly 
pressing forward/backward buttons. With CFS 2-3 forward/backward attempt 
reproduces this behaviour. 

And here are the results.

Mainline still has no issues with both xine-lib/kaffeine and xine-ui 
(kaffeine-0.8.4, xine-lib-1.1.5 [both xcb enabled], xine-ui-0.99.4). I really 
try hard to reproduce the freeze, but i can't...

And CFSv2 still fails for both child_runs_first and parent_runs_first cases 
with same strace output (FUTEX_WAIT).

If you want me to test something else just ask please :) 

Cheers
-- 
S.Çağlar Onur <caglar@pardus.org.tr>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17 14:45     ` S.Çağlar Onur
@ 2007-04-17 15:48       ` Gabriel C
  2007-04-17 16:01       ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Gabriel C @ 2007-04-17 15:48 UTC (permalink / raw)
  To: caglar
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, Willy Tarreau, Gene Heskett,
	Dmitry Adamushko

S.Çağlar Onur wrote:
> 17 Nis 2007 Sal tarihinde, Ingo Molnar şunları yazmıştı: 
>   
>>> Sorry for delayed response but i just find some free time, do you
>>> still want me to test mainline + "parent-runs first" patch or will i
>>> drop that one and test v2 which can change default behaviour?
>>>       
>> i suspect for now it would be sufficient if you could check the v2
>> patch.
>>
>> if it _works_, please try this:
>>
>>     echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> this should break Kaffeine again :)
>>
>> (if it doesnt work then the Kaffeine problem is unrelated to
>> child-runs-first.)
>>     
>
> OK, i tested both plain -rc7 and -rc7 + CFSv2 with while 
> sched_child_runs_first enabled/disabled.
>
> I'm always using same video file and try to reproduce freeze with constantly 
> pressing forward/backward buttons. With CFS 2-3 forward/backward attempt 
> reproduces this behaviour. 
>
> And here are the results.
>
> Mainline still has no issues with both xine-lib/kaffeine and xine-ui 
> (kaffeine-0.8.4, xine-lib-1.1.5 [both xcb enabled], xine-ui-0.99.4). I really 
> try hard to reproduce the freeze, but i can't...
>
> And CFSv2 still fails for both child_runs_first and parent_runs_first cases 
> with same strace output (FUTEX_WAIT).
>   

I have the same problem here ( same packages ).

Even VLC if I go forward/backward and then play again its start to 
ramdom freeze here but only for 1 - 2 seconds maybe.


> If you want me to test something else just ask please :) 
>
> Cheers
>   


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17 14:45     ` S.Çağlar Onur
  2007-04-17 15:48       ` Gabriel C
@ 2007-04-17 16:01       ` Ingo Molnar
  1 sibling, 0 replies; 37+ messages in thread
From: Ingo Molnar @ 2007-04-17 16:01 UTC (permalink / raw)
  To: S.Çağlar Onur
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, Willy Tarreau, Gene Heskett, Dmitry Adamushko,
	Christophe Thommeret, Christoph Pfister, Jurgen Kofler


* S.Çağlar Onur <caglar@pardus.org.tr> wrote:

> If you want me to test something else just ask please :)

yes, it would be nice to do a:

	strace -o kaffine.log -f -tttTTT kaffeine

log. Because in your old log this is visible:

 clone(child_stack=0xb02394a4, 
 flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, 
 parent_tidptr=0xb0239bd8, {entry_number:6, base_addr:0xb0239b90, 
 limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, 
 limit_in_pages:1, seg_not_present:0, useable:1}, 
 child_tidptr=0xb0239bd8) = 11340
 futex(0x89ac218, FUTEX_WAKE, 1)         = 1

we cloned a task and immediately afterwards we used futex 0x89ac218. 
After that point many things happen, but the lockup itself:

 futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
 futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0
 futex(0x89ac218, FUTEX_WAIT, 2, NULL)   = 0

is the same futex. Probably related to the same child thread? It would 
be nice to also get a gdb backtrace:

	gdb kaffine
	<reproduce the hang>
	Ctrl-C
	bt

this should give you a gdb backtrace of that kaffeine hang. Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  6:18   ` Ingo Molnar
  2007-04-17  7:01     ` Ingo Molnar
  2007-04-17  8:03     ` Davide Libenzi
@ 2007-04-17 16:12     ` Gene Heskett
  2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 16:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko

On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Gene Heskett <gene.heskett@gmail.com> wrote:
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was
>> much better.  Watching amanda run with htop, kmails composer is being
>> subjected to 5 to 10 second pauses, and htop says that gzip -best
>> isn't getting more that 15% of the cpu, and the /amandatapes drive is
>> being written to in a regular pattern that seems to be the cause of
>> the pauses according to gkrellm, which also seems to track the size of
>> the writes, and can show anything from 4.3k to 54 megs as being
>> written in one cycle of its screen update.
>
>ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>small. One difference is the child-runs-first fix. To restore the
>parent-runs-first logic, do this:
>
I'm running 21-rc7-CFS-v2-rc0.1 now.

>	echo 0 > /proc/sys/kernel/sched_child_runs_first

This is currently a 1.  Reset to 0 by the above.

>does this make any difference?

Hard to tell, not much running except fetchmail/procmail and this composer.

>
>If not then pretty much the only other change was the nice level tweak i
>did. Could you try to grab a few snapshots of scheduling state via
>something like:
>
>   while sleep 1; do cat /proc/sched_debug >> to-ingo.txt; done
The crf1.txt is with it=1, the cfr0.txt is with it zeroed.

>(and tell me the PID of the kmail composer, to make sure i'm checking
>the right task's behavior.)

And I let the crf0 version run longer as I was looking for the composer's pid, 
but htop (or I) can't see it.  Even a ps -e isn't seeing it!  But its 
running, I'm actively typing in it.  So you get 3 files, the third one called 
ps-e.txt, in private mail.  I thought it was called composer, I really did.

>
>also, as a separate experiment, could you perhaps run this script as
>root:
>
>   cd /proc; for N in [1-9]*; do renice -n 0 $N; done
>
>this will move all tasks in the system to nice level 0 and should make
>any nice level handling logic in the scheduler irrelevant. Do you have X
>reniced perhaps?
>
>Lots of system threads have negative or positive nice levels, so once
>you have executed this script, only a reboot will be a practical way to
>restore it to the previous settings.
>
>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have many CHARTS and DIAGRAMS..

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:01     ` Ingo Molnar
  2007-04-17  7:31       ` Davide Libenzi
@ 2007-04-17 17:15       ` Gene Heskett
  2007-04-17 17:22       ` Gene Heskett
  2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko

On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>> small. One difference is the child-runs-first fix. To restore the
>> parent-runs-first logic, do this:
>>
>> 	echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> does this make any difference?
>
>ok, i've got something better to test: i separated the delta out into a
>more finegrained stack of 3 patches. You can pick them up from:
>
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.p
>atch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-fi
>rst.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>
Ahh, so many cats, and so few recipes here Ingo.  In this case cats=patches & 
recipes=time to test adequately.  I do have another box, but it would 
probably take a week & about a big buck to get that old rh7.3 brought up to 
date & suitable, and its only a 500MHZ K-III, which might make the diffs more 
obvious.  It would need a video card to replace its dinosaur Diamond and a 
fresh dvd drive.  And its motherboard has very buggy usb chips.  TYAN S-1590.  
Never could get anything bigger than a mouse packet through them.

>i test-built and test-booted all 4 steps of this. The baseline -v2-rc0
>patch should be the one that works - you might want to double-check it,
>just to be sure. One of the other 3 patches ontop of this baseline
>causes the regression on your desktop. My current bet is on preempt-fix,
>so i have put that one first. The other one would be the second patch,
>child-runs-first. The misc patch should have no effect on behavior - but
>i've included it for completeness. (and i was wrong about the 'nice
>fix', it is not in this delta)
>
>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Q:	What's the difference between a dead dog in the road and a dead
	lawyer in the road?
A:	There are skid marks in front of the dog.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:39         ` Ingo Molnar
@ 2007-04-17 17:18           ` Gene Heskett
  0 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Davide Libenzi, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Dmitry Adamushko

On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Davide Libenzi <davidel@xmailserver.org> wrote:
>> > ok, i've got something better to test: i separated the delta out
>> > into a more finegrained stack of 3 patches. You can pick them up
>> > from:
>> >
>> >  http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
>> > 
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fi
>> >x.patch
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs
>> >-first.patch
>> > http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>>
>> Isn't that easier for everyone if you keep them as quilt series (ala
>> syslets)?
>
>i _do_ have a quilt tree, but i never had the clean splitup above. Why?
>Because i worked on all of these aspects (and a whole lot of other
>aspects as well) in parallel during the past 2 days, back and forth,
>often mixing changes, etc. and there was never any clean splitup.
>
>Now it turned out that the clean splitup of -rc0->final delta would ease
>Gene's testing so i created it. Note that this is just 30% of the total
>v1->v2 delta and i just saved the work of having to do a clean splitup
>of the other 70%. (and note that this splitup will be undone because it
>makes no sense for any potential upstream merge at all, it's only to
>ease testing for Gene)

Now he tells me.  :-)  But I have some CHO stuff to do, so it will be about 36 
hours before I can get back to this.

>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Support the Girl Scouts!
	(Today's Brownie is tomorrow's Cookie!)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  7:01     ` Ingo Molnar
  2007-04-17  7:31       ` Davide Libenzi
  2007-04-17 17:15       ` Gene Heskett
@ 2007-04-17 17:22       ` Gene Heskett
  2 siblings, 0 replies; 37+ messages in thread
From: Gene Heskett @ 2007-04-17 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Dmitry Adamushko

On Tuesday 17 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <mingo@elte.hu> wrote:
>> ok - fortunately the delta between -v2-rc0 and -v2-final is pretty
>> small. One difference is the child-runs-first fix. To restore the
>> parent-runs-first logic, do this:
>>
>> 	echo 0 > /proc/sys/kernel/sched_child_runs_first
>>
>> does this make any difference?
>
>ok, i've got something better to test: i separated the delta out into a
>more finegrained stack of 3 patches. You can pick them up from:
>
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.p
>atch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-fi
>rst.patch
> http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch
>
Got them all saved Ingo, but it will be late tomorrow before I can play again.

>i test-built and test-booted all 4 steps of this. The baseline -v2-rc0
>patch should be the one that works - you might want to double-check it,
>just to be sure. One of the other 3 patches ontop of this baseline
>causes the regression on your desktop. My current bet is on preempt-fix,
>so i have put that one first. The other one would be the second patch,
>child-runs-first. The misc patch should have no effect on behavior - but
>i've included it for completeness. (and i was wrong about the 'nice
>fix', it is not in this delta)
>
>	Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
This life is yours.  Some of it was given to you; the rest, you made yourself.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  5:25   ` Willy Tarreau
                       ` (2 preceding siblings ...)
  2007-04-17  6:27     ` Ingo Molnar
@ 2007-04-18  0:06     ` Peter Williams
  3 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-18  0:06 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Gene Heskett, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Dmitry Adamushko

Willy Tarreau wrote:
> Hi Gene,
> 
> On Tue, Apr 17, 2007 at 12:53:56AM -0400, Gene Heskett wrote:
>> On Monday 16 April 2007, Ingo Molnar wrote:
>>> this is the second release of the CFS (Completely Fair Scheduler)
>>> patchset, against v2.6.21-rc7:
>>>
>>>   http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>>
>>> i'd like to thank everyone for the tremendous amount of feedback and
>>> testing the v1 patch got - i could hardly keep up with just reading the
>>> mails! Some of the stuff people addressed i couldnt implement yet, i
>>> mostly concentrated on bugs, regressions and debuggability.
>>>
>>> there's a fair amount of churn:
>>>
>>>   15 files changed, 456 insertions(+), 241 deletions(-)
>>>
>>> But it's an encouraging sign that there was no crash bug found in v1,
>>> all the bugs were related to scheduling-behavior details. The code was
>>> tested on 3 architectures so far: i686, x86_64 and ia64. Most of the
>>> code size increase in -v2 is due to debugging helpers, they'll be
>>> removed later. (The new /proc/sched_debug file can be used to see the
>>> fine details of CFS scheduling.)
>>>
>>> Changes since -v1:
>>>
>>> - make nice levels less starvable. (reported by Willy Tarreau)
>>>
>>> - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first
>>>   flag can be used to turn it on/off. (This might fix the Kaffeine bug
>>>   reported by S.Ça??lar Onur <)
>>>
>>> - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas)
>>>
>>> - UP build fix. (reported by Gabriel C)
>>>
>>> - timer tick micro-optimization (Dmitry Adamushko)
>>>
>>> - preemption fix: sched_class->check_preempt_curr method to decide
>>>   whether to preempt after a wakeup (or at a timer tick). (Found via a
>>>   fairness-test-utility written for CFS by Mike Galbraith)
>>>
>>> - start forked children with neutral statistics instead of trying to
>>>   inherit them from the parent: Willy Tarreau reported that this
>>>   results in better behavior on extreme workloads, and it also
>>>   simplifies the code quite nicely. Removed sched_exit() and the
>>>   ->task_exit() methods.
>>>
>>> - make nice levels independent of the sched_granularity value
>>>
>>> - new /proc/sched_debug file listing runqueue details and the rbtree
>>>
>>> - new SCH-* fields in /proc/<NR>/status to see scheduling details
>>>
>>> - new cpu-hog feature (off by default) and sysctl tunable to set it:
>>>   /proc/sys/kernel/sched_max_hog_history_ns tunable defaults to
>>>   0 (off). Positive values are meant the maximum 'memory' that the
>>>   scheduler has of CPU hogs.
>>>
>>> - various code cleanups
>>>
>>> - added more statistics temporarily: sum_exec_runtime,
>>>   sum_wait_runtime.
>>>
>>> - added -CFS-v2 to EXTRAVERSION
>>>
>>> as usual, any sort of feedback, bugreports, fixes and suggestions are
>>> more than welcome,
>>>
>>> 	Ingo
>> This one (v2-rc2) is not a keeper I'm sorry to say, Ingo.  v2-rc0 was much 
>> better.  Watching amanda run with htop, kmails composer is being subjected to 
>> 5 to 10 second pauses, and htop says that gzip -best isn't getting more that 
>> 15% of the cpu, and the /amandatapes drive is being written to in a regular 
>> pattern that seems to be the cause of the pauses  according to gkrellm, which 
>> also seems to track the size of the writes, and can show anything from 4.3k 
>> to 54 megs as being written in one cycle of its screen update.
> 
> Have you tried previous version with the fair-fork patch ? It might be possible
> that your workload is sensible to the fork()'s child getting much CPU upon
> startup.
> 
> Ingo, maybe I'm saying something stupid, but in my userland scheduler, when
> new tasks are "forked", they are queued at the end of the run queue with a
> fixed priority. In our case, this would translate into assigning them the
> same prio and timeslice as their parent, but queuing them at the end so that
> they don't make existing tasks starve during huge fork() loads.
> 
> I don't know how that would be possible (nor if that would help in anything),
> but I found it was a good compromise over sharing the timeslice with the
> parent. Perhaps we should have some absolute timeslice and some relative
> timeslice (eg: X percent of total time divided by the number of tasks) ?

One way of handling forked tasks is to give them a high priority but a 
small chunk (i.e. give them a relatively short time to do some work and 
surrender the CPU voluntarily before you boot them off).  If you choose 
the size of this reduced chunk well the vast majority of tasks will 
never be booted off and will do a small bit of work and either exit or 
sleep and will suffer no penalty as a result of this mechanism.  But it 
gives you a chance to move any newly forked process that turns out to be 
a CPU hog to a lower priority before it gets its next chunk of CPU at 
which time it can revert to getting normal size chunks as pre-emption 
will stop it hogging the CPU from then on.

I've trialled this mechanism in some of my schedulers and it works well.

I found that 10 milliseconds was a good value for the initial chunk of 
CPU for a newly forked process.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [patch] CFS (Completely Fair Scheduler), v2
  2007-04-17  8:30     ` Peter Williams
@ 2007-04-18 19:15       ` Peter Williams
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Williams @ 2007-04-18 19:15 UTC (permalink / raw)
  To: William Lee Irwin III, Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Dmitry Adamushko

Peter Williams wrote:
> William Lee Irwin III wrote:
>> Ingo Molnar wrote:
>>>> this is the second release of the CFS (Completely Fair Scheduler) 
>>>> patchset, against v2.6.21-rc7:
>>>>    http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch
>>>> i'd like to thank everyone for the tremendous amount of feedback and 
>>>> testing the v1 patch got - i could hardly keep up with just reading 
>>>> the mails! Some of the stuff people addressed i couldnt implement 
>>>> yet, i mostly concentrated on bugs, regressions and debuggability.
>>
>> On Tue, Apr 17, 2007 at 04:46:57PM +1000, Peter Williams wrote:
>>> Have you considered using rq->raw_weighted_load instead of 
>>> rq->nr_running in calculating fair_clock?  This would take the nice 
>>> value (or RT priority) of the other tasks into account when 
>>> determining what's fair.
>>
>> I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load
>> in update_curr().
> 
> Or something like that, yes. :-)

Actually, this formula can't be used for the migration thread itself as 
its load_weight isn't an accurate reflection of its static priority. 
But as the migration thread is a real time task this probably isn't an 
issue, right?

If this assumption is correct (i.e. curr is never a real time task) then 
my earlier caveat re division by zero being possible is invalid because 
the migration task will never be the only task on the runqueue when this 
code is called.

I'm also assuming here that (because of its name) curr is already on the 
runqueue when this code is called.  If it isn't the divisor in the above 
expression should be (rq->raw_weighted_load + curr->load_weight).  This 
would also preclude the possibility of divide by zero.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2007-04-18 19:15 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-16 22:07 [patch] CFS (Completely Fair Scheduler), v2 Ingo Molnar
2007-04-16 22:12 ` S.Çağlar Onur
2007-04-17  8:59   ` Ingo Molnar
2007-04-17 14:45     ` S.Çağlar Onur
2007-04-17 15:48       ` Gabriel C
2007-04-17 16:01       ` Ingo Molnar
2007-04-17  4:06 ` Peter Williams
2007-04-17  6:49   ` Ingo Molnar
2007-04-17  4:53 ` Gene Heskett
2007-04-17  5:25   ` Willy Tarreau
2007-04-17  5:51     ` Gene Heskett
2007-04-17  7:18       ` Paolo Ornati
2007-04-17  5:51     ` Mike Galbraith
2007-04-17  6:27     ` Ingo Molnar
2007-04-18  0:06     ` Peter Williams
2007-04-17  6:18   ` Ingo Molnar
2007-04-17  7:01     ` Ingo Molnar
2007-04-17  7:31       ` Davide Libenzi
2007-04-17  7:39         ` Ingo Molnar
2007-04-17 17:18           ` Gene Heskett
2007-04-17 17:15       ` Gene Heskett
2007-04-17 17:22       ` Gene Heskett
2007-04-17  8:03     ` Davide Libenzi
2007-04-17  8:18       ` Nick Piggin
2007-04-17  8:26         ` Ingo Molnar
2007-04-17  8:41           ` Nick Piggin
2007-04-17  8:57             ` Ingo Molnar
2007-04-17  8:20       ` Ingo Molnar
2007-04-17 16:12     ` Gene Heskett
2007-04-17  6:46 ` Peter Williams
2007-04-17  7:51   ` William Lee Irwin III
2007-04-17  8:16     ` Ingo Molnar
2007-04-17  8:52       ` Ingo Molnar
2007-04-17 14:05       ` Peter Williams
2007-04-17  8:30     ` Peter Williams
2007-04-18 19:15       ` Peter Williams
2007-04-17  9:53   ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.