* Re: [patch] CFS scheduler, -v8
@ 2007-05-03 8:20 Zoltan Boszormenyi
2007-05-03 13:02 ` Ingo Molnar
0 siblings, 1 reply; 75+ messages in thread
From: Zoltan Boszormenyi @ 2007-05-03 8:20 UTC (permalink / raw)
To: linux-kernel
Hi!
> *** Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> > The problem is with comparing a s64 values with (s64)ULONG_MAX, which
> > evaluates to -1. Then we check if exec_delta64 and fair_delta64 are
> > greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to
> > the respective values.
>
> ah, indeed ...
>
> > The fix is to compare these values against (s64)LONG_MAX and assign
> > (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater
> > than (s64)LONG_MAX.
> >
> > Tested on PowerPC, the regression is gone, tasks are load balanced as
> > they were in v7.
>
> thanks, applied!
>
> Ingo
I started up 12 glxgears to see the effect of CFS v8
on my Athlon64 X2 4200.
Without this patch all the GL load was handled by the second core,
the system only stressed the first core if other processes were also
started, i.e. a kernel compilation. With this patch the load is evenly
balanced across the two cores all the time. And while doing make -j4
on the kernel, the 12 gears are still spinning about 185+ FPS and
there are only slightly visible hiccups. Switching between workspaces,
i.e. refreshing the large windows of Thunderbird and Firefox are
done very quickly, the whole system is exceptionally responsive.
Thanks for this great work!
Best regards,
Zoltán Böszörményi
^ permalink raw reply [flat|nested] 75+ messages in thread* Re: [patch] CFS scheduler, -v8 2007-05-03 8:20 [patch] CFS scheduler, -v8 Zoltan Boszormenyi @ 2007-05-03 13:02 ` Ingo Molnar 2007-05-03 13:29 ` Damien Wyart 0 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-03 13:02 UTC (permalink / raw) To: Zoltan Boszormenyi; +Cc: linux-kernel * Zoltan Boszormenyi <zboszor@dunaweb.hu> wrote: > I started up 12 glxgears to see the effect of CFS v8 on my Athlon64 X2 > 4200. > > Without this patch all the GL load was handled by the second core, the > system only stressed the first core if other processes were also > started, i.e. a kernel compilation. With this patch the load is evenly > balanced across the two cores all the time. ok, i didnt realize that it would affect x86_64 too. I'll do a -v9 release with this fix included. > [...] And while doing make -j4 on the kernel, the 12 gears are still > spinning about 185+ FPS and there are only slightly visible hiccups. > Switching between workspaces, i.e. refreshing the large windows of > Thunderbird and Firefox are done very quickly, the whole system is > exceptionally responsive. great! So it seems -v8 does improve OpenGL handling too :-) > Thanks for this great work! you are welcome :) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 13:02 ` Ingo Molnar @ 2007-05-03 13:29 ` Damien Wyart 2007-05-03 14:53 ` Srivatsa Vaddagiri 2007-05-07 0:04 ` Bill Davidsen 0 siblings, 2 replies; 75+ messages in thread From: Damien Wyart @ 2007-05-03 13:29 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Hello, * Ingo Molnar <mingo@elte.hu> [2007-05-03 15:02]: > great! So it seems -v8 does improve OpenGL handling too :-) What are your thoughts/plans concerning merging CFS into mainline ? Is it a bit premature to get it into 2.6.22 ? I remember Linus was ok to change the default scheduler rapidly (the discussion was about sd at that time) to get more testing than in -mm. -- Damien Wyart ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 13:29 ` Damien Wyart @ 2007-05-03 14:53 ` Srivatsa Vaddagiri 2007-05-03 15:53 ` William Lee Irwin III 2007-05-07 0:04 ` Bill Davidsen 1 sibling, 1 reply; 75+ messages in thread From: Srivatsa Vaddagiri @ 2007-05-03 14:53 UTC (permalink / raw) To: Damien Wyart; +Cc: Ingo Molnar, linux-kernel, torvalds On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote: > What are your thoughts/plans concerning merging CFS into mainline ? Is > it a bit premature to get it into 2.6.22 ? I remember Linus was ok to > change the default scheduler rapidly (the discussion was about sd at > that time) to get more testing than in -mm. And what about group scheduling extensions? Do you have plans to work on it? I was begining to work on a prototype to do group scheduling based on CFS, basically on the lines of what you and Linus had outlined earlier: http://lkml.org/lkml/2007/4/18/271 http://lkml.org/lkml/2007/4/18/244 -- Regards, vatsa ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 14:53 ` Srivatsa Vaddagiri @ 2007-05-03 15:53 ` William Lee Irwin III 2007-05-03 18:44 ` Li, Tong N 2007-05-07 14:22 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-03 15:53 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Damien Wyart, Ingo Molnar, linux-kernel, torvalds, Tong N. Li On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote: >> What are your thoughts/plans concerning merging CFS into mainline ? Is >> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to >> change the default scheduler rapidly (the discussion was about sd at >> that time) to get more testing than in -mm. On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote: > And what about group scheduling extensions? Do you have plans to work on > it? I was begining to work on a prototype to do group scheduling based > on CFS, basically on the lines of what you and Linus had outlined > earlier: > http://lkml.org/lkml/2007/4/18/271 > http://lkml.org/lkml/2007/4/18/244 Tong Li's Trio scheduler does a bit of this, though it doesn't seem to have the mindshare cfs seems to have acquired. The hyperlink seems to have broken, though: http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 15:53 ` William Lee Irwin III @ 2007-05-03 18:44 ` Li, Tong N 2007-05-03 19:52 ` William Lee Irwin III 2007-05-07 14:22 ` Srivatsa Vaddagiri 1 sibling, 1 reply; 75+ messages in thread From: Li, Tong N @ 2007-05-03 18:44 UTC (permalink / raw) To: William Lee Irwin III Cc: Srivatsa Vaddagiri, Damien Wyart, Ingo Molnar, linux-kernel, torvalds On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote: > On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote: > >> What are your thoughts/plans concerning merging CFS into mainline ? Is > >> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to > >> change the default scheduler rapidly (the discussion was about sd at > >> that time) to get more testing than in -mm. > > On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote: > > And what about group scheduling extensions? Do you have plans to work on > > it? I was begining to work on a prototype to do group scheduling based > > on CFS, basically on the lines of what you and Linus had outlined > > earlier: > > http://lkml.org/lkml/2007/4/18/271 > > http://lkml.org/lkml/2007/4/18/244 > > Tong Li's Trio scheduler does a bit of this, though it doesn't seem to > have the mindshare cfs seems to have acquired. > > The hyperlink seems to have broken, though: > http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch Yeah, I just fixed the link. The code was written based on the 2.6.19.2 scheduler. I could forward-port it to the latest kernel if there's interest and I can find time. :) Here's a description of the design: http://lkml.org/lkml/2007/4/20/286 Thanks, tong ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 18:44 ` Li, Tong N @ 2007-05-03 19:52 ` William Lee Irwin III 0 siblings, 0 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-03 19:52 UTC (permalink / raw) To: Li, Tong N Cc: Srivatsa Vaddagiri, Damien Wyart, Ingo Molnar, linux-kernel, torvalds On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote: >> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to >> have the mindshare cfs seems to have acquired. >> The hyperlink seems to have broken, though: >> http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch On Thu, May 03, 2007 at 11:44:27AM -0700, Li, Tong N wrote: > Yeah, I just fixed the link. The code was written based on the 2.6.19.2 > scheduler. I could forward-port it to the latest kernel if there's > interest and I can find time. :) Here's a description of the design: > http://lkml.org/lkml/2007/4/20/286 I have the interest. I don't see performance issues with any of the schedulers, but am rather interested in the feature. -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 15:53 ` William Lee Irwin III 2007-05-03 18:44 ` Li, Tong N @ 2007-05-07 14:22 ` Srivatsa Vaddagiri 2007-05-07 20:54 ` Li, Tong N 1 sibling, 1 reply; 75+ messages in thread From: Srivatsa Vaddagiri @ 2007-05-07 14:22 UTC (permalink / raw) To: William Lee Irwin III Cc: Damien Wyart, Ingo Molnar, linux-kernel, torvalds, Tong N. Li On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote: > On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote: > > And what about group scheduling extensions? Do you have plans to work on > > it? I was begining to work on a prototype to do group scheduling based > > on CFS, basically on the lines of what you and Linus had outlined > > earlier: > > http://lkml.org/lkml/2007/4/18/271 > > http://lkml.org/lkml/2007/4/18/244 > > Tong Li's Trio scheduler does a bit of this, though it doesn't seem to > have the mindshare cfs seems to have acquired. > > The hyperlink seems to have broken, though: > http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch The big question I have is, how well does DWRR fits into the "currently hot" scheduling frameworks like CFS? For ex: if the goal is to do fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist? Both seem to be radically different algorithms and my initial impressions of them co-existing is "No", but would be glad to be corrected if I am wrong. If they can't co-exist, then we need a different way of doing group scheduling on top of CFS, as that is gaining more popularity on account of better handling of interactivity. Tong, I understand a center hallmark of DWRR is SMP fairness. Have you considered how bad/good the other alternative to achieve SMP fairness which is in vogue today : pressure/weight based balancing (ex: smpnice and CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)? -- Regards, vatsa ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 14:22 ` Srivatsa Vaddagiri @ 2007-05-07 20:54 ` Li, Tong N 0 siblings, 0 replies; 75+ messages in thread From: Li, Tong N @ 2007-05-07 20:54 UTC (permalink / raw) To: vatsa Cc: William Lee Irwin III, Damien Wyart, Ingo Molnar, linux-kernel, torvalds On Mon, 2007-05-07 at 19:52 +0530, Srivatsa Vaddagiri wrote: > On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote: > > On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote: > > > And what about group scheduling extensions? Do you have plans to work on > > > it? I was begining to work on a prototype to do group scheduling based > > > on CFS, basically on the lines of what you and Linus had outlined > > > earlier: > > > http://lkml.org/lkml/2007/4/18/271 > > > http://lkml.org/lkml/2007/4/18/244 > > > > Tong Li's Trio scheduler does a bit of this, though it doesn't seem to > > have the mindshare cfs seems to have acquired. > > > > The hyperlink seems to have broken, though: > > http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch > > The big question I have is, how well does DWRR fits into the "currently hot" > scheduling frameworks like CFS? For ex: if the goal is to do > fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist? > Both seem to be radically different algorithms and my initial impressions > of them co-existing is "No", but would be glad to be corrected if I am > wrong. If they can't co-exist, then we need a different way of doing > group scheduling on top of CFS, as that is gaining more popularity on > account of better handling of interactivity. Yeah, the intent of DWRR was to provide proportional fairness and rely on the underlying scheduler to support interactivity. In a way, DWRR ensures that each task receives its fair share, while the underlying scheduler determines the right order to run the tasks. Since SD is structurally similar to the stock scheduler, DWRR should co-exist with it easily. Co-existing with CFS requires more work, but I think the round-robin mechanism in DWRR could be applicable to CFS to facilitate cross-processor fairness. > Tong, > I understand a center hallmark of DWRR is SMP fairness. > Have you considered how bad/good the other alternative to achieve SMP fairness > which is in vogue today : pressure/weight based balancing (ex: smpnice and > CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)? > The disadvantage of DWRR is its potential overhead and the advantage is it can provide stronger fairness. If we have 2 processors and 3 tasks, DWRR ensures that each task gets 66% of the total CPU time, while smpnice would keep two tasks on the same processor and the third one on another. I did an analysis and showed that the lag bound of DWRR is constant if task weights are bounded by a constant. On the other hand, the cost DWRR pays is that it requires more task migrations. I tested with a set of benchmarks on an SMP and didn't see migrations were causing much performance impact, but this is certainly a big issue for NUMA. tong PS. I'm now porting the code to the latest kernel and will post as soon as I'm done. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 13:29 ` Damien Wyart 2007-05-03 14:53 ` Srivatsa Vaddagiri @ 2007-05-07 0:04 ` Bill Davidsen 1 sibling, 0 replies; 75+ messages in thread From: Bill Davidsen @ 2007-05-07 0:04 UTC (permalink / raw) To: Damien Wyart Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Con Kolivas, William Lee Irwin III Damien Wyart wrote: > Hello, > > * Ingo Molnar <mingo@elte.hu> [2007-05-03 15:02]: >> great! So it seems -v8 does improve OpenGL handling too :-) > > What are your thoughts/plans concerning merging CFS into mainline ? Is > it a bit premature to get it into 2.6.22 ? I remember Linus was ok to > change the default scheduler rapidly (the discussion was about sd at > that time) to get more testing than in -mm. > It would seem that a number of us are testing the variants and reporting results (I now have to pull all the new versions and redo stuff, but I have a new test I'm going to run as well). Given that and the fact that there is another idea which can be tested from Tong Li, I think that decision can be postponed, Linus willing. There appear to be enough testers to be driving evolution now, authors may disagree, of course. Several folks added to cc list... -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 75+ messages in thread
* [patch] CFS scheduler, -v8
@ 2007-05-01 21:22 Ingo Molnar
2007-05-02 2:57 ` Ting Yang
` (5 more replies)
0 siblings, 6 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-01 21:22 UTC (permalink / raw)
To: linux-kernel
Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
buddabrod, Balbir Singh
i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
main goal of CFS is to implement "desktop scheduling" with as high
quality as technically possible.)
The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be
downloaded from the usual place:
http://people.redhat.com/mingo/cfs-scheduler/
-v7 resolved a couple of important regresisons while not introducing new
regressions, so i felt it was time to step forward: -v8 tries to address
one of the last (known) frontiers: 3D/OpenGL games^H^H^H applications
'smoothness'.
To achieve more scheduling smoothness, -v8 introduces a new 'precise
load calculation and smoothing' feature. (A variant of this was
suggested by Peter Williams in earlier CFS discussions - thanks Peter!)
i was able to reuse the rq->cpu_load[] load average calculation from
Peter Williams's smpnice code, and it's thus now utilized on UP too. As
a nice side-effect of CFS using a smoothed load metric, apps should also
start up faster under load. CFS now utilizes the full range of smpnice
metrics.
No other fundamental portion of CFS was touched, so the rate of change
is moderate:
7 files changed, 140 insertions(+), 79 deletions(-)
Changes since -v7:
- powerpc debug output and build warning fixes (Balbir Singh)
- documentation fixes (Zach Carter)
- interactivity: precise load calculation and load smoothing
As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome,
Ingo
^ permalink raw reply [flat|nested] 75+ messages in thread* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar @ 2007-05-02 2:57 ` Ting Yang 2007-05-02 5:10 ` Willy Tarreau ` (6 more replies) 2007-05-02 6:37 ` Mike Galbraith ` (4 subsequent siblings) 5 siblings, 7 replies; 75+ messages in thread From: Ting Yang @ 2007-05-02 2:57 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Hi, Ingo My name is Ting Yang, a graduate student from UMASS. I am currently studying the linux scheduler and virtual memory manager to solve some page swapping problems. I am very excited with the new scheduler CFS. After I read through your code, I think that you might be interested in reading this paper: "A Proportional Share REsource Allocation Algorithm for Real-Time, Time-Shared Systems", by Ion Stoica. You can find the paper here: http://citeseer.ist.psu.edu/37752.html Authors of this paper proposed a scheduler: Earlist Eligible Virtual Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to track the execution of each running task. The only difference between EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is _start-time_ fair. Scheduling based on deadline gives better reponse time bound and seems to more fair. In the following part of this email, I will try to explain the similarities and differences between EEVDF and CFS. Hopefully, this might provide you with some useful information w.r.t your current work on CFS. Similarities: 1. both EEVDF and CFS use virtual time to track the system's progress. CFS maintains rq->fair_clock for each cpu, which is updated every tick by: SCALE/sum(p_i->loadweight) where p_i->loadweight is the weight of each task mapped from its nice value in prio_to_load_shift[], given that the default weight is SCALE (1024) EEVDF maintains a virtual time, which is advanced every tick by: 1/sum(w_i) where w_i is the weight of each task, given that the default weight is 1. Therefore, EEVDF and CFS monitors system progress in the same way, except normalized to different scale. 2. EEVDF and CFS monitors the progress in the same way. CFS maintains a p->fair_key which indicating the amount of work done by this task. When p executes for a period t, then p->fair_key incremented by: t * SCALE/p->loadweight //the default weight is SCALE (based on solving equations in your code :-)) EEVDF does the same thing with assumption that the default weight is 1, it uses the same equation to calculate deadlines for running tasks. Differences: The main difference between CFS and EEVDF lies in the scheduling policy, although they follow the same model for tracking tasks. *** CFS: When a task starts, it gets p->fair_key from the current virtual time rq->fair_clock. It will not be scheduled for execution until all other running tasks' fair_key go beyond p->fair_key by certain virtual ticks (which is 5 ticks for the current setting of CFS). (I wish I could show examples with graphs, instead of my not-so-good english, but I am not sure if it appropriate to attach figures on this maillist) EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, and the granularity of pre-exemption for CFS is 5 virtual ticks (the current setting). If, at time t=0, we start 2 tasks: p1 and p2, both have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. Now we have: p1->fair_key = p2->fair_key = rq->fair_clock = 0. CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick (1ms later) t=1, we have: rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0. Suppose, a new task p3 starts with nice value -10 at this moment, that is p3->fair_key=1/2. In this case, CFS will not schedule p3 for execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which translates to about 10ms later in this setting), _regardless_ the priority (weight) of p3. Further imagine, if we starts n tasks at time t=0 and then start a task p_(n+1) at time t = 1, the delay of task p_(n+1) actually is proportional to the number of running tasks n. Formally speaking, CFS can has a *O(n)* lag w.r.t the ideal proportional shared fluid-flow system, which can be arbitrarily fine granularity. The cause of this problem is that p->fair_key only reflects a fair point that a task should be started, but does not has any information about how urgent the task is (i.e. the priority or weight of the task). *** In EEVDF, each task p_i is specified by 2 parameters: weight w_i and timeslice length l_i. EEVDF tries to schedule tasks based on the virtual deadline vd_i where a timeslice l_i should be finished. EEVDF keeps a virtual start (ve_i) and virtual deadline (vd_i) for each tasks. When a tasks starts, its ve_i is initialized to be the current virtual time, and calculates its virtual deadline as: vd_i = ve_i + l_i/w_i //the same method as CFS updates fair_key. When l_i amount of work is done, the just finished vd_i becomes the new ve_i. That is the virtual start time of next l_i amount work is the deadline of previous finished timeslice. The virtual deadline vd_i is then updated using the above equation. EEVDF schedule policy: always executes the task with the _earliest_ virtual deadline. EXAMPLE: Assume the system has 1000 ticks per second. At time t = 0, we start 2 tasks: p1 and p2, such that w_1 = w_2 = 1 and l_1 = l_2 = 5 ticks, i.e 5ms (which reflects the similar setting in CFS case). Furthermore, the system virtual time vt is initialized to be 0. Now at time t = 0, we have vt = 0, ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 5 ve_2 = 0, vd_2 = vd_2 + l_2/w_2 = 5 As p1 and p2 have the same virtual deadline, EEVDF breaks the tie arbitrarily, say it executes p1. After 1 system tick (1ms later), we have: vt = 0 + 1 / (w_1 + w_2) = 1/2 //just as CFS updates rq->fair_clock ve_1 = 0, vd_1 = 5 //not changed ve_2 = 0, vd_1 = 5 //not changed Suppose, we starts a new task p2 at this moment, with weight w_3 = 2 and timeslice l_3 = 5 ticks (5ms), Then ve_3 = vt = 1/2 vd_3 = ve_3 + l_3/w_2 = 1/2 + 5/2 = 3 hmmm, the scheduler now will execute task p3 first since it has the earliest deadline. The deadline implicitly contains some very important information that the CFS fair_key does not have: how urgent this amount of work has to be done, which comes from the weight of a task. Formally speaking, EEVDF has a constant O(1) lag w.r.t the ideal proportional shared fluid-flow system. (Please look at the paper for detail proofs.) Under normal cases, for a task p_i, EEVDF ensures that it does not fall behind the ideal system by more than l_i time. Occasionally, the delay can be max(l_i), the maximum timeslice used by all active tasks, due to the dynamic entering and leaving of tasks. (Again, the paper give more detailed explanation on this). We can see that here the timeslice l_i used by a task p_i actually controls accuracy of scheduling p_i. The smaller l_i, the closer to the ideal system during p_i's execution. For example, in above case, if p3 has w_3 = 1 (the same as p1 and p2) and l_3 = 3 (3ms). Since vd_3 = 1/2 + l_3/w_3 = 7/2, the scheduler still executes p3 first, even though p1,p2 and p3 have the same weight. Smaller l_i leads the scheduler to handle p_i in finer granularity. Intuitively, it makes scheduler checks the deadline of p_i more frequently and ensures each small piece of work is done on time, while larger l_i does the reverse. The decouple of weight w_i and timeslice l_i is important. Generally speaking, weight determines throughput and timeslice determines the responsiveness of a task. In normal situation, high priority tasks usually need more cpu capacity within short period of time (bursty, such as keyboard, mouse move, X updates, daemons, etc), and need to be processed as quick as possible (responsiveness and interactiveness). Follow the analysis above, we know that for higher priority tasks we should give _higher weight_ to ensure its CPU throughput, and at the same time give _smaller timeslice_ to ensure better responsiveness. This is a bit counter-intuitive against the current linux implementation: smaller nice value leads to higher weight and larger timeslice. Now let's see what happens in the Real-Time world. Usually a real-time application is specified with (C_i, T_i), where C_i is the amount of work and T_i is the period of time that the work has to be finished. For example, (20ms, 50ms) says the scheduler has to do 20ms work within a window of 50ms for this task. Smaller T_i gives tighter response constraints. Note that Bi = Ci/Ti is really the CPU proportion for the task during its execution. In our proportional time share world, weight w_i decides the cpu proportion which maps to Bi, and timeslice l_i gives the amount works should be done each time, which maps Ci. Then using w_i and l_i, we can get a window size, which actually maps Ti. Therefore smaller l_i leads to smaller Ti which means tighter response constraints. It seems to me all these make a lot sense in both proportional time share and real-time world. Based on my understanding, adopting something like EEVDF in CFS should not be very difficult given their similarities, although I do not have any idea on how this impacts the load balancing for SMP. Does this worth a try? Sorry for such a long email :-) Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang @ 2007-05-02 5:10 ` Willy Tarreau 2007-05-02 5:30 ` William Lee Irwin III ` (5 subsequent siblings) 6 siblings, 0 replies; 75+ messages in thread From: Willy Tarreau @ 2007-05-02 5:10 UTC (permalink / raw) To: Ting Yang; +Cc: Ingo Molnar, linux-kernel Hi Ting, On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: > > Hi, Ingo > > My name is Ting Yang, a graduate student from UMASS. I am currently > studying the linux scheduler and virtual memory manager to solve some > page swapping problems. I am very excited with the new scheduler CFS. > After I read through your code, I think that you might be interested in > reading this paper: > > "A Proportional Share REsource Allocation Algorithm for Real-Time, > Time-Shared Systems", by Ion Stoica. You can find the paper here: > http://citeseer.ist.psu.edu/37752.html > > Authors of this paper proposed a scheduler: Earlist Eligible Virtual > Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to > track the execution of each running task. The only difference between > EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is > _start-time_ fair. Scheduling based on deadline gives better reponse > time bound and seems to more fair. > > In the following part of this email, I will try to explain the > similarities and differences between EEVDF and CFS. Hopefully, this > might provide you with some useful information w.r.t your current work > on CFS. (...) Thanks very much for this very clear explanation. Now I realize that some of the principles I've had in mind for a long time already exist and are documented ! That's what I called sorting by job completion time in the past, which might not have been clear for everyone. Now you have put words on all those concepts, it's more clear ;-) > The decouple of weight w_i and timeslice l_i is important. Generally > speaking, weight determines throughput and timeslice determines the > responsiveness of a task. I 100% agree. That's the problem we have with nice today. Some people want to use nice to assign more CPU to tasks (as has always been for years) and others want to use nice to get better interactivity (meaning nice as when you're in a queue and leaving the old woman go before you). IMHO, the two concepts are opposed. Either you're a CPU hog OR you get quick responsiveness. > In normal situation, high priority tasks > usually need more cpu capacity within short period of time (bursty, such > as keyboard, mouse move, X updates, daemons, etc), and need to be > processed as quick as possible (responsiveness and interactiveness). > Follow the analysis above, we know that for higher priority tasks we > should give _higher weight_ to ensure its CPU throughput, and at the > same time give _smaller timeslice_ to ensure better responsiveness. > This is a bit counter-intuitive against the current linux > implementation: smaller nice value leads to higher weight and larger > timeslice. We have an additional problem in Linux, and not the least : it already exists and is deployed everywhere, so we cannot break existing setups. More specifically, we don't want to play with nice values of processes such as X. That's why I think that monitoring the amount of the time-slice (l_i) consumed by the task is important. I proposed to conserve the unused part of l_i as a credit (and conversely the credit can be negative if the time-slice has been over-used). This credit would serve two purposes : - reassign the unused part of l_i on next time-slices to get the most fair share of CPU between tasks - use it as an interactivity key to sort the tasks. Basically, if we note u_i the unused CPU cycles, you can sort based on (d_i - u_i) instead of just d_i, and the less hungry tasks will reach the CPU faster than others. (...) > Based on my understanding, adopting something like EEVDF in CFS should > not be very difficult given their similarities, although I do not have > any idea on how this impacts the load balancing for SMP. Does this worth > a try? I think that if you have time to spend on this, everyone would like to see the difference. All the works on the scheduler are more or less experimental and several people are exchanging ideas right now, so it should be the right moment. You seem to understand very well both approaches and it's likely that it will not take you too much time :-) > Sorry for such a long email :-) It was worth it, thanks ! Willy ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang 2007-05-02 5:10 ` Willy Tarreau @ 2007-05-02 5:30 ` William Lee Irwin III 2007-05-02 10:05 ` Bill Huey ` (4 subsequent siblings) 6 siblings, 0 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-02 5:30 UTC (permalink / raw) To: Ting Yang; +Cc: Ingo Molnar, linux-kernel On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: > Authors of this paper proposed a scheduler: Earlist Eligible Virtual > Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to > track the execution of each running task. The only difference between > EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is > _start-time_ fair. Scheduling based on deadline gives better reponse > time bound and seems to more fair. > In the following part of this email, I will try to explain the > similarities and differences between EEVDF and CFS. Hopefully, this > might provide you with some useful information w.r.t your current work > on CFS. Any chance you could write a patch to convert CFS to EEVDF? People may have an easier time understanding code than theoretical explanations. (I guess I could do it if sufficiently pressed.) -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang 2007-05-02 5:10 ` Willy Tarreau 2007-05-02 5:30 ` William Lee Irwin III @ 2007-05-02 10:05 ` Bill Huey 2007-05-02 10:27 ` Ingo Molnar ` (3 subsequent siblings) 6 siblings, 0 replies; 75+ messages in thread From: Bill Huey @ 2007-05-02 10:05 UTC (permalink / raw) To: Ting Yang Cc: Ingo Molnar, linux-kernel, Con Kolivas, William Lee Irwin III, Bill Huey (hui) On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: > Based on my understanding, adopting something like EEVDF in CFS should > not be very difficult given their similarities, although I do not have > any idea on how this impacts the load balancing for SMP. Does this worth > a try? > > Sorry for such a long email :-) An excellent long email. Thanks. Have you looked at Con's SD ? What did you think about it analytically and do you think that these ideas could be incorporated into that scheduler ? bill ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang ` (2 preceding siblings ...) 2007-05-02 10:05 ` Bill Huey @ 2007-05-02 10:27 ` Ingo Molnar 2007-05-02 17:36 ` Srivatsa Vaddagiri ` (2 subsequent siblings) 6 siblings, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 10:27 UTC (permalink / raw) To: Ting Yang; +Cc: linux-kernel * Ting Yang <tingy@cs.umass.edu> wrote: > My name is Ting Yang, a graduate student from UMASS. I am currently > studying the linux scheduler and virtual memory manager to solve some > page swapping problems. I am very excited with the new scheduler CFS. > After I read through your code, I think that you might be interested > in reading this paper: thanks for your detailed analysis - it was very interesting! > Based on my understanding, adopting something like EEVDF in CFS > should not be very difficult given their similarities, although I do > not have any idea on how this impacts the load balancing for SMP. Does > this worth a try? It would definitely be interesting to try! I dont think it should negatively impact load balancing on SMP. The current fork-time behavior of CFS is really just a first-approximation thing, and what you propose seems to make more sense to me too because it preserves the fluidity of fairness. (I'd probably apply your patch even if there was no directly measurable impact on workloads, because the more natural approaches tend to be more maintainable in the long run.) So by all means, please feel free to do a patch for this. > Sorry for such a long email :-) it made alot of sense and was very useful :-) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang ` (3 preceding siblings ...) 2007-05-02 10:27 ` Ingo Molnar @ 2007-05-02 17:36 ` Srivatsa Vaddagiri 2007-05-02 17:48 ` William Lee Irwin III 2007-05-02 23:41 ` Ting Yang 2007-05-02 18:42 ` Li, Tong N 2007-05-03 8:50 ` Ingo Molnar 6 siblings, 2 replies; 75+ messages in thread From: Srivatsa Vaddagiri @ 2007-05-02 17:36 UTC (permalink / raw) To: Ting Yang; +Cc: Ingo Molnar, linux-kernel On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: > "A Proportional Share REsource Allocation Algorithm for Real-Time, > Time-Shared Systems", by Ion Stoica. You can find the paper here: > http://citeseer.ist.psu.edu/37752.html Good paper ..thanks for the pointer. I briefly went thr' the paper and my impression is it expect each task to specify the length of each new request it initiates. Is that correct? If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how would we calculate that "length of each new request" (which is reqd before we calculate its virtual deadline)? > EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, > and the granularity of pre-exemption for CFS is 5 virtual ticks (the > current setting). If, at time t=0, we start 2 tasks: p1 and p2, both > have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. > Now we have: > p1->fair_key = p2->fair_key = rq->fair_clock = 0. > CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick > (1ms later) t=1, we have: > rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0. > Suppose, a new task p3 starts with nice value -10 at this moment, that > is p3->fair_key=1/2. In this case, CFS will not schedule p3 for > execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which > translates to about 10ms later in this setting), _regardless_ the > priority (weight) of p3. There is also p->wait_runtime which is taken into account when calculating p->fair_key. So if p3 had waiting in runqueue for long before, it can get to run quicker than 10ms later. -- Regards, vatsa ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 17:36 ` Srivatsa Vaddagiri @ 2007-05-02 17:48 ` William Lee Irwin III 2007-05-02 18:15 ` Ingo Molnar ` (2 more replies) 2007-05-02 23:41 ` Ting Yang 1 sibling, 3 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-02 17:48 UTC (permalink / raw) To: Srivatsa Vaddagiri; +Cc: Ting Yang, Ingo Molnar, linux-kernel On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: >> "A Proportional Share REsource Allocation Algorithm for Real-Time, >> Time-Shared Systems", by Ion Stoica. You can find the paper here: >> http://citeseer.ist.psu.edu/37752.html On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote: > Good paper ..thanks for the pointer. > I briefly went thr' the paper and my impression is it expect each task > to specify the length of each new request it initiates. Is that correct? > If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how > would we calculate that "length of each new request" (which is reqd > before we calculate its virtual deadline)? l_i and w_i are both functions of the priority. You essentially arrange l_i to express QoS wrt. latency, and w_i to express QoS wrt. bandwidth. On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote: >> EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, >> and the granularity of pre-exemption for CFS is 5 virtual ticks (the >> current setting). If, at time t=0, we start 2 tasks: p1 and p2, both >> have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. >> Now we have: >> p1->fair_key = p2->fair_key = rq->fair_clock = 0. >> CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick >> (1ms later) t=1, we have: >> rq->fair_clock = 1/2, p1->fair_key = 1, p2->fair_key = 0. >> Suppose, a new task p3 starts with nice value -10 at this moment, that >> is p3->fair_key=1/2. In this case, CFS will not schedule p3 for >> execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which >> translates to about 10ms later in this setting), _regardless_ the >> priority (weight) of p3. On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote: > There is also p->wait_runtime which is taken into account when > calculating p->fair_key. So if p3 had waiting in runqueue for long > before, it can get to run quicker than 10ms later. Virtual time is time from the task's point of view, which it has spent executing. ->wait_runtime is a device to subtract out time spent on the runqueue but not running from what would otherwise be virtual time to express lag, whether deliberately or coincidentally. ->wait_runtime would not be useful for EEVDF AFAICT, though it may be interesting to report. -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 17:48 ` William Lee Irwin III @ 2007-05-02 18:15 ` Ingo Molnar 2007-05-02 18:56 ` William Lee Irwin III 2007-05-03 2:48 ` Ting Yang 2007-05-03 3:18 ` Ting Yang 2 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 18:15 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel * William Lee Irwin III <wli@holomorphy.com> wrote: > > There is also p->wait_runtime which is taken into account when > > calculating p->fair_key. So if p3 had waiting in runqueue for long > > before, it can get to run quicker than 10ms later. > > Virtual time is time from the task's point of view, which it has spent > executing. ->wait_runtime is a device to subtract out time spent on > the runqueue but not running from what would otherwise be virtual time > to express lag, whether deliberately or coincidentally. [...] CFS is in fact _built around_ the ->wait_runtime metric (which, as its name suggests already, expresses the precise lag a task observes relative to 'ideal' fair execution), so what exactly makes you suspect that this property of the ->wait_runtime metric might be 'coincidental'? ;-) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 18:15 ` Ingo Molnar @ 2007-05-02 18:56 ` William Lee Irwin III 2007-05-02 19:12 ` Ingo Molnar 0 siblings, 1 reply; 75+ messages in thread From: William Lee Irwin III @ 2007-05-02 18:56 UTC (permalink / raw) To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel * William Lee Irwin III <wli@holomorphy.com> wrote: >> Virtual time is time from the task's point of view, which it has spent >> executing. ->wait_runtime is a device to subtract out time spent on >> the runqueue but not running from what would otherwise be virtual time >> to express lag, whether deliberately or coincidentally. [...] On Wed, May 02, 2007 at 08:15:33PM +0200, Ingo Molnar wrote: > CFS is in fact _built around_ the ->wait_runtime metric (which, as its > name suggests already, expresses the precise lag a task observes > relative to 'ideal' fair execution), so what exactly makes you suspect > that this property of the ->wait_runtime metric might be 'coincidental'? > ;-) The coincidental aspect would be that at the time it was written, the formal notion of lag was not being used particularly with respect to priorities and load weights. The documentation doesn't describe it in those terms, and I can't read minds, so I refrained from guessing. Things are moving in good directions on all this as far as I'm concerned. Moving according to Ting Yang's analysis should wrap up the soundness concerns about intra-queue policy I've had. OTOH load balancing I know much less about (not that I was ever any sort of an expert on single queue affairs). That remains a very deep concern, as load balancing is where most of the enterprise performance improvements and degradations occur. -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 18:56 ` William Lee Irwin III @ 2007-05-02 19:12 ` Ingo Molnar 2007-05-02 19:42 ` William Lee Irwin III 0 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 19:12 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel * William Lee Irwin III <wli@holomorphy.com> wrote: > The coincidental aspect would be that at the time it was written, the > formal notion of lag was not being used particularly with respect to > priorities and load weights. [...] (nice levels for SCHED_OTHER are 'just' an add-on concept to the core of CFS, in fact i had two wildly different approaches that both did the trick for users, so i fail to see the relevance of priorities to the core concept of measuring how much a task is waiting to get on the runqueue via the 'fair clock' ... but lets move on.) > Things are moving in good directions on all this as far as I'm > concerned. Moving according to Ting Yang's analysis should wrap up the > soundness concerns about intra-queue policy I've had. OTOH load > balancing I know much less about (not that I was ever any sort of an > expert on single queue affairs). [...] the whole move to ->load_weight based calculations was to make CFS integrate better with load-balancing and to bring the smpnice infrastructure even more into the scheduler mainstream. [ There's a small smpnice related buglet i fixed in -v9-to-be (based on Balbir Singh's feedback), but otherwise it behaves quite well on SMP and that's not a big surprise: i left the load-balancer largely intact. ] Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 19:12 ` Ingo Molnar @ 2007-05-02 19:42 ` William Lee Irwin III 0 siblings, 0 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-02 19:42 UTC (permalink / raw) To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel, Tong N. Li * William Lee Irwin III <wli@holomorphy.com> wrote: >> Things are moving in good directions on all this as far as I'm >> concerned. Moving according to Ting Yang's analysis should wrap up the >> soundness concerns about intra-queue policy I've had. OTOH load >> balancing I know much less about (not that I was ever any sort of an >> expert on single queue affairs). [...] On Wed, May 02, 2007 at 09:12:35PM +0200, Ingo Molnar wrote: > the whole move to ->load_weight based calculations was to make CFS > integrate better with load-balancing and to bring the smpnice > infrastructure even more into the scheduler mainstream. [ There's a > small smpnice related buglet i fixed in -v9-to-be (based on Balbir > Singh's feedback), but otherwise it behaves quite well on SMP and that's > not a big surprise: i left the load-balancer largely intact. ] Despite the original motive, the ->load_weight -based calculations largely resolved my concerns about intra-queue prioritization. They were the change to the ->fair_key computation I wanted to see, which were still not enough because of the O(rq->nr_running) lag issue due to the differences from EEVDF, but I wasn't entirely aware of that, only suspicious that some issue remained. Load balancing has non-negligible impacts on fairness but I'm almost entirely ignorant of the aspects of the theory relating to it. More knowledgeable people, e.g. Tong Li and Ting Yang, need to take over reviewing here much as they did on the intra-queue front, where I only knew enough to drop hints and not to make more useful suggestions. O(1) vs. O(lg(n)) priority queues are not what matter here (well, obviously O(n) priority queues would break), but rather O(1) vs. O(n) lag. -- wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 17:48 ` William Lee Irwin III 2007-05-02 18:15 ` Ingo Molnar @ 2007-05-03 2:48 ` Ting Yang 2007-05-03 3:18 ` Ting Yang 2 siblings, 0 replies; 75+ messages in thread From: Ting Yang @ 2007-05-03 2:48 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ingo Molnar, linux-kernel Hi, As encouraged by some of you, I have started implementing EEVDF. However, I am quite new in this area, and may not be experienced enough to get it through quickly. The main problems, I am facing now ,is how to treat the semantics of yeild() and yield_to(). I probably will throw a lot of questions along the way of my implementation. Also I found my previous email was not clear enough in describing the properties of CFS and EEVDF and caused some confusion, and there were also some mistakes too. In this email, I will try to make up for that. *** Let's start from CFS: For simplicity, let's assume that CFS preempt the current task p1 by another tasks p2, when p1->key - p2->key >1, and the virtual time rq->fair_clock is initialized to be 0. Suppose, at time t = 0, we start n+1 tasks that run long enough. task 1 has weight n and all other tasks have weight 1. It is clear that, at time t=0, p_1->key = p_2->key = ... =p_(n+1)-> key = rq->fair_clock = 0 Since all tasks has the same key, CFS breaks the ties arbitrarily, which leads to many possibilities. Let's consider 2 of them: _Case One:_ p1, which has weight n, executes first: t = 1: rq->fair_clock = 1/2n, p1->key = 1/n // others are not changed. t = 2: rq->fair_clock = 2/2n, p1->key = 2/n ... t = n: rq->fair_clock = n/2n, p1->key = n/n = 1 Only after p1 executes n ticks, the scheduler will pick another task for execution. Between time [0, n) the amount of actual work done by p1 is n. The amount of work should be done in ideal fluid-flow system is n * n/2n = n/2. Therefore the lag is n/2 - n = -n/2, negative means p1 goes faster than the ideal case. As we can see this lag is O(n). _Case Two:_ the scheduler executes the tasks in the order p2, p3, ..., p_(n+1), p1 t = 1: rq->fair_clock = 1/2n, p2->key = 1; // others are not changed t = 2: rq->fair_clock = 2/2n, p3->key = 1; .... t = n: rq->fair_clock = n/2n, p_(n+1)->key = 1; Then the scheduler picks p1 (weight n) for execution. Between time [0, n) the amount actual work done by p1 is 0, and the ideal amount is n/2. Therefore the lag is n/2 - 0, positive means p1 falls behind the ideal case. The lag here for p1 is also O(n). As I said in the previous email, p->fair_key only has the information of past execution of a task and reflects a fair start point. It does not have the information about weight. *** Now, let's look at EEVDF. I have to say that I missed a very important concept in EEVDF which leads to confusions here. EEVDF stands for _Eligible_ Earliest Virtual Deadline First, and I did not explain what is _eligible_. EEVDF maintains a virtual start time ve_i and virtual deadline vd_i for each task p_i, as well as a virtual time vt. A newly started/waked task has its ve_i initialized to be the current virtual time. Once a timeslice l_i amount of work is done, the new virtual start time is set to be the previous virtual deadline, and then virtual deadline vd_i is recalculated. A task is eligible, if and only if ve_i <= current virtual time vt EEVDF, at every tick, always picks the eligible task which has the earliest virtual deadline for execution Let's see how it works using a similar example as for CFS above. Suppose, at time t = 0, we starts n+1 tasks. p1 has weight n, and all others have weight 1. For simplicity, we assume all task use timeslice l_i = 1, and virtual time vt is initialized to be 0. - at time t = 0, we have vt = 0; ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 1/n ve_2 = 0, vd_2 = ve_1 + l_2/w_2 = 1 ... ve_(n+1) = 0, vd_(n+1) = ve_(n+1) + l_(n+1)/w_(n+1) = 1; Since p1 is eligible and has the earliest deadline 1/n, the scheduler will executes it first. (Here, the weight which encoded in the deadline plays an important rule, and allows higher weight tasks to be executed first). - at time t = 1: vt = 1/2n, ve_1 = 1/n (previous vd_1), vd_1 = ve_1 + 1/n = 2/n Since ve_1 > vt, p1 is _not_ eligible. EEVDF picks another task for execution by breaking the tie, say it executes p2. - at time t = 2: vt = 2/2n = 1/n, ve_1 = 1/n, vd_1 = 2/n ve_2 = 1, ve_2 = ve_2 + 1/1 = 2 // this makes p2 not eligible Since vt = ve_1, p1 becomes eligible again and has the earliest deadline 2/n, it will be scheduled for execution. As EEVDF repeats, it give a schedule like p1, p2, p1,p3, p1, p4, p1 .... (presented by each tick). As you can see, now p1 never falls behind/goes before the ideal case by 1. Now, let's check how timeslice l_i impacts the system. Suppose, we change the timeslice of p1 from 1 to 2, and keep others unchanged. EEVDF gives a schedule like: p1, p1, p2, p3, p1, p1, p4, p5, p1, p1, .... (presented by each tick) similarly if timeslice of p1 is set to be 3, EEVDF gives p1, p1, p1, p2, p3, p4, p1, p1, p1, .... (presented by each tick) As the timeslice of p1 increases, the system checks for reschedule less frequently, thus makes the lag of p1 becomes longer, but the lag will not be larger than the maximum timeslice used, as long as it is a fixed constant. On the other hand, increasing the timeslice of other tasks has no effect on when p1 is schedule. ( you can try to play the algorithm by yourself :-)) In CFS, a task has to increases p->fair_key for a certain amount so that the scheduler can consider it to be preempted. Higher weight leads to less progress in p->fair_key, and then effectively large timeslice. Suppose the preempt granularity is 5 virtual ticks, then a task of weight 1 needs to run 5 ticks, weight 2 need 10 ticks, weight 10 needs 50 ticks. The effect is that the timeslice increases linearly w.r.t weight, and causes the O(n) lag. In fact, we use timeslice n for p1 in the above example for EEVDF, it behaves exactly as CFS. Now, I have fix an incorrect statement about EEVDF's lag bound in my previous email: Under EEVDF, a task's lag (difference between the amount work should be done in ideal fluid-flow system and the actual amount of work done) is bounded by the maximum timeslice used. (not occasionally as I said in my previous email). This actually means the maximum timeslice used controls the total responsiveness of the system. Also the combination of high weight and smaller timeslice will give better response guarantee for those bursty response-time sensitive tasks. Sorry for the late replay. I had an eye exam today and got my eyes dilated, which forced me to stay away from my computer for a while :-) Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 17:48 ` William Lee Irwin III 2007-05-02 18:15 ` Ingo Molnar 2007-05-03 2:48 ` Ting Yang @ 2007-05-03 3:18 ` Ting Yang 2007-05-03 10:19 ` Bill Huey 2 siblings, 1 reply; 75+ messages in thread From: Ting Yang @ 2007-05-03 3:18 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ingo Molnar, linux-kernel > On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote: > >> There is also p->wait_runtime which is taken into account when >> calculating p->fair_key. So if p3 had waiting in runqueue for long >> before, it can get to run quicker than 10ms later. >> > > Virtual time is time from the task's point of view, which it has spent > executing. ->wait_runtime is a device to subtract out time spent on the > runqueue but not running from what would otherwise be virtual time to > express lag, whether deliberately or coincidentally. ->wait_runtime > would not be useful for EEVDF AFAICT, though it may be interesting to > report. I just want to point out that ->wait_runtime, in fact, stores the lag of each task in CFS, except that it is also used by other things, and occasionally tweaked (heuristically ?). Under normal cases the sum of lags of all active tasks in such a system, should be a constant 0. The lag information is equally important to EEVDF, when some tasks leave the system (becomes inactive) carrying certain amount of lag. The key point here is that we have to spread the lag (either negative or positive) to all remaining task, so that the fairness of the system is preserved. I thinks CFS implementation does not seems to handle this properly. I am running out time today :-( I will write an email about CFS -v8 tomorrow, describing 2 issues in CFS I found related to this. Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 3:18 ` Ting Yang @ 2007-05-03 10:19 ` Bill Huey 0 siblings, 0 replies; 75+ messages in thread From: Bill Huey @ 2007-05-03 10:19 UTC (permalink / raw) To: Ting Yang Cc: William Lee Irwin III, Srivatsa Vaddagiri, Ingo Molnar, linux-kernel, Bill Huey (hui) On Wed, May 02, 2007 at 11:18:45PM -0400, Ting Yang wrote: > I just want to point out that ->wait_runtime, in fact, stores the lag of > each task in CFS, except that it is also used by other things, and > occasionally tweaked (heuristically ?). Under normal cases the sum of > lags of all active tasks in such a system, should be a constant 0. The > lag information is equally important to EEVDF, when some tasks leave the > system (becomes inactive) carrying certain amount of lag. The key point > here is that we have to spread the lag (either negative or positive) to > all remaining task, so that the fairness of the system is preserved. I > thinks CFS implementation does not seems to handle this properly. > > I am running out time today :-( I will write an email about CFS -v8 > tomorrow, describing 2 issues in CFS I found related to this. Interesting. I haven't look at the code carefully but that wouldn't surprise me if this was the case and it led to odd corner cases. I'm eagerly waiting your analysis and explanation. bill ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 17:36 ` Srivatsa Vaddagiri 2007-05-02 17:48 ` William Lee Irwin III @ 2007-05-02 23:41 ` Ting Yang 1 sibling, 0 replies; 75+ messages in thread From: Ting Yang @ 2007-05-02 23:41 UTC (permalink / raw) To: vatsa; +Cc: Ingo Molnar, linux-kernel Srivatsa Vaddagiri wrote: > I briefly went thr' the paper and my impression is it expect each task > to specify the length of each new request it initiates. Is that correct? > No, the timeslice l_i here serves as a granularity control w.r.t responsiveness (or latency depends on how you interpret it). As wli said it can be express as a function of the priority, as we do for weight now. It is not related with the length of each new request. A request may be 1 seconds long, but the scheduler may still process it using 10ms timeslice. Smaller timeslice leads to more accuracy, i.e. closer to ideal case. However, the maximum of timeslice l_i used by all active tasks determines the total responsiveness of the system, which I will explain in detail later. > There is also p->wait_runtime which is taken into account when > calculating p->fair_key. So if p3 had waiting in runqueue for long > before, it can get to run quicker than 10ms later. Consider if p3 is a newly started task or waked up task and carries no p->wait_runtime. Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang ` (4 preceding siblings ...) 2007-05-02 17:36 ` Srivatsa Vaddagiri @ 2007-05-02 18:42 ` Li, Tong N 2007-05-02 19:10 ` William Lee Irwin III 2007-05-03 3:07 ` Ting Yang 2007-05-03 8:50 ` Ingo Molnar 6 siblings, 2 replies; 75+ messages in thread From: Li, Tong N @ 2007-05-02 18:42 UTC (permalink / raw) To: tingy; +Cc: Ingo Molnar, linux-kernel > Based on my understanding, adopting something like EEVDF in CFS should > not be very difficult given their similarities, although I do not have > any idea on how this impacts the load balancing for SMP. Does this worth > a try? > > Sorry for such a long email :-) Thanks for the excellent explanation. I think EEVDF and many algs alike assume global ordering of all tasks in the system (based on virtual time), whereas CFS does so locally on each processor and relies on load balancing to achieve fairness across processors. It'd achieve strong fairness locally, but I'm not sure about its global fairness properties in an MP environment. If ideally the total load weight on each processor is always the same, then local fairness would imply global fairness, but this is a bin packing problem and is intractable ... tong ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 18:42 ` Li, Tong N @ 2007-05-02 19:10 ` William Lee Irwin III 2007-05-03 3:07 ` Ting Yang 1 sibling, 0 replies; 75+ messages in thread From: William Lee Irwin III @ 2007-05-02 19:10 UTC (permalink / raw) To: Li, Tong N; +Cc: tingy, Ingo Molnar, linux-kernel At some point in the past, Ting Yang wrote: >> Based on my understanding, adopting something like EEVDF in CFS should >> not be very difficult given their similarities, although I do not have >> any idea on how this impacts the load balancing for SMP. Does this worth >> a try? >> Sorry for such a long email :-) On Wed, May 02, 2007 at 11:42:20AM -0700, Li, Tong N wrote: > Thanks for the excellent explanation. I think EEVDF and many algs alike > assume global ordering of all tasks in the system (based on virtual > time), whereas CFS does so locally on each processor and relies on load > balancing to achieve fairness across processors. It'd achieve strong > fairness locally, but I'm not sure about its global fairness properties > in an MP environment. If ideally the total load weight on each processor > is always the same, then local fairness would imply global fairness, but > this is a bin packing problem and is intractable ... It's sort of obvious how to approximate it, but not entirely obvious whether a given approximation actually suffices. More help with the theoretical aspects is needed. - wli ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 18:42 ` Li, Tong N 2007-05-02 19:10 ` William Lee Irwin III @ 2007-05-03 3:07 ` Ting Yang 1 sibling, 0 replies; 75+ messages in thread From: Ting Yang @ 2007-05-03 3:07 UTC (permalink / raw) To: Li, Tong N; +Cc: Ingo Molnar, linux-kernel, William Lee Irwin III Li, Tong N wrote: > Thanks for the excellent explanation. I think EEVDF and many algs alike > assume global ordering of all tasks in the system (based on virtual > time), whereas CFS does so locally on each processor and relies on load > balancing to achieve fairness across processors. It'd achieve strong > fairness locally, but I'm not sure about its global fairness properties > in an MP environment. If ideally the total load weight on each processor > is always the same, then local fairness would imply global fairness, but > this is a bin packing problem and is intractable ... First, I am not assuming a global ordering of all tasks. As the current implementation, EEVDF should maintain virtual time locally for each CPU. EEVDF is a proportional time share scheduler, therefore the relative weight and actual cpu share for each task varies when tasks join and leave. There will be not bin-pack problem for such systems. I understand that bin-pack problem does exist in Real-time world. Suppose in a system has 2 cpus, there a 3 tasks, all of which needs to finish 30ms work within a window of 50ms. Any 2 of them stay together will exceeds the bandwidth of one cpu. There is a bin-pack problem, unless the system has to be clever enough to break one of them down into 2 requests of 15ms/25ms, and execute them on different cpus at different time without overlap, which is quite difficult :-) In the proportional world, weights and cpu share are scale to fit with the bandwidth of a cpu. Therefore putting 2 of them on one cpu is fine, and the fairness for each cpu is preserved. On the other hand, moving one task back and forth among 2 cpus do give better throughput and better global fairness. I have not dig into the load balancing algorithms of SMP yet, so I leave it aside for now, first thing first :-) Thanks ! Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 2:57 ` Ting Yang ` (5 preceding siblings ...) 2007-05-02 18:42 ` Li, Tong N @ 2007-05-03 8:50 ` Ingo Molnar 2007-05-03 14:26 ` Srivatsa Vaddagiri 2007-05-03 15:02 ` Ting Yang 6 siblings, 2 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-03 8:50 UTC (permalink / raw) To: Ting Yang; +Cc: linux-kernel * Ting Yang <tingy@cs.umass.edu> wrote: > Authors of this paper proposed a scheduler: Earlist Eligible Virtual > Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to > track the execution of each running task. The only difference between > EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is > _start-time_ fair. [...] Well, this is a difference but note that it's far from being the 'only difference' between CFS and EEVDF: - in CFS you have to "earn" your right to be on the CPU, while EEVDF gives out timeslices (quanta) - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they know the length of work units - while CFS does not need any knowledge about 'future work', it measures 'past behavior' and makes its decisions based on that. So CFS is purely 'history-based'. - thus in CFS there's no concept of "deadline" either (the 'D' from EEVDF). - EEVDF seems to be calculating timeslices in units of milliseconds, while CFS follows a very strict 'precise' accounting scheme on the nanoseconds scale. - the EEVDF paper is also silent on SMP issues. - it seems EEVDF never existed as a kernel scheduler, it was a user-space prototype under FreeBSD with simulated workloads. (have they released that code at all?). The main common ground seems to be that both CFS and EEVDF share the view that the central metric is 'virtual time' proportional to the load of the CPU (called the 'fair clock' in CFS) - but even for this the details of the actual mechanism differ: EEVDF uses 1/N while CFS (since -v8) uses a precise, smoothed and weighted load average that is close to (and reuses portions of) Peter Williams's load metric used in smp-nice. The EEVDF mechanism could perhaps be more appropriate for real-time systems (the main target of their paper), while the CFS one i believe is more appropriate for general purpose workloads. So i'd say there's more in common between SD and CFS than between EEVDF and CFS. So ... it would certainly be interesting to try the EEVDF paper based on CFS (or whatever way you'd like to try it) and turn the EEVDF paper into a real kernel scheduler - the two mechanisms are quite dissimilar and they could behave wildly differently on various workloads. Depending on test results we could use bits of EEVDF's approaches in CFS too, if it manages to out-schedule CFS :-) (your observation about CFS's fork handling is correct nevertheless!) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 8:50 ` Ingo Molnar @ 2007-05-03 14:26 ` Srivatsa Vaddagiri 2007-05-03 15:19 ` Ting Yang 2007-05-03 15:02 ` Ting Yang 1 sibling, 1 reply; 75+ messages in thread From: Srivatsa Vaddagiri @ 2007-05-03 14:26 UTC (permalink / raw) To: Ingo Molnar; +Cc: Ting Yang, linux-kernel On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote: > - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they > know the length of work units This is what I was thinking when I wrote earlier that EEVDF expects each task will specify "length of each new request" (http://lkml.org/lkml/2007/5/2/339). The other observation that I have of EEVDF is that it tries to be fair in the virtual time scale (each client will get 'wi' real units in 1 virtual unit), whereas sometimes fairness in real-time scale also matters? For ex: a multi-media app would call scheduler fair to it, it it recvs atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue user (or workload) that does a fork bomb may skew this fairness for that multi-media app in real-time scale under EEVDF? -- Regards, vatsa ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 14:26 ` Srivatsa Vaddagiri @ 2007-05-03 15:19 ` Ting Yang 0 siblings, 0 replies; 75+ messages in thread From: Ting Yang @ 2007-05-03 15:19 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Srivatsa Vaddagiri wrote: > On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote: > >> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they >> know the length of work units >> > > This is what I was thinking when I wrote earlier that EEVDF expects each > task will specify "length of each new request" > (http://lkml.org/lkml/2007/5/2/339). > This is not very true based on my understanding of EEVDF, please look at the email I just sent out to Ingo for explanation. > The other observation that I have of EEVDF is that it tries to be fair > in the virtual time scale (each client will get 'wi' real units in 1 > virtual unit), whereas sometimes fairness in real-time scale also > matters? > For ex: a multi-media app would call scheduler fair to it, it it recvs > atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue > user (or workload) that does a fork bomb may skew this fairness for that > multi-media app in real-time scale under EEVDF? > First of all, CFS does not seems to address this issue to. This is a typical real-time or soft real-time question, that is not only the bandwidth of a task has to be fixed, i.e. 10% of cpu bandwidth (which proportional shared system, like CFS, EEVDF does not do), and the work need to satisfy a deadline. In both CFS, EEVDF, the scheduler have keep tweaking weights to give a fixed bandwidth to application. Authors of EEVDF claims this can be done, but never implemented :-( Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-03 8:50 ` Ingo Molnar 2007-05-03 14:26 ` Srivatsa Vaddagiri @ 2007-05-03 15:02 ` Ting Yang 1 sibling, 0 replies; 75+ messages in thread From: Ting Yang @ 2007-05-03 15:02 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel Ingo Molnar wrote: > * Ting Yang <tingy@cs.umass.edu> wrote: > > >> Authors of this paper proposed a scheduler: Earlist Eligible Virtual >> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to >> track the execution of each running task. The only difference between >> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is >> _start-time_ fair. [...] >> > > Well, this is a difference but note that it's far from being the 'only > difference' between CFS and EEVDF: > > I re-ordered comments a little bit to make my discussion easier: > - in CFS you have to "earn" your right to be on the CPU, while EEVDF > gives out timeslices (quanta) > > - thus in CFS there's no concept of "deadline" either (the 'D' from > EEVDF). > These two are needed to implement the deadline fair policy, therefore are cover in the difference I mentioned. Due to the way that authors using different terms, the paper may leads to certain confusion, here 3 important concepts needed: - the scheduling quanta *q* in the paper: it refers to the minimum unit that a scheduler can schedules tasks. For system ticks 1000 per second, q is 1ms. For system ticks 100 per second (old linux), q is 10ms. - the length of a request submitted by a task: it refers to the amount of work needed to process a request. EEVDF _does_ not need this information for scheduling at all. - the timesliced used to process requests, which denoted as *r* in the paper: it refers to the unit that scheduler uses to process a request. If the length of a request is larger than timeslice r, the request is effectively divided into pieces of length r. The scheduler can use a suitable timeslice at will to handle a process. Using the timeslice, it can estimate the deadline based on the weight of a task, which tells how urgent the task is. and therefore remove the potential O(n). (but it comes with a cost: more context switches) > - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they > know the length of work units That is _not_ really the case, although authors of this paper used a quite aggressive title and reclaimed so. Look carefully at the Theorem 3 and its Corollary on pg 16, it says If not request of client k is larger than a time quanta, then at any time t the lag is bounded by -q < lag_k(t) < q It gives three facts: 1. For the system to be hard real-time, the length of request (note, not the quanta q, timeslice r) of any task at any time, must be less than the quanta of schedule, which in modern Linux is 1ms. This make EEVDF effectively useless for hard real-time system, where application can not miss deadline 2. For the system to be soft real-time: EEVDF needs to first fixed the bandwidth of a task, so that it does not varies when tasks join and leave. Then EEVDF can give a soft real-time guarantee with bounded delay of max(timeslice r). Authors mentioned this, but unfortunately this feature was never actually implemented. 3. EEVDF does gives a proportional time-share system that any task running on it will has not delay longer than max(timeslice t) w.r.t the ideal fluid-flow system. Authors of this paper only developed framework that can possibly used by real-time system (either hard or soft), their main attempt was to glue real-time and proportional time-share together. I think the real-time part of this paper does not actually ring the bell to me. However, isn't the fact 3 listed above really what we needed ? > - while CFS does not need any knowledge > about 'future work', it measures 'past behavior' and makes its > decisions based on that. So CFS is purely 'history-based'. > As I explained about, EEVDF does not need any future knowledge, instead it uses a unit (timeslice) to process request from tasks, by predicting the deadline of when this unit has to be finished. There is a firm theoretic base behind this "guess":, which did not appear in the paper: as long as the total bandwidth needed by all active tasks does not exceeds the bandwidth of CPU, the predicted virtual deadline vt, if mapped back to real time t in system, then that t is exactly the deadline for the chosen unit in the ideal system. Since EEVDF is proportionally share, the bandwidth for each task is scaled based on the weight, the sum of them does not exceeds cpu capacity. Therefore the "guess" of EEVDF is always right :-) Predicting future taking weight into account, makes it choose better tasks to execute. Smaller timeslice make it stick to the ideal system closer in the cost of more context switches. > - EEVDF seems to be calculating timeslices in units of milliseconds, > while CFS follows a very strict 'precise' accounting scheme on the > nanoseconds scale. > This is just an implementation issue, right? It is possible to implement EEVDF in finer granularity. > - the EEVDF paper is also silent on SMP issues. > Yes, you are right. I do not have much knowledge about load balancing > - it seems EEVDF never existed as a kernel scheduler, it was a > user-space prototype under FreeBSD with simulated workloads. (have > they released that code at all?). > They never released the code, (quite typical behavior of system researchers :-)), and that is why I asked if it worth a try. > The main common ground seems to be that both CFS and EEVDF share the > view that the central metric is 'virtual time' proportional to the load > of the CPU (called the 'fair clock' in CFS) - but even for this the > details of the actual mechanism differ: EEVDF uses 1/N while CFS (since > -v8) uses a precise, smoothed and weighted load average that is close to > (and reuses portions of) Peter Williams's load metric used in smp-nice. > > Thanks a lot for giving your opinions, very nice discussing with you. Ting ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar 2007-05-02 2:57 ` Ting Yang @ 2007-05-02 6:37 ` Mike Galbraith 2007-05-02 6:45 ` Ingo Molnar 2007-05-02 8:03 ` Gene Heskett 2007-05-02 7:59 ` Mike Galbraith ` (3 subsequent siblings) 5 siblings, 2 replies; 75+ messages in thread From: Mike Galbraith @ 2007-05-02 6:37 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote: > i'm pleased to announce release -v8 of the CFS scheduler patchset. (The > main goal of CFS is to implement "desktop scheduling" with as high > quality as technically possible.) ... > As usual, any sort of feedback, bugreport, fix and suggestion is more > than welcome, Greetings, I noticed a (harmless) bounds warning triggered by the reduction in size of array->bitmap. Patchlet below. -Mike CC kernel/sched.o kernel/sched_rt.c: In function ‘load_balance_start_rt’: include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds kernel/sched_rt.c: In function ‘pick_next_task_rt’: include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds --- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org 2007-05-02 07:16:47.000000000 +0200 +++ linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h 2007-05-02 07:20:45.000000000 +0200 @@ -6,28 +6,23 @@ /* * Every architecture must define this function. It's the fastest - * way of searching a 140-bit bitmap where the first 100 bits are - * unlikely to be set. It's guaranteed that at least one of the 140 - * bits is cleared. + * way of searching a 100-bit bitmap. It's guaranteed that at least + * one of the 100 bits is cleared. */ static inline int sched_find_first_bit(const unsigned long *b) { #if BITS_PER_LONG == 64 - if (unlikely(b[0])) + if (b[0]) return __ffs(b[0]); - if (likely(b[1])) - return __ffs(b[1]) + 64; - return __ffs(b[2]) + 128; + return __ffs(b[1]) + 64; #elif BITS_PER_LONG == 32 - if (unlikely(b[0])) + if (b[0]) return __ffs(b[0]); - if (unlikely(b[1])) + if (b[1]) return __ffs(b[1]) + 32; - if (unlikely(b[2])) + if (b[2]) return __ffs(b[2]) + 64; - if (b[3]) - return __ffs(b[3]) + 96; - return __ffs(b[4]) + 128; + return __ffs(b[3]) + 96; #else #error BITS_PER_LONG not defined #endif ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 6:37 ` Mike Galbraith @ 2007-05-02 6:45 ` Ingo Molnar 2007-05-02 8:03 ` Gene Heskett 1 sibling, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 6:45 UTC (permalink / raw) To: Mike Galbraith Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod, Balbir Singh * Mike Galbraith <efault@gmx.de> wrote: > > As usual, any sort of feedback, bugreport, fix and suggestion is > > more than welcome, > > Greetings, > > I noticed a (harmless) bounds warning triggered by the reduction in > size of array->bitmap. Patchlet below. thanks, applied! Your patch should also speed up task selection of RT tasks a bit. (the patch removes ~40 bytes of code). And on 64-bit we now fit into 2x 64-bit bitmap and thus are now down to two Find-First-Set instructions. Nice :) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 6:37 ` Mike Galbraith 2007-05-02 6:45 ` Ingo Molnar @ 2007-05-02 8:03 ` Gene Heskett 2007-05-02 8:12 ` Mike Galbraith 2007-05-02 8:13 ` Ingo Molnar 1 sibling, 2 replies; 75+ messages in thread From: Gene Heskett @ 2007-05-02 8:03 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Wednesday 02 May 2007, Mike Galbraith wrote: >On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote: >> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The >> main goal of CFS is to implement "desktop scheduling" with as high >> quality as technically possible.) > >... > >> As usual, any sort of feedback, bugreport, fix and suggestion is more >> than welcome, > >Greetings, > >I noticed a (harmless) bounds warning triggered by the reduction in size >of array->bitmap. Patchlet below. > > -Mike I just checked my logs, and it appears my workload didn't trigger this one Mike. And so far, v8 is working great here. And that great is in my best "Tony the Tiger" voice, stolen shamelessly from the breakfast cereal tv commercial of 30+ years ago. :) Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is the best of the breed. I'll give this one a 100 till something better shows up. > CC kernel/sched.o >kernel/sched_rt.c: In function ‘load_balance_start_rt’: >include/asm-generic/bitops/sched.h:30: warning: array subscript is above > array bounds kernel/sched_rt.c: In function ‘pick_next_task_rt’: >include/asm-generic/bitops/sched.h:30: warning: array subscript is above > array bounds > >--- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org 2007-05-02 > 07:16:47.000000000 +0200 +++ > linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h 2007-05-02 > 07:20:45.000000000 +0200 @@ -6,28 +6,23 @@ > > /* > * Every architecture must define this function. It's the fastest >- * way of searching a 140-bit bitmap where the first 100 bits are >- * unlikely to be set. It's guaranteed that at least one of the 140 >- * bits is cleared. >+ * way of searching a 100-bit bitmap. It's guaranteed that at least >+ * one of the 100 bits is cleared. > */ > static inline int sched_find_first_bit(const unsigned long *b) > { > #if BITS_PER_LONG == 64 >- if (unlikely(b[0])) >+ if (b[0]) > return __ffs(b[0]); >- if (likely(b[1])) >- return __ffs(b[1]) + 64; >- return __ffs(b[2]) + 128; >+ return __ffs(b[1]) + 64; > #elif BITS_PER_LONG == 32 >- if (unlikely(b[0])) >+ if (b[0]) > return __ffs(b[0]); >- if (unlikely(b[1])) >+ if (b[1]) > return __ffs(b[1]) + 32; >- if (unlikely(b[2])) >+ if (b[2]) > return __ffs(b[2]) + 64; >- if (b[3]) >- return __ffs(b[3]) + 96; >- return __ffs(b[4]) + 128; >+ return __ffs(b[3]) + 96; > #else > #error BITS_PER_LONG not defined > #endif -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I met my latest girl friend in a department store. She was looking at clothes, and I was putting Slinkys on the escalators. -- Steven Wright ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 8:03 ` Gene Heskett @ 2007-05-02 8:12 ` Mike Galbraith 2007-05-02 8:48 ` Gene Heskett 2007-05-02 8:13 ` Ingo Molnar 1 sibling, 1 reply; 75+ messages in thread From: Mike Galbraith @ 2007-05-02 8:12 UTC (permalink / raw) To: Gene Heskett Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote: > I just checked my logs, and it appears my workload didn't trigger this one > Mike. It's just a build time compiler warning. > Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is > the best of the breed. I'll give this one a 100 till something better shows > up. Ditto. (so far... ya never know;) -Mike ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 8:12 ` Mike Galbraith @ 2007-05-02 8:48 ` Gene Heskett 0 siblings, 0 replies; 75+ messages in thread From: Gene Heskett @ 2007-05-02 8:48 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Wednesday 02 May 2007, Mike Galbraith wrote: >On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote: >> I just checked my logs, and it appears my workload didn't trigger this one >> Mike. > >It's just a build time compiler warning. Duh. I have a couple of pages of "may be used uninitialized" warnings. Including one in serial.c for the raw channel. And I have a bit of trouble there too. Related? Dunno. >> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 >> is the best of the breed. I'll give this one a 100 till something better >> shows up. > >Ditto. (so far... ya never know;) > > -Mike Yup, this is sweet so far. :-) -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The vulcan-death-grip ping has been applied. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 8:03 ` Gene Heskett 2007-05-02 8:12 ` Mike Galbraith @ 2007-05-02 8:13 ` Ingo Molnar 2007-05-02 8:51 ` Gene Heskett 1 sibling, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 8:13 UTC (permalink / raw) To: Gene Heskett Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh * Gene Heskett <gene.heskett@gmail.com> wrote: > > I noticed a (harmless) bounds warning triggered by the reduction in > > size of array->bitmap. Patchlet below. > > I just checked my logs, and it appears my workload didn't trigger this > one Mike. [...] yeah: this is a build-time warning and it needs a newer/smarter gcc to notice that provably redundant piece of code. It's a harmless thing - but nevertheless Mike's fix is a nice little micro-optimization as well: it always bothered me a bit that at 140 priority levels we were _just_ past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two 64-bit words to cover all 100 priority levels of RT tasks. > [...] And so far, v8 is working great here. And that great is in my > best "Tony the Tiger" voice, stolen shamelessly from the breakfast > cereal tv commercial of 30+ years ago. :) heh :-) > Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and > 100 is the best of the breed. I'll give this one a 100 till something > better shows up. nice - and you arent even using any OpenGL games ;) The 0-100 rating is really useful to me so that i can see the impact of regressions (if any) and it's also one single number representing the subjective impression - that way it's easier to keep tab of things. btw., do you still renice kmail slightly, or does it now work out of box with default nice 0? Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 8:13 ` Ingo Molnar @ 2007-05-02 8:51 ` Gene Heskett 0 siblings, 0 replies; 75+ messages in thread From: Gene Heskett @ 2007-05-02 8:51 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Wednesday 02 May 2007, Ingo Molnar wrote: >* Gene Heskett <gene.heskett@gmail.com> wrote: >> > I noticed a (harmless) bounds warning triggered by the reduction in >> > size of array->bitmap. Patchlet below. >> >> I just checked my logs, and it appears my workload didn't trigger this >> one Mike. [...] > >yeah: this is a build-time warning and it needs a newer/smarter gcc to >notice that provably redundant piece of code. It's a harmless thing - >but nevertheless Mike's fix is a nice little micro-optimization as well: >it always bothered me a bit that at 140 priority levels we were _just_ >past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two >64-bit words to cover all 100 priority levels of RT tasks. > >> [...] And so far, v8 is working great here. And that great is in my >> best "Tony the Tiger" voice, stolen shamelessly from the breakfast >> cereal tv commercial of 30+ years ago. :) > >heh :-) > >> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and >> 100 is the best of the breed. I'll give this one a 100 till something >> better shows up. > >nice - and you arent even using any OpenGL games ;) > >The 0-100 rating is really useful to me so that i can see the impact of >regressions (if any) and it's also one single number representing the >subjective impression - that way it's easier to keep tab of things. > >btw., do you still renice kmail slightly, or does it now work out of box >with default nice 0? > > Ingo For this last couple of boots, its "right out of the box" and isn't getting under my skin. A make -j4 didn't bother it either. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) The goys have proven the following theorem... -- Physicist John von Neumann, at the start of a classroom lecture. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar 2007-05-02 2:57 ` Ting Yang 2007-05-02 6:37 ` Mike Galbraith @ 2007-05-02 7:59 ` Mike Galbraith 2007-05-02 8:11 ` Gene Heskett 2007-05-02 10:40 ` Ingo Molnar 2007-05-02 9:08 ` Balbir Singh ` (2 subsequent siblings) 5 siblings, 2 replies; 75+ messages in thread From: Mike Galbraith @ 2007-05-02 7:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote: > - interactivity: precise load calculation and load smoothing This seems to help quite a bit. (5 second top sample) 2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0 amarok_libvisua 5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55 1 Xorg 4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1 4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1 I no longer have to renice both X and Gforce to achieve a perfect display when they are sharing my box with a make -j2. X is displaying everything it's being fed beautifully with no help. I have to renice Gforce (amarok_libvisual), but given it's very heavy CPU usage, that seems perfectly fine. No regressions noticed so far. Box is _very_ responsive under load, seemingly even more so than with previous releases. That is purely subjective, but the first impression was very distinct. -Mike ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 7:59 ` Mike Galbraith @ 2007-05-02 8:11 ` Gene Heskett 2007-05-02 10:40 ` Ingo Molnar 1 sibling, 0 replies; 75+ messages in thread From: Gene Heskett @ 2007-05-02 8:11 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter, buddabrod, Balbir Singh On Wednesday 02 May 2007, Mike Galbraith wrote: >On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote: >> - interactivity: precise load calculation and load smoothing > >This seems to help quite a bit. > >(5 second top sample) > > 2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0 > amarok_libvisua 5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55 > 1 Xorg 4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1 > 4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1 > >I no longer have to renice both X and Gforce to achieve a perfect >display when they are sharing my box with a make -j2. X is displaying >everything it's being fed beautifully with no help. I have to renice >Gforce (amarok_libvisual), but given it's very heavy CPU usage, that >seems perfectly fine. > >No regressions noticed so far. Box is _very_ responsive under load, >seemingly even more so than with previous releases. That is purely >subjective, but the first impression was very distinct. > > -Mike And I have a make -j4 running to apply your patch Mike, and can't tell it from here, no lag. This is another keeper folks. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) It takes less time to do a thing right than it does to explain why you did it wrong. -- H.W. Longfellow ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 7:59 ` Mike Galbraith 2007-05-02 8:11 ` Gene Heskett @ 2007-05-02 10:40 ` Ingo Molnar 1 sibling, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 10:40 UTC (permalink / raw) To: Mike Galbraith Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod, Balbir Singh * Mike Galbraith <efault@gmx.de> wrote: > On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote: > > - interactivity: precise load calculation and load smoothing > > This seems to help quite a bit. great :) > (5 second top sample) > > 2636 root 15 -5 19148 15m 5324 R 73 1.5 1:42.29 0 amarok_libvisua > 5440 root 20 0 320m 36m 8388 S 18 3.6 3:28.55 1 Xorg > 4621 root 20 0 22776 18m 4168 R 12 1.8 0:00.63 1 cc1 > 4616 root 20 0 19456 13m 2200 R 9 1.3 0:00.43 0 cc1 > > I no longer have to renice both X and Gforce to achieve a perfect > display when they are sharing my box with a make -j2. X is displaying > everything it's being fed beautifully with no help. I have to renice > Gforce (amarok_libvisual), but given it's very heavy CPU usage, that > seems perfectly fine. ah! Besides OpenGL behavior and app startup performance i didnt originally have Xorg in mind with this change, but thinking about it, precise load calculations and load smoothing does have a positive effect on 'coupled' workloads where under the previous variant of CFS's load calculation one task component of the workload could become 'invisible' to another task and hence cause macro-scale scheduling artifacts not expected by humans. With smoothing these are dealt with more consistently. Xorg can be a quite strongly coupled workload. > No regressions noticed so far. Box is _very_ responsive under load, > seemingly even more so than with previous releases. That is purely > subjective, but the first impression was very distinct. yeah, make -jN workloads (and any mixture of non-identical scheduling patterns, which most real workloads consist of) should be handled more consistently too by -v8. so your workload (and Gene's workload) were the ones in fact that i had hoped not to _hurt_ with -v8 (neither of you being the hardcore gamer type ;), and in reality -v8 ended up helping them too. These sorts of side-effects are always welcome ;-) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar ` (2 preceding siblings ...) 2007-05-02 7:59 ` Mike Galbraith @ 2007-05-02 9:08 ` Balbir Singh 2007-05-02 10:05 ` Ingo Molnar 2007-05-02 12:58 ` Mark Lord 2007-05-02 12:58 ` Vegard Nossum 5 siblings, 1 reply; 75+ messages in thread From: Balbir Singh @ 2007-05-02 9:08 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod, Balbir Singh Ingo Molnar wrote: > Changes since -v7: > > - powerpc debug output and build warning fixes (Balbir Singh) > > - documentation fixes (Zach Carter) > > - interactivity: precise load calculation and load smoothing > > As usual, any sort of feedback, bugreport, fix and suggestion is more > than welcome, > > Ingo Hi, Ingo, I would like to report, what I think is a regression with -v8. With -v7 I would run the n/n+1 test. Basically on a system with n cpus, I would run n+1 tasks and see how their load is distributed. I usually find that the last two tasks would get stuck on one CPU on the system and would get half the cpu time as their other peers. I think this issue has been around for long even before CFS. But while I was investigating that, I found that with -v8, all the n+1 tasks are stuck on the same cpu. Output of /proc/sched_debug # cat /proc/sched_debug Sched Debug Version: v0.02 now at 1507287574145 nsecs cpu: 0 .nr_running : 0 .raw_weighted_load : 0 .nr_switches : 111130 .nr_load_updates : 376821 .nr_uninterruptible : 18446744073709551550 .next_balance : 4295269119 .curr->pid : 0 .clock : 7431167968202137 .prev_clock_raw : 7431167968202136 .clock_warps : 0 .clock_unstable_events : 0 .clock_max_delta : 0 .fair_clock : 26969582038 .prev_fair_clock : 26969539422 .exec_clock : 9881536864 .prev_exec_clock : 9881494248 .wait_runtime : 116431647 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 runnable tasks: task PID tree-key delta waiting switches prio wstart-fair sum-exec sum-wait ---------------------------------------------------------------------------------------------------------------------------- cpu: 1 .nr_running : 0 .raw_weighted_load : 0 .nr_switches : 56374 .nr_load_updates : 376767 .nr_uninterruptible : 156 .next_balance : 4295269118 .curr->pid : 0 .clock : 7431167857161633 .prev_clock_raw : 7431167857161632 .clock_warps : 0 .clock_unstable_events : 0 .clock_max_delta : 0 .fair_clock : 34038615236 .prev_fair_clock : 34038615236 .exec_clock : 18764126904 .prev_exec_clock : 18764126904 .wait_runtime : 132146856 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 runnable tasks: task PID tree-key delta waiting switches prio wstart-fair sum-exec sum-wait ---------------------------------------------------------------------------------------------------------------------------- cpu: 2 .nr_running : 5 .raw_weighted_load : 5120 .nr_switches : 140351 .nr_load_updates : 376767 .nr_uninterruptible : 18446744073709551559 .next_balance : 4295269128 .curr->pid : 6462 .clock : 7431167968695481 .prev_clock_raw : 7431167968695480 .clock_warps : 0 .clock_unstable_events : 0 .clock_max_delta : 0 .fair_clock : 178895812434 .prev_fair_clock : 178895727748 .exec_clock : 858569069824 .prev_exec_clock : 858568528616 .wait_runtime : 2643237421 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 runnable tasks: task PID tree-key delta waiting switches prio wstart-fair sum-exec sum-wait ---------------------------------------------------------------------------------------------------------------------------- R bash 6462 178897659138 1846704 -1846958 19646 120 -178895812434 169799117688 135410790136 bash 6461 178897934427 2121993 -7673376 19538 120 -5551118 169989747968 135499300276 bash 6460 178898353788 2541354 -6492732 19608 120 -3951111 170136703840 135648219117 bash 6459 178899921997 4109563 -6460948 19747 120 -2351093 170559324432 135812802778 bash 6458 178901052918 5240484 -5991881 19756 120 -751111 171257975848 135805570391 cpu: 3 .nr_running : 1 .prev_fair_clock : 24318712701 .exec_clock : 20098322728 .prev_exec_clock : 20098322728 .wait_runtime : 178370619 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 0 .cpu_load[3] : 0 .cpu_load[4] : 0 runnable tasks: task PID tree-key delta waiting switches prio wstart-fair sum-exec sum-wait ---------------------------------------------------------------------------------------------------------------------------- R cat 7524 24318779730 67029 -67029 3 120 -24318712701 1661560 2277 .raw_weighted_load : 1024 .nr_switches : 43253 .nr_load_updates : 376767 .nr_uninterruptible : 18446744073709551583 .next_balance : 4295269180 .curr->pid : 7524 .clock : 7431167970150081 .prev_clock_raw : 7431167970150080 .clock_warps : 0 .clock_unstable_events : 0 .clock_max_delta : 0 .fair_clock : 24318712701 Output of top 6459 root 20 0 4912 792 252 R 20 0.0 8:29.33 bash 6458 root 20 0 4912 792 252 R 20 0.0 8:29.90 bash 6460 root 20 0 4912 792 252 R 20 0.0 8:28.94 bash 6461 root 20 0 4912 792 252 R 20 0.0 8:28.88 bash 6462 root 20 0 4912 792 252 R 20 0.0 8:28.54 bash -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 9:08 ` Balbir Singh @ 2007-05-02 10:05 ` Ingo Molnar 2007-05-02 10:59 ` Balbir Singh 0 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 10:05 UTC (permalink / raw) To: Balbir Singh Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > With -v7 I would run the n/n+1 test. Basically on a system with n > cpus, I would run n+1 tasks and see how their load is distributed. I > usually find that the last two tasks would get stuck on one CPU on the > system and would get half the cpu time as their other peers. I think > this issue has been around for long even before CFS. But while I was > investigating that, I found that with -v8, all the n+1 tasks are stuck > on the same cpu. i believe this problem is specific to powerpc - load is distributed fine on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2 which is 'impossible'. (I sent a few suggestions off-Cc about how to debug this.) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 10:05 ` Ingo Molnar @ 2007-05-02 10:59 ` Balbir Singh 2007-05-02 11:17 ` Ingo Molnar 0 siblings, 1 reply; 75+ messages in thread From: Balbir Singh @ 2007-05-02 10:59 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod Ingo Molnar wrote: > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> With -v7 I would run the n/n+1 test. Basically on a system with n >> cpus, I would run n+1 tasks and see how their load is distributed. I >> usually find that the last two tasks would get stuck on one CPU on the >> system and would get half the cpu time as their other peers. I think >> this issue has been around for long even before CFS. But while I was >> investigating that, I found that with -v8, all the n+1 tasks are stuck >> on the same cpu. > > i believe this problem is specific to powerpc - load is distributed fine > on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2 > which is 'impossible'. (I sent a few suggestions off-Cc about how to > debug this.) > > Ingo Hi, Ingo The suggestions helped, here is a fix tested on PowerPC only. Patch and Description ===================== Load balancing on PowerPC is broken. Running 5 tasks on a 4 cpu system results in all 5 tasks running on the same CPU. Based on Ingo's feedback, I instrumented and debugged update_load_fair(). The problem is with comparing a s64 values with (s64)ULONG_MAX, which evaluates to -1. Then we check if exec_delta64 and fair_delta64 are greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to the respective values. The fix is to compare these values against (s64)LONG_MAX and assign (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater than (s64)LONG_MAX. Tested on PowerPC, the regression is gone, tasks are load balanced as they were in v7. Output of top 5614 root 20 0 4912 784 252 R 52 0.0 3:27.49 3 bash 5620 root 20 0 4912 784 252 R 47 0.0 3:07.38 2 bash 5617 root 20 0 4912 784 252 R 47 0.0 3:08.18 0 bash 5624 root 20 0 4912 784 252 R 26 0.0 1:42.97 1 bash 5621 root 20 0 4912 784 252 R 26 0.0 1:43.14 1 bash Tasks 5624 and 5621 getting half of their peer values is a separate issue altogether. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- kernel/sched.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff -puN kernel/sched.c~cfs-fix-load-balancing-arith kernel/sched.c --- linux-2.6.21/kernel/sched.c~cfs-fix-load-balancing-arith 2007-05-02 16:16:20.000000000 +0530 +++ linux-2.6.21-balbir/kernel/sched.c 2007-05-02 16:16:47.000000000 +0530 @@ -1533,19 +1533,19 @@ static void update_load_fair(struct rq * this_rq->prev_exec_clock = this_rq->exec_clock; WARN_ON_ONCE(exec_delta64 <= 0); - if (fair_delta64 > (s64)ULONG_MAX) - fair_delta64 = (s64)ULONG_MAX; + if (fair_delta64 > (s64)LONG_MAX) + fair_delta64 = (s64)LONG_MAX; fair_delta = (unsigned long)fair_delta64; - if (exec_delta64 > (s64)ULONG_MAX) - exec_delta64 = (s64)ULONG_MAX; + if (exec_delta64 > (s64)LONG_MAX) + exec_delta64 = (s64)LONG_MAX; exec_delta = (unsigned long)exec_delta64; if (exec_delta > TICK_NSEC) exec_delta = TICK_NSEC; idle_delta = TICK_NSEC - exec_delta; - tmp = SCHED_LOAD_SCALE * exec_delta / fair_delta; + tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta; tmp64 = (u64)tmp * (u64)exec_delta; do_div(tmp64, TICK_NSEC); this_load = (unsigned long)tmp64; _ -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 10:59 ` Balbir Singh @ 2007-05-02 11:17 ` Ingo Molnar 2007-05-05 8:31 ` Esben Nielsen 0 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 11:17 UTC (permalink / raw) To: Balbir Singh Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > The problem is with comparing a s64 values with (s64)ULONG_MAX, which > evaluates to -1. Then we check if exec_delta64 and fair_delta64 are > greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to > the respective values. ah, indeed ... > The fix is to compare these values against (s64)LONG_MAX and assign > (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater > than (s64)LONG_MAX. > > Tested on PowerPC, the regression is gone, tasks are load balanced as > they were in v7. thanks, applied! Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 11:17 ` Ingo Molnar @ 2007-05-05 8:31 ` Esben Nielsen 2007-05-05 17:44 ` Linus Torvalds 0 siblings, 1 reply; 75+ messages in thread From: Esben Nielsen @ 2007-05-05 8:31 UTC (permalink / raw) To: Ingo Molnar Cc: Balbir Singh, linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Wed, 2 May 2007, Ingo Molnar wrote: > > * Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > >> The problem is with comparing a s64 values with (s64)ULONG_MAX, which >> evaluates to -1. Then we check if exec_delta64 and fair_delta64 are >> greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to >> the respective values. > > ah, indeed ... > >> The fix is to compare these values against (s64)LONG_MAX and assign >> (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater >> than (s64)LONG_MAX. >> >> Tested on PowerPC, the regression is gone, tasks are load balanced as >> they were in v7. > > thanks, applied! > > Ingo I have been wondering why you use usigned for timers anyway. It is also like that in hrtimers. Why not use signed and avoid (almost) all worries about wrap around issues. The trick is that when all a < b is be replaced by a - b < 0 the code will work on all 2-complement machines even if the (signed!) integers a and b wrap around. In both the hrtimer and CFS patch 32 bit timers could be used internally on 32 bit architectures to avoid expensive 64 bit integer calculations. The only thing is: timeouts can not be bigger than 2^31 in the chosen units. I have successfully coded a (much more primitive) hrtimer system for another OS on both ARM and PPC using the above trick in my former job. On both I used the machine's internal clock as the internal representation of time and I only scaled to a struct timespec in the user interface. Esben ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-05 8:31 ` Esben Nielsen @ 2007-05-05 17:44 ` Linus Torvalds 2007-05-06 8:29 ` Ingo Molnar 2007-05-07 11:09 ` Esben Nielsen 0 siblings, 2 replies; 75+ messages in thread From: Linus Torvalds @ 2007-05-05 17:44 UTC (permalink / raw) To: Esben Nielsen Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Sat, 5 May 2007, Esben Nielsen wrote: > > I have been wondering why you use usigned for timers anyway. It is also like > that in hrtimers. Why not use signed and avoid (almost) all worries about wrap > around issues. The trick is that when all > a < b > is be replaced by > a - b < 0 > the code will work on all 2-complement machines even if the (signed!) integers > a and b wrap around. No. BOTH of the above are buggy. The C language definition doesn't allow signed integers to wrap (ie it's undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a simple signed "a < b". And the unsigned (or signed) "a < b" is just broken wrt any kind of wrap-around (whether wrapping around zero or the sign bit). So the _only_ valid way to handle timers is to - either not allow wrapping at all (in which case "unsigned" is better, since it is bigger) - or use wrapping explicitly, and use unsigned arithmetic (which is well-defined in C) and do something like "(long)(a-b) > 0". Notice? The signed variant is basically _never_ correct. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-05 17:44 ` Linus Torvalds @ 2007-05-06 8:29 ` Ingo Molnar 2007-05-06 8:36 ` Willy Tarreau 2007-05-06 17:45 ` Linus Torvalds 2007-05-07 11:09 ` Esben Nielsen 1 sibling, 2 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-06 8:29 UTC (permalink / raw) To: Linus Torvalds Cc: Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod * Linus Torvalds <torvalds@linux-foundation.org> wrote: > So the _only_ valid way to handle timers is to > - either not allow wrapping at all (in which case "unsigned" is better, > since it is bigger) > - or use wrapping explicitly, and use unsigned arithmetic (which is > well-defined in C) and do something like "(long)(a-b) > 0". hm, there is a corner-case in CFS where a fix like this is necessary. CFS uses 64-bit values for almost everything, and the majority of values are of 'relative' nature with no danger of overflow. (They are signed because they are relative values that center around zero and can be negative or positive.) there is one value that is of 'absolute' nature though: the 'fair clock' (which is the same as the normal system clock when load is 1.0, and it slows down in a load-proportional way as load increases), which has units of nanoseconds - and the 'keys' in the rbtree are derived from such clock values. That particular clock could produce an s64 overflow at around 292 years after bootup (a bit later if the system is also used for something in that timeframe). While i dont think that's realistic to worry about, the fix below should solve that theoretical problem too. Ingo Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s * We dont care about collisions. Nodes with * the same key stay together. */ - if (key < entry->fair_key) { + if ((s64)(entry->fair_key - key) > 0) { link = &parent->rb_left; } else { link = &parent->rb_right; ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-06 8:29 ` Ingo Molnar @ 2007-05-06 8:36 ` Willy Tarreau 2007-05-06 8:52 ` Ingo Molnar 2007-05-06 17:45 ` Linus Torvalds 1 sibling, 1 reply; 75+ messages in thread From: Willy Tarreau @ 2007-05-06 8:36 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Gene Heskett, Mark Lord, Zach Carter, buddabrod Hi Ingo, On Sun, May 06, 2007 at 10:29:11AM +0200, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So the _only_ valid way to handle timers is to > > - either not allow wrapping at all (in which case "unsigned" is better, > > since it is bigger) > > - or use wrapping explicitly, and use unsigned arithmetic (which is > > well-defined in C) and do something like "(long)(a-b) > 0". > > hm, there is a corner-case in CFS where a fix like this is necessary. > > CFS uses 64-bit values for almost everything, and the majority of values > are of 'relative' nature with no danger of overflow. (They are signed > because they are relative values that center around zero and can be > negative or positive.) (...) > - if (key < entry->fair_key) { > + if ((s64)(entry->fair_key - key) > 0) { Just a hint: while your code here is correct, it is a good practise to check against < 0 instead, so that if for any reason you sometimes forget to cast to signed, the compiler will emit a warning stating that the condition is always false. This would simply become : - if (key < entry->fair_key) { + if ((s64)(key - entry->fair_key) < 0) { Just my .02 euros :-) Willy ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-06 8:36 ` Willy Tarreau @ 2007-05-06 8:52 ` Ingo Molnar 0 siblings, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-06 8:52 UTC (permalink / raw) To: Willy Tarreau Cc: Linus Torvalds, Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Gene Heskett, Mark Lord, Zach Carter, buddabrod * Willy Tarreau <w@1wt.eu> wrote: > Just a hint: while your code here is correct, it is a good practise to > check against < 0 instead, so that if for any reason you sometimes > forget to cast to signed, the compiler will emit a warning stating > that the condition is always false. This would simply become : > > - if (key < entry->fair_key) { > + if ((s64)(key - entry->fair_key) < 0) { done :-) (Btw., the key is still s64 so the cast is technically unnecessary - but now the code would be correct with the key being u64 too.) Ingo Index: linux/kernel/sched_fair.c =================================================================== --- linux.orig/kernel/sched_fair.c +++ linux/kernel/sched_fair.c @@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s * We dont care about collisions. Nodes with * the same key stay together. */ - if (key < entry->fair_key) { + if ((s64)(key - entry->fair_key) < 0) { link = &parent->rb_left; } else { link = &parent->rb_right; ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-06 8:29 ` Ingo Molnar 2007-05-06 8:36 ` Willy Tarreau @ 2007-05-06 17:45 ` Linus Torvalds 2007-05-07 11:30 ` Esben Nielsen 1 sibling, 1 reply; 75+ messages in thread From: Linus Torvalds @ 2007-05-06 17:45 UTC (permalink / raw) To: Ingo Molnar Cc: Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Sun, 6 May 2007, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > So the _only_ valid way to handle timers is to > > - either not allow wrapping at all (in which case "unsigned" is better, > > since it is bigger) > > - or use wrapping explicitly, and use unsigned arithmetic (which is > > well-defined in C) and do something like "(long)(a-b) > 0". > > hm, there is a corner-case in CFS where a fix like this is necessary. > > CFS uses 64-bit values for almost everything, and the majority of values > are of 'relative' nature with no danger of overflow. (They are signed > because they are relative values that center around zero and can be > negative or positive.) Well, I'd like to just worry about that for a while. You say there is "no danger of overflow", and I mostly agree that once we're talking about 64-bit values, the overflow issue simply doesn't exist, and furthermore the difference between 63 and 64 bits is not really relevant, so there's no major reason to actively avoid signed entries. So in that sense, it all sounds perfectly sane. And I'm definitely not sure your "292 years after bootup" worry is really worth even considering. When we're really so well off that we expect the hardware and software stack to be stable over a hundred years, I'd start to think about issues like that, in the meantime, to me worrying about those kinds of issues just means that you're worrying about the wrong things. BUT. There's a fundamental reason relative timestamps are difficult and almost always have overflow issues: the "long long in the future" case as an approximation of "infinite timeout" is almost always relevant. So rather than worry about the system staying up 292 years, I'd worry about whether people pass in big numbers (like some MAX_S64 approximation) as an approximation for "infinite", and once you have things like that, the "64 bits never overflows" argument is totally bogus. There's a damn good reason for using only *absolute* time. The whole "signed values of relative time" may _sound_ good, but it really sucks in subtle and horrible ways! Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-06 17:45 ` Linus Torvalds @ 2007-05-07 11:30 ` Esben Nielsen 2007-05-07 15:55 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 75+ messages in thread From: Esben Nielsen @ 2007-05-07 11:30 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Sun, 6 May 2007, Linus Torvalds wrote: > > > On Sun, 6 May 2007, Ingo Molnar wrote: >> >> * Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >>> So the _only_ valid way to handle timers is to >>> - either not allow wrapping at all (in which case "unsigned" is better, >>> since it is bigger) >>> - or use wrapping explicitly, and use unsigned arithmetic (which is >>> well-defined in C) and do something like "(long)(a-b) > 0". >> >> hm, there is a corner-case in CFS where a fix like this is necessary. >> >> CFS uses 64-bit values for almost everything, and the majority of values >> are of 'relative' nature with no danger of overflow. (They are signed >> because they are relative values that center around zero and can be >> negative or positive.) > > Well, I'd like to just worry about that for a while. > > You say there is "no danger of overflow", and I mostly agree that once > we're talking about 64-bit values, the overflow issue simply doesn't > exist, and furthermore the difference between 63 and 64 bits is not really > relevant, so there's no major reason to actively avoid signed entries. > > So in that sense, it all sounds perfectly sane. And I'm definitely not > sure your "292 years after bootup" worry is really worth even considering. > I would hate to tell mission control for Mankind's first mission to another star to reboot every 200 years because "there is no need to worry about it." As a matter of principle an OS should never need a reboot (with exception for upgrading). If you say you have to reboot every 200 years, why not every 100? Every 50? .... Every 45 days (you know what I am referring to :-) ? > When we're really so well off that we expect the hardware and software > stack to be stable over a hundred years, I'd start to think about issues > like that, in the meantime, to me worrying about those kinds of issues > just means that you're worrying about the wrong things. > > BUT. > > There's a fundamental reason relative timestamps are difficult and almost > always have overflow issues: the "long long in the future" case as an > approximation of "infinite timeout" is almost always relevant. > > So rather than worry about the system staying up 292 years, I'd worry > about whether people pass in big numbers (like some MAX_S64 approximation) > as an approximation for "infinite", and once you have things like that, > the "64 bits never overflows" argument is totally bogus. > > There's a damn good reason for using only *absolute* time. The whole > "signed values of relative time" may _sound_ good, but it really sucks in > subtle and horrible ways! > I think you are wrong here. The only place you need absolute time is a for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit representation when you could have used a 32 bit. With a 32 bit implementation you are forced to handle the corner cases with wrap around and too big arguments up front. With a 64 bit you hide those problems. I think CFS would be best off using a 32 bit timer counting in micro seconds. That would wrap around in 72 minuttes. But as the timers are relative you will never be able to specify a timer larger than 36 minuttes in the future. But 36 minuttes is redicolously long for a scheduler and a simple test limiting time values to that value would not break anything. Esben > Linus > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 11:30 ` Esben Nielsen @ 2007-05-07 15:55 ` Ingo Molnar 2007-05-07 16:11 ` Linus Torvalds 2007-05-08 0:35 ` Peter Williams 2 siblings, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-07 15:55 UTC (permalink / raw) To: Esben Nielsen Cc: Linus Torvalds, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod * Esben Nielsen <nielsen.esben@googlemail.com> wrote: > I would hate to tell mission control for Mankind's first mission to > another star to reboot every 200 years because "there is no need to > worry about it." ok, i need no other argument - fixed :-) (btw., if anyone from the planning committee of Mankind's first mission to another star has any other particular wish about CFS's design, i'm all ears and it's going to get the highest possible priority!) i really mean it! (And i guess you discovered my weak spot. Darn :) Hopefully they wont ask for STREAMS support, that would put us into a _really_ difficult position.) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 11:30 ` Esben Nielsen 2007-05-07 15:55 ` Ingo Molnar @ 2007-05-07 16:11 ` Linus Torvalds 2007-05-08 0:35 ` Peter Williams 2 siblings, 0 replies; 75+ messages in thread From: Linus Torvalds @ 2007-05-07 16:11 UTC (permalink / raw) To: Esben Nielsen Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, 7 May 2007, Esben Nielsen wrote: > > I would hate to tell mission control for Mankind's first mission to another > star to reboot every 200 years because "there is no need to worry about it." I'd like to put it another way: - if your mission to another star *depends* on every single piece of complex equipment staying up with zero reboots for 200+ years, you have some serious technology problems. In other words: your argument is populist, and totally silly. Trust me, if you are going to another star, you'd better have the capabilities to handle bugs. You'd better have multiple fail-over etc. A notion of "robustness" cannot and must not hinge on "no bugs". That's unrealistic. > As a matter of principle an OS should never need a reboot (with exception for > upgrading). If you say you have to reboot every 200 years, why not every 100? > Every 50? .... Every 45 days (you know what I am referring to :-) ? There's something of a difference between 45 days and 292 years. I'm not saying we can't get there, but worrying about it in the current state is just stupid. It's not just looking at the trees and not seeing the forest, it's looking at one *leaf*, and missing the forest. And quite frankly, if you work for NASA and are aiming for the stars, you'd better not be that person who looks at one leaf and cannot see the forest around you. That's the kind of thing that makes you miss the difference between miles and kilometers. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 11:30 ` Esben Nielsen 2007-05-07 15:55 ` Ingo Molnar 2007-05-07 16:11 ` Linus Torvalds @ 2007-05-08 0:35 ` Peter Williams 2007-05-08 9:05 ` Esben Nielsen 2007-05-10 13:09 ` Pavel Machek 2 siblings, 2 replies; 75+ messages in thread From: Peter Williams @ 2007-05-08 0:35 UTC (permalink / raw) To: Esben Nielsen Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod Esben Nielsen wrote: > > > On Sun, 6 May 2007, Linus Torvalds wrote: > >> >> >> On Sun, 6 May 2007, Ingo Molnar wrote: >>> >>> * Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> >>>> So the _only_ valid way to handle timers is to >>>> - either not allow wrapping at all (in which case "unsigned" is >>>> better, >>>> since it is bigger) >>>> - or use wrapping explicitly, and use unsigned arithmetic (which is >>>> well-defined in C) and do something like "(long)(a-b) > 0". >>> >>> hm, there is a corner-case in CFS where a fix like this is necessary. >>> >>> CFS uses 64-bit values for almost everything, and the majority of values >>> are of 'relative' nature with no danger of overflow. (They are signed >>> because they are relative values that center around zero and can be >>> negative or positive.) >> >> Well, I'd like to just worry about that for a while. >> >> You say there is "no danger of overflow", and I mostly agree that once >> we're talking about 64-bit values, the overflow issue simply doesn't >> exist, and furthermore the difference between 63 and 64 bits is not >> really >> relevant, so there's no major reason to actively avoid signed entries. >> >> So in that sense, it all sounds perfectly sane. And I'm definitely not >> sure your "292 years after bootup" worry is really worth even >> considering. >> > > I would hate to tell mission control for Mankind's first mission to another > star to reboot every 200 years because "there is no need to worry about > it." > > As a matter of principle an OS should never need a reboot (with > exception for upgrading). If you say you have to reboot every 200 years, > why not every 100? Every 50? .... Every 45 days (you know what I am > referring to :-) ? There's always going to be an upper limit on the representation of time. At least until we figure out how to implement infinity properly. > >> When we're really so well off that we expect the hardware and software >> stack to be stable over a hundred years, I'd start to think about issues >> like that, in the meantime, to me worrying about those kinds of issues >> just means that you're worrying about the wrong things. >> >> BUT. >> >> There's a fundamental reason relative timestamps are difficult and almost >> always have overflow issues: the "long long in the future" case as an >> approximation of "infinite timeout" is almost always relevant. >> >> So rather than worry about the system staying up 292 years, I'd worry >> about whether people pass in big numbers (like some MAX_S64 >> approximation) >> as an approximation for "infinite", and once you have things like that, >> the "64 bits never overflows" argument is totally bogus. >> >> There's a damn good reason for using only *absolute* time. The whole >> "signed values of relative time" may _sound_ good, but it really sucks in >> subtle and horrible ways! >> > > I think you are wrong here. The only place you need absolute time is a > for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit > representation when you could have used a 32 bit. With a 32 bit > implementation you are forced to handle the corner cases with wrap > around and too big arguments up front. With a 64 bit you hide those > problems. As does the other method. A 32 bit signed offset with a 32 bit base is just a crude version of 64 bit absolute time. > > I think CFS would be best off using a 32 bit timer counting in micro > seconds. That would wrap around in 72 minuttes. But as the timers are > relative you will never be able to specify a timer larger than 36 > minuttes in the future. But 36 minuttes is redicolously long for a > scheduler and a simple test limiting time values to that value would not > break anything. Except if you're measuring sleep times. I think that you'll find lots of tasks sleep for more than 72 minutes. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-08 0:35 ` Peter Williams @ 2007-05-08 9:05 ` Esben Nielsen 2007-05-09 0:01 ` Peter Williams 2007-05-10 13:09 ` Pavel Machek 1 sibling, 1 reply; 75+ messages in thread From: Esben Nielsen @ 2007-05-08 9:05 UTC (permalink / raw) To: Peter Williams Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Tue, 8 May 2007, Peter Williams wrote: > Esben Nielsen wrote: >> >> >> On Sun, 6 May 2007, Linus Torvalds wrote: >> >> > >> > >> > On Sun, 6 May 2007, Ingo Molnar wrote: >> > > >> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: >> > > >> > > > So the _only_ valid way to handle timers is to >> > > > - either not allow wrapping at all (in which case "unsigned" is >> > > > better, >> > > > since it is bigger) >> > > > - or use wrapping explicitly, and use unsigned arithmetic (which is >> > > > well-defined in C) and do something like "(long)(a-b) > 0". >> > > >> > > hm, there is a corner-case in CFS where a fix like this is necessary. >> > > >> > > CFS uses 64-bit values for almost everything, and the majority of >> > > values >> > > are of 'relative' nature with no danger of overflow. (They are signed >> > > because they are relative values that center around zero and can be >> > > negative or positive.) >> > >> > Well, I'd like to just worry about that for a while. >> > >> > You say there is "no danger of overflow", and I mostly agree that once >> > we're talking about 64-bit values, the overflow issue simply doesn't >> > exist, and furthermore the difference between 63 and 64 bits is not >> > really >> > relevant, so there's no major reason to actively avoid signed entries. >> > >> > So in that sense, it all sounds perfectly sane. And I'm definitely not >> > sure your "292 years after bootup" worry is really worth even >> > considering. >> > >> >> I would hate to tell mission control for Mankind's first mission to >> another >> star to reboot every 200 years because "there is no need to worry about >> it." >> >> As a matter of principle an OS should never need a reboot (with exception >> for upgrading). If you say you have to reboot every 200 years, why not >> every 100? Every 50? .... Every 45 days (you know what I am referring to >> :-) ? > > There's always going to be an upper limit on the representation of time. > At least until we figure out how to implement infinity properly. Well you need infinite memory for that :-) But that is my point: Why go into the problem of storing absolute time when you can use relative time? > >> >> > When we're really so well off that we expect the hardware and software >> > stack to be stable over a hundred years, I'd start to think about issues >> > like that, in the meantime, to me worrying about those kinds of issues >> > just means that you're worrying about the wrong things. >> > >> > BUT. >> > >> > There's a fundamental reason relative timestamps are difficult and >> > almost >> > always have overflow issues: the "long long in the future" case as an >> > approximation of "infinite timeout" is almost always relevant. >> > >> > So rather than worry about the system staying up 292 years, I'd worry >> > about whether people pass in big numbers (like some MAX_S64 >> > approximation) >> > as an approximation for "infinite", and once you have things like that, >> > the "64 bits never overflows" argument is totally bogus. >> > >> > There's a damn good reason for using only *absolute* time. The whole >> > "signed values of relative time" may _sound_ good, but it really sucks >> > in >> > subtle and horrible ways! >> > >> >> I think you are wrong here. The only place you need absolute time is a for >> the clock (CLOCK_REALTIME). You waste CPU using a 64 bit >> representation when you could have used a 32 bit. With a 32 bit >> implementation you are forced to handle the corner cases with wrap around >> and too big arguments up front. With a 64 bit you hide those problems. > > As does the other method. A 32 bit signed offset with a 32 bit base is just > a crude version of 64 bit absolute time. 64 bit is also relative - just over a much longer period. 32 bit signed offset is relative - and you know it. But with 64 people think it is absolute and put in large values as Linus said above. With 32 bit future developers will know it is relative and code for it. And they will get their corner cases tested, because the code soon will run into those corners. > >> >> I think CFS would be best off using a 32 bit timer counting in micro >> seconds. That would wrap around in 72 minuttes. But as the timers are >> relative you will never be able to specify a timer larger than 36 minuttes >> in the future. But 36 minuttes is redicolously long for a scheduler and a >> simple test limiting time values to that value would not break anything. > > Except if you're measuring sleep times. I think that you'll find lots of > tasks sleep for more than 72 minutes. I don't think those large values will be relavant. You can easily cut off sleep times at 30 min or even 1 min. But you need to detect that the task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do with a 32 bit timer alone. In that case you need to use a (at least) 64 bit timer - which is needed in the OS anyways. But not internally in the wait queue, where the repeated calculations are. Esben > > Peter > -- > Peter Williams pwil3058@bigpond.net.au > > "Learning, n. The kind of ignorance distinguishing the studious." > -- Ambrose Bierce > > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-08 9:05 ` Esben Nielsen @ 2007-05-09 0:01 ` Peter Williams 0 siblings, 0 replies; 75+ messages in thread From: Peter Williams @ 2007-05-09 0:01 UTC (permalink / raw) To: Esben Nielsen Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod Esben Nielsen wrote: > > > On Tue, 8 May 2007, Peter Williams wrote: > >> Esben Nielsen wrote: >>> >>> >>> On Sun, 6 May 2007, Linus Torvalds wrote: >>> >>> > > > On Sun, 6 May 2007, Ingo Molnar wrote: >>> > > > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: >>> > > > > > So the _only_ valid way to handle timers is to >>> > > > - either not allow wrapping at all (in which case "unsigned" >>> is > > > better, >>> > > > since it is bigger) >>> > > > - or use wrapping explicitly, and use unsigned arithmetic >>> (which is >>> > > > well-defined in C) and do something like "(long)(a-b) > 0". >>> > > > > hm, there is a corner-case in CFS where a fix like this is >>> necessary. >>> > > > > CFS uses 64-bit values for almost everything, and the >>> majority of > > values >>> > > are of 'relative' nature with no danger of overflow. (They are >>> signed >>> > > because they are relative values that center around zero and can be >>> > > negative or positive.) >>> > > Well, I'd like to just worry about that for a while. >>> > > You say there is "no danger of overflow", and I mostly agree >>> that once >>> > we're talking about 64-bit values, the overflow issue simply doesn't >>> > exist, and furthermore the difference between 63 and 64 bits is >>> not > really >>> > relevant, so there's no major reason to actively avoid signed >>> entries. >>> > > So in that sense, it all sounds perfectly sane. And I'm >>> definitely not >>> > sure your "292 years after bootup" worry is really worth even > >>> considering. >>> > >>> >>> I would hate to tell mission control for Mankind's first mission to >>> another >>> star to reboot every 200 years because "there is no need to worry about >>> it." >>> >>> As a matter of principle an OS should never need a reboot (with >>> exception >>> for upgrading). If you say you have to reboot every 200 years, why not >>> every 100? Every 50? .... Every 45 days (you know what I am >>> referring to >>> :-) ? >> >> There's always going to be an upper limit on the representation of time. >> At least until we figure out how to implement infinity properly. > > Well you need infinite memory for that :-) > But that is my point: Why go into the problem of storing absolute time > when you can use relative time? I'd reverse that and say "Why go to the bother of using relative time when you can use absolute time?". Absolute time being time since boot, of course. > > >> >>> >>> > When we're really so well off that we expect the hardware and >>> software >>> > stack to be stable over a hundred years, I'd start to think about >>> issues >>> > like that, in the meantime, to me worrying about those kinds of >>> issues >>> > just means that you're worrying about the wrong things. >>> > > BUT. >>> > > There's a fundamental reason relative timestamps are difficult >>> and > almost >>> > always have overflow issues: the "long long in the future" case as an >>> > approximation of "infinite timeout" is almost always relevant. >>> > > So rather than worry about the system staying up 292 years, I'd >>> worry >>> > about whether people pass in big numbers (like some MAX_S64 > >>> approximation) >>> > as an approximation for "infinite", and once you have things like >>> that, >>> > the "64 bits never overflows" argument is totally bogus. >>> > > There's a damn good reason for using only *absolute* time. The >>> whole >>> > "signed values of relative time" may _sound_ good, but it really >>> sucks > in >>> > subtle and horrible ways! >>> > >>> >>> I think you are wrong here. The only place you need absolute time is >>> a for >>> the clock (CLOCK_REALTIME). You waste CPU using a 64 bit >>> representation when you could have used a 32 bit. With a 32 bit >>> implementation you are forced to handle the corner cases with wrap >>> around >>> and too big arguments up front. With a 64 bit you hide those problems. >> >> As does the other method. A 32 bit signed offset with a 32 bit base >> is just a crude version of 64 bit absolute time. > > 64 bit is also relative - just over a much longer period. Yes, relative to boot. > 32 bit signed offset is relative - and you know it. But with 64 people > think it is absolute and put in large values as Linus said above. What people? Who gets to feed times into the scheduler? Isn't it just using the time as determined by the system? > With > 32 bit future developers will know it is relative and code for it. And > they will get their corner cases tested, because the code soon will run > into those corners. > >> >>> >>> I think CFS would be best off using a 32 bit timer counting in micro >>> seconds. That would wrap around in 72 minuttes. But as the timers are >>> relative you will never be able to specify a timer larger than 36 >>> minuttes >>> in the future. But 36 minuttes is redicolously long for a scheduler >>> and a >>> simple test limiting time values to that value would not break >>> anything. >> >> Except if you're measuring sleep times. I think that you'll find lots >> of tasks sleep for more than 72 minutes. > > I don't think those large values will be relavant. You can easily cut > off sleep times at 30 min or even 1 min. The aim is to make the code as simple as possible not add this kind of rubbish and 1 minute would be far too low. > But you need to detect that the > task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do > with a 32 bit timer alone. In that case you need to use a (at least) 64 bit > timer - which is needed in the OS anyways. But not internally in the > wait queue, where the repeated calculations are. > Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-08 0:35 ` Peter Williams 2007-05-08 9:05 ` Esben Nielsen @ 2007-05-10 13:09 ` Pavel Machek 2007-05-11 16:50 ` Linus Torvalds 1 sibling, 1 reply; 75+ messages in thread From: Pavel Machek @ 2007-05-10 13:09 UTC (permalink / raw) To: Peter Williams Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod Hi! > >>You say there is "no danger of overflow", and I mostly > >>agree that once > >>we're talking about 64-bit values, the overflow issue > >>simply doesn't > >>exist, and furthermore the difference between 63 and > >>64 bits is not really > >>relevant, so there's no major reason to actively avoid > >>signed entries. > >> > >>So in that sense, it all sounds perfectly sane. And > >>I'm definitely not > >>sure your "292 years after bootup" worry is really > >>worth even considering. > >> > > > >I would hate to tell mission control for Mankind's > >first mission to another > >star to reboot every 200 years because "there is no > >need to worry about it." > > > >As a matter of principle an OS should never need a > >reboot (with exception for upgrading). If you say you > >have to reboot every 200 years, why not every 100? > >Every 50? .... Every 45 days (you know what I am > >referring to :-) ? > > There's always going to be an upper limit on the > representation of time. At least until we figure out > how to implement infinity properly. There's also upper limit on life time of this universe. 1000 bits is certainly enough to represent that in u-seconds. Also notice that current cpus were not designed to work 300 years. When we have hw designed for 50 years+, we can start to worry. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-10 13:09 ` Pavel Machek @ 2007-05-11 16:50 ` Linus Torvalds 2007-05-11 19:18 ` Pavel Machek 0 siblings, 1 reply; 75+ messages in thread From: Linus Torvalds @ 2007-05-11 16:50 UTC (permalink / raw) To: Pavel Machek Cc: Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod [-- Attachment #1: Type: TEXT/PLAIN, Size: 1587 bytes --] On Thu, 10 May 2007, Pavel Machek wrote: > > Also notice that current cpus were not designed to work 300 years. > When we have hw designed for 50 years+, we can start to worry. Indeed. CPU manufacturers don't seem to talk about it very much, and searching for it with google on intel.com comes up with The failure rate and Mean Time Between Failure (MTBF) data is not currently available on our website. You may contact Intel® Customer Support for this information. which seems to be just a fancy way of saying "we don't actually want to talk about it". Probably not because it's actually all that bad, but simply because people don't think about it, and there's no reason a CPU manufacturer would *want* people to think about it. But if you wondered why server CPU's usually run at a lower frequency, it's because of MTBF issues. I think a desktop CPU is usually specced to run for 5 years (and that's expecting that it's turned off or at least idle much of the time), while a server CPU is expected to last longer and be active a much bigger percentage of time. ("Active" == "heat" == "more damage due to atom migration etc". Which is part of why you're not supposed to overclock stuff: it may well work well for you, but for all you know it will cut your expected CPU life by 90%). Of course, all the other components have a MTBF too (and I'd generally expect a power supply component to fail long before the CPU does). And yes, some machines will last for decades. But I suspect we've all heard of machines that just don't work reliably any more. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-11 16:50 ` Linus Torvalds @ 2007-05-11 19:18 ` Pavel Machek 2007-05-11 19:37 ` Willy Tarreau 0 siblings, 1 reply; 75+ messages in thread From: Pavel Machek @ 2007-05-11 19:18 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod Hi! > > Also notice that current cpus were not designed to work 300 years. > > When we have hw designed for 50 years+, we can start to worry. > > Indeed. CPU manufacturers don't seem to talk about it very much, and > searching for it with google on intel.com comes up with > > The failure rate and Mean Time Between Failure (MTBF) data is not > currently available on our website. You may contact Intel? > Customer Support for this information. > > which seems to be just a fancy way of saying "we don't actually want to > talk about it". Probably not because it's actually all that bad, but > simply because people don't think about it, and there's no reason a CPU > manufacturer would *want* people to think about it. > > But if you wondered why server CPU's usually run at a lower frequency, > it's because of MTBF issues. I think a desktop CPU is usually specced to > run for 5 years (and that's expecting that it's turned off or at least > idle much of the time), while a server CPU is expected to last longer and > be active a much bigger percentage of time. > > ("Active" == "heat" == "more damage due to atom migration etc". Which is > part of why you're not supposed to overclock stuff: it may well work well > for you, but for all you know it will cut your expected CPU life by 90%). Actually, when I talked with AMD, they told me that cpus should last 10 years *at their max specced temperature*... which is 95Celsius. So overclocking is not that evil, according to my info. (That would mean way more than 10 years if you use your cpu 'normally'.) But I guess capacitors from cpu power supply will hate you running cpu at 95C... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-11 19:18 ` Pavel Machek @ 2007-05-11 19:37 ` Willy Tarreau 2007-05-11 20:53 ` Kevin Bowling 0 siblings, 1 reply; 75+ messages in thread From: Willy Tarreau @ 2007-05-11 19:37 UTC (permalink / raw) To: Pavel Machek Cc: Linus Torvalds, Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote: > Hi! > > > > Also notice that current cpus were not designed to work 300 years. > > > When we have hw designed for 50 years+, we can start to worry. > > > > Indeed. CPU manufacturers don't seem to talk about it very much, and > > searching for it with google on intel.com comes up with > > > > The failure rate and Mean Time Between Failure (MTBF) data is not > > currently available on our website. You may contact Intel? > > Customer Support for this information. > > > > which seems to be just a fancy way of saying "we don't actually want to > > talk about it". Probably not because it's actually all that bad, but > > simply because people don't think about it, and there's no reason a CPU > > manufacturer would *want* people to think about it. > > > > But if you wondered why server CPU's usually run at a lower frequency, > > it's because of MTBF issues. I think a desktop CPU is usually specced to > > run for 5 years (and that's expecting that it's turned off or at least > > idle much of the time), while a server CPU is expected to last longer and > > be active a much bigger percentage of time. > > > > ("Active" == "heat" == "more damage due to atom migration etc". Which is > > part of why you're not supposed to overclock stuff: it may well work well > > for you, but for all you know it will cut your expected CPU life by 90%). > > Actually, when I talked with AMD, they told me that cpus should last > 10 years *at their max specced temperature*... which is 95Celsius. So > overclocking is not that evil, according to my info. I tend to believe that. I've slowed down the FANs on my dual-athlon XP to silent them, and I found the system to be (apparently) stable till 96 deg Celsius with FANs unplugged. So I regulated them in order to maintain a temperature below 90 deg for a safety margin, and they seem to be as happy as my ears. They've been like that for the last 3-4 years, I don't remember. On a related note, older technology is less sensible. My old VAX VLC4000 from 1991 which received a small heatsink in exchange for its noisy fans is still doing well in an closed place. I'm more worried for the EPROMs which store the boot code. Also, I think that the RS6000 processed at half-micron which run the Mars rovers might run for hundreds of years at 30 MHz. > (That would mean way more than 10 years if you use your cpu > 'normally'.) > > But I guess capacitors from cpu power supply will hate you running cpu > at 95C... even at 60, many of them die within a few years. Bu I think we're "slightly" off-topic now... > Pavel Cheers Willy ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-11 19:37 ` Willy Tarreau @ 2007-05-11 20:53 ` Kevin Bowling 0 siblings, 0 replies; 75+ messages in thread From: Kevin Bowling @ 2007-05-11 20:53 UTC (permalink / raw) To: Willy Tarreau Cc: Pavel Machek, Linus Torvalds, Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar, Gene Heskett, Mark Lord, Zach Carter, buddabrod On 5/11/07, Willy Tarreau <w@1wt.eu> wrote: > On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote: > > Hi! > > > > > > Also notice that current cpus were not designed to work 300 years. > > > > When we have hw designed for 50 years+, we can start to worry. > > On a related note, older technology is less sensible. My old VAX VLC4000 > from 1991 which received a small heatsink in exchange for its noisy fans > is still doing well in an closed place. I'm more worried for the EPROMs > which store the boot code. Also, I think that the RS6000 processed at > half-micron which run the Mars rovers might run for hundreds of years > at 30 MHz. Not your typical RS/6000. http://en.wikipedia.org/wiki/RAD6000 Without a doubt, decently built computers will easily last at least 10 years. From what I can tell, the issue here was simply OS runtime? ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-05 17:44 ` Linus Torvalds 2007-05-06 8:29 ` Ingo Molnar @ 2007-05-07 11:09 ` Esben Nielsen 2007-05-07 16:28 ` Linus Torvalds 1 sibling, 1 reply; 75+ messages in thread From: Esben Nielsen @ 2007-05-07 11:09 UTC (permalink / raw) To: Linus Torvalds Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Sat, 5 May 2007, Linus Torvalds wrote: > > > On Sat, 5 May 2007, Esben Nielsen wrote: >> >> I have been wondering why you use usigned for timers anyway. It is also like >> that in hrtimers. Why not use signed and avoid (almost) all worries about wrap >> around issues. The trick is that when all >> a < b >> is be replaced by >> a - b < 0 >> the code will work on all 2-complement machines even if the (signed!) integers >> a and b wrap around. > > No. BOTH of the above are buggy. > > The C language definition doesn't allow signed integers to wrap (ie it's > undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a > simple signed "a < b". > > And the unsigned (or signed) "a < b" is just broken wrt any kind of > wrap-around (whether wrapping around zero or the sign bit). > > So the _only_ valid way to handle timers is to > - either not allow wrapping at all (in which case "unsigned" is better, > since it is bigger) > - or use wrapping explicitly, and use unsigned arithmetic (which is > well-defined in C) and do something like "(long)(a-b) > 0". > > Notice? The signed variant is basically _never_ correct. > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find with google. I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b" as that should be completely portable. According to C99 Appendix H2.2 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf) an implementation can choose to do modulo signed integers as it is mandatory for unsigned integers. If an implementation have choosen to do that it must be a bug to to do the "a-b < 0" -> "a<b" optimization. I have never experienced a compiler/architecture combination _not_ doing wrapped signed integers. Esben > Linus > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 11:09 ` Esben Nielsen @ 2007-05-07 16:28 ` Linus Torvalds 2007-05-07 18:39 ` Johannes Stezenbach 2007-05-08 5:36 ` Matt Mackall 0 siblings, 2 replies; 75+ messages in thread From: Linus Torvalds @ 2007-05-07 16:28 UTC (permalink / raw) To: Esben Nielsen Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, 7 May 2007, Esben Nielsen wrote: > > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find > with google. I don't worry about non-2's-complement machines (they don't exist, and likely won't exist in the future either). So I worry about compilers rewriting my code. So "(long)(a-b) < 0" (with "a" and "b" being unsigned long) is basically a portable way of testing the high bit of the result. > I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b" as > that should be completely portable. That certainly works too, but the difference is irrelevant, since Linux is unlikely to work on insane machines anyway (ie we do make a lot of other assumptions about the architecture, being two's-complement is the least of those). So you basically shouldn't worry about hardware: everybody is pretty much the same. You should worry about *compilers* - that's where the differences show up. So "(long)(a-b)" may be "implementation defined" (but since implementations are all 2's complement, we don't care), but a signed "(a-b)" that over/overflows is *undefined*, and that is much worse because it means that the compiler can do some funky stuff, and _that_ is a real practical worry. And no, I also don't worry about porting Linux to 18-bit machines, or to ternary CPU's. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 16:28 ` Linus Torvalds @ 2007-05-07 18:39 ` Johannes Stezenbach 2007-05-07 18:55 ` Linus Torvalds 2007-05-08 7:34 ` Esben Nielsen 2007-05-08 5:36 ` Matt Mackall 1 sibling, 2 replies; 75+ messages in thread From: Johannes Stezenbach @ 2007-05-07 18:39 UTC (permalink / raw) To: Linus Torvalds Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, May 07, 2007, Linus Torvalds wrote: > On Mon, 7 May 2007, Esben Nielsen wrote: > > > > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I > > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find > > with google. C99 defines unsigned overflow semantics, but it doesn't say anything about signed overflow, thus it's undefined -- and you have a hard time finding it out. However, I have no clue *why* it's undefined and not implementation defined. Does someone know? > I don't worry about non-2's-complement machines (they don't exist, and > likely won't exist in the future either). I think DSPs can do saturated arithmetics (clamp to min/max values instead of wrap around). Not that it matters for Linux... > So I worry about compilers rewriting my code. gcc has -fwrapv and -ftrapv to change signed integer overflow behaviour. One baffling example where gcc rewrites code is when conditionals depend on signed integer overflow: $ cat xx.c #include <assert.h> int foo(int a) { assert(a + 100 > a); return a; } int bar(int a) { if (a + 100 > a) a += 100; return a; } $ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c $ objdump -dr xx.o xx.o: file format elf32-i386 Disassembly of section .text: 00000000 <foo>: 0: 8b 44 24 04 mov 0x4(%esp),%eax 4: c3 ret 00000005 <bar>: 5: 83 44 24 04 64 addl $0x64,0x4(%esp) a: 8b 44 24 04 mov 0x4(%esp),%eax e: c3 ret The assert and the condition were just dropped by gcc -- without any warning. gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow. http://gcc.gnu.org/gcc-4.2/changes.html Johannes ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 18:39 ` Johannes Stezenbach @ 2007-05-07 18:55 ` Linus Torvalds 2007-05-08 7:34 ` Esben Nielsen 1 sibling, 0 replies; 75+ messages in thread From: Linus Torvalds @ 2007-05-07 18:55 UTC (permalink / raw) To: Johannes Stezenbach Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, 7 May 2007, Johannes Stezenbach wrote: > > One baffling example where gcc rewrites code is when > conditionals depend on signed integer overflow: Yes. This is one of my favourite beefs with gcc. Some of the optimization decisions seem to make no sense. Your example is a good one, but my private beef has been in alias handling. Alias analysis is an important part of optimization, and there's two kinds: the static (and exact, aka "safe") kind that you can do regardless of any language definitions, because you *know* that you aren't actually changing behaviour, and the additional type-based heuristics that the C language allows. So which ones would you expect a compiler to consider more important? And which one do you think gcc will use? Right. You can have static analysis that *guarantees* that two objects alias, but if gcc determins that they have different types and thus might not alias, it decides to use the heuristic instead of the firm knowledge, and generate code that doesn't work. "Because the language definition allows it". Oh well. Linus ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 18:39 ` Johannes Stezenbach 2007-05-07 18:55 ` Linus Torvalds @ 2007-05-08 7:34 ` Esben Nielsen 2007-05-08 9:54 ` Johannes Stezenbach 1 sibling, 1 reply; 75+ messages in thread From: Esben Nielsen @ 2007-05-08 7:34 UTC (permalink / raw) To: Johannes Stezenbach Cc: Linus Torvalds, Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, 7 May 2007, Johannes Stezenbach wrote: > On Mon, May 07, 2007, Linus Torvalds wrote: >> On Mon, 7 May 2007, Esben Nielsen wrote: >>> >>> What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I >>> can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find >>> with google. > > C99 defines unsigned overflow semantics, but it doesn't say anything > about signed overflow, thus it's undefined -- and you have a hard > time finding it out. > > However, I have no clue *why* it's undefined and not > implementation defined. Does someone know? > >> I don't worry about non-2's-complement machines (they don't exist, and >> likely won't exist in the future either). > > I think DSPs can do saturated arithmetics (clamp to min/max > values instead of wrap around). Not that it matters for Linux... > >> So I worry about compilers rewriting my code. > > gcc has -fwrapv and -ftrapv to change signed integer overflow > behaviour. > > One baffling example where gcc rewrites code is when > conditionals depend on signed integer overflow: > > $ cat xx.c > #include <assert.h> > > int foo(int a) > { > assert(a + 100 > a); > return a; > } > > int bar(int a) > { > if (a + 100 > a) > a += 100; > return a; > } > $ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c > $ objdump -dr xx.o > > xx.o: file format elf32-i386 > > Disassembly of section .text: > > 00000000 <foo>: > 0: 8b 44 24 04 mov 0x4(%esp),%eax > 4: c3 ret > > 00000005 <bar>: > 5: 83 44 24 04 64 addl $0x64,0x4(%esp) > a: 8b 44 24 04 mov 0x4(%esp),%eax > e: c3 ret > > > The assert and the condition were just dropped > by gcc -- without any warning. > > gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow. > http://gcc.gnu.org/gcc-4.2/changes.html > > > Johannes > This is contrary to C99 standeard annex H2.2 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf): "An implementation that defines signed integer types as also being modulo need not detect integer overflow, in which case, only integer divide-by-zero need be detected." So if it doesn't properly defines wrapping it has to detect integer overflow, right? gcc does niether with that optimization :-( Esben ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-08 7:34 ` Esben Nielsen @ 2007-05-08 9:54 ` Johannes Stezenbach 2007-05-08 10:27 ` Esben Nielsen 0 siblings, 1 reply; 75+ messages in thread From: Johannes Stezenbach @ 2007-05-08 9:54 UTC (permalink / raw) To: Esben Nielsen Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Tue, May 08, 2007, Esben Nielsen wrote: > > This is contrary to C99 standeard annex H2.2 > (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf): > > "An implementation that defines signed integer types as also being modulo > need > not detect integer overflow, in which case, only integer divide-by-zero need > be detected." > > So if it doesn't properly defines wrapping it has to detect integer > overflow, right? No. Annex H (informative!) only talks about LIA-1 conformance. C99 isn't LIA-1 conformant. H2.2 describes what an implementation might do to make signed integers LIA-1 compatible, which is what gcc does with -fwarpv or -ftrapv. At least that's how I understand it, the C99 standard seems to have been written with the "it was hard to write, so it should be hard to read" mindset. :-/ I still don't know _why_ signed integer overflow behaviour isn't defined in C. It just goes against everyones expectation and thus causes bugs. Johannes ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-08 9:54 ` Johannes Stezenbach @ 2007-05-08 10:27 ` Esben Nielsen 0 siblings, 0 replies; 75+ messages in thread From: Esben Nielsen @ 2007-05-08 10:27 UTC (permalink / raw) To: Johannes Stezenbach Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Tue, 8 May 2007, Johannes Stezenbach wrote: > On Tue, May 08, 2007, Esben Nielsen wrote: >> >> This is contrary to C99 standeard annex H2.2 >> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf): >> >> "An implementation that defines signed integer types as also being modulo >> need >> not detect integer overflow, in which case, only integer divide-by-zero need >> be detected." >> >> So if it doesn't properly defines wrapping it has to detect integer >> overflow, right? > > No. Annex H (informative!) only talks about LIA-1 conformance. > > C99 isn't LIA-1 conformant. H2.2 describes what an implementation > might do to make signed integers LIA-1 compatible. "The signed C integer types int, long int, long long int, and the corresponding unsigned types are compatible with LIA-1." I read this as any C99 implementation must be compatible. I would like to see LIA-1 to check. >, which is > what gcc does with -fwarpv or -ftrapv. > Yes, either or: Either wrap or trap. > At least that's how I understand it, the C99 standard > seems to have been written with the "it was hard to > write, so it should be hard to read" mindset. :-/ > > I still don't know _why_ signed integer overflow behaviour > isn't defined in C. It just goes against everyones expectation > and thus causes bugs. Because it is hard to make wrapping work on non twos complement architectures. Then it is easier to trap. Esben > > > Johannes > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-07 16:28 ` Linus Torvalds 2007-05-07 18:39 ` Johannes Stezenbach @ 2007-05-08 5:36 ` Matt Mackall 1 sibling, 0 replies; 75+ messages in thread From: Matt Mackall @ 2007-05-08 5:36 UTC (permalink / raw) To: Linus Torvalds Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod On Mon, May 07, 2007 at 09:28:32AM -0700, Linus Torvalds wrote: > > > On Mon, 7 May 2007, Esben Nielsen wrote: > > > > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I > > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find > > with google. > > I don't worry about non-2's-complement machines (they don't exist, and > likely won't exist in the future either). They do exist. And they run Linux. SLES9 in fact. http://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#UNISYS_ClearPath_IX_series http://www.unisys.com/eprise/main/admin/corporate/doc/ClearPath_Plus_Dorado_Model_390_Server_Specification_Sheet.pdf That machine is a direct descendant of the Univac 1101 from 1950 and is still software-compatible with 1107s from 1962. (Granted, they only run Linux on the x86 side.) -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar ` (3 preceding siblings ...) 2007-05-02 9:08 ` Balbir Singh @ 2007-05-02 12:58 ` Mark Lord 2007-05-02 12:58 ` Vegard Nossum 5 siblings, 0 replies; 75+ messages in thread From: Mark Lord @ 2007-05-02 12:58 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Zach Carter, buddabrod, Balbir Singh Ingo Molnar wrote: > i'm pleased to announce release -v8 of the CFS scheduler patchset. (The > main goal of CFS is to implement "desktop scheduling" with as high > quality as technically possible.) > > The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be > downloaded from the usual place: > > http://people.redhat.com/mingo/cfs-scheduler/ Excellent. I've switched machines here, so my "old" single-core 2GHz notebook is no longer being used as much. The new machine is a more or less identical notebook, but with a 2.1GHz Core2Duo CPU. And.. man, the mainline scheduler really sucks eggs on the dual-core! On single core it was nearly always "nice" enough for me, but wow.. what a degradation with the extra CPU. So, CFS is now a permanent add-on for kernels I use here on either machine, and I also can say that it works GGGGGGGRRRRRRRRRRREEEEEEEEEAAAAAAAAAATTTTTTTT!!! (please excuse the North American excuse for a "cultural" reference). ;) ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-01 21:22 Ingo Molnar ` (4 preceding siblings ...) 2007-05-02 12:58 ` Mark Lord @ 2007-05-02 12:58 ` Vegard Nossum 2007-05-02 16:41 ` Ingo Molnar 5 siblings, 1 reply; 75+ messages in thread From: Vegard Nossum @ 2007-05-02 12:58 UTC (permalink / raw) To: Ingo Molnar; +Cc: linux-kernel On Tue, May 1, 2007 11:22 pm, Ingo Molnar wrote: > As usual, any sort of feedback, bugreport, fix and suggestion is more than welcome, Hi, The sys_sched_yield_to() is not callable from userspace on i386 because it is not part of the syscall table (arch/i386/kernel/syscall_table.S). This causes sysenter_entry (arch/i386/kernel/entry.S) to use the wrong count for nr_syscalls (320 instead of 321) and return with -ENOSYS. Vegard ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [patch] CFS scheduler, -v8 2007-05-02 12:58 ` Vegard Nossum @ 2007-05-02 16:41 ` Ingo Molnar 0 siblings, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2007-05-02 16:41 UTC (permalink / raw) To: Vegard Nossum; +Cc: linux-kernel * Vegard Nossum <vegard@peltkore.net> wrote: > The sys_sched_yield_to() is not callable from userspace on i386 > because it is not part of the syscall table > (arch/i386/kernel/syscall_table.S). This causes sysenter_entry > (arch/i386/kernel/entry.S) to use the wrong count for nr_syscalls (320 > instead of 321) and return with -ENOSYS. oops, indeed - the patch below should fix this. (x86 should really adopt the nice x86_64 technique of building the syscall table out of the unistd.h enumeration definitions.) Ingo Index: linux/arch/i386/kernel/syscall_table.S =================================================================== --- linux.orig/arch/i386/kernel/syscall_table.S +++ linux/arch/i386/kernel/syscall_table.S @@ -319,3 +319,4 @@ ENTRY(sys_call_table) .long sys_move_pages .long sys_getcpu .long sys_epoll_pwait + .long sys_sched_yield_to /* 320 */ ^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2007-05-11 20:54 UTC | newest] Thread overview: 75+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-05-03 8:20 [patch] CFS scheduler, -v8 Zoltan Boszormenyi 2007-05-03 13:02 ` Ingo Molnar 2007-05-03 13:29 ` Damien Wyart 2007-05-03 14:53 ` Srivatsa Vaddagiri 2007-05-03 15:53 ` William Lee Irwin III 2007-05-03 18:44 ` Li, Tong N 2007-05-03 19:52 ` William Lee Irwin III 2007-05-07 14:22 ` Srivatsa Vaddagiri 2007-05-07 20:54 ` Li, Tong N 2007-05-07 0:04 ` Bill Davidsen -- strict thread matches above, loose matches on Subject: below -- 2007-05-01 21:22 Ingo Molnar 2007-05-02 2:57 ` Ting Yang 2007-05-02 5:10 ` Willy Tarreau 2007-05-02 5:30 ` William Lee Irwin III 2007-05-02 10:05 ` Bill Huey 2007-05-02 10:27 ` Ingo Molnar 2007-05-02 17:36 ` Srivatsa Vaddagiri 2007-05-02 17:48 ` William Lee Irwin III 2007-05-02 18:15 ` Ingo Molnar 2007-05-02 18:56 ` William Lee Irwin III 2007-05-02 19:12 ` Ingo Molnar 2007-05-02 19:42 ` William Lee Irwin III 2007-05-03 2:48 ` Ting Yang 2007-05-03 3:18 ` Ting Yang 2007-05-03 10:19 ` Bill Huey 2007-05-02 23:41 ` Ting Yang 2007-05-02 18:42 ` Li, Tong N 2007-05-02 19:10 ` William Lee Irwin III 2007-05-03 3:07 ` Ting Yang 2007-05-03 8:50 ` Ingo Molnar 2007-05-03 14:26 ` Srivatsa Vaddagiri 2007-05-03 15:19 ` Ting Yang 2007-05-03 15:02 ` Ting Yang 2007-05-02 6:37 ` Mike Galbraith 2007-05-02 6:45 ` Ingo Molnar 2007-05-02 8:03 ` Gene Heskett 2007-05-02 8:12 ` Mike Galbraith 2007-05-02 8:48 ` Gene Heskett 2007-05-02 8:13 ` Ingo Molnar 2007-05-02 8:51 ` Gene Heskett 2007-05-02 7:59 ` Mike Galbraith 2007-05-02 8:11 ` Gene Heskett 2007-05-02 10:40 ` Ingo Molnar 2007-05-02 9:08 ` Balbir Singh 2007-05-02 10:05 ` Ingo Molnar 2007-05-02 10:59 ` Balbir Singh 2007-05-02 11:17 ` Ingo Molnar 2007-05-05 8:31 ` Esben Nielsen 2007-05-05 17:44 ` Linus Torvalds 2007-05-06 8:29 ` Ingo Molnar 2007-05-06 8:36 ` Willy Tarreau 2007-05-06 8:52 ` Ingo Molnar 2007-05-06 17:45 ` Linus Torvalds 2007-05-07 11:30 ` Esben Nielsen 2007-05-07 15:55 ` Ingo Molnar 2007-05-07 16:11 ` Linus Torvalds 2007-05-08 0:35 ` Peter Williams 2007-05-08 9:05 ` Esben Nielsen 2007-05-09 0:01 ` Peter Williams 2007-05-10 13:09 ` Pavel Machek 2007-05-11 16:50 ` Linus Torvalds 2007-05-11 19:18 ` Pavel Machek 2007-05-11 19:37 ` Willy Tarreau 2007-05-11 20:53 ` Kevin Bowling 2007-05-07 11:09 ` Esben Nielsen 2007-05-07 16:28 ` Linus Torvalds 2007-05-07 18:39 ` Johannes Stezenbach 2007-05-07 18:55 ` Linus Torvalds 2007-05-08 7:34 ` Esben Nielsen 2007-05-08 9:54 ` Johannes Stezenbach 2007-05-08 10:27 ` Esben Nielsen 2007-05-08 5:36 ` Matt Mackall 2007-05-02 12:58 ` Mark Lord 2007-05-02 12:58 ` Vegard Nossum 2007-05-02 16:41 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox