Re: [patch] CFS scheduler, -v8

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch] CFS scheduler, -v8
@ 2007-05-03  8:20 Zoltan Boszormenyi
  2007-05-03 13:02 ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: Zoltan Boszormenyi @ 2007-05-03  8:20 UTC (permalink / raw)
  To: linux-kernel

Hi!

> *** Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
> > The problem is with comparing a s64 values with (s64)ULONG_MAX, which 
> > evaluates to -1. Then we check if exec_delta64 and fair_delta64 are 
> > greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to 
> > the respective values.
>
> ah, indeed ...
>
> > The fix is to compare these values against (s64)LONG_MAX and assign 
> > (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater 
> > than (s64)LONG_MAX.
> > 
> > Tested on PowerPC, the regression is gone, tasks are load balanced as 
> > they were in v7.
>
> thanks, applied!
>
> 	Ingo

I started up 12 glxgears to see the effect of CFS v8
on my Athlon64 X2 4200.

Without this patch all the GL load was handled by the second core,
the system only stressed the first core if other processes were also
started, i.e. a kernel compilation. With this patch  the load is evenly
balanced across the two cores all the time. And while doing make -j4
on the kernel, the 12 gears are still spinning about 185+ FPS and
there are only slightly visible hiccups. Switching between workspaces,
i.e. refreshing the large windows of Thunderbird and Firefox are
done very quickly, the whole system is exceptionally responsive.

Thanks for this great work!

Best regards,
Zoltán Böszörményi

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03  8:20 [patch] CFS scheduler, -v8 Zoltan Boszormenyi
@ 2007-05-03 13:02 ` Ingo Molnar
  2007-05-03 13:29   ` Damien Wyart
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-03 13:02 UTC (permalink / raw)
  To: Zoltan Boszormenyi; +Cc: linux-kernel


* Zoltan Boszormenyi <zboszor@dunaweb.hu> wrote:

> I started up 12 glxgears to see the effect of CFS v8 on my Athlon64 X2 
> 4200.
> 
> Without this patch all the GL load was handled by the second core, the 
> system only stressed the first core if other processes were also 
> started, i.e. a kernel compilation. With this patch the load is evenly 
> balanced across the two cores all the time.

ok, i didnt realize that it would affect x86_64 too. I'll do a -v9 
release with this fix included.

> [...] And while doing make -j4 on the kernel, the 12 gears are still 
> spinning about 185+ FPS and there are only slightly visible hiccups. 
> Switching between workspaces, i.e. refreshing the large windows of 
> Thunderbird and Firefox are done very quickly, the whole system is 
> exceptionally responsive.

great! So it seems -v8 does improve OpenGL handling too :-)

> Thanks for this great work!

you are welcome :)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 13:02 ` Ingo Molnar
@ 2007-05-03 13:29   ` Damien Wyart
  2007-05-03 14:53     ` Srivatsa Vaddagiri
  2007-05-07  0:04     ` Bill Davidsen
  0 siblings, 2 replies; 75+ messages in thread
From: Damien Wyart @ 2007-05-03 13:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hello,

* Ingo Molnar <mingo@elte.hu> [2007-05-03 15:02]:
> great! So it seems -v8 does improve OpenGL handling too :-)

What are your thoughts/plans concerning merging CFS into mainline ? Is
it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
change the default scheduler rapidly (the discussion was about sd at
that time) to get more testing than in -mm.

-- 
Damien Wyart

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 13:29   ` Damien Wyart
@ 2007-05-03 14:53     ` Srivatsa Vaddagiri
  2007-05-03 15:53       ` William Lee Irwin III
  2007-05-07  0:04     ` Bill Davidsen
  1 sibling, 1 reply; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-03 14:53 UTC (permalink / raw)
  To: Damien Wyart; +Cc: Ingo Molnar, linux-kernel, torvalds

On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
> What are your thoughts/plans concerning merging CFS into mainline ? Is
> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> change the default scheduler rapidly (the discussion was about sd at
> that time) to get more testing than in -mm.

And what about group scheduling extensions? Do you have plans to work on
it? I was begining to work on a prototype to do group scheduling based
on CFS, basically on the lines of what you and Linus had outlined
earlier:

	http://lkml.org/lkml/2007/4/18/271
	http://lkml.org/lkml/2007/4/18/244

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 14:53     ` Srivatsa Vaddagiri
@ 2007-05-03 15:53       ` William Lee Irwin III
  2007-05-03 18:44         ` Li, Tong N
  2007-05-07 14:22         ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-03 15:53 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Damien Wyart, Ingo Molnar, linux-kernel, torvalds, Tong N. Li

On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
>> What are your thoughts/plans concerning merging CFS into mainline ? Is
>> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
>> change the default scheduler rapidly (the discussion was about sd at
>> that time) to get more testing than in -mm.

On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> And what about group scheduling extensions? Do you have plans to work on
> it? I was begining to work on a prototype to do group scheduling based
> on CFS, basically on the lines of what you and Linus had outlined
> earlier:
> 	http://lkml.org/lkml/2007/4/18/271
> 	http://lkml.org/lkml/2007/4/18/244

Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
have the mindshare cfs seems to have acquired.

The hyperlink seems to have broken, though:
	http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch


-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 15:53       ` William Lee Irwin III
@ 2007-05-03 18:44         ` Li, Tong N
  2007-05-03 19:52           ` William Lee Irwin III
  2007-05-07 14:22         ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 75+ messages in thread
From: Li, Tong N @ 2007-05-03 18:44 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Srivatsa Vaddagiri, Damien Wyart, Ingo Molnar, linux-kernel,
	torvalds

On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote:
> On Thu, May 03, 2007 at 03:29:32PM +0200, Damien Wyart wrote:
> >> What are your thoughts/plans concerning merging CFS into mainline ? Is
> >> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> >> change the default scheduler rapidly (the discussion was about sd at
> >> that time) to get more testing than in -mm.
> 
> On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > And what about group scheduling extensions? Do you have plans to work on
> > it? I was begining to work on a prototype to do group scheduling based
> > on CFS, basically on the lines of what you and Linus had outlined
> > earlier:
> > 	http://lkml.org/lkml/2007/4/18/271
> > 	http://lkml.org/lkml/2007/4/18/244
> 
> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> have the mindshare cfs seems to have acquired.
> 
> The hyperlink seems to have broken, though:
> 	http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

Yeah, I just fixed the link. The code was written based on the 2.6.19.2
scheduler. I could forward-port it to the latest kernel if there's
interest and I can find time. :) Here's a description of the design:

http://lkml.org/lkml/2007/4/20/286

Thanks,

  tong

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 18:44         ` Li, Tong N
@ 2007-05-03 19:52           ` William Lee Irwin III
  0 siblings, 0 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-03 19:52 UTC (permalink / raw)
  To: Li, Tong N
  Cc: Srivatsa Vaddagiri, Damien Wyart, Ingo Molnar, linux-kernel,
	torvalds

On Thu, 2007-05-03 at 08:53 -0700, William Lee Irwin III wrote:
>> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
>> have the mindshare cfs seems to have acquired.
>> The hyperlink seems to have broken, though:
>> 	http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

On Thu, May 03, 2007 at 11:44:27AM -0700, Li, Tong N wrote:
> Yeah, I just fixed the link. The code was written based on the 2.6.19.2
> scheduler. I could forward-port it to the latest kernel if there's
> interest and I can find time. :) Here's a description of the design:
> http://lkml.org/lkml/2007/4/20/286

I have the interest. I don't see performance issues with any of the
schedulers, but am rather interested in the feature.


-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 15:53       ` William Lee Irwin III
  2007-05-03 18:44         ` Li, Tong N
@ 2007-05-07 14:22         ` Srivatsa Vaddagiri
  2007-05-07 20:54           ` Li, Tong N
  1 sibling, 1 reply; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-07 14:22 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Damien Wyart, Ingo Molnar, linux-kernel, torvalds, Tong N. Li

On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote:
> On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > And what about group scheduling extensions? Do you have plans to work on
> > it? I was begining to work on a prototype to do group scheduling based
> > on CFS, basically on the lines of what you and Linus had outlined
> > earlier:
> > 	http://lkml.org/lkml/2007/4/18/271
> > 	http://lkml.org/lkml/2007/4/18/244
> 
> Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> have the mindshare cfs seems to have acquired.
> 
> The hyperlink seems to have broken, though:
> 	http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch

The big question I have is, how well does DWRR fits into the "currently hot" 
scheduling frameworks like CFS? For ex: if the goal is to do
fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist?
Both seem to be radically different algorithms and my initial impressions 
of them co-existing is "No", but would be glad to be corrected if I am
wrong. If they can't co-exist, then we need a different way of doing
group scheduling on top of CFS, as that is gaining more popularity on
account of better handling of interactivity.

Tong,
	I understand a center hallmark of DWRR is SMP fairness.
Have you considered how bad/good the other alternative to achieve SMP fairness 
which is in vogue today : pressure/weight based balancing (ex: smpnice and 
CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 14:22         ` Srivatsa Vaddagiri
@ 2007-05-07 20:54           ` Li, Tong N
  0 siblings, 0 replies; 75+ messages in thread
From: Li, Tong N @ 2007-05-07 20:54 UTC (permalink / raw)
  To: vatsa
  Cc: William Lee Irwin III, Damien Wyart, Ingo Molnar, linux-kernel,
	torvalds

On Mon, 2007-05-07 at 19:52 +0530, Srivatsa Vaddagiri wrote:
> On Thu, May 03, 2007 at 08:53:47AM -0700, William Lee Irwin III wrote:
> > On Thu, May 03, 2007 at 08:23:18PM +0530, Srivatsa Vaddagiri wrote:
> > > And what about group scheduling extensions? Do you have plans to work on
> > > it? I was begining to work on a prototype to do group scheduling based
> > > on CFS, basically on the lines of what you and Linus had outlined
> > > earlier:
> > > 	http://lkml.org/lkml/2007/4/18/271
> > > 	http://lkml.org/lkml/2007/4/18/244
> > 
> > Tong Li's Trio scheduler does a bit of this, though it doesn't seem to
> > have the mindshare cfs seems to have acquired.
> > 
> > The hyperlink seems to have broken, though:
> > 	http://www.cs.duke.edu/~tongli/linux/linux-2.6.19.2-trio.patch
> 
> The big question I have is, how well does DWRR fits into the "currently hot" 
> scheduling frameworks like CFS? For ex: if the goal is to do
> fair (group) scheduling of SCHED_NORMAL tasks, can CFS and DWRR co-exist?
> Both seem to be radically different algorithms and my initial impressions 
> of them co-existing is "No", but would be glad to be corrected if I am
> wrong. If they can't co-exist, then we need a different way of doing
> group scheduling on top of CFS, as that is gaining more popularity on
> account of better handling of interactivity.

Yeah, the intent of DWRR was to provide proportional fairness and rely
on the underlying scheduler to support interactivity. In a way, DWRR
ensures that each task receives its fair share, while the underlying
scheduler determines the right order to run the tasks. Since SD is
structurally similar to the stock scheduler, DWRR should co-exist with
it easily. Co-existing with CFS requires more work, but I think the
round-robin mechanism in DWRR could be applicable to CFS to facilitate
cross-processor fairness.

> Tong,
> 	I understand a center hallmark of DWRR is SMP fairness.
> Have you considered how bad/good the other alternative to achieve SMP fairness 
> which is in vogue today : pressure/weight based balancing (ex: smpnice and 
> CKRM CPU scheduler - ckrm.sourceforge.net/downloads/ckrm-ols03-slides.pdf)?
> 

The disadvantage of DWRR is its potential overhead and the advantage is
it can provide stronger fairness. If we have 2 processors and 3 tasks,
DWRR ensures that each task gets 66% of the total CPU time, while
smpnice would keep two tasks on the same processor and the third one on
another. I did an analysis and showed that the lag bound of DWRR is
constant if task weights are bounded by a constant. On the other hand,
the cost DWRR pays is that it requires more task migrations. I tested
with a set of benchmarks on an SMP and didn't see migrations were
causing much performance impact, but this is certainly a big issue for
NUMA.

  tong

PS. I'm now porting the code to the latest kernel and will post as soon
as I'm done.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 13:29   ` Damien Wyart
  2007-05-03 14:53     ` Srivatsa Vaddagiri
@ 2007-05-07  0:04     ` Bill Davidsen
  1 sibling, 0 replies; 75+ messages in thread
From: Bill Davidsen @ 2007-05-07  0:04 UTC (permalink / raw)
  To: Damien Wyart
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Con Kolivas,
	William Lee Irwin III

Damien Wyart wrote:
> Hello,
> 
> * Ingo Molnar <mingo@elte.hu> [2007-05-03 15:02]:
>> great! So it seems -v8 does improve OpenGL handling too :-)
> 
> What are your thoughts/plans concerning merging CFS into mainline ? Is
> it a bit premature to get it into 2.6.22 ? I remember Linus was ok to
> change the default scheduler rapidly (the discussion was about sd at
> that time) to get more testing than in -mm.
> 
It would seem that a number of us are testing the variants and reporting 
results (I now have to pull all the new versions and redo stuff, but I 
have a new test I'm going to run as well).

Given that and the fact that there is another idea which can be tested 
from Tong Li, I think that decision can be postponed, Linus willing. 
There appear to be enough testers to be driving evolution now, authors 
may disagree, of course.

Several folks added to cc list...

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [patch] CFS scheduler, -v8
@ 2007-05-01 21:22 Ingo Molnar
  2007-05-02  2:57 ` Ting Yang
                   ` (5 more replies)
  0 siblings, 6 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-01 21:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

i'm pleased to announce release -v8 of the CFS scheduler patchset. (The 
main goal of CFS is to implement "desktop scheduling" with as high 
quality as technically possible.)

The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be 
downloaded from the usual place:

    http://people.redhat.com/mingo/cfs-scheduler/

-v7 resolved a couple of important regresisons while not introducing new 
regressions, so i felt it was time to step forward: -v8 tries to address 
one of the last (known) frontiers: 3D/OpenGL games^H^H^H applications 
'smoothness'.

To achieve more scheduling smoothness, -v8 introduces a new 'precise 
load calculation and smoothing' feature. (A variant of this was 
suggested by Peter Williams in earlier CFS discussions - thanks Peter!)

i was able to reuse the rq->cpu_load[] load average calculation from 
Peter Williams's smpnice code, and it's thus now utilized on UP too. As 
a nice side-effect of CFS using a smoothed load metric, apps should also 
start up faster under load. CFS now utilizes the full range of smpnice 
metrics.

No other fundamental portion of CFS was touched, so the rate of change 
is moderate:

   7 files changed, 140 insertions(+), 79 deletions(-)

Changes since -v7:

 - powerpc debug output and build warning fixes (Balbir Singh)

 - documentation fixes (Zach Carter)

 - interactivity: precise load calculation and load smoothing

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
@ 2007-05-02  2:57 ` Ting Yang
  2007-05-02  5:10   ` Willy Tarreau
                     ` (6 more replies)
  2007-05-02  6:37 ` Mike Galbraith
                   ` (4 subsequent siblings)
  5 siblings, 7 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-02  2:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi, Ingo

  My name is Ting Yang, a graduate student from UMASS. I am currently 
studying the linux scheduler and virtual memory manager to solve some 
page swapping problems. I am very excited with the new scheduler CFS. 
After I read through your code, I think that you might be interested in 
reading this paper:

  "A Proportional Share REsource Allocation Algorithm for Real-Time, 
Time-Shared Systems", by Ion Stoica. You can find the paper here: 
http://citeseer.ist.psu.edu/37752.html

  Authors of this paper proposed a scheduler: Earlist Eligible Virtual 
Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to 
track the execution of each running task. The only difference between 
EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is 
_start-time_ fair. Scheduling based on deadline gives better reponse 
time bound and seems to more fair.

  In the following part of this email, I will try to explain the 
similarities and differences between EEVDF and CFS. Hopefully, this 
might provide you with some useful information w.r.t your current work 
on CFS.

Similarities:
1. both EEVDF and CFS use virtual time to track the system's progress.

  CFS maintains rq->fair_clock for each cpu, which is updated every 
tick  by:
                SCALE/sum(p_i->loadweight)
  where p_i->loadweight is the weight of each task mapped from its nice  
value in prio_to_load_shift[], given that the default weight is SCALE (1024)

  EEVDF maintains a virtual time, which is advanced every tick by:
                1/sum(w_i) 
  where w_i is the weight of each task, given that the default weight is 1.

  Therefore, EEVDF and CFS monitors system progress in the same way,  
except normalized to different scale.

2. EEVDF and CFS monitors the progress in the same way.

  CFS maintains a p->fair_key which indicating the amount of work done 
by this task. When p executes for a period t, then p->fair_key 
incremented by:
           t * SCALE/p->loadweight //the default weight is SCALE
           (based on solving equations in your code :-))

  EEVDF does the same thing with assumption that the default weight is 
1,  it uses the same equation to calculate deadlines for running tasks.

Differences:
  The main difference between CFS and EEVDF lies in the scheduling 
policy, although they follow the same model for tracking tasks.

***  CFS: When a task starts, it gets p->fair_key from the current 
virtual time rq->fair_clock. It will not be scheduled for execution 
until all other running tasks' fair_key go beyond p->fair_key by certain 
virtual ticks (which is 5 ticks for the current setting of CFS).

  (I wish I could show examples with graphs, instead of my not-so-good 
english, but I am not sure if it appropriate to attach figures on this 
maillist)

  EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, 
and the granularity of pre-exemption for CFS is 5 virtual ticks (the 
current setting). If, at time t=0, we start 2 tasks: p1 and p2, both 
have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. 
Now we have: 
        p1->fair_key = p2->fair_key = rq->fair_clock = 0.
CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick 
(1ms later) t=1, we have:
        rq->fair_clock = 1/2, p1->fair_key = 1,  p2->fair_key = 0.
Suppose, a new task p3 starts with nice value -10 at this moment, that 
is p3->fair_key=1/2. In this case, CFS will not schedule p3 for 
execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which 
translates to about 10ms later in this setting), _regardless_ the 
priority (weight) of p3. Further imagine, if we starts n tasks at time 
t=0 and then start a task p_(n+1) at time t = 1, the delay of task 
p_(n+1) actually is proportional to the number of running tasks n.

  Formally speaking, CFS can has a *O(n)* lag w.r.t the ideal 
proportional shared fluid-flow system, which can be arbitrarily fine 
granularity. The cause of this problem is that p->fair_key only reflects 
a fair point that a task should be started, but does not has any 
information about how urgent the task is (i.e. the priority or weight of 
the task).

***  In EEVDF, each task p_i is specified by 2 parameters: weight w_i 
and timeslice length l_i. EEVDF tries to schedule tasks based on the 
virtual deadline vd_i where a timeslice l_i should be finished.
     EEVDF keeps a virtual start (ve_i) and virtual deadline (vd_i) for 
each tasks. When a tasks starts, its ve_i is initialized to be the 
current virtual time, and calculates its virtual deadline as:
        vd_i = ve_i + l_i/w_i  //the same method as CFS updates fair_key.
When l_i amount of work is done, the just finished vd_i becomes the new 
ve_i. That is the virtual start time of next l_i amount work is the 
deadline of previous finished timeslice. The virtual deadline vd_i is 
then updated using the above equation.

     EEVDF schedule policy: always executes the task with the _earliest_ 
virtual deadline.

   EXAMPLE: Assume the system has 1000 ticks per second. At time t = 0, 
we start 2 tasks: p1 and p2, such that w_1 = w_2 = 1 and l_1 = l_2 = 5 
ticks, i.e 5ms (which reflects the similar setting in CFS case). 
Furthermore,  the system virtual time vt is initialized to be 0. Now at 
time t = 0, we have
            vt = 0,
            ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 5
            ve_2 = 0, vd_2 = vd_2 + l_2/w_2 = 5
As p1 and p2 have the same virtual deadline, EEVDF breaks the tie 
arbitrarily, say it executes p1. After 1 system tick (1ms later), we have:
      vt = 0 + 1 / (w_1 + w_2) = 1/2 //just as CFS updates rq->fair_clock
      ve_1 = 0, vd_1 = 5             //not changed
      ve_2 = 0, vd_1 = 5             //not changed
Suppose, we starts a new task p2 at this moment, with weight w_3 = 2 and 
timeslice l_3 = 5 ticks (5ms), Then
      ve_3 = vt = 1/2
      vd_3 = ve_3 + l_3/w_2 = 1/2 + 5/2 = 3
hmmm, the scheduler now will execute task p3 first since it has the 
earliest deadline. The deadline implicitly contains some very important 
information that the CFS fair_key does not have: how urgent this amount of
work has to be done, which comes from the weight of a task.

   Formally speaking, EEVDF has a constant O(1) lag w.r.t the ideal 
proportional shared fluid-flow system. (Please look at the paper for 
detail proofs.) Under normal cases, for a task p_i, EEVDF ensures that 
it does not fall behind the ideal system by more than l_i time. 
Occasionally, the delay can be max(l_i), the maximum timeslice used by 
all active tasks, due to the dynamic entering and leaving of tasks. 
(Again, the paper give more detailed explanation on this).

   We can see that here the timeslice l_i used by a task p_i actually 
controls accuracy of scheduling p_i. The smaller l_i, the closer to the 
ideal system during p_i's execution. For example, in above case, if p3 
has w_3 = 1 (the same as p1 and p2) and l_3 = 3 (3ms). Since vd_3 = 1/2 
+ l_3/w_3 = 7/2, the scheduler still executes p3 first, even though 
p1,p2 and p3 have the same weight. Smaller l_i leads the scheduler to 
handle p_i in finer granularity.  Intuitively, it makes scheduler checks 
the deadline of p_i more frequently
and ensures each small piece of work is done on time, while larger l_i 
does the reverse.

  The decouple of weight w_i and timeslice l_i is important. Generally 
speaking, weight determines throughput and timeslice determines the 
responsiveness of a task. In normal situation, high priority tasks 
usually need more cpu capacity within short period of time (bursty, such 
as keyboard, mouse move, X updates, daemons, etc), and need to be 
processed as quick as possible (responsiveness and interactiveness). 
Follow the analysis above, we know that for higher priority tasks we 
should give _higher weight_ to ensure its CPU throughput, and at the 
same time give _smaller timeslice_ to ensure better responsiveness.  
This is a bit counter-intuitive against the current linux 
implementation: smaller nice value leads to higher weight and larger 
timeslice.

  Now let's see what happens in the Real-Time world. Usually a real-time 
application is specified with (C_i, T_i), where C_i is the amount of 
work  and T_i is the period of time that the work has to be finished. 
For example, (20ms, 50ms) says the scheduler has to do 20ms work within 
a window of 50ms for this task. Smaller T_i gives tighter response 
constraints. Note that Bi = Ci/Ti is really the CPU proportion for the 
task during its execution. In our proportional time share world, weight 
w_i decides the cpu proportion which maps to Bi, and timeslice l_i gives 
the amount works should be done each time, which maps Ci. Then using w_i 
and l_i, we can get a window size, which actually maps Ti. Therefore 
smaller l_i leads to smaller Ti which means tighter response 
constraints. It seems to me all these make a lot sense in both 
proportional time share and real-time world.  

  Based on my understanding, adopting something like EEVDF in CFS should 
not be very difficult given their similarities, although I do not have 
any idea on how this impacts the load balancing for SMP. Does this worth 
a try?

  Sorry for such a long email :-)

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
@ 2007-05-02  5:10   ` Willy Tarreau
  2007-05-02  5:30   ` William Lee Irwin III
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 75+ messages in thread
From: Willy Tarreau @ 2007-05-02  5:10 UTC (permalink / raw)
  To: Ting Yang; +Cc: Ingo Molnar, linux-kernel

Hi Ting,

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
> 
> Hi, Ingo
> 
>  My name is Ting Yang, a graduate student from UMASS. I am currently 
> studying the linux scheduler and virtual memory manager to solve some 
> page swapping problems. I am very excited with the new scheduler CFS. 
> After I read through your code, I think that you might be interested in 
> reading this paper:
> 
>  "A Proportional Share REsource Allocation Algorithm for Real-Time, 
> Time-Shared Systems", by Ion Stoica. You can find the paper here: 
> http://citeseer.ist.psu.edu/37752.html
> 
>  Authors of this paper proposed a scheduler: Earlist Eligible Virtual 
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to 
> track the execution of each running task. The only difference between 
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is 
> _start-time_ fair. Scheduling based on deadline gives better reponse 
> time bound and seems to more fair.
> 
>  In the following part of this email, I will try to explain the 
> similarities and differences between EEVDF and CFS. Hopefully, this 
> might provide you with some useful information w.r.t your current work 
> on CFS.

(...)
Thanks very much for this very clear explanation. Now I realize that
some of the principles I've had in mind for a long time already exist
and are documented ! That's what I called sorting by job completion
time in the past, which might not have been clear for everyone. Now
you have put words on all those concepts, it's more clear ;-)

> The decouple of weight w_i and timeslice l_i is important. Generally 
> speaking, weight determines throughput and timeslice determines the 
> responsiveness of a task.

I 100% agree. That's the problem we have with nice today. Some people
want to use nice to assign more CPU to tasks (as has always been for
years) and others want to use nice to get better interactivity (meaning
nice as when you're in a queue and leaving the old woman go before you).

IMHO, the two concepts are opposed. Either you're a CPU hog OR you get
quick responsiveness.

> In normal situation, high priority tasks 
> usually need more cpu capacity within short period of time (bursty, such 
> as keyboard, mouse move, X updates, daemons, etc), and need to be 
> processed as quick as possible (responsiveness and interactiveness). 
> Follow the analysis above, we know that for higher priority tasks we 
> should give _higher weight_ to ensure its CPU throughput, and at the 
> same time give _smaller timeslice_ to ensure better responsiveness.  
> This is a bit counter-intuitive against the current linux 
> implementation: smaller nice value leads to higher weight and larger 
> timeslice.

We have an additional problem in Linux, and not the least : it already
exists and is deployed everywhere, so we cannot break existing setups.
More specifically, we don't want to play with nice values of processes
such as X.

That's why I think that monitoring the amount of the time-slice (l_i)
consumed by the task is important. I proposed to conserve the unused
part of l_i as a credit (and conversely the credit can be negative if
the time-slice has been over-used). This credit would serve two purposes :

  - reassign the unused part of l_i on next time-slices to get the
    most fair share of CPU between tasks

  - use it as an interactivity key to sort the tasks. Basically, if
    we note u_i the unused CPU cycles, you can sort based on
    (d_i - u_i) instead of just d_i, and the less hungry tasks will
    reach the CPU faster than others.

(...)

>  Based on my understanding, adopting something like EEVDF in CFS should 
> not be very difficult given their similarities, although I do not have 
> any idea on how this impacts the load balancing for SMP. Does this worth 
> a try?

I think that if you have time to spend on this, everyone would like to
see the difference. All the works on the scheduler are more or less
experimental and several people are exchanging ideas right now, so it
should be the right moment. You seem to understand very well both
approaches and it's likely that it will not take you too much time :-)

> Sorry for such a long email :-)

It was worth it, thanks !

Willy

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
  2007-05-02  5:10   ` Willy Tarreau
@ 2007-05-02  5:30   ` William Lee Irwin III
  2007-05-02 10:05   ` Bill Huey
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-02  5:30 UTC (permalink / raw)
  To: Ting Yang; +Cc: Ingo Molnar, linux-kernel

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>  Authors of this paper proposed a scheduler: Earlist Eligible Virtual 
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to 
> track the execution of each running task. The only difference between 
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is 
> _start-time_ fair. Scheduling based on deadline gives better reponse 
> time bound and seems to more fair.
>  In the following part of this email, I will try to explain the 
> similarities and differences between EEVDF and CFS. Hopefully, this 
> might provide you with some useful information w.r.t your current work 
> on CFS.

Any chance you could write a patch to convert CFS to EEVDF? People may
have an easier time understanding code than theoretical explanations.
(I guess I could do it if sufficiently pressed.)


-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
  2007-05-02  5:10   ` Willy Tarreau
  2007-05-02  5:30   ` William Lee Irwin III
@ 2007-05-02 10:05   ` Bill Huey
  2007-05-02 10:27   ` Ingo Molnar
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 75+ messages in thread
From: Bill Huey @ 2007-05-02 10:05 UTC (permalink / raw)
  To: Ting Yang
  Cc: Ingo Molnar, linux-kernel, Con Kolivas, William Lee Irwin III,
	Bill Huey (hui)

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>  Based on my understanding, adopting something like EEVDF in CFS should 
> not be very difficult given their similarities, although I do not have 
> any idea on how this impacts the load balancing for SMP. Does this worth 
> a try?
> 
>  Sorry for such a long email :-)

An excellent long email. Thanks.

Have you looked at Con's SD ? What did you think about it analytically
and do you think that these ideas could be incorporated into that
scheduler ?

bill


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
                     ` (2 preceding siblings ...)
  2007-05-02 10:05   ` Bill Huey
@ 2007-05-02 10:27   ` Ingo Molnar
  2007-05-02 17:36   ` Srivatsa Vaddagiri
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 10:27 UTC (permalink / raw)
  To: Ting Yang; +Cc: linux-kernel

* Ting Yang <tingy@cs.umass.edu> wrote:

>  My name is Ting Yang, a graduate student from UMASS. I am currently 
> studying the linux scheduler and virtual memory manager to solve some 
> page swapping problems. I am very excited with the new scheduler CFS. 
> After I read through your code, I think that you might be interested 
> in reading this paper:

thanks for your detailed analysis - it was very interesting!

>  Based on my understanding, adopting something like EEVDF in CFS 
> should not be very difficult given their similarities, although I do 
> not have any idea on how this impacts the load balancing for SMP. Does 
> this worth a try?

It would definitely be interesting to try! I dont think it should 
negatively impact load balancing on SMP. The current fork-time behavior 
of CFS is really just a first-approximation thing, and what you propose 
seems to make more sense to me too because it preserves the fluidity of 
fairness. (I'd probably apply your patch even if there was no directly 
measurable impact on workloads, because the more natural approaches tend 
to be more maintainable in the long run.)

So by all means, please feel free to do a patch for this.

>  Sorry for such a long email :-)

it made alot of sense and was very useful :-)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
                     ` (3 preceding siblings ...)
  2007-05-02 10:27   ` Ingo Molnar
@ 2007-05-02 17:36   ` Srivatsa Vaddagiri
  2007-05-02 17:48     ` William Lee Irwin III
  2007-05-02 23:41     ` Ting Yang
  2007-05-02 18:42   ` Li, Tong N
  2007-05-03  8:50   ` Ingo Molnar
  6 siblings, 2 replies; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-02 17:36 UTC (permalink / raw)
  To: Ting Yang; +Cc: Ingo Molnar, linux-kernel

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>  "A Proportional Share REsource Allocation Algorithm for Real-Time, 
> Time-Shared Systems", by Ion Stoica. You can find the paper here: 
> http://citeseer.ist.psu.edu/37752.html

Good paper ..thanks for the pointer.

I briefly went thr' the paper and my impression is it expect each task
to specify the length of each new request it initiates. Is that correct?

If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how
would we calculate that "length of each new request" (which is reqd
before we calculate its virtual deadline)?

>  EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, 
> and the granularity of pre-exemption for CFS is 5 virtual ticks (the 
> current setting). If, at time t=0, we start 2 tasks: p1 and p2, both 
> have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. 
> Now we have: 
>        p1->fair_key = p2->fair_key = rq->fair_clock = 0.
> CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick 
> (1ms later) t=1, we have:
>        rq->fair_clock = 1/2, p1->fair_key = 1,  p2->fair_key = 0.
> Suppose, a new task p3 starts with nice value -10 at this moment, that 
> is p3->fair_key=1/2. In this case, CFS will not schedule p3 for 
> execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which 
> translates to about 10ms later in this setting), _regardless_ the 
> priority (weight) of p3.

There is also p->wait_runtime which is taken into account when
calculating p->fair_key. So if p3 had waiting in runqueue for long
before, it can get to run quicker than 10ms later.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 17:36   ` Srivatsa Vaddagiri
@ 2007-05-02 17:48     ` William Lee Irwin III
  2007-05-02 18:15       ` Ingo Molnar
                         ` (2 more replies)
  2007-05-02 23:41     ` Ting Yang
  1 sibling, 3 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-02 17:48 UTC (permalink / raw)
  To: Srivatsa Vaddagiri; +Cc: Ting Yang, Ingo Molnar, linux-kernel

On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>>  "A Proportional Share REsource Allocation Algorithm for Real-Time, 
>> Time-Shared Systems", by Ion Stoica. You can find the paper here: 
>> http://citeseer.ist.psu.edu/37752.html

On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
> Good paper ..thanks for the pointer.
> I briefly went thr' the paper and my impression is it expect each task
> to specify the length of each new request it initiates. Is that correct?
> If we have to apply EEVDF to SCHED_NORMAL task scheduling under CFS, how
> would we calculate that "length of each new request" (which is reqd
> before we calculate its virtual deadline)?

l_i and w_i are both functions of the priority. You essentially arrange
l_i to express QoS wrt. latency, and w_i to express QoS wrt. bandwidth.


On Tue, May 01, 2007 at 10:57:14PM -0400, Ting Yang wrote:
>>  EXAMPLE: assume the system runs at 1000 tick/second, i.e. 1ms a tick, 
>> and the granularity of pre-exemption for CFS is 5 virtual ticks (the 
>> current setting). If, at time t=0, we start 2 tasks: p1 and p2, both 
>> have nice value 0 (weight 1024), and rq->fair_clock is initialized to 0. 
>> Now we have: 
>>        p1->fair_key = p2->fair_key = rq->fair_clock = 0.
>> CFS breaks the tie arbitrarily, say it executes p1. After 1 system tick 
>> (1ms later) t=1, we have:
>>        rq->fair_clock = 1/2, p1->fair_key = 1,  p2->fair_key = 0.
>> Suppose, a new task p3 starts with nice value -10 at this moment, that 
>> is p3->fair_key=1/2. In this case, CFS will not schedule p3 for 
>> execution until the fair_keys of p1 and p2 go beyond 5+1/2 (which 
>> translates to about 10ms later in this setting), _regardless_ the 
>> priority (weight) of p3.

On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
> There is also p->wait_runtime which is taken into account when
> calculating p->fair_key. So if p3 had waiting in runqueue for long
> before, it can get to run quicker than 10ms later.

Virtual time is time from the task's point of view, which it has spent
executing. ->wait_runtime is a device to subtract out time spent on the
runqueue but not running from what would otherwise be virtual time to
express lag, whether deliberately or coincidentally. ->wait_runtime
would not be useful for EEVDF AFAICT, though it may be interesting to
report.


-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 17:48     ` William Lee Irwin III
@ 2007-05-02 18:15       ` Ingo Molnar
  2007-05-02 18:56         ` William Lee Irwin III
  2007-05-03  2:48       ` Ting Yang
  2007-05-03  3:18       ` Ting Yang
  2 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 18:15 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel


* William Lee Irwin III <wli@holomorphy.com> wrote:

> > There is also p->wait_runtime which is taken into account when 
> > calculating p->fair_key. So if p3 had waiting in runqueue for long 
> > before, it can get to run quicker than 10ms later.
> 
> Virtual time is time from the task's point of view, which it has spent 
> executing. ->wait_runtime is a device to subtract out time spent on 
> the runqueue but not running from what would otherwise be virtual time 
> to express lag, whether deliberately or coincidentally. [...]

CFS is in fact _built around_ the ->wait_runtime metric (which, as its 
name suggests already, expresses the precise lag a task observes 
relative to 'ideal' fair execution), so what exactly makes you suspect 
that this property of the ->wait_runtime metric might be 'coincidental'? 
;-)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 18:15       ` Ingo Molnar
@ 2007-05-02 18:56         ` William Lee Irwin III
  2007-05-02 19:12           ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-02 18:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Virtual time is time from the task's point of view, which it has spent 
>> executing. ->wait_runtime is a device to subtract out time spent on 
>> the runqueue but not running from what would otherwise be virtual time 
>> to express lag, whether deliberately or coincidentally. [...]

On Wed, May 02, 2007 at 08:15:33PM +0200, Ingo Molnar wrote:
> CFS is in fact _built around_ the ->wait_runtime metric (which, as its 
> name suggests already, expresses the precise lag a task observes 
> relative to 'ideal' fair execution), so what exactly makes you suspect 
> that this property of the ->wait_runtime metric might be 'coincidental'? 
> ;-)

The coincidental aspect would be that at the time it was written, the
formal notion of lag was not being used particularly with respect to
priorities and load weights. The documentation doesn't describe it in
those terms, and I can't read minds, so I refrained from guessing.

Things are moving in good directions on all this as far as I'm
concerned. Moving according to Ting Yang's analysis should wrap up the
soundness concerns about intra-queue policy I've had. OTOH load
balancing I know much less about (not that I was ever any sort of an
expert on single queue affairs). That remains a very deep concern, as
load balancing is where most of the enterprise performance improvements
and degradations occur.

-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 18:56         ` William Lee Irwin III
@ 2007-05-02 19:12           ` Ingo Molnar
  2007-05-02 19:42             ` William Lee Irwin III
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 19:12 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel

* William Lee Irwin III <wli@holomorphy.com> wrote:

> The coincidental aspect would be that at the time it was written, the 
> formal notion of lag was not being used particularly with respect to 
> priorities and load weights. [...]

(nice levels for SCHED_OTHER are 'just' an add-on concept to the core of 
CFS, in fact i had two wildly different approaches that both did the 
trick for users, so i fail to see the relevance of priorities to the 
core concept of measuring how much a task is waiting to get on the 
runqueue via the 'fair clock' ... but lets move on.)

> Things are moving in good directions on all this as far as I'm 
> concerned. Moving according to Ting Yang's analysis should wrap up the 
> soundness concerns about intra-queue policy I've had. OTOH load 
> balancing I know much less about (not that I was ever any sort of an 
> expert on single queue affairs). [...]

the whole move to ->load_weight based calculations was to make CFS 
integrate better with load-balancing and to bring the smpnice 
infrastructure even more into the scheduler mainstream. [ There's a 
small smpnice related buglet i fixed in -v9-to-be (based on Balbir 
Singh's feedback), but otherwise it behaves quite well on SMP and that's 
not a big surprise: i left the load-balancer largely intact. ]

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 19:12           ` Ingo Molnar
@ 2007-05-02 19:42             ` William Lee Irwin III
  0 siblings, 0 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-02 19:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srivatsa Vaddagiri, Ting Yang, linux-kernel, Tong N. Li

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> Things are moving in good directions on all this as far as I'm 
>> concerned. Moving according to Ting Yang's analysis should wrap up the 
>> soundness concerns about intra-queue policy I've had. OTOH load 
>> balancing I know much less about (not that I was ever any sort of an 
>> expert on single queue affairs). [...]

On Wed, May 02, 2007 at 09:12:35PM +0200, Ingo Molnar wrote:
> the whole move to ->load_weight based calculations was to make CFS 
> integrate better with load-balancing and to bring the smpnice 
> infrastructure even more into the scheduler mainstream. [ There's a 
> small smpnice related buglet i fixed in -v9-to-be (based on Balbir 
> Singh's feedback), but otherwise it behaves quite well on SMP and that's 
> not a big surprise: i left the load-balancer largely intact. ]

Despite the original motive, the ->load_weight -based calculations
largely resolved my concerns about intra-queue prioritization. They
were the change to the ->fair_key computation I wanted to see, which
were still not enough because of the O(rq->nr_running) lag issue due to
the differences from EEVDF, but I wasn't entirely aware of that, only
suspicious that some issue remained.

Load balancing has non-negligible impacts on fairness but I'm almost
entirely ignorant of the aspects of the theory relating to it. More
knowledgeable people, e.g. Tong Li and Ting Yang, need to take over
reviewing here much as they did on the intra-queue front, where I only
knew enough to drop hints and not to make more useful suggestions.

O(1) vs. O(lg(n)) priority queues are not what matter here (well,
obviously O(n) priority queues would break), but rather O(1) vs. O(n)
lag.

-- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 17:48     ` William Lee Irwin III
  2007-05-02 18:15       ` Ingo Molnar
@ 2007-05-03  2:48       ` Ting Yang
  2007-05-03  3:18       ` Ting Yang
  2 siblings, 0 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-03  2:48 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ingo Molnar, linux-kernel

Hi,

   As encouraged by some of you, I have started implementing EEVDF. 
However, I am quite new in this area, and may not be experienced enough 
to get it through quickly.  The main problems, I am facing now ,is how 
to treat the semantics of yeild() and yield_to(). I probably will throw 
a lot of questions along the way of my implementation.

   Also I found my previous email was not clear enough in describing the 
properties of CFS and EEVDF and caused some confusion, and there were 
also some mistakes too. In this email, I will try to make up for that.

*** Let's start from CFS:
    For simplicity, let's assume that CFS preempt the current task p1 by 
another tasks p2, when p1->key - p2->key >1, and the virtual time 
rq->fair_clock is initialized to be 0. Suppose, at time t = 0, we start 
n+1 tasks that run long enough. task 1 has weight n and all other tasks 
have weight 1. It is clear that, at time t=0, p_1->key = p_2->key = ... 
=p_(n+1)-> key = rq->fair_clock = 0
 
   Since all tasks has the same key, CFS breaks the ties arbitrarily, 
which leads to many possibilities. Let's consider 2 of them:
_Case One:_ p1, which has weight n, executes first:
          t = 1: rq->fair_clock = 1/2n,  p1->key = 1/n       // others 
are not changed.
          t = 2: rq->fair_clock = 2/2n,  p1->key = 2/n
                                   ...
          t = n: rq->fair_clock = n/2n,   p1->key = n/n = 1
    Only after p1 executes n ticks, the scheduler will pick another task 
for execution. Between time [0, n)
the amount of actual work done by p1 is n. The amount of work should be 
done in ideal fluid-flow system is n * n/2n = n/2. Therefore the lag is 
n/2 - n = -n/2, negative means p1 goes faster than the ideal case. As we 
can see this lag is O(n).
_Case Two:_ the scheduler executes the tasks in the order p2, p3, ..., 
p_(n+1), p1
          t = 1: rq->fair_clock = 1/2n,  p2->key = 1;         // others 
are not changed
          t = 2: rq->fair_clock = 2/2n,  p3->key = 1;
                                ....
          t = n: rq->fair_clock = n/2n,  p_(n+1)->key = 1;
    Then the scheduler picks p1 (weight n) for execution. Between time 
[0, n) the amount actual work done by p1 is 0, and the ideal amount is 
n/2. Therefore the lag is n/2 - 0, positive means p1 falls behind the 
ideal case. The lag here for p1 is also O(n).
    As I said in the previous email, p->fair_key only has the 
information of past execution of a task and reflects a fair start point. 
It does not have the information about weight.

*** Now, let's look at EEVDF.
I have to say that I missed a very important concept in EEVDF which 
leads to confusions here. EEVDF stands for _Eligible_ Earliest Virtual 
Deadline First, and I did not explain what is _eligible_.

EEVDF maintains a virtual start time ve_i and virtual deadline vd_i for 
each task p_i, as well as a virtual time vt. A newly started/waked task 
has its ve_i initialized to be the current virtual time. Once a 
timeslice l_i amount of work is done, the new virtual start time is set 
to be the previous virtual deadline, and then virtual deadline vd_i is 
recalculated.
             A task is eligible,  if and only if  ve_i  <=  current 
virtual time vt
EEVDF, at every tick, always picks the eligible task  which has the 
earliest virtual deadline for execution

Let's see how it works using a similar example as for CFS above.
Suppose, at time t = 0, we starts n+1 tasks. p1 has weight n, and all 
others have weight 1. For simplicity, we assume all task use timeslice 
l_i = 1, and virtual time vt is initialized to be 0.
   - at time t = 0, we have
             vt = 0;
             ve_1 = 0, vd_1 = ve_1 + l_1/w_1 = 1/n
             ve_2 = 0, vd_2 = ve_1 + l_2/w_2 = 1
                           ...
             ve_(n+1) = 0, vd_(n+1) = ve_(n+1) + l_(n+1)/w_(n+1) = 1;
    Since p1 is eligible and has the earliest deadline 1/n, the 
scheduler will executes it first. (Here, the weight which encoded in the 
deadline plays an important rule, and allows higher weight tasks to be 
executed first).
     - at time t = 1: 
              vt = 1/2n, 
              ve_1 = 1/n (previous vd_1), vd_1 = ve_1 + 1/n = 2/n
     Since ve_1 > vt, p1 is _not_ eligible. EEVDF picks another task for 
execution by breaking the tie, say
it executes p2.
    - at time t = 2:
              vt = 2/2n = 1/n,  ve_1 = 1/n, vd_1 = 2/n
              ve_2 = 1, ve_2 = ve_2 + 1/1 = 2             // this makes 
p2 not eligible
      Since vt = ve_1, p1 becomes eligible again and has the earliest 
deadline 2/n, it will be scheduled for execution.  As EEVDF repeats, it 
give a schedule like p1, p2, p1,p3, p1, p4, p1 ....  (presented by each 
tick).  As you can see, now p1 never falls behind/goes before the ideal 
case by 1.
    
      Now, let's check how timeslice l_i impacts the system. Suppose, we 
change the timeslice of p1 from 1 to 2, and keep others unchanged. EEVDF 
gives a schedule like:
            p1, p1, p2, p3, p1, p1, p4, p5, p1, p1, ....      (presented 
by each tick)
similarly if timeslice of p1 is set to be 3, EEVDF gives
            p1, p1, p1, p2, p3, p4, p1, p1, p1, ....            
(presented by each tick)
      As the timeslice of p1 increases, the system checks for reschedule 
less frequently, thus makes the lag of p1 becomes longer, but the lag 
will not be larger than the maximum timeslice used, as long as it is a 
fixed constant. On the other hand, increasing the timeslice of other 
tasks has no effect on when p1 is schedule. ( you can try to play the 
algorithm by yourself :-))

     In CFS, a task has to increases p->fair_key for a certain amount so 
that the scheduler can consider it to be preempted. Higher weight leads 
to less progress in p->fair_key, and then effectively large timeslice.
Suppose the preempt granularity is 5 virtual ticks, then a task of 
weight 1 needs to run 5 ticks, weight 2 need 10 ticks, weight 10 needs 
50 ticks. The effect is that the timeslice increases linearly w.r.t 
weight, and causes the O(n) lag. In fact, we use timeslice n for p1 in 
the above example for EEVDF, it behaves exactly as CFS.

    Now, I have fix an incorrect statement about EEVDF's lag bound in my 
previous email:
    Under EEVDF, a task's lag (difference between the amount work should 
be done in ideal fluid-flow system and the actual amount of work done) 
is bounded by the maximum timeslice used. (not occasionally as I said in 
my previous email). This actually means the maximum timeslice used 
controls the total responsiveness of the system. Also the combination of 
high weight and smaller timeslice will give better response guarantee 
for those bursty response-time sensitive tasks.

    Sorry for the late replay. I had an eye exam today and got my eyes 
dilated, which forced me to stay away from my computer for a while :-)

Ting



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 17:48     ` William Lee Irwin III
  2007-05-02 18:15       ` Ingo Molnar
  2007-05-03  2:48       ` Ting Yang
@ 2007-05-03  3:18       ` Ting Yang
  2007-05-03 10:19         ` Bill Huey
  2 siblings, 1 reply; 75+ messages in thread
From: Ting Yang @ 2007-05-03  3:18 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Srivatsa Vaddagiri, Ingo Molnar, linux-kernel


> On Wed, May 02, 2007 at 11:06:34PM +0530, Srivatsa Vaddagiri wrote:
>   
>> There is also p->wait_runtime which is taken into account when
>> calculating p->fair_key. So if p3 had waiting in runqueue for long
>> before, it can get to run quicker than 10ms later.
>>     
>
> Virtual time is time from the task's point of view, which it has spent
> executing. ->wait_runtime is a device to subtract out time spent on the
> runqueue but not running from what would otherwise be virtual time to
> express lag, whether deliberately or coincidentally. ->wait_runtime
> would not be useful for EEVDF AFAICT, though it may be interesting to
> report.
I just want to point out that ->wait_runtime, in fact, stores the lag of 
each task in CFS, except that it is also used by other things, and 
occasionally tweaked (heuristically ?). Under normal cases the sum of 
lags of all active tasks in such a system, should be a constant 0. The 
lag information is equally important to EEVDF, when some tasks leave the 
system (becomes inactive) carrying certain amount of lag. The key point 
here is that we have to spread the lag (either negative or positive) to 
all remaining task, so that the fairness of the system is preserved. I 
thinks CFS implementation does not seems to handle this properly.

I am running out time today :-( I will write an email about CFS -v8 
tomorrow, describing 2 issues in CFS I found related to this.

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03  3:18       ` Ting Yang
@ 2007-05-03 10:19         ` Bill Huey
  0 siblings, 0 replies; 75+ messages in thread
From: Bill Huey @ 2007-05-03 10:19 UTC (permalink / raw)
  To: Ting Yang
  Cc: William Lee Irwin III, Srivatsa Vaddagiri, Ingo Molnar,
	linux-kernel, Bill Huey (hui)

On Wed, May 02, 2007 at 11:18:45PM -0400, Ting Yang wrote:
> I just want to point out that ->wait_runtime, in fact, stores the lag of 
> each task in CFS, except that it is also used by other things, and 
> occasionally tweaked (heuristically ?). Under normal cases the sum of 
> lags of all active tasks in such a system, should be a constant 0. The 
> lag information is equally important to EEVDF, when some tasks leave the 
> system (becomes inactive) carrying certain amount of lag. The key point 
> here is that we have to spread the lag (either negative or positive) to 
> all remaining task, so that the fairness of the system is preserved. I 
> thinks CFS implementation does not seems to handle this properly.
> 
> I am running out time today :-( I will write an email about CFS -v8 
> tomorrow, describing 2 issues in CFS I found related to this.

Interesting. I haven't look at the code carefully but that wouldn't
surprise me if this was the case and it led to odd corner cases.

I'm eagerly waiting your analysis and explanation.

bill


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 17:36   ` Srivatsa Vaddagiri
  2007-05-02 17:48     ` William Lee Irwin III
@ 2007-05-02 23:41     ` Ting Yang
  1 sibling, 0 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-02 23:41 UTC (permalink / raw)
  To: vatsa; +Cc: Ingo Molnar, linux-kernel

Srivatsa Vaddagiri wrote:
> I briefly went thr' the paper and my impression is it expect each task
> to specify the length of each new request it initiates. Is that correct?
>   
No, the timeslice l_i here serves as a granularity control w.r.t 
responsiveness (or latency depends on how you interpret it). As wli said 
it can be express as a function of the priority, as we do for weight 
now. It is not related with the length of each new request. A request 
may be 1 seconds long, but the scheduler may still process it using 10ms 
timeslice. Smaller timeslice leads to more accuracy, i.e. closer to 
ideal case.
However, the maximum of timeslice l_i used by all active tasks 
determines the total responsiveness of the system, which I will explain 
in detail later.
> There is also p->wait_runtime which is taken into account when
> calculating p->fair_key. So if p3 had waiting in runqueue for long
> before, it can get to run quicker than 10ms later.
Consider if p3 is a newly started task or waked up task and carries no 
p->wait_runtime.

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
                     ` (4 preceding siblings ...)
  2007-05-02 17:36   ` Srivatsa Vaddagiri
@ 2007-05-02 18:42   ` Li, Tong N
  2007-05-02 19:10     ` William Lee Irwin III
  2007-05-03  3:07     ` Ting Yang
  2007-05-03  8:50   ` Ingo Molnar
  6 siblings, 2 replies; 75+ messages in thread
From: Li, Tong N @ 2007-05-02 18:42 UTC (permalink / raw)
  To: tingy; +Cc: Ingo Molnar, linux-kernel

>   Based on my understanding, adopting something like EEVDF in CFS should 
> not be very difficult given their similarities, although I do not have 
> any idea on how this impacts the load balancing for SMP. Does this worth 
> a try?
> 
>   Sorry for such a long email :-)

Thanks for the excellent explanation. I think EEVDF and many algs alike
assume global ordering of all tasks in the system (based on virtual
time), whereas CFS does so locally on each processor and relies on load
balancing to achieve fairness across processors. It'd achieve strong
fairness locally, but I'm not sure about its global fairness properties
in an MP environment. If ideally the total load weight on each processor
is always the same, then local fairness would imply global fairness, but
this is a bin packing problem and is intractable ...

  tong

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 18:42   ` Li, Tong N
@ 2007-05-02 19:10     ` William Lee Irwin III
  2007-05-03  3:07     ` Ting Yang
  1 sibling, 0 replies; 75+ messages in thread
From: William Lee Irwin III @ 2007-05-02 19:10 UTC (permalink / raw)
  To: Li, Tong N; +Cc: tingy, Ingo Molnar, linux-kernel

At some point in the past, Ting Yang wrote:
>>   Based on my understanding, adopting something like EEVDF in CFS should 
>> not be very difficult given their similarities, although I do not have 
>> any idea on how this impacts the load balancing for SMP. Does this worth 
>> a try?
>>   Sorry for such a long email :-)

On Wed, May 02, 2007 at 11:42:20AM -0700, Li, Tong N wrote:
> Thanks for the excellent explanation. I think EEVDF and many algs alike
> assume global ordering of all tasks in the system (based on virtual
> time), whereas CFS does so locally on each processor and relies on load
> balancing to achieve fairness across processors. It'd achieve strong
> fairness locally, but I'm not sure about its global fairness properties
> in an MP environment. If ideally the total load weight on each processor
> is always the same, then local fairness would imply global fairness, but
> this is a bin packing problem and is intractable ...

It's sort of obvious how to approximate it, but not entirely obvious
whether a given approximation actually suffices. More help with the
theoretical aspects is needed.


- wli

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 18:42   ` Li, Tong N
  2007-05-02 19:10     ` William Lee Irwin III
@ 2007-05-03  3:07     ` Ting Yang
  1 sibling, 0 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-03  3:07 UTC (permalink / raw)
  To: Li, Tong N; +Cc: Ingo Molnar, linux-kernel, William Lee Irwin III

Li, Tong N wrote:
> Thanks for the excellent explanation. I think EEVDF and many algs alike
> assume global ordering of all tasks in the system (based on virtual
> time), whereas CFS does so locally on each processor and relies on load
> balancing to achieve fairness across processors. It'd achieve strong
> fairness locally, but I'm not sure about its global fairness properties
> in an MP environment. If ideally the total load weight on each processor
> is always the same, then local fairness would imply global fairness, but
> this is a bin packing problem and is intractable ...
First, I am not assuming a global ordering of all tasks. As the current 
implementation, EEVDF should maintain virtual time locally for each CPU. 
EEVDF is a proportional  time share scheduler, therefore the relative 
weight and actual cpu share for each task varies when tasks join and 
leave. There will be not bin-pack problem for such systems.

I understand that bin-pack problem does exist in Real-time world. 
Suppose in a system has 2 cpus,  there a 3 tasks, all of which needs to 
finish 30ms work within a window of 50ms. Any 2 of them stay together 
will exceeds the bandwidth of one cpu. There is a bin-pack problem, 
unless the system has to be clever enough to break one of them down into 
2 requests of 15ms/25ms, and execute them on different cpus at different 
time without overlap, which is quite difficult :-)

In the proportional world, weights and cpu share are scale to fit with 
the bandwidth of a cpu. Therefore putting 2 of them on one cpu is fine, 
and the fairness for each cpu is preserved. On the other hand, moving 
one task back and forth among 2 cpus do give better throughput and 
better global fairness. I have not dig into the load balancing 
algorithms of SMP yet, so I leave it aside for now, first thing first :-)

Thanks !

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  2:57 ` Ting Yang
                     ` (5 preceding siblings ...)
  2007-05-02 18:42   ` Li, Tong N
@ 2007-05-03  8:50   ` Ingo Molnar
  2007-05-03 14:26     ` Srivatsa Vaddagiri
  2007-05-03 15:02     ` Ting Yang
  6 siblings, 2 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-03  8:50 UTC (permalink / raw)
  To: Ting Yang; +Cc: linux-kernel


* Ting Yang <tingy@cs.umass.edu> wrote:

> Authors of this paper proposed a scheduler: Earlist Eligible Virtual 
> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to 
> track the execution of each running task. The only difference between 
> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is 
> _start-time_ fair. [...]

Well, this is a difference but note that it's far from being the 'only 
difference' between CFS and EEVDF:

- in CFS you have to "earn" your right to be on the CPU, while EEVDF 
  gives out timeslices (quanta)

- EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
  know the length of work units - while CFS does not need any knowledge
  about 'future work', it measures 'past behavior' and makes its
  decisions based on that. So CFS is purely 'history-based'.

- thus in CFS there's no concept of "deadline" either (the 'D' from 
  EEVDF).

- EEVDF seems to be calculating timeslices in units of milliseconds, 
  while CFS follows a very strict 'precise' accounting scheme on the 
  nanoseconds scale.

- the EEVDF paper is also silent on SMP issues.

- it seems EEVDF never existed as a kernel scheduler, it was a
  user-space prototype under FreeBSD with simulated workloads. (have
  they released that code at all?).

The main common ground seems to be that both CFS and EEVDF share the 
view that the central metric is 'virtual time' proportional to the load 
of the CPU (called the 'fair clock' in CFS) - but even for this the 
details of the actual mechanism differ: EEVDF uses 1/N while CFS (since 
-v8) uses a precise, smoothed and weighted load average that is close to 
(and reuses portions of) Peter Williams's load metric used in smp-nice. 

The EEVDF mechanism could perhaps be more appropriate for real-time 
systems (the main target of their paper), while the CFS one i believe is 
more appropriate for general purpose workloads.

So i'd say there's more in common between SD and CFS than between EEVDF 
and CFS.

So ... it would certainly be interesting to try the EEVDF paper based on 
CFS (or whatever way you'd like to try it) and turn the EEVDF paper into 
a real kernel scheduler - the two mechanisms are quite dissimilar and 
they could behave wildly differently on various workloads. Depending on 
test results we could use bits of EEVDF's approaches in CFS too, if it 
manages to out-schedule CFS :-)

(your observation about CFS's fork handling is correct nevertheless!)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03  8:50   ` Ingo Molnar
@ 2007-05-03 14:26     ` Srivatsa Vaddagiri
  2007-05-03 15:19       ` Ting Yang
  2007-05-03 15:02     ` Ting Yang
  1 sibling, 1 reply; 75+ messages in thread
From: Srivatsa Vaddagiri @ 2007-05-03 14:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Ting Yang, linux-kernel

On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote:
> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
>   know the length of work units

This is what I was thinking when I wrote earlier that EEVDF expects each
task will specify "length of each new request"
(http://lkml.org/lkml/2007/5/2/339).

The other observation that I have of EEVDF is that it tries to be fair
in the virtual time scale (each client will get 'wi' real units in 1
virtual unit), whereas sometimes fairness in real-time scale also matters?
For ex: a multi-media app would call scheduler fair to it, it it recvs
atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue
user (or workload) that does a fork bomb may skew this fairness for that
multi-media app in real-time scale under EEVDF?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03 14:26     ` Srivatsa Vaddagiri
@ 2007-05-03 15:19       ` Ting Yang
  0 siblings, 0 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-03 15:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel


Srivatsa Vaddagiri wrote:
> On Thu, May 03, 2007 at 10:50:15AM +0200, Ingo Molnar wrote:
>  
>> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
>>   know the length of work units
>>     
>
> This is what I was thinking when I wrote earlier that EEVDF expects each
> task will specify "length of each new request"
> (http://lkml.org/lkml/2007/5/2/339).
>   
This is not very true based on my understanding of EEVDF, please look at 
the email I just sent out to Ingo for explanation.
> The other observation that I have of EEVDF is that it tries to be fair
> in the virtual time scale (each client will get 'wi' real units in 1
> virtual unit), whereas sometimes fairness in real-time scale also 
> matters?
> For ex: a multi-media app would call scheduler fair to it, it it recvs
> atleast 1 ms cpu time every 10 *real* milleseconds (frame-time). A rogue
> user (or workload) that does a fork bomb may skew this fairness for that
> multi-media app in real-time scale under EEVDF?
>   
First of all, CFS does not seems to address this issue to. This is a 
typical real-time or soft real-time question, that is not only the 
bandwidth of a task has to be fixed, i.e. 10% of cpu bandwidth (which 
proportional shared system, like CFS, EEVDF does not do), and the work 
need to satisfy a deadline.
In both CFS, EEVDF, the scheduler have keep tweaking weights to give a 
fixed bandwidth to application. Authors of EEVDF claims this can be 
done, but never implemented :-(

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-03  8:50   ` Ingo Molnar
  2007-05-03 14:26     ` Srivatsa Vaddagiri
@ 2007-05-03 15:02     ` Ting Yang
  1 sibling, 0 replies; 75+ messages in thread
From: Ting Yang @ 2007-05-03 15:02 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel



Ingo Molnar wrote:
> * Ting Yang <tingy@cs.umass.edu> wrote:
>
>   
>> Authors of this paper proposed a scheduler: Earlist Eligible Virtual 
>> Deadline First (EEVDF). EEVDF uses exactly the same method as CFS to 
>> track the execution of each running task. The only difference between 
>> EEVDF and CFS is that EEVDF tries to _deadline_ fair while CFS is 
>> _start-time_ fair. [...]
>>     
>
> Well, this is a difference but note that it's far from being the 'only 
> difference' between CFS and EEVDF:
>
>   
I re-ordered comments a little bit to make my discussion easier:
> - in CFS you have to "earn" your right to be on the CPU, while EEVDF 
>   gives out timeslices (quanta)
>
> - thus in CFS there's no concept of "deadline" either (the 'D' from 
>   EEVDF).
>   
These two are needed to implement the deadline fair policy, therefore 
are cover in the difference I mentioned. Due to the way that authors 
using different terms, the paper may leads to certain confusion, here 3 
important concepts needed:
     - the scheduling quanta *q* in the paper: it refers to the minimum 
unit that a scheduler can
       schedules tasks. For system ticks 1000 per second, q is 1ms. For 
system ticks 100 per second
       (old linux), q is 10ms.
     - the length of a request submitted by a task: it refers to the 
amount of work needed to process a
        request. EEVDF _does_ not need this information for scheduling 
at all.
     - the timesliced used to process requests, which denoted as *r* in 
the paper: it refers to the unit that
       scheduler uses to process a request.  If the length of a request 
is larger than timeslice r, the request is
       effectively divided into pieces of length r. 

The scheduler can use a suitable timeslice at will to handle a process. 
Using the timeslice, it can estimate the deadline based on the weight of 
a task, which tells how urgent the task is. and therefore remove the 
potential O(n). (but it comes with a cost: more context switches)
> - EEVDF concentrates on real-time (SCHED_RR-alike) workloads where they
>   know the length of work units 
That is _not_ really the case, although authors of this paper used a 
quite aggressive title and reclaimed so.
Look carefully at the Theorem 3 and its Corollary on pg 16, it says
    If not request of client k is larger than a time quanta, then at any 
time t the lag is bounded by
                    -q < lag_k(t) < q
    It gives three facts:
    1. For the system to be hard real-time, the length of request (note, 
not the quanta q, timeslice r)
        of any task at any time, must be less than the quanta of 
schedule, which in modern Linux is 1ms.
        This make EEVDF effectively useless for hard real-time system, 
where application can not miss
        deadline
    2. For the system to be soft real-time:  EEVDF needs to first fixed 
the bandwidth of a task, so that
        it does not varies when tasks join and leave. Then EEVDF can 
give a soft real-time guarantee with
        bounded delay of max(timeslice r). Authors mentioned this, but 
unfortunately this feature was
        never actually implemented.
    3. EEVDF does gives a proportional time-share system that any task 
running on it will has not delay
        longer than max(timeslice t) w.r.t the ideal fluid-flow system.

Authors of this paper only developed framework that can possibly used by 
real-time system (either hard or soft), their main attempt was to glue 
real-time and proportional time-share together. I think the real-time 
part of this paper does not actually ring the bell to me. However, isn't 
the fact 3 listed above really what we needed ?
> - while CFS does not need any knowledge
>   about 'future work', it measures 'past behavior' and makes its
>   decisions based on that. So CFS is purely 'history-based'.
>   
As I explained about, EEVDF does not need any future knowledge, instead 
it uses a unit (timeslice) to process request from tasks, by predicting 
the deadline of when this unit has to be finished.

There is a firm theoretic base behind this "guess":, which did not 
appear in the paper: as long as the total bandwidth needed by all active 
tasks does not exceeds the bandwidth of CPU, the predicted virtual 
deadline vt, if mapped back to real time t in system, then that t is 
exactly the deadline for the chosen unit in the ideal system. Since 
EEVDF is proportionally share, the bandwidth for each task is scaled 
based on the weight, the sum of them does not exceeds cpu capacity. 
Therefore the "guess" of EEVDF is always right :-)

Predicting future taking weight into account, makes it choose better 
tasks to execute. Smaller timeslice make it stick to the ideal system 
closer in the cost of more context switches.
> - EEVDF seems to be calculating timeslices in units of milliseconds, 
>   while CFS follows a very strict 'precise' accounting scheme on the 
>   nanoseconds scale.
>   
This is just an implementation issue, right? It is possible to implement 
EEVDF in finer granularity.
> - the EEVDF paper is also silent on SMP issues.
>   
Yes, you are right. I do not have much knowledge about load balancing
> - it seems EEVDF never existed as a kernel scheduler, it was a
>   user-space prototype under FreeBSD with simulated workloads. (have
>   they released that code at all?).
>   
They never released the code, (quite typical behavior of system 
researchers :-)), and that is why I asked if it worth a try.
> The main common ground seems to be that both CFS and EEVDF share the 
> view that the central metric is 'virtual time' proportional to the load 
> of the CPU (called the 'fair clock' in CFS) - but even for this the 
> details of the actual mechanism differ: EEVDF uses 1/N while CFS (since 
> -v8) uses a precise, smoothed and weighted load average that is close to 
> (and reuses portions of) Peter Williams's load metric used in smp-nice. 
>
>   
Thanks a lot for giving your opinions, very nice discussing with you.

Ting

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
  2007-05-02  2:57 ` Ting Yang
@ 2007-05-02  6:37 ` Mike Galbraith
  2007-05-02  6:45   ` Ingo Molnar
  2007-05-02  8:03   ` Gene Heskett
  2007-05-02  7:59 ` Mike Galbraith
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 75+ messages in thread
From: Mike Galbraith @ 2007-05-02  6:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The 
> main goal of CFS is to implement "desktop scheduling" with as high 
> quality as technically possible.)
...

> As usual, any sort of feedback, bugreport, fix and suggestion is more 
> than welcome,

Greetings,

I noticed a (harmless) bounds warning triggered by the reduction in size
of array->bitmap.  Patchlet below.

	-Mike

  CC      kernel/sched.o
kernel/sched_rt.c: In function ‘load_balance_start_rt’:
include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds
kernel/sched_rt.c: In function ‘pick_next_task_rt’:
include/asm-generic/bitops/sched.h:30: warning: array subscript is above array bounds

--- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org	2007-05-02 07:16:47.000000000 +0200
+++ linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h	2007-05-02 07:20:45.000000000 +0200
@@ -6,28 +6,23 @@
 
 /*
  * Every architecture must define this function. It's the fastest
- * way of searching a 140-bit bitmap where the first 100 bits are
- * unlikely to be set. It's guaranteed that at least one of the 140
- * bits is cleared.
+ * way of searching a 100-bit bitmap.  It's guaranteed that at least
+ * one of the 100 bits is cleared.
  */
 static inline int sched_find_first_bit(const unsigned long *b)
 {
 #if BITS_PER_LONG == 64
-	if (unlikely(b[0]))
+	if (b[0])
 		return __ffs(b[0]);
-	if (likely(b[1]))
-		return __ffs(b[1]) + 64;
-	return __ffs(b[2]) + 128;
+	return __ffs(b[1]) + 64;
 #elif BITS_PER_LONG == 32
-	if (unlikely(b[0]))
+	if (b[0])
 		return __ffs(b[0]);
-	if (unlikely(b[1]))
+	if (b[1])
 		return __ffs(b[1]) + 32;
-	if (unlikely(b[2]))
+	if (b[2])
 		return __ffs(b[2]) + 64;
-	if (b[3])
-		return __ffs(b[3]) + 96;
-	return __ffs(b[4]) + 128;
+	return __ffs(b[3]) + 96;
 #else
 #error BITS_PER_LONG not defined
 #endif



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  6:37 ` Mike Galbraith
@ 2007-05-02  6:45   ` Ingo Molnar
  2007-05-02  8:03   ` Gene Heskett
  1 sibling, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02  6:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh


* Mike Galbraith <efault@gmx.de> wrote:

> > As usual, any sort of feedback, bugreport, fix and suggestion is 
> > more than welcome,
> 
> Greetings,
> 
> I noticed a (harmless) bounds warning triggered by the reduction in 
> size of array->bitmap.  Patchlet below.

thanks, applied! Your patch should also speed up task selection of RT 
tasks a bit. (the patch removes ~40 bytes of code). And on 64-bit we now 
fit into 2x 64-bit bitmap and thus are now down to two Find-First-Set 
instructions. Nice :)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  6:37 ` Mike Galbraith
  2007-05-02  6:45   ` Ingo Molnar
@ 2007-05-02  8:03   ` Gene Heskett
  2007-05-02  8:12     ` Mike Galbraith
  2007-05-02  8:13     ` Ingo Molnar
  1 sibling, 2 replies; 75+ messages in thread
From: Gene Heskett @ 2007-05-02  8:03 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
>> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The
>> main goal of CFS is to implement "desktop scheduling" with as high
>> quality as technically possible.)
>
>...
>
>> As usual, any sort of feedback, bugreport, fix and suggestion is more
>> than welcome,
>
>Greetings,
>
>I noticed a (harmless) bounds warning triggered by the reduction in size
>of array->bitmap.  Patchlet below.
>
>	-Mike

I just checked my logs, and it appears my workload didn't trigger this one 
Mike.  And so far, v8 is working great here.  And that great is in my 
best "Tony the Tiger" voice, stolen shamelessly from the breakfast cereal tv  
commercial of 30+ years ago. :)

Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is 
the best of the breed.  I'll give this one a 100 till something better shows 
up.

>  CC      kernel/sched.o
>kernel/sched_rt.c: In function ‘load_balance_start_rt’:
>include/asm-generic/bitops/sched.h:30: warning: array subscript is above
> array bounds kernel/sched_rt.c: In function ‘pick_next_task_rt’:
>include/asm-generic/bitops/sched.h:30: warning: array subscript is above
> array bounds
>
>--- linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h.org	2007-05-02
> 07:16:47.000000000 +0200 +++
> linux-2.6.21-cfs.v8/include/asm-generic/bitops/sched.h	2007-05-02
> 07:20:45.000000000 +0200 @@ -6,28 +6,23 @@
>
> /*
>  * Every architecture must define this function. It's the fastest
>- * way of searching a 140-bit bitmap where the first 100 bits are
>- * unlikely to be set. It's guaranteed that at least one of the 140
>- * bits is cleared.
>+ * way of searching a 100-bit bitmap.  It's guaranteed that at least
>+ * one of the 100 bits is cleared.
>  */
> static inline int sched_find_first_bit(const unsigned long *b)
> {
> #if BITS_PER_LONG == 64
>-	if (unlikely(b[0]))
>+	if (b[0])
> 		return __ffs(b[0]);
>-	if (likely(b[1]))
>-		return __ffs(b[1]) + 64;
>-	return __ffs(b[2]) + 128;
>+	return __ffs(b[1]) + 64;
> #elif BITS_PER_LONG == 32
>-	if (unlikely(b[0]))
>+	if (b[0])
> 		return __ffs(b[0]);
>-	if (unlikely(b[1]))
>+	if (b[1])
> 		return __ffs(b[1]) + 32;
>-	if (unlikely(b[2]))
>+	if (b[2])
> 		return __ffs(b[2]) + 64;
>-	if (b[3])
>-		return __ffs(b[3]) + 96;
>-	return __ffs(b[4]) + 128;
>+	return __ffs(b[3]) + 96;
> #else
> #error BITS_PER_LONG not defined
> #endif



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I met my latest girl friend in a department store.  She was looking at
clothes, and I was putting Slinkys on the escalators.
		-- Steven Wright

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  8:03   ` Gene Heskett
@ 2007-05-02  8:12     ` Mike Galbraith
  2007-05-02  8:48       ` Gene Heskett
  2007-05-02  8:13     ` Ingo Molnar
  1 sibling, 1 reply; 75+ messages in thread
From: Mike Galbraith @ 2007-05-02  8:12 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote:

> I just checked my logs, and it appears my workload didn't trigger this one 
> Mike.

It's just a build time compiler warning.

> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100 is 
> the best of the breed.  I'll give this one a 100 till something better shows 
> up.

Ditto.  (so far... ya never know;)

	-Mike


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  8:12     ` Mike Galbraith
@ 2007-05-02  8:48       ` Gene Heskett
  0 siblings, 0 replies; 75+ messages in thread
From: Gene Heskett @ 2007-05-02  8:48 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Wed, 2007-05-02 at 04:03 -0400, Gene Heskett wrote:
>> I just checked my logs, and it appears my workload didn't trigger this one
>> Mike.
>
>It's just a build time compiler warning.

Duh.  I have a couple of pages of "may be used uninitialized" warnings.
Including one in serial.c for the raw channel.  And I have a bit of trouble 
there too.  Related?  Dunno.

>> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 100
>> is the best of the breed.  I'll give this one a 100 till something better
>> shows up.
>
>Ditto.  (so far... ya never know;)
>
>	-Mike

Yup, this is sweet so far. :-)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The vulcan-death-grip ping has been applied.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  8:03   ` Gene Heskett
  2007-05-02  8:12     ` Mike Galbraith
@ 2007-05-02  8:13     ` Ingo Molnar
  2007-05-02  8:51       ` Gene Heskett
  1 sibling, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02  8:13 UTC (permalink / raw)
  To: Gene Heskett
  Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

* Gene Heskett <gene.heskett@gmail.com> wrote:

> > I noticed a (harmless) bounds warning triggered by the reduction in 
> > size of array->bitmap.  Patchlet below.
> 
> I just checked my logs, and it appears my workload didn't trigger this 
> one Mike. [...]

yeah: this is a build-time warning and it needs a newer/smarter gcc to 
notice that provably redundant piece of code. It's a harmless thing - 
but nevertheless Mike's fix is a nice little micro-optimization as well: 
it always bothered me a bit that at 140 priority levels we were _just_ 
past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two 
64-bit words to cover all 100 priority levels of RT tasks.

> [...]  And so far, v8 is working great here.  And that great is in my 
> best "Tony the Tiger" voice, stolen shamelessly from the breakfast 
> cereal tv commercial of 30+ years ago. :)

heh :-)

> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and 
> 100 is the best of the breed.  I'll give this one a 100 till something 
> better shows up.

nice - and you arent even using any OpenGL games ;)

The 0-100 rating is really useful to me so that i can see the impact of 
regressions (if any) and it's also one single number representing the 
subjective impression - that way it's easier to keep tab of things.

btw., do you still renice kmail slightly, or does it now work out of box 
with default nice 0?

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  8:13     ` Ingo Molnar
@ 2007-05-02  8:51       ` Gene Heskett
  0 siblings, 0 replies; 75+ messages in thread
From: Gene Heskett @ 2007-05-02  8:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Wednesday 02 May 2007, Ingo Molnar wrote:
>* Gene Heskett <gene.heskett@gmail.com> wrote:
>> > I noticed a (harmless) bounds warning triggered by the reduction in
>> > size of array->bitmap.  Patchlet below.
>>
>> I just checked my logs, and it appears my workload didn't trigger this
>> one Mike. [...]
>
>yeah: this is a build-time warning and it needs a newer/smarter gcc to
>notice that provably redundant piece of code. It's a harmless thing -
>but nevertheless Mike's fix is a nice little micro-optimization as well:
>it always bothered me a bit that at 140 priority levels we were _just_
>past the 128 bits boundary by 12 bits. Now on 64-bit boxes it's just two
>64-bit words to cover all 100 priority levels of RT tasks.
>
>> [...]  And so far, v8 is working great here.  And that great is in my
>> best "Tony the Tiger" voice, stolen shamelessly from the breakfast
>> cereal tv commercial of 30+ years ago. :)
>
>heh :-)
>
>> Ingo asked for a 0-100 rating, where 0 is mainline as I recall it, and
>> 100 is the best of the breed.  I'll give this one a 100 till something
>> better shows up.
>
>nice - and you arent even using any OpenGL games ;)
>
>The 0-100 rating is really useful to me so that i can see the impact of
>regressions (if any) and it's also one single number representing the
>subjective impression - that way it's easier to keep tab of things.
>
>btw., do you still renice kmail slightly, or does it now work out of box
>with default nice 0?
>
>	Ingo

For this last couple of boots, its "right out of the box" and isn't getting 
under my skin.  A make -j4 didn't bother it either.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
The goys have proven the following theorem...
		-- Physicist John von Neumann, at the start of a classroom
		   lecture.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
  2007-05-02  2:57 ` Ting Yang
  2007-05-02  6:37 ` Mike Galbraith
@ 2007-05-02  7:59 ` Mike Galbraith
  2007-05-02  8:11   ` Gene Heskett
  2007-05-02 10:40   ` Ingo Molnar
  2007-05-02  9:08 ` Balbir Singh
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 75+ messages in thread
From: Mike Galbraith @ 2007-05-02  7:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
>  - interactivity: precise load calculation and load smoothing

This seems to help quite a bit.

(5 second top sample)

 2636 root      15  -5 19148  15m 5324 R   73  1.5   1:42.29 0 amarok_libvisua
 5440 root      20   0  320m  36m 8388 S   18  3.6   3:28.55 1 Xorg
 4621 root      20   0 22776  18m 4168 R   12  1.8   0:00.63 1 cc1
 4616 root      20   0 19456  13m 2200 R    9  1.3   0:00.43 0 cc1

I no longer have to renice both X and Gforce to achieve a perfect
display when they are sharing my box with a make -j2.  X is displaying
everything it's being fed beautifully with no help.  I have to renice
Gforce (amarok_libvisual), but given it's very heavy CPU usage, that
seems perfectly fine.

No regressions noticed so far.  Box is _very_ responsive under load,
seemingly even more so than with previous releases.  That is purely
subjective, but the first impression was very distinct.

	-Mike


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  7:59 ` Mike Galbraith
@ 2007-05-02  8:11   ` Gene Heskett
  2007-05-02 10:40   ` Ingo Molnar
  1 sibling, 0 replies; 75+ messages in thread
From: Gene Heskett @ 2007-05-02  8:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

On Wednesday 02 May 2007, Mike Galbraith wrote:
>On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
>>  - interactivity: precise load calculation and load smoothing
>
>This seems to help quite a bit.
>
>(5 second top sample)
>
> 2636 root      15  -5 19148  15m 5324 R   73  1.5   1:42.29 0
> amarok_libvisua 5440 root      20   0  320m  36m 8388 S   18  3.6   3:28.55
> 1 Xorg 4621 root      20   0 22776  18m 4168 R   12  1.8   0:00.63 1 cc1
> 4616 root      20   0 19456  13m 2200 R    9  1.3   0:00.43 0 cc1
>
>I no longer have to renice both X and Gforce to achieve a perfect
>display when they are sharing my box with a make -j2.  X is displaying
>everything it's being fed beautifully with no help.  I have to renice
>Gforce (amarok_libvisual), but given it's very heavy CPU usage, that
>seems perfectly fine.
>
>No regressions noticed so far.  Box is _very_ responsive under load,
>seemingly even more so than with previous releases.  That is purely
>subjective, but the first impression was very distinct.
>
>	-Mike

And I have a make -j4 running to apply your patch Mike, and can't tell it from 
here, no lag.

This is another keeper folks.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It takes less time to do a thing right than it does to explain why you
did it wrong.
		-- H.W. Longfellow

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  7:59 ` Mike Galbraith
  2007-05-02  8:11   ` Gene Heskett
@ 2007-05-02 10:40   ` Ingo Molnar
  1 sibling, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 10:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod, Balbir Singh

* Mike Galbraith <efault@gmx.de> wrote:

> On Tue, 2007-05-01 at 23:22 +0200, Ingo Molnar wrote:
> >  - interactivity: precise load calculation and load smoothing
> 
> This seems to help quite a bit.

great :)

> (5 second top sample)
> 
>  2636 root      15  -5 19148  15m 5324 R   73  1.5   1:42.29 0 amarok_libvisua
>  5440 root      20   0  320m  36m 8388 S   18  3.6   3:28.55 1 Xorg
>  4621 root      20   0 22776  18m 4168 R   12  1.8   0:00.63 1 cc1
>  4616 root      20   0 19456  13m 2200 R    9  1.3   0:00.43 0 cc1
> 
> I no longer have to renice both X and Gforce to achieve a perfect 
> display when they are sharing my box with a make -j2.  X is displaying 
> everything it's being fed beautifully with no help.  I have to renice 
> Gforce (amarok_libvisual), but given it's very heavy CPU usage, that 
> seems perfectly fine.

ah! Besides OpenGL behavior and app startup performance i didnt 
originally have Xorg in mind with this change, but thinking about it, 
precise load calculations and load smoothing does have a positive effect 
on 'coupled' workloads where under the previous variant of CFS's load 
calculation one task component of the workload could become 'invisible' 
to another task and hence cause macro-scale scheduling artifacts not 
expected by humans. With smoothing these are dealt with more 
consistently. Xorg can be a quite strongly coupled workload.

> No regressions noticed so far.  Box is _very_ responsive under load, 
> seemingly even more so than with previous releases.  That is purely 
> subjective, but the first impression was very distinct.

yeah, make -jN workloads (and any mixture of non-identical scheduling 
patterns, which most real workloads consist of) should be handled more 
consistently too by -v8.

so your workload (and Gene's workload) were the ones in fact that i had 
hoped not to _hurt_ with -v8 (neither of you being the hardcore gamer 
type ;), and in reality -v8 ended up helping them too. These sorts of 
side-effects are always welcome ;-)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
                   ` (2 preceding siblings ...)
  2007-05-02  7:59 ` Mike Galbraith
@ 2007-05-02  9:08 ` Balbir Singh
  2007-05-02 10:05   ` Ingo Molnar
  2007-05-02 12:58 ` Mark Lord
  2007-05-02 12:58 ` Vegard Nossum
  5 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2007-05-02  9:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Zach Carter, buddabrod, Balbir Singh

Ingo Molnar wrote:
> Changes since -v7:
> 
>  - powerpc debug output and build warning fixes (Balbir Singh)
> 
>  - documentation fixes (Zach Carter)
> 
>  - interactivity: precise load calculation and load smoothing
> 
> As usual, any sort of feedback, bugreport, fix and suggestion is more 
> than welcome,
> 
> 	Ingo

Hi, Ingo,

I would like to report, what I think is a regression with -v8.

With -v7 I would run the n/n+1 test. Basically on a system with
n cpus, I would run n+1 tasks and see how their load is distributed.
I usually find that the last two tasks would get stuck on one CPU
on the system and would get half the cpu time as their other peers.
I think this issue has been around for long even before CFS. But
while I was investigating that, I found that with -v8, all the n+1 tasks are
stuck on the same cpu.

Output of /proc/sched_debug

# cat /proc/sched_debug
Sched Debug Version: v0.02
now at 1507287574145 nsecs

cpu: 0
   .nr_running            : 0
   .raw_weighted_load     : 0
   .nr_switches           : 111130
   .nr_load_updates       : 376821
   .nr_uninterruptible    : 18446744073709551550
   .next_balance          : 4295269119
   .curr->pid             : 0
   .clock                 : 7431167968202137
   .prev_clock_raw        : 7431167968202136
   .clock_warps           : 0
   .clock_unstable_events : 0
   .clock_max_delta       : 0
   .fair_clock            : 26969582038
   .prev_fair_clock       : 26969539422
   .exec_clock            : 9881536864
   .prev_exec_clock       : 9881494248
   .wait_runtime          : 116431647
   .cpu_load[0]           : 0
   .cpu_load[1]           : 0
   .cpu_load[2]           : 0
   .cpu_load[3]           : 0
   .cpu_load[4]           : 0

runnable tasks:
            task   PID      tree-key         delta       waiting  switches 
prio    wstart-fair       sum-exec       sum-wait
----------------------------------------------------------------------------------------------------------------------------

cpu: 1
   .nr_running            : 0
   .raw_weighted_load     : 0
   .nr_switches           : 56374
   .nr_load_updates       : 376767
   .nr_uninterruptible    : 156
   .next_balance          : 4295269118
   .curr->pid             : 0
   .clock                 : 7431167857161633
   .prev_clock_raw        : 7431167857161632
   .clock_warps           : 0
   .clock_unstable_events : 0
   .clock_max_delta       : 0
   .fair_clock            : 34038615236
   .prev_fair_clock       : 34038615236
   .exec_clock            : 18764126904
   .prev_exec_clock       : 18764126904
   .wait_runtime          : 132146856
   .cpu_load[0]           : 0
   .cpu_load[1]           : 0
   .cpu_load[2]           : 0
   .cpu_load[3]           : 0
   .cpu_load[4]           : 0

runnable tasks:
            task   PID      tree-key         delta       waiting  switches 
prio    wstart-fair       sum-exec       sum-wait
----------------------------------------------------------------------------------------------------------------------------

cpu: 2
   .nr_running            : 5
   .raw_weighted_load     : 5120
   .nr_switches           : 140351
   .nr_load_updates       : 376767
   .nr_uninterruptible    : 18446744073709551559
   .next_balance          : 4295269128
   .curr->pid             : 6462
   .clock                 : 7431167968695481
   .prev_clock_raw        : 7431167968695480
   .clock_warps           : 0
   .clock_unstable_events : 0
   .clock_max_delta       : 0
   .fair_clock            : 178895812434
   .prev_fair_clock       : 178895727748
   .exec_clock            : 858569069824
   .prev_exec_clock       : 858568528616
   .wait_runtime          : 2643237421
   .cpu_load[0]           : 0
   .cpu_load[1]           : 0
   .cpu_load[2]           : 0
   .cpu_load[3]           : 0
   .cpu_load[4]           : 0

runnable tasks:
            task   PID      tree-key         delta       waiting  switches 
prio    wstart-fair       sum-exec       sum-wait
----------------------------------------------------------------------------------------------------------------------------
R          bash  6462  178897659138       1846704      -1846958     19646 
120  -178895812434   169799117688   135410790136
            bash  6461  178897934427       2121993      -7673376     19538 
  120       -5551118   169989747968   135499300276
            bash  6460  178898353788       2541354      -6492732     19608 
  120       -3951111   170136703840   135648219117
            bash  6459  178899921997       4109563      -6460948     19747 
  120       -2351093   170559324432   135812802778
            bash  6458  178901052918       5240484      -5991881     19756 
  120        -751111   171257975848   135805570391

cpu: 3
   .nr_running            : 1  .prev_fair_clock       : 24318712701
   .exec_clock            : 20098322728
   .prev_exec_clock       : 20098322728
   .wait_runtime          : 178370619
   .cpu_load[0]           : 0
   .cpu_load[1]           : 0
   .cpu_load[2]           : 0
   .cpu_load[3]           : 0
   .cpu_load[4]           : 0

runnable tasks:
            task   PID      tree-key         delta       waiting  switches 
prio    wstart-fair       sum-exec       sum-wait
----------------------------------------------------------------------------------------------------------------------------
R           cat  7524   24318779730         67029        -67029         3 
120   -24318712701        1661560           2277


   .raw_weighted_load     : 1024
   .nr_switches           : 43253
   .nr_load_updates       : 376767
   .nr_uninterruptible    : 18446744073709551583
   .next_balance          : 4295269180
   .curr->pid             : 7524
   .clock                 : 7431167970150081
   .prev_clock_raw        : 7431167970150080
   .clock_warps           : 0
   .clock_unstable_events : 0
   .clock_max_delta       : 0
   .fair_clock            : 24318712701


Output of top

  6459 root      20   0  4912  792  252 R   20  0.0   8:29.33 bash 

  6458 root      20   0  4912  792  252 R   20  0.0   8:29.90 bash 

  6460 root      20   0  4912  792  252 R   20  0.0   8:28.94 bash 

  6461 root      20   0  4912  792  252 R   20  0.0   8:28.88 bash 

  6462 root      20   0  4912  792  252 R   20  0.0   8:28.54 bash 





-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02  9:08 ` Balbir Singh
@ 2007-05-02 10:05   ` Ingo Molnar
  2007-05-02 10:59     ` Balbir Singh
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 10:05 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Zach Carter, buddabrod


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> With -v7 I would run the n/n+1 test. Basically on a system with n 
> cpus, I would run n+1 tasks and see how their load is distributed. I 
> usually find that the last two tasks would get stuck on one CPU on the 
> system and would get half the cpu time as their other peers. I think 
> this issue has been around for long even before CFS. But while I was 
> investigating that, I found that with -v8, all the n+1 tasks are stuck 
> on the same cpu.

i believe this problem is specific to powerpc - load is distributed fine 
on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2 
which is 'impossible'. (I sent a few suggestions off-Cc about how to 
debug this.)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 10:05   ` Ingo Molnar
@ 2007-05-02 10:59     ` Balbir Singh
  2007-05-02 11:17       ` Ingo Molnar
  0 siblings, 1 reply; 75+ messages in thread
From: Balbir Singh @ 2007-05-02 10:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Zach Carter, buddabrod

Ingo Molnar wrote:
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
>> With -v7 I would run the n/n+1 test. Basically on a system with n 
>> cpus, I would run n+1 tasks and see how their load is distributed. I 
>> usually find that the last two tasks would get stuck on one CPU on the 
>> system and would get half the cpu time as their other peers. I think 
>> this issue has been around for long even before CFS. But while I was 
>> investigating that, I found that with -v8, all the n+1 tasks are stuck 
>> on the same cpu.
> 
> i believe this problem is specific to powerpc - load is distributed fine 
> on i686/x86_64 and your sched_debug shows a cpu_load[0] == 0 on CPU#2 
> which is 'impossible'. (I sent a few suggestions off-Cc about how to 
> debug this.)
> 
> 	Ingo

Hi, Ingo

The suggestions helped, here is a fix tested on PowerPC only.

Patch and Description
=====================


Load balancing on PowerPC is broken. Running 5 tasks on a 4 cpu system
results in all 5 tasks running on the same CPU. Based on Ingo's feedback,
I instrumented and debugged update_load_fair().

The problem is with comparing a s64 values with (s64)ULONG_MAX, which
evaluates to -1. Then we check if exec_delta64 and fair_delta64 are greater
than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to the respective
values.

The fix is to compare these values against (s64)LONG_MAX and assign
(s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater than
(s64)LONG_MAX.

Tested on PowerPC, the regression is gone, tasks are load balanced as they
were in v7.

Output of top

  5614 root      20   0  4912  784  252 R   52  0.0   3:27.49 3 bash 

  5620 root      20   0  4912  784  252 R   47  0.0   3:07.38 2 bash 

  5617 root      20   0  4912  784  252 R   47  0.0   3:08.18 0 bash 

  5624 root      20   0  4912  784  252 R   26  0.0   1:42.97 1 bash 

  5621 root      20   0  4912  784  252 R   26  0.0   1:43.14 1 bash 


Tasks 5624 and 5621 getting half of their peer values is a separate issue
altogether.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

  kernel/sched.c |   10 +++++-----
  1 file changed, 5 insertions(+), 5 deletions(-)

diff -puN kernel/sched.c~cfs-fix-load-balancing-arith kernel/sched.c
--- linux-2.6.21/kernel/sched.c~cfs-fix-load-balancing-arith	2007-05-02 
16:16:20.000000000 +0530
+++ linux-2.6.21-balbir/kernel/sched.c	2007-05-02 16:16:47.000000000 +0530
@@ -1533,19 +1533,19 @@ static void update_load_fair(struct rq *
  	this_rq->prev_exec_clock = this_rq->exec_clock;
  	WARN_ON_ONCE(exec_delta64 <= 0);

-	if (fair_delta64 > (s64)ULONG_MAX)
-		fair_delta64 = (s64)ULONG_MAX;
+	if (fair_delta64 > (s64)LONG_MAX)
+		fair_delta64 = (s64)LONG_MAX;
  	fair_delta = (unsigned long)fair_delta64;

-	if (exec_delta64 > (s64)ULONG_MAX)
-		exec_delta64 = (s64)ULONG_MAX;
+	if (exec_delta64 > (s64)LONG_MAX)
+		exec_delta64 = (s64)LONG_MAX;
  	exec_delta = (unsigned long)exec_delta64;
  	if (exec_delta > TICK_NSEC)
  		exec_delta = TICK_NSEC;

  	idle_delta = TICK_NSEC - exec_delta;

-	tmp = SCHED_LOAD_SCALE * exec_delta / fair_delta;
+	tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
  	tmp64 = (u64)tmp * (u64)exec_delta;
  	do_div(tmp64, TICK_NSEC);
  	this_load = (unsigned long)tmp64;
_


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 10:59     ` Balbir Singh
@ 2007-05-02 11:17       ` Ingo Molnar
  2007-05-05  8:31         ` Esben Nielsen
  0 siblings, 1 reply; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 11:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Zach Carter, buddabrod


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> The problem is with comparing a s64 values with (s64)ULONG_MAX, which 
> evaluates to -1. Then we check if exec_delta64 and fair_delta64 are 
> greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to 
> the respective values.

ah, indeed ...

> The fix is to compare these values against (s64)LONG_MAX and assign 
> (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater 
> than (s64)LONG_MAX.
> 
> Tested on PowerPC, the regression is gone, tasks are load balanced as 
> they were in v7.

thanks, applied!

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 11:17       ` Ingo Molnar
@ 2007-05-05  8:31         ` Esben Nielsen
  2007-05-05 17:44           ` Linus Torvalds
  0 siblings, 1 reply; 75+ messages in thread
From: Esben Nielsen @ 2007-05-05  8:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Balbir Singh, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Wed, 2 May 2007, Ingo Molnar wrote:

>
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>
>> The problem is with comparing a s64 values with (s64)ULONG_MAX, which
>> evaluates to -1. Then we check if exec_delta64 and fair_delta64 are
>> greater than (s64)ULONG_MAX (-1), if so we assign (s64)ULONG_MAX to
>> the respective values.
>
> ah, indeed ...
>
>> The fix is to compare these values against (s64)LONG_MAX and assign
>> (s64)LONG_MAX to exec_delta64 and fair_delta64 if they are greater
>> than (s64)LONG_MAX.
>>
>> Tested on PowerPC, the regression is gone, tasks are load balanced as
>> they were in v7.
>
> thanks, applied!
>
> 	Ingo

I have been wondering why you use usigned for timers anyway. It is also 
like that in hrtimers. Why not use signed and avoid (almost) all worries 
about wrap around issues. The trick is that when all
   a < b
is be replaced by
   a - b < 0
the code will work on all 2-complement machines even if the (signed!) 
integers a and b wrap around.

In both the hrtimer and CFS patch 32 bit timers could be used internally 
on 32 bit architectures to avoid expensive 64 bit integer calculations.
The only thing is: timeouts can not be bigger than 2^31 in the chosen 
units.

I have successfully coded a (much more primitive) hrtimer system for 
another OS on both ARM and PPC using the above trick in my former job. 
On both I used the machine's internal clock as the internal 
representation of time and I only scaled to a struct timespec in the user 
interface.

Esben

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-05  8:31         ` Esben Nielsen
@ 2007-05-05 17:44           ` Linus Torvalds
  2007-05-06  8:29             ` Ingo Molnar
  2007-05-07 11:09             ` Esben Nielsen
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2007-05-05 17:44 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod



On Sat, 5 May 2007, Esben Nielsen wrote:
> 
> I have been wondering why you use usigned for timers anyway. It is also like
> that in hrtimers. Why not use signed and avoid (almost) all worries about wrap
> around issues. The trick is that when all
>   a < b
> is be replaced by
>   a - b < 0
> the code will work on all 2-complement machines even if the (signed!) integers
> a and b wrap around.

No. BOTH of the above are buggy.

The C language definition doesn't allow signed integers to wrap (ie it's 
undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a 
simple signed "a < b".

And the unsigned (or signed) "a < b" is just broken wrt any kind of 
wrap-around (whether wrapping around zero or the sign bit).

So the _only_ valid way to handle timers is to
 - either not allow wrapping at all (in which case "unsigned" is better, 
   since it is bigger)
 - or use wrapping explicitly, and use unsigned arithmetic (which is 
   well-defined in C) and do something like "(long)(a-b) > 0".

Notice? The signed variant is basically _never_ correct.

		Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-05 17:44           ` Linus Torvalds
@ 2007-05-06  8:29             ` Ingo Molnar
  2007-05-06  8:36               ` Willy Tarreau
  2007-05-06 17:45               ` Linus Torvalds
  2007-05-07 11:09             ` Esben Nielsen
  1 sibling, 2 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-06  8:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So the _only_ valid way to handle timers is to
>  - either not allow wrapping at all (in which case "unsigned" is better, 
>    since it is bigger)
>  - or use wrapping explicitly, and use unsigned arithmetic (which is 
>    well-defined in C) and do something like "(long)(a-b) > 0".

hm, there is a corner-case in CFS where a fix like this is necessary.

CFS uses 64-bit values for almost everything, and the majority of values 
are of 'relative' nature with no danger of overflow. (They are signed 
because they are relative values that center around zero and can be 
negative or positive.)

there is one value that is of 'absolute' nature though: the 'fair clock' 
(which is the same as the normal system clock when load is 1.0, and it 
slows down in a load-proportional way as load increases), which has 
units of nanoseconds - and the 'keys' in the rbtree are derived from 
such clock values. That particular clock could produce an s64 overflow 
at around 292 years after bootup (a bit later if the system is also used 
for something in that timeframe). While i dont think that's realistic to 
worry about, the fix below should solve that theoretical problem too.

	Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s
 		 * We dont care about collisions. Nodes with
 		 * the same key stay together.
 		 */
-		if (key < entry->fair_key) {
+		if ((s64)(entry->fair_key - key) > 0) {
 			link = &parent->rb_left;
 		} else {
 			link = &parent->rb_right;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-06  8:29             ` Ingo Molnar
@ 2007-05-06  8:36               ` Willy Tarreau
  2007-05-06  8:52                 ` Ingo Molnar
  2007-05-06 17:45               ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Willy Tarreau @ 2007-05-06  8:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Esben Nielsen, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

Hi Ingo,

On Sun, May 06, 2007 at 10:29:11AM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So the _only_ valid way to handle timers is to
> >  - either not allow wrapping at all (in which case "unsigned" is better, 
> >    since it is bigger)
> >  - or use wrapping explicitly, and use unsigned arithmetic (which is 
> >    well-defined in C) and do something like "(long)(a-b) > 0".
> 
> hm, there is a corner-case in CFS where a fix like this is necessary.
> 
> CFS uses 64-bit values for almost everything, and the majority of values 
> are of 'relative' nature with no danger of overflow. (They are signed 
> because they are relative values that center around zero and can be 
> negative or positive.)

(...)

> -		if (key < entry->fair_key) {
> +		if ((s64)(entry->fair_key - key) > 0) {

Just a hint: while your code here is correct, it is a good practise to
check against < 0 instead, so that if for any reason you sometimes forget
to cast to signed, the compiler will emit a warning stating that the
condition is always false. This would simply become :

-		if (key < entry->fair_key) {
+		if ((s64)(key - entry->fair_key) < 0) {

Just my .02 euros :-)
Willy


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-06  8:36               ` Willy Tarreau
@ 2007-05-06  8:52                 ` Ingo Molnar
  0 siblings, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-06  8:52 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linus Torvalds, Esben Nielsen, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod


* Willy Tarreau <w@1wt.eu> wrote:

> Just a hint: while your code here is correct, it is a good practise to 
> check against < 0 instead, so that if for any reason you sometimes 
> forget to cast to signed, the compiler will emit a warning stating 
> that the condition is always false. This would simply become :
> 
> -		if (key < entry->fair_key) {
> +		if ((s64)(key - entry->fair_key) < 0) {

done :-) (Btw., the key is still s64 so the cast is technically 
unnecessary - but now the code would be correct with the key being u64 
too.)

	Ingo

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -60,7 +60,7 @@ static inline void __enqueue_task_fair(s
 		 * We dont care about collisions. Nodes with
 		 * the same key stay together.
 		 */
-		if (key < entry->fair_key) {
+		if ((s64)(key - entry->fair_key) < 0) {
 			link = &parent->rb_left;
 		} else {
 			link = &parent->rb_right;

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-06  8:29             ` Ingo Molnar
  2007-05-06  8:36               ` Willy Tarreau
@ 2007-05-06 17:45               ` Linus Torvalds
  2007-05-07 11:30                 ` Esben Nielsen
  1 sibling, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2007-05-06 17:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Esben Nielsen, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Sun, 6 May 2007, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > So the _only_ valid way to handle timers is to
> >  - either not allow wrapping at all (in which case "unsigned" is better, 
> >    since it is bigger)
> >  - or use wrapping explicitly, and use unsigned arithmetic (which is 
> >    well-defined in C) and do something like "(long)(a-b) > 0".
> 
> hm, there is a corner-case in CFS where a fix like this is necessary.
> 
> CFS uses 64-bit values for almost everything, and the majority of values 
> are of 'relative' nature with no danger of overflow. (They are signed 
> because they are relative values that center around zero and can be 
> negative or positive.)

Well, I'd like to just worry about that for a while.

You say there is "no danger of overflow", and I mostly agree that once 
we're talking about 64-bit values, the overflow issue simply doesn't 
exist, and furthermore the difference between 63 and 64 bits is not really 
relevant, so there's no major reason to actively avoid signed entries.

So in that sense, it all sounds perfectly sane. And I'm definitely not 
sure your "292 years after bootup" worry is really worth even considering. 

When we're really so well off that we expect the hardware and software 
stack to be stable over a hundred years, I'd start to think about issues 
like that, in the meantime, to me worrying about those kinds of issues 
just means that you're worrying about the wrong things.

BUT.

There's a fundamental reason relative timestamps are difficult and almost 
always have overflow issues: the "long long in the future" case as an 
approximation of "infinite timeout" is almost always relevant.

So rather than worry about the system staying up 292 years, I'd worry 
about whether people pass in big numbers (like some MAX_S64 approximation) 
as an approximation for "infinite", and once you have things like that, 
the "64 bits never overflows" argument is totally bogus.

There's a damn good reason for using only *absolute* time. The whole 
"signed values of relative time" may _sound_ good, but it really sucks in 
subtle and horrible ways!

			Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-06 17:45               ` Linus Torvalds
@ 2007-05-07 11:30                 ` Esben Nielsen
  2007-05-07 15:55                   ` Ingo Molnar
                                     ` (2 more replies)
  0 siblings, 3 replies; 75+ messages in thread
From: Esben Nielsen @ 2007-05-07 11:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Esben Nielsen, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod



On Sun, 6 May 2007, Linus Torvalds wrote:

>
>
> On Sun, 6 May 2007, Ingo Molnar wrote:
>>
>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>> So the _only_ valid way to handle timers is to
>>>  - either not allow wrapping at all (in which case "unsigned" is better,
>>>    since it is bigger)
>>>  - or use wrapping explicitly, and use unsigned arithmetic (which is
>>>    well-defined in C) and do something like "(long)(a-b) > 0".
>>
>> hm, there is a corner-case in CFS where a fix like this is necessary.
>>
>> CFS uses 64-bit values for almost everything, and the majority of values
>> are of 'relative' nature with no danger of overflow. (They are signed
>> because they are relative values that center around zero and can be
>> negative or positive.)
>
> Well, I'd like to just worry about that for a while.
>
> You say there is "no danger of overflow", and I mostly agree that once
> we're talking about 64-bit values, the overflow issue simply doesn't
> exist, and furthermore the difference between 63 and 64 bits is not really
> relevant, so there's no major reason to actively avoid signed entries.
>
> So in that sense, it all sounds perfectly sane. And I'm definitely not
> sure your "292 years after bootup" worry is really worth even considering.
>

I would hate to tell mission control for Mankind's first mission to another
star to reboot every 200 years because "there is no need to worry about it."

As a matter of principle an OS should never need a reboot (with exception 
for upgrading). If you say you have to reboot every 200 years, why not 
every 100? Every 50? .... Every 45 days (you know what I am referring 
to :-) ?

> When we're really so well off that we expect the hardware and software
> stack to be stable over a hundred years, I'd start to think about issues
> like that, in the meantime, to me worrying about those kinds of issues
> just means that you're worrying about the wrong things.
>
> BUT.
>
> There's a fundamental reason relative timestamps are difficult and almost
> always have overflow issues: the "long long in the future" case as an
> approximation of "infinite timeout" is almost always relevant.
>
> So rather than worry about the system staying up 292 years, I'd worry
> about whether people pass in big numbers (like some MAX_S64 approximation)
> as an approximation for "infinite", and once you have things like that,
> the "64 bits never overflows" argument is totally bogus.
>
> There's a damn good reason for using only *absolute* time. The whole
> "signed values of relative time" may _sound_ good, but it really sucks in
> subtle and horrible ways!
>

I think you are wrong here. The only place you need absolute time is a 
for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
representation when you could have used a 32 bit. With a 32 bit 
implementation you are forced to handle the corner cases with wrap 
around and too big arguments up front. With a 64 bit you hide those 
problems.

I think CFS would be best off using a 32 bit timer counting in micro 
seconds. That would wrap around in 72 minuttes. But as the timers are 
relative you will never be able to specify a timer larger than 36 minuttes 
in the future. But 36 minuttes is redicolously long for a scheduler and a 
simple test limiting time values to that value would not break anything.

Esben

> 			Linus
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 11:30                 ` Esben Nielsen
@ 2007-05-07 15:55                   ` Ingo Molnar
  2007-05-07 16:11                   ` Linus Torvalds
  2007-05-08  0:35                   ` Peter Williams
  2 siblings, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-07 15:55 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Linus Torvalds, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod


* Esben Nielsen <nielsen.esben@googlemail.com> wrote:

> I would hate to tell mission control for Mankind's first mission to 
> another star to reboot every 200 years because "there is no need to 
> worry about it."

ok, i need no other argument - fixed :-)

(btw., if anyone from the planning committee of Mankind's first mission 
to another star has any other particular wish about CFS's design, i'm 
all ears and it's going to get the highest possible priority!)

i really mean it! (And i guess you discovered my weak spot. Darn :) 
Hopefully they wont ask for STREAMS support, that would put us into a 
_really_ difficult position.)

	Ingo

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 11:30                 ` Esben Nielsen
  2007-05-07 15:55                   ` Ingo Molnar
@ 2007-05-07 16:11                   ` Linus Torvalds
  2007-05-08  0:35                   ` Peter Williams
  2 siblings, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2007-05-07 16:11 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Mon, 7 May 2007, Esben Nielsen wrote:
> 
> I would hate to tell mission control for Mankind's first mission to another
> star to reboot every 200 years because "there is no need to worry about it."

I'd like to put it another way:

 - if your mission to another star *depends* on every single piece of 
   complex equipment staying up with zero reboots for 200+ years, you have 
   some serious technology problems.

In other words: your argument is populist, and totally silly.

Trust me, if you are going to another star, you'd better have the 
capabilities to handle bugs. You'd better have multiple fail-over etc.

A notion of "robustness" cannot and must not hinge on "no bugs". That's 
unrealistic.

> As a matter of principle an OS should never need a reboot (with exception for
> upgrading). If you say you have to reboot every 200 years, why not every 100?
> Every 50? .... Every 45 days (you know what I am referring to :-) ?

There's something of a difference between 45 days and 292 years.

I'm not saying we can't get there, but worrying about it in the current 
state is just stupid. It's not just looking at the trees and not seeing 
the forest, it's looking at one *leaf*, and missing the forest.

And quite frankly, if you work for NASA and are aiming for the stars, 
you'd better not be that person who looks at one leaf and cannot see the 
forest around you. That's the kind of thing that makes you miss the 
difference between miles and kilometers.

		Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 11:30                 ` Esben Nielsen
  2007-05-07 15:55                   ` Ingo Molnar
  2007-05-07 16:11                   ` Linus Torvalds
@ 2007-05-08  0:35                   ` Peter Williams
  2007-05-08  9:05                     ` Esben Nielsen
  2007-05-10 13:09                     ` Pavel Machek
  2 siblings, 2 replies; 75+ messages in thread
From: Peter Williams @ 2007-05-08  0:35 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

Esben Nielsen wrote:
> 
> 
> On Sun, 6 May 2007, Linus Torvalds wrote:
> 
>>
>>
>> On Sun, 6 May 2007, Ingo Molnar wrote:
>>>
>>> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>>
>>>> So the _only_ valid way to handle timers is to
>>>>  - either not allow wrapping at all (in which case "unsigned" is 
>>>> better,
>>>>    since it is bigger)
>>>>  - or use wrapping explicitly, and use unsigned arithmetic (which is
>>>>    well-defined in C) and do something like "(long)(a-b) > 0".
>>>
>>> hm, there is a corner-case in CFS where a fix like this is necessary.
>>>
>>> CFS uses 64-bit values for almost everything, and the majority of values
>>> are of 'relative' nature with no danger of overflow. (They are signed
>>> because they are relative values that center around zero and can be
>>> negative or positive.)
>>
>> Well, I'd like to just worry about that for a while.
>>
>> You say there is "no danger of overflow", and I mostly agree that once
>> we're talking about 64-bit values, the overflow issue simply doesn't
>> exist, and furthermore the difference between 63 and 64 bits is not 
>> really
>> relevant, so there's no major reason to actively avoid signed entries.
>>
>> So in that sense, it all sounds perfectly sane. And I'm definitely not
>> sure your "292 years after bootup" worry is really worth even 
>> considering.
>>
> 
> I would hate to tell mission control for Mankind's first mission to another
> star to reboot every 200 years because "there is no need to worry about 
> it."
> 
> As a matter of principle an OS should never need a reboot (with 
> exception for upgrading). If you say you have to reboot every 200 years, 
> why not every 100? Every 50? .... Every 45 days (you know what I am 
> referring to :-) ?

There's always going to be an upper limit on the representation of time. 
  At least until we figure out how to implement infinity properly.

> 
>> When we're really so well off that we expect the hardware and software
>> stack to be stable over a hundred years, I'd start to think about issues
>> like that, in the meantime, to me worrying about those kinds of issues
>> just means that you're worrying about the wrong things.
>>
>> BUT.
>>
>> There's a fundamental reason relative timestamps are difficult and almost
>> always have overflow issues: the "long long in the future" case as an
>> approximation of "infinite timeout" is almost always relevant.
>>
>> So rather than worry about the system staying up 292 years, I'd worry
>> about whether people pass in big numbers (like some MAX_S64 
>> approximation)
>> as an approximation for "infinite", and once you have things like that,
>> the "64 bits never overflows" argument is totally bogus.
>>
>> There's a damn good reason for using only *absolute* time. The whole
>> "signed values of relative time" may _sound_ good, but it really sucks in
>> subtle and horrible ways!
>>
> 
> I think you are wrong here. The only place you need absolute time is a 
> for the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
> representation when you could have used a 32 bit. With a 32 bit 
> implementation you are forced to handle the corner cases with wrap 
> around and too big arguments up front. With a 64 bit you hide those 
> problems.

As does the other method.  A 32 bit signed offset with a 32 bit base is 
just a crude version of 64 bit absolute time.

> 
> I think CFS would be best off using a 32 bit timer counting in micro 
> seconds. That would wrap around in 72 minuttes. But as the timers are 
> relative you will never be able to specify a timer larger than 36 
> minuttes in the future. But 36 minuttes is redicolously long for a 
> scheduler and a simple test limiting time values to that value would not 
> break anything.

Except if you're measuring sleep times.  I think that you'll find lots 
of tasks sleep for more than 72 minutes.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-08  0:35                   ` Peter Williams
@ 2007-05-08  9:05                     ` Esben Nielsen
  2007-05-09  0:01                       ` Peter Williams
  2007-05-10 13:09                     ` Pavel Machek
  1 sibling, 1 reply; 75+ messages in thread
From: Esben Nielsen @ 2007-05-08  9:05 UTC (permalink / raw)
  To: Peter Williams
  Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod



On Tue, 8 May 2007, Peter Williams wrote:

> Esben Nielsen wrote:
>>
>>
>>  On Sun, 6 May 2007, Linus Torvalds wrote:
>> 
>> > 
>> > 
>> >  On Sun, 6 May 2007, Ingo Molnar wrote:
>> > > 
>> > >  * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> > > 
>> > > >  So the _only_ valid way to handle timers is to
>> > > >  - either not allow wrapping at all (in which case "unsigned" is 
>> > > >  better,
>> > > >     since it is bigger)
>> > > >   - or use wrapping explicitly, and use unsigned arithmetic (which is
>> > > >     well-defined in C) and do something like "(long)(a-b) > 0".
>> > > 
>> > >  hm, there is a corner-case in CFS where a fix like this is necessary.
>> > > 
>> > >  CFS uses 64-bit values for almost everything, and the majority of 
>> > >  values
>> > >  are of 'relative' nature with no danger of overflow. (They are signed
>> > >  because they are relative values that center around zero and can be
>> > >  negative or positive.)
>> > 
>> >  Well, I'd like to just worry about that for a while.
>> > 
>> >  You say there is "no danger of overflow", and I mostly agree that once
>> >  we're talking about 64-bit values, the overflow issue simply doesn't
>> >  exist, and furthermore the difference between 63 and 64 bits is not 
>> >  really
>> >  relevant, so there's no major reason to actively avoid signed entries.
>> > 
>> >  So in that sense, it all sounds perfectly sane. And I'm definitely not
>> >  sure your "292 years after bootup" worry is really worth even 
>> >  considering.
>> >
>>
>>  I would hate to tell mission control for Mankind's first mission to
>>  another
>>  star to reboot every 200 years because "there is no need to worry about
>>  it."
>>
>>  As a matter of principle an OS should never need a reboot (with exception
>>  for upgrading). If you say you have to reboot every 200 years, why not
>>  every 100? Every 50? .... Every 45 days (you know what I am referring to
>>  :-) ?
>
> There's always going to be an upper limit on the representation of time.
>  At least until we figure out how to implement infinity properly.

Well you need infinite memory for that :-)
But that is my point: Why go into the problem of storing absolute time 
when you can use relative time?


>
>> 
>> >  When we're really so well off that we expect the hardware and software
>> >  stack to be stable over a hundred years, I'd start to think about issues
>> >  like that, in the meantime, to me worrying about those kinds of issues
>> >  just means that you're worrying about the wrong things.
>> > 
>> >  BUT.
>> > 
>> >  There's a fundamental reason relative timestamps are difficult and 
>> >  almost
>> >  always have overflow issues: the "long long in the future" case as an
>> >  approximation of "infinite timeout" is almost always relevant.
>> > 
>> >  So rather than worry about the system staying up 292 years, I'd worry
>> >  about whether people pass in big numbers (like some MAX_S64 
>> >  approximation)
>> >  as an approximation for "infinite", and once you have things like that,
>> >  the "64 bits never overflows" argument is totally bogus.
>> > 
>> >  There's a damn good reason for using only *absolute* time. The whole
>> >  "signed values of relative time" may _sound_ good, but it really sucks 
>> >  in
>> >  subtle and horrible ways!
>> >
>>
>>  I think you are wrong here. The only place you need absolute time is a for
>>  the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
>>  representation when you could have used a 32 bit. With a 32 bit
>>  implementation you are forced to handle the corner cases with wrap around
>>  and too big arguments up front. With a 64 bit you hide those problems.
>
> As does the other method.  A 32 bit signed offset with a 32 bit base is just 
> a crude version of 64 bit absolute time.

64 bit is also relative - just over a much longer period.
32 bit signed offset is relative - and you know it. But with 64 people 
think it is absolute and put in large values as Linus said above. With 32 
bit future developers will know it is relative and code for it. And they 
will get their corner cases tested, because the code soon will run 
into those corners.

>
>>
>>  I think CFS would be best off using a 32 bit timer counting in micro
>>  seconds. That would wrap around in 72 minuttes. But as the timers are
>>  relative you will never be able to specify a timer larger than 36 minuttes
>>  in the future. But 36 minuttes is redicolously long for a scheduler and a
>>  simple test limiting time values to that value would not break anything.
>
> Except if you're measuring sleep times.  I think that you'll find lots of 
> tasks sleep for more than 72 minutes.

I don't think those large values will be relavant. You can easily cut 
off sleep times at 30 min or even 1 min. But you need to detect that the 
task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do 
with a 32 bit timer alone. In that case you need to use a (at least) 64 bit
timer - which is needed in the OS anyways. But not internally in the
wait queue, where the repeated calculations are.

Esben


>
> Peter
> -- 
> Peter Williams                                   pwil3058@bigpond.net.au
>
> "Learning, n. The kind of ignorance distinguishing the studious."
>  -- Ambrose Bierce
>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-08  9:05                     ` Esben Nielsen
@ 2007-05-09  0:01                       ` Peter Williams
  0 siblings, 0 replies; 75+ messages in thread
From: Peter Williams @ 2007-05-09  0:01 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

Esben Nielsen wrote:
> 
> 
> On Tue, 8 May 2007, Peter Williams wrote:
> 
>> Esben Nielsen wrote:
>>>
>>>
>>>  On Sun, 6 May 2007, Linus Torvalds wrote:
>>>
>>> > > >  On Sun, 6 May 2007, Ingo Molnar wrote:
>>> > > > >  * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>> > > > > >  So the _only_ valid way to handle timers is to
>>> > > >  - either not allow wrapping at all (in which case "unsigned" 
>>> is > > >  better,
>>> > > >     since it is bigger)
>>> > > >   - or use wrapping explicitly, and use unsigned arithmetic 
>>> (which is
>>> > > >     well-defined in C) and do something like "(long)(a-b) > 0".
>>> > > > >  hm, there is a corner-case in CFS where a fix like this is 
>>> necessary.
>>> > > > >  CFS uses 64-bit values for almost everything, and the 
>>> majority of > >  values
>>> > >  are of 'relative' nature with no danger of overflow. (They are 
>>> signed
>>> > >  because they are relative values that center around zero and can be
>>> > >  negative or positive.)
>>> > >  Well, I'd like to just worry about that for a while.
>>> > >  You say there is "no danger of overflow", and I mostly agree 
>>> that once
>>> >  we're talking about 64-bit values, the overflow issue simply doesn't
>>> >  exist, and furthermore the difference between 63 and 64 bits is 
>>> not >  really
>>> >  relevant, so there's no major reason to actively avoid signed 
>>> entries.
>>> > >  So in that sense, it all sounds perfectly sane. And I'm 
>>> definitely not
>>> >  sure your "292 years after bootup" worry is really worth even >  
>>> considering.
>>> >
>>>
>>>  I would hate to tell mission control for Mankind's first mission to
>>>  another
>>>  star to reboot every 200 years because "there is no need to worry about
>>>  it."
>>>
>>>  As a matter of principle an OS should never need a reboot (with 
>>> exception
>>>  for upgrading). If you say you have to reboot every 200 years, why not
>>>  every 100? Every 50? .... Every 45 days (you know what I am 
>>> referring to
>>>  :-) ?
>>
>> There's always going to be an upper limit on the representation of time.
>>  At least until we figure out how to implement infinity properly.
> 
> Well you need infinite memory for that :-)
> But that is my point: Why go into the problem of storing absolute time 
> when you can use relative time?

I'd reverse that and say "Why go to the bother of using relative time 
when you can use absolute time?".  Absolute time being time since boot, 
of course.

> 
> 
>>
>>>
>>> >  When we're really so well off that we expect the hardware and 
>>> software
>>> >  stack to be stable over a hundred years, I'd start to think about 
>>> issues
>>> >  like that, in the meantime, to me worrying about those kinds of 
>>> issues
>>> >  just means that you're worrying about the wrong things.
>>> > >  BUT.
>>> > >  There's a fundamental reason relative timestamps are difficult 
>>> and >  almost
>>> >  always have overflow issues: the "long long in the future" case as an
>>> >  approximation of "infinite timeout" is almost always relevant.
>>> > >  So rather than worry about the system staying up 292 years, I'd 
>>> worry
>>> >  about whether people pass in big numbers (like some MAX_S64 >  
>>> approximation)
>>> >  as an approximation for "infinite", and once you have things like 
>>> that,
>>> >  the "64 bits never overflows" argument is totally bogus.
>>> > >  There's a damn good reason for using only *absolute* time. The 
>>> whole
>>> >  "signed values of relative time" may _sound_ good, but it really 
>>> sucks >  in
>>> >  subtle and horrible ways!
>>> >
>>>
>>>  I think you are wrong here. The only place you need absolute time is 
>>> a for
>>>  the clock (CLOCK_REALTIME). You waste CPU using a 64 bit
>>>  representation when you could have used a 32 bit. With a 32 bit
>>>  implementation you are forced to handle the corner cases with wrap 
>>> around
>>>  and too big arguments up front. With a 64 bit you hide those problems.
>>
>> As does the other method.  A 32 bit signed offset with a 32 bit base 
>> is just a crude version of 64 bit absolute time.
> 
> 64 bit is also relative - just over a much longer period.

Yes, relative to boot.

> 32 bit signed offset is relative - and you know it. But with 64 people 
> think it is absolute and put in large values as Linus said above.

What people?  Who gets to feed times into the scheduler?  Isn't it just 
using the time as determined by the system?

> With 
> 32 bit future developers will know it is relative and code for it. And 
> they will get their corner cases tested, because the code soon will run 
> into those corners.
> 
>>
>>>
>>>  I think CFS would be best off using a 32 bit timer counting in micro
>>>  seconds. That would wrap around in 72 minuttes. But as the timers are
>>>  relative you will never be able to specify a timer larger than 36 
>>> minuttes
>>>  in the future. But 36 minuttes is redicolously long for a scheduler 
>>> and a
>>>  simple test limiting time values to that value would not break 
>>> anything.
>>
>> Except if you're measuring sleep times.  I think that you'll find lots 
>> of tasks sleep for more than 72 minutes.
> 
> I don't think those large values will be relavant. You can easily cut 
> off sleep times at 30 min or even 1 min.

The aim is to make the code as simple as possible not add this kind of 
rubbish and 1 minute would be far too low.

> But you need to detect that the 
> task have indeed been sleeping 2^32+1 usec and not 1 usec. You can't do 
> with a 32 bit timer alone. In that case you need to use a (at least) 64 bit
> timer - which is needed in the OS anyways. But not internally in the
> wait queue, where the repeated calculations are.
> 

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-08  0:35                   ` Peter Williams
  2007-05-08  9:05                     ` Esben Nielsen
@ 2007-05-10 13:09                     ` Pavel Machek
  2007-05-11 16:50                       ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Pavel Machek @ 2007-05-10 13:09 UTC (permalink / raw)
  To: Peter Williams
  Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

Hi!

> >>You say there is "no danger of overflow", and I mostly 
> >>agree that once
> >>we're talking about 64-bit values, the overflow issue 
> >>simply doesn't
> >>exist, and furthermore the difference between 63 and 
> >>64 bits is not really
> >>relevant, so there's no major reason to actively avoid 
> >>signed entries.
> >>
> >>So in that sense, it all sounds perfectly sane. And 
> >>I'm definitely not
> >>sure your "292 years after bootup" worry is really 
> >>worth even considering.
> >>
> >
> >I would hate to tell mission control for Mankind's 
> >first mission to another
> >star to reboot every 200 years because "there is no 
> >need to worry about it."
> >
> >As a matter of principle an OS should never need a 
> >reboot (with exception for upgrading). If you say you 
> >have to reboot every 200 years, why not every 100? 
> >Every 50? .... Every 45 days (you know what I am 
> >referring to :-) ?
> 
> There's always going to be an upper limit on the 
> representation of time. At least until we figure out 
>  how to implement infinity properly.

There's also upper limit on life time of this universe. 1000 bits is
certainly enough  to represent that in u-seconds.

Also notice that current cpus were not designed to work 300 years.
When we have hw designed for 50 years+, we can start to worry.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-10 13:09                     ` Pavel Machek
@ 2007-05-11 16:50                       ` Linus Torvalds
  2007-05-11 19:18                         ` Pavel Machek
  0 siblings, 1 reply; 75+ messages in thread
From: Linus Torvalds @ 2007-05-11 16:50 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1587 bytes --]

On Thu, 10 May 2007, Pavel Machek wrote:
> 
> Also notice that current cpus were not designed to work 300 years.
> When we have hw designed for 50 years+, we can start to worry.

Indeed. CPU manufacturers don't seem to talk about it very much, and 
searching for it with google on intel.com comes up with

	The failure rate and Mean Time Between Failure (MTBF) data is not 
	currently available on our website. You may contact Intel® 
	Customer Support for this information.

which seems to be just a fancy way of saying "we don't actually want to 
talk about it". Probably not because it's actually all that bad, but 
simply because people don't think about it, and there's no reason a CPU 
manufacturer would *want* people to think about it.

But if you wondered why server CPU's usually run at a lower frequency, 
it's because of MTBF issues. I think a desktop CPU is usually specced to 
run for 5 years (and that's expecting that it's turned off or at least 
idle much of the time), while a server CPU is expected to last longer and 
be active a much bigger percentage of time.

("Active" == "heat" == "more damage due to atom migration etc". Which is 
part of why you're not supposed to overclock stuff: it may well work well 
for you, but for all you know it will cut your expected CPU life by 90%).

Of course, all the other components have a MTBF too (and I'd generally 
expect a power supply component to fail long before the CPU does). And 
yes, some machines will last for decades. But I suspect we've all heard of 
machines that just don't work reliably any more.

			Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-11 16:50                       ` Linus Torvalds
@ 2007-05-11 19:18                         ` Pavel Machek
  2007-05-11 19:37                           ` Willy Tarreau
  0 siblings, 1 reply; 75+ messages in thread
From: Pavel Machek @ 2007-05-11 19:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Williams, Esben Nielsen, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

Hi!

> > Also notice that current cpus were not designed to work 300 years.
> > When we have hw designed for 50 years+, we can start to worry.
> 
> Indeed. CPU manufacturers don't seem to talk about it very much, and 
> searching for it with google on intel.com comes up with
> 
> 	The failure rate and Mean Time Between Failure (MTBF) data is not 
> 	currently available on our website. You may contact Intel? 
> 	Customer Support for this information.
> 
> which seems to be just a fancy way of saying "we don't actually want to 
> talk about it". Probably not because it's actually all that bad, but 
> simply because people don't think about it, and there's no reason a CPU 
> manufacturer would *want* people to think about it.
> 
> But if you wondered why server CPU's usually run at a lower frequency, 
> it's because of MTBF issues. I think a desktop CPU is usually specced to 
> run for 5 years (and that's expecting that it's turned off or at least 
> idle much of the time), while a server CPU is expected to last longer and 
> be active a much bigger percentage of time.
> 
> ("Active" == "heat" == "more damage due to atom migration etc". Which is 
> part of why you're not supposed to overclock stuff: it may well work well 
> for you, but for all you know it will cut your expected CPU life by 90%).

Actually, when I talked with AMD, they told me that cpus should last
10 years *at their max specced temperature*... which is 95Celsius. So
overclocking is not that evil, according to my info.

(That would mean way more than 10 years if you use your cpu
'normally'.)

But I guess capacitors from cpu power supply will hate you running cpu
at 95C...
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-11 19:18                         ` Pavel Machek
@ 2007-05-11 19:37                           ` Willy Tarreau
  2007-05-11 20:53                             ` Kevin Bowling
  0 siblings, 1 reply; 75+ messages in thread
From: Willy Tarreau @ 2007-05-11 19:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Linus Torvalds, Peter Williams, Esben Nielsen, Ingo Molnar,
	Balbir Singh, linux-kernel, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote:
> Hi!
> 
> > > Also notice that current cpus were not designed to work 300 years.
> > > When we have hw designed for 50 years+, we can start to worry.
> > 
> > Indeed. CPU manufacturers don't seem to talk about it very much, and 
> > searching for it with google on intel.com comes up with
> > 
> > 	The failure rate and Mean Time Between Failure (MTBF) data is not 
> > 	currently available on our website. You may contact Intel? 
> > 	Customer Support for this information.
> > 
> > which seems to be just a fancy way of saying "we don't actually want to 
> > talk about it". Probably not because it's actually all that bad, but 
> > simply because people don't think about it, and there's no reason a CPU 
> > manufacturer would *want* people to think about it.
> > 
> > But if you wondered why server CPU's usually run at a lower frequency, 
> > it's because of MTBF issues. I think a desktop CPU is usually specced to 
> > run for 5 years (and that's expecting that it's turned off or at least 
> > idle much of the time), while a server CPU is expected to last longer and 
> > be active a much bigger percentage of time.
> > 
> > ("Active" == "heat" == "more damage due to atom migration etc". Which is 
> > part of why you're not supposed to overclock stuff: it may well work well 
> > for you, but for all you know it will cut your expected CPU life by 90%).
> 
> Actually, when I talked with AMD, they told me that cpus should last
> 10 years *at their max specced temperature*... which is 95Celsius. So
> overclocking is not that evil, according to my info.

I tend to believe that. I've slowed down the FANs on my dual-athlon XP
to silent them, and I found the system to be (apparently) stable till
96 deg Celsius with FANs unplugged. So I regulated them in order to maintain
a temperature below 90 deg for a safety margin, and they seem to be as
happy as my ears. They've been like that for the last 3-4 years, I don't
remember.

On a related note, older technology is less sensible. My old VAX VLC4000
from 1991 which received a small heatsink in exchange for its noisy fans
is still doing well in an closed place. I'm more worried for the EPROMs
which store the boot code. Also, I think that the RS6000 processed at
half-micron which run the Mars rovers might run for hundreds of years
at 30 MHz.

> (That would mean way more than 10 years if you use your cpu
> 'normally'.)
> 
> But I guess capacitors from cpu power supply will hate you running cpu
> at 95C...

even at 60, many of them die within a few years. Bu  I think we're
"slightly" off-topic now...

> 							Pavel

Cheers
Willy


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-11 19:37                           ` Willy Tarreau
@ 2007-05-11 20:53                             ` Kevin Bowling
  0 siblings, 0 replies; 75+ messages in thread
From: Kevin Bowling @ 2007-05-11 20:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Pavel Machek, Linus Torvalds, Peter Williams, Esben Nielsen,
	Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Thomas Gleixner, caglar, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod

On 5/11/07, Willy Tarreau <w@1wt.eu> wrote:
> On Fri, May 11, 2007 at 07:18:25PM +0000, Pavel Machek wrote:
> > Hi!
> >
> > > > Also notice that current cpus were not designed to work 300 years.
> > > > When we have hw designed for 50 years+, we can start to worry.
>
> On a related note, older technology is less sensible. My old VAX VLC4000
> from 1991 which received a small heatsink in exchange for its noisy fans
> is still doing well in an closed place. I'm more worried for the EPROMs
> which store the boot code. Also, I think that the RS6000 processed at
> half-micron which run the Mars rovers might run for hundreds of years
> at 30 MHz.

Not your typical RS/6000.  http://en.wikipedia.org/wiki/RAD6000

Without a doubt, decently built computers will easily last at least 10
years.  From what I can tell, the issue here was simply OS runtime?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-05 17:44           ` Linus Torvalds
  2007-05-06  8:29             ` Ingo Molnar
@ 2007-05-07 11:09             ` Esben Nielsen
  2007-05-07 16:28               ` Linus Torvalds
  1 sibling, 1 reply; 75+ messages in thread
From: Esben Nielsen @ 2007-05-07 11:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod



On Sat, 5 May 2007, Linus Torvalds wrote:

>
>
> On Sat, 5 May 2007, Esben Nielsen wrote:
>>
>> I have been wondering why you use usigned for timers anyway. It is also like
>> that in hrtimers. Why not use signed and avoid (almost) all worries about wrap
>> around issues. The trick is that when all
>>   a < b
>> is be replaced by
>>   a - b < 0
>> the code will work on all 2-complement machines even if the (signed!) integers
>> a and b wrap around.
>
> No. BOTH of the above are buggy.
>
> The C language definition doesn't allow signed integers to wrap (ie it's
> undefined behaviour), so "a-b < 0" can be rewritten by the compiler as a
> simple signed "a < b".
>
> And the unsigned (or signed) "a < b" is just broken wrt any kind of
> wrap-around (whether wrapping around zero or the sign bit).
>
> So the _only_ valid way to handle timers is to
> - either not allow wrapping at all (in which case "unsigned" is better,
>   since it is bigger)
> - or use wrapping explicitly, and use unsigned arithmetic (which is
>   well-defined in C) and do something like "(long)(a-b) > 0".
>
> Notice? The signed variant is basically _never_ correct.
>

What is (long)(a-b) ? I have tried to look it up in the C99 standeard but 
I can't find it. Maybe it is in the referred LIA-1 standeard, which I 
can't find with google.

I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b" 
as that should be completely portable.

According to C99 Appendix H2.2 
(http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf) an 
implementation can choose to do modulo signed integers as it is 
mandatory for unsigned integers. If an implementation have choosen 
to do that it must be a bug to to do the "a-b < 0" -> "a<b" optimization.

I have never experienced a compiler/architecture combination _not_ doing 
wrapped signed integers.

Esben


> 		Linus
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 11:09             ` Esben Nielsen
@ 2007-05-07 16:28               ` Linus Torvalds
  2007-05-07 18:39                 ` Johannes Stezenbach
  2007-05-08  5:36                 ` Matt Mackall
  0 siblings, 2 replies; 75+ messages in thread
From: Linus Torvalds @ 2007-05-07 16:28 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Ingo Molnar, Balbir Singh, linux-kernel, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Mon, 7 May 2007, Esben Nielsen wrote:
> 
> What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> with google.

I don't worry about non-2's-complement machines (they don't exist, and 
likely won't exist in the future either).

So I worry about compilers rewriting my code.

So "(long)(a-b) < 0" (with "a" and "b" being unsigned long) is basically a 
portable way of testing the high bit of the result.

> I think the best would be to use "a-b > ULONG_MAX/2" when you mean "a<b" as
> that should be completely portable.

That certainly works too, but the difference is irrelevant, since Linux is 
unlikely to work on insane machines anyway (ie we do make a lot of other 
assumptions about the architecture, being two's-complement is the least of 
those).

So you basically shouldn't worry about hardware: everybody is pretty much 
the same. You should worry about *compilers* - that's where the 
differences show up.

So "(long)(a-b)" may be "implementation defined" (but since 
implementations are all 2's complement, we don't care), but a signed 
"(a-b)" that over/overflows is *undefined*, and that is much worse because 
it means that the compiler can do some funky stuff, and _that_ is a real 
practical worry.

And no, I also don't worry about porting Linux to 18-bit machines, or to 
ternary CPU's.  

			Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 16:28               ` Linus Torvalds
@ 2007-05-07 18:39                 ` Johannes Stezenbach
  2007-05-07 18:55                   ` Linus Torvalds
  2007-05-08  7:34                   ` Esben Nielsen
  2007-05-08  5:36                 ` Matt Mackall
  1 sibling, 2 replies; 75+ messages in thread
From: Johannes Stezenbach @ 2007-05-07 18:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Mon, May 07, 2007, Linus Torvalds wrote:
> On Mon, 7 May 2007, Esben Nielsen wrote:
> > 
> > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> > with google.

C99 defines unsigned overflow semantics, but it doesn't say anything
about signed overflow, thus it's undefined -- and you have a hard
time finding it out.

However, I have no clue *why* it's undefined and not
implementation defined. Does someone know?

> I don't worry about non-2's-complement machines (they don't exist, and 
> likely won't exist in the future either).

I think DSPs can do saturated arithmetics (clamp to min/max
values instead of wrap around). Not that it matters for Linux...

> So I worry about compilers rewriting my code.

gcc has -fwrapv and -ftrapv to change signed integer overflow
behaviour.

One baffling example where gcc rewrites code is when
conditionals depend on signed integer overflow:

$ cat xx.c
#include <assert.h>

int foo(int a)
{
	assert(a + 100 > a);
	return a;
}

int bar(int a)
{
	if (a + 100 > a)
		a += 100;
	return a;
}
$ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c
$ objdump -dr xx.o

xx.o:     file format elf32-i386

Disassembly of section .text:

00000000 <foo>:
   0:   8b 44 24 04             mov    0x4(%esp),%eax
   4:   c3                      ret

00000005 <bar>:
   5:   83 44 24 04 64          addl   $0x64,0x4(%esp)
   a:   8b 44 24 04             mov    0x4(%esp),%eax
   e:   c3                      ret

The assert and the condition were just dropped
by gcc -- without any warning.

gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow.
http://gcc.gnu.org/gcc-4.2/changes.html

Johannes

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 18:39                 ` Johannes Stezenbach
@ 2007-05-07 18:55                   ` Linus Torvalds
  2007-05-08  7:34                   ` Esben Nielsen
  1 sibling, 0 replies; 75+ messages in thread
From: Linus Torvalds @ 2007-05-07 18:55 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Mon, 7 May 2007, Johannes Stezenbach wrote:
> 
> One baffling example where gcc rewrites code is when
> conditionals depend on signed integer overflow:

Yes. This is one of my favourite beefs with gcc. Some of the optimization 
decisions seem to make no sense.

Your example is a good one, but my private beef has been in alias 
handling. Alias analysis is an important part of optimization, and there's 
two kinds: the static (and exact, aka "safe") kind that you can do 
regardless of any language definitions, because you *know* that you aren't 
actually changing behaviour, and the additional type-based heuristics that 
the C language allows.

So which ones would you expect a compiler to consider more important?

And which one do you think gcc will use?

Right. You can have static analysis that *guarantees* that two objects 
alias, but if gcc determins that they have different types and thus might 
not alias, it decides to use the heuristic instead of the firm knowledge, 
and generate code that doesn't work.

"Because the language definition allows it".

Oh well.

		Linus

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 18:39                 ` Johannes Stezenbach
  2007-05-07 18:55                   ` Linus Torvalds
@ 2007-05-08  7:34                   ` Esben Nielsen
  2007-05-08  9:54                     ` Johannes Stezenbach
  1 sibling, 1 reply; 75+ messages in thread
From: Esben Nielsen @ 2007-05-08  7:34 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Linus Torvalds, Esben Nielsen, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod



On Mon, 7 May 2007, Johannes Stezenbach wrote:

> On Mon, May 07, 2007, Linus Torvalds wrote:
>> On Mon, 7 May 2007, Esben Nielsen wrote:
>>>
>>> What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
>>> can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
>>> with google.
>
> C99 defines unsigned overflow semantics, but it doesn't say anything
> about signed overflow, thus it's undefined -- and you have a hard
> time finding it out.
>
> However, I have no clue *why* it's undefined and not
> implementation defined. Does someone know?
>
>> I don't worry about non-2's-complement machines (they don't exist, and
>> likely won't exist in the future either).
>
> I think DSPs can do saturated arithmetics (clamp to min/max
> values instead of wrap around). Not that it matters for Linux...
>
>> So I worry about compilers rewriting my code.
>
> gcc has -fwrapv and -ftrapv to change signed integer overflow
> behaviour.
>
> One baffling example where gcc rewrites code is when
> conditionals depend on signed integer overflow:
>
> $ cat xx.c
> #include <assert.h>
>
> int foo(int a)
> {
> 	assert(a + 100 > a);
> 	return a;
> }
>
> int bar(int a)
> {
> 	if (a + 100 > a)
> 		a += 100;
> 	return a;
> }
> $ gcc -Wall -Wextra -fomit-frame-pointer -c xx.c
> $ objdump -dr xx.o
>
> xx.o:     file format elf32-i386
>
> Disassembly of section .text:
>
> 00000000 <foo>:
>   0:   8b 44 24 04             mov    0x4(%esp),%eax
>   4:   c3                      ret
>
> 00000005 <bar>:
>   5:   83 44 24 04 64          addl   $0x64,0x4(%esp)
>   a:   8b 44 24 04             mov    0x4(%esp),%eax
>   e:   c3                      ret
>
>
> The assert and the condition were just dropped
> by gcc -- without any warning.
>
> gcc-4.2 will add -fstrict-overflow and -Wstrict-overflow.
> http://gcc.gnu.org/gcc-4.2/changes.html
>
>
> Johannes
>

This is contrary to C99 standeard annex H2.2 
(http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):

"An implementation that defines signed integer types as also being modulo need
not detect integer overflow, in which case, only integer divide-by-zero need
be detected."

So if it doesn't properly defines wrapping it has to detect integer 
overflow, right?

gcc does niether with that optimization :-(

Esben

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-08  7:34                   ` Esben Nielsen
@ 2007-05-08  9:54                     ` Johannes Stezenbach
  2007-05-08 10:27                       ` Esben Nielsen
  0 siblings, 1 reply; 75+ messages in thread
From: Johannes Stezenbach @ 2007-05-08  9:54 UTC (permalink / raw)
  To: Esben Nielsen
  Cc: Linus Torvalds, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Tue, May 08, 2007, Esben Nielsen wrote:
> 
> This is contrary to C99 standeard annex H2.2 
> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):
> 
> "An implementation that defines signed integer types as also being modulo 
> need
> not detect integer overflow, in which case, only integer divide-by-zero need
> be detected."
> 
> So if it doesn't properly defines wrapping it has to detect integer 
> overflow, right?

No. Annex H (informative!) only talks about LIA-1 conformance.

C99 isn't LIA-1 conformant. H2.2 describes what an implementation
might do to make signed integers LIA-1 compatible, which is
what gcc does with -fwarpv or -ftrapv.

At least that's how I understand it, the C99 standard
seems to have been written with the "it was hard to
write, so it should be hard to read" mindset. :-/

I still don't know _why_ signed integer overflow behaviour
isn't defined in C. It just goes against everyones expectation
and thus causes bugs.

Johannes

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-08  9:54                     ` Johannes Stezenbach
@ 2007-05-08 10:27                       ` Esben Nielsen
  0 siblings, 0 replies; 75+ messages in thread
From: Esben Nielsen @ 2007-05-08 10:27 UTC (permalink / raw)
  To: Johannes Stezenbach
  Cc: Esben Nielsen, Linus Torvalds, Ingo Molnar, Balbir Singh,
	linux-kernel, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter,
	buddabrod



On Tue, 8 May 2007, Johannes Stezenbach wrote:

> On Tue, May 08, 2007, Esben Nielsen wrote:
>>
>> This is contrary to C99 standeard annex H2.2
>> (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf):
>>
>> "An implementation that defines signed integer types as also being modulo
>> need
>> not detect integer overflow, in which case, only integer divide-by-zero need
>> be detected."
>>
>> So if it doesn't properly defines wrapping it has to detect integer
>> overflow, right?
>
> No. Annex H (informative!) only talks about LIA-1 conformance.
>
> C99 isn't LIA-1 conformant. H2.2 describes what an implementation
> might do to make signed integers LIA-1 compatible.

"The signed C integer types int, long int, long long int, and the 
corresponding unsigned types are compatible with LIA-1."

I read this as any C99 implementation must be compatible. I would like to 
see LIA-1 to check.


>, which is
> what gcc does with -fwarpv or -ftrapv.
>

Yes, either or: Either wrap or trap.

> At least that's how I understand it, the C99 standard
> seems to have been written with the "it was hard to
> write, so it should be hard to read" mindset. :-/
>
> I still don't know _why_ signed integer overflow behaviour
> isn't defined in C. It just goes against everyones expectation
> and thus causes bugs.

Because it is hard to make wrapping work on non twos complement 
architectures. Then it is easier to trap.

Esben

>
>
> Johannes
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-07 16:28               ` Linus Torvalds
  2007-05-07 18:39                 ` Johannes Stezenbach
@ 2007-05-08  5:36                 ` Matt Mackall
  1 sibling, 0 replies; 75+ messages in thread
From: Matt Mackall @ 2007-05-08  5:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Esben Nielsen, Ingo Molnar, Balbir Singh, linux-kernel,
	Andrew Morton, Con Kolivas, Nick Piggin, Mike Galbraith,
	Arjan van de Ven, Peter Williams, Thomas Gleixner, caglar,
	Willy Tarreau, Gene Heskett, Mark Lord, Zach Carter, buddabrod

On Mon, May 07, 2007 at 09:28:32AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 7 May 2007, Esben Nielsen wrote:
> > 
> > What is (long)(a-b) ? I have tried to look it up in the C99 standeard but I
> > can't find it. Maybe it is in the referred LIA-1 standeard, which I can't find
> > with google.
> 
> I don't worry about non-2's-complement machines (they don't exist, and 
> likely won't exist in the future either).

They do exist. And they run Linux. SLES9 in fact.

http://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#UNISYS_ClearPath_IX_series

http://www.unisys.com/eprise/main/admin/corporate/doc/ClearPath_Plus_Dorado_Model_390_Server_Specification_Sheet.pdf

That machine is a direct descendant of the Univac 1101 from 1950 and
is still software-compatible with 1107s from 1962.

(Granted, they only run Linux on the x86 side.)

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
                   ` (3 preceding siblings ...)
  2007-05-02  9:08 ` Balbir Singh
@ 2007-05-02 12:58 ` Mark Lord
  2007-05-02 12:58 ` Vegard Nossum
  5 siblings, 0 replies; 75+ messages in thread
From: Mark Lord @ 2007-05-02 12:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Zach Carter,
	buddabrod, Balbir Singh

Ingo Molnar wrote:
> i'm pleased to announce release -v8 of the CFS scheduler patchset. (The 
> main goal of CFS is to implement "desktop scheduling" with as high 
> quality as technically possible.)
> 
> The CFS patch against v2.6.21.1 (or against v2.6.20.10) can be 
> downloaded from the usual place:
> 
>     http://people.redhat.com/mingo/cfs-scheduler/

Excellent.  I've switched machines here, so my "old" single-core 2GHz notebook
is no longer being used as much.  The new machine is a more or less identical
notebook, but with a 2.1GHz Core2Duo CPU.

And.. man, the mainline scheduler really sucks eggs on the dual-core!
On single core it was nearly always "nice" enough for me,
but wow.. what a degradation with the extra CPU.

So, CFS is now a permanent add-on for kernels I use here on either machine,
and I also can say that it works GGGGGGGRRRRRRRRRRREEEEEEEEEAAAAAAAAAATTTTTTTT!!!
(please excuse the North American excuse for a "cultural" reference).  ;)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-01 21:22 Ingo Molnar
                   ` (4 preceding siblings ...)
  2007-05-02 12:58 ` Mark Lord
@ 2007-05-02 12:58 ` Vegard Nossum
  2007-05-02 16:41   ` Ingo Molnar
  5 siblings, 1 reply; 75+ messages in thread
From: Vegard Nossum @ 2007-05-02 12:58 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

On Tue, May 1, 2007 11:22 pm, Ingo Molnar wrote:
> As usual, any sort of feedback, bugreport, fix and suggestion is more
than welcome,

Hi,

The sys_sched_yield_to() is not callable from userspace on i386 because it
is not part of the syscall table (arch/i386/kernel/syscall_table.S). This
causes sysenter_entry (arch/i386/kernel/entry.S) to use the wrong count
for nr_syscalls (320 instead of 321) and return with -ENOSYS.

Vegard

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [patch] CFS scheduler, -v8
  2007-05-02 12:58 ` Vegard Nossum
@ 2007-05-02 16:41   ` Ingo Molnar
  0 siblings, 0 replies; 75+ messages in thread
From: Ingo Molnar @ 2007-05-02 16:41 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: linux-kernel


* Vegard Nossum <vegard@peltkore.net> wrote:

> The sys_sched_yield_to() is not callable from userspace on i386 
> because it is not part of the syscall table 
> (arch/i386/kernel/syscall_table.S). This causes sysenter_entry 
> (arch/i386/kernel/entry.S) to use the wrong count for nr_syscalls (320 
> instead of 321) and return with -ENOSYS.

oops, indeed - the patch below should fix this. (x86 should really adopt 
the nice x86_64 technique of building the syscall table out of the 
unistd.h enumeration definitions.)

	Ingo

Index: linux/arch/i386/kernel/syscall_table.S
===================================================================
--- linux.orig/arch/i386/kernel/syscall_table.S
+++ linux/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
 	.long sys_move_pages
 	.long sys_getcpu
 	.long sys_epoll_pwait
+	.long sys_sched_yield_to	/* 320 */


^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2007-05-11 20:54 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-03  8:20 [patch] CFS scheduler, -v8 Zoltan Boszormenyi
2007-05-03 13:02 ` Ingo Molnar
2007-05-03 13:29   ` Damien Wyart
2007-05-03 14:53     ` Srivatsa Vaddagiri
2007-05-03 15:53       ` William Lee Irwin III
2007-05-03 18:44         ` Li, Tong N
2007-05-03 19:52           ` William Lee Irwin III
2007-05-07 14:22         ` Srivatsa Vaddagiri
2007-05-07 20:54           ` Li, Tong N
2007-05-07  0:04     ` Bill Davidsen
  -- strict thread matches above, loose matches on Subject: below --
2007-05-01 21:22 Ingo Molnar
2007-05-02  2:57 ` Ting Yang
2007-05-02  5:10   ` Willy Tarreau
2007-05-02  5:30   ` William Lee Irwin III
2007-05-02 10:05   ` Bill Huey
2007-05-02 10:27   ` Ingo Molnar
2007-05-02 17:36   ` Srivatsa Vaddagiri
2007-05-02 17:48     ` William Lee Irwin III
2007-05-02 18:15       ` Ingo Molnar
2007-05-02 18:56         ` William Lee Irwin III
2007-05-02 19:12           ` Ingo Molnar
2007-05-02 19:42             ` William Lee Irwin III
2007-05-03  2:48       ` Ting Yang
2007-05-03  3:18       ` Ting Yang
2007-05-03 10:19         ` Bill Huey
2007-05-02 23:41     ` Ting Yang
2007-05-02 18:42   ` Li, Tong N
2007-05-02 19:10     ` William Lee Irwin III
2007-05-03  3:07     ` Ting Yang
2007-05-03  8:50   ` Ingo Molnar
2007-05-03 14:26     ` Srivatsa Vaddagiri
2007-05-03 15:19       ` Ting Yang
2007-05-03 15:02     ` Ting Yang
2007-05-02  6:37 ` Mike Galbraith
2007-05-02  6:45   ` Ingo Molnar
2007-05-02  8:03   ` Gene Heskett
2007-05-02  8:12     ` Mike Galbraith
2007-05-02  8:48       ` Gene Heskett
2007-05-02  8:13     ` Ingo Molnar
2007-05-02  8:51       ` Gene Heskett
2007-05-02  7:59 ` Mike Galbraith
2007-05-02  8:11   ` Gene Heskett
2007-05-02 10:40   ` Ingo Molnar
2007-05-02  9:08 ` Balbir Singh
2007-05-02 10:05   ` Ingo Molnar
2007-05-02 10:59     ` Balbir Singh
2007-05-02 11:17       ` Ingo Molnar
2007-05-05  8:31         ` Esben Nielsen
2007-05-05 17:44           ` Linus Torvalds
2007-05-06  8:29             ` Ingo Molnar
2007-05-06  8:36               ` Willy Tarreau
2007-05-06  8:52                 ` Ingo Molnar
2007-05-06 17:45               ` Linus Torvalds
2007-05-07 11:30                 ` Esben Nielsen
2007-05-07 15:55                   ` Ingo Molnar
2007-05-07 16:11                   ` Linus Torvalds
2007-05-08  0:35                   ` Peter Williams
2007-05-08  9:05                     ` Esben Nielsen
2007-05-09  0:01                       ` Peter Williams
2007-05-10 13:09                     ` Pavel Machek
2007-05-11 16:50                       ` Linus Torvalds
2007-05-11 19:18                         ` Pavel Machek
2007-05-11 19:37                           ` Willy Tarreau
2007-05-11 20:53                             ` Kevin Bowling
2007-05-07 11:09             ` Esben Nielsen
2007-05-07 16:28               ` Linus Torvalds
2007-05-07 18:39                 ` Johannes Stezenbach
2007-05-07 18:55                   ` Linus Torvalds
2007-05-08  7:34                   ` Esben Nielsen
2007-05-08  9:54                     ` Johannes Stezenbach
2007-05-08 10:27                       ` Esben Nielsen
2007-05-08  5:36                 ` Matt Mackall
2007-05-02 12:58 ` Mark Lord
2007-05-02 12:58 ` Vegard Nossum
2007-05-02 16:41   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox