public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] CFS scheduler, -v12
@ 2007-05-13 15:38 Ingo Molnar
  2007-05-16  2:04 ` Peter Williams
  2007-05-18  0:18 ` Bill Huey
  0 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-05-13 15:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Andrew Morton, Con Kolivas, Nick Piggin,
	Mike Galbraith, Arjan van de Ven, Peter Williams, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord


i'm pleased to announce release -v12 of the CFS scheduler patchset.

The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be 
downloaded from the usual place:
  
    http://people.redhat.com/mingo/cfs-scheduler/

-v12 fixes the '3D bug' that caused trivial latencies in 3D games: it 
turns out that the problem was not resulting out of any core quality of 
CFS, it was caused by 3D userspace growing dependent on the current 
inefficiency of the vanilla scheduler's sys_sched_yield() 
implementation, and CFS's "make yield work well" changes broke it.

Even a simple 3D app like glxgears does a sys_sched_yield() for every 
frame it generates (!) on certain 3D cards, which in essence punishes 
any scheduler that implements sys_sched_yield() in a sane manner. This 
interaction of CFS's yield implementation with this user-space bug could 
be the main reason why some testers reported SD to be handling 3D games 
better than CFS. (SD uses a yield implementation similar to the vanilla 
scheduler.)

So i've added a yield workaround to -v12, which makes it work similar to 
how the vanilla scheduler and SD does it. (Xorg has been notified and 
this bug should be fixed there too. This took some time to debug because 
the 3D driver i'm using for testing does not use sys_sched_yield().) The 
workaround is activated by default so -v12 should work 'out of the box'.

Mike Galbraith has fixed a bug related to nice levels - the fix should 
make negative nice levels more potent again.

Changes since -v10:

 - nice level calculation fixes (Mike Galbraith)

 - load-balancing improvements (this should fix the SMP performance 
   problem reported by Michael Gerdau)

 - remove the sched_sleep_history_max tunable.

 - more debugging fields.

 - various cleanups, fixlets and code reorganization

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome,

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-13 15:38 [patch] CFS scheduler, -v12 Ingo Molnar
@ 2007-05-16  2:04 ` Peter Williams
  2007-05-16  8:08   ` Ingo Molnar
       [not found]   ` <20070516063625.GA9058@elte.hu>
  2007-05-18  0:18 ` Bill Huey
  1 sibling, 2 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-16  2:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord

Ingo Molnar wrote:
> i'm pleased to announce release -v12 of the CFS scheduler patchset.
> 
> The CFS patch against v2.6.22-rc1, v2.6.21.1 or v2.6.20.10 can be 
> downloaded from the usual place:
>   
>     http://people.redhat.com/mingo/cfs-scheduler/
> 
> -v12 fixes the '3D bug' that caused trivial latencies in 3D games: it 
> turns out that the problem was not resulting out of any core quality of 
> CFS, it was caused by 3D userspace growing dependent on the current 
> inefficiency of the vanilla scheduler's sys_sched_yield() 
> implementation, and CFS's "make yield work well" changes broke it.
> 
> Even a simple 3D app like glxgears does a sys_sched_yield() for every 
> frame it generates (!) on certain 3D cards, which in essence punishes 
> any scheduler that implements sys_sched_yield() in a sane manner. This 
> interaction of CFS's yield implementation with this user-space bug could 
> be the main reason why some testers reported SD to be handling 3D games 
> better than CFS. (SD uses a yield implementation similar to the vanilla 
> scheduler.)
> 
> So i've added a yield workaround to -v12, which makes it work similar to 
> how the vanilla scheduler and SD does it. (Xorg has been notified and 
> this bug should be fixed there too. This took some time to debug because 
> the 3D driver i'm using for testing does not use sys_sched_yield().) The 
> workaround is activated by default so -v12 should work 'out of the box'.
> 
> Mike Galbraith has fixed a bug related to nice levels - the fix should 
> make negative nice levels more potent again.
> 
> Changes since -v10:
> 
>  - nice level calculation fixes (Mike Galbraith)
> 
>  - load-balancing improvements (this should fix the SMP performance 
>    problem reported by Michael Gerdau)
> 
>  - remove the sched_sleep_history_max tunable.
> 
>  - more debugging fields.
> 
>  - various cleanups, fixlets and code reorganization
> 
> As usual, any sort of feedback, bugreport, fix and suggestion is more 
> than welcome,

Load balancing appears to be badly broken in this version.  When I 
started 4 hard spinners on my 2 CPU machine one ended up on one CPU and 
the other 3 on the other CPU and they stayed there.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-16  2:04 ` Peter Williams
@ 2007-05-16  8:08   ` Ingo Molnar
  2007-05-16 23:42     ` Peter Williams
       [not found]   ` <20070516063625.GA9058@elte.hu>
  1 sibling, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-05-16  8:08 UTC (permalink / raw)
  To: Peter Williams
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord


* Peter Williams <pwil3058@bigpond.net.au> wrote:

> >As usual, any sort of feedback, bugreport, fix and suggestion is more 
> >than welcome,
> 
> Load balancing appears to be badly broken in this version.  When I 
> started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
> and the other 3 on the other CPU and they stayed there.

hm, i cannot reproduce this on 4 different SMP boxen, trying various 
combinations of SCHED_SMT/MC and other .config options that might make a 
difference to balancing. Could you send me your .config?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-16  8:08   ` Ingo Molnar
@ 2007-05-16 23:42     ` Peter Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-16 23:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Thomas Gleixner,
	caglar, Willy Tarreau, Gene Heskett, Mark Lord

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>>> As usual, any sort of feedback, bugreport, fix and suggestion is more 
>>> than welcome,
>> Load balancing appears to be badly broken in this version.  When I 
>> started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
>> and the other 3 on the other CPU and they stayed there.
> 
> hm, i cannot reproduce this on 4 different SMP boxen, trying various 
> combinations of SCHED_SMT/MC

You may need to try more than once.  Testing load balancing can be a 
pain as there's always a possibility you'll get a good result just by 
chance.  I.e. you need a bunch of good results to say it's OK but only 
one bad result to say it's broken.   This makes testing load balancing a 
pain.

> and other .config options that might make a 
> difference to balancing. Could you send me your .config?

Sent separately.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
       [not found]   ` <20070516063625.GA9058@elte.hu>
@ 2007-05-17 23:45     ` Peter Williams
       [not found]       ` <20070518071325.GB28702@elte.hu>
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-17 23:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> Load balancing appears to be badly broken in this version.  When I 
>> started 4 hard spinners on my 2 CPU machine one ended up on one CPU 
>> and the other 3 on the other CPU and they stayed there.
> 
> could you try to debug this a bit more?

I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
with and without CFS; and the problem is always present.  It's not 
"nice" related as the all four tasks are run at nice == 0.

It's possible that this problem has been in the kernel for a while with 
out being noticed as, even with totally random allocation of tasks to 
CPUs without any (attempt to balance), there's a quite high probability 
of the desirable 2/2 split occurring.  So one needs to repeat the test 
several times to have reasonable assurance that the problem is not 
present.  I.e. this has the characteristics of an intermittent bug with 
all the debugging problems that introduces.

The probabilities for the 3 split possibilities for random allocation are:

	2/2 (the desired outcome) is 3/8 likely,
	1/3 is 4/8 likely, and
	0/4 is 1/8 likely.

I'm pretty sure that this problem wasn't present when smpnice went into 
the kernel which is the last time I did a lot of load balance testing.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-13 15:38 [patch] CFS scheduler, -v12 Ingo Molnar
  2007-05-16  2:04 ` Peter Williams
@ 2007-05-18  0:18 ` Bill Huey
  2007-05-18  1:01   ` Bill Huey
                     ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Bill Huey @ 2007-05-18  0:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	Bill Huey (hui), William Lee Irwin III

On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> Even a simple 3D app like glxgears does a sys_sched_yield() for every 
> frame it generates (!) on certain 3D cards, which in essence punishes 
> any scheduler that implements sys_sched_yield() in a sane manner. This 
> interaction of CFS's yield implementation with this user-space bug could 
> be the main reason why some testers reported SD to be handling 3D games 
> better than CFS. (SD uses a yield implementation similar to the vanilla 
> scheduler.)
> 
> So i've added a yield workaround to -v12, which makes it work similar to 
> how the vanilla scheduler and SD does it. (Xorg has been notified and 
> this bug should be fixed there too. This took some time to debug because 
> the 3D driver i'm using for testing does not use sys_sched_yield().) The 
> workaround is activated by default so -v12 should work 'out of the box'.

This is an incorrect analysis. OpenGL has the ability to "yield" after
every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
by the system vertical retrace interrupt) so that it can free up CPU
resources for other tasks to run. The problem here is that the yield
behavior is treated generally instead of specifically to a particular
proportion scheduler policy.

The correct solution is for the app to use a directed yield and a policy
that can directly support it so that OpenGL can guaratee a frame rate
governed by CPU bandwidth allocated by the scheduler.

Will is working on such a mechanism now.

bill


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18  0:18 ` Bill Huey
@ 2007-05-18  1:01   ` Bill Huey
  2007-05-18  4:13   ` William Lee Irwin III
  2007-05-18  7:31   ` Ingo Molnar
  2 siblings, 0 replies; 35+ messages in thread
From: Bill Huey @ 2007-05-18  1:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	William Lee Irwin III, Bill Huey (hui)

On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote:
> On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> > Even a simple 3D app like glxgears does a sys_sched_yield() for every 
> > frame it generates (!) on certain 3D cards, which in essence punishes 
> > any scheduler that implements sys_sched_yield() in a sane manner. This 
> > interaction of CFS's yield implementation with this user-space bug could 
> > be the main reason why some testers reported SD to be handling 3D games 
> > better than CFS. (SD uses a yield implementation similar to the vanilla 
> > scheduler.)
> > 
> > So i've added a yield workaround to -v12, which makes it work similar to 
> > how the vanilla scheduler and SD does it. (Xorg has been notified and 
> > this bug should be fixed there too. This took some time to debug because 
> > the 3D driver i'm using for testing does not use sys_sched_yield().) The 
> > workaround is activated by default so -v12 should work 'out of the box'.
> 
> This is an incorrect analysis. OpenGL has the ability to "yield" after
> every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
> by the system vertical retrace interrupt) so that it can free up CPU
> resources for other tasks to run. The problem here is that the yield
> behavior is treated generally instead of specifically to a particular
> proportion scheduler policy.
> 
> The correct solution is for the app to use a directed yield and a policy
> that can directly support it so that OpenGL can guaratee a frame rate
> governed by CPU bandwidth allocated by the scheduler.
> 
> Will is working on such a mechanism now.

Follow up:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/REACT_PG/sgi_html/ch04.html

bill


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18  0:18 ` Bill Huey
  2007-05-18  1:01   ` Bill Huey
@ 2007-05-18  4:13   ` William Lee Irwin III
  2007-05-18  7:31   ` Ingo Molnar
  2 siblings, 0 replies; 35+ messages in thread
From: William Lee Irwin III @ 2007-05-18  4:13 UTC (permalink / raw)
  To: Bill Huey
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Con Kolivas, Nick Piggin, Mike Galbraith, Arjan van de Ven,
	Peter Williams, Thomas Gleixner, caglar, Willy Tarreau,
	Gene Heskett, Mark Lord

On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
>> So i've added a yield workaround to -v12, which makes it work similar to 
>> how the vanilla scheduler and SD does it. (Xorg has been notified and 
>> this bug should be fixed there too. This took some time to debug because 
>> the 3D driver i'm using for testing does not use sys_sched_yield().) The 
>> workaround is activated by default so -v12 should work 'out of the box'.

On Thu, May 17, 2007 at 05:18:41PM -0700, Bill Huey wrote:
> This is an incorrect analysis. OpenGL has the ability to "yield" after
> every frame specifically for SGI IRIX (React/Pro) frame scheduler (driven
> by the system vertical retrace interrupt) so that it can free up CPU
> resources for other tasks to run. The problem here is that the yield
> behavior is treated generally instead of specifically to a particular
> proportion scheduler policy.
> The correct solution is for the app to use a directed yield and a policy
> that can directly support it so that OpenGL can guaratee a frame rate
> governed by CPU bandwidth allocated by the scheduler.
> Will is working on such a mechanism now.

What? AFAIK the CFS patches already implement directed yields.


-- wli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18  0:18 ` Bill Huey
  2007-05-18  1:01   ` Bill Huey
  2007-05-18  4:13   ` William Lee Irwin III
@ 2007-05-18  7:31   ` Ingo Molnar
  2 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-05-18  7:31 UTC (permalink / raw)
  To: Bill Huey
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Con Kolivas,
	Nick Piggin, Mike Galbraith, Arjan van de Ven, Peter Williams,
	Thomas Gleixner, caglar, Willy Tarreau, Gene Heskett, Mark Lord,
	William Lee Irwin III


* Bill Huey <billh@gnuppy.monkey.org> wrote:

> On Sun, May 13, 2007 at 05:38:53PM +0200, Ingo Molnar wrote:
> > Even a simple 3D app like glxgears does a sys_sched_yield() for 
> > every frame it generates (!) on certain 3D cards, which in essence 
> > punishes any scheduler that implements sys_sched_yield() in a sane 
> > manner. This interaction of CFS's yield implementation with this 
> > user-space bug could be the main reason why some testers reported SD 
> > to be handling 3D games better than CFS. (SD uses a yield 
> > implementation similar to the vanilla scheduler.)
> > 
> > So i've added a yield workaround to -v12, which makes it work 
> > similar to how the vanilla scheduler and SD does it. (Xorg has been 
> > notified and this bug should be fixed there too. This took some time 
> > to debug because the 3D driver i'm using for testing does not use 
> > sys_sched_yield().) The workaround is activated by default so -v12 
> > should work 'out of the box'.
> 
> This is an incorrect analysis. [...]

i'm puzzled, incorrect in specifically what way?

> [...] OpenGL has the ability to "yield" after every frame specifically 
> for SGI IRIX (React/Pro) frame scheduler (driven by the system 
> vertical retrace interrupt) so that it can free up CPU resources for 
> other tasks to run. [...]

what you say makes no sense to me. The majority of Linux 3D apps are 
already driven by the vertical retrace interrupt and properly 'yield the 
CPU' if they wish so, but this has nothing to do with sys_sched_yield().

> The correct solution is for the app to use a directed yield and a 
> policy that can directly support it so that OpenGL can guaratee a 
> frame rate governed by CPU bandwidth allocated by the scheduler.
> 
> Will is working on such a mechanism now.

i'm even more puzzled. I've added sched_yield_to() to CFS -v6 and it's 
been part of CFS since then. I'm curious, on what mechanism is Will 
working and have any patches been sent to lkml for discussion?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
       [not found]       ` <20070518071325.GB28702@elte.hu>
@ 2007-05-18 13:11         ` Peter Williams
  2007-05-18 13:26           ` Peter Williams
                             ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-18 13:11 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Ingo Molnar wrote:
> * Peter Williams <pwil3058@bigpond.net.au> wrote:
> 
>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
>> with and without CFS; and the problem is always present.  It's not 
>> "nice" related as the all four tasks are run at nice == 0.
> 
> could you try -v13 and did this behavior get better in any way?

It's still there but I've got a theory about what the problems is that 
is supported by some other tests I've done.

What I'd forgotten is that I had gkrellm running as well as top (to 
observe which CPU tasks were on) at the same time as the spinners were 
running.  This meant that between them top, gkrellm and X were using 
about 2% of the CPU -- not much but enough to make it possible that at 
least one of them was running when the load balancer was trying to do 
its thing.

This raises two possibilities: 1. the system looked balanced and 2. the 
system didn't look balanced but one of  top, gkrellm or X was moved 
instead of one of the spinners.

If it's 1 then there's not much we can do about it except say that it 
only happens in these strange circumstances.  If it's 2 then we may have 
to modify the way move_tasks() selects which tasks to move (if we think 
that the circumstances warrant it -- I'm not sure that this is the case).

To examine these possibilities I tried two variations of the test.

a. run the spinners at nice == -10 instead of nice == 0.  When I did 
this the load balancing was perfect on 10 consecutive runs which 
according to my calculations makes it 99.9999997 certain that this 
didn't happen by chance.  This supports theory 2 above.

b. run the tests without gkrellm running but use nice == 0 for the 
spinners.  When I did this the load balancing was mostly perfect but was 
quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
to CPUs) but the %CPU allocation was quite good with the spinners all 
getting approximately 49% of a CPU each.  This also supports theory 2 
above and gives weak support to theory 1 above.

This leaves the question of what to do about it.  Given that most CPU 
intensive tasks on a real system probably only run for a few tens of 
milliseconds it probably won't matter much on a real system except that 
a malicious user could exploit it to disrupt a system.

So my opinion is that we probably do need to do something about it but 
that it's not urgent.

One thing that might work is to jitter the load balancing interval a 
bit.  The reason I say this is that one of the characteristics of top 
and gkrellm is that they run at a more or less constant interval (and, 
in this case, X would also be following this pattern as it's doing 
screen updates for top and gkrellm) and this means that it's possible 
for the load balancing interval to synchronize with their intervals 
which in turn causes the observed problem.  A jittered load balancing 
interval should break the synchronization.  This would certainly be 
simpler than trying to change the move_task() logic for selecting which 
tasks to move.

What do you think?
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18 13:11         ` Peter Williams
@ 2007-05-18 13:26           ` Peter Williams
  2007-05-19 13:27           ` Dmitry Adamushko
  2007-05-21 15:25           ` Dmitry Adamushko
  2 siblings, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-18 13:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel Mailing List

Peter Williams wrote:
> Ingo Molnar wrote:
>> * Peter Williams <pwil3058@bigpond.net.au> wrote:
>>
>>> I've now done this test on a number of kernels: 2.6.21 and 2.6.22-rc1 
>>> with and without CFS; and the problem is always present.  It's not 
>>> "nice" related as the all four tasks are run at nice == 0.
>>
>> could you try -v13 and did this behavior get better in any way?
> 
> It's still there but I've got a theory about what the problems is that 
> is supported by some other tests I've done.
> 
> What I'd forgotten is that I had gkrellm running as well as top (to 
> observe which CPU tasks were on) at the same time as the spinners were 
> running.  This meant that between them top, gkrellm and X were using 
> about 2% of the CPU -- not much but enough to make it possible that at 
> least one of them was running when the load balancer was trying to do 
> its thing.
> 
> This raises two possibilities: 1. the system looked balanced and 2. the 
> system didn't look balanced but one of  top, gkrellm or X was moved 
> instead of one of the spinners.
> 
> If it's 1 then there's not much we can do about it except say that it 
> only happens in these strange circumstances.  If it's 2 then we may have 
> to modify the way move_tasks() selects which tasks to move (if we think 
> that the circumstances warrant it -- I'm not sure that this is the case).
> 
> To examine these possibilities I tried two variations of the test.
> 
> a. run the spinners at nice == -10 instead of nice == 0.  When I did 
> this the load balancing was perfect on 10 consecutive runs which 
> according to my calculations makes it 99.9999997 certain that this 
> didn't happen by chance.  This supports theory 2 above.
> 
> b. run the tests without gkrellm running but use nice == 0 for the 
> spinners.  When I did this the load balancing was mostly perfect but was 
> quite volatile (switching between a 2/2 and 1/3 allocation of spinners 
> to CPUs) but the %CPU allocation was quite good with the spinners all 
> getting approximately 49% of a CPU each.  This also supports theory 2 
> above and gives weak support to theory 1 above.
> 
> This leaves the question of what to do about it.  Given that most CPU 
> intensive tasks on a real system probably only run for a few tens of 
> milliseconds it probably won't matter much on a real system except that 
> a malicious user could exploit it to disrupt a system.
> 
> So my opinion is that we probably do need to do something about it but 
> that it's not urgent.
> 
> One thing that might work is to jitter the load balancing interval a 
> bit.  The reason I say this is that one of the characteristics of top 
> and gkrellm is that they run at a more or less constant interval (and, 
> in this case, X would also be following this pattern as it's doing 
> screen updates for top and gkrellm) and this means that it's possible 
> for the load balancing interval to synchronize with their intervals 
> which in turn causes the observed problem.  A jittered load balancing 
> interval should break the synchronization.  This would certainly be 
> simpler than trying to change the move_task() logic for selecting which 
> tasks to move.

I should have added that the reason I think this mooted synchronization 
is the cause of the problem is that I can think of no other way that 
tasks with such low activity (2% between the 3 of them) could cause the 
imbalance of the spinner to CPU allocation to be so persistent.

> 
> What do you think?


Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18 13:11         ` Peter Williams
  2007-05-18 13:26           ` Peter Williams
@ 2007-05-19 13:27           ` Dmitry Adamushko
  2007-05-20  1:41             ` Peter Williams
  2007-05-21  8:29             ` William Lee Irwin III
  2007-05-21 15:25           ` Dmitry Adamushko
  2 siblings, 2 replies; 35+ messages in thread
From: Dmitry Adamushko @ 2007-05-19 13:27 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel

On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> [...]
> One thing that might work is to jitter the load balancing interval a
> bit.  The reason I say this is that one of the characteristics of top
> and gkrellm is that they run at a more or less constant interval (and,
> in this case, X would also be following this pattern as it's doing
> screen updates for top and gkrellm) and this means that it's possible
> for the load balancing interval to synchronize with their intervals
> which in turn causes the observed problem.  A jittered load balancing
> interval should break the synchronization.  This would certainly be
> simpler than trying to change the move_task() logic for selecting which
> tasks to move.

Just an(quick) another idea. Say, the load balancer would consider not
only p->load_weight but also something like Tw(task) =
(time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
as an additional "load" component (OTOH, when a task starts, it takes
some time for this parameter to become meaningful). I guess, it could
address the scenarios your have described (but maybe break some others
as well :) ...
Any hints on why it's stupid?


>
> Peter
> --
> Peter Williams                                   pwil3058@bigpond.net.au

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-19 13:27           ` Dmitry Adamushko
@ 2007-05-20  1:41             ` Peter Williams
  2007-05-21  8:29             ` William Lee Irwin III
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-20  1:41 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, Linux Kernel

Dmitry Adamushko wrote:
> On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> [...]
>> One thing that might work is to jitter the load balancing interval a
>> bit.  The reason I say this is that one of the characteristics of top
>> and gkrellm is that they run at a more or less constant interval (and,
>> in this case, X would also be following this pattern as it's doing
>> screen updates for top and gkrellm) and this means that it's possible
>> for the load balancing interval to synchronize with their intervals
>> which in turn causes the observed problem.  A jittered load balancing
>> interval should break the synchronization.  This would certainly be
>> simpler than trying to change the move_task() logic for selecting which
>> tasks to move.
> 
> Just an(quick) another idea. Say, the load balancer would consider not
> only p->load_weight but also something like Tw(task) =
> (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
> as an additional "load" component (OTOH, when a task starts, it takes
> some time for this parameter to become meaningful). I guess, it could
> address the scenarios your have described (but maybe break some others
> as well :) ...
> Any hints on why it's stupid?

Well that is the kind of thing I was hoping to avoid for the reasons of 
complexity.  I think that the actual implementation would be more 
complex than it sounds and possibly require multiple runs down the list 
of moveable tasks which would be bad for overhead.

Basically, I don't think that the problem is serious enough to warrant a 
complex solution.  But I may be wrong about how complex the 
implementation would be.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-19 13:27           ` Dmitry Adamushko
  2007-05-20  1:41             ` Peter Williams
@ 2007-05-21  8:29             ` William Lee Irwin III
  2007-05-21  8:57               ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: William Lee Irwin III @ 2007-05-21  8:29 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Peter Williams, Ingo Molnar, Linux Kernel

On Sat, May 19, 2007 at 03:27:54PM +0200, Dmitry Adamushko wrote:
> Just an(quick) another idea. Say, the load balancer would consider not
> only p->load_weight but also something like Tw(task) =
> (time_spent_on_runqueue / total_task's_runtime) * some_scale_constant
> as an additional "load" component (OTOH, when a task starts, it takes
> some time for this parameter to become meaningful). I guess, it could
> address the scenarios your have described (but maybe break some others
> as well :) ...
> Any hints on why it's stupid?

I guess I'll take time out from coding to chime in.

cfs should probably consider aggregate lag as opposed to aggregate
weighted load. Mainline's convergence to proper CPU bandwidth
distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is
incredibly slow and probably also fragile in the presence of arrivals
and departures partly because of this. Tong Li's DWRR repairs the
deficit in mainline by synchronizing epochs or otherwise bounding epoch
dispersion. This doesn't directly translate to cfs. In cfs cpu should
probably try to figure out if its aggregate lag (e.g. via minimax) is
above or below average, and push to or pull from the other half
accordingly.


-- wli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21  8:29             ` William Lee Irwin III
@ 2007-05-21  8:57               ` Ingo Molnar
  2007-05-21 12:08                 ` William Lee Irwin III
  2007-05-22 16:48                 ` Chris Friesen
  0 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-05-21  8:57 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Dmitry Adamushko, Peter Williams, Linux Kernel


* William Lee Irwin III <wli@holomorphy.com> wrote:

> cfs should probably consider aggregate lag as opposed to aggregate 
> weighted load. Mainline's convergence to proper CPU bandwidth 
> distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is 
> incredibly slow and probably also fragile in the presence of arrivals 
> and departures partly because of this. [...]

hm, have you actually tested CFS before coming to this conclusion?

CFS is fair even on SMP. Consider for example the worst-case 
3-tasks-on-2-CPUs workload on a 2-CPU box:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2658 mingo     20   0  1580  248  200 R   67  0.0   0:56.30 loop
 2656 mingo     20   0  1580  252  200 R   66  0.0   0:55.55 loop
 2657 mingo     20   0  1576  248  200 R   66  0.0   0:55.24 loop

66% of CPU time for each task. The 'TIME+' column shows a 2% spread 
between the slowest and the fastest loop after just 1 minute of runtime 
(and the spread gets narrower with time). Mainline does a 50% / 50% / 
100% split:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3121 mingo     25   0  1584  252  204 R  100  0.0   0:13.11 loop
 3120 mingo     25   0  1584  256  204 R   50  0.0   0:06.68 loop
 3119 mingo     25   0  1584  252  204 R   50  0.0   0:06.64 loop

and i fixed that in CFS.

or consider a sleepy workload like massive_intr, 3-tasks-on-2-CPUs:

  europe:~> head -1 /proc/interrupts
             CPU0       CPU1

  europe:~> ./massive_intr 3 10
  002623  00000722
  002621  00000720
  002622  00000721

Or a 5-tasks-on-2-CPS workload:

  europe:~> ./massive_intr 5 50
  002649  00002519
  002653  00002492
  002651  00002478
  002652  00002510
  002650  00002478

that's around 1% of spread.

load-balancing is a performance vs. fairness tradeoff so we wont be able 
to make it precisely fair because that's hideously expensive on SMP 
(barring someone showing a working patch of course) - but in CFS i got 
quite close to having it very fair in practice.

> [...] Tong Li's DWRR repairs the deficit in mainline by synchronizing 
> epochs or otherwise bounding epoch dispersion. This doesn't directly 
> translate to cfs. In cfs cpu should probably try to figure out if its 
> aggregate lag (e.g. via minimax) is above or below average, and push 
> to or pull from the other half accordingly.

i'd first like to see a demonstration of a problem to solve, before 
thinking about more complex solutions ;-)

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21  8:57               ` Ingo Molnar
@ 2007-05-21 12:08                 ` William Lee Irwin III
  2007-05-22 16:48                 ` Chris Friesen
  1 sibling, 0 replies; 35+ messages in thread
From: William Lee Irwin III @ 2007-05-21 12:08 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dmitry Adamushko, Peter Williams, Linux Kernel

* William Lee Irwin III <wli@holomorphy.com> wrote:
>> cfs should probably consider aggregate lag as opposed to aggregate 
>> weighted load. Mainline's convergence to proper CPU bandwidth 
>> distributions on SMP (e.g. N+1 tasks of equal nice on N cpus) is 
>> incredibly slow and probably also fragile in the presence of arrivals 
>> and departures partly because of this. [...]

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> hm, have you actually tested CFS before coming to this conclusion?
> CFS is fair even on SMP. Consider for example the worst-case 

No. It's mostly a response to Dmitry's suggestion. I've done all of
the benchmark/testcase-writing on mainline.


On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> 3-tasks-on-2-CPUs workload on a 2-CPU box:
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2658 mingo     20   0  1580  248  200 R   67  0.0   0:56.30 loop
>  2656 mingo     20   0  1580  252  200 R   66  0.0   0:55.55 loop
>  2657 mingo     20   0  1576  248  200 R   66  0.0   0:55.24 loop

This looks like you've repaired the slow convergence issue mainline
has by other means.


On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> 66% of CPU time for each task. The 'TIME+' column shows a 2% spread 
> between the slowest and the fastest loop after just 1 minute of runtime 
> (and the spread gets narrower with time). Mainline does a 50% / 50% / 
> 100% split:
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  3121 mingo     25   0  1584  252  204 R  100  0.0   0:13.11 loop
>  3120 mingo     25   0  1584  256  204 R   50  0.0   0:06.68 loop
>  3119 mingo     25   0  1584  252  204 R   50  0.0   0:06.64 loop
> and i fixed that in CFS.

I found that mainline actually converges to the evenly-split shares of
CPU bandwidth, albeit incredibly slowly. Something like an hour is needed.


On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> or consider a sleepy workload like massive_intr, 3-tasks-on-2-CPUs:
>   europe:~> head -1 /proc/interrupts
>              CPU0       CPU1
>   europe:~> ./massive_intr 3 10
>   002623  00000722
>   002621  00000720
>   002622  00000721
> Or a 5-tasks-on-2-CPS workload:
>   europe:~> ./massive_intr 5 50
>   002649  00002519
>   002653  00002492
>   002651  00002478
>   002652  00002510
>   002650  00002478
> that's around 1% of spread.
> load-balancing is a performance vs. fairness tradeoff so we wont be able 
> to make it precisely fair because that's hideously expensive on SMP 
> (barring someone showing a working patch of course) - but in CFS i got 
> quite close to having it very fair in practice.

This is close enough to Libenzi's load generator to mean this particular
issue is almost certainly fixed.


* William Lee Irwin III <wli@holomorphy.com> wrote:
>> [...] Tong Li's DWRR repairs the deficit in mainline by synchronizing 
>> epochs or otherwise bounding epoch dispersion. This doesn't directly 
>> translate to cfs. In cfs cpu should probably try to figure out if its 
>> aggregate lag (e.g. via minimax) is above or below average, and push 
>> to or pull from the other half accordingly.

On Mon, May 21, 2007 at 10:57:03AM +0200, Ingo Molnar wrote:
> i'd first like to see a demonstration of a problem to solve, before 
> thinking about more complex solutions ;-)

I have other, more difficult to pass testcases. I'm giving up on ipopt
for the quadratic program associated with the \ell^\infty norm and just
pushing out the least squares solution since LAPACK is standard enough
for most people to have or easily obtain.

A quick and dirty approximation is to run one task at each nice level in
a range of nice levels and see if the proportions of CPU bandwidth come
out the same on SMP as UP and how quickly they converge. The testcase
is more comprehensive than that, but it's easy enough of a check to see
if there are any issues in this area.


-- wli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-18 13:11         ` Peter Williams
  2007-05-18 13:26           ` Peter Williams
  2007-05-19 13:27           ` Dmitry Adamushko
@ 2007-05-21 15:25           ` Dmitry Adamushko
  2007-05-21 23:51             ` Peter Williams
  2 siblings, 1 reply; 35+ messages in thread
From: Dmitry Adamushko @ 2007-05-21 15:25 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel

On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
[...]
> One thing that might work is to jitter the load balancing interval a
> bit.  The reason I say this is that one of the characteristics of top
> and gkrellm is that they run at a more or less constant interval (and,
> in this case, X would also be following this pattern as it's doing
> screen updates for top and gkrellm) and this means that it's possible
> for the load balancing interval to synchronize with their intervals
> which in turn causes the observed problem.

Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
all 4 spinners "tend" to be on CPU0 (and as I understand each gets
~25% approx.?), so there must be plenty of moments for
*idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
together just a few % of CPU. Hence, we should not be that dependent
on the load balancing interval here..

(unlikely consiparacy theory) - idle_balance() and load_balance() (the
later is dependent on the load balancing interval which can be in
sync. with top/gkerllm activities as you suggest) move always either
top or gkerllm between themselves.. esp. if X is reniced (so it gets
additional "weight") and happens to be active (on CPU1) when
load_balance() (kicked from scheduler_tick()) runs..

p.s. it's mainly theoretical specualtions.. I recently started looking
at the load-balancing code (unfortunatelly, don't have an SMP machine
which I can upgrade to the recent kernel) and so far for me it's
mainly about getting sure I see things sanely.


-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21 15:25           ` Dmitry Adamushko
@ 2007-05-21 23:51             ` Peter Williams
  2007-05-22  4:47               ` Peter Williams
  2007-05-22 11:52               ` Dmitry Adamushko
  0 siblings, 2 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-21 23:51 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, Linux Kernel

Dmitry Adamushko wrote:
> On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> [...]
>> One thing that might work is to jitter the load balancing interval a
>> bit.  The reason I say this is that one of the characteristics of top
>> and gkrellm is that they run at a more or less constant interval (and,
>> in this case, X would also be following this pattern as it's doing
>> screen updates for top and gkrellm) and this means that it's possible
>> for the load balancing interval to synchronize with their intervals
>> which in turn causes the observed problem.
> 
> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..

No, and I haven't seen one.

> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
> ~25% approx.?), so there must be plenty of moments for
> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
> together just a few % of CPU. Hence, we should not be that dependent
> on the load balancing interval here..

The split that I see is 3/1 and neither CPU seems to be favoured with 
respect to getting the majority.  However, top, gkrellm and X seem to be 
always on the CPU with the single spinner.  The CPU% reported by top is 
approx. 33%, 33%, 33% and 100% for the spinners.

If I renice the spinners to -10 (so that there load weights dominate the 
run queue load calculations) the problem goes away and the spinner to 
CPU allocation is 2/2 and top reports them all getting approx. 50% each.

It's also worth noting that I've had tests where the allocation started 
out 2/2 and the system changed it to 3/1 where it stabilized.  So it's 
not just a case of bad luck with the initial CPU allocation when the 
tasks start and the load balancing failing to fix it (which was one of 
my earlier theories).

> 
> (unlikely consiparacy theory)

It's not a conspiracy.  It's just dumb luck. :-)

> - idle_balance() and load_balance() (the
> later is dependent on the load balancing interval which can be in
> sync. with top/gkerllm activities as you suggest) move always either
> top or gkerllm between themselves.. esp. if X is reniced (so it gets
> additional "weight") and happens to be active (on CPU1) when
> load_balance() (kicked from scheduler_tick()) runs..
> 
> p.s. it's mainly theoretical specualtions.. I recently started looking
> at the load-balancing code (unfortunatelly, don't have an SMP machine
> which I can upgrade to the recent kernel) and so far for me it's
> mainly about getting sure I see things sanely.

I'm playing with some jitter experiments at the moment.  The amount of 
jitter needs to be small (a few tenths of a second) as the 
synchronization (if it's happening) is happening at the seconds level as 
the intervals for top and gkrellm will be in the 1 to 5 second range (I 
guess -- I haven't checked) and the load balancing is every 60 seconds.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21 23:51             ` Peter Williams
@ 2007-05-22  4:47               ` Peter Williams
  2007-05-22 12:03                 ` Peter Williams
  2007-05-22 11:52               ` Dmitry Adamushko
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-22  4:47 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, Linux Kernel

Peter Williams wrote:
> Dmitry Adamushko wrote:
>> On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> [...]
>>> One thing that might work is to jitter the load balancing interval a
>>> bit.  The reason I say this is that one of the characteristics of top
>>> and gkrellm is that they run at a more or less constant interval (and,
>>> in this case, X would also be following this pattern as it's doing
>>> screen updates for top and gkrellm) and this means that it's possible
>>> for the load balancing interval to synchronize with their intervals
>>> which in turn causes the observed problem.
>>
>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
> 
> No, and I haven't seen one.
> 
>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>> ~25% approx.?), so there must be plenty of moments for
>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>> together just a few % of CPU. Hence, we should not be that dependent
>> on the load balancing interval here..
> 
> The split that I see is 3/1 and neither CPU seems to be favoured with 
> respect to getting the majority.  However, top, gkrellm and X seem to be 
> always on the CPU with the single spinner.  The CPU% reported by top is 
> approx. 33%, 33%, 33% and 100% for the spinners.
> 
> If I renice the spinners to -10 (so that there load weights dominate the 
> run queue load calculations) the problem goes away and the spinner to 
> CPU allocation is 2/2 and top reports them all getting approx. 50% each.

For no good reason other than curiosity, I tried a variation of this 
experiment where I reniced the spinners to 10 instead of -10 and, to my 
surprise, they were allocated 2/2 to the CPUs on average.  I say on 
average because the allocations were a little more volatile and 
occasionally 0/4 splits would occur but these would last for less than 
one top cycle before the 2/2 was re-established.  The quickness of these 
recoveries would indicate that it was most likely the idle balance 
mechanism that restored the balance.

This may point the finger at the tick based load balance mechanism being 
too conservative in when it decides whether tasks need to be moved.  In 
the case where the spinners are at nice == 0, the idle balance mechanism 
never comes into play as the 0/4 split is never seen so only the tick 
based mechanism is in force in this case and this is where the anomalies 
are seen.

This tick rebalance mechanism only situation is also true for the nice 
== -10 case but in this case the high load weights of the spinners 
overcomes the tick based load balancing mechanism's conservatism e.g. 
the difference in queue loads for a 1/3 split in this case is the 
equivalent to the difference that would be generated by an imbalance of 
about 18 nice == 0 spinners i.e. too big to be ignored.

The evidence seems to indicate that IF a rebalance operation gets 
initiated then the right amount of load will get moved.

This new evidence weakens (but does not totally destroy) my 
synchronization (a.k.a. conspiracy) theory.

Peter
PS As the total load weight for 4 nice == 10 tasks is only about 40% of 
the load weight of a single nice == 0 task, the occasional 0/4 split in 
the spinners at nice == 10 case is not unexpected as it would be the 
desirable allocation if there were exactly one other running task at 
nice == 0.
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21 23:51             ` Peter Williams
  2007-05-22  4:47               ` Peter Williams
@ 2007-05-22 11:52               ` Dmitry Adamushko
  2007-05-23  0:10                 ` Peter Williams
  1 sibling, 1 reply; 35+ messages in thread
From: Dmitry Adamushko @ 2007-05-22 11:52 UTC (permalink / raw)
  To: Peter Williams; +Cc: Ingo Molnar, Linux Kernel

On 22/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
> > [...]
> > Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>
> No, and I haven't seen one.

Well, I just took one of your calculated probabilities as something
you have really observed - (*) below.

"The probabilities for the 3 split possibilities for random allocation are:

       2/2 (the desired outcome) is 3/8 likely,
       1/3 is 4/8 likely, and
       0/4 is 1/8 likely.            <-------------------------- (*)
"

> The split that I see is 3/1 and neither CPU seems to be favoured with
> respect to getting the majority.  However, top, gkrellm and X seem to be
> always on the CPU with the single spinner.  The CPU% reported by top is
> approx. 33%, 33%, 33% and 100% for the spinners.

Yes. That said, idle_balance() is out of work in this case.

> If I renice the spinners to -10 (so that there load weights dominate the
> run queue load calculations) the problem goes away and the spinner to
> CPU allocation is 2/2 and top reports them all getting approx. 50% each.

I wonder what would happen if X gets reniced to -10 instead (and
spinners are at 0).. I guess, something I described in my previous
mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and
then idle_balance() comes into play..

ok, I see. You have probably achieved a similar effect with the
spinners being reniced to 10 (but here both "X" and "top" gain
additional "weight" wrt the load balancing).

> I'm playing with some jitter experiments at the moment.  The amount of
> jitter needs to be small (a few tenths of a second) as the
> synchronization (if it's happening) is happening at the seconds level as
> the intervals for top and gkrellm will be in the 1 to 5 second range (I
> guess -- I haven't checked) and the load balancing is every 60 seconds.

Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at
the run_rebalance_domain(), I'd say that it's normally overwritten by
the following code

               if (time_after(next_balance, sd->last_balance + interval))
                        next_balance = sd->last_balance + interval;

the "interval" seems to be *normally* shorter than "60*HZ" (according
to the default params in topology.h).. moreover, in case of the CFS

                if (interval > HZ*NR_CPUS/10)
                        interval = HZ*NR_CPUS/10;

so it can't be > 0.2 HZ in your case (== once in 200 ms at max with
HZ=1000).. am I missing something? TIA


>
> Peter

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-22  4:47               ` Peter Williams
@ 2007-05-22 12:03                 ` Peter Williams
  2007-05-24  7:43                   ` Peter Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-22 12:03 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Dmitry Adamushko, Linux Kernel

Peter Williams wrote:
> Peter Williams wrote:
>> Dmitry Adamushko wrote:
>>> On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>> [...]
>>>> One thing that might work is to jitter the load balancing interval a
>>>> bit.  The reason I say this is that one of the characteristics of top
>>>> and gkrellm is that they run at a more or less constant interval (and,
>>>> in this case, X would also be following this pattern as it's doing
>>>> screen updates for top and gkrellm) and this means that it's possible
>>>> for the load balancing interval to synchronize with their intervals
>>>> which in turn causes the observed problem.
>>>
>>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>
>> No, and I haven't seen one.
>>
>>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>>> ~25% approx.?), so there must be plenty of moments for
>>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>>> together just a few % of CPU. Hence, we should not be that dependent
>>> on the load balancing interval here..
>>
>> The split that I see is 3/1 and neither CPU seems to be favoured with 
>> respect to getting the majority.  However, top, gkrellm and X seem to 
>> be always on the CPU with the single spinner.  The CPU% reported by 
>> top is approx. 33%, 33%, 33% and 100% for the spinners.
>>
>> If I renice the spinners to -10 (so that there load weights dominate 
>> the run queue load calculations) the problem goes away and the spinner 
>> to CPU allocation is 2/2 and top reports them all getting approx. 50% 
>> each.
> 
> For no good reason other than curiosity, I tried a variation of this 
> experiment where I reniced the spinners to 10 instead of -10 and, to my 
> surprise, they were allocated 2/2 to the CPUs on average.  I say on 
> average because the allocations were a little more volatile and 
> occasionally 0/4 splits would occur but these would last for less than 
> one top cycle before the 2/2 was re-established.  The quickness of these 
> recoveries would indicate that it was most likely the idle balance 
> mechanism that restored the balance.
> 
> This may point the finger at the tick based load balance mechanism being 
> too conservative

The relevant code, find_busiest_group() and find_busiest_queue(), has a 
lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
as these macros were defined in the kernels I was testing with, I built 
a kernel with these macros undefined and reran my tests.  The 
problems/anomalies were not present in 10 consecutive tests on this new 
kernel.  Even better on the few occasions that a 3/1 split did occur it 
was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
all spinners throughout each of the ten tests.

So all that is required now is an analysis of the code inside the ifdefs 
to see why it is causing a problem.

> in when it decides whether tasks need to be moved.  In 
> the case where the spinners are at nice == 0, the idle balance mechanism 
> never comes into play as the 0/4 split is never seen so only the tick 
> based mechanism is in force in this case and this is where the anomalies 
> are seen.
> 
> This tick rebalance mechanism only situation is also true for the nice 
> == -10 case but in this case the high load weights of the spinners 
> overcomes the tick based load balancing mechanism's conservatism e.g. 
> the difference in queue loads for a 1/3 split in this case is the 
> equivalent to the difference that would be generated by an imbalance of 
> about 18 nice == 0 spinners i.e. too big to be ignored.
> 
> The evidence seems to indicate that IF a rebalance operation gets 
> initiated then the right amount of load will get moved.
> 
> This new evidence weakens (but does not totally destroy) my 
> synchronization (a.k.a. conspiracy) theory.

My synchronization theory is now dead.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-21  8:57               ` Ingo Molnar
  2007-05-21 12:08                 ` William Lee Irwin III
@ 2007-05-22 16:48                 ` Chris Friesen
  2007-05-22 20:15                   ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: Chris Friesen @ 2007-05-22 16:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Dmitry Adamushko, Peter Williams,
	Linux Kernel

Ingo Molnar wrote:

> CFS is fair even on SMP. Consider for example the worst-case 
> 3-tasks-on-2-CPUs workload on a 2-CPU box:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2658 mingo     20   0  1580  248  200 R   67  0.0   0:56.30 loop
>  2656 mingo     20   0  1580  252  200 R   66  0.0   0:55.55 loop
>  2657 mingo     20   0  1576  248  200 R   66  0.0   0:55.24 loop
> 
> 66% of CPU time for each task. The 'TIME+' column shows a 2% spread 
> between the slowest and the fastest loop after just 1 minute of runtime 
> (and the spread gets narrower with time).

Is there a way in CFS to tune the amount of time over which the load 
balancer is fair?  (Of course there would be some overhead involved.)

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-22 16:48                 ` Chris Friesen
@ 2007-05-22 20:15                   ` Ingo Molnar
  2007-05-22 20:49                     ` Chris Friesen
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2007-05-22 20:15 UTC (permalink / raw)
  To: Chris Friesen
  Cc: William Lee Irwin III, Dmitry Adamushko, Peter Williams,
	Linux Kernel


* Chris Friesen <cfriesen@nortel.com> wrote:

> Ingo Molnar wrote:
> 
> >CFS is fair even on SMP. Consider for example the worst-case 
> >3-tasks-on-2-CPUs workload on a 2-CPU box:
> >
> >  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > 2658 mingo     20   0  1580  248  200 R   67  0.0   0:56.30 loop
> > 2656 mingo     20   0  1580  252  200 R   66  0.0   0:55.55 loop
> > 2657 mingo     20   0  1576  248  200 R   66  0.0   0:55.24 loop
> >
> >66% of CPU time for each task. The 'TIME+' column shows a 2% spread 
> >between the slowest and the fastest loop after just 1 minute of runtime 
> >(and the spread gets narrower with time).
> 
> Is there a way in CFS to tune the amount of time over which the load 
> balancer is fair?  (Of course there would be some overhead involved.)

it should be fair pretty fast (see the 10 seconds run of massive_intr) - 
so it's not 1 minute (if you were worried about that).

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-22 20:15                   ` Ingo Molnar
@ 2007-05-22 20:49                     ` Chris Friesen
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Friesen @ 2007-05-22 20:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Dmitry Adamushko, Peter Williams,
	Linux Kernel

Ingo Molnar wrote:
> * Chris Friesen <cfriesen@nortel.com> wrote:

>>Is there a way in CFS to tune the amount of time over which the load 
>>balancer is fair?  (Of course there would be some overhead involved.)

> it should be fair pretty fast (see the 10 seconds run of massive_intr) - 
> so it's not 1 minute (if you were worried about that).

Good to know..that's exactly what I was worried about.  I work with guys 
that really want predictability above all else, then fairness, and only 
then performance--if we can't guarantee a given level of performance for 
5-9s then its useless to us.

Chris

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-22 11:52               ` Dmitry Adamushko
@ 2007-05-23  0:10                 ` Peter Williams
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-23  0:10 UTC (permalink / raw)
  To: Dmitry Adamushko; +Cc: Ingo Molnar, Linux Kernel

Dmitry Adamushko wrote:
> On 22/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>> > [...]
>> > Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>
>> No, and I haven't seen one.
> 
> Well, I just took one of your calculated probabilities as something
> you have really observed - (*) below.
> 
> "The probabilities for the 3 split possibilities for random allocation are:
> 
>       2/2 (the desired outcome) is 3/8 likely,
>       1/3 is 4/8 likely, and
>       0/4 is 1/8 likely.            <-------------------------- (*)
> "

These are the theoretical probabilities for the outcomes based on the 
random allocation of 4 tasks to 2 CPUs.  There are, in fact, 16 
different ways that 4 tasks can be assigned to 2 CPUs.  6 of these 
result in a 2/2 split, 8 in a 1/3 split and 2 in a 0/4 split.

> 
>> The split that I see is 3/1 and neither CPU seems to be favoured with
>> respect to getting the majority.  However, top, gkrellm and X seem to be
>> always on the CPU with the single spinner.  The CPU% reported by top is
>> approx. 33%, 33%, 33% and 100% for the spinners.
> 
> Yes. That said, idle_balance() is out of work in this case.

Which is why I reported the problem.

> 
>> If I renice the spinners to -10 (so that there load weights dominate the
>> run queue load calculations) the problem goes away and the spinner to
>> CPU allocation is 2/2 and top reports them all getting approx. 50% each.
> 
> I wonder what would happen if X gets reniced to -10 instead (and
> spinners are at 0).. I guess, something I described in my previous
> mail (and dubbed "unlikely cospiracy" :) could happen, i.e. 0/4 and
> then idle_balance() comes into play..

Probably the same as I observed but it's easier to renice the spinners.

I see the 0/4 split for brief moments if I renice the spinners to 10 
instead of -10 but the idle balancer quickly restores it to 2/2.

> 
> ok, I see. You have probably achieved a similar effect with the
> spinners being reniced to 10 (but here both "X" and "top" gain
> additional "weight" wrt the load balancing).
> 
>> I'm playing with some jitter experiments at the moment.  The amount of
>> jitter needs to be small (a few tenths of a second) as the
>> synchronization (if it's happening) is happening at the seconds level as
>> the intervals for top and gkrellm will be in the 1 to 5 second range (I
>> guess -- I haven't checked) and the load balancing is every 60 seconds.
> 
> Hum.. the "every 60 seconds" part puzzles me quite a bit. Looking at
> the run_rebalance_domain(), I'd say that it's normally overwritten by
> the following code
> 
>               if (time_after(next_balance, sd->last_balance + interval))
>                        next_balance = sd->last_balance + interval;
> 
> the "interval" seems to be *normally* shorter than "60*HZ" (according
> to the default params in topology.h).. moreover, in case of the CFS
> 
>                if (interval > HZ*NR_CPUS/10)
>                        interval = HZ*NR_CPUS/10;
> 
> so it can't be > 0.2 HZ in your case (== once in 200 ms at max with
> HZ=1000).. am I missing something? TIA

No, I did.

But it's all academic as my synchronization theory is now dead -- see 
separate e-mail.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-22 12:03                 ` Peter Williams
@ 2007-05-24  7:43                   ` Peter Williams
  2007-05-24 16:45                     ` Siddha, Suresh B
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-24  7:43 UTC (permalink / raw)
  To: Ingo Molnar, colpatch, Siddha, Suresh B, Nick Piggin, Con Kolivas,
	Christoph Lameter
  Cc: Dmitry Adamushko, Linux Kernel

Peter Williams wrote:
> Peter Williams wrote:
>> Peter Williams wrote:
>>> Dmitry Adamushko wrote:
>>>> On 18/05/07, Peter Williams <pwil3058@bigpond.net.au> wrote:
>>>> [...]
>>>>> One thing that might work is to jitter the load balancing interval a
>>>>> bit.  The reason I say this is that one of the characteristics of top
>>>>> and gkrellm is that they run at a more or less constant interval (and,
>>>>> in this case, X would also be following this pattern as it's doing
>>>>> screen updates for top and gkrellm) and this means that it's possible
>>>>> for the load balancing interval to synchronize with their intervals
>>>>> which in turn causes the observed problem.
>>>>
>>>> Hum.. I guess, a 0/4 scenario wouldn't fit well in this explanation..
>>>
>>> No, and I haven't seen one.
>>>
>>>> all 4 spinners "tend" to be on CPU0 (and as I understand each gets
>>>> ~25% approx.?), so there must be plenty of moments for
>>>> *idle_balance()* to be called on CPU1 - as gkrellm, top and X consume
>>>> together just a few % of CPU. Hence, we should not be that dependent
>>>> on the load balancing interval here..
>>>
>>> The split that I see is 3/1 and neither CPU seems to be favoured with 
>>> respect to getting the majority.  However, top, gkrellm and X seem to 
>>> be always on the CPU with the single spinner.  The CPU% reported by 
>>> top is approx. 33%, 33%, 33% and 100% for the spinners.
>>>
>>> If I renice the spinners to -10 (so that there load weights dominate 
>>> the run queue load calculations) the problem goes away and the 
>>> spinner to CPU allocation is 2/2 and top reports them all getting 
>>> approx. 50% each.
>>
>> For no good reason other than curiosity, I tried a variation of this 
>> experiment where I reniced the spinners to 10 instead of -10 and, to 
>> my surprise, they were allocated 2/2 to the CPUs on average.  I say on 
>> average because the allocations were a little more volatile and 
>> occasionally 0/4 splits would occur but these would last for less than 
>> one top cycle before the 2/2 was re-established.  The quickness of 
>> these recoveries would indicate that it was most likely the idle 
>> balance mechanism that restored the balance.
>>
>> This may point the finger at the tick based load balance mechanism 
>> being too conservative
> 
> The relevant code, find_busiest_group() and find_busiest_queue(), has a 
> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
> as these macros were defined in the kernels I was testing with, I built 
> a kernel with these macros undefined and reran my tests.  The 
> problems/anomalies were not present in 10 consecutive tests on this new 
> kernel.  Even better on the few occasions that a 3/1 split did occur it 
> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
> all spinners throughout each of the ten tests.
> 
> So all that is required now is an analysis of the code inside the ifdefs 
> to see why it is causing a problem.

Further testing indicates that CONFIG_SCHED_MC is not implicated and
it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
code in find_busiest_group() as it is common to both macros.

I think this makes the scheduling domain parameter values the most
likely cause of the problem.  I'm not very familiar with this code so 
I've added those who've modified this code in the last year or so to the 
address of this e-mail.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-24  7:43                   ` Peter Williams
@ 2007-05-24 16:45                     ` Siddha, Suresh B
  2007-05-24 23:23                       ` Peter Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Siddha, Suresh B @ 2007-05-24 16:45 UTC (permalink / raw)
  To: Peter Williams
  Cc: Ingo Molnar, colpatch, Siddha, Suresh B, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>Peter Williams wrote:
>> The relevant code, find_busiest_group() and find_busiest_queue(), has a 
>> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
>> as these macros were defined in the kernels I was testing with, I built 
>> a kernel with these macros undefined and reran my tests.  The 
>> problems/anomalies were not present in 10 consecutive tests on this new 
>> kernel.  Even better on the few occasions that a 3/1 split did occur it 
>> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
>> all spinners throughout each of the ten tests.
>> 
>> So all that is required now is an analysis of the code inside the ifdefs 
>> to see why it is causing a problem.
>
>
>Further testing indicates that CONFIG_SCHED_MC is not implicated and
>it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
>code in find_busiest_group() as it is common to both macros.
>
>I think this makes the scheduling domain parameter values the most
>likely cause of the problem.  I'm not very familiar with this code so 
>I've added those who've modified this code in the last year or 
>so to the 
>address of this e-mail.

What platform is this? I remember you mentioned its a 2 cpu box. Is it
dual core or dual package or one with HT?

thanks,
suresh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-24 16:45                     ` Siddha, Suresh B
@ 2007-05-24 23:23                       ` Peter Williams
  2007-05-29 20:45                         ` Siddha, Suresh B
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-24 23:23 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, colpatch, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

Siddha, Suresh B wrote:
> On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>> Peter Williams wrote:
>>> The relevant code, find_busiest_group() and find_busiest_queue(), has a 
>>> lot of code that is ifdefed by CONFIG_SCHED_MC and CONFIG_SCHED_SMT and, 
>>> as these macros were defined in the kernels I was testing with, I built 
>>> a kernel with these macros undefined and reran my tests.  The 
>>> problems/anomalies were not present in 10 consecutive tests on this new 
>>> kernel.  Even better on the few occasions that a 3/1 split did occur it 
>>> was quickly corrected to 2/2 and top was reporting approx 49% of CPU for 
>>> all spinners throughout each of the ten tests.
>>>
>>> So all that is required now is an analysis of the code inside the ifdefs 
>>> to see why it is causing a problem.
>>
>> Further testing indicates that CONFIG_SCHED_MC is not implicated and
>> it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
>> code in find_busiest_group() as it is common to both macros.
>>
>> I think this makes the scheduling domain parameter values the most
>> likely cause of the problem.  I'm not very familiar with this code so 
>> I've added those who've modified this code in the last year or 
>> so to the 
>> address of this e-mail.
> 
> What platform is this? I remember you mentioned its a 2 cpu box. Is it
> dual core or dual package or one with HT?

It's a single CPU HT box i.e. 2 virtual CPUs.  "cat /proc/cpuinfo" produces:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 4
cpu MHz         : 3201.145
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 6403.97
clflush size    : 64

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 3
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 4
cpu MHz         : 3201.145
cache size      : 1024 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
constant_tsc pni monitor ds_cpl cid xtpr
bogomips        : 6400.92
clflush size    : 64


Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-24 23:23                       ` Peter Williams
@ 2007-05-29 20:45                         ` Siddha, Suresh B
  2007-05-29 23:54                           ` Peter Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Siddha, Suresh B @ 2007-05-29 20:45 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Ingo Molnar, colpatch, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
> > >
> > > Further testing indicates that CONFIG_SCHED_MC is not implicated and
> > > it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
> > > code in find_busiest_group() as it is common to both macros.
> > >
> > > I think this makes the scheduling domain parameter values the most
> > > likely cause of the problem.  I'm not very familiar with this code so
> > > I've added those who've modified this code in the last year or
> > > so to the
> > > address of this e-mail.
> >
> > What platform is this? I remember you mentioned its a 2 cpu box. Is it
> > dual core or dual package or one with HT?
> 
> It's a single CPU HT box i.e. 2 virtual CPUs.  "cat /proc/cpuinfo"
> produces:

Peter, I tried on a similar box and couldn't reproduce this problem
with x86_64 2.6.22-rc3 kernel and using defconfig(has SCHED_SMT turned on).
I am using top and just the spinners.  I don't have gkrellm running, is that
required to reproduce the issue?

I tried number of times and also in runlevels 3,5(with top running
in a xterm incase of runlevel 5).

In runlevel 5, occasionally for one refresh screen of top, I see three
spinners on one cpu and one spinner on other(with X or someother app
also on the cpu with one spinner). But it balances nicely for the
immd next refresh of the top screen.

I tried with various refresh rates of top too.. Do you see the issue
at runlevel 3 too?

thanks,
suresh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-29 20:45                         ` Siddha, Suresh B
@ 2007-05-29 23:54                           ` Peter Williams
  2007-05-30  0:50                             ` Siddha, Suresh B
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-29 23:54 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, colpatch, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

Siddha, Suresh B wrote:
> On Thu, May 24, 2007 at 04:23:19PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> On Thu, May 24, 2007 at 12:43:58AM -0700, Peter Williams wrote:
>>>> Further testing indicates that CONFIG_SCHED_MC is not implicated and
>>>> it's CONFIG_SCHED_SMT that's causing the problem.  This rules out the
>>>> code in find_busiest_group() as it is common to both macros.
>>>>
>>>> I think this makes the scheduling domain parameter values the most
>>>> likely cause of the problem.  I'm not very familiar with this code so
>>>> I've added those who've modified this code in the last year or
>>>> so to the
>>>> address of this e-mail.
>>> What platform is this? I remember you mentioned its a 2 cpu box. Is it
>>> dual core or dual package or one with HT?
>> It's a single CPU HT box i.e. 2 virtual CPUs.  "cat /proc/cpuinfo"
>> produces:
> 
> Peter, I tried on a similar box and couldn't reproduce this problem
> with x86_64

Mine's a 32 bit machine.

> 2.6.22-rc3 kernel

I haven't tried rc3 yet.

> and using defconfig(has SCHED_SMT turned on).
> I am using top and just the spinners.  I don't have gkrellm running, is that
> required to reproduce the issue?

Not necessarily.  But you may need to do a number of trials as sheer 
chance plays a part.

> 
> I tried number of times and also in runlevels 3,5(with top running
> in a xterm incase of runlevel 5).

I've always done it in run level 5 using gnome-terminal.  I use 10 
consecutive trials without seeing the problem as an indication of its 
absence but will cut that short if I see a 3/1 which quickly recovers 
(see below).

> 
> In runlevel 5, occasionally for one refresh screen of top, I see three
> spinners on one cpu and one spinner on other(with X or someother app
> also on the cpu with one spinner). But it balances nicely for the
> immd next refresh of the top screen.

Yes, that (the fact that it recovers quickly) confirms that the problem 
isn't present for your system.  If load balancing occurs when other 
tasks than the spinners are actually running a 1/3 split for the 
spinners is a reasonable outcome so seeing the occasional 1/3 split is 
OK but it should return to 2/2 as soon as the other tasks sleep.

When I'm doing my tests (for the various combinations of macros) I 
always count a case where I see a 3/1 split that quickly recovers as 
proof that this problem isn't present for that case and cease testing.

> 
> I tried with various refresh rates of top too.. Do you see the issue
> at runlevel 3 too?

I haven't tried that.

Do your spinners ever relinquish the CPU voluntarily?

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-29 23:54                           ` Peter Williams
@ 2007-05-30  0:50                             ` Siddha, Suresh B
  2007-05-30  2:18                               ` Peter Williams
  0 siblings, 1 reply; 35+ messages in thread
From: Siddha, Suresh B @ 2007-05-30  0:50 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Ingo Molnar, colpatch, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:
> > I tried with various refresh rates of top too.. Do you see the issue
> > at runlevel 3 too?
> 
> I haven't tried that.
> 
> Do your spinners ever relinquish the CPU voluntarily?

Nope. Simple and plain while(1); 's

I can try 32-bit kernel to check.

thanks,
suresh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-30  0:50                             ` Siddha, Suresh B
@ 2007-05-30  2:18                               ` Peter Williams
  2007-05-30  4:42                                 ` Siddha, Suresh B
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Williams @ 2007-05-30  2:18 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Nick Piggin, Con Kolivas, Christoph Lameter,
	Dmitry Adamushko, Linux Kernel

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 04:54:29PM -0700, Peter Williams wrote:
>>> I tried with various refresh rates of top too.. Do you see the issue
>>> at runlevel 3 too?
>> I haven't tried that.
>>
>> Do your spinners ever relinquish the CPU voluntarily?
> 
> Nope. Simple and plain while(1); 's
> 
> I can try 32-bit kernel to check.

Don't bother.  I just checked 2.6.22-rc3 and the problem is not present 
which means something between rc2 and rc3 has fixed the problem.  I hate 
it when problems (appear to) fix themselves as it usually means they're 
just hiding.

I didn't see any patches between rc2 and rc3 that were likely to have 
fixed this (but doesn't mean there wasn't one).  I'm wondering whether I 
should do a git bisect to see if I can find where it got fixed?

Could you see if you can reproduce it on 2.6.22-rc2?

Thanks
Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-30  2:18                               ` Peter Williams
@ 2007-05-30  4:42                                 ` Siddha, Suresh B
  2007-05-30  6:28                                   ` Peter Williams
  2007-05-31  1:49                                   ` Peter Williams
  0 siblings, 2 replies; 35+ messages in thread
From: Siddha, Suresh B @ 2007-05-30  4:42 UTC (permalink / raw)
  To: Peter Williams
  Cc: Siddha, Suresh B, Ingo Molnar, Nick Piggin, Con Kolivas,
	Christoph Lameter, Dmitry Adamushko, Linux Kernel

On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
> Siddha, Suresh B wrote:
> > I can try 32-bit kernel to check.
> 
> Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
> which means something between rc2 and rc3 has fixed the problem.  I hate
> it when problems (appear to) fix themselves as it usually means they're
> just hiding.
> 
> I didn't see any patches between rc2 and rc3 that were likely to have
> fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
> should do a git bisect to see if I can find where it got fixed?
> 
> Could you see if you can reproduce it on 2.6.22-rc2?

No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
system at office. 15 attempts didn't show the issue.

Sure that nothing changed in your test setup?

More experiments tomorrow morning..

thanks,
suresh

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-30  4:42                                 ` Siddha, Suresh B
@ 2007-05-30  6:28                                   ` Peter Williams
  2007-05-31  1:49                                   ` Peter Williams
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-30  6:28 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Nick Piggin, Con Kolivas, Christoph Lameter,
	Dmitry Adamushko, Linux Kernel

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> I can try 32-bit kernel to check.
>> Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
>> which means something between rc2 and rc3 has fixed the problem.  I hate
>> it when problems (appear to) fix themselves as it usually means they're
>> just hiding.
>>
>> I didn't see any patches between rc2 and rc3 that were likely to have
>> fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
>> should do a git bisect to see if I can find where it got fixed?
>>
>> Could you see if you can reproduce it on 2.6.22-rc2?
> 
> No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
> system at office. 15 attempts didn't show the issue.
> 
> Sure that nothing changed in your test setup?
> 

I just rechecked with an old kernel and the problem was still there.

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch] CFS scheduler, -v12
  2007-05-30  4:42                                 ` Siddha, Suresh B
  2007-05-30  6:28                                   ` Peter Williams
@ 2007-05-31  1:49                                   ` Peter Williams
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Williams @ 2007-05-31  1:49 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Ingo Molnar, Nick Piggin, Con Kolivas, Christoph Lameter,
	Dmitry Adamushko, Linux Kernel

Siddha, Suresh B wrote:
> On Tue, May 29, 2007 at 07:18:18PM -0700, Peter Williams wrote:
>> Siddha, Suresh B wrote:
>>> I can try 32-bit kernel to check.
>> Don't bother.  I just checked 2.6.22-rc3 and the problem is not present
>> which means something between rc2 and rc3 has fixed the problem.  I hate
>> it when problems (appear to) fix themselves as it usually means they're
>> just hiding.
>>
>> I didn't see any patches between rc2 and rc3 that were likely to have
>> fixed this (but doesn't mean there wasn't one).  I'm wondering whether I
>> should do a git bisect to see if I can find where it got fixed?
>>
>> Could you see if you can reproduce it on 2.6.22-rc2?
> 
> No. Just tried 2.6.22-rc2 64-bit version at runlevel 3 on my remote
> system at office. 15 attempts didn't show the issue.
> 
> Sure that nothing changed in your test setup?
> 
> More experiments tomorrow morning..

I've finished bisecting and the patch at which things appear to improve 
is cd5477911fc9f5cc64678e2b95cdd606c59a11b5 which is in the middle of a 
bunch of patches reorganizing the link phase of the build.  Patch 
description is:

kbuild: add "Section mismatch" warning whitelist for powerpc
author	Li Yang <leoli@freescale.com>
	Mon, 14 May 2007 10:04:28 +0000 (18:04 +0800)
committer	Sam Ravnborg <sam@ravnborg.org>
	Sat, 19 May 2007 07:11:57 +0000 (09:11 +0200)
commit	cd5477911fc9f5cc64678e2b95cdd606c59a11b5
tree	d893f07b0040d36dfc60040dc695384e9afcf103	tree | snapshot
parent	f892b7d480eec809a5dfbd6e65742b3f3155e50e	commit | diff
kbuild: add "Section mismatch" warning whitelist for powerpc

This patch fixes the following class of "Section mismatch" warnings when
building powerpc platforms.

WARNING: arch/powerpc/kernel/built-in.o - Section mismatch: reference to 
.init.data:.got2 from prom_entry (offset 0x0)
WARNING: arch/powerpc/platforms/built-in.o - Section mismatch: reference 
to .init.text:mpc8313_rdb_probe from .machine.desc after 
'mach_mpc8313_rdb' (at offset 0x4)
....

Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>

scripts/mod/modpost.c

Peter
-- 
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-05-31  1:50 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-13 15:38 [patch] CFS scheduler, -v12 Ingo Molnar
2007-05-16  2:04 ` Peter Williams
2007-05-16  8:08   ` Ingo Molnar
2007-05-16 23:42     ` Peter Williams
     [not found]   ` <20070516063625.GA9058@elte.hu>
2007-05-17 23:45     ` Peter Williams
     [not found]       ` <20070518071325.GB28702@elte.hu>
2007-05-18 13:11         ` Peter Williams
2007-05-18 13:26           ` Peter Williams
2007-05-19 13:27           ` Dmitry Adamushko
2007-05-20  1:41             ` Peter Williams
2007-05-21  8:29             ` William Lee Irwin III
2007-05-21  8:57               ` Ingo Molnar
2007-05-21 12:08                 ` William Lee Irwin III
2007-05-22 16:48                 ` Chris Friesen
2007-05-22 20:15                   ` Ingo Molnar
2007-05-22 20:49                     ` Chris Friesen
2007-05-21 15:25           ` Dmitry Adamushko
2007-05-21 23:51             ` Peter Williams
2007-05-22  4:47               ` Peter Williams
2007-05-22 12:03                 ` Peter Williams
2007-05-24  7:43                   ` Peter Williams
2007-05-24 16:45                     ` Siddha, Suresh B
2007-05-24 23:23                       ` Peter Williams
2007-05-29 20:45                         ` Siddha, Suresh B
2007-05-29 23:54                           ` Peter Williams
2007-05-30  0:50                             ` Siddha, Suresh B
2007-05-30  2:18                               ` Peter Williams
2007-05-30  4:42                                 ` Siddha, Suresh B
2007-05-30  6:28                                   ` Peter Williams
2007-05-31  1:49                                   ` Peter Williams
2007-05-22 11:52               ` Dmitry Adamushko
2007-05-23  0:10                 ` Peter Williams
2007-05-18  0:18 ` Bill Huey
2007-05-18  1:01   ` Bill Huey
2007-05-18  4:13   ` William Lee Irwin III
2007-05-18  7:31   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox