public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* fair group scheduler not so fair?
@ 2008-05-21 23:59 Chris Friesen
  2008-05-22  6:56 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-21 23:59 UTC (permalink / raw)
  To: linux-kernel, vatsa, mingo, a.p.zijlstra, pj

I just downloaded the current git head and started playing with the fair 
group scheduler.  (This is on a dual cpu Mac G5.)

I created two groups, "a" and "b".  Each of them was left with the 
default share of 1024.

I created three cpu hogs by doing "cat /dev/zero > /dev/null".  One hog 
(pid 2435) was put into group "a", while the other two were put into 
group "b".

After giving them time to settle down, "top" showed the following:

2438 cfriesen  20   0  3800  392  336 R 99.5  0.0   4:02.82 cat 

2435 cfriesen  20   0  3800  392  336 R 65.9  0.0   3:30.94 cat 

2437 cfriesen  20   0  3800  392  336 R 34.3  0.0   3:14.89 cat 



Where pid 2435 should have gotten a whole cpu worth of time, it actually 
only got 66% of a cpu. Is this expected behaviour?



I then redid the test with two hogs in one group and three hogs in the 
other group.  Unfortunately, the cpu shares were not equally distributed 
within each group.  Using a 10-sec interval in "top", I got the following:


2522 cfriesen  20   0  3800  392  336 R 52.2  0.0   1:33.38 cat 

2523 cfriesen  20   0  3800  392  336 R 48.9  0.0   1:37.85 cat 

2524 cfriesen  20   0  3800  392  336 R 37.0  0.0   1:23.22 cat 

2525 cfriesen  20   0  3800  392  336 R 32.6  0.0   1:22.62 cat 

2559 cfriesen  20   0  3800  392  336 R 28.7  0.0   0:24.30 cat 


Do we expect to see upwards of 9% relative unfairness between processes 
within a class?

I tried messing with the tuneables in /proc/sys/kernel 
(sched_latency_ns, sched_migration_cost, sched_min_granularity_ns) but 
was unable to significantly improve these results.

Any pointers would be appreciated.

Thanks,

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
@ 2008-05-22  6:56 ` Peter Zijlstra
  2008-05-22 20:02   ` Chris Friesen
  2008-05-27 17:15 ` Srivatsa Vaddagiri
  2008-05-27 17:28 ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22  6:56 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj

On Wed, 2008-05-21 at 17:59 -0600, Chris Friesen wrote:
> I just downloaded the current git head and started playing with the fair 
> group scheduler.  (This is on a dual cpu Mac G5.)
> 
> I created two groups, "a" and "b".  Each of them was left with the 
> default share of 1024.
> 
> I created three cpu hogs by doing "cat /dev/zero > /dev/null".  One hog 
> (pid 2435) was put into group "a", while the other two were put into 
> group "b".
> 
> After giving them time to settle down, "top" showed the following:
> 
> 2438 cfriesen  20   0  3800  392  336 R 99.5  0.0   4:02.82 cat 
> 
> 2435 cfriesen  20   0  3800  392  336 R 65.9  0.0   3:30.94 cat 
> 
> 2437 cfriesen  20   0  3800  392  336 R 34.3  0.0   3:14.89 cat 
> 
> 
> 
> Where pid 2435 should have gotten a whole cpu worth of time, it actually 
> only got 66% of a cpu. Is this expected behaviour?
> 
> 
> 
> I then redid the test with two hogs in one group and three hogs in the 
> other group.  Unfortunately, the cpu shares were not equally distributed 
> within each group.  Using a 10-sec interval in "top", I got the following:
> 
> 
> 2522 cfriesen  20   0  3800  392  336 R 52.2  0.0   1:33.38 cat 
> 
> 2523 cfriesen  20   0  3800  392  336 R 48.9  0.0   1:37.85 cat 
> 
> 2524 cfriesen  20   0  3800  392  336 R 37.0  0.0   1:23.22 cat 
> 
> 2525 cfriesen  20   0  3800  392  336 R 32.6  0.0   1:22.62 cat 
> 
> 2559 cfriesen  20   0  3800  392  336 R 28.7  0.0   0:24.30 cat 
> 
> 
> Do we expect to see upwards of 9% relative unfairness between processes 
> within a class?
> 
> I tried messing with the tuneables in /proc/sys/kernel 
> (sched_latency_ns, sched_migration_cost, sched_min_granularity_ns) but 
> was unable to significantly improve these results.
> 
> Any pointers would be appreciated.

What you're testing is SMP fairness of group scheduling and that code is
somewhat fresh (and has known issues - performance nr 1 amogst them) but
its quite possible it has some other issues as well.

Could you see if the patches found here:

 http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/

make any difference for you?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-22  6:56 ` Peter Zijlstra
@ 2008-05-22 20:02   ` Chris Friesen
  2008-05-22 20:07     ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-22 20:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, vatsa, mingo, pj

Peter Zijlstra wrote:

> Could you see if the patches found here:
> 
>  http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/
> 
> make any difference for you?

Not much difference.  In the following case pid 2438 is in group "a" and
pids 2439/2440 are in group "b".  Pid 2438 still gets stuck with only 66%.

  2439 cfriesen  20   0  3800  392  336 R 99.7  0.0   3:17.37 cat 

  2438 cfriesen  20   0  3800  392  336 R 66.2  0.0   2:33.63 cat 

  2440 cfriesen  20   0  3800  392  336 R 33.6  0.0   1:47.53 cat 



With 3 tasks in group a, 2 in group b, it's still pretty poor:

  2514 cfriesen  20   0  3800  392  336 R 52.5  0.0   0:48.11 cat 

  2515 cfriesen  20   0  3800  392  336 R 50.2  0.0   0:42.53 cat 

  2439 cfriesen  20   0  3800  392  336 R 35.4  0.0   4:37.07 cat 

  2438 cfriesen  20   0  3800  392  336 R 33.3  0.0   3:34.97 cat 

  2440 cfriesen  20   0  3800  392  336 R 28.3  0.0   2:26.17 cat 




If I boot with "nosmp" it behaves more or less as expected:

3 tasks in default:
  2427 cfriesen  20   0  3800  392  336 R 33.7  0.0   0:36.54 cat 

  2429 cfriesen  20   0  3800  392  336 R 33.5  0.0   0:35.63 cat 

  2428 cfriesen  20   0  3800  392  336 R 32.9  0.0   0:35.84 cat 


1 task in a, 2 in b:
  2427 cfriesen  20   0  3800  392  336 R 49.8  0.0   1:45.74 cat 

  2428 cfriesen  20   0  3800  392  336 R 25.0  0.0   1:38.65 cat 

  2429 cfriesen  20   0  3800  392  336 R 25.0  0.0   1:38.18 cat 


3 tasks in a, 2 in b:
  2521 cfriesen  20   0  3800  392  336 R 25.2  0.0   0:08.52 cat 

  2522 cfriesen  20   0  3800  392  336 R 25.2  0.0   0:08.23 cat 

  2427 cfriesen  20   0  3800  392  336 R 16.6  0.0   1:59.39 cat 

  2429 cfriesen  20   0  3800  392  336 R 16.6  0.0   1:47.63 cat 

  2428 cfriesen  20   0  3800  392  336 R 16.4  0.0   1:48.65 cat 



I haven't really dug into the scheduler yet (although that's next), but 
based on these results it doesn't really look like the load balancer is 
properly group-aware.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-22 20:02   ` Chris Friesen
@ 2008-05-22 20:07     ` Peter Zijlstra
  2008-05-22 20:18       ` Li, Tong N
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22 20:07 UTC (permalink / raw)
  To: Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj

On Thu, 2008-05-22 at 14:02 -0600, Chris Friesen wrote:

> I haven't really dug into the scheduler yet (although that's next), but 
> based on these results it doesn't really look like the load balancer is 
> properly group-aware.

I'm trying, but its turning out to be rather more difficult than
expected. Any help here would be much appreciated.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: fair group scheduler not so fair?
  2008-05-22 20:07     ` Peter Zijlstra
@ 2008-05-22 20:18       ` Li, Tong N
  2008-05-22 21:13         ` Peter Zijlstra
  2008-05-23  9:42         ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 26+ messages in thread
From: Li, Tong N @ 2008-05-22 20:18 UTC (permalink / raw)
  To: Peter Zijlstra, Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj

Peter,

I didn't look at your patches, but I thought you were flattening group
weights down to task-level so that the scheduler only looks at per-task
weights. That'd make group fairness as good as task fairness gets. Is
this still the case?

  tong

>-----Original Message-----
>From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>owner@vger.kernel.org] On Behalf Of Peter Zijlstra
>Sent: Thursday, May 22, 2008 1:08 PM
>To: Chris Friesen
>Cc: linux-kernel@vger.kernel.org; vatsa@linux.vnet.ibm.com;
mingo@elte.hu;
>pj@sgi.com
>Subject: Re: fair group scheduler not so fair?
>
>On Thu, 2008-05-22 at 14:02 -0600, Chris Friesen wrote:
>
>> I haven't really dug into the scheduler yet (although that's next),
but
>> based on these results it doesn't really look like the load balancer
is
>> properly group-aware.
>
>I'm trying, but its turning out to be rather more difficult than
>expected. Any help here would be much appreciated.
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: fair group scheduler not so fair?
  2008-05-22 20:18       ` Li, Tong N
@ 2008-05-22 21:13         ` Peter Zijlstra
  2008-05-23  0:17           ` Chris Friesen
  2008-05-23  9:42         ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22 21:13 UTC (permalink / raw)
  To: Li, Tong N; +Cc: Chris Friesen, linux-kernel, vatsa, mingo, pj

On Thu, 2008-05-22 at 13:18 -0700, Li, Tong N wrote:
> Peter,
> 
> I didn't look at your patches, but I thought you were flattening group
> weights down to task-level so that the scheduler only looks at per-task
> weights. That'd make group fairness as good as task fairness gets. Is
> this still the case?

We still have hierarchical runqueues - getting rid of that is another
tree I'm working on, it has an EEVDF based rq scheduler.

For load balancing purposes we are indeed projecting everything to a
flat level.

A rather quick description of what we do:


We'll have:

  task-weight     - the weight for a task
  group-weight    - the weight for a group (same units as for tasks)
  group-shares    - the weight for a group on a particular cpu
  runqueue-weight - the sum of weights

we compute group-shares as:

  s_(i,g) = W_g * rw_(i,g) / \Sum_j rw_(j,g)
  
s_(i,g)  := group g's shares for cpu i
W_g      := group g's weight
rw_(i,g) := group g's runqueue weight for cpu i

(all for a given group)

We compute these shares while walking the task group tree bottom up,
since the shares for a child's group will affect the runqueue weight for
its parent.

We then select the busiest runqueue from the available set solely based
on top level runqueue weight (since that accurately reflects all the
child group weights after updating the shares).

We compute an imbalance between this rq and the busiest rq in top
weight.

Then, for this busiest cpu we compute the hierarchical load for each
group:

 h_load_(i,g) = rw_(i,0) \Prod_{l=1} s_(i,l)/rw_(i,{l-1})

Where l iterates over the tree levels (not the cpus).

h_load represents the full weight of the group as seen from the top
level. This is used to convert the weight of each moved task to top
weight, and we'll keep on moving tasks until the imbalance is satisfied.


Given the following:

      root
     / | \
   _A_ 1  2
  /| |\
 3 4 5 B
      / \
     6   7

     CPU0            CPU1
     root            root
     /  \            /  \
    A    1          A    2
   / \             / \
  4   B           3   5
     / \
    6   7


Numerical examples given the above scenario, assuming every body's
weight is 1024:

 s_(0,A) = s_(1,A) = 512
 s_(0,B) = 1024, s_(1,B) = 0

 rw_(0,A) = rw(1,A) = 2048
 rw_(0,B) = 2048, rw_(1,B) = 0

 h_load_(0,A) = h_load_(1,A) = 512
 h_load_(0,B) = 256, h_load(1,B) = 0





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-22 21:13         ` Peter Zijlstra
@ 2008-05-23  0:17           ` Chris Friesen
  2008-05-23  7:44             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-23  0:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Li, Tong N, linux-kernel, vatsa, mingo, pj

Peter Zijlstra wrote:

> Given the following:
> 
>       root
>      / | \
>    _A_ 1  2
>   /| |\
>  3 4 5 B
>       / \
>      6   7
> 
>      CPU0            CPU1
>      root            root
>      /  \            /  \
>     A    1          A    2
>    / \             / \
>   4   B           3   5
>      / \
>     6   7

How do you move specific groups to different cpus.  Is this simply using 
cpusets?

> 
> Numerical examples given the above scenario, assuming every body's
> weight is 1024:

>  s_(0,A) = s_(1,A) = 512

Just to make sure I understand what's going on...this is half of 1024 
because it shows up on both cpus?

>  s_(0,B) = 1024, s_(1,B) = 0

This gets the full 1024 because it's only on one cpu.

>  rw_(0,A) = rw(1,A) = 2048
>  rw_(0,B) = 2048, rw_(1,B) = 0

How do we get 2048?  Shouldn't this be 1024?

>  h_load_(0,A) = h_load_(1,A) = 512
>  h_load_(0,B) = 256, h_load(1,B) = 0

At this point the numbers make sense, but I'm not sure how the formula 
for h_load_ works given that I'm not sure what's going on for rw_.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-23  0:17           ` Chris Friesen
@ 2008-05-23  7:44             ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23  7:44 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Peter Zijlstra, Li, Tong N, linux-kernel, mingo, pj

On Thu, May 22, 2008 at 06:17:37PM -0600, Chris Friesen wrote:
> Peter Zijlstra wrote:
>
>> Given the following:
>>       root
>>      / | \
>>    _A_ 1  2
>>   /| |\
>>  3 4 5 B
>>       / \
>>      6   7
>>      CPU0            CPU1
>>      root            root
>>      /  \            /  \
>>     A    1          A    2
>>    / \             / \
>>   4   B           3   5
>>      / \
>>     6   7
>
> How do you move specific groups to different cpus.  Is this simply using 
> cpusets?

No. Moving groups to different cpus is just a group-aware extension to
move_tasks() that is invoked as part of regular load balance operation.
move_tasks()->sched_fair_class.load_balance() has been modified to
understand how much various task-groups at various levels (ex: A at level 1,
B at level 2 etc) contribute to cpu load. It moves tasks between cpus
using this knowledge.

For ex: if we were to consider all tasks shown above to be in same cpu,
CPU0, this is how it would look:

	CPU0		CPU1
       root		root
      / | \
     A  1  2
   /| |\
  3 4 5 B
       / \
      6   7

Then cpu0 load = weight of A + weight of 1 + weight of 2
	       = 1024 + 1024 + 1024 = 3072

while cpu1 load = 0

load to be moved to cut down this imbalance = 3072/2 = 1536

move_tasks() running on CPU1 would try to pull iteratively tasks such
that total weight moved is <= 1536.

	Task moved		Total Weight moved
	---------		------------
	    2			     1024
	    3			     1024 + 256 = 1280
	    5			     1280 + 256 = 1536

resulting in:

      CPU0            CPU1
      root            root
      /  \            /  \
     A    1          A    2
    / \             / \
   4   B           3   5
      / \
     6   7

>> Numerical examples given the above scenario, assuming every body's
>> weight is 1024:
>
>>  s_(0,A) = s_(1,A) = 512
>
> Just to make sure I understand what's going on...this is half of 1024 
> because it shows up on both cpus?

not exactly ..as Peter put it:

  s_(i,g) = W_g * rw_(i,g) / \Sum_j rw_(j,g)

In this case, 

  s_(0,A) = W_A * rw_(0, A) / \Sum_j rw_(j, A)

W_A = shares given to A by admin = 1024

rw_(0,A) = Weight of 4 + Weight of B = 1024 + 1024 = 2048
rw_(1,A) = Weight of 3 + Weight of 5 = 1024 + 1024 = 2048
\Sum_j rw_(j, A) = 4096

So,

  s_(0,A) = 1024 *2048 / 4096 = 512


>>  s_(0,B) = 1024, s_(1,B) = 0
>
> This gets the full 1024 because it's only on one cpu.

Not exactly. rw_(0, B) = \Sum_j rw_(j, B) and that's why s_(0,B) = 1024

>>  rw_(0,A) = rw(1,A) = 2048
>>  rw_(0,B) = 2048, rw_(1,B) = 0
>
> How do we get 2048?  Shouldn't this be 1024?

Hope this is clarified from above.

>>  h_load_(0,A) = h_load_(1,A) = 512
>>  h_load_(0,B) = 256, h_load(1,B) = 0
>
> At this point the numbers make sense, but I'm not sure how the formula for 
> h_load_ works given that I'm not sure what's going on for rw_.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-23  9:42         ` Srivatsa Vaddagiri
@ 2008-05-23  9:39           ` Peter Zijlstra
  2008-05-23 10:19             ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-23  9:39 UTC (permalink / raw)
  To: vatsa; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj

On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > Peter,
> > 
> > I didn't look at your patches, but I thought you were flattening group
> > weights down to task-level so that the scheduler only looks at per-task
> > weights.
> 
> Wouldnt that require task weight readjustment upon every fork/exit?

If you were to do that - yes that would get you into some very serious
trouble.

The route I've chosen is to basically recompute it every time I need the
weight. So every time I use a weight, I do:

  \Prod_{l=1} w_l/rw_{l-1}

Not doing that will get you O(n) recomputes on all sorts of situations.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-22 20:18       ` Li, Tong N
  2008-05-22 21:13         ` Peter Zijlstra
@ 2008-05-23  9:42         ` Srivatsa Vaddagiri
  2008-05-23  9:39           ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23  9:42 UTC (permalink / raw)
  To: Li, Tong N; +Cc: Peter Zijlstra, Chris Friesen, linux-kernel, mingo, pj

On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> Peter,
> 
> I didn't look at your patches, but I thought you were flattening group
> weights down to task-level so that the scheduler only looks at per-task
> weights.

Wouldnt that require task weight readjustment upon every fork/exit?

> That'd make group fairness as good as task fairness gets. Is
> this still the case?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-23 10:19             ` Srivatsa Vaddagiri
@ 2008-05-23 10:16               ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-23 10:16 UTC (permalink / raw)
  To: vatsa; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj

On Fri, 2008-05-23 at 15:49 +0530, Srivatsa Vaddagiri wrote:
> On Fri, May 23, 2008 at 11:39:21AM +0200, Peter Zijlstra wrote:
> > On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> > > On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > > > Peter,
> > > > 
> > > > I didn't look at your patches, but I thought you were flattening group
> > > > weights down to task-level so that the scheduler only looks at per-task
> > > > weights.
> > > 
> > > Wouldnt that require task weight readjustment upon every fork/exit?
> > 
> > If you were to do that - yes that would get you into some very serious
> > trouble.
> > 
> > The route I've chosen is to basically recompute it every time I need the
> > weight. So every time I use a weight, I do:
> 
> and which are those points when "you need the weight"?

eg. when computing the vruntime gain for this task, or when calculating
the slice.

Also when load-balancing, we do a similar thing through the h_load
stuff.

> Basically here's what I had in mind:
> 
> 	Group A shares = 1024
> 	# of tasks in group A = 1 (T0)
> 
> So T0 weight can be 1024.
> 
> 	T0 now forks 1000 children. Ideally now,
> 		T0.weight = T1.weight = .... = T999.weight = 1024/1000
> 
> If we don't change each task's weight like this, then group A will
> cumulatively get more share than it deserves.
> 
> Are you saying you will change each task's weight lazily? If so how?

No, we leave it alone, but calculate the effective weight as seen from
the top for every time we apply a tasks weight to compute variables.

Basically all the calc_delta_*() stuff and the load balancer stuff.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-23  9:39           ` Peter Zijlstra
@ 2008-05-23 10:19             ` Srivatsa Vaddagiri
  2008-05-23 10:16               ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23 10:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj

On Fri, May 23, 2008 at 11:39:21AM +0200, Peter Zijlstra wrote:
> On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> > On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > > Peter,
> > > 
> > > I didn't look at your patches, but I thought you were flattening group
> > > weights down to task-level so that the scheduler only looks at per-task
> > > weights.
> > 
> > Wouldnt that require task weight readjustment upon every fork/exit?
> 
> If you were to do that - yes that would get you into some very serious
> trouble.
> 
> The route I've chosen is to basically recompute it every time I need the
> weight. So every time I use a weight, I do:

and which are those points when "you need the weight"?

Basically here's what I had in mind:

	Group A shares = 1024
	# of tasks in group A = 1 (T0)

So T0 weight can be 1024.

	T0 now forks 1000 children. Ideally now,
		T0.weight = T1.weight = .... = T999.weight = 1024/1000

If we don't change each task's weight like this, then group A will
cumulatively get more share than it deserves.

Are you saying you will change each task's weight lazily? If so how?


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
  2008-05-22  6:56 ` Peter Zijlstra
@ 2008-05-27 17:15 ` Srivatsa Vaddagiri
  2008-05-27 18:13   ` Chris Friesen
  2008-05-27 17:28 ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-27 17:15 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Wed, May 21, 2008 at 05:59:22PM -0600, Chris Friesen wrote:
> I just downloaded the current git head and started playing with the fair 
> group scheduler.  (This is on a dual cpu Mac G5.)
>
> I created two groups, "a" and "b".  Each of them was left with the default 
> share of 1024.
>
> I created three cpu hogs by doing "cat /dev/zero > /dev/null".  One hog 
> (pid 2435) was put into group "a", while the other two were put into group 
> "b".
>
> After giving them time to settle down, "top" showed the following:
>
> 2438 cfriesen  20   0  3800  392  336 R 99.5  0.0   4:02.82 cat 
> 2435 cfriesen  20   0  3800  392  336 R 65.9  0.0   3:30.94 cat 
> 2437 cfriesen  20   0  3800  392  336 R 34.3  0.0   3:14.89 cat 
>
>
> Where pid 2435 should have gotten a whole cpu worth of time, it actually 
> only got 66% of a cpu. Is this expected behaviour?

Definitely not an expected behavior and I think I understand why this is
happening.

But first, note that Groups "a" and "b" share bandwidth with all tasks
in /dev/cgroup/tasks. Lets say that /dev/cgroup/tasks had T0-T1,
/dev/cgroup/a/tasks has TA1 while /dev/cgroup/b/tasks has
TB1 (all tasks of weight 1024).

Then TA1 is expected to get 1/(1+1+2) = 25% bandwidth

Similarly T0, T1, TB1 all get 25% bandwidth.

IOW, Groups "a" and "b" are peers of each task in /dev/cgroup/tasks.

Having said that, here's what I do for my testing:

	# mkdir /cgroup
	# mount -t cgroup -ocpu none /cgroup
	# cd /cgroup

	# #Move all tasks to 'sys' group and give it low shares
	# mkdir sys
	# cd sys
	# for i in `cat ../tasks`
	  do
		echo $i > tasks
	  done
	# echo 100 > cpu.shares

	# mkdir a
	# mkdir b

	# echo <pid> > a/tasks
	..

Now, why did Group "a" get less than what it deserved? Here's what was
happening:

	CPU0		CPU1

	  a0		b0
	  b1

cpu0.load = 1024 (Grp a load) + 512 (Grp b load)
cpu1.load = 512 (Grp b load)

imbalance = 1024

max_load_move = 512 (to equalize load)

load_balance_fair() is invoked on CPU1 with this max_load_move target of 512. 
Ideally it can move b1 to CPU1, which would attain perfect balance. This
does not happen because:

load_balance_fair() iterates thr' the task list in the order they
were created. So it first examines what tasks it can pull from Group "a".

	It invokes __load_balance_fair() to see if it can pull any tasks
	worth max weight 512 (rem_load). Ideally since a0's weight is
	1024, it should not pull a0. However, balance_tasks() is eager
	to pull atleast one task (because of SCHED_LOAD_SCALE_FUZZ) and
	ends up pulling a0. This results in more load being moved (1024)
	than the required target.

Next, when CPU0 tries pulling load of 512, it ends up pulling a0 again.

This a0 ping pongs between both CPUs.


The following experimental patch (on top of 2.6.26-rc3 +
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/) seems 
to fix the problem.

Note that this works only when /dev/cgroup/sys/cpu.shares = 100 (or some low
number). Otherwise top (or whatever command you run to observe load 
distribution) contributes to some load in /dev/cgroup/sys group, which skews the
results. IMHO, find_busiest_group() needs to use cpu utilization (rather than 
cpu load) as the metric to balance across CPUs (rather than task/group load).

Can you check if this makes a difference for you as well?


Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
 include/linux/sched.h |    4 ++++
 init/Kconfig          |    2 +-
 kernel/sched.c        |    5 ++++-
 kernel/sched_debug.c  |    2 +-
 4 files changed, 10 insertions(+), 3 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -698,7 +698,11 @@ enum cpu_idle_type {
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define SCHED_LOAD_SCALE_FUZZ	0
+#else
 #define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+#endif
 
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
 	  See Documentation/sched-rt-group.txt for more information.
 
 choice
-	depends on GROUP_SCHED
+	depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
 	prompt "Basis for grouping tasks"
 	default USER_SCHED
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int 
 	unsigned long shares = 0;
 	int i;
 
+	if (!tg->parent)
+		return;
+
 	for_each_cpu_mask(i, sd->span) {
 		rq_weight += tg->cfs_rq[i]->load.weight;
 		shares += tg->cfs_rq[i]->shares;
@@ -2919,7 +2922,7 @@ next:
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+	skip_for_load = (p->se.load.weight >> 1) >= rem_load_move +
 							 SCHED_LOAD_SCALE_FUZZ;
 	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
 	struct sched_entity *last;
 	unsigned long flags;
 
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
 #else
 	char path[128] = "";




















	





>
>
>
> I then redid the test with two hogs in one group and three hogs in the 
> other group.  Unfortunately, the cpu shares were not equally distributed 
> within each group.  Using a 10-sec interval in "top", I got the following:
>
>
> 2522 cfriesen  20   0  3800  392  336 R 52.2  0.0   1:33.38 cat 
> 2523 cfriesen  20   0  3800  392  336 R 48.9  0.0   1:37.85 cat 
> 2524 cfriesen  20   0  3800  392  336 R 37.0  0.0   1:23.22 cat 
> 2525 cfriesen  20   0  3800  392  336 R 32.6  0.0   1:22.62 cat 
> 2559 cfriesen  20   0  3800  392  336 R 28.7  0.0   0:24.30 cat 
>
> Do we expect to see upwards of 9% relative unfairness between processes 
> within a class?
>
> I tried messing with the tuneables in /proc/sys/kernel (sched_latency_ns, 
> sched_migration_cost, sched_min_granularity_ns) but was unable to 
> significantly improve these results.
>
> Any pointers would be appreciated.
>
> Thanks,
>
> Chris

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
  2008-05-22  6:56 ` Peter Zijlstra
  2008-05-27 17:15 ` Srivatsa Vaddagiri
@ 2008-05-27 17:28 ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-27 17:28 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, dhaval, Balbir Singh,
	aneesh.kumar

On Wed, May 21, 2008 at 05:59:22PM -0600, Chris Friesen wrote:
> I then redid the test with two hogs in one group and three hogs in the 
> other group.  Unfortunately, the cpu shares were not equally distributed 
> within each group.  Using a 10-sec interval in "top", I got the following:

I ran with this combination (2 in Group a and 3 in Group b) on top of the 
experimental patch I sent and here's what I get:

 4350 root      20   0  1384  228  176 R 53.8  0.0  52:27.54  1 hoga
 4542 root      20   0  1384  228  176 R 49.3  0.0   3:39.76  0 hoga
 4352 root      20   0  1384  232  176 R 36.0  0.0  26:53.50  1 hogb
 4351 root      20   0  1384  228  176 R 32.0  0.0  26:47.54  0 hogb
 4543 root      20   0  1384  232  176 R 29.0  0.0   2:03.62  0 hogb

Note that fairness (using load balance approach we have currently) works
over a long window. Usually I observe with "top -d30". Higher the
asymmetry of task-load distribution, longer it takes to converge to
fairness.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-27 17:15 ` Srivatsa Vaddagiri
@ 2008-05-27 18:13   ` Chris Friesen
  2008-05-28 16:33     ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-27 18:13 UTC (permalink / raw)
  To: vatsa
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

Srivatsa Vaddagiri wrote:

> But first, note that Groups "a" and "b" share bandwidth with all tasks
> in /dev/cgroup/tasks.

Ah, good point.  I've switched over to your group setup for testing.

> The following experimental patch (on top of 2.6.26-rc3 +
> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/) seems 
> to fix the problem.

> Can you check if this makes a difference for you as well?

Initially it looked promising.  I put pid 2498 in group A, and pids 2499 
and 2500 in group B.  2498 got basically a full cpu, and the other two 
got 50% each.

However, I then moved pid 2499 from group B to group A, and the system 
got stuck in the following behaviour:

2498 cfriesen  20   0  3800  392  336 R 99.7  0.0   3:00.22 cat
2500 cfriesen  20   0  3800  392  336 R 66.7  0.0   1:39.10 cat
2499 cfriesen  20   0  3800  392  336 R 33.0  0.0   1:24.31 cat

I reproduced this a number of times.

Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-27 18:13   ` Chris Friesen
@ 2008-05-28 16:33     ` Srivatsa Vaddagiri
  2008-05-28 18:35       ` Chris Friesen
  0 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-28 16:33 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Tue, May 27, 2008 at 12:13:46PM -0600, Chris Friesen wrote:
>> Can you check if this makes a difference for you as well?
>
> Initially it looked promising.  I put pid 2498 in group A, and pids 2499 
> and 2500 in group B.  2498 got basically a full cpu, and the other two got 
> 50% each.
>
> However, I then moved pid 2499 from group B to group A, and the system got 
> stuck in the following behaviour:
>
> 2498 cfriesen  20   0  3800  392  336 R 99.7  0.0   3:00.22 cat
> 2500 cfriesen  20   0  3800  392  336 R 66.7  0.0   1:39.10 cat
> 2499 cfriesen  20   0  3800  392  336 R 33.0  0.0   1:24.31 cat
>
> I reproduced this a number of times.

Thanks for trying this combination. I discovered a task-leak in this loop
(__load_balance_iterator):

        /* Skip over entities that are not tasks */
        do {
                se = list_entry(next, struct sched_entity, group_node);
                next = next->next;
        } while (next != &cfs_rq->tasks && !entity_is_task(se));

        if (next == &cfs_rq->tasks)
                return NULL;

We seem to be skipping the last element in the task list always. In your
case, the lone task in Group a/b is always skipped because of this.

The following hunk seems to fix this:

@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
-		return NULL;
-
 	cfs_rq->balance_iterator = next;
 
 	if (entity_is_task(se))


Updated patch (on top of 2.6.26-rc3 +
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
below.  Pls let me know how it fares!


---
 include/linux/sched.h |    4 ++++
 init/Kconfig          |    2 +-
 kernel/sched.c        |    5 ++++-
 kernel/sched_debug.c  |    3 ++-
 kernel/sched_fair.c   |    3 ---
 5 files changed, 11 insertions(+), 6 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -698,7 +698,11 @@ enum cpu_idle_type {
 #define SCHED_LOAD_SHIFT	10
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define SCHED_LOAD_SCALE_FUZZ	0
+#else
 #define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
+#endif
 
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
 	  See Documentation/sched-rt-group.txt for more information.
 
 choice
-	depends on GROUP_SCHED
+	depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
 	prompt "Basis for grouping tasks"
 	default USER_SCHED
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int 
 	unsigned long shares = 0;
 	int i;
 
+	if (!tg->parent)
+		return;
+
 	for_each_cpu_mask(i, sd->span) {
 		rq_weight += tg->cfs_rq[i]->load.weight;
 		shares += tg->cfs_rq[i]->shares;
@@ -2919,7 +2922,7 @@ next:
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+	skip_for_load = (p->se.load.weight >> 1) >= rem_load_move +
 							 SCHED_LOAD_SCALE_FUZZ;
 	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
 	struct sched_entity *last;
 	unsigned long flags;
 
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
 #else
 	char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load", cfs_rq->h_load);
 #endif
 #endif
 }
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
-		return NULL;
-
 	cfs_rq->balance_iterator = next;
 
 	if (entity_is_task(se))


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-28 16:33     ` Srivatsa Vaddagiri
@ 2008-05-28 18:35       ` Chris Friesen
  2008-05-28 18:47         ` Dhaval Giani
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-28 18:35 UTC (permalink / raw)
  To: vatsa
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

Srivatsa Vaddagiri wrote:

> We seem to be skipping the last element in the task list always. In your
> case, the lone task in Group a/b is always skipped because of this.

> Updated patch (on top of 2.6.26-rc3 +
> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
> below.  Pls let me know how it fares!

Looking much better, but still some fairness issues with more complex 
setups.

pid 2477 in A, others in B
2477	99.5%
2478	49.9%
2479	49.9%

move 2478 to A
2479	99.9%
2477	49.9%
2478	49.9%

So far so good.  I then created C, and moved 2478 to it.  A 3-second 
"top" gave almost a 15% error from the desired behaviour for one group:

2479	76.2%
2477	72.2%
2478	51.0%


A 10-sec average was better, but we still see errors of 6%:
2478	72.8%
2477	64.0%
2479	63.2%


I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C.  A 
10-second "top" gave errors of up to 6.5%:
2500	60.1%
2491	37.5%
2492	37.4%
2489	25.0%
2488	19.9%
2490	19.9%

a re-test gave errors of up to 8.1%:

2534	74.8%
2533	30.1%
2532	30.0%
2529	25.0%
2530	20.0%
2531	20.0%

Another retest gave perfect results initially:

2559	66.5%
2560	33.4%
2561	33.3%
2564	22.3%
2562	22.2%
2563	22.1%

but moving 2564 from group A to C and then back to A disturbed the 
perfect division of time and resulted in almost the same utilization 
pattern as above:

2559	74.9%
2560	30.0%
2561	29.6%
2564	25.3%
2562	20.0%
2563	20.0%

It looks like perfect balancing is a metastable state where it can stay 
happily for some time, but any small disturbance may be enough to kick 
it over into a more stable but incorrect state.  Once we get into such 
an incorrect division of time, it appears very difficult to return to 
perfect balancing.

Chris





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-28 18:35       ` Chris Friesen
@ 2008-05-28 18:47         ` Dhaval Giani
  2008-05-29  2:50         ` Srivatsa Vaddagiri
  2008-05-29 16:46         ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 26+ messages in thread
From: Dhaval Giani @ 2008-05-28 18:47 UTC (permalink / raw)
  To: Chris Friesen
  Cc: vatsa, linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh,
	aneesh.kumar

On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>> We seem to be skipping the last element in the task list always. In your
>> case, the lone task in Group a/b is always skipped because of this.
>
>> Updated patch (on top of 2.6.26-rc3 +
>> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
>> below.  Pls let me know how it fares!
>
> Looking much better, but still some fairness issues with more complex 
> setups.
>
> pid 2477 in A, others in B
> 2477	99.5%
> 2478	49.9%
> 2479	49.9%
>
> move 2478 to A
> 2479	99.9%
> 2477	49.9%
> 2478	49.9%
>
> So far so good.  I then created C, and moved 2478 to it.  A 3-second "top" 
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479	76.2%
> 2477	72.2%
> 2478	51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:

So it is converging to a fair state. How does it look
across say 20 or 30 seconds your side?

> 2478	72.8%
> 2477	64.0%
> 2479	63.2%
>
>
> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C.  A 
> 10-second "top" gave errors of up to 6.5%:
> 2500	60.1%
> 2491	37.5%
> 2492	37.4%
> 2489	25.0%
> 2488	19.9%
> 2490	19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534	74.8%
> 2533	30.1%
> 2532	30.0%
> 2529	25.0%
> 2530	20.0%
> 2531	20.0%
>
> Another retest gave perfect results initially:
>
> 2559	66.5%
> 2560	33.4%
> 2561	33.3%
> 2564	22.3%
> 2562	22.2%
> 2563	22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect 
> division of time and resulted in almost the same utilization pattern as 
> above:
>
> 2559	74.9%
> 2560	30.0%
> 2561	29.6%
> 2564	25.3%
> 2562	20.0%
> 2563	20.0%
>

This is over a longer duration or a 10 second duration?

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-28 18:35       ` Chris Friesen
  2008-05-28 18:47         ` Dhaval Giani
@ 2008-05-29  2:50         ` Srivatsa Vaddagiri
  2008-05-29 16:46         ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29  2:50 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>> We seem to be skipping the last element in the task list always. In your
>> case, the lone task in Group a/b is always skipped because of this.
>
>> Updated patch (on top of 2.6.26-rc3 +
>> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
>> below.  Pls let me know how it fares!
>
> Looking much better, but still some fairness issues with more complex 
> setups.
>
[snip]

> A 10-sec average was better, but we still see errors of 6%:
> 2478	72.8%
> 2477	64.0%
> 2479	63.2%

Yes I have observed that the delta from ideal fairness is much more than
desirable (although I used to get much better fairness with
2.6.25-rc1. Let me verify and get back).

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-28 18:35       ` Chris Friesen
  2008-05-28 18:47         ` Dhaval Giani
  2008-05-29  2:50         ` Srivatsa Vaddagiri
@ 2008-05-29 16:46         ` Srivatsa Vaddagiri
  2008-05-29 16:47           ` Srivatsa Vaddagiri
  2008-05-29 21:30           ` Chris Friesen
  2 siblings, 2 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29 16:46 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Looking much better, but still some fairness issues with more complex 
> setups.
>
> pid 2477 in A, others in B
> 2477	99.5%
> 2478	49.9%
> 2479	49.9%
>
> move 2478 to A
> 2479	99.9%
> 2477	49.9%
> 2478	49.9%
>
> So far so good.  I then created C, and moved 2478 to it.  A 3-second "top" 
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479	76.2%
> 2477	72.2%
> 2478	51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:
> 2478	72.8%
> 2477	64.0%
> 2479	63.2%

Found couple of issues:

	1. A minor bug in load_balance_fair() in calculation of moved_load:

		moved_load /= busiest_cfs_rq->load.weight + 1;

	   In place of busiest_cfs_rq->load.weight, the load before
	   moving tasks needs to be used. Fix in the updated patch below.

	2. We walk task groups sequentially in load_balance_fair() w/o
	   necessarily looking for the best group. This results in
	   load_balance_fair() in picking non optimal group/tasks
	   to pull. I have a hack below (strict = 1/0) to rectify this
	   problem, but we need a better algo to pick the best group
	   from which to pull tasks.

	3. sd->imbalance_pct (default = 125) specifies how much imbalance we 
	   tolerate. Lower the value, better will be the fairness. To check 
	   this, I changed default to 105, which is giving me better
	   results.

With the updated patch and imbalance_pct = 105, here's how my 60-sec avg looks 
like:

 4353 root      20   0  1384  232  176 R 67.0  0.0   2:47.23  1 hoga
 4354 root      20   0  1384  228  176 R 66.5  0.0   2:44.65  1 hogb
 4355 root      20   0  1384  228  176 R 66.3  0.0   2:28.18  0 hogb

Error is < 1%

> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C.  A 
> 10-second "top" gave errors of up to 6.5%:
> 2500	60.1%
> 2491	37.5%
> 2492	37.4%
> 2489	25.0%
> 2488	19.9%
> 2490	19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534	74.8%
> 2533	30.1%
> 2532	30.0%
> 2529	25.0%
> 2530	20.0%
> 2531	20.0%
>
> Another retest gave perfect results initially:
>
> 2559	66.5%
> 2560	33.4%
> 2561	33.3%
> 2564	22.3%
> 2562	22.2%
> 2563	22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect 
> division of time and resulted in almost the same utilization pattern as 
> above:
>
> 2559	74.9%
> 2560	30.0%
> 2561	29.6%
> 2564	25.3%
> 2562	20.0%
> 2563	20.0%

Again with the updated patch and changed imbalance_pct, here's what I see:

 4458 root      20   0  1384  232  176 R 66.3  0.0   2:11.04  0 hogc
 4457 root      20   0  1384  232  176 R 33.7  0.0   1:06.19  0 hogb
 4456 root      20   0  1384  232  176 R 33.4  0.0   1:06.59  0 hogb
 4455 root      20   0  1384  228  176 R 22.5  0.0   0:44.09  0 hoga
 4453 root      20   0  1384  232  176 R 22.3  0.0   0:44.10  1 hoga
 4454 root      20   0  1384  228  176 R 22.2  0.0   0:43.94  1 hoga

(Error < 1%)

In summary, can you do this before running your tests:

1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
(http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)

2. Setup test env as below:

	# mkdir /cgroup
	# mount -t cgroup -ocpu none /cgroup
	# cd /cgroup

	# #Move all tasks to 'sys' group and give it low shares
	# mkdir sys
	# cd sys
	# for i in `cat ../tasks`
	  do
		echo $i > tasks
	  done
	# echo 100 > cpu.shares

	# cd /proc/sys/kernel/sched_domain
	# for i in `find . -name imbalance_pct`; do echo 105 > $i; done


---
 init/Kconfig         |    2 +-
 kernel/sched.c       |   12 +++++++++---
 kernel/sched_debug.c |    3 ++-
 kernel/sched_fair.c  |   26 +++++++++++++++++---------
 4 files changed, 29 insertions(+), 14 deletions(-)

Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
 	  See Documentation/sched-rt-group.txt for more information.
 
 choice
-	depends on GROUP_SCHED
+	depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
 	prompt "Basis for grouping tasks"
 	default USER_SCHED
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1398,7 +1398,7 @@ static unsigned long
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	      unsigned long max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
-	      int *this_best_prio, struct rq_iterator *iterator);
+	      int *this_best_prio, struct rq_iterator *iterator, int strict);
 
 static int
 iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int 
 	unsigned long shares = 0;
 	int i;
 
+	if (!tg->parent)
+		return;
+
 	for_each_cpu_mask(i, sd->span) {
 		rq_weight += tg->cfs_rq[i]->load.weight;
 		shares += tg->cfs_rq[i]->shares;
@@ -2896,7 +2899,7 @@ static unsigned long
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	      unsigned long max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
-	      int *this_best_prio, struct rq_iterator *iterator)
+	      int *this_best_prio, struct rq_iterator *iterator, int strict)
 {
 	int loops = 0, pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
@@ -2919,7 +2922,10 @@ next:
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+	if (strict)
+		skip_for_load = p->se.load.weight >= rem_load_move;
+	else
+		skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
 							 SCHED_LOAD_SCALE_FUZZ;
 	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
 	struct sched_entity *last;
 	unsigned long flags;
 
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
 #else
 	char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load", cfs_rq->h_load);
 #endif
 #endif
 }
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
-		return NULL;
-
 	cfs_rq->balance_iterator = next;
 
 	if (entity_is_task(se))
@@ -1415,7 +1412,7 @@ static unsigned long
 __load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		unsigned long max_load_move, struct sched_domain *sd,
 		enum cpu_idle_type idle, int *all_pinned, int *this_best_prio,
-		struct cfs_rq *cfs_rq)
+		struct cfs_rq *cfs_rq, int strict)
 {
 	struct rq_iterator cfs_rq_iterator;
 
@@ -1425,10 +1422,11 @@ __load_balance_fair(struct rq *this_rq, 
 
 	return balance_tasks(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
-			this_best_prio, &cfs_rq_iterator);
+			this_best_prio, &cfs_rq_iterator, strict);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -1438,13 +1436,17 @@ load_balance_fair(struct rq *this_rq, in
 	long rem_load_move = max_load_move;
 	int busiest_cpu = cpu_of(busiest);
 	struct task_group *tg;
+	int strict = 1;
 
 	update_h_load(cpu_of(busiest));
 
 	rcu_read_lock();
+
+retry:
 	list_for_each_entry(tg, &task_groups, list) {
 		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
 		long rem_load, moved_load;
+		unsigned long busiest_cfs_rq_load;
 
 		/*
 		 * empty group
@@ -1452,25 +1454,31 @@ load_balance_fair(struct rq *this_rq, in
 		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = rem_load_move * busiest_cfs_rq->load.weight;
+		busiest_cfs_rq_load = busiest_cfs_rq->load.weight;
+		rem_load = rem_load_move * busiest_cfs_rq_load;
 		rem_load /= busiest_cfs_rq->h_load + 1;
 
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
-				tg->cfs_rq[busiest_cpu]);
+				tg->cfs_rq[busiest_cpu], strict);
 
 		if (!moved_load)
 			continue;
 
 		moved_load *= busiest_cfs_rq->h_load;
-		moved_load /= busiest_cfs_rq->load.weight + 1;
+		moved_load /= busiest_cfs_rq_load + 1;
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)
 			break;
 	}
+
+	if (rem_load_move && strict--)
+		goto retry;
+
 	rcu_read_unlock();
 
+
 	return max_load_move - rem_load_move;
 }
 #else
@@ -1482,7 +1490,7 @@ load_balance_fair(struct rq *this_rq, in
 {
 	return __load_balance_fair(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
-			this_best_prio, &busiest->cfs);
+			this_best_prio, &busiest->cfs, 0);
 }
 #endif
 


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-29 16:46         ` Srivatsa Vaddagiri
@ 2008-05-29 16:47           ` Srivatsa Vaddagiri
  2008-05-29 21:30           ` Chris Friesen
  1 sibling, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29 16:47 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Thu, May 29, 2008 at 10:16:07PM +0530, Srivatsa Vaddagiri wrote:
> In summary, can you do this before running your tests:
> 
> 1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
> (http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)

Its quite a while since I downloaded Peter's patches and I hope it hasnt
changed! Once the results start looking good, I will request Peter to
roll it up in his patch stack.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-29 16:46         ` Srivatsa Vaddagiri
  2008-05-29 16:47           ` Srivatsa Vaddagiri
@ 2008-05-29 21:30           ` Chris Friesen
  2008-05-30  6:43             ` Dhaval Giani
  2008-05-30 11:36             ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-29 21:30 UTC (permalink / raw)
  To: vatsa
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

Srivatsa Vaddagiri wrote:

> In summary, can you do this before running your tests:
> 
> 1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
> (http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)

I updated with the old set of patches you sent me, plus your patch.

> 2. Setup test env as below:

Done.

Overall the group scheduler results look better, but I'm seeing an odd 
scenario within a single group where sometimes I get a 67/67/66 
breakdown but sometimes it gives 100/50/50.

Also, although the long-term results are good, the shorter-term fairness 
isn't great.  Is there a tuneable that would allow for a tradeoff 
between performance and fairness?  I have people that are looking for 
within 4% fairness over a 1sec interval.


Initially I tried a simple setup with three hogs all in the default 
"sys" group.  Over multiple retries using 10-sec intervals, sometimes it 
gave roughly 67% for each task, other times it settled into a 100/50/50 
split that remained stable over time.

3 tasks in sys
  2471 cfriesen  20   0  3800  392  336 R 99.9  0.0   0:29.97 cat
  2470 cfriesen  20   0  3800  392  336 R 50.3  0.0   0:17.83 cat
  2469 cfriesen  20   0  3800  392  336 R 49.6  0.0   0:17.96 cat

retry
  2475 cfriesen  20   0  3800  392  336 R 68.3  0.0   0:28.46 cat
  2476 cfriesen  20   0  3800  392  336 R 67.3  0.0   0:28.24 cat
  2474 cfriesen  20   0  3800  392  336 R 64.3  0.0   0:28.73 cat

  2476 cfriesen  20   0  3800  392  336 R 67.1  0.0   0:41.79 cat
  2474 cfriesen  20   0  3800  392  336 R 66.6  0.0   0:41.96 cat
  2475 cfriesen  20   0  3800  392  336 R 66.1  0.0   0:41.67 cat

retry
  2490 cfriesen  20   0  3800  392  336 R 99.7  0.0   0:22.23 cat
  2489 cfriesen  20   0  3800  392  336 R 49.9  0.0   0:21.02 cat
  2491 cfriesen  20   0  3800  392  336 R 49.9  0.0   0:13.94 cat


With three groups, one task in each, I tried both 10 and 60 second 
intervals.  The longer interval looked better but was still up to 0.8% off:
10-sec
  2490 cfriesen  20   0  3800  392  336 R 68.9  0.0   1:35.13 cat
  2491 cfriesen  20   0  3800  392  336 R 65.8  0.0   1:04.65 cat
  2489 cfriesen  20   0  3800  392  336 R 64.5  0.0   1:26.48 cat

60-sec
  2490 cfriesen  20   0  3800  392  336 R 67.5  0.0   3:19.85 cat
  2491 cfriesen  20   0  3800  392  336 R 66.3  0.0   2:48.93 cat
  2489 cfriesen  20   0  3800  392  336 R 66.2  0.0   3:10.86 cat


Finally, a more complicated scenario.  three tasks in A, two in B, and 
one in C.  The 60-sec trial was up to 0.8 off, while a 3-second trial 
(just for fun) was 8.5% off.

60-sec
2491 cfriesen  20   0  3800  392  336 R 65.9  0.0   5:06.69 cat
  2499 cfriesen  20   0  3800  392  336 R 33.6  0.0   0:55.35 cat
  2490 cfriesen  20   0  3800  392  336 R 33.5  0.0   4:47.94 cat
  2497 cfriesen  20   0  3800  392  336 R 22.6  0.0   0:38.76 cat
  2489 cfriesen  20   0  3800  392  336 R 22.2  0.0   4:28.03 cat
  2498 cfriesen  20   0  3800  392  336 R 22.2  0.0   0:35.13 cat

3-sec
2491 cfriesen  20   0  3800  392  336 R 58.2  0.0  13:29.60 cat
  2490 cfriesen  20   0  3800  392  336 R 34.8  0.0   9:07.73 cat
  2499 cfriesen  20   0  3800  392  336 R 31.0  0.0   5:15.69 cat
  2497 cfriesen  20   0  3800  392  336 R 29.4  0.0   3:37.25 cat
  2489 cfriesen  20   0  3800  392  336 R 23.3  0.0   7:26.25 cat
  2498 cfriesen  20   0  3800  392  336 R 23.0  0.0   3:33.24 cat


Chris

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-29 21:30           ` Chris Friesen
@ 2008-05-30  6:43             ` Dhaval Giani
  2008-05-30 10:21               ` Srivatsa Vaddagiri
  2008-05-30 11:36             ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 26+ messages in thread
From: Dhaval Giani @ 2008-05-30  6:43 UTC (permalink / raw)
  To: Chris Friesen
  Cc: vatsa, linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh,
	aneesh.kumar

> Also, although the long-term results are good, the shorter-term fairness 
> isn't great.  Is there a tuneable that would allow for a tradeoff between 
> performance and fairness?  I have people that are looking for within 4% 
> fairness over a 1sec interval.
>

How fair does smp fairness look for a !group scenario? I don't expect
group schould be able to do much better.

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-30  6:43             ` Dhaval Giani
@ 2008-05-30 10:21               ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-30 10:21 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Chris Friesen, linux-kernel, mingo, a.p.zijlstra, pj,
	Balbir Singh, aneesh.kumar

On Fri, May 30, 2008 at 12:13:24PM +0530, Dhaval Giani wrote:
> > Also, although the long-term results are good, the shorter-term fairness 
> > isn't great.  Is there a tuneable that would allow for a tradeoff between 
> > performance and fairness?  I have people that are looking for within 4% 
> > fairness over a 1sec interval.
> >
> 
> How fair does smp fairness look for a !group scenario? I don't expect
> group schould be able to do much better.

Just tested this combo for !group case:

	1 nice0 (weight = 1024)
	2 nice3 (each weight = 526)
	3 nice5 (each weight = 335)

You'd expect nice0 to get (on a 2 cpu system):

	2 * 1024 / (1024 + 2*526 + 3*335) = 66.47

This is what I see over a 10sec interval (error = 6%):

 4386 root      20   0  1384  228  176 R 60.4  0.0   3:06.75  1 nice0
 4387 root      23   3  1384  232  176 R 37.9  0.0   1:57.03  0 nice3
 4388 root      23   3  1384  228  176 R 37.9  0.0   1:57.24  0 nice3
 4390 root      25   5  1384  228  176 R 24.1  0.0   1:14.62  0 nice5
 4391 root      25   5  1384  228  176 R 19.8  0.0   1:01.26  1 nice5
 4389 root      25   5  1384  228  176 R 19.7  0.0   1:01.12  1 nice5

Over 120sec interval (error still as high as 6%):

 4386 root      20   0  1384  228  176 R 60.4  0.0   6:13.95  1 nice0
 4388 root      23   3  1384  228  176 R 37.9  0.0   3:54.69  0 nice3
 4387 root      23   3  1384  232  176 R 37.9  0.0   3:54.44  0 nice3
 4390 root      25   5  1384  228  176 R 24.2  0.0   2:29.45  0 nice5
 4391 root      25   5  1384  228  176 R 19.8  0.0   2:02.56  1 nice5
 4389 root      25   5  1384  228  176 R 19.8  0.0   2:02.44  1 nice5

The high error could be because of interference from other tasks. Anyway I 
dont think !group case is better at achieving fairness over shorter
intervals.


-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-29 21:30           ` Chris Friesen
  2008-05-30  6:43             ` Dhaval Giani
@ 2008-05-30 11:36             ` Srivatsa Vaddagiri
  2008-06-02 20:03               ` Chris Friesen
  1 sibling, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-30 11:36 UTC (permalink / raw)
  To: Chris Friesen
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

On Thu, May 29, 2008 at 03:30:37PM -0600, Chris Friesen wrote:
> Overall the group scheduler results look better, but I'm seeing an odd 
> scenario within a single group where sometimes I get a 67/67/66 breakdown 
> but sometimes it gives 100/50/50.

Hmm ..I cant recreate this 100/50/50 situation (tried about 10 times).

> Also, although the long-term results are good, the shorter-term fairness 
> isn't great.  Is there a tuneable that would allow for a tradeoff between 
> performance and fairness?

The tuneables I can think of are:

- HZ (higher the better)
- min/max_interval and imbalance_pct for each domain (lower the better)

> I have people that are looking for within 4% fairness over a 1sec interval.

That seems to be pretty difficult to achieve with the per-cpu runqueue
and smpnice based load balancing approach we have now.

> Initially I tried a simple setup with three hogs all in the default "sys" 
> group.  Over multiple retries using 10-sec intervals, sometimes it gave 
> roughly 67% for each task, other times it settled into a 100/50/50 split 
> that remained stable over time.

Was this with imbalance_pct set to 105? Does it make any difference if
you change imbalance_pct to say 102?

> 3 tasks in sys
>  2471 cfriesen  20   0  3800  392  336 R 99.9  0.0   0:29.97 cat
>  2470 cfriesen  20   0  3800  392  336 R 50.3  0.0   0:17.83 cat
>  2469 cfriesen  20   0  3800  392  336 R 49.6  0.0   0:17.96 cat
>
> retry
>  2475 cfriesen  20   0  3800  392  336 R 68.3  0.0   0:28.46 cat
>  2476 cfriesen  20   0  3800  392  336 R 67.3  0.0   0:28.24 cat
>  2474 cfriesen  20   0  3800  392  336 R 64.3  0.0   0:28.73 cat
>
>  2476 cfriesen  20   0  3800  392  336 R 67.1  0.0   0:41.79 cat
>  2474 cfriesen  20   0  3800  392  336 R 66.6  0.0   0:41.96 cat
>  2475 cfriesen  20   0  3800  392  336 R 66.1  0.0   0:41.67 cat
>
> retry
>  2490 cfriesen  20   0  3800  392  336 R 99.7  0.0   0:22.23 cat
>  2489 cfriesen  20   0  3800  392  336 R 49.9  0.0   0:21.02 cat
>  2491 cfriesen  20   0  3800  392  336 R 49.9  0.0   0:13.94 cat
>
>
> With three groups, one task in each, I tried both 10 and 60 second 
> intervals.  The longer interval looked better but was still up to 0.8% off:

I honestly don't know if we can do better than 0.8%! In any case, I'd
expect that it would require more drastic changes.

> 10-sec
>  2490 cfriesen  20   0  3800  392  336 R 68.9  0.0   1:35.13 cat
>  2491 cfriesen  20   0  3800  392  336 R 65.8  0.0   1:04.65 cat
>  2489 cfriesen  20   0  3800  392  336 R 64.5  0.0   1:26.48 cat
>
> 60-sec
>  2490 cfriesen  20   0  3800  392  336 R 67.5  0.0   3:19.85 cat
>  2491 cfriesen  20   0  3800  392  336 R 66.3  0.0   2:48.93 cat
>  2489 cfriesen  20   0  3800  392  336 R 66.2  0.0   3:10.86 cat
>
>
> Finally, a more complicated scenario.  three tasks in A, two in B, and one 
> in C.  The 60-sec trial was up to 0.8 off, while a 3-second trial (just for 
> fun) was 8.5% off.
>
> 60-sec
> 2491 cfriesen  20   0  3800  392  336 R 65.9  0.0   5:06.69 cat
>  2499 cfriesen  20   0  3800  392  336 R 33.6  0.0   0:55.35 cat
>  2490 cfriesen  20   0  3800  392  336 R 33.5  0.0   4:47.94 cat
>  2497 cfriesen  20   0  3800  392  336 R 22.6  0.0   0:38.76 cat
>  2489 cfriesen  20   0  3800  392  336 R 22.2  0.0   4:28.03 cat
>  2498 cfriesen  20   0  3800  392  336 R 22.2  0.0   0:35.13 cat
>
> 3-sec
> 2491 cfriesen  20   0  3800  392  336 R 58.2  0.0  13:29.60 cat
>  2490 cfriesen  20   0  3800  392  336 R 34.8  0.0   9:07.73 cat
>  2499 cfriesen  20   0  3800  392  336 R 31.0  0.0   5:15.69 cat
>  2497 cfriesen  20   0  3800  392  336 R 29.4  0.0   3:37.25 cat
>  2489 cfriesen  20   0  3800  392  336 R 23.3  0.0   7:26.25 cat
>  2498 cfriesen  20   0  3800  392  336 R 23.0  0.0   3:33.24 cat

I ran with this configuration:

	HZ = 1000, 
	min/max_interval = 1
	imbalance_pct = 102

My 10-sec fairness looks like below (Error = 1.5%):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  #C COMMAND
 4549 root      20   0  1384  228  176 R 65.2  0.0   0:36.02  0 hogc
 4547 root      20   0  1384  228  176 R 32.8  0.0   0:17.87  0 hogb
 4548 root      20   0  1384  228  176 R 32.6  0.0   0:18.28  1 hogb
 4546 root      20   0  1384  232  176 R 22.9  0.0   0:11.82  1 hoga
 4545 root      20   0  1384  228  176 R 22.3  0.0   0:11.74  1 hoga
 4544 root      20   0  1384  232  176 R 22.1  0.0   0:11.93  1 hoga

3-sec fairness (error = 2.3% ..sometimes went upto 6.7%)

 4549 root      20   0  1384  228  176 R 69.0  0.0   1:33.56  1 hogc
 4548 root      20   0  1384  228  176 R 32.7  0.0   0:46.74  1 hogb
 4547 root      20   0  1384  228  176 R 29.3  0.0   0:47.16  0 hogb
 4546 root      20   0  1384  232  176 R 22.3  0.0   0:30.80  0 hoga
 4544 root      20   0  1384  232  176 R 20.3  0.0   0:30.95  0 hoga
 4545 root      20   0  1384  228  176 R 19.4  0.0   0:31.17  0 hoga

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: fair group scheduler not so fair?
  2008-05-30 11:36             ` Srivatsa Vaddagiri
@ 2008-06-02 20:03               ` Chris Friesen
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Friesen @ 2008-06-02 20:03 UTC (permalink / raw)
  To: vatsa
  Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
	dhaval

Srivatsa Vaddagiri wrote:

> That seems to be pretty difficult to achieve with the per-cpu runqueue
> and smpnice based load balancing approach we have now.

Okay, thanks.

>>Initially I tried a simple setup with three hogs all in the default "sys" 
>>group.  Over multiple retries using 10-sec intervals, sometimes it gave 
>>roughly 67% for each task, other times it settled into a 100/50/50 split 
>>that remained stable over time.

> Was this with imbalance_pct set to 105? Does it make any difference if
> you change imbalance_pct to say 102?

It was set to 105 initially.  I later reproduced the problem with 102. 
For example, the following was with 102, with three tasks created in the 
sys class.  Based on the runtime, pid 2499 has been getting a cpu all to 
itself for over a minute.

  2499 cfriesen  20   0  3800  392  336 R 99.8  0.0   1:05.85 cat
  2496 cfriesen  20   0  3800  392  336 R 50.0  0.0   0:32.95 cat
  2498 cfriesen  20   0  3800  392  336 R 50.0  0.0   0:32.97 cat

The next run was much better, with sub-second fairness after a minute.

  2505 cfriesen  20   0  3800  392  336 R 68.2  0.0   1:00.32 cat
  2506 cfriesen  20   0  3800  392  336 R 66.9  0.0   0:59.85 cat
  2503 cfriesen  20   0  3800  392  336 R 64.2  0.0   1:00.21 cat

The lack of predictability is disturbing, as it implies some sensitivity 
to the specific test conditions.

>>With three groups, one task in each, I tried both 10 and 60 second 
>>intervals.  The longer interval looked better but was still up to 0.8% off:
> 
> 
> I honestly don't know if we can do better than 0.8%! In any case, I'd
> expect that it would require more drastic changes.

No problem.  It's still far superior than the SMP performance of CKRM,
which is what we're currently using (although heavily modified).

Chris


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-06-02 20:03 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22  6:56 ` Peter Zijlstra
2008-05-22 20:02   ` Chris Friesen
2008-05-22 20:07     ` Peter Zijlstra
2008-05-22 20:18       ` Li, Tong N
2008-05-22 21:13         ` Peter Zijlstra
2008-05-23  0:17           ` Chris Friesen
2008-05-23  7:44             ` Srivatsa Vaddagiri
2008-05-23  9:42         ` Srivatsa Vaddagiri
2008-05-23  9:39           ` Peter Zijlstra
2008-05-23 10:19             ` Srivatsa Vaddagiri
2008-05-23 10:16               ` Peter Zijlstra
2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 18:13   ` Chris Friesen
2008-05-28 16:33     ` Srivatsa Vaddagiri
2008-05-28 18:35       ` Chris Friesen
2008-05-28 18:47         ` Dhaval Giani
2008-05-29  2:50         ` Srivatsa Vaddagiri
2008-05-29 16:46         ` Srivatsa Vaddagiri
2008-05-29 16:47           ` Srivatsa Vaddagiri
2008-05-29 21:30           ` Chris Friesen
2008-05-30  6:43             ` Dhaval Giani
2008-05-30 10:21               ` Srivatsa Vaddagiri
2008-05-30 11:36             ` Srivatsa Vaddagiri
2008-06-02 20:03               ` Chris Friesen
2008-05-27 17:28 ` Srivatsa Vaddagiri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox