* fair group scheduler not so fair?
@ 2008-05-21 23:59 Chris Friesen
2008-05-22 6:56 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-21 23:59 UTC (permalink / raw)
To: linux-kernel, vatsa, mingo, a.p.zijlstra, pj
I just downloaded the current git head and started playing with the fair
group scheduler. (This is on a dual cpu Mac G5.)
I created two groups, "a" and "b". Each of them was left with the
default share of 1024.
I created three cpu hogs by doing "cat /dev/zero > /dev/null". One hog
(pid 2435) was put into group "a", while the other two were put into
group "b".
After giving them time to settle down, "top" showed the following:
2438 cfriesen 20 0 3800 392 336 R 99.5 0.0 4:02.82 cat
2435 cfriesen 20 0 3800 392 336 R 65.9 0.0 3:30.94 cat
2437 cfriesen 20 0 3800 392 336 R 34.3 0.0 3:14.89 cat
Where pid 2435 should have gotten a whole cpu worth of time, it actually
only got 66% of a cpu. Is this expected behaviour?
I then redid the test with two hogs in one group and three hogs in the
other group. Unfortunately, the cpu shares were not equally distributed
within each group. Using a 10-sec interval in "top", I got the following:
2522 cfriesen 20 0 3800 392 336 R 52.2 0.0 1:33.38 cat
2523 cfriesen 20 0 3800 392 336 R 48.9 0.0 1:37.85 cat
2524 cfriesen 20 0 3800 392 336 R 37.0 0.0 1:23.22 cat
2525 cfriesen 20 0 3800 392 336 R 32.6 0.0 1:22.62 cat
2559 cfriesen 20 0 3800 392 336 R 28.7 0.0 0:24.30 cat
Do we expect to see upwards of 9% relative unfairness between processes
within a class?
I tried messing with the tuneables in /proc/sys/kernel
(sched_latency_ns, sched_migration_cost, sched_min_granularity_ns) but
was unable to significantly improve these results.
Any pointers would be appreciated.
Thanks,
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
@ 2008-05-22 6:56 ` Peter Zijlstra
2008-05-22 20:02 ` Chris Friesen
2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 17:28 ` Srivatsa Vaddagiri
2 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22 6:56 UTC (permalink / raw)
To: Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj
On Wed, 2008-05-21 at 17:59 -0600, Chris Friesen wrote:
> I just downloaded the current git head and started playing with the fair
> group scheduler. (This is on a dual cpu Mac G5.)
>
> I created two groups, "a" and "b". Each of them was left with the
> default share of 1024.
>
> I created three cpu hogs by doing "cat /dev/zero > /dev/null". One hog
> (pid 2435) was put into group "a", while the other two were put into
> group "b".
>
> After giving them time to settle down, "top" showed the following:
>
> 2438 cfriesen 20 0 3800 392 336 R 99.5 0.0 4:02.82 cat
>
> 2435 cfriesen 20 0 3800 392 336 R 65.9 0.0 3:30.94 cat
>
> 2437 cfriesen 20 0 3800 392 336 R 34.3 0.0 3:14.89 cat
>
>
>
> Where pid 2435 should have gotten a whole cpu worth of time, it actually
> only got 66% of a cpu. Is this expected behaviour?
>
>
>
> I then redid the test with two hogs in one group and three hogs in the
> other group. Unfortunately, the cpu shares were not equally distributed
> within each group. Using a 10-sec interval in "top", I got the following:
>
>
> 2522 cfriesen 20 0 3800 392 336 R 52.2 0.0 1:33.38 cat
>
> 2523 cfriesen 20 0 3800 392 336 R 48.9 0.0 1:37.85 cat
>
> 2524 cfriesen 20 0 3800 392 336 R 37.0 0.0 1:23.22 cat
>
> 2525 cfriesen 20 0 3800 392 336 R 32.6 0.0 1:22.62 cat
>
> 2559 cfriesen 20 0 3800 392 336 R 28.7 0.0 0:24.30 cat
>
>
> Do we expect to see upwards of 9% relative unfairness between processes
> within a class?
>
> I tried messing with the tuneables in /proc/sys/kernel
> (sched_latency_ns, sched_migration_cost, sched_min_granularity_ns) but
> was unable to significantly improve these results.
>
> Any pointers would be appreciated.
What you're testing is SMP fairness of group scheduling and that code is
somewhat fresh (and has known issues - performance nr 1 amogst them) but
its quite possible it has some other issues as well.
Could you see if the patches found here:
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/
make any difference for you?
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-22 6:56 ` Peter Zijlstra
@ 2008-05-22 20:02 ` Chris Friesen
2008-05-22 20:07 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-22 20:02 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, vatsa, mingo, pj
Peter Zijlstra wrote:
> Could you see if the patches found here:
>
> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/
>
> make any difference for you?
Not much difference. In the following case pid 2438 is in group "a" and
pids 2439/2440 are in group "b". Pid 2438 still gets stuck with only 66%.
2439 cfriesen 20 0 3800 392 336 R 99.7 0.0 3:17.37 cat
2438 cfriesen 20 0 3800 392 336 R 66.2 0.0 2:33.63 cat
2440 cfriesen 20 0 3800 392 336 R 33.6 0.0 1:47.53 cat
With 3 tasks in group a, 2 in group b, it's still pretty poor:
2514 cfriesen 20 0 3800 392 336 R 52.5 0.0 0:48.11 cat
2515 cfriesen 20 0 3800 392 336 R 50.2 0.0 0:42.53 cat
2439 cfriesen 20 0 3800 392 336 R 35.4 0.0 4:37.07 cat
2438 cfriesen 20 0 3800 392 336 R 33.3 0.0 3:34.97 cat
2440 cfriesen 20 0 3800 392 336 R 28.3 0.0 2:26.17 cat
If I boot with "nosmp" it behaves more or less as expected:
3 tasks in default:
2427 cfriesen 20 0 3800 392 336 R 33.7 0.0 0:36.54 cat
2429 cfriesen 20 0 3800 392 336 R 33.5 0.0 0:35.63 cat
2428 cfriesen 20 0 3800 392 336 R 32.9 0.0 0:35.84 cat
1 task in a, 2 in b:
2427 cfriesen 20 0 3800 392 336 R 49.8 0.0 1:45.74 cat
2428 cfriesen 20 0 3800 392 336 R 25.0 0.0 1:38.65 cat
2429 cfriesen 20 0 3800 392 336 R 25.0 0.0 1:38.18 cat
3 tasks in a, 2 in b:
2521 cfriesen 20 0 3800 392 336 R 25.2 0.0 0:08.52 cat
2522 cfriesen 20 0 3800 392 336 R 25.2 0.0 0:08.23 cat
2427 cfriesen 20 0 3800 392 336 R 16.6 0.0 1:59.39 cat
2429 cfriesen 20 0 3800 392 336 R 16.6 0.0 1:47.63 cat
2428 cfriesen 20 0 3800 392 336 R 16.4 0.0 1:48.65 cat
I haven't really dug into the scheduler yet (although that's next), but
based on these results it doesn't really look like the load balancer is
properly group-aware.
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-22 20:02 ` Chris Friesen
@ 2008-05-22 20:07 ` Peter Zijlstra
2008-05-22 20:18 ` Li, Tong N
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22 20:07 UTC (permalink / raw)
To: Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj
On Thu, 2008-05-22 at 14:02 -0600, Chris Friesen wrote:
> I haven't really dug into the scheduler yet (although that's next), but
> based on these results it doesn't really look like the load balancer is
> properly group-aware.
I'm trying, but its turning out to be rather more difficult than
expected. Any help here would be much appreciated.
^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: fair group scheduler not so fair?
2008-05-22 20:07 ` Peter Zijlstra
@ 2008-05-22 20:18 ` Li, Tong N
2008-05-22 21:13 ` Peter Zijlstra
2008-05-23 9:42 ` Srivatsa Vaddagiri
0 siblings, 2 replies; 26+ messages in thread
From: Li, Tong N @ 2008-05-22 20:18 UTC (permalink / raw)
To: Peter Zijlstra, Chris Friesen; +Cc: linux-kernel, vatsa, mingo, pj
Peter,
I didn't look at your patches, but I thought you were flattening group
weights down to task-level so that the scheduler only looks at per-task
weights. That'd make group fairness as good as task fairness gets. Is
this still the case?
tong
>-----Original Message-----
>From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
>owner@vger.kernel.org] On Behalf Of Peter Zijlstra
>Sent: Thursday, May 22, 2008 1:08 PM
>To: Chris Friesen
>Cc: linux-kernel@vger.kernel.org; vatsa@linux.vnet.ibm.com;
mingo@elte.hu;
>pj@sgi.com
>Subject: Re: fair group scheduler not so fair?
>
>On Thu, 2008-05-22 at 14:02 -0600, Chris Friesen wrote:
>
>> I haven't really dug into the scheduler yet (although that's next),
but
>> based on these results it doesn't really look like the load balancer
is
>> properly group-aware.
>
>I'm trying, but its turning out to be rather more difficult than
>expected. Any help here would be much appreciated.
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: fair group scheduler not so fair?
2008-05-22 20:18 ` Li, Tong N
@ 2008-05-22 21:13 ` Peter Zijlstra
2008-05-23 0:17 ` Chris Friesen
2008-05-23 9:42 ` Srivatsa Vaddagiri
1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-22 21:13 UTC (permalink / raw)
To: Li, Tong N; +Cc: Chris Friesen, linux-kernel, vatsa, mingo, pj
On Thu, 2008-05-22 at 13:18 -0700, Li, Tong N wrote:
> Peter,
>
> I didn't look at your patches, but I thought you were flattening group
> weights down to task-level so that the scheduler only looks at per-task
> weights. That'd make group fairness as good as task fairness gets. Is
> this still the case?
We still have hierarchical runqueues - getting rid of that is another
tree I'm working on, it has an EEVDF based rq scheduler.
For load balancing purposes we are indeed projecting everything to a
flat level.
A rather quick description of what we do:
We'll have:
task-weight - the weight for a task
group-weight - the weight for a group (same units as for tasks)
group-shares - the weight for a group on a particular cpu
runqueue-weight - the sum of weights
we compute group-shares as:
s_(i,g) = W_g * rw_(i,g) / \Sum_j rw_(j,g)
s_(i,g) := group g's shares for cpu i
W_g := group g's weight
rw_(i,g) := group g's runqueue weight for cpu i
(all for a given group)
We compute these shares while walking the task group tree bottom up,
since the shares for a child's group will affect the runqueue weight for
its parent.
We then select the busiest runqueue from the available set solely based
on top level runqueue weight (since that accurately reflects all the
child group weights after updating the shares).
We compute an imbalance between this rq and the busiest rq in top
weight.
Then, for this busiest cpu we compute the hierarchical load for each
group:
h_load_(i,g) = rw_(i,0) \Prod_{l=1} s_(i,l)/rw_(i,{l-1})
Where l iterates over the tree levels (not the cpus).
h_load represents the full weight of the group as seen from the top
level. This is used to convert the weight of each moved task to top
weight, and we'll keep on moving tasks until the imbalance is satisfied.
Given the following:
root
/ | \
_A_ 1 2
/| |\
3 4 5 B
/ \
6 7
CPU0 CPU1
root root
/ \ / \
A 1 A 2
/ \ / \
4 B 3 5
/ \
6 7
Numerical examples given the above scenario, assuming every body's
weight is 1024:
s_(0,A) = s_(1,A) = 512
s_(0,B) = 1024, s_(1,B) = 0
rw_(0,A) = rw(1,A) = 2048
rw_(0,B) = 2048, rw_(1,B) = 0
h_load_(0,A) = h_load_(1,A) = 512
h_load_(0,B) = 256, h_load(1,B) = 0
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-22 21:13 ` Peter Zijlstra
@ 2008-05-23 0:17 ` Chris Friesen
2008-05-23 7:44 ` Srivatsa Vaddagiri
0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-23 0:17 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Li, Tong N, linux-kernel, vatsa, mingo, pj
Peter Zijlstra wrote:
> Given the following:
>
> root
> / | \
> _A_ 1 2
> /| |\
> 3 4 5 B
> / \
> 6 7
>
> CPU0 CPU1
> root root
> / \ / \
> A 1 A 2
> / \ / \
> 4 B 3 5
> / \
> 6 7
How do you move specific groups to different cpus. Is this simply using
cpusets?
>
> Numerical examples given the above scenario, assuming every body's
> weight is 1024:
> s_(0,A) = s_(1,A) = 512
Just to make sure I understand what's going on...this is half of 1024
because it shows up on both cpus?
> s_(0,B) = 1024, s_(1,B) = 0
This gets the full 1024 because it's only on one cpu.
> rw_(0,A) = rw(1,A) = 2048
> rw_(0,B) = 2048, rw_(1,B) = 0
How do we get 2048? Shouldn't this be 1024?
> h_load_(0,A) = h_load_(1,A) = 512
> h_load_(0,B) = 256, h_load(1,B) = 0
At this point the numbers make sense, but I'm not sure how the formula
for h_load_ works given that I'm not sure what's going on for rw_.
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-23 0:17 ` Chris Friesen
@ 2008-05-23 7:44 ` Srivatsa Vaddagiri
0 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23 7:44 UTC (permalink / raw)
To: Chris Friesen; +Cc: Peter Zijlstra, Li, Tong N, linux-kernel, mingo, pj
On Thu, May 22, 2008 at 06:17:37PM -0600, Chris Friesen wrote:
> Peter Zijlstra wrote:
>
>> Given the following:
>> root
>> / | \
>> _A_ 1 2
>> /| |\
>> 3 4 5 B
>> / \
>> 6 7
>> CPU0 CPU1
>> root root
>> / \ / \
>> A 1 A 2
>> / \ / \
>> 4 B 3 5
>> / \
>> 6 7
>
> How do you move specific groups to different cpus. Is this simply using
> cpusets?
No. Moving groups to different cpus is just a group-aware extension to
move_tasks() that is invoked as part of regular load balance operation.
move_tasks()->sched_fair_class.load_balance() has been modified to
understand how much various task-groups at various levels (ex: A at level 1,
B at level 2 etc) contribute to cpu load. It moves tasks between cpus
using this knowledge.
For ex: if we were to consider all tasks shown above to be in same cpu,
CPU0, this is how it would look:
CPU0 CPU1
root root
/ | \
A 1 2
/| |\
3 4 5 B
/ \
6 7
Then cpu0 load = weight of A + weight of 1 + weight of 2
= 1024 + 1024 + 1024 = 3072
while cpu1 load = 0
load to be moved to cut down this imbalance = 3072/2 = 1536
move_tasks() running on CPU1 would try to pull iteratively tasks such
that total weight moved is <= 1536.
Task moved Total Weight moved
--------- ------------
2 1024
3 1024 + 256 = 1280
5 1280 + 256 = 1536
resulting in:
CPU0 CPU1
root root
/ \ / \
A 1 A 2
/ \ / \
4 B 3 5
/ \
6 7
>> Numerical examples given the above scenario, assuming every body's
>> weight is 1024:
>
>> s_(0,A) = s_(1,A) = 512
>
> Just to make sure I understand what's going on...this is half of 1024
> because it shows up on both cpus?
not exactly ..as Peter put it:
s_(i,g) = W_g * rw_(i,g) / \Sum_j rw_(j,g)
In this case,
s_(0,A) = W_A * rw_(0, A) / \Sum_j rw_(j, A)
W_A = shares given to A by admin = 1024
rw_(0,A) = Weight of 4 + Weight of B = 1024 + 1024 = 2048
rw_(1,A) = Weight of 3 + Weight of 5 = 1024 + 1024 = 2048
\Sum_j rw_(j, A) = 4096
So,
s_(0,A) = 1024 *2048 / 4096 = 512
>> s_(0,B) = 1024, s_(1,B) = 0
>
> This gets the full 1024 because it's only on one cpu.
Not exactly. rw_(0, B) = \Sum_j rw_(j, B) and that's why s_(0,B) = 1024
>> rw_(0,A) = rw(1,A) = 2048
>> rw_(0,B) = 2048, rw_(1,B) = 0
>
> How do we get 2048? Shouldn't this be 1024?
Hope this is clarified from above.
>> h_load_(0,A) = h_load_(1,A) = 512
>> h_load_(0,B) = 256, h_load(1,B) = 0
>
> At this point the numbers make sense, but I'm not sure how the formula for
> h_load_ works given that I'm not sure what's going on for rw_.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-23 9:42 ` Srivatsa Vaddagiri
@ 2008-05-23 9:39 ` Peter Zijlstra
2008-05-23 10:19 ` Srivatsa Vaddagiri
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-23 9:39 UTC (permalink / raw)
To: vatsa; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj
On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > Peter,
> >
> > I didn't look at your patches, but I thought you were flattening group
> > weights down to task-level so that the scheduler only looks at per-task
> > weights.
>
> Wouldnt that require task weight readjustment upon every fork/exit?
If you were to do that - yes that would get you into some very serious
trouble.
The route I've chosen is to basically recompute it every time I need the
weight. So every time I use a weight, I do:
\Prod_{l=1} w_l/rw_{l-1}
Not doing that will get you O(n) recomputes on all sorts of situations.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-22 20:18 ` Li, Tong N
2008-05-22 21:13 ` Peter Zijlstra
@ 2008-05-23 9:42 ` Srivatsa Vaddagiri
2008-05-23 9:39 ` Peter Zijlstra
1 sibling, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23 9:42 UTC (permalink / raw)
To: Li, Tong N; +Cc: Peter Zijlstra, Chris Friesen, linux-kernel, mingo, pj
On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> Peter,
>
> I didn't look at your patches, but I thought you were flattening group
> weights down to task-level so that the scheduler only looks at per-task
> weights.
Wouldnt that require task weight readjustment upon every fork/exit?
> That'd make group fairness as good as task fairness gets. Is
> this still the case?
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-23 10:19 ` Srivatsa Vaddagiri
@ 2008-05-23 10:16 ` Peter Zijlstra
0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2008-05-23 10:16 UTC (permalink / raw)
To: vatsa; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj
On Fri, 2008-05-23 at 15:49 +0530, Srivatsa Vaddagiri wrote:
> On Fri, May 23, 2008 at 11:39:21AM +0200, Peter Zijlstra wrote:
> > On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> > > On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > > > Peter,
> > > >
> > > > I didn't look at your patches, but I thought you were flattening group
> > > > weights down to task-level so that the scheduler only looks at per-task
> > > > weights.
> > >
> > > Wouldnt that require task weight readjustment upon every fork/exit?
> >
> > If you were to do that - yes that would get you into some very serious
> > trouble.
> >
> > The route I've chosen is to basically recompute it every time I need the
> > weight. So every time I use a weight, I do:
>
> and which are those points when "you need the weight"?
eg. when computing the vruntime gain for this task, or when calculating
the slice.
Also when load-balancing, we do a similar thing through the h_load
stuff.
> Basically here's what I had in mind:
>
> Group A shares = 1024
> # of tasks in group A = 1 (T0)
>
> So T0 weight can be 1024.
>
> T0 now forks 1000 children. Ideally now,
> T0.weight = T1.weight = .... = T999.weight = 1024/1000
>
> If we don't change each task's weight like this, then group A will
> cumulatively get more share than it deserves.
>
> Are you saying you will change each task's weight lazily? If so how?
No, we leave it alone, but calculate the effective weight as seen from
the top for every time we apply a tasks weight to compute variables.
Basically all the calc_delta_*() stuff and the load balancer stuff.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-23 9:39 ` Peter Zijlstra
@ 2008-05-23 10:19 ` Srivatsa Vaddagiri
2008-05-23 10:16 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-23 10:19 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Li, Tong N, Chris Friesen, linux-kernel, mingo, pj
On Fri, May 23, 2008 at 11:39:21AM +0200, Peter Zijlstra wrote:
> On Fri, 2008-05-23 at 15:12 +0530, Srivatsa Vaddagiri wrote:
> > On Thu, May 22, 2008 at 01:18:33PM -0700, Li, Tong N wrote:
> > > Peter,
> > >
> > > I didn't look at your patches, but I thought you were flattening group
> > > weights down to task-level so that the scheduler only looks at per-task
> > > weights.
> >
> > Wouldnt that require task weight readjustment upon every fork/exit?
>
> If you were to do that - yes that would get you into some very serious
> trouble.
>
> The route I've chosen is to basically recompute it every time I need the
> weight. So every time I use a weight, I do:
and which are those points when "you need the weight"?
Basically here's what I had in mind:
Group A shares = 1024
# of tasks in group A = 1 (T0)
So T0 weight can be 1024.
T0 now forks 1000 children. Ideally now,
T0.weight = T1.weight = .... = T999.weight = 1024/1000
If we don't change each task's weight like this, then group A will
cumulatively get more share than it deserves.
Are you saying you will change each task's weight lazily? If so how?
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22 6:56 ` Peter Zijlstra
@ 2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 18:13 ` Chris Friesen
2008-05-27 17:28 ` Srivatsa Vaddagiri
2 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-27 17:15 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Wed, May 21, 2008 at 05:59:22PM -0600, Chris Friesen wrote:
> I just downloaded the current git head and started playing with the fair
> group scheduler. (This is on a dual cpu Mac G5.)
>
> I created two groups, "a" and "b". Each of them was left with the default
> share of 1024.
>
> I created three cpu hogs by doing "cat /dev/zero > /dev/null". One hog
> (pid 2435) was put into group "a", while the other two were put into group
> "b".
>
> After giving them time to settle down, "top" showed the following:
>
> 2438 cfriesen 20 0 3800 392 336 R 99.5 0.0 4:02.82 cat
> 2435 cfriesen 20 0 3800 392 336 R 65.9 0.0 3:30.94 cat
> 2437 cfriesen 20 0 3800 392 336 R 34.3 0.0 3:14.89 cat
>
>
> Where pid 2435 should have gotten a whole cpu worth of time, it actually
> only got 66% of a cpu. Is this expected behaviour?
Definitely not an expected behavior and I think I understand why this is
happening.
But first, note that Groups "a" and "b" share bandwidth with all tasks
in /dev/cgroup/tasks. Lets say that /dev/cgroup/tasks had T0-T1,
/dev/cgroup/a/tasks has TA1 while /dev/cgroup/b/tasks has
TB1 (all tasks of weight 1024).
Then TA1 is expected to get 1/(1+1+2) = 25% bandwidth
Similarly T0, T1, TB1 all get 25% bandwidth.
IOW, Groups "a" and "b" are peers of each task in /dev/cgroup/tasks.
Having said that, here's what I do for my testing:
# mkdir /cgroup
# mount -t cgroup -ocpu none /cgroup
# cd /cgroup
# #Move all tasks to 'sys' group and give it low shares
# mkdir sys
# cd sys
# for i in `cat ../tasks`
do
echo $i > tasks
done
# echo 100 > cpu.shares
# mkdir a
# mkdir b
# echo <pid> > a/tasks
..
Now, why did Group "a" get less than what it deserved? Here's what was
happening:
CPU0 CPU1
a0 b0
b1
cpu0.load = 1024 (Grp a load) + 512 (Grp b load)
cpu1.load = 512 (Grp b load)
imbalance = 1024
max_load_move = 512 (to equalize load)
load_balance_fair() is invoked on CPU1 with this max_load_move target of 512.
Ideally it can move b1 to CPU1, which would attain perfect balance. This
does not happen because:
load_balance_fair() iterates thr' the task list in the order they
were created. So it first examines what tasks it can pull from Group "a".
It invokes __load_balance_fair() to see if it can pull any tasks
worth max weight 512 (rem_load). Ideally since a0's weight is
1024, it should not pull a0. However, balance_tasks() is eager
to pull atleast one task (because of SCHED_LOAD_SCALE_FUZZ) and
ends up pulling a0. This results in more load being moved (1024)
than the required target.
Next, when CPU0 tries pulling load of 512, it ends up pulling a0 again.
This a0 ping pongs between both CPUs.
The following experimental patch (on top of 2.6.26-rc3 +
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/) seems
to fix the problem.
Note that this works only when /dev/cgroup/sys/cpu.shares = 100 (or some low
number). Otherwise top (or whatever command you run to observe load
distribution) contributes to some load in /dev/cgroup/sys group, which skews the
results. IMHO, find_busiest_group() needs to use cpu utilization (rather than
cpu load) as the metric to balance across CPUs (rather than task/group load).
Can you check if this makes a difference for you as well?
Not-yet-Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
include/linux/sched.h | 4 ++++
init/Kconfig | 2 +-
kernel/sched.c | 5 ++++-
kernel/sched_debug.c | 2 +-
4 files changed, 10 insertions(+), 3 deletions(-)
Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -698,7 +698,11 @@ enum cpu_idle_type {
#define SCHED_LOAD_SHIFT 10
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define SCHED_LOAD_SCALE_FUZZ 0
+#else
#define SCHED_LOAD_SCALE_FUZZ SCHED_LOAD_SCALE
+#endif
#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
See Documentation/sched-rt-group.txt for more information.
choice
- depends on GROUP_SCHED
+ depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
prompt "Basis for grouping tasks"
default USER_SCHED
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int
unsigned long shares = 0;
int i;
+ if (!tg->parent)
+ return;
+
for_each_cpu_mask(i, sd->span) {
rq_weight += tg->cfs_rq[i]->load.weight;
shares += tg->cfs_rq[i]->shares;
@@ -2919,7 +2922,7 @@ next:
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
- skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+ skip_for_load = (p->se.load.weight >> 1) >= rem_load_move +
SCHED_LOAD_SCALE_FUZZ;
if ((skip_for_load && p->prio >= *this_best_prio) ||
!can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
struct sched_entity *last;
unsigned long flags;
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
#else
char path[128] = "";
>
>
>
> I then redid the test with two hogs in one group and three hogs in the
> other group. Unfortunately, the cpu shares were not equally distributed
> within each group. Using a 10-sec interval in "top", I got the following:
>
>
> 2522 cfriesen 20 0 3800 392 336 R 52.2 0.0 1:33.38 cat
> 2523 cfriesen 20 0 3800 392 336 R 48.9 0.0 1:37.85 cat
> 2524 cfriesen 20 0 3800 392 336 R 37.0 0.0 1:23.22 cat
> 2525 cfriesen 20 0 3800 392 336 R 32.6 0.0 1:22.62 cat
> 2559 cfriesen 20 0 3800 392 336 R 28.7 0.0 0:24.30 cat
>
> Do we expect to see upwards of 9% relative unfairness between processes
> within a class?
>
> I tried messing with the tuneables in /proc/sys/kernel (sched_latency_ns,
> sched_migration_cost, sched_min_granularity_ns) but was unable to
> significantly improve these results.
>
> Any pointers would be appreciated.
>
> Thanks,
>
> Chris
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22 6:56 ` Peter Zijlstra
2008-05-27 17:15 ` Srivatsa Vaddagiri
@ 2008-05-27 17:28 ` Srivatsa Vaddagiri
2 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-27 17:28 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, dhaval, Balbir Singh,
aneesh.kumar
On Wed, May 21, 2008 at 05:59:22PM -0600, Chris Friesen wrote:
> I then redid the test with two hogs in one group and three hogs in the
> other group. Unfortunately, the cpu shares were not equally distributed
> within each group. Using a 10-sec interval in "top", I got the following:
I ran with this combination (2 in Group a and 3 in Group b) on top of the
experimental patch I sent and here's what I get:
4350 root 20 0 1384 228 176 R 53.8 0.0 52:27.54 1 hoga
4542 root 20 0 1384 228 176 R 49.3 0.0 3:39.76 0 hoga
4352 root 20 0 1384 232 176 R 36.0 0.0 26:53.50 1 hogb
4351 root 20 0 1384 228 176 R 32.0 0.0 26:47.54 0 hogb
4543 root 20 0 1384 232 176 R 29.0 0.0 2:03.62 0 hogb
Note that fairness (using load balance approach we have currently) works
over a long window. Usually I observe with "top -d30". Higher the
asymmetry of task-load distribution, longer it takes to converge to
fairness.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-27 17:15 ` Srivatsa Vaddagiri
@ 2008-05-27 18:13 ` Chris Friesen
2008-05-28 16:33 ` Srivatsa Vaddagiri
0 siblings, 1 reply; 26+ messages in thread
From: Chris Friesen @ 2008-05-27 18:13 UTC (permalink / raw)
To: vatsa
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
Srivatsa Vaddagiri wrote:
> But first, note that Groups "a" and "b" share bandwidth with all tasks
> in /dev/cgroup/tasks.
Ah, good point. I've switched over to your group setup for testing.
> The following experimental patch (on top of 2.6.26-rc3 +
> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/) seems
> to fix the problem.
> Can you check if this makes a difference for you as well?
Initially it looked promising. I put pid 2498 in group A, and pids 2499
and 2500 in group B. 2498 got basically a full cpu, and the other two
got 50% each.
However, I then moved pid 2499 from group B to group A, and the system
got stuck in the following behaviour:
2498 cfriesen 20 0 3800 392 336 R 99.7 0.0 3:00.22 cat
2500 cfriesen 20 0 3800 392 336 R 66.7 0.0 1:39.10 cat
2499 cfriesen 20 0 3800 392 336 R 33.0 0.0 1:24.31 cat
I reproduced this a number of times.
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-27 18:13 ` Chris Friesen
@ 2008-05-28 16:33 ` Srivatsa Vaddagiri
2008-05-28 18:35 ` Chris Friesen
0 siblings, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-28 16:33 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Tue, May 27, 2008 at 12:13:46PM -0600, Chris Friesen wrote:
>> Can you check if this makes a difference for you as well?
>
> Initially it looked promising. I put pid 2498 in group A, and pids 2499
> and 2500 in group B. 2498 got basically a full cpu, and the other two got
> 50% each.
>
> However, I then moved pid 2499 from group B to group A, and the system got
> stuck in the following behaviour:
>
> 2498 cfriesen 20 0 3800 392 336 R 99.7 0.0 3:00.22 cat
> 2500 cfriesen 20 0 3800 392 336 R 66.7 0.0 1:39.10 cat
> 2499 cfriesen 20 0 3800 392 336 R 33.0 0.0 1:24.31 cat
>
> I reproduced this a number of times.
Thanks for trying this combination. I discovered a task-leak in this loop
(__load_balance_iterator):
/* Skip over entities that are not tasks */
do {
se = list_entry(next, struct sched_entity, group_node);
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
if (next == &cfs_rq->tasks)
return NULL;
We seem to be skipping the last element in the task list always. In your
case, the lone task in Group a/b is always skipped because of this.
The following hunk seems to fix this:
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
- if (next == &cfs_rq->tasks)
- return NULL;
-
cfs_rq->balance_iterator = next;
if (entity_is_task(se))
Updated patch (on top of 2.6.26-rc3 +
http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
below. Pls let me know how it fares!
---
include/linux/sched.h | 4 ++++
init/Kconfig | 2 +-
kernel/sched.c | 5 ++++-
kernel/sched_debug.c | 3 ++-
kernel/sched_fair.c | 3 ---
5 files changed, 11 insertions(+), 6 deletions(-)
Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -698,7 +698,11 @@ enum cpu_idle_type {
#define SCHED_LOAD_SHIFT 10
#define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT)
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define SCHED_LOAD_SCALE_FUZZ 0
+#else
#define SCHED_LOAD_SCALE_FUZZ SCHED_LOAD_SCALE
+#endif
#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
See Documentation/sched-rt-group.txt for more information.
choice
- depends on GROUP_SCHED
+ depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
prompt "Basis for grouping tasks"
default USER_SCHED
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int
unsigned long shares = 0;
int i;
+ if (!tg->parent)
+ return;
+
for_each_cpu_mask(i, sd->span) {
rq_weight += tg->cfs_rq[i]->load.weight;
shares += tg->cfs_rq[i]->shares;
@@ -2919,7 +2922,7 @@ next:
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
- skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+ skip_for_load = (p->se.load.weight >> 1) >= rem_load_move +
SCHED_LOAD_SCALE_FUZZ;
if ((skip_for_load && p->prio >= *this_best_prio) ||
!can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
struct sched_entity *last;
unsigned long flags;
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
#else
char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
+ SEQ_printf(m, " .%-30s: %lu\n", "h_load", cfs_rq->h_load);
#endif
#endif
}
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
- if (next == &cfs_rq->tasks)
- return NULL;
-
cfs_rq->balance_iterator = next;
if (entity_is_task(se))
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-28 16:33 ` Srivatsa Vaddagiri
@ 2008-05-28 18:35 ` Chris Friesen
2008-05-28 18:47 ` Dhaval Giani
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-28 18:35 UTC (permalink / raw)
To: vatsa
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
Srivatsa Vaddagiri wrote:
> We seem to be skipping the last element in the task list always. In your
> case, the lone task in Group a/b is always skipped because of this.
> Updated patch (on top of 2.6.26-rc3 +
> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
> below. Pls let me know how it fares!
Looking much better, but still some fairness issues with more complex
setups.
pid 2477 in A, others in B
2477 99.5%
2478 49.9%
2479 49.9%
move 2478 to A
2479 99.9%
2477 49.9%
2478 49.9%
So far so good. I then created C, and moved 2478 to it. A 3-second
"top" gave almost a 15% error from the desired behaviour for one group:
2479 76.2%
2477 72.2%
2478 51.0%
A 10-sec average was better, but we still see errors of 6%:
2478 72.8%
2477 64.0%
2479 63.2%
I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C. A
10-second "top" gave errors of up to 6.5%:
2500 60.1%
2491 37.5%
2492 37.4%
2489 25.0%
2488 19.9%
2490 19.9%
a re-test gave errors of up to 8.1%:
2534 74.8%
2533 30.1%
2532 30.0%
2529 25.0%
2530 20.0%
2531 20.0%
Another retest gave perfect results initially:
2559 66.5%
2560 33.4%
2561 33.3%
2564 22.3%
2562 22.2%
2563 22.1%
but moving 2564 from group A to C and then back to A disturbed the
perfect division of time and resulted in almost the same utilization
pattern as above:
2559 74.9%
2560 30.0%
2561 29.6%
2564 25.3%
2562 20.0%
2563 20.0%
It looks like perfect balancing is a metastable state where it can stay
happily for some time, but any small disturbance may be enough to kick
it over into a more stable but incorrect state. Once we get into such
an incorrect division of time, it appears very difficult to return to
perfect balancing.
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-28 18:35 ` Chris Friesen
@ 2008-05-28 18:47 ` Dhaval Giani
2008-05-29 2:50 ` Srivatsa Vaddagiri
2008-05-29 16:46 ` Srivatsa Vaddagiri
2 siblings, 0 replies; 26+ messages in thread
From: Dhaval Giani @ 2008-05-28 18:47 UTC (permalink / raw)
To: Chris Friesen
Cc: vatsa, linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh,
aneesh.kumar
On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>> We seem to be skipping the last element in the task list always. In your
>> case, the lone task in Group a/b is always skipped because of this.
>
>> Updated patch (on top of 2.6.26-rc3 +
>> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
>> below. Pls let me know how it fares!
>
> Looking much better, but still some fairness issues with more complex
> setups.
>
> pid 2477 in A, others in B
> 2477 99.5%
> 2478 49.9%
> 2479 49.9%
>
> move 2478 to A
> 2479 99.9%
> 2477 49.9%
> 2478 49.9%
>
> So far so good. I then created C, and moved 2478 to it. A 3-second "top"
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479 76.2%
> 2477 72.2%
> 2478 51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:
So it is converging to a fair state. How does it look
across say 20 or 30 seconds your side?
> 2478 72.8%
> 2477 64.0%
> 2479 63.2%
>
>
> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C. A
> 10-second "top" gave errors of up to 6.5%:
> 2500 60.1%
> 2491 37.5%
> 2492 37.4%
> 2489 25.0%
> 2488 19.9%
> 2490 19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534 74.8%
> 2533 30.1%
> 2532 30.0%
> 2529 25.0%
> 2530 20.0%
> 2531 20.0%
>
> Another retest gave perfect results initially:
>
> 2559 66.5%
> 2560 33.4%
> 2561 33.3%
> 2564 22.3%
> 2562 22.2%
> 2563 22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect
> division of time and resulted in almost the same utilization pattern as
> above:
>
> 2559 74.9%
> 2560 30.0%
> 2561 29.6%
> 2564 25.3%
> 2562 20.0%
> 2563 20.0%
>
This is over a longer duration or a 10 second duration?
--
regards,
Dhaval
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-28 18:35 ` Chris Friesen
2008-05-28 18:47 ` Dhaval Giani
@ 2008-05-29 2:50 ` Srivatsa Vaddagiri
2008-05-29 16:46 ` Srivatsa Vaddagiri
2 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29 2:50 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>> We seem to be skipping the last element in the task list always. In your
>> case, the lone task in Group a/b is always skipped because of this.
>
>> Updated patch (on top of 2.6.26-rc3 +
>> http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
>> below. Pls let me know how it fares!
>
> Looking much better, but still some fairness issues with more complex
> setups.
>
[snip]
> A 10-sec average was better, but we still see errors of 6%:
> 2478 72.8%
> 2477 64.0%
> 2479 63.2%
Yes I have observed that the delta from ideal fairness is much more than
desirable (although I used to get much better fairness with
2.6.25-rc1. Let me verify and get back).
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-28 18:35 ` Chris Friesen
2008-05-28 18:47 ` Dhaval Giani
2008-05-29 2:50 ` Srivatsa Vaddagiri
@ 2008-05-29 16:46 ` Srivatsa Vaddagiri
2008-05-29 16:47 ` Srivatsa Vaddagiri
2008-05-29 21:30 ` Chris Friesen
2 siblings, 2 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29 16:46 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Looking much better, but still some fairness issues with more complex
> setups.
>
> pid 2477 in A, others in B
> 2477 99.5%
> 2478 49.9%
> 2479 49.9%
>
> move 2478 to A
> 2479 99.9%
> 2477 49.9%
> 2478 49.9%
>
> So far so good. I then created C, and moved 2478 to it. A 3-second "top"
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479 76.2%
> 2477 72.2%
> 2478 51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:
> 2478 72.8%
> 2477 64.0%
> 2479 63.2%
Found couple of issues:
1. A minor bug in load_balance_fair() in calculation of moved_load:
moved_load /= busiest_cfs_rq->load.weight + 1;
In place of busiest_cfs_rq->load.weight, the load before
moving tasks needs to be used. Fix in the updated patch below.
2. We walk task groups sequentially in load_balance_fair() w/o
necessarily looking for the best group. This results in
load_balance_fair() in picking non optimal group/tasks
to pull. I have a hack below (strict = 1/0) to rectify this
problem, but we need a better algo to pick the best group
from which to pull tasks.
3. sd->imbalance_pct (default = 125) specifies how much imbalance we
tolerate. Lower the value, better will be the fairness. To check
this, I changed default to 105, which is giving me better
results.
With the updated patch and imbalance_pct = 105, here's how my 60-sec avg looks
like:
4353 root 20 0 1384 232 176 R 67.0 0.0 2:47.23 1 hoga
4354 root 20 0 1384 228 176 R 66.5 0.0 2:44.65 1 hogb
4355 root 20 0 1384 228 176 R 66.3 0.0 2:28.18 0 hogb
Error is < 1%
> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C. A
> 10-second "top" gave errors of up to 6.5%:
> 2500 60.1%
> 2491 37.5%
> 2492 37.4%
> 2489 25.0%
> 2488 19.9%
> 2490 19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534 74.8%
> 2533 30.1%
> 2532 30.0%
> 2529 25.0%
> 2530 20.0%
> 2531 20.0%
>
> Another retest gave perfect results initially:
>
> 2559 66.5%
> 2560 33.4%
> 2561 33.3%
> 2564 22.3%
> 2562 22.2%
> 2563 22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect
> division of time and resulted in almost the same utilization pattern as
> above:
>
> 2559 74.9%
> 2560 30.0%
> 2561 29.6%
> 2564 25.3%
> 2562 20.0%
> 2563 20.0%
Again with the updated patch and changed imbalance_pct, here's what I see:
4458 root 20 0 1384 232 176 R 66.3 0.0 2:11.04 0 hogc
4457 root 20 0 1384 232 176 R 33.7 0.0 1:06.19 0 hogb
4456 root 20 0 1384 232 176 R 33.4 0.0 1:06.59 0 hogb
4455 root 20 0 1384 228 176 R 22.5 0.0 0:44.09 0 hoga
4453 root 20 0 1384 232 176 R 22.3 0.0 0:44.10 1 hoga
4454 root 20 0 1384 228 176 R 22.2 0.0 0:43.94 1 hoga
(Error < 1%)
In summary, can you do this before running your tests:
1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
(http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
2. Setup test env as below:
# mkdir /cgroup
# mount -t cgroup -ocpu none /cgroup
# cd /cgroup
# #Move all tasks to 'sys' group and give it low shares
# mkdir sys
# cd sys
# for i in `cat ../tasks`
do
echo $i > tasks
done
# echo 100 > cpu.shares
# cd /proc/sys/kernel/sched_domain
# for i in `find . -name imbalance_pct`; do echo 105 > $i; done
---
init/Kconfig | 2 +-
kernel/sched.c | 12 +++++++++---
kernel/sched_debug.c | 3 ++-
kernel/sched_fair.c | 26 +++++++++++++++++---------
4 files changed, 29 insertions(+), 14 deletions(-)
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
See Documentation/sched-rt-group.txt for more information.
choice
- depends on GROUP_SCHED
+ depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
prompt "Basis for grouping tasks"
default USER_SCHED
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1398,7 +1398,7 @@ static unsigned long
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
- int *this_best_prio, struct rq_iterator *iterator);
+ int *this_best_prio, struct rq_iterator *iterator, int strict);
static int
iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int
unsigned long shares = 0;
int i;
+ if (!tg->parent)
+ return;
+
for_each_cpu_mask(i, sd->span) {
rq_weight += tg->cfs_rq[i]->load.weight;
shares += tg->cfs_rq[i]->shares;
@@ -2896,7 +2899,7 @@ static unsigned long
balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned,
- int *this_best_prio, struct rq_iterator *iterator)
+ int *this_best_prio, struct rq_iterator *iterator, int strict)
{
int loops = 0, pulled = 0, pinned = 0, skip_for_load;
struct task_struct *p;
@@ -2919,7 +2922,10 @@ next:
* skip a task if it will be the highest priority task (i.e. smallest
* prio value) on its new queue regardless of its load weight
*/
- skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+ if (strict)
+ skip_for_load = p->se.load.weight >= rem_load_move;
+ else
+ skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
SCHED_LOAD_SCALE_FUZZ;
if ((skip_for_load && p->prio >= *this_best_prio) ||
!can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
struct sched_entity *last;
unsigned long flags;
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
#else
char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
#ifdef CONFIG_FAIR_GROUP_SCHED
#ifdef CONFIG_SMP
SEQ_printf(m, " .%-30s: %lu\n", "shares", cfs_rq->shares);
+ SEQ_printf(m, " .%-30s: %lu\n", "h_load", cfs_rq->h_load);
#endif
#endif
}
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
next = next->next;
} while (next != &cfs_rq->tasks && !entity_is_task(se));
- if (next == &cfs_rq->tasks)
- return NULL;
-
cfs_rq->balance_iterator = next;
if (entity_is_task(se))
@@ -1415,7 +1412,7 @@ static unsigned long
__load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move, struct sched_domain *sd,
enum cpu_idle_type idle, int *all_pinned, int *this_best_prio,
- struct cfs_rq *cfs_rq)
+ struct cfs_rq *cfs_rq, int strict)
{
struct rq_iterator cfs_rq_iterator;
@@ -1425,10 +1422,11 @@ __load_balance_fair(struct rq *this_rq,
return balance_tasks(this_rq, this_cpu, busiest,
max_load_move, sd, idle, all_pinned,
- this_best_prio, &cfs_rq_iterator);
+ this_best_prio, &cfs_rq_iterator, strict);
}
#ifdef CONFIG_FAIR_GROUP_SCHED
+
static unsigned long
load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
unsigned long max_load_move,
@@ -1438,13 +1436,17 @@ load_balance_fair(struct rq *this_rq, in
long rem_load_move = max_load_move;
int busiest_cpu = cpu_of(busiest);
struct task_group *tg;
+ int strict = 1;
update_h_load(cpu_of(busiest));
rcu_read_lock();
+
+retry:
list_for_each_entry(tg, &task_groups, list) {
struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
long rem_load, moved_load;
+ unsigned long busiest_cfs_rq_load;
/*
* empty group
@@ -1452,25 +1454,31 @@ load_balance_fair(struct rq *this_rq, in
if (!busiest_cfs_rq->task_weight)
continue;
- rem_load = rem_load_move * busiest_cfs_rq->load.weight;
+ busiest_cfs_rq_load = busiest_cfs_rq->load.weight;
+ rem_load = rem_load_move * busiest_cfs_rq_load;
rem_load /= busiest_cfs_rq->h_load + 1;
moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
rem_load, sd, idle, all_pinned, this_best_prio,
- tg->cfs_rq[busiest_cpu]);
+ tg->cfs_rq[busiest_cpu], strict);
if (!moved_load)
continue;
moved_load *= busiest_cfs_rq->h_load;
- moved_load /= busiest_cfs_rq->load.weight + 1;
+ moved_load /= busiest_cfs_rq_load + 1;
rem_load_move -= moved_load;
if (rem_load_move < 0)
break;
}
+
+ if (rem_load_move && strict--)
+ goto retry;
+
rcu_read_unlock();
+
return max_load_move - rem_load_move;
}
#else
@@ -1482,7 +1490,7 @@ load_balance_fair(struct rq *this_rq, in
{
return __load_balance_fair(this_rq, this_cpu, busiest,
max_load_move, sd, idle, all_pinned,
- this_best_prio, &busiest->cfs);
+ this_best_prio, &busiest->cfs, 0);
}
#endif
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-29 16:46 ` Srivatsa Vaddagiri
@ 2008-05-29 16:47 ` Srivatsa Vaddagiri
2008-05-29 21:30 ` Chris Friesen
1 sibling, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-29 16:47 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Thu, May 29, 2008 at 10:16:07PM +0530, Srivatsa Vaddagiri wrote:
> In summary, can you do this before running your tests:
>
> 1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
> (http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
Its quite a while since I downloaded Peter's patches and I hope it hasnt
changed! Once the results start looking good, I will request Peter to
roll it up in his patch stack.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-29 16:46 ` Srivatsa Vaddagiri
2008-05-29 16:47 ` Srivatsa Vaddagiri
@ 2008-05-29 21:30 ` Chris Friesen
2008-05-30 6:43 ` Dhaval Giani
2008-05-30 11:36 ` Srivatsa Vaddagiri
1 sibling, 2 replies; 26+ messages in thread
From: Chris Friesen @ 2008-05-29 21:30 UTC (permalink / raw)
To: vatsa
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
Srivatsa Vaddagiri wrote:
> In summary, can you do this before running your tests:
>
> 1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
> (http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)
I updated with the old set of patches you sent me, plus your patch.
> 2. Setup test env as below:
Done.
Overall the group scheduler results look better, but I'm seeing an odd
scenario within a single group where sometimes I get a 67/67/66
breakdown but sometimes it gives 100/50/50.
Also, although the long-term results are good, the shorter-term fairness
isn't great. Is there a tuneable that would allow for a tradeoff
between performance and fairness? I have people that are looking for
within 4% fairness over a 1sec interval.
Initially I tried a simple setup with three hogs all in the default
"sys" group. Over multiple retries using 10-sec intervals, sometimes it
gave roughly 67% for each task, other times it settled into a 100/50/50
split that remained stable over time.
3 tasks in sys
2471 cfriesen 20 0 3800 392 336 R 99.9 0.0 0:29.97 cat
2470 cfriesen 20 0 3800 392 336 R 50.3 0.0 0:17.83 cat
2469 cfriesen 20 0 3800 392 336 R 49.6 0.0 0:17.96 cat
retry
2475 cfriesen 20 0 3800 392 336 R 68.3 0.0 0:28.46 cat
2476 cfriesen 20 0 3800 392 336 R 67.3 0.0 0:28.24 cat
2474 cfriesen 20 0 3800 392 336 R 64.3 0.0 0:28.73 cat
2476 cfriesen 20 0 3800 392 336 R 67.1 0.0 0:41.79 cat
2474 cfriesen 20 0 3800 392 336 R 66.6 0.0 0:41.96 cat
2475 cfriesen 20 0 3800 392 336 R 66.1 0.0 0:41.67 cat
retry
2490 cfriesen 20 0 3800 392 336 R 99.7 0.0 0:22.23 cat
2489 cfriesen 20 0 3800 392 336 R 49.9 0.0 0:21.02 cat
2491 cfriesen 20 0 3800 392 336 R 49.9 0.0 0:13.94 cat
With three groups, one task in each, I tried both 10 and 60 second
intervals. The longer interval looked better but was still up to 0.8% off:
10-sec
2490 cfriesen 20 0 3800 392 336 R 68.9 0.0 1:35.13 cat
2491 cfriesen 20 0 3800 392 336 R 65.8 0.0 1:04.65 cat
2489 cfriesen 20 0 3800 392 336 R 64.5 0.0 1:26.48 cat
60-sec
2490 cfriesen 20 0 3800 392 336 R 67.5 0.0 3:19.85 cat
2491 cfriesen 20 0 3800 392 336 R 66.3 0.0 2:48.93 cat
2489 cfriesen 20 0 3800 392 336 R 66.2 0.0 3:10.86 cat
Finally, a more complicated scenario. three tasks in A, two in B, and
one in C. The 60-sec trial was up to 0.8 off, while a 3-second trial
(just for fun) was 8.5% off.
60-sec
2491 cfriesen 20 0 3800 392 336 R 65.9 0.0 5:06.69 cat
2499 cfriesen 20 0 3800 392 336 R 33.6 0.0 0:55.35 cat
2490 cfriesen 20 0 3800 392 336 R 33.5 0.0 4:47.94 cat
2497 cfriesen 20 0 3800 392 336 R 22.6 0.0 0:38.76 cat
2489 cfriesen 20 0 3800 392 336 R 22.2 0.0 4:28.03 cat
2498 cfriesen 20 0 3800 392 336 R 22.2 0.0 0:35.13 cat
3-sec
2491 cfriesen 20 0 3800 392 336 R 58.2 0.0 13:29.60 cat
2490 cfriesen 20 0 3800 392 336 R 34.8 0.0 9:07.73 cat
2499 cfriesen 20 0 3800 392 336 R 31.0 0.0 5:15.69 cat
2497 cfriesen 20 0 3800 392 336 R 29.4 0.0 3:37.25 cat
2489 cfriesen 20 0 3800 392 336 R 23.3 0.0 7:26.25 cat
2498 cfriesen 20 0 3800 392 336 R 23.0 0.0 3:33.24 cat
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-29 21:30 ` Chris Friesen
@ 2008-05-30 6:43 ` Dhaval Giani
2008-05-30 10:21 ` Srivatsa Vaddagiri
2008-05-30 11:36 ` Srivatsa Vaddagiri
1 sibling, 1 reply; 26+ messages in thread
From: Dhaval Giani @ 2008-05-30 6:43 UTC (permalink / raw)
To: Chris Friesen
Cc: vatsa, linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh,
aneesh.kumar
> Also, although the long-term results are good, the shorter-term fairness
> isn't great. Is there a tuneable that would allow for a tradeoff between
> performance and fairness? I have people that are looking for within 4%
> fairness over a 1sec interval.
>
How fair does smp fairness look for a !group scenario? I don't expect
group schould be able to do much better.
--
regards,
Dhaval
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-30 6:43 ` Dhaval Giani
@ 2008-05-30 10:21 ` Srivatsa Vaddagiri
0 siblings, 0 replies; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-30 10:21 UTC (permalink / raw)
To: Dhaval Giani
Cc: Chris Friesen, linux-kernel, mingo, a.p.zijlstra, pj,
Balbir Singh, aneesh.kumar
On Fri, May 30, 2008 at 12:13:24PM +0530, Dhaval Giani wrote:
> > Also, although the long-term results are good, the shorter-term fairness
> > isn't great. Is there a tuneable that would allow for a tradeoff between
> > performance and fairness? I have people that are looking for within 4%
> > fairness over a 1sec interval.
> >
>
> How fair does smp fairness look for a !group scenario? I don't expect
> group schould be able to do much better.
Just tested this combo for !group case:
1 nice0 (weight = 1024)
2 nice3 (each weight = 526)
3 nice5 (each weight = 335)
You'd expect nice0 to get (on a 2 cpu system):
2 * 1024 / (1024 + 2*526 + 3*335) = 66.47
This is what I see over a 10sec interval (error = 6%):
4386 root 20 0 1384 228 176 R 60.4 0.0 3:06.75 1 nice0
4387 root 23 3 1384 232 176 R 37.9 0.0 1:57.03 0 nice3
4388 root 23 3 1384 228 176 R 37.9 0.0 1:57.24 0 nice3
4390 root 25 5 1384 228 176 R 24.1 0.0 1:14.62 0 nice5
4391 root 25 5 1384 228 176 R 19.8 0.0 1:01.26 1 nice5
4389 root 25 5 1384 228 176 R 19.7 0.0 1:01.12 1 nice5
Over 120sec interval (error still as high as 6%):
4386 root 20 0 1384 228 176 R 60.4 0.0 6:13.95 1 nice0
4388 root 23 3 1384 228 176 R 37.9 0.0 3:54.69 0 nice3
4387 root 23 3 1384 232 176 R 37.9 0.0 3:54.44 0 nice3
4390 root 25 5 1384 228 176 R 24.2 0.0 2:29.45 0 nice5
4391 root 25 5 1384 228 176 R 19.8 0.0 2:02.56 1 nice5
4389 root 25 5 1384 228 176 R 19.8 0.0 2:02.44 1 nice5
The high error could be because of interference from other tasks. Anyway I
dont think !group case is better at achieving fairness over shorter
intervals.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-29 21:30 ` Chris Friesen
2008-05-30 6:43 ` Dhaval Giani
@ 2008-05-30 11:36 ` Srivatsa Vaddagiri
2008-06-02 20:03 ` Chris Friesen
1 sibling, 1 reply; 26+ messages in thread
From: Srivatsa Vaddagiri @ 2008-05-30 11:36 UTC (permalink / raw)
To: Chris Friesen
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
On Thu, May 29, 2008 at 03:30:37PM -0600, Chris Friesen wrote:
> Overall the group scheduler results look better, but I'm seeing an odd
> scenario within a single group where sometimes I get a 67/67/66 breakdown
> but sometimes it gives 100/50/50.
Hmm ..I cant recreate this 100/50/50 situation (tried about 10 times).
> Also, although the long-term results are good, the shorter-term fairness
> isn't great. Is there a tuneable that would allow for a tradeoff between
> performance and fairness?
The tuneables I can think of are:
- HZ (higher the better)
- min/max_interval and imbalance_pct for each domain (lower the better)
> I have people that are looking for within 4% fairness over a 1sec interval.
That seems to be pretty difficult to achieve with the per-cpu runqueue
and smpnice based load balancing approach we have now.
> Initially I tried a simple setup with three hogs all in the default "sys"
> group. Over multiple retries using 10-sec intervals, sometimes it gave
> roughly 67% for each task, other times it settled into a 100/50/50 split
> that remained stable over time.
Was this with imbalance_pct set to 105? Does it make any difference if
you change imbalance_pct to say 102?
> 3 tasks in sys
> 2471 cfriesen 20 0 3800 392 336 R 99.9 0.0 0:29.97 cat
> 2470 cfriesen 20 0 3800 392 336 R 50.3 0.0 0:17.83 cat
> 2469 cfriesen 20 0 3800 392 336 R 49.6 0.0 0:17.96 cat
>
> retry
> 2475 cfriesen 20 0 3800 392 336 R 68.3 0.0 0:28.46 cat
> 2476 cfriesen 20 0 3800 392 336 R 67.3 0.0 0:28.24 cat
> 2474 cfriesen 20 0 3800 392 336 R 64.3 0.0 0:28.73 cat
>
> 2476 cfriesen 20 0 3800 392 336 R 67.1 0.0 0:41.79 cat
> 2474 cfriesen 20 0 3800 392 336 R 66.6 0.0 0:41.96 cat
> 2475 cfriesen 20 0 3800 392 336 R 66.1 0.0 0:41.67 cat
>
> retry
> 2490 cfriesen 20 0 3800 392 336 R 99.7 0.0 0:22.23 cat
> 2489 cfriesen 20 0 3800 392 336 R 49.9 0.0 0:21.02 cat
> 2491 cfriesen 20 0 3800 392 336 R 49.9 0.0 0:13.94 cat
>
>
> With three groups, one task in each, I tried both 10 and 60 second
> intervals. The longer interval looked better but was still up to 0.8% off:
I honestly don't know if we can do better than 0.8%! In any case, I'd
expect that it would require more drastic changes.
> 10-sec
> 2490 cfriesen 20 0 3800 392 336 R 68.9 0.0 1:35.13 cat
> 2491 cfriesen 20 0 3800 392 336 R 65.8 0.0 1:04.65 cat
> 2489 cfriesen 20 0 3800 392 336 R 64.5 0.0 1:26.48 cat
>
> 60-sec
> 2490 cfriesen 20 0 3800 392 336 R 67.5 0.0 3:19.85 cat
> 2491 cfriesen 20 0 3800 392 336 R 66.3 0.0 2:48.93 cat
> 2489 cfriesen 20 0 3800 392 336 R 66.2 0.0 3:10.86 cat
>
>
> Finally, a more complicated scenario. three tasks in A, two in B, and one
> in C. The 60-sec trial was up to 0.8 off, while a 3-second trial (just for
> fun) was 8.5% off.
>
> 60-sec
> 2491 cfriesen 20 0 3800 392 336 R 65.9 0.0 5:06.69 cat
> 2499 cfriesen 20 0 3800 392 336 R 33.6 0.0 0:55.35 cat
> 2490 cfriesen 20 0 3800 392 336 R 33.5 0.0 4:47.94 cat
> 2497 cfriesen 20 0 3800 392 336 R 22.6 0.0 0:38.76 cat
> 2489 cfriesen 20 0 3800 392 336 R 22.2 0.0 4:28.03 cat
> 2498 cfriesen 20 0 3800 392 336 R 22.2 0.0 0:35.13 cat
>
> 3-sec
> 2491 cfriesen 20 0 3800 392 336 R 58.2 0.0 13:29.60 cat
> 2490 cfriesen 20 0 3800 392 336 R 34.8 0.0 9:07.73 cat
> 2499 cfriesen 20 0 3800 392 336 R 31.0 0.0 5:15.69 cat
> 2497 cfriesen 20 0 3800 392 336 R 29.4 0.0 3:37.25 cat
> 2489 cfriesen 20 0 3800 392 336 R 23.3 0.0 7:26.25 cat
> 2498 cfriesen 20 0 3800 392 336 R 23.0 0.0 3:33.24 cat
I ran with this configuration:
HZ = 1000,
min/max_interval = 1
imbalance_pct = 102
My 10-sec fairness looks like below (Error = 1.5%):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ #C COMMAND
4549 root 20 0 1384 228 176 R 65.2 0.0 0:36.02 0 hogc
4547 root 20 0 1384 228 176 R 32.8 0.0 0:17.87 0 hogb
4548 root 20 0 1384 228 176 R 32.6 0.0 0:18.28 1 hogb
4546 root 20 0 1384 232 176 R 22.9 0.0 0:11.82 1 hoga
4545 root 20 0 1384 228 176 R 22.3 0.0 0:11.74 1 hoga
4544 root 20 0 1384 232 176 R 22.1 0.0 0:11.93 1 hoga
3-sec fairness (error = 2.3% ..sometimes went upto 6.7%)
4549 root 20 0 1384 228 176 R 69.0 0.0 1:33.56 1 hogc
4548 root 20 0 1384 228 176 R 32.7 0.0 0:46.74 1 hogb
4547 root 20 0 1384 228 176 R 29.3 0.0 0:47.16 0 hogb
4546 root 20 0 1384 232 176 R 22.3 0.0 0:30.80 0 hoga
4544 root 20 0 1384 232 176 R 20.3 0.0 0:30.95 0 hoga
4545 root 20 0 1384 228 176 R 19.4 0.0 0:31.17 0 hoga
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: fair group scheduler not so fair?
2008-05-30 11:36 ` Srivatsa Vaddagiri
@ 2008-06-02 20:03 ` Chris Friesen
0 siblings, 0 replies; 26+ messages in thread
From: Chris Friesen @ 2008-06-02 20:03 UTC (permalink / raw)
To: vatsa
Cc: linux-kernel, mingo, a.p.zijlstra, pj, Balbir Singh, aneesh.kumar,
dhaval
Srivatsa Vaddagiri wrote:
> That seems to be pretty difficult to achieve with the per-cpu runqueue
> and smpnice based load balancing approach we have now.
Okay, thanks.
>>Initially I tried a simple setup with three hogs all in the default "sys"
>>group. Over multiple retries using 10-sec intervals, sometimes it gave
>>roughly 67% for each task, other times it settled into a 100/50/50 split
>>that remained stable over time.
> Was this with imbalance_pct set to 105? Does it make any difference if
> you change imbalance_pct to say 102?
It was set to 105 initially. I later reproduced the problem with 102.
For example, the following was with 102, with three tasks created in the
sys class. Based on the runtime, pid 2499 has been getting a cpu all to
itself for over a minute.
2499 cfriesen 20 0 3800 392 336 R 99.8 0.0 1:05.85 cat
2496 cfriesen 20 0 3800 392 336 R 50.0 0.0 0:32.95 cat
2498 cfriesen 20 0 3800 392 336 R 50.0 0.0 0:32.97 cat
The next run was much better, with sub-second fairness after a minute.
2505 cfriesen 20 0 3800 392 336 R 68.2 0.0 1:00.32 cat
2506 cfriesen 20 0 3800 392 336 R 66.9 0.0 0:59.85 cat
2503 cfriesen 20 0 3800 392 336 R 64.2 0.0 1:00.21 cat
The lack of predictability is disturbing, as it implies some sensitivity
to the specific test conditions.
>>With three groups, one task in each, I tried both 10 and 60 second
>>intervals. The longer interval looked better but was still up to 0.8% off:
>
>
> I honestly don't know if we can do better than 0.8%! In any case, I'd
> expect that it would require more drastic changes.
No problem. It's still far superior than the SMP performance of CKRM,
which is what we're currently using (although heavily modified).
Chris
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2008-06-02 20:03 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22 6:56 ` Peter Zijlstra
2008-05-22 20:02 ` Chris Friesen
2008-05-22 20:07 ` Peter Zijlstra
2008-05-22 20:18 ` Li, Tong N
2008-05-22 21:13 ` Peter Zijlstra
2008-05-23 0:17 ` Chris Friesen
2008-05-23 7:44 ` Srivatsa Vaddagiri
2008-05-23 9:42 ` Srivatsa Vaddagiri
2008-05-23 9:39 ` Peter Zijlstra
2008-05-23 10:19 ` Srivatsa Vaddagiri
2008-05-23 10:16 ` Peter Zijlstra
2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 18:13 ` Chris Friesen
2008-05-28 16:33 ` Srivatsa Vaddagiri
2008-05-28 18:35 ` Chris Friesen
2008-05-28 18:47 ` Dhaval Giani
2008-05-29 2:50 ` Srivatsa Vaddagiri
2008-05-29 16:46 ` Srivatsa Vaddagiri
2008-05-29 16:47 ` Srivatsa Vaddagiri
2008-05-29 21:30 ` Chris Friesen
2008-05-30 6:43 ` Dhaval Giani
2008-05-30 10:21 ` Srivatsa Vaddagiri
2008-05-30 11:36 ` Srivatsa Vaddagiri
2008-06-02 20:03 ` Chris Friesen
2008-05-27 17:28 ` Srivatsa Vaddagiri
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox