top-down balance purpose discussion -- resend

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* top-down balance purpose discussion -- resend
@ 2014-01-21 14:04 Alex Shi
  2014-01-21 14:57 ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Shi @ 2014-01-21 14:04 UTC (permalink / raw)
  To: Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Mike Galbraith,
	Daniel Lezcano, Vincent Guittot, Morten Rasmussen, Amit Kucheria,
	tglx@linutronix.de, linux-kernel@vger.kernel.org, Linus Torvalds

Current scheduler load balance is bottom-up mode, each CPU need
initiate the balance by self.

1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
level scheduler domains. If there is just 2 tasks in whole system that
both running on cpu0. Current load balance need to pull task to another
smt in smt domain, then pull task to another core, then pull task to
another cpu, finally pull task to another numa. Totally it is need 4
times task moving to get system balance.

Generally, the task moving complexity is
	O(nm log n),  n := nr_cpus, m := nr_tasks

There is a excellent summary and explanation for this in
kernel/sched/fair.c:4605

Another weakness of current LB is that every cpu need to get the other
cpus' load info repeatedly and try to figure out busiest sched
group/queue on every sched domain level. But it just waste time, since
it may not conduct a task moving. One of reasons is that cpu can only
pull task, not pushing.

2, Consider huge cost of task moving: CS, tlb/cache refill, and the
useless remote cpu load info getting. If we can have better solution for
load balance, like reduce the balance times to.
	O(m) m := nr_tasks

It will be a great win on performance. like above example, we can move
task from cpu0 direct to another numa. that only need 1 task moving,
save 3 CS and tlb/cache refill.

To get this point, a rough idea is changing the load balance behaviour
to top-down mode. And only load balance just be done by on on cpu.
Like let each of cpu report self load status on per-cpu
memory. And a baby-sitter in system to collect these cpus load info,
then decide how to move task centralize, finally send IPI to each hungry
cpu to let them pull load quota from appointed cpus.

Like in above example, the baby-sitter will fetch each cpus' load info,
then send a pull task IPI to let a cpu in another numa pull one task
from cpu0. So in the task pulling, we still just involved 2 cpus, can
reuse move_tasks() functions.

3, From power saving POV, top-down can get the whole cpu power topology
info before balance. So we can start balance with the power
consideration firstly, instead of stop balance at last minute after
getting every remote cpu info.

4, One of concern of top-down mode is that we need to remote access top
level domain cpu info in large system while each cpu keep updates its
load. That may causes cache bouncing issue. (current LB also has this
problem, but it set a long load balance interval to reduce LB times)

Paul has given lots of excellent suggestion to resolve/relief this
issue. the following is his suggestions:
===
Might be -- key thing is to measure it on a big system.  If it is a
problem, here are some workarounds with varying degrees of sanity:

1.	Reduce cache misses by keeping three values:

	a.	Current load value.
	b.	Last exported value.  (Should be in same cache line
		as a.)
	c.	Exported value.  (Should be in separate cache line.)

	The avoid writing to c unless a has deviated sufficiently from a.
	If the load values stay constant for long time periods, this
	can reduce the number of cache misses.  On the other hand,
	it introduces an extra compare and branch -- though this should
	not be a problem if the load value is computed relatively
	infrequently.

2.	As above, but provide additional information to allow the
	other CPU to extrapolate values.  For example, if a CPU goes
	idle, some of its values will decay exponentially.  The
	remote CPU could compute the decay rate, removing the need
	for the subject CPU to wake up to recompute its value.
	(Maybe you already do this?)

	Similarly if a CPU remains CPU-bound with given runqueue
	loading for some time.

3.	Allow the exported values to become inaccurate, and resample
	the actual values remotely if extrapolated values indicate
	that action is warranted.

There are probably other approaches.  I am being quite general here
because I don't have the full picture of the scheduler statistics
in my head.  It is likely possible to obtain a much better approach
by considering the scheduler's specifics.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: top-down balance purpose discussion -- resend
  2014-01-21 14:04 top-down balance purpose discussion -- resend Alex Shi
@ 2014-01-21 14:57 ` Peter Zijlstra
  2014-01-22  7:40   ` Alex Shi
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2014-01-21 14:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: Paul E. McKenney, Ingo Molnar, Mike Galbraith, Daniel Lezcano,
	Vincent Guittot, Morten Rasmussen, Amit Kucheria,
	tglx@linutronix.de, linux-kernel@vger.kernel.org, Linus Torvalds

On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
> 
> Current scheduler load balance is bottom-up mode, each CPU need
> initiate the balance by self.
> 
> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
> level scheduler domains. If there is just 2 tasks in whole system that
> both running on cpu0. Current load balance need to pull task to another
> smt in smt domain, then pull task to another core, then pull task to
> another cpu, finally pull task to another numa. Totally it is need 4
> times task moving to get system balance.

Except the idle load balancer, and esp. the newidle can totally by-pass
this.

If you do the packing right in the newidle pass, you'd get there in 1
step.

> Generally, the task moving complexity is
> 	O(nm log n),  n := nr_cpus, m := nr_tasks
> 
> There is a excellent summary and explanation for this in
> kernel/sched/fair.c:4605

Which is a perfectly fine scheme for a busy system.

> Another weakness of current LB is that every cpu need to get the other
> cpus' load info repeatedly and try to figure out busiest sched
> group/queue on every sched domain level. But it just waste time, since
> it may not conduct a task moving. One of reasons is that cpu can only
> pull task, not pushing.

This doesn't make sense.. and in fact, we do a limited amount of 3rd
party movements.

Whatever you do, you have to repeat the information gathering anyhow,
because it constantly changes.

Trying to serialize that doesn't make any kind of sense. The only thing
you want is that the system converges.

Skipped the rest because it seems build on a fundament I don't agree
with. That 4 move thing is just silly for an idle system, and we
shouldn't do that.

I also very much do not want a single CPU balancing the entire system,
that's the anti-thesis of scalable.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: top-down balance purpose discussion -- resend
  2014-01-21 14:57 ` Peter Zijlstra
@ 2014-01-22  7:40   ` Alex Shi
  2014-01-24  7:29     ` Alex Shi
  0 siblings, 1 reply; 4+ messages in thread
From: Alex Shi @ 2014-01-22  7:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Ingo Molnar, Mike Galbraith, Daniel Lezcano,
	Vincent Guittot, Morten Rasmussen, Amit Kucheria,
	tglx@linutronix.de, linux-kernel@vger.kernel.org, Linus Torvalds

On 01/21/2014 10:57 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
>>
>> Current scheduler load balance is bottom-up mode, each CPU need
>> initiate the balance by self.
>>
>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
>> level scheduler domains. If there is just 2 tasks in whole system that
>> both running on cpu0. Current load balance need to pull task to another
>> smt in smt domain, then pull task to another core, then pull task to
>> another cpu, finally pull task to another numa. Totally it is need 4
>> times task moving to get system balance.
> 
> Except the idle load balancer, and esp. the newidle can totally by-pass
> this.
> 
> If you do the packing right in the newidle pass, you'd get there in 1
> step.

It give me a huge pressure to argue with you a great experts. I am
waiting and very appreciate for any comments and corrections. :)

Yes, a newidle will kindly relief this. but it can not eliminate it. If
a newidle happens on another numa group. It just needs 1 step. But if it
happens on another smt group, it still needs 4 steps. So generally, we
still need one more steps before well balance.

In this example, if a newidle is in the same smallest group, maybe we
should wakeup a remotest cpu in system/llc to avoid extra task moving in
near future for best performance.
And for power saving, maybe we'd better kick the task to smallest group,
then let the remote cpu group idle.
But for current newidle, it's impossible to do this because newidle is
also bottom-up mode.
> 
>> Generally, the task moving complexity is
>> 	O(nm log n),  n := nr_cpus, m := nr_tasks
>>
>> There is a excellent summary and explanation for this in
>> kernel/sched/fair.c:4605
> 
> Which is a perfectly fine scheme for a busy system.
> 
>> Another weakness of current LB is that every cpu need to get the other
>> cpus' load info repeatedly and try to figure out busiest sched
>> group/queue on every sched domain level. But it just waste time, since
>> it may not conduct a task moving. One of reasons is that cpu can only
>> pull task, not pushing.
> 
> This doesn't make sense.. and in fact, we do a limited amount of 3rd
> party movements.

Yes, but the 3rd party movements is too limited, just for task pinned.
> 
> Whatever you do, you have to repeat the information gathering anyhow,
> because it constantly changes.
> 

Yes, it is good to collection the load info once for once balance. but
if the balance cpu is busiest cpu, current balance still keep collecting
every group load info from bottom to up, and then do nothing on this
imbalance system. This is bad.

> Trying to serialize that doesn't make any kind of sense. The only thing
> you want is that the system converges.

Sorry, would you like to give a bit more details of 'serialize' is no sense?
> 
> Skipped the rest because it seems build on a fundament I don't agree
> with. That 4 move thing is just silly for an idle system, and we
> shouldn't do that.
> 
> I also very much do not want a single CPU balancing the entire system,
> that's the anti-thesis of scalable.

Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it
is far more scalable than every cpu do balance in system, since there is
only one cpu need to pick other cpu load info.

BTW, there is no organize among all cpus' balancing currently. That's a
a bit mess. Like if 2 cpus in a small cpu group just do balance for
whole system at the same time, then both of them think self group is
light and want more load. then they have the chance to over pull load to
self group. That is bad. And single balancing has no such problem.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: top-down balance purpose discussion -- resend
  2014-01-22  7:40   ` Alex Shi
@ 2014-01-24  7:29     ` Alex Shi
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Shi @ 2014-01-24  7:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Ingo Molnar, Mike Galbraith, Daniel Lezcano,
	Vincent Guittot, Morten Rasmussen, Amit Kucheria,
	tglx@linutronix.de, linux-kernel@vger.kernel.org, Linus Torvalds


Any more comments for this idea? :)

On 01/22/2014 03:40 PM, Alex Shi wrote:
> On 01/21/2014 10:57 PM, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
>>>
>>> Current scheduler load balance is bottom-up mode, each CPU need
>>> initiate the balance by self.
>>>
>>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
>>> level scheduler domains. If there is just 2 tasks in whole system that
>>> both running on cpu0. Current load balance need to pull task to another
>>> smt in smt domain, then pull task to another core, then pull task to
>>> another cpu, finally pull task to another numa. Totally it is need 4
>>> times task moving to get system balance.
>>
>> Except the idle load balancer, and esp. the newidle can totally by-pass
>> this.
>>
>> If you do the packing right in the newidle pass, you'd get there in 1
>> step.
> 
> It give me a huge pressure to argue with you a great experts. I am
> waiting and very appreciate for any comments and corrections. :)
> 
> Yes, a newidle will kindly relief this. but it can not eliminate it. If
> a newidle happens on another numa group. It just needs 1 step. But if it
> happens on another smt group, it still needs 4 steps. So generally, we
> still need one more steps before well balance.
> 
> In this example, if a newidle is in the same smallest group, maybe we
> should wakeup a remotest cpu in system/llc to avoid extra task moving in
> near future for best performance.
> And for power saving, maybe we'd better kick the task to smallest group,
> then let the remote cpu group idle.
> But for current newidle, it's impossible to do this because newidle is
> also bottom-up mode.
>>
>>> Generally, the task moving complexity is
>>> 	O(nm log n),  n := nr_cpus, m := nr_tasks
>>>
>>> There is a excellent summary and explanation for this in
>>> kernel/sched/fair.c:4605
>>
>> Which is a perfectly fine scheme for a busy system.
>>
>>> Another weakness of current LB is that every cpu need to get the other
>>> cpus' load info repeatedly and try to figure out busiest sched
>>> group/queue on every sched domain level. But it just waste time, since
>>> it may not conduct a task moving. One of reasons is that cpu can only
>>> pull task, not pushing.
>>
>> This doesn't make sense.. and in fact, we do a limited amount of 3rd
>> party movements.
> 
> Yes, but the 3rd party movements is too limited, just for task pinned.
>>
>> Whatever you do, you have to repeat the information gathering anyhow,
>> because it constantly changes.
>>
> 
> Yes, it is good to collection the load info once for once balance. but
> if the balance cpu is busiest cpu, current balance still keep collecting
> every group load info from bottom to up, and then do nothing on this
> imbalance system. This is bad.
> 
>> Trying to serialize that doesn't make any kind of sense. The only thing
>> you want is that the system converges.
> 
> Sorry, would you like to give a bit more details of 'serialize' is no sense?
>>
>> Skipped the rest because it seems build on a fundament I don't agree
>> with. That 4 move thing is just silly for an idle system, and we
>> shouldn't do that.
>>
>> I also very much do not want a single CPU balancing the entire system,
>> that's the anti-thesis of scalable.
> 
> Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it
> is far more scalable than every cpu do balance in system, since there is
> only one cpu need to pick other cpu load info.
> 
> BTW, there is no organize among all cpus' balancing currently. That's a
> a bit mess. Like if 2 cpus in a small cpu group just do balance for
> whole system at the same time, then both of them think self group is
> light and want more load. then they have the chance to over pull load to
> self group. That is bad. And single balancing has no such problem.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-01-24  7:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-21 14:04 top-down balance purpose discussion -- resend Alex Shi
2014-01-21 14:57 ` Peter Zijlstra
2014-01-22  7:40   ` Alex Shi
2014-01-24  7:29     ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox