From: Alex Shi <alex.shi@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>, Mike Galbraith <efault@gmx.de>,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Vincent Guittot <vincent.guittot@linaro.org>,
Morten Rasmussen <morten.rasmussen@arm.com>,
Amit Kucheria <amit.kucheria@linaro.org>,
"tglx@linutronix.de" <tglx@linutronix.de>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: top-down balance purpose discussion -- resend
Date: Fri, 24 Jan 2014 15:29:59 +0800 [thread overview]
Message-ID: <52E21677.9090907@linaro.org> (raw)
In-Reply-To: <52DF75FD.5080304@linaro.org>
Any more comments for this idea? :)
On 01/22/2014 03:40 PM, Alex Shi wrote:
> On 01/21/2014 10:57 PM, Peter Zijlstra wrote:
>> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
>>>
>>> Current scheduler load balance is bottom-up mode, each CPU need
>>> initiate the balance by self.
>>>
>>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
>>> level scheduler domains. If there is just 2 tasks in whole system that
>>> both running on cpu0. Current load balance need to pull task to another
>>> smt in smt domain, then pull task to another core, then pull task to
>>> another cpu, finally pull task to another numa. Totally it is need 4
>>> times task moving to get system balance.
>>
>> Except the idle load balancer, and esp. the newidle can totally by-pass
>> this.
>>
>> If you do the packing right in the newidle pass, you'd get there in 1
>> step.
>
> It give me a huge pressure to argue with you a great experts. I am
> waiting and very appreciate for any comments and corrections. :)
>
> Yes, a newidle will kindly relief this. but it can not eliminate it. If
> a newidle happens on another numa group. It just needs 1 step. But if it
> happens on another smt group, it still needs 4 steps. So generally, we
> still need one more steps before well balance.
>
> In this example, if a newidle is in the same smallest group, maybe we
> should wakeup a remotest cpu in system/llc to avoid extra task moving in
> near future for best performance.
> And for power saving, maybe we'd better kick the task to smallest group,
> then let the remote cpu group idle.
> But for current newidle, it's impossible to do this because newidle is
> also bottom-up mode.
>>
>>> Generally, the task moving complexity is
>>> O(nm log n), n := nr_cpus, m := nr_tasks
>>>
>>> There is a excellent summary and explanation for this in
>>> kernel/sched/fair.c:4605
>>
>> Which is a perfectly fine scheme for a busy system.
>>
>>> Another weakness of current LB is that every cpu need to get the other
>>> cpus' load info repeatedly and try to figure out busiest sched
>>> group/queue on every sched domain level. But it just waste time, since
>>> it may not conduct a task moving. One of reasons is that cpu can only
>>> pull task, not pushing.
>>
>> This doesn't make sense.. and in fact, we do a limited amount of 3rd
>> party movements.
>
> Yes, but the 3rd party movements is too limited, just for task pinned.
>>
>> Whatever you do, you have to repeat the information gathering anyhow,
>> because it constantly changes.
>>
>
> Yes, it is good to collection the load info once for once balance. but
> if the balance cpu is busiest cpu, current balance still keep collecting
> every group load info from bottom to up, and then do nothing on this
> imbalance system. This is bad.
>
>> Trying to serialize that doesn't make any kind of sense. The only thing
>> you want is that the system converges.
>
> Sorry, would you like to give a bit more details of 'serialize' is no sense?
>>
>> Skipped the rest because it seems build on a fundament I don't agree
>> with. That 4 move thing is just silly for an idle system, and we
>> shouldn't do that.
>>
>> I also very much do not want a single CPU balancing the entire system,
>> that's the anti-thesis of scalable.
>
> Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it
> is far more scalable than every cpu do balance in system, since there is
> only one cpu need to pick other cpu load info.
>
> BTW, there is no organize among all cpus' balancing currently. That's a
> a bit mess. Like if 2 cpus in a small cpu group just do balance for
> whole system at the same time, then both of them think self group is
> light and want more load. then they have the chance to over pull load to
> self group. That is bad. And single balancing has no such problem.
>
--
Thanks
Alex
prev parent reply other threads:[~2014-01-24 7:30 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-21 14:04 top-down balance purpose discussion -- resend Alex Shi
2014-01-21 14:57 ` Peter Zijlstra
2014-01-22 7:40 ` Alex Shi
2014-01-24 7:29 ` Alex Shi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52E21677.9090907@linaro.org \
--to=alex.shi@linaro.org \
--cc=amit.kucheria@linaro.org \
--cc=daniel.lezcano@linaro.org \
--cc=efault@gmx.de \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=morten.rasmussen@arm.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox