Re: [RFC] sched/topology: NUMA topology limitations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Valentin Schneider <valentin.schneider@arm.com>
To: "Song Bao Hua \(Barry Song\)" <song.bao.hua@hisilicon.com>
Cc: "linux-kernel\@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	"vincent.guittot\@linaro.org" <vincent.guittot@linaro.org>,
	"dietmar.eggemann\@arm.com" <dietmar.eggemann@arm.com>,
	"morten.rasmussen\@arm.com" <morten.rasmussen@arm.com>,
	Linuxarm <linuxarm@huawei.com>
Subject: Re: [RFC] sched/topology: NUMA topology limitations
Date: Sat, 29 Aug 2020 13:33:22 +0100	[thread overview]
Message-ID: <jhjpn79ocml.mognet@arm.com> (raw)
In-Reply-To: <422fb2cfe1d24fca8efa74ba23d8754b@hisilicon.com>


Hi Barry,

Thanks for having a look!

On 29/08/20 06:32, Barry Song wrote:
>> If you boot the above with CONFIG_SCHED_DEBUG, you'll get:
>>
>> [    0.245896] CPU0 attaching sched-domain(s):
>> [    0.246133]  domain-0: span=0-1 level=NUMA
>> [    0.246592]   groups: 0:{ span=0 cap=1011 }, 1:{ span=1 cap=1008 }
>> [    0.246998]   domain-1: span=0-2 level=NUMA
>> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
>> mask=2 cap=3025 }
>> [    0.247454] ERROR: groups don't span domain->span
>
> Hi Valtentin,
> Thanks for your email. It seems it is quite clear. May I ask what is the real harm
> when group 2 is actually out of the span of diameter 2 here? What will happen when group2
> doesn't span cpu0_domain1->span?
> In domain-1, will scheduler fail to do load balance between group0 and group2?
>
>> [    0.247654]    domain-2: span=0-3 level=NUMA
>> [    0.247892]     groups: 0:{ span=0-2 mask=0 cap=3021 }, 3:{ span=1-3
>> mask=3 cap=3047 }
>
> Here domain-2 includes all span from 0 to 3, so that means we should still be able to do
> load balance in domain-2 between cpu0 and cpu3 even we fail in domain-1?
>

[...]

>> Implications of fixing this
>> ---------------------------
>>
>> Clearly the current sched_group setup for such topologies is not what we
>> want: this disturbs load balancing on the 'corrupted' domains.
>>
>> If we *do* want to support systems like this, then we have other problems to
>> solve. Currently, on the aforementioned QEMU setup, we have:
>>
>>   CPU0-domain1
>>     group0: span=0-2, mask=0
>>     group2: span=1-3, mask=2
>
> Your kernel log is:
> [    0.246998]   domain-1: span=0-2 level=NUMA
> [    0.247145]    groups: 0:{ span=0-1 mask=0 cap=2019 }, 2:{ span=1-3
> mask=2 cap=3025 }
>
> it seems group0 should be:
> group0: span=0-1, mask=0
>
> but not
> group0: span=0-2, mask=0 ?
>
> is it a typo here?
>

It is indeed, well spotted.

[...]

>
> What is the real harm you have seen in load balance? Have you even tried
> to make cpu0,cpu1,cpu2 busy and wake-up more tasks in cpu0? Will cpu3
> join to share the loading in your qemu topology?
>

(pasting the comment from your other email here)

> After second thought, would the harm be that scheduler should do load balance
> among cpu0, cpu1 and cpu2 in domain1, but it is actually doing load balance
> among all of cpu0, cpu1, cpu2 and cpu3 since cpu3 is incorrectly put in group2?
> So it is possible that scheduler will make wrong decision to put tasks in cpu3 while
> it actually should only begin to do that in domain2?

Ignoring corner cases where task affinity gets in the way, load balance
will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we
are working on).

If we're balancing load for CPU0-domain1, we would be looking at which CPUs
in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from
to migrate them over to CPU0.

We'll first try to figure out which sched_group has the more load (see
find_busiest_group() & friends), and that's where we may hit issues.

Consider a scenario where CPU3 is noticeably busier than the other
CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest group,
and compute an imbalance (i.e. amount of load to pull) mostly based on the
status of CPU3.

We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is
restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so
we'll pull things from either CPU1 or CPU2 based on stats we built looking
at CPU3, which is bound to be pretty bogus.

To summarise: we won't pull from the "outsider" node(s) (i.e., nodes
included in the sched_groups but not covered by the sched_domain), but they
will influence the stats and heuristics of the load balance.

HTH

next prev parent reply	other threads:[~2020-08-29 12:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <6a5f06ff4ecb4f34bd7e9890dc07fb99@hisilicon.com>
2020-08-29  5:32 ` [RFC] sched/topology: NUMA topology limitations Song Bao Hua (Barry Song)
2020-08-29 12:33   ` Valentin Schneider [this message]
2020-08-31 10:45     ` Song Bao Hua (Barry Song)
2020-09-01  9:40       ` Valentin Schneider
2020-09-04  2:02         ` Song Bao Hua (Barry Song)
2020-08-29  5:42 ` Song Bao Hua (Barry Song)
2020-08-14 10:15 Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jhjpn79ocml.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox