All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Rik van Riel <riel@surriel.com>,
	linuxppc-dev@lists.ozlabs.org,
	Scott Cheloha <cheloha@linux.ibm.com>,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Mel Gorman <mgorman@techsingularity.net>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>
Subject: Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
Date: Fri, 21 May 2021 08:08:02 +0530	[thread overview]
Message-ID: <20210521023802.GE2633526@linux.vnet.ibm.com> (raw)
In-Reply-To: <YKaw33d71FpHjGnR@hirez.programming.kicks-ass.net>

* Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> > 
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> > 
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> > 
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
> 
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
> 
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.

-- 
Thanks and Regards
Srikar Dronamraju

WARNING: multiple messages have this Message-ID (diff)
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Rik van Riel <riel@surriel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	linuxppc-dev@lists.ozlabs.org,
	Nathan Lynch <nathanl@linux.ibm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Scott Cheloha <cheloha@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Subject: Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map
Date: Fri, 21 May 2021 08:08:02 +0530	[thread overview]
Message-ID: <20210521023802.GE2633526@linux.vnet.ibm.com> (raw)
In-Reply-To: <YKaw33d71FpHjGnR@hirez.programming.kicks-ass.net>

* Peter Zijlstra <peterz@infradead.org> [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> > 
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> > 
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> > 
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
> 
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
> 
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.

-- 
Thanks and Regards
Srikar Dronamraju

  reply	other threads:[~2021-05-21  2:39 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-20 15:44 [PATCH 0/3] Skip numa distance for offline nodes Srikar Dronamraju
2021-05-20 15:44 ` Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 1/3] sched/topology: Allow archs to populate distance map Srikar Dronamraju
2021-05-20 15:44   ` Srikar Dronamraju
2021-05-20 18:56   ` Peter Zijlstra
2021-05-20 18:56     ` Peter Zijlstra
2021-05-21  2:38     ` Srikar Dronamraju [this message]
2021-05-21  2:38       ` Srikar Dronamraju
2021-05-21  8:14       ` Peter Zijlstra
2021-05-21  8:14         ` Peter Zijlstra
2021-05-21  9:28         ` Srikar Dronamraju
2021-05-21  9:28           ` Srikar Dronamraju
2021-05-24 14:16           ` Valentin Schneider
2021-05-24 14:16             ` Valentin Schneider
2021-05-24 16:18             ` Srikar Dronamraju
2021-05-24 16:18               ` Srikar Dronamraju
2021-05-25 10:21               ` Valentin Schneider
2021-05-25 10:21                 ` Valentin Schneider
2021-05-25 11:32                 ` Srikar Dronamraju
2021-05-25 11:32                   ` Srikar Dronamraju
2021-05-28  5:21                 ` Srikar Dronamraju
2021-05-28  5:21                   ` Srikar Dronamraju
2021-05-28  8:43               ` Peter Zijlstra
2021-05-28  8:43                 ` Peter Zijlstra
2021-05-28 10:24                 ` Srikar Dronamraju
2021-05-28 10:24                   ` Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 2/3] powerpc/numa: Populate distance map correctly Srikar Dronamraju
2021-05-20 15:44   ` Srikar Dronamraju
2021-05-24 14:16   ` Valentin Schneider
2021-05-24 14:16     ` Valentin Schneider
2021-05-24 14:50     ` Srikar Dronamraju
2021-05-24 14:50       ` Srikar Dronamraju
2021-05-20 15:44 ` [PATCH 3/3] sched/topology: Skip updating masks for non-online nodes Srikar Dronamraju
2021-05-20 15:44   ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210521023802.GE2633526@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=Geetika.Moolchandani1@ibm.com \
    --cc=cheloha@linux.ibm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=nathanl@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.