From: Peter Zijlstra <peterz@infradead.org>
To: Huang Ying <ying.huang@intel.com>
Cc: linux-kernel@vger.kernel.org,
Valentin Schneider <valentin.schneider@arm.com>,
Ingo Molnar <mingo@redhat.com>, Mel Gorman <mgorman@suse.de>,
Rik van Riel <riel@surriel.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Subject: Re: [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes
Date: Mon, 14 Feb 2022 16:05:44 +0100 [thread overview]
Message-ID: <YgpvyE7oV1lZDRQL@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20220214121553.582248-1-ying.huang@intel.com>
On Mon, Feb 14, 2022 at 08:15:52PM +0800, Huang Ying wrote:
> This isn't a practical problem now yet. Because the PMEM nodes (node
> 2 and node 3 in example system) are offlined by default during system
> boot. So init_numa_topology_type() called during system boot will
> ignore them and set sched_numa_topology_type to NUMA_DIRECT. And
> init_numa_topology_type() is only called at runtime when a CPU of a
> never-onlined-before node gets plugged in. And there's no CPU in the
> PMEM nodes. But it appears better to fix this to make the code more
> robust.
IIRC there are pre-existing issues with this; namely the distance_map is
created for all nodes, online or not, therefore the levels and
max_distance include the pmem stuff.
At the same time, the numa_topolog_type() uses those values, and the
only reason it 'worked' is because the combination of arguments fails to
hit any of the existing types and exits without setting a type,
defaulting to NUMA_DIRECT by 'accident' of that being type 0 and
bss/data being 0 initialized.
Also, Power (and possibly other architectures) already have CPU-less
nodes and are similarly suffering issues.
Anyway, aside from this the patches look like they should do.
There's a few niggles, like using READ_ONCE() on sched_max_numa_distance
without using WRITE_ONCE() (see below) and having
sched_domains_numa_distance and sched_domains_numa_masks separate RCU
variables (that could go side-ways if there were a function using both,
afaict there isn't and I couldn't be bothered changing that, but it's
something to keep in mind).
I'll go queue these, thanks!
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1259,11 +1259,10 @@ static bool numa_is_active_node(int nid,
/* Handle placement on systems where not all nodes are directly connected. */
static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
- int maxdist, bool task)
+ int lim_dist, bool task)
{
unsigned long score = 0;
- int node;
- int sys_max_dist;
+ int node, max_dist;
/*
* All nodes are directly connected, and the same distance
@@ -1273,7 +1272,7 @@ static unsigned long score_nearby_nodes(
return 0;
/* sched_max_numa_distance may be changed in parallel. */
- sys_max_dist = READ_ONCE(sched_max_numa_distance);
+ max_dist = READ_ONCE(sched_max_numa_distance);
/*
* This code is called for each node, introducing N^2 complexity,
* which should be ok given the number of nodes rarely exceeds 8.
@@ -1286,7 +1285,7 @@ static unsigned long score_nearby_nodes(
* The furthest away nodes in the system are not interesting
* for placement; nid was already counted.
*/
- if (dist >= sys_max_dist || node == nid)
+ if (dist >= max_dist || node == nid)
continue;
/*
@@ -1296,8 +1295,7 @@ static unsigned long score_nearby_nodes(
* "hoplimit", only nodes closer by than "hoplimit" are part
* of each group. Skip other nodes.
*/
- if (sched_numa_topology_type == NUMA_BACKPLANE &&
- dist >= maxdist)
+ if (sched_numa_topology_type == NUMA_BACKPLANE && dist >= lim_dist)
continue;
/* Add up the faults from nearby nodes. */
@@ -1315,8 +1313,8 @@ static unsigned long score_nearby_nodes(
* This seems to result in good task placement.
*/
if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
- faults *= (sys_max_dist - dist);
- faults /= (sys_max_dist - LOCAL_DISTANCE);
+ faults *= (max_dist - dist);
+ faults /= (max_dist - LOCAL_DISTANCE);
}
score += faults;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1927,7 +1927,7 @@ void sched_init_numa(int offline_node)
sched_domain_topology = tl;
sched_domains_numa_levels = nr_levels;
- sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
+ WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
init_numa_topology_type(offline_node);
}
next prev parent reply other threads:[~2022-02-14 15:06 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-14 12:15 [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang Ying
2022-02-14 12:15 ` [PATCH -V3 2/2] NUMA balancing: avoid to migrate task to CPU-less node Huang Ying
2022-02-17 18:56 ` [tip: sched/core] sched/numa: Avoid migrating " tip-bot2 for Huang Ying
2022-03-01 20:54 ` Qian Cai
2022-03-01 20:54 ` Qian Cai
2022-03-02 0:59 ` Huang, Ying
2022-03-02 0:59 ` Huang, Ying
2022-03-02 12:37 ` Qian Cai
2022-03-02 12:37 ` Qian Cai
2022-03-07 5:51 ` Huang, Ying
2022-03-07 5:51 ` Huang, Ying
2022-03-07 13:53 ` Qian Cai
2022-03-07 13:53 ` Qian Cai
2022-03-08 0:40 ` Huang, Ying
2022-03-08 0:40 ` Huang, Ying
2022-03-08 2:05 ` [PATCH -V3 2/2 UPDATE] NUMA balancing: avoid to migrate " Huang, Ying
2022-03-08 2:11 ` Huang, Ying
2022-03-16 0:37 ` Huang, Ying
2022-02-14 15:05 ` Peter Zijlstra [this message]
2022-02-15 1:29 ` [PATCH -V3 1/2] NUMA balancing: fix NUMA topology for systems with CPU-less nodes Huang, Ying
2022-02-17 18:56 ` [tip: sched/core] sched/numa: Fix " tip-bot2 for Huang Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YgpvyE7oV1lZDRQL@hirez.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=riel@surriel.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=valentin.schneider@arm.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.