public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Valentin Schneider <valentin.schneider@arm.com>,
	Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 5.14 28/47] sched/topology: Skip updating masks for non-online nodes
Date: Sun,  5 Sep 2021 21:19:32 -0400	[thread overview]
Message-ID: <20210906011951.928679-28-sashal@kernel.org> (raw)
In-Reply-To: <20210906011951.928679-1-sashal@kernel.org>

From: Valentin Schneider <valentin.schneider@arm.com>

[ Upstream commit 0083242c93759dde353a963a90cb351c5c283379 ]

The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).

Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.

Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.

Reported-by: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..4e8698e62f07 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1482,6 +1482,8 @@ int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
 int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
+
+static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1833,6 +1835,16 @@ void sched_init_numa(void)
 			sched_domains_numa_masks[i][j] = mask;
 
 			for_each_node(k) {
+				/*
+				 * Distance information can be unreliable for
+				 * offline nodes, defer building the node
+				 * masks to its bringup.
+				 * This relies on all unique distance values
+				 * still being visible at init time.
+				 */
+				if (!node_online(j))
+					continue;
+
 				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
 					sched_numa_warn("Node-distance not symmetric");
 
@@ -1886,6 +1898,53 @@ void sched_init_numa(void)
 	sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
 	init_numa_topology_type();
+
+	sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
+	if (!sched_numa_onlined_nodes)
+		return;
+
+	bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
+	for_each_online_node(i)
+		bitmap_set(sched_numa_onlined_nodes, i, 1);
+}
+
+static void __sched_domains_numa_masks_set(unsigned int node)
+{
+	int i, j;
+
+	/*
+	 * NUMA masks are not built for offline nodes in sched_init_numa().
+	 * Thus, when a CPU of a never-onlined-before node gets plugged in,
+	 * adding that new CPU to the right NUMA masks is not sufficient: the
+	 * masks of that CPU's node must also be updated.
+	 */
+	if (test_bit(node, sched_numa_onlined_nodes))
+		return;
+
+	bitmap_set(sched_numa_onlined_nodes, node, 1);
+
+	for (i = 0; i < sched_domains_numa_levels; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_online(j) || node == j)
+				continue;
+
+			if (node_distance(j, node) > sched_domains_numa_distance[i])
+				continue;
+
+			/* Add remote nodes in our masks */
+			cpumask_or(sched_domains_numa_masks[i][node],
+				   sched_domains_numa_masks[i][node],
+				   sched_domains_numa_masks[0][j]);
+		}
+	}
+
+	/*
+	 * A new node has been brought up, potentially changing the topology
+	 * classification.
+	 *
+	 * Note that this is racy vs any use of sched_numa_topology_type :/
+	 */
+	init_numa_topology_type();
 }
 
 void sched_domains_numa_masks_set(unsigned int cpu)
@@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 	int node = cpu_to_node(cpu);
 	int i, j;
 
+	__sched_domains_numa_masks_set(node);
+
 	for (i = 0; i < sched_domains_numa_levels; i++) {
 		for (j = 0; j < nr_node_ids; j++) {
+			if (!node_online(j))
+				continue;
+
+			/* Set ourselves in the remote node's masks */
 			if (node_distance(j, node) <= sched_domains_numa_distance[i])
 				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
 		}
-- 
2.30.2


  parent reply	other threads:[~2021-09-06  1:20 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-06  1:19 [PATCH AUTOSEL 5.14 01/47] locking/mutex: Fix HANDOFF condition Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 02/47] regmap: fix the offset of register error log Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 03/47] regulator: tps65910: Silence deferred probe error Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 04/47] crypto: mxs-dcp - Check for DMA mapping errors Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 05/47] sched/deadline: Fix reset_on_fork reporting of DL tasks Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 06/47] power: supply: axp288_fuel_gauge: Report register-address on readb / writeb errors Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 07/47] crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 08/47] sched/deadline: Fix missing clock update in migrate_task_rq_dl() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 09/47] rcu/tree: Handle VM stoppage in stall detection Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 10/47] EDAC/mce_amd: Do not load edac_mce_amd module on guests Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 11/47] posix-cpu-timers: Force next expiration recalc after itimer reset Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 12/47] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 13/47] hrtimer: Ensure timerfd notification for HIGHRES=n Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 14/47] udf: Check LVID earlier Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 15/47] udf: Fix iocharset=utf8 mount option Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 16/47] isofs: joliet: " Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 17/47] bcache: add proper error unwinding in bcache_device_init Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 18/47] nbd: add the check to prevent overflow in __nbd_ioctl() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 19/47] blk-throtl: optimize IOPS throttle for large IO scenarios Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 20/47] nvme-tcp: don't update queue count when failing to set io queues Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 21/47] nvme-rdma: " Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 22/47] nvmet: pass back cntlid on successful completion Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 23/47] power: supply: smb347-charger: Add missing pin control activation Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 24/47] power: supply: max17042_battery: fix typo in MAx17042_TOFF Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 25/47] s390/cio: add dev_busid sysfs entry for each subchannel Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 26/47] s390/zcrypt: fix wrong offset index for APKA master key valid state Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 27/47] libata: fix ata_host_start() Sasha Levin
2021-09-06  1:19 ` Sasha Levin [this message]
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 29/47] crypto: omap - Fix inconsistent locking of device lists Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 30/47] crypto: qat - do not ignore errors from enable_vf2pf_comms() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 31/47] crypto: qat - handle both source of interrupt in VF ISR Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 32/47] crypto: qat - fix reuse of completion variable Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 33/47] crypto: qat - fix naming for init/shutdown VF to PF notifications Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 34/47] crypto: qat - do not export adf_iov_putmsg() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 35/47] crypto: hisilicon/sec - fix the abnormal exiting process Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 36/47] crypto: hisilicon/sec - modify the hardware endian configuration Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 37/47] crypto: tcrypt - Fix missing return value check Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 38/47] fcntl: fix potential deadlocks for &fown_struct.lock Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 39/47] fcntl: fix potential deadlock for &fasync_struct.fa_lock Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 40/47] udf_get_extendedattr() had no boundary checks Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 41/47] io-wq: remove GFP_ATOMIC allocation off schedule out path Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 42/47] s390/kasan: fix large PMD pages address alignment check Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 43/47] s390/pci: fix misleading rc in clp_set_pci_fn() Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 44/47] s390/debug: keep debug data on resize Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 45/47] s390/debug: fix debug area life cycle Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 46/47] s390/ap: fix state machine hang after failure to enable irq Sasha Levin
2021-09-06  1:19 ` [PATCH AUTOSEL 5.14 47/47] s390/smp: enable DAT before CPU restart callback is called Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210906011951.928679-28-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Geetika.Moolchandani1@ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=stable@vger.kernel.org \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox