[PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online
@ 2013-12-30 11:35 Srivatsa S. Bhat
  2013-12-30 11:36 ` [PATCH 2/2] powerpc: Add debug checks to catch invalid cpu-to-node mappings Srivatsa S. Bhat
  2014-01-06 16:04 ` [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat
  0 siblings, 2 replies; 3+ messages in thread
From: Srivatsa S. Bhat @ 2013-12-30 11:35 UTC (permalink / raw)
  To: benh, paulus, nfont; +Cc: Srivatsa S. Bhat, maddy, linuxppc-dev, linux-kernel

On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.

But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).

Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.

Now consider this scenario:

1. During boot, the kernel gets the cpu-to-node mappings from the firmware
   and assigns the CPUs to NUMA nodes appropriately, during CPU online.

2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
   communicates this update to the kernel. The kernel in turn updates its
   cpu-to-node associations and rebuilds its sched domains. Everything is
   fine so far.

3. Now, the user switches the machine from SMT to ST mode (say, by running
   ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
   core.

4. The user then tries to switch back from ST to SMT mode (say, by running
   ppc64_cpu --smt=4), and this involves onlining those threads back. Since
   CPU hotplug ignores the new mappings, it queries the firmware and tries to
   associate the newly onlined sibling threads to the old NUMA nodes. This
   results in sibling threads within the same core getting associated with
   different NUMA nodes, which is incorrect.

   The scheduler's build-sched-domains code gets thoroughly confused with this
   and enters an infinite loop and causes soft-lockups, as explained in detail
   in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).

So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.

Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/powerpc/include/asm/topology.h |   10 +++++
 arch/powerpc/mm/numa.c              |   70 ++++++++++++++++++++++++++++++++++-
 2 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 89e3ef2..d0b5fca 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -22,7 +22,15 @@ struct device_node;

 static inline int cpu_to_node(int cpu)
 {
-	return numa_cpu_lookup_table[cpu];
+	int nid;
+
+	nid = numa_cpu_lookup_table[cpu];
+
+	/*
+	 * During early boot, the numa-cpu lookup table might not have been
+	 * setup for all CPUs yet. In such cases, default to node 0.
+	 */
+	return (nid < 0) ? 0 : nid;
 }

 #define parent_node(node)	(node)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 078d3e0..6847d50 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -31,6 +31,8 @@
 #include <asm/sparsemem.h>
 #include <asm/prom.h>
 #include <asm/smp.h>
+#include <asm/cputhreads.h>
+#include <asm/topology.h>
 #include <asm/firmware.h>
 #include <asm/paca.h>
 #include <asm/hvcall.h>
@@ -152,9 +154,22 @@ static void __init get_node_active_region(unsigned long pfn,
 	}
 }

-static void map_cpu_to_node(int cpu, int node)
+static void reset_numa_cpu_lookup_table(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		numa_cpu_lookup_table[cpu] = -1;
+}
+
+static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
 	numa_cpu_lookup_table[cpu] = node;
+}
+
+static void map_cpu_to_node(int cpu, int node)
+{
+	update_numa_cpu_lookup_table(cpu, node);

 	dbg("adding cpu %d to node %d\n", cpu, node);

@@ -522,11 +537,24 @@ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-	int nid = 0;
-	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
+	int nid;
+	struct device_node *cpu;
+
+	/*
+	 * If a valid cpu-to-node mapping is already available, use it
+	 * directly instead of querying the firmware, since it represents
+	 * the most recent mapping notified to us by the platform (eg: VPHN).
+	 */
+	if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+		map_cpu_to_node(lcpu, nid);
+		return nid;
+	}
+
+	cpu = of_get_cpu_node(lcpu, NULL);

 	if (!cpu) {
 		WARN_ON(1);
+		nid = 0;
 		goto out;
 	}

@@ -1067,6 +1095,7 @@ void __init do_init_bootmem(void)
 	 */
 	setup_node_to_cpumask_map();

+	reset_numa_cpu_lookup_table();
 	register_cpu_notifier(&ppc64_numa_nb);
 	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
 			  (void *)(unsigned long)boot_cpuid);
@@ -1445,6 +1474,33 @@ static int update_cpu_topology(void *data)
 	return 0;
 }

+static int update_lookup_table(void *data)
+{
+	struct topology_update_data *update;
+
+	if (!data)
+		return -EINVAL;
+
+	/*
+	 * Upon topology update, the numa-cpu lookup table needs to be updated
+	 * for all threads in the core, including offline CPUs, to ensure that
+	 * future hotplug operations respect the cpu-to-node associativity
+	 * properly.
+	 */
+	for (update = data; update; update = update->next) {
+		int nid, base, j;
+
+		nid = update->new_nid;
+		base = cpu_first_thread_sibling(update->cpu);
+
+		for (j = 0; j < threads_per_core; j++) {
+			update_numa_cpu_lookup_table(base + j, nid);
+		}
+	}
+
+	return 0;
+}
+
 /*
  * Update the node maps and sysfs entries for each cpu whose home node
  * has changed. Returns 1 when the topology has changed, and 0 otherwise.
@@ -1513,6 +1569,14 @@ int arch_update_cpu_topology(void)

 	stop_machine(update_cpu_topology, &updates[0], &updated_cpus);

+	/*
+	 * Update the numa-cpu lookup table with the new mappings, even for
+	 * offline CPUs. It is best to perform this update from the stop-
+	 * machine context.
+	 */
+	stop_machine(update_lookup_table, &updates[0],
+					cpumask_of(raw_smp_processor_id()));
+
 	for (ud = &updates[0]; ud; ud = ud->next) {
 		unregister_cpu_under_node(ud->cpu, ud->old_nid);
 		register_cpu_under_node(ud->cpu, ud->new_nid);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] powerpc: Add debug checks to catch invalid cpu-to-node mappings
  2013-12-30 11:35 [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat
@ 2013-12-30 11:36 ` Srivatsa S. Bhat
  2014-01-06 16:04 ` [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat
  1 sibling, 0 replies; 3+ messages in thread
From: Srivatsa S. Bhat @ 2013-12-30 11:36 UTC (permalink / raw)
  To: benh, paulus, nfont; +Cc: Srivatsa S. Bhat, maddy, linuxppc-dev, linux-kernel

There have been some weird bugs in the past where the kernel tried to associate
threads of the same core to different NUMA nodes, and things went haywire after
that point (as expected).

But unfortunately, root-causing such issues have been quite challenging, due to
the lack of appropriate debug checks in the kernel. These bugs usually lead to
some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
hotplug path, which makes it very hard to trace it back to the incorrect
cpu-to-node mappings.

So add appropriate debug checks to catch such invalid cpu-to-node mappings
as early as possible.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 arch/powerpc/mm/numa.c |   26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6847d50..4f50c6a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -570,16 +570,38 @@ out:
 	return nid;
 }
 
+static void verify_cpu_node_mapping(int cpu, int node)
+{
+	int base, sibling, i;
+
+	/* Verify that all the threads in the core belong to the same node */
+	base = cpu_first_thread_sibling(cpu);
+
+	for (i = 0; i < threads_per_core; i++) {
+		sibling = base + i;
+
+		if (sibling == cpu || cpu_is_offline(sibling))
+			continue;
+
+		if (cpu_to_node(sibling) != node) {
+			WARN(1, "CPU thread siblings %d and %d don't belong"
+				" to the same node!\n", cpu, sibling);
+			break;
+		}
+	}
+}
+
 static int cpu_numa_callback(struct notifier_block *nfb, unsigned long action,
 			     void *hcpu)
 {
 	unsigned long lcpu = (unsigned long)hcpu;
-	int ret = NOTIFY_DONE;
+	int ret = NOTIFY_DONE, nid;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
-		numa_setup_cpu(lcpu);
+		nid = numa_setup_cpu(lcpu);
+		verify_cpu_node_mapping((int)lcpu, nid);
 		ret = NOTIFY_OK;
 		break;
 #ifdef CONFIG_HOTPLUG_CPU

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online
  2013-12-30 11:35 [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat
  2013-12-30 11:36 ` [PATCH 2/2] powerpc: Add debug checks to catch invalid cpu-to-node mappings Srivatsa S. Bhat
@ 2014-01-06 16:04 ` Srivatsa S. Bhat
  1 sibling, 0 replies; 3+ messages in thread
From: Srivatsa S. Bhat @ 2014-01-06 16:04 UTC (permalink / raw)
  To: benh, paulus, nfont; +Cc: maddy, linuxppc-dev, linux-kernel

On 12/30/2013 05:05 PM, Srivatsa S. Bhat wrote:
> On POWER platforms, the hypervisor can notify the guest kernel about dynamic
> changes in the cpu-numa associativity (VPHN topology update). Hence the
> cpu-to-node mappings that we got from the firmware during boot, may no longer
> be valid after such updates. This is handled using the arch_update_cpu_topology()
> hook in the scheduler, and the sched-domains are rebuilt according to the new
> mappings.
> 
> But unfortunately, at the moment, CPU hotplug ignores these updated mappings
> and instead queries the firmware for the cpu-to-numa relationships and uses
> them during CPU online. So the kernel can end up assigning wrong NUMA nodes
> to CPUs during subsequent CPU hotplug online operations (after booting).
> 
> Further, a particularly problematic scenario can result from this bug:
> On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
> threads per core. The switch to Single-Threaded (ST) mode is performed by
> offlining all except the first CPU thread in each core. Switching back to
> SMT mode involves onlining those other threads back, in each core.
> 
> Now consider this scenario:
> 
> 1. During boot, the kernel gets the cpu-to-node mappings from the firmware
>    and assigns the CPUs to NUMA nodes appropriately, during CPU online.
> 
> 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
>    communicates this update to the kernel. The kernel in turn updates its
>    cpu-to-node associations and rebuilds its sched domains. Everything is
>    fine so far.
> 
> 3. Now, the user switches the machine from SMT to ST mode (say, by running
>    ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
>    core.
> 
> 4. The user then tries to switch back from ST to SMT mode (say, by running
>    ppc64_cpu --smt=4), and this involves onlining those threads back. Since
>    CPU hotplug ignores the new mappings, it queries the firmware and tries to
>    associate the newly onlined sibling threads to the old NUMA nodes. This
>    results in sibling threads within the same core getting associated with
>    different NUMA nodes, which is incorrect.
> 
>    The scheduler's build-sched-domains code gets thoroughly confused with this
>    and enters an infinite loop and causes soft-lockups, as explained in detail
>    in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).
> 
> 
> So to fix this, use the numa_cpu_lookup_table to remember the updated
> cpu-to-node mappings, and use them during CPU hotplug online operations.
> Further, we also need to ensure that all threads in a core are assigned to a
> common NUMA node, irrespective of whether all those threads were online during
> the topology update. To achieve this, we take care not to use cpu_sibling_mask()
> since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
> and set up the mappings manually using the 'threads_per_core' value for that
> particular platform. This helps us ensure that we don't hit this bug with any
> combination of CPU hotplug and SMT mode switching.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>

Any thoughts about these patches?

Regards,
Srivatsa S. Bhat

 
>  arch/powerpc/include/asm/topology.h |   10 +++++
>  arch/powerpc/mm/numa.c              |   70 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index 89e3ef2..d0b5fca 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -22,7 +22,15 @@ struct device_node;
> 
>  static inline int cpu_to_node(int cpu)
>  {
> -	return numa_cpu_lookup_table[cpu];
> +	int nid;
> +
> +	nid = numa_cpu_lookup_table[cpu];
> +
> +	/*
> +	 * During early boot, the numa-cpu lookup table might not have been
> +	 * setup for all CPUs yet. In such cases, default to node 0.
> +	 */
> +	return (nid < 0) ? 0 : nid;
>  }
> 
>  #define parent_node(node)	(node)
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 078d3e0..6847d50 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -31,6 +31,8 @@
>  #include <asm/sparsemem.h>
>  #include <asm/prom.h>
>  #include <asm/smp.h>
> +#include <asm/cputhreads.h>
> +#include <asm/topology.h>
>  #include <asm/firmware.h>
>  #include <asm/paca.h>
>  #include <asm/hvcall.h>
> @@ -152,9 +154,22 @@ static void __init get_node_active_region(unsigned long pfn,
>  	}
>  }
> 
> -static void map_cpu_to_node(int cpu, int node)
> +static void reset_numa_cpu_lookup_table(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		numa_cpu_lookup_table[cpu] = -1;
> +}
> +
> +static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
>  {
>  	numa_cpu_lookup_table[cpu] = node;
> +}
> +
> +static void map_cpu_to_node(int cpu, int node)
> +{
> +	update_numa_cpu_lookup_table(cpu, node);
> 
>  	dbg("adding cpu %d to node %d\n", cpu, node);
> 
> @@ -522,11 +537,24 @@ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
>   */
>  static int numa_setup_cpu(unsigned long lcpu)
>  {
> -	int nid = 0;
> -	struct device_node *cpu = of_get_cpu_node(lcpu, NULL);
> +	int nid;
> +	struct device_node *cpu;
> +
> +	/*
> +	 * If a valid cpu-to-node mapping is already available, use it
> +	 * directly instead of querying the firmware, since it represents
> +	 * the most recent mapping notified to us by the platform (eg: VPHN).
> +	 */
> +	if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
> +		map_cpu_to_node(lcpu, nid);
> +		return nid;
> +	}
> +
> +	cpu = of_get_cpu_node(lcpu, NULL);
> 
>  	if (!cpu) {
>  		WARN_ON(1);
> +		nid = 0;
>  		goto out;
>  	}
> 
> @@ -1067,6 +1095,7 @@ void __init do_init_bootmem(void)
>  	 */
>  	setup_node_to_cpumask_map();
> 
> +	reset_numa_cpu_lookup_table();
>  	register_cpu_notifier(&ppc64_numa_nb);
>  	cpu_numa_callback(&ppc64_numa_nb, CPU_UP_PREPARE,
>  			  (void *)(unsigned long)boot_cpuid);
> @@ -1445,6 +1474,33 @@ static int update_cpu_topology(void *data)
>  	return 0;
>  }
> 
> +static int update_lookup_table(void *data)
> +{
> +	struct topology_update_data *update;
> +
> +	if (!data)
> +		return -EINVAL;
> +
> +	/*
> +	 * Upon topology update, the numa-cpu lookup table needs to be updated
> +	 * for all threads in the core, including offline CPUs, to ensure that
> +	 * future hotplug operations respect the cpu-to-node associativity
> +	 * properly.
> +	 */
> +	for (update = data; update; update = update->next) {
> +		int nid, base, j;
> +
> +		nid = update->new_nid;
> +		base = cpu_first_thread_sibling(update->cpu);
> +
> +		for (j = 0; j < threads_per_core; j++) {
> +			update_numa_cpu_lookup_table(base + j, nid);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * Update the node maps and sysfs entries for each cpu whose home node
>   * has changed. Returns 1 when the topology has changed, and 0 otherwise.
> @@ -1513,6 +1569,14 @@ int arch_update_cpu_topology(void)
> 
>  	stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
> 
> +	/*
> +	 * Update the numa-cpu lookup table with the new mappings, even for
> +	 * offline CPUs. It is best to perform this update from the stop-
> +	 * machine context.
> +	 */
> +	stop_machine(update_lookup_table, &updates[0],
> +					cpumask_of(raw_smp_processor_id()));
> +
>  	for (ud = &updates[0]; ud; ud = ud->next) {
>  		unregister_cpu_under_node(ud->cpu, ud->old_nid);
>  		register_cpu_under_node(ud->cpu, ud->new_nid);
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-01-06 16:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-30 11:35 [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat
2013-12-30 11:36 ` [PATCH 2/2] powerpc: Add debug checks to catch invalid cpu-to-node mappings Srivatsa S. Bhat
2014-01-06 16:04 ` [PATCH 1/2] powerpc: Fix the setup of CPU-to-Node mappings during CPU online Srivatsa S. Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).