LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Dont assume l2-cache to be superset of sibling
	Set cpumask after verifying l2-cache. (Gautham)

 arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 72f16dc0cb26..57468877499a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	if (!l2_cache)
 		return false;
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	for_each_cpu(i, cpu_online_mask) {
 		/*
 		 * when updating the marks the current CPU has not been marked
@@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Currently "CACHE" domain happens to be the 2nd sched domain as per
powerpc_topology. This domain will collapse if cpumask of l2-cache is
same as SMT domain. However we could generalize this domain such that it
could mean either be a "CACHE" domain or a "BIGCORE" domain.

While setting up the "CACHE" domain, check if shared_cache is already
set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Generalize 2nd sched domain
	Moved shared_cache topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 49 ++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 57468877499a..933ebdf97432 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 EXPORT_SYMBOL_GPL(has_big_cores);
 
+enum {
+#ifdef CONFIG_SCHED_SMT
+	smt_idx,
+#endif
+	bigcore_idx,
+	die_idx,
+};
+
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
 struct thread_groups {
@@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return per_cpu(cpu_sibling_map, cpu);
+	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static const struct cpumask *cpu_bigcore_mask(int cpu)
+{
+	return per_cpu(cpu_sibling_map, cpu);
+}
+
 static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -1313,7 +1320,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
-	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
 	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
@@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
 	/* Update topology CPU masks */
 	add_cpu_to_masks(cpu);
 
-	if (has_big_cores)
-		sibling_mask = cpu_smallcore_mask;
 	/*
 	 * Check for any shared caches. Note that this must be done on a
 	 * per-core basis because one core in the pair might be disabled.
 	 */
-	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-		shared_caches = true;
+	if (!shared_caches) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+		struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+			shared_caches = true;
+	}
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier)
 
 static void fixup_topology(void)
 {
+	if (shared_caches) {
+		pr_info("Using shared cache scheduler topology\n");
+		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
+#ifdef CONFIG_SCHED_DEBUG
+		powerpc_topology[bigcore_idx].name = "CACHE";
+#endif
+		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
+	}
+
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
+		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
 	}
 #endif
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 07/10] Powerpc/numa: Detect support for coregroup
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
Powerpc/numa: Detect support for coregroup
	Explained Coregroup in commit msg (Michael Ellerman)

 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/smp.c      |  1 +
 arch/powerpc/mm/numa.c         | 34 +++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 49a25e2400f2..5bdc17a7049f 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -28,6 +28,7 @@
 extern int boot_cpuid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
+extern bool coregroup_enabled;
 
 extern void cpu_die(void);
 extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 933ebdf97432..320e36a0ec0b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
 struct task_struct *secondary_current;
 bool has_big_cores;
+bool coregroup_enabled;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index bc5b2e8112c8..3248160c0327 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
 	struct device_node *rtas;
-	u32 numnodes, i;
+	const __be32 *domains;
+	int prop_length, max_nodes;
+	u32 i;
 
 	if (!numa_enabled)
 		return;
@@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
 	if (!rtas)
 		return;
 
-	if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
-				min_common_depth, &numnodes)) {
-		/*
-		 * ibm,current-associativity-domains is a fairly recent
-		 * property. If it doesn't exist, then fallback on
-		 * ibm,max-associativity-domains. Current denotes what the
-		 * platform can support compared to max which denotes what the
-		 * Hypervisor can support.
-		 */
-		if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
-				min_common_depth, &numnodes))
+	/*
+	 * ibm,current-associativity-domains is a fairly recent property. If
+	 * it doesn't exist, then fallback on ibm,max-associativity-domains.
+	 * Current denotes what the platform can support compared to max
+	 * which denotes what the Hypervisor can support.
+	 */
+	domains = of_get_property(rtas, "ibm,current-associativity-domains",
+					&prop_length);
+	if (!domains) {
+		domains = of_get_property(rtas, "ibm,max-associativity-domains",
+					&prop_length);
+		if (!domains)
 			goto out;
 	}
 
-	for (i = 0; i < numnodes; i++) {
+	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
+	prop_length /= sizeof(int);
+	if (prop_length > min_common_depth + 2)
+		coregroup_enabled = 1;
+
 out:
 	of_node_put(rtas);
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

If allocated earlier and the search fails, then cpumask need to be
freed. However cpu_l1_cache_map can be allocated after we search thread
group.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 320e36a0ec0b..97b762a1944a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
 	if (err)
 		goto out;
 
-	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
-				GFP_KERNEL,
-				cpu_to_node(cpu));
-
 	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
 
 	if (unlikely(cpu_group_start == -1)) {
@@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
 		goto out;
 	}
 
+	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
+				GFP_KERNEL, cpu_to_node(cpu));
+
 	for (i = first_thread; i < first_thread + threads_per_core; i++) {
 		int i_group_start = get_cpu_thread_group_start(i, &tg);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 09/10] Powerpc/smp: Create coregroup domain
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
Powerpc/smp: Create coregroup domain
	Moved coregroup topology fixup to fixup_topology (Gautham)

 arch/powerpc/include/asm/topology.h | 10 ++++++++
 arch/powerpc/kernel/smp.c           | 38 +++++++++++++++++++++++++++++
 arch/powerpc/mm/numa.c              |  5 ++++
 3 files changed, 53 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..6609174918ab 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int find_and_online_cpu_nid(int cpu);
+extern int cpu_to_coregroup_id(int cpu);
 #else
 static inline int find_and_online_cpu_nid(int cpu)
 {
 	return 0;
 }
 
+static inline int cpu_to_coregroup_id(int cpu)
+{
+#ifdef CONFIG_SMP
+	return cpu_to_core_id(cpu);
+#else
+	return 0;
+#endif
+}
+
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include <asm-generic/topology.h>
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 97b762a1944a..a7e1366b7fd3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
 
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
@@ -91,6 +92,7 @@ enum {
 	smt_idx,
 #endif
 	bigcore_idx,
+	mc_idx,
 	die_idx,
 };
 
@@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	return per_cpu(cpu_coregroup_map, cpu);
+}
+
+static bool has_coregroup_support(void)
+{
+	return coregroup_enabled;
+}
+
+static const struct cpumask *cpu_mc_mask(int cpu)
+{
+	return cpu_coregroup_mask(cpu);
+}
+
 static const struct cpumask *cpu_bigcore_mask(int cpu)
 {
 	return per_cpu(cpu_sibling_map, cpu);
@@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
 	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
+	{ cpu_mc_mask, SD_INIT_NAME(MC) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -927,6 +945,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, node);
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, node);
+		if (has_coregroup_support())
+			zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu),
+						GFP_KERNEL, node);
+
 #ifdef CONFIG_NEED_MULTIPLE_NODES
 		/*
 		 * numa_node_id() works after this.
@@ -944,6 +966,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
 	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
+	if (has_coregroup_support())
+		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+
 	init_big_cores();
 	if (has_big_cores) {
 		cpumask_set_cpu(boot_cpuid,
@@ -1235,6 +1260,8 @@ static void remove_cpu_from_masks(int cpu)
 		set_cpus_unrelated(cpu, i, cpu_sibling_mask);
 		if (has_big_cores)
 			set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
+		if (has_coregroup_support())
+			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
 	}
 }
 #endif
@@ -1295,6 +1322,14 @@ static void add_cpu_to_masks(int cpu)
 	add_cpu_to_smallcore_masks(cpu);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
+	if (has_coregroup_support()) {
+		cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+		for_each_cpu(i, cpu_online_mask) {
+			if (cpu_to_coregroup_id(cpu) == cpu_to_coregroup_id(i))
+				set_cpus_related(cpu, i, cpu_coregroup_mask);
+		}
+	}
+
 	if (pkg_id == -1) {
 		struct cpumask *(*mask)(int) = cpu_sibling_mask;
 
@@ -1386,6 +1421,9 @@ int setup_profiling_timer(unsigned int multiplier)
 
 static void fixup_topology(void)
 {
+	if (!has_coregroup_support())
+		powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
+
 	if (shared_caches) {
 		pr_info("Using shared cache scheduler topology\n");
 		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3248160c0327..ef8aa580da21 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1216,6 +1216,11 @@ int find_and_online_cpu_nid(int cpu)
 	return new_nid;
 }
 
+int cpu_to_coregroup_id(int cpu)
+{
+	return cpu_to_core_id(cpu);
+}
+
 static int topology_update_init(void)
 {
 	topology_inited = 1;
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 10/10] powerpc/smp: Implement cpu_to_coregroup_id
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Implement cpu_to_coregroup_id
	Move coregroup_enabled before getting associativity (Gautham)

 arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index ef8aa580da21..ae57b68beaee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
 
 int cpu_to_coregroup_id(int cpu)
 {
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+	int index;
+
+	if (cpu < 0 || cpu > nr_cpu_ids)
+		return -1;
+
+	if (!coregroup_enabled)
+		goto out;
+
+	if (!firmware_has_feature(FW_FEATURE_VPHN))
+		goto out;
+
+	if (vphn_get_associativity(cpu, associativity))
+		goto out;
+
+	index = of_read_number(associativity, 1);
+	if (index > min_common_depth + 1)
+		return of_read_number(&associativity[index - 1], 1);
+
+out:
 	return cpu_to_core_id(cpu);
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 03/10] powerpc/smp: Move powerpc_topology above
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 116 +++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0e0b118d9b6e..1ce95da00cb6 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -818,6 +818,64 @@ static int init_cpu_l1_cache_map(int cpu)
 	return err;
 }
 
+static bool shared_caches;
+
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymmetric SMT dependency */
+static int powerpc_smt_flags(void)
+{
+	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
+
+	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+		flags |= SD_ASYM_PACKING;
+	}
+	return flags;
+}
+#endif
+
+/*
+ * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
+ * This topology makes it *much* cheaper to migrate tasks between adjacent cores
+ * since the migrated task remains cache hot. We want to take advantage of this
+ * at the scheduler level so an extra topology level is required.
+ */
+static int powerpc_shared_cache_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+
+/*
+ * We can't just pass cpu_l2_cache_mask() directly because
+ * returns a non-const pointer and the compiler barfs on that.
+ */
+static const struct cpumask *shared_cache_mask(int cpu)
+{
+	if (shared_caches)
+		return cpu_l2_cache_mask(cpu);
+
+	if (has_big_cores)
+		return cpu_smallcore_mask(cpu);
+
+	return per_cpu(cpu_sibling_map, cpu);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *smallcore_smt_mask(int cpu)
+{
+	return cpu_smallcore_mask(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 static int init_big_cores(void)
 {
 	int cpu;
@@ -1249,8 +1307,6 @@ static void add_cpu_to_masks(int cpu)
 			set_cpus_related(cpu, i, cpu_core_mask);
 }
 
-static bool shared_caches;
-
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
@@ -1314,62 +1370,6 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
-#ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymmetric SMT dependency */
-static int powerpc_smt_flags(void)
-{
-	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
-
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		flags |= SD_ASYM_PACKING;
-	}
-	return flags;
-}
-#endif
-
-/*
- * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
- * This topology makes it *much* cheaper to migrate tasks between adjacent cores
- * since the migrated task remains cache hot. We want to take advantage of this
- * at the scheduler level so an extra topology level is required.
- */
-static int powerpc_shared_cache_flags(void)
-{
-	return SD_SHARE_PKG_RESOURCES;
-}
-
-/*
- * We can't just pass cpu_l2_cache_mask() directly because
- * returns a non-const pointer and the compiler barfs on that.
- */
-static const struct cpumask *shared_cache_mask(int cpu)
-{
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return per_cpu(cpu_sibling_map, cpu);
-}
-
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
-{
-	return cpu_smallcore_mask(cpu);
-}
-#endif
-
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
From: Pratik Sampat @ 2020-07-21 11:56 UTC (permalink / raw)
  To: ego
  Cc: linux-pm, daniel.lezcano, rjw, linuxppc-dev, npiggin, paulus,
	linux-kselftest, shuah, srivatsa, linux-kernel
In-Reply-To: <20200720055242.GB31497@in.ibm.com>

Hi Gautham, Thanks for the review.


On 20/07/20 11:22 am, Gautham R Shenoy wrote:
> Hi Pratik,
>
>
> On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote:
>> This patch adds support to trace IPI based and timer based wakeup
>> latency from idle states
>>
>> Latches onto the test-cpuidle_latency kernel module using the debugfs
>> interface to send IPIs or schedule a timer based event, which in-turn
>> populates the debugfs with the latency measurements.
>>
>> Currently for the IPI and timer tests; first disable all idle states
>> and then test for latency measurements incrementally enabling each state
>>
>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
> A few comments below.
>
>> ---
>>   tools/testing/selftests/Makefile           |   1 +
>>   tools/testing/selftests/cpuidle/Makefile   |   6 +
>>   tools/testing/selftests/cpuidle/cpuidle.sh | 257 +++++++++++++++++++++
>>   tools/testing/selftests/cpuidle/settings   |   1 +
>>   4 files changed, 265 insertions(+)
>>   create mode 100644 tools/testing/selftests/cpuidle/Makefile
>>   create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
>>   create mode 100644 tools/testing/selftests/cpuidle/settings
>>
> [..skip..]
>
>> +
>> +ins_mod()
>> +{
>> +	if [ ! -f "$MODULE" ]; then
>> +		printf "$MODULE module does not exist. Exitting\n"
> If the module has been compiled into the kernel (due to a
> localyesconfig, for instance), then it is unlikely that we will find
> it in /lib/modules. Perhaps you want to check if the debugfs
> directories created by the module exist, and if so, print a message
> saying that the modules is already loaded or some such?
>
That's a good idea. I can can grep for this module within /proc/modules
and not insert it, if it is already there

>> +		exit $ksft_skip
>> +	fi
>> +	printf "Inserting $MODULE module\n\n"
>> +	insmod $MODULE
>> +	if [ $? != 0 ]; then
>> +		printf "Insmod $MODULE failed\n"
>> +		exit $ksft_skip
>> +	fi
>> +}
>> +
>> +compute_average()
>> +{
>> +	arr=("$@")
>> +	sum=0
>> +	size=${#arr[@]}
>> +	for i in "${arr[@]}"
>> +	do
>> +		sum=$((sum + i))
>> +	done
>> +	avg=$((sum/size))
> It would be good to assert that "size" isn't 0 here.

Sure

>> +}
>> +
>> +# Disable all stop states
>> +disable_idle()
>> +{
>> +	for ((cpu=0; cpu<NUM_CPUS; cpu++))
>> +	do
>> +		for ((state=0; state<NUM_STATES; state++))
>> +		do
>> +			echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable
> So, on offlined CPUs, we won't see
> /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You
> should probably perform this operation only on online CPUs.

Right. I should make CPU operations only on online CPUs all over the script

[..snip..]

Thanks
Pratik


^ permalink raw reply

* Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: Michael Ellerman @ 2020-07-21 12:25 UTC (permalink / raw)
  To: bharata; +Cc: Nathan Lynch, Aneesh Kumar K.V, linuxppc-dev
In-Reply-To: <20200721032959.GN7902@in.ibm.com>

Bharata B Rao <bharata@linux.ibm.com> writes:
> On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> >> This is the next version of the fixes for memory unplug on radix.
>> >> The issues and the fix are described in the actual patches.
>> >
>> > I guess this isn't actually causing problems at runtime right now, but I
>> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
>> > arch_remove_memory(), which ought to be mmu-agnostic:
>> >
>> > int __ref arch_add_memory(int nid, u64 start, u64 size,
>> > 			  struct mhp_params *params)
>> > {
>> > 	unsigned long start_pfn = start >> PAGE_SHIFT;
>> > 	unsigned long nr_pages = size >> PAGE_SHIFT;
>> > 	int rc;
>> >
>> > 	resize_hpt_for_hotplug(memblock_phys_mem_size());
>> >
>> > 	start = (unsigned long)__va(start);
>> > 	rc = create_section_mapping(start, start + size, nid,
>> > 				    params->pgprot);
>> > ...
>> 
>> Hmm well spotted.
>> 
>> That does return early if the ops are not setup:
>> 
>> int resize_hpt_for_hotplug(unsigned long new_mem_size)
>> {
>> 	unsigned target_hpt_shift;
>> 
>> 	if (!mmu_hash_ops.resize_hpt)
>> 		return 0;
>> 
>> 
>> And:
>> 
>> void __init hpte_init_pseries(void)
>> {
>> 	...
>> 	if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
>> 		mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
>> 
>> And that comes in via ibm,hypertas-functions:
>> 
>> 	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
>> 
>> 
>> But firmware is not necessarily going to add/remove that call based on
>> whether we're using hash/radix.
>
> Correct but hpte_init_pseries() will not be called for radix guests.

Yeah, duh. You'd think the function name would have been a sufficient
clue for me :)

>> So I think a follow-up patch is needed to make this more robust.
>> 
>> Aneesh/Bharata what platform did you test this series on? I'm curious
>> how this didn't break.
>
> I have tested memory hotplug/unplug for radix guest on zz platform and
> sanity-tested this for hash guest on P8.
>
> As noted above, mmu_hash_ops.resize_hpt will not be set for radix
> guest and hence we won't see any breakage.

OK.

That's probably fine as it is then. Or maybe just a comment in
resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
we're using radix.

cheers

^ permalink raw reply

* [PATCH v3 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
From: Pratik Rajesh Sampat @ 2020-07-21 12:42 UTC (permalink / raw)
  To: rjw, daniel.lezcano, mpe, benh, paulus, srivatsa, shuah, npiggin,
	ego, svaidy, pratik.r.sampat, psampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest
In-Reply-To: <20200721124300.65615-1-psampat@linux.ibm.com>

Fire directed smp_call_function_single IPIs from a specified source
CPU to the specified target CPU to reduce the noise we have to wade
through in the trace log.
The module is based on the idea written by Srivatsa Bhat and maintained
by Vaidyanathan Srinivasan internally.

Queue HR timer and measure jitter. Wakeup latency measurement for idle
states using hrtimer.  Echo a value in ns to timer_test_function and
watch trace. A HRtimer will be queued and when it fires the expected
wakeup vs actual wakeup is computes and delay printed in ns.

Implemented as a module which utilizes debugfs so that it can be
integrated with selftests.

To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

[srivatsa.bhat@linux.vnet.ibm.com: Initial implementation in
 cpidle/sysfs]

[svaidy@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
 and fix some of the time calculation]

[ego@linux.vnet.ibm.com: Fix some whitespace and tab errors and
 increase the resolution of IPI wakeup]

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 drivers/cpuidle/Makefile               |   1 +
 drivers/cpuidle/test-cpuidle_latency.c | 150 +++++++++++++++++++++++++
 lib/Kconfig.debug                      |  10 ++
 3 files changed, 161 insertions(+)
 create mode 100644 drivers/cpuidle/test-cpuidle_latency.c

diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
index f07800cbb43f..2ae05968078c 100644
--- a/drivers/cpuidle/Makefile
+++ b/drivers/cpuidle/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
 obj-$(CONFIG_DT_IDLE_STATES)		  += dt_idle_states.o
 obj-$(CONFIG_ARCH_HAS_CPU_RELAX)	  += poll_state.o
 obj-$(CONFIG_HALTPOLL_CPUIDLE)		  += cpuidle-haltpoll.o
+obj-$(CONFIG_IDLE_LATENCY_SELFTEST)	  += test-cpuidle_latency.o
 
 ##################################################################################
 # ARM SoC drivers
diff --git a/drivers/cpuidle/test-cpuidle_latency.c b/drivers/cpuidle/test-cpuidle_latency.c
new file mode 100644
index 000000000000..61574665e972
--- /dev/null
+++ b/drivers/cpuidle/test-cpuidle_latency.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Module-based API test facility for cpuidle latency using IPIs and timers
+ */
+
+#include <linux/debugfs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+/* IPI based wakeup latencies */
+struct latency {
+	unsigned int src_cpu;
+	unsigned int dest_cpu;
+	ktime_t time_start;
+	ktime_t time_end;
+	u64 latency_ns;
+} ipi_wakeup;
+
+static void measure_latency(void *info)
+{
+	struct latency *v;
+	ktime_t time_diff;
+
+	v = (struct latency *)info;
+	v->time_end = ktime_get();
+	time_diff = ktime_sub(v->time_end, v->time_start);
+	v->latency_ns = ktime_to_ns(time_diff);
+}
+
+void run_smp_call_function_test(unsigned int cpu)
+{
+	ipi_wakeup.src_cpu = smp_processor_id();
+	ipi_wakeup.dest_cpu = cpu;
+	ipi_wakeup.time_start = ktime_get();
+	smp_call_function_single(cpu, measure_latency, &ipi_wakeup, 1);
+}
+
+/* Timer based wakeup latencies */
+struct timer_data {
+	unsigned int src_cpu;
+	u64 timeout;
+	ktime_t time_start;
+	ktime_t time_end;
+	struct hrtimer timer;
+	u64 timeout_diff_ns;
+} timer_wakeup;
+
+static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
+{
+	struct timer_data *w;
+	ktime_t time_diff;
+
+	w = container_of(hrtimer, struct timer_data, timer);
+	w->time_end = ktime_get();
+
+	time_diff = ktime_sub(w->time_end, w->time_start);
+	time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
+	w->timeout_diff_ns = ktime_to_ns(time_diff);
+	return HRTIMER_NORESTART;
+}
+
+static void run_timer_test(unsigned int ns)
+{
+	hrtimer_init(&timer_wakeup.timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL);
+	timer_wakeup.timer.function = timer_called;
+	timer_wakeup.time_start = ktime_get();
+	timer_wakeup.src_cpu = smp_processor_id();
+	timer_wakeup.timeout = ns;
+
+	hrtimer_start(&timer_wakeup.timer, ns_to_ktime(ns),
+		      HRTIMER_MODE_REL_PINNED);
+}
+
+static struct dentry *dir;
+
+static int cpu_read_op(void *data, u64 *value)
+{
+	*value = ipi_wakeup.dest_cpu;
+	return 0;
+}
+
+static int cpu_write_op(void *data, u64 value)
+{
+	run_smp_call_function_test(value);
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(ipi_ops, cpu_read_op, cpu_write_op, "%llu\n");
+
+static int timeout_read_op(void *data, u64 *value)
+{
+	*value = timer_wakeup.timeout;
+	return 0;
+}
+
+static int timeout_write_op(void *data, u64 value)
+{
+	run_timer_test(value);
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(timeout_ops, timeout_read_op, timeout_write_op, "%llu\n");
+
+static int __init latency_init(void)
+{
+	struct dentry *temp;
+
+	dir = debugfs_create_dir("latency_test", 0);
+	if (!dir) {
+		pr_alert("latency_test: failed to create /sys/kernel/debug/latency_test\n");
+		return -1;
+	}
+	temp = debugfs_create_file("ipi_cpu_dest",
+				   0666,
+				   dir,
+				   NULL,
+				   &ipi_ops);
+	if (!temp) {
+		pr_alert("latency_test: failed to create /sys/kernel/debug/ipi_cpu_dest\n");
+		return -1;
+	}
+	debugfs_create_u64("ipi_latency_ns", 0444, dir, &ipi_wakeup.latency_ns);
+	debugfs_create_u32("ipi_cpu_src", 0444, dir, &ipi_wakeup.src_cpu);
+
+	temp = debugfs_create_file("timeout_expected_ns",
+				   0666,
+				   dir,
+				   NULL,
+				   &timeout_ops);
+	if (!temp) {
+		pr_alert("latency_test: failed to create /sys/kernel/debug/timeout_expected_ns\n");
+		return -1;
+	}
+	debugfs_create_u64("timeout_diff_ns", 0444, dir, &timer_wakeup.timeout_diff_ns);
+	debugfs_create_u32("timeout_cpu_src", 0444, dir, &timer_wakeup.src_cpu);
+	pr_info("Latency Test module loaded\n");
+	return 0;
+}
+
+static void __exit latency_cleanup(void)
+{
+	pr_info("Cleaning up Latency Test module.\n");
+	debugfs_remove_recursive(dir);
+}
+
+module_init(latency_init);
+module_exit(latency_cleanup);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("IBM Corporation");
+MODULE_DESCRIPTION("Measuring idle latency for IPIs and Timers");
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d74ac0fd6b2d..e2283790245a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1375,6 +1375,16 @@ config DEBUG_KOBJECT
 	  If you say Y here, some extra kobject debugging messages will be sent
 	  to the syslog.
 
+config IDLE_LATENCY_SELFTEST
+	tristate "Cpuidle latency selftests"
+	depends on CPU_IDLE
+	help
+	  This option provides a kernel module that runs tests using the IPI and
+	  timers to measure latency.
+
+	  Say M if you want these self tests to build as a module.
+	  Say N if you are unsure.
+
 config DEBUG_KOBJECT_RELEASE
 	bool "kobject release debugging"
 	depends on DEBUG_OBJECTS_TIMERS
-- 
2.25.4


^ permalink raw reply related

* [PATCH v3 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
From: Pratik Rajesh Sampat @ 2020-07-21 12:43 UTC (permalink / raw)
  To: rjw, daniel.lezcano, mpe, benh, paulus, srivatsa, shuah, npiggin,
	ego, svaidy, pratik.r.sampat, psampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest
In-Reply-To: <20200721124300.65615-1-psampat@linux.ibm.com>

This patch adds support to trace IPI based and timer based wakeup
latency from idle states

Latches onto the test-cpuidle_latency kernel module using the debugfs
interface to send IPIs or schedule a timer based event, which in-turn
populates the debugfs with the latency measurements.

Currently for the IPI and timer tests; first disable all idle states
and then test for latency measurements incrementally enabling each state

Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
---
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/cpuidle/Makefile   |   6 +
 tools/testing/selftests/cpuidle/cpuidle.sh | 310 +++++++++++++++++++++
 tools/testing/selftests/cpuidle/settings   |   1 +
 4 files changed, 318 insertions(+)
 create mode 100644 tools/testing/selftests/cpuidle/Makefile
 create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
 create mode 100644 tools/testing/selftests/cpuidle/settings

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1195bd85af38..ab6cf51f3518 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -7,6 +7,7 @@ TARGETS += capabilities
 TARGETS += cgroup
 TARGETS += clone3
 TARGETS += cpufreq
+TARGETS += cpuidle
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
 TARGETS += efivarfs
diff --git a/tools/testing/selftests/cpuidle/Makefile b/tools/testing/selftests/cpuidle/Makefile
new file mode 100644
index 000000000000..72fd5d2e974d
--- /dev/null
+++ b/tools/testing/selftests/cpuidle/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+all:
+
+TEST_PROGS := cpuidle.sh
+
+include ../lib.mk
diff --git a/tools/testing/selftests/cpuidle/cpuidle.sh b/tools/testing/selftests/cpuidle/cpuidle.sh
new file mode 100755
index 000000000000..19cc24ccd4af
--- /dev/null
+++ b/tools/testing/selftests/cpuidle/cpuidle.sh
@@ -0,0 +1,310 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+LOG=cpuidle.log
+MODULE=/lib/modules/$(uname -r)/kernel/drivers/cpuidle/test-cpuidle_latency.ko
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+helpme()
+{
+	printf "Usage: $0 [-h] [-todg args]
+	[-h <help>]
+	[-m <location of the module>]
+	[-o <location of the output>]
+	\n"
+	exit 2
+}
+
+parse_arguments()
+{
+	while getopts ht:m:o: arg
+	do
+		case $arg in
+			h) # --help
+				helpme
+				;;
+			m) # --mod-file
+				MODULE=$OPTARG
+				;;
+			o) # output log files
+				LOG=$OPTARG
+				;;
+			\?)
+				helpme
+				;;
+		esac
+	done
+}
+
+ins_mod()
+{
+	if [ ! -f "$MODULE" ]; then
+		printf "$MODULE module does not exist. Exitting\n"
+		exit $ksft_skip
+	fi
+	# Check if the module is already loaded
+	grep "test_cpuidle_latency" /proc/modules
+	if [ $? == 0 ]; then
+		printf "Module: $MODULE already loaded\n\n"
+		return 0
+	fi
+	printf "Inserting $MODULE module\n\n"
+	insmod $MODULE
+	if [ $? != 0 ]; then
+		printf "Insmod $MODULE failed\n"
+		exit $ksft_skip
+	fi
+}
+
+compute_average()
+{
+	arr=("$@")
+	sum=0
+	size=${#arr[@]}
+	if [ $size == 0 ]; then
+		avg=0
+		return 1
+	fi
+	for i in "${arr[@]}"
+	do
+		sum=$((sum + i))
+	done
+	avg=$((sum/size))
+}
+
+# Disable all stop states
+disable_idle()
+{
+	for ((cpu=0; cpu<NUM_CPUS; cpu++))
+	do
+		local cpu_status=$(cpu_is_online $cpu)
+		if [ $cpu_status == 0 ]; then
+			continue
+		fi
+		for ((state=0; state<NUM_STATES; state++))
+		do
+			echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable
+		done
+	done
+}
+
+# Perform operation on each CPU for the given state
+# $1 - Operation: enable (0) / disable (1)
+# $2 - State to enable
+op_state()
+{
+	for ((cpu=0; cpu<NUM_CPUS; cpu++))
+	do
+		local cpu_status=$(cpu_is_online $cpu)
+		if [ $cpu_status == 0 ]; then
+			continue
+		fi
+		echo $1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable
+	done
+}
+
+cpuidle_enable_state()
+{
+	state=$1
+	op_state 0 $state
+}
+
+cpuidle_disable_state()
+{
+	state=$1
+	op_state 1 $state
+}
+
+cpu_is_online()
+{
+	cpu=$1
+	status=$(cat /sys/devices/system/cpu/cpu$cpu/online)
+	echo $status
+}
+
+# Extract latency in microseconds and convert to nanoseconds
+extract_latency()
+{
+	for ((state=0; state<NUM_STATES; state++))
+	do
+		latency=$(($(cat /sys/devices/system/cpu/cpu0/cpuidle/state$state/latency) * 1000))
+		latency_arr+=($latency)
+	done
+}
+
+# Run the IPI test
+# $1 run for baseline - busy cpu or regular environment
+# $2 destination cpu
+ipi_test_once()
+{
+	dest_cpu=$2
+	if [ "$1" = "baseline" ]; then
+		# Keep the CPU busy
+		taskset -c $dest_cpu cat /dev/random > /dev/null &
+		task_pid=$!
+		# Wait for the workload to achieve 100% CPU usage
+		sleep 1
+	fi
+	taskset 0x1 echo $dest_cpu > /sys/kernel/debug/latency_test/ipi_cpu_dest
+	ipi_latency=$(cat /sys/kernel/debug/latency_test/ipi_latency_ns)
+	src_cpu=$(cat /sys/kernel/debug/latency_test/ipi_cpu_src)
+	if [ "$1" = "baseline" ]; then
+		kill $task_pid
+		wait $task_pid 2>/dev/null
+	fi
+}
+
+# Incrementally Enable idle states one by one and compute the latency
+run_ipi_tests()
+{
+	extract_latency
+	disable_idle
+	declare -a avg_arr
+	echo -e "--IPI Latency Test---" >> $LOG
+
+	echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG
+	printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
+	for ((cpu=0; cpu<NUM_CPUS; cpu++))
+	do
+		local cpu_status=$(cpu_is_online $cpu)
+		if [ $cpu_status == 0 ]; then
+			continue
+		fi
+		ipi_test_once "baseline" $cpu
+		printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
+		avg_arr+=($ipi_latency)
+	done
+	compute_average "${avg_arr[@]}"
+	echo -e "Baseline Average IPI latency(ns): $avg" >> $LOG
+
+	for ((state=0; state<NUM_STATES; state++))
+	do
+		unset avg_arr
+		echo -e "---Enabling state: $state---" >> $LOG
+		cpuidle_enable_state $state
+		printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" >> $LOG
+		for ((cpu=0; cpu<NUM_CPUS; cpu++))
+		do
+			local cpu_status=$(cpu_is_online $cpu)
+			if [ $cpu_status == 0 ]; then
+				continue
+			fi
+			# Running IPI test and logging results
+			sleep 1
+			ipi_test_once "test" $cpu
+			printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> $LOG
+			avg_arr+=($ipi_latency)
+		done
+		compute_average "${avg_arr[@]}"
+		echo -e "Expected IPI latency(ns): ${latency_arr[$state]}" >> $LOG
+		echo -e "Observed Average IPI latency(ns): $avg" >> $LOG
+		cpuidle_disable_state $state
+	done
+}
+
+# Extract the residency in microseconds and convert to nanoseconds.
+# Add 100 ns so that the timer stays for a little longer than the residency
+extract_residency()
+{
+	for ((state=0; state<NUM_STATES; state++))
+	do
+		residency=$(($(cat /sys/devices/system/cpu/cpu0/cpuidle/state$state/residency) * 1000 + 200))
+		residency_arr+=($residency)
+	done
+}
+
+# Run the Timeout test
+# $1 run for baseline - busy cpu or regular environment
+# $2 destination cpu
+# $3 timeout
+timeout_test_once()
+{
+	dest_cpu=$2
+	if [ "$1" = "baseline" ]; then
+		# Keep the CPU busy
+		taskset -c $dest_cpu cat /dev/random > /dev/null &
+		task_pid=$!
+		# Wait for the workload to achieve 100% CPU usage
+		sleep 1
+	fi
+	taskset -c $dest_cpu echo $3 > /sys/kernel/debug/latency_test/timeout_expected_ns
+	# Wait for the result to populate
+	sleep 0.1
+	timeout_diff=$(cat /sys/kernel/debug/latency_test/timeout_diff_ns)
+	src_cpu=$(cat /sys/kernel/debug/latency_test/timeout_cpu_src)
+	if [ "$1" = "baseline" ]; then
+		kill $task_pid
+		wait $task_pid 2>/dev/null
+	fi
+}
+
+run_timeout_tests()
+{
+	extract_residency
+	disable_idle
+	declare -a avg_arr
+	echo -e "\n--Timeout Latency Test--" >> $LOG
+
+	echo -e "--Baseline Timeout Latency measurement: CPU Busy--" >> $LOG
+	printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)">> $LOG
+	for ((cpu=0; cpu<NUM_CPUS; cpu++))
+	do
+		local cpu_status=$(cpu_is_online $cpu)
+		if [ $cpu_status == 0 ]; then
+			continue
+		fi
+		timeout_test_once "baseline" $cpu ${residency_arr[0]}
+		printf "%-3s %13s\n" $src_cpu $timeout_diff >> $LOG
+		avg_arr+=($timeout_diff)
+	done
+	compute_average "${avg_arr[@]}"
+	echo -e "Baseline Average timeout diff(ns): $avg" >> $LOG
+
+	for ((state=0; state<NUM_STATES; state++))
+	do
+		unset avg_arr
+		echo -e "---Enabling state: $state---" >> $LOG
+		cpuidle_enable_state $state
+		printf "%s %10s %10s\n" "Wakeup_src" "Baseline_delay(ns)" "Delay(ns)" >> $LOG
+		for ((cpu=0; cpu<NUM_CPUS; cpu++))
+		do
+			local cpu_status=$(cpu_is_online $cpu)
+			if [ $cpu_status == 0 ]; then
+				continue
+			fi
+			timeout_test_once "test" $cpu ${residency_arr[$state]}
+			printf "%-3s %13s %18s\n" $src_cpu $baseline_timeout_diff $timeout_diff >> $LOG
+			avg_arr+=($timeout_diff)
+		done
+		compute_average "${avg_arr[@]}"
+		echo -e "Expected timeout(ns): ${residency_arr[$state]}" >> $LOG
+		echo -e "Observed Average timeout diff(ns): $avg" >> $LOG
+		cpuidle_disable_state $state
+	done
+}
+
+declare -a residency_arr
+declare -a latency_arr
+
+# Parse arguments
+parse_arguments $@
+
+rm -f $LOG
+touch $LOG
+NUM_CPUS=$(nproc --all)
+NUM_STATES=$(ls -1 /sys/devices/system/cpu/cpu0/cpuidle/ | wc -l)
+
+# Insert the module
+ins_mod $MODULE
+
+printf "Started IPI latency tests\n"
+run_ipi_tests
+
+printf "Started Timer latency tests\n"
+run_timeout_tests
+
+printf "Removing $MODULE module\n"
+printf "Output logged at: $LOG\n"
+rmmod $MODULE
diff --git a/tools/testing/selftests/cpuidle/settings b/tools/testing/selftests/cpuidle/settings
new file mode 100644
index 000000000000..e7b9417537fb
--- /dev/null
+++ b/tools/testing/selftests/cpuidle/settings
@@ -0,0 +1 @@
+timeout=0
-- 
2.25.4


^ permalink raw reply related

* [PATCH v3 0/2] Selftest for cpuidle latency measurement
From: Pratik Rajesh Sampat @ 2020-07-21 12:42 UTC (permalink / raw)
  To: rjw, daniel.lezcano, mpe, benh, paulus, srivatsa, shuah, npiggin,
	ego, svaidy, pratik.r.sampat, psampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest

v2: https://lkml.org/lkml/2020/7/17/369
Changelog v2-->v3
Based on comments from Gautham R. Shenoy adding the following in the
selftest,
1. Grepping modules to determine if already loaded
2. Wrapper to enable/disable states
3. Preventing any operation/test on offlined CPUs 
---

The patch series introduces a mechanism to measure wakeup latency for
IPI and timer based interrupts
The motivation behind this series is to find significant deviations
behind advertised latency and resisdency values

To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.

The kernel module provides the following interfaces within
/sys/kernel/debug/latency_test/ for,
1. IPI test:
  ipi_cpu_dest   # Destination CPU for the IPI
  ipi_cpu_src    # Origin of the IPI
  ipi_latency_ns # Measured latency time in ns
2. Timeout test:
  timeout_cpu_src     # CPU on which the timer to be queued
  timeout_expected_ns # Timer duration
  timeout_diff_ns     # Difference of actual duration vs expected timer
To include the module, check option and include as module
kernel hacking -> Cpuidle latency selftests

The selftest inserts the module, disables all the idle states and
enables them one by one testing the following:
1. Keeping source CPU constant, iterates through all the CPUS measuring
   IPI latency for baseline (CPU is busy with
   "cat /dev/random > /dev/null" workload) and the when the CPU is
   allowed to be at rest
2. Iterating through all the CPUs, sending expected timer durations to
   be equivalent to the residency of the the deepest idle state
   enabled and extracting the difference in time between the time of
   wakeup and the expected timer duration

Usage
-----
Can be used in conjuction to the rest of the selftests.
Default Output location in: tools/testing/cpuidle/cpuidle.log

To run this test specifically:
$ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

There are a few optinal arguments too that the script can take
	[-h <help>]
	[-m <location of the module>]
	[-o <location of the output>]

Sample output snippet
---------------------
--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
...
0            8         1996
0            9         2125
0           10         1264
0           11         1788
0           12         2045
Baseline Average IPI latency(ns): 1843
---Enabling state: 5---
SRC_CPU   DEST_CPU IPI_Latency(ns)
0            8       621719
0            9       624752
0           10       622218
0           11       623968
0           12       621303
Expected IPI latency(ns): 100000
Observed Average IPI latency(ns): 622792

--Timeout Latency Test--
--Baseline Timeout Latency measurement: CPU Busy--
Wakeup_src Baseline_delay(ns) 
...
8            2249
9            2226
10           2211
11           2183
12           2263
Baseline Average timeout diff(ns): 2226
---Enabling state: 5---
8           10749                   
9           10911                   
10          10912                   
11          12100                   
12          73276                   
Expected timeout(ns): 10000200
Observed Average timeout diff(ns): 23589

Pratik Rajesh Sampat (2):
  cpuidle: Trace IPI based and timer based wakeup latency from idle
    states
  selftest/cpuidle: Add support for cpuidle latency measurement

 drivers/cpuidle/Makefile                   |   1 +
 drivers/cpuidle/test-cpuidle_latency.c     | 150 ++++++++++
 lib/Kconfig.debug                          |  10 +
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/cpuidle/Makefile   |   6 +
 tools/testing/selftests/cpuidle/cpuidle.sh | 310 +++++++++++++++++++++
 tools/testing/selftests/cpuidle/settings   |   1 +
 7 files changed, 479 insertions(+)
 create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
 create mode 100644 tools/testing/selftests/cpuidle/Makefile
 create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
 create mode 100644 tools/testing/selftests/cpuidle/settings

-- 
2.25.4


^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Mathieu Desnoyers @ 2020-07-21 13:11 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <1595324577.x3bf55tpgu.astroid@bobo.none>

----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
[...]
> 
> Yeah you're probably right in this case I think. Quite likely most kernel
> tasks that asynchronously write to user memory would at least have some
> kind of producer-consumer barriers.
> 
> But is that restriction of all async modifications documented and enforced
> anywhere?
> 
>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>> still ordering requirement there for membarrier,
>> 
>> Please provide an example case with memory accesses via kthread_use_mm where
>> ordering matters to support your concern.
> 
> I think the concern Andy raised with io_uring was less a specific
> problem he saw and more a general concern that we have these memory
> accesses which are not synchronized with membarrier.
> 
>>> so I really think
>>> it's a fragile interface with no real way for the user to know how
>>> kernel threads may use its mm for any particular reason, so membarrier
>>> should synchronize all possible kernel users as well.
>> 
>> I strongly doubt so, but perhaps something should be clarified in the
>> documentation
>> if you have that feeling.
> 
> I'd rather go the other way and say if you have reasoning or numbers for
> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
> then we could think about keeping this subtlety with appropriate
> documentation added, otherwise we can just kill it and remove all doubt.
> 
> That being said, the x86 sync core gap that I imagined could be fixed
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

Today, the membarrier global expedited command explicitly skips kernel threads,
but it happens that membarrier private expedited considers those with the
same mm as target for the IPI.

So we already implement a semantic which differs between private and global
expedited membarriers. This can be explained in part by the fact that
kthread_use_mm was introduced after 4.16, where the most recent membarrier
commands where introduced. It seems that the effect on membarrier was not
considered when kthread_use_mm was introduced.

Looking at membarrier(2) documentation, it states that IPIs are only sent to
threads belonging to the same process as the calling thread. If my understanding
of the notion of process is correct, this should rule out sending the IPI to
kernel threads, given they are not "part" of the same process, only borrowing
the mm. But I agree that the distinction is moot, and should be clarified.

Without a clear use-case to justify adding a constraint on membarrier, I am
tempted to simply clarify documentation of current membarrier commands,
stating clearly that they are not guaranteed to affect kernel threads. Then,
if we have a compelling use-case to implement a different behavior which covers
kthreads, this could be added consistently across membarrier commands with a
flag (or by adding new commands).

Does this approach make sense ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v4 09/10] powerpc/watchpoint: Return available watchpoints dynamically
From: Ravi Bangoria @ 2020-07-21 13:33 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, Jordan Niethe,
	oleg, Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa,
	fweisbec, pedromfc, naveen.n.rao, linuxppc-dev, mingo,
	Ravi Bangoria
In-Reply-To: <87k0yxrtex.fsf@mpe.ellerman.id.au>



On 7/21/20 5:06 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/20/20 9:12 AM, Jordan Niethe wrote:
>>> On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>
>>>> So far Book3S Powerpc supported only one watchpoint. Power10 is
>>>> introducing 2nd DAWR. Enable 2nd DAWR support for Power10.
>>>> Availability of 2nd DAWR will depend on CPU_FTR_DAWR1.
>>>>
>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/include/asm/cputable.h      | 4 +++-
>>>>    arch/powerpc/include/asm/hw_breakpoint.h | 5 +++--
>>>>    2 files changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>> index 3445c86e1f6f..36a0851a7a9b 100644
>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>> @@ -633,7 +633,9 @@ enum {
>>>>     * Maximum number of hw breakpoint supported on powerpc. Number of
>>>>     * breakpoints supported by actual hw might be less than this.
>>>>     */
>>>> -#define HBP_NUM_MAX    1
>>>> +#define HBP_NUM_MAX    2
>>>> +#define HBP_NUM_ONE    1
>>>> +#define HBP_NUM_TWO    2
> 
>>> I wonder if these defines are necessary - has it any advantage over
>>> just using the literal?
>>
>> No, not really. Initially I had something like:
>>
>> #define HBP_NUM_MAX    2
>> #define HBP_NUM_P8_P9  1
>> #define HBP_NUM_P10    2
>>
>> But then I thought it's also not right. So I made it _ONE and _TWO.
>> Now the function that decides nr watchpoints dynamically (nr_wp_slots)
>> is in different file, I thought to keep it like this so it would be
>> easier to figure out why _MAX is 2.
> 
> I don't think it makes anything clearer.
> 
> I had to stare at it thinking there was some sort of mapping or
> indirection going on, before I realised it's just literally the number
> of breakpoints.
> 
> So please just do:
> 
> static inline int nr_wp_slots(void)
> {
>         return cpu_has_feature(CPU_FTR_DAWR1) ? 2 : 1;
> }
> 
> If you think HBP_NUM_MAX needs explanation then do that with a comment,
> it can refer to nr_wp_slots() if that's helpful.

Agreed. By adding a comment, we can remove those macros. Will change it.

Thanks,
Ravi

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Ravi Bangoria @ 2020-07-21 13:42 UTC (permalink / raw)
  To: Michael Ellerman, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <87mu3trtri.fsf@mpe.ellerman.id.au>



On 7/21/20 4:59 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>
>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>> 2nd DAWR is supported, otherwise not.
>>>>
>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>> ---
>>>>    arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>    arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>    2 files changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>> index e506d429b1af..3445c86e1f6f 100644
>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>    #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>    #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>    #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>
>>>>    #ifndef __ASSEMBLY__
>>>>
>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>               (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>                CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>> +            CPU_FTR_DAWR1)
>>>>    #else
>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>               (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>                CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>                CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>                CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>> +            CPU_FTR_DAWR1)
> 
>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>> into CPU_FTRS_POWER10?
>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>
>> I remember a discussion about this with Mikey and we decided to do it
>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>> even when device-tree property is not present or pa-feature bit it not set,
>> because we do:
>>
>>         {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>                 .pvr_mask               = 0xffffffff,
>>                 .pvr_value              = 0x0f000006,
>>                 .cpu_name               = "POWER10 (architected)",
>>                 .cpu_features           = CPU_FTRS_POWER10,
> 
> The pa-features logic will turn it off if the feature bit is not set.
> 
> So you should be able to put it in CPU_FTRS_POWER10.
> 
> See for example CPU_FTR_NOEXECUTE.

Ah ok. scan_features() clears the feature if the bit is not set in
pa-features. So it should work find for powervm. I'll verify the same
thing happens in case of baremetal where we use cpu-features not
pa-features. If it works in baremetal as well, will put it in
CPU_FTRS_POWER10.

Thanks for the clarification,
Ravi

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Michael Ellerman @ 2020-07-21 14:07 UTC (permalink / raw)
  To: Ravi Bangoria, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <62daa2d1-4e11-dcc1-cb1d-805ee4a156e0@linux.ibm.com>

Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
> On 7/21/20 4:59 PM, Michael Ellerman wrote:
>> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>>
>>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>>> 2nd DAWR is supported, otherwise not.
>>>>>
>>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>>> ---
>>>>>    arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>>    arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>>    2 files changed, 12 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>>> index e506d429b1af..3445c86e1f6f 100644
>>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>    #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>>    #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>>    #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>>
>>>>>    #ifndef __ASSEMBLY__
>>>>>
>>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>>               (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>>                CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>> +            CPU_FTR_DAWR1)
>>>>>    #else
>>>>>    #define CPU_FTRS_POSSIBLE      \
>>>>>               (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>>                CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>>                CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>>                CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>> +            CPU_FTR_DAWR1)
>> 
>>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>>> into CPU_FTRS_POWER10?
>>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>>
>>> I remember a discussion about this with Mikey and we decided to do it
>>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>>> even when device-tree property is not present or pa-feature bit it not set,
>>> because we do:
>>>
>>>         {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>>                 .pvr_mask               = 0xffffffff,
>>>                 .pvr_value              = 0x0f000006,
>>>                 .cpu_name               = "POWER10 (architected)",
>>>                 .cpu_features           = CPU_FTRS_POWER10,
>> 
>> The pa-features logic will turn it off if the feature bit is not set.
>> 
>> So you should be able to put it in CPU_FTRS_POWER10.
>> 
>> See for example CPU_FTR_NOEXECUTE.
>
> Ah ok. scan_features() clears the feature if the bit is not set in
> pa-features. So it should work find for powervm. I'll verify the same
> thing happens in case of baremetal where we use cpu-features not
> pa-features. If it works in baremetal as well, will put it in
> CPU_FTRS_POWER10.

When we use DT CPU features we don't use CPU_FTRS_POWER10 at all.

We construct a cpu_spec from scratch with just the base set of features:

static struct cpu_spec __initdata base_cpu_spec = {
	.cpu_name		= NULL,
	.cpu_features		= CPU_FTRS_DT_CPU_BASE,


And then individual features are enabled via the device tree flags.

cheers

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Ravi Bangoria @ 2020-07-21 14:16 UTC (permalink / raw)
  To: Michael Ellerman, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <87d04prmgc.fsf@mpe.ellerman.id.au>



On 7/21/20 7:37 PM, Michael Ellerman wrote:
> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>> On 7/21/20 4:59 PM, Michael Ellerman wrote:
>>> Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
>>>> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>>>>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>>>>> <ravi.bangoria@linux.ibm.com> wrote:
>>>>>>
>>>>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>>>>> 2nd DAWR is supported, otherwise not.
>>>>>>
>>>>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>>>>> ---
>>>>>>     arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>>>>     arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>>>>     2 files changed, 12 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>>>>> index e506d429b1af..3445c86e1f6f 100644
>>>>>> --- a/arch/powerpc/include/asm/cputable.h
>>>>>> +++ b/arch/powerpc/include/asm/cputable.h
>>>>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>>     #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>>>>     #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>>>>     #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>>>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>>>>
>>>>>>     #ifndef __ASSEMBLY__
>>>>>>
>>>>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>>>>     #define CPU_FTRS_POSSIBLE      \
>>>>>>                (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>>>>                 CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>>> +            CPU_FTR_DAWR1)
>>>>>>     #else
>>>>>>     #define CPU_FTRS_POSSIBLE      \
>>>>>>                (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>>>>                 CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>>>>                 CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>>>>                 CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>>>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>>>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>>>>> +            CPU_FTR_DAWR1)
>>>
>>>>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>>>>> into CPU_FTRS_POWER10?
>>>>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>>>>
>>>> I remember a discussion about this with Mikey and we decided to do it
>>>> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
>>>> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
>>>> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
>>>> even when device-tree property is not present or pa-feature bit it not set,
>>>> because we do:
>>>>
>>>>          {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>>>>                  .pvr_mask               = 0xffffffff,
>>>>                  .pvr_value              = 0x0f000006,
>>>>                  .cpu_name               = "POWER10 (architected)",
>>>>                  .cpu_features           = CPU_FTRS_POWER10,
>>>
>>> The pa-features logic will turn it off if the feature bit is not set.
>>>
>>> So you should be able to put it in CPU_FTRS_POWER10.
>>>
>>> See for example CPU_FTR_NOEXECUTE.
>>
>> Ah ok. scan_features() clears the feature if the bit is not set in
>> pa-features. So it should work find for powervm. I'll verify the same
>> thing happens in case of baremetal where we use cpu-features not
>> pa-features. If it works in baremetal as well, will put it in
>> CPU_FTRS_POWER10.
> 
> When we use DT CPU features we don't use CPU_FTRS_POWER10 at all.
> 
> We construct a cpu_spec from scratch with just the base set of features:
> 
> static struct cpu_spec __initdata base_cpu_spec = {
> 	.cpu_name		= NULL,
> 	.cpu_features		= CPU_FTRS_DT_CPU_BASE,
> 
> 
> And then individual features are enabled via the device tree flags.

Ah good. I was under a wrong impression that we use cpu_specs[] for all
the cases. Thanks mpe for explaining in detail :)

Ravi

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Nicholas Piggin @ 2020-07-21 14:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <470490605.22057.1595337118562.JavaMail.zimbra@efficios.com>

Excerpts from Mathieu Desnoyers's message of July 21, 2020 11:11 pm:
> ----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
> [...]
>> 
>> Yeah you're probably right in this case I think. Quite likely most kernel
>> tasks that asynchronously write to user memory would at least have some
>> kind of producer-consumer barriers.
>> 
>> But is that restriction of all async modifications documented and enforced
>> anywhere?
>> 
>>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>>> still ordering requirement there for membarrier,
>>> 
>>> Please provide an example case with memory accesses via kthread_use_mm where
>>> ordering matters to support your concern.
>> 
>> I think the concern Andy raised with io_uring was less a specific
>> problem he saw and more a general concern that we have these memory
>> accesses which are not synchronized with membarrier.
>> 
>>>> so I really think
>>>> it's a fragile interface with no real way for the user to know how
>>>> kernel threads may use its mm for any particular reason, so membarrier
>>>> should synchronize all possible kernel users as well.
>>> 
>>> I strongly doubt so, but perhaps something should be clarified in the
>>> documentation
>>> if you have that feeling.
>> 
>> I'd rather go the other way and say if you have reasoning or numbers for
>> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
>> then we could think about keeping this subtlety with appropriate
>> documentation added, otherwise we can just kill it and remove all doubt.
>> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> Today, the membarrier global expedited command explicitly skips kernel threads,
> but it happens that membarrier private expedited considers those with the
> same mm as target for the IPI.
> 
> So we already implement a semantic which differs between private and global
> expedited membarriers.

Which is not a good thing.

> This can be explained in part by the fact that
> kthread_use_mm was introduced after 4.16, where the most recent membarrier
> commands where introduced. It seems that the effect on membarrier was not
> considered when kthread_use_mm was introduced.

No it was just renamed, it used to be called use_mm and has been in the 
kernel for ~ever.

That you hadn't considered this is actually weight for my point, which 
is that there's so much subtle behaviour that's easy to miss we're 
better off with simpler and fewer special cases until it's proven 
they're needed. Not the other way around.

> 
> Looking at membarrier(2) documentation, it states that IPIs are only sent to
> threads belonging to the same process as the calling thread. If my understanding
> of the notion of process is correct, this should rule out sending the IPI to
> kernel threads, given they are not "part" of the same process, only borrowing
> the mm. But I agree that the distinction is moot, and should be clarified.

It does if you read it in a user-hostile legalistic way. The reality is 
userspace shouldn't and can't know about how the kernel might implement 
functionality.

> Without a clear use-case to justify adding a constraint on membarrier, I am
> tempted to simply clarify documentation of current membarrier commands,
> stating clearly that they are not guaranteed to affect kernel threads. Then,
> if we have a compelling use-case to implement a different behavior which covers
> kthreads, this could be added consistently across membarrier commands with a
> flag (or by adding new commands).
> 
> Does this approach make sense ?

The other position is without a clear use case for PF_KTHREAD, seeing as 
async kernel accesses had not been considered before now, we limit the 
optimision to only skipping the idle thread. I think that makes more 
sense (unless you have a reason for PF_KTHREAD but it doesn't seem like 
there is much of one).

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
From: Waiman Long @ 2020-07-21 14:36 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: linux-arch, Will Deacon, Boqun Feng, linux-kernel, kvm-ppc,
	virtualization, Ingo Molnar, linuxppc-dev
In-Reply-To: <1595327263.lk78cqolxm.astroid@bobo.none>

On 7/21/20 7:08 AM, Nicholas Piggin wrote:
> diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
> index b752d34517b3..26d8766a1106 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
>   
>   #else
>   extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
> +extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
>   #endif
>   
>   static __always_inline void queued_spin_lock(struct qspinlock *lock)
>   {
> -	u32 val = 0;
> -
> -	if (likely(atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
> +	atomic_t *a = &lock->val;
> +	u32 val;
> +
> +again:
> +	asm volatile(
> +"1:\t"	PPC_LWARX(%0,0,%1,1) "	# queued_spin_lock			\n"
> +	: "=&r" (val)
> +	: "r" (&a->counter)
> +	: "memory");
> +
> +	if (likely(val == 0)) {
> +		asm_volatile_goto(
> +	"	stwcx.	%0,0,%1							\n"
> +	"	bne-	%l[again]						\n"
> +	"\t"	PPC_ACQUIRE_BARRIER "						\n"
> +		:
> +		: "r"(_Q_LOCKED_VAL), "r" (&a->counter)
> +		: "cr0", "memory"
> +		: again );
>   		return;
> -
> -	queued_spin_lock_slowpath(lock, val);
> +	}
> +
> +	if (likely(val == _Q_LOCKED_VAL)) {
> +		asm_volatile_goto(
> +	"	stwcx.	%0,0,%1							\n"
> +	"	bne-	%l[again]						\n"
> +		:
> +		: "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (&a->counter)
> +		: "cr0", "memory"
> +		: again );
> +
> +		atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
> +//		clear_pending_set_locked(lock);
> +		WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
> +//		lockevent_inc(lock_pending);
> +		return;
> +	}
> +
> +	if (val == _Q_PENDING_VAL) {
> +		int cnt = _Q_PENDING_LOOPS;
> +		val = atomic_cond_read_relaxed(a,
> +					       (VAL != _Q_PENDING_VAL) || !cnt--);
> +		if (!(val & ~_Q_LOCKED_MASK))
> +			goto again;
> +        }
> +	queued_spin_lock_slowpath_queue(lock);
>   }
>   #define queued_spin_lock queued_spin_lock
>   

I am fine with the arch code override some part of the generic code.


> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index b9515fcc9b29..ebcc6f5d99d5 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>   
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>   #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
> +#define queued_spin_lock_slowpath_queue	native_queued_spin_lock_slowpath_queue
>   #endif
>   
>   #endif /* _GEN_PV_LOCK_SLOWPATH */
>   
> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
> +
>   /**
>    * queued_spin_lock_slowpath - acquire the queued spinlock
>    * @lock: Pointer to queued spinlock structure
> @@ -314,12 +318,6 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>    */
>   void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   {
> -	struct mcs_spinlock *prev, *next, *node;
> -	u32 old, tail;
> -	int idx;
> -
> -	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> -
>   	if (pv_enabled())
>   		goto pv_queue;
>   
> @@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   queue:
>   	lockevent_inc(lock_slowpath);
>   pv_queue:
> +	__queued_spin_lock_slowpath_queue(lock);
> +}
> +EXPORT_SYMBOL(queued_spin_lock_slowpath);
> +
> +void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
> +{
> +	lockevent_inc(lock_slowpath);
> +	__queued_spin_lock_slowpath_queue(lock);
> +}
> +EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
> +
> +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
> +{
> +	struct mcs_spinlock *prev, *next, *node;
> +	u32 old, tail;
> +	u32 val;
> +	int idx;
> +
> +	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
>   	node = this_cpu_ptr(&qnodes[0].mcs);
>   	idx = node->count++;
>   	tail = encode_tail(smp_processor_id(), idx);
> @@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>   	 */
>   	__this_cpu_dec(qnodes[0].mcs.count);
>   }
> -EXPORT_SYMBOL(queued_spin_lock_slowpath);
>   
>   /*
>    * Generate the paravirt code for queued_spin_unlock_slowpath().
>
I would prefer to extract out the pending bit handling code out into a 
separate helper function which can be overridden by the arch code 
instead of breaking the slowpath into 2 pieces.

Cheers,
Longman


^ permalink raw reply

* Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Nicholas Piggin @ 2020-07-21 14:37 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Sampat, svaidy
In-Reply-To: <81dcf34e-870d-b3a1-7876-a6a2f0b37d1f@linux.ibm.com>

Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm:
> 
> 
> On 20/07/20 5:27 am, Nicholas Piggin wrote:
>> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
>>> Replace the variable name from using "pnv_first_spr_loss_level" to
>>> "pnv_first_fullstate_loss_level".
>>>
>>> As pnv_first_spr_loss_level is supposed to be the earliest state that
>>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
>>> SPR values, render an incorrect terminology.
>> It also doesn't lose "full" state at this loss level though. From the
>> architecture it could be called "hv state loss level", but in POWER10
>> even that is not strictly true.
>>
> Right. Just discovered that deep stop states won't loose full state
> P10 onwards.
> Would it better if we rename it as "pnv_all_spr_loss_state" instead
> so that it stays generic enough while being semantically coherent?

It doesn't lose all SPRs. It does physically, but for Linux it appears 
at least timebase SPRs are retained and that's mostly how it's 
documented.

Maybe there's no really good name for it, but we do call it "deep" stop 
in other places, you could call it deep_spr_loss maybe. I don't mind too 
much though, whatever Gautham is happy with.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Gautham R Shenoy @ 2020-07-21 14:55 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, mikey, pratik.r.sampat, linux-kernel, Pratik Sampat, paulus,
	linuxppc-dev
In-Reply-To: <1595341835.8ad8mjl9hm.astroid@bobo.none>

Hi,

On Wed, Jul 22, 2020 at 12:37:41AM +1000, Nicholas Piggin wrote:
> Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm:
> > 
> > 
> > On 20/07/20 5:27 am, Nicholas Piggin wrote:
> >> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
> >>> Replace the variable name from using "pnv_first_spr_loss_level" to
> >>> "pnv_first_fullstate_loss_level".
> >>>
> >>> As pnv_first_spr_loss_level is supposed to be the earliest state that
> >>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
> >>> SPR values, render an incorrect terminology.
> >> It also doesn't lose "full" state at this loss level though. From the
> >> architecture it could be called "hv state loss level", but in POWER10
> >> even that is not strictly true.
> >>
> > Right. Just discovered that deep stop states won't loose full state
> > P10 onwards.
> > Would it better if we rename it as "pnv_all_spr_loss_state" instead
> > so that it stays generic enough while being semantically coherent?
> 
> It doesn't lose all SPRs. It does physically, but for Linux it appears 
> at least timebase SPRs are retained and that's mostly how it's 
> documented.
> 
> Maybe there's no really good name for it, but we do call it "deep" stop 
> in other places, you could call it deep_spr_loss maybe. I don't mind too 
> much though, whatever Gautham is happy with.

Nick's suggestion is fine by me. We can call it deep_spr_loss_state.

> 
> Thanks,
> Nick

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH v3 0/2] Selftest for cpuidle latency measurement
From: Daniel Lezcano @ 2020-07-21 14:57 UTC (permalink / raw)
  To: Pratik Rajesh Sampat, rjw, mpe, benh, paulus, srivatsa, shuah,
	npiggin, ego, svaidy, pratik.r.sampat, linux-pm, linuxppc-dev,
	linux-kernel, linux-kselftest
In-Reply-To: <20200721124300.65615-1-psampat@linux.ibm.com>

On 21/07/2020 14:42, Pratik Rajesh Sampat wrote:
> v2: https://lkml.org/lkml/2020/7/17/369
> Changelog v2-->v3
> Based on comments from Gautham R. Shenoy adding the following in the
> selftest,
> 1. Grepping modules to determine if already loaded
> 2. Wrapper to enable/disable states
> 3. Preventing any operation/test on offlined CPUs 
> ---
> 
> The patch series introduces a mechanism to measure wakeup latency for
> IPI and timer based interrupts
> The motivation behind this series is to find significant deviations
> behind advertised latency and resisdency values

Why do you want to measure for the timer and the IPI ? Whatever the
source of the wakeup, the exit latency remains the same, no ?

Is all this kernel-ish code really needed ?

What about using a highres periodic timer and make it expires every eg.
50ms x 2400, so it is 120 secondes and measure the deviation. Repeat the
operation for each idle states.

And in order to make it as much accurate as possible, set the program
affinity on a CPU and isolate this one by preventing other processes to
be scheduled on and migrate the interrupts on the other CPUs.

That will be all userspace code, no?

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply

* Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Nicholas Piggin @ 2020-07-21 15:00 UTC (permalink / raw)
  To: kvm-ppc, linuxppc-dev, Ram Pai
  Cc: sukadev, aik, bharata, sathnaga, ldufour, bauerman, david
In-Reply-To: <1594888333-9370-1-git-send-email-linuxram@us.ibm.com>

Excerpts from Ram Pai's message of July 16, 2020 6:32 pm:
> An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> fault is delivered to the ultravisor.

Why not a ucall if you're paraultravizing it anyway?

> 
> Unfortunately the Ultravisor has no correct-way to fetch the faulting
> instruction. The PEF architecture does not allow Ultravisor to enable MMU
> translation. Walking the two level page table to read the instruction can race
> with other vcpus modifying the SVM's process scoped page table.
> 
> This problem can be correctly solved with some help from the kernel.
> 
> Capture the faulting instruction in SPRG0 register, before executing the
> faulting instruction. This enables the ultravisor to easily procure the
> faulting instruction and emulate it.
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/include/asm/io.h | 85 ++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 75 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index 635969b..7ef663d 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -35,6 +35,7 @@
>  #include <asm/mmu.h>
>  #include <asm/ppc_asm.h>
>  #include <asm/pgtable.h>
> +#include <asm/svm.h>
>  
>  #define SIO_CONFIG_RA	0x398
>  #define SIO_CONFIG_RD	0x399
> @@ -105,34 +106,98 @@
>  static inline u##size name(const volatile u##size __iomem *addr)	\
>  {									\
>  	u##size ret;							\
> -	__asm__ __volatile__("sync;"#insn" %0,%y1;twi 0,%0,0;isync"	\
> -		: "=r" (ret) : "Z" (*addr) : "memory");			\
> +	if (is_secure_guest()) {					\
> +		__asm__ __volatile__("mfsprg0 %3;"			\
> +				"lnia %2;"				\
> +				"ld %2,12(%2);"				\
> +				"mtsprg0 %2;"				\
> +				"sync;"					\
> +				#insn" %0,%y1;"				\
> +				"twi 0,%0,0;"				\
> +				"isync;"				\
> +				"mtsprg0 %3"				\

We prefer to use mtspr in new code, and the nia offset should be 
calculated with a label I think "(1f - .)(%2)" should work.

SPRG usage is documented in arch/powerpc/include/asm/reg.h if this 
goes past RFC stage. Looks like SPRG0 probably could be used for this.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: peterz @ 2020-07-21 15:06 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, x86, linux-kernel,
	Andy Lutomirski, linux-mm, Mathieu Desnoyers, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <1595324577.x3bf55tpgu.astroid@bobo.none>

On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:

> That being said, the x86 sync core gap that I imagined could be fixed 
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

So I've been trying to figure out where that PF_KTHREAD comes from,
commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.

So the first version:

  https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com

appears to unconditionally send the IPI and checks p->mm in the IPI
context, but then v2:

  https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com

has the current code. But I've been unable to find the reason the
'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

The comment doesn't really help either; sure we have the whole lazy mm
thing, but that's ->active_mm, not ->mm.

Possibly it is because {,un}use_mm() do not have sufficient barriers to
make the remote p->mm test work? Or were we over-eager with the !p->mm
doesn't imply kthread 'cleanups' at the time?

Also, I just realized, I still have a fix for use_mm() now
kthread_use_mm() that seems to have been lost.

^ permalink raw reply

* [PATCH -next] PCI: rpadlpar: Make some functions static
From: Wei Yongjun @ 2020-07-21 15:17 UTC (permalink / raw)
  To: Hulk Robot, Michael Ellerman, Tyrel Datwyler, Bjorn Helgaas
  Cc: linux-pci, Wei Yongjun, linuxppc-dev

The sparse tool report build warnings as follows:

drivers/pci/hotplug/rpadlpar_core.c:355:5: warning:
 symbol 'dlpar_remove_pci_slot' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:461:12: warning:
 symbol 'rpadlpar_io_init' was not declared. Should it be static?
drivers/pci/hotplug/rpadlpar_core.c:473:6: warning:
 symbol 'rpadlpar_io_exit' was not declared. Should it be static?

Those functions are not used outside of this file, so marks them
static.
Also mark rpadlpar_io_exit() as __exit.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
---
 drivers/pci/hotplug/rpadlpar_core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
index c5eb509c72f0..f979b7098acf 100644
--- a/drivers/pci/hotplug/rpadlpar_core.c
+++ b/drivers/pci/hotplug/rpadlpar_core.c
@@ -352,7 +352,7 @@ static int dlpar_remove_vio_slot(char *drc_name, struct device_node *dn)
  * -ENODEV		Not a valid drc_name
  * -EIO			Internal PCI Error
  */
-int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
+static int dlpar_remove_pci_slot(char *drc_name, struct device_node *dn)
 {
 	struct pci_bus *bus;
 	struct slot *slot;
@@ -458,7 +458,7 @@ static inline int is_dlpar_capable(void)
 	return (int) (rc != RTAS_UNKNOWN_SERVICE);
 }
 
-int __init rpadlpar_io_init(void)
+static int __init rpadlpar_io_init(void)
 {
 
 	if (!is_dlpar_capable()) {
@@ -470,7 +470,7 @@ int __init rpadlpar_io_init(void)
 	return dlpar_sysfs_init();
 }
 
-void rpadlpar_io_exit(void)
+static void __exit rpadlpar_io_exit(void)
 {
 	dlpar_sysfs_exit();
 }


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox