LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Dont assume l2-cache to be superset of sibling
	Set cpumask after verifying l2-cache. (Gautham)

 arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 72f16dc0cb26..57468877499a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	if (!l2_cache)
 		return false;
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	for_each_cpu(i, cpu_online_mask) {
 		/*
 		 * when updating the marks the current CPU has not been marked
@@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 04/10] powerpc/smp: Enable small core scheduling sooner
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

Enable small core scheduling as soon as we detect that we are in a
system that supports thread group. Doing so would avoid a redundant
check.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Enable small core scheduling sooner
	Restored the previous info msg (Jordan)
	Moved big core topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 1ce95da00cb6..72f16dc0cb26 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1370,6 +1370,16 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
+static void fixup_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (has_big_cores) {
+		pr_info("Big cores detected but using small core scheduling\n");
+		powerpc_topology[0].mask = smallcore_smt_mask;
+	}
+#endif
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
@@ -1383,12 +1393,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-#ifdef CONFIG_SCHED_SMT
-	if (has_big_cores) {
-		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
-	}
-#endif
+	fixup_topology();
 	set_sched_topology(powerpc_topology);
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 02/10] powerpc/smp: Merge Power9 topology with Power topology
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
powerpc/smp: Merge Power9 topology with Power topology
	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

 arch/powerpc/kernel/smp.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 680c0edcc59d..0e0b118d9b6e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1315,7 +1315,7 @@ int setup_profiling_timer(unsigned int multiplier)
 }
 
 #ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymetric SMT dependancy */
+/* cpumask of CPUs with asymmetric SMT dependency */
 static int powerpc_smt_flags(void)
 {
 	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
@@ -1328,14 +1328,6 @@ static int powerpc_smt_flags(void)
 }
 #endif
 
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 /*
  * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
  * This topology makes it *much* cheaper to migrate tasks between adjacent cores
@@ -1353,7 +1345,13 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	return cpu_l2_cache_mask(cpu);
+	if (shared_caches)
+		return cpu_l2_cache_mask(cpu);
+
+	if (has_big_cores)
+		return cpu_smallcore_mask(cpu);
+
+	return per_cpu(cpu_sibling_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -1363,7 +1361,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
-static struct sched_domain_topology_level power9_topology[] = {
+static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
@@ -1388,21 +1386,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		power9_topology[0].mask = smallcore_smt_mask;
 		powerpc_topology[0].mask = smallcore_smt_mask;
 	}
 #endif
-	/*
-	 * If any CPU detects that it's sharing a cache with another CPU then
-	 * use the deeper topology that is aware of this sharing.
-	 */
-	if (shared_caches) {
-		pr_info("Using shared cache scheduler topology\n");
-		set_sched_topology(power9_topology);
-	} else {
-		pr_info("Using standard scheduler topology\n");
-		set_sched_topology(powerpc_topology);
-	}
+	set_sched_topology(powerpc_topology);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 01/10] powerpc/smp: Cache node for reuse
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-1-srikar@linux.vnet.ibm.com>

While cpu_to_node is inline function with access to per_cpu variable.
However when using repeatedly, it may be cleaner to cache it in a local
variable.

Also fix a build error in a some weird config.
"error: _numa_cpu_lookup_table_ undeclared"

No functional change

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 73199470c265..680c0edcc59d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -843,7 +843,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 
 	DBG("smp_prepare_cpus\n");
 
-	/* 
+	/*
 	 * setup_cpu may need to be called on the boot cpu. We havent
 	 * spun any cpus up but lets be paranoid.
 	 */
@@ -854,20 +854,24 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	cpu_callin_map[boot_cpuid] = 1;
 
 	for_each_possible_cpu(cpu) {
+		int node = cpu_to_node(cpu);
+
 		zalloc_cpumask_var_node(&per_cpu(cpu_sibling_map, cpu),
-					GFP_KERNEL, cpu_to_node(cpu));
+					GFP_KERNEL, node);
 		zalloc_cpumask_var_node(&per_cpu(cpu_l2_cache_map, cpu),
-					GFP_KERNEL, cpu_to_node(cpu));
+					GFP_KERNEL, node);
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
-					GFP_KERNEL, cpu_to_node(cpu));
+					GFP_KERNEL, node);
+#ifdef CONFIG_NEED_MULTIPLE_NODES
 		/*
 		 * numa_node_id() works after this.
 		 */
 		if (cpu_present(cpu)) {
-			set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
-			set_cpu_numa_mem(cpu,
-				local_memory_node(numa_cpu_lookup_table[cpu]));
+			node = numa_cpu_lookup_table[cpu];
+			set_cpu_numa_node(cpu, node);
+			set_cpu_numa_mem(cpu, local_memory_node(node));
 		}
+#endif
 	}
 
 	/* Init the cpumasks so the boot CPU is related to itself */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 00/10] Coregroup support on Powerpc
From: Srikar Dronamraju @ 2020-07-21 11:38 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar

Changelog v1 -> v2:
v1: https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Merge Power9 topology with Power topology
	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

powerpc/smp: Enable small core scheduling sooner
	Restored the previous info msg (Jordan)
	Moved big core topology fixup to fixup_topology (Gautham)

powerpc/smp: Dont assume l2-cache to be superset of sibling
	Set cpumask after verifying l2-cache. (Gautham)

powerpc/smp: Generalize 2nd sched domain
	Moved shared_cache topology fixup to fixup_topology (Gautham)

Powerpc/numa: Detect support for coregroup
	Explained Coregroup in commit msg (Michael Ellerman)

Powerpc/smp: Create coregroup domain
	Moved coregroup topology fixup to fixup_topology (Gautham)

powerpc/smp: Implement cpu_to_coregroup_id
	Move coregroup_enabled before getting associativity (Gautham)

powerpc/smp: Provide an ability to disable coregroup
	Patch dropped (Michael Ellerman)

Cleanup of existing powerpc topologies and add coregroup support on
Powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.

Patch 7 of this patch series: "Powerpc/numa: Detect support for coregroup"
depends on
https://lore.kernel.org/linuxppc-dev/20200707140644.7241-1-srikar@linux.vnet.ibm.com/t/#u
However it should be easy to rebase the patch without the above patch.

This patch series is based on top of current powerpc/next tree + the
above patch.

On Power 8 Systems
------------------
$ tail /proc/cpuinfo
processor	: 255
cpu		: POWER8 (architected), altivec supported
clock		: 3724.000000MHz
revision	: 2.1 (pvr 004b 0201)

timebase	: 512000000
platform	: pSeries
model		: IBM,8408-E8E
machine		: CHRP IBM,8408-E8E
MMU		: Hash

Before the patchset
-------------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
DIE
NUMA
NUMA
$ head /proc/schedstat
version 15
timestamp 4295534931
cpu0 0 0 0 0 0 0 41389823338 17682779896 14117
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 27087859050 152273672 10396
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After the patchset
------------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
DIE
NUMA
NUMA
$ head /proc/schedstat
version 15
timestamp 4295534931
cpu0 0 0 0 0 0 0 41389823338 17682779896 14117
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 27087859050 152273672 10396
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Power 9 (with device-tree enablement to show coregroups).
(hunks for mimicing a coregroup was posted at
https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#m2cb09bb11c7a93257d6123d1d27edb8212f8af21)
-----------------------------------------------------------
$ tail /proc/cpuinfo
processor	: 127
cpu		: POWER9 (architected), altivec supported
clock		: 3000.000000MHz
revision	: 2.2 (pvr 004e 0202)

timebase	: 512000000
platform	: pSeries
model		: IBM,9008-22L
machine		: CHRP IBM,9008-22L
MMU		: Hash

Before patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
MC
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>

Srikar Dronamraju (10):
  powerpc/smp: Cache node for reuse
  powerpc/smp: Merge Power9 topology with Power topology
  powerpc/smp: Move powerpc_topology above
  powerpc/smp: Enable small core scheduling sooner
  powerpc/smp: Dont assume l2-cache to be superset of sibling
  powerpc/smp: Generalize 2nd sched domain
  Powerpc/numa: Detect support for coregroup
  powerpc/smp: Allocate cpumask only after searching thread group
  Powerpc/smp: Create coregroup domain
  powerpc/smp: Implement cpu_to_coregroup_id

 arch/powerpc/include/asm/smp.h      |   1 +
 arch/powerpc/include/asm/topology.h |  10 ++
 arch/powerpc/kernel/smp.c           | 255 +++++++++++++++++-----------
 arch/powerpc/mm/numa.c              |  59 +++++--
 4 files changed, 213 insertions(+), 112 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH v4 09/10] powerpc/watchpoint: Return available watchpoints dynamically
From: Michael Ellerman @ 2020-07-21 11:36 UTC (permalink / raw)
  To: Ravi Bangoria, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <ccfcf488-0ec9-1737-8368-a848de1d72d1@linux.ibm.com>

Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
> On 7/20/20 9:12 AM, Jordan Niethe wrote:
>> On Fri, Jul 17, 2020 at 2:11 PM Ravi Bangoria
>> <ravi.bangoria@linux.ibm.com> wrote:
>>>
>>> So far Book3S Powerpc supported only one watchpoint. Power10 is
>>> introducing 2nd DAWR. Enable 2nd DAWR support for Power10.
>>> Availability of 2nd DAWR will depend on CPU_FTR_DAWR1.
>>>
>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/cputable.h      | 4 +++-
>>>   arch/powerpc/include/asm/hw_breakpoint.h | 5 +++--
>>>   2 files changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>> index 3445c86e1f6f..36a0851a7a9b 100644
>>> --- a/arch/powerpc/include/asm/cputable.h
>>> +++ b/arch/powerpc/include/asm/cputable.h
>>> @@ -633,7 +633,9 @@ enum {
>>>    * Maximum number of hw breakpoint supported on powerpc. Number of
>>>    * breakpoints supported by actual hw might be less than this.
>>>    */
>>> -#define HBP_NUM_MAX    1
>>> +#define HBP_NUM_MAX    2
>>> +#define HBP_NUM_ONE    1
>>> +#define HBP_NUM_TWO    2

>> I wonder if these defines are necessary - has it any advantage over
>> just using the literal?
>
> No, not really. Initially I had something like:
>
> #define HBP_NUM_MAX    2
> #define HBP_NUM_P8_P9  1
> #define HBP_NUM_P10    2
>
> But then I thought it's also not right. So I made it _ONE and _TWO.
> Now the function that decides nr watchpoints dynamically (nr_wp_slots)
> is in different file, I thought to keep it like this so it would be
> easier to figure out why _MAX is 2.

I don't think it makes anything clearer.

I had to stare at it thinking there was some sort of mapping or
indirection going on, before I realised it's just literally the number
of breakpoints.

So please just do:

static inline int nr_wp_slots(void)
{
       return cpu_has_feature(CPU_FTR_DAWR1) ? 2 : 1;
}

If you think HBP_NUM_MAX needs explanation then do that with a comment,
it can refer to nr_wp_slots() if that's helpful.

cheers

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Michael Ellerman @ 2020-07-21 11:29 UTC (permalink / raw)
  To: Ravi Bangoria, Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, fweisbec, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, pedromfc,
	naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <c34b1a66-2db6-c97a-1782-0d473c758502@linux.ibm.com>

Ravi Bangoria <ravi.bangoria@linux.ibm.com> writes:
> On 7/17/20 11:14 AM, Jordan Niethe wrote:
>> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
>> <ravi.bangoria@linux.ibm.com> wrote:
>>>
>>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>>> 2nd DAWR is supported, otherwise not.
>>>
>>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>>> ---
>>>   arch/powerpc/include/asm/cputable.h | 7 +++++--
>>>   arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>>   2 files changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>>> index e506d429b1af..3445c86e1f6f 100644
>>> --- a/arch/powerpc/include/asm/cputable.h
>>> +++ b/arch/powerpc/include/asm/cputable.h
>>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>>   #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>>   #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>>   #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>>
>>>   #ifndef __ASSEMBLY__
>>>
>>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>>   #define CPU_FTRS_POSSIBLE      \
>>>              (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>>               CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>> +            CPU_FTR_DAWR1)
>>>   #else
>>>   #define CPU_FTRS_POSSIBLE      \
>>>              (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>>               CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>>               CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>>               CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>>> +            CPU_FTR_DAWR1)

>> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
>> into CPU_FTRS_POWER10?
>> Then it will be picked up by CPU_FTRS_POSSIBLE.
>
> I remember a discussion about this with Mikey and we decided to do it
> this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
> CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
> including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
> even when device-tree property is not present or pa-feature bit it not set,
> because we do:
>
>        {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
>                .pvr_mask               = 0xffffffff,
>                .pvr_value              = 0x0f000006,
>                .cpu_name               = "POWER10 (architected)",
>                .cpu_features           = CPU_FTRS_POWER10,

The pa-features logic will turn it off if the feature bit is not set.

So you should be able to put it in CPU_FTRS_POWER10.

See for example CPU_FTR_NOEXECUTE.

cheers

^ permalink raw reply

* Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
From: Nicholas Piggin @ 2020-07-21 11:20 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: linux-arch, Will Deacon, Boqun Feng, linux-kernel, kvm-ppc,
	virtualization, Ingo Molnar, linuxppc-dev
In-Reply-To: <20200709083113.GI597537@hirez.programming.kicks-ass.net>

Excerpts from Peter Zijlstra's message of July 9, 2020 6:31 pm:
> On Wed, Jul 08, 2020 at 07:54:34PM -0400, Waiman Long wrote:
>> On 7/8/20 4:41 AM, Peter Zijlstra wrote:
>> > On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
>> > > Yes, powerpc could certainly get more performance out of the slow
>> > > paths, and then there are a few parameters to tune.
>> > Can you clarify? The slow path is already in use on ARM64 which is weak,
>> > so I doubt there's superfluous serialization present. And Will spend a
>> > fair amount of time on making that thing guarantee forward progressm, so
>> > there just isn't too much room to play.
>> > 
>> > > We don't have a good alternate patching for function calls yet, but
>> > > that would be something to do for native vs pv.
>> > Going by your jump_label implementation, support for static_call should
>> > be fairly straight forward too, no?
>> > 
>> >    https://lkml.kernel.org/r/20200624153024.794671356@infradead.org
>> > 
>> Speaking of static_call, I am also looking forward to it. Do you have an
>> idea when that will be merged?
> 
> 0day had one crash on the last round, I think Steve send a fix for that
> last night and I'll go look at it.
> 
> That said, the last posting got 0 feedback, so either everybody is
> really happy with it, or not interested. So let us know in the thread,
> with some review feedback.
> 
> Once I get through enough of the inbox to actually find the fix and test
> it, I'll also update the thread, and maybe threaten to merge it if
> everybody stays silent :-)

I'd like to use it in powerpc. We have code now for example that patches 
a branch immediately at the top of memcpy which branches to a different 
version of the function. pv queued spinlock selection obviously, and
there's a bunch of platform ops struct things that get filled in at boot 
time, etc.

So +1 here if you can get them through. I'm not 100% sure we can do
it with existing toolchain and no ugly hacks, but there's no way to
structure things that can get around that AFAIKS. We'd eventually
use it though, I'd say.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
From: Nicholas Piggin @ 2020-07-21 11:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, Will Deacon, Boqun Feng, linux-kernel, kvm-ppc,
	virtualization, Ingo Molnar, Waiman Long, linuxppc-dev
In-Reply-To: <20200708084106.GE597537@hirez.programming.kicks-ass.net>

Excerpts from Peter Zijlstra's message of July 8, 2020 6:41 pm:
> On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
>> Yes, powerpc could certainly get more performance out of the slow
>> paths, and then there are a few parameters to tune.
> 

Sorry for the delay, got bogged down and distracted by other things :(

> Can you clarify? The slow path is already in use on ARM64 which is weak,
> so I doubt there's superfluous serialization present. And Will spend a
> fair amount of time on making that thing guarantee forward progressm, so
> there just isn't too much room to play.

Sure, the way the pending not-queued slowpath (which I guess is the
medium-path) is implemented is just poorly structured for LL/SC. It
has one more atomic than necessary (queued_fetch_set_pending_acquire),
and a lot of branches in suboptimal order.

Attached patch (completely untested just compiled and looked at asm
so far) is a way we can fix this on powerpc I think. It's actually
very little generic code change which is good, duplicated medium-path
logic unfortunately but that's no worse than something like x86
really.

>> We don't have a good alternate patching for function calls yet, but
>> that would be something to do for native vs pv.
> 
> Going by your jump_label implementation, support for static_call should
> be fairly straight forward too, no?
> 
>   https://lkml.kernel.org/r/20200624153024.794671356@infradead.org

Nice, yeah it should be. I've wanted this for ages!

powerpc is kind of annoying to implement that with limited call range,
Hmm, not sure if we'd need a new linker feature to support it. We'd
provide call site patch space for indirect branches for those out of
range of direct call, so that should work fine. The trick would be 
patching in the TOC lookup for the function... should be doable somehow.

Thanks,
Nick

---

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..26d8766a1106 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 
 #else
 extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
 #endif
 
 static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	u32 val = 0;
-
-	if (likely(atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
+	atomic_t *a = &lock->val;
+	u32 val;
+
+again:
+	asm volatile(
+"1:\t"	PPC_LWARX(%0,0,%1,1) "	# queued_spin_lock			\n"
+	: "=&r" (val)
+	: "r" (&a->counter)
+	: "memory");
+
+	if (likely(val == 0)) {
+		asm_volatile_goto(
+	"	stwcx.	%0,0,%1							\n"
+	"	bne-	%l[again]						\n"
+	"\t"	PPC_ACQUIRE_BARRIER "						\n"
+		:
+		: "r"(_Q_LOCKED_VAL), "r" (&a->counter)
+		: "cr0", "memory"
+		: again );
 		return;
-
-	queued_spin_lock_slowpath(lock, val);
+	}
+
+	if (likely(val == _Q_LOCKED_VAL)) {
+		asm_volatile_goto(
+	"	stwcx.	%0,0,%1							\n"
+	"	bne-	%l[again]						\n"
+		:
+		: "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (&a->counter)
+		: "cr0", "memory"
+		: again );
+
+		atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
+//		clear_pending_set_locked(lock);
+		WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+//		lockevent_inc(lock_pending);
+		return;
+	}
+
+	if (val == _Q_PENDING_VAL) {
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(a,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
+		if (!(val & ~_Q_LOCKED_MASK))
+			goto again;
+        }
+	queued_spin_lock_slowpath_queue(lock);
 }
 #define queued_spin_lock queued_spin_lock
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ebcc6f5d99d5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath_queue	native_queued_spin_lock_slowpath_queue
 #endif
 
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+
 /**
  * queued_spin_lock_slowpath - acquire the queued spinlock
  * @lock: Pointer to queued spinlock structure
@@ -314,12 +318,6 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
  */
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	struct mcs_spinlock *prev, *next, *node;
-	u32 old, tail;
-	int idx;
-
-	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
-
 	if (pv_enabled())
 		goto pv_queue;
 
@@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 queue:
 	lockevent_inc(lock_slowpath);
 pv_queue:
+	__queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+	lockevent_inc(lock_slowpath);
+	__queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
+
+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+	struct mcs_spinlock *prev, *next, *node;
+	u32 old, tail;
+	u32 val;
+	int idx;
+
+	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
 	node = this_cpu_ptr(&qnodes[0].mcs);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
@@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	__this_cpu_dec(qnodes[0].mcs.count);
 }
-EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 /*
  * Generate the paravirt code for queued_spin_unlock_slowpath().

^ permalink raw reply related

* [PATCH v2 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Laurent Dufour @ 2020-07-21 10:42 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, mpe, paulus
  Cc: sukadev, linuxram, bauerman, bharata
In-Reply-To: <20200721104202.15727-1-ldufour@linux.ibm.com>

When a secure memslot is dropped, all the pages backed in the secure device
(aka really backed by secure memory by the Ultravisor) should be paged out
to a normal page. Previously, this was achieved by triggering the page
fault mechanism which is calling kvmppc_svm_page_out() on each pages.

This can't work when hot unplugging a memory slot because the memory slot
is flagged as invalid and gfn_to_pfn() is then not trying to access the
page, so the page fault mechanism is not triggered.

Since the final goal is to make a call to kvmppc_svm_page_out() it seems
simpler to directly calling it instead of triggering such a mechanism. This
way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
memslot.

Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
the call to __kvmppc_svm_page_out() is made.
As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
addition, the mmap_sem is help in read mode during that time, not in write
mode since the virual memory layout is not impacted, and
kvm->arch.uvmem_lock prevents concurrent operation on the secure device.

Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 5a4b02d3f651..ba5c7c77cc3a 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -624,35 +624,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
  * fault on them, do fault time migration to replace the device PTEs in
  * QEMU page table with normal PTEs from newly allocated pages.
  */
-void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
 			     struct kvm *kvm, bool skip_page_out)
 {
 	int i;
 	struct kvmppc_uvmem_page_pvt *pvt;
-	unsigned long pfn, uvmem_pfn;
-	unsigned long gfn = free->base_gfn;
+	struct page *uvmem_page;
+	struct vm_area_struct *vma = NULL;
+	unsigned long uvmem_pfn, gfn;
+	unsigned long addr, end;
+
+	mmap_read_lock(kvm->mm);
+
+	addr = slot->userspace_addr;
+	end = addr + (slot->npages * PAGE_SIZE);
 
-	for (i = free->npages; i; --i, ++gfn) {
-		struct page *uvmem_page;
+	gfn = slot->base_gfn;
+	for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
+
+		/* Fetch the VMA if addr is not in the latest fetched one */
+		if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
+			vma = find_vma_intersection(kvm->mm, addr, end);
+			if (!vma ||
+			    vma->vm_start > addr || vma->vm_end < end) {
+				pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
+				break;
+			}
+		}
 
 		mutex_lock(&kvm->arch.uvmem_lock);
-		if (!kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+
+		if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+			uvmem_page = pfn_to_page(uvmem_pfn);
+			pvt = uvmem_page->zone_device_data;
+			pvt->skip_page_out = skip_page_out;
+			pvt->remove_gfn = true;
+
+			if (__kvmppc_svm_page_out(vma, addr, addr + PAGE_SIZE,
+						  PAGE_SHIFT, kvm, pvt->gpa))
+				pr_err("Can't page out gpa:0x%lx addr:0x%lx\n",
+				       pvt->gpa, addr);
+		} else {
+			/* Remove the shared flag if any */
 			kvmppc_gfn_remove(gfn, kvm);
-			mutex_unlock(&kvm->arch.uvmem_lock);
-			continue;
 		}
 
-		uvmem_page = pfn_to_page(uvmem_pfn);
-		pvt = uvmem_page->zone_device_data;
-		pvt->skip_page_out = skip_page_out;
-		pvt->remove_gfn = true;
 		mutex_unlock(&kvm->arch.uvmem_lock);
-
-		pfn = gfn_to_pfn(kvm, gfn);
-		if (is_error_noslot_pfn(pfn))
-			continue;
-		kvm_release_pfn_clean(pfn);
 	}
+
+	mmap_read_unlock(kvm->mm);
 }
 
 unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
-- 
2.27.0


^ permalink raw reply related

* [PATCH v2 1/2] KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
From: Laurent Dufour @ 2020-07-21 10:42 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, mpe, paulus
  Cc: sukadev, linuxram, bauerman, bharata
In-Reply-To: <20200721104202.15727-1-ldufour@linux.ibm.com>

kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
so move it upper in this file.

Furthermore it will be interesting to call this function when already
holding the kvm->arch.uvmem_lock, so prefix the original function with __
and remove the locking in it, and introduce a wrapper which call that
function with the lock held.

There is no functional change.

Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 166 ++++++++++++++++-------------
 1 file changed, 90 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a2b4d259f8b0..5a4b02d3f651 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -526,6 +526,96 @@ unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
 	return ret;
 }
 
+/*
+ * Provision a new page on HV side and copy over the contents
+ * from secure memory using UV_PAGE_OUT uvcall.
+ * Caller must held kvm->arch.uvmem_lock.
+ */
+static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
+		unsigned long start,
+		unsigned long end, unsigned long page_shift,
+		struct kvm *kvm, unsigned long gpa)
+{
+	unsigned long src_pfn, dst_pfn = 0;
+	struct migrate_vma mig;
+	struct page *dpage, *spage;
+	struct kvmppc_uvmem_page_pvt *pvt;
+	unsigned long pfn;
+	int ret = U_SUCCESS;
+
+	memset(&mig, 0, sizeof(mig));
+	mig.vma = vma;
+	mig.start = start;
+	mig.end = end;
+	mig.src = &src_pfn;
+	mig.dst = &dst_pfn;
+	mig.src_owner = &kvmppc_uvmem_pgmap;
+
+	/* The requested page is already paged-out, nothing to do */
+	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
+		return ret;
+
+	ret = migrate_vma_setup(&mig);
+	if (ret)
+		return -1;
+
+	spage = migrate_pfn_to_page(*mig.src);
+	if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
+		goto out_finalize;
+
+	if (!is_zone_device_page(spage))
+		goto out_finalize;
+
+	dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
+	if (!dpage) {
+		ret = -1;
+		goto out_finalize;
+	}
+
+	lock_page(dpage);
+	pvt = spage->zone_device_data;
+	pfn = page_to_pfn(dpage);
+
+	/*
+	 * This function is used in two cases:
+	 * - When HV touches a secure page, for which we do UV_PAGE_OUT
+	 * - When a secure page is converted to shared page, we *get*
+	 *   the page to essentially unmap the device page. In this
+	 *   case we skip page-out.
+	 */
+	if (!pvt->skip_page_out)
+		ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
+				  gpa, 0, page_shift);
+
+	if (ret == U_SUCCESS)
+		*mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+	else {
+		unlock_page(dpage);
+		__free_page(dpage);
+		goto out_finalize;
+	}
+
+	migrate_vma_pages(&mig);
+
+out_finalize:
+	migrate_vma_finalize(&mig);
+	return ret;
+}
+
+static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
+				      unsigned long start, unsigned long end,
+				      unsigned long page_shift,
+				      struct kvm *kvm, unsigned long gpa)
+{
+	int ret;
+
+	mutex_lock(&kvm->arch.uvmem_lock);
+	ret = __kvmppc_svm_page_out(vma, start, end, page_shift, kvm, gpa);
+	mutex_unlock(&kvm->arch.uvmem_lock);
+
+	return ret;
+}
+
 /*
  * Drop device pages that we maintain for the secure guest
  *
@@ -898,82 +988,6 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
 	return ret;
 }
 
-/*
- * Provision a new page on HV side and copy over the contents
- * from secure memory using UV_PAGE_OUT uvcall.
- */
-static int kvmppc_svm_page_out(struct vm_area_struct *vma,
-		unsigned long start,
-		unsigned long end, unsigned long page_shift,
-		struct kvm *kvm, unsigned long gpa)
-{
-	unsigned long src_pfn, dst_pfn = 0;
-	struct migrate_vma mig;
-	struct page *dpage, *spage;
-	struct kvmppc_uvmem_page_pvt *pvt;
-	unsigned long pfn;
-	int ret = U_SUCCESS;
-
-	memset(&mig, 0, sizeof(mig));
-	mig.vma = vma;
-	mig.start = start;
-	mig.end = end;
-	mig.src = &src_pfn;
-	mig.dst = &dst_pfn;
-	mig.src_owner = &kvmppc_uvmem_pgmap;
-
-	mutex_lock(&kvm->arch.uvmem_lock);
-	/* The requested page is already paged-out, nothing to do */
-	if (!kvmppc_gfn_is_uvmem_pfn(gpa >> page_shift, kvm, NULL))
-		goto out;
-
-	ret = migrate_vma_setup(&mig);
-	if (ret)
-		goto out;
-
-	spage = migrate_pfn_to_page(*mig.src);
-	if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
-		goto out_finalize;
-
-	if (!is_zone_device_page(spage))
-		goto out_finalize;
-
-	dpage = alloc_page_vma(GFP_HIGHUSER, vma, start);
-	if (!dpage) {
-		ret = -1;
-		goto out_finalize;
-	}
-
-	lock_page(dpage);
-	pvt = spage->zone_device_data;
-	pfn = page_to_pfn(dpage);
-
-	/*
-	 * This function is used in two cases:
-	 * - When HV touches a secure page, for which we do UV_PAGE_OUT
-	 * - When a secure page is converted to shared page, we *get*
-	 *   the page to essentially unmap the device page. In this
-	 *   case we skip page-out.
-	 */
-	if (!pvt->skip_page_out)
-		ret = uv_page_out(kvm->arch.lpid, pfn << page_shift,
-				  gpa, 0, page_shift);
-
-	if (ret == U_SUCCESS)
-		*mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
-	else {
-		unlock_page(dpage);
-		__free_page(dpage);
-		goto out_finalize;
-	}
-
-	migrate_vma_pages(&mig);
-out_finalize:
-	migrate_vma_finalize(&mig);
-out:
-	mutex_unlock(&kvm->arch.uvmem_lock);
-	return ret;
-}
 
 /*
  * Fault handler callback that gets called when HV touches any page that
-- 
2.27.0


^ permalink raw reply related

* [PATCH v2 0/2] Rework secure memslot dropping
From: Laurent Dufour @ 2020-07-21 10:42 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, mpe, paulus
  Cc: sukadev, linuxram, bauerman, bharata

When doing memory hotplug on a secure VM, the secure pages are not well
cleaned from the secure device when dropping the memslot.  This silent
error, is then preventing the SVM to reboot properly after the following
sequence of commands are run in the Qemu monitor:

device_add pc-dimm,id=dimm1,memdev=mem1
device_del dimm1
device_add pc-dimm,id=dimm1,memdev=mem1

At reboot time, when the kernel is booting again and switching to the
secure mode, the page_in is failing for the pages in the memslot because
the cleanup was not done properly, because the memslot is flagged as
invalid during the hot unplug and thus the page fault mechanism is not
triggered.

To prevent that during the memslot dropping, instead of belonging on the
page fault mechanism to trigger the page out of the secured pages, it seems
simpler to directly call the function doing the page out. This way the
state of the memslot is not interfering on the page out process.

This series applies on top of the Ram's one titled:
"[v4 0/5] Migrate non-migrated pages of a SVM."
https://lore.kernel.org/linuxppc-dev/1594972827-13928-1-git-send-email-linuxram@us.ibm.com/

Changes since V1:
 - Rebase on top of Ram's V4 series
 - Address Bharata's comment to use mmap_read_*lock().

Laurent Dufour (2):
  KVM: PPC: Book3S HV: move kvmppc_svm_page_out up
  KVM: PPC: Book3S HV: rework secure mem slot dropping

 arch/powerpc/kvm/book3s_hv_uvmem.c | 220 +++++++++++++++++------------
 1 file changed, 127 insertions(+), 93 deletions(-)

-- 
2.27.0

^ permalink raw reply

* Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
From: Pratik Sampat @ 2020-07-21 10:29 UTC (permalink / raw)
  To: Nicholas Piggin, benh, ego, linux-kernel, linuxppc-dev, mikey,
	mpe, paulus, pratik.r.sampat, svaidy
In-Reply-To: <1595202681.bt4670u7q7.astroid@bobo.none>



On 20/07/20 5:27 am, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
>> Replace the variable name from using "pnv_first_spr_loss_level" to
>> "pnv_first_fullstate_loss_level".
>>
>> As pnv_first_spr_loss_level is supposed to be the earliest state that
>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
>> SPR values, render an incorrect terminology.
> It also doesn't lose "full" state at this loss level though. From the
> architecture it could be called "hv state loss level", but in POWER10
> even that is not strictly true.
>
Right. Just discovered that deep stop states won't loose full state
P10 onwards.
Would it better if we rename it as "pnv_all_spr_loss_state" instead
so that it stays generic enough while being semantically coherent?

Thanks
Pratik

>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
>> ---
>>   arch/powerpc/platforms/powernv/idle.c | 18 +++++++++---------
>>   1 file changed, 9 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
>> index f62904f70fc6..d439e11af101 100644
>> --- a/arch/powerpc/platforms/powernv/idle.c
>> +++ b/arch/powerpc/platforms/powernv/idle.c
>> @@ -48,7 +48,7 @@ static bool default_stop_found;
>>    * First stop state levels when SPR and TB loss can occur.
>>    */
>>   static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
>> -static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
>> +static u64 pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
>>   
>>   /*
>>    * psscr value and mask of the deepest stop idle state.
>> @@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>>   		  */
>>   		mmcr0		= mfspr(SPRN_MMCR0);
>>   	}
>> -	if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
>> +	if ((psscr & PSSCR_RL_MASK) >= pnv_first_fullstate_loss_level) {
>>   		sprs.lpcr	= mfspr(SPRN_LPCR);
>>   		sprs.hfscr	= mfspr(SPRN_HFSCR);
>>   		sprs.fscr	= mfspr(SPRN_FSCR);
>> @@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>>   	 * just always test PSSCR for SPR/TB state loss.
>>   	 */
>>   	pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
>> -	if (likely(pls < pnv_first_spr_loss_level)) {
>> +	if (likely(pls < pnv_first_fullstate_loss_level)) {
>>   		if (sprs_saved)
>>   			atomic_stop_thread_idle();
>>   		goto out;
>> @@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
>>   	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
>>   	 */
>>   	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
>> -	pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
>> +	pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
>>   	for (i = 0; i < nr_pnv_idle_states; i++) {
>>   		int err;
>>   		struct pnv_idle_states_t *state = &pnv_idle_states[i];
>> @@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
>>   			pnv_first_tb_loss_level = psscr_rl;
>>   
>>   		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
>> -		     (pnv_first_spr_loss_level > psscr_rl))
>> -			pnv_first_spr_loss_level = psscr_rl;
>> +		     (pnv_first_fullstate_loss_level > psscr_rl))
>> +			pnv_first_fullstate_loss_level = psscr_rl;
>>   
>>   		/*
>>   		 * The idle code does not deal with TB loss occurring
>> @@ -1111,8 +1111,8 @@ static void __init pnv_power9_idle_init(void)
>>   		 * compatibility.
>>   		 */
>>   		if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
>> -		     (pnv_first_spr_loss_level > psscr_rl))
>> -			pnv_first_spr_loss_level = psscr_rl;
>> +		     (pnv_first_fullstate_loss_level > psscr_rl))
>> +			pnv_first_fullstate_loss_level = psscr_rl;
>>   
>>   		err = validate_psscr_val_mask(&state->psscr_val,
>>   					      &state->psscr_mask,
>> @@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
>>   	}
>>   
>>   	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%llx\n",
>> -		pnv_first_spr_loss_level);
>> +		pnv_first_fullstate_loss_level);
>>   
>>   	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%llx\n",
>>   		pnv_first_tb_loss_level);
>> -- 
>> 2.25.4
>>
>>


^ permalink raw reply

* Re: [PATCH v3 1/3] powerpc/powernv/idle: Replace CPU features checks with PVR checks
From: Pratik Sampat @ 2020-07-21 10:24 UTC (permalink / raw)
  To: Nicholas Piggin, benh, ego, linux-kernel, linuxppc-dev, mikey,
	mpe, paulus, pratik.r.sampat, svaidy
In-Reply-To: <1595203067.oropk0x5c8.astroid@bobo.none>



On 20/07/20 5:30 am, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
>> As the idle framework's architecture is incomplete, hence instead of
>> checking for just the processor type advertised in the device tree CPU
>> features; check for the Processor Version Register (PVR) so that finer
>> granularity can be leveraged while making processor checks.
>>
>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
>> ---
>>   arch/powerpc/platforms/powernv/idle.c | 14 +++++++-------
>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
>> index 2dd467383a88..f62904f70fc6 100644
>> --- a/arch/powerpc/platforms/powernv/idle.c
>> +++ b/arch/powerpc/platforms/powernv/idle.c
>> @@ -92,7 +92,7 @@ static int pnv_save_sprs_for_deep_states(void)
>>   		if (rc != 0)
>>   			return rc;
>>   
>> -		if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +		if (pvr_version_is(PVR_POWER9)) {
>>   			rc = opal_slw_set_reg(pir, P9_STOP_SPR_MSR, msr_val);
>>   			if (rc)
>>   				return rc;
>> @@ -116,7 +116,7 @@ static int pnv_save_sprs_for_deep_states(void)
>>   				return rc;
>>   
>>   			/* Only p8 needs to set extra HID regiters */
>> -			if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
>> +			if (!pvr_version_is(PVR_POWER9)) {
>>   
>>   				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>>   				if (rc != 0)
> What I think you should do is keep using CPU_FTR_ARCH_300 for this stuff
> which is written for power9 and we know is running on power9, because
> that's a faster test (static branch and does not have to read PVR. And
> then...
>
>> @@ -1205,7 +1205,7 @@ static void __init pnv_probe_idle_states(void)
>>   		return;
>>   	}
>>   
>> -	if (cpu_has_feature(CPU_FTR_ARCH_300))
>> +	if (pvr_version_is(PVR_POWER9))
>>   		pnv_power9_idle_init();
>>   
>>   	for (i = 0; i < nr_pnv_idle_states; i++)
> Here is where you would put the version check. Once we have code that
> can also handle P10 (either by testing CPU_FTR_ARCH_31, or by adding
> an entirely new power10 idle function), then you can add the P10 version
> check here.

Sure, it makes sense to make this check on the top level function and
retain CPU_FTR_ARCH_300 lower in the calls for speed.
I'll make that change.

Thanks
Pratik

> Thanks,
> Nick
>


^ permalink raw reply

* Re: [PATCH v3] powerpc/pseries: Avoid using addr_to_pfn in realmode
From: Nicholas Piggin @ 2020-07-21 10:08 UTC (permalink / raw)
  To: Ganesh Goudar, linuxppc-dev, mpe; +Cc: aneesh.kumar, mahesh
In-Reply-To: <20200720080335.21049-1-ganeshgr@linux.ibm.com>

Excerpts from Ganesh Goudar's message of July 20, 2020 6:03 pm:
> When an UE or memory error exception is encountered the MCE handler
> tries to find the pfn using addr_to_pfn() which takes effective
> address as an argument, later pfn is used to poison the page where
> memory error occurred, recent rework in this area made addr_to_pfn
> to run in realmode, which can be fatal as it may try to access
> memory outside RMO region.
> 
> To fix this have separate functions for realmode and virtual mode
> handling and let addr_to_pfn to run in virtual mode.

You didn't really explain what you moved around. You added some
helper functions, but what does it actually do differently now? Can you 
explain that in the changelog?

Thanks,
Nick

> 
> Without this fix following kernel crash is seen on hitting UE.
> 
> [  485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
> [  485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
> [  485.128047] Modules linked in:
> [  485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
> [  485.128074] NIP:  c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
> [  485.128078] REGS: c000000003f1f970 TRAP: 0300   Tainted: G OE (5.7.0)
> [  485.128082] MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008284  XER: 00000001
> [  485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
> [  485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
> [  485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
> [  485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
> [  485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
> [  485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
> [  485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
> [  485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
> [  485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
> [  485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
> [  485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
> [  485.128133] Call Trace:
> [  485.128135] Instruction dump:
> [  485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
> [  485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
> [  485.128152] ---[ end trace d34b27e29ae0e340 ]---
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
> ---
> V2: Leave bare metal code and save_mce_event as is.
> 
> V3: Have separate functions for realmode and virtual mode handling.
> ---
>  arch/powerpc/platforms/pseries/ras.c | 119 ++++++++++++++++-----------
>  1 file changed, 70 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index f3736fcd98fc..32fe3fad86b8 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -522,18 +522,55 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>  	return 0; /* need to perform reset */
>  }
>  
> +static int mce_handle_err_realmode(int disposition, u8 error_type)
> +{
> +#ifdef CONFIG_PPC_BOOK3S_64
> +	if (disposition == RTAS_DISP_NOT_RECOVERED) {
> +		switch (error_type) {
> +		case	MC_ERROR_TYPE_SLB:
> +		case	MC_ERROR_TYPE_ERAT:
> +			/*
> +			 * Store the old slb content in paca before flushing.
> +			 * Print this when we go to virtual mode.
> +			 * There are chances that we may hit MCE again if there
> +			 * is a parity error on the SLB entry we trying to read
> +			 * for saving. Hence limit the slb saving to single
> +			 * level of recursion.
> +			 */
> +			if (local_paca->in_mce == 1)
> +				slb_save_contents(local_paca->mce_faulty_slbs);
> +			flush_and_reload_slb();
> +			disposition = RTAS_DISP_FULLY_RECOVERED;
> +			break;
> +		default:
> +			break;
> +		}
> +	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
> +		/* Platform corrected itself but could be degraded */
> +		pr_err("MCE: limited recovery, system may be degraded\n");
> +		disposition = RTAS_DISP_FULLY_RECOVERED;
> +	}
> +#endif
> +	return disposition;
> +}
>  
> -static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
> +static int mce_handle_err_virtmode(struct pt_regs *regs,
> +				   struct rtas_error_log *errp,
> +				   struct pseries_mc_errorlog *mce_log,
> +				   int disposition)
>  {
>  	struct mce_error_info mce_err = { 0 };
> -	unsigned long eaddr = 0, paddr = 0;
> -	struct pseries_errorlog *pseries_log;
> -	struct pseries_mc_errorlog *mce_log;
> -	int disposition = rtas_error_disposition(errp);
>  	int initiator = rtas_error_initiator(errp);
>  	int severity = rtas_error_severity(errp);
> +	unsigned long eaddr = 0, paddr = 0;
>  	u8 error_type, err_sub_type;
>  
> +	if (!mce_log)
> +		goto out;
> +
> +	error_type = mce_log->error_type;
> +	err_sub_type = rtas_mc_error_sub_type(mce_log);
> +
>  	if (initiator == RTAS_INITIATOR_UNKNOWN)
>  		mce_err.initiator = MCE_INITIATOR_UNKNOWN;
>  	else if (initiator == RTAS_INITIATOR_CPU)
> @@ -572,18 +609,7 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>  	mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
>  	mce_err.error_class = MCE_ECLASS_UNKNOWN;
>  
> -	if (!rtas_error_extended(errp))
> -		goto out;
> -
> -	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
> -	if (pseries_log == NULL)
> -		goto out;
> -
> -	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
> -	error_type = mce_log->error_type;
> -	err_sub_type = rtas_mc_error_sub_type(mce_log);
> -
> -	switch (mce_log->error_type) {
> +	switch (error_type) {
>  	case MC_ERROR_TYPE_UE:
>  		mce_err.error_type = MCE_ERROR_TYPE_UE;
>  		mce_common_process_ue(regs, &mce_err);
> @@ -683,37 +709,32 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>  		mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
>  		break;
>  	}
> +out:
> +	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
> +		       &mce_err, regs->nip, eaddr, paddr);
> +	return disposition;
> +}
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> -	if (disposition == RTAS_DISP_NOT_RECOVERED) {
> -		switch (error_type) {
> -		case	MC_ERROR_TYPE_SLB:
> -		case	MC_ERROR_TYPE_ERAT:
> -			/*
> -			 * Store the old slb content in paca before flushing.
> -			 * Print this when we go to virtual mode.
> -			 * There are chances that we may hit MCE again if there
> -			 * is a parity error on the SLB entry we trying to read
> -			 * for saving. Hence limit the slb saving to single
> -			 * level of recursion.
> -			 */
> -			if (local_paca->in_mce == 1)
> -				slb_save_contents(local_paca->mce_faulty_slbs);
> -			flush_and_reload_slb();
> -			disposition = RTAS_DISP_FULLY_RECOVERED;
> -			break;
> -		default:
> -			break;
> -		}
> -	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
> -		/* Platform corrected itself but could be degraded */
> -		printk(KERN_ERR "MCE: limited recovery, system may "
> -		       "be degraded\n");
> -		disposition = RTAS_DISP_FULLY_RECOVERED;
> -	}
> -#endif
> +static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
> +{
> +	struct pseries_errorlog *pseries_log;
> +	struct pseries_mc_errorlog *mce_log = NULL;
> +	int disposition = rtas_error_disposition(errp);
> +	u8 error_type, err_sub_type;
> +
> +	if (!rtas_error_extended(errp))
> +		goto out;
> +
> +	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
> +	if (!pseries_log)
> +		goto out;
> +
> +	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
> +	error_type = mce_log->error_type;
> +	err_sub_type = rtas_mc_error_sub_type(mce_log);
> +
> +	disposition = mce_handle_err_realmode(disposition, error_type);
>  
> -out:
>  	/*
>  	 * Enable translation as we will be accessing per-cpu variables
>  	 * in save_mce_event() which may fall outside RMO region, also
> @@ -724,10 +745,10 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>  	 * Note: All the realmode handling like flushing SLB entries for
>  	 *       SLB multihit is done by now.
>  	 */
> +out:
>  	mtmsr(mfmsr() | MSR_IR | MSR_DR);
> -	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
> -			&mce_err, regs->nip, eaddr, paddr);
> -
> +	disposition = mce_handle_err_virtmode(regs, errp, mce_log,
> +					      disposition);
>  	return disposition;
>  }
>  
> -- 
> 2.17.2
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
From: Nicholas Piggin @ 2020-07-21 10:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jens Axboe, linux-arch, Arnd Bergmann, Peter Zijlstra, x86,
	linux-kernel, Andy Lutomirski, linux-mm, Andy Lutomirski,
	linuxppc-dev
In-Reply-To: <2055788870.20749.1595263590675.JavaMail.zimbra@efficios.com>

Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
> ----- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npiggin@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
>>> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote:
>>> [...]
>>>> 
>>>> membarrier does replace barrier instructions on remote CPUs, which do
>>>> order accesses performed by the kernel on the user address space. So
>>>> membarrier should too I guess.
>>>> 
>>>> Normal process context accesses like read(2) will do so because they
>>>> don't get filtered out from IPIs, but kernel threads using the mm may
>>>> not.
>>> 
>>> But it should not be an issue, because membarrier's ordering is only with
>>> respect
>>> to submit and completion of io_uring requests, which are performed through
>>> system calls from the context of user-space threads, which are called from the
>>> right mm.
>> 
>> Is that true? Can io completions be written into an address space via a
>> kernel thread? I don't know the io_uring code well but it looks like
>> that's asynchonously using the user mm context.
> 
> Indeed, the io completion appears to be signaled asynchronously between kernel
> and user-space.

Yep, many other places do similar with use_mm.

[snip]

> So as far as membarrier memory ordering dependencies are concerned, it relies
> on the store-release/load-acquire dependency chain in the completion queue to
> order against anything that was done prior to the completed requests.
> 
> What is in-flight while the requests are being serviced provides no memory
> ordering guarantee whatsoever.

Yeah you're probably right in this case I think. Quite likely most kernel 
tasks that asynchronously write to user memory would at least have some 
kind of producer-consumer barriers.

But is that restriction of all async modifications documented and enforced
anywhere?

>> How about other memory accesses via kthread_use_mm? Presumably there is
>> still ordering requirement there for membarrier,
> 
> Please provide an example case with memory accesses via kthread_use_mm where
> ordering matters to support your concern.

I think the concern Andy raised with io_uring was less a specific 
problem he saw and more a general concern that we have these memory 
accesses which are not synchronized with membarrier.

>> so I really think
>> it's a fragile interface with no real way for the user to know how
>> kernel threads may use its mm for any particular reason, so membarrier
>> should synchronize all possible kernel users as well.
> 
> I strongly doubt so, but perhaps something should be clarified in the documentation
> if you have that feeling.

I'd rather go the other way and say if you have reasoning or numbers for 
why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
then we could think about keeping this subtlety with appropriate 
documentation added, otherwise we can just kill it and remove all doubt.

That being said, the x86 sync core gap that I imagined could be fixed 
by changing to rq->curr == rq->idle test does not actually exist because
the global membarrier does not have a sync core option. So fixing the
exit_lazy_tlb points that this series does *should* fix that. So
PF_KTHREAD may be less problematic than I thought from implementation
point of view, only semantics.

Thanks,
Nick

^ permalink raw reply

* [PATCH trivial] ppc64/mm: remove comment that is no longer valid
From: Santosh Sivaraj @ 2020-07-21  9:19 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Santosh Sivaraj

hash_low_64.S was removed in [1] and since flush_hash_page is not called
from any assembly routine.

[1]: commit a43c0eb8364c0 ("powerpc/mm: Convert 4k insert from asm to C")

Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 468169e33c86f..90ee0be3281a9 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1706,10 +1706,6 @@ unsigned long pte_get_hash_gslot(unsigned long vpn, unsigned long shift,
 	return gslot;
 }
 
-/*
- * WARNING: This is called from hash_low_64.S, if you change this prototype,
- *          do not forget to update the assembly call site !
- */
 void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, int ssize,
 		     unsigned long flags)
 {
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH v4 09/10] powerpc/watchpoint: Return available watchpoints dynamically
From: Ravi Bangoria @ 2020-07-21  8:15 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Christophe Leroy, miltonm, mikey, Ravi Bangoria, peterz, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
	pedromfc, naveen.n.rao, linuxppc-dev, mingo
In-Reply-To: <CACzsE9q5YtT_bXOpw9cri_UCxziW_FRbCpcViANaZwui0hjDqw@mail.gmail.com>


>>>> @@ -46,7 +47,7 @@ struct arch_hw_breakpoint {
>>>>
>>>>    static inline int nr_wp_slots(void)
>>>>    {
>>>> -       return HBP_NUM_MAX;
>>>> +       return cpu_has_feature(CPU_FTR_DAWR1) ? HBP_NUM_TWO : HBP_NUM_ONE;
>>> So it'd be something like:
>>> +       return cpu_has_feature(CPU_FTR_DAWR1) ? HBP_NUM_MAX : 1;
>>> But thinking that there might be more slots added in the future, it
>>> may be better to make the number of slots a variable that is set
>>> during the init and then have this function return that.
>>
>> Not sure I follow. What do you mean by setting number of slots a
>> variable that is set during the init?
> Sorry I was unclear there.
> I was just looking and saw arm also has a variable number of hw breakpoints.
> If we did something like how they handle it, it might look something like:
> 
> static int num_wp_slots __ro_after_init;
> 
> int nr_wp_slots(void) {
>      return num_wp_slots;
> }
> 
> static int __init arch_hw_breakpoint_init(void) {
>      num_wp_slots = work out how many wp_slots
> }
> arch_initcall(arch_hw_breakpoint_init);
> 
> Then we wouldn't have to calculate everytime nr_wp_slots() is called.
> In the future if more wp's are added nr_wp_slots() will get more complicated.
> But just an idea, feel free to ignore.

Ok I got the idea. But ARM arch_hw_breakpoint_init() is much more complex
compared to our nr_wp_slots(). I don't see any benefit by making our code
like ARM.

Thanks for the idea though :)
Ravi

^ permalink raw reply

* Re: [PATCH v4 05/10] powerpc/dt_cpu_ftrs: Add feature for 2nd DAWR
From: Ravi Bangoria @ 2020-07-21  7:51 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Christophe Leroy, apopple, mikey, miltonm, peterz, oleg,
	Nicholas Piggin, linux-kernel, Paul Mackerras, jolsa, fweisbec,
	pedromfc, naveen.n.rao, linuxppc-dev, mingo, Ravi Bangoria
In-Reply-To: <CACzsE9oE+OMnWEXvbZZbq35YzpSzCbBHWEJcjtCgkcq-YrABng@mail.gmail.com>



On 7/17/20 11:14 AM, Jordan Niethe wrote:
> On Fri, Jul 17, 2020 at 2:10 PM Ravi Bangoria
> <ravi.bangoria@linux.ibm.com> wrote:
>>
>> Add new device-tree feature for 2nd DAWR. If this feature is present,
>> 2nd DAWR is supported, otherwise not.
>>
>> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
>> ---
>>   arch/powerpc/include/asm/cputable.h | 7 +++++--
>>   arch/powerpc/kernel/dt_cpu_ftrs.c   | 7 +++++++
>>   2 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
>> index e506d429b1af..3445c86e1f6f 100644
>> --- a/arch/powerpc/include/asm/cputable.h
>> +++ b/arch/powerpc/include/asm/cputable.h
>> @@ -214,6 +214,7 @@ static inline void cpu_feature_keys_init(void) { }
>>   #define CPU_FTR_P9_TLBIE_ERAT_BUG      LONG_ASM_CONST(0x0001000000000000)
>>   #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002000000000000)
>>   #define CPU_FTR_ARCH_31                        LONG_ASM_CONST(0x0004000000000000)
>> +#define CPU_FTR_DAWR1                  LONG_ASM_CONST(0x0008000000000000)
>>
>>   #ifndef __ASSEMBLY__
>>
>> @@ -497,14 +498,16 @@ static inline void cpu_feature_keys_init(void) { }
>>   #define CPU_FTRS_POSSIBLE      \
>>              (CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | CPU_FTRS_POWER8 | \
>>               CPU_FTR_ALTIVEC_COMP | CPU_FTR_VSX_COMP | CPU_FTRS_POWER9 | \
>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>> +            CPU_FTR_DAWR1)
>>   #else
>>   #define CPU_FTRS_POSSIBLE      \
>>              (CPU_FTRS_PPC970 | CPU_FTRS_POWER5 | \
>>               CPU_FTRS_POWER6 | CPU_FTRS_POWER7 | CPU_FTRS_POWER8E | \
>>               CPU_FTRS_POWER8 | CPU_FTRS_CELL | CPU_FTRS_PA6T | \
>>               CPU_FTR_VSX_COMP | CPU_FTR_ALTIVEC_COMP | CPU_FTRS_POWER9 | \
>> -            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10)
>> +            CPU_FTRS_POWER9_DD2_1 | CPU_FTRS_POWER9_DD2_2 | CPU_FTRS_POWER10 | \
>> +            CPU_FTR_DAWR1)
> Instead of putting CPU_FTR_DAWR1 into CPU_FTRS_POSSIBLE should it go
> into CPU_FTRS_POWER10?
> Then it will be picked up by CPU_FTRS_POSSIBLE.

I remember a discussion about this with Mikey and we decided to do it
this way. Obviously, the purpose is to make CPU_FTR_DAWR1 independent of
CPU_FTRS_POWER10 because DAWR1 is an optional feature in p10. I fear
including CPU_FTR_DAWR1 in CPU_FTRS_POWER10 can make it forcefully enabled
even when device-tree property is not present or pa-feature bit it not set,
because we do:

       {       /* 3.1-compliant processor, i.e. Power10 "architected" mode */
               .pvr_mask               = 0xffffffff,
               .pvr_value              = 0x0f000006,
               .cpu_name               = "POWER10 (architected)",
               .cpu_features           = CPU_FTRS_POWER10,

>>   #endif /* CONFIG_CPU_LITTLE_ENDIAN */
>>   #endif
>>   #else
>> diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
>> index ac650c233cd9..c78cd3596ec4 100644
>> --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
>> +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
>> @@ -574,6 +574,12 @@ static int __init feat_enable_mma(struct dt_cpu_feature *f)
>>          return 1;
>>   }
>>
>> +static int __init feat_enable_debug_facilities_v31(struct dt_cpu_feature *f)
>> +{
>> +       cur_cpu_spec->cpu_features |= CPU_FTR_DAWR1;
>> +       return 1;
>> +}
>> +
>>   struct dt_cpu_feature_match {
>>          const char *name;
>>          int (*enable)(struct dt_cpu_feature *f);
>> @@ -649,6 +655,7 @@ static struct dt_cpu_feature_match __initdata
>>          {"wait-v3", feat_enable, 0},
>>          {"prefix-instructions", feat_enable, 0},
>>          {"matrix-multiply-assist", feat_enable_mma, 0},
>> +       {"debug-facilities-v31", feat_enable_debug_facilities_v31, 0},
> Since all feat_enable_debug_facilities_v31() does is set
> CPU_FTR_DAWR1, if you just have:
> {"debug-facilities-v31", feat_enable, CPU_FTR_DAWR1},
> I think cpufeatures_process_feature() should set it in for you at this point:
>              if (m->enable(f)) {
>                  cur_cpu_spec->cpu_features |= m->cpu_ftr_bit_mask;
>                  break;
>              }

Yes, that seems a better option.

Thanks,
Ravi

^ permalink raw reply

* Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Laurent Dufour @ 2020-07-21  7:22 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: aik, Ram Pai, kvm-ppc, bharata, sathnaga, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200720202452.GN30544@gate.crashing.org>

Le 20/07/2020 à 22:24, Segher Boessenkool a écrit :
> On Mon, Jul 20, 2020 at 03:10:41PM -0500, Segher Boessenkool wrote:
>> On Mon, Jul 20, 2020 at 11:39:56AM +0200, Laurent Dufour wrote:
>>> Le 16/07/2020 à 10:32, Ram Pai a écrit :
>>>> +	if (is_secure_guest()) {					\
>>>> +		__asm__ __volatile__("mfsprg0 %3;"			\
>>>> +				"lnia %2;"				\
>>>> +				"ld %2,12(%2);"				\
>>>> +				"mtsprg0 %2;"				\
>>>> +				"sync;"					\
>>>> +				#insn" %0,%y1;"				\
>>>> +				"twi 0,%0,0;"				\
>>>> +				"isync;"				\
>>>> +				"mtsprg0 %3"				\
>>>> +			: "=r" (ret)					\
>>>> +			: "Z" (*addr), "r" (0), "r" (0)			\
>>>
>>> I'm wondering if SPRG0 is restored to its original value.
>>> You're using the same register (r0) for parameters 2 and 3, so when doing
>>> lnia %2, you're overwriting the SPRG0 value you saved in r0 just earlier.
>>
>> It is putting the value 0 in the registers the compiler chooses for
>> operands 2 and 3.  But operand 3 is written, while the asm says it is an
>> input.  It needs an earlyclobber as well.

Oh nice, I was not aware that compiler may choose registers this way.
Good to know, thanks for the explanation.

>>> It may be clearer to use explicit registers for %2 and %3 and to mark them
>>> as modified for the compiler.
>>
>> That is not a good idea, imnsho.
> 
> (The explicit register number part, I mean; operand 2 should be an
> output as well, yes.)

Sure if the compiler can choose the registers that's far better.

Cheers,
Laurent.

^ permalink raw reply

* Re: [PATCH 09/20] Documentation: i2c: eliminate duplicated word
From: Wolfram Sang @ 2020-07-21  6:27 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: kvm, linux-doc, David Airlie, kgdb-bugreport, linux-fpga,
	Liviu Dudau, dri-devel, Douglas Anderson, Paul Cercueil, keyrings,
	Paul Mackerras, linux-i2c, Pavel Machek, Srinivas Pandruvada,
	Mihail Atanassov, linux-leds, linux-s390, Daniel Thompson,
	linux-scsi, Jonathan Corbet, Masahiro Yamada, Matthew Wilcox,
	Halil Pasic, Jarkko Sakkinen, James Wang, linux-input,
	Mali DP Maintainers, Derek Kiernan, linux-mips, Dragan Cvetic,
	Wu Hao, Tony Krowiak, linux-kbuild, James E.J. Bottomley,
	Jiri Kosina, Hannes Reinecke, linux-block, Thomas Bogendoerfer,
	Jacek Anaszewski, linux-mm, Dan Williams, Andrew Morton,
	Mimi Zohar, Jens Axboe, Michal Marek, Martin K. Petersen,
	Pierre Morel, linux-kernel, Daniel Vetter, Jason Wessel,
	Paolo Bonzini, linux-integrity, linuxppc-dev, Mike Rapoport,
	Dan Murphy
In-Reply-To: <20200707180414.10467-10-rdunlap@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 341 bytes --]

On Tue, Jul 07, 2020 at 11:04:03AM -0700, Randy Dunlap wrote:
> Drop doubled word "new".
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: linux-doc@vger.kernel.org
> Cc: Wolfram Sang <wsa@kernel.org>
> Cc: linux-i2c@vger.kernel.org

Reviewed-by: Wolfram Sang <wsa@kernel.org>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [v3 15/15] tools/perf: Add perf tools support for extended regs in power10
From: kajoljain @ 2020-07-21  6:04 UTC (permalink / raw)
  To: Athira Rajeev, mpe, acme, jolsa
  Cc: ego, mikey, maddy, kvm, kvm-ppc, svaidyan, linuxppc-dev
In-Reply-To: <1594996707-3727-16-git-send-email-atrajeev@linux.vnet.ibm.com>



On 7/17/20 8:08 PM, Athira Rajeev wrote:
> Added support for supported regs which are new in power10
> ( MMCR3, SIER2, SIER3 ) to sample_reg_mask in the tool side
> to use with `-I?` option. Also added PVR check to send extended
> mask for power10 at kernel while capturing extended regs in
> each sample.
> 
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  tools/arch/powerpc/include/uapi/asm/perf_regs.h | 6 ++++++
>  tools/perf/arch/powerpc/include/perf_regs.h     | 3 +++
>  tools/perf/arch/powerpc/util/perf_regs.c        | 6 ++++++
>  3 files changed, 15 insertions(+)
> 


Reviewed-by: Kajol Jain <kjain@linux.ibm.com>

Thanks,
Kajol Jain

> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> index 225c64c..bdf5f10 100644
> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
>  	PERF_REG_POWERPC_MMCR0,
>  	PERF_REG_POWERPC_MMCR1,
>  	PERF_REG_POWERPC_MMCR2,
> +	PERF_REG_POWERPC_MMCR3,
> +	PERF_REG_POWERPC_SIER2,
> +	PERF_REG_POWERPC_SIER3,
>  	/* Max regs without the extended regs */
>  	PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> @@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
>  
>  /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
>  #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) - PERF_REG_PMU_MASK)
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
> +#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - PERF_REG_PMU_MASK)
>  
>  #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
> +#define PERF_REG_MAX_ISA_31    (PERF_REG_POWERPC_SIER3 + 1)
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/perf/arch/powerpc/include/perf_regs.h
> index 46ed00d..63f3ac9 100644
> --- a/tools/perf/arch/powerpc/include/perf_regs.h
> +++ b/tools/perf/arch/powerpc/include/perf_regs.h
> @@ -68,6 +68,9 @@
>  	[PERF_REG_POWERPC_MMCR0] = "mmcr0",
>  	[PERF_REG_POWERPC_MMCR1] = "mmcr1",
>  	[PERF_REG_POWERPC_MMCR2] = "mmcr2",
> +	[PERF_REG_POWERPC_MMCR3] = "mmcr3",
> +	[PERF_REG_POWERPC_SIER2] = "sier2",
> +	[PERF_REG_POWERPC_SIER3] = "sier3",
>  };
>  
>  static inline const char *perf_reg_name(int id)
> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c
> index d64ba0c..2b6d470 100644
> --- a/tools/perf/arch/powerpc/util/perf_regs.c
> +++ b/tools/perf/arch/powerpc/util/perf_regs.c
> @@ -14,6 +14,7 @@
>  #include <linux/kernel.h>
>  
>  #define PVR_POWER9		0x004E
> +#define PVR_POWER10		0x0080
>  
>  const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG(r0, PERF_REG_POWERPC_R0),
> @@ -64,6 +65,9 @@
>  	SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
>  	SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
>  	SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
> +	SMPL_REG(mmcr3, PERF_REG_POWERPC_MMCR3),
> +	SMPL_REG(sier2, PERF_REG_POWERPC_SIER2),
> +	SMPL_REG(sier3, PERF_REG_POWERPC_SIER3),
>  	SMPL_REG_END
>  };
>  
> @@ -194,6 +198,8 @@ uint64_t arch__intr_reg_mask(void)
>  	version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>  	if (version == PVR_POWER9)
>  		extended_mask = PERF_REG_PMU_MASK_300;
> +	else if (version == PVR_POWER10)
> +		extended_mask = PERF_REG_PMU_MASK_31;
>  	else
>  		return mask;
>  
> 

^ permalink raw reply

* Re: [v3 14/15] powerpc/perf: Add extended regs support for power10 platform
From: kajoljain @ 2020-07-21  6:03 UTC (permalink / raw)
  To: Athira Rajeev, mpe, acme, jolsa
  Cc: ego, mikey, maddy, kvm, kvm-ppc, svaidyan, linuxppc-dev
In-Reply-To: <1594996707-3727-15-git-send-email-atrajeev@linux.vnet.ibm.com>



On 7/17/20 8:08 PM, Athira Rajeev wrote:
> Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10
> and expose MMCR3, SIER2, SIER3 registers as part of extended regs.
> Also introduce `PERF_REG_PMU_MASK_31` to define extended mask
> value at runtime for power10
> 
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> [Fix build failure on PPC32 platform]
> Suggested-by: Ryan Grimm <grimm@linux.ibm.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
>  arch/powerpc/include/uapi/asm/perf_regs.h |  6 ++++++
>  arch/powerpc/perf/perf_regs.c             | 12 +++++++++++-
>  arch/powerpc/perf/power10-pmu.c           |  6 ++++++
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 


Reviewed-by: Kajol Jain <kjain@linux.ibm.com>

Thanks,
Kajol Jain

> diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/powerpc/include/uapi/asm/perf_regs.h
> index 225c64c..bdf5f10 100644
> --- a/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
>  	PERF_REG_POWERPC_MMCR0,
>  	PERF_REG_POWERPC_MMCR1,
>  	PERF_REG_POWERPC_MMCR2,
> +	PERF_REG_POWERPC_MMCR3,
> +	PERF_REG_POWERPC_SIER2,
> +	PERF_REG_POWERPC_SIER3,
>  	/* Max regs without the extended regs */
>  	PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> @@ -60,6 +63,9 @@ enum perf_event_powerpc_regs {
>  
>  /* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
>  #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) - PERF_REG_PMU_MASK)
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
> +#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 1) - PERF_REG_PMU_MASK)
>  
>  #define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
> +#define PERF_REG_MAX_ISA_31    (PERF_REG_POWERPC_SIER3 + 1)
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
> index b0cf68f..11b90d5 100644
> --- a/arch/powerpc/perf/perf_regs.c
> +++ b/arch/powerpc/perf/perf_regs.c
> @@ -81,6 +81,14 @@ static u64 get_ext_regs_value(int idx)
>  		return mfspr(SPRN_MMCR1);
>  	case PERF_REG_POWERPC_MMCR2:
>  		return mfspr(SPRN_MMCR2);
> +#ifdef CONFIG_PPC64
> +	case PERF_REG_POWERPC_MMCR3:
> +		return mfspr(SPRN_MMCR3);
> +	case PERF_REG_POWERPC_SIER2:
> +		return mfspr(SPRN_SIER2);
> +	case PERF_REG_POWERPC_SIER3:
> +		return mfspr(SPRN_SIER3);
> +#endif
>  	default: return 0;
>  	}
>  }
> @@ -89,7 +97,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  {
>  	u64 PERF_REG_EXTENDED_MAX;
>  
> -	if (cpu_has_feature(CPU_FTR_ARCH_300))
> +	if (cpu_has_feature(CPU_FTR_ARCH_31))
> +		PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_31;
> +	else if (cpu_has_feature(CPU_FTR_ARCH_300))
>  		PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300;
>  
>  	if (idx == PERF_REG_POWERPC_SIER &&
> diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
> index b02aabb..f066ed9 100644
> --- a/arch/powerpc/perf/power10-pmu.c
> +++ b/arch/powerpc/perf/power10-pmu.c
> @@ -87,6 +87,8 @@
>  #define POWER10_MMCRA_IFM3		0x00000000C0000000UL
>  #define POWER10_MMCRA_BHRB_MASK		0x00000000C0000000UL
>  
> +extern u64 PERF_REG_EXTENDED_MASK;
> +
>  /* Table of alternatives, sorted by column 0 */
>  static const unsigned int power10_event_alternatives[][MAX_ALT] = {
>  	{ PM_RUN_CYC_ALT,		PM_RUN_CYC },
> @@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter)
>  	.cache_events		= &power10_cache_events,
>  	.attr_groups		= power10_pmu_attr_groups,
>  	.bhrb_nr		= 32,
> +	.capabilities           = PERF_PMU_CAP_EXTENDED_REGS,
>  };
>  
>  int init_power10_pmu(void)
> @@ -408,6 +411,9 @@ int init_power10_pmu(void)
>  	    strcmp(cur_cpu_spec->oprofile_cpu_type, "ppc64/power10"))
>  		return -ENODEV;
>  
> +	/* Set the PERF_REG_EXTENDED_MASK here */
> +	PERF_REG_EXTENDED_MASK = PERF_REG_PMU_MASK_31;
> +
>  	rc = register_power_pmu(&power10_pmu);
>  	if (rc)
>  		return rc;
> 

^ permalink raw reply

* Re: [v3 13/15] tools/perf: Add perf tools support for extended register capability in powerpc
From: kajoljain @ 2020-07-21  6:03 UTC (permalink / raw)
  To: Athira Rajeev, mpe, acme, jolsa
  Cc: ego, mikey, maddy, kvm, kvm-ppc, svaidyan, linuxppc-dev
In-Reply-To: <1594996707-3727-14-git-send-email-atrajeev@linux.vnet.ibm.com>



On 7/17/20 8:08 PM, Athira Rajeev wrote:
> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> 
> Add extended regs to sample_reg_mask in the tool side to use
> with `-I?` option. Perf tools side uses extended mask to display
> the platform supported register names (with -I? option) to the user
> and also send this mask to the kernel to capture the extended registers
> in each sample. Hence decide the mask value based on the processor
> version.
> 
> Currently definitions for `mfspr`, `SPRN_PVR` are part of
> `arch/powerpc/util/header.c`. Move this to a header file so that
> these definitions can be re-used in other source files as well.
> 
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> [Decide extended mask at run time based on platform]
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
> ---
>  tools/arch/powerpc/include/uapi/asm/perf_regs.h | 14 ++++++-
>  tools/perf/arch/powerpc/include/perf_regs.h     |  5 ++-
>  tools/perf/arch/powerpc/util/header.c           |  9 +----
>  tools/perf/arch/powerpc/util/perf_regs.c        | 49 +++++++++++++++++++++++++
>  tools/perf/arch/powerpc/util/utils_header.h     | 15 ++++++++
>  5 files changed, 82 insertions(+), 10 deletions(-)
>  create mode 100644 tools/perf/arch/powerpc/util/utils_header.h
> 


Reviewed-by: Kajol Jain <kjain@linux.ibm.com>

Thanks,
Kajol Jain

> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> index f599064..225c64c 100644
> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>  	PERF_REG_POWERPC_DSISR,
>  	PERF_REG_POWERPC_SIER,
>  	PERF_REG_POWERPC_MMCRA,
> -	PERF_REG_POWERPC_MAX,
> +	/* Extended registers */
> +	PERF_REG_POWERPC_MMCR0,
> +	PERF_REG_POWERPC_MMCR1,
> +	PERF_REG_POWERPC_MMCR2,
> +	/* Max regs without the extended regs */
> +	PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> +
> +#define PERF_REG_PMU_MASK	((1ULL << PERF_REG_POWERPC_MAX) - 1)
> +
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
> +#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) - PERF_REG_PMU_MASK)
> +
> +#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h b/tools/perf/arch/powerpc/include/perf_regs.h
> index e18a355..46ed00d 100644
> --- a/tools/perf/arch/powerpc/include/perf_regs.h
> +++ b/tools/perf/arch/powerpc/include/perf_regs.h
> @@ -64,7 +64,10 @@
>  	[PERF_REG_POWERPC_DAR] = "dar",
>  	[PERF_REG_POWERPC_DSISR] = "dsisr",
>  	[PERF_REG_POWERPC_SIER] = "sier",
> -	[PERF_REG_POWERPC_MMCRA] = "mmcra"
> +	[PERF_REG_POWERPC_MMCRA] = "mmcra",
> +	[PERF_REG_POWERPC_MMCR0] = "mmcr0",
> +	[PERF_REG_POWERPC_MMCR1] = "mmcr1",
> +	[PERF_REG_POWERPC_MMCR2] = "mmcr2",
>  };
>  
>  static inline const char *perf_reg_name(int id)
> diff --git a/tools/perf/arch/powerpc/util/header.c b/tools/perf/arch/powerpc/util/header.c
> index d487007..1a95017 100644
> --- a/tools/perf/arch/powerpc/util/header.c
> +++ b/tools/perf/arch/powerpc/util/header.c
> @@ -7,17 +7,10 @@
>  #include <string.h>
>  #include <linux/stringify.h>
>  #include "header.h"
> +#include "utils_header.h"
>  #include "metricgroup.h"
>  #include <api/fs/fs.h>
>  
> -#define mfspr(rn)       ({unsigned long rval; \
> -			 asm volatile("mfspr %0," __stringify(rn) \
> -				      : "=r" (rval)); rval; })
> -
> -#define SPRN_PVR        0x11F	/* Processor Version Register */
> -#define PVR_VER(pvr)    (((pvr) >>  16) & 0xFFFF) /* Version field */
> -#define PVR_REV(pvr)    (((pvr) >>   0) & 0xFFFF) /* Revison field */
> -
>  int
>  get_cpuid(char *buffer, size_t sz)
>  {
> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c
> index 0a52429..d64ba0c 100644
> --- a/tools/perf/arch/powerpc/util/perf_regs.c
> +++ b/tools/perf/arch/powerpc/util/perf_regs.c
> @@ -6,9 +6,15 @@
>  
>  #include "../../../util/perf_regs.h"
>  #include "../../../util/debug.h"
> +#include "../../../util/event.h"
> +#include "../../../util/header.h"
> +#include "../../../perf-sys.h"
> +#include "utils_header.h"
>  
>  #include <linux/kernel.h>
>  
> +#define PVR_POWER9		0x004E
> +
>  const struct sample_reg sample_reg_masks[] = {
>  	SMPL_REG(r0, PERF_REG_POWERPC_R0),
>  	SMPL_REG(r1, PERF_REG_POWERPC_R1),
> @@ -55,6 +61,9 @@
>  	SMPL_REG(dsisr, PERF_REG_POWERPC_DSISR),
>  	SMPL_REG(sier, PERF_REG_POWERPC_SIER),
>  	SMPL_REG(mmcra, PERF_REG_POWERPC_MMCRA),
> +	SMPL_REG(mmcr0, PERF_REG_POWERPC_MMCR0),
> +	SMPL_REG(mmcr1, PERF_REG_POWERPC_MMCR1),
> +	SMPL_REG(mmcr2, PERF_REG_POWERPC_MMCR2),
>  	SMPL_REG_END
>  };
>  
> @@ -163,3 +172,43 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
>  
>  	return SDT_ARG_VALID;
>  }
> +
> +uint64_t arch__intr_reg_mask(void)
> +{
> +	struct perf_event_attr attr = {
> +		.type                   = PERF_TYPE_HARDWARE,
> +		.config                 = PERF_COUNT_HW_CPU_CYCLES,
> +		.sample_type            = PERF_SAMPLE_REGS_INTR,
> +		.precise_ip             = 1,
> +		.disabled               = 1,
> +		.exclude_kernel         = 1,
> +	};
> +	int fd;
> +	u32 version;
> +	u64 extended_mask = 0, mask = PERF_REGS_MASK;
> +
> +	/*
> +	 * Get the PVR value to set the extended
> +	 * mask specific to platform.
> +	 */
> +	version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
> +	if (version == PVR_POWER9)
> +		extended_mask = PERF_REG_PMU_MASK_300;
> +	else
> +		return mask;
> +
> +	attr.sample_regs_intr = extended_mask;
> +	attr.sample_period = 1;
> +	event_attr_init(&attr);
> +
> +	/*
> +	 * check if the pmu supports perf extended regs, before
> +	 * returning the register mask to sample.
> +	 */
> +	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
> +	if (fd != -1) {
> +		close(fd);
> +		mask |= extended_mask;
> +	}
> +	return mask;
> +}
> diff --git a/tools/perf/arch/powerpc/util/utils_header.h b/tools/perf/arch/powerpc/util/utils_header.h
> new file mode 100644
> index 0000000..5788eb1
> --- /dev/null
> +++ b/tools/perf/arch/powerpc/util/utils_header.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __PERF_UTIL_HEADER_H
> +#define __PERF_UTIL_HEADER_H
> +
> +#include <linux/stringify.h>
> +
> +#define mfspr(rn)       ({unsigned long rval; \
> +			asm volatile("mfspr %0," __stringify(rn) \
> +				: "=r" (rval)); rval; })
> +
> +#define SPRN_PVR        0x11F   /* Processor Version Register */
> +#define PVR_VER(pvr)    (((pvr) >>  16) & 0xFFFF) /* Version field */
> +#define PVR_REV(pvr)    (((pvr) >>   0) & 0xFFFF) /* Revison field */
> +
> +#endif /* __PERF_UTIL_HEADER_H */
> 

^ permalink raw reply

* Re: [v3 12/15] powerpc/perf: Add support for outputting extended regs in perf intr_regs
From: kajoljain @ 2020-07-21  6:02 UTC (permalink / raw)
  To: Athira Rajeev, mpe, acme, jolsa
  Cc: ego, mikey, maddy, kvm, kvm-ppc, svaidyan, linuxppc-dev
In-Reply-To: <1594996707-3727-13-git-send-email-atrajeev@linux.vnet.ibm.com>



On 7/17/20 8:08 PM, Athira Rajeev wrote:
> From: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> 
> Add support for perf extended register capability in powerpc.
> The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to indicate the
> PMU which support extended registers. The generic code define the mask
> of extended registers as 0 for non supported architectures.
> 
> Patch adds extended regs support for power9 platform by
> exposing MMCR0, MMCR1 and MMCR2 registers.
> 
> REG_RESERVED mask needs update to include extended regs.
> `PERF_REG_EXTENDED_MASK`, contains mask value of the supported registers,
> is defined at runtime in the kernel based on platform since the supported
> registers may differ from one processor version to another and hence the
> MASK value.
> 
> with patch
> ----------
> 
> available registers: r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11
> r12 r13 r14 r15 r16 r17 r18 r19 r20 r21 r22 r23 r24 r25 r26
> r27 r28 r29 r30 r31 nip msr orig_r3 ctr link xer ccr softe
> trap dar dsisr sier mmcra mmcr0 mmcr1 mmcr2
> 
> PERF_RECORD_SAMPLE(IP, 0x1): 4784/4784: 0 period: 1 addr: 0
> ... intr regs: mask 0xffffffffffff ABI 64-bit
> .... r0    0xc00000000012b77c
> .... r1    0xc000003fe5e03930
> .... r2    0xc000000001b0e000
> .... r3    0xc000003fdcddf800
> .... r4    0xc000003fc7880000
> .... r5    0x9c422724be
> .... r6    0xc000003fe5e03908
> .... r7    0xffffff63bddc8706
> .... r8    0x9e4
> .... r9    0x0
> .... r10   0x1
> .... r11   0x0
> .... r12   0xc0000000001299c0
> .... r13   0xc000003ffffc4800
> .... r14   0x0
> .... r15   0x7fffdd8b8b00
> .... r16   0x0
> .... r17   0x7fffdd8be6b8
> .... r18   0x7e7076607730
> .... r19   0x2f
> .... r20   0xc00000001fc26c68
> .... r21   0xc0002041e4227e00
> .... r22   0xc00000002018fb60
> .... r23   0x1
> .... r24   0xc000003ffec4d900
> .... r25   0x80000000
> .... r26   0x0
> .... r27   0x1
> .... r28   0x1
> .... r29   0xc000000001be1260
> .... r30   0x6008010
> .... r31   0xc000003ffebb7218
> .... nip   0xc00000000012b910
> .... msr   0x9000000000009033
> .... orig_r3 0xc00000000012b86c
> .... ctr   0xc0000000001299c0
> .... link  0xc00000000012b77c
> .... xer   0x0
> .... ccr   0x28002222
> .... softe 0x1
> .... trap  0xf00
> .... dar   0x0
> .... dsisr 0x80000000000
> .... sier  0x0
> .... mmcra 0x80000000000
> .... mmcr0 0x82008090
> .... mmcr1 0x1e000000
> .... mmcr2 0x0
>  ... thread: perf:4784
> 
> Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
> [Defined PERF_REG_EXTENDED_MASK at run time to add support for different platforms ]
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
> ---

Patch looks good to me.

Reviewed-by: Kajol Jain <kjain@linux.ibm.com>

Thanks,
Kajol Jain

>  arch/powerpc/include/asm/perf_event_server.h |  8 +++++++
>  arch/powerpc/include/uapi/asm/perf_regs.h    | 14 +++++++++++-
>  arch/powerpc/perf/core-book3s.c              |  1 +
>  arch/powerpc/perf/perf_regs.c                | 34 +++++++++++++++++++++++++---
>  arch/powerpc/perf/power9-pmu.c               |  6 +++++
>  5 files changed, 59 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> index 832450a..bf85d1a 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -15,6 +15,9 @@
>  #define MAX_EVENT_ALTERNATIVES	8
>  #define MAX_LIMITED_HWCOUNTERS	2
>  
> +extern u64 PERF_REG_EXTENDED_MASK;
> +#define PERF_REG_EXTENDED_MASK	PERF_REG_EXTENDED_MASK
> +
>  struct perf_event;
>  
>  struct mmcr_regs {
> @@ -62,6 +65,11 @@ struct power_pmu {
>  	int 		*blacklist_ev;
>  	/* BHRB entries in the PMU */
>  	int		bhrb_nr;
> +	/*
> +	 * set this flag with `PERF_PMU_CAP_EXTENDED_REGS` if
> +	 * the pmu supports extended perf regs capability
> +	 */
> +	int		capabilities;
>  };
>  
>  /*
> diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h b/arch/powerpc/include/uapi/asm/perf_regs.h
> index f599064..225c64c 100644
> --- a/arch/powerpc/include/uapi/asm/perf_regs.h
> +++ b/arch/powerpc/include/uapi/asm/perf_regs.h
> @@ -48,6 +48,18 @@ enum perf_event_powerpc_regs {
>  	PERF_REG_POWERPC_DSISR,
>  	PERF_REG_POWERPC_SIER,
>  	PERF_REG_POWERPC_MMCRA,
> -	PERF_REG_POWERPC_MAX,
> +	/* Extended registers */
> +	PERF_REG_POWERPC_MMCR0,
> +	PERF_REG_POWERPC_MMCR1,
> +	PERF_REG_POWERPC_MMCR2,
> +	/* Max regs without the extended regs */
> +	PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>  };
> +
> +#define PERF_REG_PMU_MASK	((1ULL << PERF_REG_POWERPC_MAX) - 1)
> +
> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
> +#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 1) - PERF_REG_PMU_MASK)
> +
> +#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
>  #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 31c0535..d5a9529 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2316,6 +2316,7 @@ int register_power_pmu(struct power_pmu *pmu)
>  		pmu->name);
>  
>  	power_pmu.attr_groups = ppmu->attr_groups;
> +	power_pmu.capabilities |= (ppmu->capabilities & PERF_PMU_CAP_EXTENDED_REGS);
>  
>  #ifdef MSR_HV
>  	/*
> diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
> index a213a0a..b0cf68f 100644
> --- a/arch/powerpc/perf/perf_regs.c
> +++ b/arch/powerpc/perf/perf_regs.c
> @@ -13,9 +13,11 @@
>  #include <asm/ptrace.h>
>  #include <asm/perf_regs.h>
>  
> +u64 PERF_REG_EXTENDED_MASK;
> +
>  #define PT_REGS_OFFSET(id, r) [id] = offsetof(struct pt_regs, r)
>  
> -#define REG_RESERVED (~((1ULL << PERF_REG_POWERPC_MAX) - 1))
> +#define REG_RESERVED (~(PERF_REG_EXTENDED_MASK | PERF_REG_PMU_MASK))
>  
>  static unsigned int pt_regs_offset[PERF_REG_POWERPC_MAX] = {
>  	PT_REGS_OFFSET(PERF_REG_POWERPC_R0,  gpr[0]),
> @@ -69,10 +71,26 @@
>  	PT_REGS_OFFSET(PERF_REG_POWERPC_MMCRA, dsisr),
>  };
>  
> +/* Function to return the extended register values */
> +static u64 get_ext_regs_value(int idx)
> +{
> +	switch (idx) {
> +	case PERF_REG_POWERPC_MMCR0:
> +		return mfspr(SPRN_MMCR0);
> +	case PERF_REG_POWERPC_MMCR1:
> +		return mfspr(SPRN_MMCR1);
> +	case PERF_REG_POWERPC_MMCR2:
> +		return mfspr(SPRN_MMCR2);
> +	default: return 0;
> +	}
> +}
> +
>  u64 perf_reg_value(struct pt_regs *regs, int idx)
>  {
> -	if (WARN_ON_ONCE(idx >= PERF_REG_POWERPC_MAX))
> -		return 0;
> +	u64 PERF_REG_EXTENDED_MAX;
> +
> +	if (cpu_has_feature(CPU_FTR_ARCH_300))
> +		PERF_REG_EXTENDED_MAX = PERF_REG_MAX_ISA_300;
>  
>  	if (idx == PERF_REG_POWERPC_SIER &&
>  	   (IS_ENABLED(CONFIG_FSL_EMB_PERF_EVENT) ||
> @@ -85,6 +103,16 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>  	    IS_ENABLED(CONFIG_PPC32)))
>  		return 0;
>  
> +	if (idx >= PERF_REG_POWERPC_MAX && idx < PERF_REG_EXTENDED_MAX)
> +		return get_ext_regs_value(idx);
> +
> +	/*
> +	 * If the idx is referring to value beyond the
> +	 * supported registers, return 0 with a warning
> +	 */
> +	if (WARN_ON_ONCE(idx >= PERF_REG_EXTENDED_MAX))
> +		return 0;
> +
>  	return regs_get_register(regs, pt_regs_offset[idx]);
>  }
>  
> diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
> index 05dae38..2a57e93 100644
> --- a/arch/powerpc/perf/power9-pmu.c
> +++ b/arch/powerpc/perf/power9-pmu.c
> @@ -90,6 +90,8 @@ enum {
>  #define POWER9_MMCRA_IFM3		0x00000000C0000000UL
>  #define POWER9_MMCRA_BHRB_MASK		0x00000000C0000000UL
>  
> +extern u64 PERF_REG_EXTENDED_MASK;
> +
>  /* Nasty Power9 specific hack */
>  #define PVR_POWER9_CUMULUS		0x00002000
>  
> @@ -434,6 +436,7 @@ static void power9_config_bhrb(u64 pmu_bhrb_filter)
>  	.cache_events		= &power9_cache_events,
>  	.attr_groups		= power9_pmu_attr_groups,
>  	.bhrb_nr		= 32,
> +	.capabilities           = PERF_PMU_CAP_EXTENDED_REGS,
>  };
>  
>  int init_power9_pmu(void)
> @@ -457,6 +460,9 @@ int init_power9_pmu(void)
>  		}
>  	}
>  
> +	/* Set the PERF_REG_EXTENDED_MASK here */
> +	PERF_REG_EXTENDED_MASK = PERF_REG_PMU_MASK_300;
> +
>  	rc = register_power_pmu(&power9_pmu);
>  	if (rc)
>  		return rc;
> 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox