LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 06/10] powerpc/smp: Optimize start_secondary
From: Srikar Dronamraju @ 2020-08-10  7:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200810071834.92514-1-srikar@linux.vnet.ibm.com>

In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.

While here, localize the sibling_mask variable to within the if
condition.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4 ->v5:
	Retain cache domain, no need for generalization
		 (Michael Ellerman, Peter Zijlstra,
		 Valentin Schneider, Gautham R. Shenoy)

Changelog v1 -> v2:
	Moved shared_cache topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0c960ce3be42..91cf5d05e7ec 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	return cpu_l2_cache_mask(cpu);
+	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
-	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
 	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
@@ -1331,14 +1330,20 @@ void start_secondary(void *unused)
 	/* Update topology CPU masks */
 	add_cpu_to_masks(cpu);
 
-	if (has_big_cores)
-		sibling_mask = cpu_smallcore_mask;
 	/*
 	 * Check for any shared caches. Note that this must be done on a
 	 * per-core basis because one core in the pair might be disabled.
 	 */
-	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-		shared_caches = true;
+	if (!shared_caches) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+		struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+			shared_caches = true;
+	}
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-- 
2.18.2


^ permalink raw reply related

* [PATCH v5 04/10] powerpc/smp: Move topology fixups into a new function
From: Srikar Dronamraju @ 2020-08-10  7:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200810071834.92514-1-srikar@linux.vnet.ibm.com>

Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2 -> v3:
	Rewrote changelog (Gautham)
	Renamed to powerpc/smp: Move topology fixups into  a new function

 arch/powerpc/kernel/smp.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 39224a042468..b13161a5ffc3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1362,6 +1362,16 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
+static void fixup_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (has_big_cores) {
+		pr_info("Big cores detected but using small core scheduling\n");
+		powerpc_topology[0].mask = smallcore_smt_mask;
+	}
+#endif
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
@@ -1375,12 +1385,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-#ifdef CONFIG_SCHED_SMT
-	if (has_big_cores) {
-		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
-	}
-#endif
+	fixup_topology();
 	set_sched_topology(powerpc_topology);
 }
 
-- 
2.18.2


^ permalink raw reply related

* [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-08-10  7:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200810071834.92514-1-srikar@linux.vnet.ibm.com>

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Set cpumask after verifying l2-cache. (Gautham)

 arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b13161a5ffc3..0c960ce3be42 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1188,6 +1188,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	if (!l2_cache)
 		return false;
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	for_each_cpu(i, cpu_online_mask) {
 		/*
 		 * when updating the marks the current CPU has not been marked
@@ -1270,29 +1271,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.18.2


^ permalink raw reply related

* [PATCH v5 03/10] powerpc/smp: Move powerpc_topology above
From: Srikar Dronamraju @ 2020-08-10  7:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <20200810071834.92514-1-srikar@linux.vnet.ibm.com>

Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 104 +++++++++++++++++++-------------------
 1 file changed, 52 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 08da765b91f1..39224a042468 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -818,6 +818,58 @@ static int init_cpu_l1_cache_map(int cpu)
 	return err;
 }
 
+static bool shared_caches;
+
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymmetric SMT dependency */
+static int powerpc_smt_flags(void)
+{
+	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
+
+	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+		flags |= SD_ASYM_PACKING;
+	}
+	return flags;
+}
+#endif
+
+/*
+ * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
+ * This topology makes it *much* cheaper to migrate tasks between adjacent cores
+ * since the migrated task remains cache hot. We want to take advantage of this
+ * at the scheduler level so an extra topology level is required.
+ */
+static int powerpc_shared_cache_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+
+/*
+ * We can't just pass cpu_l2_cache_mask() directly because
+ * returns a non-const pointer and the compiler barfs on that.
+ */
+static const struct cpumask *shared_cache_mask(int cpu)
+{
+	return cpu_l2_cache_mask(cpu);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *smallcore_smt_mask(int cpu)
+{
+	return cpu_smallcore_mask(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 static int init_big_cores(void)
 {
 	int cpu;
@@ -1247,8 +1299,6 @@ static void add_cpu_to_masks(int cpu)
 			set_cpus_related(cpu, i, cpu_core_mask);
 }
 
-static bool shared_caches;
-
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
@@ -1312,56 +1362,6 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
-#ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymmetric SMT dependency */
-static int powerpc_smt_flags(void)
-{
-	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
-
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		flags |= SD_ASYM_PACKING;
-	}
-	return flags;
-}
-#endif
-
-/*
- * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
- * This topology makes it *much* cheaper to migrate tasks between adjacent cores
- * since the migrated task remains cache hot. We want to take advantage of this
- * at the scheduler level so an extra topology level is required.
- */
-static int powerpc_shared_cache_flags(void)
-{
-	return SD_SHARE_PKG_RESOURCES;
-}
-
-/*
- * We can't just pass cpu_l2_cache_mask() directly because
- * returns a non-const pointer and the compiler barfs on that.
- */
-static const struct cpumask *shared_cache_mask(int cpu)
-{
-	return cpu_l2_cache_mask(cpu);
-}
-
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
-{
-	return cpu_smallcore_mask(cpu);
-}
-#endif
-
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
-- 
2.18.2


^ permalink raw reply related

* [PATCH v5 00/10] Coregroup support on Powerpc
From: Srikar Dronamraju @ 2020-08-10  7:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Peter Zijlstra,
	Jordan Niethe, Anton Blanchard, LKML, Valentin Schneider,
	Nick Piggin, linuxppc-dev, Ingo Molnar

Changelog v4->v5:
v4: http://lore.kernel.org/lkml/20200727053230.19753-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v4 ->v5:
powerpc/smp: Optimize start_secondary
	Retain cache domain, no need for generalization
		 (Michael Ellerman, Peter Zijlstra,
		 Valentin Schneider, Gautham R. Shenoy)

powerpc/numa: Detect support for coregroup
	Updated commit msg with current abstract nature of the coregroups
						(Michael Ellerman)
powerpc/smp: Allocate cpumask only after searching thread group
	Updated commit msg on why cpumask need not be freed.
						(Michael Ellerman)

powerpc/smp: Create coregroup domain
	Updated commit msg to specify actual implementation of
	cpu_to_coregroup_id is in a subsequent patch (Michael Ellerman)

Changelog v3 ->v4:
v3: https://lore.kernel.org/lkml/20200723085116.4731-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Create coregroup domain
	if coregroup_support doesn't exist, update MC mask to the next
	smaller domain mask.

Changelog v2 -> v3:
v2: https://lore.kernel.org/linuxppc-dev/20200721113814.32284-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Cache node for reuse
	Removed node caching part. Rewrote the Commit msg (Michael Ellerman)
	Renamed to powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

powerpc/smp: Enable small core scheduling sooner
	Rewrote changelog (Gautham)
	Renamed to powerpc/smp: Move topology fixups into  a new function

powerpc/smp: Create coregroup domain
	Add optimization for mask updation under coregroup_support

Changelog v1 -> v2:
v1: https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Merge Power9 topology with Power topology
	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

powerpc/smp: Enable small core scheduling sooner
	Restored the previous info msg (Jordan)
	Moved big core topology fixup to fixup_topology (Gautham)

powerpc/smp: Dont assume l2-cache to be superset of sibling
	Set cpumask after verifying l2-cache. (Gautham)

powerpc/smp: Generalize 2nd sched domain
	Moved shared_cache topology fixup to fixup_topology (Gautham)

Powerpc/numa: Detect support for coregroup
	Explained Coregroup in commit msg (Michael Ellerman)

Powerpc/smp: Create coregroup domain
	Moved coregroup topology fixup to fixup_topology (Gautham)

powerpc/smp: Implement cpu_to_coregroup_id
	Move coregroup_enabled before getting associativity (Gautham)

powerpc/smp: Provide an ability to disable coregroup
	Patch dropped (Michael Ellerman)

Cleanup of existing powerpc topologies and add coregroup support on
Powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.

Patch 7 of this patch series: "Powerpc/numa: Detect support for coregroup"
depends on
https://lore.kernel.org/linuxppc-dev/20200707140644.7241-1-srikar@linux.vnet.ibm.com/t/#u
However it should be easy to rebase the patch without the above patch.

This patch series is based on top of current powerpc/next tree + the
above patch.

Summary of some of the testing done with coregroup patchset.
It includes ebizzy, schbench, perf bench sched pipe and topology verification.
One the left side are results from powerpc/next tree and on the right are the
results with the patchset applied.  Topological verification clearly shows that
there is no change in topology with and without the patches on all the 3 class
of systems that were tested.

On PowerPc/Next                                                            On Powerpc/next + Coregroup Support v5 patchset

Power 9 PowerNV (2 Node/ 160 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N      Min       Max    Median       Avg        Stddev                  N      Min       Max    Median       Avg      Stddev
100   993884   1276090   1173476   1165914     54867.201                100   910470   1279820   1171095   1162091    67363.28

schbench (latency hence lower is better)
Latency percentiles (usec)                                              Latency percentiles (usec)
        50.0th: 455                                                             50.0th: 454
        75.0th: 533                                                             75.0th: 543
        90.0th: 683                                                             90.0th: 701
        95.0th: 743                                                             95.0th: 737
        *99.0th: 815                                                            *99.0th: 805
        99.5th: 839                                                             99.5th: 835
        99.9th: 913                                                             99.9th: 893
        min=0, max=1011                                                         min=0, max=2833

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                       # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes                # Executed 1000000 pipe operations between two processes

     Total time: 6.083 [sec]                                                 Total time: 6.303 [sec]

       6.083576 usecs/op                                                       6.303318 usecs/op
         164377 ops/sec                                                          158646 ops/sec


Power 9 LPAR (2 Node/ 128 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N       Min       Max    Median         Avg      Stddev                 N       Min       Max    Median         Avg      Stddev
100   1058029   1295393   1200414   1188306.7   56786.538               100    943264   1287619   1180522   1168473.2   64469.955

schbench (latency hence lower is better)
Latency percentiles (usec)                                                Latency percentiles (usec)
        50.0000th: 34                                                             50.0000th: 39
        75.0000th: 46                                                             75.0000th: 52
        90.0000th: 53                                                             90.0000th: 68
        95.0000th: 56                                                             95.0000th: 77
        *99.0000th: 61                                                            *99.0000th: 89
        99.5000th: 63                                                             99.5000th: 94
        99.9000th: 81                                                             99.9000th: 169
        min=0, max=8405                                                           min=0, max=23674

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                        # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes                 # Executed 1000000 pipe operations between two processes

     Total time: 8.768 [sec]                                                      Total time: 5.217 [sec]

       8.768400 usecs/op                                                            5.217625 usecs/op
         114045 ops/sec                                                               191658 ops/sec

Power 8 LPAR (8 Node/ 256 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N       Min       Max    Median         Avg      Stddev               N      Min      Max   Median        Avg     Stddev
100   1267615   1965234   1707423   1689137.6   144363.29             100  1175357  1924262  1691104  1664792.1   145876.4

schbench (latency hence lower is better)
Latency percentiles (usec)                                             Latency percentiles (usec)
        50.0th: 37                                                             50.0th: 36
        75.0th: 51                                                             75.0th: 48
        90.0th: 59                                                             90.0th: 55
        95.0th: 63                                                             95.0th: 59
        *99.0th: 71                                                            *99.0th: 67
        99.5th: 75                                                             99.5th: 72
        99.9th: 105                                                            99.9th: 170
        min=0, max=18560                                                       min=0, max=27031

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                       # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes               # Executed 1000000 pipe operations between two processes

     Total time: 6.013 [sec]                                                 Total time: 5.930 [sec]

       6.013963 usecs/op                                                        5.930724 usecs/op
         166279 ops/sec                                                           168613 ops/sec

Topology verification on Power9
Power9/ PowerNV / SMT4

tail -f /proc/cpuinfo
---------------------
cpu                : POWER9, altivec supported
clock                : 3600.000000MHz
revision        : 2.2 (pvr 004e 1202)

timebase        : 512000000
platform        : PowerNV
model                : 9006-22P
machine                : PowerNV 9006-22P
firmware        : OPAL
MMU                : Radix

On PowerPc/Next                                                            On Powerpc/next + Coregroup Support v5 patchset
lscpu                                                                      lscpu
------                                                                     ------
Architecture:        ppc64le                                               Architecture:        ppc64le
Byte Order:          Little Endian                                         Byte Order:          Little Endian
CPU(s):              160                                                   CPU(s):              160
On-line CPU(s) list: 0-159                                                 On-line CPU(s) list: 0-159
Thread(s) per core:  4                                                     Thread(s) per core:  4
Core(s) per socket:  20                                                    Core(s) per socket:  20
Socket(s):           2                                                     Socket(s):           2
NUMA node(s):        2                                                     NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)                                   Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported                             Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000                                             CPU max MHz:         3800.0000
CPU min MHz:         2166.0000                                             CPU min MHz:         2166.0000
L1d cache:           32K                                                   L1d cache:           32K
L1i cache:           32K                                                   L1i cache:           32K
L2 cache:            512K                                                  L2 cache:            512K
L3 cache:            10240K                                                L3 cache:            10240K
NUMA node0 CPU(s):   0-79                                                  NUMA node0 CPU(s):   0-79
NUMA node8 CPU(s):   80-159                                                NUMA node8 CPU(s):   80-159

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                     grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
-----------------------------------------------------                      -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT                        /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
/proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE                      /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE                        /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA                       /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags                    grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------                     ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391                      /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327                      /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071                      /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801                     /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801


On PowerPc/Next
head /proc/schedstat
--------------------
version 15
timestamp 4295043536
cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 4941435230 11106132 1583
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Powerpc/next + Coregroup Support v5 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296311826
cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Post sudo ppc64_cpu --smt=1                                                     Post sudo ppc64_cpu --smt=1
---------------------                                                           ---------------------
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                          grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
-----------------------------------------------------                           -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE                           /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE                             /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA                            /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags                         grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------                          ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327                           /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071                           /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801                          /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801


On Powerpc/next
head /proc/schedstat
--------------------
version 15
timestamp 4295046242
cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 5408663896 95701034 7697
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Powerpc/next + Coregroup Support v5 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296314905
cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and Power 9 (2
node 128 Cpu LPAR) and they showed the topology before and after the patch to be
identical. If Interested, I could provide the same.

On Power 9 (with device-tree enablement to show coregroups).
(hunks for mimicing a coregroup was posted at
https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#m2cb09bb11c7a93257d6123d1d27edb8212f8af21)
-----------------------------------------------------------
$ tail /proc/cpuinfo
processor	: 127
cpu		: POWER9 (architected), altivec supported
clock		: 3000.000000MHz
revision	: 2.2 (pvr 004e 0202)

timebase	: 512000000
platform	: pSeries
model		: IBM,9008-22L
machine		: CHRP IBM,9008-22L
MMU		: Hash

Before patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
MC
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>

Srikar Dronamraju (10):
  powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
  powerpc/smp: Merge Power9 topology with Power topology
  powerpc/smp: Move powerpc_topology above
  powerpc/smp: Move topology fixups into  a new function
  powerpc/smp: Dont assume l2-cache to be superset of sibling
  powerpc/smp: Optimize start_secondary
  powerpc/numa: Detect support for coregroup
  powerpc/smp: Allocate cpumask only after searching thread group
  powerpc/smp: Create coregroup domain
  powerpc/smp: Implement cpu_to_coregroup_id

 arch/powerpc/include/asm/smp.h      |   1 +
 arch/powerpc/include/asm/topology.h |  10 ++
 arch/powerpc/kernel/smp.c           | 235 +++++++++++++++++-----------
 arch/powerpc/mm/numa.c              |  59 +++++--
 4 files changed, 198 insertions(+), 107 deletions(-)

-- 
2.18.2


^ permalink raw reply

* Re: [PATCH] powerpc/rtas: Restrict RTAS requests from userspace
From: Michael Ellerman @ 2020-08-10  6:40 UTC (permalink / raw)
  To: Andrew Donnellan, linuxppc-dev; +Cc: nathanl, leobras.c, Daniel Axtens
In-Reply-To: <20200702161932.18176-1-ajd@linux.ibm.com>

Hi ajd,

Thanks for taking care of this.

I was going to merge this as-is, but given it's fixing a long standing
issue there's not really a big rush. So a few comments below.

Andrew Donnellan <ajd@linux.ibm.com> writes:
> A number of userspace utilities depend on making calls to RTAS to retrieve
> information and update various things.
>
> The existing API through which we expose RTAS to userspace exposes more
> RTAS functionality than we actually need, through the sys_rtas syscall,
> which allows root (or anyone with CAP_SYS_ADMIN) to make any RTAS call they
> want with arbitrary arguments.
>
> Many RTAS calls take the address of a buffer as an argument, and it's up to
> the caller to specify the physical address of the buffer as an argument. We
> allocate a buffer (the "RMO buffer") in the Real Memory Area that RTAS can
> access, and then expose the physical address and size of this buffer in
> /proc/powerpc/rtas/rmo_buffer. Userspace is expected to read this address,
> poke at the buffer using /dev/mem, and pass an address in the RMO buffer to
> the RTAS call.
>
> However, there's nothing stopping the caller from specifying whatever
> address they want in the RTAS call, and it's easy to construct a series of
> RTAS calls that can overwrite arbitrary bytes (even without /dev/mem
> access).
>
> Additionally, there are some RTAS calls that do potentially dangerous
> things and for which there are no legitimate userspace use cases.
>
> In the past, this would not have been a particularly big deal as it was
> assumed that root could modify all system state freely, but with Secure
> Boot and lockdown we need to care about this.
>
> We can't fundamentally change the ABI at this point, however we can address
> this by implementing a filter that checks RTAS calls against a list
> of permitted calls and forces the caller to use addresses within the RMO
> buffer.
>
> The list is based off the list of calls that are used by the librtas
> userspace library, and has been tested with a number of existing userspace
> RTAS utilities. For compatibility with any applications we are not aware of
> that require other calls, the filter can be turned off at build time.
>
> Reported-by: Daniel Axtens <dja@axtens.net>
> Cc: stable@vger.kernel.org
> Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
> ---
>  arch/powerpc/Kconfig       |  13 +++
>  arch/powerpc/kernel/rtas.c | 198 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 211 insertions(+)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 9fa23eb320ff..0e2dfe497357 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -973,6 +973,19 @@ config PPC_SECVAR_SYSFS
>  	  read/write operations on these variables. Say Y if you have
>  	  secure boot enabled and want to expose variables to userspace.
>  
> +config PPC_RTAS_FILTER
> +	bool "Enable filtering of RTAS syscalls"
> +	default y
> +	depends on PPC_RTAS
> +	help
> +	  The RTAS syscall API has security issues that could be used to
> +	  compromise system integrity. This option enforces restrictions on the
> +	  RTAS calls and arguments passed by userspace programs to mitigate
> +	  these issues.
> +
> +	  Say Y unless you know what you are doing and the filter is causing
> +	  problems for you.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index a09eba03f180..ec1cae52d8bd 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -324,6 +324,23 @@ int rtas_token(const char *service)
>  }
>  EXPORT_SYMBOL(rtas_token);
>  
> +#ifdef CONFIG_PPC_RTAS_FILTER
> +

I think this could be combined with the #ifdef block below?

> +static char *rtas_token_name(int token)
> +{
> +	struct property *prop;
> +
> +	for_each_property_of_node(rtas.dev, prop) {
> +		const __be32 *tokp = prop->value;
> +
> +		if (tokp && be32_to_cpu(*tokp) == token)
> +			return prop->name;
> +	}
> +	return NULL;
> +}
> +
> +#endif /* CONFIG_PPC_RTAS_FILTER */
> +
>  int rtas_service_present(const char *service)
>  {
>  	return rtas_token(service) != RTAS_UNKNOWN_SERVICE;
> @@ -1110,6 +1127,184 @@ struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log,
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_PPC_RTAS_FILTER
> +
> +/*
> + * The sys_rtas syscall, as originally designed, allows root to pass
> + * arbitrary physical addresses to RTAS calls. A number of RTAS calls
> + * can be abused to write to arbitrary memory and do other things that
> + * are potentially harmful to system integrity, and thus should only
> + * be used inside the kernel and not exposed to userspace.
> + *
> + * All known legitimate users of the sys_rtas syscall will only ever
> + * pass addresses that fall within the RMO buffer, and use a known
> + * subset of RTAS calls.
> + *
> + * Accordingly, we filter RTAS requests to check that the call is
> + * permitted, and that provided pointers fall within the RMO buffer.
> + * The rtas_filters list contains an entry for each permitted call,
> + * with the indexes of the parameters which are expected to contain
> + * addresses and sizes of buffers allocated inside the RMO buffer.
> + */
> +struct rtas_filter {
> +	const char name[32];

Using a const char * for the name would be more typical, meaning the
strings would end up in .rodata, and could be merged with other uses of
the same strings.

> +
> +	/* Indexes into the args buffer, -1 if not used */
> +	int rmo_buf_idx1;
> +	int rmo_size_idx1;
> +	int rmo_buf_idx2;
> +	int rmo_size_idx2;

The "rmo" prefix is probably unnecessary?

> +};
> +
> +struct rtas_filter rtas_filters[] = {

Should be static, and __ro_after_init ?

> +	{ "ibm,activate-firmware", -1, -1, -1, -1 },

Would it be worth making the indices 1-based, allowing 0 to be the
unused value, meaning you only have to initialise the used fields?

It would require adjusting them before use, but there's only 4 places
they're used, and you could probably use a macro to do the - 1.

> +	{ "ibm,configure-connector", 0, -1, 1, -1 },	/* Special cased, size 4096 */

Does it make sense to put the hard coded sizes in the structure as well?

eg. fixed_size1 = 4096,

I think that would avoid the need for any strcmps in the code.

> +	{ "display-character", -1, -1, -1, -1 },
> +	{ "ibm,display-message", 0, -1, -1, -1 },
> +	{ "ibm,errinjct", 2, -1, -1, -1 },		/* Fixed size of 1024 */
> +	{ "ibm,close-errinjct", -1, -1, -1, -1 },
> +	{ "ibm,open-errinct", -1, -1, -1, -1 },
> +	{ "ibm,get-config-addr-info2", -1, -1, -1, -1 },
> +	{ "ibm,get-dynamic-sensor-state", 1, -1, -1, -1 },
> +	{ "ibm,get-indices", 2, 3, -1, -1 },
> +	{ "get-power-level", -1, -1, -1, -1 },
> +	{ "get-sensor-state", -1, -1, -1, -1 },
> +	{ "ibm,get-system-parameter", 1, 2, -1, -1 },
> +	{ "get-time-of-day", -1, -1, -1, -1 },
> +	{ "ibm,get-vpd", 0, -1, 1, 2 },
> +	{ "ibm,lpar-perftools", 2, 3, -1, -1 },
> +	{ "ibm,platform-dump", 4, 5, -1, -1 },
> +	{ "ibm,read-slot-reset-state", -1, -1, -1, -1 },
> +	{ "ibm,scan-log-dump", 0, 1, -1, -1 },
> +	{ "ibm,set-dynamic-indicator", 2, -1, -1, -1 },
> +	{ "ibm,set-eeh-option", -1, -1, -1, -1 },
> +	{ "set-indicator", -1, -1, -1, -1 },
> +	{ "set-power-level", -1, -1, -1, -1 },
> +	{ "set-time-for-power-on", -1, -1, -1, -1 },
> +	{ "ibm,set-system-parameter", 1, -1, -1, -1 },
> +	{ "set-time-of-day", -1, -1, -1, -1 },
> +	{ "ibm,suspend-me", -1, -1, -1, -1 },
> +	{ "ibm,update-nodes", 0, -1, -1, -1 },		/* Fixed size of 4096 */
> +	{ "ibm,update-properties", 0, -1, -1, -1 },	/* Fixed size of 4096 */
> +	{ "ibm,physical-attestation", 0, 1, -1, -1 },
> +};
> +
> +static void dump_rtas_params(int token, int nargs, int nret,
> +			     struct rtas_args *args)
> +{
> +	int i;
> +	char *token_name = rtas_token_name(token);
> +
> +	pr_err_ratelimited("sys_rtas: token=0x%x (%s), nargs=%d, nret=%d (called by %s)\n",
> +			   token, token_name ? token_name : "unknown", nargs,
> +			   nret, current->comm);
> +	pr_err_ratelimited("sys_rtas: args: ");
> +
> +	for (i = 0; i < nargs; i++) {
> +		u32 arg = be32_to_cpu(args->args[i]);
> +
> +		pr_cont("%08x ", arg);
> +		if (arg >= rtas_rmo_buf &&
> +		    arg < (rtas_rmo_buf + RTAS_RMOBUF_MAX))
> +			pr_cont("(buf+0x%lx) ", arg - rtas_rmo_buf);
> +	}

This can leak the location of the RMO buf into dmesg. I know it's
visible via /proc, but the /proc file is 0400.

So I think it's probably safer if we just don't dump the args, or their
relation to the RMO buf.

> +
> +	pr_cont("\n");
> +}
> +
> +static bool in_rmo_buf(u32 base, u32 end)
> +{
> +	return base >= rtas_rmo_buf &&
> +		base < (rtas_rmo_buf + RTAS_RMOBUF_MAX) &&
> +		base <= end &&
> +		end >= rtas_rmo_buf &&
> +		end < (rtas_rmo_buf + RTAS_RMOBUF_MAX);
> +}
> +
> +static bool block_rtas_call(int token, int nargs,
> +			    struct rtas_args *args)
> +{
> +	int i;
> +	const char *reason;
> +	char *token_name = rtas_token_name(token);

This code isn't particularly performance critical, but I think it would
be cleaner to do the token lookup once at init time, and store the token
in the filter array?

Then this code would only be doing token comparisons.

> +
> +	if (!token_name)
> +		goto err_notpermitted;
> +
> +	for (i = 0; i < ARRAY_SIZE(rtas_filters); i++) {
> +		struct rtas_filter *f = &rtas_filters[i];
> +		u32 base, size, end;
> +
> +		if (strcmp(token_name, f->name))
> +			continue;
> +
> +		if (f->rmo_buf_idx1 != -1) {
> +			base = be32_to_cpu(args->args[f->rmo_buf_idx1]);
> +			if (f->rmo_size_idx1 != -1)
> +				size = be32_to_cpu(args->args[f->rmo_size_idx1]);
> +			else if (!strcmp(token_name, "ibm,errinjct"))
> +				size = 1024;
> +			else if (!strcmp(token_name, "ibm,update-nodes") ||
> +				 !strcmp(token_name, "ibm,update-properties") ||
> +				 !strcmp(token_name, "ibm,configure-connector"))
> +				size = 4096;
> +			else
> +				size = 1;
> +
> +			end = base + size - 1;
> +			if (!in_rmo_buf(base, end)) {
> +				reason = "address pair 1 out of range";

I don't think we need to give the user this much detail about what they
did wrong, all cases can just print "call not permitted" IMO.

> +				goto err;
> +			}
> +		}
> +
> +		if (f->rmo_buf_idx2 != -1) {
> +			base = be32_to_cpu(args->args[f->rmo_buf_idx2]);
> +			if (f->rmo_size_idx2 != -1)
> +				size = be32_to_cpu(args->args[f->rmo_size_idx2]);
> +			else if (!strcmp(token_name, "ibm,configure-connector"))
> +				size = 4096;
> +			else
> +				size = 1;
> +			end = base + size - 1;
> +
> +			/*
> +			 * Special case for ibm,configure-connector where the
> +			 * address can be 0
> +			 */
> +			if (!strcmp(token_name, "ibm,configure-connector") &&
> +			    base == 0)
> +				return false;
> +
> +			if (!in_rmo_buf(base, end)) {
> +				reason = "address pair 2 out of range";
> +				goto err;
> +			}
> +		}
> +
> +		return false;
> +	}
> +
> +err_notpermitted:
> +	reason = "call not permitted";
> +
> +err:
> +	pr_err_ratelimited("sys_rtas: RTAS call blocked - exploit attempt? (%s)\n",
> +			   reason);
> +	dump_rtas_params(token, nargs, 0, args);
> +	return true;
> +}
> +
> +#else
> +
> +static bool block_rtas_call(int token, int nargs,
> +			    struct rtas_args *args)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_PPC_RTAS_FILTER */
> +
>  /* We assume to be passed big endian arguments */
>  SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>  {
> @@ -1147,6 +1342,9 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
>  	args.rets = &args.args[nargs];
>  	memset(args.rets, 0, nret * sizeof(rtas_arg_t));
>  
> +	if (block_rtas_call(token, nargs, &args))
> +		return -EINVAL;
> +
>  	/* Need to handle ibm,suspend_me call specially */
>  	if (token == ibm_suspend_me_token) {
>  

cheers

^ permalink raw reply

* Re: linux-next: manual merge of the set_fs tree with the powerpc tree
From: Stephen Rothwell @ 2020-08-10  6:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nathan Lynch, Linux Next Mailing List, PowerPC,
	Linux Kernel Mailing List
In-Reply-To: <20200810061106.GA29318@lst.de>

[-- Attachment #1: Type: text/plain, Size: 257 bytes --]

Hi Christoph,

On Mon, 10 Aug 2020 08:11:06 +0200 Christoph Hellwig <hch@lst.de> wrote:
>
> please drop my set_fs tree from linux-next.  It is not going to be
> merged for 5.9 in this form.

OK, done from tomorrow.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: linux-next: manual merge of the set_fs tree with the powerpc tree
From: Christoph Hellwig @ 2020-08-10  6:11 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Nathan Lynch, Linux Kernel Mailing List, Linux Next Mailing List,
	PowerPC, Christoph Hellwig
In-Reply-To: <20200809185726.5d8e5f55@canb.auug.org.au>

Hi Stephen,

please drop my set_fs tree from linux-next.  It is not going to be
merged for 5.9 in this form.

Thanks!

^ permalink raw reply

* Re: [PATCH v3 5/8] mm: HUGE_VMAP arch support cleanup
From: kernel test robot @ 2020-08-10  3:58 UTC (permalink / raw)
  To: Nicholas Piggin, linux-mm
  Cc: linux-arch, kbuild-all, Will Deacon, Catalin Marinas,
	linux-kernel, Nicholas Piggin, Ingo Molnar, Thomas Gleixner,
	linuxppc-dev, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-6-npiggin@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3957 bytes --]

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on hnaz-linux-mm/master]
[also build test ERROR on arm64/for-next/core powerpc/next tip/x86/mm linus/master v5.8 next-20200807]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Nicholas-Piggin/huge-vmalloc-mappings/20200810-103105
base:   https://github.com/hnaz/linux-mm master
config: i386-randconfig-s002-20200810 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.2-118-ge1578773-dirty
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/x86/mm/ioremap.c:484:6: error: redefinition of 'arch_vmap_p4d_supported'
     484 | bool arch_vmap_p4d_supported(pgprot_t prot)
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/io.h:911,
                    from arch/x86/include/asm/io.h:375,
                    from arch/x86/include/asm/dma.h:13,
                    from include/linux/memblock.h:14,
                    from arch/x86/mm/ioremap.c:10:
   include/linux/vmalloc.h:92:20: note: previous definition of 'arch_vmap_p4d_supported' was here
      92 | static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
         |                    ^~~~~~~~~~~~~~~~~~~~~~~
>> arch/x86/mm/ioremap.c:489:6: error: redefinition of 'arch_vmap_pud_supported'
     489 | bool arch_vmap_pud_supported(pgprot_t prot)
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/io.h:911,
                    from arch/x86/include/asm/io.h:375,
                    from arch/x86/include/asm/dma.h:13,
                    from include/linux/memblock.h:14,
                    from arch/x86/mm/ioremap.c:10:
   include/linux/vmalloc.h:93:20: note: previous definition of 'arch_vmap_pud_supported' was here
      93 | static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
         |                    ^~~~~~~~~~~~~~~~~~~~~~~
>> arch/x86/mm/ioremap.c:498:6: error: redefinition of 'arch_vmap_pmd_supported'
     498 | bool arch_vmap_pmd_supported(pgprot_t prot)
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   In file included from include/asm-generic/io.h:911,
                    from arch/x86/include/asm/io.h:375,
                    from arch/x86/include/asm/dma.h:13,
                    from include/linux/memblock.h:14,
                    from arch/x86/mm/ioremap.c:10:
   include/linux/vmalloc.h:94:20: note: previous definition of 'arch_vmap_pmd_supported' was here
      94 | static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
         |                    ^~~~~~~~~~~~~~~~~~~~~~~
   arch/x86/mm/ioremap.c:737:17: warning: no previous prototype for 'early_memremap_pgprot_adjust' [-Wmissing-prototypes]
     737 | pgprot_t __init early_memremap_pgprot_adjust(resource_size_t phys_addr,
         |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

vim +/arch_vmap_p4d_supported +484 arch/x86/mm/ioremap.c

   483	
 > 484	bool arch_vmap_p4d_supported(pgprot_t prot)
   485	{
   486		return false;
   487	}
   488	
 > 489	bool arch_vmap_pud_supported(pgprot_t prot)
   490	{
   491	#ifdef CONFIG_X86_64
   492		return boot_cpu_has(X86_FEATURE_GBPAGES);
   493	#else
   494		return false;
   495	#endif
   496	}
   497	
 > 498	bool arch_vmap_pmd_supported(pgprot_t prot)
   499	{
   500		return boot_cpu_has(X86_FEATURE_PSE);
   501	}
   502	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35543 bytes --]

^ permalink raw reply

* [PATCH v3 8/8] mm/vmalloc: Hugepage vmalloc mappings
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.

Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmap is added to disable at boot.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |   2 +
 include/linux/vmalloc.h                       |   1 +
 mm/vmalloc.c                                  | 174 +++++++++++++-----
 3 files changed, 135 insertions(+), 42 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 98ea67f27809..eaef176c597f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
 	nohugeiomap	[KNL,x86,PPC] Disable kernel huge I/O mappings.
 
+	nohugevmap	[KNL,x86,PPC] Disable kernel huge vmalloc mappings.
+
 	nosmt		[KNL,S390] Disable symmetric multithreading (SMT).
 			Equivalent to smt=1.
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
 	unsigned long		size;
 	unsigned long		flags;
 	struct page		**pages;
+	unsigned int		page_order;
 	unsigned int		nr_pages;
 	phys_addr_t		phys_addr;
 	const void		*caller;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..a7728c7086bc 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmap(char *str)
+{
+	vmap_allow_huge = false;
+	return 0;
+}
+early_param("nohugevmap", set_nohugevmap);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool vmap_allow_huge = false;
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 bool is_vmalloc_addr(const void *x)
 {
 	unsigned long addr = (unsigned long)x;
@@ -468,31 +481,12 @@ static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long en
 	return 0;
 }
 
-/**
- * map_kernel_range_noflush - map kernel VM area with the specified pages
- * @addr: start of the VM area to map
- * @size: size of the VM area to map
- * @prot: page protection flags to use
- * @pages: pages to map
- *
- * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify should
- * have been allocated using get_vm_area() and its friends.
- *
- * NOTE:
- * This function does NOT do any cache flushing.  The caller is responsible for
- * calling flush_cache_vmap() on to-be-mapped areas before calling this
- * function.
- *
- * RETURNS:
- * 0 on success, -errno on failure.
- */
-int map_kernel_range_noflush(unsigned long addr, unsigned long size,
-			     pgprot_t prot, struct page **pages)
+static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages)
 {
 	unsigned long start = addr;
-	unsigned long end = addr + size;
-	unsigned long next;
 	pgd_t *pgd;
+	unsigned long next;
 	int err = 0;
 	int nr = 0;
 	pgtbl_mod_mask mask = 0;
@@ -514,6 +508,65 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 	return 0;
 }
 
+static int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+	WARN_ON(page_shift < PAGE_SHIFT);
+
+	if (page_shift == PAGE_SHIFT) {
+		return vmap_small_pages_range_noflush(addr, end, prot, pages);
+	} else {
+		unsigned int i, nr = (end - addr) >> page_shift;
+
+		for (i = 0; i < nr; i++) {
+			int err;
+
+			err = vmap_range_noflush(addr, addr + (1UL << page_shift),
+						__pa(page_address(pages[i])), prot, page_shift);
+			if (err)
+				return err;
+
+			addr += 1UL << page_shift;
+		}
+
+		return 0;
+	}
+}
+
+static int vmap_pages_range(unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, unsigned int page_shift)
+{
+	int err;
+
+	err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
+	flush_cache_vmap(addr, end);
+	return err;
+}
+
+/**
+ * map_kernel_range_noflush - map kernel VM area with the specified pages
+ * @addr: start of the VM area to map
+ * @size: size of the VM area to map
+ * @prot: page protection flags to use
+ * @pages: pages to map
+ *
+ * Map PFN_UP(@size) pages at @addr.  The VM area @addr and @size specify should
+ * have been allocated using get_vm_area() and its friends.
+ *
+ * NOTE:
+ * This function does NOT do any cache flushing.  The caller is responsible for
+ * calling flush_cache_vmap() on to-be-mapped areas before calling this
+ * function.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int map_kernel_range_noflush(unsigned long addr, unsigned long size,
+			     pgprot_t prot, struct page **pages)
+{
+	return vmap_pages_range_noflush(addr, addr + size, prot, pages, PAGE_SHIFT);
+}
+
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
 		struct page **pages)
 {
@@ -2274,9 +2327,11 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (unlikely(!size))
 		return NULL;
 
-	if (flags & VM_IOREMAP)
-		align = 1ul << clamp_t(int, get_count_order_long(size),
-				       PAGE_SHIFT, IOREMAP_MAX_ORDER);
+	if (flags & VM_IOREMAP) {
+		align = max(align,
+			    1ul << clamp_t(int, get_count_order_long(size),
+					   PAGE_SHIFT, IOREMAP_MAX_ORDER));
+	}
 
 	area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
 	if (unlikely(!area))
@@ -2475,7 +2530,8 @@ static void __vunmap(const void *addr, int deallocate_pages)
 			struct page *page = area->pages[i];
 
 			BUG_ON(!page);
-			__free_pages(page, 0);
+			__free_pages(page, area->page_order);
+			i += 1 << area->page_order;
 		}
 		atomic_long_sub(area->nr_pages, &nr_vmalloc_pages);
 
@@ -2614,9 +2670,12 @@ void *vmap(struct page **pages, unsigned int count,
 EXPORT_SYMBOL(vmap);
 
 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
-				 pgprot_t prot, int node)
+				 pgprot_t prot, unsigned int page_shift, int node)
 {
 	struct page **pages;
+	unsigned long addr = (unsigned long)area->addr;
+	unsigned long size = get_vm_area_size(area);
+	unsigned int page_order = page_shift - PAGE_SHIFT;
 	unsigned int nr_pages, array_size, i;
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
 	const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
@@ -2624,7 +2683,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 					0 :
 					__GFP_HIGHMEM;
 
-	nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
+	nr_pages = size >> PAGE_SHIFT;
 	array_size = (nr_pages * sizeof(struct page *));
 
 	/* Please note that the recursion is strictly bounded. */
@@ -2643,29 +2702,29 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 
 	area->pages = pages;
 	area->nr_pages = nr_pages;
+	area->page_order = page_order;
 
-	for (i = 0; i < area->nr_pages; i++) {
+	for (i = 0; i < area->nr_pages; i += 1 << page_order) {
 		struct page *page;
+		int p;
 
-		if (node == NUMA_NO_NODE)
-			page = alloc_page(alloc_mask|highmem_mask);
-		else
-			page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
-
+		page = alloc_pages_node(node, alloc_mask|highmem_mask, page_order);
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
 			area->nr_pages = i;
 			atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
 			goto fail;
 		}
-		area->pages[i] = page;
+
+		for (p = 0; p < (1 << page_order); p++)
+			area->pages[i + p] = page + p;
+
 		if (gfpflags_allow_blocking(gfp_mask))
 			cond_resched();
 	}
 	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
 
-	if (map_kernel_range((unsigned long)area->addr, get_vm_area_size(area),
-			prot, pages) < 0)
+	if (vmap_pages_range(addr, addr + size, prot, pages, page_shift) < 0)
 		goto fail;
 
 	return area->addr;
@@ -2701,22 +2760,45 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller)
 {
-	struct vm_struct *area;
+	struct vm_struct *area = NULL;
 	void *addr;
 	unsigned long real_size = size;
+	unsigned long real_align = align;
+	unsigned int shift = PAGE_SHIFT;
 
 	size = PAGE_ALIGN(size);
 	if (!size || (size >> PAGE_SHIFT) > totalram_pages())
 		goto fail;
 
-	area = __get_vm_area_node(real_size, align, VM_ALLOC | VM_UNINITIALIZED |
+	if (vmap_allow_huge && (pgprot_val(prot) == pgprot_val(PAGE_KERNEL))) {
+		unsigned long size_per_node;
+
+		/*
+		 * Try huge pages. Only try for PAGE_KERNEL allocations,
+		 * others like modules don't yet expect huge pages in
+		 * their allocations due to apply_to_page_range not
+		 * supporting them.
+		 */
+
+		size_per_node = size;
+		if (node == NUMA_NO_NODE)
+			size_per_node /= num_online_nodes();
+		if (size_per_node >= PMD_SIZE)
+			shift = PMD_SHIFT;
+	}
+
+again:
+	align = max(real_align, 1UL << shift);
+	size = ALIGN(real_size, align);
+
+	area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
 				vm_flags, start, end, node, gfp_mask, caller);
 	if (!area)
 		goto fail;
 
-	addr = __vmalloc_area_node(area, gfp_mask, prot, node);
+	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
 	if (!addr)
-		return NULL;
+		goto fail;
 
 	/*
 	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
@@ -2730,8 +2812,16 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	return addr;
 
 fail:
-	warn_alloc(gfp_mask, NULL,
+	if (shift > PAGE_SHIFT) {
+		shift = PAGE_SHIFT;
+		goto again;
+	}
+
+	if (!area) {
+		/* Warn for area allocation, page allocations already warn */
+		warn_alloc(gfp_mask, NULL,
 			  "vmalloc: allocation failure: %lu bytes", real_size);
+	}
 	return NULL;
 }
 
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 7/8] mm/vmalloc: add vmap_range_noflush variant
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/vmalloc.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, phys_addr_t phys_addr,
+			pgprot_t prot, unsigned int max_page_shift)
 {
 	pgd_t *pgd;
 	unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgp
 			break;
 	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-	flush_cache_vmap(start, end);
-
 	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
 		arch_sync_kernel_mappings(start, end);
 
 	return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+			unsigned int max_page_shift)
+{
+	int err;
+
+	err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+	flush_cache_vmap(addr, end);
+
+	return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     pgtbl_mod_mask *mask)
 {
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 include/linux/vmalloc.h |   2 +
 mm/ioremap.c            | 192 ----------------------------------------
 mm/vmalloc.c            | 191 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 193 insertions(+), 192 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+			unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
 				    pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
-{
-	pte_t *pte;
-	u64 pfn;
-
-	pfn = phys_addr >> PAGE_SHIFT;
-	pte = pte_alloc_kernel_track(pmd, addr, mask);
-	if (!pte)
-		return -ENOMEM;
-	do {
-		BUG_ON(!pte_none(*pte));
-		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
-		pfn++;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
-	*mask |= PGTBL_PTE_MODIFIED;
-	return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
-	if (max_page_shift < PMD_SHIFT)
-		return 0;
-
-	if (!arch_vmap_pmd_supported(prot))
-		return 0;
-
-	if ((end - addr) != PMD_SIZE)
-		return 0;
-
-	if (!IS_ALIGNED(addr, PMD_SIZE))
-		return 0;
-
-	if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-		return 0;
-
-	if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-		return 0;
-
-	return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
-			pgtbl_mod_mask *mask)
-{
-	pmd_t *pmd;
-	unsigned long next;
-
-	pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
-	if (!pmd)
-		return -ENOMEM;
-	do {
-		next = pmd_addr_end(addr, end);
-
-		if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
-			*mask |= PGTBL_PMD_MODIFIED;
-			continue;
-		}
-
-		if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-			return -ENOMEM;
-	} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-	return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
-	if (max_page_shift < PUD_SHIFT)
-		return 0;
-
-	if (!arch_vmap_pud_supported(prot))
-		return 0;
-
-	if ((end - addr) != PUD_SIZE)
-		return 0;
-
-	if (!IS_ALIGNED(addr, PUD_SIZE))
-		return 0;
-
-	if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-		return 0;
-
-	if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-		return 0;
-
-	return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
-			pgtbl_mod_mask *mask)
-{
-	pud_t *pud;
-	unsigned long next;
-
-	pud = pud_alloc_track(&init_mm, p4d, addr, mask);
-	if (!pud)
-		return -ENOMEM;
-	do {
-		next = pud_addr_end(addr, end);
-
-		if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
-			*mask |= PGTBL_PUD_MODIFIED;
-			continue;
-		}
-
-		if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
-			return -ENOMEM;
-	} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
-	return 0;
-}
-
-static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
-{
-	if (max_page_shift < P4D_SHIFT)
-		return 0;
-
-	if (!arch_vmap_p4d_supported(prot))
-		return 0;
-
-	if ((end - addr) != P4D_SIZE)
-		return 0;
-
-	if (!IS_ALIGNED(addr, P4D_SIZE))
-		return 0;
-
-	if (!IS_ALIGNED(phys_addr, P4D_SIZE))
-		return 0;
-
-	if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
-		return 0;
-
-	return p4d_set_huge(p4d, phys_addr, prot);
-}
-
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
-			pgtbl_mod_mask *mask)
-{
-	p4d_t *p4d;
-	unsigned long next;
-
-	p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
-	if (!p4d)
-		return -ENOMEM;
-	do {
-		next = p4d_addr_end(addr, end);
-
-		if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
-			*mask |= PGTBL_P4D_MODIFIED;
-			continue;
-		}
-
-		if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
-			return -ENOMEM;
-	} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
-	return 0;
-}
-
-static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift)
-{
-	pgd_t *pgd;
-	unsigned long start;
-	unsigned long next;
-	int err;
-	pgtbl_mod_mask mask = 0;
-
-	might_sleep();
-	BUG_ON(addr >= end);
-
-	start = addr;
-	pgd = pgd_offset_k(addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
-		if (err)
-			break;
-	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
-
-	flush_cache_vmap(start, end);
-
-	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
-		arch_sync_kernel_mappings(start, end);
-
-	return err;
-}
-
 int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
 {
 	unsigned int max_page_shift = PAGE_SHIFT;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3a1e45fd1626..129f10545bb1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -71,6 +71,197 @@ static void free_work(struct work_struct *w)
 }
 
 /*** Page table manipulation functions ***/
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+{
+	pte_t *pte;
+	u64 pfn;
+
+	pfn = phys_addr >> PAGE_SHIFT;
+	pte = pte_alloc_kernel_track(pmd, addr, mask);
+	if (!pte)
+		return -ENOMEM;
+	do {
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
+		pfn++;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	*mask |= PGTBL_PTE_MODIFIED;
+	return 0;
+}
+
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+	if (max_page_shift < PMD_SHIFT)
+		return 0;
+
+	if (!arch_vmap_pmd_supported(prot))
+		return 0;
+
+	if ((end - addr) != PMD_SIZE)
+		return 0;
+
+	if (!IS_ALIGNED(addr, PMD_SIZE))
+		return 0;
+
+	if (!IS_ALIGNED(phys_addr, PMD_SIZE))
+		return 0;
+
+	if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
+		return 0;
+
+	return pmd_set_huge(pmd, phys_addr, prot);
+}
+
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
+{
+	pmd_t *pmd;
+	unsigned long next;
+
+	pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
+	if (!pmd)
+		return -ENOMEM;
+	do {
+		next = pmd_addr_end(addr, end);
+
+		if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
+			*mask |= PGTBL_PMD_MODIFIED;
+			continue;
+		}
+
+		if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+			return -ENOMEM;
+	} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
+	return 0;
+}
+
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+	if (max_page_shift < PUD_SHIFT)
+		return 0;
+
+	if (!arch_vmap_pud_supported(prot))
+		return 0;
+
+	if ((end - addr) != PUD_SIZE)
+		return 0;
+
+	if (!IS_ALIGNED(addr, PUD_SIZE))
+		return 0;
+
+	if (!IS_ALIGNED(phys_addr, PUD_SIZE))
+		return 0;
+
+	if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
+		return 0;
+
+	return pud_set_huge(pud, phys_addr, prot);
+}
+
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
+{
+	pud_t *pud;
+	unsigned long next;
+
+	pud = pud_alloc_track(&init_mm, p4d, addr, mask);
+	if (!pud)
+		return -ENOMEM;
+	do {
+		next = pud_addr_end(addr, end);
+
+		if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
+			*mask |= PGTBL_PUD_MODIFIED;
+			continue;
+		}
+
+		if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
+			return -ENOMEM;
+	} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
+	return 0;
+}
+
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
+{
+	if (max_page_shift < P4D_SHIFT)
+		return 0;
+
+	if (!arch_vmap_p4d_supported(prot))
+		return 0;
+
+	if ((end - addr) != P4D_SIZE)
+		return 0;
+
+	if (!IS_ALIGNED(addr, P4D_SIZE))
+		return 0;
+
+	if (!IS_ALIGNED(phys_addr, P4D_SIZE))
+		return 0;
+
+	if (p4d_present(*p4d) && !p4d_free_pud_page(p4d, addr))
+		return 0;
+
+	return p4d_set_huge(p4d, phys_addr, prot);
+}
+
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
+{
+	p4d_t *p4d;
+	unsigned long next;
+
+	p4d = p4d_alloc_track(&init_mm, pgd, addr, mask);
+	if (!p4d)
+		return -ENOMEM;
+	do {
+		next = p4d_addr_end(addr, end);
+
+		if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
+			*mask |= PGTBL_P4D_MODIFIED;
+			continue;
+		}
+
+		if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
+			return -ENOMEM;
+	} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
+	return 0;
+}
+
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+			unsigned int max_page_shift)
+{
+	pgd_t *pgd;
+	unsigned long start;
+	unsigned long next;
+	int err;
+	pgtbl_mod_mask mask = 0;
+
+	might_sleep();
+	BUG_ON(addr >= end);
+
+	start = addr;
+	pgd = pgd_offset_k(addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
+		if (err)
+			break;
+	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
+
+	flush_cache_vmap(start, end);
+
+	if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
+		arch_sync_kernel_mappings(start, end);
+
+	return err;
+}
 
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     pgtbl_mod_mask *mask)
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 5/8] mm: HUGE_VMAP arch support cleanup
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/arm64/mm/mmu.c                      | 10 +--
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +-
 arch/x86/mm/ioremap.c                    | 10 +--
 include/linux/io.h                       |  9 ---
 include/linux/vmalloc.h                  | 10 +++
 init/main.c                              |  1 -
 mm/ioremap.c                             | 96 +++++++++++-------------
 7 files changed, 67 insertions(+), 77 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..184a6e917417 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,12 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
 	return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-	return 0;
+	return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
 	/*
 	 * Only 4k granule supports level 1 block mappings.
@@ -1319,9 +1319,9 @@ int __init arch_ioremap_pud_supported(void)
 	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-	/* See arch_ioremap_pud_supported() */
+	/* See arch_vmap_pud_supported() */
 	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..eca83a50bf2e 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1134,13 +1134,13 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
 	set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
 	/* HPT does not cope with large pages in the vmalloc area */
 	return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
 	return radix_enabled();
 }
@@ -1234,7 +1234,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 	return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-	return 0;
+	return false;
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..159bfca757b9 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,21 +481,21 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-	return 0;
+	return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
 #ifdef CONFIG_X86_64
 	return boot_cpu_has(X86_FEATURE_GBPAGES);
 #else
-	return 0;
+	return false;
 #endif
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
 	return boot_cpu_has(X86_FEATURE_PSE);
 }
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline void ioremap_huge_init(void) { }
-#endif
-
 /*
  * Managed iomap interface
  */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..787d77ad7536 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -84,6 +84,16 @@ struct vmap_area {
 	};
 };
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#else
+static inline bool arch_vmap_p4d_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pud_supported(pgprot_t prot) { return false; }
+static inline bool arch_vmap_pmd_supported(pgprot_t prot) { return false; }
+#endif
+
 /*
  *	Highlevel APIs for driver use
  */
diff --git a/init/main.c b/init/main.c
index ae78fb68d231..1c89aa127b8f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -820,7 +820,6 @@ static void __init mm_init(void)
 	pgtable_init();
 	debug_objects_mem_init();
 	vmalloc_init();
-	ioremap_huge_init();
 	/* Should be run before the first non-init thread is created */
 	init_espfix_bsp();
 	/* Should be run after espfix64 is set up. */
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 6016ae3227ad..b0032dbadaf7 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -16,49 +16,16 @@
 #include "pgalloc-track.h"
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static int __read_mostly ioremap_p4d_capable;
-static int __read_mostly ioremap_pud_capable;
-static int __read_mostly ioremap_pmd_capable;
-static int __read_mostly ioremap_huge_disabled;
+static bool __ro_after_init iomap_allow_huge = true;
 
 static int __init set_nohugeiomap(char *str)
 {
-	ioremap_huge_disabled = 1;
+	iomap_allow_huge = false;
 	return 0;
 }
 early_param("nohugeiomap", set_nohugeiomap);
-
-void __init ioremap_huge_init(void)
-{
-	if (!ioremap_huge_disabled) {
-		if (arch_ioremap_p4d_supported())
-			ioremap_p4d_capable = 1;
-		if (arch_ioremap_pud_supported())
-			ioremap_pud_capable = 1;
-		if (arch_ioremap_pmd_supported())
-			ioremap_pmd_capable = 1;
-	}
-}
-
-static inline int ioremap_p4d_enabled(void)
-{
-	return ioremap_p4d_capable;
-}
-
-static inline int ioremap_pud_enabled(void)
-{
-	return ioremap_pud_capable;
-}
-
-static inline int ioremap_pmd_enabled(void)
-{
-	return ioremap_pmd_capable;
-}
-
-#else	/* !CONFIG_HAVE_ARCH_HUGE_VMAP */
-static inline int ioremap_p4d_enabled(void) { return 0; }
-static inline int ioremap_pud_enabled(void) { return 0; }
-static inline int ioremap_pmd_enabled(void) { return 0; }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const bool iomap_allow_huge = false;
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
@@ -81,9 +48,12 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
 {
-	if (!ioremap_pmd_enabled())
+	if (max_page_shift < PMD_SHIFT)
+		return 0;
+
+	if (!arch_vmap_pmd_supported(prot))
 		return 0;
 
 	if ((end - addr) != PMD_SIZE)
@@ -102,7 +72,8 @@ static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -113,7 +84,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	do {
 		next = pmd_addr_end(addr, end);
 
-		if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) {
 			*mask |= PGTBL_PMD_MODIFIED;
 			continue;
 		}
@@ -125,9 +96,12 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 }
 
 static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
 {
-	if (!ioremap_pud_enabled())
+	if (max_page_shift < PUD_SHIFT)
+		return 0;
+
+	if (!arch_vmap_pud_supported(prot))
 		return 0;
 
 	if ((end - addr) != PUD_SIZE)
@@ -146,7 +120,8 @@ static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -157,21 +132,24 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
 		next = pud_addr_end(addr, end);
 
-		if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) {
 			*mask |= PGTBL_PUD_MODIFIED;
 			continue;
 		}
 
-		if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+		if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask))
 			return -ENOMEM;
 	} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift)
 {
-	if (!ioremap_p4d_enabled())
+	if (max_page_shift < P4D_SHIFT)
+		return 0;
+
+	if (!arch_vmap_p4d_supported(prot))
 		return 0;
 
 	if ((end - addr) != P4D_SIZE)
@@ -190,7 +168,8 @@ static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
+			phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift,
+			pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -201,18 +180,19 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
 	do {
 		next = p4d_addr_end(addr, end);
 
-		if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot, max_page_shift)) {
 			*mask |= PGTBL_P4D_MODIFIED;
 			continue;
 		}
 
-		if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+		if (vmap_pud_range(p4d, addr, next, phys_addr, prot, max_page_shift, mask))
 			return -ENOMEM;
 	} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
 }
 
-int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+static int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
+			unsigned int max_page_shift)
 {
 	pgd_t *pgd;
 	unsigned long start;
@@ -227,7 +207,7 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
+		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, max_page_shift, &mask);
 		if (err)
 			break;
 	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -240,6 +220,16 @@ int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_a
 	return err;
 }
 
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+{
+	unsigned int max_page_shift = PAGE_SHIFT;
+
+	if (iomap_allow_huge)
+		max_page_shift = P4D_SHIFT;
+
+	return vmap_range(addr, end, phys_addr, prot, max_page_shift);
+}
+
 #ifdef CONFIG_GENERIC_IOREMAP
 void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
 {
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

This will be moved to mm/ and used as a generic kernel virtual mapping
function, so re-name it in preparation.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/ioremap.c | 55 ++++++++++++++++++++++------------------------------
 1 file changed, 23 insertions(+), 32 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..6016ae3227ad 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-		pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 	u64 pfn;
@@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-				unsigned long end, phys_addr_t phys_addr,
-				pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
 {
 	if (!ioremap_pmd_enabled())
 		return 0;
@@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
 	return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-		pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	do {
 		next = pmd_addr_end(addr, end);
 
-		if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PMD_MODIFIED;
 			continue;
 		}
 
-		if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+		if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-				unsigned long end, phys_addr_t phys_addr,
-				pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
 {
 	if (!ioremap_pud_enabled())
 		return 0;
@@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
 	return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-		pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	do {
 		next = pud_addr_end(addr, end);
 
-		if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_PUD_MODIFIED;
 			continue;
 		}
 
-		if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+		if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-				unsigned long end, phys_addr_t phys_addr,
-				pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot)
 {
 	if (!ioremap_p4d_enabled())
 		return 0;
@@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
 	return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-		pgtbl_mod_mask *mask)
+static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+			phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -208,19 +201,18 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	do {
 		next = p4d_addr_end(addr, end);
 
-		if (ioremap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
+		if (vmap_try_huge_p4d(p4d, addr, next, phys_addr, prot)) {
 			*mask |= PGTBL_P4D_MODIFIED;
 			continue;
 		}
 
-		if (ioremap_pud_range(p4d, addr, next, phys_addr, prot, mask))
+		if (vmap_pud_range(p4d, addr, next, phys_addr, prot, mask))
 			return -ENOMEM;
 	} while (p4d++, phys_addr += (next - addr), addr = next, addr != end);
 	return 0;
 }
 
-int ioremap_page_range(unsigned long addr,
-		       unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
+int ioremap_page_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
 {
 	pgd_t *pgd;
 	unsigned long start;
@@ -235,8 +227,7 @@ int ioremap_page_range(unsigned long addr,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = ioremap_p4d_range(pgd, addr, next, phys_addr, prot,
-					&mask);
+		err = vmap_p4d_range(pgd, addr, next, phys_addr, prot, &mask);
 		if (err)
 			break;
 	} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
@@ -272,7 +263,7 @@ void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
 		return NULL;
 	vaddr = (unsigned long)area->addr;
 
-	if (ioremap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
+	if (vmap_page_range(vaddr, vaddr + size, addr, __pgprot(prot))) {
 		free_vm_area(area);
 		return NULL;
 	}
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/vmalloc.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 49f225b0f855..3a1e45fd1626 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size)
 		arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
 
@@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-		pgtbl_mod_mask *mask)
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end,
+		pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+		if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 		next = pgd_addr_end(addr, end);
 		if (pgd_bad(*pgd))
 			mask |= PGTBL_PGD_MODIFIED;
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+		err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/memory.c | 60 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..1d5f3093c249 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2260,13 +2260,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 	}
 	do {
 		next = pmd_addr_end(addr, end);
-		if (create || !pmd_none_or_clear_bad(pmd)) {
-			err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-						 create);
-			if (err)
-				break;
+		if (pmd_none(*pmd) && !create)
+			continue;
+		if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+			return -EINVAL;
+		if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+			if (!create)
+				continue;
+			pmd_clear_bad(pmd);
 		}
+		err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+		if (err)
+			break;
 	} while (pmd++, addr = next, addr != end);
+
 	return err;
 }
 
@@ -2287,13 +2294,20 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	}
 	do {
 		next = pud_addr_end(addr, end);
-		if (create || !pud_none_or_clear_bad(pud)) {
-			err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-						 create);
-			if (err)
-				break;
+		if (pud_none(*pud) && !create)
+			continue;
+		if (WARN_ON_ONCE(pud_leaf(*pud)))
+			return -EINVAL;
+		if (WARN_ON_ONCE(pud_bad(*pud))) {
+			if (!create)
+				continue;
+			pud_clear_bad(pud);
 		}
+		err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+		if (err)
+			break;
 	} while (pud++, addr = next, addr != end);
+
 	return err;
 }
 
@@ -2314,13 +2328,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	}
 	do {
 		next = p4d_addr_end(addr, end);
-		if (create || !p4d_none_or_clear_bad(p4d)) {
-			err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-						 create);
-			if (err)
-				break;
+		if (p4d_none(*p4d) && !create)
+			continue;
+		if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+			return -EINVAL;
+		if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+			if (!create)
+				continue;
+			p4d_clear_bad(p4d);
 		}
+		err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+		if (err)
+			break;
 	} while (p4d++, addr = next, addr != end);
+
 	return err;
 }
 
@@ -2339,8 +2360,15 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 	pgd = pgd_offset(mm, addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		if (!create && pgd_none_or_clear_bad(pgd))
+		if (pgd_none(*pgd) && !create)
 			continue;
+		if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+			return -EINVAL;
+		if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+			if (!create)
+				continue;
+			pgd_clear_bad(pgd);
+		}
 		err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
 		if (err)
 			break;
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel
In-Reply-To: <20200810022732.1150009-1-npiggin@gmail.com>

vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
    fail gracefully on unexpected huge vmap mappings")

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/vmalloc.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..49f225b0f855 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
 #include <linux/overflow.h>
 
 #include <linux/uaccess.h>
+#include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
 	if (pgd_none(*pgd))
 		return NULL;
+	if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+		return NULL; /* XXX: no allowance for huge pgd */
+	if (WARN_ON_ONCE(pgd_bad(*pgd)))
+		return NULL;
+
 	p4d = p4d_offset(pgd, addr);
 	if (p4d_none(*p4d))
 		return NULL;
-	pud = pud_offset(p4d, addr);
+	if (p4d_leaf(*p4d))
+		return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+	if (WARN_ON_ONCE(p4d_bad(*p4d)))
+		return NULL;
 
-	/*
-	 * Don't dereference bad PUD or PMD (below) entries. This will also
-	 * identify huge mappings, which we may encounter on architectures
-	 * that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-	 * identified as vmalloc addresses by is_vmalloc_addr(), but are
-	 * not [unambiguously] associated with a struct page, so there is
-	 * no correct value to return for them.
-	 */
-	WARN_ON_ONCE(pud_bad(*pud));
-	if (pud_none(*pud) || pud_bad(*pud))
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+	if (pud_leaf(*pud))
+		return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+	if (WARN_ON_ONCE(pud_bad(*pud)))
 		return NULL;
+
 	pmd = pmd_offset(pud, addr);
-	WARN_ON_ONCE(pmd_bad(*pmd));
-	if (pmd_none(*pmd) || pmd_bad(*pmd))
+	if (pmd_none(*pmd))
+		return NULL;
+	if (pmd_leaf(*pmd))
+		return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	if (WARN_ON_ONCE(pmd_bad(*pmd)))
 		return NULL;
 
 	ptep = pte_offset_map(pmd, addr);
@@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	if (pte_present(pte))
 		page = pte_page(pte);
 	pte_unmap(ptep);
+
 	return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0


^ permalink raw reply related

* [PATCH v3 0/8] huge vmalloc mappings
From: Nicholas Piggin @ 2020-08-10  2:27 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-arch, Zefan Li, Catalin Marinas, x86, linuxppc-dev,
	Nicholas Piggin, linux-kernel, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Thomas Gleixner, Will Deacon, linux-arm-kernel

Not tested on x86 or arm64, would appreciate a quick test there so I can
ask Andrew to put it in -mm. Other option is I can disable huge vmallocs
for them for the time being.

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
  struct vm_struct is an interface (this should fix x86 vmap stack debug
  assert). [Thanks Zefan]

Nicholas Piggin (8):
  mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  lib/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings

 .../admin-guide/kernel-parameters.txt         |   2 +
 arch/arm64/mm/mmu.c                           |  10 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c      |   8 +-
 arch/x86/mm/ioremap.c                         |  10 +-
 include/linux/io.h                            |   9 -
 include/linux/vmalloc.h                       |  13 +
 init/main.c                                   |   1 -
 mm/ioremap.c                                  | 231 +--------
 mm/memory.c                                   |  60 ++-
 mm/vmalloc.c                                  | 442 +++++++++++++++---
 10 files changed, 453 insertions(+), 333 deletions(-)

-- 
2.23.0


^ permalink raw reply

* Re: [PATCH] powerpc/legacy_serial: Use early_ioremap()
From: Chris Packham @ 2020-08-10  2:01 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: Hamish Martin, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <d94f74c57112e002959143bb5ccdcd9be80ed8d2.camel@alliedtelesis.co.nz>


On 24/03/20 10:54 am, Chris Packham wrote:
> Hi Christophe,
>
> On Wed, 2020-02-05 at 12:03 +0000, Christophe Leroy wrote:
>> [    0.000000] ioremap() called early from
>> find_legacy_serial_ports+0x3cc/0x474. Use early_ioremap() instead
>>
> I was just about to dig into this error message and found you patch. I
> applied it to a v5.5 base.
>
>> find_legacy_serial_ports() is called early from setup_arch(), before
>> paging_init(). vmalloc is not available yet, ioremap shouldn't be
>> used that early.
>>
>> Use early_ioremap() and switch to a regular ioremap() later.
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> On my system (Freescale T2080 SOC) this seems to cause a crash/hang in
> early boot. Unfortunately because this is affecting the boot console I
> don't get any earlyprintk output.

I've been doing a bit more digging into why Christophe's patch didn't 
work for me. I noticed the powerpc specific early_ioremap_range() 
returns addresses around ioremap_bot. Yet the generic early_ioremap() 
uses addresses around FIXADDR_TOP. If I try the following hack I can 
make Christophe's patch work

diff --git a/arch/powerpc/include/asm/fixmap.h 
b/arch/powerpc/include/asm/fixmap.h
index 2ef155a3c821..7bc2f3f73c8b 100644
--- a/arch/powerpc/include/asm/fixmap.h
+++ b/arch/powerpc/include/asm/fixmap.h
@@ -27,7 +27,7 @@
  #include <asm/kasan.h>
  #define FIXADDR_TOP    (KASAN_SHADOW_START - PAGE_SIZE)
  #else
-#define FIXADDR_TOP    ((unsigned long)(-PAGE_SIZE))
+#define FIXADDR_TOP    (IOREMAP_END - PAGE_SIZE)
  #endif

  /*

I'll admit to being out of my depth. It seems that the generic 
early_ioremap() is not quite correctly plumbed in for powerpc.

>> ---
>>   arch/powerpc/kernel/legacy_serial.c | 33 +++++++++++++++++++++++++
>> ----
>>   1 file changed, 29 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/legacy_serial.c
>> b/arch/powerpc/kernel/legacy_serial.c
>> index f061e06e9f51..8b2c1a8553a0 100644
>> --- a/arch/powerpc/kernel/legacy_serial.c
>> +++ b/arch/powerpc/kernel/legacy_serial.c
>> @@ -15,6 +15,7 @@
>>   #include <asm/udbg.h>
>>   #include <asm/pci-bridge.h>
>>   #include <asm/ppc-pci.h>
>> +#include <asm/early_ioremap.h>
>>   
>>   #undef DEBUG
>>   
>> @@ -34,6 +35,7 @@ static struct legacy_serial_info {
>>   	unsigned int			clock;
>>   	int				irq_check_parent;
>>   	phys_addr_t			taddr;
>> +	void __iomem			*early_addr;
>>   } legacy_serial_infos[MAX_LEGACY_SERIAL_PORTS];
>>   
>>   static const struct of_device_id legacy_serial_parents[] __initconst
>> = {
>> @@ -325,17 +327,16 @@ static void __init
>> setup_legacy_serial_console(int console)
>>   {
>>   	struct legacy_serial_info *info =
>> &legacy_serial_infos[console];
>>   	struct plat_serial8250_port *port =
>> &legacy_serial_ports[console];
>> -	void __iomem *addr;
>>   	unsigned int stride;
>>   
>>   	stride = 1 << port->regshift;
>>   
>>   	/* Check if a translated MMIO address has been found */
>>   	if (info->taddr) {
>> -		addr = ioremap(info->taddr, 0x1000);
>> -		if (addr == NULL)
>> +		info->early_addr = early_ioremap(info->taddr, 0x1000);
>> +		if (info->early_addr == NULL)
>>   			return;
>> -		udbg_uart_init_mmio(addr, stride);
>> +		udbg_uart_init_mmio(info->early_addr, stride);
>>   	} else {
>>   		/* Check if it's PIO and we support untranslated PIO */
>>   		if (port->iotype == UPIO_PORT && isa_io_special)
>> @@ -353,6 +354,30 @@ static void __init
>> setup_legacy_serial_console(int console)
>>   	udbg_uart_setup(info->speed, info->clock);
>>   }
>>   
>> +static int __init ioremap_legacy_serial_console(void)
>> +{
>> +	struct legacy_serial_info *info =
>> &legacy_serial_infos[legacy_serial_console];
>> +	struct plat_serial8250_port *port =
>> &legacy_serial_ports[legacy_serial_console];
>> +	void __iomem *vaddr;
>> +
>> +	if (legacy_serial_console < 0)
>> +		return 0;
>> +
>> +	if (!info->early_addr)
>> +		return 0;
>> +
>> +	vaddr = ioremap(info->taddr, 0x1000);
>> +	if (WARN_ON(!vaddr))
>> +		return -ENOMEM;
>> +
>> +	udbg_uart_init_mmio(vaddr, 1 << port->regshift);
>> +	early_iounmap(info->early_addr, 0x1000);
>> +	info->early_addr = NULL;
>> +
>> +	return 0;
>> +}
>> +early_initcall(ioremap_legacy_serial_console);
>> +
>>   /*
>>    * This is called very early, as part of setup_system() or
>> eventually
>>    * setup_arch(), basically before anything else in this file. This
>> function

^ permalink raw reply related

* Re: [PATCH 2/9] macintosh/via-macii: Poll the device most likely to respond
From: Guenter Roeck @ 2020-08-10  1:16 UTC (permalink / raw)
  To: Finn Thain
  Cc: Laurent Vivier, Mark Cave-Ayland, linux-kernel, linux-m68k,
	Geert Uytterhoeven, linuxppc-dev, Joshua Thompson
In-Reply-To: <alpine.LNX.2.23.453.2008100844450.8@nippy.intranet>

Hi,

On 8/9/20 3:58 PM, Finn Thain wrote:
> On Sun, 9 Aug 2020, Guenter Roeck wrote:
> 
>> Hi,
>>
>> On Sun, Jun 28, 2020 at 02:23:12PM +1000, Finn Thain wrote:
>>> Poll the most recently polled device by default, rather than the lowest
>>> device address that happens to be enabled in autopoll_devs. This improves
>>> input latency. Re-use macii_queue_poll() rather than duplicate that logic.
>>> This eliminates a static struct and function.
>>>
>>> Fixes: d95fd5fce88f0 ("m68k: Mac II ADB fixes") # v5.0+
>>> Tested-by: Stan Johnson <userm57@yahoo.com>
>>> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
>>
>> With this patch applied, the qemu "q800" emulation no longer works and 
>> is stuck in early boot. Any idea why that might be the case, and/or how 
>> to debug it ?
>>
> 
> The problem you're seeing was mentioned in the cover letter,
> https://lore.kernel.org/linux-m68k/cover.1593318192.git.fthain@telegraphics.com.au/
> 
> Since this series was merged, Linus' tree is no longer compatible with 
> long-standing QEMU bugs.
> 
> The best way to fix this is to upgrade QEMU (latest is 5.1.0-rc3). Or use 
> the serial console instead of the framebuffer console.
> 

I have no problem with that. Actually, I had checked the qemu commit log,
 but somehow I had missed missed the commits there.

> I regret the inconvenience but the alternative was worse: adding code to 
> Linux to get compatibility with QEMU bugs (which were added to QEMU due to 
> Linux bugs).
> 
> My main concern is correct operation on actual hardware, as always. But 
> some QEMU developers are working on support for operating systems besides 
> Linux.
> 
> Therefore, I believe that both QEMU and Linux should aim for compatibility 
> with actual hardware and not bug compatibility with each other.
> 
I absolutely agree.

I repeated the test on the mainline kernel with qemu v5.1-rc3, and it works.
I also made sure that older versions of Linux still work with the qemu
v5.1.0-rc3. So everything is good, and sorry for the noise.

Thanks,
Guenter

^ permalink raw reply

* Re: [PATCH 1/9] macintosh/via-macii: Access autopoll_devs when inside lock
From: Finn Thain @ 2020-08-09 23:15 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Laurent Vivier, Mark Cave-Ayland, linux-kernel, linux-m68k,
	Geert Uytterhoeven, linuxppc-dev, Joshua Thompson
In-Reply-To: <20200809190138.GA133890@roeck-us.net>

On Sun, 9 Aug 2020, Guenter Roeck wrote:

> Hi,
> 
> On Sun, Jun 28, 2020 at 02:23:12PM +1000, Finn Thain wrote:
> > The interrupt handler should be excluded when accessing the 
> > autopoll_devs variable.
> > 
> 
> I am quite baffled by this patch. Other than adding an unnecessary lock 
> / unlock sequence,

The new lock/unlock sequence means that the expression (autopoll_devs && 
!current_req) can be understood to be atomic. That makes it easier for me 
to follow (being that both variables are shared state).

> accessing a variable (which is derived from another variable) from 
> inside or outside a lock does not make a difference. If autopoll_devs = 
> devs & 0xfffe is 0 inside the lock, it will just as much be 0 outside 
> the lock, and vice versa.
> 
> Can you explain this in some more detail ? Not that is matters much 
> since the change already made it into mainline, but I would like to 
> understand what if anything I am missing here.
> 

I think the new code is more readable and is obviously free of race 
conditions. It's not obvious to me why the old code was free of race 
conditions but if you can easily establish that by inspection then you are 
a better auditor than I am. Regardless, I'll stick with "Keep It Simple, 
Stupid".

^ permalink raw reply

* Re: [PATCH 2/9] macintosh/via-macii: Poll the device most likely to respond
From: Finn Thain @ 2020-08-09 22:58 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Laurent Vivier, Mark Cave-Ayland, linux-kernel, linux-m68k,
	Geert Uytterhoeven, linuxppc-dev, Joshua Thompson
In-Reply-To: <20200809185541.GA133779@roeck-us.net>

On Sun, 9 Aug 2020, Guenter Roeck wrote:

> Hi,
> 
> On Sun, Jun 28, 2020 at 02:23:12PM +1000, Finn Thain wrote:
> > Poll the most recently polled device by default, rather than the lowest
> > device address that happens to be enabled in autopoll_devs. This improves
> > input latency. Re-use macii_queue_poll() rather than duplicate that logic.
> > This eliminates a static struct and function.
> > 
> > Fixes: d95fd5fce88f0 ("m68k: Mac II ADB fixes") # v5.0+
> > Tested-by: Stan Johnson <userm57@yahoo.com>
> > Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
> 
> With this patch applied, the qemu "q800" emulation no longer works and 
> is stuck in early boot. Any idea why that might be the case, and/or how 
> to debug it ?
> 

The problem you're seeing was mentioned in the cover letter,
https://lore.kernel.org/linux-m68k/cover.1593318192.git.fthain@telegraphics.com.au/

Since this series was merged, Linus' tree is no longer compatible with 
long-standing QEMU bugs.

The best way to fix this is to upgrade QEMU (latest is 5.1.0-rc3). Or use 
the serial console instead of the framebuffer console.

I regret the inconvenience but the alternative was worse: adding code to 
Linux to get compatibility with QEMU bugs (which were added to QEMU due to 
Linux bugs).

My main concern is correct operation on actual hardware, as always. But 
some QEMU developers are working on support for operating systems besides 
Linux.

Therefore, I believe that both QEMU and Linux should aim for compatibility 
with actual hardware and not bug compatibility with each other.

^ permalink raw reply

* Re: [PATCH 1/9] macintosh/via-macii: Access autopoll_devs when inside lock
From: Guenter Roeck @ 2020-08-09 19:01 UTC (permalink / raw)
  To: Finn Thain
  Cc: Laurent Vivier, Mark Cave-Ayland, linux-kernel, linux-m68k,
	Geert Uytterhoeven, linuxppc-dev, Joshua Thompson
In-Reply-To: <5952dd8a9bc9de90f1acc4790c51dd42b4c98065.1593318192.git.fthain@telegraphics.com.au>

Hi,

On Sun, Jun 28, 2020 at 02:23:12PM +1000, Finn Thain wrote:
> The interrupt handler should be excluded when accessing the autopoll_devs
> variable.
> 

I am quite baffled by this patch. Other than adding an unnecessary lock /
unlock sequence, accessing a variable (which is derived from another
variable) from inside or outside a lock does not make a difference.
If autopoll_devs = devs & 0xfffe is 0 inside the lock, it will just
as much be 0 outside the lock, and vice versa.

Can you explain this in some more detail ? Not that is matters much since
the change already made it into mainline, but I would like to understand
what if anything I am missing here.

Thanks,
Guenter

> Fixes: d95fd5fce88f0 ("m68k: Mac II ADB fixes") # v5.0+
> Tested-by: Stan Johnson <userm57@yahoo.com>
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
> ---
>  drivers/macintosh/via-macii.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/macintosh/via-macii.c b/drivers/macintosh/via-macii.c
> index ac824d7b2dcfc..6aa903529570d 100644
> --- a/drivers/macintosh/via-macii.c
> +++ b/drivers/macintosh/via-macii.c
> @@ -270,15 +270,12 @@ static int macii_autopoll(int devs)
>  	unsigned long flags;
>  	int err = 0;
>  
> +	local_irq_save(flags);
> +
>  	/* bit 1 == device 1, and so on. */
>  	autopoll_devs = devs & 0xFFFE;
>  
> -	if (!autopoll_devs)
> -		return 0;
> -
> -	local_irq_save(flags);
> -
> -	if (current_req == NULL) {
> +	if (autopoll_devs && !current_req) {
>  		/* Send a Talk Reg 0. The controller will repeatedly transmit
>  		 * this as long as it is idle.
>  		 */

^ permalink raw reply

* Re: [PATCH 2/9] macintosh/via-macii: Poll the device most likely to respond
From: Guenter Roeck @ 2020-08-09 18:55 UTC (permalink / raw)
  To: Finn Thain
  Cc: Laurent Vivier, Mark Cave-Ayland, linux-kernel, linux-m68k,
	Geert Uytterhoeven, linuxppc-dev, Joshua Thompson
In-Reply-To: <5836f80886ebcfbe5be5fb7e0dc49feed6469712.1593318192.git.fthain@telegraphics.com.au>

Hi,

On Sun, Jun 28, 2020 at 02:23:12PM +1000, Finn Thain wrote:
> Poll the most recently polled device by default, rather than the lowest
> device address that happens to be enabled in autopoll_devs. This improves
> input latency. Re-use macii_queue_poll() rather than duplicate that logic.
> This eliminates a static struct and function.
> 
> Fixes: d95fd5fce88f0 ("m68k: Mac II ADB fixes") # v5.0+
> Tested-by: Stan Johnson <userm57@yahoo.com>
> Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

With this patch applied, the qemu "q800" emulation no longer works and is stuck
in early boot. Any idea why that might be the case, and/or how to debug it ?

Thanks,
Guenter

> ---
>  drivers/macintosh/via-macii.c | 99 +++++++++++++++++++----------------
>  1 file changed, 53 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/macintosh/via-macii.c b/drivers/macintosh/via-macii.c
> index 6aa903529570d..d4f1a65c5f1fd 100644
> --- a/drivers/macintosh/via-macii.c
> +++ b/drivers/macintosh/via-macii.c
> @@ -77,6 +77,10 @@ static volatile unsigned char *via;
>  #define ST_ODD		0x20		/* ADB state: odd data byte */
>  #define ST_IDLE		0x30		/* ADB state: idle, nothing to send */
>  
> +/* ADB command byte structure */
> +#define ADDR_MASK	0xF0
> +#define CMD_MASK	0x0F
> +
>  static int macii_init_via(void);
>  static void macii_start(void);
>  static irqreturn_t macii_interrupt(int irq, void *arg);
> @@ -117,7 +121,8 @@ static int reply_len; /* number of bytes received in reply_buf or req->reply */
>  static int status;          /* VIA's ADB status bits captured upon interrupt */
>  static int last_status;              /* status bits as at previous interrupt */
>  static int srq_asserted;     /* have to poll for the device that asserted it */
> -static int command_byte;         /* the most recent command byte transmitted */
> +static u8 last_cmd;              /* the most recent command byte transmitted */
> +static u8 last_poll_cmd; /* the most recent Talk R0 command byte transmitted */
>  static int autopoll_devs;      /* bits set are device addresses to be polled */
>  
>  /* Check for MacII style ADB */
> @@ -179,35 +184,49 @@ static int macii_init_via(void)
>  /* Send an ADB poll (Talk Register 0 command prepended to the request queue) */
>  static void macii_queue_poll(void)
>  {
> -	/* No point polling the active device as it will never assert SRQ, so
> -	 * poll the next device in the autopoll list. This could leave us
> -	 * stuck in a polling loop if an unprobed device is asserting SRQ.
> -	 * In theory, that could only happen if a device was plugged in after
> -	 * probing started. Unplugging it again will break the cycle.
> -	 * (Simply polling the next higher device often ends up polling almost
> -	 * every device (after wrapping around), which takes too long.)
> -	 */
> -	int device_mask;
> -	int next_device;
>  	static struct adb_request req;
> +	unsigned char poll_command;
> +	unsigned int poll_addr;
>  
> +	/* This only polls devices in the autopoll list, which assumes that
> +	 * unprobed devices never assert SRQ. That could happen if a device was
> +	 * plugged in after the adb bus scan. Unplugging it again will resolve
> +	 * the problem. This behaviour is similar to MacOS.
> +	 */
>  	if (!autopoll_devs)
>  		return;
>  
> -	device_mask = (1 << (((command_byte & 0xF0) >> 4) + 1)) - 1;
> -	if (autopoll_devs & ~device_mask)
> -		next_device = ffs(autopoll_devs & ~device_mask) - 1;
> -	else
> -		next_device = ffs(autopoll_devs) - 1;
> +	/* The device most recently polled may not be the best device to poll
> +	 * right now. Some other device(s) may have signalled SRQ (the active
> +	 * device won't do that). Or the autopoll list may have been changed.
> +	 * Try polling the next higher address.
> +	 */
> +	poll_addr = (last_poll_cmd & ADDR_MASK) >> 4;
> +	if ((srq_asserted && last_cmd == last_poll_cmd) ||
> +	    !(autopoll_devs & (1 << poll_addr))) {
> +		unsigned int higher_devs;
> +
> +		higher_devs = autopoll_devs & -(1 << (poll_addr + 1));
> +		poll_addr = ffs(higher_devs ? higher_devs : autopoll_devs) - 1;
> +	}
>  
> -	adb_request(&req, NULL, ADBREQ_NOSEND, 1, ADB_READREG(next_device, 0));
> +	/* Send a Talk Register 0 command */
> +	poll_command = ADB_READREG(poll_addr, 0);
> +
> +	/* No need to repeat this Talk command. The transceiver will do that
> +	 * as long as it is idle.
> +	 */
> +	if (poll_command == last_cmd)
> +		return;
> +
> +	adb_request(&req, NULL, ADBREQ_NOSEND, 1, poll_command);
>  
>  	req.sent = 0;
>  	req.complete = 0;
>  	req.reply_len = 0;
>  	req.next = current_req;
>  
> -	if (current_req != NULL) {
> +	if (WARN_ON(current_req)) {
>  		current_req = &req;
>  	} else {
>  		current_req = &req;
> @@ -266,37 +285,22 @@ static int macii_write(struct adb_request *req)
>  /* Start auto-polling */
>  static int macii_autopoll(int devs)
>  {
> -	static struct adb_request req;
>  	unsigned long flags;
> -	int err = 0;
>  
>  	local_irq_save(flags);
>  
>  	/* bit 1 == device 1, and so on. */
>  	autopoll_devs = devs & 0xFFFE;
>  
> -	if (autopoll_devs && !current_req) {
> -		/* Send a Talk Reg 0. The controller will repeatedly transmit
> -		 * this as long as it is idle.
> -		 */
> -		adb_request(&req, NULL, ADBREQ_NOSEND, 1,
> -		            ADB_READREG(ffs(autopoll_devs) - 1, 0));
> -		err = macii_write(&req);
> +	if (!current_req) {
> +		macii_queue_poll();
> +		if (current_req && macii_state == idle)
> +			macii_start();
>  	}
>  
>  	local_irq_restore(flags);
> -	return err;
> -}
>  
> -static inline int need_autopoll(void)
> -{
> -	/* Was the last command Talk Reg 0
> -	 * and is the target on the autopoll list?
> -	 */
> -	if ((command_byte & 0x0F) == 0x0C &&
> -	    ((1 << ((command_byte & 0xF0) >> 4)) & autopoll_devs))
> -		return 0;
> -	return 1;
> +	return 0;
>  }
>  
>  /* Prod the chip without interrupts */
> @@ -333,7 +337,12 @@ static void macii_start(void)
>  	 */
>  
>  	/* store command byte */
> -	command_byte = req->data[1];
> +	last_cmd = req->data[1];
> +
> +	/* If this is a Talk Register 0 command, store the command byte */
> +	if ((last_cmd & CMD_MASK) == ADB_READREG(0, 0))
> +		last_poll_cmd = last_cmd;
> +
>  	/* Output mode */
>  	via[ACR] |= SR_OUT;
>  	/* Load data */
> @@ -424,10 +433,11 @@ static irqreturn_t macii_interrupt(int irq, void *arg)
>  				if (req->done)
>  					(*req->done)(req);
>  
> -				if (current_req)
> +				if (!current_req)
> +					macii_queue_poll();
> +
> +				if (current_req && macii_state == idle)
>  					macii_start();
> -				else if (need_autopoll())
> -					macii_autopoll(autopoll_devs);
>  			}
>  
>  			if (macii_state == idle) {
> @@ -507,14 +517,11 @@ static irqreturn_t macii_interrupt(int irq, void *arg)
>  
>  		macii_state = idle;
>  
> -		/* SRQ seen before, initiate poll now */
> -		if (srq_asserted)
> +		if (!current_req)
>  			macii_queue_poll();
>  
>  		if (current_req)
>  			macii_start();
> -		else if (need_autopoll())
> -			macii_autopoll(autopoll_devs);
>  
>  		if (macii_state == idle)
>  			via[B] = (via[B] & ~ST_MASK) | ST_IDLE;
> -- 
> 2.26.2
> 

^ permalink raw reply

* Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
From: Aneesh Kumar K.V @ 2020-08-09 18:40 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev, Srikar Dronamraju
In-Reply-To: <6d880a50-09c4-d591-c88c-09fd77512ad3@linux.ibm.com>

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> On 8/8/20 2:15 AM, Nathan Lynch wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> On 8/7/20 9:54 AM, Nathan Lynch wrote:
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>>>>> index e437a9ac4956..6c659aada55b 100644
>>>>> --- a/arch/powerpc/mm/numa.c
>>>>> +++ b/arch/powerpc/mm/numa.c
>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>>>>>    	}
>>>>>    }
>>>>>    
>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
>>>>
>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be
>>>> indexed not by Linux's logical node IDs but by the platform-provided
>>>> domain number, which has no relation to MAX_NUMNODES.
>>>
>>>
>>> I didn't want to dynamically allocate this. We could fetch
>>> "ibm,max-associativity-domains" to find the size for that. The current
>>> code do assume  firmware group id to not exceed MAX_NUMNODES. Hence kept
>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May
>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES?
>> 
>> Well, consider:
>> 
>> - ibm,max-associativity-domains can change at runtime with LPM. This
>>    doesn't happen in practice yet, but we should probably start thinking
>>    about how to support that.
>> - The domain numbering isn't clearly specified to have any particular
>>    properties such as beginning at zero or a contiguous range.
>> 
>> While the current code likely contains assumptions contrary to these
>> points, a change such as this is an opportunity to think about whether
>> those assumptions can be reduced or removed. In particular I think it
>> would be good to gracefully degrade when the number of NUMA affinity
>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain
>> numbers to directly index Linux data structures will make that
>> impossible.
>> 
>> So, maybe genradix or even xarray wouldn't actually be overengineering
>> here.
>> 
>
> One of the challenges with such a data structure is that we initialize 
> the nid_map before the slab is available. This means a memblock based 
> allocation and we would end up implementing such a sparse data structure 
> ourselves here.
>
> As you mentioned above, since we know that hypervisor as of now limits 
> the max affinity domain id below ibm,max-associativity-domains we are 
> good with an array-like nid_map we have here. This keeps the code simpler.
>
> This will also allow us to switch to a more sparse data structure as you 
> requested here in the future because the main change that is pushed in 
> this series is the usage of firmare_group_id_to_nid(). The details of 
> the data structure we use to keep track of that mapping are pretty much 
> internal to that function.

How about this? This makes it not a direct index. But it do limit the
search to max numa node on the system. 

static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  -1 };

static int __affinity_domain_to_nid(int domain_id, int max_nid)
{
	int i;

	for (i = 0; i < max_nid; i++) {
		if (domain_id_map[i] == domain_id)
			return i;
	}
	return NUMA_NO_NODE;
}

int affinity_domain_to_nid(struct affinity_domain *domain)
{
	int nid, domain_id;
	static int last_nid = 0;
	static DEFINE_SPINLOCK(node_id_lock);

	domain_id = domain->id;
	/*
	 * For PowerNV we don't change the node id. This helps to avoid
	 * confusion w.r.t the expected node ids. On pseries, node numbers
	 * are virtualized. Hence do logical node id for pseries.
	 */
	if (!firmware_has_feature(FW_FEATURE_LPAR))
		return domain_id;

	if (domain_id ==  -1 || last_nid == MAX_NUMNODES)
		return NUMA_NO_NODE;

	nid = __affinity_domain_to_nid(domain_id, last_nid);

	if (nid == NUMA_NO_NODE) {
		spin_lock(&node_id_lock);
		/*  recheck with lock held */
		nid = __affinity_domain_to_nid(domain_id, last_nid);
		if (nid == NUMA_NO_NODE) {
			nid = last_nid++;
			domain_id_map[nid] = domain_id;
		}
		spin_unlock(&node_id_lock);
	}

	return nid;
}

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox