[v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
@ 2025-03-04 22:19 Yang Shi
  2025-03-04 22:19 ` [v3 PATCH 1/6] arm64: Add BBM Level 2 cpu feature Yang Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

Changelog
=========
v3:
  * Rebased to v6.14-rc4.
  * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
    Also included in this series in order to have the complete patchset.
  * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
  * Supported CONT mappings per Ryan.
  * Supported asymmetric system by splitting kernel linear mapping if such
    system is detected per Ryan. I don't have such system to test, so the
    testing is done by hacking kernel to call linear mapping repainting
    unconditionally. The linear mapping doesn't have any block and cont
    mappings after booting.

RFC v2:
  * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
    conflict gracefully per Will Deacon
  * Rebased onto v6.13-rc5
  * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-yang@os.amperecomputing.com/

RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-yang@os.amperecomputing.com/

Description
===========
When rodata=full kernel linear mapping is mapped by PTE due to arm's
break-before-make rule.

A number of performance issues arise when the kernel linear map is using
PTE entries due to arm's break-before-make rule:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table

These issues can be avoided by specifying rodata=on the kernel command
line but this disables the alias checks on page table permissions and
therefore compromises security somewhat.

With FEAT_BBM level 2 support it is no longer necessary to invalidate the
page table entry when changing page sizes.  This allows the kernel to
split large mappings after boot is complete.

This patch adds support for splitting large mappings when FEAT_BBM level 2
is available and rodata=full is used. This functionality will be used
when modifying page permissions for individual page frames.

Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
only.

If the system is asymmetric, the kernel linear mapping may be repainted once
the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.

We saw significant performance increases in some benchmarks with
rodata=full without compromising the security features of the kernel.

Testing
=======
The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
4K page size + 48 bit VA.

Function test (4K/16K/64K page size)
  - Kernel boot.  Kernel needs change kernel linear mapping permission at
    boot stage, if the patch didn't work, kernel typically didn't boot.
  - Module stress from stress-ng. Kernel module load change permission for
    linear mapping.
  - A test kernel module which allocates 80% of total memory via vmalloc(),
    then change the vmalloc area permission to RO, this also change linear
    mapping permission to RO, then change it back before vfree(). Then launch
    a VM which consumes almost all physical memory.
  - VM with the patchset applied in guest kernel too.
  - Kernel build in VM with guest kernel which has this series applied.
  - rodata=on. Make sure other rodata mode is not broken.
  - Boot on the machine which doesn't support BBML2.

Performance
===========
Memory consumption
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

Performance benchmarking
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
    --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
    --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
    --name=iops-test-job --eta-newline=1 --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad case).
The bandwidth is increased and the avg clat is reduced proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache populated).
The bandwidth is increased by 150%.

Mikołaj Lenczewski (1):
      arm64: Add BBM Level 2 cpu feature

Yang Shi (5):
      arm64: cpufeature: add AmpereOne to BBML2 allow list
      arm64: mm: make __create_pgd_mapping() and helpers non-void
      arm64: mm: support large block mapping when rodata=full
      arm64: mm: support split CONT mappings
      arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs

 arch/arm64/Kconfig                  |  11 +++++
 arch/arm64/include/asm/cpucaps.h    |   2 +
 arch/arm64/include/asm/cpufeature.h |  15 ++++++
 arch/arm64/include/asm/mmu.h        |   4 ++
 arch/arm64/include/asm/pgtable.h    |  12 ++++-
 arch/arm64/kernel/cpufeature.c      |  95 +++++++++++++++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
 arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
 arch/arm64/tools/cpucaps            |   1 +
 9 files changed, 518 insertions(+), 56 deletions(-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [v3 PATCH 1/6] arm64: Add BBM Level 2 cpu feature
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-04 22:19 ` [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list Yang Shi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

From: Mikołaj Lenczewski <miko.lenczewski@arm.com>

The Break-Before-Make cpu feature supports multiple levels (levels 0-2),
and this commit adds a dedicated BBML2 cpufeature to test against
support for.

This is a system feature as we might have a big.LITTLE architecture
where some cores support BBML2 and some don't, but we want all cores to
be available and BBM to default to level 0 (as opposed to having cores
without BBML2 not coming online).

To support BBML2 in as wide a range of contexts as we can, we want not
only the architectural guarantees that BBML2 makes, but additionally
want BBML2 to not create TLB conflict aborts. Not causing aborts avoids
us having to prove that no recursive faults can be induced in any path
that uses BBML2, allowing its use for arbitrary kernel mappings.
Support detection of such CPUs.

Signed-off-by: Mikołaj Lenczewski <miko.lenczewski@arm.com>
---
 arch/arm64/Kconfig                  | 11 +++++
 arch/arm64/include/asm/cpucaps.h    |  2 +
 arch/arm64/include/asm/cpufeature.h |  5 +++
 arch/arm64/kernel/cpufeature.c      | 69 +++++++++++++++++++++++++++++
 arch/arm64/tools/cpucaps            |  1 +
 5 files changed, 88 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 940343beb3d4..49deda2b22ae 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2057,6 +2057,17 @@ config ARM64_TLB_RANGE
 	  The feature introduces new assembly instructions, and they were
 	  support when binutils >= 2.30.
 
+config ARM64_BBML2_NOABORT
+	bool "Enable support for Break-Before-Make Level 2 detection and usage"
+	default y
+	help
+	  FEAT_BBM provides detection of support levels for break-before-make
+	  sequences. If BBM level 2 is supported, some TLB maintenance requirements
+	  can be relaxed to improve performance. We additonally require the
+	  property that the implementation cannot ever raise TLB Conflict Aborts.
+	  Selecting N causes the kernel to fallback to BBM level 0 behaviour
+	  even if the system supports BBM level 2.
+
 endmenu # "ARMv8.4 architectural features"
 
 menu "ARMv8.5 architectural features"
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 0b5ca6e0eb09..2d6db33d4e45 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -23,6 +23,8 @@ cpucap_is_possible(const unsigned int cap)
 		return IS_ENABLED(CONFIG_ARM64_PAN);
 	case ARM64_HAS_EPAN:
 		return IS_ENABLED(CONFIG_ARM64_EPAN);
+	case ARM64_HAS_BBML2_NOABORT:
+		return IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT);
 	case ARM64_SVE:
 		return IS_ENABLED(CONFIG_ARM64_SVE);
 	case ARM64_SME:
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index e0e4478f5fb5..108ef3fbbc00 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -866,6 +866,11 @@ static __always_inline bool system_supports_mpam_hcr(void)
 	return alternative_has_cap_unlikely(ARM64_MPAM_HCR);
 }
 
+static inline bool system_supports_bbml2_noabort(void)
+{
+	return alternative_has_cap_unlikely(ARM64_HAS_BBML2_NOABORT);
+}
+
 int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
 bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index d561cf3b8ac7..7934c6dd493e 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2176,6 +2176,68 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
 	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
 }
 
+static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
+{
+	/* We want to allow usage of bbml2 in as wide a range of kernel contexts
+	 * as possible. This list is therefore an allow-list of known-good
+	 * implementations that both support bbml2 and additionally, fulfill the
+	 * extra constraint of never generating TLB conflict aborts when using
+	 * the relaxed bbml2 semantics (such aborts make use of bbml2 in certain
+	 * kernel contexts difficult to prove safe against recursive aborts).
+	 *
+	 * Note that implementations can only be considered "known-good" if their
+	 * implementors attest to the fact that the implementation never raises
+	 * TLBI conflict aborts for bbml2 mapping granularity changes.
+	 */
+	static const struct midr_range supports_bbml2_noabort_list[] = {
+		MIDR_REV_RANGE(MIDR_CORTEX_X4, 0, 3, 0xf),
+		MIDR_REV_RANGE(MIDR_NEOVERSE_V3, 0, 2, 0xf),
+		{}
+	};
+
+	return is_midr_in_range_list(cpu_midr, supports_bbml2_noabort_list);
+}
+
+static inline unsigned int __cpu_read_midr(int cpu)
+{
+	WARN_ON_ONCE(!cpu_online(cpu));
+
+	return per_cpu(cpu_data, cpu).reg_midr;
+}
+
+static bool has_bbml2_noabort(const struct arm64_cpu_capabilities *caps, int scope)
+{
+	if (!IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT))
+		return false;
+
+	if (scope & SCOPE_SYSTEM) {
+		int cpu;
+
+		/* We are a boot CPU, and must verify that all enumerated boot
+		 * CPUs have MIDR values within our allowlist. Otherwise, we do
+		 * not allow the BBML2 feature to avoid potential faults when
+		 * the insufficient CPUs access memory regions using BBML2
+		 * semantics.
+		 */
+		for_each_online_cpu(cpu) {
+			if (!cpu_has_bbml2_noabort(__cpu_read_midr(cpu)))
+				return false;
+		}
+
+		return true;
+	} else if (scope & SCOPE_LOCAL_CPU) {
+		/* We are a hot-plugged CPU, so only need to check our MIDR.
+		 * If we have the correct MIDR, but the kernel booted on an
+		 * insufficient CPU, we will not use BBML2 (this is safe). If
+		 * we have an incorrect MIDR, but the kernel booted on a
+		 * sufficient CPU, we will not bring up this CPU.
+		 */
+		return cpu_has_bbml2_noabort(read_cpuid_id());
+	}
+
+	return false;
+}
+
 #ifdef CONFIG_ARM64_PAN
 static void cpu_enable_pan(const struct arm64_cpu_capabilities *__unused)
 {
@@ -2926,6 +2988,13 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.matches = has_cpuid_feature,
 		ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, EVT, IMP)
 	},
+	{
+		.desc = "BBM Level 2 without conflict abort",
+		.capability = ARM64_HAS_BBML2_NOABORT,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_bbml2_noabort,
+		ARM64_CPUID_FIELDS(ID_AA64MMFR2_EL1, BBM, 2)
+	},
 	{
 		.desc = "52-bit Virtual Addressing for KVM (LPA2)",
 		.capability = ARM64_HAS_LPA2,
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 1e65f2fb45bd..b03a375e5507 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -14,6 +14,7 @@ HAS_ADDRESS_AUTH_ARCH_QARMA5
 HAS_ADDRESS_AUTH_IMP_DEF
 HAS_AMU_EXTN
 HAS_ARMv8_4_TTL
+HAS_BBML2_NOABORT
 HAS_CACHE_DIC
 HAS_CACHE_IDC
 HAS_CNP
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  2025-03-04 22:19 ` [v3 PATCH 1/6] arm64: Add BBM Level 2 cpu feature Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-14 10:58   ` Ryan Roberts
  2025-03-04 22:19 ` [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void Yang Shi
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

AmpereOne supports BBML2 without conflict abort, add to the allow list.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/kernel/cpufeature.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 7934c6dd493e..bf3df8407ca3 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2192,6 +2192,8 @@ static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
 	static const struct midr_range supports_bbml2_noabort_list[] = {
 		MIDR_REV_RANGE(MIDR_CORTEX_X4, 0, 3, 0xf),
 		MIDR_REV_RANGE(MIDR_NEOVERSE_V3, 0, 2, 0xf),
+		MIDR_ALL_VERSIONS(MIDR_AMPERE1),
+		MIDR_ALL_VERSIONS(MIDR_AMPERE1A),
 		{}
 	};
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  2025-03-04 22:19 ` [v3 PATCH 1/6] arm64: Add BBM Level 2 cpu feature Yang Shi
  2025-03-04 22:19 ` [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-14 11:51   ` Ryan Roberts
  2025-03-04 22:19 ` [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full Yang Shi
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

The later patch will enhance __create_pgd_mapping() and related helpers
to split kernel linear mapping, it requires have return value.  So make
__create_pgd_mapping() and helpers non-void functions.

And move the BUG_ON() out of page table alloc helper since failing
splitting kernel linear mapping is not fatal and can be handled by the
callers in the later patch.  Have BUG_ON() after
__create_pgd_mapping_locked() returns to keep the current callers behavior
intact.

Suggested-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/mm/mmu.c | 127 ++++++++++++++++++++++++++++++--------------
 1 file changed, 86 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index b4df5bc5b1b8..dccf0877285b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -189,11 +189,11 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 	} while (ptep++, addr += PAGE_SIZE, addr != end);
 }
 
-static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
-				unsigned long end, phys_addr_t phys,
-				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int),
-				int flags)
+static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
+			       unsigned long end, phys_addr_t phys,
+			       pgprot_t prot,
+			       phys_addr_t (*pgtable_alloc)(int),
+			       int flags)
 {
 	unsigned long next;
 	pmd_t pmd = READ_ONCE(*pmdp);
@@ -208,6 +208,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 			pmdval |= PMD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pte_phys = pgtable_alloc(PAGE_SHIFT);
+		if (!pte_phys)
+			return -ENOMEM;
 		ptep = pte_set_fixmap(pte_phys);
 		init_clear_pgtable(ptep);
 		ptep += pte_index(addr);
@@ -239,13 +241,16 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	 * walker.
 	 */
 	pte_clear_fixmap();
+
+	return 0;
 }
 
-static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot,
-		     phys_addr_t (*pgtable_alloc)(int), int flags)
+static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
+		    phys_addr_t phys, pgprot_t prot,
+		    phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
+	int ret = 0;
 
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
@@ -264,22 +269,27 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
 						      READ_ONCE(pmd_val(*pmdp))));
 		} else {
-			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
+			ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
 					    pgtable_alloc, flags);
+			if (ret)
+				break;
 
 			BUG_ON(pmd_val(old_pmd) != 0 &&
 			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
 		}
 		phys += next - addr;
 	} while (pmdp++, addr = next, addr != end);
+
+	return ret;
 }
 
-static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
-				unsigned long end, phys_addr_t phys,
-				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int), int flags)
+static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
+			       unsigned long end, phys_addr_t phys,
+			       pgprot_t prot,
+			       phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long next;
+	int ret = 0;
 	pud_t pud = READ_ONCE(*pudp);
 	pmd_t *pmdp;
 
@@ -295,6 +305,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 			pudval |= PUD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pmd_phys = pgtable_alloc(PMD_SHIFT);
+		if (!pmd_phys)
+			return -ENOMEM;
 		pmdp = pmd_set_fixmap(pmd_phys);
 		init_clear_pgtable(pmdp);
 		pmdp += pmd_index(addr);
@@ -314,21 +326,26 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
+		ret = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
+		if (ret)
+			break;
 
 		pmdp += pmd_index(next) - pmd_index(addr);
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
 	pmd_clear_fixmap();
+
+	return ret;
 }
 
-static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
-			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
-			   int flags)
+static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
+			  phys_addr_t phys, pgprot_t prot,
+			  phys_addr_t (*pgtable_alloc)(int),
+			  int flags)
 {
 	unsigned long next;
+	int ret = 0;
 	p4d_t p4d = READ_ONCE(*p4dp);
 	pud_t *pudp;
 
@@ -340,6 +357,8 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			p4dval |= P4D_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		pud_phys = pgtable_alloc(PUD_SHIFT);
+		if (!pud_phys)
+			return -ENOMEM;
 		pudp = pud_set_fixmap(pud_phys);
 		init_clear_pgtable(pudp);
 		pudp += pud_index(addr);
@@ -369,8 +388,10 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
 						      READ_ONCE(pud_val(*pudp))));
 		} else {
-			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
+			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
 					    pgtable_alloc, flags);
+			if (ret)
+				break;
 
 			BUG_ON(pud_val(old_pud) != 0 &&
 			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
@@ -379,14 +400,17 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 	} while (pudp++, addr = next, addr != end);
 
 	pud_clear_fixmap();
+
+	return ret;
 }
 
-static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
-			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
-			   int flags)
+static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
+			  phys_addr_t phys, pgprot_t prot,
+			  phys_addr_t (*pgtable_alloc)(int),
+			  int flags)
 {
 	unsigned long next;
+	int ret = 0;
 	pgd_t pgd = READ_ONCE(*pgdp);
 	p4d_t *p4dp;
 
@@ -398,6 +422,8 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 			pgdval |= PGD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
 		p4d_phys = pgtable_alloc(P4D_SHIFT);
+		if (!p4d_phys)
+			return -ENOMEM;
 		p4dp = p4d_set_fixmap(p4d_phys);
 		init_clear_pgtable(p4dp);
 		p4dp += p4d_index(addr);
@@ -412,8 +438,10 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 
 		next = p4d_addr_end(addr, end);
 
-		alloc_init_pud(p4dp, addr, next, phys, prot,
+		ret = alloc_init_pud(p4dp, addr, next, phys, prot,
 			       pgtable_alloc, flags);
+		if (ret)
+			break;
 
 		BUG_ON(p4d_val(old_p4d) != 0 &&
 		       p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
@@ -422,23 +450,26 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	} while (p4dp++, addr = next, addr != end);
 
 	p4d_clear_fixmap();
+
+	return ret;
 }
 
-static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
-					unsigned long virt, phys_addr_t size,
-					pgprot_t prot,
-					phys_addr_t (*pgtable_alloc)(int),
-					int flags)
+static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
+				       unsigned long virt, phys_addr_t size,
+				       pgprot_t prot,
+				       phys_addr_t (*pgtable_alloc)(int),
+				       int flags)
 {
 	unsigned long addr, end, next;
 	pgd_t *pgdp = pgd_offset_pgd(pgdir, virt);
+	int ret = 0;
 
 	/*
 	 * If the virtual and physical address don't have the same offset
 	 * within a page, we cannot map the region as the caller expects.
 	 */
 	if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
-		return;
+		return -EINVAL;
 
 	phys &= PAGE_MASK;
 	addr = virt & PAGE_MASK;
@@ -446,29 +477,38 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 
 	do {
 		next = pgd_addr_end(addr, end);
-		alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
+		ret = alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
+		if (ret)
+			break;
 		phys += next - addr;
 	} while (pgdp++, addr = next, addr != end);
+
+	return ret;
 }
 
-static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
-				 unsigned long virt, phys_addr_t size,
-				 pgprot_t prot,
-				 phys_addr_t (*pgtable_alloc)(int),
-				 int flags)
+static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
+				unsigned long virt, phys_addr_t size,
+				pgprot_t prot,
+				phys_addr_t (*pgtable_alloc)(int),
+				int flags)
 {
+	int ret;
+
 	mutex_lock(&fixmap_lock);
-	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
+	ret = __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
 				    pgtable_alloc, flags);
+	BUG_ON(ret);
 	mutex_unlock(&fixmap_lock);
+
+	return ret;
 }
 
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 extern __alias(__create_pgd_mapping_locked)
-void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
-			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
+int create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
+			    phys_addr_t size, pgprot_t prot,
+			    phys_addr_t (*pgtable_alloc)(int), int flags);
 #endif
 
 static phys_addr_t __pgd_pgtable_alloc(int shift)
@@ -476,13 +516,17 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
 	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
 	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
 
-	BUG_ON(!ptr);
+	if (!ptr)
+		return 0;
+
 	return __pa(ptr);
 }
 
 static phys_addr_t pgd_pgtable_alloc(int shift)
 {
 	phys_addr_t pa = __pgd_pgtable_alloc(shift);
+	if (!pa)
+		goto out;
 	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
 
 	/*
@@ -498,6 +542,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 	else if (shift == PMD_SHIFT)
 		BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
+out:
 	return pa;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (2 preceding siblings ...)
  2025-03-04 22:19 ` [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-08  1:53   ` kernel test robot
  2025-03-14 13:29   ` Ryan Roberts
  2025-03-04 22:19 ` [v3 PATCH 5/6] arm64: mm: support split CONT mappings Yang Shi
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.

This resulted in a couple of problems:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table

With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.

Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full.  When
changing permissions for kernel linear mapping, the page table will be
split to PTE level.

The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.

With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P) with
256GB memory.

* Memory use after boot
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
    --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
    --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
    --name=iops-test-job --eta-newline=1 --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad case).
The bandwidth is increased and the avg clat is reduced proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache populated).
The bandwidth is increased by 150%.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/cpufeature.h |  10 ++
 arch/arm64/include/asm/mmu.h        |   1 +
 arch/arm64/include/asm/pgtable.h    |   7 +-
 arch/arm64/kernel/cpufeature.c      |   2 +-
 arch/arm64/mm/mmu.c                 | 169 +++++++++++++++++++++++++++-
 arch/arm64/mm/pageattr.c            |  35 +++++-
 6 files changed, 211 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 108ef3fbbc00..e24edc32b0bd 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -871,6 +871,16 @@ static inline bool system_supports_bbml2_noabort(void)
 	return alternative_has_cap_unlikely(ARM64_HAS_BBML2_NOABORT);
 }
 
+bool cpu_has_bbml2_noabort(unsigned int cpu_midr);
+/*
+ * Called at early boot stage on boot CPU before cpu info and cpu feature
+ * are ready.
+ */
+static inline bool bbml2_noabort_available(void)
+{
+	return cpu_has_bbml2_noabort(read_cpuid_id());
+}
+
 int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
 bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
 
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 662471cfc536..d658a33df266 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 			       pgprot_t prot, bool page_mappings_only);
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
+extern int split_linear_mapping(unsigned long start, unsigned long end);
 
 /*
  * This check is triggered during the early boot before the cpufeature
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b2a2ad1b9e8..ed2fc1dcf7ae 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -749,7 +749,7 @@ static inline bool in_swapper_pgdir(void *addr)
 	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
 }
 
-static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef __PAGETABLE_PMD_FOLDED
 	if (in_swapper_pgdir(pmdp)) {
@@ -759,6 +759,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 	WRITE_ONCE(*pmdp, pmd);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	__set_pmd_nosync(pmdp, pmd);
 
 	if (pmd_valid(pmd)) {
 		dsb(ishst);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index bf3df8407ca3..d39637d5aeab 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2176,7 +2176,7 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
 	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
 }
 
-static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
+bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
 {
 	/* We want to allow usage of bbml2 in as wide a range of kernel contexts
 	 * as possible. This list is therefore an allow-list of known-good
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index dccf0877285b..ad0f1cc55e3a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -45,6 +45,7 @@
 #define NO_BLOCK_MAPPINGS	BIT(0)
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
+#define SPLIT_MAPPINGS		BIT(3)
 
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
@@ -166,6 +167,73 @@ static void init_clear_pgtable(void *table)
 	dsb(ishst);
 }
 
+static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
+		     phys_addr_t (*pgtable_alloc)(int))
+{
+	unsigned long pfn;
+	pgprot_t prot;
+	phys_addr_t pte_phys;
+	pte_t *ptep;
+
+	if (!pmd_leaf(pmdval))
+		return 0;
+
+	pfn = pmd_pfn(pmdval);
+	prot = pmd_pgprot(pmdval);
+
+	pte_phys = pgtable_alloc(PAGE_SHIFT);
+	if (!pte_phys)
+		return -ENOMEM;
+
+	ptep = (pte_t *)phys_to_virt(pte_phys);
+	init_clear_pgtable(ptep);
+	prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);
+	for (int i = 0; i < PTRS_PER_PTE; i++, ptep++)
+		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));
+
+	dsb(ishst);
+
+	set_pmd(pmdp, pfn_pmd(__phys_to_pfn(pte_phys),
+		__pgprot(PMD_TYPE_TABLE)));
+
+	return 0;
+}
+
+static int split_pud(pud_t *pudp, pud_t pudval,
+		     phys_addr_t (*pgtable_alloc)(int))
+{
+	unsigned long pfn;
+	pgprot_t prot;
+	pmd_t *pmdp;
+	phys_addr_t pmd_phys;
+	unsigned int step;
+
+	if (!pud_leaf(pudval))
+		return 0;
+
+	pfn = pud_pfn(pudval);
+	prot = pud_pgprot(pudval);
+	step = PMD_SIZE >> PAGE_SHIFT;
+
+	pmd_phys = pgtable_alloc(PMD_SHIFT);
+	if (!pmd_phys)
+		return -ENOMEM;
+
+	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
+	init_clear_pgtable(pmdp);
+	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
+		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
+		pfn += step;
+	}
+
+	dsb(ishst);
+
+	set_pud(pudp, pfn_pud(__phys_to_pfn(pmd_phys),
+		__pgprot(PUD_TYPE_TABLE)));
+
+	return 0;
+}
+
 static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot)
 {
@@ -251,12 +319,21 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 {
 	unsigned long next;
 	int ret = 0;
+	bool split = flags & SPLIT_MAPPINGS;
 
 	do {
 		pmd_t old_pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 
+		if (split) {
+			ret = split_pmd(pmdp, old_pmd, pgtable_alloc);
+			if (ret)
+				break;
+
+			continue;
+		}
+
 		/* try section mapping first */
 		if (((addr | next | phys) & ~PMD_MASK) == 0 &&
 		    (flags & NO_BLOCK_MAPPINGS) == 0) {
@@ -292,11 +369,19 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 	int ret = 0;
 	pud_t pud = READ_ONCE(*pudp);
 	pmd_t *pmdp;
+	bool split = flags & SPLIT_MAPPINGS;
 
 	/*
 	 * Check for initial section mappings in the pgd/pud.
 	 */
 	BUG_ON(pud_sect(pud));
+
+	if (split) {
+		BUG_ON(pud_none(pud));
+		pmdp = pmd_offset(pudp, addr);
+		goto split_pgtable;
+	}
+
 	if (pud_none(pud)) {
 		pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
 		phys_addr_t pmd_phys;
@@ -316,6 +401,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		pmdp = pmd_set_fixmap_offset(pudp, addr);
 	}
 
+split_pgtable:
 	do {
 		pgprot_t __prot = prot;
 
@@ -334,7 +420,8 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		phys += next - addr;
 	} while (addr = next, addr != end);
 
-	pmd_clear_fixmap();
+	if (!split)
+		pmd_clear_fixmap();
 
 	return ret;
 }
@@ -348,6 +435,13 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 	int ret = 0;
 	p4d_t p4d = READ_ONCE(*p4dp);
 	pud_t *pudp;
+	bool split = flags & SPLIT_MAPPINGS;
+
+	if (split) {
+		BUG_ON(p4d_none(p4d));
+		pudp = pud_offset(p4dp, addr);
+		goto split_pgtable;
+	}
 
 	if (p4d_none(p4d)) {
 		p4dval_t p4dval = P4D_TYPE_TABLE | P4D_TABLE_UXN | P4D_TABLE_AF;
@@ -368,11 +462,25 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		pudp = pud_set_fixmap_offset(p4dp, addr);
 	}
 
+split_pgtable:
 	do {
 		pud_t old_pud = READ_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 
+		if (split) {
+			ret = split_pud(pudp, old_pud, pgtable_alloc);
+			if (ret)
+				break;
+
+			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
+						  pgtable_alloc, flags);
+			if (ret)
+				break;
+
+			continue;
+		}
+
 		/*
 		 * For 4K granule only, attempt to put down a 1GB block
 		 */
@@ -399,7 +507,8 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (pudp++, addr = next, addr != end);
 
-	pud_clear_fixmap();
+	if (!split)
+		pud_clear_fixmap();
 
 	return ret;
 }
@@ -413,6 +522,13 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 	int ret = 0;
 	pgd_t pgd = READ_ONCE(*pgdp);
 	p4d_t *p4dp;
+	bool split = flags & SPLIT_MAPPINGS;
+
+	if (split) {
+		BUG_ON(pgd_none(pgd));
+		p4dp = p4d_offset(pgdp, addr);
+		goto split_pgtable;
+	}
 
 	if (pgd_none(pgd)) {
 		pgdval_t pgdval = PGD_TYPE_TABLE | PGD_TABLE_UXN | PGD_TABLE_AF;
@@ -433,6 +549,7 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		p4dp = p4d_set_fixmap_offset(pgdp, addr);
 	}
 
+split_pgtable:
 	do {
 		p4d_t old_p4d = READ_ONCE(*p4dp);
 
@@ -449,7 +566,8 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		phys += next - addr;
 	} while (p4dp++, addr = next, addr != end);
 
-	p4d_clear_fixmap();
+	if (!split)
+		p4d_clear_fixmap();
 
 	return ret;
 }
@@ -546,6 +664,23 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 	return pa;
 }
 
+int split_linear_mapping(unsigned long start, unsigned long end)
+{
+	int ret = 0;
+
+	if (!system_supports_bbml2_noabort())
+		return 0;
+
+	mmap_write_lock(&init_mm);
+	ret = __create_pgd_mapping_locked(init_mm.pgd, virt_to_phys((void *)start),
+					  start, (end - start), __pgprot(0),
+					  __pgd_pgtable_alloc, SPLIT_MAPPINGS);
+	mmap_write_unlock(&init_mm);
+	flush_tlb_kernel_range(start, end);
+
+	return ret;
+}
+
 /*
  * This function can only be used to modify existing table entries,
  * without allocating new levels of table. Note that this permits the
@@ -665,6 +800,24 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
 
 #endif /* CONFIG_KFENCE */
 
+static inline bool force_pte_mapping(void)
+{
+	/*
+	 * Can't use cpufeature API to determine whether BBML2 supported
+	 * or not since cpufeature have not been finalized yet.
+	 *
+	 * Checking the boot CPU only for now.  If the boot CPU has
+	 * BBML2, paint linear mapping with block mapping.  If it turns
+	 * out the secondary CPUs don't support BBML2 once cpufeature is
+	 * fininalized, the linear mapping will be repainted with PTE
+	 * mapping.
+	 */
+	return (rodata_full && !bbml2_noabort_available()) ||
+		debug_pagealloc_enabled() ||
+		arm64_kfence_can_set_direct_map() ||
+		is_realm_world();
+}
+
 static void __init map_mem(pgd_t *pgdp)
 {
 	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
@@ -690,9 +843,12 @@ static void __init map_mem(pgd_t *pgdp)
 
 	early_kfence_pool = arm64_kfence_alloc_pool();
 
-	if (can_set_direct_map())
+	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
+	if (rodata_full)
+		flags |= NO_CONT_MAPPINGS;
+
 	/*
 	 * Take care not to create a writable alias for the
 	 * read-only text and rodata sections of the kernel image.
@@ -1388,9 +1544,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 
 	VM_BUG_ON(!mhp_range_allowed(start, size, true));
 
-	if (can_set_direct_map())
+	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
+	if (rodata_full)
+		flags |= NO_CONT_MAPPINGS;
+
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, params->pgprot, __pgd_pgtable_alloc,
 			     flags);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 39fd1f7ff02a..5d42d87ea7e1 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -10,6 +10,7 @@
 #include <linux/vmalloc.h>
 
 #include <asm/cacheflush.h>
+#include <asm/mmu.h>
 #include <asm/pgtable-prot.h>
 #include <asm/set_memory.h>
 #include <asm/tlbflush.h>
@@ -80,8 +81,9 @@ static int change_memory_common(unsigned long addr, int numpages,
 	unsigned long start = addr;
 	unsigned long size = PAGE_SIZE * numpages;
 	unsigned long end = start + size;
+	unsigned long l_start;
 	struct vm_struct *area;
-	int i;
+	int i, ret;
 
 	if (!PAGE_ALIGNED(addr)) {
 		start &= PAGE_MASK;
@@ -118,7 +120,12 @@ static int change_memory_common(unsigned long addr, int numpages,
 	if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
 			    pgprot_val(clear_mask) == PTE_RDONLY)) {
 		for (i = 0; i < area->nr_pages; i++) {
-			__change_memory_common((u64)page_address(area->pages[i]),
+			l_start = (u64)page_address(area->pages[i]);
+			ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+			if (WARN_ON_ONCE(ret))
+				return ret;
+
+			__change_memory_common(l_start,
 					       PAGE_SIZE, set_mask, clear_mask);
 		}
 	}
@@ -174,6 +181,9 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 
 int set_direct_map_invalid_noflush(struct page *page)
 {
+	unsigned long l_start;
+	int ret;
+
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
@@ -182,13 +192,21 @@ int set_direct_map_invalid_noflush(struct page *page)
 	if (!can_set_direct_map())
 		return 0;
 
+	l_start = (unsigned long)page_address(page);
+	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
 	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   l_start, PAGE_SIZE, change_page_range,
+				   &data);
 }
 
 int set_direct_map_default_noflush(struct page *page)
 {
+	unsigned long l_start;
+	int ret;
+
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
@@ -197,9 +215,14 @@ int set_direct_map_default_noflush(struct page *page)
 	if (!can_set_direct_map())
 		return 0;
 
+	l_start = (unsigned long)page_address(page);
+	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
 	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   l_start, PAGE_SIZE, change_page_range,
+				   &data);
 }
 
 static int __set_memory_enc_dec(unsigned long addr,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [v3 PATCH 5/6] arm64: mm: support split CONT mappings
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (3 preceding siblings ...)
  2025-03-04 22:19 ` [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-14 13:33   ` Ryan Roberts
  2025-03-04 22:19 ` [v3 PATCH 6/6] arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs Yang Shi
  2025-03-13 17:28 ` [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  6 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

Add split CONT mappings support in order to support CONT mappings for
direct map.  This should help reduce TLB pressure further.

When splitting PUD, all PMDs will have CONT bit set since the leaf PUD
must be naturally aligned.  When splitting PMD, all PTEs will have CONT
bit set since the leaf PMD must be naturally aligned too, but the PMDs
in the cont range of split PMD will have CONT bit cleared.  Splitting
CONT PTEs by clearing CONT bit for all PTEs in the range.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/pgtable.h |  5 ++
 arch/arm64/mm/mmu.c              | 82 ++++++++++++++++++++++++++------
 arch/arm64/mm/pageattr.c         |  2 +
 3 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ed2fc1dcf7ae..3c6ef47f5813 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -290,6 +290,11 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
 }
 
+static inline pmd_t pmd_mknoncont(pmd_t pmd)
+{
+	return __pmd(pmd_val(pmd) & ~PMD_SECT_CONT);
+}
+
 static inline pte_t pte_mkdevmap(pte_t pte)
 {
 	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ad0f1cc55e3a..d4dfeabc80e9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -167,19 +167,36 @@ static void init_clear_pgtable(void *table)
 	dsb(ishst);
 }
 
+static void split_cont_pte(pte_t *ptep)
+{
+	pte_t *_ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
+	pte_t _pte;
+	for (int i = 0; i < CONT_PTES; i++, _ptep++) {
+		_pte = READ_ONCE(*_ptep);
+		_pte = pte_mknoncont(_pte);
+		__set_pte_nosync(_ptep, _pte);
+	}
+
+	dsb(ishst);
+	isb();
+}
+
 static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
-		     phys_addr_t (*pgtable_alloc)(int))
+		     phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long pfn;
 	pgprot_t prot;
 	phys_addr_t pte_phys;
 	pte_t *ptep;
+	bool cont;
+	int i;
 
 	if (!pmd_leaf(pmdval))
 		return 0;
 
 	pfn = pmd_pfn(pmdval);
 	prot = pmd_pgprot(pmdval);
+	cont = pgprot_val(prot) & PTE_CONT;
 
 	pte_phys = pgtable_alloc(PAGE_SHIFT);
 	if (!pte_phys)
@@ -188,11 +205,27 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
 	ptep = (pte_t *)phys_to_virt(pte_phys);
 	init_clear_pgtable(ptep);
 	prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);
-	for (int i = 0; i < PTRS_PER_PTE; i++, ptep++)
+
+	/* It must be naturally aligned if PMD is leaf */
+	if ((flags & NO_CONT_MAPPINGS) == 0)
+		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+
+	for (i = 0; i < PTRS_PER_PTE; i++, ptep++)
 		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));
 
 	dsb(ishst);
 
+	/* Clear CONT bit for the PMDs in the range */
+	if (cont) {
+		pmd_t *_pmdp, _pmd;
+		_pmdp = PTR_ALIGN_DOWN(pmdp, sizeof(*pmdp) * CONT_PMDS);
+		for (i = 0; i < CONT_PMDS; i++, _pmdp++) {
+			_pmd = READ_ONCE(*_pmdp);
+			_pmd = pmd_mknoncont(_pmd);
+			set_pmd(_pmdp, _pmd);
+		}
+	}
+
 	set_pmd(pmdp, pfn_pmd(__phys_to_pfn(pte_phys),
 		__pgprot(PMD_TYPE_TABLE)));
 
@@ -200,7 +233,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
 }
 
 static int split_pud(pud_t *pudp, pud_t pudval,
-		     phys_addr_t (*pgtable_alloc)(int))
+		     phys_addr_t (*pgtable_alloc)(int), int flags)
 {
 	unsigned long pfn;
 	pgprot_t prot;
@@ -221,6 +254,11 @@ static int split_pud(pud_t *pudp, pud_t pudval,
 
 	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
 	init_clear_pgtable(pmdp);
+
+	/* It must be naturally aligned if PUD is leaf */
+	if ((flags & NO_CONT_MAPPINGS) == 0)
+		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+
 	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
 		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
 		pfn += step;
@@ -235,11 +273,18 @@ static int split_pud(pud_t *pudp, pud_t pudval,
 }
 
 static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
-		     phys_addr_t phys, pgprot_t prot)
+		     phys_addr_t phys, pgprot_t prot, int flags)
 {
 	do {
 		pte_t old_pte = __ptep_get(ptep);
 
+		if (flags & SPLIT_MAPPINGS) {
+			if (pte_cont(old_pte))
+				split_cont_pte(ptep);
+
+			continue;
+		}
+
 		/*
 		 * Required barriers to make this visible to the table walker
 		 * are deferred to the end of alloc_init_cont_pte().
@@ -266,8 +311,16 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	unsigned long next;
 	pmd_t pmd = READ_ONCE(*pmdp);
 	pte_t *ptep;
+	bool split = flags & SPLIT_MAPPINGS;
 
 	BUG_ON(pmd_sect(pmd));
+
+	if (split) {
+		BUG_ON(pmd_none(pmd));
+		ptep = pte_offset_kernel(pmdp, addr);
+		goto split_pgtable;
+	}
+
 	if (pmd_none(pmd)) {
 		pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
 		phys_addr_t pte_phys;
@@ -287,6 +340,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		ptep = pte_set_fixmap_offset(pmdp, addr);
 	}
 
+split_pgtable:
 	do {
 		pgprot_t __prot = prot;
 
@@ -297,7 +351,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pte(ptep, addr, next, phys, __prot);
+		init_pte(ptep, addr, next, phys, __prot, flags);
 
 		ptep += pte_index(next) - pte_index(addr);
 		phys += next - addr;
@@ -308,7 +362,8 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 	 * ensure that all previous pgtable writes are visible to the table
 	 * walker.
 	 */
-	pte_clear_fixmap();
+	if (!split)
+		pte_clear_fixmap();
 
 	return 0;
 }
@@ -327,7 +382,12 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		next = pmd_addr_end(addr, end);
 
 		if (split) {
-			ret = split_pmd(pmdp, old_pmd, pgtable_alloc);
+			ret = split_pmd(pmdp, old_pmd, pgtable_alloc, flags);
+			if (ret)
+				break;
+
+			ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
+						  pgtable_alloc, flags);
 			if (ret)
 				break;
 
@@ -469,7 +529,7 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		next = pud_addr_end(addr, end);
 
 		if (split) {
-			ret = split_pud(pudp, old_pud, pgtable_alloc);
+			ret = split_pud(pudp, old_pud, pgtable_alloc, flags);
 			if (ret)
 				break;
 
@@ -846,9 +906,6 @@ static void __init map_mem(pgd_t *pgdp)
 	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
-	if (rodata_full)
-		flags |= NO_CONT_MAPPINGS;
-
 	/*
 	 * Take care not to create a writable alias for the
 	 * read-only text and rodata sections of the kernel image.
@@ -1547,9 +1604,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
-	if (rodata_full)
-		flags |= NO_CONT_MAPPINGS;
-
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, params->pgprot, __pgd_pgtable_alloc,
 			     flags);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 5d42d87ea7e1..25c068712cb5 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -43,6 +43,8 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	struct page_change_data *cdata = data;
 	pte_t pte = __ptep_get(ptep);
 
+	BUG_ON(pte_cont(pte));
+
 	pte = clear_pte_bit(pte, cdata->clear_mask);
 	pte = set_pte_bit(pte, cdata->set_mask);
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [v3 PATCH 6/6] arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (4 preceding siblings ...)
  2025-03-04 22:19 ` [v3 PATCH 5/6] arm64: mm: support split CONT mappings Yang Shi
@ 2025-03-04 22:19 ` Yang Shi
  2025-03-13 17:28 ` [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  6 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-04 22:19 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

The kernel linear mapping is painted in very early stage of system boot.
The cpufeature has not been finalized yet at this point.  So the linear
mapping is determined by the capability of boot CPU.  If the boot CPU
supports BBML2, large block mapping will be used for linear mapping.

But the secondary CPUs may not support BBML2, so repaint the linear mapping
if large block mapping is used and the secondary CPUs don't support BBML2
once cpufeature is finalized on all CPUs.

If the boot CPU doesn't support BBML2 or the secondary CPUs have the
same BBML2 capability with the boot CPU, repainting the linear mapping
is not needed.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/mmu.h   |  3 +++
 arch/arm64/kernel/cpufeature.c | 24 +++++++++++++++++++
 arch/arm64/mm/mmu.c            | 43 +++++++++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index d658a33df266..181649424317 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -56,6 +56,8 @@ typedef struct {
  */
 #define ASID(mm)	(atomic64_read(&(mm)->context.id) & 0xffff)
 
+extern bool block_mapping;
+
 static inline bool arm64_kernel_unmapped_at_el0(void)
 {
 	return alternative_has_cap_unlikely(ARM64_UNMAP_KERNEL_AT_EL0);
@@ -72,6 +74,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
 extern void mark_linear_text_alias_ro(void);
 extern int split_linear_mapping(unsigned long start, unsigned long end);
+extern int __repaint_linear_mappings(void *__unused);
 
 /*
  * This check is triggered during the early boot before the cpufeature
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index d39637d5aeab..ffb797bc2dba 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -85,6 +85,7 @@
 #include <asm/insn.h>
 #include <asm/kvm_host.h>
 #include <asm/mmu_context.h>
+#include <asm/mmu.h>
 #include <asm/mte.h>
 #include <asm/processor.h>
 #include <asm/smp.h>
@@ -1972,6 +1973,28 @@ static int __init __kpti_install_ng_mappings(void *__unused)
 	return 0;
 }
 
+static void __init repaint_linear_mappings(void)
+{
+	struct cpumask bbml2_cpus;
+
+	if (!block_mapping)
+		return;
+
+	if (!rodata_full)
+		return;
+
+	if (system_supports_bbml2_noabort())
+		return;
+
+	/*
+	 * Need to guarantee repainting linear mapping is called on the
+	 * boot CPU since boot CPU supports BBML2.
+	 */
+	cpumask_clear(&bbml2_cpus);
+	cpumask_set_cpu(smp_processor_id(), &bbml2_cpus);
+	stop_machine(__repaint_linear_mappings, NULL, &bbml2_cpus);
+}
+
 static void __init kpti_install_ng_mappings(void)
 {
 	/* Check whether KPTI is going to be used */
@@ -3814,6 +3837,7 @@ void __init setup_system_features(void)
 {
 	setup_system_capabilities();
 
+	repaint_linear_mappings();
 	kpti_install_ng_mappings();
 
 	sve_setup();
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d4dfeabc80e9..015b30567ad1 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -209,6 +209,8 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
 	/* It must be naturally aligned if PMD is leaf */
 	if ((flags & NO_CONT_MAPPINGS) == 0)
 		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+	else
+		prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
 
 	for (i = 0; i < PTRS_PER_PTE; i++, ptep++)
 		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));
@@ -258,6 +260,8 @@ static int split_pud(pud_t *pudp, pud_t pudval,
 	/* It must be naturally aligned if PUD is leaf */
 	if ((flags & NO_CONT_MAPPINGS) == 0)
 		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
+	else
+		prot = __pgprot(pgprot_val(prot) & ~PTE_CONT);
 
 	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
 		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
@@ -806,6 +810,37 @@ void __init mark_linear_text_alias_ro(void)
 			    PAGE_KERNEL_RO);
 }
 
+int __init __repaint_linear_mappings(void *__unused)
+{
+	phys_addr_t kernel_start = __pa_symbol(_stext);
+	phys_addr_t kernel_end = __pa_symbol(__init_begin);
+	phys_addr_t start, end;
+	unsigned long vstart, vend;
+	u64 i;
+	int ret;
+
+	memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
+	/* Split the whole linear mapping */
+	for_each_mem_range(i, &start, &end) {
+		if (start >= end)
+			return -EINVAL;
+
+		vstart = __phys_to_virt(start);
+		vend = __phys_to_virt(end);
+		ret = __create_pgd_mapping_locked(init_mm.pgd, start,
+					vstart, (end - start), __pgprot(0),
+					__pgd_pgtable_alloc,
+					NO_CONT_MAPPINGS | SPLIT_MAPPINGS);
+		if (ret)
+			panic("Failed to split linear mappings\n");
+
+		flush_tlb_kernel_range(vstart, vend);
+	}
+	memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
+
+	return 0;
+}
+
 #ifdef CONFIG_KFENCE
 
 bool __ro_after_init kfence_early_init = !!CONFIG_KFENCE_SAMPLE_INTERVAL;
@@ -860,6 +895,8 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
 
 #endif /* CONFIG_KFENCE */
 
+bool block_mapping;
+
 static inline bool force_pte_mapping(void)
 {
 	/*
@@ -888,6 +925,8 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = NO_EXEC_MAPPINGS;
 	u64 i;
 
+	block_mapping = true;
+
 	/*
 	 * Setting hierarchical PXNTable attributes on table entries covering
 	 * the linear region is only possible if it is guaranteed that no table
@@ -903,8 +942,10 @@ static void __init map_mem(pgd_t *pgdp)
 
 	early_kfence_pool = arm64_kfence_alloc_pool();
 
-	if (force_pte_mapping())
+	if (force_pte_mapping()) {
+		block_mapping = false;
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+	}
 
 	/*
 	 * Take care not to create a writable alias for the
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full
  2025-03-04 22:19 ` [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full Yang Shi
@ 2025-03-08  1:53   ` kernel test robot
  2025-03-14 13:29   ` Ryan Roberts
  1 sibling, 0 replies; 49+ messages in thread
From: kernel test robot @ 2025-03-08  1:53 UTC (permalink / raw)
  To: Yang Shi, ryan.roberts, will, catalin.marinas, Miko.Lenczewski,
	scott, cl
  Cc: oe-kbuild-all, linux-arm-kernel, linux-kernel

Hi Yang,

kernel test robot noticed the following build warnings:

[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on linus/master v6.14-rc5 next-20250307]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yang-Shi/arm64-Add-BBM-Level-2-cpu-feature/20250305-062252
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link:    https://lore.kernel.org/r/20250304222018.615808-5-yang%40os.amperecomputing.com
patch subject: [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full
config: arm64-randconfig-002-20250308 (https://download.01.org/0day-ci/archive/20250308/202503080930.7ZetfmFz-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250308/202503080930.7ZetfmFz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503080930.7ZetfmFz-lkp@intel.com/

All warnings (new ones prefixed by >>):

   arch/arm64/mm/mmu.c: In function 'alloc_init_pud':
>> arch/arm64/mm/mmu.c:511:35: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
     511 |                 pud_clear_fixmap();
         |                                   ^
   arch/arm64/mm/mmu.c: In function 'alloc_init_p4d':
   arch/arm64/mm/mmu.c:570:35: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
     570 |                 p4d_clear_fixmap();
         |                                   ^


vim +/if +511 arch/arm64/mm/mmu.c

d27cfa1fc823d3 Ard Biesheuvel    2017-03-09  428  
2451145c9a60e0 Yang Shi          2025-03-04  429  static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
da141706aea52c Laura Abbott      2015-01-21  430  			  phys_addr_t phys, pgprot_t prot,
90292aca9854a2 Yu Zhao           2019-03-11  431  			  phys_addr_t (*pgtable_alloc)(int),
c0951366d4b7e0 Ard Biesheuvel    2017-03-09  432  			  int flags)
c1cc1552616d0f Catalin Marinas   2012-03-05  433  {
c1cc1552616d0f Catalin Marinas   2012-03-05  434  	unsigned long next;
2451145c9a60e0 Yang Shi          2025-03-04  435  	int ret = 0;
e9f6376858b979 Mike Rapoport     2020-06-04  436  	p4d_t p4d = READ_ONCE(*p4dp);
6ed8a3a094b43a Ard Biesheuvel    2024-02-14  437  	pud_t *pudp;
6fad683b9a5c21 Yang Shi          2025-03-04  438  	bool split = flags & SPLIT_MAPPINGS;
6fad683b9a5c21 Yang Shi          2025-03-04  439  
6fad683b9a5c21 Yang Shi          2025-03-04  440  	if (split) {
6fad683b9a5c21 Yang Shi          2025-03-04  441  		BUG_ON(p4d_none(p4d));
6fad683b9a5c21 Yang Shi          2025-03-04  442  		pudp = pud_offset(p4dp, addr);
6fad683b9a5c21 Yang Shi          2025-03-04  443  		goto split_pgtable;
6fad683b9a5c21 Yang Shi          2025-03-04  444  	}
c1cc1552616d0f Catalin Marinas   2012-03-05  445  
e9f6376858b979 Mike Rapoport     2020-06-04  446  	if (p4d_none(p4d)) {
efe72541355d4d Yicong Yang       2024-11-02  447  		p4dval_t p4dval = P4D_TYPE_TABLE | P4D_TABLE_UXN | P4D_TABLE_AF;
132233a759580f Laura Abbott      2016-02-05  448  		phys_addr_t pud_phys;
87143f404f338d Ard Biesheuvel    2021-03-10  449  
87143f404f338d Ard Biesheuvel    2021-03-10  450  		if (flags & NO_EXEC_MAPPINGS)
87143f404f338d Ard Biesheuvel    2021-03-10  451  			p4dval |= P4D_TABLE_PXN;
132233a759580f Laura Abbott      2016-02-05  452  		BUG_ON(!pgtable_alloc);
90292aca9854a2 Yu Zhao           2019-03-11  453  		pud_phys = pgtable_alloc(PUD_SHIFT);
2451145c9a60e0 Yang Shi          2025-03-04  454  		if (!pud_phys)
2451145c9a60e0 Yang Shi          2025-03-04  455  			return -ENOMEM;
0e9df1c905d829 Ryan Roberts      2024-04-12  456  		pudp = pud_set_fixmap(pud_phys);
0e9df1c905d829 Ryan Roberts      2024-04-12  457  		init_clear_pgtable(pudp);
0e9df1c905d829 Ryan Roberts      2024-04-12  458  		pudp += pud_index(addr);
87143f404f338d Ard Biesheuvel    2021-03-10  459  		__p4d_populate(p4dp, pud_phys, p4dval);
0e9df1c905d829 Ryan Roberts      2024-04-12  460  	} else {
e9f6376858b979 Mike Rapoport     2020-06-04  461  		BUG_ON(p4d_bad(p4d));
e9f6376858b979 Mike Rapoport     2020-06-04  462  		pudp = pud_set_fixmap_offset(p4dp, addr);
0e9df1c905d829 Ryan Roberts      2024-04-12  463  	}
0e9df1c905d829 Ryan Roberts      2024-04-12  464  
6fad683b9a5c21 Yang Shi          2025-03-04  465  split_pgtable:
c1cc1552616d0f Catalin Marinas   2012-03-05  466  	do {
20a004e7b017cc Will Deacon       2018-02-15  467  		pud_t old_pud = READ_ONCE(*pudp);
e98216b52176ba Ard Biesheuvel    2016-10-21  468  
c1cc1552616d0f Catalin Marinas   2012-03-05  469  		next = pud_addr_end(addr, end);
206a2a73a62d37 Steve Capper      2014-05-06  470  
6fad683b9a5c21 Yang Shi          2025-03-04  471  		if (split) {
6fad683b9a5c21 Yang Shi          2025-03-04  472  			ret = split_pud(pudp, old_pud, pgtable_alloc);
6fad683b9a5c21 Yang Shi          2025-03-04  473  			if (ret)
6fad683b9a5c21 Yang Shi          2025-03-04  474  				break;
6fad683b9a5c21 Yang Shi          2025-03-04  475  
6fad683b9a5c21 Yang Shi          2025-03-04  476  			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
6fad683b9a5c21 Yang Shi          2025-03-04  477  						  pgtable_alloc, flags);
6fad683b9a5c21 Yang Shi          2025-03-04  478  			if (ret)
6fad683b9a5c21 Yang Shi          2025-03-04  479  				break;
6fad683b9a5c21 Yang Shi          2025-03-04  480  
6fad683b9a5c21 Yang Shi          2025-03-04  481  			continue;
6fad683b9a5c21 Yang Shi          2025-03-04  482  		}
6fad683b9a5c21 Yang Shi          2025-03-04  483  
206a2a73a62d37 Steve Capper      2014-05-06  484  		/*
206a2a73a62d37 Steve Capper      2014-05-06  485  		 * For 4K granule only, attempt to put down a 1GB block
206a2a73a62d37 Steve Capper      2014-05-06  486  		 */
1310222c276b79 Anshuman Khandual 2022-02-16  487  		if (pud_sect_supported() &&
1310222c276b79 Anshuman Khandual 2022-02-16  488  		   ((addr | next | phys) & ~PUD_MASK) == 0 &&
c0951366d4b7e0 Ard Biesheuvel    2017-03-09  489  		    (flags & NO_BLOCK_MAPPINGS) == 0) {
20a004e7b017cc Will Deacon       2018-02-15  490  			pud_set_huge(pudp, phys, prot);
206a2a73a62d37 Steve Capper      2014-05-06  491  
206a2a73a62d37 Steve Capper      2014-05-06  492  			/*
e98216b52176ba Ard Biesheuvel    2016-10-21  493  			 * After the PUD entry has been populated once, we
e98216b52176ba Ard Biesheuvel    2016-10-21  494  			 * only allow updates to the permission attributes.
206a2a73a62d37 Steve Capper      2014-05-06  495  			 */
e98216b52176ba Ard Biesheuvel    2016-10-21  496  			BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
20a004e7b017cc Will Deacon       2018-02-15  497  						      READ_ONCE(pud_val(*pudp))));
206a2a73a62d37 Steve Capper      2014-05-06  498  		} else {
2451145c9a60e0 Yang Shi          2025-03-04  499  			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
c0951366d4b7e0 Ard Biesheuvel    2017-03-09  500  					    pgtable_alloc, flags);
2451145c9a60e0 Yang Shi          2025-03-04  501  			if (ret)
2451145c9a60e0 Yang Shi          2025-03-04  502  				break;
e98216b52176ba Ard Biesheuvel    2016-10-21  503  
e98216b52176ba Ard Biesheuvel    2016-10-21  504  			BUG_ON(pud_val(old_pud) != 0 &&
20a004e7b017cc Will Deacon       2018-02-15  505  			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
206a2a73a62d37 Steve Capper      2014-05-06  506  		}
c1cc1552616d0f Catalin Marinas   2012-03-05  507  		phys += next - addr;
20a004e7b017cc Will Deacon       2018-02-15  508  	} while (pudp++, addr = next, addr != end);
f4710445458c0a Mark Rutland      2016-01-25  509  
6fad683b9a5c21 Yang Shi          2025-03-04  510  	if (!split)
f4710445458c0a Mark Rutland      2016-01-25 @511  		pud_clear_fixmap();
2451145c9a60e0 Yang Shi          2025-03-04  512  
2451145c9a60e0 Yang Shi          2025-03-04  513  	return ret;
c1cc1552616d0f Catalin Marinas   2012-03-05  514  }
c1cc1552616d0f Catalin Marinas   2012-03-05  515  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (5 preceding siblings ...)
  2025-03-04 22:19 ` [v3 PATCH 6/6] arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs Yang Shi
@ 2025-03-13 17:28 ` Yang Shi
  2025-03-13 17:36   ` Ryan Roberts
  6 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-13 17:28 UTC (permalink / raw)
  To: ryan.roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

Hi Ryan,

I saw Miko posted a new spin of his patches. There are some slight 
changes that have impact to my patches (basically check the new boot 
parameter). Do you prefer I rebase my patches on top of his new spin 
right now then restart review from the new spin or review the current 
patches then solve the new review comments and rebase to Miko's new spin 
together?

Thanks,
Yang


On 3/4/25 2:19 PM, Yang Shi wrote:
> Changelog
> =========
> v3:
>    * Rebased to v6.14-rc4.
>    * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>      Also included in this series in order to have the complete patchset.
>    * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>    * Supported CONT mappings per Ryan.
>    * Supported asymmetric system by splitting kernel linear mapping if such
>      system is detected per Ryan. I don't have such system to test, so the
>      testing is done by hacking kernel to call linear mapping repainting
>      unconditionally. The linear mapping doesn't have any block and cont
>      mappings after booting.
>
> RFC v2:
>    * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>      conflict gracefully per Will Deacon
>    * Rebased onto v6.13-rc5
>    * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-yang@os.amperecomputing.com/
>
> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-yang@os.amperecomputing.com/
>
> Description
> ===========
> When rodata=full kernel linear mapping is mapped by PTE due to arm's
> break-before-make rule.
>
> A number of performance issues arise when the kernel linear map is using
> PTE entries due to arm's break-before-make rule:
>    - performance degradation
>    - more TLB pressure
>    - memory waste for kernel page table
>
> These issues can be avoided by specifying rodata=on the kernel command
> line but this disables the alias checks on page table permissions and
> therefore compromises security somewhat.
>
> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
> page table entry when changing page sizes.  This allows the kernel to
> split large mappings after boot is complete.
>
> This patch adds support for splitting large mappings when FEAT_BBM level 2
> is available and rodata=full is used. This functionality will be used
> when modifying page permissions for individual page frames.
>
> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
> only.
>
> If the system is asymmetric, the kernel linear mapping may be repainted once
> the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.
>
> We saw significant performance increases in some benchmarks with
> rodata=full without compromising the security features of the kernel.
>
> Testing
> =======
> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
> 4K page size + 48 bit VA.
>
> Function test (4K/16K/64K page size)
>    - Kernel boot.  Kernel needs change kernel linear mapping permission at
>      boot stage, if the patch didn't work, kernel typically didn't boot.
>    - Module stress from stress-ng. Kernel module load change permission for
>      linear mapping.
>    - A test kernel module which allocates 80% of total memory via vmalloc(),
>      then change the vmalloc area permission to RO, this also change linear
>      mapping permission to RO, then change it back before vfree(). Then launch
>      a VM which consumes almost all physical memory.
>    - VM with the patchset applied in guest kernel too.
>    - Kernel build in VM with guest kernel which has this series applied.
>    - rodata=on. Make sure other rodata mode is not broken.
>    - Boot on the machine which doesn't support BBML2.
>
> Performance
> ===========
> Memory consumption
> Before:
> MemTotal:       258988984 kB
> MemFree:        254821700 kB
>
> After:
> MemTotal:       259505132 kB
> MemFree:        255410264 kB
>
> Around 500MB more memory are free to use.  The larger the machine, the
> more memory saved.
>
> Performance benchmarking
> * Memcached
> We saw performance degradation when running Memcached benchmark with
> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
> With this patchset we saw ops/sec is increased by around 3.5%, P99
> latency is reduced by around 9.6%.
> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
> MPKI is reduced by 28.5%.
>
> The benchmark data is now on par with rodata=on too.
>
> * Disk encryption (dm-crypt) benchmark
> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
> encryption (by dm-crypt).
> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>      --name=iops-test-job --eta-newline=1 --size 100G
>
> The IOPS is increased by 90% - 150% (the variance is high, but the worst
> number of good case is around 90% more than the best number of bad case).
> The bandwidth is increased and the avg clat is reduced proportionally.
>
> * Sequential file read
> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
> The bandwidth is increased by 150%.
>
>
> Mikołaj Lenczewski (1):
>        arm64: Add BBM Level 2 cpu feature
>
> Yang Shi (5):
>        arm64: cpufeature: add AmpereOne to BBML2 allow list
>        arm64: mm: make __create_pgd_mapping() and helpers non-void
>        arm64: mm: support large block mapping when rodata=full
>        arm64: mm: support split CONT mappings
>        arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs
>
>   arch/arm64/Kconfig                  |  11 +++++
>   arch/arm64/include/asm/cpucaps.h    |   2 +
>   arch/arm64/include/asm/cpufeature.h |  15 ++++++
>   arch/arm64/include/asm/mmu.h        |   4 ++
>   arch/arm64/include/asm/pgtable.h    |  12 ++++-
>   arch/arm64/kernel/cpufeature.c      |  95 +++++++++++++++++++++++++++++++++++++
>   arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------
>   arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>   arch/arm64/tools/cpucaps            |   1 +
>   9 files changed, 518 insertions(+), 56 deletions(-)
>
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-03-13 17:28 ` [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
@ 2025-03-13 17:36   ` Ryan Roberts
  2025-03-13 17:40     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-03-13 17:36 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 13/03/2025 17:28, Yang Shi wrote:
> Hi Ryan,
> 
> I saw Miko posted a new spin of his patches. There are some slight changes that
> have impact to my patches (basically check the new boot parameter). Do you
> prefer I rebase my patches on top of his new spin right now then restart review
> from the new spin or review the current patches then solve the new review
> comments and rebase to Miko's new spin together?

Hi Yang,

Sorry I haven't got to reviewing this version yet, it's in my queue!

I'm happy to review against v3 as it is. I'm familiar with Miko's series and am
not too bothered about the integration with that; I think it's pretty straight
forward. I'm more interested in how you are handling the splitting, which I
think is the bulk of the effort.

I'm hoping to get to this next week before heading out to LSF/MM the following
week (might I see you there?)

Thanks,
Ryan


> 
> Thanks,
> Yang
> 
> 
> On 3/4/25 2:19 PM, Yang Shi wrote:
>> Changelog
>> =========
>> v3:
>>    * Rebased to v6.14-rc4.
>>    * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-
>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>      Also included in this series in order to have the complete patchset.
>>    * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>    * Supported CONT mappings per Ryan.
>>    * Supported asymmetric system by splitting kernel linear mapping if such
>>      system is detected per Ryan. I don't have such system to test, so the
>>      testing is done by hacking kernel to call linear mapping repainting
>>      unconditionally. The linear mapping doesn't have any block and cont
>>      mappings after booting.
>>
>> RFC v2:
>>    * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>      conflict gracefully per Will Deacon
>>    * Rebased onto v6.13-rc5
>>    * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>> yang@os.amperecomputing.com/
>>
>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>> yang@os.amperecomputing.com/
>>
>> Description
>> ===========
>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>> break-before-make rule.
>>
>> A number of performance issues arise when the kernel linear map is using
>> PTE entries due to arm's break-before-make rule:
>>    - performance degradation
>>    - more TLB pressure
>>    - memory waste for kernel page table
>>
>> These issues can be avoided by specifying rodata=on the kernel command
>> line but this disables the alias checks on page table permissions and
>> therefore compromises security somewhat.
>>
>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>> page table entry when changing page sizes.  This allows the kernel to
>> split large mappings after boot is complete.
>>
>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>> is available and rodata=full is used. This functionality will be used
>> when modifying page permissions for individual page frames.
>>
>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>> only.
>>
>> If the system is asymmetric, the kernel linear mapping may be repainted once
>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.
>>
>> We saw significant performance increases in some benchmarks with
>> rodata=full without compromising the security features of the kernel.
>>
>> Testing
>> =======
>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>> 4K page size + 48 bit VA.
>>
>> Function test (4K/16K/64K page size)
>>    - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>      boot stage, if the patch didn't work, kernel typically didn't boot.
>>    - Module stress from stress-ng. Kernel module load change permission for
>>      linear mapping.
>>    - A test kernel module which allocates 80% of total memory via vmalloc(),
>>      then change the vmalloc area permission to RO, this also change linear
>>      mapping permission to RO, then change it back before vfree(). Then launch
>>      a VM which consumes almost all physical memory.
>>    - VM with the patchset applied in guest kernel too.
>>    - Kernel build in VM with guest kernel which has this series applied.
>>    - rodata=on. Make sure other rodata mode is not broken.
>>    - Boot on the machine which doesn't support BBML2.
>>
>> Performance
>> ===========
>> Memory consumption
>> Before:
>> MemTotal:       258988984 kB
>> MemFree:        254821700 kB
>>
>> After:
>> MemTotal:       259505132 kB
>> MemFree:        255410264 kB
>>
>> Around 500MB more memory are free to use.  The larger the machine, the
>> more memory saved.
>>
>> Performance benchmarking
>> * Memcached
>> We saw performance degradation when running Memcached benchmark with
>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>> latency is reduced by around 9.6%.
>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>> MPKI is reduced by 28.5%.
>>
>> The benchmark data is now on par with rodata=on too.
>>
>> * Disk encryption (dm-crypt) benchmark
>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>> encryption (by dm-crypt).
>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>      --name=iops-test-job --eta-newline=1 --size 100G
>>
>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>> number of good case is around 90% more than the best number of bad case).
>> The bandwidth is increased and the avg clat is reduced proportionally.
>>
>> * Sequential file read
>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>> The bandwidth is increased by 150%.
>>
>>
>> Mikołaj Lenczewski (1):
>>        arm64: Add BBM Level 2 cpu feature
>>
>> Yang Shi (5):
>>        arm64: cpufeature: add AmpereOne to BBML2 allow list
>>        arm64: mm: make __create_pgd_mapping() and helpers non-void
>>        arm64: mm: support large block mapping when rodata=full
>>        arm64: mm: support split CONT mappings
>>        arm64: mm: split linear mapping if BBML2 is not supported on secondary
>> CPUs
>>
>>   arch/arm64/Kconfig                  |  11 +++++
>>   arch/arm64/include/asm/cpucaps.h    |   2 +
>>   arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>   arch/arm64/include/asm/mmu.h        |   4 ++
>>   arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>   arch/arm64/kernel/cpufeature.c      |  95 +++++++++++++++++++++++++++++++++++++
>>   arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++-------------------
>>   arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>   arch/arm64/tools/cpucaps            |   1 +
>>   9 files changed, 518 insertions(+), 56 deletions(-)
>>
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-03-13 17:36   ` Ryan Roberts
@ 2025-03-13 17:40     ` Yang Shi
  2025-04-10 22:00       ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-13 17:40 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 3/13/25 10:36 AM, Ryan Roberts wrote:
> On 13/03/2025 17:28, Yang Shi wrote:
>> Hi Ryan,
>>
>> I saw Miko posted a new spin of his patches. There are some slight changes that
>> have impact to my patches (basically check the new boot parameter). Do you
>> prefer I rebase my patches on top of his new spin right now then restart review
>> from the new spin or review the current patches then solve the new review
>> comments and rebase to Miko's new spin together?
> Hi Yang,
>
> Sorry I haven't got to reviewing this version yet, it's in my queue!
>
> I'm happy to review against v3 as it is. I'm familiar with Miko's series and am
> not too bothered about the integration with that; I think it's pretty straight
> forward. I'm more interested in how you are handling the splitting, which I
> think is the bulk of the effort.

Yeah, sure, thank you.

>
> I'm hoping to get to this next week before heading out to LSF/MM the following
> week (might I see you there?)

Unfortunately I can't make it this year. Have a fun!

Thanks,
Yang

>
> Thanks,
> Ryan
>
>
>> Thanks,
>> Yang
>>
>>
>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>> Changelog
>>> =========
>>> v3:
>>>     * Rebased to v6.14-rc4.
>>>     * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-
>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>       Also included in this series in order to have the complete patchset.
>>>     * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>     * Supported CONT mappings per Ryan.
>>>     * Supported asymmetric system by splitting kernel linear mapping if such
>>>       system is detected per Ryan. I don't have such system to test, so the
>>>       testing is done by hacking kernel to call linear mapping repainting
>>>       unconditionally. The linear mapping doesn't have any block and cont
>>>       mappings after booting.
>>>
>>> RFC v2:
>>>     * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>       conflict gracefully per Will Deacon
>>>     * Rebased onto v6.13-rc5
>>>     * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>> yang@os.amperecomputing.com/
>>>
>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>> yang@os.amperecomputing.com/
>>>
>>> Description
>>> ===========
>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>> break-before-make rule.
>>>
>>> A number of performance issues arise when the kernel linear map is using
>>> PTE entries due to arm's break-before-make rule:
>>>     - performance degradation
>>>     - more TLB pressure
>>>     - memory waste for kernel page table
>>>
>>> These issues can be avoided by specifying rodata=on the kernel command
>>> line but this disables the alias checks on page table permissions and
>>> therefore compromises security somewhat.
>>>
>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>> page table entry when changing page sizes.  This allows the kernel to
>>> split large mappings after boot is complete.
>>>
>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>> is available and rodata=full is used. This functionality will be used
>>> when modifying page permissions for individual page frames.
>>>
>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>> only.
>>>
>>> If the system is asymmetric, the kernel linear mapping may be repainted once
>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.
>>>
>>> We saw significant performance increases in some benchmarks with
>>> rodata=full without compromising the security features of the kernel.
>>>
>>> Testing
>>> =======
>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>> 4K page size + 48 bit VA.
>>>
>>> Function test (4K/16K/64K page size)
>>>     - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>       boot stage, if the patch didn't work, kernel typically didn't boot.
>>>     - Module stress from stress-ng. Kernel module load change permission for
>>>       linear mapping.
>>>     - A test kernel module which allocates 80% of total memory via vmalloc(),
>>>       then change the vmalloc area permission to RO, this also change linear
>>>       mapping permission to RO, then change it back before vfree(). Then launch
>>>       a VM which consumes almost all physical memory.
>>>     - VM with the patchset applied in guest kernel too.
>>>     - Kernel build in VM with guest kernel which has this series applied.
>>>     - rodata=on. Make sure other rodata mode is not broken.
>>>     - Boot on the machine which doesn't support BBML2.
>>>
>>> Performance
>>> ===========
>>> Memory consumption
>>> Before:
>>> MemTotal:       258988984 kB
>>> MemFree:        254821700 kB
>>>
>>> After:
>>> MemTotal:       259505132 kB
>>> MemFree:        255410264 kB
>>>
>>> Around 500MB more memory are free to use.  The larger the machine, the
>>> more memory saved.
>>>
>>> Performance benchmarking
>>> * Memcached
>>> We saw performance degradation when running Memcached benchmark with
>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>> latency is reduced by around 9.6%.
>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>> MPKI is reduced by 28.5%.
>>>
>>> The benchmark data is now on par with rodata=on too.
>>>
>>> * Disk encryption (dm-crypt) benchmark
>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>> encryption (by dm-crypt).
>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>       --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>       --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>>       --name=iops-test-job --eta-newline=1 --size 100G
>>>
>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>> number of good case is around 90% more than the best number of bad case).
>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>
>>> * Sequential file read
>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>> The bandwidth is increased by 150%.
>>>
>>>
>>> Mikołaj Lenczewski (1):
>>>         arm64: Add BBM Level 2 cpu feature
>>>
>>> Yang Shi (5):
>>>         arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>         arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>         arm64: mm: support large block mapping when rodata=full
>>>         arm64: mm: support split CONT mappings
>>>         arm64: mm: split linear mapping if BBML2 is not supported on secondary
>>> CPUs
>>>
>>>    arch/arm64/Kconfig                  |  11 +++++
>>>    arch/arm64/include/asm/cpucaps.h    |   2 +
>>>    arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>    arch/arm64/include/asm/mmu.h        |   4 ++
>>>    arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>    arch/arm64/kernel/cpufeature.c      |  95 +++++++++++++++++++++++++++++++++++++
>>>    arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++++++
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> ++++++++++++++++++++++-------------------
>>>    arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>    arch/arm64/tools/cpucaps            |   1 +
>>>    9 files changed, 518 insertions(+), 56 deletions(-)
>>>
>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list
  2025-03-04 22:19 ` [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list Yang Shi
@ 2025-03-14 10:58   ` Ryan Roberts
  2025-03-17 17:50     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-03-14 10:58 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 04/03/2025 22:19, Yang Shi wrote:
> AmpereOne supports BBML2 without conflict abort, add to the allow list.
> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  arch/arm64/kernel/cpufeature.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index 7934c6dd493e..bf3df8407ca3 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -2192,6 +2192,8 @@ static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>  	static const struct midr_range supports_bbml2_noabort_list[] = {
>  		MIDR_REV_RANGE(MIDR_CORTEX_X4, 0, 3, 0xf),
>  		MIDR_REV_RANGE(MIDR_NEOVERSE_V3, 0, 2, 0xf),
> +		MIDR_ALL_VERSIONS(MIDR_AMPERE1),
> +		MIDR_ALL_VERSIONS(MIDR_AMPERE1A),
>  		{}
>  	};
>  

Miko's series will move back to additionally checking MMFR2.BBM, so you will
need to add an erratum workaround for these CPUs to set MMFR2.BBM=2 in the
per-cpu "sanitised" feature register. See:

https://lore.kernel.org/linux-arm-kernel/86ecyzorb7.wl-maz@kernel.org/

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void
  2025-03-04 22:19 ` [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void Yang Shi
@ 2025-03-14 11:51   ` Ryan Roberts
  2025-03-17 17:53     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-03-14 11:51 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 04/03/2025 22:19, Yang Shi wrote:
> The later patch will enhance __create_pgd_mapping() and related helpers
> to split kernel linear mapping, it requires have return value.  So make
> __create_pgd_mapping() and helpers non-void functions.
> 
> And move the BUG_ON() out of page table alloc helper since failing
> splitting kernel linear mapping is not fatal and can be handled by the
> callers in the later patch.  Have BUG_ON() after
> __create_pgd_mapping_locked() returns to keep the current callers behavior
> intact.
> 
> Suggested-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  arch/arm64/mm/mmu.c | 127 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 86 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index b4df5bc5b1b8..dccf0877285b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -189,11 +189,11 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>  	} while (ptep++, addr += PAGE_SIZE, addr != end);
>  }
>  
> -static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> -				unsigned long end, phys_addr_t phys,
> -				pgprot_t prot,
> -				phys_addr_t (*pgtable_alloc)(int),
> -				int flags)
> +static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
> +			       unsigned long end, phys_addr_t phys,
> +			       pgprot_t prot,
> +			       phys_addr_t (*pgtable_alloc)(int),
> +			       int flags)
>  {
>  	unsigned long next;
>  	pmd_t pmd = READ_ONCE(*pmdp);
> @@ -208,6 +208,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  			pmdval |= PMD_TABLE_PXN;
>  		BUG_ON(!pgtable_alloc);
>  		pte_phys = pgtable_alloc(PAGE_SHIFT);
> +		if (!pte_phys)
> +			return -ENOMEM;

nit: personally I'd prefer to see a "goto out" and funnel all to a single return
statement. You do that in some functions (via loop break), but would be cleaner
if consistent.

If pgtable_alloc() is modified to return int (see my comment at the bottom),
this becomes:

ret = pgtable_alloc(PAGE_SHIFT, &pte_phys);
if (ret)
	goto out;


>  		ptep = pte_set_fixmap(pte_phys);
>  		init_clear_pgtable(ptep);
>  		ptep += pte_index(addr);
> @@ -239,13 +241,16 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  	 * walker.
>  	 */
>  	pte_clear_fixmap();
> +
> +	return 0;
>  }
>  
> -static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> -		     phys_addr_t phys, pgprot_t prot,
> -		     phys_addr_t (*pgtable_alloc)(int), int flags)
> +static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
> +		    phys_addr_t phys, pgprot_t prot,
> +		    phys_addr_t (*pgtable_alloc)(int), int flags)
>  {
>  	unsigned long next;
> +	int ret = 0;
>  
>  	do {
>  		pmd_t old_pmd = READ_ONCE(*pmdp);
> @@ -264,22 +269,27 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  			BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
>  						      READ_ONCE(pmd_val(*pmdp))));
>  		} else {
> -			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
> +			ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>  					    pgtable_alloc, flags);
> +			if (ret)
> +				break;
>  
>  			BUG_ON(pmd_val(old_pmd) != 0 &&
>  			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
>  		}
>  		phys += next - addr;
>  	} while (pmdp++, addr = next, addr != end);
> +
> +	return ret;
>  }
>  
> -static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> -				unsigned long end, phys_addr_t phys,
> -				pgprot_t prot,
> -				phys_addr_t (*pgtable_alloc)(int), int flags)
> +static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
> +			       unsigned long end, phys_addr_t phys,
> +			       pgprot_t prot,
> +			       phys_addr_t (*pgtable_alloc)(int), int flags)
>  {
>  	unsigned long next;
> +	int ret = 0;
>  	pud_t pud = READ_ONCE(*pudp);
>  	pmd_t *pmdp;
>  
> @@ -295,6 +305,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  			pudval |= PUD_TABLE_PXN;
>  		BUG_ON(!pgtable_alloc);
>  		pmd_phys = pgtable_alloc(PMD_SHIFT);
> +		if (!pmd_phys)
> +			return -ENOMEM;
>  		pmdp = pmd_set_fixmap(pmd_phys);
>  		init_clear_pgtable(pmdp);
>  		pmdp += pmd_index(addr);
> @@ -314,21 +326,26 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  		    (flags & NO_CONT_MAPPINGS) == 0)
>  			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>  
> -		init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
> +		ret = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
> +		if (ret)
> +			break;
>  
>  		pmdp += pmd_index(next) - pmd_index(addr);
>  		phys += next - addr;
>  	} while (addr = next, addr != end);
>  
>  	pmd_clear_fixmap();
> +
> +	return ret;
>  }
>  
> -static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> -			   phys_addr_t phys, pgprot_t prot,
> -			   phys_addr_t (*pgtable_alloc)(int),
> -			   int flags)
> +static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
> +			  phys_addr_t phys, pgprot_t prot,
> +			  phys_addr_t (*pgtable_alloc)(int),
> +			  int flags)
>  {
>  	unsigned long next;
> +	int ret = 0;
>  	p4d_t p4d = READ_ONCE(*p4dp);
>  	pud_t *pudp;
>  
> @@ -340,6 +357,8 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  			p4dval |= P4D_TABLE_PXN;
>  		BUG_ON(!pgtable_alloc);
>  		pud_phys = pgtable_alloc(PUD_SHIFT);
> +		if (!pud_phys)
> +			return -ENOMEM;
>  		pudp = pud_set_fixmap(pud_phys);
>  		init_clear_pgtable(pudp);
>  		pudp += pud_index(addr);
> @@ -369,8 +388,10 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  			BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
>  						      READ_ONCE(pud_val(*pudp))));
>  		} else {
> -			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> +			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>  					    pgtable_alloc, flags);
> +			if (ret)
> +				break;
>  
>  			BUG_ON(pud_val(old_pud) != 0 &&
>  			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
> @@ -379,14 +400,17 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  	} while (pudp++, addr = next, addr != end);
>  
>  	pud_clear_fixmap();
> +
> +	return ret;
>  }
>  
> -static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
> -			   phys_addr_t phys, pgprot_t prot,
> -			   phys_addr_t (*pgtable_alloc)(int),
> -			   int flags)
> +static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +			  phys_addr_t phys, pgprot_t prot,
> +			  phys_addr_t (*pgtable_alloc)(int),
> +			  int flags)
>  {
>  	unsigned long next;
> +	int ret = 0;
>  	pgd_t pgd = READ_ONCE(*pgdp);
>  	p4d_t *p4dp;
>  
> @@ -398,6 +422,8 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  			pgdval |= PGD_TABLE_PXN;
>  		BUG_ON(!pgtable_alloc);
>  		p4d_phys = pgtable_alloc(P4D_SHIFT);
> +		if (!p4d_phys)
> +			return -ENOMEM;
>  		p4dp = p4d_set_fixmap(p4d_phys);
>  		init_clear_pgtable(p4dp);
>  		p4dp += p4d_index(addr);
> @@ -412,8 +438,10 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  
>  		next = p4d_addr_end(addr, end);
>  
> -		alloc_init_pud(p4dp, addr, next, phys, prot,
> +		ret = alloc_init_pud(p4dp, addr, next, phys, prot,
>  			       pgtable_alloc, flags);
> +		if (ret)
> +			break;
>  
>  		BUG_ON(p4d_val(old_p4d) != 0 &&
>  		       p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
> @@ -422,23 +450,26 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  	} while (p4dp++, addr = next, addr != end);
>  
>  	p4d_clear_fixmap();
> +
> +	return ret;
>  }
>  
> -static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
> -					unsigned long virt, phys_addr_t size,
> -					pgprot_t prot,
> -					phys_addr_t (*pgtable_alloc)(int),
> -					int flags)
> +static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
> +				       unsigned long virt, phys_addr_t size,
> +				       pgprot_t prot,
> +				       phys_addr_t (*pgtable_alloc)(int),
> +				       int flags)
>  {
>  	unsigned long addr, end, next;
>  	pgd_t *pgdp = pgd_offset_pgd(pgdir, virt);
> +	int ret = 0;
>  
>  	/*
>  	 * If the virtual and physical address don't have the same offset
>  	 * within a page, we cannot map the region as the caller expects.
>  	 */
>  	if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
> -		return;
> +		return -EINVAL;
>  
>  	phys &= PAGE_MASK;
>  	addr = virt & PAGE_MASK;
> @@ -446,29 +477,38 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>  
>  	do {
>  		next = pgd_addr_end(addr, end);
> -		alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
> +		ret = alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>  			       flags);
> +		if (ret)
> +			break;
>  		phys += next - addr;
>  	} while (pgdp++, addr = next, addr != end);
> +
> +	return ret;
>  }
>  
> -static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> -				 unsigned long virt, phys_addr_t size,
> -				 pgprot_t prot,
> -				 phys_addr_t (*pgtable_alloc)(int),
> -				 int flags)
> +static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> +				unsigned long virt, phys_addr_t size,
> +				pgprot_t prot,
> +				phys_addr_t (*pgtable_alloc)(int),
> +				int flags)
>  {
> +	int ret;
> +
>  	mutex_lock(&fixmap_lock);
> -	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
> +	ret = __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>  				    pgtable_alloc, flags);
> +	BUG_ON(ret);

This function now returns an error, but also BUGs on ret!=0. For this patch, I'd
suggest keeping this function as void.

But I believe there is a pre-existing bug in arch_add_memory(). That's called at
runtime so if __create_pgd_mapping() fails and BUGs, it will take down a running
system.

With this foundational patch, we can fix that with an additional patch to pass
along the error code instead of BUGing in that case. arch_add_memory() would
need to unwind whatever __create_pgd_mapping() managed to do before the memory
allocation failure (presumably unmapping and freeing any allocated tables). I'm
happy to do this as a follow up patch.

>  	mutex_unlock(&fixmap_lock);
> +
> +	return ret;
>  }
>  
>  #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>  extern __alias(__create_pgd_mapping_locked)
> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
> -			     phys_addr_t size, pgprot_t prot,
> -			     phys_addr_t (*pgtable_alloc)(int), int flags);
> +int create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
> +			    phys_addr_t size, pgprot_t prot,
> +			    phys_addr_t (*pgtable_alloc)(int), int flags);

create_kpti_ng_temp_pgd() now returns error instead of BUGing on allocation
failure, but I don't see a change to handle that error. You'll want to update
__kpti_install_ng_mappings() to BUG on error.

>  #endif
>  
>  static phys_addr_t __pgd_pgtable_alloc(int shift)
> @@ -476,13 +516,17 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
>  	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>  	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
>  
> -	BUG_ON(!ptr);
> +	if (!ptr)
> +		return 0;

0 is a valid (though unlikely) physical address. I guess you could technically
encode like ERR_PTR(), but since you are returning phys_addr_t and not a
pointer, then perhaps it will be clearer to make this return int and accept a
pointer to a phys_addr_t, which it will populate on success?

> +
>  	return __pa(ptr);
>  }
>  
>  static phys_addr_t pgd_pgtable_alloc(int shift)
>  {
>  	phys_addr_t pa = __pgd_pgtable_alloc(shift);
> +	if (!pa)
> +		goto out;

This would obviously need to be fixed up as per above.

>  	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
>  
>  	/*
> @@ -498,6 +542,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>  	else if (shift == PMD_SHIFT)
>  		BUG_ON(!pagetable_pmd_ctor(ptdesc));
>  
> +out:
>  	return pa;
>  }
>  

You have left early_pgtable_alloc() to panic() on allocation failure. Given we
can now unwind the stack with error code, I think it would be more consistent to
also allow early_pgtable_alloc() to return error.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full
  2025-03-04 22:19 ` [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full Yang Shi
  2025-03-08  1:53   ` kernel test robot
@ 2025-03-14 13:29   ` Ryan Roberts
  2025-03-17 17:57     ` Yang Shi
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-03-14 13:29 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 04/03/2025 22:19, Yang Shi wrote:
> When rodata=full is specified, kernel linear mapping has to be mapped at
> PTE level since large page table can't be split due to break-before-make
> rule on ARM64.
> 
> This resulted in a couple of problems:
>   - performance degradation
>   - more TLB pressure
>   - memory waste for kernel page table
> 
> With FEAT_BBM level 2 support, splitting large block page table to
> smaller ones doesn't need to make the page table entry invalid anymore.
> This allows kernel split large block mapping on the fly.
> 
> Add kernel page table split support and use large block mapping by
> default when FEAT_BBM level 2 is supported for rodata=full.  When
> changing permissions for kernel linear mapping, the page table will be
> split to PTE level.
> 
> The machine without FEAT_BBM level 2 will fallback to have kernel linear
> mapping PTE-mapped when rodata=full.
> 
> With this we saw significant performance boost with some benchmarks and
> much less memory consumption on my AmpereOne machine (192 cores, 1P) with
> 256GB memory.
> 
> * Memory use after boot
> Before:
> MemTotal:       258988984 kB
> MemFree:        254821700 kB
> 
> After:
> MemTotal:       259505132 kB
> MemFree:        255410264 kB
> 
> Around 500MB more memory are free to use.  The larger the machine, the
> more memory saved.
> 
> * Memcached
> We saw performance degradation when running Memcached benchmark with
> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
> With this patchset we saw ops/sec is increased by around 3.5%, P99
> latency is reduced by around 9.6%.
> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
> MPKI is reduced by 28.5%.
> 
> The benchmark data is now on par with rodata=on too.
> 
> * Disk encryption (dm-crypt) benchmark
> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
> encryption (by dm-crypt).
> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>     --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>     --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>     --name=iops-test-job --eta-newline=1 --size 100G
> 
> The IOPS is increased by 90% - 150% (the variance is high, but the worst
> number of good case is around 90% more than the best number of bad case).
> The bandwidth is increased and the avg clat is reduced proportionally.
> 
> * Sequential file read
> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
> The bandwidth is increased by 150%.
> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  arch/arm64/include/asm/cpufeature.h |  10 ++
>  arch/arm64/include/asm/mmu.h        |   1 +
>  arch/arm64/include/asm/pgtable.h    |   7 +-
>  arch/arm64/kernel/cpufeature.c      |   2 +-
>  arch/arm64/mm/mmu.c                 | 169 +++++++++++++++++++++++++++-
>  arch/arm64/mm/pageattr.c            |  35 +++++-
>  6 files changed, 211 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 108ef3fbbc00..e24edc32b0bd 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -871,6 +871,16 @@ static inline bool system_supports_bbml2_noabort(void)
>  	return alternative_has_cap_unlikely(ARM64_HAS_BBML2_NOABORT);
>  }
>  
> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr);
> +/*
> + * Called at early boot stage on boot CPU before cpu info and cpu feature
> + * are ready.
> + */
> +static inline bool bbml2_noabort_available(void)
> +{
> +	return cpu_has_bbml2_noabort(read_cpuid_id());

You'll want to incorporate the IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT) and
arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_NOBBML2) checks from
Miko's new series to avoid block mappings when BBML2 is disabled. (that second
check will change a bit based on Maz's feedback against Miko's v3).

Hopefully we can factor out into a common helper that is used by Miko's stuff too?

> +}
> +
>  int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
>  bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
>  
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 662471cfc536..d658a33df266 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>  			       pgprot_t prot, bool page_mappings_only);
>  extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>  extern void mark_linear_text_alias_ro(void);
> +extern int split_linear_mapping(unsigned long start, unsigned long end);
>  
>  /*
>   * This check is triggered during the early boot before the cpufeature
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0b2a2ad1b9e8..ed2fc1dcf7ae 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -749,7 +749,7 @@ static inline bool in_swapper_pgdir(void *addr)
>  	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>  }
>  
> -static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
> +static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
>  {
>  #ifdef __PAGETABLE_PMD_FOLDED
>  	if (in_swapper_pgdir(pmdp)) {
> @@ -759,6 +759,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  #endif /* __PAGETABLE_PMD_FOLDED */
>  
>  	WRITE_ONCE(*pmdp, pmd);
> +}
> +
> +static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
> +{
> +	__set_pmd_nosync(pmdp, pmd);
>  
>  	if (pmd_valid(pmd)) {
>  		dsb(ishst);
> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
> index bf3df8407ca3..d39637d5aeab 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -2176,7 +2176,7 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
>  	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
>  }
>  
> -static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>  {
>  	/* We want to allow usage of bbml2 in as wide a range of kernel contexts
>  	 * as possible. This list is therefore an allow-list of known-good
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index dccf0877285b..ad0f1cc55e3a 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -45,6 +45,7 @@
>  #define NO_BLOCK_MAPPINGS	BIT(0)
>  #define NO_CONT_MAPPINGS	BIT(1)
>  #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
> +#define SPLIT_MAPPINGS		BIT(3)
>  
>  u64 kimage_voffset __ro_after_init;
>  EXPORT_SYMBOL(kimage_voffset);
> @@ -166,6 +167,73 @@ static void init_clear_pgtable(void *table)
>  	dsb(ishst);
>  }
>  
> +static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
> +		     phys_addr_t (*pgtable_alloc)(int))
> +{
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	phys_addr_t pte_phys;
> +	pte_t *ptep;
> +
> +	if (!pmd_leaf(pmdval))
> +		return 0;
> +
> +	pfn = pmd_pfn(pmdval);
> +	prot = pmd_pgprot(pmdval);
> +
> +	pte_phys = pgtable_alloc(PAGE_SHIFT);
> +	if (!pte_phys)
> +		return -ENOMEM;
> +
> +	ptep = (pte_t *)phys_to_virt(pte_phys);
> +	init_clear_pgtable(ptep);

No need for this, you're about to fill the table with ptes so clearing it is a
waste of time.

> +	prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);

This happen to work for D64 pgtables because of the way the bits are arranged.
But it won't work for D128 (when we get there). We are in the process of
cleaning up the code base to make it D128 ready. So let's fix this now:

	prot = __pgprot(pgprot_val(prot) & ~PMD_TYPE_MASK) | PTE_TYPE_PAGE);

nit: I'd move this up, next to the "prot = pmd_pgprot(pmdval);" line.

> +	for (int i = 0; i < PTRS_PER_PTE; i++, ptep++)
> +		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));

nit: you're incrementing ptep but adding i to pfn. Why not just increment pfn too?

> +
> +	dsb(ishst);
> +
> +	set_pmd(pmdp, pfn_pmd(__phys_to_pfn(pte_phys),
> +		__pgprot(PMD_TYPE_TABLE)));

You're missing some required pgprot flags and it would be better to follow what
alloc_init_cont_pte() does in general. Something like:

	pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
	if (flags & NO_EXEC_MAPPINGS)
		pmdval |= PMD_TABLE_PXN;
	__pmd_populate(pmdp, pte_phys, pmdval);

> +
> +	return 0;
> +}
> +
> +static int split_pud(pud_t *pudp, pud_t pudval,
> +		     phys_addr_t (*pgtable_alloc)(int))

All the same comments for split_pmd() apply here too.

> +{
> +	unsigned long pfn;
> +	pgprot_t prot;
> +	pmd_t *pmdp;
> +	phys_addr_t pmd_phys;
> +	unsigned int step;
> +
> +	if (!pud_leaf(pudval))
> +		return 0;
> +
> +	pfn = pud_pfn(pudval);
> +	prot = pud_pgprot(pudval);
> +	step = PMD_SIZE >> PAGE_SHIFT;
> +
> +	pmd_phys = pgtable_alloc(PMD_SHIFT);
> +	if (!pmd_phys)
> +		return -ENOMEM;
> +
> +	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> +	init_clear_pgtable(pmdp);
> +	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
> +		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
> +		pfn += step;
> +	}
> +
> +	dsb(ishst);
> +
> +	set_pud(pudp, pfn_pud(__phys_to_pfn(pmd_phys),
> +		__pgprot(PUD_TYPE_TABLE)));
> +
> +	return 0;
> +}
> +
>  static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>  		     phys_addr_t phys, pgprot_t prot)
>  {
> @@ -251,12 +319,21 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  {
>  	unsigned long next;
>  	int ret = 0;
> +	bool split = flags & SPLIT_MAPPINGS;
>  
>  	do {
>  		pmd_t old_pmd = READ_ONCE(*pmdp);
>  
>  		next = pmd_addr_end(addr, end);
>  
> +		if (split) {

I think this should be:

		if (flags & SPLIT_MAPPINGS &&
		    pmd_leaf(old_pmd) &&
		    next < addr + PMD_SIZE) {

So we only attempt a split if its a leaf and the leaf is not fully contained by
the range. Your current code is always splitting even if the block mapping is
fully contained which seems a waste. And if the pmd is not a leaf (either not
present or a table) split_pmd will currently do nothing and return 0, so there
is no opportunity to install mappings or visit the ptes.

> +			ret = split_pmd(pmdp, old_pmd, pgtable_alloc);

But... do we need the special split_pmd() and split_pud() functions at all?
Can't we just allocate a new table here, then let the existing code populate it,
then replace the block mapping with the table mapping? Same goes for huge puds.
If you take this approach, I think a lot of the code below will significantly
simplify.

> +			if (ret)
> +				break;
> +
> +			continue;
> +		}
> +
>  		/* try section mapping first */
>  		if (((addr | next | phys) & ~PMD_MASK) == 0 &&
>  		    (flags & NO_BLOCK_MAPPINGS) == 0) {

You'll want to modify this last bit to avoid setting up a block mapping if we
are trying to split?

		    (flags & (NO_BLOCK_MAPPINGS | SPLIT_MAPPINGS) == 0) {

Or perhaps it's an error to call this without NO_BLOCK_MAPPINGS if
SPLIT_MAPPINGS is specified? Or perhaps we don't even need SPLIT_MAPPINGS, and
NO_BLOCK_MAPPINGS means we will split if we find any block mappings? (similarly
for NO_CONT_MAPPINGS)?

> @@ -292,11 +369,19 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  	int ret = 0;
>  	pud_t pud = READ_ONCE(*pudp);
>  	pmd_t *pmdp;
> +	bool split = flags & SPLIT_MAPPINGS;
>  
>  	/*
>  	 * Check for initial section mappings in the pgd/pud.
>  	 */
>  	BUG_ON(pud_sect(pud));
> +
> +	if (split) {
> +		BUG_ON(pud_none(pud));
> +		pmdp = pmd_offset(pudp, addr);
> +		goto split_pgtable;
> +	}
> +
>  	if (pud_none(pud)) {
>  		pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>  		phys_addr_t pmd_phys;
> @@ -316,6 +401,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  		pmdp = pmd_set_fixmap_offset(pudp, addr);
>  	}
>  
> +split_pgtable:
>  	do {
>  		pgprot_t __prot = prot;
>  
> @@ -334,7 +420,8 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>  		phys += next - addr;
>  	} while (addr = next, addr != end);
>  
> -	pmd_clear_fixmap();
> +	if (!split)
> +		pmd_clear_fixmap();
>  
>  	return ret;
>  }
> @@ -348,6 +435,13 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  	int ret = 0;
>  	p4d_t p4d = READ_ONCE(*p4dp);
>  	pud_t *pudp;
> +	bool split = flags & SPLIT_MAPPINGS;
> +
> +	if (split) {
> +		BUG_ON(p4d_none(p4d));
> +		pudp = pud_offset(p4dp, addr);
> +		goto split_pgtable;
> +	}
>  
>  	if (p4d_none(p4d)) {
>  		p4dval_t p4dval = P4D_TYPE_TABLE | P4D_TABLE_UXN | P4D_TABLE_AF;
> @@ -368,11 +462,25 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  		pudp = pud_set_fixmap_offset(p4dp, addr);
>  	}
>  
> +split_pgtable:
>  	do {
>  		pud_t old_pud = READ_ONCE(*pudp);
>  
>  		next = pud_addr_end(addr, end);
>  
> +		if (split) {
> +			ret = split_pud(pudp, old_pud, pgtable_alloc);
> +			if (ret)
> +				break;
> +
> +			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
> +						  pgtable_alloc, flags);
> +			if (ret)
> +				break;
> +
> +			continue;
> +		}
> +
>  		/*
>  		 * For 4K granule only, attempt to put down a 1GB block
>  		 */
> @@ -399,7 +507,8 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  		phys += next - addr;
>  	} while (pudp++, addr = next, addr != end);
>  
> -	pud_clear_fixmap();
> +	if (!split)
> +		pud_clear_fixmap();
>  
>  	return ret;
>  }
> @@ -413,6 +522,13 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  	int ret = 0;
>  	pgd_t pgd = READ_ONCE(*pgdp);
>  	p4d_t *p4dp;
> +	bool split = flags & SPLIT_MAPPINGS;
> +
> +	if (split) {
> +		BUG_ON(pgd_none(pgd));
> +		p4dp = p4d_offset(pgdp, addr);
> +		goto split_pgtable;
> +	}
>  
>  	if (pgd_none(pgd)) {
>  		pgdval_t pgdval = PGD_TYPE_TABLE | PGD_TABLE_UXN | PGD_TABLE_AF;
> @@ -433,6 +549,7 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  		p4dp = p4d_set_fixmap_offset(pgdp, addr);
>  	}
>  
> +split_pgtable:
>  	do {
>  		p4d_t old_p4d = READ_ONCE(*p4dp);
>  
> @@ -449,7 +566,8 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>  		phys += next - addr;
>  	} while (p4dp++, addr = next, addr != end);
>  
> -	p4d_clear_fixmap();
> +	if (!split)
> +		p4d_clear_fixmap();
>  
>  	return ret;
>  }
> @@ -546,6 +664,23 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>  	return pa;
>  }
>  
> +int split_linear_mapping(unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +
> +	if (!system_supports_bbml2_noabort())
> +		return 0;
> +
> +	mmap_write_lock(&init_mm);
> +	ret = __create_pgd_mapping_locked(init_mm.pgd, virt_to_phys((void *)start),
> +					  start, (end - start), __pgprot(0),
> +					  __pgd_pgtable_alloc, SPLIT_MAPPINGS);
> +	mmap_write_unlock(&init_mm);
> +	flush_tlb_kernel_range(start, end);
> +
> +	return ret;
> +}
> +
>  /*
>   * This function can only be used to modify existing table entries,
>   * without allocating new levels of table. Note that this permits the
> @@ -665,6 +800,24 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>  
>  #endif /* CONFIG_KFENCE */
>  
> +static inline bool force_pte_mapping(void)
> +{
> +	/*
> +	 * Can't use cpufeature API to determine whether BBML2 supported
> +	 * or not since cpufeature have not been finalized yet.
> +	 *
> +	 * Checking the boot CPU only for now.  If the boot CPU has
> +	 * BBML2, paint linear mapping with block mapping.  If it turns
> +	 * out the secondary CPUs don't support BBML2 once cpufeature is
> +	 * fininalized, the linear mapping will be repainted with PTE
> +	 * mapping.
> +	 */
> +	return (rodata_full && !bbml2_noabort_available()) ||
> +		debug_pagealloc_enabled() ||
> +		arm64_kfence_can_set_direct_map() ||
> +		is_realm_world();
> +}
> +
>  static void __init map_mem(pgd_t *pgdp)
>  {
>  	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
> @@ -690,9 +843,12 @@ static void __init map_mem(pgd_t *pgdp)
>  
>  	early_kfence_pool = arm64_kfence_alloc_pool();
>  
> -	if (can_set_direct_map())
> +	if (force_pte_mapping())
>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
> +	if (rodata_full)
> +		flags |= NO_CONT_MAPPINGS;
> +
>  	/*
>  	 * Take care not to create a writable alias for the
>  	 * read-only text and rodata sections of the kernel image.
> @@ -1388,9 +1544,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  
>  	VM_BUG_ON(!mhp_range_allowed(start, size, true));
>  
> -	if (can_set_direct_map())
> +	if (force_pte_mapping())
>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
> +	if (rodata_full)
> +		flags |= NO_CONT_MAPPINGS;
> +
>  	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>  			     size, params->pgprot, __pgd_pgtable_alloc,
>  			     flags);
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 39fd1f7ff02a..5d42d87ea7e1 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -10,6 +10,7 @@
>  #include <linux/vmalloc.h>
>  
>  #include <asm/cacheflush.h>
> +#include <asm/mmu.h>
>  #include <asm/pgtable-prot.h>
>  #include <asm/set_memory.h>
>  #include <asm/tlbflush.h>
> @@ -80,8 +81,9 @@ static int change_memory_common(unsigned long addr, int numpages,
>  	unsigned long start = addr;
>  	unsigned long size = PAGE_SIZE * numpages;
>  	unsigned long end = start + size;
> +	unsigned long l_start;
>  	struct vm_struct *area;
> -	int i;
> +	int i, ret;
>  
>  	if (!PAGE_ALIGNED(addr)) {
>  		start &= PAGE_MASK;
> @@ -118,7 +120,12 @@ static int change_memory_common(unsigned long addr, int numpages,
>  	if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
>  			    pgprot_val(clear_mask) == PTE_RDONLY)) {
>  		for (i = 0; i < area->nr_pages; i++) {
> -			__change_memory_common((u64)page_address(area->pages[i]),
> +			l_start = (u64)page_address(area->pages[i]);
> +			ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);

This isn't quite aligned with how I was thinking about it. You still have 2
passes here; one to split the range to base pages, then another to modify the
permissions.

I was thinking we could use the table walker in mmu.c to achieve 2 benefits:

  - Do both operations in a single pass (a bit like how calling
update_mapping_prot() will update the protections on an existing mapping, and
the table walker will split when it comes across a huge page)

  - Only split when needed; if the whole huge page is contained within the
range, then there is no need to split in the first place.

We could then split vmalloc regions for free using this infrastructure too.

Although there is a wrinkle that the mmu.c table walker only accepts a pgprot
and can't currently handle a set_mask/clear_mask. I guess that could be added,
but it starts to get a bit busy. I think this generic infra would be useful
though. What do you think?

[...]

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 5/6] arm64: mm: support split CONT mappings
  2025-03-04 22:19 ` [v3 PATCH 5/6] arm64: mm: support split CONT mappings Yang Shi
@ 2025-03-14 13:33   ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2025-03-14 13:33 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 04/03/2025 22:19, Yang Shi wrote:
> Add split CONT mappings support in order to support CONT mappings for
> direct map.  This should help reduce TLB pressure further.
> 
> When splitting PUD, all PMDs will have CONT bit set since the leaf PUD
> must be naturally aligned.  When splitting PMD, all PTEs will have CONT
> bit set since the leaf PMD must be naturally aligned too, but the PMDs
> in the cont range of split PMD will have CONT bit cleared.  Splitting
> CONT PTEs by clearing CONT bit for all PTEs in the range.

My expectation is that this patch is not needed if you take the approach of
reusing the existing code to generate the new lower parts of the hierachy as
suggested in the previous patch.

Thanks,
Ryan

> 
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> ---
>  arch/arm64/include/asm/pgtable.h |  5 ++
>  arch/arm64/mm/mmu.c              | 82 ++++++++++++++++++++++++++------
>  arch/arm64/mm/pageattr.c         |  2 +
>  3 files changed, 75 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index ed2fc1dcf7ae..3c6ef47f5813 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -290,6 +290,11 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
>  	return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
>  }
>  
> +static inline pmd_t pmd_mknoncont(pmd_t pmd)
> +{
> +	return __pmd(pmd_val(pmd) & ~PMD_SECT_CONT);
> +}
> +
>  static inline pte_t pte_mkdevmap(pte_t pte)
>  {
>  	return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ad0f1cc55e3a..d4dfeabc80e9 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -167,19 +167,36 @@ static void init_clear_pgtable(void *table)
>  	dsb(ishst);
>  }
>  
> +static void split_cont_pte(pte_t *ptep)
> +{
> +	pte_t *_ptep = PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> +	pte_t _pte;
> +	for (int i = 0; i < CONT_PTES; i++, _ptep++) {
> +		_pte = READ_ONCE(*_ptep);
> +		_pte = pte_mknoncont(_pte);
> +		__set_pte_nosync(_ptep, _pte);
> +	}
> +
> +	dsb(ishst);
> +	isb();
> +}
> +
>  static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
> -		     phys_addr_t (*pgtable_alloc)(int))
> +		     phys_addr_t (*pgtable_alloc)(int), int flags)
>  {
>  	unsigned long pfn;
>  	pgprot_t prot;
>  	phys_addr_t pte_phys;
>  	pte_t *ptep;
> +	bool cont;
> +	int i;
>  
>  	if (!pmd_leaf(pmdval))
>  		return 0;
>  
>  	pfn = pmd_pfn(pmdval);
>  	prot = pmd_pgprot(pmdval);
> +	cont = pgprot_val(prot) & PTE_CONT;
>  
>  	pte_phys = pgtable_alloc(PAGE_SHIFT);
>  	if (!pte_phys)
> @@ -188,11 +205,27 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
>  	ptep = (pte_t *)phys_to_virt(pte_phys);
>  	init_clear_pgtable(ptep);
>  	prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);
> -	for (int i = 0; i < PTRS_PER_PTE; i++, ptep++)
> +
> +	/* It must be naturally aligned if PMD is leaf */
> +	if ((flags & NO_CONT_MAPPINGS) == 0)
> +		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++, ptep++)
>  		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));
>  
>  	dsb(ishst);
>  
> +	/* Clear CONT bit for the PMDs in the range */
> +	if (cont) {
> +		pmd_t *_pmdp, _pmd;
> +		_pmdp = PTR_ALIGN_DOWN(pmdp, sizeof(*pmdp) * CONT_PMDS);
> +		for (i = 0; i < CONT_PMDS; i++, _pmdp++) {
> +			_pmd = READ_ONCE(*_pmdp);
> +			_pmd = pmd_mknoncont(_pmd);
> +			set_pmd(_pmdp, _pmd);
> +		}
> +	}
> +
>  	set_pmd(pmdp, pfn_pmd(__phys_to_pfn(pte_phys),
>  		__pgprot(PMD_TYPE_TABLE)));
>  
> @@ -200,7 +233,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
>  }
>  
>  static int split_pud(pud_t *pudp, pud_t pudval,
> -		     phys_addr_t (*pgtable_alloc)(int))
> +		     phys_addr_t (*pgtable_alloc)(int), int flags)
>  {
>  	unsigned long pfn;
>  	pgprot_t prot;
> @@ -221,6 +254,11 @@ static int split_pud(pud_t *pudp, pud_t pudval,
>  
>  	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
>  	init_clear_pgtable(pmdp);
> +
> +	/* It must be naturally aligned if PUD is leaf */
> +	if ((flags & NO_CONT_MAPPINGS) == 0)
> +		prot = __pgprot(pgprot_val(prot) | PTE_CONT);
> +
>  	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
>  		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
>  		pfn += step;
> @@ -235,11 +273,18 @@ static int split_pud(pud_t *pudp, pud_t pudval,
>  }
>  
>  static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
> -		     phys_addr_t phys, pgprot_t prot)
> +		     phys_addr_t phys, pgprot_t prot, int flags)
>  {
>  	do {
>  		pte_t old_pte = __ptep_get(ptep);
>  
> +		if (flags & SPLIT_MAPPINGS) {
> +			if (pte_cont(old_pte))
> +				split_cont_pte(ptep);
> +
> +			continue;
> +		}
> +
>  		/*
>  		 * Required barriers to make this visible to the table walker
>  		 * are deferred to the end of alloc_init_cont_pte().
> @@ -266,8 +311,16 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  	unsigned long next;
>  	pmd_t pmd = READ_ONCE(*pmdp);
>  	pte_t *ptep;
> +	bool split = flags & SPLIT_MAPPINGS;
>  
>  	BUG_ON(pmd_sect(pmd));
> +
> +	if (split) {
> +		BUG_ON(pmd_none(pmd));
> +		ptep = pte_offset_kernel(pmdp, addr);
> +		goto split_pgtable;
> +	}
> +
>  	if (pmd_none(pmd)) {
>  		pmdval_t pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>  		phys_addr_t pte_phys;
> @@ -287,6 +340,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  		ptep = pte_set_fixmap_offset(pmdp, addr);
>  	}
>  
> +split_pgtable:
>  	do {
>  		pgprot_t __prot = prot;
>  
> @@ -297,7 +351,7 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  		    (flags & NO_CONT_MAPPINGS) == 0)
>  			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>  
> -		init_pte(ptep, addr, next, phys, __prot);
> +		init_pte(ptep, addr, next, phys, __prot, flags);
>  
>  		ptep += pte_index(next) - pte_index(addr);
>  		phys += next - addr;
> @@ -308,7 +362,8 @@ static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>  	 * ensure that all previous pgtable writes are visible to the table
>  	 * walker.
>  	 */
> -	pte_clear_fixmap();
> +	if (!split)
> +		pte_clear_fixmap();
>  
>  	return 0;
>  }
> @@ -327,7 +382,12 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>  		next = pmd_addr_end(addr, end);
>  
>  		if (split) {
> -			ret = split_pmd(pmdp, old_pmd, pgtable_alloc);
> +			ret = split_pmd(pmdp, old_pmd, pgtable_alloc, flags);
> +			if (ret)
> +				break;
> +
> +			ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
> +						  pgtable_alloc, flags);
>  			if (ret)
>  				break;
>  
> @@ -469,7 +529,7 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>  		next = pud_addr_end(addr, end);
>  
>  		if (split) {
> -			ret = split_pud(pudp, old_pud, pgtable_alloc);
> +			ret = split_pud(pudp, old_pud, pgtable_alloc, flags);
>  			if (ret)
>  				break;
>  
> @@ -846,9 +906,6 @@ static void __init map_mem(pgd_t *pgdp)
>  	if (force_pte_mapping())
>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
> -	if (rodata_full)
> -		flags |= NO_CONT_MAPPINGS;
> -
>  	/*
>  	 * Take care not to create a writable alias for the
>  	 * read-only text and rodata sections of the kernel image.
> @@ -1547,9 +1604,6 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	if (force_pte_mapping())
>  		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>  
> -	if (rodata_full)
> -		flags |= NO_CONT_MAPPINGS;
> -
>  	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>  			     size, params->pgprot, __pgd_pgtable_alloc,
>  			     flags);
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 5d42d87ea7e1..25c068712cb5 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -43,6 +43,8 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>  	struct page_change_data *cdata = data;
>  	pte_t pte = __ptep_get(ptep);
>  
> +	BUG_ON(pte_cont(pte));
> +
>  	pte = clear_pte_bit(pte, cdata->clear_mask);
>  	pte = set_pte_bit(pte, cdata->set_mask);
>  



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list
  2025-03-14 10:58   ` Ryan Roberts
@ 2025-03-17 17:50     ` Yang Shi
  0 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-17 17:50 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 3/14/25 3:58 AM, Ryan Roberts wrote:
> On 04/03/2025 22:19, Yang Shi wrote:
>> AmpereOne supports BBML2 without conflict abort, add to the allow list.
>>
>> Signed-off-by: Yang Shi<yang@os.amperecomputing.com>
>> ---
>>   arch/arm64/kernel/cpufeature.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index 7934c6dd493e..bf3df8407ca3 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -2192,6 +2192,8 @@ static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>>   	static const struct midr_range supports_bbml2_noabort_list[] = {
>>   		MIDR_REV_RANGE(MIDR_CORTEX_X4, 0, 3, 0xf),
>>   		MIDR_REV_RANGE(MIDR_NEOVERSE_V3, 0, 2, 0xf),
>> +		MIDR_ALL_VERSIONS(MIDR_AMPERE1),
>> +		MIDR_ALL_VERSIONS(MIDR_AMPERE1A),
>>   		{}
>>   	};
>>   
> Miko's series will move back to additionally checking MMFR2.BBM, so you will
> need to add an erratum workaround for these CPUs to set MMFR2.BBM=2 in the
> per-cpu "sanitised" feature register. See:
>
> https://lore.kernel.org/linux-arm-kernel/86ecyzorb7.wl-maz@kernel.org/

Thank you. I will talk to our architect to see how we should handle 
this. This should not block the page table split work.

Yang

> Thanks,
> Ryan
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void
  2025-03-14 11:51   ` Ryan Roberts
@ 2025-03-17 17:53     ` Yang Shi
  2025-05-07  8:18       ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-03-17 17:53 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 3/14/25 4:51 AM, Ryan Roberts wrote:
> On 04/03/2025 22:19, Yang Shi wrote:
>> The later patch will enhance __create_pgd_mapping() and related helpers
>> to split kernel linear mapping, it requires have return value.  So make
>> __create_pgd_mapping() and helpers non-void functions.
>>
>> And move the BUG_ON() out of page table alloc helper since failing
>> splitting kernel linear mapping is not fatal and can be handled by the
>> callers in the later patch.  Have BUG_ON() after
>> __create_pgd_mapping_locked() returns to keep the current callers behavior
>> intact.
>>
>> Suggested-by: Ryan Roberts<ryan.roberts@arm.com>
>> Signed-off-by: Yang Shi<yang@os.amperecomputing.com>
>> ---
>>   arch/arm64/mm/mmu.c | 127 ++++++++++++++++++++++++++++++--------------
>>   1 file changed, 86 insertions(+), 41 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index b4df5bc5b1b8..dccf0877285b 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -189,11 +189,11 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>>   	} while (ptep++, addr += PAGE_SIZE, addr != end);
>>   }
>>   
>> -static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> -				unsigned long end, phys_addr_t phys,
>> -				pgprot_t prot,
>> -				phys_addr_t (*pgtable_alloc)(int),
>> -				int flags)
>> +static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>> +			       unsigned long end, phys_addr_t phys,
>> +			       pgprot_t prot,
>> +			       phys_addr_t (*pgtable_alloc)(int),
>> +			       int flags)
>>   {
>>   	unsigned long next;
>>   	pmd_t pmd = READ_ONCE(*pmdp);
>> @@ -208,6 +208,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>   			pmdval |= PMD_TABLE_PXN;
>>   		BUG_ON(!pgtable_alloc);
>>   		pte_phys = pgtable_alloc(PAGE_SHIFT);
>> +		if (!pte_phys)
>> +			return -ENOMEM;
> nit: personally I'd prefer to see a "goto out" and funnel all to a single return
> statement. You do that in some functions (via loop break), but would be cleaner
> if consistent.
>
> If pgtable_alloc() is modified to return int (see my comment at the bottom),
> this becomes:
>
> ret = pgtable_alloc(PAGE_SHIFT, &pte_phys);
> if (ret)
> 	goto out;

OK

>>   		ptep = pte_set_fixmap(pte_phys);
>>   		init_clear_pgtable(ptep);
>>   		ptep += pte_index(addr);
>> @@ -239,13 +241,16 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>   	 * walker.
>>   	 */
>>   	pte_clear_fixmap();
>> +
>> +	return 0;
>>   }
>>   
>> -static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> -		     phys_addr_t phys, pgprot_t prot,
>> -		     phys_addr_t (*pgtable_alloc)(int), int flags)
>> +static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>> +		    phys_addr_t phys, pgprot_t prot,
>> +		    phys_addr_t (*pgtable_alloc)(int), int flags)
>>   {
>>   	unsigned long next;
>> +	int ret = 0;
>>   
>>   	do {
>>   		pmd_t old_pmd = READ_ONCE(*pmdp);
>> @@ -264,22 +269,27 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>   			BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
>>   						      READ_ONCE(pmd_val(*pmdp))));
>>   		} else {
>> -			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>> +			ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>>   					    pgtable_alloc, flags);
>> +			if (ret)
>> +				break;
>>   
>>   			BUG_ON(pmd_val(old_pmd) != 0 &&
>>   			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
>>   		}
>>   		phys += next - addr;
>>   	} while (pmdp++, addr = next, addr != end);
>> +
>> +	return ret;
>>   }
>>   
>> -static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>> -				unsigned long end, phys_addr_t phys,
>> -				pgprot_t prot,
>> -				phys_addr_t (*pgtable_alloc)(int), int flags)
>> +static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>> +			       unsigned long end, phys_addr_t phys,
>> +			       pgprot_t prot,
>> +			       phys_addr_t (*pgtable_alloc)(int), int flags)
>>   {
>>   	unsigned long next;
>> +	int ret = 0;
>>   	pud_t pud = READ_ONCE(*pudp);
>>   	pmd_t *pmdp;
>>   
>> @@ -295,6 +305,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>   			pudval |= PUD_TABLE_PXN;
>>   		BUG_ON(!pgtable_alloc);
>>   		pmd_phys = pgtable_alloc(PMD_SHIFT);
>> +		if (!pmd_phys)
>> +			return -ENOMEM;
>>   		pmdp = pmd_set_fixmap(pmd_phys);
>>   		init_clear_pgtable(pmdp);
>>   		pmdp += pmd_index(addr);
>> @@ -314,21 +326,26 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>   		    (flags & NO_CONT_MAPPINGS) == 0)
>>   			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>   
>> -		init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>> +		ret = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>> +		if (ret)
>> +			break;
>>   
>>   		pmdp += pmd_index(next) - pmd_index(addr);
>>   		phys += next - addr;
>>   	} while (addr = next, addr != end);
>>   
>>   	pmd_clear_fixmap();
>> +
>> +	return ret;
>>   }
>>   
>> -static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>> -			   phys_addr_t phys, pgprot_t prot,
>> -			   phys_addr_t (*pgtable_alloc)(int),
>> -			   int flags)
>> +static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>> +			  phys_addr_t phys, pgprot_t prot,
>> +			  phys_addr_t (*pgtable_alloc)(int),
>> +			  int flags)
>>   {
>>   	unsigned long next;
>> +	int ret = 0;
>>   	p4d_t p4d = READ_ONCE(*p4dp);
>>   	pud_t *pudp;
>>   
>> @@ -340,6 +357,8 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   			p4dval |= P4D_TABLE_PXN;
>>   		BUG_ON(!pgtable_alloc);
>>   		pud_phys = pgtable_alloc(PUD_SHIFT);
>> +		if (!pud_phys)
>> +			return -ENOMEM;
>>   		pudp = pud_set_fixmap(pud_phys);
>>   		init_clear_pgtable(pudp);
>>   		pudp += pud_index(addr);
>> @@ -369,8 +388,10 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   			BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
>>   						      READ_ONCE(pud_val(*pudp))));
>>   		} else {
>> -			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>> +			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>>   					    pgtable_alloc, flags);
>> +			if (ret)
>> +				break;
>>   
>>   			BUG_ON(pud_val(old_pud) != 0 &&
>>   			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
>> @@ -379,14 +400,17 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   	} while (pudp++, addr = next, addr != end);
>>   
>>   	pud_clear_fixmap();
>> +
>> +	return ret;
>>   }
>>   
>> -static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>> -			   phys_addr_t phys, pgprot_t prot,
>> -			   phys_addr_t (*pgtable_alloc)(int),
>> -			   int flags)
>> +static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>> +			  phys_addr_t phys, pgprot_t prot,
>> +			  phys_addr_t (*pgtable_alloc)(int),
>> +			  int flags)
>>   {
>>   	unsigned long next;
>> +	int ret = 0;
>>   	pgd_t pgd = READ_ONCE(*pgdp);
>>   	p4d_t *p4dp;
>>   
>> @@ -398,6 +422,8 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   			pgdval |= PGD_TABLE_PXN;
>>   		BUG_ON(!pgtable_alloc);
>>   		p4d_phys = pgtable_alloc(P4D_SHIFT);
>> +		if (!p4d_phys)
>> +			return -ENOMEM;
>>   		p4dp = p4d_set_fixmap(p4d_phys);
>>   		init_clear_pgtable(p4dp);
>>   		p4dp += p4d_index(addr);
>> @@ -412,8 +438,10 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   
>>   		next = p4d_addr_end(addr, end);
>>   
>> -		alloc_init_pud(p4dp, addr, next, phys, prot,
>> +		ret = alloc_init_pud(p4dp, addr, next, phys, prot,
>>   			       pgtable_alloc, flags);
>> +		if (ret)
>> +			break;
>>   
>>   		BUG_ON(p4d_val(old_p4d) != 0 &&
>>   		       p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
>> @@ -422,23 +450,26 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   	} while (p4dp++, addr = next, addr != end);
>>   
>>   	p4d_clear_fixmap();
>> +
>> +	return ret;
>>   }
>>   
>> -static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>> -					unsigned long virt, phys_addr_t size,
>> -					pgprot_t prot,
>> -					phys_addr_t (*pgtable_alloc)(int),
>> -					int flags)
>> +static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>> +				       unsigned long virt, phys_addr_t size,
>> +				       pgprot_t prot,
>> +				       phys_addr_t (*pgtable_alloc)(int),
>> +				       int flags)
>>   {
>>   	unsigned long addr, end, next;
>>   	pgd_t *pgdp = pgd_offset_pgd(pgdir, virt);
>> +	int ret = 0;
>>   
>>   	/*
>>   	 * If the virtual and physical address don't have the same offset
>>   	 * within a page, we cannot map the region as the caller expects.
>>   	 */
>>   	if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
>> -		return;
>> +		return -EINVAL;
>>   
>>   	phys &= PAGE_MASK;
>>   	addr = virt & PAGE_MASK;
>> @@ -446,29 +477,38 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>>   
>>   	do {
>>   		next = pgd_addr_end(addr, end);
>> -		alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>> +		ret = alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>>   			       flags);
>> +		if (ret)
>> +			break;
>>   		phys += next - addr;
>>   	} while (pgdp++, addr = next, addr != end);
>> +
>> +	return ret;
>>   }
>>   
>> -static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>> -				 unsigned long virt, phys_addr_t size,
>> -				 pgprot_t prot,
>> -				 phys_addr_t (*pgtable_alloc)(int),
>> -				 int flags)
>> +static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>> +				unsigned long virt, phys_addr_t size,
>> +				pgprot_t prot,
>> +				phys_addr_t (*pgtable_alloc)(int),
>> +				int flags)
>>   {
>> +	int ret;
>> +
>>   	mutex_lock(&fixmap_lock);
>> -	__create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>> +	ret = __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>>   				    pgtable_alloc, flags);
>> +	BUG_ON(ret);
> This function now returns an error, but also BUGs on ret!=0. For this patch, I'd
> suggest keeping this function as void.

You mean __create_pgd_mapping(), right?

> But I believe there is a pre-existing bug in arch_add_memory(). That's called at
> runtime so if __create_pgd_mapping() fails and BUGs, it will take down a running
> system.

Yes, it is the current behavior.

> With this foundational patch, we can fix that with an additional patch to pass
> along the error code instead of BUGing in that case. arch_add_memory() would
> need to unwind whatever __create_pgd_mapping() managed to do before the memory
> allocation failure (presumably unmapping and freeing any allocated tables). I'm
> happy to do this as a follow up patch.

Yes, the allocated page tables need to be freed. Thank you for taking it.

>>   	mutex_unlock(&fixmap_lock);
>> +
>> +	return ret;
>>   }
>>   
>>   #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>>   extern __alias(__create_pgd_mapping_locked)
>> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>> -			     phys_addr_t size, pgprot_t prot,
>> -			     phys_addr_t (*pgtable_alloc)(int), int flags);
>> +int create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>> +			    phys_addr_t size, pgprot_t prot,
>> +			    phys_addr_t (*pgtable_alloc)(int), int flags);
> create_kpti_ng_temp_pgd() now returns error instead of BUGing on allocation
> failure, but I don't see a change to handle that error. You'll want to update
> __kpti_install_ng_mappings() to BUG on error.

Yes, I missed that. It should BUG on error.

>>   #endif
>>   
>>   static phys_addr_t __pgd_pgtable_alloc(int shift)
>> @@ -476,13 +516,17 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
>>   	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>   	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
>>   
>> -	BUG_ON(!ptr);
>> +	if (!ptr)
>> +		return 0;
> 0 is a valid (though unlikely) physical address. I guess you could technically
> encode like ERR_PTR(), but since you are returning phys_addr_t and not a
> pointer, then perhaps it will be clearer to make this return int and accept a
> pointer to a phys_addr_t, which it will populate on success?

Actually I did something similar in the first place, but just returned 
the virt address. Then did something if it returns NULL. That made the 
code a little more messy since we need convert the virt address to phys 
address because __create_pgd_mapping() and the helpers require phys 
address, and changed the functions definition.

But I noticed 0 should be not a valid phys address if I remember 
correctly. I also noticed early_pgtable_alloc() calls 
memblock_phys_alloc_range(), it returns 0 on failure. If 0 is valid phys 
address, then it should not do that, right? And I also noticed the 
memblock range 0 - memstart_addr is actually removed from memblock (see 
arm64_memblock_init()), so IIUC 0 should be not valid phys address. So 
the patch ended up being as is.

If this assumption doesn't stand, I think your suggestion makes sense.

>> +
>>   	return __pa(ptr);
>>   }
>>   
>>   static phys_addr_t pgd_pgtable_alloc(int shift)
>>   {
>>   	phys_addr_t pa = __pgd_pgtable_alloc(shift);
>> +	if (!pa)
>> +		goto out;
> This would obviously need to be fixed up as per above.
>
>>   	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
>>   
>>   	/*
>> @@ -498,6 +542,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>>   	else if (shift == PMD_SHIFT)
>>   		BUG_ON(!pagetable_pmd_ctor(ptdesc));
>>   
>> +out:
>>   	return pa;
>>   }
>>   
> You have left early_pgtable_alloc() to panic() on allocation failure. Given we
> can now unwind the stack with error code, I think it would be more consistent to
> also allow early_pgtable_alloc() to return error.

The early_pgtable_alloc() is just used for painting linear mapping at 
early boot stage, if it fails I don't think unwinding the stack is 
feasible and worth it. Did I miss something?

Thanks,
Yang

> Thanks,
> Ryan
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full
  2025-03-14 13:29   ` Ryan Roberts
@ 2025-03-17 17:57     ` Yang Shi
  0 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-03-17 17:57 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 3/14/25 6:29 AM, Ryan Roberts wrote:
> On 04/03/2025 22:19, Yang Shi wrote:
>> When rodata=full is specified, kernel linear mapping has to be mapped at
>> PTE level since large page table can't be split due to break-before-make
>> rule on ARM64.
>>
>> This resulted in a couple of problems:
>>    - performance degradation
>>    - more TLB pressure
>>    - memory waste for kernel page table
>>
>> With FEAT_BBM level 2 support, splitting large block page table to
>> smaller ones doesn't need to make the page table entry invalid anymore.
>> This allows kernel split large block mapping on the fly.
>>
>> Add kernel page table split support and use large block mapping by
>> default when FEAT_BBM level 2 is supported for rodata=full.  When
>> changing permissions for kernel linear mapping, the page table will be
>> split to PTE level.
>>
>> The machine without FEAT_BBM level 2 will fallback to have kernel linear
>> mapping PTE-mapped when rodata=full.
>>
>> With this we saw significant performance boost with some benchmarks and
>> much less memory consumption on my AmpereOne machine (192 cores, 1P) with
>> 256GB memory.
>>
>> * Memory use after boot
>> Before:
>> MemTotal:       258988984 kB
>> MemFree:        254821700 kB
>>
>> After:
>> MemTotal:       259505132 kB
>> MemFree:        255410264 kB
>>
>> Around 500MB more memory are free to use.  The larger the machine, the
>> more memory saved.
>>
>> * Memcached
>> We saw performance degradation when running Memcached benchmark with
>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>> latency is reduced by around 9.6%.
>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>> MPKI is reduced by 28.5%.
>>
>> The benchmark data is now on par with rodata=on too.
>>
>> * Disk encryption (dm-crypt) benchmark
>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>> encryption (by dm-crypt).
>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>      --name=iops-test-job --eta-newline=1 --size 100G
>>
>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>> number of good case is around 90% more than the best number of bad case).
>> The bandwidth is increased and the avg clat is reduced proportionally.
>>
>> * Sequential file read
>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>> The bandwidth is increased by 150%.
>>
>> Signed-off-by: Yang Shi<yang@os.amperecomputing.com>
>> ---
>>   arch/arm64/include/asm/cpufeature.h |  10 ++
>>   arch/arm64/include/asm/mmu.h        |   1 +
>>   arch/arm64/include/asm/pgtable.h    |   7 +-
>>   arch/arm64/kernel/cpufeature.c      |   2 +-
>>   arch/arm64/mm/mmu.c                 | 169 +++++++++++++++++++++++++++-
>>   arch/arm64/mm/pageattr.c            |  35 +++++-
>>   6 files changed, 211 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>> index 108ef3fbbc00..e24edc32b0bd 100644
>> --- a/arch/arm64/include/asm/cpufeature.h
>> +++ b/arch/arm64/include/asm/cpufeature.h
>> @@ -871,6 +871,16 @@ static inline bool system_supports_bbml2_noabort(void)
>>   	return alternative_has_cap_unlikely(ARM64_HAS_BBML2_NOABORT);
>>   }
>>   
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr);
>> +/*
>> + * Called at early boot stage on boot CPU before cpu info and cpu feature
>> + * are ready.
>> + */
>> +static inline bool bbml2_noabort_available(void)
>> +{
>> +	return cpu_has_bbml2_noabort(read_cpuid_id());
> You'll want to incorporate the IS_ENABLED(CONFIG_ARM64_BBML2_NOABORT) and
> arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_NOBBML2) checks from
> Miko's new series to avoid block mappings when BBML2 is disabled. (that second
> check will change a bit based on Maz's feedback against Miko's v3).

Sure

> Hopefully we can factor out into a common helper that is used by Miko's stuff too?

I think checking the kernel config and 
arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_NOBBML2) can be 
consolidated into a helper?

>> +}
>> +
>>   int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
>>   bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
>>   
>> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
>> index 662471cfc536..d658a33df266 100644
>> --- a/arch/arm64/include/asm/mmu.h
>> +++ b/arch/arm64/include/asm/mmu.h
>> @@ -71,6 +71,7 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
>>   			       pgprot_t prot, bool page_mappings_only);
>>   extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
>>   extern void mark_linear_text_alias_ro(void);
>> +extern int split_linear_mapping(unsigned long start, unsigned long end);
>>   
>>   /*
>>    * This check is triggered during the early boot before the cpufeature
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0b2a2ad1b9e8..ed2fc1dcf7ae 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -749,7 +749,7 @@ static inline bool in_swapper_pgdir(void *addr)
>>   	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
>>   }
>>   
>> -static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
>>   {
>>   #ifdef __PAGETABLE_PMD_FOLDED
>>   	if (in_swapper_pgdir(pmdp)) {
>> @@ -759,6 +759,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>   #endif /* __PAGETABLE_PMD_FOLDED */
>>   
>>   	WRITE_ONCE(*pmdp, pmd);
>> +}
>> +
>> +static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>> +{
>> +	__set_pmd_nosync(pmdp, pmd);
>>   
>>   	if (pmd_valid(pmd)) {
>>   		dsb(ishst);
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index bf3df8407ca3..d39637d5aeab 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -2176,7 +2176,7 @@ static bool hvhe_possible(const struct arm64_cpu_capabilities *entry,
>>   	return arm64_test_sw_feature_override(ARM64_SW_FEATURE_OVERRIDE_HVHE);
>>   }
>>   
>> -static bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>> +bool cpu_has_bbml2_noabort(unsigned int cpu_midr)
>>   {
>>   	/* We want to allow usage of bbml2 in as wide a range of kernel contexts
>>   	 * as possible. This list is therefore an allow-list of known-good
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index dccf0877285b..ad0f1cc55e3a 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -45,6 +45,7 @@
>>   #define NO_BLOCK_MAPPINGS	BIT(0)
>>   #define NO_CONT_MAPPINGS	BIT(1)
>>   #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>> +#define SPLIT_MAPPINGS		BIT(3)
>>   
>>   u64 kimage_voffset __ro_after_init;
>>   EXPORT_SYMBOL(kimage_voffset);
>> @@ -166,6 +167,73 @@ static void init_clear_pgtable(void *table)
>>   	dsb(ishst);
>>   }
>>   
>> +static int split_pmd(pmd_t *pmdp, pmd_t pmdval,
>> +		     phys_addr_t (*pgtable_alloc)(int))
>> +{
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	phys_addr_t pte_phys;
>> +	pte_t *ptep;
>> +
>> +	if (!pmd_leaf(pmdval))
>> +		return 0;
>> +
>> +	pfn = pmd_pfn(pmdval);
>> +	prot = pmd_pgprot(pmdval);
>> +
>> +	pte_phys = pgtable_alloc(PAGE_SHIFT);
>> +	if (!pte_phys)
>> +		return -ENOMEM;
>> +
>> +	ptep = (pte_t *)phys_to_virt(pte_phys);
>> +	init_clear_pgtable(ptep);
> No need for this, you're about to fill the table with ptes so clearing it is a
> waste of time.

OK

>> +	prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);
> This happen to work for D64 pgtables because of the way the bits are arranged.
> But it won't work for D128 (when we get there). We are in the process of
> cleaning up the code base to make it D128 ready. So let's fix this now:
>
> 	prot = __pgprot(pgprot_val(prot) & ~PMD_TYPE_MASK) | PTE_TYPE_PAGE);
>
> nit: I'd move this up, next to the "prot = pmd_pgprot(pmdval);" line.

OK

>> +	for (int i = 0; i < PTRS_PER_PTE; i++, ptep++)
>> +		__set_pte_nosync(ptep, pfn_pte(pfn + i, prot));
> nit: you're incrementing ptep but adding i to pfn. Why not just increment pfn too?

Sure, pfn++ works too.

>> +
>> +	dsb(ishst);
>> +
>> +	set_pmd(pmdp, pfn_pmd(__phys_to_pfn(pte_phys),
>> +		__pgprot(PMD_TYPE_TABLE)));
> You're missing some required pgprot flags and it would be better to follow what
> alloc_init_cont_pte() does in general. Something like:
>
> 	pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
> 	if (flags & NO_EXEC_MAPPINGS)
> 		pmdval |= PMD_TABLE_PXN;
> 	__pmd_populate(pmdp, pte_phys, pmdval);

Sure

>> +
>> +	return 0;
>> +}
>> +
>> +static int split_pud(pud_t *pudp, pud_t pudval,
>> +		     phys_addr_t (*pgtable_alloc)(int))
> All the same comments for split_pmd() apply here too.
>
>> +{
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +	pmd_t *pmdp;
>> +	phys_addr_t pmd_phys;
>> +	unsigned int step;
>> +
>> +	if (!pud_leaf(pudval))
>> +		return 0;
>> +
>> +	pfn = pud_pfn(pudval);
>> +	prot = pud_pgprot(pudval);
>> +	step = PMD_SIZE >> PAGE_SHIFT;
>> +
>> +	pmd_phys = pgtable_alloc(PMD_SHIFT);
>> +	if (!pmd_phys)
>> +		return -ENOMEM;
>> +
>> +	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
>> +	init_clear_pgtable(pmdp);
>> +	for (int i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
>> +		__set_pmd_nosync(pmdp, pfn_pmd(pfn, prot));
>> +		pfn += step;
>> +	}
>> +
>> +	dsb(ishst);
>> +
>> +	set_pud(pudp, pfn_pud(__phys_to_pfn(pmd_phys),
>> +		__pgprot(PUD_TYPE_TABLE)));
>> +
>> +	return 0;
>> +}
>> +
>>   static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
>>   		     phys_addr_t phys, pgprot_t prot)
>>   {
>> @@ -251,12 +319,21 @@ static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>   {
>>   	unsigned long next;
>>   	int ret = 0;
>> +	bool split = flags & SPLIT_MAPPINGS;
>>   
>>   	do {
>>   		pmd_t old_pmd = READ_ONCE(*pmdp);
>>   
>>   		next = pmd_addr_end(addr, end);
>>   
>> +		if (split) {
> I think this should be:
>
> 		if (flags & SPLIT_MAPPINGS &&
> 		    pmd_leaf(old_pmd) &&
> 		    next < addr + PMD_SIZE) {
>
> So we only attempt a split if its a leaf and the leaf is not fully contained by
> the range. Your current code is always splitting even if the block mapping is
> fully contained which seems a waste. And if the pmd is not a leaf (either not
> present or a table) split_pmd will currently do nothing and return 0, so there
> is no opportunity to install mappings or visit the ptes.

Yes, it splits the PMD even though the block mapping is fully contained. 
It is because the current user (change_memory_common()) just manipulates 
the page permission on PAGE_SIZE granularity IIRC. But I agree with you 
not splitting when it is fully contained is better and more flexible. We 
don't have to change the code if change_memory_common() is enhanced to 
handle contiguous pages. However the related code would be untested due 
to no usecase at the moment.

If the PMD is non-leaf it will do nothing because this patch doesn't 
handle CONT_PTE, if the PMD is table it means it already points to a PTE 
so we don't need do anything. The later patch handles CONT_PTE.

>> +			ret = split_pmd(pmdp, old_pmd, pgtable_alloc);
> But... do we need the special split_pmd() and split_pud() functions at all?
> Can't we just allocate a new table here, then let the existing code populate it,
> then replace the block mapping with the table mapping? Same goes for huge puds.
> If you take this approach, I think a lot of the code below will significantly
> simplify.

Actually I thought about this. The existing code populates page table in 
the range of size@addr, if the size is, for example, pud size, the 
existing code can populate the page table as you suggested. But as I 
mentioned above change_memory_common() is called on PAGE_SIZE 
granularity, If we just allocate a page table then let the existing code 
populate it, we will end up populating just one PMD and PTE entry for 
the specified address. For example, we a module is loaded, its text 
segment may just use one page, so kernel just need change the permission 
for that page.

So we still need populate other PMD entries and PTE entries other than 
the specified address. That would need the most code in split_pud() and 
split_pmd().

To make your suggestion work I think we can set addr and end used by the 
walker to the start boundary and end boundary of PUD (P4D doesn't 
support block mapping yet) respectively. For example:

@@ -441,8 +441,14 @@ static void __create_pgd_mapping_locked(pgd_t 
*pgdir, phys_addr_t phys,
                 return;

         phys &= PAGE_MASK;
-       addr = virt & PAGE_MASK;
-       end = PAGE_ALIGN(virt + size);
+       if (split) {
+               addr = virt & PAGE_MASK;
+               end = PAGE_ALIGN(virt + size);
+       else {
+               addr = start_pud_boundary;
+               end = end_pud_boundary;
+               phys = __pa(start_boundary);
+       }

         do {
                 next = pgd_addr_end(addr, end);

But we may need to add a dedicated parameter for the start boundary of 
page table if we want to do split and permission change in one pass as 
you suggested below since we need to know which PTE permission need to 
be changed. However this may make detecting fully contained range 
harder, the range passed in from caller needs to be preserved so that we 
can know what PUD or PMD permission need to be changed. CONT mappings 
will make it more complicated.

So it sounds like we need much more parameters. We may need put all the 
parameters into a struct, for example, something like below off the top 
of my head:

struct walk_param {
     unsigned long start;
     unsigned long end;
     unsigned long addr;
     unsigned long orig_start;
     unsigned long orig_end;
     pgprot_t clear_prot;
     pgprot_t set_prot;
     pgprot_t prot;
}

So I'm not sure whether the code can be significantly simplified or not.

>> +			if (ret)
>> +				break;
>> +
>> +			continue;
>> +		}
>> +
>>   		/* try section mapping first */
>>   		if (((addr | next | phys) & ~PMD_MASK) == 0 &&
>>   		    (flags & NO_BLOCK_MAPPINGS) == 0) {
> You'll want to modify this last bit to avoid setting up a block mapping if we
> are trying to split?
>
> 		    (flags & (NO_BLOCK_MAPPINGS | SPLIT_MAPPINGS) == 0) {

The specified address can't have block mapping, but the surrounding 
address can have. For example, when splitting a PUD, the PMD containing 
the specified address will be table, but all other 511 PMDs still can be 
block mappings.

> Or perhaps it's an error to call this without NO_BLOCK_MAPPINGS if
> SPLIT_MAPPINGS is specified? Or perhaps we don't even need SPLIT_MAPPINGS, and
> NO_BLOCK_MAPPINGS means we will split if we find any block mappings? (similarly
> for NO_CONT_MAPPINGS)?

As I said above we still can have block mappings, so using 
NO_BLOCK_MAPPINGS may cause some confusion?

>> @@ -292,11 +369,19 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>   	int ret = 0;
>>   	pud_t pud = READ_ONCE(*pudp);
>>   	pmd_t *pmdp;
>> +	bool split = flags & SPLIT_MAPPINGS;
>>   
>>   	/*
>>   	 * Check for initial section mappings in the pgd/pud.
>>   	 */
>>   	BUG_ON(pud_sect(pud));
>> +
>> +	if (split) {
>> +		BUG_ON(pud_none(pud));
>> +		pmdp = pmd_offset(pudp, addr);
>> +		goto split_pgtable;
>> +	}
>> +
>>   	if (pud_none(pud)) {
>>   		pudval_t pudval = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>>   		phys_addr_t pmd_phys;
>> @@ -316,6 +401,7 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>   		pmdp = pmd_set_fixmap_offset(pudp, addr);
>>   	}
>>   
>> +split_pgtable:
>>   	do {
>>   		pgprot_t __prot = prot;
>>   
>> @@ -334,7 +420,8 @@ static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>   		phys += next - addr;
>>   	} while (addr = next, addr != end);
>>   
>> -	pmd_clear_fixmap();
>> +	if (!split)
>> +		pmd_clear_fixmap();
>>   
>>   	return ret;
>>   }
>> @@ -348,6 +435,13 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   	int ret = 0;
>>   	p4d_t p4d = READ_ONCE(*p4dp);
>>   	pud_t *pudp;
>> +	bool split = flags & SPLIT_MAPPINGS;
>> +
>> +	if (split) {
>> +		BUG_ON(p4d_none(p4d));
>> +		pudp = pud_offset(p4dp, addr);
>> +		goto split_pgtable;
>> +	}
>>   
>>   	if (p4d_none(p4d)) {
>>   		p4dval_t p4dval = P4D_TYPE_TABLE | P4D_TABLE_UXN | P4D_TABLE_AF;
>> @@ -368,11 +462,25 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   		pudp = pud_set_fixmap_offset(p4dp, addr);
>>   	}
>>   
>> +split_pgtable:
>>   	do {
>>   		pud_t old_pud = READ_ONCE(*pudp);
>>   
>>   		next = pud_addr_end(addr, end);
>>   
>> +		if (split) {
>> +			ret = split_pud(pudp, old_pud, pgtable_alloc);
>> +			if (ret)
>> +				break;
>> +
>> +			ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>> +						  pgtable_alloc, flags);
>> +			if (ret)
>> +				break;
>> +
>> +			continue;
>> +		}
>> +
>>   		/*
>>   		 * For 4K granule only, attempt to put down a 1GB block
>>   		 */
>> @@ -399,7 +507,8 @@ static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>   		phys += next - addr;
>>   	} while (pudp++, addr = next, addr != end);
>>   
>> -	pud_clear_fixmap();
>> +	if (!split)
>> +		pud_clear_fixmap();
>>   
>>   	return ret;
>>   }
>> @@ -413,6 +522,13 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   	int ret = 0;
>>   	pgd_t pgd = READ_ONCE(*pgdp);
>>   	p4d_t *p4dp;
>> +	bool split = flags & SPLIT_MAPPINGS;
>> +
>> +	if (split) {
>> +		BUG_ON(pgd_none(pgd));
>> +		p4dp = p4d_offset(pgdp, addr);
>> +		goto split_pgtable;
>> +	}
>>   
>>   	if (pgd_none(pgd)) {
>>   		pgdval_t pgdval = PGD_TYPE_TABLE | PGD_TABLE_UXN | PGD_TABLE_AF;
>> @@ -433,6 +549,7 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   		p4dp = p4d_set_fixmap_offset(pgdp, addr);
>>   	}
>>   
>> +split_pgtable:
>>   	do {
>>   		p4d_t old_p4d = READ_ONCE(*p4dp);
>>   
>> @@ -449,7 +566,8 @@ static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>   		phys += next - addr;
>>   	} while (p4dp++, addr = next, addr != end);
>>   
>> -	p4d_clear_fixmap();
>> +	if (!split)
>> +		p4d_clear_fixmap();
>>   
>>   	return ret;
>>   }
>> @@ -546,6 +664,23 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>>   	return pa;
>>   }
>>   
>> +int split_linear_mapping(unsigned long start, unsigned long end)
>> +{
>> +	int ret = 0;
>> +
>> +	if (!system_supports_bbml2_noabort())
>> +		return 0;
>> +
>> +	mmap_write_lock(&init_mm);
>> +	ret = __create_pgd_mapping_locked(init_mm.pgd, virt_to_phys((void *)start),
>> +					  start, (end - start), __pgprot(0),
>> +					  __pgd_pgtable_alloc, SPLIT_MAPPINGS);
>> +	mmap_write_unlock(&init_mm);
>> +	flush_tlb_kernel_range(start, end);
>> +
>> +	return ret;
>> +}
>> +
>>   /*
>>    * This function can only be used to modify existing table entries,
>>    * without allocating new levels of table. Note that this permits the
>> @@ -665,6 +800,24 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
>>   
>>   #endif /* CONFIG_KFENCE */
>>   
>> +static inline bool force_pte_mapping(void)
>> +{
>> +	/*
>> +	 * Can't use cpufeature API to determine whether BBML2 supported
>> +	 * or not since cpufeature have not been finalized yet.
>> +	 *
>> +	 * Checking the boot CPU only for now.  If the boot CPU has
>> +	 * BBML2, paint linear mapping with block mapping.  If it turns
>> +	 * out the secondary CPUs don't support BBML2 once cpufeature is
>> +	 * fininalized, the linear mapping will be repainted with PTE
>> +	 * mapping.
>> +	 */
>> +	return (rodata_full && !bbml2_noabort_available()) ||
>> +		debug_pagealloc_enabled() ||
>> +		arm64_kfence_can_set_direct_map() ||
>> +		is_realm_world();
>> +}
>> +
>>   static void __init map_mem(pgd_t *pgdp)
>>   {
>>   	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
>> @@ -690,9 +843,12 @@ static void __init map_mem(pgd_t *pgdp)
>>   
>>   	early_kfence_pool = arm64_kfence_alloc_pool();
>>   
>> -	if (can_set_direct_map())
>> +	if (force_pte_mapping())
>>   		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>   
>> +	if (rodata_full)
>> +		flags |= NO_CONT_MAPPINGS;
>> +
>>   	/*
>>   	 * Take care not to create a writable alias for the
>>   	 * read-only text and rodata sections of the kernel image.
>> @@ -1388,9 +1544,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>   
>>   	VM_BUG_ON(!mhp_range_allowed(start, size, true));
>>   
>> -	if (can_set_direct_map())
>> +	if (force_pte_mapping())
>>   		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>   
>> +	if (rodata_full)
>> +		flags |= NO_CONT_MAPPINGS;
>> +
>>   	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>>   			     size, params->pgprot, __pgd_pgtable_alloc,
>>   			     flags);
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 39fd1f7ff02a..5d42d87ea7e1 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -10,6 +10,7 @@
>>   #include <linux/vmalloc.h>
>>   
>>   #include <asm/cacheflush.h>
>> +#include <asm/mmu.h>
>>   #include <asm/pgtable-prot.h>
>>   #include <asm/set_memory.h>
>>   #include <asm/tlbflush.h>
>> @@ -80,8 +81,9 @@ static int change_memory_common(unsigned long addr, int numpages,
>>   	unsigned long start = addr;
>>   	unsigned long size = PAGE_SIZE * numpages;
>>   	unsigned long end = start + size;
>> +	unsigned long l_start;
>>   	struct vm_struct *area;
>> -	int i;
>> +	int i, ret;
>>   
>>   	if (!PAGE_ALIGNED(addr)) {
>>   		start &= PAGE_MASK;
>> @@ -118,7 +120,12 @@ static int change_memory_common(unsigned long addr, int numpages,
>>   	if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
>>   			    pgprot_val(clear_mask) == PTE_RDONLY)) {
>>   		for (i = 0; i < area->nr_pages; i++) {
>> -			__change_memory_common((u64)page_address(area->pages[i]),
>> +			l_start = (u64)page_address(area->pages[i]);
>> +			ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
> This isn't quite aligned with how I was thinking about it. You still have 2
> passes here; one to split the range to base pages, then another to modify the
> permissions.
>
> I was thinking we could use the table walker in mmu.c to achieve 2 benefits:
>
>    - Do both operations in a single pass (a bit like how calling
> update_mapping_prot() will update the protections on an existing mapping, and
> the table walker will split when it comes across a huge page)
>
>    - Only split when needed; if the whole huge page is contained within the
> range, then there is no need to split in the first place.
>
> We could then split vmalloc regions for free using this infrastructure too.
>
> Although there is a wrinkle that the mmu.c table walker only accepts a pgprot
> and can't currently handle a set_mask/clear_mask. I guess that could be added,
> but it starts to get a bit busy. I think this generic infra would be useful
> though. What do you think?

Yes, we need to add another pgprot parameter (maybe more) to tell the 
walker what is going to be set and what is going to be cleared. I agree 
the generic infra is useful. Would you prefer implement it in this 
patchset or in a separate following patchset?

Thanks,
Yang

> [...]
>
> Thanks,
> Ryan
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-03-13 17:40     ` Yang Shi
@ 2025-04-10 22:00       ` Yang Shi
  2025-04-14 13:03         ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-04-10 22:00 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

Hi Ryan,

I know you may have a lot of things to follow up after LSF/MM. Just 
gently ping, hopefully we can resume the review soon.

Thanks,
Yang


On 3/13/25 10:40 AM, Yang Shi wrote:
>
>
> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>> On 13/03/2025 17:28, Yang Shi wrote:
>>> Hi Ryan,
>>>
>>> I saw Miko posted a new spin of his patches. There are some slight 
>>> changes that
>>> have impact to my patches (basically check the new boot parameter). 
>>> Do you
>>> prefer I rebase my patches on top of his new spin right now then 
>>> restart review
>>> from the new spin or review the current patches then solve the new 
>>> review
>>> comments and rebase to Miko's new spin together?
>> Hi Yang,
>>
>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>
>> I'm happy to review against v3 as it is. I'm familiar with Miko's 
>> series and am
>> not too bothered about the integration with that; I think it's pretty 
>> straight
>> forward. I'm more interested in how you are handling the splitting, 
>> which I
>> think is the bulk of the effort.
>
> Yeah, sure, thank you.
>
>>
>> I'm hoping to get to this next week before heading out to LSF/MM the 
>> following
>> week (might I see you there?)
>
> Unfortunately I can't make it this year. Have a fun!
>
> Thanks,
> Yang
>
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>
>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>> Changelog
>>>> =========
>>>> v3:
>>>>     * Rebased to v6.14-rc4.
>>>>     * Based on Miko's BBML2 cpufeature patch 
>>>> (https://lore.kernel.org/linux-
>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>       Also included in this series in order to have the complete 
>>>> patchset.
>>>>     * Enhanced __create_pgd_mapping() to handle split as well per 
>>>> Ryan.
>>>>     * Supported CONT mappings per Ryan.
>>>>     * Supported asymmetric system by splitting kernel linear 
>>>> mapping if such
>>>>       system is detected per Ryan. I don't have such system to 
>>>> test, so the
>>>>       testing is done by hacking kernel to call linear mapping 
>>>> repainting
>>>>       unconditionally. The linear mapping doesn't have any block 
>>>> and cont
>>>>       mappings after booting.
>>>>
>>>> RFC v2:
>>>>     * Used allowlist to advertise BBM lv2 on the CPUs which can 
>>>> handle TLB
>>>>       conflict gracefully per Will Deacon
>>>>     * Rebased onto v6.13-rc5
>>>>     * 
>>>> https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>> yang@os.amperecomputing.com/
>>>>
>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>> yang@os.amperecomputing.com/
>>>>
>>>> Description
>>>> ===========
>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>> break-before-make rule.
>>>>
>>>> A number of performance issues arise when the kernel linear map is 
>>>> using
>>>> PTE entries due to arm's break-before-make rule:
>>>>     - performance degradation
>>>>     - more TLB pressure
>>>>     - memory waste for kernel page table
>>>>
>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>> line but this disables the alias checks on page table permissions and
>>>> therefore compromises security somewhat.
>>>>
>>>> With FEAT_BBM level 2 support it is no longer necessary to 
>>>> invalidate the
>>>> page table entry when changing page sizes.  This allows the kernel to
>>>> split large mappings after boot is complete.
>>>>
>>>> This patch adds support for splitting large mappings when FEAT_BBM 
>>>> level 2
>>>> is available and rodata=full is used. This functionality will be used
>>>> when modifying page permissions for individual page frames.
>>>>
>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>> only.
>>>>
>>>> If the system is asymmetric, the kernel linear mapping may be 
>>>> repainted once
>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for 
>>>> more details.
>>>>
>>>> We saw significant performance increases in some benchmarks with
>>>> rodata=full without compromising the security features of the kernel.
>>>>
>>>> Testing
>>>> =======
>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB 
>>>> memory and
>>>> 4K page size + 48 bit VA.
>>>>
>>>> Function test (4K/16K/64K page size)
>>>>     - Kernel boot.  Kernel needs change kernel linear mapping 
>>>> permission at
>>>>       boot stage, if the patch didn't work, kernel typically didn't 
>>>> boot.
>>>>     - Module stress from stress-ng. Kernel module load change 
>>>> permission for
>>>>       linear mapping.
>>>>     - A test kernel module which allocates 80% of total memory via 
>>>> vmalloc(),
>>>>       then change the vmalloc area permission to RO, this also 
>>>> change linear
>>>>       mapping permission to RO, then change it back before vfree(). 
>>>> Then launch
>>>>       a VM which consumes almost all physical memory.
>>>>     - VM with the patchset applied in guest kernel too.
>>>>     - Kernel build in VM with guest kernel which has this series 
>>>> applied.
>>>>     - rodata=on. Make sure other rodata mode is not broken.
>>>>     - Boot on the machine which doesn't support BBML2.
>>>>
>>>> Performance
>>>> ===========
>>>> Memory consumption
>>>> Before:
>>>> MemTotal:       258988984 kB
>>>> MemFree:        254821700 kB
>>>>
>>>> After:
>>>> MemTotal:       259505132 kB
>>>> MemFree:        255410264 kB
>>>>
>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>> more memory saved.
>>>>
>>>> Performance benchmarking
>>>> * Memcached
>>>> We saw performance degradation when running Memcached benchmark with
>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB 
>>>> pressure.
>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>> latency is reduced by around 9.6%.
>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>> MPKI is reduced by 28.5%.
>>>>
>>>> The benchmark data is now on par with rodata=on too.
>>>>
>>>> * Disk encryption (dm-crypt) benchmark
>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) 
>>>> with disk
>>>> encryption (by dm-crypt).
>>>> fio --directory=/data --random_generator=lfsr --norandommap 
>>>> --randrepeat 1 \
>>>>       --status-interval=999 --rw=write --bs=4k --loops=1 
>>>> --ioengine=sync \
>>>>       --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting 
>>>> --thread \
>>>>       --name=iops-test-job --eta-newline=1 --size 100G
>>>>
>>>> The IOPS is increased by 90% - 150% (the variance is high, but the 
>>>> worst
>>>> number of good case is around 90% more than the best number of bad 
>>>> case).
>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>
>>>> * Sequential file read
>>>> Read 100G file sequentially on XFS (xfs_io read with page cache 
>>>> populated).
>>>> The bandwidth is increased by 150%.
>>>>
>>>>
>>>> Mikołaj Lenczewski (1):
>>>>         arm64: Add BBM Level 2 cpu feature
>>>>
>>>> Yang Shi (5):
>>>>         arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>         arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>         arm64: mm: support large block mapping when rodata=full
>>>>         arm64: mm: support split CONT mappings
>>>>         arm64: mm: split linear mapping if BBML2 is not supported 
>>>> on secondary
>>>> CPUs
>>>>
>>>>    arch/arm64/Kconfig                  |  11 +++++
>>>>    arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>    arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>    arch/arm64/include/asm/mmu.h        |   4 ++
>>>>    arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>    arch/arm64/kernel/cpufeature.c      |  95 
>>>> +++++++++++++++++++++++++++++++++++++
>>>>    arch/arm64/mm/mmu.c                 | 397 
>>>> ++++++++++++++++++++++++++++++++++
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>>
>>>> ++++++++++++++++++++++-------------------
>>>>    arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>    arch/arm64/tools/cpucaps            |   1 +
>>>>    9 files changed, 518 insertions(+), 56 deletions(-)
>>>>
>>>>
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-04-10 22:00       ` Yang Shi
@ 2025-04-14 13:03         ` Ryan Roberts
  2025-04-14 21:24           ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-04-14 13:03 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 10/04/2025 23:00, Yang Shi wrote:
> Hi Ryan,
> 
> I know you may have a lot of things to follow up after LSF/MM. Just gently ping,
> hopefully we can resume the review soon.

Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
keen to move this series forward so will come back to you next week. (although
TBH, I thought I was waiting for you to respond to me... :-| )

FWIW, having thought about it a bit more, I think some of the suggestions I
previously made may not have been quite right, but I'll elaborate next week. I'm
keen to build a pgtable splitting primitive here that we can reuse with vmalloc
as well to enable huge mappings by default with vmalloc too.

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
> 
> On 3/13/25 10:40 AM, Yang Shi wrote:
>>
>>
>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>> Hi Ryan,
>>>>
>>>> I saw Miko posted a new spin of his patches. There are some slight changes that
>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>> prefer I rebase my patches on top of his new spin right now then restart review
>>>> from the new spin or review the current patches then solve the new review
>>>> comments and rebase to Miko's new spin together?
>>> Hi Yang,
>>>
>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>
>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series and am
>>> not too bothered about the integration with that; I think it's pretty straight
>>> forward. I'm more interested in how you are handling the splitting, which I
>>> think is the bulk of the effort.
>>
>> Yeah, sure, thank you.
>>
>>>
>>> I'm hoping to get to this next week before heading out to LSF/MM the following
>>> week (might I see you there?)
>>
>> Unfortunately I can't make it this year. Have a fun!
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>
>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>> Changelog
>>>>> =========
>>>>> v3:
>>>>>     * Rebased to v6.14-rc4.
>>>>>     * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-
>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>       Also included in this series in order to have the complete patchset.
>>>>>     * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>     * Supported CONT mappings per Ryan.
>>>>>     * Supported asymmetric system by splitting kernel linear mapping if such
>>>>>       system is detected per Ryan. I don't have such system to test, so the
>>>>>       testing is done by hacking kernel to call linear mapping repainting
>>>>>       unconditionally. The linear mapping doesn't have any block and cont
>>>>>       mappings after booting.
>>>>>
>>>>> RFC v2:
>>>>>     * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>>>       conflict gracefully per Will Deacon
>>>>>     * Rebased onto v6.13-rc5
>>>>>     * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>> yang@os.amperecomputing.com/
>>>>>
>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>> yang@os.amperecomputing.com/
>>>>>
>>>>> Description
>>>>> ===========
>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>> break-before-make rule.
>>>>>
>>>>> A number of performance issues arise when the kernel linear map is using
>>>>> PTE entries due to arm's break-before-make rule:
>>>>>     - performance degradation
>>>>>     - more TLB pressure
>>>>>     - memory waste for kernel page table
>>>>>
>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>> line but this disables the alias checks on page table permissions and
>>>>> therefore compromises security somewhat.
>>>>>
>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>> split large mappings after boot is complete.
>>>>>
>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>> is available and rodata=full is used. This functionality will be used
>>>>> when modifying page permissions for individual page frames.
>>>>>
>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>> only.
>>>>>
>>>>> If the system is asymmetric, the kernel linear mapping may be repainted once
>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.
>>>>>
>>>>> We saw significant performance increases in some benchmarks with
>>>>> rodata=full without compromising the security features of the kernel.
>>>>>
>>>>> Testing
>>>>> =======
>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>>>> 4K page size + 48 bit VA.
>>>>>
>>>>> Function test (4K/16K/64K page size)
>>>>>     - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>>>       boot stage, if the patch didn't work, kernel typically didn't boot.
>>>>>     - Module stress from stress-ng. Kernel module load change permission for
>>>>>       linear mapping.
>>>>>     - A test kernel module which allocates 80% of total memory via vmalloc(),
>>>>>       then change the vmalloc area permission to RO, this also change linear
>>>>>       mapping permission to RO, then change it back before vfree(). Then
>>>>> launch
>>>>>       a VM which consumes almost all physical memory.
>>>>>     - VM with the patchset applied in guest kernel too.
>>>>>     - Kernel build in VM with guest kernel which has this series applied.
>>>>>     - rodata=on. Make sure other rodata mode is not broken.
>>>>>     - Boot on the machine which doesn't support BBML2.
>>>>>
>>>>> Performance
>>>>> ===========
>>>>> Memory consumption
>>>>> Before:
>>>>> MemTotal:       258988984 kB
>>>>> MemFree:        254821700 kB
>>>>>
>>>>> After:
>>>>> MemTotal:       259505132 kB
>>>>> MemFree:        255410264 kB
>>>>>
>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>> more memory saved.
>>>>>
>>>>> Performance benchmarking
>>>>> * Memcached
>>>>> We saw performance degradation when running Memcached benchmark with
>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>> latency is reduced by around 9.6%.
>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>> MPKI is reduced by 28.5%.
>>>>>
>>>>> The benchmark data is now on par with rodata=on too.
>>>>>
>>>>> * Disk encryption (dm-crypt) benchmark
>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>>>> encryption (by dm-crypt).
>>>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>>>       --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>>>       --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>>>>       --name=iops-test-job --eta-newline=1 --size 100G
>>>>>
>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>> number of good case is around 90% more than the best number of bad case).
>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>
>>>>> * Sequential file read
>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>>>> The bandwidth is increased by 150%.
>>>>>
>>>>>
>>>>> Mikołaj Lenczewski (1):
>>>>>         arm64: Add BBM Level 2 cpu feature
>>>>>
>>>>> Yang Shi (5):
>>>>>         arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>         arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>         arm64: mm: support large block mapping when rodata=full
>>>>>         arm64: mm: support split CONT mappings
>>>>>         arm64: mm: split linear mapping if BBML2 is not supported on secondary
>>>>> CPUs
>>>>>
>>>>>    arch/arm64/Kconfig                  |  11 +++++
>>>>>    arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>    arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>    arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>    arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>    arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++++++++
>>>>> +++++++
>>>>>    arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++
>>>>> ++++
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> ++++++++++++++++++++++-------------------
>>>>>    arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>    arch/arm64/tools/cpucaps            |   1 +
>>>>>    9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>
>>>>>
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-04-14 13:03         ` Ryan Roberts
@ 2025-04-14 21:24           ` Yang Shi
  2025-05-02 11:51             ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-04-14 21:24 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 4/14/25 6:03 AM, Ryan Roberts wrote:
> On 10/04/2025 23:00, Yang Shi wrote:
>> Hi Ryan,
>>
>> I know you may have a lot of things to follow up after LSF/MM. Just gently ping,
>> hopefully we can resume the review soon.
> Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
> keen to move this series forward so will come back to you next week. (although
> TBH, I thought I was waiting for you to respond to me... :-| )
>
> FWIW, having thought about it a bit more, I think some of the suggestions I
> previously made may not have been quite right, but I'll elaborate next week. I'm
> keen to build a pgtable splitting primitive here that we can reuse with vmalloc
> as well to enable huge mappings by default with vmalloc too.

Sounds good. I think the patches can support splitting vmalloc page 
table too. Anyway we can discuss more after you are back. Enjoy your 
holiday.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>
>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>
>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> I saw Miko posted a new spin of his patches. There are some slight changes that
>>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>>> prefer I rebase my patches on top of his new spin right now then restart review
>>>>> from the new spin or review the current patches then solve the new review
>>>>> comments and rebase to Miko's new spin together?
>>>> Hi Yang,
>>>>
>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>
>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series and am
>>>> not too bothered about the integration with that; I think it's pretty straight
>>>> forward. I'm more interested in how you are handling the splitting, which I
>>>> think is the bulk of the effort.
>>> Yeah, sure, thank you.
>>>
>>>> I'm hoping to get to this next week before heading out to LSF/MM the following
>>>> week (might I see you there?)
>>> Unfortunately I can't make it this year. Have a fun!
>>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>
>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>> Changelog
>>>>>> =========
>>>>>> v3:
>>>>>>      * Rebased to v6.14-rc4.
>>>>>>      * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/linux-
>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>        Also included in this series in order to have the complete patchset.
>>>>>>      * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>>      * Supported CONT mappings per Ryan.
>>>>>>      * Supported asymmetric system by splitting kernel linear mapping if such
>>>>>>        system is detected per Ryan. I don't have such system to test, so the
>>>>>>        testing is done by hacking kernel to call linear mapping repainting
>>>>>>        unconditionally. The linear mapping doesn't have any block and cont
>>>>>>        mappings after booting.
>>>>>>
>>>>>> RFC v2:
>>>>>>      * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>>>>        conflict gracefully per Will Deacon
>>>>>>      * Rebased onto v6.13-rc5
>>>>>>      * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>>> yang@os.amperecomputing.com/
>>>>>>
>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>> yang@os.amperecomputing.com/
>>>>>>
>>>>>> Description
>>>>>> ===========
>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>> break-before-make rule.
>>>>>>
>>>>>> A number of performance issues arise when the kernel linear map is using
>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>      - performance degradation
>>>>>>      - more TLB pressure
>>>>>>      - memory waste for kernel page table
>>>>>>
>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>> line but this disables the alias checks on page table permissions and
>>>>>> therefore compromises security somewhat.
>>>>>>
>>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>>> split large mappings after boot is complete.
>>>>>>
>>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>> when modifying page permissions for individual page frames.
>>>>>>
>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>> only.
>>>>>>
>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted once
>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more details.
>>>>>>
>>>>>> We saw significant performance increases in some benchmarks with
>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>
>>>>>> Testing
>>>>>> =======
>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>>>>> 4K page size + 48 bit VA.
>>>>>>
>>>>>> Function test (4K/16K/64K page size)
>>>>>>      - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>>>>        boot stage, if the patch didn't work, kernel typically didn't boot.
>>>>>>      - Module stress from stress-ng. Kernel module load change permission for
>>>>>>        linear mapping.
>>>>>>      - A test kernel module which allocates 80% of total memory via vmalloc(),
>>>>>>        then change the vmalloc area permission to RO, this also change linear
>>>>>>        mapping permission to RO, then change it back before vfree(). Then
>>>>>> launch
>>>>>>        a VM which consumes almost all physical memory.
>>>>>>      - VM with the patchset applied in guest kernel too.
>>>>>>      - Kernel build in VM with guest kernel which has this series applied.
>>>>>>      - rodata=on. Make sure other rodata mode is not broken.
>>>>>>      - Boot on the machine which doesn't support BBML2.
>>>>>>
>>>>>> Performance
>>>>>> ===========
>>>>>> Memory consumption
>>>>>> Before:
>>>>>> MemTotal:       258988984 kB
>>>>>> MemFree:        254821700 kB
>>>>>>
>>>>>> After:
>>>>>> MemTotal:       259505132 kB
>>>>>> MemFree:        255410264 kB
>>>>>>
>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>> more memory saved.
>>>>>>
>>>>>> Performance benchmarking
>>>>>> * Memcached
>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>> latency is reduced by around 9.6%.
>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>> MPKI is reduced by 28.5%.
>>>>>>
>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>
>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>>>>> encryption (by dm-crypt).
>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>>>>        --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>>>>        --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>>>>>>        --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>
>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>>> number of good case is around 90% more than the best number of bad case).
>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>
>>>>>> * Sequential file read
>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>>>>> The bandwidth is increased by 150%.
>>>>>>
>>>>>>
>>>>>> Mikołaj Lenczewski (1):
>>>>>>          arm64: Add BBM Level 2 cpu feature
>>>>>>
>>>>>> Yang Shi (5):
>>>>>>          arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>          arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>          arm64: mm: support large block mapping when rodata=full
>>>>>>          arm64: mm: support split CONT mappings
>>>>>>          arm64: mm: split linear mapping if BBML2 is not supported on secondary
>>>>>> CPUs
>>>>>>
>>>>>>     arch/arm64/Kconfig                  |  11 +++++
>>>>>>     arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>>     arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>>     arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>>     arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>>     arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++++++++
>>>>>> +++++++
>>>>>>     arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++
>>>>>> ++++
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> ++++++++++++++++++++++-------------------
>>>>>>     arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>>     arch/arm64/tools/cpucaps            |   1 +
>>>>>>     9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>
>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-04-14 21:24           ` Yang Shi
@ 2025-05-02 11:51             ` Ryan Roberts
  2025-05-05 21:39               ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-02 11:51 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 14/04/2025 22:24, Yang Shi wrote:
> 
> 
> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>> On 10/04/2025 23:00, Yang Shi wrote:
>>> Hi Ryan,
>>>
>>> I know you may have a lot of things to follow up after LSF/MM. Just gently ping,
>>> hopefully we can resume the review soon.
>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
>> keen to move this series forward so will come back to you next week. (although
>> TBH, I thought I was waiting for you to respond to me... :-| )
>>
>> FWIW, having thought about it a bit more, I think some of the suggestions I
>> previously made may not have been quite right, but I'll elaborate next week. I'm
>> keen to build a pgtable splitting primitive here that we can reuse with vmalloc
>> as well to enable huge mappings by default with vmalloc too.
> 
> Sounds good. I think the patches can support splitting vmalloc page table too.
> Anyway we can discuss more after you are back. Enjoy your holiday.

Hi Yang,

Sorry I've taken so long to get back to you. Here's what I'm currently thinking:
I'd eventually like to get to the point where the linear map and most vmalloc
memory is mapped using the largest possible mapping granularity (i.e. block
mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).

vmalloc has history with trying to do huge mappings by default; it ended up
having to be turned into an opt-in feature (instead of the original opt-out
approach) because there were problems with some parts of the kernel expecting
page mappings. I think we might be able to overcome those issues on arm64 with
BBML2.

arm64 can already support vmalloc PUD and PMD block mappings, and I have a
series (that should make v6.16) that enables contiguous PTE mappings in vmalloc
too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
able to use that by default, we need to be able to change permissions on
sub-regions of an allocation, which is where BBML2 and your series come in.
(there may be other things we need to solve as well; TBD).

I think the key thing we need is a function that can take a page-aligned kernel
VA, will walk to the leaf entry for that VA and if the VA is in the middle of
the leaf entry, it will split it so that the VA is now on a boundary. This will
work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
function can assume BBML2 is present. And it will return 0 on success, -EINVAL
if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to perform
the split.

Then we can use that primitive on the start and end address of any range for
which we need exact mapping boundaries (e.g. when changing permissions on part
of linear map or vmalloc allocation, when freeing part of a vmalloc allocation,
etc). This way we only split enough to ensure the boundaries are precise, and
keep larger mappings inside the range.

Next we need to reimplement __change_memory_common() to not use
apply_to_page_range(), because that assumes page mappings only. Dev Jain has
been working on a series that converts this to use walk_page_range_novma() so
that we can change permissions on the block/contig entries too. That's not
posted publicly yet, but it's not huge so I'll ask if he is comfortable with
posting an RFC early next week.

You'll still need to repaint the whole linear map with page mappings for the
case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially with
minor modifications?) can do that repainting on the live mappings; similar to
how you are doing it in v3.

Miko's BBML2 series should hopefully get imminently queued for v6.16.

So in summary, what I'm asking for your large block mapping the linear map
series is:
  - Paint linear map using blocks/contig if boot CPU supports BBML2
  - Repaint linear map using page mappings if secondary CPUs don't support BBML2
  - Integrate Dev's __change_memory_common() series
  - Create primitive to ensure mapping entry boundary at a given page-aligned VA
  - Use primitive when changing permissions on linear map region

This will be mergable on its own, but will also provide a great starting base
for adding huge-vmalloc-by-default.

What do you think?

Thanks,
Ryan


> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>
>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>
>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> I saw Miko posted a new spin of his patches. There are some slight changes
>>>>>> that
>>>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>> review
>>>>>> from the new spin or review the current patches then solve the new review
>>>>>> comments and rebase to Miko's new spin together?
>>>>> Hi Yang,
>>>>>
>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>
>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>> and am
>>>>> not too bothered about the integration with that; I think it's pretty straight
>>>>> forward. I'm more interested in how you are handling the splitting, which I
>>>>> think is the bulk of the effort.
>>>> Yeah, sure, thank you.
>>>>
>>>>> I'm hoping to get to this next week before heading out to LSF/MM the following
>>>>> week (might I see you there?)
>>>> Unfortunately I can't make it this year. Have a fun!
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>
>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>> Changelog
>>>>>>> =========
>>>>>>> v3:
>>>>>>>      * Rebased to v6.14-rc4.
>>>>>>>      * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/
>>>>>>> linux-
>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>        Also included in this series in order to have the complete patchset.
>>>>>>>      * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>>>      * Supported CONT mappings per Ryan.
>>>>>>>      * Supported asymmetric system by splitting kernel linear mapping if
>>>>>>> such
>>>>>>>        system is detected per Ryan. I don't have such system to test, so the
>>>>>>>        testing is done by hacking kernel to call linear mapping repainting
>>>>>>>        unconditionally. The linear mapping doesn't have any block and cont
>>>>>>>        mappings after booting.
>>>>>>>
>>>>>>> RFC v2:
>>>>>>>      * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>>>>>        conflict gracefully per Will Deacon
>>>>>>>      * Rebased onto v6.13-rc5
>>>>>>>      * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>>>> yang@os.amperecomputing.com/
>>>>>>>
>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>> yang@os.amperecomputing.com/
>>>>>>>
>>>>>>> Description
>>>>>>> ===========
>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>> break-before-make rule.
>>>>>>>
>>>>>>> A number of performance issues arise when the kernel linear map is using
>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>      - performance degradation
>>>>>>>      - more TLB pressure
>>>>>>>      - memory waste for kernel page table
>>>>>>>
>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>> therefore compromises security somewhat.
>>>>>>>
>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>>>> split large mappings after boot is complete.
>>>>>>>
>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>
>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>> only.
>>>>>>>
>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted once
>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>> details.
>>>>>>>
>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>
>>>>>>> Testing
>>>>>>> =======
>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>>>>>> 4K page size + 48 bit VA.
>>>>>>>
>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>      - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>>>>>        boot stage, if the patch didn't work, kernel typically didn't boot.
>>>>>>>      - Module stress from stress-ng. Kernel module load change permission
>>>>>>> for
>>>>>>>        linear mapping.
>>>>>>>      - A test kernel module which allocates 80% of total memory via
>>>>>>> vmalloc(),
>>>>>>>        then change the vmalloc area permission to RO, this also change
>>>>>>> linear
>>>>>>>        mapping permission to RO, then change it back before vfree(). Then
>>>>>>> launch
>>>>>>>        a VM which consumes almost all physical memory.
>>>>>>>      - VM with the patchset applied in guest kernel too.
>>>>>>>      - Kernel build in VM with guest kernel which has this series applied.
>>>>>>>      - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>      - Boot on the machine which doesn't support BBML2.
>>>>>>>
>>>>>>> Performance
>>>>>>> ===========
>>>>>>> Memory consumption
>>>>>>> Before:
>>>>>>> MemTotal:       258988984 kB
>>>>>>> MemFree:        254821700 kB
>>>>>>>
>>>>>>> After:
>>>>>>> MemTotal:       259505132 kB
>>>>>>> MemFree:        255410264 kB
>>>>>>>
>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>> more memory saved.
>>>>>>>
>>>>>>> Performance benchmarking
>>>>>>> * Memcached
>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>> latency is reduced by around 9.6%.
>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>
>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>
>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>>>>>> encryption (by dm-crypt).
>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>>>>>        --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>>>>>        --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --
>>>>>>> thread \
>>>>>>>        --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>
>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>>>> number of good case is around 90% more than the best number of bad case).
>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>
>>>>>>> * Sequential file read
>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>>>>>> The bandwidth is increased by 150%.
>>>>>>>
>>>>>>>
>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>          arm64: Add BBM Level 2 cpu feature
>>>>>>>
>>>>>>> Yang Shi (5):
>>>>>>>          arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>          arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>          arm64: mm: support large block mapping when rodata=full
>>>>>>>          arm64: mm: support split CONT mappings
>>>>>>>          arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>> secondary
>>>>>>> CPUs
>>>>>>>
>>>>>>>     arch/arm64/Kconfig                  |  11 +++++
>>>>>>>     arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>>>     arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>>>     arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>>>     arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>>>     arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++++++++
>>>>>>> +++++++
>>>>>>>     arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++
>>>>>>> ++++
>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>> +++++
>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>     arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>>>     arch/arm64/tools/cpucaps            |   1 +
>>>>>>>     9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>
>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-02 11:51             ` Ryan Roberts
@ 2025-05-05 21:39               ` Yang Shi
  2025-05-07  7:58                 ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-05 21:39 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/2/25 4:51 AM, Ryan Roberts wrote:
> On 14/04/2025 22:24, Yang Shi wrote:
>>
>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>> Hi Ryan,
>>>>
>>>> I know you may have a lot of things to follow up after LSF/MM. Just gently ping,
>>>> hopefully we can resume the review soon.
>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
>>> keen to move this series forward so will come back to you next week. (although
>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>
>>> FWIW, having thought about it a bit more, I think some of the suggestions I
>>> previously made may not have been quite right, but I'll elaborate next week. I'm
>>> keen to build a pgtable splitting primitive here that we can reuse with vmalloc
>>> as well to enable huge mappings by default with vmalloc too.
>> Sounds good. I think the patches can support splitting vmalloc page table too.
>> Anyway we can discuss more after you are back. Enjoy your holiday.
> Hi Yang,
>
> Sorry I've taken so long to get back to you. Here's what I'm currently thinking:
> I'd eventually like to get to the point where the linear map and most vmalloc
> memory is mapped using the largest possible mapping granularity (i.e. block
> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>
> vmalloc has history with trying to do huge mappings by default; it ended up
> having to be turned into an opt-in feature (instead of the original opt-out
> approach) because there were problems with some parts of the kernel expecting
> page mappings. I think we might be able to overcome those issues on arm64 with
> BBML2.
>
> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
> series (that should make v6.16) that enables contiguous PTE mappings in vmalloc
> too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
> able to use that by default, we need to be able to change permissions on
> sub-regions of an allocation, which is where BBML2 and your series come in.
> (there may be other things we need to solve as well; TBD).
>
> I think the key thing we need is a function that can take a page-aligned kernel
> VA, will walk to the leaf entry for that VA and if the VA is in the middle of
> the leaf entry, it will split it so that the VA is now on a boundary. This will
> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
> function can assume BBML2 is present. And it will return 0 on success, -EINVAL
> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to perform
> the split.

OK, the v3 patches already handled page table allocation failure with 
returning -ENOMEM and BUG_ON if it is not mapped because kernel assumes 
linear mapping should be always present. It is easy to return -EINVAL 
instead of BUG_ON. However I'm wondering what usecases you are thinking 
about? Splitting vmalloc area may run into unmapped VA?

>
> Then we can use that primitive on the start and end address of any range for
> which we need exact mapping boundaries (e.g. when changing permissions on part
> of linear map or vmalloc allocation, when freeing part of a vmalloc allocation,
> etc). This way we only split enough to ensure the boundaries are precise, and
> keep larger mappings inside the range.

Yeah, makes sense to me.

>
> Next we need to reimplement __change_memory_common() to not use
> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
> been working on a series that converts this to use walk_page_range_novma() so
> that we can change permissions on the block/contig entries too. That's not
> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
> posting an RFC early next week.

OK, so the new __change_memory_common() will change the permission of 
page table, right? If I remember correctly, you suggested change 
permissions in __create_pgd_mapping_locked() for v3. So I can disregard it?

The current code assumes the address range passed in by 
change_memory_common() is *NOT* physically contiguous so 
__change_memory_common() handles page table permission on page basis. 
I'm supposed Dev's patches will handle this then my patch can safely 
assume the linear mapping address range for splitting is physically 
contiguous too otherwise I can't keep large mappings inside the range. 
Splitting vmalloc area doesn't need to worry about this.

>
> You'll still need to repaint the whole linear map with page mappings for the
> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially with
> minor modifications?) can do that repainting on the live mappings; similar to
> how you are doing it in v3.

Yes, when repainting I need to split the page table all the way down to 
PTE level. A simple flag should be good enough to tell 
__create_pgd_mapping_locked() do the right thing off the top of my head.

>
> Miko's BBML2 series should hopefully get imminently queued for v6.16.

Great! Anyway my series is based on his advertising BBML2 patch.

>
> So in summary, what I'm asking for your large block mapping the linear map
> series is:
>    - Paint linear map using blocks/contig if boot CPU supports BBML2
>    - Repaint linear map using page mappings if secondary CPUs don't support BBML2

OK, I just need to add some simple tweak to split down to PTE level to v3.

>    - Integrate Dev's __change_memory_common() series

OK, I think I have to do my patches on top of it. Because Dev's patch 
need guarantee the linear mapping address range is physically contiguous.

>    - Create primitive to ensure mapping entry boundary at a given page-aligned VA
>    - Use primitive when changing permissions on linear map region

Sure.

>
> This will be mergable on its own, but will also provide a great starting base
> for adding huge-vmalloc-by-default.
>
> What do you think?

Definitely makes sense to me.

If I remember correctly, we still have some unsolved comments/questions 
for v3 in my replies on March 17, particularly:
https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-b344-9401fa4c0feb@os.amperecomputing.com/

Thanks,
Yang

>
> Thanks,
> Ryan
>
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>
>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> I saw Miko posted a new spin of his patches. There are some slight changes
>>>>>>> that
>>>>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>>> review
>>>>>>> from the new spin or review the current patches then solve the new review
>>>>>>> comments and rebase to Miko's new spin together?
>>>>>> Hi Yang,
>>>>>>
>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>
>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>>> and am
>>>>>> not too bothered about the integration with that; I think it's pretty straight
>>>>>> forward. I'm more interested in how you are handling the splitting, which I
>>>>>> think is the bulk of the effort.
>>>>> Yeah, sure, thank you.
>>>>>
>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the following
>>>>>> week (might I see you there?)
>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>
>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>> Changelog
>>>>>>>> =========
>>>>>>>> v3:
>>>>>>>>       * Rebased to v6.14-rc4.
>>>>>>>>       * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/
>>>>>>>> linux-
>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>         Also included in this series in order to have the complete patchset.
>>>>>>>>       * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>>>>       * Supported CONT mappings per Ryan.
>>>>>>>>       * Supported asymmetric system by splitting kernel linear mapping if
>>>>>>>> such
>>>>>>>>         system is detected per Ryan. I don't have such system to test, so the
>>>>>>>>         testing is done by hacking kernel to call linear mapping repainting
>>>>>>>>         unconditionally. The linear mapping doesn't have any block and cont
>>>>>>>>         mappings after booting.
>>>>>>>>
>>>>>>>> RFC v2:
>>>>>>>>       * Used allowlist to advertise BBM lv2 on the CPUs which can handle TLB
>>>>>>>>         conflict gracefully per Will Deacon
>>>>>>>>       * Rebased onto v6.13-rc5
>>>>>>>>       * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>
>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>
>>>>>>>> Description
>>>>>>>> ===========
>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>> break-before-make rule.
>>>>>>>>
>>>>>>>> A number of performance issues arise when the kernel linear map is using
>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>       - performance degradation
>>>>>>>>       - more TLB pressure
>>>>>>>>       - memory waste for kernel page table
>>>>>>>>
>>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>> therefore compromises security somewhat.
>>>>>>>>
>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>>>>> split large mappings after boot is complete.
>>>>>>>>
>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>
>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>>> only.
>>>>>>>>
>>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted once
>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>> details.
>>>>>>>>
>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>
>>>>>>>> Testing
>>>>>>>> =======
>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>
>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>       - Kernel boot.  Kernel needs change kernel linear mapping permission at
>>>>>>>>         boot stage, if the patch didn't work, kernel typically didn't boot.
>>>>>>>>       - Module stress from stress-ng. Kernel module load change permission
>>>>>>>> for
>>>>>>>>         linear mapping.
>>>>>>>>       - A test kernel module which allocates 80% of total memory via
>>>>>>>> vmalloc(),
>>>>>>>>         then change the vmalloc area permission to RO, this also change
>>>>>>>> linear
>>>>>>>>         mapping permission to RO, then change it back before vfree(). Then
>>>>>>>> launch
>>>>>>>>         a VM which consumes almost all physical memory.
>>>>>>>>       - VM with the patchset applied in guest kernel too.
>>>>>>>>       - Kernel build in VM with guest kernel which has this series applied.
>>>>>>>>       - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>       - Boot on the machine which doesn't support BBML2.
>>>>>>>>
>>>>>>>> Performance
>>>>>>>> ===========
>>>>>>>> Memory consumption
>>>>>>>> Before:
>>>>>>>> MemTotal:       258988984 kB
>>>>>>>> MemFree:        254821700 kB
>>>>>>>>
>>>>>>>> After:
>>>>>>>> MemTotal:       259505132 kB
>>>>>>>> MemFree:        255410264 kB
>>>>>>>>
>>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>>> more memory saved.
>>>>>>>>
>>>>>>>> Performance benchmarking
>>>>>>>> * Memcached
>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>
>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>
>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
>>>>>>>> encryption (by dm-crypt).
>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>>>>>>>>         --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>>>>>>>>         --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --
>>>>>>>> thread \
>>>>>>>>         --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>
>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>>>>> number of good case is around 90% more than the best number of bad case).
>>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>>
>>>>>>>> * Sequential file read
>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>
>>>>>>>>
>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>           arm64: Add BBM Level 2 cpu feature
>>>>>>>>
>>>>>>>> Yang Shi (5):
>>>>>>>>           arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>           arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>>           arm64: mm: support large block mapping when rodata=full
>>>>>>>>           arm64: mm: support split CONT mappings
>>>>>>>>           arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>>> secondary
>>>>>>>> CPUs
>>>>>>>>
>>>>>>>>      arch/arm64/Kconfig                  |  11 +++++
>>>>>>>>      arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>>>>      arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>>>>      arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>>>>      arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>>>>      arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++++++++
>>>>>>>> +++++++
>>>>>>>>      arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++++++++
>>>>>>>> ++++
>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> +++++
>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>      arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>>>>      arch/arm64/tools/cpucaps            |   1 +
>>>>>>>>      9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>
>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-05 21:39               ` Yang Shi
@ 2025-05-07  7:58                 ` Ryan Roberts
  2025-05-07 21:16                   ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-07  7:58 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 05/05/2025 22:39, Yang Shi wrote:
> 
> 
> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>> On 14/04/2025 22:24, Yang Shi wrote:
>>>
>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> I know you may have a lot of things to follow up after LSF/MM. Just gently
>>>>> ping,
>>>>> hopefully we can resume the review soon.
>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
>>>> keen to move this series forward so will come back to you next week. (although
>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>
>>>> FWIW, having thought about it a bit more, I think some of the suggestions I
>>>> previously made may not have been quite right, but I'll elaborate next week.
>>>> I'm
>>>> keen to build a pgtable splitting primitive here that we can reuse with vmalloc
>>>> as well to enable huge mappings by default with vmalloc too.
>>> Sounds good. I think the patches can support splitting vmalloc page table too.
>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>> Hi Yang,
>>
>> Sorry I've taken so long to get back to you. Here's what I'm currently thinking:
>> I'd eventually like to get to the point where the linear map and most vmalloc
>> memory is mapped using the largest possible mapping granularity (i.e. block
>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>
>> vmalloc has history with trying to do huge mappings by default; it ended up
>> having to be turned into an opt-in feature (instead of the original opt-out
>> approach) because there were problems with some parts of the kernel expecting
>> page mappings. I think we might be able to overcome those issues on arm64 with
>> BBML2.
>>
>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>> series (that should make v6.16) that enables contiguous PTE mappings in vmalloc
>> too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
>> able to use that by default, we need to be able to change permissions on
>> sub-regions of an allocation, which is where BBML2 and your series come in.
>> (there may be other things we need to solve as well; TBD).
>>
>> I think the key thing we need is a function that can take a page-aligned kernel
>> VA, will walk to the leaf entry for that VA and if the VA is in the middle of
>> the leaf entry, it will split it so that the VA is now on a boundary. This will
>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
>> function can assume BBML2 is present. And it will return 0 on success, -EINVAL
>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to perform
>> the split.
> 
> OK, the v3 patches already handled page table allocation failure with returning
> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
> should be always present. It is easy to return -EINVAL instead of BUG_ON.
> However I'm wondering what usecases you are thinking about? Splitting vmalloc
> area may run into unmapped VA?

I don't think BUG_ON is the right behaviour; crashing the kernel should be
discouraged. I think even for vmalloc under correct conditions we shouldn't see
any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
vunmap_pmd_range() which skips the pmd if its none.

> 
>>
>> Then we can use that primitive on the start and end address of any range for
>> which we need exact mapping boundaries (e.g. when changing permissions on part
>> of linear map or vmalloc allocation, when freeing part of a vmalloc allocation,
>> etc). This way we only split enough to ensure the boundaries are precise, and
>> keep larger mappings inside the range.
> 
> Yeah, makes sense to me.
> 
>>
>> Next we need to reimplement __change_memory_common() to not use
>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>> been working on a series that converts this to use walk_page_range_novma() so
>> that we can change permissions on the block/contig entries too. That's not
>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>> posting an RFC early next week.
> 
> OK, so the new __change_memory_common() will change the permission of page
> table, right? 

It will change permissions of all the leaf entries in the range of VAs it is
passed. Currently it assumes that all the leaf entries are PTEs. But we will
generalize to support all the other types of leaf entries too.,

> If I remember correctly, you suggested change permissions in
> __create_pgd_mapping_locked() for v3. So I can disregard it?

Yes I did. I think this made sense (in my head at least) because in the context
of the linear map, all the PFNs are contiguous so it kind-of makes sense to
reuse that infrastructure. But it doesn't generalize to vmalloc because vmalloc
PFNs are not contiguous. So for that reason, I think it's preferable to have an
independent capability.

> 
> The current code assumes the address range passed in by change_memory_common()
> is *NOT* physically contiguous so __change_memory_common() handles page table
> permission on page basis. I'm supposed Dev's patches will handle this then my
> patch can safely assume the linear mapping address range for splitting is
> physically contiguous too otherwise I can't keep large mappings inside the
> range. Splitting vmalloc area doesn't need to worry about this.

I'm not sure I fully understand the point you're making here...

Dev's series aims to use walk_page_range_novma() similar to riscv's
implementation so that it can walk a VA range and update the permissions on each
leaf entry it visits, regadless of which level the leaf entry is at. This
doesn't make any assumption of the physical contiguity of neighbouring leaf
entries in the page table.

So if we are changing permissions on the linear map, we have a range of VAs to
walk and convert all the leaf entries, regardless of their size. The same goes
for vmalloc... But for vmalloc, we will also want to change the underlying
permissions in the linear map, so we will have to figure out the contiguous
pieces of the linear map and call __change_memory_common() for each; there is
definitely some detail to work out there!

> 
>>
>> You'll still need to repaint the whole linear map with page mappings for the
>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially with
>> minor modifications?) can do that repainting on the live mappings; similar to
>> how you are doing it in v3.
> 
> Yes, when repainting I need to split the page table all the way down to PTE
> level. A simple flag should be good enough to tell __create_pgd_mapping_locked()
> do the right thing off the top of my head.

Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and NO_CONT_MAPPINGS
flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is set,
then you need to split it?

> 
>>
>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
> 
> Great! Anyway my series is based on his advertising BBML2 patch.
> 
>>
>> So in summary, what I'm asking for your large block mapping the linear map
>> series is:
>>    - Paint linear map using blocks/contig if boot CPU supports BBML2
>>    - Repaint linear map using page mappings if secondary CPUs don't support BBML2
> 
> OK, I just need to add some simple tweak to split down to PTE level to v3.
> 
>>    - Integrate Dev's __change_memory_common() series
> 
> OK, I think I have to do my patches on top of it. Because Dev's patch need
> guarantee the linear mapping address range is physically contiguous.
> 
>>    - Create primitive to ensure mapping entry boundary at a given page-aligned VA
>>    - Use primitive when changing permissions on linear map region
> 
> Sure.
> 
>>
>> This will be mergable on its own, but will also provide a great starting base
>> for adding huge-vmalloc-by-default.
>>
>> What do you think?
> 
> Definitely makes sense to me.
> 
> If I remember correctly, we still have some unsolved comments/questions for v3
> in my replies on March 17, particularly:
> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
> b344-9401fa4c0feb@os.amperecomputing.com/

Ahh sorry about that. I'll take a look now...

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>
>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight changes
>>>>>>>> that
>>>>>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>>>> review
>>>>>>>> from the new spin or review the current patches then solve the new review
>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>> Hi Yang,
>>>>>>>
>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>
>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>>>> and am
>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>> straight
>>>>>>> forward. I'm more interested in how you are handling the splitting, which I
>>>>>>> think is the bulk of the effort.
>>>>>> Yeah, sure, thank you.
>>>>>>
>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>> following
>>>>>>> week (might I see you there?)
>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>> Changelog
>>>>>>>>> =========
>>>>>>>>> v3:
>>>>>>>>>       * Rebased to v6.14-rc4.
>>>>>>>>>       * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/
>>>>>>>>> linux-
>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>         Also included in this series in order to have the complete
>>>>>>>>> patchset.
>>>>>>>>>       * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>>>>>       * Supported CONT mappings per Ryan.
>>>>>>>>>       * Supported asymmetric system by splitting kernel linear mapping if
>>>>>>>>> such
>>>>>>>>>         system is detected per Ryan. I don't have such system to test,
>>>>>>>>> so the
>>>>>>>>>         testing is done by hacking kernel to call linear mapping
>>>>>>>>> repainting
>>>>>>>>>         unconditionally. The linear mapping doesn't have any block and
>>>>>>>>> cont
>>>>>>>>>         mappings after booting.
>>>>>>>>>
>>>>>>>>> RFC v2:
>>>>>>>>>       * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>> handle TLB
>>>>>>>>>         conflict gracefully per Will Deacon
>>>>>>>>>       * Rebased onto v6.13-rc5
>>>>>>>>>       * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>
>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>
>>>>>>>>> Description
>>>>>>>>> ===========
>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>> break-before-make rule.
>>>>>>>>>
>>>>>>>>> A number of performance issues arise when the kernel linear map is using
>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>       - performance degradation
>>>>>>>>>       - more TLB pressure
>>>>>>>>>       - memory waste for kernel page table
>>>>>>>>>
>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>
>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>
>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>
>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>>>> only.
>>>>>>>>>
>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted
>>>>>>>>> once
>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>> details.
>>>>>>>>>
>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>
>>>>>>>>> Testing
>>>>>>>>> =======
>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>> memory and
>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>
>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>       - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>> permission at
>>>>>>>>>         boot stage, if the patch didn't work, kernel typically didn't
>>>>>>>>> boot.
>>>>>>>>>       - Module stress from stress-ng. Kernel module load change permission
>>>>>>>>> for
>>>>>>>>>         linear mapping.
>>>>>>>>>       - A test kernel module which allocates 80% of total memory via
>>>>>>>>> vmalloc(),
>>>>>>>>>         then change the vmalloc area permission to RO, this also change
>>>>>>>>> linear
>>>>>>>>>         mapping permission to RO, then change it back before vfree(). Then
>>>>>>>>> launch
>>>>>>>>>         a VM which consumes almost all physical memory.
>>>>>>>>>       - VM with the patchset applied in guest kernel too.
>>>>>>>>>       - Kernel build in VM with guest kernel which has this series
>>>>>>>>> applied.
>>>>>>>>>       - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>       - Boot on the machine which doesn't support BBML2.
>>>>>>>>>
>>>>>>>>> Performance
>>>>>>>>> ===========
>>>>>>>>> Memory consumption
>>>>>>>>> Before:
>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>
>>>>>>>>> After:
>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>
>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>>>> more memory saved.
>>>>>>>>>
>>>>>>>>> Performance benchmarking
>>>>>>>>> * Memcached
>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>
>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>
>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
>>>>>>>>> disk
>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>> randrepeat 1 \
>>>>>>>>>         --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>> ioengine=sync \
>>>>>>>>>         --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --
>>>>>>>>> thread \
>>>>>>>>>         --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>
>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>>>>>> number of good case is around 90% more than the best number of bad case).
>>>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>>>
>>>>>>>>> * Sequential file read
>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>> populated).
>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>           arm64: Add BBM Level 2 cpu feature
>>>>>>>>>
>>>>>>>>> Yang Shi (5):
>>>>>>>>>           arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>           arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>>>           arm64: mm: support large block mapping when rodata=full
>>>>>>>>>           arm64: mm: support split CONT mappings
>>>>>>>>>           arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>>>> secondary
>>>>>>>>> CPUs
>>>>>>>>>
>>>>>>>>>      arch/arm64/Kconfig                  |  11 +++++
>>>>>>>>>      arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>>>>>      arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>>>>>      arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>>>>>      arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>>>>>      arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++
>>>>>>>>> ++++++
>>>>>>>>> +++++++
>>>>>>>>>      arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++
>>>>>>>>> ++++++
>>>>>>>>> ++++
>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> +++++
>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>      arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>>>>>      arch/arm64/tools/cpucaps            |   1 +
>>>>>>>>>      9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>
>>>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void
  2025-03-17 17:53     ` Yang Shi
@ 2025-05-07  8:18       ` Ryan Roberts
  2025-05-07 22:19         ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-07  8:18 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel

On 17/03/2025 17:53, Yang Shi wrote:
> 
> 
> On 3/14/25 4:51 AM, Ryan Roberts wrote:
>> On 04/03/2025 22:19, Yang Shi wrote:
>>> The later patch will enhance __create_pgd_mapping() and related helpers
>>> to split kernel linear mapping, it requires have return value.  So make
>>> __create_pgd_mapping() and helpers non-void functions.
>>>
>>> And move the BUG_ON() out of page table alloc helper since failing
>>> splitting kernel linear mapping is not fatal and can be handled by the
>>> callers in the later patch.  Have BUG_ON() after
>>> __create_pgd_mapping_locked() returns to keep the current callers behavior
>>> intact.
>>>
>>> Suggested-by: Ryan Roberts<ryan.roberts@arm.com>
>>> Signed-off-by: Yang Shi<yang@os.amperecomputing.com>
>>> ---
>>>   arch/arm64/mm/mmu.c | 127 ++++++++++++++++++++++++++++++--------------
>>>   1 file changed, 86 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index b4df5bc5b1b8..dccf0877285b 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -189,11 +189,11 @@ static void init_pte(pte_t *ptep, unsigned long addr,
>>> unsigned long end,
>>>       } while (ptep++, addr += PAGE_SIZE, addr != end);
>>>   }
>>>   -static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>> -                unsigned long end, phys_addr_t phys,
>>> -                pgprot_t prot,
>>> -                phys_addr_t (*pgtable_alloc)(int),
>>> -                int flags)
>>> +static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>> +                   unsigned long end, phys_addr_t phys,
>>> +                   pgprot_t prot,
>>> +                   phys_addr_t (*pgtable_alloc)(int),
>>> +                   int flags)
>>>   {
>>>       unsigned long next;
>>>       pmd_t pmd = READ_ONCE(*pmdp);
>>> @@ -208,6 +208,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned
>>> long addr,
>>>               pmdval |= PMD_TABLE_PXN;
>>>           BUG_ON(!pgtable_alloc);
>>>           pte_phys = pgtable_alloc(PAGE_SHIFT);
>>> +        if (!pte_phys)
>>> +            return -ENOMEM;
>> nit: personally I'd prefer to see a "goto out" and funnel all to a single return
>> statement. You do that in some functions (via loop break), but would be cleaner
>> if consistent.
>>
>> If pgtable_alloc() is modified to return int (see my comment at the bottom),
>> this becomes:
>>
>> ret = pgtable_alloc(PAGE_SHIFT, &pte_phys);
>> if (ret)
>>     goto out;
> 
> OK
> 
>>>           ptep = pte_set_fixmap(pte_phys);
>>>           init_clear_pgtable(ptep);
>>>           ptep += pte_index(addr);
>>> @@ -239,13 +241,16 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned
>>> long addr,
>>>        * walker.
>>>        */
>>>       pte_clear_fixmap();
>>> +
>>> +    return 0;
>>>   }
>>>   -static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>> -             phys_addr_t phys, pgprot_t prot,
>>> -             phys_addr_t (*pgtable_alloc)(int), int flags)
>>> +static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>> +            phys_addr_t phys, pgprot_t prot,
>>> +            phys_addr_t (*pgtable_alloc)(int), int flags)
>>>   {
>>>       unsigned long next;
>>> +    int ret = 0;
>>>         do {
>>>           pmd_t old_pmd = READ_ONCE(*pmdp);
>>> @@ -264,22 +269,27 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr,
>>> unsigned long end,
>>>               BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
>>>                                 READ_ONCE(pmd_val(*pmdp))));
>>>           } else {
>>> -            alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>>> +            ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>>>                           pgtable_alloc, flags);
>>> +            if (ret)
>>> +                break;
>>>                 BUG_ON(pmd_val(old_pmd) != 0 &&
>>>                      pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
>>>           }
>>>           phys += next - addr;
>>>       } while (pmdp++, addr = next, addr != end);
>>> +
>>> +    return ret;
>>>   }
>>>   -static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>> -                unsigned long end, phys_addr_t phys,
>>> -                pgprot_t prot,
>>> -                phys_addr_t (*pgtable_alloc)(int), int flags)
>>> +static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>> +                   unsigned long end, phys_addr_t phys,
>>> +                   pgprot_t prot,
>>> +                   phys_addr_t (*pgtable_alloc)(int), int flags)
>>>   {
>>>       unsigned long next;
>>> +    int ret = 0;
>>>       pud_t pud = READ_ONCE(*pudp);
>>>       pmd_t *pmdp;
>>>   @@ -295,6 +305,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned
>>> long addr,
>>>               pudval |= PUD_TABLE_PXN;
>>>           BUG_ON(!pgtable_alloc);
>>>           pmd_phys = pgtable_alloc(PMD_SHIFT);
>>> +        if (!pmd_phys)
>>> +            return -ENOMEM;
>>>           pmdp = pmd_set_fixmap(pmd_phys);
>>>           init_clear_pgtable(pmdp);
>>>           pmdp += pmd_index(addr);
>>> @@ -314,21 +326,26 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned
>>> long addr,
>>>               (flags & NO_CONT_MAPPINGS) == 0)
>>>               __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>>   -        init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>>> +        ret = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>>> +        if (ret)
>>> +            break;
>>>             pmdp += pmd_index(next) - pmd_index(addr);
>>>           phys += next - addr;
>>>       } while (addr = next, addr != end);
>>>         pmd_clear_fixmap();
>>> +
>>> +    return ret;
>>>   }
>>>   -static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long
>>> end,
>>> -               phys_addr_t phys, pgprot_t prot,
>>> -               phys_addr_t (*pgtable_alloc)(int),
>>> -               int flags)
>>> +static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>> +              phys_addr_t phys, pgprot_t prot,
>>> +              phys_addr_t (*pgtable_alloc)(int),
>>> +              int flags)
>>>   {
>>>       unsigned long next;
>>> +    int ret = 0;
>>>       p4d_t p4d = READ_ONCE(*p4dp);
>>>       pud_t *pudp;
>>>   @@ -340,6 +357,8 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>> addr, unsigned long end,
>>>               p4dval |= P4D_TABLE_PXN;
>>>           BUG_ON(!pgtable_alloc);
>>>           pud_phys = pgtable_alloc(PUD_SHIFT);
>>> +        if (!pud_phys)
>>> +            return -ENOMEM;
>>>           pudp = pud_set_fixmap(pud_phys);
>>>           init_clear_pgtable(pudp);
>>>           pudp += pud_index(addr);
>>> @@ -369,8 +388,10 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>> addr, unsigned long end,
>>>               BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
>>>                                 READ_ONCE(pud_val(*pudp))));
>>>           } else {
>>> -            alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>>> +            ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>>>                           pgtable_alloc, flags);
>>> +            if (ret)
>>> +                break;
>>>                 BUG_ON(pud_val(old_pud) != 0 &&
>>>                      pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
>>> @@ -379,14 +400,17 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>> addr, unsigned long end,
>>>       } while (pudp++, addr = next, addr != end);
>>>         pud_clear_fixmap();
>>> +
>>> +    return ret;
>>>   }
>>>   -static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long
>>> end,
>>> -               phys_addr_t phys, pgprot_t prot,
>>> -               phys_addr_t (*pgtable_alloc)(int),
>>> -               int flags)
>>> +static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>> +              phys_addr_t phys, pgprot_t prot,
>>> +              phys_addr_t (*pgtable_alloc)(int),
>>> +              int flags)
>>>   {
>>>       unsigned long next;
>>> +    int ret = 0;
>>>       pgd_t pgd = READ_ONCE(*pgdp);
>>>       p4d_t *p4dp;
>>>   @@ -398,6 +422,8 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>> addr, unsigned long end,
>>>               pgdval |= PGD_TABLE_PXN;
>>>           BUG_ON(!pgtable_alloc);
>>>           p4d_phys = pgtable_alloc(P4D_SHIFT);
>>> +        if (!p4d_phys)
>>> +            return -ENOMEM;
>>>           p4dp = p4d_set_fixmap(p4d_phys);
>>>           init_clear_pgtable(p4dp);
>>>           p4dp += p4d_index(addr);
>>> @@ -412,8 +438,10 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>> addr, unsigned long end,
>>>             next = p4d_addr_end(addr, end);
>>>   -        alloc_init_pud(p4dp, addr, next, phys, prot,
>>> +        ret = alloc_init_pud(p4dp, addr, next, phys, prot,
>>>                      pgtable_alloc, flags);
>>> +        if (ret)
>>> +            break;
>>>             BUG_ON(p4d_val(old_p4d) != 0 &&
>>>                  p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
>>> @@ -422,23 +450,26 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>> addr, unsigned long end,
>>>       } while (p4dp++, addr = next, addr != end);
>>>         p4d_clear_fixmap();
>>> +
>>> +    return ret;
>>>   }
>>>   -static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>>> -                    unsigned long virt, phys_addr_t size,
>>> -                    pgprot_t prot,
>>> -                    phys_addr_t (*pgtable_alloc)(int),
>>> -                    int flags)
>>> +static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>>> +                       unsigned long virt, phys_addr_t size,
>>> +                       pgprot_t prot,
>>> +                       phys_addr_t (*pgtable_alloc)(int),
>>> +                       int flags)
>>>   {
>>>       unsigned long addr, end, next;
>>>       pgd_t *pgdp = pgd_offset_pgd(pgdir, virt);
>>> +    int ret = 0;
>>>         /*
>>>        * If the virtual and physical address don't have the same offset
>>>        * within a page, we cannot map the region as the caller expects.
>>>        */
>>>       if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
>>> -        return;
>>> +        return -EINVAL;
>>>         phys &= PAGE_MASK;
>>>       addr = virt & PAGE_MASK;
>>> @@ -446,29 +477,38 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir,
>>> phys_addr_t phys,
>>>         do {
>>>           next = pgd_addr_end(addr, end);
>>> -        alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>>> +        ret = alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>>>                      flags);
>>> +        if (ret)
>>> +            break;
>>>           phys += next - addr;
>>>       } while (pgdp++, addr = next, addr != end);
>>> +
>>> +    return ret;
>>>   }
>>>   -static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>> -                 unsigned long virt, phys_addr_t size,
>>> -                 pgprot_t prot,
>>> -                 phys_addr_t (*pgtable_alloc)(int),
>>> -                 int flags)
>>> +static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>> +                unsigned long virt, phys_addr_t size,
>>> +                pgprot_t prot,
>>> +                phys_addr_t (*pgtable_alloc)(int),
>>> +                int flags)
>>>   {
>>> +    int ret;
>>> +
>>>       mutex_lock(&fixmap_lock);
>>> -    __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>>> +    ret = __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>>>                       pgtable_alloc, flags);
>>> +    BUG_ON(ret);
>> This function now returns an error, but also BUGs on ret!=0. For this patch, I'd
>> suggest keeping this function as void.
> 
> You mean __create_pgd_mapping(), right?

Yes.

> 
>> But I believe there is a pre-existing bug in arch_add_memory(). That's called at
>> runtime so if __create_pgd_mapping() fails and BUGs, it will take down a running
>> system.
> 
> Yes, it is the current behavior.
> 
>> With this foundational patch, we can fix that with an additional patch to pass
>> along the error code instead of BUGing in that case. arch_add_memory() would
>> need to unwind whatever __create_pgd_mapping() managed to do before the memory
>> allocation failure (presumably unmapping and freeing any allocated tables). I'm
>> happy to do this as a follow up patch.
> 
> Yes, the allocated page tables need to be freed. Thank you for taking it.

Given the conversation in the other thread about generalizing to also eventually
support vmalloc, I'm not sure you need to be able to return errors from this
walker for your usage now? I think you will only use this walker for the case
where you need to repaint to page mappings after determining that a secondary
CPU does not support BBML2? If that fails, the system is dead anyway, so
continuing to BUG() is probably acceptable?

So perhaps you could drop this patch from your series? If so, then I'll reuse
the patch to fix the theoretical hotplug bug (when I get to it) and will keep
your authorship.

> 
>>>       mutex_unlock(&fixmap_lock);
>>> +
>>> +    return ret;
>>>   }
>>>     #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>>>   extern __alias(__create_pgd_mapping_locked)
>>> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long
>>> virt,
>>> -                 phys_addr_t size, pgprot_t prot,
>>> -                 phys_addr_t (*pgtable_alloc)(int), int flags);
>>> +int create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>>> +                phys_addr_t size, pgprot_t prot,
>>> +                phys_addr_t (*pgtable_alloc)(int), int flags);
>> create_kpti_ng_temp_pgd() now returns error instead of BUGing on allocation
>> failure, but I don't see a change to handle that error. You'll want to update
>> __kpti_install_ng_mappings() to BUG on error.
> 
> Yes, I missed that. It should BUG on error.
> 
>>>   #endif
>>>     static phys_addr_t __pgd_pgtable_alloc(int shift)
>>> @@ -476,13 +516,17 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
>>>       /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>>       void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
>>>   -    BUG_ON(!ptr);
>>> +    if (!ptr)
>>> +        return 0;
>> 0 is a valid (though unlikely) physical address. I guess you could technically
>> encode like ERR_PTR(), but since you are returning phys_addr_t and not a
>> pointer, then perhaps it will be clearer to make this return int and accept a
>> pointer to a phys_addr_t, which it will populate on success?
> 
> Actually I did something similar in the first place, but just returned the virt
> address. Then did something if it returns NULL. That made the code a little more
> messy since we need convert the virt address to phys address because
> __create_pgd_mapping() and the helpers require phys address, and changed the
> functions definition.
> 
> But I noticed 0 should be not a valid phys address if I remember correctly. 

0 is definitely a valid physical address. We even have examples of real Arm
boards that have RAM at physical address 0. See [1].

[1] https://lore.kernel.org/lkml/ad8ed3ba-12e8-3031-7c66-035b6d9ad6cd@arm.com/

> I
> also noticed early_pgtable_alloc() calls memblock_phys_alloc_range(), it returns
> 0 on failure. If 0 is valid phys address, then it should not do that, right? 

Well perhaps memblock will just refuse to give you RAM at address 0. That's a
bad design choice in my opinion. But the buddy will definitely give out page 0
if it is RAM. -1 would be a better choice for an error sentinel.

> And
> I also noticed the memblock range 0 - memstart_addr is actually removed from
> memblock (see arm64_memblock_init()), so IIUC 0 should be not valid phys
> address. So the patch ended up being as is.

But memstart_addr could be 0, so in that case you don't actually remove anything?

> 
> If this assumption doesn't stand, I think your suggestion makes sense.

Perhaps the simpler approach is to return -1 on error. That's never going to be
valid because the maximum number of address bits on the physical bus is 56.

> 
>>> +
>>>       return __pa(ptr);
>>>   }
>>>     static phys_addr_t pgd_pgtable_alloc(int shift)
>>>   {
>>>       phys_addr_t pa = __pgd_pgtable_alloc(shift);
>>> +    if (!pa)
>>> +        goto out;
>> This would obviously need to be fixed up as per above.
>>
>>>       struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
>>>         /*
>>> @@ -498,6 +542,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>>>       else if (shift == PMD_SHIFT)
>>>           BUG_ON(!pagetable_pmd_ctor(ptdesc));
>>>   +out:
>>>       return pa;
>>>   }
>>>   
>> You have left early_pgtable_alloc() to panic() on allocation failure. Given we
>> can now unwind the stack with error code, I think it would be more consistent to
>> also allow early_pgtable_alloc() to return error.
> 
> The early_pgtable_alloc() is just used for painting linear mapping at early boot
> stage, if it fails I don't think unwinding the stack is feasible and worth it.
> Did I miss something?

Personally I'd just prefer it all to be consistent. But I agree there is no big
benefit. Anyway, like I said above, I'm not sure you need to worry about
unwinding the stack at all given the approach we agreed in the other thread?

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
>> Thanks,
>> Ryan
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-07  7:58                 ` Ryan Roberts
@ 2025-05-07 21:16                   ` Yang Shi
  2025-05-28  0:00                     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-07 21:16 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/7/25 12:58 AM, Ryan Roberts wrote:
> On 05/05/2025 22:39, Yang Shi wrote:
>>
>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just gently
>>>>>> ping,
>>>>>> hopefully we can resume the review soon.
>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But I'm very
>>>>> keen to move this series forward so will come back to you next week. (although
>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>
>>>>> FWIW, having thought about it a bit more, I think some of the suggestions I
>>>>> previously made may not have been quite right, but I'll elaborate next week.
>>>>> I'm
>>>>> keen to build a pgtable splitting primitive here that we can reuse with vmalloc
>>>>> as well to enable huge mappings by default with vmalloc too.
>>>> Sounds good. I think the patches can support splitting vmalloc page table too.
>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>> Hi Yang,
>>>
>>> Sorry I've taken so long to get back to you. Here's what I'm currently thinking:
>>> I'd eventually like to get to the point where the linear map and most vmalloc
>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>
>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>> having to be turned into an opt-in feature (instead of the original opt-out
>>> approach) because there were problems with some parts of the kernel expecting
>>> page mappings. I think we might be able to overcome those issues on arm64 with
>>> BBML2.
>>>
>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>> series (that should make v6.16) that enables contiguous PTE mappings in vmalloc
>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
>>> able to use that by default, we need to be able to change permissions on
>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>> (there may be other things we need to solve as well; TBD).
>>>
>>> I think the key thing we need is a function that can take a page-aligned kernel
>>> VA, will walk to the leaf entry for that VA and if the VA is in the middle of
>>> the leaf entry, it will split it so that the VA is now on a boundary. This will
>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
>>> function can assume BBML2 is present. And it will return 0 on success, -EINVAL
>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to perform
>>> the split.
>> OK, the v3 patches already handled page table allocation failure with returning
>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>> However I'm wondering what usecases you are thinking about? Splitting vmalloc
>> area may run into unmapped VA?
> I don't think BUG_ON is the right behaviour; crashing the kernel should be
> discouraged. I think even for vmalloc under correct conditions we shouldn't see
> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
> vunmap_pmd_range() which skips the pmd if its none.
>
>>> Then we can use that primitive on the start and end address of any range for
>>> which we need exact mapping boundaries (e.g. when changing permissions on part
>>> of linear map or vmalloc allocation, when freeing part of a vmalloc allocation,
>>> etc). This way we only split enough to ensure the boundaries are precise, and
>>> keep larger mappings inside the range.
>> Yeah, makes sense to me.
>>
>>> Next we need to reimplement __change_memory_common() to not use
>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>>> been working on a series that converts this to use walk_page_range_novma() so
>>> that we can change permissions on the block/contig entries too. That's not
>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>>> posting an RFC early next week.
>> OK, so the new __change_memory_common() will change the permission of page
>> table, right?
> It will change permissions of all the leaf entries in the range of VAs it is
> passed. Currently it assumes that all the leaf entries are PTEs. But we will
> generalize to support all the other types of leaf entries too.,
>
>> If I remember correctly, you suggested change permissions in
>> __create_pgd_mapping_locked() for v3. So I can disregard it?
> Yes I did. I think this made sense (in my head at least) because in the context
> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
> reuse that infrastructure. But it doesn't generalize to vmalloc because vmalloc
> PFNs are not contiguous. So for that reason, I think it's preferable to have an
> independent capability.

OK, sounds good to me.

>
>> The current code assumes the address range passed in by change_memory_common()
>> is *NOT* physically contiguous so __change_memory_common() handles page table
>> permission on page basis. I'm supposed Dev's patches will handle this then my
>> patch can safely assume the linear mapping address range for splitting is
>> physically contiguous too otherwise I can't keep large mappings inside the
>> range. Splitting vmalloc area doesn't need to worry about this.
> I'm not sure I fully understand the point you're making here...
>
> Dev's series aims to use walk_page_range_novma() similar to riscv's
> implementation so that it can walk a VA range and update the permissions on each
> leaf entry it visits, regadless of which level the leaf entry is at. This
> doesn't make any assumption of the physical contiguity of neighbouring leaf
> entries in the page table.
>
> So if we are changing permissions on the linear map, we have a range of VAs to
> walk and convert all the leaf entries, regardless of their size. The same goes
> for vmalloc... But for vmalloc, we will also want to change the underlying
> permissions in the linear map, so we will have to figure out the contiguous
> pieces of the linear map and call __change_memory_common() for each; there is
> definitely some detail to work out there!

Yes, this is my point. When changing underlying linear map permission 
for vmalloc, the linear map address may be not contiguous. This is why 
change_memory_common() calls __change_memory_common() on page basis.

But how Dev's patch work should have no impact on how I implement the 
split primitive by thinking it further. It should be the caller's 
responsibility to make sure __create_pgd_mapping_locked() is called for 
contiguous linear map address range.

>
>>> You'll still need to repaint the whole linear map with page mappings for the
>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially with
>>> minor modifications?) can do that repainting on the live mappings; similar to
>>> how you are doing it in v3.
>> Yes, when repainting I need to split the page table all the way down to PTE
>> level. A simple flag should be good enough to tell __create_pgd_mapping_locked()
>> do the right thing off the top of my head.
> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and NO_CONT_MAPPINGS
> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is set,
> then you need to split it?

Yeah, sounds feasible. Anyway I will figure it out.

>
>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>> Great! Anyway my series is based on his advertising BBML2 patch.
>>
>>> So in summary, what I'm asking for your large block mapping the linear map
>>> series is:
>>>     - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>     - Repaint linear map using page mappings if secondary CPUs don't support BBML2
>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>
>>>     - Integrate Dev's __change_memory_common() series
>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>> guarantee the linear mapping address range is physically contiguous.
>>
>>>     - Create primitive to ensure mapping entry boundary at a given page-aligned VA
>>>     - Use primitive when changing permissions on linear map region
>> Sure.
>>
>>> This will be mergable on its own, but will also provide a great starting base
>>> for adding huge-vmalloc-by-default.
>>>
>>> What do you think?
>> Definitely makes sense to me.
>>
>> If I remember correctly, we still have some unsolved comments/questions for v3
>> in my replies on March 17, particularly:
>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>> b344-9401fa4c0feb@os.amperecomputing.com/
> Ahh sorry about that. I'll take a look now...

No problem.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>
>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight changes
>>>>>>>>> that
>>>>>>>>> have impact to my patches (basically check the new boot parameter). Do you
>>>>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>>>>> review
>>>>>>>>> from the new spin or review the current patches then solve the new review
>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>> Hi Yang,
>>>>>>>>
>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>
>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>>>>> and am
>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>> straight
>>>>>>>> forward. I'm more interested in how you are handling the splitting, which I
>>>>>>>> think is the bulk of the effort.
>>>>>>> Yeah, sure, thank you.
>>>>>>>
>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>> following
>>>>>>>> week (might I see you there?)
>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>> Changelog
>>>>>>>>>> =========
>>>>>>>>>> v3:
>>>>>>>>>>        * Rebased to v6.14-rc4.
>>>>>>>>>>        * Based on Miko's BBML2 cpufeature patch (https://lore.kernel.org/
>>>>>>>>>> linux-
>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>          Also included in this series in order to have the complete
>>>>>>>>>> patchset.
>>>>>>>>>>        * Enhanced __create_pgd_mapping() to handle split as well per Ryan.
>>>>>>>>>>        * Supported CONT mappings per Ryan.
>>>>>>>>>>        * Supported asymmetric system by splitting kernel linear mapping if
>>>>>>>>>> such
>>>>>>>>>>          system is detected per Ryan. I don't have such system to test,
>>>>>>>>>> so the
>>>>>>>>>>          testing is done by hacking kernel to call linear mapping
>>>>>>>>>> repainting
>>>>>>>>>>          unconditionally. The linear mapping doesn't have any block and
>>>>>>>>>> cont
>>>>>>>>>>          mappings after booting.
>>>>>>>>>>
>>>>>>>>>> RFC v2:
>>>>>>>>>>        * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>> handle TLB
>>>>>>>>>>          conflict gracefully per Will Deacon
>>>>>>>>>>        * Rebased onto v6.13-rc5
>>>>>>>>>>        * https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1-
>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>
>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>
>>>>>>>>>> Description
>>>>>>>>>> ===========
>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>> break-before-make rule.
>>>>>>>>>>
>>>>>>>>>> A number of performance issues arise when the kernel linear map is using
>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>        - performance degradation
>>>>>>>>>>        - more TLB pressure
>>>>>>>>>>        - memory waste for kernel page table
>>>>>>>>>>
>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>
>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to invalidate the
>>>>>>>>>> page table entry when changing page sizes.  This allows the kernel to
>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>
>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM level 2
>>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>
>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>>>>> only.
>>>>>>>>>>
>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted
>>>>>>>>>> once
>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>>> details.
>>>>>>>>>>
>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>>
>>>>>>>>>> Testing
>>>>>>>>>> =======
>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>> memory and
>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>
>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>        - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>> permission at
>>>>>>>>>>          boot stage, if the patch didn't work, kernel typically didn't
>>>>>>>>>> boot.
>>>>>>>>>>        - Module stress from stress-ng. Kernel module load change permission
>>>>>>>>>> for
>>>>>>>>>>          linear mapping.
>>>>>>>>>>        - A test kernel module which allocates 80% of total memory via
>>>>>>>>>> vmalloc(),
>>>>>>>>>>          then change the vmalloc area permission to RO, this also change
>>>>>>>>>> linear
>>>>>>>>>>          mapping permission to RO, then change it back before vfree(). Then
>>>>>>>>>> launch
>>>>>>>>>>          a VM which consumes almost all physical memory.
>>>>>>>>>>        - VM with the patchset applied in guest kernel too.
>>>>>>>>>>        - Kernel build in VM with guest kernel which has this series
>>>>>>>>>> applied.
>>>>>>>>>>        - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>        - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>
>>>>>>>>>> Performance
>>>>>>>>>> ===========
>>>>>>>>>> Memory consumption
>>>>>>>>>> Before:
>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>
>>>>>>>>>> After:
>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>
>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>>>>> more memory saved.
>>>>>>>>>>
>>>>>>>>>> Performance benchmarking
>>>>>>>>>> * Memcached
>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>
>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>
>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
>>>>>>>>>> disk
>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>          --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>> ioengine=sync \
>>>>>>>>>>          --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --
>>>>>>>>>> thread \
>>>>>>>>>>          --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>
>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the worst
>>>>>>>>>> number of good case is around 90% more than the best number of bad case).
>>>>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>>>>
>>>>>>>>>> * Sequential file read
>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>> populated).
>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>            arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>
>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>            arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>            arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>>>>            arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>            arm64: mm: support split CONT mappings
>>>>>>>>>>            arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>>>>> secondary
>>>>>>>>>> CPUs
>>>>>>>>>>
>>>>>>>>>>       arch/arm64/Kconfig                  |  11 +++++
>>>>>>>>>>       arch/arm64/include/asm/cpucaps.h    |   2 +
>>>>>>>>>>       arch/arm64/include/asm/cpufeature.h |  15 ++++++
>>>>>>>>>>       arch/arm64/include/asm/mmu.h        |   4 ++
>>>>>>>>>>       arch/arm64/include/asm/pgtable.h    |  12 ++++-
>>>>>>>>>>       arch/arm64/kernel/cpufeature.c      |  95 ++++++++++++++++++++++++
>>>>>>>>>> ++++++
>>>>>>>>>> +++++++
>>>>>>>>>>       arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++++++
>>>>>>>>>> ++++++
>>>>>>>>>> ++++
>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>> +++++
>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>       arch/arm64/mm/pageattr.c            |  37 ++++++++++++---
>>>>>>>>>>       arch/arm64/tools/cpucaps            |   1 +
>>>>>>>>>>       9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>
>>>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void
  2025-05-07  8:18       ` Ryan Roberts
@ 2025-05-07 22:19         ` Yang Shi
  0 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-05-07 22:19 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel



On 5/7/25 1:18 AM, Ryan Roberts wrote:
> On 17/03/2025 17:53, Yang Shi wrote:
>>
>> On 3/14/25 4:51 AM, Ryan Roberts wrote:
>>> On 04/03/2025 22:19, Yang Shi wrote:
>>>> The later patch will enhance __create_pgd_mapping() and related helpers
>>>> to split kernel linear mapping, it requires have return value.  So make
>>>> __create_pgd_mapping() and helpers non-void functions.
>>>>
>>>> And move the BUG_ON() out of page table alloc helper since failing
>>>> splitting kernel linear mapping is not fatal and can be handled by the
>>>> callers in the later patch.  Have BUG_ON() after
>>>> __create_pgd_mapping_locked() returns to keep the current callers behavior
>>>> intact.
>>>>
>>>> Suggested-by: Ryan Roberts<ryan.roberts@arm.com>
>>>> Signed-off-by: Yang Shi<yang@os.amperecomputing.com>
>>>> ---
>>>>    arch/arm64/mm/mmu.c | 127 ++++++++++++++++++++++++++++++--------------
>>>>    1 file changed, 86 insertions(+), 41 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>> index b4df5bc5b1b8..dccf0877285b 100644
>>>> --- a/arch/arm64/mm/mmu.c
>>>> +++ b/arch/arm64/mm/mmu.c
>>>> @@ -189,11 +189,11 @@ static void init_pte(pte_t *ptep, unsigned long addr,
>>>> unsigned long end,
>>>>        } while (ptep++, addr += PAGE_SIZE, addr != end);
>>>>    }
>>>>    -static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>>> -                unsigned long end, phys_addr_t phys,
>>>> -                pgprot_t prot,
>>>> -                phys_addr_t (*pgtable_alloc)(int),
>>>> -                int flags)
>>>> +static int alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
>>>> +                   unsigned long end, phys_addr_t phys,
>>>> +                   pgprot_t prot,
>>>> +                   phys_addr_t (*pgtable_alloc)(int),
>>>> +                   int flags)
>>>>    {
>>>>        unsigned long next;
>>>>        pmd_t pmd = READ_ONCE(*pmdp);
>>>> @@ -208,6 +208,8 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned
>>>> long addr,
>>>>                pmdval |= PMD_TABLE_PXN;
>>>>            BUG_ON(!pgtable_alloc);
>>>>            pte_phys = pgtable_alloc(PAGE_SHIFT);
>>>> +        if (!pte_phys)
>>>> +            return -ENOMEM;
>>> nit: personally I'd prefer to see a "goto out" and funnel all to a single return
>>> statement. You do that in some functions (via loop break), but would be cleaner
>>> if consistent.
>>>
>>> If pgtable_alloc() is modified to return int (see my comment at the bottom),
>>> this becomes:
>>>
>>> ret = pgtable_alloc(PAGE_SHIFT, &pte_phys);
>>> if (ret)
>>>      goto out;
>> OK
>>
>>>>            ptep = pte_set_fixmap(pte_phys);
>>>>            init_clear_pgtable(ptep);
>>>>            ptep += pte_index(addr);
>>>> @@ -239,13 +241,16 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned
>>>> long addr,
>>>>         * walker.
>>>>         */
>>>>        pte_clear_fixmap();
>>>> +
>>>> +    return 0;
>>>>    }
>>>>    -static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>>> -             phys_addr_t phys, pgprot_t prot,
>>>> -             phys_addr_t (*pgtable_alloc)(int), int flags)
>>>> +static int init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
>>>> +            phys_addr_t phys, pgprot_t prot,
>>>> +            phys_addr_t (*pgtable_alloc)(int), int flags)
>>>>    {
>>>>        unsigned long next;
>>>> +    int ret = 0;
>>>>          do {
>>>>            pmd_t old_pmd = READ_ONCE(*pmdp);
>>>> @@ -264,22 +269,27 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr,
>>>> unsigned long end,
>>>>                BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
>>>>                                  READ_ONCE(pmd_val(*pmdp))));
>>>>            } else {
>>>> -            alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>>>> +            ret = alloc_init_cont_pte(pmdp, addr, next, phys, prot,
>>>>                            pgtable_alloc, flags);
>>>> +            if (ret)
>>>> +                break;
>>>>                  BUG_ON(pmd_val(old_pmd) != 0 &&
>>>>                       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
>>>>            }
>>>>            phys += next - addr;
>>>>        } while (pmdp++, addr = next, addr != end);
>>>> +
>>>> +    return ret;
>>>>    }
>>>>    -static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>>> -                unsigned long end, phys_addr_t phys,
>>>> -                pgprot_t prot,
>>>> -                phys_addr_t (*pgtable_alloc)(int), int flags)
>>>> +static int alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
>>>> +                   unsigned long end, phys_addr_t phys,
>>>> +                   pgprot_t prot,
>>>> +                   phys_addr_t (*pgtable_alloc)(int), int flags)
>>>>    {
>>>>        unsigned long next;
>>>> +    int ret = 0;
>>>>        pud_t pud = READ_ONCE(*pudp);
>>>>        pmd_t *pmdp;
>>>>    @@ -295,6 +305,8 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned
>>>> long addr,
>>>>                pudval |= PUD_TABLE_PXN;
>>>>            BUG_ON(!pgtable_alloc);
>>>>            pmd_phys = pgtable_alloc(PMD_SHIFT);
>>>> +        if (!pmd_phys)
>>>> +            return -ENOMEM;
>>>>            pmdp = pmd_set_fixmap(pmd_phys);
>>>>            init_clear_pgtable(pmdp);
>>>>            pmdp += pmd_index(addr);
>>>> @@ -314,21 +326,26 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned
>>>> long addr,
>>>>                (flags & NO_CONT_MAPPINGS) == 0)
>>>>                __prot = __pgprot(pgprot_val(prot) | PTE_CONT);
>>>>    -        init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>>>> +        ret = init_pmd(pmdp, addr, next, phys, __prot, pgtable_alloc, flags);
>>>> +        if (ret)
>>>> +            break;
>>>>              pmdp += pmd_index(next) - pmd_index(addr);
>>>>            phys += next - addr;
>>>>        } while (addr = next, addr != end);
>>>>          pmd_clear_fixmap();
>>>> +
>>>> +    return ret;
>>>>    }
>>>>    -static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long
>>>> end,
>>>> -               phys_addr_t phys, pgprot_t prot,
>>>> -               phys_addr_t (*pgtable_alloc)(int),
>>>> -               int flags)
>>>> +static int alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
>>>> +              phys_addr_t phys, pgprot_t prot,
>>>> +              phys_addr_t (*pgtable_alloc)(int),
>>>> +              int flags)
>>>>    {
>>>>        unsigned long next;
>>>> +    int ret = 0;
>>>>        p4d_t p4d = READ_ONCE(*p4dp);
>>>>        pud_t *pudp;
>>>>    @@ -340,6 +357,8 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>>> addr, unsigned long end,
>>>>                p4dval |= P4D_TABLE_PXN;
>>>>            BUG_ON(!pgtable_alloc);
>>>>            pud_phys = pgtable_alloc(PUD_SHIFT);
>>>> +        if (!pud_phys)
>>>> +            return -ENOMEM;
>>>>            pudp = pud_set_fixmap(pud_phys);
>>>>            init_clear_pgtable(pudp);
>>>>            pudp += pud_index(addr);
>>>> @@ -369,8 +388,10 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>>> addr, unsigned long end,
>>>>                BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
>>>>                                  READ_ONCE(pud_val(*pudp))));
>>>>            } else {
>>>> -            alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>>>> +            ret = alloc_init_cont_pmd(pudp, addr, next, phys, prot,
>>>>                            pgtable_alloc, flags);
>>>> +            if (ret)
>>>> +                break;
>>>>                  BUG_ON(pud_val(old_pud) != 0 &&
>>>>                       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
>>>> @@ -379,14 +400,17 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long
>>>> addr, unsigned long end,
>>>>        } while (pudp++, addr = next, addr != end);
>>>>          pud_clear_fixmap();
>>>> +
>>>> +    return ret;
>>>>    }
>>>>    -static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long
>>>> end,
>>>> -               phys_addr_t phys, pgprot_t prot,
>>>> -               phys_addr_t (*pgtable_alloc)(int),
>>>> -               int flags)
>>>> +static int alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
>>>> +              phys_addr_t phys, pgprot_t prot,
>>>> +              phys_addr_t (*pgtable_alloc)(int),
>>>> +              int flags)
>>>>    {
>>>>        unsigned long next;
>>>> +    int ret = 0;
>>>>        pgd_t pgd = READ_ONCE(*pgdp);
>>>>        p4d_t *p4dp;
>>>>    @@ -398,6 +422,8 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>>> addr, unsigned long end,
>>>>                pgdval |= PGD_TABLE_PXN;
>>>>            BUG_ON(!pgtable_alloc);
>>>>            p4d_phys = pgtable_alloc(P4D_SHIFT);
>>>> +        if (!p4d_phys)
>>>> +            return -ENOMEM;
>>>>            p4dp = p4d_set_fixmap(p4d_phys);
>>>>            init_clear_pgtable(p4dp);
>>>>            p4dp += p4d_index(addr);
>>>> @@ -412,8 +438,10 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>>> addr, unsigned long end,
>>>>              next = p4d_addr_end(addr, end);
>>>>    -        alloc_init_pud(p4dp, addr, next, phys, prot,
>>>> +        ret = alloc_init_pud(p4dp, addr, next, phys, prot,
>>>>                       pgtable_alloc, flags);
>>>> +        if (ret)
>>>> +            break;
>>>>              BUG_ON(p4d_val(old_p4d) != 0 &&
>>>>                   p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
>>>> @@ -422,23 +450,26 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long
>>>> addr, unsigned long end,
>>>>        } while (p4dp++, addr = next, addr != end);
>>>>          p4d_clear_fixmap();
>>>> +
>>>> +    return ret;
>>>>    }
>>>>    -static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>>>> -                    unsigned long virt, phys_addr_t size,
>>>> -                    pgprot_t prot,
>>>> -                    phys_addr_t (*pgtable_alloc)(int),
>>>> -                    int flags)
>>>> +static int __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
>>>> +                       unsigned long virt, phys_addr_t size,
>>>> +                       pgprot_t prot,
>>>> +                       phys_addr_t (*pgtable_alloc)(int),
>>>> +                       int flags)
>>>>    {
>>>>        unsigned long addr, end, next;
>>>>        pgd_t *pgdp = pgd_offset_pgd(pgdir, virt);
>>>> +    int ret = 0;
>>>>          /*
>>>>         * If the virtual and physical address don't have the same offset
>>>>         * within a page, we cannot map the region as the caller expects.
>>>>         */
>>>>        if (WARN_ON((phys ^ virt) & ~PAGE_MASK))
>>>> -        return;
>>>> +        return -EINVAL;
>>>>          phys &= PAGE_MASK;
>>>>        addr = virt & PAGE_MASK;
>>>> @@ -446,29 +477,38 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir,
>>>> phys_addr_t phys,
>>>>          do {
>>>>            next = pgd_addr_end(addr, end);
>>>> -        alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>>>> +        ret = alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
>>>>                       flags);
>>>> +        if (ret)
>>>> +            break;
>>>>            phys += next - addr;
>>>>        } while (pgdp++, addr = next, addr != end);
>>>> +
>>>> +    return ret;
>>>>    }
>>>>    -static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>>> -                 unsigned long virt, phys_addr_t size,
>>>> -                 pgprot_t prot,
>>>> -                 phys_addr_t (*pgtable_alloc)(int),
>>>> -                 int flags)
>>>> +static int __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>>> +                unsigned long virt, phys_addr_t size,
>>>> +                pgprot_t prot,
>>>> +                phys_addr_t (*pgtable_alloc)(int),
>>>> +                int flags)
>>>>    {
>>>> +    int ret;
>>>> +
>>>>        mutex_lock(&fixmap_lock);
>>>> -    __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>>>> +    ret = __create_pgd_mapping_locked(pgdir, phys, virt, size, prot,
>>>>                        pgtable_alloc, flags);
>>>> +    BUG_ON(ret);
>>> This function now returns an error, but also BUGs on ret!=0. For this patch, I'd
>>> suggest keeping this function as void.
>> You mean __create_pgd_mapping(), right?
> Yes.
>
>>> But I believe there is a pre-existing bug in arch_add_memory(). That's called at
>>> runtime so if __create_pgd_mapping() fails and BUGs, it will take down a running
>>> system.
>> Yes, it is the current behavior.
>>
>>> With this foundational patch, we can fix that with an additional patch to pass
>>> along the error code instead of BUGing in that case. arch_add_memory() would
>>> need to unwind whatever __create_pgd_mapping() managed to do before the memory
>>> allocation failure (presumably unmapping and freeing any allocated tables). I'm
>>> happy to do this as a follow up patch.
>> Yes, the allocated page tables need to be freed. Thank you for taking it.
> Given the conversation in the other thread about generalizing to also eventually
> support vmalloc, I'm not sure you need to be able to return errors from this
> walker for your usage now? I think you will only use this walker for the case
> where you need to repaint to page mappings after determining that a secondary
> CPU does not support BBML2? If that fails, the system is dead anyway, so
> continuing to BUG() is probably acceptable?
>
> So perhaps you could drop this patch from your series? If so, then I'll reuse
> the patch to fix the theoretical hotplug bug (when I get to it) and will keep
> your authorship.

For repainting the linear map, if it is failed, I agree it should just 
BUG(). But it is used by changing linear map permission too when 
vmalloc. BUG() is not the expected behavior, it should return an error 
to tell vmalloc (i.e. module loading) the failure and propagate the 
error back to userspace.

>
>>>>        mutex_unlock(&fixmap_lock);
>>>> +
>>>> +    return ret;
>>>>    }
>>>>      #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
>>>>    extern __alias(__create_pgd_mapping_locked)
>>>> -void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long
>>>> virt,
>>>> -                 phys_addr_t size, pgprot_t prot,
>>>> -                 phys_addr_t (*pgtable_alloc)(int), int flags);
>>>> +int create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
>>>> +                phys_addr_t size, pgprot_t prot,
>>>> +                phys_addr_t (*pgtable_alloc)(int), int flags);
>>> create_kpti_ng_temp_pgd() now returns error instead of BUGing on allocation
>>> failure, but I don't see a change to handle that error. You'll want to update
>>> __kpti_install_ng_mappings() to BUG on error.
>> Yes, I missed that. It should BUG on error.
>>
>>>>    #endif
>>>>      static phys_addr_t __pgd_pgtable_alloc(int shift)
>>>> @@ -476,13 +516,17 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
>>>>        /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>>>        void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
>>>>    -    BUG_ON(!ptr);
>>>> +    if (!ptr)
>>>> +        return 0;
>>> 0 is a valid (though unlikely) physical address. I guess you could technically
>>> encode like ERR_PTR(), but since you are returning phys_addr_t and not a
>>> pointer, then perhaps it will be clearer to make this return int and accept a
>>> pointer to a phys_addr_t, which it will populate on success?
>> Actually I did something similar in the first place, but just returned the virt
>> address. Then did something if it returns NULL. That made the code a little more
>> messy since we need convert the virt address to phys address because
>> __create_pgd_mapping() and the helpers require phys address, and changed the
>> functions definition.
>>
>> But I noticed 0 should be not a valid phys address if I remember correctly.
> 0 is definitely a valid physical address. We even have examples of real Arm
> boards that have RAM at physical address 0. See [1].
>
> [1] https://lore.kernel.org/lkml/ad8ed3ba-12e8-3031-7c66-035b6d9ad6cd@arm.com/
>
>> I
>> also noticed early_pgtable_alloc() calls memblock_phys_alloc_range(), it returns
>> 0 on failure. If 0 is valid phys address, then it should not do that, right?
> Well perhaps memblock will just refuse to give you RAM at address 0. That's a
> bad design choice in my opinion. But the buddy will definitely give out page 0
> if it is RAM. -1 would be a better choice for an error sentinel.

If 0 is not valid in memblock, can it be valid in buddy? If I remember 
correctly, just the valid memblock will be added in buddy. But I'm not 
100% sure.

>
>> And
>> I also noticed the memblock range 0 - memstart_addr is actually removed from
>> memblock (see arm64_memblock_init()), so IIUC 0 should be not valid phys
>> address. So the patch ended up being as is.
> But memstart_addr could be 0, so in that case you don't actually remove anything?

Yeah, it could be 0.

>
>> If this assumption doesn't stand, I think your suggestion makes sense.
> Perhaps the simpler approach is to return -1 on error. That's never going to be
> valid because the maximum number of address bits on the physical bus is 56.

Sounds fine to me. We just need to know whether the allocation is failed 
or not. A non-ambiguous error code is good enough.

>
>>>> +
>>>>        return __pa(ptr);
>>>>    }
>>>>      static phys_addr_t pgd_pgtable_alloc(int shift)
>>>>    {
>>>>        phys_addr_t pa = __pgd_pgtable_alloc(shift);
>>>> +    if (!pa)
>>>> +        goto out;
>>> This would obviously need to be fixed up as per above.
>>>
>>>>        struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
>>>>          /*
>>>> @@ -498,6 +542,7 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
>>>>        else if (shift == PMD_SHIFT)
>>>>            BUG_ON(!pagetable_pmd_ctor(ptdesc));
>>>>    +out:
>>>>        return pa;
>>>>    }
>>>>    
>>> You have left early_pgtable_alloc() to panic() on allocation failure. Given we
>>> can now unwind the stack with error code, I think it would be more consistent to
>>> also allow early_pgtable_alloc() to return error.
>> The early_pgtable_alloc() is just used for painting linear mapping at early boot
>> stage, if it fails I don't think unwinding the stack is feasible and worth it.
>> Did I miss something?
> Personally I'd just prefer it all to be consistent. But I agree there is no big
> benefit. Anyway, like I said above, I'm not sure you need to worry about
> unwinding the stack at all given the approach we agreed in the other thread?

As I said above, I need to propagate the error code back to userspace. 
For example, when loading a module, if changing linear map permission 
fails due to page table allocation failure when splitting page table, 
-ENOMEM needs to be propagated to insmod so that insmod returns failure 
to user.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-07 21:16                   ` Yang Shi
@ 2025-05-28  0:00                     ` Yang Shi
  2025-05-28  3:47                       ` Dev Jain
  2025-05-28 13:13                       ` Ryan Roberts
  0 siblings, 2 replies; 49+ messages in thread
From: Yang Shi @ 2025-05-28  0:00 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

Hi Ryan,

I got a new spin ready in my local tree on top of v6.15-rc4. I noticed 
there were some more comments on Miko's BBML2 patch, it looks like a new 
spin is needed. But AFAICT there should be no significant change to how 
I advertise AmpereOne BBML2 in my patches. We will keep using MIDR list 
to check whether BBML2 is advertised or not and the erratum seems still 
be needed to fix up AA64MMFR2 BBML2 bits for AmpereOne IIUC.

You also mentioned Dev was working on patches to have 
__change_memory_common() apply permission change on a contiguous range 
instead of on page basis (the status quo). But I have not seen the 
patches on mailing list yet. However I don't think this will result in 
any significant change to my patches either, particularly the split 
primitive and linear map repainting.

So I plan to post v4 patches to the mailing list. We can focus on 
reviewing the split primitive and linear map repainting. Does it sound 
good to you?

Thanks,
Yang


On 5/7/25 2:16 PM, Yang Shi wrote:
>
>
> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>> On 05/05/2025 22:39, Yang Shi wrote:
>>>
>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> I know you may have a lot of things to follow up after LSF/MM. 
>>>>>>> Just gently
>>>>>>> ping,
>>>>>>> hopefully we can resume the review soon.
>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd 
>>>>>> April. But I'm very
>>>>>> keen to move this series forward so will come back to you next 
>>>>>> week. (although
>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>
>>>>>> FWIW, having thought about it a bit more, I think some of the 
>>>>>> suggestions I
>>>>>> previously made may not have been quite right, but I'll elaborate 
>>>>>> next week.
>>>>>> I'm
>>>>>> keen to build a pgtable splitting primitive here that we can 
>>>>>> reuse with vmalloc
>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>> Sounds good. I think the patches can support splitting vmalloc 
>>>>> page table too.
>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>> Hi Yang,
>>>>
>>>> Sorry I've taken so long to get back to you. Here's what I'm 
>>>> currently thinking:
>>>> I'd eventually like to get to the point where the linear map and 
>>>> most vmalloc
>>>> memory is mapped using the largest possible mapping granularity 
>>>> (i.e. block
>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>
>>>> vmalloc has history with trying to do huge mappings by default; it 
>>>> ended up
>>>> having to be turned into an opt-in feature (instead of the original 
>>>> opt-out
>>>> approach) because there were problems with some parts of the kernel 
>>>> expecting
>>>> page mappings. I think we might be able to overcome those issues on 
>>>> arm64 with
>>>> BBML2.
>>>>
>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I 
>>>> have a
>>>> series (that should make v6.16) that enables contiguous PTE 
>>>> mappings in vmalloc
>>>> too. But these are currently limited to when VM_ALLOW_HUGE is 
>>>> specified. To be
>>>> able to use that by default, we need to be able to change 
>>>> permissions on
>>>> sub-regions of an allocation, which is where BBML2 and your series 
>>>> come in.
>>>> (there may be other things we need to solve as well; TBD).
>>>>
>>>> I think the key thing we need is a function that can take a 
>>>> page-aligned kernel
>>>> VA, will walk to the leaf entry for that VA and if the VA is in the 
>>>> middle of
>>>> the leaf entry, it will split it so that the VA is now on a 
>>>> boundary. This will
>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE 
>>>> entries. The
>>>> function can assume BBML2 is present. And it will return 0 on 
>>>> success, -EINVAL
>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a 
>>>> pgtable to perform
>>>> the split.
>>> OK, the v3 patches already handled page table allocation failure 
>>> with returning
>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear 
>>> mapping
>>> should be always present. It is easy to return -EINVAL instead of 
>>> BUG_ON.
>>> However I'm wondering what usecases you are thinking about? 
>>> Splitting vmalloc
>>> area may run into unmapped VA?
>> I don't think BUG_ON is the right behaviour; crashing the kernel 
>> should be
>> discouraged. I think even for vmalloc under correct conditions we 
>> shouldn't see
>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>> vunmap_pmd_range() which skips the pmd if its none.
>>
>>>> Then we can use that primitive on the start and end address of any 
>>>> range for
>>>> which we need exact mapping boundaries (e.g. when changing 
>>>> permissions on part
>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc 
>>>> allocation,
>>>> etc). This way we only split enough to ensure the boundaries are 
>>>> precise, and
>>>> keep larger mappings inside the range.
>>> Yeah, makes sense to me.
>>>
>>>> Next we need to reimplement __change_memory_common() to not use
>>>> apply_to_page_range(), because that assumes page mappings only. Dev 
>>>> Jain has
>>>> been working on a series that converts this to use 
>>>> walk_page_range_novma() so
>>>> that we can change permissions on the block/contig entries too. 
>>>> That's not
>>>> posted publicly yet, but it's not huge so I'll ask if he is 
>>>> comfortable with
>>>> posting an RFC early next week.
>>> OK, so the new __change_memory_common() will change the permission 
>>> of page
>>> table, right?
>> It will change permissions of all the leaf entries in the range of 
>> VAs it is
>> passed. Currently it assumes that all the leaf entries are PTEs. But 
>> we will
>> generalize to support all the other types of leaf entries too.,
>>
>>> If I remember correctly, you suggested change permissions in
>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>> Yes I did. I think this made sense (in my head at least) because in 
>> the context
>> of the linear map, all the PFNs are contiguous so it kind-of makes 
>> sense to
>> reuse that infrastructure. But it doesn't generalize to vmalloc 
>> because vmalloc
>> PFNs are not contiguous. So for that reason, I think it's preferable 
>> to have an
>> independent capability.
>
> OK, sounds good to me.
>
>>
>>> The current code assumes the address range passed in by 
>>> change_memory_common()
>>> is *NOT* physically contiguous so __change_memory_common() handles 
>>> page table
>>> permission on page basis. I'm supposed Dev's patches will handle 
>>> this then my
>>> patch can safely assume the linear mapping address range for 
>>> splitting is
>>> physically contiguous too otherwise I can't keep large mappings 
>>> inside the
>>> range. Splitting vmalloc area doesn't need to worry about this.
>> I'm not sure I fully understand the point you're making here...
>>
>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>> implementation so that it can walk a VA range and update the 
>> permissions on each
>> leaf entry it visits, regadless of which level the leaf entry is at. 
>> This
>> doesn't make any assumption of the physical contiguity of 
>> neighbouring leaf
>> entries in the page table.
>>
>> So if we are changing permissions on the linear map, we have a range 
>> of VAs to
>> walk and convert all the leaf entries, regardless of their size. The 
>> same goes
>> for vmalloc... But for vmalloc, we will also want to change the 
>> underlying
>> permissions in the linear map, so we will have to figure out the 
>> contiguous
>> pieces of the linear map and call __change_memory_common() for each; 
>> there is
>> definitely some detail to work out there!
>
> Yes, this is my point. When changing underlying linear map permission 
> for vmalloc, the linear map address may be not contiguous. This is why 
> change_memory_common() calls __change_memory_common() on page basis.
>
> But how Dev's patch work should have no impact on how I implement the 
> split primitive by thinking it further. It should be the caller's 
> responsibility to make sure __create_pgd_mapping_locked() is called 
> for contiguous linear map address range.
>
>>
>>>> You'll still need to repaint the whole linear map with page 
>>>> mappings for the
>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() 
>>>> (potentially with
>>>> minor modifications?) can do that repainting on the live mappings; 
>>>> similar to
>>>> how you are doing it in v3.
>>> Yes, when repainting I need to split the page table all the way down 
>>> to PTE
>>> level. A simple flag should be good enough to tell 
>>> __create_pgd_mapping_locked()
>>> do the right thing off the top of my head.
>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and 
>> NO_CONT_MAPPINGS
>> flags? For example, if you are find a leaf mapping and 
>> NO_BLOCK_MAPPINGS is set,
>> then you need to split it?
>
> Yeah, sounds feasible. Anyway I will figure it out.
>
>>
>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>
>>>> So in summary, what I'm asking for your large block mapping the 
>>>> linear map
>>>> series is:
>>>>     - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>     - Repaint linear map using page mappings if secondary CPUs 
>>>> don't support BBML2
>>> OK, I just need to add some simple tweak to split down to PTE level 
>>> to v3.
>>>
>>>>     - Integrate Dev's __change_memory_common() series
>>> OK, I think I have to do my patches on top of it. Because Dev's 
>>> patch need
>>> guarantee the linear mapping address range is physically contiguous.
>>>
>>>>     - Create primitive to ensure mapping entry boundary at a given 
>>>> page-aligned VA
>>>>     - Use primitive when changing permissions on linear map region
>>> Sure.
>>>
>>>> This will be mergable on its own, but will also provide a great 
>>>> starting base
>>>> for adding huge-vmalloc-by-default.
>>>>
>>>> What do you think?
>>> Definitely makes sense to me.
>>>
>>> If I remember correctly, we still have some unsolved 
>>> comments/questions for v3
>>> in my replies on March 17, particularly:
>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>> b344-9401fa4c0feb@os.amperecomputing.com/
>> Ahh sorry about that. I'll take a look now...
>
> No problem.
>
> Thanks,
> Yang
>
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>
>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some 
>>>>>>>>>> slight changes
>>>>>>>>>> that
>>>>>>>>>> have impact to my patches (basically check the new boot 
>>>>>>>>>> parameter). Do you
>>>>>>>>>> prefer I rebase my patches on top of his new spin right now 
>>>>>>>>>> then restart
>>>>>>>>>> review
>>>>>>>>>> from the new spin or review the current patches then solve 
>>>>>>>>>> the new review
>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>> Hi Yang,
>>>>>>>>>
>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my 
>>>>>>>>> queue!
>>>>>>>>>
>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with 
>>>>>>>>> Miko's series
>>>>>>>>> and am
>>>>>>>>> not too bothered about the integration with that; I think it's 
>>>>>>>>> pretty
>>>>>>>>> straight
>>>>>>>>> forward. I'm more interested in how you are handling the 
>>>>>>>>> splitting, which I
>>>>>>>>> think is the bulk of the effort.
>>>>>>>> Yeah, sure, thank you.
>>>>>>>>
>>>>>>>>> I'm hoping to get to this next week before heading out to 
>>>>>>>>> LSF/MM the
>>>>>>>>> following
>>>>>>>>> week (might I see you there?)
>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>> Changelog
>>>>>>>>>>> =========
>>>>>>>>>>> v3:
>>>>>>>>>>>        * Rebased to v6.14-rc4.
>>>>>>>>>>>        * Based on Miko's BBML2 cpufeature patch 
>>>>>>>>>>> (https://lore.kernel.org/
>>>>>>>>>>> linux-
>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>          Also included in this series in order to have the 
>>>>>>>>>>> complete
>>>>>>>>>>> patchset.
>>>>>>>>>>>        * Enhanced __create_pgd_mapping() to handle split as 
>>>>>>>>>>> well per Ryan.
>>>>>>>>>>>        * Supported CONT mappings per Ryan.
>>>>>>>>>>>        * Supported asymmetric system by splitting kernel 
>>>>>>>>>>> linear mapping if
>>>>>>>>>>> such
>>>>>>>>>>>          system is detected per Ryan. I don't have such 
>>>>>>>>>>> system to test,
>>>>>>>>>>> so the
>>>>>>>>>>>          testing is done by hacking kernel to call linear 
>>>>>>>>>>> mapping
>>>>>>>>>>> repainting
>>>>>>>>>>>          unconditionally. The linear mapping doesn't have 
>>>>>>>>>>> any block and
>>>>>>>>>>> cont
>>>>>>>>>>>          mappings after booting.
>>>>>>>>>>>
>>>>>>>>>>> RFC v2:
>>>>>>>>>>>        * Used allowlist to advertise BBM lv2 on the CPUs 
>>>>>>>>>>> which can
>>>>>>>>>>> handle TLB
>>>>>>>>>>>          conflict gracefully per Will Deacon
>>>>>>>>>>>        * Rebased onto v6.13-rc5
>>>>>>>>>>>        * 
>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1- 
>>>>>>>>>>>
>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>
>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>
>>>>>>>>>>> Description
>>>>>>>>>>> ===========
>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due 
>>>>>>>>>>> to arm's
>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>
>>>>>>>>>>> A number of performance issues arise when the kernel linear 
>>>>>>>>>>> map is using
>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>        - performance degradation
>>>>>>>>>>>        - more TLB pressure
>>>>>>>>>>>        - memory waste for kernel page table
>>>>>>>>>>>
>>>>>>>>>>> These issues can be avoided by specifying rodata=on the 
>>>>>>>>>>> kernel command
>>>>>>>>>>> line but this disables the alias checks on page table 
>>>>>>>>>>> permissions and
>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>
>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to 
>>>>>>>>>>> invalidate the
>>>>>>>>>>> page table entry when changing page sizes. This allows the 
>>>>>>>>>>> kernel to
>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>
>>>>>>>>>>> This patch adds support for splitting large mappings when 
>>>>>>>>>>> FEAT_BBM level 2
>>>>>>>>>>> is available and rodata=full is used. This functionality 
>>>>>>>>>>> will be used
>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>
>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map 
>>>>>>>>>>> using PTEs
>>>>>>>>>>> only.
>>>>>>>>>>>
>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may 
>>>>>>>>>>> be repainted
>>>>>>>>>>> once
>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 
>>>>>>>>>>> for more
>>>>>>>>>>> details.
>>>>>>>>>>>
>>>>>>>>>>> We saw significant performance increases in some benchmarks 
>>>>>>>>>>> with
>>>>>>>>>>> rodata=full without compromising the security features of 
>>>>>>>>>>> the kernel.
>>>>>>>>>>>
>>>>>>>>>>> Testing
>>>>>>>>>>> =======
>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 
>>>>>>>>>>> 256GB
>>>>>>>>>>> memory and
>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>
>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>        - Kernel boot.  Kernel needs change kernel linear 
>>>>>>>>>>> mapping
>>>>>>>>>>> permission at
>>>>>>>>>>>          boot stage, if the patch didn't work, kernel 
>>>>>>>>>>> typically didn't
>>>>>>>>>>> boot.
>>>>>>>>>>>        - Module stress from stress-ng. Kernel module load 
>>>>>>>>>>> change permission
>>>>>>>>>>> for
>>>>>>>>>>>          linear mapping.
>>>>>>>>>>>        - A test kernel module which allocates 80% of total 
>>>>>>>>>>> memory via
>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>          then change the vmalloc area permission to RO, this 
>>>>>>>>>>> also change
>>>>>>>>>>> linear
>>>>>>>>>>>          mapping permission to RO, then change it back 
>>>>>>>>>>> before vfree(). Then
>>>>>>>>>>> launch
>>>>>>>>>>>          a VM which consumes almost all physical memory.
>>>>>>>>>>>        - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>        - Kernel build in VM with guest kernel which has this 
>>>>>>>>>>> series
>>>>>>>>>>> applied.
>>>>>>>>>>>        - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>        - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>
>>>>>>>>>>> Performance
>>>>>>>>>>> ===========
>>>>>>>>>>> Memory consumption
>>>>>>>>>>> Before:
>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>
>>>>>>>>>>> After:
>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>
>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the 
>>>>>>>>>>> machine, the
>>>>>>>>>>> more memory saved.
>>>>>>>>>>>
>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>> * Memcached
>>>>>>>>>>> We saw performance degradation when running Memcached 
>>>>>>>>>>> benchmark with
>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel 
>>>>>>>>>>> TLB pressure.
>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 
>>>>>>>>>>> 3.5%, P99
>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The 
>>>>>>>>>>> kernel TLB
>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>
>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>
>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk 
>>>>>>>>>>> (ext4) with
>>>>>>>>>>> disk
>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>          --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>          --iodepth=1 --numjobs=1 --fsync_on_close=1 
>>>>>>>>>>> --group_reporting --
>>>>>>>>>>> thread \
>>>>>>>>>>>          --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>
>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, 
>>>>>>>>>>> but the worst
>>>>>>>>>>> number of good case is around 90% more than the best number 
>>>>>>>>>>> of bad case).
>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced 
>>>>>>>>>>> proportionally.
>>>>>>>>>>>
>>>>>>>>>>> * Sequential file read
>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>> populated).
>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>            arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>
>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>            arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>            arm64: mm: make __create_pgd_mapping() and 
>>>>>>>>>>> helpers non-void
>>>>>>>>>>>            arm64: mm: support large block mapping when 
>>>>>>>>>>> rodata=full
>>>>>>>>>>>            arm64: mm: support split CONT mappings
>>>>>>>>>>>            arm64: mm: split linear mapping if BBML2 is not 
>>>>>>>>>>> supported on
>>>>>>>>>>> secondary
>>>>>>>>>>> CPUs
>>>>>>>>>>>
>>>>>>>>>>>       arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>       arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>       arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>       arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>       arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>       arch/arm64/kernel/cpufeature.c      | 95 
>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>> ++++++
>>>>>>>>>>> +++++++
>>>>>>>>>>>       arch/arm64/mm/mmu.c                 | 397 
>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>> ++++++
>>>>>>>>>>> ++++
>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>>>>>>>>>
>>>>>>>>>>> +++++
>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>       arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>       arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>       9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28  0:00                     ` Yang Shi
@ 2025-05-28  3:47                       ` Dev Jain
  2025-05-28 13:13                       ` Ryan Roberts
  1 sibling, 0 replies; 49+ messages in thread
From: Dev Jain @ 2025-05-28  3:47 UTC (permalink / raw)
  To: Yang Shi, Ryan Roberts, will, catalin.marinas, Miko.Lenczewski,
	scott, cl
  Cc: linux-arm-kernel, linux-kernel


On 28/05/25 5:30 am, Yang Shi wrote:
> Hi Ryan,
>
> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed 
> there were some more comments on Miko's BBML2 patch, it looks like a 
> new spin is needed. But AFAICT there should be no significant change 
> to how I advertise AmpereOne BBML2 in my patches. We will keep using 
> MIDR list to check whether BBML2 is advertised or not and the erratum 
> seems still be needed to fix up AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>
> You also mentioned Dev was working on patches to have 
> __change_memory_common() apply permission change on a contiguous range 
> instead of on page basis (the status quo). But I have not seen the 
> patches on mailing list yet. However I don't think this will result in 
> any significant change to my patches either, particularly the split 
> primitive and linear map repainting.


Hi sorry for the delay, I have too much on my plate right now so will 
try to get that posting ready by end of this week.


>
> So I plan to post v4 patches to the mailing list. We can focus on 
> reviewing the split primitive and linear map repainting. Does it sound 
> good to you?
>
> Thanks,
> Yang
>
>
> On 5/7/25 2:16 PM, Yang Shi wrote:
>>
>>
>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>
>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. 
>>>>>>>> Just gently
>>>>>>>> ping,
>>>>>>>> hopefully we can resume the review soon.
>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd 
>>>>>>> April. But I'm very
>>>>>>> keen to move this series forward so will come back to you next 
>>>>>>> week. (although
>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>
>>>>>>> FWIW, having thought about it a bit more, I think some of the 
>>>>>>> suggestions I
>>>>>>> previously made may not have been quite right, but I'll 
>>>>>>> elaborate next week.
>>>>>>> I'm
>>>>>>> keen to build a pgtable splitting primitive here that we can 
>>>>>>> reuse with vmalloc
>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>> Sounds good. I think the patches can support splitting vmalloc 
>>>>>> page table too.
>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>> Hi Yang,
>>>>>
>>>>> Sorry I've taken so long to get back to you. Here's what I'm 
>>>>> currently thinking:
>>>>> I'd eventually like to get to the point where the linear map and 
>>>>> most vmalloc
>>>>> memory is mapped using the largest possible mapping granularity 
>>>>> (i.e. block
>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>
>>>>> vmalloc has history with trying to do huge mappings by default; it 
>>>>> ended up
>>>>> having to be turned into an opt-in feature (instead of the 
>>>>> original opt-out
>>>>> approach) because there were problems with some parts of the 
>>>>> kernel expecting
>>>>> page mappings. I think we might be able to overcome those issues 
>>>>> on arm64 with
>>>>> BBML2.
>>>>>
>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and 
>>>>> I have a
>>>>> series (that should make v6.16) that enables contiguous PTE 
>>>>> mappings in vmalloc
>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is 
>>>>> specified. To be
>>>>> able to use that by default, we need to be able to change 
>>>>> permissions on
>>>>> sub-regions of an allocation, which is where BBML2 and your series 
>>>>> come in.
>>>>> (there may be other things we need to solve as well; TBD).
>>>>>
>>>>> I think the key thing we need is a function that can take a 
>>>>> page-aligned kernel
>>>>> VA, will walk to the leaf entry for that VA and if the VA is in 
>>>>> the middle of
>>>>> the leaf entry, it will split it so that the VA is now on a 
>>>>> boundary. This will
>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE 
>>>>> entries. The
>>>>> function can assume BBML2 is present. And it will return 0 on 
>>>>> success, -EINVAL
>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a 
>>>>> pgtable to perform
>>>>> the split.
>>>> OK, the v3 patches already handled page table allocation failure 
>>>> with returning
>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes 
>>>> linear mapping
>>>> should be always present. It is easy to return -EINVAL instead of 
>>>> BUG_ON.
>>>> However I'm wondering what usecases you are thinking about? 
>>>> Splitting vmalloc
>>>> area may run into unmapped VA?
>>> I don't think BUG_ON is the right behaviour; crashing the kernel 
>>> should be
>>> discouraged. I think even for vmalloc under correct conditions we 
>>> shouldn't see
>>> any unmapped VA. But vmalloc does handle it gracefully today; see 
>>> (e.g.)
>>> vunmap_pmd_range() which skips the pmd if its none.
>>>
>>>>> Then we can use that primitive on the start and end address of any 
>>>>> range for
>>>>> which we need exact mapping boundaries (e.g. when changing 
>>>>> permissions on part
>>>>> of linear map or vmalloc allocation, when freeing part of a 
>>>>> vmalloc allocation,
>>>>> etc). This way we only split enough to ensure the boundaries are 
>>>>> precise, and
>>>>> keep larger mappings inside the range.
>>>> Yeah, makes sense to me.
>>>>
>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>> apply_to_page_range(), because that assumes page mappings only. 
>>>>> Dev Jain has
>>>>> been working on a series that converts this to use 
>>>>> walk_page_range_novma() so
>>>>> that we can change permissions on the block/contig entries too. 
>>>>> That's not
>>>>> posted publicly yet, but it's not huge so I'll ask if he is 
>>>>> comfortable with
>>>>> posting an RFC early next week.
>>>> OK, so the new __change_memory_common() will change the permission 
>>>> of page
>>>> table, right?
>>> It will change permissions of all the leaf entries in the range of 
>>> VAs it is
>>> passed. Currently it assumes that all the leaf entries are PTEs. But 
>>> we will
>>> generalize to support all the other types of leaf entries too.,
>>>
>>>> If I remember correctly, you suggested change permissions in
>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>> Yes I did. I think this made sense (in my head at least) because in 
>>> the context
>>> of the linear map, all the PFNs are contiguous so it kind-of makes 
>>> sense to
>>> reuse that infrastructure. But it doesn't generalize to vmalloc 
>>> because vmalloc
>>> PFNs are not contiguous. So for that reason, I think it's preferable 
>>> to have an
>>> independent capability.
>>
>> OK, sounds good to me.
>>
>>>
>>>> The current code assumes the address range passed in by 
>>>> change_memory_common()
>>>> is *NOT* physically contiguous so __change_memory_common() handles 
>>>> page table
>>>> permission on page basis. I'm supposed Dev's patches will handle 
>>>> this then my
>>>> patch can safely assume the linear mapping address range for 
>>>> splitting is
>>>> physically contiguous too otherwise I can't keep large mappings 
>>>> inside the
>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>> I'm not sure I fully understand the point you're making here...
>>>
>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>> implementation so that it can walk a VA range and update the 
>>> permissions on each
>>> leaf entry it visits, regadless of which level the leaf entry is at. 
>>> This
>>> doesn't make any assumption of the physical contiguity of 
>>> neighbouring leaf
>>> entries in the page table.
>>>
>>> So if we are changing permissions on the linear map, we have a range 
>>> of VAs to
>>> walk and convert all the leaf entries, regardless of their size. The 
>>> same goes
>>> for vmalloc... But for vmalloc, we will also want to change the 
>>> underlying
>>> permissions in the linear map, so we will have to figure out the 
>>> contiguous
>>> pieces of the linear map and call __change_memory_common() for each; 
>>> there is
>>> definitely some detail to work out there!
>>
>> Yes, this is my point. When changing underlying linear map permission 
>> for vmalloc, the linear map address may be not contiguous. This is 
>> why change_memory_common() calls __change_memory_common() on page basis.
>>
>> But how Dev's patch work should have no impact on how I implement the 
>> split primitive by thinking it further. It should be the caller's 
>> responsibility to make sure __create_pgd_mapping_locked() is called 
>> for contiguous linear map address range.
>>
>>>
>>>>> You'll still need to repaint the whole linear map with page 
>>>>> mappings for the
>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() 
>>>>> (potentially with
>>>>> minor modifications?) can do that repainting on the live mappings; 
>>>>> similar to
>>>>> how you are doing it in v3.
>>>> Yes, when repainting I need to split the page table all the way 
>>>> down to PTE
>>>> level. A simple flag should be good enough to tell 
>>>> __create_pgd_mapping_locked()
>>>> do the right thing off the top of my head.
>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and 
>>> NO_CONT_MAPPINGS
>>> flags? For example, if you are find a leaf mapping and 
>>> NO_BLOCK_MAPPINGS is set,
>>> then you need to split it?
>>
>> Yeah, sounds feasible. Anyway I will figure it out.
>>
>>>
>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>
>>>>> So in summary, what I'm asking for your large block mapping the 
>>>>> linear map
>>>>> series is:
>>>>>     - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>     - Repaint linear map using page mappings if secondary CPUs 
>>>>> don't support BBML2
>>>> OK, I just need to add some simple tweak to split down to PTE level 
>>>> to v3.
>>>>
>>>>>     - Integrate Dev's __change_memory_common() series
>>>> OK, I think I have to do my patches on top of it. Because Dev's 
>>>> patch need
>>>> guarantee the linear mapping address range is physically contiguous.
>>>>
>>>>>     - Create primitive to ensure mapping entry boundary at a given 
>>>>> page-aligned VA
>>>>>     - Use primitive when changing permissions on linear map region
>>>> Sure.
>>>>
>>>>> This will be mergable on its own, but will also provide a great 
>>>>> starting base
>>>>> for adding huge-vmalloc-by-default.
>>>>>
>>>>> What do you think?
>>>> Definitely makes sense to me.
>>>>
>>>> If I remember correctly, we still have some unsolved 
>>>> comments/questions for v3
>>>> in my replies on March 17, particularly:
>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>> Ahh sorry about that. I'll take a look now...
>>
>> No problem.
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some 
>>>>>>>>>>> slight changes
>>>>>>>>>>> that
>>>>>>>>>>> have impact to my patches (basically check the new boot 
>>>>>>>>>>> parameter). Do you
>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now 
>>>>>>>>>>> then restart
>>>>>>>>>>> review
>>>>>>>>>>> from the new spin or review the current patches then solve 
>>>>>>>>>>> the new review
>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>> Hi Yang,
>>>>>>>>>>
>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my 
>>>>>>>>>> queue!
>>>>>>>>>>
>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with 
>>>>>>>>>> Miko's series
>>>>>>>>>> and am
>>>>>>>>>> not too bothered about the integration with that; I think 
>>>>>>>>>> it's pretty
>>>>>>>>>> straight
>>>>>>>>>> forward. I'm more interested in how you are handling the 
>>>>>>>>>> splitting, which I
>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>
>>>>>>>>>> I'm hoping to get to this next week before heading out to 
>>>>>>>>>> LSF/MM the
>>>>>>>>>> following
>>>>>>>>>> week (might I see you there?)
>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>> Changelog
>>>>>>>>>>>> =========
>>>>>>>>>>>> v3:
>>>>>>>>>>>>        * Rebased to v6.14-rc4.
>>>>>>>>>>>>        * Based on Miko's BBML2 cpufeature patch 
>>>>>>>>>>>> (https://lore.kernel.org/
>>>>>>>>>>>> linux-
>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>          Also included in this series in order to have the 
>>>>>>>>>>>> complete
>>>>>>>>>>>> patchset.
>>>>>>>>>>>>        * Enhanced __create_pgd_mapping() to handle split as 
>>>>>>>>>>>> well per Ryan.
>>>>>>>>>>>>        * Supported CONT mappings per Ryan.
>>>>>>>>>>>>        * Supported asymmetric system by splitting kernel 
>>>>>>>>>>>> linear mapping if
>>>>>>>>>>>> such
>>>>>>>>>>>>          system is detected per Ryan. I don't have such 
>>>>>>>>>>>> system to test,
>>>>>>>>>>>> so the
>>>>>>>>>>>>          testing is done by hacking kernel to call linear 
>>>>>>>>>>>> mapping
>>>>>>>>>>>> repainting
>>>>>>>>>>>>          unconditionally. The linear mapping doesn't have 
>>>>>>>>>>>> any block and
>>>>>>>>>>>> cont
>>>>>>>>>>>>          mappings after booting.
>>>>>>>>>>>>
>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>        * Used allowlist to advertise BBM lv2 on the CPUs 
>>>>>>>>>>>> which can
>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>          conflict gracefully per Will Deacon
>>>>>>>>>>>>        * Rebased onto v6.13-rc5
>>>>>>>>>>>>        * 
>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/20250103011822.1257189-1- 
>>>>>>>>>>>>
>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>
>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>
>>>>>>>>>>>> Description
>>>>>>>>>>>> ===========
>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due 
>>>>>>>>>>>> to arm's
>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>
>>>>>>>>>>>> A number of performance issues arise when the kernel linear 
>>>>>>>>>>>> map is using
>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>        - performance degradation
>>>>>>>>>>>>        - more TLB pressure
>>>>>>>>>>>>        - memory waste for kernel page table
>>>>>>>>>>>>
>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the 
>>>>>>>>>>>> kernel command
>>>>>>>>>>>> line but this disables the alias checks on page table 
>>>>>>>>>>>> permissions and
>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>
>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to 
>>>>>>>>>>>> invalidate the
>>>>>>>>>>>> page table entry when changing page sizes. This allows the 
>>>>>>>>>>>> kernel to
>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch adds support for splitting large mappings when 
>>>>>>>>>>>> FEAT_BBM level 2
>>>>>>>>>>>> is available and rodata=full is used. This functionality 
>>>>>>>>>>>> will be used
>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>
>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map 
>>>>>>>>>>>> using PTEs
>>>>>>>>>>>> only.
>>>>>>>>>>>>
>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may 
>>>>>>>>>>>> be repainted
>>>>>>>>>>>> once
>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch 
>>>>>>>>>>>> #6 for more
>>>>>>>>>>>> details.
>>>>>>>>>>>>
>>>>>>>>>>>> We saw significant performance increases in some benchmarks 
>>>>>>>>>>>> with
>>>>>>>>>>>> rodata=full without compromising the security features of 
>>>>>>>>>>>> the kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> Testing
>>>>>>>>>>>> =======
>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 
>>>>>>>>>>>> 256GB
>>>>>>>>>>>> memory and
>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>
>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>        - Kernel boot.  Kernel needs change kernel linear 
>>>>>>>>>>>> mapping
>>>>>>>>>>>> permission at
>>>>>>>>>>>>          boot stage, if the patch didn't work, kernel 
>>>>>>>>>>>> typically didn't
>>>>>>>>>>>> boot.
>>>>>>>>>>>>        - Module stress from stress-ng. Kernel module load 
>>>>>>>>>>>> change permission
>>>>>>>>>>>> for
>>>>>>>>>>>>          linear mapping.
>>>>>>>>>>>>        - A test kernel module which allocates 80% of total 
>>>>>>>>>>>> memory via
>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>          then change the vmalloc area permission to RO, 
>>>>>>>>>>>> this also change
>>>>>>>>>>>> linear
>>>>>>>>>>>>          mapping permission to RO, then change it back 
>>>>>>>>>>>> before vfree(). Then
>>>>>>>>>>>> launch
>>>>>>>>>>>>          a VM which consumes almost all physical memory.
>>>>>>>>>>>>        - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>        - Kernel build in VM with guest kernel which has 
>>>>>>>>>>>> this series
>>>>>>>>>>>> applied.
>>>>>>>>>>>>        - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>        - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>
>>>>>>>>>>>> Performance
>>>>>>>>>>>> ===========
>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>> Before:
>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>
>>>>>>>>>>>> After:
>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>
>>>>>>>>>>>> Around 500MB more memory are free to use. The larger the 
>>>>>>>>>>>> machine, the
>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>
>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>> * Memcached
>>>>>>>>>>>> We saw performance degradation when running Memcached 
>>>>>>>>>>>> benchmark with
>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel 
>>>>>>>>>>>> TLB pressure.
>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 
>>>>>>>>>>>> 3.5%, P99
>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The 
>>>>>>>>>>>> kernel TLB
>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>
>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>
>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk 
>>>>>>>>>>>> (ext4) with
>>>>>>>>>>>> disk
>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>          --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>          --iodepth=1 --numjobs=1 --fsync_on_close=1 
>>>>>>>>>>>> --group_reporting --
>>>>>>>>>>>> thread \
>>>>>>>>>>>>          --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>
>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, 
>>>>>>>>>>>> but the worst
>>>>>>>>>>>> number of good case is around 90% more than the best number 
>>>>>>>>>>>> of bad case).
>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced 
>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>
>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page 
>>>>>>>>>>>> cache
>>>>>>>>>>>> populated).
>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>            arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>
>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>            arm64: cpufeature: add AmpereOne to BBML2 allow 
>>>>>>>>>>>> list
>>>>>>>>>>>>            arm64: mm: make __create_pgd_mapping() and 
>>>>>>>>>>>> helpers non-void
>>>>>>>>>>>>            arm64: mm: support large block mapping when 
>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>            arm64: mm: support split CONT mappings
>>>>>>>>>>>>            arm64: mm: split linear mapping if BBML2 is not 
>>>>>>>>>>>> supported on
>>>>>>>>>>>> secondary
>>>>>>>>>>>> CPUs
>>>>>>>>>>>>
>>>>>>>>>>>>       arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>       arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>       arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>       arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>       arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>       arch/arm64/kernel/cpufeature.c      | 95 
>>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>>> ++++++
>>>>>>>>>>>> +++++++
>>>>>>>>>>>>       arch/arm64/mm/mmu.c                 | 397 
>>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>>> ++++++
>>>>>>>>>>>> ++++
>>>>>>>>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>>>>>>>>>>
>>>>>>>>>>>> +++++
>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>       arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>       arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>       9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>
>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28  0:00                     ` Yang Shi
  2025-05-28  3:47                       ` Dev Jain
@ 2025-05-28 13:13                       ` Ryan Roberts
  2025-05-28 15:18                         ` Yang Shi
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-28 13:13 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 28/05/2025 01:00, Yang Shi wrote:
> Hi Ryan,
> 
> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
> were some more comments on Miko's BBML2 patch, it looks like a new spin is
> needed. But AFAICT there should be no significant change to how I advertise
> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
> BBML2 is advertised or not and the erratum seems still be needed to fix up
> AA64MMFR2 BBML2 bits for AmpereOne IIUC.

Yes, I agree this should not impact you too much.

> 
> You also mentioned Dev was working on patches to have __change_memory_common()
> apply permission change on a contiguous range instead of on page basis (the
> status quo). But I have not seen the patches on mailing list yet. However I
> don't think this will result in any significant change to my patches either,
> particularly the split primitive and linear map repainting.

I think you would need Dev's series to be able to apply the permissions change
without needing to split the whole range to pte mappings? So I guess your change
must either be implementing something similar to what Dev is working on or you
are splitting the entire range to ptes? If the latter, then I'm not keen on that
approach.

Regarding the linear map repainting, I had a chat with Catalin, and he reminded
me of a potential problem; if you are doing the repainting with the machine
stopped, you can't allocate memory at that point; it's possible a CPU was inside
the allocator when it stopped. And I think you need to allocate intermediate
pgtables, right? Do you have a solution to that problem? I guess one approach
would be to figure out how much memory you will need and pre-allocate prior to
stoping the machine?

> 
> So I plan to post v4 patches to the mailing list. We can focus on reviewing the
> split primitive and linear map repainting. Does it sound good to you?

That works assuming you have a solution for the above.

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
> 
> On 5/7/25 2:16 PM, Yang Shi wrote:
>>
>>
>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>
>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just gently
>>>>>>>> ping,
>>>>>>>> hopefully we can resume the review soon.
>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>> I'm very
>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>> (although
>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>
>>>>>>> FWIW, having thought about it a bit more, I think some of the suggestions I
>>>>>>> previously made may not have been quite right, but I'll elaborate next week.
>>>>>>> I'm
>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>> vmalloc
>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>> too.
>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>> Hi Yang,
>>>>>
>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>> thinking:
>>>>> I'd eventually like to get to the point where the linear map and most vmalloc
>>>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>
>>>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>>>> having to be turned into an opt-in feature (instead of the original opt-out
>>>>> approach) because there were problems with some parts of the kernel expecting
>>>>> page mappings. I think we might be able to overcome those issues on arm64 with
>>>>> BBML2.
>>>>>
>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>> vmalloc
>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
>>>>> able to use that by default, we need to be able to change permissions on
>>>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>>>> (there may be other things we need to solve as well; TBD).
>>>>>
>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>> kernel
>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the middle of
>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>> will
>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
>>>>> function can assume BBML2 is present. And it will return 0 on success, -EINVAL
>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>> perform
>>>>> the split.
>>>> OK, the v3 patches already handled page table allocation failure with returning
>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>> However I'm wondering what usecases you are thinking about? Splitting vmalloc
>>>> area may run into unmapped VA?
>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>> discouraged. I think even for vmalloc under correct conditions we shouldn't see
>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>> vunmap_pmd_range() which skips the pmd if its none.
>>>
>>>>> Then we can use that primitive on the start and end address of any range for
>>>>> which we need exact mapping boundaries (e.g. when changing permissions on part
>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>> allocation,
>>>>> etc). This way we only split enough to ensure the boundaries are precise, and
>>>>> keep larger mappings inside the range.
>>>> Yeah, makes sense to me.
>>>>
>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>>>>> been working on a series that converts this to use walk_page_range_novma() so
>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>>>>> posting an RFC early next week.
>>>> OK, so the new __change_memory_common() will change the permission of page
>>>> table, right?
>>> It will change permissions of all the leaf entries in the range of VAs it is
>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>> generalize to support all the other types of leaf entries too.,
>>>
>>>> If I remember correctly, you suggested change permissions in
>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>> Yes I did. I think this made sense (in my head at least) because in the context
>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>> reuse that infrastructure. But it doesn't generalize to vmalloc because vmalloc
>>> PFNs are not contiguous. So for that reason, I think it's preferable to have an
>>> independent capability.
>>
>> OK, sounds good to me.
>>
>>>
>>>> The current code assumes the address range passed in by change_memory_common()
>>>> is *NOT* physically contiguous so __change_memory_common() handles page table
>>>> permission on page basis. I'm supposed Dev's patches will handle this then my
>>>> patch can safely assume the linear mapping address range for splitting is
>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>> I'm not sure I fully understand the point you're making here...
>>>
>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>> implementation so that it can walk a VA range and update the permissions on each
>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>> entries in the page table.
>>>
>>> So if we are changing permissions on the linear map, we have a range of VAs to
>>> walk and convert all the leaf entries, regardless of their size. The same goes
>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>> permissions in the linear map, so we will have to figure out the contiguous
>>> pieces of the linear map and call __change_memory_common() for each; there is
>>> definitely some detail to work out there!
>>
>> Yes, this is my point. When changing underlying linear map permission for
>> vmalloc, the linear map address may be not contiguous. This is why
>> change_memory_common() calls __change_memory_common() on page basis.
>>
>> But how Dev's patch work should have no impact on how I implement the split
>> primitive by thinking it further. It should be the caller's responsibility to
>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>> address range.
>>
>>>
>>>>> You'll still need to repaint the whole linear map with page mappings for the
>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially
>>>>> with
>>>>> minor modifications?) can do that repainting on the live mappings; similar to
>>>>> how you are doing it in v3.
>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>> level. A simple flag should be good enough to tell
>>>> __create_pgd_mapping_locked()
>>>> do the right thing off the top of my head.
>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and NO_CONT_MAPPINGS
>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is set,
>>> then you need to split it?
>>
>> Yeah, sounds feasible. Anyway I will figure it out.
>>
>>>
>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>
>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>> series is:
>>>>>     - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>     - Repaint linear map using page mappings if secondary CPUs don't
>>>>> support BBML2
>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>
>>>>>     - Integrate Dev's __change_memory_common() series
>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>> guarantee the linear mapping address range is physically contiguous.
>>>>
>>>>>     - Create primitive to ensure mapping entry boundary at a given page-
>>>>> aligned VA
>>>>>     - Use primitive when changing permissions on linear map region
>>>> Sure.
>>>>
>>>>> This will be mergable on its own, but will also provide a great starting base
>>>>> for adding huge-vmalloc-by-default.
>>>>>
>>>>> What do you think?
>>>> Definitely makes sense to me.
>>>>
>>>> If I remember correctly, we still have some unsolved comments/questions for v3
>>>> in my replies on March 17, particularly:
>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>> Ahh sorry about that. I'll take a look now...
>>
>> No problem.
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>> changes
>>>>>>>>>>> that
>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>> Do you
>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>>>>>>> review
>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>> review
>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>> Hi Yang,
>>>>>>>>>>
>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>
>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>>>>>>> and am
>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>> straight
>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>> which I
>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>
>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>> following
>>>>>>>>>> week (might I see you there?)
>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>> Changelog
>>>>>>>>>>>> =========
>>>>>>>>>>>> v3:
>>>>>>>>>>>>        * Rebased to v6.14-rc4.
>>>>>>>>>>>>        * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>> linux-
>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>          Also included in this series in order to have the complete
>>>>>>>>>>>> patchset.
>>>>>>>>>>>>        * Enhanced __create_pgd_mapping() to handle split as well per
>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>        * Supported CONT mappings per Ryan.
>>>>>>>>>>>>        * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>> mapping if
>>>>>>>>>>>> such
>>>>>>>>>>>>          system is detected per Ryan. I don't have such system to test,
>>>>>>>>>>>> so the
>>>>>>>>>>>>          testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>> repainting
>>>>>>>>>>>>          unconditionally. The linear mapping doesn't have any block and
>>>>>>>>>>>> cont
>>>>>>>>>>>>          mappings after booting.
>>>>>>>>>>>>
>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>        * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>          conflict gracefully per Will Deacon
>>>>>>>>>>>>        * Rebased onto v6.13-rc5
>>>>>>>>>>>>        * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>
>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>
>>>>>>>>>>>> Description
>>>>>>>>>>>> ===========
>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>
>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>> using
>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>        - performance degradation
>>>>>>>>>>>>        - more TLB pressure
>>>>>>>>>>>>        - memory waste for kernel page table
>>>>>>>>>>>>
>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>
>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>> invalidate the
>>>>>>>>>>>> page table entry when changing page sizes. This allows the kernel to
>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>> level 2
>>>>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>
>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>>>>>>> only.
>>>>>>>>>>>>
>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted
>>>>>>>>>>>> once
>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>>>>> details.
>>>>>>>>>>>>
>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>>>>
>>>>>>>>>>>> Testing
>>>>>>>>>>>> =======
>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>> memory and
>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>
>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>        - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>> permission at
>>>>>>>>>>>>          boot stage, if the patch didn't work, kernel typically didn't
>>>>>>>>>>>> boot.
>>>>>>>>>>>>        - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>> permission
>>>>>>>>>>>> for
>>>>>>>>>>>>          linear mapping.
>>>>>>>>>>>>        - A test kernel module which allocates 80% of total memory via
>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>          then change the vmalloc area permission to RO, this also
>>>>>>>>>>>> change
>>>>>>>>>>>> linear
>>>>>>>>>>>>          mapping permission to RO, then change it back before
>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>> launch
>>>>>>>>>>>>          a VM which consumes almost all physical memory.
>>>>>>>>>>>>        - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>        - Kernel build in VM with guest kernel which has this series
>>>>>>>>>>>> applied.
>>>>>>>>>>>>        - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>        - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>
>>>>>>>>>>>> Performance
>>>>>>>>>>>> ===========
>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>> Before:
>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>
>>>>>>>>>>>> After:
>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>
>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>
>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>> * Memcached
>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>> pressure.
>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>
>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>
>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
>>>>>>>>>>>> disk
>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>          --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>          --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>> thread \
>>>>>>>>>>>>          --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>
>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>> worst
>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>> case).
>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>>>>>>
>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>> populated).
>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>            arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>
>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>            arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>            arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>>>>>>            arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>>>            arm64: mm: support split CONT mappings
>>>>>>>>>>>>            arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>>>>>>> secondary
>>>>>>>>>>>> CPUs
>>>>>>>>>>>>
>>>>>>>>>>>>       arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>       arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>       arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>       arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>       arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>       arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++++++++
>>>>>>>>>>>> ++++++
>>>>>>>>>>>> +++++++
>>>>>>>>>>>>       arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++
>>>>>>>>>>>> ++++
>>>>>>>>>>>> ++++++
>>>>>>>>>>>> ++++
>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>> +++++
>>>>>>>>>>>> +++++
>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>       arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>       arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>       9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28 13:13                       ` Ryan Roberts
@ 2025-05-28 15:18                         ` Yang Shi
  2025-05-28 17:12                           ` Yang Shi
  2025-05-29  7:36                           ` Ryan Roberts
  0 siblings, 2 replies; 49+ messages in thread
From: Yang Shi @ 2025-05-28 15:18 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/28/25 6:13 AM, Ryan Roberts wrote:
> On 28/05/2025 01:00, Yang Shi wrote:
>> Hi Ryan,
>>
>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>> needed. But AFAICT there should be no significant change to how I advertise
>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
> Yes, I agree this should not impact you too much.
>
>> You also mentioned Dev was working on patches to have __change_memory_common()
>> apply permission change on a contiguous range instead of on page basis (the
>> status quo). But I have not seen the patches on mailing list yet. However I
>> don't think this will result in any significant change to my patches either,
>> particularly the split primitive and linear map repainting.
> I think you would need Dev's series to be able to apply the permissions change
> without needing to split the whole range to pte mappings? So I guess your change
> must either be implementing something similar to what Dev is working on or you
> are splitting the entire range to ptes? If the latter, then I'm not keen on that
> approach.

I don't think Dev's series is mandatory prerequisite for my patches. 
IIUC how the split primitive keeps block mapping if it is fully 
contained is independent from how to apply the permissions change on it.
The new spin implemented keeping block mapping if it is fully contained 
as we discussed earlier. I'm supposed Dev's series just need to check 
whether the mapping is block or not when applying permission change.

The flow just looks like as below conceptually:

split_mapping(start, end)
apply_permission_change(start, end)

The split_mapping() guarantees keep block mapping if it is fully 
contained in the range between start and end, this is my series's 
responsibility. I know the current code calls apply_to_page_range() to 
apply permission change and it just does it on PTE basis. So IIUC Dev's 
series will modify it or provide a new API, then 
__change_memory_common() will call it to change permission. There should 
be some overlap between mine and Dev's, but I don't see strong dependency.

>
> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
> me of a potential problem; if you are doing the repainting with the machine
> stopped, you can't allocate memory at that point; it's possible a CPU was inside
> the allocator when it stopped. And I think you need to allocate intermediate
> pgtables, right? Do you have a solution to that problem? I guess one approach
> would be to figure out how much memory you will need and pre-allocate prior to
> stoping the machine?

OK, I don't remember we discussed this problem before. I think we can do 
something like what kpti does. When creating the linear map we know how 
many PUD and PMD mappings are created, we can record the number, it will 
tell how many pages we need for repainting the linear map.

>
>> So I plan to post v4 patches to the mailing list. We can focus on reviewing the
>> split primitive and linear map repainting. Does it sound good to you?
> That works assuming you have a solution for the above.

I think the only missing part is preallocating page tables for 
repainting. I will add this, then post the new spin to the mailing list.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>
>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>
>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>> Hi Ryan,
>>>>>>>>>
>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just gently
>>>>>>>>> ping,
>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>> I'm very
>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>> (although
>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>
>>>>>>>> FWIW, having thought about it a bit more, I think some of the suggestions I
>>>>>>>> previously made may not have been quite right, but I'll elaborate next week.
>>>>>>>> I'm
>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>> vmalloc
>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>> too.
>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>> Hi Yang,
>>>>>>
>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>> thinking:
>>>>>> I'd eventually like to get to the point where the linear map and most vmalloc
>>>>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>
>>>>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>>>>> having to be turned into an opt-in feature (instead of the original opt-out
>>>>>> approach) because there were problems with some parts of the kernel expecting
>>>>>> page mappings. I think we might be able to overcome those issues on arm64 with
>>>>>> BBML2.
>>>>>>
>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>> vmalloc
>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified. To be
>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>
>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>> kernel
>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the middle of
>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>> will
>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries. The
>>>>>> function can assume BBML2 is present. And it will return 0 on success, -EINVAL
>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>> perform
>>>>>> the split.
>>>>> OK, the v3 patches already handled page table allocation failure with returning
>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>> However I'm wondering what usecases you are thinking about? Splitting vmalloc
>>>>> area may run into unmapped VA?
>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't see
>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>
>>>>>> Then we can use that primitive on the start and end address of any range for
>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on part
>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>> allocation,
>>>>>> etc). This way we only split enough to ensure the boundaries are precise, and
>>>>>> keep larger mappings inside the range.
>>>>> Yeah, makes sense to me.
>>>>>
>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>>>>>> been working on a series that converts this to use walk_page_range_novma() so
>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>>>>>> posting an RFC early next week.
>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>> table, right?
>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>> generalize to support all the other types of leaf entries too.,
>>>>
>>>>> If I remember correctly, you suggested change permissions in
>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>> Yes I did. I think this made sense (in my head at least) because in the context
>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because vmalloc
>>>> PFNs are not contiguous. So for that reason, I think it's preferable to have an
>>>> independent capability.
>>> OK, sounds good to me.
>>>
>>>>> The current code assumes the address range passed in by change_memory_common()
>>>>> is *NOT* physically contiguous so __change_memory_common() handles page table
>>>>> permission on page basis. I'm supposed Dev's patches will handle this then my
>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>> I'm not sure I fully understand the point you're making here...
>>>>
>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>> implementation so that it can walk a VA range and update the permissions on each
>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>> entries in the page table.
>>>>
>>>> So if we are changing permissions on the linear map, we have a range of VAs to
>>>> walk and convert all the leaf entries, regardless of their size. The same goes
>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>> pieces of the linear map and call __change_memory_common() for each; there is
>>>> definitely some detail to work out there!
>>> Yes, this is my point. When changing underlying linear map permission for
>>> vmalloc, the linear map address may be not contiguous. This is why
>>> change_memory_common() calls __change_memory_common() on page basis.
>>>
>>> But how Dev's patch work should have no impact on how I implement the split
>>> primitive by thinking it further. It should be the caller's responsibility to
>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>> address range.
>>>
>>>>>> You'll still need to repaint the whole linear map with page mappings for the
>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially
>>>>>> with
>>>>>> minor modifications?) can do that repainting on the live mappings; similar to
>>>>>> how you are doing it in v3.
>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>> level. A simple flag should be good enough to tell
>>>>> __create_pgd_mapping_locked()
>>>>> do the right thing off the top of my head.
>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and NO_CONT_MAPPINGS
>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is set,
>>>> then you need to split it?
>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>
>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>
>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>> series is:
>>>>>>      - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>      - Repaint linear map using page mappings if secondary CPUs don't
>>>>>> support BBML2
>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>
>>>>>>      - Integrate Dev's __change_memory_common() series
>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>
>>>>>>      - Create primitive to ensure mapping entry boundary at a given page-
>>>>>> aligned VA
>>>>>>      - Use primitive when changing permissions on linear map region
>>>>> Sure.
>>>>>
>>>>>> This will be mergable on its own, but will also provide a great starting base
>>>>>> for adding huge-vmalloc-by-default.
>>>>>>
>>>>>> What do you think?
>>>>> Definitely makes sense to me.
>>>>>
>>>>> If I remember correctly, we still have some unsolved comments/questions for v3
>>>>> in my replies on March 17, particularly:
>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>> Ahh sorry about that. I'll take a look now...
>>> No problem.
>>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>> changes
>>>>>>>>>>>> that
>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>> Do you
>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then restart
>>>>>>>>>>>> review
>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>> review
>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>
>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>
>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's series
>>>>>>>>>>> and am
>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>> straight
>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>> which I
>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>
>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>> following
>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>> =========
>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>         * Rebased to v6.14-rc4.
>>>>>>>>>>>>>         * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>> linux-
>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>           Also included in this series in order to have the complete
>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>         * Enhanced __create_pgd_mapping() to handle split as well per
>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>         * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>         * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>> such
>>>>>>>>>>>>>           system is detected per Ryan. I don't have such system to test,
>>>>>>>>>>>>> so the
>>>>>>>>>>>>>           testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>           unconditionally. The linear mapping doesn't have any block and
>>>>>>>>>>>>> cont
>>>>>>>>>>>>>           mappings after booting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>         * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>           conflict gracefully per Will Deacon
>>>>>>>>>>>>>         * Rebased onto v6.13-rc5
>>>>>>>>>>>>>         * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>
>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Description
>>>>>>>>>>>>> ===========
>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>> using
>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>         - performance degradation
>>>>>>>>>>>>>         - more TLB pressure
>>>>>>>>>>>>>         - memory waste for kernel page table
>>>>>>>>>>>>>
>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel command
>>>>>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>> page table entry when changing page sizes. This allows the kernel to
>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>> level 2
>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using PTEs
>>>>>>>>>>>>> only.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be repainted
>>>>>>>>>>>>> once
>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>>>>>> details.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Testing
>>>>>>>>>>>>> =======
>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>> memory and
>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>         - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>           boot stage, if the patch didn't work, kernel typically didn't
>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>         - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>> permission
>>>>>>>>>>>>> for
>>>>>>>>>>>>>           linear mapping.
>>>>>>>>>>>>>         - A test kernel module which allocates 80% of total memory via
>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>           then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>> change
>>>>>>>>>>>>> linear
>>>>>>>>>>>>>           mapping permission to RO, then change it back before
>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>> launch
>>>>>>>>>>>>>           a VM which consumes almost all physical memory.
>>>>>>>>>>>>>         - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>         - Kernel build in VM with guest kernel which has this series
>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>         - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>         - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Performance
>>>>>>>>>>>>> ===========
>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>> Before:
>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>
>>>>>>>>>>>>> After:
>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>
>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine, the
>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
>>>>>>>>>>>>> disk
>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>           --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>           --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>           --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>
>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>> worst
>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>> case).
>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced proportionally.
>>>>>>>>>>>>>
>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>> populated).
>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>             arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>             arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>             arm64: mm: make __create_pgd_mapping() and helpers non-void
>>>>>>>>>>>>>             arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>>>>             arm64: mm: support split CONT mappings
>>>>>>>>>>>>>             arm64: mm: split linear mapping if BBML2 is not supported on
>>>>>>>>>>>>> secondary
>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>
>>>>>>>>>>>>>        arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>        arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>        arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>        arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>        arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>        arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++++++++
>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>        arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++
>>>>>>>>>>>>> ++++
>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>> ++++
>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>> +++++
>>>>>>>>>>>>> +++++
>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>        arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>        arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>        9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28 15:18                         ` Yang Shi
@ 2025-05-28 17:12                           ` Yang Shi
  2025-05-29  8:48                             ` Ryan Roberts
  2025-05-29  7:36                           ` Ryan Roberts
  1 sibling, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-28 17:12 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/28/25 8:18 AM, Yang Shi wrote:
>
>
> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>> On 28/05/2025 01:00, Yang Shi wrote:
>>> Hi Ryan,
>>>
>>> I got a new spin ready in my local tree on top of v6.15-rc4. I 
>>> noticed there
>>> were some more comments on Miko's BBML2 patch, it looks like a new 
>>> spin is
>>> needed. But AFAICT there should be no significant change to how I 
>>> advertise
>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check 
>>> whether
>>> BBML2 is advertised or not and the erratum seems still be needed to 
>>> fix up
>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>> Yes, I agree this should not impact you too much.
>>
>>> You also mentioned Dev was working on patches to have 
>>> __change_memory_common()
>>> apply permission change on a contiguous range instead of on page 
>>> basis (the
>>> status quo). But I have not seen the patches on mailing list yet. 
>>> However I
>>> don't think this will result in any significant change to my patches 
>>> either,
>>> particularly the split primitive and linear map repainting.
>> I think you would need Dev's series to be able to apply the 
>> permissions change
>> without needing to split the whole range to pte mappings? So I guess 
>> your change
>> must either be implementing something similar to what Dev is working 
>> on or you
>> are splitting the entire range to ptes? If the latter, then I'm not 
>> keen on that
>> approach.
>
> I don't think Dev's series is mandatory prerequisite for my patches. 
> IIUC how the split primitive keeps block mapping if it is fully 
> contained is independent from how to apply the permissions change on it.
> The new spin implemented keeping block mapping if it is fully 
> contained as we discussed earlier. I'm supposed Dev's series just need 
> to check whether the mapping is block or not when applying permission 
> change.
>
> The flow just looks like as below conceptually:
>
> split_mapping(start, end)
> apply_permission_change(start, end)
>
> The split_mapping() guarantees keep block mapping if it is fully 
> contained in the range between start and end, this is my series's 
> responsibility. I know the current code calls apply_to_page_range() to 
> apply permission change and it just does it on PTE basis. So IIUC 
> Dev's series will modify it or provide a new API, then 
> __change_memory_common() will call it to change permission. There 
> should be some overlap between mine and Dev's, but I don't see strong 
> dependency.
>
>>
>> Regarding the linear map repainting, I had a chat with Catalin, and 
>> he reminded
>> me of a potential problem; if you are doing the repainting with the 
>> machine
>> stopped, you can't allocate memory at that point; it's possible a CPU 
>> was inside
>> the allocator when it stopped. And I think you need to allocate 
>> intermediate
>> pgtables, right? Do you have a solution to that problem? I guess one 
>> approach
>> would be to figure out how much memory you will need and pre-allocate 
>> prior to
>> stoping the machine?
>
> OK, I don't remember we discussed this problem before. I think we can 
> do something like what kpti does. When creating the linear map we know 
> how many PUD and PMD mappings are created, we can record the number, 
> it will tell how many pages we need for repainting the linear map.

Looking the kpti code further, it looks like kpti also allocates memory 
with the machine stopped, but it calls memory allocation on cpu 0 only. 
IIUC this guarantees the code will not be called on a CPU which was 
inside the allocator when it stopped because CPU 0 is running 
stop_machine().

My patch already guarantees repainting just run on CPU 0 (the boot CPU) 
because we just can be sure the boot CPU supports BBML2 so that we can 
repaint the linear map safely. This also guarantees the "stopped CPU was 
inside allocator" problem won't happen IIUC.

Thanks,
Yang

>
>>
>>> So I plan to post v4 patches to the mailing list. We can focus on 
>>> reviewing the
>>> split primitive and linear map repainting. Does it sound good to you?
>> That works assuming you have a solution for the above.
>
> I think the only missing part is preallocating page tables for 
> repainting. I will add this, then post the new spin to the mailing list.
>
> Thanks,
> Yang
>
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>
>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>
>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> I know you may have a lot of things to follow up after 
>>>>>>>>>> LSF/MM. Just gently
>>>>>>>>>> ping,
>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd 
>>>>>>>>> April. But
>>>>>>>>> I'm very
>>>>>>>>> keen to move this series forward so will come back to you next 
>>>>>>>>> week.
>>>>>>>>> (although
>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>
>>>>>>>>> FWIW, having thought about it a bit more, I think some of the 
>>>>>>>>> suggestions I
>>>>>>>>> previously made may not have been quite right, but I'll 
>>>>>>>>> elaborate next week.
>>>>>>>>> I'm
>>>>>>>>> keen to build a pgtable splitting primitive here that we can 
>>>>>>>>> reuse with
>>>>>>>>> vmalloc
>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>> Sounds good. I think the patches can support splitting vmalloc 
>>>>>>>> page table
>>>>>>>> too.
>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>> Hi Yang,
>>>>>>>
>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm 
>>>>>>> currently
>>>>>>> thinking:
>>>>>>> I'd eventually like to get to the point where the linear map and 
>>>>>>> most vmalloc
>>>>>>> memory is mapped using the largest possible mapping granularity 
>>>>>>> (i.e. block
>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>
>>>>>>> vmalloc has history with trying to do huge mappings by default; 
>>>>>>> it ended up
>>>>>>> having to be turned into an opt-in feature (instead of the 
>>>>>>> original opt-out
>>>>>>> approach) because there were problems with some parts of the 
>>>>>>> kernel expecting
>>>>>>> page mappings. I think we might be able to overcome those issues 
>>>>>>> on arm64 with
>>>>>>> BBML2.
>>>>>>>
>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, 
>>>>>>> and I have a
>>>>>>> series (that should make v6.16) that enables contiguous PTE 
>>>>>>> mappings in
>>>>>>> vmalloc
>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is 
>>>>>>> specified. To be
>>>>>>> able to use that by default, we need to be able to change 
>>>>>>> permissions on
>>>>>>> sub-regions of an allocation, which is where BBML2 and your 
>>>>>>> series come in.
>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>
>>>>>>> I think the key thing we need is a function that can take a 
>>>>>>> page-aligned
>>>>>>> kernel
>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in 
>>>>>>> the middle of
>>>>>>> the leaf entry, it will split it so that the VA is now on a 
>>>>>>> boundary. This
>>>>>>> will
>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE 
>>>>>>> entries. The
>>>>>>> function can assume BBML2 is present. And it will return 0 on 
>>>>>>> success, -EINVAL
>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a 
>>>>>>> pgtable to
>>>>>>> perform
>>>>>>> the split.
>>>>>> OK, the v3 patches already handled page table allocation failure 
>>>>>> with returning
>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes 
>>>>>> linear mapping
>>>>>> should be always present. It is easy to return -EINVAL instead of 
>>>>>> BUG_ON.
>>>>>> However I'm wondering what usecases you are thinking about? 
>>>>>> Splitting vmalloc
>>>>>> area may run into unmapped VA?
>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel 
>>>>> should be
>>>>> discouraged. I think even for vmalloc under correct conditions we 
>>>>> shouldn't see
>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see 
>>>>> (e.g.)
>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>
>>>>>>> Then we can use that primitive on the start and end address of 
>>>>>>> any range for
>>>>>>> which we need exact mapping boundaries (e.g. when changing 
>>>>>>> permissions on part
>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>> allocation,
>>>>>>> etc). This way we only split enough to ensure the boundaries are 
>>>>>>> precise, and
>>>>>>> keep larger mappings inside the range.
>>>>>> Yeah, makes sense to me.
>>>>>>
>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>> apply_to_page_range(), because that assumes page mappings only. 
>>>>>>> Dev Jain has
>>>>>>> been working on a series that converts this to use 
>>>>>>> walk_page_range_novma() so
>>>>>>> that we can change permissions on the block/contig entries too. 
>>>>>>> That's not
>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is 
>>>>>>> comfortable with
>>>>>>> posting an RFC early next week.
>>>>>> OK, so the new __change_memory_common() will change the 
>>>>>> permission of page
>>>>>> table, right?
>>>>> It will change permissions of all the leaf entries in the range of 
>>>>> VAs it is
>>>>> passed. Currently it assumes that all the leaf entries are PTEs. 
>>>>> But we will
>>>>> generalize to support all the other types of leaf entries too.,
>>>>>
>>>>>> If I remember correctly, you suggested change permissions in
>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>> Yes I did. I think this made sense (in my head at least) because 
>>>>> in the context
>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes 
>>>>> sense to
>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc 
>>>>> because vmalloc
>>>>> PFNs are not contiguous. So for that reason, I think it's 
>>>>> preferable to have an
>>>>> independent capability.
>>>> OK, sounds good to me.
>>>>
>>>>>> The current code assumes the address range passed in by 
>>>>>> change_memory_common()
>>>>>> is *NOT* physically contiguous so __change_memory_common() 
>>>>>> handles page table
>>>>>> permission on page basis. I'm supposed Dev's patches will handle 
>>>>>> this then my
>>>>>> patch can safely assume the linear mapping address range for 
>>>>>> splitting is
>>>>>> physically contiguous too otherwise I can't keep large mappings 
>>>>>> inside the
>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>> I'm not sure I fully understand the point you're making here...
>>>>>
>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>> implementation so that it can walk a VA range and update the 
>>>>> permissions on each
>>>>> leaf entry it visits, regadless of which level the leaf entry is 
>>>>> at. This
>>>>> doesn't make any assumption of the physical contiguity of 
>>>>> neighbouring leaf
>>>>> entries in the page table.
>>>>>
>>>>> So if we are changing permissions on the linear map, we have a 
>>>>> range of VAs to
>>>>> walk and convert all the leaf entries, regardless of their size. 
>>>>> The same goes
>>>>> for vmalloc... But for vmalloc, we will also want to change the 
>>>>> underlying
>>>>> permissions in the linear map, so we will have to figure out the 
>>>>> contiguous
>>>>> pieces of the linear map and call __change_memory_common() for 
>>>>> each; there is
>>>>> definitely some detail to work out there!
>>>> Yes, this is my point. When changing underlying linear map 
>>>> permission for
>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>
>>>> But how Dev's patch work should have no impact on how I implement 
>>>> the split
>>>> primitive by thinking it further. It should be the caller's 
>>>> responsibility to
>>>> make sure __create_pgd_mapping_locked() is called for contiguous 
>>>> linear map
>>>> address range.
>>>>
>>>>>>> You'll still need to repaint the whole linear map with page 
>>>>>>> mappings for the
>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() 
>>>>>>> (potentially
>>>>>>> with
>>>>>>> minor modifications?) can do that repainting on the live 
>>>>>>> mappings; similar to
>>>>>>> how you are doing it in v3.
>>>>>> Yes, when repainting I need to split the page table all the way 
>>>>>> down to PTE
>>>>>> level. A simple flag should be good enough to tell
>>>>>> __create_pgd_mapping_locked()
>>>>>> do the right thing off the top of my head.
>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and 
>>>>> NO_CONT_MAPPINGS
>>>>> flags? For example, if you are find a leaf mapping and 
>>>>> NO_BLOCK_MAPPINGS is set,
>>>>> then you need to split it?
>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>
>>>>>>> Miko's BBML2 series should hopefully get imminently queued for 
>>>>>>> v6.16.
>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>
>>>>>>> So in summary, what I'm asking for your large block mapping the 
>>>>>>> linear map
>>>>>>> series is:
>>>>>>>      - Paint linear map using blocks/contig if boot CPU supports 
>>>>>>> BBML2
>>>>>>>      - Repaint linear map using page mappings if secondary CPUs 
>>>>>>> don't
>>>>>>> support BBML2
>>>>>> OK, I just need to add some simple tweak to split down to PTE 
>>>>>> level to v3.
>>>>>>
>>>>>>>      - Integrate Dev's __change_memory_common() series
>>>>>> OK, I think I have to do my patches on top of it. Because Dev's 
>>>>>> patch need
>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>
>>>>>>>      - Create primitive to ensure mapping entry boundary at a 
>>>>>>> given page-
>>>>>>> aligned VA
>>>>>>>      - Use primitive when changing permissions on linear map region
>>>>>> Sure.
>>>>>>
>>>>>>> This will be mergable on its own, but will also provide a great 
>>>>>>> starting base
>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>
>>>>>>> What do you think?
>>>>>> Definitely makes sense to me.
>>>>>>
>>>>>> If I remember correctly, we still have some unsolved 
>>>>>> comments/questions for v3
>>>>>> in my replies on March 17, particularly:
>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>> Ahh sorry about that. I'll take a look now...
>>>> No problem.
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are 
>>>>>>>>>>>>> some slight
>>>>>>>>>>>>> changes
>>>>>>>>>>>>> that
>>>>>>>>>>>>> have impact to my patches (basically check the new boot 
>>>>>>>>>>>>> parameter).
>>>>>>>>>>>>> Do you
>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right 
>>>>>>>>>>>>> now then restart
>>>>>>>>>>>>> review
>>>>>>>>>>>>> from the new spin or review the current patches then solve 
>>>>>>>>>>>>> the new
>>>>>>>>>>>>> review
>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in 
>>>>>>>>>>>> my queue!
>>>>>>>>>>>>
>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with 
>>>>>>>>>>>> Miko's series
>>>>>>>>>>>> and am
>>>>>>>>>>>> not too bothered about the integration with that; I think 
>>>>>>>>>>>> it's pretty
>>>>>>>>>>>> straight
>>>>>>>>>>>> forward. I'm more interested in how you are handling the 
>>>>>>>>>>>> splitting,
>>>>>>>>>>>> which I
>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>
>>>>>>>>>>>> I'm hoping to get to this next week before heading out to 
>>>>>>>>>>>> LSF/MM the
>>>>>>>>>>>> following
>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>         * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>         * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>           Also included in this series in order to have 
>>>>>>>>>>>>>> the complete
>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>         * Enhanced __create_pgd_mapping() to handle split 
>>>>>>>>>>>>>> as well per
>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>         * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>         * Supported asymmetric system by splitting kernel 
>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>           system is detected per Ryan. I don't have such 
>>>>>>>>>>>>>> system to test,
>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>           testing is done by hacking kernel to call 
>>>>>>>>>>>>>> linear mapping
>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>           unconditionally. The linear mapping doesn't 
>>>>>>>>>>>>>> have any block and
>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>           mappings after booting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>         * Used allowlist to advertise BBM lv2 on the CPUs 
>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>           conflict gracefully per Will Deacon
>>>>>>>>>>>>>>         * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>         * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RFC v1: 
>>>>>>>>>>>>>> https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE 
>>>>>>>>>>>>>> due to arm's
>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A number of performance issues arise when the kernel 
>>>>>>>>>>>>>> linear map is
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>         - performance degradation
>>>>>>>>>>>>>>         - more TLB pressure
>>>>>>>>>>>>>>         - memory waste for kernel page table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the 
>>>>>>>>>>>>>> kernel command
>>>>>>>>>>>>>> line but this disables the alias checks on page table 
>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>> page table entry when changing page sizes. This allows 
>>>>>>>>>>>>>> the kernel to
>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch adds support for splitting large mappings when 
>>>>>>>>>>>>>> FEAT_BBM
>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>> is available and rodata=full is used. This functionality 
>>>>>>>>>>>>>> will be used
>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear 
>>>>>>>>>>>>>> map using PTEs
>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping 
>>>>>>>>>>>>>> may be repainted
>>>>>>>>>>>>>> once
>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch 
>>>>>>>>>>>>>> #6 for more
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We saw significant performance increases in some 
>>>>>>>>>>>>>> benchmarks with
>>>>>>>>>>>>>> rodata=full without compromising the security features of 
>>>>>>>>>>>>>> the kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) 
>>>>>>>>>>>>>> with 256GB
>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>         - Kernel boot.  Kernel needs change kernel linear 
>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>           boot stage, if the patch didn't work, kernel 
>>>>>>>>>>>>>> typically didn't
>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>         - Module stress from stress-ng. Kernel module 
>>>>>>>>>>>>>> load change
>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>           linear mapping.
>>>>>>>>>>>>>>         - A test kernel module which allocates 80% of 
>>>>>>>>>>>>>> total memory via
>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>           then change the vmalloc area permission to RO, 
>>>>>>>>>>>>>> this also
>>>>>>>>>>>>>> change
>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>           mapping permission to RO, then change it back 
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>           a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>         - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>         - Kernel build in VM with guest kernel which has 
>>>>>>>>>>>>>> this series
>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>         - rodata=on. Make sure other rodata mode is not 
>>>>>>>>>>>>>> broken.
>>>>>>>>>>>>>>         - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the 
>>>>>>>>>>>>>> machine, the
>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>> We saw performance degradation when running Memcached 
>>>>>>>>>>>>>> benchmark with
>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to 
>>>>>>>>>>>>>> kernel TLB
>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 
>>>>>>>>>>>>>> 3.5%, P99
>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The 
>>>>>>>>>>>>>> kernel TLB
>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G 
>>>>>>>>>>>>>> ramdisk (ext4) with
>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr 
>>>>>>>>>>>>>> --norandommap --
>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>           --status-interval=999 --rw=write --bs=4k 
>>>>>>>>>>>>>> --loops=1 --
>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>           --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>           --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is 
>>>>>>>>>>>>>> high, but the
>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>> number of good case is around 90% more than the best 
>>>>>>>>>>>>>> number of bad
>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced 
>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page 
>>>>>>>>>>>>>> cache
>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>             arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>             arm64: cpufeature: add AmpereOne to BBML2 
>>>>>>>>>>>>>> allow list
>>>>>>>>>>>>>>             arm64: mm: make __create_pgd_mapping() and 
>>>>>>>>>>>>>> helpers non-void
>>>>>>>>>>>>>>             arm64: mm: support large block mapping when 
>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>             arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>             arm64: mm: split linear mapping if BBML2 is 
>>>>>>>>>>>>>> not supported on
>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>> arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>> arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c      | 95 
>>>>>>>>>>>>>> ++++++++++++++++++++++++
>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>> arch/arm64/mm/mmu.c                 | 397 
>>>>>>>>>>>>>> ++++++++++++++++++++
>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>> arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>> arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>        9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28 15:18                         ` Yang Shi
  2025-05-28 17:12                           ` Yang Shi
@ 2025-05-29  7:36                           ` Ryan Roberts
  2025-05-29 16:37                             ` Yang Shi
  1 sibling, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29  7:36 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 28/05/2025 16:18, Yang Shi wrote:
> 
> 
> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>> On 28/05/2025 01:00, Yang Shi wrote:
>>> Hi Ryan,
>>>
>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>> needed. But AFAICT there should be no significant change to how I advertise
>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>> Yes, I agree this should not impact you too much.
>>
>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>> apply permission change on a contiguous range instead of on page basis (the
>>> status quo). But I have not seen the patches on mailing list yet. However I
>>> don't think this will result in any significant change to my patches either,
>>> particularly the split primitive and linear map repainting.
>> I think you would need Dev's series to be able to apply the permissions change
>> without needing to split the whole range to pte mappings? So I guess your change
>> must either be implementing something similar to what Dev is working on or you
>> are splitting the entire range to ptes? If the latter, then I'm not keen on that
>> approach.
> 
> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
> the split primitive keeps block mapping if it is fully contained is independent
> from how to apply the permissions change on it.
> The new spin implemented keeping block mapping if it is fully contained as we
> discussed earlier. I'm supposed Dev's series just need to check whether the
> mapping is block or not when applying permission change.

The way I was thinking the split primitive would would, you would need Dev's
change as a prerequisite, so I suspect we both have a slightly different idea of
how this will work.

> 
> The flow just looks like as below conceptually:
> 
> split_mapping(start, end)
> apply_permission_change(start, end)

The flow I was thinking of would be this:

split_mapping(start)
split_mapping(end)
apply_permission_change(start, end)

split_mapping() takes a virtual address that is at least page aligned and when
it returns, ensures that the address is at the start of a leaf mapping. And it
will only break the leaf mappings down so that they are the maximum size that
can still meet the requirement.

As an example, let's suppose you initially start with a region that is composed
entirely of 2M mappings. Then you want to change permissions of a region [2052K,
6208K).

Before any splitting, you have:

  - 2M   x4: [0, 8192K)

Then you call split_mapping(start=2052K):

  - 2M   x1: [0, 2048K)
  - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf mapping
  - 64K x31: [2112K, 4096K)
  - 2M:  x2: [4096K, 8192K)

Then you call split_mapping(end=6208K):

  - 2M   x1: [0, 2048K)
  - 4K  x16: [2048K, 2112K)
  - 64K x31: [2112K, 4096K)
  - 2M:  x1: [4096K, 6144K)
  - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping

So then when you call apply_permission_change(start=2052K, end=6208K), the
following leaf mappings' permissions will be modified:

  - 4K  x15: [2052K, 2112K)
  - 64K x31: [2112K, 4096K)
  - 2M:  x1: [4096K, 6144K)
  - 64K  x1: [6144K, 6208K)

Since there are block mappings in this range, Dev's change is required to change
the permissions.

This approach means that we only ever split the minimum required number of
mappings and we only split them to the largest size that still provides the
alignment requirement.

> 
> The split_mapping() guarantees keep block mapping if it is fully contained in
> the range between start and end, this is my series's responsibility. I know the
> current code calls apply_to_page_range() to apply permission change and it just
> does it on PTE basis. So IIUC Dev's series will modify it or provide a new API,
> then __change_memory_common() will call it to change permission. There should be
> some overlap between mine and Dev's, but I don't see strong dependency.

But if you have a block mapping in the region you are calling
__change_memory_common() on, today that will fail because it can only handle
page mappings.

> 
>>
>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>> me of a potential problem; if you are doing the repainting with the machine
>> stopped, you can't allocate memory at that point; it's possible a CPU was inside
>> the allocator when it stopped. And I think you need to allocate intermediate
>> pgtables, right? Do you have a solution to that problem? I guess one approach
>> would be to figure out how much memory you will need and pre-allocate prior to
>> stoping the machine?
> 
> OK, I don't remember we discussed this problem before. I think we can do
> something like what kpti does. When creating the linear map we know how many PUD
> and PMD mappings are created, we can record the number, it will tell how many
> pages we need for repainting the linear map.

I saw a separate reply you sent for this. I'll read that and respond in that
context.

Thanks,
Ryan

> 
>>
>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing the
>>> split primitive and linear map repainting. Does it sound good to you?
>> That works assuming you have a solution for the above.
> 
> I think the only missing part is preallocating page tables for repainting. I
> will add this, then post the new spin to the mailing list.
> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>
>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>
>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>>
>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>> gently
>>>>>>>>>> ping,
>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>> I'm very
>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>> (although
>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>
>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>> suggestions I
>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>> week.
>>>>>>>>> I'm
>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>> vmalloc
>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>> too.
>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>> Hi Yang,
>>>>>>>
>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>> thinking:
>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>> vmalloc
>>>>>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>
>>>>>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>>>>>> having to be turned into an opt-in feature (instead of the original opt-out
>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>> expecting
>>>>>>> page mappings. I think we might be able to overcome those issues on arm64
>>>>>>> with
>>>>>>> BBML2.
>>>>>>>
>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>> vmalloc
>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>> To be
>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>
>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>> kernel
>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>> middle of
>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>> will
>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries.
>>>>>>> The
>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>> EINVAL
>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>> perform
>>>>>>> the split.
>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>> returning
>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>> However I'm wondering what usecases you are thinking about? Splitting vmalloc
>>>>>> area may run into unmapped VA?
>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't
>>>>> see
>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>
>>>>>>> Then we can use that primitive on the start and end address of any range for
>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on
>>>>>>> part
>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>> allocation,
>>>>>>> etc). This way we only split enough to ensure the boundaries are precise,
>>>>>>> and
>>>>>>> keep larger mappings inside the range.
>>>>>> Yeah, makes sense to me.
>>>>>>
>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>>>>>>> been working on a series that converts this to use
>>>>>>> walk_page_range_novma() so
>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>>>>>>> posting an RFC early next week.
>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>> table, right?
>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>> generalize to support all the other types of leaf entries too.,
>>>>>
>>>>>> If I remember correctly, you suggested change permissions in
>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>> context
>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>> vmalloc
>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>> have an
>>>>> independent capability.
>>>> OK, sounds good to me.
>>>>
>>>>>> The current code assumes the address range passed in by
>>>>>> change_memory_common()
>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page table
>>>>>> permission on page basis. I'm supposed Dev's patches will handle this then my
>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>> I'm not sure I fully understand the point you're making here...
>>>>>
>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>> implementation so that it can walk a VA range and update the permissions on
>>>>> each
>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>> entries in the page table.
>>>>>
>>>>> So if we are changing permissions on the linear map, we have a range of VAs to
>>>>> walk and convert all the leaf entries, regardless of their size. The same goes
>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>> pieces of the linear map and call __change_memory_common() for each; there is
>>>>> definitely some detail to work out there!
>>>> Yes, this is my point. When changing underlying linear map permission for
>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>
>>>> But how Dev's patch work should have no impact on how I implement the split
>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>> address range.
>>>>
>>>>>>> You'll still need to repaint the whole linear map with page mappings for the
>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially
>>>>>>> with
>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>> similar to
>>>>>>> how you are doing it in v3.
>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>> level. A simple flag should be good enough to tell
>>>>>> __create_pgd_mapping_locked()
>>>>>> do the right thing off the top of my head.
>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>> NO_CONT_MAPPINGS
>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is
>>>>> set,
>>>>> then you need to split it?
>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>
>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>
>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>> series is:
>>>>>>>      - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>      - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>> support BBML2
>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>
>>>>>>>      - Integrate Dev's __change_memory_common() series
>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>
>>>>>>>      - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>> aligned VA
>>>>>>>      - Use primitive when changing permissions on linear map region
>>>>>> Sure.
>>>>>>
>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>> base
>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>
>>>>>>> What do you think?
>>>>>> Definitely makes sense to me.
>>>>>>
>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>> for v3
>>>>>> in my replies on March 17, particularly:
>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>> Ahh sorry about that. I'll take a look now...
>>>> No problem.
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>> changes
>>>>>>>>>>>>> that
>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>> Do you
>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>> restart
>>>>>>>>>>>>> review
>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>> review
>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>
>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>> series
>>>>>>>>>>>> and am
>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>> straight
>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>> which I
>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>
>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>> following
>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>         * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>         * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>           Also included in this series in order to have the complete
>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>         * Enhanced __create_pgd_mapping() to handle split as well per
>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>         * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>         * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>           system is detected per Ryan. I don't have such system to
>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>           testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>           unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>           mappings after booting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>         * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>           conflict gracefully per Will Deacon
>>>>>>>>>>>>>>         * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>         * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>         - performance degradation
>>>>>>>>>>>>>>         - more TLB pressure
>>>>>>>>>>>>>>         - memory waste for kernel page table
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>> command
>>>>>>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the kernel to
>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>> once
>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>         - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>           boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>         - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>           linear mapping.
>>>>>>>>>>>>>>         - A test kernel module which allocates 80% of total memory
>>>>>>>>>>>>>> via
>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>           then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>> change
>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>           mapping permission to RO, then change it back before
>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>           a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>         - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>         - Kernel build in VM with guest kernel which has this series
>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>         - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>         - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine,
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>           --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>           --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>           --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>             arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>             arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>             arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>             arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>>>>>             arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>             arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>        arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>        arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>        arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>        arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>        arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++
>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>        arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++
>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>        arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>        arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>        9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-28 17:12                           ` Yang Shi
@ 2025-05-29  8:48                             ` Ryan Roberts
  2025-05-29 15:33                               ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29  8:48 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 28/05/2025 18:12, Yang Shi wrote:
> 
> 
> On 5/28/25 8:18 AM, Yang Shi wrote:
>>
>>
>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>> Hi Ryan,
>>>>
>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>> Yes, I agree this should not impact you too much.
>>>
>>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>>> apply permission change on a contiguous range instead of on page basis (the
>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>> don't think this will result in any significant change to my patches either,
>>>> particularly the split primitive and linear map repainting.
>>> I think you would need Dev's series to be able to apply the permissions change
>>> without needing to split the whole range to pte mappings? So I guess your change
>>> must either be implementing something similar to what Dev is working on or you
>>> are splitting the entire range to ptes? If the latter, then I'm not keen on that
>>> approach.
>>
>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>> the split primitive keeps block mapping if it is fully contained is
>> independent from how to apply the permissions change on it.
>> The new spin implemented keeping block mapping if it is fully contained as we
>> discussed earlier. I'm supposed Dev's series just need to check whether the
>> mapping is block or not when applying permission change.
>>
>> The flow just looks like as below conceptually:
>>
>> split_mapping(start, end)
>> apply_permission_change(start, end)
>>
>> The split_mapping() guarantees keep block mapping if it is fully contained in
>> the range between start and end, this is my series's responsibility. I know
>> the current code calls apply_to_page_range() to apply permission change and it
>> just does it on PTE basis. So IIUC Dev's series will modify it or provide a
>> new API, then __change_memory_common() will call it to change permission.
>> There should be some overlap between mine and Dev's, but I don't see strong
>> dependency.
>>
>>>
>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>> me of a potential problem; if you are doing the repainting with the machine
>>> stopped, you can't allocate memory at that point; it's possible a CPU was inside
>>> the allocator when it stopped. And I think you need to allocate intermediate
>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>> would be to figure out how much memory you will need and pre-allocate prior to
>>> stoping the machine?
>>
>> OK, I don't remember we discussed this problem before. I think we can do
>> something like what kpti does. When creating the linear map we know how many
>> PUD and PMD mappings are created, we can record the number, it will tell how
>> many pages we need for repainting the linear map.
> 
> Looking the kpti code further, it looks like kpti also allocates memory with the
> machine stopped, but it calls memory allocation on cpu 0 only. 

Oh yes, I hadn't spotted that. It looks like a special case that may be ok for
kpti though; it's allocating a fairly small amount of memory (max levels=5 so
max order=3) and it's doing it with GFP_ATOMIC. So if my understanding of the
page allocator is correct, then this should be allocated from a per-cpu reserve?
Which means that it never needs to take a lock that other, stopped CPUs could be
holding. And GFP_ATOMIC guarrantees that the thread will never sleep, which I
think is not allowed while the machine is stopped.

> IIUC this
> guarantees the code will not be called on a CPU which was inside the allocator
> when it stopped because CPU 0 is running stop_machine().

My concern was a bit more general; if any other CPU was inside the allocator
holding a lock when the machine was stopped, then if CPU 0 comes along and makes
a call to the allocator that requires the lock, then we have a deadlock.

All that said, looking at the stop_machine() docs, it says:

 * Description: This causes a thread to be scheduled on every cpu,
 * each of which disables interrupts.  The result is that no one is
 * holding a spinlock or inside any other preempt-disabled region when
 * @fn() runs.

So I think my deadlock concern was unfounded. I think as long as you can
garrantee that fn() won't try to sleep then you should be safe? So I guess
allocating from within fn() should be safe as long as you use GFP_ATOMIC?

Thanks,
Ryan

> 
> My patch already guarantees repainting just run on CPU 0 (the boot CPU) because
> we just can be sure the boot CPU supports BBML2 so that we can repaint the
> linear map safely. This also guarantees the "stopped CPU was inside allocator"
> problem won't happen IIUC.
> 
> Thanks,
> Yang
> 
>>
>>>
>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing the
>>>> split primitive and linear map repainting. Does it sound good to you?
>>> That works assuming you have a solution for the above.
>>
>> I think the only missing part is preallocating page tables for repainting. I
>> will add this, then post the new spin to the mailing list.
>>
>> Thanks,
>> Yang
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>
>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>
>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>> gently
>>>>>>>>>>> ping,
>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>> I'm very
>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>> (although
>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>
>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>> suggestions I
>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>> week.
>>>>>>>>>> I'm
>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>>> vmalloc
>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>>> too.
>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>> Hi Yang,
>>>>>>>>
>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>> thinking:
>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>> vmalloc
>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>
>>>>>>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>>>>>>> having to be turned into an opt-in feature (instead of the original opt-out
>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>> expecting
>>>>>>>> page mappings. I think we might be able to overcome those issues on
>>>>>>>> arm64 with
>>>>>>>> BBML2.
>>>>>>>>
>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>> vmalloc
>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>> To be
>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>
>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>> kernel
>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>> middle of
>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>>> will
>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE
>>>>>>>> entries. The
>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>> EINVAL
>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>> perform
>>>>>>>> the split.
>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>> returning
>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>> vmalloc
>>>>>>> area may run into unmapped VA?
>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>> discouraged. I think even for vmalloc under correct conditions we
>>>>>> shouldn't see
>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>
>>>>>>>> Then we can use that primitive on the start and end address of any range
>>>>>>>> for
>>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions
>>>>>>>> on part
>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>> allocation,
>>>>>>>> etc). This way we only split enough to ensure the boundaries are
>>>>>>>> precise, and
>>>>>>>> keep larger mappings inside the range.
>>>>>>> Yeah, makes sense to me.
>>>>>>>
>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain
>>>>>>>> has
>>>>>>>> been working on a series that converts this to use
>>>>>>>> walk_page_range_novma() so
>>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>> with
>>>>>>>> posting an RFC early next week.
>>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>>> table, right?
>>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>
>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>> context
>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>> vmalloc
>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>> have an
>>>>>> independent capability.
>>>>> OK, sounds good to me.
>>>>>
>>>>>>> The current code assumes the address range passed in by
>>>>>>> change_memory_common()
>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>> table
>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>> then my
>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>
>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>> implementation so that it can walk a VA range and update the permissions
>>>>>> on each
>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>>> entries in the page table.
>>>>>>
>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>> VAs to
>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>> goes
>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>>> pieces of the linear map and call __change_memory_common() for each; there is
>>>>>> definitely some detail to work out there!
>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>
>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>> address range.
>>>>>
>>>>>>>> You'll still need to repaint the whole linear map with page mappings for
>>>>>>>> the
>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially
>>>>>>>> with
>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>> similar to
>>>>>>>> how you are doing it in v3.
>>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>>> level. A simple flag should be good enough to tell
>>>>>>> __create_pgd_mapping_locked()
>>>>>>> do the right thing off the top of my head.
>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>> NO_CONT_MAPPINGS
>>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS
>>>>>> is set,
>>>>>> then you need to split it?
>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>
>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>
>>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>>> series is:
>>>>>>>>      - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>      - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>> support BBML2
>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>>
>>>>>>>>      - Integrate Dev's __change_memory_common() series
>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>
>>>>>>>>      - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>>> aligned VA
>>>>>>>>      - Use primitive when changing permissions on linear map region
>>>>>>> Sure.
>>>>>>>
>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>> base
>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Definitely makes sense to me.
>>>>>>>
>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>> for v3
>>>>>>> in my replies on March 17, particularly:
>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>> Ahh sorry about that. I'll take a look now...
>>>>> No problem.
>>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>> series
>>>>>>>>>>>>> and am
>>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>>> straight
>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>> which I
>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>>> following
>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>         * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>         * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>           Also included in this series in order to have the complete
>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>         * Enhanced __create_pgd_mapping() to handle split as well
>>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>         * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>         * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>           system is detected per Ryan. I don't have such system
>>>>>>>>>>>>>>> to test,
>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>           testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>           unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>           mappings after booting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>         * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>           conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>         * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>         * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>         - performance degradation
>>>>>>>>>>>>>>>         - more TLB pressure
>>>>>>>>>>>>>>>         - memory waste for kernel page table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>> line but this disables the alias checks on page table permissions
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the kernel to
>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>         - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>           boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>         - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>           linear mapping.
>>>>>>>>>>>>>>>         - A test kernel module which allocates 80% of total
>>>>>>>>>>>>>>> memory via
>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>           then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>           mapping permission to RO, then change it back before
>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>           a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>         - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>         - Kernel build in VM with guest kernel which has this series
>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>         - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>         - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the
>>>>>>>>>>>>>>> machine, the
>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>           --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>           --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>           --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>             arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>             arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>>             arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>             arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>>>>>>             arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>             arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>> arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>> arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>> arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>> arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>> arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++++++++
>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>> arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++
>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>> arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>> arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>        9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29  8:48                             ` Ryan Roberts
@ 2025-05-29 15:33                               ` Ryan Roberts
  2025-05-29 17:35                                 ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29 15:33 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 09:48, Ryan Roberts wrote:

[...]

>>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>>> me of a potential problem; if you are doing the repainting with the machine
>>>> stopped, you can't allocate memory at that point; it's possible a CPU was inside
>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>> would be to figure out how much memory you will need and pre-allocate prior to
>>>> stoping the machine?
>>>
>>> OK, I don't remember we discussed this problem before. I think we can do
>>> something like what kpti does. When creating the linear map we know how many
>>> PUD and PMD mappings are created, we can record the number, it will tell how
>>> many pages we need for repainting the linear map.
>>
>> Looking the kpti code further, it looks like kpti also allocates memory with the
>> machine stopped, but it calls memory allocation on cpu 0 only. 
> 
> Oh yes, I hadn't spotted that. It looks like a special case that may be ok for
> kpti though; it's allocating a fairly small amount of memory (max levels=5 so
> max order=3) and it's doing it with GFP_ATOMIC. So if my understanding of the
> page allocator is correct, then this should be allocated from a per-cpu reserve?
> Which means that it never needs to take a lock that other, stopped CPUs could be
> holding. And GFP_ATOMIC guarrantees that the thread will never sleep, which I
> think is not allowed while the machine is stopped.
> 
>> IIUC this
>> guarantees the code will not be called on a CPU which was inside the allocator
>> when it stopped because CPU 0 is running stop_machine().
> 
> My concern was a bit more general; if any other CPU was inside the allocator
> holding a lock when the machine was stopped, then if CPU 0 comes along and makes
> a call to the allocator that requires the lock, then we have a deadlock.
> 
> All that said, looking at the stop_machine() docs, it says:
> 
>  * Description: This causes a thread to be scheduled on every cpu,
>  * each of which disables interrupts.  The result is that no one is
>  * holding a spinlock or inside any other preempt-disabled region when
>  * @fn() runs.
> 
> So I think my deadlock concern was unfounded. I think as long as you can
> garrantee that fn() won't try to sleep then you should be safe? So I guess
> allocating from within fn() should be safe as long as you use GFP_ATOMIC?

I just had another conversation about this internally, and there is another
concern; we obviously don't want to modify the pgtables while other CPUs that
don't support BBML2 could be accessing them. Even in stop_machine() this may be
possible if the CPU stacks and task structure (for example) are allocated out of
the linear map.

So we need to be careful to follow the pattern used by kpti; all secondary CPUs
need to switch to the idmap (which is installed in TTBR0) then install the
reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.

Given CPU 0 supports BBML2, I think it can just update the linear map live,
without needing to do the idmap dance?

Thanks,
Ryan


> 
> Thanks,
> Ryan
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29  7:36                           ` Ryan Roberts
@ 2025-05-29 16:37                             ` Yang Shi
  2025-05-29 17:01                               ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-29 16:37 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/29/25 12:36 AM, Ryan Roberts wrote:
> On 28/05/2025 16:18, Yang Shi wrote:
>>
>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>> Hi Ryan,
>>>>
>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>> Yes, I agree this should not impact you too much.
>>>
>>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>>> apply permission change on a contiguous range instead of on page basis (the
>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>> don't think this will result in any significant change to my patches either,
>>>> particularly the split primitive and linear map repainting.
>>> I think you would need Dev's series to be able to apply the permissions change
>>> without needing to split the whole range to pte mappings? So I guess your change
>>> must either be implementing something similar to what Dev is working on or you
>>> are splitting the entire range to ptes? If the latter, then I'm not keen on that
>>> approach.
>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>> the split primitive keeps block mapping if it is fully contained is independent
>> from how to apply the permissions change on it.
>> The new spin implemented keeping block mapping if it is fully contained as we
>> discussed earlier. I'm supposed Dev's series just need to check whether the
>> mapping is block or not when applying permission change.
> The way I was thinking the split primitive would would, you would need Dev's
> change as a prerequisite, so I suspect we both have a slightly different idea of
> how this will work.
>
>> The flow just looks like as below conceptually:
>>
>> split_mapping(start, end)
>> apply_permission_change(start, end)
> The flow I was thinking of would be this:
>
> split_mapping(start)
> split_mapping(end)
> apply_permission_change(start, end)
>
> split_mapping() takes a virtual address that is at least page aligned and when
> it returns, ensures that the address is at the start of a leaf mapping. And it
> will only break the leaf mappings down so that they are the maximum size that
> can still meet the requirement.
>
> As an example, let's suppose you initially start with a region that is composed
> entirely of 2M mappings. Then you want to change permissions of a region [2052K,
> 6208K).
>
> Before any splitting, you have:
>
>    - 2M   x4: [0, 8192K)
>
> Then you call split_mapping(start=2052K):
>
>    - 2M   x1: [0, 2048K)
>    - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf mapping
>    - 64K x31: [2112K, 4096K)
>    - 2M:  x2: [4096K, 8192K)
>
> Then you call split_mapping(end=6208K):
>
>    - 2M   x1: [0, 2048K)
>    - 4K  x16: [2048K, 2112K)
>    - 64K x31: [2112K, 4096K)
>    - 2M:  x1: [4096K, 6144K)
>    - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping
>
> So then when you call apply_permission_change(start=2052K, end=6208K), the
> following leaf mappings' permissions will be modified:
>
>    - 4K  x15: [2052K, 2112K)
>    - 64K x31: [2112K, 4096K)
>    - 2M:  x1: [4096K, 6144K)
>    - 64K  x1: [6144K, 6208K)
>
> Since there are block mappings in this range, Dev's change is required to change
> the permissions.
>
> This approach means that we only ever split the minimum required number of
> mappings and we only split them to the largest size that still provides the
> alignment requirement.

I see your point. I believe we are on the same page: keep the block 
mappings in the range as possible as we can. My implementation actually 
ends up having the same result as your example shows. I guess we just 
have different ideas about how to implement it.

However I do have hard time to understand why not just use 
split_mapping(start, end). We can reuse some of the existing code easily 
with "end". Because the existing code does calculate the page table 
(PUD/PMD/CONT PMD/CONT PTE) boundary, so I reused it. Basically my 
implementation just skip to the next page table if:
   * The start address is at page table boundary, and
   * The "end" is greater than page table boundary

The logic may be a little bit convoluted, not sure if I articulated 
myself or not. Anyway the code will explain everything.

>
>> The split_mapping() guarantees keep block mapping if it is fully contained in
>> the range between start and end, this is my series's responsibility. I know the
>> current code calls apply_to_page_range() to apply permission change and it just
>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new API,
>> then __change_memory_common() will call it to change permission. There should be
>> some overlap between mine and Dev's, but I don't see strong dependency.
> But if you have a block mapping in the region you are calling
> __change_memory_common() on, today that will fail because it can only handle
> page mappings.

IMHO letting __change_memory_common() manipulate on contiguous address 
range is another story and should be not a part of the split primitive.

For example, we need to use vmalloc_huge() instead of vmalloc() to 
allocate huge memory, then does:
split_mapping(start, start+HPAGE_PMD_SIZE);
change_permission(start, start+HPAGE_PMD_SIZE);

The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept 
as PMD mapping so that change_permission() can change it on PMD basis too.

But this requires other kernel subsystems, for example, module, to 
allocate huge memory with proper APIs, for example, vmalloc_huge().

Thanks,
Yang

>
>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>> me of a potential problem; if you are doing the repainting with the machine
>>> stopped, you can't allocate memory at that point; it's possible a CPU was inside
>>> the allocator when it stopped. And I think you need to allocate intermediate
>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>> would be to figure out how much memory you will need and pre-allocate prior to
>>> stoping the machine?
>> OK, I don't remember we discussed this problem before. I think we can do
>> something like what kpti does. When creating the linear map we know how many PUD
>> and PMD mappings are created, we can record the number, it will tell how many
>> pages we need for repainting the linear map.
> I saw a separate reply you sent for this. I'll read that and respond in that
> context.
>
> Thanks,
> Ryan
>
>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing the
>>>> split primitive and linear map repainting. Does it sound good to you?
>>> That works assuming you have a solution for the above.
>> I think the only missing part is preallocating page tables for repainting. I
>> will add this, then post the new spin to the mailing list.
>>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>
>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>
>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>> gently
>>>>>>>>>>> ping,
>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>> I'm very
>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>> (although
>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>
>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>> suggestions I
>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>> week.
>>>>>>>>>> I'm
>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>>> vmalloc
>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>>> too.
>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>> Hi Yang,
>>>>>>>>
>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>> thinking:
>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>> vmalloc
>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e. block
>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>
>>>>>>>> vmalloc has history with trying to do huge mappings by default; it ended up
>>>>>>>> having to be turned into an opt-in feature (instead of the original opt-out
>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>> expecting
>>>>>>>> page mappings. I think we might be able to overcome those issues on arm64
>>>>>>>> with
>>>>>>>> BBML2.
>>>>>>>>
>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>> vmalloc
>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>> To be
>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come in.
>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>
>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>> kernel
>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>> middle of
>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>>> will
>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries.
>>>>>>>> The
>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>> EINVAL
>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>> perform
>>>>>>>> the split.
>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>> returning
>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear mapping
>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>> However I'm wondering what usecases you are thinking about? Splitting vmalloc
>>>>>>> area may run into unmapped VA?
>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't
>>>>>> see
>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>
>>>>>>>> Then we can use that primitive on the start and end address of any range for
>>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on
>>>>>>>> part
>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>> allocation,
>>>>>>>> etc). This way we only split enough to ensure the boundaries are precise,
>>>>>>>> and
>>>>>>>> keep larger mappings inside the range.
>>>>>>> Yeah, makes sense to me.
>>>>>>>
>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev Jain has
>>>>>>>> been working on a series that converts this to use
>>>>>>>> walk_page_range_novma() so
>>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable with
>>>>>>>> posting an RFC early next week.
>>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>>> table, right?
>>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>
>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>> context
>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>> vmalloc
>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>> have an
>>>>>> independent capability.
>>>>> OK, sounds good to me.
>>>>>
>>>>>>> The current code assumes the address range passed in by
>>>>>>> change_memory_common()
>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page table
>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this then my
>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>
>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>> implementation so that it can walk a VA range and update the permissions on
>>>>>> each
>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>>> entries in the page table.
>>>>>>
>>>>>> So if we are changing permissions on the linear map, we have a range of VAs to
>>>>>> walk and convert all the leaf entries, regardless of their size. The same goes
>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>>> pieces of the linear map and call __change_memory_common() for each; there is
>>>>>> definitely some detail to work out there!
>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>
>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>> address range.
>>>>>
>>>>>>>> You'll still need to repaint the whole linear map with page mappings for the
>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked() (potentially
>>>>>>>> with
>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>> similar to
>>>>>>>> how you are doing it in v3.
>>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>>> level. A simple flag should be good enough to tell
>>>>>>> __create_pgd_mapping_locked()
>>>>>>> do the right thing off the top of my head.
>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>> NO_CONT_MAPPINGS
>>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is
>>>>>> set,
>>>>>> then you need to split it?
>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>
>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>
>>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>>> series is:
>>>>>>>>       - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>       - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>> support BBML2
>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>>
>>>>>>>>       - Integrate Dev's __change_memory_common() series
>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>
>>>>>>>>       - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>>> aligned VA
>>>>>>>>       - Use primitive when changing permissions on linear map region
>>>>>>> Sure.
>>>>>>>
>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>> base
>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Definitely makes sense to me.
>>>>>>>
>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>> for v3
>>>>>>> in my replies on March 17, particularly:
>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>> Ahh sorry about that. I'll take a look now...
>>>>> No problem.
>>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>> series
>>>>>>>>>>>>> and am
>>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>>> straight
>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>> which I
>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>>> following
>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>          * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>          * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>            Also included in this series in order to have the complete
>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>          * Enhanced __create_pgd_mapping() to handle split as well per
>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>          * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>          * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>            system is detected per Ryan. I don't have such system to
>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>            testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>            unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>            mappings after booting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>          * Used allowlist to advertise BBM lv2 on the CPUs which can
>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>            conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>          * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>          * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>          - performance degradation
>>>>>>>>>>>>>>>          - more TLB pressure
>>>>>>>>>>>>>>>          - memory waste for kernel page table
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>> line but this disables the alias checks on page table permissions and
>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the kernel to
>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be used
>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for more
>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>> rodata=full without compromising the security features of the kernel.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>          - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>            boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>          - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>            linear mapping.
>>>>>>>>>>>>>>>          - A test kernel module which allocates 80% of total memory
>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>            then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>            mapping permission to RO, then change it back before
>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>            a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>          - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>          - Kernel build in VM with guest kernel which has this series
>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>          - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>          - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine,
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark with
>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>            --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>            --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>            --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>              arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>              arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>>              arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>              arm64: mm: support large block mapping when rodata=full
>>>>>>>>>>>>>>>              arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>              arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>         arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>         arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>         arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>         arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>         arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++
>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>         arch/arm64/mm/mmu.c                 | 397 ++++++++++++++++++++
>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>         arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>         arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>         9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 16:37                             ` Yang Shi
@ 2025-05-29 17:01                               ` Ryan Roberts
  2025-05-29 17:50                                 ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29 17:01 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 17:37, Yang Shi wrote:
> 
> 
> On 5/29/25 12:36 AM, Ryan Roberts wrote:
>> On 28/05/2025 16:18, Yang Shi wrote:
>>>
>>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>>> Yes, I agree this should not impact you too much.
>>>>
>>>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>>>> apply permission change on a contiguous range instead of on page basis (the
>>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>>> don't think this will result in any significant change to my patches either,
>>>>> particularly the split primitive and linear map repainting.
>>>> I think you would need Dev's series to be able to apply the permissions change
>>>> without needing to split the whole range to pte mappings? So I guess your
>>>> change
>>>> must either be implementing something similar to what Dev is working on or you
>>>> are splitting the entire range to ptes? If the latter, then I'm not keen on
>>>> that
>>>> approach.
>>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>>> the split primitive keeps block mapping if it is fully contained is independent
>>> from how to apply the permissions change on it.
>>> The new spin implemented keeping block mapping if it is fully contained as we
>>> discussed earlier. I'm supposed Dev's series just need to check whether the
>>> mapping is block or not when applying permission change.
>> The way I was thinking the split primitive would would, you would need Dev's
>> change as a prerequisite, so I suspect we both have a slightly different idea of
>> how this will work.
>>
>>> The flow just looks like as below conceptually:
>>>
>>> split_mapping(start, end)
>>> apply_permission_change(start, end)
>> The flow I was thinking of would be this:
>>
>> split_mapping(start)
>> split_mapping(end)
>> apply_permission_change(start, end)
>>
>> split_mapping() takes a virtual address that is at least page aligned and when
>> it returns, ensures that the address is at the start of a leaf mapping. And it
>> will only break the leaf mappings down so that they are the maximum size that
>> can still meet the requirement.
>>
>> As an example, let's suppose you initially start with a region that is composed
>> entirely of 2M mappings. Then you want to change permissions of a region [2052K,
>> 6208K).
>>
>> Before any splitting, you have:
>>
>>    - 2M   x4: [0, 8192K)
>>
>> Then you call split_mapping(start=2052K):
>>
>>    - 2M   x1: [0, 2048K)
>>    - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf mapping
>>    - 64K x31: [2112K, 4096K)
>>    - 2M:  x2: [4096K, 8192K)
>>
>> Then you call split_mapping(end=6208K):
>>
>>    - 2M   x1: [0, 2048K)
>>    - 4K  x16: [2048K, 2112K)
>>    - 64K x31: [2112K, 4096K)
>>    - 2M:  x1: [4096K, 6144K)
>>    - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping
>>
>> So then when you call apply_permission_change(start=2052K, end=6208K), the
>> following leaf mappings' permissions will be modified:
>>
>>    - 4K  x15: [2052K, 2112K)
>>    - 64K x31: [2112K, 4096K)
>>    - 2M:  x1: [4096K, 6144K)
>>    - 64K  x1: [6144K, 6208K)
>>
>> Since there are block mappings in this range, Dev's change is required to change
>> the permissions.
>>
>> This approach means that we only ever split the minimum required number of
>> mappings and we only split them to the largest size that still provides the
>> alignment requirement.
> 
> I see your point. I believe we are on the same page: keep the block mappings in
> the range as possible as we can. My implementation actually ends up having the
> same result as your example shows. I guess we just have different ideas about
> how to implement it.

OK great!

> 
> However I do have hard time to understand why not just use split_mapping(start,
> end). 

I don't really understand why you need to pass a range here. It's not like we
want to visit every leaf mapping in the range. We just want to walk down through
the page tables until we get to a leaf mapping that contains the address, then
keep splitting and walking deeper until the address is the start of a leaf
mapping. That's my thinking anyway. But you're the one doing the actual work
here so you probably have better insight than me.

> We can reuse some of the existing code easily with "end". Because the
> existing code does calculate the page table (PUD/PMD/CONT PMD/CONT PTE)
> boundary, so I reused it. Basically my implementation just skip to the next page
> table if:
>   * The start address is at page table boundary, and
>   * The "end" is greater than page table boundary
> 
> The logic may be a little bit convoluted, not sure if I articulated myself or
> not. Anyway the code will explain everything.

OK I think I understand; I think you're saying that if you pass in end, there is
an optimization you can do for the case where end is contained within the same
(ultimate) leaf mapping as start to avoid rewalking the pgtables?

> 
>>
>>> The split_mapping() guarantees keep block mapping if it is fully contained in
>>> the range between start and end, this is my series's responsibility. I know the
>>> current code calls apply_to_page_range() to apply permission change and it just
>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new API,
>>> then __change_memory_common() will call it to change permission. There should be
>>> some overlap between mine and Dev's, but I don't see strong dependency.
>> But if you have a block mapping in the region you are calling
>> __change_memory_common() on, today that will fail because it can only handle
>> page mappings.
> 
> IMHO letting __change_memory_common() manipulate on contiguous address range is
> another story and should be not a part of the split primitive.

I 100% agree that it should not be part of the split primitive.

But your series *depends* upon __change_memory_common() being able to change
permissions on block mappings. Today it can only change permissions on page
mappings.

Your original v1 series solved this by splitting *all* of the mappings in a
given range to page mappings before calling __change_memory_common(), right?

Remember it's not just vmalloc areas that are passed to
__change_memory_common(); virtually contiguous linear map regions can be passed
in as well. See (for example) set_direct_map_invalid_noflush(),
set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
__kernel_map_pages(), realm_set_memory_encrypted(), realm_set_memory_decrypted().


> 
> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate huge
> memory, then does:
> split_mapping(start, start+HPAGE_PMD_SIZE);
> change_permission(start, start+HPAGE_PMD_SIZE);
> 
> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as PMD
> mapping so that change_permission() can change it on PMD basis too.
> 
> But this requires other kernel subsystems, for example, module, to allocate huge
> memory with proper APIs, for example, vmalloc_huge().

The longer term plan is to have vmalloc() always allocate using the
VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
to migrate users to vmalloc_huge(). We will just detect if we can split live
mappings safely and use huge mappings in that case.

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
>>
>>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>>> me of a potential problem; if you are doing the repainting with the machine
>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>> inside
>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>> would be to figure out how much memory you will need and pre-allocate prior to
>>>> stoping the machine?
>>> OK, I don't remember we discussed this problem before. I think we can do
>>> something like what kpti does. When creating the linear map we know how many PUD
>>> and PMD mappings are created, we can record the number, it will tell how many
>>> pages we need for repainting the linear map.
>> I saw a separate reply you sent for this. I'll read that and respond in that
>> context.
>>
>> Thanks,
>> Ryan
>>
>>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing
>>>>> the
>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>> That works assuming you have a solution for the above.
>>> I think the only missing part is preallocating page tables for repainting. I
>>> will add this, then post the new spin to the mailing list.
>>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>
>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>
>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>> gently
>>>>>>>>>>>> ping,
>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>>> I'm very
>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>> (although
>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>
>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>> suggestions I
>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>>> week.
>>>>>>>>>>> I'm
>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>>>> vmalloc
>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>>>> too.
>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>> Hi Yang,
>>>>>>>>>
>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>> thinking:
>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>> vmalloc
>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>> block
>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>
>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>> ended up
>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>> opt-out
>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>> expecting
>>>>>>>>> page mappings. I think we might be able to overcome those issues on arm64
>>>>>>>>> with
>>>>>>>>> BBML2.
>>>>>>>>>
>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>>> vmalloc
>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>>> To be
>>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come
>>>>>>>>> in.
>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>
>>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>>> kernel
>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>> middle of
>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>>>> will
>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries.
>>>>>>>>> The
>>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>>> EINVAL
>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>>> perform
>>>>>>>>> the split.
>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>> returning
>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>> mapping
>>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>> vmalloc
>>>>>>>> area may run into unmapped VA?
>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't
>>>>>>> see
>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>
>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>> range for
>>>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on
>>>>>>>>> part
>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>> allocation,
>>>>>>>>> etc). This way we only split enough to ensure the boundaries are precise,
>>>>>>>>> and
>>>>>>>>> keep larger mappings inside the range.
>>>>>>>> Yeah, makes sense to me.
>>>>>>>>
>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>> Jain has
>>>>>>>>> been working on a series that converts this to use
>>>>>>>>> walk_page_range_novma() so
>>>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>>> with
>>>>>>>>> posting an RFC early next week.
>>>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>>>> table, right?
>>>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>
>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>> context
>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>> vmalloc
>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>> have an
>>>>>>> independent capability.
>>>>>> OK, sounds good to me.
>>>>>>
>>>>>>>> The current code assumes the address range passed in by
>>>>>>>> change_memory_common()
>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>> table
>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>> then my
>>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>
>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>> implementation so that it can walk a VA range and update the permissions on
>>>>>>> each
>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>>>> entries in the page table.
>>>>>>>
>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>> VAs to
>>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>>> goes
>>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>> there is
>>>>>>> definitely some detail to work out there!
>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>
>>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>>> address range.
>>>>>>
>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>> for the
>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>> (potentially
>>>>>>>>> with
>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>> similar to
>>>>>>>>> how you are doing it in v3.
>>>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>> do the right thing off the top of my head.
>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>> NO_CONT_MAPPINGS
>>>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is
>>>>>>> set,
>>>>>>> then you need to split it?
>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>
>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>
>>>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>>>> series is:
>>>>>>>>>       - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>>       - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>>> support BBML2
>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>>>
>>>>>>>>>       - Integrate Dev's __change_memory_common() series
>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>
>>>>>>>>>       - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>>>> aligned VA
>>>>>>>>>       - Use primitive when changing permissions on linear map region
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>>> base
>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>> Definitely makes sense to me.
>>>>>>>>
>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>> for v3
>>>>>>>> in my replies on March 17, particularly:
>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>> No problem.
>>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>> series
>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>>>> following
>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>          * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>          * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>>            Also included in this series in order to have the
>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>          * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>          * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>          * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>            system is detected per Ryan. I don't have such system to
>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>            testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>            unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>            mappings after booting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>          * Used allowlist to advertise BBM lv2 on the CPUs which
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>            conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>          * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>          * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>          - performance degradation
>>>>>>>>>>>>>>>>          - more TLB pressure
>>>>>>>>>>>>>>>>          - memory waste for kernel page table
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>          - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>            boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>          - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>            linear mapping.
>>>>>>>>>>>>>>>>          - A test kernel module which allocates 80% of total memory
>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>            then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>            mapping permission to RO, then change it back before
>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>            a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>          - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>          - Kernel build in VM with guest kernel which has this
>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>          - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>>          - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel
>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>            --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>            --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>            --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>              arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>              arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>>>              arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>              arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>              arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>              arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>         arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>         arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>         arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>         arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>         arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++
>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>         arch/arm64/mm/mmu.c                 | 397 ++++++++++++++
>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>         arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>>         arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>         9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 15:33                               ` Ryan Roberts
@ 2025-05-29 17:35                                 ` Yang Shi
  2025-05-29 18:30                                   ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-29 17:35 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/29/25 8:33 AM, Ryan Roberts wrote:
> On 29/05/2025 09:48, Ryan Roberts wrote:
>
> [...]
>
>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was inside
>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>> would be to figure out how much memory you will need and pre-allocate prior to
>>>>> stoping the machine?
>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>> something like what kpti does. When creating the linear map we know how many
>>>> PUD and PMD mappings are created, we can record the number, it will tell how
>>>> many pages we need for repainting the linear map.
>>> Looking the kpti code further, it looks like kpti also allocates memory with the
>>> machine stopped, but it calls memory allocation on cpu 0 only.
>> Oh yes, I hadn't spotted that. It looks like a special case that may be ok for
>> kpti though; it's allocating a fairly small amount of memory (max levels=5 so
>> max order=3) and it's doing it with GFP_ATOMIC. So if my understanding of the
>> page allocator is correct, then this should be allocated from a per-cpu reserve?
>> Which means that it never needs to take a lock that other, stopped CPUs could be
>> holding. And GFP_ATOMIC guarrantees that the thread will never sleep, which I
>> think is not allowed while the machine is stopped.

The pcp should be set up by then, but I don't think it is actually 
populated until the first allocation happens IIRC.

>>
>>> IIUC this
>>> guarantees the code will not be called on a CPU which was inside the allocator
>>> when it stopped because CPU 0 is running stop_machine().
>> My concern was a bit more general; if any other CPU was inside the allocator
>> holding a lock when the machine was stopped, then if CPU 0 comes along and makes
>> a call to the allocator that requires the lock, then we have a deadlock.
>>
>> All that said, looking at the stop_machine() docs, it says:
>>
>>   * Description: This causes a thread to be scheduled on every cpu,
>>   * each of which disables interrupts.  The result is that no one is
>>   * holding a spinlock or inside any other preempt-disabled region when
>>   * @fn() runs.
>>
>> So I think my deadlock concern was unfounded. I think as long as you can
>> garrantee that fn() won't try to sleep then you should be safe? So I guess
>> allocating from within fn() should be safe as long as you use GFP_ATOMIC?

Yes, the deadlock should be not a concern.

The other comment also said:

  * On each target cpu, @fn is run in a process context with the highest 
priority
  * preempting any task on the cpu and monopolizing it.

Since the fn is running in a process context, so sleep should be ok? 
Sleep should just can happen when allocation requires memory reclaim due 
to insufficient memory for kpti and repainting linear map usecases. But 
I do agree GFP_ATOMIC is safer.

> I just had another conversation about this internally, and there is another
> concern; we obviously don't want to modify the pgtables while other CPUs that
> don't support BBML2 could be accessing them. Even in stop_machine() this may be
> possible if the CPU stacks and task structure (for example) are allocated out of
> the linear map.
>
> So we need to be careful to follow the pattern used by kpti; all secondary CPUs
> need to switch to the idmap (which is installed in TTBR0) then install the
> reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
> the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.

So the below code should be ok?

cpu_install_idmap()
Busy loop to wait for cpu 0 done
cpu_uninstall_idmap()

>
> Given CPU 0 supports BBML2, I think it can just update the linear map live,
> without needing to do the idmap dance?

Yes, I think so too.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>
>> Thanks,
>> Ryan
>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 17:01                               ` Ryan Roberts
@ 2025-05-29 17:50                                 ` Yang Shi
  2025-05-29 18:34                                   ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-29 17:50 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/29/25 10:01 AM, Ryan Roberts wrote:
> On 29/05/2025 17:37, Yang Shi wrote:
>>
>> On 5/29/25 12:36 AM, Ryan Roberts wrote:
>>> On 28/05/2025 16:18, Yang Shi wrote:
>>>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>>>> Yes, I agree this should not impact you too much.
>>>>>
>>>>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>>>>> apply permission change on a contiguous range instead of on page basis (the
>>>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>>>> don't think this will result in any significant change to my patches either,
>>>>>> particularly the split primitive and linear map repainting.
>>>>> I think you would need Dev's series to be able to apply the permissions change
>>>>> without needing to split the whole range to pte mappings? So I guess your
>>>>> change
>>>>> must either be implementing something similar to what Dev is working on or you
>>>>> are splitting the entire range to ptes? If the latter, then I'm not keen on
>>>>> that
>>>>> approach.
>>>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>>>> the split primitive keeps block mapping if it is fully contained is independent
>>>> from how to apply the permissions change on it.
>>>> The new spin implemented keeping block mapping if it is fully contained as we
>>>> discussed earlier. I'm supposed Dev's series just need to check whether the
>>>> mapping is block or not when applying permission change.
>>> The way I was thinking the split primitive would would, you would need Dev's
>>> change as a prerequisite, so I suspect we both have a slightly different idea of
>>> how this will work.
>>>
>>>> The flow just looks like as below conceptually:
>>>>
>>>> split_mapping(start, end)
>>>> apply_permission_change(start, end)
>>> The flow I was thinking of would be this:
>>>
>>> split_mapping(start)
>>> split_mapping(end)
>>> apply_permission_change(start, end)
>>>
>>> split_mapping() takes a virtual address that is at least page aligned and when
>>> it returns, ensures that the address is at the start of a leaf mapping. And it
>>> will only break the leaf mappings down so that they are the maximum size that
>>> can still meet the requirement.
>>>
>>> As an example, let's suppose you initially start with a region that is composed
>>> entirely of 2M mappings. Then you want to change permissions of a region [2052K,
>>> 6208K).
>>>
>>> Before any splitting, you have:
>>>
>>>     - 2M   x4: [0, 8192K)
>>>
>>> Then you call split_mapping(start=2052K):
>>>
>>>     - 2M   x1: [0, 2048K)
>>>     - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf mapping
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x2: [4096K, 8192K)
>>>
>>> Then you call split_mapping(end=6208K):
>>>
>>>     - 2M   x1: [0, 2048K)
>>>     - 4K  x16: [2048K, 2112K)
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x1: [4096K, 6144K)
>>>     - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping
>>>
>>> So then when you call apply_permission_change(start=2052K, end=6208K), the
>>> following leaf mappings' permissions will be modified:
>>>
>>>     - 4K  x15: [2052K, 2112K)
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x1: [4096K, 6144K)
>>>     - 64K  x1: [6144K, 6208K)
>>>
>>> Since there are block mappings in this range, Dev's change is required to change
>>> the permissions.
>>>
>>> This approach means that we only ever split the minimum required number of
>>> mappings and we only split them to the largest size that still provides the
>>> alignment requirement.
>> I see your point. I believe we are on the same page: keep the block mappings in
>> the range as possible as we can. My implementation actually ends up having the
>> same result as your example shows. I guess we just have different ideas about
>> how to implement it.
> OK great!
>
>> However I do have hard time to understand why not just use split_mapping(start,
>> end).
> I don't really understand why you need to pass a range here. It's not like we
> want to visit every leaf mapping in the range. We just want to walk down through
> the page tables until we get to a leaf mapping that contains the address, then
> keep splitting and walking deeper until the address is the start of a leaf
> mapping. That's my thinking anyway. But you're the one doing the actual work
> here so you probably have better insight than me.

split_mapping(start, end) actually does the same thing, and we just need 
one call instead of two.

>
>> We can reuse some of the existing code easily with "end". Because the
>> existing code does calculate the page table (PUD/PMD/CONT PMD/CONT PTE)
>> boundary, so I reused it. Basically my implementation just skip to the next page
>> table if:
>>    * The start address is at page table boundary, and
>>    * The "end" is greater than page table boundary
>>
>> The logic may be a little bit convoluted, not sure if I articulated myself or
>> not. Anyway the code will explain everything.
> OK I think I understand; I think you're saying that if you pass in end, there is
> an optimization you can do for the case where end is contained within the same
> (ultimate) leaf mapping as start to avoid rewalking the pgtables?

Yes, we can just skip that page table to the next one because we know 
the "end".

>
>>>> The split_mapping() guarantees keep block mapping if it is fully contained in
>>>> the range between start and end, this is my series's responsibility. I know the
>>>> current code calls apply_to_page_range() to apply permission change and it just
>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new API,
>>>> then __change_memory_common() will call it to change permission. There should be
>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>> But if you have a block mapping in the region you are calling
>>> __change_memory_common() on, today that will fail because it can only handle
>>> page mappings.
>> IMHO letting __change_memory_common() manipulate on contiguous address range is
>> another story and should be not a part of the split primitive.
> I 100% agree that it should not be part of the split primitive.
>
> But your series *depends* upon __change_memory_common() being able to change
> permissions on block mappings. Today it can only change permissions on page
> mappings.

I don't think split primitive depends on it. Changing permission on 
block mappings is just the user of the new split primitive IMHO. We just 
have no real user right now.

>
> Your original v1 series solved this by splitting *all* of the mappings in a
> given range to page mappings before calling __change_memory_common(), right?

Yes, but if the range is contiguous, the new split primitive doesn't 
have to split to page mappings.

>
> Remember it's not just vmalloc areas that are passed to
> __change_memory_common(); virtually contiguous linear map regions can be passed
> in as well. See (for example) set_direct_map_invalid_noflush(),
> set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
> __kernel_map_pages(), realm_set_memory_encrypted(), realm_set_memory_decrypted().

Yes, no matter who the caller is, as long as the caller passes in 
contiguous address range, the split primitive can keep block mappings.

>
>
>> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate huge
>> memory, then does:
>> split_mapping(start, start+HPAGE_PMD_SIZE);
>> change_permission(start, start+HPAGE_PMD_SIZE);
>>
>> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as PMD
>> mapping so that change_permission() can change it on PMD basis too.
>>
>> But this requires other kernel subsystems, for example, module, to allocate huge
>> memory with proper APIs, for example, vmalloc_huge().
> The longer term plan is to have vmalloc() always allocate using the
> VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
> to migrate users to vmalloc_huge(). We will just detect if we can split live
> mappings safely and use huge mappings in that case.

Anyway this is the potential user of the new split primitive.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>> inside
>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>> would be to figure out how much memory you will need and pre-allocate prior to
>>>>> stoping the machine?
>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>> something like what kpti does. When creating the linear map we know how many PUD
>>>> and PMD mappings are created, we can record the number, it will tell how many
>>>> pages we need for repainting the linear map.
>>> I saw a separate reply you sent for this. I'll read that and respond in that
>>> context.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing
>>>>>> the
>>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>>> That works assuming you have a solution for the above.
>>>> I think the only missing part is preallocating page tables for repainting. I
>>>> will add this, then post the new spin to the mailing list.
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>
>>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>>> gently
>>>>>>>>>>>>> ping,
>>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>>>> I'm very
>>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>>> (although
>>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>>> suggestions I
>>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>>>> week.
>>>>>>>>>>>> I'm
>>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>>>>> vmalloc
>>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>>>>> too.
>>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>>> Hi Yang,
>>>>>>>>>>
>>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>>> thinking:
>>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>>> vmalloc
>>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>>> block
>>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>>
>>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>>> ended up
>>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>>> opt-out
>>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>>> expecting
>>>>>>>>>> page mappings. I think we might be able to overcome those issues on arm64
>>>>>>>>>> with
>>>>>>>>>> BBML2.
>>>>>>>>>>
>>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>>>> vmalloc
>>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>>>> To be
>>>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come
>>>>>>>>>> in.
>>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>>
>>>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>>>> kernel
>>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>>> middle of
>>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>>>>> will
>>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries.
>>>>>>>>>> The
>>>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>>>> EINVAL
>>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>>>> perform
>>>>>>>>>> the split.
>>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>>> returning
>>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>>> mapping
>>>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>>> vmalloc
>>>>>>>>> area may run into unmapped VA?
>>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't
>>>>>>>> see
>>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>>
>>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>>> range for
>>>>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on
>>>>>>>>>> part
>>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>>> allocation,
>>>>>>>>>> etc). This way we only split enough to ensure the boundaries are precise,
>>>>>>>>>> and
>>>>>>>>>> keep larger mappings inside the range.
>>>>>>>>> Yeah, makes sense to me.
>>>>>>>>>
>>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>>> Jain has
>>>>>>>>>> been working on a series that converts this to use
>>>>>>>>>> walk_page_range_novma() so
>>>>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>>>> with
>>>>>>>>>> posting an RFC early next week.
>>>>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>>>>> table, right?
>>>>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>>
>>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>>> context
>>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>>> vmalloc
>>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>>> have an
>>>>>>>> independent capability.
>>>>>>> OK, sounds good to me.
>>>>>>>
>>>>>>>>> The current code assumes the address range passed in by
>>>>>>>>> change_memory_common()
>>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>>> table
>>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>>> then my
>>>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>>
>>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>>> implementation so that it can walk a VA range and update the permissions on
>>>>>>>> each
>>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>>>>> entries in the page table.
>>>>>>>>
>>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>>> VAs to
>>>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>>>> goes
>>>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>>> there is
>>>>>>>> definitely some detail to work out there!
>>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>>
>>>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>>>> address range.
>>>>>>>
>>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>>> for the
>>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>>> (potentially
>>>>>>>>>> with
>>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>>> similar to
>>>>>>>>>> how you are doing it in v3.
>>>>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>>> do the right thing off the top of my head.
>>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>>> NO_CONT_MAPPINGS
>>>>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is
>>>>>>>> set,
>>>>>>>> then you need to split it?
>>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>>
>>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>>
>>>>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>>>>> series is:
>>>>>>>>>>        - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>>>        - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>>>> support BBML2
>>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>>>>
>>>>>>>>>>        - Integrate Dev's __change_memory_common() series
>>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>>
>>>>>>>>>>        - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>>>>> aligned VA
>>>>>>>>>>        - Use primitive when changing permissions on linear map region
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>>>> base
>>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>> Definitely makes sense to me.
>>>>>>>>>
>>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>>> for v3
>>>>>>>>> in my replies on March 17, particularly:
>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>>> No problem.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>>           * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>>           * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>>>             Also included in this series in order to have the
>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>>           * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>>           * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>>           * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>             system is detected per Ryan. I don't have such system to
>>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>>             testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>>             unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>>             mappings after booting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>>           * Used allowlist to advertise BBM lv2 on the CPUs which
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>>             conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>>           * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>>           * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>>           - performance degradation
>>>>>>>>>>>>>>>>>           - more TLB pressure
>>>>>>>>>>>>>>>>>           - memory waste for kernel page table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>>           - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>>             boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>>           - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>             linear mapping.
>>>>>>>>>>>>>>>>>           - A test kernel module which allocates 80% of total memory
>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>>             then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>             mapping permission to RO, then change it back before
>>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>>             a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>>           - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>>           - Kernel build in VM with guest kernel which has this
>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>>           - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>>>           - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel
>>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>>             --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>>             --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>>             --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>>               arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>>               arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>>>>               arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>>               arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>>               arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>>               arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>>          arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>>          arch/arm64/mm/mmu.c                 | 397 ++++++++++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>>          arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>>>          arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>>          9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 17:35                                 ` Yang Shi
@ 2025-05-29 18:30                                   ` Ryan Roberts
  2025-05-29 19:52                                     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29 18:30 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 18:35, Yang Shi wrote:
> 
> 
> On 5/29/25 8:33 AM, Ryan Roberts wrote:
>> On 29/05/2025 09:48, Ryan Roberts wrote:
>>
>> [...]
>>
>>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he
>>>>>> reminded
>>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>>> inside
>>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>>> would be to figure out how much memory you will need and pre-allocate
>>>>>> prior to
>>>>>> stoping the machine?
>>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>>> something like what kpti does. When creating the linear map we know how many
>>>>> PUD and PMD mappings are created, we can record the number, it will tell how
>>>>> many pages we need for repainting the linear map.
>>>> Looking the kpti code further, it looks like kpti also allocates memory with
>>>> the
>>>> machine stopped, but it calls memory allocation on cpu 0 only.
>>> Oh yes, I hadn't spotted that. It looks like a special case that may be ok for
>>> kpti though; it's allocating a fairly small amount of memory (max levels=5 so
>>> max order=3) and it's doing it with GFP_ATOMIC. So if my understanding of the
>>> page allocator is correct, then this should be allocated from a per-cpu reserve?
>>> Which means that it never needs to take a lock that other, stopped CPUs could be
>>> holding. And GFP_ATOMIC guarrantees that the thread will never sleep, which I
>>> think is not allowed while the machine is stopped.
> 
> The pcp should be set up by then, but I don't think it is actually populated
> until the first allocation happens IIRC.
> 
>>>
>>>> IIUC this
>>>> guarantees the code will not be called on a CPU which was inside the allocator
>>>> when it stopped because CPU 0 is running stop_machine().
>>> My concern was a bit more general; if any other CPU was inside the allocator
>>> holding a lock when the machine was stopped, then if CPU 0 comes along and makes
>>> a call to the allocator that requires the lock, then we have a deadlock.
>>>
>>> All that said, looking at the stop_machine() docs, it says:
>>>
>>>   * Description: This causes a thread to be scheduled on every cpu,
>>>   * each of which disables interrupts.  The result is that no one is
>>>   * holding a spinlock or inside any other preempt-disabled region when
>>>   * @fn() runs.
>>>
>>> So I think my deadlock concern was unfounded. I think as long as you can
>>> garrantee that fn() won't try to sleep then you should be safe? So I guess
>>> allocating from within fn() should be safe as long as you use GFP_ATOMIC?
> 
> Yes, the deadlock should be not a concern.
> 
> The other comment also said:
> 
>  * On each target cpu, @fn is run in a process context with the highest priority
>  * preempting any task on the cpu and monopolizing it.
> 
> Since the fn is running in a process context, so sleep should be ok? Sleep
> should just can happen when allocation requires memory reclaim due to
> insufficient memory for kpti and repainting linear map usecases. But I do agree
> GFP_ATOMIC is safer.

Interrupts are disabled so I can't imagine sleeping is a good idea...

> 
>> I just had another conversation about this internally, and there is another
>> concern; we obviously don't want to modify the pgtables while other CPUs that
>> don't support BBML2 could be accessing them. Even in stop_machine() this may be
>> possible if the CPU stacks and task structure (for example) are allocated out of
>> the linear map.
>>
>> So we need to be careful to follow the pattern used by kpti; all secondary CPUs
>> need to switch to the idmap (which is installed in TTBR0) then install the
>> reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
>> the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.
> 
> So the below code should be ok?
> 
> cpu_install_idmap()
> Busy loop to wait for cpu 0 done
> cpu_uninstall_idmap()

Once you have installed the idmap, you'll need to call a function by its PA so
you are actually executing out of the idmap. And you will need to be in assembly
so you don't need the stack, and you'll need to switch TTBR1 to the reserved
pgtable, so that the CPU has no access to the swapper pgtable (which CPU 0 is
able to modify).

You may well be able to reuse __idmap_kpti_secondary in proc.S, or lightly
refactor it to work for both the existing idmap_kpti_install_ng_mappings case,
and your case.

Thanks,
Ryan

> 
>>
>> Given CPU 0 supports BBML2, I think it can just update the linear map live,
>> without needing to do the idmap dance?
> 
> Yes, I think so too.
> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Ryan
>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 17:50                                 ` Yang Shi
@ 2025-05-29 18:34                                   ` Ryan Roberts
  2025-05-29 20:52                                     ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-29 18:34 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 18:50, Yang Shi wrote:
> 
> 
> On 5/29/25 10:01 AM, Ryan Roberts wrote:
>> On 29/05/2025 17:37, Yang Shi wrote:
>>>
>>> On 5/29/25 12:36 AM, Ryan Roberts wrote:
>>>> On 28/05/2025 16:18, Yang Shi wrote:
>>>>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>>>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>>>>> Yes, I agree this should not impact you too much.
>>>>>>
>>>>>>> You also mentioned Dev was working on patches to have
>>>>>>> __change_memory_common()
>>>>>>> apply permission change on a contiguous range instead of on page basis (the
>>>>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>>>>> don't think this will result in any significant change to my patches either,
>>>>>>> particularly the split primitive and linear map repainting.
>>>>>> I think you would need Dev's series to be able to apply the permissions
>>>>>> change
>>>>>> without needing to split the whole range to pte mappings? So I guess your
>>>>>> change
>>>>>> must either be implementing something similar to what Dev is working on or
>>>>>> you
>>>>>> are splitting the entire range to ptes? If the latter, then I'm not keen on
>>>>>> that
>>>>>> approach.
>>>>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>>>>> the split primitive keeps block mapping if it is fully contained is
>>>>> independent
>>>>> from how to apply the permissions change on it.
>>>>> The new spin implemented keeping block mapping if it is fully contained as we
>>>>> discussed earlier. I'm supposed Dev's series just need to check whether the
>>>>> mapping is block or not when applying permission change.
>>>> The way I was thinking the split primitive would would, you would need Dev's
>>>> change as a prerequisite, so I suspect we both have a slightly different
>>>> idea of
>>>> how this will work.
>>>>
>>>>> The flow just looks like as below conceptually:
>>>>>
>>>>> split_mapping(start, end)
>>>>> apply_permission_change(start, end)
>>>> The flow I was thinking of would be this:
>>>>
>>>> split_mapping(start)
>>>> split_mapping(end)
>>>> apply_permission_change(start, end)
>>>>
>>>> split_mapping() takes a virtual address that is at least page aligned and when
>>>> it returns, ensures that the address is at the start of a leaf mapping. And it
>>>> will only break the leaf mappings down so that they are the maximum size that
>>>> can still meet the requirement.
>>>>
>>>> As an example, let's suppose you initially start with a region that is composed
>>>> entirely of 2M mappings. Then you want to change permissions of a region
>>>> [2052K,
>>>> 6208K).
>>>>
>>>> Before any splitting, you have:
>>>>
>>>>     - 2M   x4: [0, 8192K)
>>>>
>>>> Then you call split_mapping(start=2052K):
>>>>
>>>>     - 2M   x1: [0, 2048K)
>>>>     - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf
>>>> mapping
>>>>     - 64K x31: [2112K, 4096K)
>>>>     - 2M:  x2: [4096K, 8192K)
>>>>
>>>> Then you call split_mapping(end=6208K):
>>>>
>>>>     - 2M   x1: [0, 2048K)
>>>>     - 4K  x16: [2048K, 2112K)
>>>>     - 64K x31: [2112K, 4096K)
>>>>     - 2M:  x1: [4096K, 6144K)
>>>>     - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping
>>>>
>>>> So then when you call apply_permission_change(start=2052K, end=6208K), the
>>>> following leaf mappings' permissions will be modified:
>>>>
>>>>     - 4K  x15: [2052K, 2112K)
>>>>     - 64K x31: [2112K, 4096K)
>>>>     - 2M:  x1: [4096K, 6144K)
>>>>     - 64K  x1: [6144K, 6208K)
>>>>
>>>> Since there are block mappings in this range, Dev's change is required to
>>>> change
>>>> the permissions.
>>>>
>>>> This approach means that we only ever split the minimum required number of
>>>> mappings and we only split them to the largest size that still provides the
>>>> alignment requirement.
>>> I see your point. I believe we are on the same page: keep the block mappings in
>>> the range as possible as we can. My implementation actually ends up having the
>>> same result as your example shows. I guess we just have different ideas about
>>> how to implement it.
>> OK great!
>>
>>> However I do have hard time to understand why not just use split_mapping(start,
>>> end).
>> I don't really understand why you need to pass a range here. It's not like we
>> want to visit every leaf mapping in the range. We just want to walk down through
>> the page tables until we get to a leaf mapping that contains the address, then
>> keep splitting and walking deeper until the address is the start of a leaf
>> mapping. That's my thinking anyway. But you're the one doing the actual work
>> here so you probably have better insight than me.
> 
> split_mapping(start, end) actually does the same thing, and we just need one
> call instead of two.
> 
>>
>>> We can reuse some of the existing code easily with "end". Because the
>>> existing code does calculate the page table (PUD/PMD/CONT PMD/CONT PTE)
>>> boundary, so I reused it. Basically my implementation just skip to the next page
>>> table if:
>>>    * The start address is at page table boundary, and
>>>    * The "end" is greater than page table boundary
>>>
>>> The logic may be a little bit convoluted, not sure if I articulated myself or
>>> not. Anyway the code will explain everything.
>> OK I think I understand; I think you're saying that if you pass in end, there is
>> an optimization you can do for the case where end is contained within the same
>> (ultimate) leaf mapping as start to avoid rewalking the pgtables?
> 
> Yes, we can just skip that page table to the next one because we know the "end".
> 
>>
>>>>> The split_mapping() guarantees keep block mapping if it is fully contained in
>>>>> the range between start and end, this is my series's responsibility. I know
>>>>> the
>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>> just
>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>> API,
>>>>> then __change_memory_common() will call it to change permission. There
>>>>> should be
>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>> But if you have a block mapping in the region you are calling
>>>> __change_memory_common() on, today that will fail because it can only handle
>>>> page mappings.
>>> IMHO letting __change_memory_common() manipulate on contiguous address range is
>>> another story and should be not a part of the split primitive.
>> I 100% agree that it should not be part of the split primitive.
>>
>> But your series *depends* upon __change_memory_common() being able to change
>> permissions on block mappings. Today it can only change permissions on page
>> mappings.
> 
> I don't think split primitive depends on it. Changing permission on block
> mappings is just the user of the new split primitive IMHO. We just have no real
> user right now.

But your series introduces a real user; after your series, the linear map is
block mapped.

Anyway, I think we are talking past eachother. Let's continue the conversation
in the context of your next version of the code.

> 
>>
>> Your original v1 series solved this by splitting *all* of the mappings in a
>> given range to page mappings before calling __change_memory_common(), right?
> 
> Yes, but if the range is contiguous, the new split primitive doesn't have to
> split to page mappings.
> 
>>
>> Remember it's not just vmalloc areas that are passed to
>> __change_memory_common(); virtually contiguous linear map regions can be passed
>> in as well. See (for example) set_direct_map_invalid_noflush(),
>> set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
>> __kernel_map_pages(), realm_set_memory_encrypted(), realm_set_memory_decrypted().
> 
> Yes, no matter who the caller is, as long as the caller passes in contiguous
> address range, the split primitive can keep block mappings.
> 
>>
>>
>>> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate huge
>>> memory, then does:
>>> split_mapping(start, start+HPAGE_PMD_SIZE);
>>> change_permission(start, start+HPAGE_PMD_SIZE);
>>>
>>> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as PMD
>>> mapping so that change_permission() can change it on PMD basis too.
>>>
>>> But this requires other kernel subsystems, for example, module, to allocate huge
>>> memory with proper APIs, for example, vmalloc_huge().
>> The longer term plan is to have vmalloc() always allocate using the
>> VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
>> to migrate users to vmalloc_huge(). We will just detect if we can split live
>> mappings safely and use huge mappings in that case.
> 
> Anyway this is the potential user of the new split primitive.
> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>> Thanks,
>>> Yang
>>>
>>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he
>>>>>> reminded
>>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>>> inside
>>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>>> would be to figure out how much memory you will need and pre-allocate
>>>>>> prior to
>>>>>> stoping the machine?
>>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>>> something like what kpti does. When creating the linear map we know how
>>>>> many PUD
>>>>> and PMD mappings are created, we can record the number, it will tell how many
>>>>> pages we need for repainting the linear map.
>>>> I saw a separate reply you sent for this. I'll read that and respond in that
>>>> context.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing
>>>>>>> the
>>>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>>>> That works assuming you have a solution for the above.
>>>>> I think the only missing part is preallocating page tables for repainting. I
>>>>> will add this, then post the new spin to the mailing list.
>>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>
>>>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>>>> gently
>>>>>>>>>>>>>> ping,
>>>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>>>>> I'm very
>>>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>>>> (although
>>>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>>>
>>>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>>>> suggestions I
>>>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>>>>> week.
>>>>>>>>>>>>> I'm
>>>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse
>>>>>>>>>>>>> with
>>>>>>>>>>>>> vmalloc
>>>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page
>>>>>>>>>>>> table
>>>>>>>>>>>> too.
>>>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>
>>>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>>>> thinking:
>>>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>>>> vmalloc
>>>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>>>> block
>>>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>>>
>>>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>>>> ended up
>>>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>>>> opt-out
>>>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>>>> expecting
>>>>>>>>>>> page mappings. I think we might be able to overcome those issues on
>>>>>>>>>>> arm64
>>>>>>>>>>> with
>>>>>>>>>>> BBML2.
>>>>>>>>>>>
>>>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I
>>>>>>>>>>> have a
>>>>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>>>>> vmalloc
>>>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>>>>> To be
>>>>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come
>>>>>>>>>>> in.
>>>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>>>
>>>>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>>>>> kernel
>>>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>>>> middle of
>>>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary.
>>>>>>>>>>> This
>>>>>>>>>>> will
>>>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE
>>>>>>>>>>> entries.
>>>>>>>>>>> The
>>>>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>>>>> EINVAL
>>>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>>>>> perform
>>>>>>>>>>> the split.
>>>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>>>> returning
>>>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>>>> mapping
>>>>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>>>> vmalloc
>>>>>>>>>> area may run into unmapped VA?
>>>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>>>>> discouraged. I think even for vmalloc under correct conditions we
>>>>>>>>> shouldn't
>>>>>>>>> see
>>>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>>>
>>>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>>>> range for
>>>>>>>>>>> which we need exact mapping boundaries (e.g. when changing
>>>>>>>>>>> permissions on
>>>>>>>>>>> part
>>>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>>>> allocation,
>>>>>>>>>>> etc). This way we only split enough to ensure the boundaries are
>>>>>>>>>>> precise,
>>>>>>>>>>> and
>>>>>>>>>>> keep larger mappings inside the range.
>>>>>>>>>> Yeah, makes sense to me.
>>>>>>>>>>
>>>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>>>> Jain has
>>>>>>>>>>> been working on a series that converts this to use
>>>>>>>>>>> walk_page_range_novma() so
>>>>>>>>>>> that we can change permissions on the block/contig entries too.
>>>>>>>>>>> That's not
>>>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>>>>> with
>>>>>>>>>>> posting an RFC early next week.
>>>>>>>>>> OK, so the new __change_memory_common() will change the permission of
>>>>>>>>>> page
>>>>>>>>>> table, right?
>>>>>>>>> It will change permissions of all the leaf entries in the range of VAs
>>>>>>>>> it is
>>>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we
>>>>>>>>> will
>>>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>>>
>>>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>>>> context
>>>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes
>>>>>>>>> sense to
>>>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>>>> vmalloc
>>>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>>>> have an
>>>>>>>>> independent capability.
>>>>>>>> OK, sounds good to me.
>>>>>>>>
>>>>>>>>>> The current code assumes the address range passed in by
>>>>>>>>>> change_memory_common()
>>>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>>>> table
>>>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>>>> then my
>>>>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside
>>>>>>>>>> the
>>>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>>>
>>>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>>>> implementation so that it can walk a VA range and update the
>>>>>>>>> permissions on
>>>>>>>>> each
>>>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring
>>>>>>>>> leaf
>>>>>>>>> entries in the page table.
>>>>>>>>>
>>>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>>>> VAs to
>>>>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>>>>> goes
>>>>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>>>>> permissions in the linear map, so we will have to figure out the
>>>>>>>>> contiguous
>>>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>>>> there is
>>>>>>>>> definitely some detail to work out there!
>>>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>>>
>>>>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>>>>> primitive by thinking it further. It should be the caller's
>>>>>>>> responsibility to
>>>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>>>>> address range.
>>>>>>>>
>>>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>>>> for the
>>>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>>>> (potentially
>>>>>>>>>>> with
>>>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>>>> similar to
>>>>>>>>>>> how you are doing it in v3.
>>>>>>>>>> Yes, when repainting I need to split the page table all the way down
>>>>>>>>>> to PTE
>>>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>>>> do the right thing off the top of my head.
>>>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>>>> NO_CONT_MAPPINGS
>>>>>>>>> flags? For example, if you are find a leaf mapping and
>>>>>>>>> NO_BLOCK_MAPPINGS is
>>>>>>>>> set,
>>>>>>>>> then you need to split it?
>>>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>>>
>>>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>>>
>>>>>>>>>>> So in summary, what I'm asking for your large block mapping the
>>>>>>>>>>> linear map
>>>>>>>>>>> series is:
>>>>>>>>>>>        - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>>>>        - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>>>>> support BBML2
>>>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to
>>>>>>>>>> v3.
>>>>>>>>>>
>>>>>>>>>>>        - Integrate Dev's __change_memory_common() series
>>>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch
>>>>>>>>>> need
>>>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>>>
>>>>>>>>>>>        - Create primitive to ensure mapping entry boundary at a given
>>>>>>>>>>> page-
>>>>>>>>>>> aligned VA
>>>>>>>>>>>        - Use primitive when changing permissions on linear map region
>>>>>>>>>> Sure.
>>>>>>>>>>
>>>>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>>>>> base
>>>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>> Definitely makes sense to me.
>>>>>>>>>>
>>>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>>>> for v3
>>>>>>>>>> in my replies on March 17, particularly:
>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>>>> No problem.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot
>>>>>>>>>>>>>>>>> parameter).
>>>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my
>>>>>>>>>>>>>>>> queue!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's
>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>>>           * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>>>           * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>>>>             Also included in this series in order to have the
>>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>>>           * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>>>           * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>>>           * Supported asymmetric system by splitting kernel
>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>             system is detected per Ryan. I don't have such
>>>>>>>>>>>>>>>>>> system to
>>>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>>>             testing is done by hacking kernel to call linear
>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>>>             unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>>>             mappings after booting.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>>>           * Used allowlist to advertise BBM lv2 on the CPUs which
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>>>             conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>>>           * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>>>           * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear
>>>>>>>>>>>>>>>>>> map is
>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>>>           - performance degradation
>>>>>>>>>>>>>>>>>>           - more TLB pressure
>>>>>>>>>>>>>>>>>>           - memory waste for kernel page table
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when
>>>>>>>>>>>>>>>>>> FEAT_BBM
>>>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>>>           - Kernel boot.  Kernel needs change kernel linear
>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>>>             boot stage, if the patch didn't work, kernel
>>>>>>>>>>>>>>>>>> typically
>>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>>>           - Module stress from stress-ng. Kernel module load
>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>             linear mapping.
>>>>>>>>>>>>>>>>>>           - A test kernel module which allocates 80% of total
>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>>>             then change the vmalloc area permission to RO,
>>>>>>>>>>>>>>>>>> this also
>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>>             mapping permission to RO, then change it back before
>>>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>>>             a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>>>           - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>>>           - Kernel build in VM with guest kernel which has this
>>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>>>           - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>>>>           - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the
>>>>>>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%,
>>>>>>>>>>>>>>>>>> P99
>>>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel
>>>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>>>             --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>>>             --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>>>             --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of
>>>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>>>               arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>>>               arm64: cpufeature: add AmpereOne to BBML2 allow
>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>               arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>>>               arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>>>               arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>>>               arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>          arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>>>          arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++
>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>>>          arch/arm64/mm/mmu.c                 | 397 ++++++++++++++
>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>>>          arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>>>>          arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>>>          9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 18:30                                   ` Ryan Roberts
@ 2025-05-29 19:52                                     ` Yang Shi
  2025-05-30  7:17                                       ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-29 19:52 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

>>> I just had another conversation about this internally, and there is another
>>> concern; we obviously don't want to modify the pgtables while other CPUs that
>>> don't support BBML2 could be accessing them. Even in stop_machine() this may be
>>> possible if the CPU stacks and task structure (for example) are allocated out of
>>> the linear map.
>>>
>>> So we need to be careful to follow the pattern used by kpti; all secondary CPUs
>>> need to switch to the idmap (which is installed in TTBR0) then install the
>>> reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
>>> the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.
>> So the below code should be ok?
>>
>> cpu_install_idmap()
>> Busy loop to wait for cpu 0 done
>> cpu_uninstall_idmap()
> Once you have installed the idmap, you'll need to call a function by its PA so
> you are actually executing out of the idmap. And you will need to be in assembly
> so you don't need the stack, and you'll need to switch TTBR1 to the reserved
> pgtable, so that the CPU has no access to the swapper pgtable (which CPU 0 is
> able to modify).
>
> You may well be able to reuse __idmap_kpti_secondary in proc.S, or lightly
> refactor it to work for both the existing idmap_kpti_install_ng_mappings case,
> and your case.

I'm wondering whether we really need idmap for repainting. I think 
repainting is different from kpti. We just split linear map which is 
*not* used by kernel itself, the mappings for kernel itself is intact, 
we don't touch it at all. So as long as CPU 0 will not repaint the 
linear map until all other CPUs busy looping in stop_machine fn, then we 
are fine.

We can have two flags to control it. The first one should be a cpu mask, 
all secondary CPUs set its own mask bit to tell CPU 0 it is in stop 
machine fn (ready for repainting). The other flag is used by CPU 0 to 
tell all secondary CPUs repainting is done, please resume. We need have 
the two flags in kernel data section instead of stack.

The code of fn is in kernel text section, the flags are in kernel data 
section. I don't see how come fn (just doing simple busy loop) on 
secondary CPUs need to access linear map while repainting the linear 
map. After repainting the TLB will be flushed before letting secondary 
CPUs resume, so any access to linear map address after that point should 
be safe too.

Does it sound reasonable to you? Did I miss something?

Thanks,
Yang

>
> Thanks,
> Ryan
>
>>> Given CPU 0 supports BBML2, I think it can just update the linear map live,
>>> without needing to do the idmap dance?
>> Yes, I think so too.
>>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>> Thanks,
>>>> Ryan
>>>>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 18:34                                   ` Ryan Roberts
@ 2025-05-29 20:52                                     ` Yang Shi
  2025-05-30  7:59                                       ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-29 20:52 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

>>>>>> The split_mapping() guarantees keep block mapping if it is fully contained in
>>>>>> the range between start and end, this is my series's responsibility. I know
>>>>>> the
>>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>>> just
>>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>>> API,
>>>>>> then __change_memory_common() will call it to change permission. There
>>>>>> should be
>>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>>> But if you have a block mapping in the region you are calling
>>>>> __change_memory_common() on, today that will fail because it can only handle
>>>>> page mappings.
>>>> IMHO letting __change_memory_common() manipulate on contiguous address range is
>>>> another story and should be not a part of the split primitive.
>>> I 100% agree that it should not be part of the split primitive.
>>>
>>> But your series *depends* upon __change_memory_common() being able to change
>>> permissions on block mappings. Today it can only change permissions on page
>>> mappings.
>> I don't think split primitive depends on it. Changing permission on block
>> mappings is just the user of the new split primitive IMHO. We just have no real
>> user right now.
> But your series introduces a real user; after your series, the linear map is
> block mapped.

The users of the split primitive are the permission changers, for 
example, module, bpf, secret mem, etc.

>
> Anyway, I think we are talking past eachother. Let's continue the conversation
> in the context of your next version of the code.

Yeah, sure.

Thanks,
Yang

>
>>> Your original v1 series solved this by splitting *all* of the mappings in a
>>> given range to page mappings before calling __change_memory_common(), right?
>> Yes, but if the range is contiguous, the new split primitive doesn't have to
>> split to page mappings.
>>
>>> Remember it's not just vmalloc areas that are passed to
>>> __change_memory_common(); virtually contiguous linear map regions can be passed
>>> in as well. See (for example) set_direct_map_invalid_noflush(),
>>> set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
>>> __kernel_map_pages(), realm_set_memory_encrypted(), realm_set_memory_decrypted().
>> Yes, no matter who the caller is, as long as the caller passes in contiguous
>> address range, the split primitive can keep block mappings.
>>
>>>
>>>> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate huge
>>>> memory, then does:
>>>> split_mapping(start, start+HPAGE_PMD_SIZE);
>>>> change_permission(start, start+HPAGE_PMD_SIZE);
>>>>
>>>> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as PMD
>>>> mapping so that change_permission() can change it on PMD basis too.
>>>>
>>>> But this requires other kernel subsystems, for example, module, to allocate huge
>>>> memory with proper APIs, for example, vmalloc_huge().
>>> The longer term plan is to have vmalloc() always allocate using the
>>> VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
>>> to migrate users to vmalloc_huge(). We will just detect if we can split live
>>> mappings safely and use huge mappings in that case.
>> Anyway this is the potential user of the new split primitive.
>>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he
>>>>>>> reminded
>>>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>>>> inside
>>>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>>>> would be to figure out how much memory you will need and pre-allocate
>>>>>>> prior to
>>>>>>> stoping the machine?
>>>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>>>> something like what kpti does. When creating the linear map we know how
>>>>>> many PUD
>>>>>> and PMD mappings are created, we can record the number, it will tell how many
>>>>>> pages we need for repainting the linear map.
>>>>> I saw a separate reply you sent for this. I'll read that and respond in that
>>>>> context.
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing
>>>>>>>> the
>>>>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>>>>> That works assuming you have a solution for the above.
>>>>>> I think the only missing part is preallocating page tables for repainting. I
>>>>>> will add this, then post the new spin to the mailing list.
>>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yang
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>>>>> gently
>>>>>>>>>>>>>>> ping,
>>>>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>>>>>> I'm very
>>>>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>>>>> (although
>>>>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>>>>> suggestions I
>>>>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> vmalloc
>>>>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page
>>>>>>>>>>>>> table
>>>>>>>>>>>>> too.
>>>>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>>>>> thinking:
>>>>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>>>>> vmalloc
>>>>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>>>>> block
>>>>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>>>>
>>>>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>>>>> ended up
>>>>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>>>>> opt-out
>>>>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>>>>> expecting
>>>>>>>>>>>> page mappings. I think we might be able to overcome those issues on
>>>>>>>>>>>> arm64
>>>>>>>>>>>> with
>>>>>>>>>>>> BBML2.
>>>>>>>>>>>>
>>>>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I
>>>>>>>>>>>> have a
>>>>>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>>>>>> vmalloc
>>>>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>>>>>> To be
>>>>>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come
>>>>>>>>>>>> in.
>>>>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>>>>
>>>>>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>>>>>> kernel
>>>>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>>>>> middle of
>>>>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary.
>>>>>>>>>>>> This
>>>>>>>>>>>> will
>>>>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE
>>>>>>>>>>>> entries.
>>>>>>>>>>>> The
>>>>>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>>>>>> EINVAL
>>>>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>>>>>> perform
>>>>>>>>>>>> the split.
>>>>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>>>>> returning
>>>>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>>>>> mapping
>>>>>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>>>>> vmalloc
>>>>>>>>>>> area may run into unmapped VA?
>>>>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>>>>>> discouraged. I think even for vmalloc under correct conditions we
>>>>>>>>>> shouldn't
>>>>>>>>>> see
>>>>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>>>>
>>>>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>>>>> range for
>>>>>>>>>>>> which we need exact mapping boundaries (e.g. when changing
>>>>>>>>>>>> permissions on
>>>>>>>>>>>> part
>>>>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>>>>> allocation,
>>>>>>>>>>>> etc). This way we only split enough to ensure the boundaries are
>>>>>>>>>>>> precise,
>>>>>>>>>>>> and
>>>>>>>>>>>> keep larger mappings inside the range.
>>>>>>>>>>> Yeah, makes sense to me.
>>>>>>>>>>>
>>>>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>>>>> Jain has
>>>>>>>>>>>> been working on a series that converts this to use
>>>>>>>>>>>> walk_page_range_novma() so
>>>>>>>>>>>> that we can change permissions on the block/contig entries too.
>>>>>>>>>>>> That's not
>>>>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>>>>>> with
>>>>>>>>>>>> posting an RFC early next week.
>>>>>>>>>>> OK, so the new __change_memory_common() will change the permission of
>>>>>>>>>>> page
>>>>>>>>>>> table, right?
>>>>>>>>>> It will change permissions of all the leaf entries in the range of VAs
>>>>>>>>>> it is
>>>>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we
>>>>>>>>>> will
>>>>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>>>>
>>>>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>>>>> context
>>>>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes
>>>>>>>>>> sense to
>>>>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>>>>> vmalloc
>>>>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>>>>> have an
>>>>>>>>>> independent capability.
>>>>>>>>> OK, sounds good to me.
>>>>>>>>>
>>>>>>>>>>> The current code assumes the address range passed in by
>>>>>>>>>>> change_memory_common()
>>>>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>>>>> table
>>>>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>>>>> then my
>>>>>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside
>>>>>>>>>>> the
>>>>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>>>>
>>>>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>>>>> implementation so that it can walk a VA range and update the
>>>>>>>>>> permissions on
>>>>>>>>>> each
>>>>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring
>>>>>>>>>> leaf
>>>>>>>>>> entries in the page table.
>>>>>>>>>>
>>>>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>>>>> VAs to
>>>>>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>>>>>> goes
>>>>>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>>>>>> permissions in the linear map, so we will have to figure out the
>>>>>>>>>> contiguous
>>>>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>>>>> there is
>>>>>>>>>> definitely some detail to work out there!
>>>>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>>>>
>>>>>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>>>>>> primitive by thinking it further. It should be the caller's
>>>>>>>>> responsibility to
>>>>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>>>>>> address range.
>>>>>>>>>
>>>>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>>>>> for the
>>>>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>>>>> (potentially
>>>>>>>>>>>> with
>>>>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>>>>> similar to
>>>>>>>>>>>> how you are doing it in v3.
>>>>>>>>>>> Yes, when repainting I need to split the page table all the way down
>>>>>>>>>>> to PTE
>>>>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>>>>> do the right thing off the top of my head.
>>>>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>>>>> NO_CONT_MAPPINGS
>>>>>>>>>> flags? For example, if you are find a leaf mapping and
>>>>>>>>>> NO_BLOCK_MAPPINGS is
>>>>>>>>>> set,
>>>>>>>>>> then you need to split it?
>>>>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>>>>
>>>>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>>>>
>>>>>>>>>>>> So in summary, what I'm asking for your large block mapping the
>>>>>>>>>>>> linear map
>>>>>>>>>>>> series is:
>>>>>>>>>>>>         - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>>>>>         - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>>>>>> support BBML2
>>>>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to
>>>>>>>>>>> v3.
>>>>>>>>>>>
>>>>>>>>>>>>         - Integrate Dev's __change_memory_common() series
>>>>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch
>>>>>>>>>>> need
>>>>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>>>>
>>>>>>>>>>>>         - Create primitive to ensure mapping entry boundary at a given
>>>>>>>>>>>> page-
>>>>>>>>>>>> aligned VA
>>>>>>>>>>>>         - Use primitive when changing permissions on linear map region
>>>>>>>>>>> Sure.
>>>>>>>>>>>
>>>>>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>>>>>> base
>>>>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>> Definitely makes sense to me.
>>>>>>>>>>>
>>>>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>>>>> for v3
>>>>>>>>>>> in my replies on March 17, particularly:
>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>>>>> No problem.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot
>>>>>>>>>>>>>>>>>> parameter).
>>>>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my
>>>>>>>>>>>>>>>>> queue!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's
>>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>>>>            * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>>>>            * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>>>>>              Also included in this series in order to have the
>>>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>>>>            * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>>>>            * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>>>>            * Supported asymmetric system by splitting kernel
>>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>              system is detected per Ryan. I don't have such
>>>>>>>>>>>>>>>>>>> system to
>>>>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>>>>              testing is done by hacking kernel to call linear
>>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>>>>              unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>>>>              mappings after booting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>>>>            * Used allowlist to advertise BBM lv2 on the CPUs which
>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>>>>              conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>>>>            * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>>>>            * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear
>>>>>>>>>>>>>>>>>>> map is
>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>>>>            - performance degradation
>>>>>>>>>>>>>>>>>>>            - more TLB pressure
>>>>>>>>>>>>>>>>>>>            - memory waste for kernel page table
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when
>>>>>>>>>>>>>>>>>>> FEAT_BBM
>>>>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>>>>            - Kernel boot.  Kernel needs change kernel linear
>>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>>>>              boot stage, if the patch didn't work, kernel
>>>>>>>>>>>>>>>>>>> typically
>>>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>>>>            - Module stress from stress-ng. Kernel module load
>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>              linear mapping.
>>>>>>>>>>>>>>>>>>>            - A test kernel module which allocates 80% of total
>>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>>>>              then change the vmalloc area permission to RO,
>>>>>>>>>>>>>>>>>>> this also
>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>>>              mapping permission to RO, then change it back before
>>>>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>>>>              a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>>>>            - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>>>>            - Kernel build in VM with guest kernel which has this
>>>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>>>>            - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>>>>>            - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the
>>>>>>>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%,
>>>>>>>>>>>>>>>>>>> P99
>>>>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel
>>>>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>>>>              --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>>>>              --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>>>>              --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of
>>>>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>>>>                arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>>>>                arm64: cpufeature: add AmpereOne to BBML2 allow
>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>                arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>>>>                arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>>>>                arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>>>>                arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>           arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>>>>           arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++
>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>>>>           arch/arm64/mm/mmu.c                 | 397 ++++++++++++++
>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>>>>           arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>>>>>           arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>>>>           9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 19:52                                     ` Yang Shi
@ 2025-05-30  7:17                                       ` Ryan Roberts
  2025-05-30 21:21                                         ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-30  7:17 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 20:52, Yang Shi wrote:
>>>> I just had another conversation about this internally, and there is another
>>>> concern; we obviously don't want to modify the pgtables while other CPUs that
>>>> don't support BBML2 could be accessing them. Even in stop_machine() this may be
>>>> possible if the CPU stacks and task structure (for example) are allocated
>>>> out of
>>>> the linear map.
>>>>
>>>> So we need to be careful to follow the pattern used by kpti; all secondary CPUs
>>>> need to switch to the idmap (which is installed in TTBR0) then install the
>>>> reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
>>>> the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.
>>> So the below code should be ok?
>>>
>>> cpu_install_idmap()
>>> Busy loop to wait for cpu 0 done
>>> cpu_uninstall_idmap()
>> Once you have installed the idmap, you'll need to call a function by its PA so
>> you are actually executing out of the idmap. And you will need to be in assembly
>> so you don't need the stack, and you'll need to switch TTBR1 to the reserved
>> pgtable, so that the CPU has no access to the swapper pgtable (which CPU 0 is
>> able to modify).
>>
>> You may well be able to reuse __idmap_kpti_secondary in proc.S, or lightly
>> refactor it to work for both the existing idmap_kpti_install_ng_mappings case,
>> and your case.
> 
> I'm wondering whether we really need idmap for repainting. I think repainting is
> different from kpti. We just split linear map which is *not* used by kernel
> itself, the mappings for kernel itself is intact, we don't touch it at all. So
> as long as CPU 0 will not repaint the linear map until all other CPUs busy
> looping in stop_machine fn, then we are fine.

But *how* are the other CPUs busy looping? Are they polling a variable? Where
does that variable live? The docs say that a high priority thread is run for
each CPU. So there at least needs to be a task struct and a stack. There are
some Kconfigs where the stack comes from the linear map, so if the variable that
is polls is on its stack (or even on CPU 0's stack then that's a problem. If the
scheduler runs and accesses the task struct which may be allocated from the
linear map (e.g. via kmalloc), that's a problem.

The point is that you have to understand all the details of stop_machine() to be
confident that it is never accessing the linear map. And even if you can prove
that today, there is nothing stopping from the implementation changing in future.

But then you have non-architectural memory accesses too (i.e. speculative
accesses). It's possible that the CPU does a speculative load, which causes the
TLB to do a translation and cache a TLB entry to the linear map. Then CPU 0
changes the pgtable and you have broken the BBM requirements from the secondary
CPU's perspective.

So personally I think the only truely safe way to solve this is to switch the
secondary CPUs to the idmap, then install the reserved map in TTBR1. That way,
the secondary CPUs can't see the swapper pgtable at all and CPU 0 is free to do
what it likes.

> 
> We can have two flags to control it. The first one should be a cpu mask, all
> secondary CPUs set its own mask bit to tell CPU 0 it is in stop machine fn
> (ready for repainting). The other flag is used by CPU 0 to tell all secondary
> CPUs repainting is done, please resume. We need have the two flags in kernel
> data section instead of stack.
> 
> The code of fn is in kernel text section, the flags are in kernel data section.
> I don't see how come fn (just doing simple busy loop) on secondary CPUs need to
> access linear map while repainting the linear map. After repainting the TLB will
> be flushed before letting secondary CPUs resume, so any access to linear map
> address after that point should be safe too.
> 
> Does it sound reasonable to you? Did I miss something?

I think the potential for speculative access is the problem. Personally, I would
follow the pattern laid out by kpti. Then you can more easily defend it by
pointing to an established pattern.

Thanks,
Ryan

> 
> Thanks,
> Yang
> 
>>
>> Thanks,
>> Ryan
>>
>>>> Given CPU 0 supports BBML2, I think it can just update the linear map live,
>>>> without needing to do the idmap dance?
>>> Yes, I think so too.
>>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-29 20:52                                     ` Yang Shi
@ 2025-05-30  7:59                                       ` Ryan Roberts
  2025-05-30 17:18                                         ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-05-30  7:59 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 29/05/2025 21:52, Yang Shi wrote:
>>>>>>> The split_mapping() guarantees keep block mapping if it is fully
>>>>>>> contained in
>>>>>>> the range between start and end, this is my series's responsibility. I know
>>>>>>> the
>>>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>>>> just
>>>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>>>> API,
>>>>>>> then __change_memory_common() will call it to change permission. There
>>>>>>> should be
>>>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>>>> But if you have a block mapping in the region you are calling
>>>>>> __change_memory_common() on, today that will fail because it can only handle
>>>>>> page mappings.
>>>>> IMHO letting __change_memory_common() manipulate on contiguous address
>>>>> range is
>>>>> another story and should be not a part of the split primitive.
>>>> I 100% agree that it should not be part of the split primitive.
>>>>
>>>> But your series *depends* upon __change_memory_common() being able to change
>>>> permissions on block mappings. Today it can only change permissions on page
>>>> mappings.
>>> I don't think split primitive depends on it. Changing permission on block
>>> mappings is just the user of the new split primitive IMHO. We just have no real
>>> user right now.
>> But your series introduces a real user; after your series, the linear map is
>> block mapped.
> 
> The users of the split primitive are the permission changers, for example,
> module, bpf, secret mem, etc.

Ahh, perhaps this is the crux of our misunderstanding... In my model, the split
primitive is called from __change_memory_common() (or from other appropriate
functions in pageattr.c). It's an implementation detail for arm64 and is not
exposed to common code. arm64 knows that it can split live mappings in a
transparent way so it uses huge pages eagerly and splits on demand.

I personally wouldn't want to be relying on the memory user knowing it needs to
split the mappings...

> 
>>
>> Anyway, I think we are talking past eachother. Let's continue the conversation
>> in the context of your next version of the code.
> 
> Yeah, sure.
> 
> Thanks,
> Yang
> 
>>
>>>> Your original v1 series solved this by splitting *all* of the mappings in a
>>>> given range to page mappings before calling __change_memory_common(), right?
>>> Yes, but if the range is contiguous, the new split primitive doesn't have to
>>> split to page mappings.
>>>
>>>> Remember it's not just vmalloc areas that are passed to
>>>> __change_memory_common(); virtually contiguous linear map regions can be passed
>>>> in as well. See (for example) set_direct_map_invalid_noflush(),
>>>> set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
>>>> __kernel_map_pages(), realm_set_memory_encrypted(),
>>>> realm_set_memory_decrypted().
>>> Yes, no matter who the caller is, as long as the caller passes in contiguous
>>> address range, the split primitive can keep block mappings.
>>>
>>>>
>>>>> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate
>>>>> huge
>>>>> memory, then does:
>>>>> split_mapping(start, start+HPAGE_PMD_SIZE);
>>>>> change_permission(start, start+HPAGE_PMD_SIZE);
>>>>>
>>>>> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as
>>>>> PMD
>>>>> mapping so that change_permission() can change it on PMD basis too.
>>>>>
>>>>> But this requires other kernel subsystems, for example, module, to allocate
>>>>> huge
>>>>> memory with proper APIs, for example, vmalloc_huge().
>>>> The longer term plan is to have vmalloc() always allocate using the
>>>> VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
>>>> to migrate users to vmalloc_huge(). We will just detect if we can split live
>>>> mappings safely and use huge mappings in that case.
>>> Anyway this is the potential user of the new split primitive.
>>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> Thanks,
>>>>> Yang
>>>>>
>>>>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he
>>>>>>>> reminded
>>>>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>>>>> inside
>>>>>>>> the allocator when it stopped. And I think you need to allocate
>>>>>>>> intermediate
>>>>>>>> pgtables, right? Do you have a solution to that problem? I guess one
>>>>>>>> approach
>>>>>>>> would be to figure out how much memory you will need and pre-allocate
>>>>>>>> prior to
>>>>>>>> stoping the machine?
>>>>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>>>>> something like what kpti does. When creating the linear map we know how
>>>>>>> many PUD
>>>>>>> and PMD mappings are created, we can record the number, it will tell how
>>>>>>> many
>>>>>>> pages we need for repainting the linear map.
>>>>>> I saw a separate reply you sent for this. I'll read that and respond in that
>>>>>> context.
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>>>> So I plan to post v4 patches to the mailing list. We can focus on
>>>>>>>>> reviewing
>>>>>>>>> the
>>>>>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>>>>>> That works assuming you have a solution for the above.
>>>>>>> I think the only missing part is preallocating page tables for repainting. I
>>>>>>> will add this, then post the new spin to the mailing list.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>>>>>> gently
>>>>>>>>>>>>>>>> ping,
>>>>>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd
>>>>>>>>>>>>>>> April. But
>>>>>>>>>>>>>>> I'm very
>>>>>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>>>>>> (although
>>>>>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>>>>>> suggestions I
>>>>>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate
>>>>>>>>>>>>>>> next
>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse
>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> vmalloc
>>>>>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>>>>>> thinking:
>>>>>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>>>>>> vmalloc
>>>>>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>>>>>> block
>>>>>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>>>>>
>>>>>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>>>>>> ended up
>>>>>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>>>>>> opt-out
>>>>>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>>>>>> expecting
>>>>>>>>>>>>> page mappings. I think we might be able to overcome those issues on
>>>>>>>>>>>>> arm64
>>>>>>>>>>>>> with
>>>>>>>>>>>>> BBML2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I
>>>>>>>>>>>>> have a
>>>>>>>>>>>>> series (that should make v6.16) that enables contiguous PTE
>>>>>>>>>>>>> mappings in
>>>>>>>>>>>>> vmalloc
>>>>>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is
>>>>>>>>>>>>> specified.
>>>>>>>>>>>>> To be
>>>>>>>>>>>>> able to use that by default, we need to be able to change
>>>>>>>>>>>>> permissions on
>>>>>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series
>>>>>>>>>>>>> come
>>>>>>>>>>>>> in.
>>>>>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think the key thing we need is a function that can take a page-
>>>>>>>>>>>>> aligned
>>>>>>>>>>>>> kernel
>>>>>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>>>>>> middle of
>>>>>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary.
>>>>>>>>>>>>> This
>>>>>>>>>>>>> will
>>>>>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>> The
>>>>>>>>>>>>> function can assume BBML2 is present. And it will return 0 on
>>>>>>>>>>>>> success, -
>>>>>>>>>>>>> EINVAL
>>>>>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a
>>>>>>>>>>>>> pgtable to
>>>>>>>>>>>>> perform
>>>>>>>>>>>>> the split.
>>>>>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>>>>>> returning
>>>>>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>>>>>> mapping
>>>>>>>>>>>> should be always present. It is easy to return -EINVAL instead of
>>>>>>>>>>>> BUG_ON.
>>>>>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>>>>>> vmalloc
>>>>>>>>>>>> area may run into unmapped VA?
>>>>>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel
>>>>>>>>>>> should be
>>>>>>>>>>> discouraged. I think even for vmalloc under correct conditions we
>>>>>>>>>>> shouldn't
>>>>>>>>>>> see
>>>>>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>>>>>
>>>>>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>>>>>> range for
>>>>>>>>>>>>> which we need exact mapping boundaries (e.g. when changing
>>>>>>>>>>>>> permissions on
>>>>>>>>>>>>> part
>>>>>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>>>>>> allocation,
>>>>>>>>>>>>> etc). This way we only split enough to ensure the boundaries are
>>>>>>>>>>>>> precise,
>>>>>>>>>>>>> and
>>>>>>>>>>>>> keep larger mappings inside the range.
>>>>>>>>>>>> Yeah, makes sense to me.
>>>>>>>>>>>>
>>>>>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>>>>>> Jain has
>>>>>>>>>>>>> been working on a series that converts this to use
>>>>>>>>>>>>> walk_page_range_novma() so
>>>>>>>>>>>>> that we can change permissions on the block/contig entries too.
>>>>>>>>>>>>> That's not
>>>>>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is
>>>>>>>>>>>>> comfortable
>>>>>>>>>>>>> with
>>>>>>>>>>>>> posting an RFC early next week.
>>>>>>>>>>>> OK, so the new __change_memory_common() will change the permission of
>>>>>>>>>>>> page
>>>>>>>>>>>> table, right?
>>>>>>>>>>> It will change permissions of all the leaf entries in the range of VAs
>>>>>>>>>>> it is
>>>>>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we
>>>>>>>>>>> will
>>>>>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>>>>>
>>>>>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>>>>>> context
>>>>>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes
>>>>>>>>>>> sense to
>>>>>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>>>>>> vmalloc
>>>>>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>>>>>> have an
>>>>>>>>>>> independent capability.
>>>>>>>>>> OK, sounds good to me.
>>>>>>>>>>
>>>>>>>>>>>> The current code assumes the address range passed in by
>>>>>>>>>>>> change_memory_common()
>>>>>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>>>>>> table
>>>>>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>>>>>> then my
>>>>>>>>>>>> patch can safely assume the linear mapping address range for
>>>>>>>>>>>> splitting is
>>>>>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside
>>>>>>>>>>>> the
>>>>>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>>>>>
>>>>>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>>>>>> implementation so that it can walk a VA range and update the
>>>>>>>>>>> permissions on
>>>>>>>>>>> each
>>>>>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at.
>>>>>>>>>>> This
>>>>>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring
>>>>>>>>>>> leaf
>>>>>>>>>>> entries in the page table.
>>>>>>>>>>>
>>>>>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>>>>>> VAs to
>>>>>>>>>>> walk and convert all the leaf entries, regardless of their size. The
>>>>>>>>>>> same
>>>>>>>>>>> goes
>>>>>>>>>>> for vmalloc... But for vmalloc, we will also want to change the
>>>>>>>>>>> underlying
>>>>>>>>>>> permissions in the linear map, so we will have to figure out the
>>>>>>>>>>> contiguous
>>>>>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>>>>>> there is
>>>>>>>>>>> definitely some detail to work out there!
>>>>>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>>>>>
>>>>>>>>>> But how Dev's patch work should have no impact on how I implement the
>>>>>>>>>> split
>>>>>>>>>> primitive by thinking it further. It should be the caller's
>>>>>>>>>> responsibility to
>>>>>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous
>>>>>>>>>> linear map
>>>>>>>>>> address range.
>>>>>>>>>>
>>>>>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>>>>>> (potentially
>>>>>>>>>>>>> with
>>>>>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>>>>>> similar to
>>>>>>>>>>>>> how you are doing it in v3.
>>>>>>>>>>>> Yes, when repainting I need to split the page table all the way down
>>>>>>>>>>>> to PTE
>>>>>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>>>>>> do the right thing off the top of my head.
>>>>>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>>>>>> NO_CONT_MAPPINGS
>>>>>>>>>>> flags? For example, if you are find a leaf mapping and
>>>>>>>>>>> NO_BLOCK_MAPPINGS is
>>>>>>>>>>> set,
>>>>>>>>>>> then you need to split it?
>>>>>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>>>>>
>>>>>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>>>>>
>>>>>>>>>>>>> So in summary, what I'm asking for your large block mapping the
>>>>>>>>>>>>> linear map
>>>>>>>>>>>>> series is:
>>>>>>>>>>>>>         - Paint linear map using blocks/contig if boot CPU supports
>>>>>>>>>>>>> BBML2
>>>>>>>>>>>>>         - Repaint linear map using page mappings if secondary CPUs
>>>>>>>>>>>>> don't
>>>>>>>>>>>>> support BBML2
>>>>>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to
>>>>>>>>>>>> v3.
>>>>>>>>>>>>
>>>>>>>>>>>>>         - Integrate Dev's __change_memory_common() series
>>>>>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch
>>>>>>>>>>>> need
>>>>>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>>>>>
>>>>>>>>>>>>>         - Create primitive to ensure mapping entry boundary at a given
>>>>>>>>>>>>> page-
>>>>>>>>>>>>> aligned VA
>>>>>>>>>>>>>         - Use primitive when changing permissions on linear map region
>>>>>>>>>>>> Sure.
>>>>>>>>>>>>
>>>>>>>>>>>>> This will be mergable on its own, but will also provide a great
>>>>>>>>>>>>> starting
>>>>>>>>>>>>> base
>>>>>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>> Definitely makes sense to me.
>>>>>>>>>>>>
>>>>>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>>>>>> for v3
>>>>>>>>>>>> in my replies on March 17, particularly:
>>>>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>>>>>> b344-9401fa4c0feb@os.amperecomputing.com/
>>>>>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>>>>>> No problem.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Yang
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some
>>>>>>>>>>>>>>>>>>> slight
>>>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot
>>>>>>>>>>>>>>>>>>> parameter).
>>>>>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>>> from the new spin or review the current patches then solve
>>>>>>>>>>>>>>>>>>> the new
>>>>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my
>>>>>>>>>>>>>>>>>> queue!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's
>>>>>>>>>>>>>>>>>> pretty
>>>>>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the
>>>>>>>>>>>>>>>>>> splitting,
>>>>>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>>>>>            * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>>>>>            * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski@arm.com/).
>>>>>>>>>>>>>>>>>>>>              Also included in this series in order to have the
>>>>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>>>>>            * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>>>>>            * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>>>>>            * Supported asymmetric system by splitting kernel
>>>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>>>>              system is detected per Ryan. I don't have such
>>>>>>>>>>>>>>>>>>>> system to
>>>>>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>>>>>              testing is done by hacking kernel to call linear
>>>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>>>>>              unconditionally. The linear mapping doesn't
>>>>>>>>>>>>>>>>>>>> have any
>>>>>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>>>>>              mappings after booting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>>>>>            * Used allowlist to advertise BBM lv2 on the CPUs
>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>>>>>              conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>>>>>            * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>>>>>            * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>>>>>> yang@os.amperecomputing.com/
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear
>>>>>>>>>>>>>>>>>>>> map is
>>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>>>>>            - performance degradation
>>>>>>>>>>>>>>>>>>>>            - more TLB pressure
>>>>>>>>>>>>>>>>>>>>            - memory waste for kernel page table
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when
>>>>>>>>>>>>>>>>>>>> FEAT_BBM
>>>>>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality
>>>>>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map
>>>>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with
>>>>>>>>>>>>>>>>>>>> 256GB
>>>>>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>>>>>            - Kernel boot.  Kernel needs change kernel linear
>>>>>>>>>>>>>>>>>>>> mapping
>>>>>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>>>>>              boot stage, if the patch didn't work, kernel
>>>>>>>>>>>>>>>>>>>> typically
>>>>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>>>>>            - Module stress from stress-ng. Kernel module load
>>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>              linear mapping.
>>>>>>>>>>>>>>>>>>>>            - A test kernel module which allocates 80% of total
>>>>>>>>>>>>>>>>>>>> memory
>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>>>>>              then change the vmalloc area permission to RO,
>>>>>>>>>>>>>>>>>>>> this also
>>>>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>>>>              mapping permission to RO, then change it back
>>>>>>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>>>>>              a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>>>>>            - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>>>>>            - Kernel build in VM with guest kernel which has
>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>>>>>            - rodata=on. Make sure other rodata mode is not
>>>>>>>>>>>>>>>>>>>> broken.
>>>>>>>>>>>>>>>>>>>>            - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the
>>>>>>>>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%,
>>>>>>>>>>>>>>>>>>>> P99
>>>>>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The
>>>>>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk
>>>>>>>>>>>>>>>>>>>> (ext4)
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>>>>>              --status-interval=999 --rw=write --bs=4k --
>>>>>>>>>>>>>>>>>>>> loops=1 --
>>>>>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>>>>>              --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>>>>>              --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of
>>>>>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>>>>>                arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>>>>>                arm64: cpufeature: add AmpereOne to BBML2 allow
>>>>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>>>>                arm64: mm: make __create_pgd_mapping() and
>>>>>>>>>>>>>>>>>>>> helpers
>>>>>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>>>>>                arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>>>>>                arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>>>>>                arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>           arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>>>>>           arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>>>>>           arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>>>>>           arch/arm64/mm/mmu.c                 | 397 ++++++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>>>>>           arch/arm64/mm/pageattr.c            | 37 +++++++++
>>>>>>>>>>>>>>>>>>>> +++---
>>>>>>>>>>>>>>>>>>>>           arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>>>>>           9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-30  7:59                                       ` Ryan Roberts
@ 2025-05-30 17:18                                         ` Yang Shi
  2025-06-02 10:47                                           ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: Yang Shi @ 2025-05-30 17:18 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/30/25 12:59 AM, Ryan Roberts wrote:
> On 29/05/2025 21:52, Yang Shi wrote:
>>>>>>>> The split_mapping() guarantees keep block mapping if it is fully
>>>>>>>> contained in
>>>>>>>> the range between start and end, this is my series's responsibility. I know
>>>>>>>> the
>>>>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>>>>> just
>>>>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>>>>> API,
>>>>>>>> then __change_memory_common() will call it to change permission. There
>>>>>>>> should be
>>>>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>>>>> But if you have a block mapping in the region you are calling
>>>>>>> __change_memory_common() on, today that will fail because it can only handle
>>>>>>> page mappings.
>>>>>> IMHO letting __change_memory_common() manipulate on contiguous address
>>>>>> range is
>>>>>> another story and should be not a part of the split primitive.
>>>>> I 100% agree that it should not be part of the split primitive.
>>>>>
>>>>> But your series *depends* upon __change_memory_common() being able to change
>>>>> permissions on block mappings. Today it can only change permissions on page
>>>>> mappings.
>>>> I don't think split primitive depends on it. Changing permission on block
>>>> mappings is just the user of the new split primitive IMHO. We just have no real
>>>> user right now.
>>> But your series introduces a real user; after your series, the linear map is
>>> block mapped.
>> The users of the split primitive are the permission changers, for example,
>> module, bpf, secret mem, etc.
> Ahh, perhaps this is the crux of our misunderstanding... In my model, the split
> primitive is called from __change_memory_common() (or from other appropriate
> functions in pageattr.c). It's an implementation detail for arm64 and is not
> exposed to common code. arm64 knows that it can split live mappings in a
> transparent way so it uses huge pages eagerly and splits on demand.
>
> I personally wouldn't want to be relying on the memory user knowing it needs to
> split the mappings...

We are actually on the same page...

For example, when loading module, kernel currently does:

vmalloc() // Allocate memory for module
module_enable_text_rox() // change permission to ROX for text section
     set_memory_x
         change_memory_common
             for every page in the vmalloc area
                 __change_memory_common(addr, PAGE_SIZE, ...) // page basis
                     split_mapping(addr, addr + PAGE_SIZE)
                     apply_to_page_range() // apply the new permission

__change_memory_common() has to be called on page basis because we don't 
know whether the pages for the vmalloc area are contiguous or not. So 
the split primitive is called on page basis.


So we need do the below in order to keep large mapping:
check whether the vmalloc area is huge mapped (PMD/CONT PMD/CONT PTE) or not
if (it is huge mapped)
     __change_memory_common(addr, HUGE_SIZE, ...)
         split_mapping(addr, addr + HUGE_SIZE)
         change permission on (addr, addr + HUGE_SIZE)
else
     fallback to page basis


To have huge mapping for vmalloc, we need use vmalloc_huge() or the new 
implementation proposed by you to allocate memory for module in the 
first place. This is the "user" in my understanding.

Thanks,
Yang




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-30  7:17                                       ` Ryan Roberts
@ 2025-05-30 21:21                                         ` Yang Shi
  0 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-05-30 21:21 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 5/30/25 12:17 AM, Ryan Roberts wrote:
> On 29/05/2025 20:52, Yang Shi wrote:
>>>>> I just had another conversation about this internally, and there is another
>>>>> concern; we obviously don't want to modify the pgtables while other CPUs that
>>>>> don't support BBML2 could be accessing them. Even in stop_machine() this may be
>>>>> possible if the CPU stacks and task structure (for example) are allocated
>>>>> out of
>>>>> the linear map.
>>>>>
>>>>> So we need to be careful to follow the pattern used by kpti; all secondary CPUs
>>>>> need to switch to the idmap (which is installed in TTBR0) then install the
>>>>> reserved map in TTBR1, then wait for CPU 0 to repaint the linear map, then have
>>>>> the secondary CPUs switch TTBR1 back to swapper then switch back out of idmap.
>>>> So the below code should be ok?
>>>>
>>>> cpu_install_idmap()
>>>> Busy loop to wait for cpu 0 done
>>>> cpu_uninstall_idmap()
>>> Once you have installed the idmap, you'll need to call a function by its PA so
>>> you are actually executing out of the idmap. And you will need to be in assembly
>>> so you don't need the stack, and you'll need to switch TTBR1 to the reserved
>>> pgtable, so that the CPU has no access to the swapper pgtable (which CPU 0 is
>>> able to modify).
>>>
>>> You may well be able to reuse __idmap_kpti_secondary in proc.S, or lightly
>>> refactor it to work for both the existing idmap_kpti_install_ng_mappings case,
>>> and your case.
>> I'm wondering whether we really need idmap for repainting. I think repainting is
>> different from kpti. We just split linear map which is *not* used by kernel
>> itself, the mappings for kernel itself is intact, we don't touch it at all. So
>> as long as CPU 0 will not repaint the linear map until all other CPUs busy
>> looping in stop_machine fn, then we are fine.
> But *how* are the other CPUs busy looping? Are they polling a variable? Where
> does that variable live? The docs say that a high priority thread is run for
> each CPU. So there at least needs to be a task struct and a stack. There are
> some Kconfigs where the stack comes from the linear map, so if the variable that
> is polls is on its stack (or even on CPU 0's stack then that's a problem. If the
> scheduler runs and accesses the task struct which may be allocated from the
> linear map (e.g. via kmalloc), that's a problem.
>
> The point is that you have to understand all the details of stop_machine() to be
> confident that it is never accessing the linear map. And even if you can prove
> that today, there is nothing stopping from the implementation changing in future.
>
> But then you have non-architectural memory accesses too (i.e. speculative
> accesses). It's possible that the CPU does a speculative load, which causes the
> TLB to do a translation and cache a TLB entry to the linear map. Then CPU 0
> changes the pgtable and you have broken the BBM requirements from the secondary
> CPU's perspective.
>
> So personally I think the only truely safe way to solve this is to switch the
> secondary CPUs to the idmap, then install the reserved map in TTBR1. That way,
> the secondary CPUs can't see the swapper pgtable at all and CPU 0 is free to do
> what it likes.

OK, I agree it is safer to run the busy loop (wait for repainting done 
on boot CPU) in idmap address space.

IIUC I should just need map the flag polled by the secondary CPU in 
idmap so that both CPU 0 and secondary CPUs can access it. And have the 
wait function in .idmap.text section. I may not reuse kpti code because 
it is much simpler than kpti.

Thanks,
Yang

>
>> We can have two flags to control it. The first one should be a cpu mask, all
>> secondary CPUs set its own mask bit to tell CPU 0 it is in stop machine fn
>> (ready for repainting). The other flag is used by CPU 0 to tell all secondary
>> CPUs repainting is done, please resume. We need have the two flags in kernel
>> data section instead of stack.
>>
>> The code of fn is in kernel text section, the flags are in kernel data section.
>> I don't see how come fn (just doing simple busy loop) on secondary CPUs need to
>> access linear map while repainting the linear map. After repainting the TLB will
>> be flushed before letting secondary CPUs resume, so any access to linear map
>> address after that point should be safe too.
>>
>> Does it sound reasonable to you? Did I miss something?
> I think the potential for speculative access is the problem. Personally, I would
> follow the pattern laid out by kpti. Then you can more easily defend it by
> pointing to an established pattern.
>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>> Thanks,
>>> Ryan
>>>
>>>>> Given CPU 0 supports BBML2, I think it can just update the linear map live,
>>>>> without needing to do the idmap dance?
>>>> Yes, I think so too.
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-05-30 17:18                                         ` Yang Shi
@ 2025-06-02 10:47                                           ` Ryan Roberts
  2025-06-02 20:55                                             ` Yang Shi
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2025-06-02 10:47 UTC (permalink / raw)
  To: Yang Shi, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain

On 30/05/2025 18:18, Yang Shi wrote:
> 
> 
> On 5/30/25 12:59 AM, Ryan Roberts wrote:
>> On 29/05/2025 21:52, Yang Shi wrote:
>>>>>>>>> The split_mapping() guarantees keep block mapping if it is fully
>>>>>>>>> contained in
>>>>>>>>> the range between start and end, this is my series's responsibility. I
>>>>>>>>> know
>>>>>>>>> the
>>>>>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>>>>>> just
>>>>>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>>>>>> API,
>>>>>>>>> then __change_memory_common() will call it to change permission. There
>>>>>>>>> should be
>>>>>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>>>>>> But if you have a block mapping in the region you are calling
>>>>>>>> __change_memory_common() on, today that will fail because it can only
>>>>>>>> handle
>>>>>>>> page mappings.
>>>>>>> IMHO letting __change_memory_common() manipulate on contiguous address
>>>>>>> range is
>>>>>>> another story and should be not a part of the split primitive.
>>>>>> I 100% agree that it should not be part of the split primitive.
>>>>>>
>>>>>> But your series *depends* upon __change_memory_common() being able to change
>>>>>> permissions on block mappings. Today it can only change permissions on page
>>>>>> mappings.
>>>>> I don't think split primitive depends on it. Changing permission on block
>>>>> mappings is just the user of the new split primitive IMHO. We just have no
>>>>> real
>>>>> user right now.
>>>> But your series introduces a real user; after your series, the linear map is
>>>> block mapped.
>>> The users of the split primitive are the permission changers, for example,
>>> module, bpf, secret mem, etc.
>> Ahh, perhaps this is the crux of our misunderstanding... In my model, the split
>> primitive is called from __change_memory_common() (or from other appropriate
>> functions in pageattr.c). It's an implementation detail for arm64 and is not
>> exposed to common code. arm64 knows that it can split live mappings in a
>> transparent way so it uses huge pages eagerly and splits on demand.
>>
>> I personally wouldn't want to be relying on the memory user knowing it needs to
>> split the mappings...
> 
> We are actually on the same page...
> 
> For example, when loading module, kernel currently does:
> 
> vmalloc() // Allocate memory for module
> module_enable_text_rox() // change permission to ROX for text section
>     set_memory_x
>         change_memory_common
>             for every page in the vmalloc area
>                 __change_memory_common(addr, PAGE_SIZE, ...) // page basis
>                     split_mapping(addr, addr + PAGE_SIZE)
>                     apply_to_page_range() // apply the new permission
> 
> __change_memory_common() has to be called on page basis because we don't know
> whether the pages for the vmalloc area are contiguous or not. So the split
> primitive is called on page basis.

Yes that makes sense for the case where we are setting permissions on a
virtually contiguous region of vmalloc space; in that case we must set
permissions on the linear map page-by-page. Agreed.

I was thinking of the cases where we are changing the permissions on a virtually
contiguous region of the *linear map*. Although looking again at the code, it
seems there aren't as many places as I thought that actually do this. I think
set_direct_map_valid_noflush() is the only one that will operate on multiple
pages of the linear map at a time. But this single case means that you could end
up wanting to change permissions on a large block mapping and therefore need
Dev's work, right?

Thanks,
Ryan

> 
> 
> So we need do the below in order to keep large mapping:
> check whether the vmalloc area is huge mapped (PMD/CONT PMD/CONT PTE) or not
> if (it is huge mapped)
>     __change_memory_common(addr, HUGE_SIZE, ...)
>         split_mapping(addr, addr + HUGE_SIZE)
>         change permission on (addr, addr + HUGE_SIZE)
> else
>     fallback to page basis
> 
> 
> To have huge mapping for vmalloc, we need use vmalloc_huge() or the new
> implementation proposed by you to allocate memory for module in the first place.
> This is the "user" in my understanding.
> 
> Thanks,
> Yang
> 
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2025-06-02 10:47                                           ` Ryan Roberts
@ 2025-06-02 20:55                                             ` Yang Shi
  0 siblings, 0 replies; 49+ messages in thread
From: Yang Shi @ 2025-06-02 20:55 UTC (permalink / raw)
  To: Ryan Roberts, will, catalin.marinas, Miko.Lenczewski, scott, cl
  Cc: linux-arm-kernel, linux-kernel, Dev Jain



On 6/2/25 3:47 AM, Ryan Roberts wrote:
> On 30/05/2025 18:18, Yang Shi wrote:
>>
>> On 5/30/25 12:59 AM, Ryan Roberts wrote:
>>> On 29/05/2025 21:52, Yang Shi wrote:
>>>>>>>>>> The split_mapping() guarantees keep block mapping if it is fully
>>>>>>>>>> contained in
>>>>>>>>>> the range between start and end, this is my series's responsibility. I
>>>>>>>>>> know
>>>>>>>>>> the
>>>>>>>>>> current code calls apply_to_page_range() to apply permission change and it
>>>>>>>>>> just
>>>>>>>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new
>>>>>>>>>> API,
>>>>>>>>>> then __change_memory_common() will call it to change permission. There
>>>>>>>>>> should be
>>>>>>>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>>>>>>>> But if you have a block mapping in the region you are calling
>>>>>>>>> __change_memory_common() on, today that will fail because it can only
>>>>>>>>> handle
>>>>>>>>> page mappings.
>>>>>>>> IMHO letting __change_memory_common() manipulate on contiguous address
>>>>>>>> range is
>>>>>>>> another story and should be not a part of the split primitive.
>>>>>>> I 100% agree that it should not be part of the split primitive.
>>>>>>>
>>>>>>> But your series *depends* upon __change_memory_common() being able to change
>>>>>>> permissions on block mappings. Today it can only change permissions on page
>>>>>>> mappings.
>>>>>> I don't think split primitive depends on it. Changing permission on block
>>>>>> mappings is just the user of the new split primitive IMHO. We just have no
>>>>>> real
>>>>>> user right now.
>>>>> But your series introduces a real user; after your series, the linear map is
>>>>> block mapped.
>>>> The users of the split primitive are the permission changers, for example,
>>>> module, bpf, secret mem, etc.
>>> Ahh, perhaps this is the crux of our misunderstanding... In my model, the split
>>> primitive is called from __change_memory_common() (or from other appropriate
>>> functions in pageattr.c). It's an implementation detail for arm64 and is not
>>> exposed to common code. arm64 knows that it can split live mappings in a
>>> transparent way so it uses huge pages eagerly and splits on demand.
>>>
>>> I personally wouldn't want to be relying on the memory user knowing it needs to
>>> split the mappings...
>> We are actually on the same page...
>>
>> For example, when loading module, kernel currently does:
>>
>> vmalloc() // Allocate memory for module
>> module_enable_text_rox() // change permission to ROX for text section
>>      set_memory_x
>>          change_memory_common
>>              for every page in the vmalloc area
>>                  __change_memory_common(addr, PAGE_SIZE, ...) // page basis
>>                      split_mapping(addr, addr + PAGE_SIZE)
>>                      apply_to_page_range() // apply the new permission
>>
>> __change_memory_common() has to be called on page basis because we don't know
>> whether the pages for the vmalloc area are contiguous or not. So the split
>> primitive is called on page basis.
> Yes that makes sense for the case where we are setting permissions on a
> virtually contiguous region of vmalloc space; in that case we must set
> permissions on the linear map page-by-page. Agreed.
>
> I was thinking of the cases where we are changing the permissions on a virtually
> contiguous region of the *linear map*. Although looking again at the code, it
> seems there aren't as many places as I thought that actually do this. I think
> set_direct_map_valid_noflush() is the only one that will operate on multiple
> pages of the linear map at a time. But this single case means that you could end
> up wanting to change permissions on a large block mapping and therefore need
> Dev's work, right?

Yes, set_direct_map_valid_noflush() may be called on multiple pages. But 
this was introduced by Mike Rapport's "x86/module: use large ROX pages 
for text allocations" series. So just x86 supports it right now. Large 
execmem ROX cache is not supported on arm64 yet IIRC.

Thanks,
Yang


>
> Thanks,
> Ryan
>
>>
>> So we need do the below in order to keep large mapping:
>> check whether the vmalloc area is huge mapped (PMD/CONT PMD/CONT PTE) or not
>> if (it is huge mapped)
>>      __change_memory_common(addr, HUGE_SIZE, ...)
>>          split_mapping(addr, addr + HUGE_SIZE)
>>          change permission on (addr, addr + HUGE_SIZE)
>> else
>>      fallback to page basis
>>
>>
>> To have huge mapping for vmalloc, we need use vmalloc_huge() or the new
>> implementation proposed by you to allocate memory for module in the first place.
>> This is the "user" in my understanding.
>>
>> Thanks,
>> Yang
>>
>>



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2025-06-02 20:57 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 22:19 [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
2025-03-04 22:19 ` [v3 PATCH 1/6] arm64: Add BBM Level 2 cpu feature Yang Shi
2025-03-04 22:19 ` [v3 PATCH 2/6] arm64: cpufeature: add AmpereOne to BBML2 allow list Yang Shi
2025-03-14 10:58   ` Ryan Roberts
2025-03-17 17:50     ` Yang Shi
2025-03-04 22:19 ` [v3 PATCH 3/6] arm64: mm: make __create_pgd_mapping() and helpers non-void Yang Shi
2025-03-14 11:51   ` Ryan Roberts
2025-03-17 17:53     ` Yang Shi
2025-05-07  8:18       ` Ryan Roberts
2025-05-07 22:19         ` Yang Shi
2025-03-04 22:19 ` [v3 PATCH 4/6] arm64: mm: support large block mapping when rodata=full Yang Shi
2025-03-08  1:53   ` kernel test robot
2025-03-14 13:29   ` Ryan Roberts
2025-03-17 17:57     ` Yang Shi
2025-03-04 22:19 ` [v3 PATCH 5/6] arm64: mm: support split CONT mappings Yang Shi
2025-03-14 13:33   ` Ryan Roberts
2025-03-04 22:19 ` [v3 PATCH 6/6] arm64: mm: split linear mapping if BBML2 is not supported on secondary CPUs Yang Shi
2025-03-13 17:28 ` [v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
2025-03-13 17:36   ` Ryan Roberts
2025-03-13 17:40     ` Yang Shi
2025-04-10 22:00       ` Yang Shi
2025-04-14 13:03         ` Ryan Roberts
2025-04-14 21:24           ` Yang Shi
2025-05-02 11:51             ` Ryan Roberts
2025-05-05 21:39               ` Yang Shi
2025-05-07  7:58                 ` Ryan Roberts
2025-05-07 21:16                   ` Yang Shi
2025-05-28  0:00                     ` Yang Shi
2025-05-28  3:47                       ` Dev Jain
2025-05-28 13:13                       ` Ryan Roberts
2025-05-28 15:18                         ` Yang Shi
2025-05-28 17:12                           ` Yang Shi
2025-05-29  8:48                             ` Ryan Roberts
2025-05-29 15:33                               ` Ryan Roberts
2025-05-29 17:35                                 ` Yang Shi
2025-05-29 18:30                                   ` Ryan Roberts
2025-05-29 19:52                                     ` Yang Shi
2025-05-30  7:17                                       ` Ryan Roberts
2025-05-30 21:21                                         ` Yang Shi
2025-05-29  7:36                           ` Ryan Roberts
2025-05-29 16:37                             ` Yang Shi
2025-05-29 17:01                               ` Ryan Roberts
2025-05-29 17:50                                 ` Yang Shi
2025-05-29 18:34                                   ` Ryan Roberts
2025-05-29 20:52                                     ` Yang Shi
2025-05-30  7:59                                       ` Ryan Roberts
2025-05-30 17:18                                         ` Yang Shi
2025-06-02 10:47                                           ` Ryan Roberts
2025-06-02 20:55                                             ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).