[RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
@ 2024-11-18 18:16 Yang Shi
  2024-11-18 18:16 ` [PATCH 1/3] arm64: cpufeature: detect FEAT_BBM level 2 Yang Shi
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Yang Shi @ 2024-11-18 18:16 UTC (permalink / raw)
  To: catalin.marinas, will; +Cc: cl, scott, yang, linux-arm-kernel, linux-kernel

When rodata=full kernel linear mapping is mapped by PTE due to arm's
break-before-make rule.

This resulted in a couple of problems:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table

There are some workarounds to mitigate the problems, for example, using
rodata=on, but this compromises the security measurement.

With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.

Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full.  When
changing permissions for kernel linear mapping, the page table will be
split to PTE level.

The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.

With this we saw significant performance boost with some benchmarks with
keeping rodata=full security protection in the mean time.

The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
4K page size + 48 bit VA.

Function test (4K/16K/64K page size)
  - Kernel boot.  Kernel needs change kernel linear mapping permission at
    boot stage, if the patch didn't work, kernel typically didn't boot.
  - Module stress from stress-ng. Kernel module load change permission for
    module sections.
  - A test kernel module which allocates 80% of total memory via vmalloc(),
    then change the vmalloc area permission to RO, then change it back
    before vfree(). Then launch a VM which consumes almost all physical
    memory.
  - VM with the patchset applied in guest kernel too.
  - Kernel build in VM with patched guest kernel.

Memory consumption
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

Performance benchmarking
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
    --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
    --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
    --name=iops-test-job --eta-newline=1 --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad case).
The bandwidth is increased and the avg clat is reduced proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache populated).
The bandwidth is increased by 150%.

Yang Shi (3):
      arm64: cpufeature: detect FEAT_BBM level 2
      arm64: mm: support large block mapping when rodata=full
      arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2

 arch/arm64/include/asm/cpufeature.h |  24 ++++++++++++++++++
 arch/arm64/include/asm/pgtable.h    |   7 +++++-
 arch/arm64/kernel/cpufeature.c      |  11 ++++++++
 arch/arm64/mm/mmu.c                 |  31 +++++++++++++++++++++--
 arch/arm64/mm/pageattr.c            | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 arch/arm64/tools/cpucaps            |   1 +
 6 files changed, 238 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] arm64: cpufeature: detect FEAT_BBM level 2
  2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
@ 2024-11-18 18:16 ` Yang Shi
  2024-11-18 18:16 ` [PATCH 2/3] arm64: mm: support large block mapping when rodata=full Yang Shi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2024-11-18 18:16 UTC (permalink / raw)
  To: catalin.marinas, will; +Cc: cl, scott, yang, linux-arm-kernel, linux-kernel

FEAT_BBM level 2 is useful to split large page table without breaking
the page table entry.  The following patch will use to improve
performance.  Detect it in cpufeature and use BOOT_CPU feature for now,
if the late CPU cores have conflict, kernel will not bring them up.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/cpufeature.h | 15 +++++++++++++++
 arch/arm64/kernel/cpufeature.c      | 11 +++++++++++
 arch/arm64/tools/cpucaps            |  1 +
 3 files changed, 27 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 3d261cc123c1..c7ca5f9f88bb 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -838,6 +838,21 @@ static inline bool system_supports_poe(void)
 		alternative_has_cap_unlikely(ARM64_HAS_S1POE);
 }
 
+static inline bool system_supports_bbmlv2(void)
+{
+	return cpus_have_final_boot_cap(ARM64_HAS_BBMLV2);
+}
+
+static inline bool bbmlv2_available(void)
+{
+	u64 mmfr2;
+	u32 bbm;
+
+	mmfr2 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR2_EL1);
+	bbm = cpuid_feature_extract_unsigned_field(mmfr2, ID_AA64MMFR2_EL1_BBM_SHIFT);
+	return bbm == ID_AA64MMFR2_EL1_BBM_2;
+}
+
 int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
 bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
 
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 718728a85430..cb916747cd31 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1866,6 +1866,11 @@ static bool has_lpa2(const struct arm64_cpu_capabilities *entry, int scope)
 }
 #endif
 
+static bool has_bbmlv2(const struct arm64_cpu_capabilities *entry, int scope)
+{
+	return bbmlv2_available();
+}
+
 #ifdef CONFIG_UNMAP_KERNEL_AT_EL0
 #define KPTI_NG_TEMP_VA		(-(1UL << PMD_SHIFT))
 
@@ -2890,6 +2895,12 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		ARM64_CPUID_FIELDS(ID_AA64MMFR3_EL1, S1POE, IMP)
 	},
 #endif
+	{
+		.desc = "BBM Level 2",
+		.capability = ARM64_HAS_BBMLV2,
+		.type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+		.matches = has_bbmlv2,
+	},
 	{},
 };
 
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index eedb5acc21ed..175b7eb42b0b 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -14,6 +14,7 @@ HAS_ADDRESS_AUTH_ARCH_QARMA5
 HAS_ADDRESS_AUTH_IMP_DEF
 HAS_AMU_EXTN
 HAS_ARMv8_4_TTL
+HAS_BBMLV2
 HAS_CACHE_DIC
 HAS_CACHE_IDC
 HAS_CNP
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] arm64: mm: support large block mapping when rodata=full
  2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  2024-11-18 18:16 ` [PATCH 1/3] arm64: cpufeature: detect FEAT_BBM level 2 Yang Shi
@ 2024-11-18 18:16 ` Yang Shi
  2024-11-18 18:16 ` [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2 Yang Shi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2024-11-18 18:16 UTC (permalink / raw)
  To: catalin.marinas, will; +Cc: cl, scott, yang, linux-arm-kernel, linux-kernel

When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.

This resulted in a couple of problems:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table

With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.

Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full.  When
changing permissions for kernel linear mapping, the page table will be
split to PTE level.

The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.

With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P) with
256GB memory.

* Memory use after boot
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB

After:
MemTotal:       259505132 kB
MemFree:        255410264 kB

Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.

* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.

The benchmark data is now on par with rodata=on too.

* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
    --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
    --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
    --name=iops-test-job --eta-newline=1 --size 100G

The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad case).
The bandwidth is increased and the avg clat is reduced proportionally.

* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache populated).
The bandwidth is increased by 150%.

Keep using PTE mapping when pagealloc debug is enabled.  It is not worth
the complexity.

Kfence can be converted to use page block mapping later.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/pgtable.h |   7 +-
 arch/arm64/mm/mmu.c              |  31 +++++-
 arch/arm64/mm/pageattr.c         | 173 +++++++++++++++++++++++++++++--
 3 files changed, 202 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c329ea061dc9..473c133ce10c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -750,7 +750,7 @@ static inline bool in_swapper_pgdir(void *addr)
 	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
 }
 
-static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+static inline void __set_pmd_nosync(pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef __PAGETABLE_PMD_FOLDED
 	if (in_swapper_pgdir(pmdp)) {
@@ -760,6 +760,11 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 	WRITE_ONCE(*pmdp, pmd);
+}
+
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	__set_pmd_nosync(pmdp, pmd);
 
 	if (pmd_valid(pmd)) {
 		dsb(ishst);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e55b02fbddc8..09ccb4f8964a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -620,6 +620,18 @@ static inline void arm64_kfence_map_pool(phys_addr_t kfence_pool, pgd_t *pgdp) {
 
 #endif /* CONFIG_KFENCE */
 
+static inline bool force_pte_mapping(void)
+{
+	/*
+	 * Can't use cpufeature API to determine whether BBM level 2
+	 * is supported or not since cpufeature have not been
+	 * finalized yet.
+	 */
+	return (rodata_full && !bbmlv2_available()) ||
+		debug_pagealloc_enabled() ||
+		arm64_kfence_can_set_direct_map();
+}
+
 static void __init map_mem(pgd_t *pgdp)
 {
 	static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
@@ -645,9 +657,21 @@ static void __init map_mem(pgd_t *pgdp)
 
 	early_kfence_pool = arm64_kfence_alloc_pool();
 
-	if (can_set_direct_map())
+	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
+	/*
+	 * With FEAT_BBM level 2 we can split large block mapping without
+	 * making it invalid.  So kernel linear mapping can be mapped with
+	 * large block instead of PTE level.
+	 *
+	 * Need to break cont for CONT_MAPPINGS when changing permission,
+	 * and need to inspect the adjacent page table entries to make
+	 * them cont again later.  It sounds not worth the complexity.
+	 */
+	if (rodata_full)
+		flags |= NO_CONT_MAPPINGS;
+
 	/*
 	 * Take care not to create a writable alias for the
 	 * read-only text and rodata sections of the kernel image.
@@ -1342,9 +1366,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
 
 	VM_BUG_ON(!mhp_range_allowed(start, size, true));
 
-	if (can_set_direct_map())
+	if (force_pte_mapping())
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
+	if (rodata_full)
+		flags |= NO_CONT_MAPPINGS;
+
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
 			     size, params->pgprot, __pgd_pgtable_alloc,
 			     flags);
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 0e270a1c51e6..dbb2b709d184 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -45,6 +45,145 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	return 0;
 }
 
+static int __split_linear_mapping_pmd(pud_t *pudp,
+				      unsigned long vaddr, unsigned long end)
+{
+	pmd_t *pmdp;
+	unsigned long next;
+
+	pmdp = pmd_offset(pudp, vaddr);
+
+	do {
+		next = pmd_addr_end(vaddr, end);
+
+		if (pmd_leaf(pmdp_get(pmdp))) {
+			struct page *pte_page;
+			unsigned long pfn = pmd_pfn(pmdp_get(pmdp));
+			pgprot_t prot = pmd_pgprot(pmdp_get(pmdp));
+			pte_t *ptep_new;
+			int i;
+
+			pte_page = alloc_page(GFP_KERNEL);
+			if (!pte_page)
+				return -ENOMEM;
+
+			prot = __pgprot(pgprot_val(prot) | PTE_TYPE_PAGE);
+			ptep_new = (pte_t *)page_address(pte_page);
+			for (i = 0; i < PTRS_PER_PTE; ++i, ++ptep_new)
+				__set_pte_nosync(ptep_new,
+						 pfn_pte(pfn + i, prot));
+
+			dsb(ishst);
+			isb();
+
+			set_pmd(pmdp, pfn_pmd(page_to_pfn(pte_page),
+				__pgprot(PMD_TYPE_TABLE)));
+		}
+	} while (pmdp++, vaddr = next, vaddr != end);
+
+	return 0;
+}
+
+static int __split_linear_mapping_pud(p4d_t *p4dp,
+				      unsigned long vaddr, unsigned long end)
+{
+	pud_t *pudp;
+	unsigned long next;
+	int ret;
+
+	pudp = pud_offset(p4dp, vaddr);
+
+	do {
+		next = pud_addr_end(vaddr, end);
+
+		if (pud_leaf(pudp_get(pudp))) {
+			struct page *pmd_page;
+			unsigned long pfn = pud_pfn(pudp_get(pudp));
+			pgprot_t prot = pud_pgprot(pudp_get(pudp));
+			pmd_t *pmdp_new;
+			int i;
+			unsigned int step;
+
+			pmd_page = alloc_page(GFP_KERNEL);
+			if (!pmd_page)
+				return -ENOMEM;
+
+			pmdp_new = (pmd_t *)page_address(pmd_page);
+			for (i = 0; i < PTRS_PER_PMD; ++i, ++pmdp_new) {
+				step = (i * PMD_SIZE) >> PAGE_SHIFT;
+				__set_pmd_nosync(pmdp_new,
+						 pfn_pmd(pfn + step, prot));
+			}
+
+			dsb(ishst);
+			isb();
+
+			set_pud(pudp, pfn_pud(page_to_pfn(pmd_page),
+				__pgprot(PUD_TYPE_TABLE)));
+		}
+
+		ret = __split_linear_mapping_pmd(pudp, vaddr, next);
+		if (ret)
+			return ret;
+	} while (pudp++, vaddr = next, vaddr != end);
+
+	return 0;
+}
+
+static int __split_linear_mapping_p4d(pgd_t *pgdp,
+				      unsigned long vaddr, unsigned long end)
+{
+	p4d_t *p4dp;
+	unsigned long next;
+	int ret;
+
+	p4dp = p4d_offset(pgdp, vaddr);
+
+	do {
+		next = p4d_addr_end(vaddr, end);
+
+		ret = __split_linear_mapping_pud(p4dp, vaddr, next);
+		if (ret)
+			return ret;
+	} while (p4dp++, vaddr = next, vaddr != end);
+
+	return 0;
+}
+
+static int __split_linear_mapping_pgd(pgd_t *pgdp,
+				      unsigned long vaddr,
+				      unsigned long end)
+{
+	unsigned long next;
+	int ret = 0;
+
+	mmap_write_lock(&init_mm);
+
+	do {
+		next = pgd_addr_end(vaddr, end);
+		ret = __split_linear_mapping_p4d(pgdp, vaddr, next);
+		if (ret)
+			break;
+	} while (pgdp++, vaddr = next, vaddr != end);
+
+	mmap_write_unlock(&init_mm);
+
+	return ret;
+}
+
+static int split_linear_mapping(unsigned long start, unsigned long end)
+{
+	int ret;
+
+	if (!system_supports_bbmlv2())
+		return 0;
+
+	ret = __split_linear_mapping_pgd(pgd_offset_k(start), start, end);
+	flush_tlb_kernel_range(start, end);
+
+	return ret;
+}
+
 /*
  * This function assumes that the range is mapped with PAGE_SIZE pages.
  */
@@ -70,8 +209,9 @@ static int change_memory_common(unsigned long addr, int numpages,
 	unsigned long start = addr;
 	unsigned long size = PAGE_SIZE * numpages;
 	unsigned long end = start + size;
+	unsigned long l_start;
 	struct vm_struct *area;
-	int i;
+	int i, ret;
 
 	if (!PAGE_ALIGNED(addr)) {
 		start &= PAGE_MASK;
@@ -108,7 +248,12 @@ static int change_memory_common(unsigned long addr, int numpages,
 	if (rodata_full && (pgprot_val(set_mask) == PTE_RDONLY ||
 			    pgprot_val(clear_mask) == PTE_RDONLY)) {
 		for (i = 0; i < area->nr_pages; i++) {
-			__change_memory_common((u64)page_address(area->pages[i]),
+			l_start = (u64)page_address(area->pages[i]);
+			ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+			if (WARN_ON_ONCE(ret))
+				return ret;
+
+			__change_memory_common(l_start,
 					       PAGE_SIZE, set_mask, clear_mask);
 		}
 	}
@@ -164,6 +309,9 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 
 int set_direct_map_invalid_noflush(struct page *page)
 {
+	unsigned long l_start;
+	int ret;
+
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
@@ -172,13 +320,21 @@ int set_direct_map_invalid_noflush(struct page *page)
 	if (!can_set_direct_map())
 		return 0;
 
+	l_start = (unsigned long)page_address(page);
+	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
 	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   l_start, PAGE_SIZE, change_page_range,
+				   &data);
 }
 
 int set_direct_map_default_noflush(struct page *page)
 {
+	unsigned long l_start;
+	int ret;
+
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
@@ -187,9 +343,14 @@ int set_direct_map_default_noflush(struct page *page)
 	if (!can_set_direct_map())
 		return 0;
 
+	l_start = (unsigned long)page_address(page);
+	ret = split_linear_mapping(l_start, l_start + PAGE_SIZE);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
 	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   l_start, PAGE_SIZE, change_page_range,
+				   &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2
  2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  2024-11-18 18:16 ` [PATCH 1/3] arm64: cpufeature: detect FEAT_BBM level 2 Yang Shi
  2024-11-18 18:16 ` [PATCH 2/3] arm64: mm: support large block mapping when rodata=full Yang Shi
@ 2024-11-18 18:16 ` Yang Shi
  2024-11-18 18:33   ` Christoph Lameter (Ampere)
  2024-12-02 23:39 ` [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
  2024-12-10 11:31 ` Will Deacon
  4 siblings, 1 reply; 12+ messages in thread
From: Yang Shi @ 2024-11-18 18:16 UTC (permalink / raw)
  To: catalin.marinas, will; +Cc: cl, scott, yang, linux-arm-kernel, linux-kernel

FEAT_BBM level 2 is not advertised on AmpereOne because of a bug when
collapsing stage 2 mappings from smaller to larger translations.  That
doesn't impact splitting stage 1 mappings (whether stage 2 is enabled or
not), so workaround it by detecting CPUID.

Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 arch/arm64/include/asm/cpufeature.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index c7ca5f9f88bb..d9b20eb43d31 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -847,10 +847,19 @@ static inline bool bbmlv2_available(void)
 {
 	u64 mmfr2;
 	u32 bbm;
+	static const struct midr_range ampereone[] = {
+		MIDR_ALL_VERSIONS(MIDR_AMPERE1),
+		MIDR_ALL_VERSIONS(MIDR_AMPERE1A),
+		{}
+	};
 
 	mmfr2 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR2_EL1);
 	bbm = cpuid_feature_extract_unsigned_field(mmfr2, ID_AA64MMFR2_EL1_BBM_SHIFT);
-	return bbm == ID_AA64MMFR2_EL1_BBM_2;
+	if ((bbm == ID_AA64MMFR2_EL1_BBM_2) ||
+	    is_midr_in_range_list(read_cpuid_id(), ampereone))
+		return true;
+
+	return false;
 }
 
 int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2
  2024-11-18 18:16 ` [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2 Yang Shi
@ 2024-11-18 18:33   ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-11-18 18:33 UTC (permalink / raw)
  To: Yang Shi; +Cc: catalin.marinas, will, scott, linux-arm-kernel, linux-kernel

On Mon, 18 Nov 2024, Yang Shi wrote:

> FEAT_BBM level 2 is not advertised on AmpereOne because of a bug when
> collapsing stage 2 mappings from smaller to larger translations.  That
> doesn't impact splitting stage 1 mappings (whether stage 2 is enabled or
> not), so workaround it by detecting CPUID.

Would be better to have a bblmv2_split_available() function that only
checks for the splitting capability.

If more code is added that uses the so far unused collapsing features
also included in the BBML2 feature set then that will break on AmpereOne.

bbml2_split_available() could call bbml2_available() and check the ampere
errata when false.

Should work fine for now.

Reviewed-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (2 preceding siblings ...)
  2024-11-18 18:16 ` [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2 Yang Shi
@ 2024-12-02 23:39 ` Yang Shi
  2024-12-10 11:31 ` Will Deacon
  4 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2024-12-02 23:39 UTC (permalink / raw)
  To: catalin.marinas, will; +Cc: cl, scott, linux-arm-kernel, linux-kernel

Gently ping...


Any comments on this RFC? Look forward to discussing them.

Thanks,
Yang


On 11/18/24 10:16 AM, Yang Shi wrote:
> When rodata=full kernel linear mapping is mapped by PTE due to arm's
> break-before-make rule.
>
> This resulted in a couple of problems:
>    - performance degradation
>    - more TLB pressure
>    - memory waste for kernel page table
>
> There are some workarounds to mitigate the problems, for example, using
> rodata=on, but this compromises the security measurement.
>
> With FEAT_BBM level 2 support, splitting large block page table to
> smaller ones doesn't need to make the page table entry invalid anymore.
> This allows kernel split large block mapping on the fly.
>
> Add kernel page table split support and use large block mapping by
> default when FEAT_BBM level 2 is supported for rodata=full.  When
> changing permissions for kernel linear mapping, the page table will be
> split to PTE level.
>
> The machine without FEAT_BBM level 2 will fallback to have kernel linear
> mapping PTE-mapped when rodata=full.
>
> With this we saw significant performance boost with some benchmarks with
> keeping rodata=full security protection in the mean time.
>
> The test was done on AmpereOne machine (192 cores, 1P) with 256GB memory and
> 4K page size + 48 bit VA.
>
> Function test (4K/16K/64K page size)
>    - Kernel boot.  Kernel needs change kernel linear mapping permission at
>      boot stage, if the patch didn't work, kernel typically didn't boot.
>    - Module stress from stress-ng. Kernel module load change permission for
>      module sections.
>    - A test kernel module which allocates 80% of total memory via vmalloc(),
>      then change the vmalloc area permission to RO, then change it back
>      before vfree(). Then launch a VM which consumes almost all physical
>      memory.
>    - VM with the patchset applied in guest kernel too.
>    - Kernel build in VM with patched guest kernel.
>
> Memory consumption
> Before:
> MemTotal:       258988984 kB
> MemFree:        254821700 kB
>
> After:
> MemTotal:       259505132 kB
> MemFree:        255410264 kB
>
> Around 500MB more memory are free to use.  The larger the machine, the
> more memory saved.
>
> Performance benchmarking
> * Memcached
> We saw performance degradation when running Memcached benchmark with
> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
> With this patchset we saw ops/sec is increased by around 3.5%, P99
> latency is reduced by around 9.6%.
> The gain mainly came from reduced kernel TLB misses.  The kernel TLB
> MPKI is reduced by 28.5%.
>
> The benchmark data is now on par with rodata=on too.
>
> * Disk encryption (dm-crypt) benchmark
> Ran fio benchmark with the below command on a 128G ramdisk (ext4) with disk
> encryption (by dm-crypt).
> fio --directory=/data --random_generator=lfsr --norandommap --randrepeat 1 \
>      --status-interval=999 --rw=write --bs=4k --loops=1 --ioengine=sync \
>      --iodepth=1 --numjobs=1 --fsync_on_close=1 --group_reporting --thread \
>      --name=iops-test-job --eta-newline=1 --size 100G
>
> The IOPS is increased by 90% - 150% (the variance is high, but the worst
> number of good case is around 90% more than the best number of bad case).
> The bandwidth is increased and the avg clat is reduced proportionally.
>
> * Sequential file read
> Read 100G file sequentially on XFS (xfs_io read with page cache populated).
> The bandwidth is increased by 150%.
>
>
> Yang Shi (3):
>        arm64: cpufeature: detect FEAT_BBM level 2
>        arm64: mm: support large block mapping when rodata=full
>        arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2
>
>   arch/arm64/include/asm/cpufeature.h |  24 ++++++++++++++++++
>   arch/arm64/include/asm/pgtable.h    |   7 +++++-
>   arch/arm64/kernel/cpufeature.c      |  11 ++++++++
>   arch/arm64/mm/mmu.c                 |  31 +++++++++++++++++++++--
>   arch/arm64/mm/pageattr.c            | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>   arch/arm64/tools/cpucaps            |   1 +
>   6 files changed, 238 insertions(+), 9 deletions(-)
>
>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
                   ` (3 preceding siblings ...)
  2024-12-02 23:39 ` [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
@ 2024-12-10 11:31 ` Will Deacon
  2024-12-10 19:33   ` Yang Shi
                     ` (2 more replies)
  4 siblings, 3 replies; 12+ messages in thread
From: Will Deacon @ 2024-12-10 11:31 UTC (permalink / raw)
  To: Yang Shi; +Cc: catalin.marinas, cl, scott, linux-arm-kernel, linux-kernel

On Mon, Nov 18, 2024 at 10:16:07AM -0800, Yang Shi wrote:
> 
> When rodata=full kernel linear mapping is mapped by PTE due to arm's
> break-before-make rule.
> 
> This resulted in a couple of problems:
>   - performance degradation
>   - more TLB pressure
>   - memory waste for kernel page table
> 
> There are some workarounds to mitigate the problems, for example, using
> rodata=on, but this compromises the security measurement.
> 
> With FEAT_BBM level 2 support, splitting large block page table to
> smaller ones doesn't need to make the page table entry invalid anymore.
> This allows kernel split large block mapping on the fly.

I think you can still get TLB conflict aborts in this case, so this
doesn't work. Hopefully the architecture can strengthen this in the
future to give you what you need.

Will


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-12-10 11:31 ` Will Deacon
@ 2024-12-10 19:33   ` Yang Shi
  2024-12-11 22:30     ` Will Deacon
  2024-12-11 17:24   ` Christoph Lameter (Ampere)
  2025-01-02 12:13   ` Jonathan Cameron
  2 siblings, 1 reply; 12+ messages in thread
From: Yang Shi @ 2024-12-10 19:33 UTC (permalink / raw)
  To: Will Deacon; +Cc: catalin.marinas, cl, scott, linux-arm-kernel, linux-kernel

On 12/10/24 3:31 AM, Will Deacon wrote:
> On Mon, Nov 18, 2024 at 10:16:07AM -0800, Yang Shi wrote:
>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>> break-before-make rule.
>>
>> This resulted in a couple of problems:
>>    - performance degradation
>>    - more TLB pressure
>>    - memory waste for kernel page table
>>
>> There are some workarounds to mitigate the problems, for example, using
>> rodata=on, but this compromises the security measurement.
>>
>> With FEAT_BBM level 2 support, splitting large block page table to
>> smaller ones doesn't need to make the page table entry invalid anymore.
>> This allows kernel split large block mapping on the fly.
> I think you can still get TLB conflict aborts in this case, so this
> doesn't work. Hopefully the architecture can strengthen this in the
> future to give you what you need.

Hi Will,

Thanks for responding. This is a little bit surprising. I thought 
FEAT_BBM level 2 can handle the TLB conflict gracefully. At least its 
description made me assume so. And Catalin also mentioned FEAT_BBM level 
2 can be used to split vmemmap page table in HVO patch discussion 
(https://lore.kernel.org/all/Zo68DP6siXfb6ZBR@arm.com/).

It sounds a little bit contradicting if the TLB conflict still can 
happen with FEAT_BBM level 2. It makes the benefit of FEAT_BBM level 2 
much less than expected.

Is it out of question to handle the TLB conflict aborts? IIUC we should 
just need flush TLB then resume, and it doesn't require to hold any 
locks as well.

And I chatted with our architects, I was told the TLB conflict abort 
doesn't happen on AmpereOne. Maybe this is why I didn't see the problem 
when I tested the patches.

Thanks,
Yang

>
> Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-12-10 11:31 ` Will Deacon
  2024-12-10 19:33   ` Yang Shi
@ 2024-12-11 17:24   ` Christoph Lameter (Ampere)
  2025-01-02 12:13   ` Jonathan Cameron
  2 siblings, 0 replies; 12+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-12-11 17:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yang Shi, catalin.marinas, scott, linux-arm-kernel, linux-kernel

On Tue, 10 Dec 2024, Will Deacon wrote:

> > With FEAT_BBM level 2 support, splitting large block page table to
> > smaller ones doesn't need to make the page table entry invalid anymore.
> > This allows kernel split large block mapping on the fly.
>
> I think you can still get TLB conflict aborts in this case, so this
> doesn't work. Hopefully the architecture can strengthen this in the

Which platforms get TLB conflicts? Ours does not. Is this an errata on
some platforms?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-12-10 19:33   ` Yang Shi
@ 2024-12-11 22:30     ` Will Deacon
  2024-12-12  0:05       ` Yang Shi
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2024-12-11 22:30 UTC (permalink / raw)
  To: Yang Shi; +Cc: catalin.marinas, cl, scott, linux-arm-kernel, linux-kernel

Hey,

On Tue, Dec 10, 2024 at 11:33:16AM -0800, Yang Shi wrote:
> On 12/10/24 3:31 AM, Will Deacon wrote:
> > On Mon, Nov 18, 2024 at 10:16:07AM -0800, Yang Shi wrote:
> > > When rodata=full kernel linear mapping is mapped by PTE due to arm's
> > > break-before-make rule.
> > > 
> > > This resulted in a couple of problems:
> > >    - performance degradation
> > >    - more TLB pressure
> > >    - memory waste for kernel page table
> > > 
> > > There are some workarounds to mitigate the problems, for example, using
> > > rodata=on, but this compromises the security measurement.
> > > 
> > > With FEAT_BBM level 2 support, splitting large block page table to
> > > smaller ones doesn't need to make the page table entry invalid anymore.
> > > This allows kernel split large block mapping on the fly.
> > I think you can still get TLB conflict aborts in this case, so this
> > doesn't work. Hopefully the architecture can strengthen this in the
> > future to give you what you need.
> 
> Thanks for responding. This is a little bit surprising. I thought FEAT_BBM
> level 2 can handle the TLB conflict gracefully. At least its description
> made me assume so. And Catalin also mentioned FEAT_BBM level 2 can be used
> to split vmemmap page table in HVO patch discussion
> (https://lore.kernel.org/all/Zo68DP6siXfb6ZBR@arm.com/).
> 
> It sounds a little bit contradicting if the TLB conflict still can happen
> with FEAT_BBM level 2. It makes the benefit of FEAT_BBM level 2 much less
> than expected.

You can read the Arm ARM just as badly as I can :)

 | I_HYQMB
 |
 | If any level is supported and the TLB entries are not invalidated after
 | the writes that modified the translation table entries are completed,
 | then a TLB conflict abort can be generated because in a TLB there might
 | be multiple translation table entries that all translate the same IA.

Note *any level*.

Furthermore:

 | R_FWRMB
 |
 | If all of the following apply, then a TLB conflict abort is reported
 | to EL2:
 | * Level 1 or level 2 is supported.
 | * Stage 2 translations are enabled in the current translation regime.
 | * A TLB conflict abort is generated due to changing the block size or
 |   Contiguous bit.

I think this series is trying to handle some of this:

https://lore.kernel.org/r/20241211154611.40395-1-miko.lenczewski@arm.com

> Is it out of question to handle the TLB conflict aborts? IIUC we should just
> need flush TLB then resume, and it doesn't require to hold any locks as
> well.

See my reply here:

https://lore.kernel.org/r/20241211210243.GA17155@willie-the-truck

> And I chatted with our architects, I was told the TLB conflict abort doesn't
> happen on AmpereOne. Maybe this is why I didn't see the problem when I
> tested the patches.

I'm actually open to having an MIDR-based lookup for this if its your own
micro-architecture.

Will


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-12-11 22:30     ` Will Deacon
@ 2024-12-12  0:05       ` Yang Shi
  0 siblings, 0 replies; 12+ messages in thread
From: Yang Shi @ 2024-12-12  0:05 UTC (permalink / raw)
  To: Will Deacon; +Cc: catalin.marinas, cl, scott, linux-arm-kernel, linux-kernel



On 12/11/24 2:30 PM, Will Deacon wrote:
> Hey,
>
> On Tue, Dec 10, 2024 at 11:33:16AM -0800, Yang Shi wrote:
>> On 12/10/24 3:31 AM, Will Deacon wrote:
>>> On Mon, Nov 18, 2024 at 10:16:07AM -0800, Yang Shi wrote:
>>>> When rodata=full kernel linear mapping is mapped by PTE due to arm's
>>>> break-before-make rule.
>>>>
>>>> This resulted in a couple of problems:
>>>>     - performance degradation
>>>>     - more TLB pressure
>>>>     - memory waste for kernel page table
>>>>
>>>> There are some workarounds to mitigate the problems, for example, using
>>>> rodata=on, but this compromises the security measurement.
>>>>
>>>> With FEAT_BBM level 2 support, splitting large block page table to
>>>> smaller ones doesn't need to make the page table entry invalid anymore.
>>>> This allows kernel split large block mapping on the fly.
>>> I think you can still get TLB conflict aborts in this case, so this
>>> doesn't work. Hopefully the architecture can strengthen this in the
>>> future to give you what you need.
>> Thanks for responding. This is a little bit surprising. I thought FEAT_BBM
>> level 2 can handle the TLB conflict gracefully. At least its description
>> made me assume so. And Catalin also mentioned FEAT_BBM level 2 can be used
>> to split vmemmap page table in HVO patch discussion
>> (https://lore.kernel.org/all/Zo68DP6siXfb6ZBR@arm.com/).
>>
>> It sounds a little bit contradicting if the TLB conflict still can happen
>> with FEAT_BBM level 2. It makes the benefit of FEAT_BBM level 2 much less
>> than expected.
> You can read the Arm ARM just as badly as I can :)
>
>   | I_HYQMB
>   |
>   | If any level is supported and the TLB entries are not invalidated after
>   | the writes that modified the translation table entries are completed,
>   | then a TLB conflict abort can be generated because in a TLB there might
>   | be multiple translation table entries that all translate the same IA.
>
> Note *any level*.
>
> Furthermore:
>
>   | R_FWRMB
>   |
>   | If all of the following apply, then a TLB conflict abort is reported
>   | to EL2:
>   | * Level 1 or level 2 is supported.
>   | * Stage 2 translations are enabled in the current translation regime.
>   | * A TLB conflict abort is generated due to changing the block size or
>   |   Contiguous bit.

Thank you so much for pinpointing the document.

>
> I think this series is trying to handle some of this:
>
> https://lore.kernel.org/r/20241211154611.40395-1-miko.lenczewski@arm.com

Thanks for sharing the series. It is new. Yes, both are trying to add 
BBMlv2 support to optimize some usecases.

>
>> Is it out of question to handle the TLB conflict aborts? IIUC we should just
>> need flush TLB then resume, and it doesn't require to hold any locks as
>> well.
> See my reply here:
>
> https://lore.kernel.org/r/20241211210243.GA17155@willie-the-truck

Yeah, it is hard to guarantee recursive TLB conflict abort never happens.

>
>> And I chatted with our architects, I was told the TLB conflict abort doesn't
>> happen on AmpereOne. Maybe this is why I didn't see the problem when I
>> tested the patches.
> I'm actually open to having an MIDR-based lookup for this if its your own
> micro-architecture.

I think it actually makes our life easier. We can just enable BBMlv2 for 
the CPUs which can handle TLB conflict gracefully, so we don't worry 
about handling TLB conflict abort at all. I can implement this in v2.

> Will



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full
  2024-12-10 11:31 ` Will Deacon
  2024-12-10 19:33   ` Yang Shi
  2024-12-11 17:24   ` Christoph Lameter (Ampere)
@ 2025-01-02 12:13   ` Jonathan Cameron
  2 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2025-01-02 12:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yang Shi, catalin.marinas, cl, scott, linux-arm-kernel,
	linux-kernel, yangyicong, guohanjun, wangkefeng.wang, liaochang1,
	sunnanyong, linuxarm

On Tue, 10 Dec 2024 11:31:52 +0000
Will Deacon <will@kernel.org> wrote:

> On Mon, Nov 18, 2024 at 10:16:07AM -0800, Yang Shi wrote:
> > 
> > When rodata=full kernel linear mapping is mapped by PTE due to arm's
> > break-before-make rule.
> > 
> > This resulted in a couple of problems:
> >   - performance degradation
> >   - more TLB pressure
> >   - memory waste for kernel page table
> > 
> > There are some workarounds to mitigate the problems, for example, using
> > rodata=on, but this compromises the security measurement.
> > 
> > With FEAT_BBM level 2 support, splitting large block page table to
> > smaller ones doesn't need to make the page table entry invalid anymore.
> > This allows kernel split large block mapping on the fly.  
> 
> I think you can still get TLB conflict aborts in this case, so this
> doesn't work. Hopefully the architecture can strengthen this in the
> future to give you what you need.
> 
> Will
> 

Hi All,

Given we have two threads on this topic, replying here as well...

Huawei has implementations that support BBML2, and might report TLB conflict
abort after changing block size directly until an appropriate TLB invalidation
instruction completes and this Implementation Choice is architecturally compliant.

I'm not trying to restrict potential solutions, but just making the point that we
will be interested in solutions that handle the conflict abort.

Jonathan


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-01-02 12:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-18 18:16 [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
2024-11-18 18:16 ` [PATCH 1/3] arm64: cpufeature: detect FEAT_BBM level 2 Yang Shi
2024-11-18 18:16 ` [PATCH 2/3] arm64: mm: support large block mapping when rodata=full Yang Shi
2024-11-18 18:16 ` [PATCH 3/3] arm64: cpufeature: workaround AmpereOne FEAT_BBM level 2 Yang Shi
2024-11-18 18:33   ` Christoph Lameter (Ampere)
2024-12-02 23:39 ` [RFC PATCH 0/3] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full Yang Shi
2024-12-10 11:31 ` Will Deacon
2024-12-10 19:33   ` Yang Shi
2024-12-11 22:30     ` Will Deacon
2024-12-12  0:05       ` Yang Shi
2024-12-11 17:24   ` Christoph Lameter (Ampere)
2025-01-02 12:13   ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).