[PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64
@ 2025-03-04 15:04 Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
                   ` (12 more replies)
  0 siblings, 13 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Hi All,

This is v3 of a series to improve performance for hugetlb and vmalloc on arm64.
Although some of these patches are core-mm, advice from Andrew was to go via the
arm64 tree. Hopefully I can get some ACKs from mm folks.

The 2 key performance improvements are 1) enabling the use of contpte-mapped
blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
were already hooks for this (used by powerpc) but they required some tidying and
extending for arm64. And 2) batching up barriers when modifying the vmalloc
address space for upto 30% reduction in time taken in vmalloc().

vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
test was repeated 10 times.

legend:
  - p: nr_pages (pages to allocate)
  - h: use_huge (vmalloc() vs vmalloc_huge())
  - (I): statistically significant improvement (95% CI does not overlap)
  - (R): statistically significant regression (95% CI does not overlap)
  - measurements are times; smaller is better

+--------------------------------------------------+-------------+-------------+
| Benchmark                                        |             |             |
|   Result Class                                   |    Apple M2 | Ampere Alta |
+==================================================+=============+=============+
| micromm/vmalloc                                  |             |             |
|   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -11.53% |      -2.57% |
|   fix_size_alloc_test: p:1, h:0 (usec)           |       2.14% |       1.79% |
|   fix_size_alloc_test: p:4, h:0 (usec)           |  (I) -9.93% |  (I) -4.80% |
|   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -25.07% | (I) -14.24% |
|   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -14.07% |   (R) 7.93% |
|   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -29.43% | (I) -19.30% |
|   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.39% |   (R) 6.71% |
|   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -31.46% | (I) -20.60% |
|   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -16.58% |   (R) 6.70% |
|   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -31.96% | (I) -20.04% |
|   fix_size_alloc_test: p:512, h:1 (usec)         |       2.30% |       0.71% |
|   full_fit_alloc_test: p:1, h:0 (usec)           |      -2.94% |       1.77% |
|   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |      -7.75% |       1.71% |
|   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -9.07% |   (R) 2.34% |
|   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -29.18% | (I) -17.91% |
|   pcpu_alloc_test: p:1, h:0 (usec)               |     -14.71% |      -3.14% |
|   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -11.08% |  (I) -4.62% |
|   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.25% | (I) -17.95% |
|   vm_map_ram_test: p:1, h:0 (usec)               |       5.06% |   (R) 6.63% |
+--------------------------------------------------+-------------+-------------+

So there are some nice improvements but also some regressions to explain:

fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The
regression is actually introduced by enabling contpte-mapped 64K blocks in these
tests, and that regression is reduced (from about 8% if memory serves) by doing
the barrier batching. I don't have a definite conclusion on the root cause, but
I've ruled out the differences in the mapping paths in vmalloc. I strongly
believe this is likely due to the difference in the allocation path; 64K blocks
are not cached per-cpu so we have to go all the way to the buddy. I'm not sure
why this doesn't show up on M2 though. Regardless, I'm going to assert that it's
better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation
call duration.

Changes since v2 [2]
====================
- Removed the new arch_update_kernel_mappings_[begin|end]() API
- Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching
- Removed clean up to avoid barriers for invalid or user mappings

Changes since v1 [1]
====================
- Split out the fixes into their own series
- Added Rbs from Anshuman - Thanks!
- Added patch to clean up the methods by which huge_pte size is determined
- Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
  flush_hugetlb_tlb_range()
- Renamed ___set_ptes() -> set_ptes_anysz()
- Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
- Fixed typos in commit logs
- Refactored pXd_valid_not_user() for better reuse
- Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
- Concluded the extra isb() in __switch_to() is not required
- Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings

Applies on top of v6.14-rc5, which already contains the fixes from [3]. All
mm selftests run and pass.

NOTE: Its possible that the changes in patch #10 may cause bugs I found in other
archs' lazy mmu implementations to become more likely to trigger. I've fixed all
those bugs in the series at [4], which is now in mm-unstable. But some
coordination when merging this may be required.

[1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250217140419.1702389-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/

Thanks,
Ryan

Ryan Roberts (11):
  arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  arm64: hugetlb: Refine tlb maintenance scope
  mm/page_table_check: Batch-check pmds/puds just like ptes
  arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  mm/vmalloc: Warn on improper use of vunmap_range()
  mm/vmalloc: Gracefully unmap huge ptes
  arm64/mm: Support huge pte-mapped pages in vmap
  mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  arm64/mm: Batch barriers when updating kernel mappings

 arch/arm64/include/asm/hugetlb.h     |  29 ++--
 arch/arm64/include/asm/pgtable.h     | 195 ++++++++++++++++++---------
 arch/arm64/include/asm/thread_info.h |   2 +
 arch/arm64/include/asm/vmalloc.h     |  45 +++++++
 arch/arm64/kernel/process.c          |   9 +-
 arch/arm64/mm/hugetlbpage.c          |  72 ++++------
 include/linux/page_table_check.h     |  30 +++--
 include/linux/vmalloc.h              |   8 ++
 mm/page_table_check.c                |  34 +++--
 mm/vmalloc.c                         |  40 +++++-
 10 files changed, 315 insertions(+), 149 deletions(-)

--
2.43.0



^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-04-03 20:46   ` Catalin Marinas
  2025-04-04  3:03   ` Anshuman Khandual
  2025-03-04 15:04 ` [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Not all huge_pte helper APIs explicitly provide the size of the
huge_pte. So the helpers have to depend on various methods to determine
the size of the huge_pte. Some of these methods are dubious.

Let's clean up the code to use preferred methods and retire the dubious
ones. The options in order of preference:

 - If size is provided as parameter, use it together with
   num_contig_ptes(). This is explicit and works for both present and
   non-present ptes.

 - If vma is provided as a parameter, retrieve size via
   huge_page_size(hstate_vma(vma)) and use it together with
   num_contig_ptes(). This is explicit and works for both present and
   non-present ptes.

 - If the pte is present and contiguous, use find_num_contig() to walk
   the pgtable to find the level and infer the number of ptes from
   level. Only works for *present* ptes.

 - If the pte is present and not contiguous and you can infer from this
   that only 1 pte needs to be operated on. This is ok if you don't care
   about the absolute size, and just want to know the number of ptes.

 - NEVER rely on resolving the PFN of a present pte to a folio and
   getting the folio's size. This is fragile at best, because there is
   nothing to stop the core-mm from allocating a folio twice as big as
   the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
   partially mapping it.

Where we require that the pte is present, add warnings if not-present.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index b3a7fafe8892..6a2af9fb2566 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -129,7 +129,7 @@ pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
 		return orig_pte;
 
-	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
+	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
 	for (i = 0; i < ncontig; i++, ptep++) {
 		pte_t pte = __ptep_get(ptep);
 
@@ -438,16 +438,19 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	pgprot_t hugeprot;
 	pte_t orig_pte;
 
+	VM_WARN_ON(!pte_present(pte));
+
 	if (!pte_cont(pte))
 		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
-	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
+	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
 	dpfn = pgsize >> PAGE_SHIFT;
 
 	if (!__cont_access_flags_changed(ptep, pte, ncontig))
 		return 0;
 
 	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
+	VM_WARN_ON(!pte_present(orig_pte));
 
 	/* Make sure we don't lose the dirty or young state */
 	if (pte_dirty(orig_pte))
@@ -472,7 +475,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(__ptep_get(ptep))) {
+	pte = __ptep_get(ptep);
+	VM_WARN_ON(!pte_present(pte));
+
+	if (!pte_cont(pte)) {
 		__ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -496,11 +502,15 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	size_t pgsize;
 	int ncontig;
+	pte_t pte;
+
+	pte = __ptep_get(ptep);
+	VM_WARN_ON(!pte_present(pte));
 
-	if (!pte_cont(__ptep_get(ptep)))
+	if (!pte_cont(pte))
 		return ptep_clear_flush(vma, addr, ptep);
 
-	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
+	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
 	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-04-03 20:47   ` Catalin Marinas
  2025-04-04  3:50   ` Anshuman Khandual
  2025-03-04 15:04 ` [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

When operating on contiguous blocks of ptes (or pmds) for some hugetlb
sizes, we must honour break-before-make requirements and clear down the
block to invalid state in the pgtable then invalidate the relevant tlb
entries before making the pgtable entries valid again.

However, the tlb maintenance is currently always done assuming the worst
case stride (PAGE_SIZE), last_level (false) and tlb_level
(TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
we know the stride from the huge_pte pgsize, we are always operating
only on the last level, and we always know the tlb_level, again based on
pgsize. So let's start providing these hints.

Additionally, avoid tlb maintenace in set_huge_pte_at().
Break-before-make is only required if we are transitioning the
contiguous pte block from valid -> valid. So let's elide the
clear-and-flush ("break") if the pte range was previously invalid.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/hugetlb.h | 29 +++++++++++++++++++----------
 arch/arm64/mm/hugetlbpage.c      |  9 ++++++---
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 07fbf5bf85a7..2a8155c4a882 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -69,29 +69,38 @@ extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
 
 #include <asm-generic/hugetlb.h>
 
-#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
-static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
-					   unsigned long start,
-					   unsigned long end)
+static inline void __flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+					     unsigned long start,
+					     unsigned long end,
+					     unsigned long stride,
+					     bool last_level)
 {
-	unsigned long stride = huge_page_size(hstate_vma(vma));
-
 	switch (stride) {
 #ifndef __PAGETABLE_PMD_FOLDED
 	case PUD_SIZE:
-		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
+		__flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1);
 		break;
 #endif
 	case CONT_PMD_SIZE:
 	case PMD_SIZE:
-		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
+		__flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2);
 		break;
 	case CONT_PTE_SIZE:
-		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3);
 		break;
 	default:
-		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
+		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN);
 	}
 }
 
+#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
+					   unsigned long start,
+					   unsigned long end)
+{
+	unsigned long stride = huge_page_size(hstate_vma(vma));
+
+	__flush_hugetlb_tlb_range(vma, start, end, stride, false);
+}
+
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6a2af9fb2566..065be8650aa5 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -183,8 +183,9 @@ static pte_t get_clear_contig_flush(struct mm_struct *mm,
 {
 	pte_t orig_pte = get_clear_contig(mm, addr, ptep, pgsize, ncontig);
 	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	unsigned long end = addr + (pgsize * ncontig);
 
-	flush_tlb_range(&vma, addr, addr + (pgsize * ncontig));
+	__flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true);
 	return orig_pte;
 }
 
@@ -209,7 +210,7 @@ static void clear_flush(struct mm_struct *mm,
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
 		__ptep_get_and_clear(mm, addr, ptep);
 
-	flush_tlb_range(&vma, saddr, addr);
+	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
@@ -238,7 +239,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	dpfn = pgsize >> PAGE_SHIFT;
 	hugeprot = pte_pgprot(pte);
 
-	clear_flush(mm, addr, ptep, pgsize, ncontig);
+	/* Only need to "break" if transitioning valid -> valid. */
+	if (pte_valid(__ptep_get(ptep)))
+		clear_flush(mm, addr, ptep, pgsize, ncontig);
 
 	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-26 14:48   ` Pasha Tatashin
  2025-04-03 20:46   ` Catalin Marinas
  2025-03-04 15:04 ` [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Convert page_table_check_p[mu]d_set(...) to
page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
as macros that call new batch functions with nr=1 for compatibility.

arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
and to allow the implementation for huge_pte to more efficiently set
ptes/pmds/puds in batches. We need these batch-helpers to make the
refactoring possible.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/page_table_check.h | 30 +++++++++++++++++-----------
 mm/page_table_check.c            | 34 +++++++++++++++++++-------------
 2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/include/linux/page_table_check.h b/include/linux/page_table_check.h
index 6722941c7cb8..289620d4aad3 100644
--- a/include/linux/page_table_check.h
+++ b/include/linux/page_table_check.h
@@ -19,8 +19,10 @@ void __page_table_check_pmd_clear(struct mm_struct *mm, pmd_t pmd);
 void __page_table_check_pud_clear(struct mm_struct *mm, pud_t pud);
 void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 		unsigned int nr);
-void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd);
-void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud);
+void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
+		unsigned int nr);
+void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
+		unsigned int nr);
 void __page_table_check_pte_clear_range(struct mm_struct *mm,
 					unsigned long addr,
 					pmd_t pmd);
@@ -74,22 +76,22 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
 	__page_table_check_ptes_set(mm, ptep, pte, nr);
 }
 
-static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
-					    pmd_t pmd)
+static inline void page_table_check_pmds_set(struct mm_struct *mm,
+		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
 {
 	if (static_branch_likely(&page_table_check_disabled))
 		return;
 
-	__page_table_check_pmd_set(mm, pmdp, pmd);
+	__page_table_check_pmds_set(mm, pmdp, pmd, nr);
 }
 
-static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
-					    pud_t pud)
+static inline void page_table_check_puds_set(struct mm_struct *mm,
+		pud_t *pudp, pud_t pud, unsigned int nr)
 {
 	if (static_branch_likely(&page_table_check_disabled))
 		return;
 
-	__page_table_check_pud_set(mm, pudp, pud);
+	__page_table_check_puds_set(mm, pudp, pud, nr);
 }
 
 static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
@@ -129,13 +131,13 @@ static inline void page_table_check_ptes_set(struct mm_struct *mm,
 {
 }
 
-static inline void page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp,
-					    pmd_t pmd)
+static inline void page_table_check_pmds_set(struct mm_struct *mm,
+		pmd_t *pmdp, pmd_t pmd, unsigned int nr)
 {
 }
 
-static inline void page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp,
-					    pud_t pud)
+static inline void page_table_check_puds_set(struct mm_struct *mm,
+		pud_t *pudp, pud_t pud, unsigned int nr)
 {
 }
 
@@ -146,4 +148,8 @@ static inline void page_table_check_pte_clear_range(struct mm_struct *mm,
 }
 
 #endif /* CONFIG_PAGE_TABLE_CHECK */
+
+#define page_table_check_pmd_set(mm, pmdp, pmd)	page_table_check_pmds_set(mm, pmdp, pmd, 1)
+#define page_table_check_pud_set(mm, pudp, pud)	page_table_check_puds_set(mm, pudp, pud, 1)
+
 #endif /* __LINUX_PAGE_TABLE_CHECK_H */
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 509c6ef8de40..dae4a7d776b3 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -234,33 +234,39 @@ static inline void page_table_check_pmd_flags(pmd_t pmd)
 		WARN_ON_ONCE(swap_cached_writable(pmd_to_swp_entry(pmd)));
 }
 
-void __page_table_check_pmd_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd)
+void __page_table_check_pmds_set(struct mm_struct *mm, pmd_t *pmdp, pmd_t pmd,
+		unsigned int nr)
 {
+	unsigned int i;
+	unsigned long stride = PMD_SIZE >> PAGE_SHIFT;
+
 	if (&init_mm == mm)
 		return;
 
 	page_table_check_pmd_flags(pmd);
 
-	__page_table_check_pmd_clear(mm, *pmdp);
-	if (pmd_user_accessible_page(pmd)) {
-		page_table_check_set(pmd_pfn(pmd), PMD_SIZE >> PAGE_SHIFT,
-				     pmd_write(pmd));
-	}
+	for (i = 0; i < nr; i++)
+		__page_table_check_pmd_clear(mm, *(pmdp + i));
+	if (pmd_user_accessible_page(pmd))
+		page_table_check_set(pmd_pfn(pmd), stride * nr, pmd_write(pmd));
 }
-EXPORT_SYMBOL(__page_table_check_pmd_set);
+EXPORT_SYMBOL(__page_table_check_pmds_set);
 
-void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
+void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
+		unsigned int nr)
 {
+	unsigned int i;
+	unsigned long stride = PUD_SIZE >> PAGE_SHIFT;
+
 	if (&init_mm == mm)
 		return;
 
-	__page_table_check_pud_clear(mm, *pudp);
-	if (pud_user_accessible_page(pud)) {
-		page_table_check_set(pud_pfn(pud), PUD_SIZE >> PAGE_SHIFT,
-				     pud_write(pud));
-	}
+	for (i = 0; i < nr; i++)
+		__page_table_check_pud_clear(mm, *(pudp + i));
+	if (pud_user_accessible_page(pud))
+		page_table_check_set(pud_pfn(pud), stride * nr, pud_write(pud));
 }
-EXPORT_SYMBOL(__page_table_check_pud_set);
+EXPORT_SYMBOL(__page_table_check_puds_set);
 
 void __page_table_check_pte_clear_range(struct mm_struct *mm,
 					unsigned long addr,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-06  5:08   ` kernel test robot
  2025-04-14 16:25   ` Catalin Marinas
  2025-03-04 15:04 ` [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz() Ryan Roberts
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Refactor __set_ptes(), set_pmd_at() and set_pud_at() so that they are
all a thin wrapper around a new common set_ptes_anysz(), which takes
pgsize parameter. Additionally, refactor __ptep_get_and_clear() and
pmdp_huge_get_and_clear() to use a new common ptep_get_and_clear_anysz()
which also takes a pgsize parameter.

These changes will permit the huge_pte API to efficiently batch-set
pgtable entries and take advantage of the future barrier optimizations.
Additionally since the new *_anysz() helpers call the correct
page_table_check_*_set() API based on pgsize, this means that huge_ptes
will be able to get proper coverage. Currently the huge_pte API always
uses the pte API which assumes an entry only covers a single page.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 108 +++++++++++++++++++------------
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0b2a2ad1b9e8..e255a36380dc 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -420,23 +420,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
 	return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
 }
 
-static inline void __set_ptes(struct mm_struct *mm,
-			      unsigned long __always_unused addr,
-			      pte_t *ptep, pte_t pte, unsigned int nr)
-{
-	page_table_check_ptes_set(mm, ptep, pte, nr);
-	__sync_cache_and_tags(pte, nr);
-
-	for (;;) {
-		__check_safe_pte_update(mm, ptep, pte);
-		__set_pte(ptep, pte);
-		if (--nr == 0)
-			break;
-		ptep++;
-		pte = pte_advance_pfn(pte, 1);
-	}
-}
-
 /*
  * Hugetlb definitions.
  */
@@ -641,30 +624,59 @@ static inline pgprot_t pud_pgprot(pud_t pud)
 	return __pgprot(pud_val(pfn_pud(pfn, __pgprot(0))) ^ pud_val(pud));
 }
 
-static inline void __set_pte_at(struct mm_struct *mm,
-				unsigned long __always_unused addr,
-				pte_t *ptep, pte_t pte, unsigned int nr)
+static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
+				  unsigned int nr, unsigned long pgsize)
 {
-	__sync_cache_and_tags(pte, nr);
-	__check_safe_pte_update(mm, ptep, pte);
-	__set_pte(ptep, pte);
+	unsigned long stride = pgsize >> PAGE_SHIFT;
+
+	switch (pgsize) {
+	case PAGE_SIZE:
+		page_table_check_ptes_set(mm, ptep, pte, nr);
+		break;
+	case PMD_SIZE:
+		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
+		break;
+	case PUD_SIZE:
+		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
+		break;
+	default:
+		VM_WARN_ON(1);
+	}
+
+	__sync_cache_and_tags(pte, nr * stride);
+
+	for (;;) {
+		__check_safe_pte_update(mm, ptep, pte);
+		__set_pte(ptep, pte);
+		if (--nr == 0)
+			break;
+		ptep++;
+		pte = pte_advance_pfn(pte, stride);
+	}
 }
 
-static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
-			      pmd_t *pmdp, pmd_t pmd)
+static inline void __set_ptes(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	page_table_check_pmd_set(mm, pmdp, pmd);
-	return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd),
-						PMD_SIZE >> PAGE_SHIFT);
+	set_ptes_anysz(mm, ptep, pte, nr, PAGE_SIZE);
 }
 
-static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
-			      pud_t *pudp, pud_t pud)
+static inline void __set_pmds(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pmd_t *pmdp, pmd_t pmd, unsigned int nr)
+{
+	set_ptes_anysz(mm, (pte_t *)pmdp, pmd_pte(pmd), nr, PMD_SIZE);
+}
+#define set_pmd_at(mm, addr, pmdp, pmd) __set_pmds(mm, addr, pmdp, pmd, 1)
+
+static inline void __set_puds(struct mm_struct *mm,
+			      unsigned long __always_unused addr,
+			      pud_t *pudp, pud_t pud, unsigned int nr)
 {
-	page_table_check_pud_set(mm, pudp, pud);
-	return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud),
-						PUD_SIZE >> PAGE_SHIFT);
+	set_ptes_anysz(mm, (pte_t *)pudp, pud_pte(pud), nr, PUD_SIZE);
 }
+#define set_pud_at(mm, addr, pudp, pud) __set_puds(mm, addr, pudp, pud, 1)
 
 #define __p4d_to_phys(p4d)	__pte_to_phys(p4d_pte(p4d))
 #define __phys_to_p4d_val(phys)	__phys_to_pte_val(phys)
@@ -1276,16 +1288,34 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 
-static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
-				       unsigned long address, pte_t *ptep)
+static inline pte_t ptep_get_and_clear_anysz(struct mm_struct *mm, pte_t *ptep,
+					     unsigned long pgsize)
 {
 	pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
 
-	page_table_check_pte_clear(mm, pte);
+	switch (pgsize) {
+	case PAGE_SIZE:
+		page_table_check_pte_clear(mm, pte);
+		break;
+	case PMD_SIZE:
+		page_table_check_pmd_clear(mm, pte_pmd(pte));
+		break;
+	case PUD_SIZE:
+		page_table_check_pud_clear(mm, pte_pud(pte));
+		break;
+	default:
+		VM_WARN_ON(1);
+	}
 
 	return pte;
 }
 
+static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
+				       unsigned long address, pte_t *ptep)
+{
+	return ptep_get_and_clear_anysz(mm, ptep, PAGE_SIZE);
+}
+
 static inline void __clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr, int full)
 {
@@ -1322,11 +1352,7 @@ static inline pte_t __get_and_clear_full_ptes(struct mm_struct *mm,
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address, pmd_t *pmdp)
 {
-	pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
-
-	page_table_check_pmd_clear(mm, pmd);
-
-	return pmd;
+	return pte_pmd(ptep_get_and_clear_anysz(mm, (pte_t *)pmdp, PMD_SIZE));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (3 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-05 16:00   ` kernel test robot
  2025-04-03 20:47   ` Catalin Marinas
  2025-03-04 15:04 ` [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop Ryan Roberts
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Refactor the huge_pte helpers to use the new common set_ptes_anysz() and
ptep_get_and_clear_anysz() APIs.

This provides 2 benefits; First, when page_table_check=on, hugetlb is
now properly/fully checked. Previously only the first page of a hugetlb
folio was checked. Second, instead of having to call __set_ptes(nr=1)
for each pte in a loop, the whole contiguous batch can now be set in one
go, which enables some efficiencies and cleans up the code.

One detail to note is that huge_ptep_clear_flush() was previously
calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
block mapping). This has a couple of disadvantages; first
ptep_clear_flush() calls ptep_get_and_clear() which transparently
handles contpte. Given we only call for non-contiguous ptes, it would be
safe, but a waste of effort. It's preferable to go straight to the layer
below. However, more problematic is that ptep_get_and_clear() is for
PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
clear the whole hugetlb folio. So let's stop special-casing the non-cont
case and just rely on get_clear_contig_flush() to do the right thing for
non-cont entries.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 52 +++++++------------------------------
 1 file changed, 10 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 065be8650aa5..efd18bd1eae3 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -159,12 +159,12 @@ static pte_t get_clear_contig(struct mm_struct *mm,
 	pte_t pte, tmp_pte;
 	bool present;
 
-	pte = __ptep_get_and_clear(mm, addr, ptep);
+	pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
 	present = pte_present(pte);
 	while (--ncontig) {
 		ptep++;
 		addr += pgsize;
-		tmp_pte = __ptep_get_and_clear(mm, addr, ptep);
+		tmp_pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
 		if (present) {
 			if (pte_dirty(tmp_pte))
 				pte = pte_mkdirty(pte);
@@ -208,7 +208,7 @@ static void clear_flush(struct mm_struct *mm,
 	unsigned long i, saddr = addr;
 
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
-		__ptep_get_and_clear(mm, addr, ptep);
+		ptep_get_and_clear_anysz(mm, ptep, pgsize);
 
 	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
@@ -219,32 +219,20 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	size_t pgsize;
 	int i;
 	int ncontig;
-	unsigned long pfn, dpfn;
-	pgprot_t hugeprot;
 
 	ncontig = num_contig_ptes(sz, &pgsize);
 
 	if (!pte_present(pte)) {
 		for (i = 0; i < ncontig; i++, ptep++, addr += pgsize)
-			__set_ptes(mm, addr, ptep, pte, 1);
+			set_ptes_anysz(mm, ptep, pte, 1, pgsize);
 		return;
 	}
 
-	if (!pte_cont(pte)) {
-		__set_ptes(mm, addr, ptep, pte, 1);
-		return;
-	}
-
-	pfn = pte_pfn(pte);
-	dpfn = pgsize >> PAGE_SHIFT;
-	hugeprot = pte_pgprot(pte);
-
 	/* Only need to "break" if transitioning valid -> valid. */
-	if (pte_valid(__ptep_get(ptep)))
+	if (pte_cont(pte) && pte_valid(__ptep_get(ptep)))
 		clear_flush(mm, addr, ptep, pgsize, ncontig);
 
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
+	set_ptes_anysz(mm, ptep, pte, ncontig, pgsize);
 }
 
 pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -434,11 +422,9 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 			       unsigned long addr, pte_t *ptep,
 			       pte_t pte, int dirty)
 {
-	int ncontig, i;
+	int ncontig;
 	size_t pgsize = 0;
-	unsigned long pfn = pte_pfn(pte), dpfn;
 	struct mm_struct *mm = vma->vm_mm;
-	pgprot_t hugeprot;
 	pte_t orig_pte;
 
 	VM_WARN_ON(!pte_present(pte));
@@ -447,7 +433,6 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
 
 	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
-	dpfn = pgsize >> PAGE_SHIFT;
 
 	if (!__cont_access_flags_changed(ptep, pte, ncontig))
 		return 0;
@@ -462,19 +447,14 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
 	if (pte_young(orig_pte))
 		pte = pte_mkyoung(pte);
 
-	hugeprot = pte_pgprot(pte);
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
-
+	set_ptes_anysz(mm, ptep, pte, ncontig, pgsize);
 	return 1;
 }
 
 void huge_ptep_set_wrprotect(struct mm_struct *mm,
 			     unsigned long addr, pte_t *ptep)
 {
-	unsigned long pfn, dpfn;
-	pgprot_t hugeprot;
-	int ncontig, i;
+	int ncontig;
 	size_t pgsize;
 	pte_t pte;
 
@@ -487,16 +467,11 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	}
 
 	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
-	dpfn = pgsize >> PAGE_SHIFT;
 
 	pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
 	pte = pte_wrprotect(pte);
 
-	hugeprot = pte_pgprot(pte);
-	pfn = pte_pfn(pte);
-
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
-		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);
+	set_ptes_anysz(mm, ptep, pte, ncontig, pgsize);
 }
 
 pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
@@ -505,13 +480,6 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	size_t pgsize;
 	int ncontig;
-	pte_t pte;
-
-	pte = __ptep_get(ptep);
-	VM_WARN_ON(!pte_present(pte));
-
-	if (!pte_cont(pte))
-		return ptep_clear_flush(vma, addr, ptep);
 
 	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
 	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (4 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz() Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-04-03 20:46   ` Catalin Marinas
  2025-04-04  4:11   ` Anshuman Khandual
  2025-03-04 15:04 ` [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

set_ptes_anysz() previously called __set_pte() for each PTE in the
range, which would conditionally issue a DSB and ISB to make the new PTE
value immediately visible to the table walker if the new PTE was valid
and for kernel space.

We can do better than this; let's hoist those barriers out of the loop
so that they are only issued once at the end of the loop. We then reduce
the cost by the number of PTEs in the range.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e255a36380dc..1898c3069c43 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -317,13 +317,11 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
 	WRITE_ONCE(*ptep, pte);
 }
 
-static inline void __set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte_complete(pte_t pte)
 {
-	__set_pte_nosync(ptep, pte);
-
 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
-	 * or update_mmu_cache() have the necessary barriers.
+	 * has the necessary barriers.
 	 */
 	if (pte_valid_not_user(pte)) {
 		dsb(ishst);
@@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
 	}
 }
 
+static inline void __set_pte(pte_t *ptep, pte_t pte)
+{
+	__set_pte_nosync(ptep, pte);
+	__set_pte_complete(pte);
+}
+
 static inline pte_t __ptep_get(pte_t *ptep)
 {
 	return READ_ONCE(*ptep);
@@ -647,12 +651,14 @@ static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
-		__set_pte(ptep, pte);
+		__set_pte_nosync(ptep, pte);
 		if (--nr == 0)
 			break;
 		ptep++;
 		pte = pte_advance_pfn(pte, stride);
 	}
+
+	__set_pte_complete(pte);
 }
 
 static inline void __set_ptes(struct mm_struct *mm,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (5 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-27 13:05   ` Uladzislau Rezki
  2025-03-04 15:04 ` [PATCH v3 08/11] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
pud level. But it is possible to subsequently call vunmap_range() on a
sub-range of the mapped memory, which partially overlaps a pmd or pud.
In this case, vmalloc unmaps the entire pmd or pud so that the
no-overlapping portion is also unmapped. Clearly that would have a bad
outcome, but it's not something that any callers do today as far as I
can tell. So I guess it's just expected that callers will not do this.

However, it would be useful to know if this happened in future; let's
add a warning to cover the eventuality.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmalloc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a6e7acebe9ad..fcdf67d5177a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 		if (cleared || pmd_bad(*pmd))
 			*mask |= PGTBL_PMD_MODIFIED;
 
-		if (cleared)
+		if (cleared) {
+			WARN_ON(next - addr < PMD_SIZE);
 			continue;
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		vunmap_pte_range(pmd, addr, next, mask);
@@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		if (cleared || pud_bad(*pud))
 			*mask |= PGTBL_PUD_MODIFIED;
 
-		if (cleared)
+		if (cleared) {
+			WARN_ON(next - addr < PUD_SIZE);
 			continue;
+		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		vunmap_pmd_range(pud, addr, next, mask);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 08/11] mm/vmalloc: Gracefully unmap huge ptes
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (6 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 09/11] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Commit f7ee1f13d606 ("mm/vmalloc: enable mapping of huge pages at pte
level in vmap") added its support by reusing the set_huge_pte_at() API,
which is otherwise only used for user mappings. But when unmapping those
huge ptes, it continued to call ptep_get_and_clear(), which is a
layering violation. To date, the only arch to implement this support is
powerpc and it all happens to work ok for it.

But arm64's implementation of ptep_get_and_clear() can not be safely
used to clear a previous set_huge_pte_at(). So let's introduce a new
arch opt-in function, arch_vmap_pte_range_unmap_size(), which can
provide the size of a (present) pte. Then we can call
huge_ptep_get_and_clear() to tear it down properly.

Note that if vunmap_range() is called with a range that starts in the
middle of a huge pte-mapped page, we must unmap the entire huge page so
the behaviour is consistent with pmd and pud block mappings. In this
case emit a warning just like we do for pmd/pud mappings.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/vmalloc.h |  8 ++++++++
 mm/vmalloc.c            | 18 ++++++++++++++++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 31e9ffd936e3..16dd4cba64f2 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -113,6 +113,14 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, uns
 }
 #endif
 
+#ifndef arch_vmap_pte_range_unmap_size
+static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
+							   pte_t *ptep)
+{
+	return PAGE_SIZE;
+}
+#endif
+
 #ifndef arch_vmap_pte_supported_shift
 static inline int arch_vmap_pte_supported_shift(unsigned long size)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fcdf67d5177a..6111ce900ec4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -350,12 +350,26 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			     pgtbl_mod_mask *mask)
 {
 	pte_t *pte;
+	pte_t ptent;
+	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
 	do {
-		pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte);
+#ifdef CONFIG_HUGETLB_PAGE
+		size = arch_vmap_pte_range_unmap_size(addr, pte);
+		if (size != PAGE_SIZE) {
+			if (WARN_ON(!IS_ALIGNED(addr, size))) {
+				addr = ALIGN_DOWN(addr, size);
+				pte = PTR_ALIGN_DOWN(pte, sizeof(*pte) * (size >> PAGE_SHIFT));
+			}
+			ptent = huge_ptep_get_and_clear(&init_mm, addr, pte, size);
+			if (WARN_ON(end - addr < size))
+				size = end - addr;
+		} else
+#endif
+			ptent = ptep_get_and_clear(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 09/11] arm64/mm: Support huge pte-mapped pages in vmap
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (7 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 08/11] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Implement the required arch functions to enable use of contpte in the
vmap when VM_ALLOW_HUGE_VMAP is specified. This speeds up vmap
operations due to only having to issue a DSB and ISB per contpte block
instead of per pte. But it also means that the TLB pressure reduces due
to only needing a single TLB entry for the whole contpte block.

Since vmap uses set_huge_pte_at() to set the contpte, that API is now
used for kernel mappings for the first time. Although in the vmap case
we never expect it to be called to modify a valid mapping so
clear_flush() should never be called, it's still wise to make it robust
for the kernel case, so amend the tlb flush function if the mm is for
kernel space.

Tested with vmalloc performance selftests:

  # kself/mm/test_vmalloc.sh \
	run_test_mask=1
	test_repeat_count=5
	nr_pages=256
	test_loop_count=100000
	use_huge=1

Duration reduced from 1274243 usec to 1083553 usec on Apple M2 for 15%
reduction in time taken.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/vmalloc.h | 45 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/hugetlbpage.c      |  5 +++-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
index 38fafffe699f..12f534e8f3ed 100644
--- a/arch/arm64/include/asm/vmalloc.h
+++ b/arch/arm64/include/asm/vmalloc.h
@@ -23,6 +23,51 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot)
 	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
+#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
+static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr,
+						unsigned long end, u64 pfn,
+						unsigned int max_page_shift)
+{
+	/*
+	 * If the block is at least CONT_PTE_SIZE in size, and is naturally
+	 * aligned in both virtual and physical space, then we can pte-map the
+	 * block using the PTE_CONT bit for more efficient use of the TLB.
+	 */
+	if (max_page_shift < CONT_PTE_SHIFT)
+		return PAGE_SIZE;
+
+	if (end - addr < CONT_PTE_SIZE)
+		return PAGE_SIZE;
+
+	if (!IS_ALIGNED(addr, CONT_PTE_SIZE))
+		return PAGE_SIZE;
+
+	if (!IS_ALIGNED(PFN_PHYS(pfn), CONT_PTE_SIZE))
+		return PAGE_SIZE;
+
+	return CONT_PTE_SIZE;
+}
+
+#define arch_vmap_pte_range_unmap_size arch_vmap_pte_range_unmap_size
+static inline unsigned long arch_vmap_pte_range_unmap_size(unsigned long addr,
+							   pte_t *ptep)
+{
+	/*
+	 * The caller handles alignment so it's sufficient just to check
+	 * PTE_CONT.
+	 */
+	return pte_valid_cont(__ptep_get(ptep)) ? CONT_PTE_SIZE : PAGE_SIZE;
+}
+
+#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift
+static inline int arch_vmap_pte_supported_shift(unsigned long size)
+{
+	if (size >= CONT_PTE_SIZE)
+		return CONT_PTE_SHIFT;
+
+	return PAGE_SHIFT;
+}
+
 #endif
 
 #define arch_vmap_pgprot_tagged arch_vmap_pgprot_tagged
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index efd18bd1eae3..c1cb13dd5e84 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -210,7 +210,10 @@ static void clear_flush(struct mm_struct *mm,
 	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
 		ptep_get_and_clear_anysz(mm, ptep, pgsize);
 
-	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
+	if (mm == &init_mm)
+		flush_tlb_kernel_range(saddr, addr);
+	else
+		__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (8 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 09/11] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-03-27 13:06   ` Uladzislau Rezki
                     ` (2 more replies)
  2025-03-04 15:04 ` [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings Ryan Roberts
                   ` (2 subsequent siblings)
  12 siblings, 3 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Wrap vmalloc's pte table manipulation loops with
arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). This provides
the arch code with the opportunity to optimize the pte manipulations.

Note that vmap_pfn() already uses lazy mmu mode since it delegates to
apply_to_page_range() which enters lazy mmu mode for both user and
kernel mappings.

These hooks will shortly be used by arm64 to improve vmalloc
performance.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/vmalloc.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6111ce900ec4..b63ca0b7dd40 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -104,6 +104,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte = pte_alloc_kernel_track(pmd, addr, mask);
 	if (!pte)
 		return -ENOMEM;
+
+	arch_enter_lazy_mmu_mode();
+
 	do {
 		if (unlikely(!pte_none(ptep_get(pte)))) {
 			if (pfn_valid(pfn)) {
@@ -127,6 +130,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
 		pfn++;
 	} while (pte += PFN_DOWN(size), addr += size, addr != end);
+
+	arch_leave_lazy_mmu_mode();
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
@@ -354,6 +359,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned long size = PAGE_SIZE;
 
 	pte = pte_offset_kernel(pmd, addr);
+	arch_enter_lazy_mmu_mode();
+
 	do {
 #ifdef CONFIG_HUGETLB_PAGE
 		size = arch_vmap_pte_range_unmap_size(addr, pte);
@@ -370,6 +377,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			ptent = ptep_get_and_clear(&init_mm, addr, pte);
 		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
 	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
+
+	arch_leave_lazy_mmu_mode();
 	*mask |= PGTBL_PTE_MODIFIED;
 }
 
@@ -515,6 +524,9 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte = pte_alloc_kernel_track(pmd, addr, mask);
 	if (!pte)
 		return -ENOMEM;
+
+	arch_enter_lazy_mmu_mode();
+
 	do {
 		struct page *page = pages[*nr];
 
@@ -528,6 +540,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
 		(*nr)++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	arch_leave_lazy_mmu_mode();
 	*mask |= PGTBL_PTE_MODIFIED;
 	return 0;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (9 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
@ 2025-03-04 15:04 ` Ryan Roberts
  2025-04-04  6:02   ` Anshuman Khandual
  2025-04-14 17:38   ` Catalin Marinas
  2025-03-27 12:16 ` [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Uladzislau Rezki
  2025-04-14 13:56 ` Ryan Roberts
  12 siblings, 2 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-04 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: Ryan Roberts, linux-arm-kernel, linux-mm, linux-kernel

Because the kernel can't tolerate page faults for kernel mappings, when
setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
dsb(ishst) to ensure that the store to the pgtable is observed by the
table walker immediately. Additionally it emits an isb() to ensure that
any already speculatively determined invalid mapping fault gets
canceled.

We can improve the performance of vmalloc operations by batching these
barriers until the end of a set of entry updates.
arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() provide the
required hooks.

vmalloc improves by up to 30% as a result.

Two new TIF_ flags are created; TIF_LAZY_MMU tells us if the task is in
the lazy mode and can therefore defer any barriers until exit from the
lazy mode. TIF_LAZY_MMU_PENDING is used to remember if any pte operation
was performed while in the lazy mode that required barriers. Then when
leaving lazy mode, if that flag is set, we emit the barriers.

Since arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() are used
for both user and kernel mappings, we need the second flag to avoid
emitting barriers unnecessarily if only user mappings were updated.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h     | 73 ++++++++++++++++++++++------
 arch/arm64/include/asm/thread_info.h |  2 +
 arch/arm64/kernel/process.c          |  9 ++--
 3 files changed, 64 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1898c3069c43..149df945c1ab 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -40,6 +40,55 @@
 #include <linux/sched.h>
 #include <linux/page_table_check.h>
 
+static inline void emit_pte_barriers(void)
+{
+	/*
+	 * These barriers are emitted under certain conditions after a pte entry
+	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
+	 * visible to the table walker. The isb ensures that any previous
+	 * speculative "invalid translation" marker that is in the CPU's
+	 * pipeline gets cleared, so that any access to that address after
+	 * setting the pte to valid won't cause a spurious fault. If the thread
+	 * gets preempted after storing to the pgtable but before emitting these
+	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
+	 * see the store. There is no guarrantee of an isb being issued though.
+	 * This is safe because it will still get issued (albeit on a
+	 * potentially different CPU) when the thread starts running again,
+	 * before any access to the address.
+	 */
+	dsb(ishst);
+	isb();
+}
+
+static inline void queue_pte_barriers(void)
+{
+	if (test_thread_flag(TIF_LAZY_MMU))
+		set_thread_flag(TIF_LAZY_MMU_PENDING);
+	else
+		emit_pte_barriers();
+}
+
+#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(in_interrupt());
+	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
+
+	set_thread_flag(TIF_LAZY_MMU);
+}
+
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
+		emit_pte_barriers();
+}
+
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+	arch_flush_lazy_mmu_mode();
+	clear_thread_flag(TIF_LAZY_MMU);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
 
@@ -323,10 +372,8 @@ static inline void __set_pte_complete(pte_t pte)
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
 	 * has the necessary barriers.
 	 */
-	if (pte_valid_not_user(pte)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pte_valid_not_user(pte))
+		queue_pte_barriers();
 }
 
 static inline void __set_pte(pte_t *ptep, pte_t pte)
@@ -778,10 +825,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 	WRITE_ONCE(*pmdp, pmd);
 
-	if (pmd_valid(pmd)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pmd_valid(pmd))
+		queue_pte_barriers();
 }
 
 static inline void pmd_clear(pmd_t *pmdp)
@@ -845,10 +890,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
 
 	WRITE_ONCE(*pudp, pud);
 
-	if (pud_valid(pud)) {
-		dsb(ishst);
-		isb();
-	}
+	if (pud_valid(pud))
+		queue_pte_barriers();
 }
 
 static inline void pud_clear(pud_t *pudp)
@@ -925,8 +968,7 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 	}
 
 	WRITE_ONCE(*p4dp, p4d);
-	dsb(ishst);
-	isb();
+	queue_pte_barriers();
 }
 
 static inline void p4d_clear(p4d_t *p4dp)
@@ -1052,8 +1094,7 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 	}
 
 	WRITE_ONCE(*pgdp, pgd);
-	dsb(ishst);
-	isb();
+	queue_pte_barriers();
 }
 
 static inline void pgd_clear(pgd_t *pgdp)
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index 1114c1c3300a..1fdd74b7b831 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
 #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
 #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
 #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
+#define TIF_LAZY_MMU		31	/* Task in lazy mmu mode */
+#define TIF_LAZY_MMU_PENDING	32	/* Ops pending for lazy mmu mode exit */
 
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 42faebb7b712..45a55fe81788 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -680,10 +680,11 @@ struct task_struct *__switch_to(struct task_struct *prev,
 	gcs_thread_switch(next);
 
 	/*
-	 * Complete any pending TLB or cache maintenance on this CPU in case
-	 * the thread migrates to a different CPU.
-	 * This full barrier is also required by the membarrier system
-	 * call.
+	 * Complete any pending TLB or cache maintenance on this CPU in case the
+	 * thread migrates to a different CPU. This full barrier is also
+	 * required by the membarrier system call. Additionally it makes any
+	 * in-progress pgtable writes visible to the table walker; See
+	 * emit_pte_barriers().
 	 */
 	dsb(ish);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  2025-03-04 15:04 ` [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz() Ryan Roberts
@ 2025-03-05 16:00   ` kernel test robot
  2025-03-05 16:32     ` Ryan Roberts
  2025-04-03 20:47   ` Catalin Marinas
  1 sibling, 1 reply; 39+ messages in thread
From: kernel test robot @ 2025-03-05 16:00 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.14-rc5 next-20250305]
[cannot apply to arm64/for-next/core akpm-mm/mm-everything arm-perf/for-next/perf]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-hugetlb-Cleanup-huge_pte-size-discovery-mechanisms/20250304-230647
base:   linus/master
patch link:    https://lore.kernel.org/r/20250304150444.3788920-6-ryan.roberts%40arm.com
patch subject: [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
config: arm64-randconfig-003-20250305 (https://download.01.org/0day-ci/archive/20250305/202503052315.vk7m958M-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 14170b16028c087ca154878f5ed93d3089a965c6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250305/202503052315.vk7m958M-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503052315.vk7m958M-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from arch/arm64/mm/hugetlbpage.c:12:
   In file included from include/linux/mm.h:2224:
   include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     505 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     512 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
>> arch/arm64/mm/hugetlbpage.c:154:23: warning: parameter 'addr' set but not used [-Wunused-but-set-parameter]
     154 |                              unsigned long addr,
         |                                            ^
   3 warnings generated.


vim +/addr +154 arch/arm64/mm/hugetlbpage.c

bc5dfb4fd7bd471 Baolin Wang       2022-05-16  144  
d8bdcff2876424d Steve Capper      2017-08-22  145  /*
d8bdcff2876424d Steve Capper      2017-08-22  146   * Changing some bits of contiguous entries requires us to follow a
d8bdcff2876424d Steve Capper      2017-08-22  147   * Break-Before-Make approach, breaking the whole contiguous set
d8bdcff2876424d Steve Capper      2017-08-22  148   * before we can change any entries. See ARM DDI 0487A.k_iss10775,
d8bdcff2876424d Steve Capper      2017-08-22  149   * "Misprogramming of the Contiguous bit", page D4-1762.
d8bdcff2876424d Steve Capper      2017-08-22  150   *
d8bdcff2876424d Steve Capper      2017-08-22  151   * This helper performs the break step.
d8bdcff2876424d Steve Capper      2017-08-22  152   */
fb396bb459c1fa3 Anshuman Khandual 2022-05-10  153  static pte_t get_clear_contig(struct mm_struct *mm,
d8bdcff2876424d Steve Capper      2017-08-22 @154  			     unsigned long addr,
d8bdcff2876424d Steve Capper      2017-08-22  155  			     pte_t *ptep,
d8bdcff2876424d Steve Capper      2017-08-22  156  			     unsigned long pgsize,
d8bdcff2876424d Steve Capper      2017-08-22  157  			     unsigned long ncontig)
d8bdcff2876424d Steve Capper      2017-08-22  158  {
49c87f7677746f3 Ryan Roberts      2025-02-26  159  	pte_t pte, tmp_pte;
49c87f7677746f3 Ryan Roberts      2025-02-26  160  	bool present;
49c87f7677746f3 Ryan Roberts      2025-02-26  161  
66251d3eadf78e2 Ryan Roberts      2025-03-04  162  	pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
49c87f7677746f3 Ryan Roberts      2025-02-26  163  	present = pte_present(pte);
49c87f7677746f3 Ryan Roberts      2025-02-26  164  	while (--ncontig) {
49c87f7677746f3 Ryan Roberts      2025-02-26  165  		ptep++;
49c87f7677746f3 Ryan Roberts      2025-02-26  166  		addr += pgsize;
66251d3eadf78e2 Ryan Roberts      2025-03-04  167  		tmp_pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
49c87f7677746f3 Ryan Roberts      2025-02-26  168  		if (present) {
49c87f7677746f3 Ryan Roberts      2025-02-26  169  			if (pte_dirty(tmp_pte))
49c87f7677746f3 Ryan Roberts      2025-02-26  170  				pte = pte_mkdirty(pte);
49c87f7677746f3 Ryan Roberts      2025-02-26  171  			if (pte_young(tmp_pte))
49c87f7677746f3 Ryan Roberts      2025-02-26  172  				pte = pte_mkyoung(pte);
d8bdcff2876424d Steve Capper      2017-08-22  173  		}
49c87f7677746f3 Ryan Roberts      2025-02-26  174  	}
49c87f7677746f3 Ryan Roberts      2025-02-26  175  	return pte;
d8bdcff2876424d Steve Capper      2017-08-22  176  }
d8bdcff2876424d Steve Capper      2017-08-22  177  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  2025-03-05 16:00   ` kernel test robot
@ 2025-03-05 16:32     ` Ryan Roberts
  0 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-05 16:32 UTC (permalink / raw)
  To: kernel test robot, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	linux-arm-kernel, linux-kernel

On 05/03/2025 16:00, kernel test robot wrote:
> Hi Ryan,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on linus/master]
> [also build test WARNING on v6.14-rc5 next-20250305]
> [cannot apply to arm64/for-next/core akpm-mm/mm-everything arm-perf/for-next/perf]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-hugetlb-Cleanup-huge_pte-size-discovery-mechanisms/20250304-230647
> base:   linus/master
> patch link:    https://lore.kernel.org/r/20250304150444.3788920-6-ryan.roberts%40arm.com
> patch subject: [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
> config: arm64-randconfig-003-20250305 (https://download.01.org/0day-ci/archive/20250305/202503052315.vk7m958M-lkp@intel.com/config)
> compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 14170b16028c087ca154878f5ed93d3089a965c6)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250305/202503052315.vk7m958M-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202503052315.vk7m958M-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from arch/arm64/mm/hugetlbpage.c:12:
>    In file included from include/linux/mm.h:2224:
>    include/linux/vmstat.h:504:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
>      504 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>          |                            ~~~~~~~~~~~~~~~~~~~~~ ^
>      505 |                            item];
>          |                            ~~~~
>    include/linux/vmstat.h:511:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
>      511 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
>          |                            ~~~~~~~~~~~~~~~~~~~~~ ^
>      512 |                            NR_VM_NUMA_EVENT_ITEMS +
>          |                            ~~~~~~~~~~~~~~~~~~~~~~
>>> arch/arm64/mm/hugetlbpage.c:154:23: warning: parameter 'addr' set but not used [-Wunused-but-set-parameter]
>      154 |                              unsigned long addr,
>          |                                            ^
>    3 warnings generated.
> 
> 
> vim +/addr +154 arch/arm64/mm/hugetlbpage.c
> 
> bc5dfb4fd7bd471 Baolin Wang       2022-05-16  144  
> d8bdcff2876424d Steve Capper      2017-08-22  145  /*
> d8bdcff2876424d Steve Capper      2017-08-22  146   * Changing some bits of contiguous entries requires us to follow a
> d8bdcff2876424d Steve Capper      2017-08-22  147   * Break-Before-Make approach, breaking the whole contiguous set
> d8bdcff2876424d Steve Capper      2017-08-22  148   * before we can change any entries. See ARM DDI 0487A.k_iss10775,
> d8bdcff2876424d Steve Capper      2017-08-22  149   * "Misprogramming of the Contiguous bit", page D4-1762.
> d8bdcff2876424d Steve Capper      2017-08-22  150   *
> d8bdcff2876424d Steve Capper      2017-08-22  151   * This helper performs the break step.
> d8bdcff2876424d Steve Capper      2017-08-22  152   */
> fb396bb459c1fa3 Anshuman Khandual 2022-05-10  153  static pte_t get_clear_contig(struct mm_struct *mm,
> d8bdcff2876424d Steve Capper      2017-08-22 @154  			     unsigned long addr,
> d8bdcff2876424d Steve Capper      2017-08-22  155  			     pte_t *ptep,
> d8bdcff2876424d Steve Capper      2017-08-22  156  			     unsigned long pgsize,
> d8bdcff2876424d Steve Capper      2017-08-22  157  			     unsigned long ncontig)
> d8bdcff2876424d Steve Capper      2017-08-22  158  {
> 49c87f7677746f3 Ryan Roberts      2025-02-26  159  	pte_t pte, tmp_pte;
> 49c87f7677746f3 Ryan Roberts      2025-02-26  160  	bool present;
> 49c87f7677746f3 Ryan Roberts      2025-02-26  161  
> 66251d3eadf78e2 Ryan Roberts      2025-03-04  162  	pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
> 49c87f7677746f3 Ryan Roberts      2025-02-26  163  	present = pte_present(pte);
> 49c87f7677746f3 Ryan Roberts      2025-02-26  164  	while (--ncontig) {
> 49c87f7677746f3 Ryan Roberts      2025-02-26  165  		ptep++;
> 49c87f7677746f3 Ryan Roberts      2025-02-26  166  		addr += pgsize;

Ahh yes, thanks! Looks like this line can be removed since we no longer need the
address.

Catalin, I was optimistically hoping this might be the final version. If it is,
are you happy to fold this in? Or do you want me to re-spin regardless?

> 66251d3eadf78e2 Ryan Roberts      2025-03-04  167  		tmp_pte = ptep_get_and_clear_anysz(mm, ptep, pgsize);
> 49c87f7677746f3 Ryan Roberts      2025-02-26  168  		if (present) {
> 49c87f7677746f3 Ryan Roberts      2025-02-26  169  			if (pte_dirty(tmp_pte))
> 49c87f7677746f3 Ryan Roberts      2025-02-26  170  				pte = pte_mkdirty(pte);
> 49c87f7677746f3 Ryan Roberts      2025-02-26  171  			if (pte_young(tmp_pte))
> 49c87f7677746f3 Ryan Roberts      2025-02-26  172  				pte = pte_mkyoung(pte);
> d8bdcff2876424d Steve Capper      2017-08-22  173  		}
> 49c87f7677746f3 Ryan Roberts      2025-02-26  174  	}
> 49c87f7677746f3 Ryan Roberts      2025-02-26  175  	return pte;
> d8bdcff2876424d Steve Capper      2017-08-22  176  }
> d8bdcff2876424d Steve Capper      2017-08-22  177  
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-03-04 15:04 ` [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
@ 2025-03-06  5:08   ` kernel test robot
  2025-03-06 11:54     ` Ryan Roberts
  2025-04-14 16:25   ` Catalin Marinas
  1 sibling, 1 reply; 39+ messages in thread
From: kernel test robot @ 2025-03-06  5:08 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky
  Cc: llvm, oe-kbuild-all, Linux Memory Management List, Ryan Roberts,
	linux-arm-kernel, linux-kernel

Hi Ryan,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.14-rc5 next-20250305]
[cannot apply to arm64/for-next/core akpm-mm/mm-everything arm-perf/for-next/perf]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-hugetlb-Cleanup-huge_pte-size-discovery-mechanisms/20250304-230647
base:   linus/master
patch link:    https://lore.kernel.org/r/20250304150444.3788920-5-ryan.roberts%40arm.com
patch subject: [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
config: arm64-randconfig-001-20250305 (https://download.01.org/0day-ci/archive/20250306/202503061237.QurSXHSC-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250306/202503061237.QurSXHSC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503061237.QurSXHSC-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm64/kernel/asm-offsets.c:12:
   In file included from include/linux/ftrace.h:10:
   In file included from include/linux/trace_recursion.h:5:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/arm64/include/asm/hardirq.h:17:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/arm64/include/asm/io.h:12:
   In file included from include/linux/pgtable.h:6:
>> arch/arm64/include/asm/pgtable.h:639:7: error: duplicate case value '536870912'
           case PUD_SIZE:
                ^
   include/asm-generic/pgtable-nopud.h:20:20: note: expanded from macro 'PUD_SIZE'
   #define PUD_SIZE        (1UL << PUD_SHIFT)
                           ^
   arch/arm64/include/asm/pgtable.h:636:7: note: previous case defined here
           case PMD_SIZE:
                ^
   include/asm-generic/pgtable-nopmd.h:22:20: note: expanded from macro 'PMD_SIZE'
   #define PMD_SIZE        (1UL << PMD_SHIFT)
                           ^
   In file included from arch/arm64/kernel/asm-offsets.c:12:
   In file included from include/linux/ftrace.h:10:
   In file included from include/linux/trace_recursion.h:5:
   In file included from include/linux/interrupt.h:11:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/arm64/include/asm/hardirq.h:17:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:14:
   In file included from arch/arm64/include/asm/io.h:12:
   In file included from include/linux/pgtable.h:6:
   arch/arm64/include/asm/pgtable.h:1303:7: error: duplicate case value '536870912'
           case PUD_SIZE:
                ^
   include/asm-generic/pgtable-nopud.h:20:20: note: expanded from macro 'PUD_SIZE'
   #define PUD_SIZE        (1UL << PUD_SHIFT)
                           ^
   arch/arm64/include/asm/pgtable.h:1300:7: note: previous case defined here
           case PMD_SIZE:
                ^
   include/asm-generic/pgtable-nopmd.h:22:20: note: expanded from macro 'PMD_SIZE'
   #define PMD_SIZE        (1UL << PMD_SHIFT)
                           ^
   2 errors generated.
   make[3]: *** [scripts/Makefile.build:102: arch/arm64/kernel/asm-offsets.s] Error 1 shuffle=4064171735
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=4064171735
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=4064171735
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:251: __sub-make] Error 2 shuffle=4064171735
   make: Target 'prepare' not remade because of errors.


vim +/536870912 +639 arch/arm64/include/asm/pgtable.h

   626	
   627	static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
   628					  unsigned int nr, unsigned long pgsize)
   629	{
   630		unsigned long stride = pgsize >> PAGE_SHIFT;
   631	
   632		switch (pgsize) {
   633		case PAGE_SIZE:
   634			page_table_check_ptes_set(mm, ptep, pte, nr);
   635			break;
   636		case PMD_SIZE:
   637			page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
   638			break;
 > 639		case PUD_SIZE:
   640			page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
   641			break;
   642		default:
   643			VM_WARN_ON(1);
   644		}
   645	
   646		__sync_cache_and_tags(pte, nr * stride);
   647	
   648		for (;;) {
   649			__check_safe_pte_update(mm, ptep, pte);
   650			__set_pte(ptep, pte);
   651			if (--nr == 0)
   652				break;
   653			ptep++;
   654			pte = pte_advance_pfn(pte, stride);
   655		}
   656	}
   657	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-03-06  5:08   ` kernel test robot
@ 2025-03-06 11:54     ` Ryan Roberts
  0 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-06 11:54 UTC (permalink / raw)
  To: kernel test robot, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky
  Cc: llvm, oe-kbuild-all, Linux Memory Management List,
	linux-arm-kernel, linux-kernel

On 06/03/2025 05:08, kernel test robot wrote:
> Hi Ryan,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v6.14-rc5 next-20250305]
> [cannot apply to arm64/for-next/core akpm-mm/mm-everything arm-perf/for-next/perf]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Ryan-Roberts/arm64-hugetlb-Cleanup-huge_pte-size-discovery-mechanisms/20250304-230647
> base:   linus/master
> patch link:    https://lore.kernel.org/r/20250304150444.3788920-5-ryan.roberts%40arm.com
> patch subject: [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
> config: arm64-randconfig-001-20250305 (https://download.01.org/0day-ci/archive/20250306/202503061237.QurSXHSC-lkp@intel.com/config)
> compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250306/202503061237.QurSXHSC-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202503061237.QurSXHSC-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from arch/arm64/kernel/asm-offsets.c:12:
>    In file included from include/linux/ftrace.h:10:
>    In file included from include/linux/trace_recursion.h:5:
>    In file included from include/linux/interrupt.h:11:
>    In file included from include/linux/hardirq.h:11:
>    In file included from arch/arm64/include/asm/hardirq.h:17:
>    In file included from include/asm-generic/hardirq.h:17:
>    In file included from include/linux/irq.h:20:
>    In file included from include/linux/io.h:14:
>    In file included from arch/arm64/include/asm/io.h:12:
>    In file included from include/linux/pgtable.h:6:
>>> arch/arm64/include/asm/pgtable.h:639:7: error: duplicate case value '536870912'
>            case PUD_SIZE:
>                 ^
>    include/asm-generic/pgtable-nopud.h:20:20: note: expanded from macro 'PUD_SIZE'
>    #define PUD_SIZE        (1UL << PUD_SHIFT)
>                            ^
>    arch/arm64/include/asm/pgtable.h:636:7: note: previous case defined here
>            case PMD_SIZE:
>                 ^
>    include/asm-generic/pgtable-nopmd.h:22:20: note: expanded from macro 'PMD_SIZE'
>    #define PMD_SIZE        (1UL << PMD_SHIFT)
>                            ^
>    In file included from arch/arm64/kernel/asm-offsets.c:12:
>    In file included from include/linux/ftrace.h:10:
>    In file included from include/linux/trace_recursion.h:5:
>    In file included from include/linux/interrupt.h:11:
>    In file included from include/linux/hardirq.h:11:
>    In file included from arch/arm64/include/asm/hardirq.h:17:
>    In file included from include/asm-generic/hardirq.h:17:
>    In file included from include/linux/irq.h:20:
>    In file included from include/linux/io.h:14:
>    In file included from arch/arm64/include/asm/io.h:12:
>    In file included from include/linux/pgtable.h:6:
>    arch/arm64/include/asm/pgtable.h:1303:7: error: duplicate case value '536870912'
>            case PUD_SIZE:
>                 ^
>    include/asm-generic/pgtable-nopud.h:20:20: note: expanded from macro 'PUD_SIZE'
>    #define PUD_SIZE        (1UL << PUD_SHIFT)
>                            ^
>    arch/arm64/include/asm/pgtable.h:1300:7: note: previous case defined here
>            case PMD_SIZE:
>                 ^
>    include/asm-generic/pgtable-nopmd.h:22:20: note: expanded from macro 'PMD_SIZE'
>    #define PMD_SIZE        (1UL << PMD_SHIFT)
>                            ^
>    2 errors generated.
>    make[3]: *** [scripts/Makefile.build:102: arch/arm64/kernel/asm-offsets.s] Error 1 shuffle=4064171735
>    make[3]: Target 'prepare' not remade because of errors.
>    make[2]: *** [Makefile:1264: prepare0] Error 2 shuffle=4064171735
>    make[2]: Target 'prepare' not remade because of errors.
>    make[1]: *** [Makefile:251: __sub-make] Error 2 shuffle=4064171735
>    make[1]: Target 'prepare' not remade because of errors.
>    make: *** [Makefile:251: __sub-make] Error 2 shuffle=4064171735
>    make: Target 'prepare' not remade because of errors.
> 
> 
> vim +/536870912 +639 arch/arm64/include/asm/pgtable.h
> 
>    626	
>    627	static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>    628					  unsigned int nr, unsigned long pgsize)
>    629	{
>    630		unsigned long stride = pgsize >> PAGE_SHIFT;
>    631	
>    632		switch (pgsize) {
>    633		case PAGE_SIZE:
>    634			page_table_check_ptes_set(mm, ptep, pte, nr);
>    635			break;
>    636		case PMD_SIZE:
>    637			page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
>    638			break;
>  > 639		case PUD_SIZE:
>    640			page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
>    641			break;

Looks like this needs to be wrapped in `#ifndef __PAGETABLE_PMD_FOLDED`. This
failing config folds the PMD so PMD_SIZE and PUD_SIZE are the same.

Given there are now 2 kernel robot reports, I'll respin the series next week,
giving time for any interim review comments.

Thanks,
Ryan


>    642		default:
>    643			VM_WARN_ON(1);
>    644		}
>    645	
>    646		__sync_cache_and_tags(pte, nr * stride);
>    647	
>    648		for (;;) {
>    649			__check_safe_pte_update(mm, ptep, pte);
>    650			__set_pte(ptep, pte);
>    651			if (--nr == 0)
>    652				break;
>    653			ptep++;
>    654			pte = pte_advance_pfn(pte, stride);
>    655		}
>    656	}
>    657	
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-03-04 15:04 ` [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
@ 2025-03-26 14:48   ` Pasha Tatashin
  2025-03-26 14:54     ` Ryan Roberts
  2025-04-03 20:46   ` Catalin Marinas
  1 sibling, 1 reply; 39+ messages in thread
From: Pasha Tatashin @ 2025-03-26 14:48 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
> +               unsigned int nr)
>  {
> +       unsigned int i;
> +       unsigned long stride = PUD_SIZE >> PAGE_SHIFT;

nit: please order declarations from longest to shortest, it usually
helps with readability. (here and in pmd)

Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Pasha


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-03-26 14:48   ` Pasha Tatashin
@ 2025-03-26 14:54     ` Ryan Roberts
  0 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-26 14:54 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Catalin Marinas, Will Deacon, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On 26/03/2025 10:48, Pasha Tatashin wrote:
>> -void __page_table_check_pud_set(struct mm_struct *mm, pud_t *pudp, pud_t pud)
>> +void __page_table_check_puds_set(struct mm_struct *mm, pud_t *pudp, pud_t pud,
>> +               unsigned int nr)
>>  {
>> +       unsigned int i;
>> +       unsigned long stride = PUD_SIZE >> PAGE_SHIFT;
> 
> nit: please order declarations from longest to shortest, it usually
> helps with readability. (here and in pmd)

Noted, I'll fix this.

> 
> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Thanks!

> 
> Pasha



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (10 preceding siblings ...)
  2025-03-04 15:04 ` [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings Ryan Roberts
@ 2025-03-27 12:16 ` Uladzislau Rezki
  2025-03-27 13:46   ` Ryan Roberts
  2025-04-14 13:56 ` Ryan Roberts
  12 siblings, 1 reply; 39+ messages in thread
From: Uladzislau Rezki @ 2025-03-27 12:16 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky, linux-arm-kernel, linux-mm,
	linux-kernel

On Tue, Mar 04, 2025 at 03:04:30PM +0000, Ryan Roberts wrote:
> Hi All,
> 
> This is v3 of a series to improve performance for hugetlb and vmalloc on arm64.
> Although some of these patches are core-mm, advice from Andrew was to go via the
> arm64 tree. Hopefully I can get some ACKs from mm folks.
> 
> The 2 key performance improvements are 1) enabling the use of contpte-mapped
> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
> were already hooks for this (used by powerpc) but they required some tidying and
> extending for arm64. And 2) batching up barriers when modifying the vmalloc
> address space for upto 30% reduction in time taken in vmalloc().
> 
> vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
> Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
> test was repeated 10 times.
> 
I will have a look and review just give me some time :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range()
  2025-03-04 15:04 ` [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
@ 2025-03-27 13:05   ` Uladzislau Rezki
  0 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2025-03-27 13:05 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky, linux-arm-kernel, linux-mm,
	linux-kernel

On Tue, Mar 04, 2025 at 03:04:37PM +0000, Ryan Roberts wrote:
> A call to vmalloc_huge() may cause memory blocks to be mapped at pmd or
> pud level. But it is possible to subsequently call vunmap_range() on a
> sub-range of the mapped memory, which partially overlaps a pmd or pud.
> In this case, vmalloc unmaps the entire pmd or pud so that the
> no-overlapping portion is also unmapped. Clearly that would have a bad
> outcome, but it's not something that any callers do today as far as I
> can tell. So I guess it's just expected that callers will not do this.
> 
> However, it would be useful to know if this happened in future; let's
> add a warning to cover the eventuality.
> 
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmalloc.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index a6e7acebe9ad..fcdf67d5177a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -374,8 +374,10 @@ static void vunmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
>  		if (cleared || pmd_bad(*pmd))
>  			*mask |= PGTBL_PMD_MODIFIED;
>  
> -		if (cleared)
> +		if (cleared) {
> +			WARN_ON(next - addr < PMD_SIZE);
>  			continue;
> +		}
>  		if (pmd_none_or_clear_bad(pmd))
>  			continue;
>  		vunmap_pte_range(pmd, addr, next, mask);
> @@ -399,8 +401,10 @@ static void vunmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
>  		if (cleared || pud_bad(*pud))
>  			*mask |= PGTBL_PUD_MODIFIED;
>  
> -		if (cleared)
> +		if (cleared) {
> +			WARN_ON(next - addr < PUD_SIZE);
>  			continue;
> +		}
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
>  		vunmap_pmd_range(pud, addr, next, mask);
> -- 
> 2.43.0
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
@ 2025-03-27 13:06   ` Uladzislau Rezki
  2025-04-03 20:47   ` Catalin Marinas
  2025-04-04  4:54   ` Anshuman Khandual
  2 siblings, 0 replies; 39+ messages in thread
From: Uladzislau Rezki @ 2025-03-27 13:06 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky, linux-arm-kernel, linux-mm,
	linux-kernel

On Tue, Mar 04, 2025 at 03:04:40PM +0000, Ryan Roberts wrote:
> Wrap vmalloc's pte table manipulation loops with
> arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). This provides
> the arch code with the opportunity to optimize the pte manipulations.
> 
> Note that vmap_pfn() already uses lazy mmu mode since it delegates to
> apply_to_page_range() which enters lazy mmu mode for both user and
> kernel mappings.
> 
> These hooks will shortly be used by arm64 to improve vmalloc
> performance.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmalloc.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6111ce900ec4..b63ca0b7dd40 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -104,6 +104,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte = pte_alloc_kernel_track(pmd, addr, mask);
>  	if (!pte)
>  		return -ENOMEM;
> +
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  		if (unlikely(!pte_none(ptep_get(pte)))) {
>  			if (pfn_valid(pfn)) {
> @@ -127,6 +130,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
>  		pfn++;
>  	} while (pte += PFN_DOWN(size), addr += size, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }
> @@ -354,6 +359,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	unsigned long size = PAGE_SIZE;
>  
>  	pte = pte_offset_kernel(pmd, addr);
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  #ifdef CONFIG_HUGETLB_PAGE
>  		size = arch_vmap_pte_range_unmap_size(addr, pte);
> @@ -370,6 +377,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			ptent = ptep_get_and_clear(&init_mm, addr, pte);
>  		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
>  	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  }
>  
> @@ -515,6 +524,9 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	pte = pte_alloc_kernel_track(pmd, addr, mask);
>  	if (!pte)
>  		return -ENOMEM;
> +
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  		struct page *page = pages[*nr];
>  
> @@ -528,6 +540,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>  		(*nr)++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }
> -- 
> 2.43.0
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64
  2025-03-27 12:16 ` [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Uladzislau Rezki
@ 2025-03-27 13:46   ` Ryan Roberts
  0 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-03-27 13:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On 27/03/2025 08:16, Uladzislau Rezki wrote:
> On Tue, Mar 04, 2025 at 03:04:30PM +0000, Ryan Roberts wrote:
>> Hi All,
>>
>> This is v3 of a series to improve performance for hugetlb and vmalloc on arm64.
>> Although some of these patches are core-mm, advice from Andrew was to go via the
>> arm64 tree. Hopefully I can get some ACKs from mm folks.
>>
>> The 2 key performance improvements are 1) enabling the use of contpte-mapped
>> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
>> were already hooks for this (used by powerpc) but they required some tidying and
>> extending for arm64. And 2) batching up barriers when modifying the vmalloc
>> address space for upto 30% reduction in time taken in vmalloc().
>>
>> vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
>> Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
>> test was repeated 10 times.
>>
> I will have a look and review just give me some time :)

Thanks for the reviews - appreciate it!

> 
> --
> Uladzislau Rezki



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
@ 2025-04-03 20:46   ` Catalin Marinas
  2025-04-04  3:03   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:31PM +0000, Ryan Roberts wrote:
> Not all huge_pte helper APIs explicitly provide the size of the
> huge_pte. So the helpers have to depend on various methods to determine
> the size of the huge_pte. Some of these methods are dubious.
> 
> Let's clean up the code to use preferred methods and retire the dubious
> ones. The options in order of preference:
> 
>  - If size is provided as parameter, use it together with
>    num_contig_ptes(). This is explicit and works for both present and
>    non-present ptes.
> 
>  - If vma is provided as a parameter, retrieve size via
>    huge_page_size(hstate_vma(vma)) and use it together with
>    num_contig_ptes(). This is explicit and works for both present and
>    non-present ptes.
> 
>  - If the pte is present and contiguous, use find_num_contig() to walk
>    the pgtable to find the level and infer the number of ptes from
>    level. Only works for *present* ptes.
> 
>  - If the pte is present and not contiguous and you can infer from this
>    that only 1 pte needs to be operated on. This is ok if you don't care
>    about the absolute size, and just want to know the number of ptes.
> 
>  - NEVER rely on resolving the PFN of a present pte to a folio and
>    getting the folio's size. This is fragile at best, because there is
>    nothing to stop the core-mm from allocating a folio twice as big as
>    the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
>    partially mapping it.
> 
> Where we require that the pte is present, add warnings if not-present.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes
  2025-03-04 15:04 ` [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
  2025-03-26 14:48   ` Pasha Tatashin
@ 2025-04-03 20:46   ` Catalin Marinas
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:33PM +0000, Ryan Roberts wrote:
> Convert page_table_check_p[mu]d_set(...) to
> page_table_check_p[mu]ds_set(..., nr) to allow checking a contiguous set
> of pmds/puds in single batch. We retain page_table_check_p[mu]d_set(...)
> as macros that call new batch functions with nr=1 for compatibility.
> 
> arm64 is about to reorganise its pte/pmd/pud helpers to reuse more code
> and to allow the implementation for huge_pte to more efficiently set
> ptes/pmds/puds in batches. We need these batch-helpers to make the
> refactoring possible.
> 
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  2025-03-04 15:04 ` [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop Ryan Roberts
@ 2025-04-03 20:46   ` Catalin Marinas
  2025-04-04  4:11   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:46 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:36PM +0000, Ryan Roberts wrote:
> set_ptes_anysz() previously called __set_pte() for each PTE in the
> range, which would conditionally issue a DSB and ISB to make the new PTE
> value immediately visible to the table walker if the new PTE was valid
> and for kernel space.
> 
> We can do better than this; let's hoist those barriers out of the loop
> so that they are only issued once at the end of the loop. We then reduce
> the cost by the number of PTEs in the range.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e255a36380dc..1898c3069c43 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -317,13 +317,11 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>  	WRITE_ONCE(*ptep, pte);
>  }
>  
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte_complete(pte_t pte)
>  {
> -	__set_pte_nosync(ptep, pte);
> -
>  	/*
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
> -	 * or update_mmu_cache() have the necessary barriers.
> +	 * has the necessary barriers.

Thanks for removing the stale comment.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope
  2025-03-04 15:04 ` [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
@ 2025-04-03 20:47   ` Catalin Marinas
  2025-04-04  3:50   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:32PM +0000, Ryan Roberts wrote:
> When operating on contiguous blocks of ptes (or pmds) for some hugetlb
> sizes, we must honour break-before-make requirements and clear down the
> block to invalid state in the pgtable then invalidate the relevant tlb
> entries before making the pgtable entries valid again.
> 
> However, the tlb maintenance is currently always done assuming the worst
> case stride (PAGE_SIZE), last_level (false) and tlb_level
> (TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
> we know the stride from the huge_pte pgsize, we are always operating
> only on the last level, and we always know the tlb_level, again based on
> pgsize. So let's start providing these hints.
> 
> Additionally, avoid tlb maintenace in set_huge_pte_at().
> Break-before-make is only required if we are transitioning the
> contiguous pte block from valid -> valid. So let's elide the
> clear-and-flush ("break") if the pte range was previously invalid.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
  2025-03-04 15:04 ` [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz() Ryan Roberts
  2025-03-05 16:00   ` kernel test robot
@ 2025-04-03 20:47   ` Catalin Marinas
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:35PM +0000, Ryan Roberts wrote:
> Refactor the huge_pte helpers to use the new common set_ptes_anysz() and
> ptep_get_and_clear_anysz() APIs.

Nitpick: maybe add underscores to this new API to suggest it's private.
Up to you.

> This provides 2 benefits; First, when page_table_check=on, hugetlb is
> now properly/fully checked. Previously only the first page of a hugetlb
> folio was checked. Second, instead of having to call __set_ptes(nr=1)
> for each pte in a loop, the whole contiguous batch can now be set in one
> go, which enables some efficiencies and cleans up the code.
> 
> One detail to note is that huge_ptep_clear_flush() was previously
> calling ptep_clear_flush() for a non-contiguous pte (i.e. a pud or pmd
> block mapping). This has a couple of disadvantages; first
> ptep_clear_flush() calls ptep_get_and_clear() which transparently
> handles contpte. Given we only call for non-contiguous ptes, it would be
> safe, but a waste of effort. It's preferable to go straight to the layer
> below. However, more problematic is that ptep_get_and_clear() is for
> PAGE_SIZE entries so it calls page_table_check_pte_clear() and would not
> clear the whole hugetlb folio. So let's stop special-casing the non-cont
> case and just rely on get_clear_contig_flush() to do the right thing for
> non-cont entries.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
  2025-03-27 13:06   ` Uladzislau Rezki
@ 2025-04-03 20:47   ` Catalin Marinas
  2025-04-04  4:54   ` Anshuman Khandual
  2 siblings, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-03 20:47 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:40PM +0000, Ryan Roberts wrote:
> Wrap vmalloc's pte table manipulation loops with
> arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). This provides
> the arch code with the opportunity to optimize the pte manipulations.
> 
> Note that vmap_pfn() already uses lazy mmu mode since it delegates to
> apply_to_page_range() which enters lazy mmu mode for both user and
> kernel mappings.
> 
> These hooks will shortly be used by arm64 to improve vmalloc
> performance.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
  2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
  2025-04-03 20:46   ` Catalin Marinas
@ 2025-04-04  3:03   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Anshuman Khandual @ 2025-04-04  3:03 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 3/4/25 20:34, Ryan Roberts wrote:
> Not all huge_pte helper APIs explicitly provide the size of the
> huge_pte. So the helpers have to depend on various methods to determine
> the size of the huge_pte. Some of these methods are dubious.
> 
> Let's clean up the code to use preferred methods and retire the dubious
> ones. The options in order of preference:
> 
>  - If size is provided as parameter, use it together with
>    num_contig_ptes(). This is explicit and works for both present and
>    non-present ptes.
> 
>  - If vma is provided as a parameter, retrieve size via
>    huge_page_size(hstate_vma(vma)) and use it together with
>    num_contig_ptes(). This is explicit and works for both present and
>    non-present ptes.
> 
>  - If the pte is present and contiguous, use find_num_contig() to walk
>    the pgtable to find the level and infer the number of ptes from
>    level. Only works for *present* ptes.
> 
>  - If the pte is present and not contiguous and you can infer from this
>    that only 1 pte needs to be operated on. This is ok if you don't care
>    about the absolute size, and just want to know the number of ptes.
> 
>  - NEVER rely on resolving the PFN of a present pte to a folio and
>    getting the folio's size. This is fragile at best, because there is
>    nothing to stop the core-mm from allocating a folio twice as big as
>    the huge_pte then mapping it across 2 consecutive huge_ptes. Or just
>    partially mapping it.
> 
> Where we require that the pte is present, add warnings if not-present.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/mm/hugetlbpage.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index b3a7fafe8892..6a2af9fb2566 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -129,7 +129,7 @@ pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
>  	if (!pte_present(orig_pte) || !pte_cont(orig_pte))
>  		return orig_pte;
>  
> -	ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> +	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
>  	for (i = 0; i < ncontig; i++, ptep++) {
>  		pte_t pte = __ptep_get(ptep);
>  
> @@ -438,16 +438,19 @@ int huge_ptep_set_access_flags(struct vm_area_struct *vma,
>  	pgprot_t hugeprot;
>  	pte_t orig_pte;
>  
> +	VM_WARN_ON(!pte_present(pte));
> +
>  	if (!pte_cont(pte))
>  		return __ptep_set_access_flags(vma, addr, ptep, pte, dirty);
>  
> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> +	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
>  	dpfn = pgsize >> PAGE_SHIFT;
>  
>  	if (!__cont_access_flags_changed(ptep, pte, ncontig))
>  		return 0;
>  
>  	orig_pte = get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
> +	VM_WARN_ON(!pte_present(orig_pte));
>  
>  	/* Make sure we don't lose the dirty or young state */
>  	if (pte_dirty(orig_pte))
> @@ -472,7 +475,10 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
>  	size_t pgsize;
>  	pte_t pte;
>  
> -	if (!pte_cont(__ptep_get(ptep))) {
> +	pte = __ptep_get(ptep);
> +	VM_WARN_ON(!pte_present(pte));
> +
> +	if (!pte_cont(pte)) {
>  		__ptep_set_wrprotect(mm, addr, ptep);
>  		return;
>  	}
> @@ -496,11 +502,15 @@ pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>  	size_t pgsize;
>  	int ncontig;
> +	pte_t pte;
> +
> +	pte = __ptep_get(ptep);
> +	VM_WARN_ON(!pte_present(pte));
>  
> -	if (!pte_cont(__ptep_get(ptep)))
> +	if (!pte_cont(pte))
>  		return ptep_clear_flush(vma, addr, ptep);
>  
> -	ncontig = find_num_contig(mm, addr, ptep, &pgsize);
> +	ncontig = num_contig_ptes(huge_page_size(hstate_vma(vma)), &pgsize);
>  	return get_clear_contig_flush(mm, addr, ptep, pgsize, ncontig);
>  }
>  

Should a comment be added before all the VM_WARN_ON() explaining the rationale
about why the page table entries need to be present, before checking for their
contiguous attribute before subsequently calling into find_num_contig() ?

Regardless, LGTM.

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope
  2025-03-04 15:04 ` [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
  2025-04-03 20:47   ` Catalin Marinas
@ 2025-04-04  3:50   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Anshuman Khandual @ 2025-04-04  3:50 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 3/4/25 20:34, Ryan Roberts wrote:
> When operating on contiguous blocks of ptes (or pmds) for some hugetlb
> sizes, we must honour break-before-make requirements and clear down the
> block to invalid state in the pgtable then invalidate the relevant tlb
> entries before making the pgtable entries valid again.
> 
> However, the tlb maintenance is currently always done assuming the worst
> case stride (PAGE_SIZE), last_level (false) and tlb_level
> (TLBI_TTL_UNKNOWN). We can do much better with the hinting; In reality,
> we know the stride from the huge_pte pgsize, we are always operating
> only on the last level, and we always know the tlb_level, again based on
> pgsize. So let's start providing these hints.
> 
> Additionally, avoid tlb maintenace in set_huge_pte_at().
> Break-before-make is only required if we are transitioning the
> contiguous pte block from valid -> valid. So let's elide the
> clear-and-flush ("break") if the pte range was previously invalid.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/hugetlb.h | 29 +++++++++++++++++++----------
>  arch/arm64/mm/hugetlbpage.c      |  9 ++++++---
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
> index 07fbf5bf85a7..2a8155c4a882 100644
> --- a/arch/arm64/include/asm/hugetlb.h
> +++ b/arch/arm64/include/asm/hugetlb.h
> @@ -69,29 +69,38 @@ extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
>  
>  #include <asm-generic/hugetlb.h>
>  
> -#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
> -static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
> -					   unsigned long start,
> -					   unsigned long end)
> +static inline void __flush_hugetlb_tlb_range(struct vm_area_struct *vma,
> +					     unsigned long start,
> +					     unsigned long end,
> +					     unsigned long stride,
> +					     bool last_level)
>  {
> -	unsigned long stride = huge_page_size(hstate_vma(vma));
> -
>  	switch (stride) {
>  #ifndef __PAGETABLE_PMD_FOLDED
>  	case PUD_SIZE:
> -		__flush_tlb_range(vma, start, end, PUD_SIZE, false, 1);
> +		__flush_tlb_range(vma, start, end, PUD_SIZE, last_level, 1);
>  		break;
>  #endif
>  	case CONT_PMD_SIZE:
>  	case PMD_SIZE:
> -		__flush_tlb_range(vma, start, end, PMD_SIZE, false, 2);
> +		__flush_tlb_range(vma, start, end, PMD_SIZE, last_level, 2);
>  		break;
>  	case CONT_PTE_SIZE:
> -		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 3);
> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, 3);
>  		break;
>  	default:
> -		__flush_tlb_range(vma, start, end, PAGE_SIZE, false, TLBI_TTL_UNKNOWN);
> +		__flush_tlb_range(vma, start, end, PAGE_SIZE, last_level, TLBI_TTL_UNKNOWN);
>  	}
>  }
>  
> +#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
> +static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
> +					   unsigned long start,
> +					   unsigned long end)
> +{
> +	unsigned long stride = huge_page_size(hstate_vma(vma));
> +
> +	__flush_hugetlb_tlb_range(vma, start, end, stride, false);
> +}
> +
>  #endif /* __ASM_HUGETLB_H */
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 6a2af9fb2566..065be8650aa5 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -183,8 +183,9 @@ static pte_t get_clear_contig_flush(struct mm_struct *mm,
>  {
>  	pte_t orig_pte = get_clear_contig(mm, addr, ptep, pgsize, ncontig);
>  	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long end = addr + (pgsize * ncontig);
>  
> -	flush_tlb_range(&vma, addr, addr + (pgsize * ncontig));
> +	__flush_hugetlb_tlb_range(&vma, addr, end, pgsize, true);
>  	return orig_pte;
>  }
>  
> @@ -209,7 +210,7 @@ static void clear_flush(struct mm_struct *mm,
>  	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
>  		__ptep_get_and_clear(mm, addr, ptep);
>  
> -	flush_tlb_range(&vma, saddr, addr);
> +	__flush_hugetlb_tlb_range(&vma, saddr, addr, pgsize, true);
>  }
>  
>  void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> @@ -238,7 +239,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>  	dpfn = pgsize >> PAGE_SHIFT;
>  	hugeprot = pte_pgprot(pte);
>  
> -	clear_flush(mm, addr, ptep, pgsize, ncontig);
> +	/* Only need to "break" if transitioning valid -> valid. */
> +	if (pte_valid(__ptep_get(ptep)))
> +		clear_flush(mm, addr, ptep, pgsize, ncontig);
>  
>  	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
>  		__set_ptes(mm, addr, ptep, pfn_pte(pfn, hugeprot), 1);


LGTM

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop
  2025-03-04 15:04 ` [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop Ryan Roberts
  2025-04-03 20:46   ` Catalin Marinas
@ 2025-04-04  4:11   ` Anshuman Khandual
  1 sibling, 0 replies; 39+ messages in thread
From: Anshuman Khandual @ 2025-04-04  4:11 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 3/4/25 20:34, Ryan Roberts wrote:
> set_ptes_anysz() previously called __set_pte() for each PTE in the
> range, which would conditionally issue a DSB and ISB to make the new PTE
> value immediately visible to the table walker if the new PTE was valid
> and for kernel space.
> 
> We can do better than this; let's hoist those barriers out of the loop
> so that they are only issued once at the end of the loop. We then reduce
> the cost by the number of PTEs in the range.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e255a36380dc..1898c3069c43 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -317,13 +317,11 @@ static inline void __set_pte_nosync(pte_t *ptep, pte_t pte)
>  	WRITE_ONCE(*ptep, pte);
>  }
>  
> -static inline void __set_pte(pte_t *ptep, pte_t pte)
> +static inline void __set_pte_complete(pte_t pte)
>  {
> -	__set_pte_nosync(ptep, pte);
> -
>  	/*
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
> -	 * or update_mmu_cache() have the necessary barriers.
> +	 * has the necessary barriers.
>  	 */
>  	if (pte_valid_not_user(pte)) {
>  		dsb(ishst);
> @@ -331,6 +329,12 @@ static inline void __set_pte(pte_t *ptep, pte_t pte)
>  	}
>  }
>  
> +static inline void __set_pte(pte_t *ptep, pte_t pte)
> +{
> +	__set_pte_nosync(ptep, pte);
> +	__set_pte_complete(pte);
> +}
> +
>  static inline pte_t __ptep_get(pte_t *ptep)
>  {
>  	return READ_ONCE(*ptep);
> @@ -647,12 +651,14 @@ static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
>  
>  	for (;;) {
>  		__check_safe_pte_update(mm, ptep, pte);
> -		__set_pte(ptep, pte);
> +		__set_pte_nosync(ptep, pte);
>  		if (--nr == 0)
>  			break;
>  		ptep++;
>  		pte = pte_advance_pfn(pte, stride);
>  	}
> +
> +	__set_pte_complete(pte);
>  }
>  
>  static inline void __set_ptes(struct mm_struct *mm,

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
  2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
  2025-03-27 13:06   ` Uladzislau Rezki
  2025-04-03 20:47   ` Catalin Marinas
@ 2025-04-04  4:54   ` Anshuman Khandual
  2 siblings, 0 replies; 39+ messages in thread
From: Anshuman Khandual @ 2025-04-04  4:54 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel



On 3/4/25 20:34, Ryan Roberts wrote:
> Wrap vmalloc's pte table manipulation loops with
> arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode(). This provides
> the arch code with the opportunity to optimize the pte manipulations.
> 
> Note that vmap_pfn() already uses lazy mmu mode since it delegates to
> apply_to_page_range() which enters lazy mmu mode for both user and
> kernel mappings.
> 
> These hooks will shortly be used by arm64 to improve vmalloc
> performance.
> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  mm/vmalloc.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6111ce900ec4..b63ca0b7dd40 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -104,6 +104,9 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte = pte_alloc_kernel_track(pmd, addr, mask);
>  	if (!pte)
>  		return -ENOMEM;
> +
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  		if (unlikely(!pte_none(ptep_get(pte)))) {
>  			if (pfn_valid(pfn)) {
> @@ -127,6 +130,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  		set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
>  		pfn++;
>  	} while (pte += PFN_DOWN(size), addr += size, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }
> @@ -354,6 +359,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	unsigned long size = PAGE_SIZE;
>  
>  	pte = pte_offset_kernel(pmd, addr);
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  #ifdef CONFIG_HUGETLB_PAGE
>  		size = arch_vmap_pte_range_unmap_size(addr, pte);
> @@ -370,6 +377,8 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			ptent = ptep_get_and_clear(&init_mm, addr, pte);
>  		WARN_ON(!pte_none(ptent) && !pte_present(ptent));
>  	} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  }
>  
> @@ -515,6 +524,9 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	pte = pte_alloc_kernel_track(pmd, addr, mask);
>  	if (!pte)
>  		return -ENOMEM;
> +
> +	arch_enter_lazy_mmu_mode();
> +
>  	do {
>  		struct page *page = pages[*nr];
>  
> @@ -528,6 +540,8 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(&init_mm, addr, pte, mk_pte(page, prot));
>  		(*nr)++;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
> +
> +	arch_leave_lazy_mmu_mode();
>  	*mask |= PGTBL_PTE_MODIFIED;
>  	return 0;
>  }

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-03-04 15:04 ` [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings Ryan Roberts
@ 2025-04-04  6:02   ` Anshuman Khandual
  2025-04-14 17:38   ` Catalin Marinas
  1 sibling, 0 replies; 39+ messages in thread
From: Anshuman Khandual @ 2025-04-04  6:02 UTC (permalink / raw)
  To: Ryan Roberts, Catalin Marinas, Will Deacon, Pasha Tatashin,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig,
	David Hildenbrand, Matthew Wilcox (Oracle), Mark Rutland,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

On 3/4/25 20:34, Ryan Roberts wrote:
> Because the kernel can't tolerate page faults for kernel mappings, when
> setting a valid, kernel space pte (or pmd/pud/p4d/pgd), it emits a
> dsb(ishst) to ensure that the store to the pgtable is observed by the
> table walker immediately. Additionally it emits an isb() to ensure that
> any already speculatively determined invalid mapping fault gets
> canceled.
> 
> We can improve the performance of vmalloc operations by batching these
> barriers until the end of a set of entry updates.
> arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() provide the
> required hooks.
> 
> vmalloc improves by up to 30% as a result.
> 
> Two new TIF_ flags are created; TIF_LAZY_MMU tells us if the task is in
> the lazy mode and can therefore defer any barriers until exit from the
> lazy mode. TIF_LAZY_MMU_PENDING is used to remember if any pte operation
> was performed while in the lazy mode that required barriers. Then when
> leaving lazy mode, if that flag is set, we emit the barriers.
> 
> Since arch_enter_lazy_mmu_mode() and arch_leave_lazy_mmu_mode() are used
> for both user and kernel mappings, we need the second flag to avoid
> emitting barriers unnecessarily if only user mappings were updated.

Agreed and hence for that an additional TIF flag i.e TIF_LAZY_MMU_PENDING
can be justified.

> 
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h     | 73 ++++++++++++++++++++++------
>  arch/arm64/include/asm/thread_info.h |  2 +
>  arch/arm64/kernel/process.c          |  9 ++--
>  3 files changed, 64 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 1898c3069c43..149df945c1ab 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -40,6 +40,55 @@
>  #include <linux/sched.h>
>  #include <linux/page_table_check.h>
>  
> +static inline void emit_pte_barriers(void)
> +{
> +	/*
> +	 * These barriers are emitted under certain conditions after a pte entry
> +	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
> +	 * visible to the table walker. The isb ensures that any previous
> +	 * speculative "invalid translation" marker that is in the CPU's
> +	 * pipeline gets cleared, so that any access to that address after
> +	 * setting the pte to valid won't cause a spurious fault. If the thread
> +	 * gets preempted after storing to the pgtable but before emitting these
> +	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
> +	 * see the store. There is no guarrantee of an isb being issued though.

typo					^^^^^^^^
 					
> +	 * This is safe because it will still get issued (albeit on a
> +	 * potentially different CPU) when the thread starts running again,
> +	 * before any access to the address.
> +	 */
> +	dsb(ishst);
> +	isb();
> +}
> +
> +static inline void queue_pte_barriers(void)
> +{
> +	if (test_thread_flag(TIF_LAZY_MMU))
> +		set_thread_flag(TIF_LAZY_MMU_PENDING);
> +	else
> +		emit_pte_barriers();
> +}
> +
> +#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +static inline void arch_enter_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON(in_interrupt());
> +	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
> +
> +	set_thread_flag(TIF_LAZY_MMU);
> +}
> +
> +static inline void arch_flush_lazy_mmu_mode(void)
> +{
> +	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
> +		emit_pte_barriers();
> +}
> +
> +static inline void arch_leave_lazy_mmu_mode(void)
> +{
> +	arch_flush_lazy_mmu_mode();
> +	clear_thread_flag(TIF_LAZY_MMU);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>  
> @@ -323,10 +372,8 @@ static inline void __set_pte_complete(pte_t pte)
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>  	 * has the necessary barriers.
>  	 */
> -	if (pte_valid_not_user(pte)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pte_valid_not_user(pte))
> +		queue_pte_barriers();
>  }
>  
>  static inline void __set_pte(pte_t *ptep, pte_t pte)
> @@ -778,10 +825,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  	WRITE_ONCE(*pmdp, pmd);
>  
> -	if (pmd_valid(pmd)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pmd_valid(pmd))
> +		queue_pte_barriers();
>  }
>  
>  static inline void pmd_clear(pmd_t *pmdp)
> @@ -845,10 +890,8 @@ static inline void set_pud(pud_t *pudp, pud_t pud)
>  
>  	WRITE_ONCE(*pudp, pud);
>  
> -	if (pud_valid(pud)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pud_valid(pud))
> +		queue_pte_barriers();
>  }
>  
>  static inline void pud_clear(pud_t *pudp)
> @@ -925,8 +968,7 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>  	}
>  
>  	WRITE_ONCE(*p4dp, p4d);
> -	dsb(ishst);
> -	isb();
> +	queue_pte_barriers();
>  }
>  
>  static inline void p4d_clear(p4d_t *p4dp)
> @@ -1052,8 +1094,7 @@ static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
>  	}
>  
>  	WRITE_ONCE(*pgdp, pgd);
> -	dsb(ishst);
> -	isb();
> +	queue_pte_barriers();
>  }
>  
>  static inline void pgd_clear(pgd_t *pgdp)
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index 1114c1c3300a..1fdd74b7b831 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -82,6 +82,8 @@ void arch_setup_new_exec(void);
>  #define TIF_SME_VL_INHERIT	28	/* Inherit SME vl_onexec across exec */
>  #define TIF_KERNEL_FPSTATE	29	/* Task is in a kernel mode FPSIMD section */
>  #define TIF_TSC_SIGSEGV		30	/* SIGSEGV on counter-timer access */
> +#define TIF_LAZY_MMU		31	/* Task in lazy mmu mode */
> +#define TIF_LAZY_MMU_PENDING	32	/* Ops pending for lazy mmu mode exit */
>  
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 42faebb7b712..45a55fe81788 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -680,10 +680,11 @@ struct task_struct *__switch_to(struct task_struct *prev,
>  	gcs_thread_switch(next);
>  
>  	/*
> -	 * Complete any pending TLB or cache maintenance on this CPU in case
> -	 * the thread migrates to a different CPU.
> -	 * This full barrier is also required by the membarrier system
> -	 * call.
> +	 * Complete any pending TLB or cache maintenance on this CPU in case the
> +	 * thread migrates to a different CPU. This full barrier is also
> +	 * required by the membarrier system call. Additionally it makes any
> +	 * in-progress pgtable writes visible to the table walker; See
> +	 * emit_pte_barriers().
>  	 */
>  	dsb(ish);
>  

Otherwise, LGTM.

I will try and think through again if these deferred sync and flush can cause
subtle problems else where.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64
  2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
                   ` (11 preceding siblings ...)
  2025-03-27 12:16 ` [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Uladzislau Rezki
@ 2025-04-14 13:56 ` Ryan Roberts
  12 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-04-14 13:56 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Pasha Tatashin, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, David Hildenbrand,
	Matthew Wilcox (Oracle), Mark Rutland, Anshuman Khandual,
	Alexandre Ghiti, Kevin Brodsky
  Cc: linux-arm-kernel, linux-mm, linux-kernel

Hi Catalin,


On 04/03/2025 15:04, Ryan Roberts wrote:
> Hi All,
> 
> This is v3 of a series to improve performance for hugetlb and vmalloc on arm64.
> Although some of these patches are core-mm, advice from Andrew was to go via the
> arm64 tree. Hopefully I can get some ACKs from mm folks.
> 
> The 2 key performance improvements are 1) enabling the use of contpte-mapped
> blocks in the vmalloc space when appropriate (which reduces TLB pressure). There
> were already hooks for this (used by powerpc) but they required some tidying and
> extending for arm64. And 2) batching up barriers when modifying the vmalloc
> address space for upto 30% reduction in time taken in vmalloc().

Thanks for your reviews - I'm just trying to get my ducks lined up for when I'm
back in the office next week...

The last remaining 2 patches without R-b are patch #4 and #11 (both arm64). Do
you have any feedback on these? I have the series ready to repost with some
minor nits and build warnings addressed. I'm hoping we can get these last to
patches squared away then I'll repost next week against -rc3.

Thanks,
Ryan


> 
> vmalloc() performance was measured using the test_vmalloc.ko module. Tested on
> Apple M2 and Ampere Altra. Each test had loop count set to 500000 and the whole
> test was repeated 10 times.
> 
> legend:
>   - p: nr_pages (pages to allocate)
>   - h: use_huge (vmalloc() vs vmalloc_huge())
>   - (I): statistically significant improvement (95% CI does not overlap)
>   - (R): statistically significant regression (95% CI does not overlap)
>   - measurements are times; smaller is better
> 
> +--------------------------------------------------+-------------+-------------+
> | Benchmark                                        |             |             |
> |   Result Class                                   |    Apple M2 | Ampere Alta |
> +==================================================+=============+=============+
> | micromm/vmalloc                                  |             |             |
> |   fix_align_alloc_test: p:1, h:0 (usec)          | (I) -11.53% |      -2.57% |
> |   fix_size_alloc_test: p:1, h:0 (usec)           |       2.14% |       1.79% |
> |   fix_size_alloc_test: p:4, h:0 (usec)           |  (I) -9.93% |  (I) -4.80% |
> |   fix_size_alloc_test: p:16, h:0 (usec)          | (I) -25.07% | (I) -14.24% |
> |   fix_size_alloc_test: p:16, h:1 (usec)          | (I) -14.07% |   (R) 7.93% |
> |   fix_size_alloc_test: p:64, h:0 (usec)          | (I) -29.43% | (I) -19.30% |
> |   fix_size_alloc_test: p:64, h:1 (usec)          | (I) -16.39% |   (R) 6.71% |
> |   fix_size_alloc_test: p:256, h:0 (usec)         | (I) -31.46% | (I) -20.60% |
> |   fix_size_alloc_test: p:256, h:1 (usec)         | (I) -16.58% |   (R) 6.70% |
> |   fix_size_alloc_test: p:512, h:0 (usec)         | (I) -31.96% | (I) -20.04% |
> |   fix_size_alloc_test: p:512, h:1 (usec)         |       2.30% |       0.71% |
> |   full_fit_alloc_test: p:1, h:0 (usec)           |      -2.94% |       1.77% |
> |   kvfree_rcu_1_arg_vmalloc_test: p:1, h:0 (usec) |      -7.75% |       1.71% |
> |   kvfree_rcu_2_arg_vmalloc_test: p:1, h:0 (usec) |      -9.07% |   (R) 2.34% |
> |   long_busy_list_alloc_test: p:1, h:0 (usec)     | (I) -29.18% | (I) -17.91% |
> |   pcpu_alloc_test: p:1, h:0 (usec)               |     -14.71% |      -3.14% |
> |   random_size_align_alloc_test: p:1, h:0 (usec)  | (I) -11.08% |  (I) -4.62% |
> |   random_size_alloc_test: p:1, h:0 (usec)        | (I) -30.25% | (I) -17.95% |
> |   vm_map_ram_test: p:1, h:0 (usec)               |       5.06% |   (R) 6.63% |
> +--------------------------------------------------+-------------+-------------+
> 
> So there are some nice improvements but also some regressions to explain:
> 
> fix_size_alloc_test with h:1 and p:16,64,256 regress by ~6% on Altra. The
> regression is actually introduced by enabling contpte-mapped 64K blocks in these
> tests, and that regression is reduced (from about 8% if memory serves) by doing
> the barrier batching. I don't have a definite conclusion on the root cause, but
> I've ruled out the differences in the mapping paths in vmalloc. I strongly
> believe this is likely due to the difference in the allocation path; 64K blocks
> are not cached per-cpu so we have to go all the way to the buddy. I'm not sure
> why this doesn't show up on M2 though. Regardless, I'm going to assert that it's
> better to choose 16x reduction in TLB pressure vs 6% on the vmalloc allocation
> call duration.
> 
> Changes since v2 [2]
> ====================
> - Removed the new arch_update_kernel_mappings_[begin|end]() API
> - Switches to arch_[enter|leave]_lazy_mmu_mode() instead for barrier batching
> - Removed clean up to avoid barriers for invalid or user mappings
> 
> Changes since v1 [1]
> ====================
> - Split out the fixes into their own series
> - Added Rbs from Anshuman - Thanks!
> - Added patch to clean up the methods by which huge_pte size is determined
> - Added "#ifndef __PAGETABLE_PMD_FOLDED" around PUD_SIZE in
>   flush_hugetlb_tlb_range()
> - Renamed ___set_ptes() -> set_ptes_anysz()
> - Renamed ___ptep_get_and_clear() -> ptep_get_and_clear_anysz()
> - Fixed typos in commit logs
> - Refactored pXd_valid_not_user() for better reuse
> - Removed TIF_KMAP_UPDATE_PENDING after concluding that single flag is sufficent
> - Concluded the extra isb() in __switch_to() is not required
> - Only call arch_update_kernel_mappings_[begin|end]() for kernel mappings
> 
> Applies on top of v6.14-rc5, which already contains the fixes from [3]. All
> mm selftests run and pass.
> 
> NOTE: Its possible that the changes in patch #10 may cause bugs I found in other
> archs' lazy mmu implementations to become more likely to trigger. I've fixed all
> those bugs in the series at [4], which is now in mm-unstable. But some
> coordination when merging this may be required.
> 
> [1] https://lore.kernel.org/all/20250205151003.88959-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20250217140809.1702789-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20250217140419.1702389-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (11):
>   arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
>   arm64: hugetlb: Refine tlb maintenance scope
>   mm/page_table_check: Batch-check pmds/puds just like ptes
>   arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
>   arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz()
>   arm64/mm: Hoist barriers out of set_ptes_anysz() loop
>   mm/vmalloc: Warn on improper use of vunmap_range()
>   mm/vmalloc: Gracefully unmap huge ptes
>   arm64/mm: Support huge pte-mapped pages in vmap
>   mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
>   arm64/mm: Batch barriers when updating kernel mappings
> 
>  arch/arm64/include/asm/hugetlb.h     |  29 ++--
>  arch/arm64/include/asm/pgtable.h     | 195 ++++++++++++++++++---------
>  arch/arm64/include/asm/thread_info.h |   2 +
>  arch/arm64/include/asm/vmalloc.h     |  45 +++++++
>  arch/arm64/kernel/process.c          |   9 +-
>  arch/arm64/mm/hugetlbpage.c          |  72 ++++------
>  include/linux/page_table_check.h     |  30 +++--
>  include/linux/vmalloc.h              |   8 ++
>  mm/page_table_check.c                |  34 +++--
>  mm/vmalloc.c                         |  40 +++++-
>  10 files changed, 315 insertions(+), 149 deletions(-)
> 
> --
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
  2025-03-04 15:04 ` [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
  2025-03-06  5:08   ` kernel test robot
@ 2025-04-14 16:25   ` Catalin Marinas
  1 sibling, 0 replies; 39+ messages in thread
From: Catalin Marinas @ 2025-04-14 16:25 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:34PM +0000, Ryan Roberts wrote:
> +static inline void set_ptes_anysz(struct mm_struct *mm, pte_t *ptep, pte_t pte,
> +				  unsigned int nr, unsigned long pgsize)
>  {
> -	__sync_cache_and_tags(pte, nr);
> -	__check_safe_pte_update(mm, ptep, pte);
> -	__set_pte(ptep, pte);
> +	unsigned long stride = pgsize >> PAGE_SHIFT;
> +
> +	switch (pgsize) {
> +	case PAGE_SIZE:
> +		page_table_check_ptes_set(mm, ptep, pte, nr);
> +		break;
> +	case PMD_SIZE:
> +		page_table_check_pmds_set(mm, (pmd_t *)ptep, pte_pmd(pte), nr);
> +		break;
> +	case PUD_SIZE:
> +		page_table_check_puds_set(mm, (pud_t *)ptep, pte_pud(pte), nr);
> +		break;
> +	default:
> +		VM_WARN_ON(1);
> +	}
> +
> +	__sync_cache_and_tags(pte, nr * stride);
> +
> +	for (;;) {
> +		__check_safe_pte_update(mm, ptep, pte);
> +		__set_pte(ptep, pte);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		pte = pte_advance_pfn(pte, stride);
> +	}
>  }

I thought I replied to this one but somehow failed to send. The only
comment I have is that I'd add a double underscore in front of the anysz
functions to imply it's a private API. Otherwise it looks fine.

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-03-04 15:04 ` [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings Ryan Roberts
  2025-04-04  6:02   ` Anshuman Khandual
@ 2025-04-14 17:38   ` Catalin Marinas
  2025-04-14 18:28     ` Ryan Roberts
  1 sibling, 1 reply; 39+ messages in thread
From: Catalin Marinas @ 2025-04-14 17:38 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Tue, Mar 04, 2025 at 03:04:41PM +0000, Ryan Roberts wrote:
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 1898c3069c43..149df945c1ab 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -40,6 +40,55 @@
>  #include <linux/sched.h>
>  #include <linux/page_table_check.h>
>  
> +static inline void emit_pte_barriers(void)
> +{
> +	/*
> +	 * These barriers are emitted under certain conditions after a pte entry
> +	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
> +	 * visible to the table walker. The isb ensures that any previous
> +	 * speculative "invalid translation" marker that is in the CPU's
> +	 * pipeline gets cleared, so that any access to that address after
> +	 * setting the pte to valid won't cause a spurious fault. If the thread
> +	 * gets preempted after storing to the pgtable but before emitting these
> +	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
> +	 * see the store. There is no guarrantee of an isb being issued though.
> +	 * This is safe because it will still get issued (albeit on a
> +	 * potentially different CPU) when the thread starts running again,
> +	 * before any access to the address.
> +	 */
> +	dsb(ishst);
> +	isb();
> +}
> +
> +static inline void queue_pte_barriers(void)
> +{
> +	if (test_thread_flag(TIF_LAZY_MMU))
> +		set_thread_flag(TIF_LAZY_MMU_PENDING);

As we can have lots of calls here, it might be slightly cheaper to test
TIF_LAZY_MMU_PENDING and avoid setting it unnecessarily.

I haven't checked - does the compiler generate multiple mrs from sp_el0
for subsequent test_thread_flag()?

> +	else
> +		emit_pte_barriers();
> +}
> +
> +#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +static inline void arch_enter_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON(in_interrupt());
> +	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
> +
> +	set_thread_flag(TIF_LAZY_MMU);
> +}
> +
> +static inline void arch_flush_lazy_mmu_mode(void)
> +{
> +	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
> +		emit_pte_barriers();
> +}
> +
> +static inline void arch_leave_lazy_mmu_mode(void)
> +{
> +	arch_flush_lazy_mmu_mode();
> +	clear_thread_flag(TIF_LAZY_MMU);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>  
> @@ -323,10 +372,8 @@ static inline void __set_pte_complete(pte_t pte)
>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>  	 * has the necessary barriers.
>  	 */
> -	if (pte_valid_not_user(pte)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pte_valid_not_user(pte))
> +		queue_pte_barriers();
>  }

I think this scheme works, I couldn't find a counter-example unless
__set_pte() gets called in an interrupt context. You could add
VM_WARN_ON(in_interrupt()) in queue_pte_barriers() as well.

With preemption, the newly mapped range shouldn't be used before
arch_flush_lazy_mmu_mode() is called, so it looks safe as well. I think
x86 uses a per-CPU variable to track this but per-thread is easier to
reason about if there's no nesting.

>  static inline void __set_pte(pte_t *ptep, pte_t pte)
> @@ -778,10 +825,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  
>  	WRITE_ONCE(*pmdp, pmd);
>  
> -	if (pmd_valid(pmd)) {
> -		dsb(ishst);
> -		isb();
> -	}
> +	if (pmd_valid(pmd))
> +		queue_pte_barriers();
>  }

We discussed on a previous series - for pmd/pud we end up with barriers
even for user mappings but they are at a much coarser granularity (and I
wasn't keen on 'user' attributes for the table entries).

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-04-14 17:38   ` Catalin Marinas
@ 2025-04-14 18:28     ` Ryan Roberts
  2025-04-15 10:51       ` Catalin Marinas
  0 siblings, 1 reply; 39+ messages in thread
From: Ryan Roberts @ 2025-04-14 18:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On 14/04/2025 18:38, Catalin Marinas wrote:
> On Tue, Mar 04, 2025 at 03:04:41PM +0000, Ryan Roberts wrote:
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 1898c3069c43..149df945c1ab 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -40,6 +40,55 @@
>>  #include <linux/sched.h>
>>  #include <linux/page_table_check.h>
>>  
>> +static inline void emit_pte_barriers(void)
>> +{
>> +	/*
>> +	 * These barriers are emitted under certain conditions after a pte entry
>> +	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
>> +	 * visible to the table walker. The isb ensures that any previous
>> +	 * speculative "invalid translation" marker that is in the CPU's
>> +	 * pipeline gets cleared, so that any access to that address after
>> +	 * setting the pte to valid won't cause a spurious fault. If the thread
>> +	 * gets preempted after storing to the pgtable but before emitting these
>> +	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
>> +	 * see the store. There is no guarrantee of an isb being issued though.
>> +	 * This is safe because it will still get issued (albeit on a
>> +	 * potentially different CPU) when the thread starts running again,
>> +	 * before any access to the address.
>> +	 */
>> +	dsb(ishst);
>> +	isb();
>> +}
>> +
>> +static inline void queue_pte_barriers(void)
>> +{
>> +	if (test_thread_flag(TIF_LAZY_MMU))
>> +		set_thread_flag(TIF_LAZY_MMU_PENDING);
> 
> As we can have lots of calls here, it might be slightly cheaper to test
> TIF_LAZY_MMU_PENDING and avoid setting it unnecessarily.

Yes, good point.

> 
> I haven't checked - does the compiler generate multiple mrs from sp_el0
> for subsequent test_thread_flag()?

It emits a single mrs but it loads from the pointer twice. I think v3 is the version we want?


void TEST_queue_pte_barriers_v1(void)
{
	if (test_thread_flag(TIF_LAZY_MMU))
		set_thread_flag(TIF_LAZY_MMU_PENDING);
	else
		emit_pte_barriers();
}

void TEST_queue_pte_barriers_v2(void)
{
	if (test_thread_flag(TIF_LAZY_MMU) &&
	    !test_thread_flag(TIF_LAZY_MMU_PENDING))
		set_thread_flag(TIF_LAZY_MMU_PENDING);
	else
		emit_pte_barriers();
}

void TEST_queue_pte_barriers_v3(void)
{
	unsigned long flags = read_thread_flags();

	if ((flags & (_TIF_LAZY_MMU | _TIF_LAZY_MMU_PENDING)) == _TIF_LAZY_MMU)
		set_thread_flag(TIF_LAZY_MMU_PENDING);
	else
		emit_pte_barriers();
}


000000000000101c <TEST_queue_pte_barriers_v1>:
    101c:	d5384100 	mrs	x0, sp_el0
    1020:	f9400001 	ldr	x1, [x0]
    1024:	37f80081 	tbnz	w1, #31, 1034 <TEST_queue_pte_barriers_v1+0x18>
    1028:	d5033a9f 	dsb	ishst
    102c:	d5033fdf 	isb
    1030:	d65f03c0 	ret
    1034:	14000004 	b	1044 <TEST_queue_pte_barriers_v1+0x28>
    1038:	d2c00021 	mov	x1, #0x100000000           	// #4294967296
    103c:	f821301f 	stset	x1, [x0]
    1040:	d65f03c0 	ret
    1044:	f9800011 	prfm	pstl1strm, [x0]
    1048:	c85f7c01 	ldxr	x1, [x0]
    104c:	b2600021 	orr	x1, x1, #0x100000000
    1050:	c8027c01 	stxr	w2, x1, [x0]
    1054:	35ffffa2 	cbnz	w2, 1048 <TEST_queue_pte_barriers_v1+0x2c>
    1058:	d65f03c0 	ret

000000000000105c <TEST_queue_pte_barriers_v2>:
    105c:	d5384100 	mrs	x0, sp_el0
    1060:	f9400001 	ldr	x1, [x0]
    1064:	37f80081 	tbnz	w1, #31, 1074 <TEST_queue_pte_barriers_v2+0x18>
    1068:	d5033a9f 	dsb	ishst
    106c:	d5033fdf 	isb
    1070:	d65f03c0 	ret
    1074:	f9400001 	ldr	x1, [x0]
    1078:	b707ff81 	tbnz	x1, #32, 1068 <TEST_queue_pte_barriers_v2+0xc>
    107c:	14000004 	b	108c <TEST_queue_pte_barriers_v2+0x30>
    1080:	d2c00021 	mov	x1, #0x100000000           	// #4294967296
    1084:	f821301f 	stset	x1, [x0]
    1088:	d65f03c0 	ret
    108c:	f9800011 	prfm	pstl1strm, [x0]
    1090:	c85f7c01 	ldxr	x1, [x0]
    1094:	b2600021 	orr	x1, x1, #0x100000000
    1098:	c8027c01 	stxr	w2, x1, [x0]
    109c:	35ffffa2 	cbnz	w2, 1090 <TEST_queue_pte_barriers_v2+0x34>
    10a0:	d65f03c0 	ret

00000000000010a4 <TEST_queue_pte_barriers_v3>:
    10a4:	d5384101 	mrs	x1, sp_el0
    10a8:	f9400020 	ldr	x0, [x1]
    10ac:	d2b00002 	mov	x2, #0x80000000            	// #2147483648
    10b0:	92610400 	and	x0, x0, #0x180000000
    10b4:	eb02001f 	cmp	x0, x2
    10b8:	54000080 	b.eq	10c8 <TEST_queue_pte_barriers_v3+0x24>  // b.none
    10bc:	d5033a9f 	dsb	ishst
    10c0:	d5033fdf 	isb
    10c4:	d65f03c0 	ret
    10c8:	14000004 	b	10d8 <TEST_queue_pte_barriers_v3+0x34>
    10cc:	d2c00020 	mov	x0, #0x100000000           	// #4294967296
    10d0:	f820303f 	stset	x0, [x1]
    10d4:	d65f03c0 	ret
    10d8:	f9800031 	prfm	pstl1strm, [x1]
    10dc:	c85f7c20 	ldxr	x0, [x1]
    10e0:	b2600000 	orr	x0, x0, #0x100000000
    10e4:	c8027c20 	stxr	w2, x0, [x1]
    10e8:	35ffffa2 	cbnz	w2, 10dc <TEST_queue_pte_barriers_v3+0x38>
    10ec:	d65f03c0 	ret



> 
>> +	else
>> +		emit_pte_barriers();
>> +}
>> +
>> +#define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>> +static inline void arch_enter_lazy_mmu_mode(void)
>> +{
>> +	VM_WARN_ON(in_interrupt());
>> +	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
>> +
>> +	set_thread_flag(TIF_LAZY_MMU);
>> +}
>> +
>> +static inline void arch_flush_lazy_mmu_mode(void)
>> +{
>> +	if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
>> +		emit_pte_barriers();
>> +}
>> +
>> +static inline void arch_leave_lazy_mmu_mode(void)
>> +{
>> +	arch_flush_lazy_mmu_mode();
>> +	clear_thread_flag(TIF_LAZY_MMU);
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>>  
>> @@ -323,10 +372,8 @@ static inline void __set_pte_complete(pte_t pte)
>>  	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
>>  	 * has the necessary barriers.
>>  	 */
>> -	if (pte_valid_not_user(pte)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pte_valid_not_user(pte))
>> +		queue_pte_barriers();
>>  }
> 
> I think this scheme works, I couldn't find a counter-example unless
> __set_pte() gets called in an interrupt context. You could add
> VM_WARN_ON(in_interrupt()) in queue_pte_barriers() as well.
> 
> With preemption, the newly mapped range shouldn't be used before
> arch_flush_lazy_mmu_mode() is called, so it looks safe as well. I think
> x86 uses a per-CPU variable to track this but per-thread is easier to
> reason about if there's no nesting.
> 
>>  static inline void __set_pte(pte_t *ptep, pte_t pte)
>> @@ -778,10 +825,8 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>>  
>>  	WRITE_ONCE(*pmdp, pmd);
>>  
>> -	if (pmd_valid(pmd)) {
>> -		dsb(ishst);
>> -		isb();
>> -	}
>> +	if (pmd_valid(pmd))
>> +		queue_pte_barriers();
>>  }
> 
> We discussed on a previous series - for pmd/pud we end up with barriers
> even for user mappings but they are at a much coarser granularity (and I
> wasn't keen on 'user' attributes for the table entries).
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks!

Ryan



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-04-14 18:28     ` Ryan Roberts
@ 2025-04-15 10:51       ` Catalin Marinas
  2025-04-15 17:28         ` Ryan Roberts
  0 siblings, 1 reply; 39+ messages in thread
From: Catalin Marinas @ 2025-04-15 10:51 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On Mon, Apr 14, 2025 at 07:28:46PM +0100, Ryan Roberts wrote:
> On 14/04/2025 18:38, Catalin Marinas wrote:
> > On Tue, Mar 04, 2025 at 03:04:41PM +0000, Ryan Roberts wrote:
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index 1898c3069c43..149df945c1ab 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -40,6 +40,55 @@
> >>  #include <linux/sched.h>
> >>  #include <linux/page_table_check.h>
> >>  
> >> +static inline void emit_pte_barriers(void)
> >> +{
> >> +	/*
> >> +	 * These barriers are emitted under certain conditions after a pte entry
> >> +	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
> >> +	 * visible to the table walker. The isb ensures that any previous
> >> +	 * speculative "invalid translation" marker that is in the CPU's
> >> +	 * pipeline gets cleared, so that any access to that address after
> >> +	 * setting the pte to valid won't cause a spurious fault. If the thread
> >> +	 * gets preempted after storing to the pgtable but before emitting these
> >> +	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
> >> +	 * see the store. There is no guarrantee of an isb being issued though.
> >> +	 * This is safe because it will still get issued (albeit on a
> >> +	 * potentially different CPU) when the thread starts running again,
> >> +	 * before any access to the address.
> >> +	 */
> >> +	dsb(ishst);
> >> +	isb();
> >> +}
> >> +
> >> +static inline void queue_pte_barriers(void)
> >> +{
> >> +	if (test_thread_flag(TIF_LAZY_MMU))
> >> +		set_thread_flag(TIF_LAZY_MMU_PENDING);
> > 
> > As we can have lots of calls here, it might be slightly cheaper to test
> > TIF_LAZY_MMU_PENDING and avoid setting it unnecessarily.
> 
> Yes, good point.
> 
> > I haven't checked - does the compiler generate multiple mrs from sp_el0
> > for subsequent test_thread_flag()?
> 
> It emits a single mrs but it loads from the pointer twice.

It's not that bad if only do the set_thread_flag() once.

> I think v3 is the version we want?
> 
> 
> void TEST_queue_pte_barriers_v1(void)
> {
> 	if (test_thread_flag(TIF_LAZY_MMU))
> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
> 	else
> 		emit_pte_barriers();
> }
> 
> void TEST_queue_pte_barriers_v2(void)
> {
> 	if (test_thread_flag(TIF_LAZY_MMU) &&
> 	    !test_thread_flag(TIF_LAZY_MMU_PENDING))
> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
> 	else
> 		emit_pte_barriers();
> }
> 
> void TEST_queue_pte_barriers_v3(void)
> {
> 	unsigned long flags = read_thread_flags();
> 
> 	if ((flags & (_TIF_LAZY_MMU | _TIF_LAZY_MMU_PENDING)) == _TIF_LAZY_MMU)
> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
> 	else
> 		emit_pte_barriers();
> }

Doesn't v3 emit barriers once _TIF_LAZY_MMU_PENDING has been set? We
need something like:

	if (flags & _TIF_LAZY_MMU) {
		if (!(flags & _TIF_LAZY_MMU_PENDING))
			set_thread_flag(TIF_LAZY_MMU_PENDING);
	} else {
		emit_pte_barriers();
	}

-- 
Catalin


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings
  2025-04-15 10:51       ` Catalin Marinas
@ 2025-04-15 17:28         ` Ryan Roberts
  0 siblings, 0 replies; 39+ messages in thread
From: Ryan Roberts @ 2025-04-15 17:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Pasha Tatashin, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, David Hildenbrand, Matthew Wilcox (Oracle),
	Mark Rutland, Anshuman Khandual, Alexandre Ghiti, Kevin Brodsky,
	linux-arm-kernel, linux-mm, linux-kernel

On 15/04/2025 11:51, Catalin Marinas wrote:
> On Mon, Apr 14, 2025 at 07:28:46PM +0100, Ryan Roberts wrote:
>> On 14/04/2025 18:38, Catalin Marinas wrote:
>>> On Tue, Mar 04, 2025 at 03:04:41PM +0000, Ryan Roberts wrote:
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 1898c3069c43..149df945c1ab 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -40,6 +40,55 @@
>>>>  #include <linux/sched.h>
>>>>  #include <linux/page_table_check.h>
>>>>  
>>>> +static inline void emit_pte_barriers(void)
>>>> +{
>>>> +	/*
>>>> +	 * These barriers are emitted under certain conditions after a pte entry
>>>> +	 * was modified (see e.g. __set_pte_complete()). The dsb makes the store
>>>> +	 * visible to the table walker. The isb ensures that any previous
>>>> +	 * speculative "invalid translation" marker that is in the CPU's
>>>> +	 * pipeline gets cleared, so that any access to that address after
>>>> +	 * setting the pte to valid won't cause a spurious fault. If the thread
>>>> +	 * gets preempted after storing to the pgtable but before emitting these
>>>> +	 * barriers, __switch_to() emits a dsb which ensure the walker gets to
>>>> +	 * see the store. There is no guarrantee of an isb being issued though.
>>>> +	 * This is safe because it will still get issued (albeit on a
>>>> +	 * potentially different CPU) when the thread starts running again,
>>>> +	 * before any access to the address.
>>>> +	 */
>>>> +	dsb(ishst);
>>>> +	isb();
>>>> +}
>>>> +
>>>> +static inline void queue_pte_barriers(void)
>>>> +{
>>>> +	if (test_thread_flag(TIF_LAZY_MMU))
>>>> +		set_thread_flag(TIF_LAZY_MMU_PENDING);
>>>
>>> As we can have lots of calls here, it might be slightly cheaper to test
>>> TIF_LAZY_MMU_PENDING and avoid setting it unnecessarily.
>>
>> Yes, good point.
>>
>>> I haven't checked - does the compiler generate multiple mrs from sp_el0
>>> for subsequent test_thread_flag()?
>>
>> It emits a single mrs but it loads from the pointer twice.
> 
> It's not that bad if only do the set_thread_flag() once.
> 
>> I think v3 is the version we want?
>>
>>
>> void TEST_queue_pte_barriers_v1(void)
>> {
>> 	if (test_thread_flag(TIF_LAZY_MMU))
>> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
>> 	else
>> 		emit_pte_barriers();
>> }
>>
>> void TEST_queue_pte_barriers_v2(void)
>> {
>> 	if (test_thread_flag(TIF_LAZY_MMU) &&
>> 	    !test_thread_flag(TIF_LAZY_MMU_PENDING))
>> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
>> 	else
>> 		emit_pte_barriers();
>> }
>>
>> void TEST_queue_pte_barriers_v3(void)
>> {
>> 	unsigned long flags = read_thread_flags();
>>
>> 	if ((flags & (_TIF_LAZY_MMU | _TIF_LAZY_MMU_PENDING)) == _TIF_LAZY_MMU)
>> 		set_thread_flag(TIF_LAZY_MMU_PENDING);
>> 	else
>> 		emit_pte_barriers();
>> }
> 
> Doesn't v3 emit barriers once _TIF_LAZY_MMU_PENDING has been set? We
> need something like:
> 
> 	if (flags & _TIF_LAZY_MMU) {
> 		if (!(flags & _TIF_LAZY_MMU_PENDING))
> 			set_thread_flag(TIF_LAZY_MMU_PENDING);
> 	} else {
> 		emit_pte_barriers();
> 	}

Gah, yeah sorry, going to quickly. v2 is also logicially incorrect.

Fixed versions:

void TEST_queue_pte_barriers_v2(void)
{
	if (test_thread_flag(TIF_LAZY_MMU)) {
		if (!test_thread_flag(TIF_LAZY_MMU_PENDING))
			set_thread_flag(TIF_LAZY_MMU_PENDING);
	} else {
		emit_pte_barriers();
	}
}

void TEST_queue_pte_barriers_v3(void)
{
	unsigned long flags = read_thread_flags();

	if (flags & BIT(TIF_LAZY_MMU)) {
		if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
			set_thread_flag(TIF_LAZY_MMU_PENDING);
	} else {
		emit_pte_barriers();
	}
}

000000000000105c <TEST_queue_pte_barriers_v2>:
    105c:	d5384100 	mrs	x0, sp_el0
    1060:	f9400001 	ldr	x1, [x0]
    1064:	37f80081 	tbnz	w1, #31, 1074 <TEST_queue_pte_barriers_v2+0x18>
    1068:	d5033a9f 	dsb	ishst
    106c:	d5033fdf 	isb
    1070:	d65f03c0 	ret
    1074:	f9400001 	ldr	x1, [x0]
    1078:	b707ffc1 	tbnz	x1, #32, 1070 <TEST_queue_pte_barriers_v2+0x14>
    107c:	14000004 	b	108c <TEST_queue_pte_barriers_v2+0x30>
    1080:	d2c00021 	mov	x1, #0x100000000           	// #4294967296
    1084:	f821301f 	stset	x1, [x0]
    1088:	d65f03c0 	ret
    108c:	f9800011 	prfm	pstl1strm, [x0]
    1090:	c85f7c01 	ldxr	x1, [x0]
    1094:	b2600021 	orr	x1, x1, #0x100000000
    1098:	c8027c01 	stxr	w2, x1, [x0]
    109c:	35ffffa2 	cbnz	w2, 1090 <TEST_queue_pte_barriers_v2+0x34>
    10a0:	d65f03c0 	ret

00000000000010a4 <TEST_queue_pte_barriers_v3>:
    10a4:	d5384101 	mrs	x1, sp_el0
    10a8:	f9400020 	ldr	x0, [x1]
    10ac:	36f80060 	tbz	w0, #31, 10b8 <TEST_queue_pte_barriers_v3+0x14>
    10b0:	b60000a0 	tbz	x0, #32, 10c4 <TEST_queue_pte_barriers_v3+0x20>
    10b4:	d65f03c0 	ret
    10b8:	d5033a9f 	dsb	ishst
    10bc:	d5033fdf 	isb
    10c0:	d65f03c0 	ret
    10c4:	14000004 	b	10d4 <TEST_queue_pte_barriers_v3+0x30>
    10c8:	d2c00020 	mov	x0, #0x100000000           	// #4294967296
    10cc:	f820303f 	stset	x0, [x1]
    10d0:	d65f03c0 	ret
    10d4:	f9800031 	prfm	pstl1strm, [x1]
    10d8:	c85f7c20 	ldxr	x0, [x1]
    10dc:	b2600000 	orr	x0, x0, #0x100000000
    10e0:	c8027c20 	stxr	w2, x0, [x1]
    10e4:	35ffffa2 	cbnz	w2, 10d8 <TEST_queue_pte_barriers_v3+0x34>
    10e8:	d65f03c0 	ret

So v3 is the way to go, I think; it's a single mrs and a single ldr.

I'll get this fixed up and posted early next week.

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-04-15 17:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 15:04 [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Ryan Roberts
2025-03-04 15:04 ` [PATCH v3 01/11] arm64: hugetlb: Cleanup huge_pte size discovery mechanisms Ryan Roberts
2025-04-03 20:46   ` Catalin Marinas
2025-04-04  3:03   ` Anshuman Khandual
2025-03-04 15:04 ` [PATCH v3 02/11] arm64: hugetlb: Refine tlb maintenance scope Ryan Roberts
2025-04-03 20:47   ` Catalin Marinas
2025-04-04  3:50   ` Anshuman Khandual
2025-03-04 15:04 ` [PATCH v3 03/11] mm/page_table_check: Batch-check pmds/puds just like ptes Ryan Roberts
2025-03-26 14:48   ` Pasha Tatashin
2025-03-26 14:54     ` Ryan Roberts
2025-04-03 20:46   ` Catalin Marinas
2025-03-04 15:04 ` [PATCH v3 04/11] arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear() Ryan Roberts
2025-03-06  5:08   ` kernel test robot
2025-03-06 11:54     ` Ryan Roberts
2025-04-14 16:25   ` Catalin Marinas
2025-03-04 15:04 ` [PATCH v3 05/11] arm64: hugetlb: Use set_ptes_anysz() and ptep_get_and_clear_anysz() Ryan Roberts
2025-03-05 16:00   ` kernel test robot
2025-03-05 16:32     ` Ryan Roberts
2025-04-03 20:47   ` Catalin Marinas
2025-03-04 15:04 ` [PATCH v3 06/11] arm64/mm: Hoist barriers out of set_ptes_anysz() loop Ryan Roberts
2025-04-03 20:46   ` Catalin Marinas
2025-04-04  4:11   ` Anshuman Khandual
2025-03-04 15:04 ` [PATCH v3 07/11] mm/vmalloc: Warn on improper use of vunmap_range() Ryan Roberts
2025-03-27 13:05   ` Uladzislau Rezki
2025-03-04 15:04 ` [PATCH v3 08/11] mm/vmalloc: Gracefully unmap huge ptes Ryan Roberts
2025-03-04 15:04 ` [PATCH v3 09/11] arm64/mm: Support huge pte-mapped pages in vmap Ryan Roberts
2025-03-04 15:04 ` [PATCH v3 10/11] mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes Ryan Roberts
2025-03-27 13:06   ` Uladzislau Rezki
2025-04-03 20:47   ` Catalin Marinas
2025-04-04  4:54   ` Anshuman Khandual
2025-03-04 15:04 ` [PATCH v3 11/11] arm64/mm: Batch barriers when updating kernel mappings Ryan Roberts
2025-04-04  6:02   ` Anshuman Khandual
2025-04-14 17:38   ` Catalin Marinas
2025-04-14 18:28     ` Ryan Roberts
2025-04-15 10:51       ` Catalin Marinas
2025-04-15 17:28         ` Ryan Roberts
2025-03-27 12:16 ` [PATCH v3 00/11] Perf improvements for hugetlb and vmalloc on arm64 Uladzislau Rezki
2025-03-27 13:46   ` Ryan Roberts
2025-04-14 13:56 ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).