linux-openrisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] Always call constructor for kernel page tables
@ 2025-03-17 14:16 Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 01/11] mm: Pass mm down to pagetable_{pte,pmd}_ctor Kevin Brodsky
                   ` (11 more replies)
  0 siblings, 12 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

There has been much confusion around exactly when page table
constructors/destructors (pagetable_*_[cd]tor) are supposed to be
called. They were initially introduced for user PTEs only (to support
split page table locks), then at the PMD level for the same purpose.
Accounting was added later on, starting at the PTE level and then moving
to higher levels (PMD, PUD). Finally, with my earlier series "Account
page tables at all levels" [1], the ctor/dtor is run for all levels, all
the way to PGD.

I thought this was the end of the story, and it hopefully is for user
pgtables, but I was wrong for what concerns kernel pgtables. The current
situation there makes very little sense:

* At the PTE level, the ctor/dtor is not called (at least in the generic
  implementation). Specific helpers are used for kernel pgtables at this
  level (pte_{alloc,free}_kernel()) and those have never called the
  ctor/dtor, most likely because they were initially irrelevant in the
  kernel case.

* At all other levels, the ctor/dtor is normally called. This is
  potentially wasteful at the PMD level (more on that later).

This series aims to ensure that the ctor/dtor is always called for kernel
pgtables, as it already is for user pgtables. Besides consistency, the
main motivation is to guarantee that ctor/dtor hooks are systematically
called; this makes it possible to insert hooks to protect page tables [2],
for instance. There is however an extra challenge: split locks are not
used for kernel pgtables, and it would therefore be wasteful to
initialise them (ptlock_init()).

It is worth clarifying exactly when split locks are used. They clearly
are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM:
8591/1: mm: use fully constructed struct pages for EFI pgd
allocations"), they also are for special page tables like efi_mm. The
one case where split locks are definitely unused is pgtables owned by
init_mm; this is consistent with the behaviour of apply_to_pte_range().

The approach chosen in this series is therefore to pass the mm
associated to the pgtables being constructed to
pagetable_{pte,pmd}_ctor() (patch 1), and skip ptlock_init() if
mm == &init_mm (patch 2 and 6). This makes it possible to call the PTE
ctor/dtor from pte_{alloc,free}_kernel() without unintended consequences
(patch 2). As a result the accounting functions are now called at
all levels for kernel pgtables, and split locks are never initialised.

In configurations where ptlocks are dynamically allocated (32-bit,
PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this
series results in the removal of a kmem_cache allocation for every
kernel PMD. Additionally, for certain architectures that do not use
<asm-generic/pgalloc.h> such as s390, the same optimisation occurs at
the PTE level.

---

Things get more complicated when it comes to special pgtable allocators
(patch 7-11). All architectures need such allocators to create initial
kernel pgtables; we are not concerned with those as the ctor cannot be
called so early in the boot sequence. However, those allocators may also
be used later in the boot sequence or during normal operations. There
are two main use-cases:

1. Mapping EFI memory: efi_mm (arm, arm64, riscv)
2. arch_add_memory(): init_mm

The ctor is already explicitly run (at the PTE/PMD level) in the first
case, as required for pgtables that are not associated with init_mm.
However the same allocators may also be used for the second use-case (or
others), and this is where it gets messy. Patch 1 calls the ctor with
NULL as mm in those situations, as the actual mm isn't available.
Practically this means that ptlocks will be unconditionally initialised.
This is fine on arm - create_mapping_late() is only used for the EFI
mapping. On arm64, __create_pgd_mapping() is also used by
arch_add_memory(); patch 7/8/10 ensure that ctors are called at all
levels with the appropriate mm. The situation is similar on riscv, but
propagating the mm down to the ctor would require significant
refactoring. Since they are already called unconditionally, this series
leaves riscv no worse off - patch 9 adds comments to clarify the
situation.

From a cursory look at other architectures implementing
arch_add_memory(), s390 and x86 may also need a similar treatment to add
constructor calls. This is to be taken care of in a future version or as
a follow-up.

---

The complications in those special pgtable allocators beg the question:
does it really make sense to treat efi_mm and init_mm differently in
e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if
an mm corresponds to user memory or not, and never use split locks for
non-user mm's. Feedback and suggestions welcome!

- Kevin

[1] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/
[2] https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/
---
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-csky@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-openrisc@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
---
Kevin Brodsky (11):
  mm: Pass mm down to pagetable_{pte,pmd}_ctor
  mm: Call ctor/dtor for kernel PTEs
  m68k: mm: Call ctor/dtor for kernel PTEs
  powerpc: mm: Call ctor/dtor for kernel PTEs
  sparc64: mm: Call ctor/dtor for kernel PTEs
  mm: Skip ptlock_init() for kernel PMDs
  arm64: mm: Use enum to identify pgtable level instead of *_SHIFT
  arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping()
  riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late
  arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping()
  riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc

 arch/arm/mm/mmu.c                        |  2 +-
 arch/arm64/mm/mmu.c                      | 91 ++++++++++++++----------
 arch/csky/include/asm/pgalloc.h          |  2 +-
 arch/loongarch/include/asm/pgalloc.h     |  2 +-
 arch/m68k/include/asm/mcf_pgalloc.h      |  8 ++-
 arch/m68k/include/asm/motorola_pgalloc.h | 10 +--
 arch/m68k/mm/motorola.c                  |  6 +-
 arch/microblaze/mm/pgtable.c             |  2 +-
 arch/mips/include/asm/pgalloc.h          |  2 +-
 arch/openrisc/mm/ioremap.c               |  2 +-
 arch/parisc/include/asm/pgalloc.h        |  2 +-
 arch/powerpc/mm/book3s64/pgtable.c       |  2 +-
 arch/powerpc/mm/pgtable-frag.c           | 30 ++++----
 arch/riscv/mm/init.c                     | 26 ++++---
 arch/s390/include/asm/pgalloc.h          |  2 +-
 arch/s390/mm/pgalloc.c                   |  2 +-
 arch/sparc/mm/init_64.c                  | 29 ++++----
 arch/sparc/mm/srmmu.c                    |  2 +-
 arch/x86/mm/pgtable.c                    |  2 +-
 include/asm-generic/pgalloc.h            | 11 ++-
 include/linux/mm.h                       | 10 +--
 21 files changed, 137 insertions(+), 108 deletions(-)


base-commit: 4701f33a10702d5fc577c32434eb62adde0a1ae1
-- 
2.47.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/11] mm: Pass mm down to pagetable_{pte,pmd}_ctor
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs Kevin Brodsky
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

In preparation for calling constructors for all kernel page tables
while eliding unnecessary ptlock initialisation, let's pass down the
associated mm to the PTE/PMD level ctors. (These are the two levels
where ptlocks are used.)

In most cases the mm is already around at the point of calling the
ctor so we simply pass it down. This is however not the case for
special page table allocators:

* arch/arm/mm/mmu.c
* arch/arm64/mm/mmu.c
* arch/riscv/mm/init.c

In those cases, the page tables being allocated are either for
standard kernel memory (init_mm) or special page directories, which
may not be associated to any mm. For now let's pass NULL as mm; this
will be refined where possible in future patches.

No functional change in this patch.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm/mm/mmu.c                        |  2 +-
 arch/arm64/mm/mmu.c                      |  4 ++--
 arch/loongarch/include/asm/pgalloc.h     |  2 +-
 arch/m68k/include/asm/mcf_pgalloc.h      |  2 +-
 arch/m68k/include/asm/motorola_pgalloc.h | 10 +++++-----
 arch/m68k/mm/motorola.c                  |  6 +++---
 arch/mips/include/asm/pgalloc.h          |  2 +-
 arch/parisc/include/asm/pgalloc.h        |  2 +-
 arch/powerpc/mm/book3s64/pgtable.c       |  2 +-
 arch/powerpc/mm/pgtable-frag.c           |  2 +-
 arch/riscv/mm/init.c                     |  4 ++--
 arch/s390/include/asm/pgalloc.h          |  2 +-
 arch/s390/mm/pgalloc.c                   |  2 +-
 arch/sparc/mm/init_64.c                  |  2 +-
 arch/sparc/mm/srmmu.c                    |  2 +-
 arch/x86/mm/pgtable.c                    |  2 +-
 include/asm-generic/pgalloc.h            |  4 ++--
 include/linux/mm.h                       |  6 ++++--
 18 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index f02f872ea8a9..edb7f56b7c91 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -735,7 +735,7 @@ static void *__init late_alloc(unsigned long sz)
 	void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM,
 			get_order(sz));
 
-	if (!ptdesc || !pagetable_pte_ctor(ptdesc))
+	if (!ptdesc || !pagetable_pte_ctor(NULL, ptdesc))
 		BUG();
 	return ptdesc_to_virt(ptdesc);
 }
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 1dfe1a8efdbe..437d4977bcf5 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -494,9 +494,9 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 	 * folded, and if so pagetable_pte_ctor() becomes nop.
 	 */
 	if (shift == PAGE_SHIFT)
-		BUG_ON(!pagetable_pte_ctor(ptdesc));
+		BUG_ON(!pagetable_pte_ctor(NULL, ptdesc));
 	else if (shift == PMD_SHIFT)
-		BUG_ON(!pagetable_pmd_ctor(ptdesc));
+		BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc));
 
 	return pa;
 }
diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
index 7211dff8c969..57b7dbcee259 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -72,7 +72,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	if (!ptdesc)
 		return NULL;
 
-	if (!pagetable_pmd_ctor(ptdesc)) {
+	if (!pagetable_pmd_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index 4c648b51e7fd..465a71101b7d 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -48,7 +48,7 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pte_ctor(ptdesc)) {
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index 5abe7da8ac5a..1091fb0affbe 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -15,7 +15,7 @@ enum m68k_table_types {
 };
 
 extern void init_pointer_table(void *table, int type);
-extern void *get_pointer_table(int type);
+extern void *get_pointer_table(struct mm_struct *mm, int type);
 extern int free_pointer_table(void *table, int type);
 
 /*
@@ -26,7 +26,7 @@ extern int free_pointer_table(void *table, int type);
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
-	return get_pointer_table(TABLE_PTE);
+	return get_pointer_table(mm, TABLE_PTE);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -36,7 +36,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-	return get_pointer_table(TABLE_PTE);
+	return get_pointer_table(mm, TABLE_PTE);
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
@@ -53,7 +53,7 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
-	return get_pointer_table(TABLE_PMD);
+	return get_pointer_table(mm, TABLE_PMD);
 }
 
 static inline int pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -75,7 +75,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return get_pointer_table(TABLE_PGD);
+	return get_pointer_table(mm, TABLE_PGD);
 }
 
 
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 73651e093c4d..6ab3ef39ba7a 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -139,7 +139,7 @@ void __init init_pointer_table(void *table, int type)
 	return;
 }
 
-void *get_pointer_table(int type)
+void *get_pointer_table(struct mm_struct *mm, int type)
 {
 	ptable_desc *dp = ptable_list[type].next;
 	unsigned int mask = list_empty(&ptable_list[type]) ? 0 : PD_MARKBITS(dp);
@@ -164,10 +164,10 @@ void *get_pointer_table(int type)
 			 * m68k doesn't have SPLIT_PTE_PTLOCKS for not having
 			 * SMP.
 			 */
-			pagetable_pte_ctor(virt_to_ptdesc(page));
+			pagetable_pte_ctor(mm, virt_to_ptdesc(page));
 			break;
 		case TABLE_PMD:
-			pagetable_pmd_ctor(virt_to_ptdesc(page));
+			pagetable_pmd_ctor(mm, virt_to_ptdesc(page));
 			break;
 		case TABLE_PGD:
 			pagetable_pgd_ctor(virt_to_ptdesc(page));
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 26c7a6ede983..415237af4029 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -65,7 +65,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	if (!ptdesc)
 		return NULL;
 
-	if (!pagetable_pmd_ctor(ptdesc)) {
+	if (!pagetable_pmd_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index 2ca74a56415c..3b84ee93edaa 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -39,7 +39,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	ptdesc = pagetable_alloc(gfp, PMD_TABLE_ORDER);
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pmd_ctor(ptdesc)) {
+	if (!pagetable_pmd_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index ce64abea9e3e..83afc4e72088 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -423,7 +423,7 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 	ptdesc = pagetable_alloc(gfp, 0);
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pmd_ctor(ptdesc)) {
+	if (!pagetable_pmd_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 713268ccb1a0..387e9b1fe12c 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -61,7 +61,7 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 		ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0);
 		if (!ptdesc)
 			return NULL;
-		if (!pagetable_pte_ctor(ptdesc)) {
+		if (!pagetable_pte_ctor(mm, ptdesc)) {
 			pagetable_free(ptdesc);
 			return NULL;
 		}
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 15b2eda4c364..703c3648cfa9 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -409,7 +409,7 @@ static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-	BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc));
+	BUG_ON(!ptdesc || !pagetable_pte_ctor(NULL, ptdesc));
 	return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
@@ -489,7 +489,7 @@ static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-	BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc));
+	BUG_ON(!ptdesc || !pagetable_pmd_ctor(NULL, ptdesc));
 	return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index b19b6ed2ab53..cd65b93d87e9 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -98,7 +98,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
 	if (!table)
 		return NULL;
 	crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
-	if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) {
+	if (!pagetable_pmd_ctor(mm, virt_to_ptdesc(table))) {
 		crst_table_free(mm, table);
 		return NULL;
 	}
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 30387a6e98ff..35b0ab6fab24 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -170,7 +170,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pte_ctor(ptdesc)) {
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 05882bca5b73..cd60a0a8ca0e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2899,7 +2899,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pte_ctor(ptdesc)) {
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index dd32711022f5..f8fb4911d360 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -350,7 +350,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
 	spin_lock(&mm->page_table_lock);
 	if (page_ref_inc_return(page) == 2 &&
-			!pagetable_pte_ctor(page_ptdesc(page))) {
+			!pagetable_pte_ctor(mm, page_ptdesc(page))) {
 		page_ref_dec(page);
 		ptep = NULL;
 	}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 1fef5ad32d5a..2d20f021c50b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -249,7 +249,7 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 
 		if (!ptdesc)
 			failed = true;
-		if (ptdesc && !pagetable_pmd_ctor(ptdesc)) {
+		if (ptdesc && !pagetable_pmd_ctor(mm, ptdesc)) {
 			pagetable_free(ptdesc);
 			ptdesc = NULL;
 			failed = true;
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 892ece4558a2..e164ca66f0f6 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -70,7 +70,7 @@ static inline pgtable_t __pte_alloc_one_noprof(struct mm_struct *mm, gfp_t gfp)
 	ptdesc = pagetable_alloc_noprof(gfp, 0);
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pte_ctor(ptdesc)) {
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
@@ -137,7 +137,7 @@ static inline pmd_t *pmd_alloc_one_noprof(struct mm_struct *mm, unsigned long ad
 	ptdesc = pagetable_alloc_noprof(gfp, 0);
 	if (!ptdesc)
 		return NULL;
-	if (!pagetable_pmd_ctor(ptdesc)) {
+	if (!pagetable_pmd_ctor(mm, ptdesc)) {
 		pagetable_free(ptdesc);
 		return NULL;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8483e09aeb2c..d92c16f6cfa2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3015,7 +3015,8 @@ static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
 	pagetable_free(ptdesc);
 }
 
-static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pte_ctor(struct mm_struct *mm,
+				      struct ptdesc *ptdesc)
 {
 	if (!ptlock_init(ptdesc))
 		return false;
@@ -3121,7 +3122,8 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 	return ptl;
 }
 
-static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
+static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
+				      struct ptdesc *ptdesc)
 {
 	if (!pmd_ptlock_init(ptdesc))
 		return false;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 01/11] mm: Pass mm down to pagetable_{pte,pmd}_ctor Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-24  8:37   ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 03/11] m68k: " Kevin Brodsky
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

Since [1], constructors/destructors are expected to be called for
all page table pages, at all levels and for both user and kernel
pgtables. There is however one glaring exception: kernel PTEs are
managed via separate helpers (pte_alloc_kernel/pte_free_kernel),
which do not call the [cd]tor, at least not in the generic
implementation.

The most obvious reason for this anomaly is that init_mm is
special-cased not to use split page table locks. As a result calling
ptlock_init() for PTEs associated with init_mm would be wasteful,
potentially resulting in dynamic memory allocation. However, pgtable
[cd]tors perform other actions - currently related to
accounting/statistics, and potentially more functionally significant
in the future.

Now that pagetable_pte_ctor() is passed the associated mm, we can
make it skip the call to ptlock_init() for init_mm; this allows us
to call the ctor from pte_alloc_one_kernel() too. This is matched by
a call to the pgtable destructor in pte_free_kernel(); no
special-casing is needed on that path, as ptlock_free() is already
called unconditionally. (ptlock_free() is a no-op unless a ptlock
was allocated for the given PTP.)

This patch ensures that all architectures that rely on
<asm-generic/pgalloc.h> call the [cd]tor for kernel PTEs.
pte_free_kernel() cannot be overridden so changing the generic
implementation is sufficient. pte_alloc_one_kernel() can be
overridden using __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL, and a few
architectures implement it by calling the page allocator directly.
We amend those so that they call the generic
__pte_alloc_one_kernel() instead, if possible, ensuring that the
ctor is called.

A few architectures do not use <asm-generic/pgalloc.h>; those will
be taken care of separately.

[1] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/csky/include/asm/pgalloc.h | 2 +-
 arch/microblaze/mm/pgtable.c    | 2 +-
 arch/openrisc/mm/ioremap.c      | 2 +-
 include/asm-generic/pgalloc.h   | 7 ++++++-
 include/linux/mm.h              | 2 +-
 5 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index bf8400c28b5a..288dca0d160a 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -29,7 +29,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 	pte_t *pte;
 	unsigned long i;
 
-	pte = (pte_t *) __get_free_page(GFP_KERNEL);
+	pte = __pte_alloc_one_kernel(mm);
 	if (!pte)
 		return NULL;
 
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index 9f73265aad4e..e96dd1b7aba4 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -245,7 +245,7 @@ unsigned long iopa(unsigned long addr)
 __ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
 	if (mem_init_done)
-		return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+		return __pte_alloc_one_kernel(mm);
 	else
 		return memblock_alloc_try_nid(PAGE_SIZE, PAGE_SIZE,
 					      MEMBLOCK_LOW_LIMIT,
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index 8e63e86251ca..3b352f97fecb 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -36,7 +36,7 @@ pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm)
 	pte_t *pte;
 
 	if (likely(mem_init_done)) {
-		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
+		pte = __pte_alloc_one_kernel(mm);
 	} else {
 		pte = memblock_alloc_or_panic(PAGE_SIZE, PAGE_SIZE);
 	}
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index e164ca66f0f6..3c8ec3bfea44 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -23,6 +23,11 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
 
 	if (!ptdesc)
 		return NULL;
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
+
 	return ptdesc_address(ptdesc);
 }
 #define __pte_alloc_one_kernel(...)	alloc_hooks(__pte_alloc_one_kernel_noprof(__VA_ARGS__))
@@ -48,7 +53,7 @@ static inline pte_t *pte_alloc_one_kernel_noprof(struct mm_struct *mm)
  */
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-	pagetable_free(virt_to_ptdesc(pte));
+	pagetable_dtor_free(virt_to_ptdesc(pte));
 }
 
 /**
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d92c16f6cfa2..ee31ffd7ead2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3018,7 +3018,7 @@ static inline void pagetable_dtor_free(struct ptdesc *ptdesc)
 static inline bool pagetable_pte_ctor(struct mm_struct *mm,
 				      struct ptdesc *ptdesc)
 {
-	if (!ptlock_init(ptdesc))
+	if (mm != &init_mm && !ptlock_init(ptdesc))
 		return false;
 	__pagetable_ctor(ptdesc);
 	return true;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/11] m68k: mm: Call ctor/dtor for kernel PTEs
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 01/11] mm: Pass mm down to pagetable_{pte,pmd}_ctor Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 04/11] powerpc: " Kevin Brodsky
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

The generic implementation of pte_{alloc_one,free}_kernel now calls
the [cd]tor. Align the m68k/ColdFire implementation of those
functions by calling the [cd]tor explicitly.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/m68k/include/asm/mcf_pgalloc.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index 465a71101b7d..fc5454d37da3 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -7,7 +7,7 @@
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-	pagetable_free(virt_to_ptdesc(pte));
+	pagetable_dtor_free(virt_to_ptdesc(pte));
 }
 
 extern const char bad_pmd_string[];
@@ -19,6 +19,10 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 
 	if (!ptdesc)
 		return NULL;
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
+	}
 
 	return ptdesc_address(ptdesc);
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/11] powerpc: mm: Call ctor/dtor for kernel PTEs
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (2 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 03/11] m68k: " Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 05/11] sparc64: " Kevin Brodsky
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

The generic implementation of pte_{alloc_one,free}_kernel now calls
the [cd]tor, without initialising the ptlock needlessly as
pagetable_pte_ctor() skips it for init_mm.

On powerpc, all functions related to PTE allocation are implemented
by common helpers, which are passed a boolean to differentiate user
from kernel pgtables. This patch aligns the powerpc implementation
with the generic one by calling pagetable_pte_[cd]tor()
unconditionally in those helpers.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/powerpc/mm/pgtable-frag.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 387e9b1fe12c..77e55eac16e4 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -56,19 +56,17 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 {
 	void *ret = NULL;
 	struct ptdesc *ptdesc;
+	gfp_t gfp = PGALLOC_GFP;
 
-	if (!kernel) {
-		ptdesc = pagetable_alloc(PGALLOC_GFP | __GFP_ACCOUNT, 0);
-		if (!ptdesc)
-			return NULL;
-		if (!pagetable_pte_ctor(mm, ptdesc)) {
-			pagetable_free(ptdesc);
-			return NULL;
-		}
-	} else {
-		ptdesc = pagetable_alloc(PGALLOC_GFP, 0);
-		if (!ptdesc)
-			return NULL;
+	if (!kernel)
+		gfp |= __GFP_ACCOUNT;
+
+	ptdesc = pagetable_alloc(gfp, 0);
+	if (!ptdesc)
+		return NULL;
+	if (!pagetable_pte_ctor(mm, ptdesc)) {
+		pagetable_free(ptdesc);
+		return NULL;
 	}
 
 	atomic_set(&ptdesc->pt_frag_refcount, 1);
@@ -124,12 +122,10 @@ void pte_fragment_free(unsigned long *table, int kernel)
 
 	BUG_ON(atomic_read(&ptdesc->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&ptdesc->pt_frag_refcount)) {
-		if (kernel)
-			pagetable_free(ptdesc);
-		else if (folio_test_clear_active(ptdesc_folio(ptdesc)))
-			call_rcu(&ptdesc->pt_rcu_head, pte_free_now);
-		else
+		if (kernel || !folio_test_clear_active(ptdesc_folio(ptdesc)))
 			pte_free_now(&ptdesc->pt_rcu_head);
+		else
+			call_rcu(&ptdesc->pt_rcu_head, pte_free_now);
 	}
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/11] sparc64: mm: Call ctor/dtor for kernel PTEs
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (3 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 04/11] powerpc: " Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 06/11] mm: Skip ptlock_init() for kernel PMDs Kevin Brodsky
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

The generic implementation of pte_{alloc_one,free}_kernel now calls
the [cd]tor, without initialising the ptlock needlessly as
pagetable_pte_ctor() skips it for init_mm.

Align sparc64 with the generic implementation by ensuring
pagetable_pte_[cd]tor() are called for kernel PTEs. As a result
the kernel and user alloc/free functions have the same
implementation, and since pgtable_t is defined as pte_t *, we can
have both call a common helper.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/sparc/mm/init_64.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index cd60a0a8ca0e..0a1b2b1adad4 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2882,18 +2882,7 @@ void __flush_tlb_all(void)
 			     : : "r" (pstate));
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
-{
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-	pte_t *pte = NULL;
-
-	if (page)
-		pte = (pte_t *) page_address(page);
-
-	return pte;
-}
-
-pgtable_t pte_alloc_one(struct mm_struct *mm)
+static pte_t *__pte_alloc_one(struct mm_struct *mm)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0);
 
@@ -2906,9 +2895,14 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return ptdesc_address(ptdesc);
 }
 
-void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
-	free_page((unsigned long)pte);
+	return __pte_alloc_one(mm);
+}
+
+pgtable_t pte_alloc_one(struct mm_struct *mm)
+{
+	return __pte_alloc_one(mm);
 }
 
 static void __pte_free(pgtable_t pte)
@@ -2919,6 +2913,11 @@ static void __pte_free(pgtable_t pte)
 	pagetable_free(ptdesc);
 }
 
+void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+	__pte_free(pte);
+}
+
 void pte_free(struct mm_struct *mm, pgtable_t pte)
 {
 	__pte_free(pte);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/11] mm: Skip ptlock_init() for kernel PMDs
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (4 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 05/11] sparc64: " Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 07/11] arm64: mm: Use enum to identify pgtable level instead of *_SHIFT Kevin Brodsky
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

Split page table locks are not used for pgtables associated to
init_mm, at any level. pte_alloc_kernel() does not call
ptlock_init() as a result. There is however no separate alloc/free
functions for kernel PMDs, and pmd_ptlock_init() is called
unconditionally. When ALLOC_SPLIT_PTLOCKS is true (e.g. 32-bit
architectures or if CONFIG_PREEMPT_RT is selected), this results in
unnecessary dynamic memory allocation every time a kernel PMD is
allocated.

Now that pagetable_pmd_ctor() is passed the associated mm, we can
easily remove this overhead by skipping pmd_ptlock_init() if the
pgtable is associated to init_mm. No special-casing is needed on the
dtor path, as ptlock_free() is already called unconditionally for
all levels. (ptlock_free() is a no-op unless a ptlock was allocated
for the given PTP.)

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee31ffd7ead2..4759da9cd633 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3125,7 +3125,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
 static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
 				      struct ptdesc *ptdesc)
 {
-	if (!pmd_ptlock_init(ptdesc))
+	if (mm != &init_mm && !pmd_ptlock_init(ptdesc))
 		return false;
 	ptdesc_pmd_pts_init(ptdesc);
 	__pagetable_ctor(ptdesc);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/11] arm64: mm: Use enum to identify pgtable level instead of *_SHIFT
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (5 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 06/11] mm: Skip ptlock_init() for kernel PMDs Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 08/11] arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping() Kevin Brodsky
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

Commit 90292aca9854 ("arm64: mm: use appropriate ctors for page
tables") introduced pgtable ctor calls in pgd_pgtable_alloc(). To
identify the pgtable level and call the appropriate ctor, the
*_SHIFT value associated with the pgtable level is used. However,
those values do not unambiguously identify a level, because if a
given level is folded, the *_SHIFT value will be equal to that of
the upper level (e.g. PMD_SHIFT == PUD_SHIFT if PMD is folded).

As things stand, there is probably not much damaged done by calling
the ctor for a different level, and ARCH_ENABLE_SPLIT_PMD_PTLOCK is
only selected if PMD isn't folded (so we don't needlessly initialise
pmd_ptlock). Still, this is pretty confusing, and it would get even
more confusing when adding ctor calls for the remaining levels.

Let's simplify all this by using an enum to identify the pgtable
level instead; this way folding becomes irrelevant. This is inspired
by one of the m68k pgtable allocators
(arch/m68k/include/asm/motorola_pgalloc.h).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/mm/mmu.c | 54 +++++++++++++++++++++++++++------------------
 1 file changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 437d4977bcf5..a7292ce9d7b8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -46,6 +46,13 @@
 #define NO_CONT_MAPPINGS	BIT(1)
 #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
 
+enum pgtable_type {
+	TABLE_PTE,
+	TABLE_PMD,
+	TABLE_PUD,
+	TABLE_P4D,
+};
+
 u64 kimage_voffset __ro_after_init;
 EXPORT_SYMBOL(kimage_voffset);
 
@@ -107,7 +114,7 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 }
 EXPORT_SYMBOL(phys_mem_access_prot);
 
-static phys_addr_t __init early_pgtable_alloc(int shift)
+static phys_addr_t __init early_pgtable_alloc(enum pgtable_type pgtable_type)
 {
 	phys_addr_t phys;
 
@@ -192,7 +199,7 @@ static void init_pte(pte_t *ptep, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int),
+				phys_addr_t (*pgtable_alloc)(enum pgtable_type),
 				int flags)
 {
 	unsigned long next;
@@ -207,7 +214,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 		if (flags & NO_EXEC_MAPPINGS)
 			pmdval |= PMD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
-		pte_phys = pgtable_alloc(PAGE_SHIFT);
+		pte_phys = pgtable_alloc(TABLE_PTE);
 		ptep = pte_set_fixmap(pte_phys);
 		init_clear_pgtable(ptep);
 		ptep += pte_index(addr);
@@ -243,7 +250,7 @@ static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 
 static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot,
-		     phys_addr_t (*pgtable_alloc)(int), int flags)
+		     phys_addr_t (*pgtable_alloc)(enum pgtable_type), int flags)
 {
 	unsigned long next;
 
@@ -277,7 +284,8 @@ static void init_pmd(pmd_t *pmdp, unsigned long addr, unsigned long end,
 static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
-				phys_addr_t (*pgtable_alloc)(int), int flags)
+				phys_addr_t (*pgtable_alloc)(enum pgtable_type),
+				int flags)
 {
 	unsigned long next;
 	pud_t pud = READ_ONCE(*pudp);
@@ -294,7 +302,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 		if (flags & NO_EXEC_MAPPINGS)
 			pudval |= PUD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
-		pmd_phys = pgtable_alloc(PMD_SHIFT);
+		pmd_phys = pgtable_alloc(TABLE_PMD);
 		pmdp = pmd_set_fixmap(pmd_phys);
 		init_clear_pgtable(pmdp);
 		pmdp += pmd_index(addr);
@@ -325,7 +333,7 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 
 static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   phys_addr_t (*pgtable_alloc)(enum pgtable_type),
 			   int flags)
 {
 	unsigned long next;
@@ -339,7 +347,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 		if (flags & NO_EXEC_MAPPINGS)
 			p4dval |= P4D_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
-		pud_phys = pgtable_alloc(PUD_SHIFT);
+		pud_phys = pgtable_alloc(TABLE_PUD);
 		pudp = pud_set_fixmap(pud_phys);
 		init_clear_pgtable(pudp);
 		pudp += pud_index(addr);
@@ -383,7 +391,7 @@ static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
 
 static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 			   phys_addr_t phys, pgprot_t prot,
-			   phys_addr_t (*pgtable_alloc)(int),
+			   phys_addr_t (*pgtable_alloc)(enum pgtable_type),
 			   int flags)
 {
 	unsigned long next;
@@ -397,7 +405,7 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 		if (flags & NO_EXEC_MAPPINGS)
 			pgdval |= PGD_TABLE_PXN;
 		BUG_ON(!pgtable_alloc);
-		p4d_phys = pgtable_alloc(P4D_SHIFT);
+		p4d_phys = pgtable_alloc(TABLE_P4D);
 		p4dp = p4d_set_fixmap(p4d_phys);
 		init_clear_pgtable(p4dp);
 		p4dp += p4d_index(addr);
@@ -427,7 +435,7 @@ static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
 static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 					unsigned long virt, phys_addr_t size,
 					pgprot_t prot,
-					phys_addr_t (*pgtable_alloc)(int),
+					phys_addr_t (*pgtable_alloc)(enum pgtable_type),
 					int flags)
 {
 	unsigned long addr, end, next;
@@ -455,7 +463,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
 static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 				 unsigned long virt, phys_addr_t size,
 				 pgprot_t prot,
-				 phys_addr_t (*pgtable_alloc)(int),
+				 phys_addr_t (*pgtable_alloc)(enum pgtable_type),
 				 int flags)
 {
 	mutex_lock(&fixmap_lock);
@@ -468,10 +476,11 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 extern __alias(__create_pgd_mapping_locked)
 void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
 			     phys_addr_t size, pgprot_t prot,
-			     phys_addr_t (*pgtable_alloc)(int), int flags);
+			     phys_addr_t (*pgtable_alloc)(enum pgtable_type),
+			     int flags);
 #endif
 
-static phys_addr_t __pgd_pgtable_alloc(int shift)
+static phys_addr_t __pgd_pgtable_alloc(enum pgtable_type pgtable_type)
 {
 	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
 	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
@@ -480,23 +489,26 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
 	return __pa(ptr);
 }
 
-static phys_addr_t pgd_pgtable_alloc(int shift)
+static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type)
 {
-	phys_addr_t pa = __pgd_pgtable_alloc(shift);
+	phys_addr_t pa = __pgd_pgtable_alloc(pgtable_type);
 	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
 
 	/*
 	 * Call proper page table ctor in case later we need to
 	 * call core mm functions like apply_to_page_range() on
 	 * this pre-allocated page table.
-	 *
-	 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
-	 * folded, and if so pagetable_pte_ctor() becomes nop.
 	 */
-	if (shift == PAGE_SHIFT)
+	switch (pgtable_type) {
+	case TABLE_PTE:
 		BUG_ON(!pagetable_pte_ctor(NULL, ptdesc));
-	else if (shift == PMD_SHIFT)
+		break;
+	case TABLE_PMD:
 		BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc));
+		break;
+	default:
+		break;
+	}
 
 	return pa;
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/11] arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping()
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (6 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 07/11] arm64: mm: Use enum to identify pgtable level instead of *_SHIFT Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 09/11] riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late Kevin Brodsky
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

TL;DR: always call the PTE/PMD ctor, passing the appropriate mm to
skip ptlock_init() if unneeded.

__create_pgd_mapping() is used for creating different kinds of
mappings, and may allocate page table pages if passed an allocator
callback. There are currently three such cases:

1. create_pgd_mapping(), which is used to create the EFI mapping
2. arch_add_memory()
3. map_entry_trampoline()

1. uses pgd_pgtable_alloc() as allocator callback, which calls the
PTE/PMD ctor, while 2. and 3. use __pgd_pgtable_alloc(), which does
not. The rationale is most likely that pgtables associated with
init_mm do not make use of split page table locks, and it is
therefore unnecessary to initialise them by calling the ctor. 2.
operates on swapper_pg_dir so the allocated pgtables are clearly
associated with init_mm, this is arguably the case for 3. too (the
trampoline mapping is never modified so ptlocks are anyway
irrelevant). 1. corresponds to efi_mm so ptlocks need to be
initialised in that case.

We are now moving towards calling the ctor for all page tables, even
those associated with init_mm. pagetable_{pte,pmd}_ctor() have
become aware of the associated mm so that the ptlock initialisation
can be skipped for init_mm. This patch therefore amends the
allocator callbacks so that the PTE/PMD ctor are always called, with
an appropriate mm pointer to avoid unnecessary ptlock overhead.

Modifying the prototype of the allocator callbacks to take the mm
and propagating that pointer all the way down would be pretty
invasive. Instead:

* __pgd_pgtable_alloc() (cases 2. and 3. above) is replaced with
  pgd_pgtable_alloc_init_mm(), resulting in the ctors being called
  with &init_mm. This is the main functional change in this patch;
  the ptlock still isn't initialised, but other ctor actions (e.g.
  accounting-related) are now carried out for those allocated
  pgtables.

* pgd_pgtable_alloc() (case 1. above) is replaced with
  pgd_pgtable_alloc_special_mm(), resulting in the ctors being
  called with NULL as mm. No functional change here; NULL
  essentially means "not init_mm", and the ptlock is still
  initialised.

__pgd_pgtable_alloc() is now the common implementation of those two
helpers. While at it we switch it to using pagetable_alloc() like
standard pgtable allocator functions and remove the comment
regarding ctor calls (ctors are now always expected to be called).

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/mm/mmu.c | 41 +++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a7292ce9d7b8..accb0a33c59f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -480,31 +480,22 @@ void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys, unsigned long virt,
 			     int flags);
 #endif
 
-static phys_addr_t __pgd_pgtable_alloc(enum pgtable_type pgtable_type)
+static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm,
+				       enum pgtable_type pgtable_type)
 {
 	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
-	void *ptr = (void *)__get_free_page(GFP_PGTABLE_KERNEL & ~__GFP_ZERO);
+	struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_ZERO, 0);
+	phys_addr_t pa;
 
-	BUG_ON(!ptr);
-	return __pa(ptr);
-}
-
-static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type)
-{
-	phys_addr_t pa = __pgd_pgtable_alloc(pgtable_type);
-	struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
+	BUG_ON(!ptdesc);
+	pa = page_to_phys(ptdesc_page(ptdesc));
 
-	/*
-	 * Call proper page table ctor in case later we need to
-	 * call core mm functions like apply_to_page_range() on
-	 * this pre-allocated page table.
-	 */
 	switch (pgtable_type) {
 	case TABLE_PTE:
-		BUG_ON(!pagetable_pte_ctor(NULL, ptdesc));
+		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
 		break;
 	case TABLE_PMD:
-		BUG_ON(!pagetable_pmd_ctor(NULL, ptdesc));
+		BUG_ON(!pagetable_pmd_ctor(mm, ptdesc));
 		break;
 	default:
 		break;
@@ -513,6 +504,16 @@ static phys_addr_t pgd_pgtable_alloc(enum pgtable_type pgtable_type)
 	return pa;
 }
 
+static phys_addr_t pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
+{
+	return __pgd_pgtable_alloc(&init_mm, pgtable_type);
+}
+
+static phys_addr_t pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
+{
+	return __pgd_pgtable_alloc(NULL, pgtable_type);
+}
+
 /*
  * This function can only be used to modify existing table entries,
  * without allocating new levels of table. Note that this permits the
@@ -542,7 +543,7 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(mm->pgd, phys, virt, size, prot,
-			     pgd_pgtable_alloc, flags);
+			     pgd_pgtable_alloc_special_mm, flags);
 }
 
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
@@ -756,7 +757,7 @@ static int __init map_entry_trampoline(void)
 	memset(tramp_pg_dir, 0, PGD_SIZE);
 	__create_pgd_mapping(tramp_pg_dir, pa_start, TRAMP_VALIAS,
 			     entry_tramp_text_size(), prot,
-			     __pgd_pgtable_alloc, NO_BLOCK_MAPPINGS);
+			     pgd_pgtable_alloc_init_mm, NO_BLOCK_MAPPINGS);
 
 	/* Map both the text and data into the kernel page table */
 	for (i = 0; i < DIV_ROUND_UP(entry_tramp_text_size(), PAGE_SIZE); i++)
@@ -1362,7 +1363,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
-			     size, params->pgprot, __pgd_pgtable_alloc,
+			     size, params->pgprot, pgd_pgtable_alloc_init_mm,
 			     flags);
 
 	memblock_clear_nomap(start, size);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/11] riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (7 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 08/11] arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping() Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:16 ` [PATCH 10/11] arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping() Kevin Brodsky
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

pagetable_{pte,pmd}_ctor(mm, ptdesc) skip the ptlock initialisation
if mm is &init_mm. To avoid unnecessary overhead, it is therefore
preferable to pass the actual mm associated to the PTE/PMD.

Unfortunately, this proves challenging for alloc_{pte,pmd}_late() as
the associated mm is not available at the point where they are
called - in fact not even top-level functions like
create_pgd_mapping() are passed the mm. As a result they both call
the ctor with NULL as mm; this is safe but potentially wasteful.

This is not a new situation, but let's add a couple of comments to
clarify it.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/riscv/mm/init.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 703c3648cfa9..fb18940113f2 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -409,6 +409,11 @@ static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
+	/*
+	 * We do not know which mm the PTE page is associated to at this point.
+	 * Passing NULL to the ctor is the safe option, though it may result
+	 * in unnecessary work (e.g. initialising the ptlock for init_mm).
+	 */
 	BUG_ON(!ptdesc || !pagetable_pte_ctor(NULL, ptdesc));
 	return __pa((pte_t *)ptdesc_address(ptdesc));
 }
@@ -489,6 +494,7 @@ static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
 {
 	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
+	/* See comment in alloc_pte_late() regarding NULL passed the ctor */
 	BUG_ON(!ptdesc || !pagetable_pmd_ctor(NULL, ptdesc));
 	return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/11] arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping()
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (8 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 09/11] riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late Kevin Brodsky
@ 2025-03-17 14:16 ` Kevin Brodsky
  2025-03-17 14:17 ` [PATCH 11/11] riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc Kevin Brodsky
  2025-03-17 15:30 ` [PATCH 00/11] Always call constructor for kernel page tables Ryan Roberts
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:16 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

Constructors for PUD/P4D-level pgtables were recently introduced.
They should be called for all pgtables; make sure they are called
for special kernel mappings created by __create_pgd_mapping() too.

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/mm/mmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index accb0a33c59f..10bf39654e77 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -497,7 +497,11 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm,
 	case TABLE_PMD:
 		BUG_ON(!pagetable_pmd_ctor(mm, ptdesc));
 		break;
-	default:
+	case TABLE_PUD:
+		pagetable_pud_ctor(ptdesc);
+		break;
+	case TABLE_P4D:
+		pagetable_p4d_ctor(ptdesc);
 		break;
 	}
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 11/11] riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (9 preceding siblings ...)
  2025-03-17 14:16 ` [PATCH 10/11] arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping() Kevin Brodsky
@ 2025-03-17 14:17 ` Kevin Brodsky
  2025-03-17 15:30 ` [PATCH 00/11] Always call constructor for kernel page tables Ryan Roberts
  11 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-17 14:17 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Kevin Brodsky, Albert Ou, Andreas Larsson,
	Andrew Morton, Catalin Marinas, Dave Hansen, David S. Miller,
	Geert Uytterhoeven, Linus Walleij, Madhavan Srinivasan,
	Mark Rutland, Matthew Wilcox, Michael Ellerman,
	Mike Rapoport (IBM), Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Qi Zheng, Ryan Roberts, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

Constructors for PUD/P4D-level pgtables were recently introduced.
They should be called for all pgtables; make sure they are called
for special kernel mappings created by create_pgd_mapping() too.

While at it also switch to using pagetable_alloc() like
in alloc_{pte,pmd}_late().

Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/riscv/mm/init.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fb18940113f2..dc2715f3fd00 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -557,11 +557,11 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
 
 static phys_addr_t __meminit alloc_pud_late(uintptr_t va)
 {
-	unsigned long vaddr;
+	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-	vaddr = __get_free_page(GFP_KERNEL);
-	BUG_ON(!vaddr);
-	return __pa(vaddr);
+	BUG_ON(!ptdesc);
+	pagetable_pud_ctor(ptdesc);
+	return __pa((pud_t *)ptdesc_address(ptdesc));
 }
 
 static p4d_t *__init get_p4d_virt_early(phys_addr_t pa)
@@ -595,11 +595,11 @@ static phys_addr_t __init alloc_p4d_fixmap(uintptr_t va)
 
 static phys_addr_t __meminit alloc_p4d_late(uintptr_t va)
 {
-	unsigned long vaddr;
+	struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-	vaddr = __get_free_page(GFP_KERNEL);
-	BUG_ON(!vaddr);
-	return __pa(vaddr);
+	BUG_ON(!ptdesc);
+	pagetable_p4d_ctor(ptdesc);
+	return __pa((p4d_t *)ptdesc_address(ptdesc));
 }
 
 static void __meminit create_pud_mapping(pud_t *pudp, uintptr_t va, phys_addr_t pa, phys_addr_t sz,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] Always call constructor for kernel page tables
  2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
                   ` (10 preceding siblings ...)
  2025-03-17 14:17 ` [PATCH 11/11] riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc Kevin Brodsky
@ 2025-03-17 15:30 ` Ryan Roberts
  2025-03-18 12:14   ` Kevin Brodsky
  11 siblings, 1 reply; 15+ messages in thread
From: Ryan Roberts @ 2025-03-17 15:30 UTC (permalink / raw)
  To: Kevin Brodsky, linux-mm
  Cc: linux-kernel, Albert Ou, Andreas Larsson, Andrew Morton,
	Catalin Marinas, Dave Hansen, David S. Miller, Geert Uytterhoeven,
	Linus Walleij, Madhavan Srinivasan, Mark Rutland, Matthew Wilcox,
	Michael Ellerman, Mike Rapoport (IBM), Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Qi Zheng, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

On 17/03/2025 14:16, Kevin Brodsky wrote:
> The complications in those special pgtable allocators beg the question:
> does it really make sense to treat efi_mm and init_mm differently in
> e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if
> an mm corresponds to user memory or not, and never use split locks for
> non-user mm's. Feedback and suggestions welcome!

The difference in treatment is whether or not the ptl is taken, right? So the
real question is when calling apply_to_pte_range() for efi_mm, is there already
a higher level serialization mechanism that prevents racy accesses? For init_mm,
I think this is handled implicitly because there is no way for user space to
cause apply_to_pte_range() for an arbitrary piece of kernel memory. Although I
can't even see where apply_to_page_range() is called for efi_mm.

FWIW, contpte.c has mm_is_user() which is used by arm64.

Thanks,
Ryan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/11] Always call constructor for kernel page tables
  2025-03-17 15:30 ` [PATCH 00/11] Always call constructor for kernel page tables Ryan Roberts
@ 2025-03-18 12:14   ` Kevin Brodsky
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-18 12:14 UTC (permalink / raw)
  To: Ryan Roberts, linux-mm
  Cc: linux-kernel, Albert Ou, Andreas Larsson, Andrew Morton,
	Catalin Marinas, Dave Hansen, David S. Miller, Geert Uytterhoeven,
	Linus Walleij, Madhavan Srinivasan, Mark Rutland, Matthew Wilcox,
	Michael Ellerman, Mike Rapoport (IBM), Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Qi Zheng, Will Deacon, Yang Shi,
	linux-arch, linux-arm-kernel, linux-csky, linux-m68k,
	linux-openrisc, linux-riscv, linux-s390, linuxppc-dev, sparclinux

On 17/03/2025 16:30, Ryan Roberts wrote:
> On 17/03/2025 14:16, Kevin Brodsky wrote:
>> The complications in those special pgtable allocators beg the question:
>> does it really make sense to treat efi_mm and init_mm differently in
>> e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if
>> an mm corresponds to user memory or not, and never use split locks for
>> non-user mm's. Feedback and suggestions welcome!
> The difference in treatment is whether or not the ptl is taken, right? So the
> real question is when calling apply_to_pte_range() for efi_mm, is there already
> a higher level serialization mechanism that prevents racy accesses? For init_mm,
> I think this is handled implicitly because there is no way for user space to
> cause apply_to_pte_range() for an arbitrary piece of kernel memory. Although I
> can't even see where apply_to_page_range() is called for efi_mm.

The commit I mentioned above, 61444cde9170 ("ARM: 8591/1: mm: use fully
constructed struct pages for EFI pgd allocations"), shows that
apply_to_page_range() is called from efi_set_mapping_permissions(), and
this indeed hasn't changed. It is itself called from efi_virtmap_init().
I would expect that no locking at all is necessary here, since the
mapping has just been created and surely isn't used yet. Now the
question is where exactly init_mm is special-cased in this manner. I can
see that walk_page_range() does something similar, there may be more
cases. And the other question is whether those functions are ever used
on special mm's, aside from efi_set_mapping_permissions().
> FWIW, contpte.c has mm_is_user() which is used by arm64.

Interesting! But not pretty, that's basically checking that the mm is
not &init_mm or &efi_mm... which wouldn't work for a generic
implementation. It feels like adding some attribute to mm_struct
wouldn't hurt. It looks like we've run out of MMF_* flags though :/

- Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs
  2025-03-17 14:16 ` [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs Kevin Brodsky
@ 2025-03-24  8:37   ` Kevin Brodsky
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Brodsky @ 2025-03-24  8:37 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Albert Ou, Andreas Larsson, Andrew Morton,
	Catalin Marinas, Dave Hansen, David S. Miller, Geert Uytterhoeven,
	Linus Walleij, Madhavan Srinivasan, Mark Rutland, Matthew Wilcox,
	Michael Ellerman, Mike Rapoport (IBM), Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Qi Zheng, Ryan Roberts,
	Will Deacon, Yang Shi, linux-arch, linux-arm-kernel, linux-csky,
	linux-m68k, linux-openrisc, linux-riscv, linux-s390, linuxppc-dev,
	sparclinux

On 17/03/2025 15:16, Kevin Brodsky wrote:
> diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
> index e164ca66f0f6..3c8ec3bfea44 100644
> --- a/include/asm-generic/pgalloc.h
> +++ b/include/asm-generic/pgalloc.h
> @@ -23,6 +23,11 @@ static inline pte_t *__pte_alloc_one_kernel_noprof(struct mm_struct *mm)
>  
>  	if (!ptdesc)
>  		return NULL;
> +	if (!pagetable_pte_ctor(mm, ptdesc)) {

As reported by the CI [1], this can cause trouble on x86 because dtor
calls are missing in pud_free_pmd_page() and pmd_free_pte_page(). Will
fix in the next version.

- Kevin

[1] https://lore.kernel.org/oe-lkp/202503211612.e11bd73f-lkp@intel.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-24  8:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-17 14:16 [PATCH 00/11] Always call constructor for kernel page tables Kevin Brodsky
2025-03-17 14:16 ` [PATCH 01/11] mm: Pass mm down to pagetable_{pte,pmd}_ctor Kevin Brodsky
2025-03-17 14:16 ` [PATCH 02/11] mm: Call ctor/dtor for kernel PTEs Kevin Brodsky
2025-03-24  8:37   ` Kevin Brodsky
2025-03-17 14:16 ` [PATCH 03/11] m68k: " Kevin Brodsky
2025-03-17 14:16 ` [PATCH 04/11] powerpc: " Kevin Brodsky
2025-03-17 14:16 ` [PATCH 05/11] sparc64: " Kevin Brodsky
2025-03-17 14:16 ` [PATCH 06/11] mm: Skip ptlock_init() for kernel PMDs Kevin Brodsky
2025-03-17 14:16 ` [PATCH 07/11] arm64: mm: Use enum to identify pgtable level instead of *_SHIFT Kevin Brodsky
2025-03-17 14:16 ` [PATCH 08/11] arm64: mm: Always call PTE/PMD ctor in __create_pgd_mapping() Kevin Brodsky
2025-03-17 14:16 ` [PATCH 09/11] riscv: mm: Clarify ctor mm argument in alloc_{pte,pmd}_late Kevin Brodsky
2025-03-17 14:16 ` [PATCH 10/11] arm64: mm: Call PUD/P4D ctor in __create_pgd_mapping() Kevin Brodsky
2025-03-17 14:17 ` [PATCH 11/11] riscv: mm: Call PUD/P4D ctor in special kernel pgtable alloc Kevin Brodsky
2025-03-17 15:30 ` [PATCH 00/11] Always call constructor for kernel page tables Ryan Roberts
2025-03-18 12:14   ` Kevin Brodsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).