[PATCH v2 0/2] fix wrong usage of memory allocation APIs under PREEMPT

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64
@ 2025-12-17 18:20 Yeoreum Yun
  2025-12-17 18:20 ` [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping Yeoreum Yun
  2025-12-17 18:20 ` [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
  0 siblings, 2 replies; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-17 18:20 UTC (permalink / raw)
  To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun

Under PREEMPT_RT, calling generic memory allocation/free APIs
(e.x) __get_free_pages(), pgtable_alloc(), free_pages() and etc
with preemption disabled is not allowed, but allow only nolock() APIs series
because it may acquire a spin lock that becomes sleepable on RT,
potentially causing a sleep during page allocation
(See Documentation/core-api/real-time/differences.rst, Memory allocation section).

However, In arm64, __linear_map_split_to_ptes() and
__kpti_install_ng_mappings() called by stopper thread via stop_machine()
use generic memory allocation/free APIs.

This patchset fixes this problem and based on v6.19-rc1

Patch History
==============
from v1 to v2:
  - drop pagetable_alloc_nolock()
  - following @Ryan Roberts suggestion.
  - https://lore.kernel.org/all/20251212161832.2067134-1-yeoreum.yun@arm.com/

 arch/arm64/mm/mmu.c | 235 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 183 insertions(+), 52 deletions(-)

--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-17 18:20 [PATCH v2 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
@ 2025-12-17 18:20 ` Yeoreum Yun
  2025-12-18  8:23   ` David Hildenbrand (Red Hat)
  2025-12-18 14:22   ` Ryan Roberts
  2025-12-17 18:20 ` [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
  1 sibling, 2 replies; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-17 18:20 UTC (permalink / raw)
  To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun

Current linear_map_split_to_ptes() allocate the pagetable
while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.

This is fine for non-PREEMPR_RT case.
However It's a problem in PREEMPR_RT case since
generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
couldn't be called in non-preemptible context except _nolock() APIs
since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*

IOW, calling a pgtable_alloc() even with GFP_ATOMIC
doesn't allow in __linear_map_split_to_pte() executed by stopper thread
where preemption is disabled in PREEMPR_RT.

To address this, divide linear_map_maybe_split_ptes:
  - collect number of pages to require for spliting.
  - allocate the required number of pages for spliting.
  - with pre-allocate page, split the linear map.

Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/mm/mmu.c | 213 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 170 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9ae7ce00a7ef..e4e6c7e0a016 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -527,18 +527,28 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 		panic("Failed to create page tables\n");
 }
 
-static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
-				       enum pgtable_type pgtable_type)
-{
-	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
-	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
-	phys_addr_t pa;
-
-	if (!ptdesc)
-		return INVALID_PHYS_ADDR;
+enum split_modes {
+	SPLIT_ALLOC,
+	SPLIT_PREALLOC,
+	SPLIT_COLLECT,
+};
 
-	pa = page_to_phys(ptdesc_page(ptdesc));
+struct split_args {
+	enum split_modes mode;
+	union {
+		struct {
+			unsigned long nr;
+			unsigned long i;
+			struct page **pages;
+		};
+		gfp_t gfp;
+	};
+};
 
+static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
+					       struct ptdesc *ptdesc,
+					       enum pgtable_type pgtable_type)
+{
 	switch (pgtable_type) {
 	case TABLE_PTE:
 		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
@@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
 		break;
 	}
 
+}
+
+static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
+				       enum pgtable_type pgtable_type)
+{
+	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
+	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
+	phys_addr_t pa;
+
+	if (!ptdesc)
+		return INVALID_PHYS_ADDR;
+
+	pa = page_to_phys(ptdesc_page(ptdesc));
+
+	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
+
 	return pa;
 }
 
+static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
+					  struct split_args *split_args,
+					  enum pgtable_type pgtable_type)
+{
+	struct page *page;
+
+	BUG_ON(split_args->i >= split_args->nr);
+
+	page = split_args->pages[split_args->i++];
+	if (!page)
+		return INVALID_PHYS_ADDR;
+
+	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
+
+	return page_to_phys(page);
+}
+
 static phys_addr_t
-pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
+pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
+		        struct split_args *split_args)
 {
-	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
+	if (split_args->mode == SPLIT_ALLOC)
+		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
+
+	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
 }
 
 static phys_addr_t __maybe_unused
 pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
 {
-	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
+	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
 }
 
 static phys_addr_t
@@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
 		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
 }
 
-static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
+static int split_pmd(pmd_t *pmdp, pmd_t pmd,
+		     struct split_args *split_args,
+		     bool to_cont)
 {
 	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
 	unsigned long pfn = pmd_pfn(pmd);
@@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
 	pte_t *ptep;
 	int i;
 
-	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
+	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
 	if (pte_phys == INVALID_PHYS_ADDR)
 		return -ENOMEM;
 	ptep = (pte_t *)phys_to_virt(pte_phys);
@@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
 		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
 }
 
-static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
+static int split_pud(pud_t *pudp, pud_t pud,
+		    struct split_args *split_args,
+		    bool to_cont)
 {
 	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
 	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
@@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
 	pmd_t *pmdp;
 	int i;
 
-	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
+	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
 	if (pmd_phys == INVALID_PHYS_ADDR)
 		return -ENOMEM;
 	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
@@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
 	pmd_t *pmdp, pmd;
 	pte_t *ptep, pte;
 	int ret = 0;
+	struct split_args split_args = {
+		.mode = SPLIT_ALLOC,
+		.gfp = GFP_PGTABLE_KERNEL,
+	};
 
 	/*
 	 * PGD: If addr is PGD aligned then addr already describes a leaf
@@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
 	if (!pud_present(pud))
 		goto out;
 	if (pud_leaf(pud)) {
-		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
+		ret = split_pud(pudp, pud, &split_args, true);
 		if (ret)
 			goto out;
 	}
@@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
 		 */
 		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
 			goto out;
-		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
+		ret = split_pmd(pmdp, pmd, &split_args, true);
 		if (ret)
 			goto out;
 	}
@@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
 static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
 				   unsigned long next, struct mm_walk *walk)
 {
-	gfp_t gfp = *(gfp_t *)walk->private;
+	struct split_args *split_args = (struct split_args *)walk->private;
 	pud_t pud = pudp_get(pudp);
 	int ret = 0;
 
-	if (pud_leaf(pud))
-		ret = split_pud(pudp, pud, gfp, false);
+	if (!pud_leaf(pud))
+		return 0;
+
+	if (split_args->mode == SPLIT_COLLECT)
+		split_args->nr++;
+	else
+		ret = split_pud(pudp, pud, split_args, false);
 
 	return ret;
 }
@@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
 static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
 				   unsigned long next, struct mm_walk *walk)
 {
-	gfp_t gfp = *(gfp_t *)walk->private;
+	struct split_args *split_args = (struct split_args *)walk->private;
 	pmd_t pmd = pmdp_get(pmdp);
-	int ret = 0;
+	int ret;
 
-	if (pmd_leaf(pmd)) {
-		if (pmd_cont(pmd))
-			split_contpmd(pmdp);
-		ret = split_pmd(pmdp, pmd, gfp, false);
+	if (!pmd_leaf(pmd))
+		return 0;
 
-		/*
-		 * We have split the pmd directly to ptes so there is no need to
-		 * visit each pte to check if they are contpte.
-		 */
-		walk->action = ACTION_CONTINUE;
+	if (split_args->mode == SPLIT_COLLECT) {
+		split_args->nr++;
+		return 0;
 	}
 
+	if (pmd_cont(pmd))
+		split_contpmd(pmdp);
+	ret = split_pmd(pmdp, pmd, split_args, false);
+
+	/*
+	 * We have split the pmd directly to ptes so there is no need to
+	 * visit each pte to check if they are contpte.
+	 */
+	walk->action = ACTION_CONTINUE;
+
+
 	return ret;
 }
 
@@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
 	.pte_entry	= split_to_ptes_pte_entry,
 };
 
-static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
+static int range_split_to_ptes(unsigned long start, unsigned long end,
+		 	       struct split_args *split_args)
 {
 	int ret;
 
 	arch_enter_lazy_mmu_mode();
 	ret = walk_kernel_page_table_range_lockless(start, end,
-					&split_to_ptes_ops, NULL, &gfp);
+						    &split_to_ptes_ops, NULL,
+						    split_args);
 	arch_leave_lazy_mmu_mode();
 
 	return ret;
@@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
 	smp_mb();
 }
 
-static int __init linear_map_split_to_ptes(void *__unused)
+static int __init linear_map_split_to_ptes(void *data)
 {
 	/*
 	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
@@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
 	 * be held in a waiting area with the idmap active.
 	 */
 	if (!smp_processor_id()) {
+		struct split_args *split_args = data;
 		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
 		unsigned long lend = PAGE_END;
 		unsigned long kstart = (unsigned long)lm_alias(_stext);
@@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
 		 * PTE. The kernel alias remains static throughout runtime so
 		 * can continue to be safely mapped with large mappings.
 		 */
-		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
+		ret = range_split_to_ptes(lstart, lend, split_args);
 		if (!ret)
-			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
+			ret = range_split_to_ptes(kstart, kend, split_args);
 		if (ret)
 			panic("Failed to split linear map\n");
-		flush_tlb_kernel_range(lstart, lend);
+		if (split_args->mode != SPLIT_COLLECT)
+			flush_tlb_kernel_range(lstart, lend);
 
 		/*
 		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
@@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
 
 void __init linear_map_maybe_split_to_ptes(void)
 {
-	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
-		init_idmap_kpti_bbml2_flag();
-		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
+	struct page **pages = NULL;
+	struct split_args split_args;
+	int order;
+	int nr_populated;
+	int err;
+
+	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
+		return;
+
+	/*
+	 * Phase 1: Collect required extra pages to split.
+	 */
+	split_args.mode = SPLIT_COLLECT;
+	split_args.nr = 0;
+
+	init_idmap_kpti_bbml2_flag();
+	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
+
+	/*
+	 * Phase 2: Allocate necessary pages to split.
+	 */
+	if (split_args.nr == 0) {
+		err = 0;
+		split_args.mode = SPLIT_ALLOC;
+	} else {
+		err = -ENOMEM;
+		order = order_base_2(PAGE_ALIGN(split_args.nr *
+					sizeof(struct page *)) >> PAGE_SHIFT);
+
+		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
+		if (!pages)
+			goto error;
+
+		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);
+		if (nr_populated < split_args.nr)
+			goto error;
+
+		err = 0;
+		split_args.mode = SPLIT_PREALLOC;
+		split_args.i = 0;
+		split_args.pages = pages;
 	}
+
+	/*
+	 * Phase 3: Split linear map.
+	 */
+	init_idmap_kpti_bbml2_flag();
+	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
+
+error:
+	if (pages)
+		free_pages((unsigned long)pages, order);
+
+	if (err)
+		panic("Failed to split linear map: %d\n", err);
 }
 
 /*
@@ -1087,6 +1209,11 @@ bool arch_kfence_init_pool(void)
 	unsigned long start = (unsigned long)__kfence_pool;
 	unsigned long end = start + KFENCE_POOL_SIZE;
 	int ret;
+	struct split_args split_args = {
+		.mode = SPLIT_ALLOC,
+		.gfp = GFP_PGTABLE_KERNEL,
+	};
+
 
 	/* Exit early if we know the linear map is already pte-mapped. */
 	if (!split_leaf_mapping_possible())
@@ -1097,7 +1224,7 @@ bool arch_kfence_init_pool(void)
 		return true;
 
 	mutex_lock(&pgtable_split_lock);
-	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
+	ret = range_split_to_ptes(start, end, &split_args);
 	mutex_unlock(&pgtable_split_lock);
 
 	/*
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-17 18:20 [PATCH v2 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
  2025-12-17 18:20 ` [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping Yeoreum Yun
@ 2025-12-17 18:20 ` Yeoreum Yun
  2025-12-17 19:03   ` Ryan Roberts
  1 sibling, 1 reply; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-17 18:20 UTC (permalink / raw)
  To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun

The current __kpti_install_ng_mappings() allocates a temporary PGD
while installing the NG mapping for KPTI under stop_machine(),
using GFP_ATOMIC.

This is fine in the non-PREEMPT_RT case. However, it becomes a problem
under PREEMPT_RT because generic memory allocation/free APIs
(e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
in a non-preemptible context, except for the *_nolock() variants.
These generic allocators may sleep due to their use of spin_lock().

In other words, calling __get_free_pages(), even with GFP_ATOMIC,
is not allowed in __kpti_install_ng_mappings(), which is executed by
the stopper thread where preemption is disabled under PREEMPT_RT.

To address this, preallocate the page needed for the temporary PGD
before invoking __kpti_install_ng_mappings() via stop_machine().

Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/mm/mmu.c | 22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e4e6c7e0a016..69d9651de0cd 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1360,7 +1360,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
 	return kpti_ng_temp_alloc;
 }
 
-static int __init __kpti_install_ng_mappings(void *__unused)
+static int __init __kpti_install_ng_mappings(void *data)
 {
 	typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
 	extern kpti_remap_fn idmap_kpti_install_ng_mappings;
@@ -1368,10 +1368,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
 
 	int cpu = smp_processor_id();
 	int levels = CONFIG_PGTABLE_LEVELS;
-	int order = order_base_2(levels);
 	u64 kpti_ng_temp_pgd_pa = 0;
 	pgd_t *kpti_ng_temp_pgd;
-	u64 alloc = 0;
+	u64 alloc = *(u64 *)data;
 
 	if (levels == 5 && !pgtable_l5_enabled())
 		levels = 4;
@@ -1382,8 +1381,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
 
 	if (!cpu) {
 		int ret;
-
-		alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
 		kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
 		kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
 
@@ -1414,16 +1411,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
 	remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
 	cpu_uninstall_idmap();
 
-	if (!cpu) {
-		free_pages(alloc, order);
+	if (!cpu)
 		arm64_use_ng_mappings = true;
-	}
 
 	return 0;
 }
 
 void __init kpti_install_ng_mappings(void)
 {
+	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
+	u64 alloc;
+
 	/* Check whether KPTI is going to be used */
 	if (!arm64_kernel_unmapped_at_el0())
 		return;
@@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
 	if (arm64_use_ng_mappings)
 		return;
 
+	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
+	if (!alloc)
+		panic("Failed to alloc page tables\n");
+
 	init_idmap_kpti_bbml2_flag();
-	stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
+	stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
+
+	free_pages(alloc, order);
 }
 
 static pgprot_t __init kernel_exec_prot(void)
-- 
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-17 18:20 ` [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
@ 2025-12-17 19:03   ` Ryan Roberts
  2025-12-17 19:09     ` Yeoreum Yun
  2025-12-18  7:51     ` Vlastimil Babka
  0 siblings, 2 replies; 19+ messages in thread
From: Ryan Roberts @ 2025-12-17 19:03 UTC (permalink / raw)
  To: Yeoreum Yun, catalin.marinas, will, akpm, david, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel

On 17/12/2025 18:20, Yeoreum Yun wrote:
> The current __kpti_install_ng_mappings() allocates a temporary PGD
> while installing the NG mapping for KPTI under stop_machine(),
> using GFP_ATOMIC.
> 
> This is fine in the non-PREEMPT_RT case. However, it becomes a problem
> under PREEMPT_RT because generic memory allocation/free APIs
> (e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
> in a non-preemptible context, except for the *_nolock() variants.
> These generic allocators may sleep due to their use of spin_lock().
> 
> In other words, calling __get_free_pages(), even with GFP_ATOMIC,
> is not allowed in __kpti_install_ng_mappings(), which is executed by
> the stopper thread where preemption is disabled under PREEMPT_RT.
> 
> To address this, preallocate the page needed for the temporary PGD
> before invoking __kpti_install_ng_mappings() via stop_machine().
> 
> Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 22 +++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index e4e6c7e0a016..69d9651de0cd 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1360,7 +1360,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
>  	return kpti_ng_temp_alloc;
>  }
>  
> -static int __init __kpti_install_ng_mappings(void *__unused)
> +static int __init __kpti_install_ng_mappings(void *data)
>  {
>  	typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
>  	extern kpti_remap_fn idmap_kpti_install_ng_mappings;
> @@ -1368,10 +1368,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>  
>  	int cpu = smp_processor_id();
>  	int levels = CONFIG_PGTABLE_LEVELS;
> -	int order = order_base_2(levels);
>  	u64 kpti_ng_temp_pgd_pa = 0;
>  	pgd_t *kpti_ng_temp_pgd;
> -	u64 alloc = 0;
> +	u64 alloc = *(u64 *)data;
>  
>  	if (levels == 5 && !pgtable_l5_enabled())
>  		levels = 4;
> @@ -1382,8 +1381,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>  
>  	if (!cpu) {
>  		int ret;
> -
> -		alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>  		kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
>  		kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
>  
> @@ -1414,16 +1411,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>  	remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
>  	cpu_uninstall_idmap();
>  
> -	if (!cpu) {
> -		free_pages(alloc, order);
> +	if (!cpu)
>  		arm64_use_ng_mappings = true;
> -	}
>  
>  	return 0;
>  }
>  
>  void __init kpti_install_ng_mappings(void)
>  {
> +	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
> +	u64 alloc;
> +
>  	/* Check whether KPTI is going to be used */
>  	if (!arm64_kernel_unmapped_at_el0())
>  		return;
> @@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
>  	if (arm64_use_ng_mappings)
>  		return;
>  
> +	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);

I don't think this requires GFP_ATOMIC now?

With that removed:

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> +	if (!alloc)
> +		panic("Failed to alloc page tables\n");
> +
>  	init_idmap_kpti_bbml2_flag();
> -	stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
> +	stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
> +
> +	free_pages(alloc, order);
>  }
>  
>  static pgprot_t __init kernel_exec_prot(void)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-17 19:03   ` Ryan Roberts
@ 2025-12-17 19:09     ` Yeoreum Yun
  2025-12-18  8:34       ` Ryan Roberts
  2025-12-18  7:51     ` Vlastimil Babka
  1 sibling, 1 reply; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-17 19:09 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

Hi Ryanc,

[...]
> > -	if (!cpu) {
> > -		free_pages(alloc, order);
> > +	if (!cpu)
> >  		arm64_use_ng_mappings = true;
> > -	}
> >
> >  	return 0;
> >  }
> >
> >  void __init kpti_install_ng_mappings(void)
> >  {
> > +	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
> > +	u64 alloc;
> > +
> >  	/* Check whether KPTI is going to be used */
> >  	if (!arm64_kernel_unmapped_at_el0())
> >  		return;
> > @@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
> >  	if (arm64_use_ng_mappings)
> >  		return;
> >
> > +	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>
> I don't think this requires GFP_ATOMIC now?
>
> With that removed:
>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

I think it would be better to use only __GFP_HIGH in here since
when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
and to allocate page with assurance, It would be good to use
min_reserved to.

Am I missing something?

Thanks.

[...]

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-17 19:03   ` Ryan Roberts
  2025-12-17 19:09     ` Yeoreum Yun
@ 2025-12-18  7:51     ` Vlastimil Babka
  2025-12-18  8:32       ` Ryan Roberts
  1 sibling, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2025-12-18  7:51 UTC (permalink / raw)
  To: Ryan Roberts, Yeoreum Yun, catalin.marinas, will, akpm, david,
	kevin.brodsky, quic_zhenhuah, dev.jain, yang, chaitanyas.prakash,
	bigeasy, clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel

On 12/17/25 20:03, Ryan Roberts wrote:
> On 17/12/2025 18:20, Yeoreum Yun wrote:
>> The current __kpti_install_ng_mappings() allocates a temporary PGD
>> while installing the NG mapping for KPTI under stop_machine(),
>> using GFP_ATOMIC.
>> 
>> This is fine in the non-PREEMPT_RT case. However, it becomes a problem
>> under PREEMPT_RT because generic memory allocation/free APIs
>> (e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
>> in a non-preemptible context, except for the *_nolock() variants.
>> These generic allocators may sleep due to their use of spin_lock().
>> 
>> In other words, calling __get_free_pages(), even with GFP_ATOMIC,
>> is not allowed in __kpti_install_ng_mappings(), which is executed by
>> the stopper thread where preemption is disabled under PREEMPT_RT.
>> 
>> To address this, preallocate the page needed for the temporary PGD
>> before invoking __kpti_install_ng_mappings() via stop_machine().
>> 
>> Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>> ---
>>  arch/arm64/mm/mmu.c | 22 +++++++++++++---------
>>  1 file changed, 13 insertions(+), 9 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index e4e6c7e0a016..69d9651de0cd 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1360,7 +1360,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
>>  	return kpti_ng_temp_alloc;
>>  }
>>  
>> -static int __init __kpti_install_ng_mappings(void *__unused)
>> +static int __init __kpti_install_ng_mappings(void *data)
>>  {
>>  	typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
>>  	extern kpti_remap_fn idmap_kpti_install_ng_mappings;
>> @@ -1368,10 +1368,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>  
>>  	int cpu = smp_processor_id();
>>  	int levels = CONFIG_PGTABLE_LEVELS;
>> -	int order = order_base_2(levels);
>>  	u64 kpti_ng_temp_pgd_pa = 0;
>>  	pgd_t *kpti_ng_temp_pgd;
>> -	u64 alloc = 0;
>> +	u64 alloc = *(u64 *)data;
>>  
>>  	if (levels == 5 && !pgtable_l5_enabled())
>>  		levels = 4;
>> @@ -1382,8 +1381,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>  
>>  	if (!cpu) {
>>  		int ret;
>> -
>> -		alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>>  		kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
>>  		kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
>>  
>> @@ -1414,16 +1411,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>  	remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
>>  	cpu_uninstall_idmap();
>>  
>> -	if (!cpu) {
>> -		free_pages(alloc, order);
>> +	if (!cpu)
>>  		arm64_use_ng_mappings = true;
>> -	}
>>  
>>  	return 0;
>>  }
>>  
>>  void __init kpti_install_ng_mappings(void)
>>  {
>> +	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
>> +	u64 alloc;
>> +
>>  	/* Check whether KPTI is going to be used */
>>  	if (!arm64_kernel_unmapped_at_el0())
>>  		return;
>> @@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
>>  	if (arm64_use_ng_mappings)
>>  		return;
>>  
>> +	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> 
> I don't think this requires GFP_ATOMIC now?

Do you mean it's fine to use GFP_KERNEL now, or still not, and you mean that
GFP_NOWAIT is sufficient?

> With that removed:
> 
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> 
>> +	if (!alloc)
>> +		panic("Failed to alloc page tables\n");
>> +
>>  	init_idmap_kpti_bbml2_flag();
>> -	stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
>> +	stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
>> +
>> +	free_pages(alloc, order);
>>  }
>>  
>>  static pgprot_t __init kernel_exec_prot(void)
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-17 18:20 ` [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping Yeoreum Yun
@ 2025-12-18  8:23   ` David Hildenbrand (Red Hat)
  2025-12-18  9:08     ` Yeoreum Yun
  2025-12-18 14:22   ` Ryan Roberts
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  8:23 UTC (permalink / raw)
  To: Yeoreum Yun, catalin.marinas, will, ryan.roberts, akpm,
	kevin.brodsky, quic_zhenhuah, dev.jain, yang, chaitanyas.prakash,
	bigeasy, clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb,
	vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel

On 12/17/25 19:20, Yeoreum Yun wrote:
> Current linear_map_split_to_ptes() allocate the pagetable
> while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
> 
> This is fine for non-PREEMPR_RT case.
> However It's a problem in PREEMPR_RT case since
> generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
> couldn't be called in non-preemptible context except _nolock() APIs
> since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
> 
> IOW, calling a pgtable_alloc() even with GFP_ATOMIC
> doesn't allow in __linear_map_split_to_pte() executed by stopper thread
> where preemption is disabled in PREEMPR_RT.
> 
> To address this, divide linear_map_maybe_split_ptes:
>    - collect number of pages to require for spliting.

COLLECT step

>    - allocate the required number of pages for spliting.

PREALLOC step

>    - with pre-allocate page, split the linear map.

SPLIT step

The steps you have below are confusing :)

> +enum split_modes {

Is it a "mode" or a "step" ?

enum split_step {
	SPLIT_STEP_COLLECT,
	...
};


But reading the code below I am lost how these modes apply here.

I guess what you mean is something like

enum split_mode {
	SPLIT_MODE_ALLOC, /* ... from the buddy. */
	SPLIT_MODE_PREALLOCATED, /* ... from preallocated buffer. */
	SPLIT_MODE_COLLECT, /* collect required page tables. */
};

But why is SPLIT_MODE_ALLOC vs. SPLIT_MODE_PREALLOCATED required?

Can't we just test if there is something in the preallocated buffer, and 
if not, simply allocate?

IOW, all you need is "collect vs. alloc", and the buffer information 
tells you how to allocate.

> +	SPLIT_ALLOC,
> +	SPLIT_PREALLOC,
> +	SPLIT_COLLECT,
> +};
>   
> -	pa = page_to_phys(ptdesc_page(ptdesc));
> +struct split_args {
> +	enum split_modes mode;
> +	union {
> +		struct {
> +			unsigned long nr;
> +			unsigned long i;
> +			struct page **pages;
> +		};
> +		gfp_t gfp;
> +	};
> +};
>   
> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> +					       struct ptdesc *ptdesc,
> +					       enum pgtable_type pgtable_type)
> +{
>   	switch (pgtable_type) {
>   	case TABLE_PTE:
>   		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> @@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>   		break;
>   	}
>   
> +}
> +
> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> +				       enum pgtable_type pgtable_type)
> +{
> +	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> +	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> +	phys_addr_t pa;
> +
> +	if (!ptdesc)
> +		return INVALID_PHYS_ADDR;
> +
> +	pa = page_to_phys(ptdesc_page(ptdesc));
> +
> +	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
> +
>   	return pa;
>   }
>   
> +static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
> +					  struct split_args *split_args,
> +					  enum pgtable_type pgtable_type)

The function is called "_prealloc" but actually it simply grabs a 
preallocated page. Confusing.


Should this function be called something like 
"__pgd_pgtable_get_preallocated()". Not sure.

> +{
> +	struct page *page;
> +
> +	BUG_ON(split_args->i >= split_args->nr);

No BUG_ON().

> +
> +	page = split_args->pages[split_args->i++];
> +	if (!page)
> +		return INVALID_PHYS_ADDR;
> +
> +	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
> +
> +	return page_to_phys(page);
> +}
> +
>   static phys_addr_t
> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> +pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
> +		        struct split_args *split_args)
>   {
> -	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> +	if (split_args->mode == SPLIT_ALLOC)
> +		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
> +
> +	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
>   }
>   
>   static phys_addr_t __maybe_unused
>   pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
>   {
> -	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> +	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
>   }
>   
>   static phys_addr_t
> @@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
>   		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>   }
>   
> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> +		     struct split_args *split_args,
> +		     bool to_cont)
>   {
>   	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>   	unsigned long pfn = pmd_pfn(pmd);
> @@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>   	pte_t *ptep;
>   	int i;
>   
> -	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> +	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
>   	if (pte_phys == INVALID_PHYS_ADDR)
>   		return -ENOMEM;
>   	ptep = (pte_t *)phys_to_virt(pte_phys);
> @@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
>   		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>   }
>   
> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> +static int split_pud(pud_t *pudp, pud_t pud,
> +		    struct split_args *split_args,
> +		    bool to_cont)
>   {
>   	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>   	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> @@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>   	pmd_t *pmdp;
>   	int i;
>   
> -	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> +	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
>   	if (pmd_phys == INVALID_PHYS_ADDR)
>   		return -ENOMEM;
>   	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> @@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>   	pmd_t *pmdp, pmd;
>   	pte_t *ptep, pte;
>   	int ret = 0;
> +	struct split_args split_args = {
> +		.mode = SPLIT_ALLOC,
> +		.gfp = GFP_PGTABLE_KERNEL,
> +	};
>   
>   	/*
>   	 * PGD: If addr is PGD aligned then addr already describes a leaf
> @@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>   	if (!pud_present(pud))
>   		goto out;
>   	if (pud_leaf(pud)) {
> -		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> +		ret = split_pud(pudp, pud, &split_args, true);
>   		if (ret)
>   			goto out;
>   	}
> @@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>   		 */
>   		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>   			goto out;
> -		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> +		ret = split_pmd(pmdp, pmd, &split_args, true);
>   		if (ret)
>   			goto out;
>   	}
> @@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>   static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>   				   unsigned long next, struct mm_walk *walk)
>   {
> -	gfp_t gfp = *(gfp_t *)walk->private;
> +	struct split_args *split_args = (struct split_args *)walk->private;
>   	pud_t pud = pudp_get(pudp);
>   	int ret = 0;
>   
> -	if (pud_leaf(pud))
> -		ret = split_pud(pudp, pud, gfp, false);
> +	if (!pud_leaf(pud))
> +		return 0;
> +
> +	if (split_args->mode == SPLIT_COLLECT)
> +		split_args->nr++;
> +	else
> +		ret = split_pud(pudp, pud, split_args, false);
>   
>   	return ret;
>   }
> @@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>   static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>   				   unsigned long next, struct mm_walk *walk)
>   {
> -	gfp_t gfp = *(gfp_t *)walk->private;
> +	struct split_args *split_args = (struct split_args *)walk->private;
>   	pmd_t pmd = pmdp_get(pmdp);
> -	int ret = 0;
> +	int ret;
>   
> -	if (pmd_leaf(pmd)) {
> -		if (pmd_cont(pmd))
> -			split_contpmd(pmdp);
> -		ret = split_pmd(pmdp, pmd, gfp, false);
> +	if (!pmd_leaf(pmd))
> +		return 0;
>   
> -		/*
> -		 * We have split the pmd directly to ptes so there is no need to
> -		 * visit each pte to check if they are contpte.
> -		 */
> -		walk->action = ACTION_CONTINUE;
> +	if (split_args->mode == SPLIT_COLLECT) {
> +		split_args->nr++;
> +		return 0;
>   	}
>   
> +	if (pmd_cont(pmd))
> +		split_contpmd(pmdp);
> +	ret = split_pmd(pmdp, pmd, split_args, false);
> +
> +	/*
> +	 * We have split the pmd directly to ptes so there is no need to
> +	 * visit each pte to check if they are contpte.
> +	 */
> +	walk->action = ACTION_CONTINUE;
> +
> +
>   	return ret;
>   }
>   
> @@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
>   	.pte_entry	= split_to_ptes_pte_entry,
>   };
>   
> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> +static int range_split_to_ptes(unsigned long start, unsigned long end,
> +		 	       struct split_args *split_args)
>   {
>   	int ret;
>   
>   	arch_enter_lazy_mmu_mode();
>   	ret = walk_kernel_page_table_range_lockless(start, end,
> -					&split_to_ptes_ops, NULL, &gfp);
> +						    &split_to_ptes_ops, NULL,
> +						    split_args);
>   	arch_leave_lazy_mmu_mode();
>   
>   	return ret;
> @@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
>   	smp_mb();
>   }
>   
> -static int __init linear_map_split_to_ptes(void *__unused)
> +static int __init linear_map_split_to_ptes(void *data)
>   {
>   	/*
>   	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
> @@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
>   	 * be held in a waiting area with the idmap active.
>   	 */
>   	if (!smp_processor_id()) {
> +		struct split_args *split_args = data;
>   		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>   		unsigned long lend = PAGE_END;
>   		unsigned long kstart = (unsigned long)lm_alias(_stext);
> @@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
>   		 * PTE. The kernel alias remains static throughout runtime so
>   		 * can continue to be safely mapped with large mappings.
>   		 */
> -		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> +		ret = range_split_to_ptes(lstart, lend, split_args);
>   		if (!ret)
> -			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> +			ret = range_split_to_ptes(kstart, kend, split_args);
>   		if (ret)
>   			panic("Failed to split linear map\n");
> -		flush_tlb_kernel_range(lstart, lend);
> +		if (split_args->mode != SPLIT_COLLECT)
> +			flush_tlb_kernel_range(lstart, lend);
>   
>   		/*
>   		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
> @@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
>   
>   void __init linear_map_maybe_split_to_ptes(void)
>   {
> -	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> -		init_idmap_kpti_bbml2_flag();
> -		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> +	struct page **pages = NULL;
> +	struct split_args split_args;
> +	int order;
> +	int nr_populated;
> +	int err;
> +
> +	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> +		return;
> +
> +	/*
> +	 * Phase 1: Collect required extra pages to split.
> +	 */
> +	split_args.mode = SPLIT_COLLECT;
> +	split_args.nr = 0;
> +
> +	init_idmap_kpti_bbml2_flag();
> +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
> +
> +	/*
> +	 * Phase 2: Allocate necessary pages to split.
> +	 */
> +	if (split_args.nr == 0) {
> +		err = 0;
> +		split_args.mode = SPLIT_ALLOC;
> +	} else {
> +		err = -ENOMEM;
> +		order = order_base_2(PAGE_ALIGN(split_args.nr *
> +					sizeof(struct page *)) >> PAGE_SHIFT);
> +
> +		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> +		if (!pages)
> +			goto error;
> +
> +		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);
> +		if (nr_populated < split_args.nr)
> +			goto error;

That is prone to cause trouble in the future once we allocate memdescs 
separately.

Can we simply call pagetable_alloc() in a loop here instead for the time 
being? A future-proof bulk allocation can be done later separately.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  7:51     ` Vlastimil Babka
@ 2025-12-18  8:32       ` Ryan Roberts
  0 siblings, 0 replies; 19+ messages in thread
From: Ryan Roberts @ 2025-12-18  8:32 UTC (permalink / raw)
  To: Vlastimil Babka, Yeoreum Yun, catalin.marinas, will, akpm, david,
	kevin.brodsky, quic_zhenhuah, dev.jain, yang, chaitanyas.prakash,
	bigeasy, clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel

On 18/12/2025 07:51, Vlastimil Babka wrote:
> On 12/17/25 20:03, Ryan Roberts wrote:
>> On 17/12/2025 18:20, Yeoreum Yun wrote:
>>> The current __kpti_install_ng_mappings() allocates a temporary PGD
>>> while installing the NG mapping for KPTI under stop_machine(),
>>> using GFP_ATOMIC.
>>>
>>> This is fine in the non-PREEMPT_RT case. However, it becomes a problem
>>> under PREEMPT_RT because generic memory allocation/free APIs
>>> (e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
>>> in a non-preemptible context, except for the *_nolock() variants.
>>> These generic allocators may sleep due to their use of spin_lock().
>>>
>>> In other words, calling __get_free_pages(), even with GFP_ATOMIC,
>>> is not allowed in __kpti_install_ng_mappings(), which is executed by
>>> the stopper thread where preemption is disabled under PREEMPT_RT.
>>>
>>> To address this, preallocate the page needed for the temporary PGD
>>> before invoking __kpti_install_ng_mappings() via stop_machine().
>>>
>>> Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
>>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>>> ---
>>>  arch/arm64/mm/mmu.c | 22 +++++++++++++---------
>>>  1 file changed, 13 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index e4e6c7e0a016..69d9651de0cd 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -1360,7 +1360,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
>>>  	return kpti_ng_temp_alloc;
>>>  }
>>>  
>>> -static int __init __kpti_install_ng_mappings(void *__unused)
>>> +static int __init __kpti_install_ng_mappings(void *data)
>>>  {
>>>  	typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
>>>  	extern kpti_remap_fn idmap_kpti_install_ng_mappings;
>>> @@ -1368,10 +1368,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>>  
>>>  	int cpu = smp_processor_id();
>>>  	int levels = CONFIG_PGTABLE_LEVELS;
>>> -	int order = order_base_2(levels);
>>>  	u64 kpti_ng_temp_pgd_pa = 0;
>>>  	pgd_t *kpti_ng_temp_pgd;
>>> -	u64 alloc = 0;
>>> +	u64 alloc = *(u64 *)data;
>>>  
>>>  	if (levels == 5 && !pgtable_l5_enabled())
>>>  		levels = 4;
>>> @@ -1382,8 +1381,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>>  
>>>  	if (!cpu) {
>>>  		int ret;
>>> -
>>> -		alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>>>  		kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
>>>  		kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
>>>  
>>> @@ -1414,16 +1411,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>>>  	remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
>>>  	cpu_uninstall_idmap();
>>>  
>>> -	if (!cpu) {
>>> -		free_pages(alloc, order);
>>> +	if (!cpu)
>>>  		arm64_use_ng_mappings = true;
>>> -	}
>>>  
>>>  	return 0;
>>>  }
>>>  
>>>  void __init kpti_install_ng_mappings(void)
>>>  {
>>> +	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
>>> +	u64 alloc;
>>> +
>>>  	/* Check whether KPTI is going to be used */
>>>  	if (!arm64_kernel_unmapped_at_el0())
>>>  		return;
>>> @@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
>>>  	if (arm64_use_ng_mappings)
>>>  		return;
>>>  
>>> +	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>>
>> I don't think this requires GFP_ATOMIC now?
> 
> Do you mean it's fine to use GFP_KERNEL now, or still not, and you mean that
> GFP_NOWAIT is sufficient?

I mean it's fine to use "GFP_KERNEL | __GFP_ZERO"; this context can sleep so we
don't want to dip into the reserves - we can sleep and reclaim if needed.

> 
>> With that removed:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
>>
>>> +	if (!alloc)
>>> +		panic("Failed to alloc page tables\n");
>>> +
>>>  	init_idmap_kpti_bbml2_flag();
>>> -	stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
>>> +	stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
>>> +
>>> +	free_pages(alloc, order);
>>>  }
>>>  
>>>  static pgprot_t __init kernel_exec_prot(void)
>>
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-17 19:09     ` Yeoreum Yun
@ 2025-12-18  8:34       ` Ryan Roberts
  2025-12-18  8:37         ` Yeoreum Yun
  0 siblings, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-12-18  8:34 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

On 17/12/2025 19:09, Yeoreum Yun wrote:
> Hi Ryanc,
> 
> [...]
>>> -	if (!cpu) {
>>> -		free_pages(alloc, order);
>>> +	if (!cpu)
>>>  		arm64_use_ng_mappings = true;
>>> -	}
>>>
>>>  	return 0;
>>>  }
>>>
>>>  void __init kpti_install_ng_mappings(void)
>>>  {
>>> +	int order = order_base_2(CONFIG_PGTABLE_LEVELS);
>>> +	u64 alloc;
>>> +
>>>  	/* Check whether KPTI is going to be used */
>>>  	if (!arm64_kernel_unmapped_at_el0())
>>>  		return;
>>> @@ -1436,8 +1434,14 @@ void __init kpti_install_ng_mappings(void)
>>>  	if (arm64_use_ng_mappings)
>>>  		return;
>>>
>>> +	alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>>
>> I don't think this requires GFP_ATOMIC now?
>>
>> With that removed:
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> I think it would be better to use only __GFP_HIGH in here since
> when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
> and to allocate page with assurance, It would be good to use
> min_reserved to.
> 
> Am I missing something?

Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
would make this allocation look special, which it is not. If we fail to allocate
at this point in boot, we have bigger problems.

> 
> Thanks.
> 
> [...]
> 
> --
> Sincerely,
> Yeoreum Yun



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  8:34       ` Ryan Roberts
@ 2025-12-18  8:37         ` Yeoreum Yun
  2025-12-18  8:46           ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18  8:37 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

[...]
> > I think it would be better to use only __GFP_HIGH in here since
> > when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
> > and to allocate page with assurance, It would be good to use
> > min_reserved to.
> >
> > Am I missing something?
>
> Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
> would make this allocation look special, which it is not. If we fail to allocate
> at this point in boot, we have bigger problems.

But I'm not sure *HOW effective* to use GFP_KERNEL in here.
Since it's before the any filesystem inited.
IOW, in this context, almost there would be no *page cache*
and I think it seems meaningless to use "GFP_KERNEL" and "direct
reclaim"

So to get success for allocation, __GFP_HIGH | _GFP_ZERO seems much
better.

Thanks


--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  8:37         ` Yeoreum Yun
@ 2025-12-18  8:46           ` David Hildenbrand (Red Hat)
  2025-12-18  9:31             ` Yeoreum Yun
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  8:46 UTC (permalink / raw)
  To: Yeoreum Yun, Ryan Roberts
  Cc: catalin.marinas, will, akpm, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

On 12/18/25 09:37, Yeoreum Yun wrote:
> [...]
>>> I think it would be better to use only __GFP_HIGH in here since
>>> when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
>>> and to allocate page with assurance, It would be good to use
>>> min_reserved to.
>>>
>>> Am I missing something?
>>
>> Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
>> would make this allocation look special, which it is not. If we fail to allocate
>> at this point in boot, we have bigger problems.
> 
> But I'm not sure *HOW effective* to use GFP_KERNEL in here.
> Since it's before the any filesystem inited.
> IOW, in this context, almost there would be no *page cache*
> and I think it seems meaningless to use "GFP_KERNEL" and "direct
> reclaim"
> 
> So to get success for allocation, __GFP_HIGH | _GFP_ZERO seems much
> better.

Unless there is a real reason to confuse readers why this is very 
special, just go with "GFP_KERNEL | __GFP_ZERO", really.

In particular if it doesn't matter in practice? Or does it and we are 
not getting your point?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-18  8:23   ` David Hildenbrand (Red Hat)
@ 2025-12-18  9:08     ` Yeoreum Yun
  0 siblings, 0 replies; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18  9:08 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: catalin.marinas, will, ryan.roberts, akpm, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
	linux-arm-kernel, linux-kernel, linux-rt-devel

Hi David,

> > Current linear_map_split_to_ptes() allocate the pagetable
> > while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
> >
> > This is fine for non-PREEMPR_RT case.
> > However It's a problem in PREEMPR_RT case since
> > generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
> > couldn't be called in non-preemptible context except _nolock() APIs
> > since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
> >
> > IOW, calling a pgtable_alloc() even with GFP_ATOMIC
> > doesn't allow in __linear_map_split_to_pte() executed by stopper thread
> > where preemption is disabled in PREEMPR_RT.
> >
> > To address this, divide linear_map_maybe_split_ptes:
> >    - collect number of pages to require for spliting.
>
> COLLECT step
>
> >    - allocate the required number of pages for spliting.
>
> PREALLOC step
>
> >    - with pre-allocate page, split the linear map.
>
> SPLIT step
>
> The steps you have below are confusing :)
>
> > +enum split_modes {
>
> Is it a "mode" or a "step" ?
>
> enum split_step {
> 	SPLIT_STEP_COLLECT,
> 	...
> };
>
>
> But reading the code below I am lost how these modes apply here.
>
> I guess what you mean is something like
>
> enum split_mode {
> 	SPLIT_MODE_ALLOC, /* ... from the buddy. */
> 	SPLIT_MODE_PREALLOCATED, /* ... from preallocated buffer. */
> 	SPLIT_MODE_COLLECT, /* collect required page tables. */
> };

Sorry for *bad* naming from me.
Yes. your suggestion is much clearer. I'll change with this.

>
> But why is SPLIT_MODE_ALLOC vs. SPLIT_MODE_PREALLOCATED required?
>
> Can't we just test if there is something in the preallocated buffer, and if
> not, simply allocate?
>
> IOW, all you need is "collect vs. alloc", and the buffer information tells
> you how to allocate.

Agree. since it already check the "number of pgtable" before allocation
in linear_map_split_to_ptes(), That's would be better.

>
> > +	SPLIT_ALLOC,
> > +	SPLIT_PREALLOC,
> > +	SPLIT_COLLECT,
> > +};
> > -	pa = page_to_phys(ptdesc_page(ptdesc));
> > +struct split_args {
> > +	enum split_modes mode;
> > +	union {
> > +		struct {
> > +			unsigned long nr;
> > +			unsigned long i;
> > +			struct page **pages;
> > +		};
> > +		gfp_t gfp;
> > +	};
> > +};
> > +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> > +					       struct ptdesc *ptdesc,
> > +					       enum pgtable_type pgtable_type)
> > +{
> >   	switch (pgtable_type) {
> >   	case TABLE_PTE:
> >   		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> > @@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> >   		break;
> >   	}
> > +}
> > +
> > +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > +				       enum pgtable_type pgtable_type)
> > +{
> > +	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > +	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > +	phys_addr_t pa;
> > +
> > +	if (!ptdesc)
> > +		return INVALID_PHYS_ADDR;
> > +
> > +	pa = page_to_phys(ptdesc_page(ptdesc));
> > +
> > +	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
> > +
> >   	return pa;
> >   }
> > +static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
> > +					  struct split_args *split_args,
> > +					  enum pgtable_type pgtable_type)
>
> The function is called "_prealloc" but actually it simply grabs a
> preallocated page. Confusing.
>
>
> Should this function be called something like
> "__pgd_pgtable_get_preallocated()". Not sure.

Sorry to make a confuse. I think this name would be better.
I'll fix.


> > +{
> > +	struct page *page;
> > +
> > +	BUG_ON(split_args->i >= split_args->nr);
>
> No BUG_ON().

Okay. as keep only two mode:
    - COLLECT
    - ALLOC

I'll remove this.

> > +
> > +	page = split_args->pages[split_args->i++];
> > +	if (!page)
> > +		return INVALID_PHYS_ADDR;
> > +
> > +	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
> > +
> > +	return page_to_phys(page);
> > +}
> > +
> >   static phys_addr_t
> > -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> > +pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
> > +		        struct split_args *split_args)
> >   {
> > -	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> > +	if (split_args->mode == SPLIT_ALLOC)
> > +		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
> > +
> > +	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
> >   }
> >   static phys_addr_t __maybe_unused
> >   pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
> >   {
> > -	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> > +	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
> >   }
> >   static phys_addr_t
> > @@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
> >   		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
> >   }
> > -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> > +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> > +		     struct split_args *split_args,
> > +		     bool to_cont)
> >   {
> >   	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
> >   	unsigned long pfn = pmd_pfn(pmd);
> > @@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> >   	pte_t *ptep;
> >   	int i;
> > -	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> > +	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
> >   	if (pte_phys == INVALID_PHYS_ADDR)
> >   		return -ENOMEM;
> >   	ptep = (pte_t *)phys_to_virt(pte_phys);
> > @@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
> >   		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
> >   }
> > -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> > +static int split_pud(pud_t *pudp, pud_t pud,
> > +		    struct split_args *split_args,
> > +		    bool to_cont)
> >   {
> >   	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
> >   	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> > @@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> >   	pmd_t *pmdp;
> >   	int i;
> > -	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> > +	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
> >   	if (pmd_phys == INVALID_PHYS_ADDR)
> >   		return -ENOMEM;
> >   	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> > @@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >   	pmd_t *pmdp, pmd;
> >   	pte_t *ptep, pte;
> >   	int ret = 0;
> > +	struct split_args split_args = {
> > +		.mode = SPLIT_ALLOC,
> > +		.gfp = GFP_PGTABLE_KERNEL,
> > +	};
> >   	/*
> >   	 * PGD: If addr is PGD aligned then addr already describes a leaf
> > @@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >   	if (!pud_present(pud))
> >   		goto out;
> >   	if (pud_leaf(pud)) {
> > -		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> > +		ret = split_pud(pudp, pud, &split_args, true);
> >   		if (ret)
> >   			goto out;
> >   	}
> > @@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >   		 */
> >   		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> >   			goto out;
> > -		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> > +		ret = split_pmd(pmdp, pmd, &split_args, true);
> >   		if (ret)
> >   			goto out;
> >   	}
> > @@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> >   static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> >   				   unsigned long next, struct mm_walk *walk)
> >   {
> > -	gfp_t gfp = *(gfp_t *)walk->private;
> > +	struct split_args *split_args = (struct split_args *)walk->private;
> >   	pud_t pud = pudp_get(pudp);
> >   	int ret = 0;
> > -	if (pud_leaf(pud))
> > -		ret = split_pud(pudp, pud, gfp, false);
> > +	if (!pud_leaf(pud))
> > +		return 0;
> > +
> > +	if (split_args->mode == SPLIT_COLLECT)
> > +		split_args->nr++;
> > +	else
> > +		ret = split_pud(pudp, pud, split_args, false);
> >   	return ret;
> >   }
> > @@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> >   static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> >   				   unsigned long next, struct mm_walk *walk)
> >   {
> > -	gfp_t gfp = *(gfp_t *)walk->private;
> > +	struct split_args *split_args = (struct split_args *)walk->private;
> >   	pmd_t pmd = pmdp_get(pmdp);
> > -	int ret = 0;
> > +	int ret;
> > -	if (pmd_leaf(pmd)) {
> > -		if (pmd_cont(pmd))
> > -			split_contpmd(pmdp);
> > -		ret = split_pmd(pmdp, pmd, gfp, false);
> > +	if (!pmd_leaf(pmd))
> > +		return 0;
> > -		/*
> > -		 * We have split the pmd directly to ptes so there is no need to
> > -		 * visit each pte to check if they are contpte.
> > -		 */
> > -		walk->action = ACTION_CONTINUE;
> > +	if (split_args->mode == SPLIT_COLLECT) {
> > +		split_args->nr++;
> > +		return 0;
> >   	}
> > +	if (pmd_cont(pmd))
> > +		split_contpmd(pmdp);
> > +	ret = split_pmd(pmdp, pmd, split_args, false);
> > +
> > +	/*
> > +	 * We have split the pmd directly to ptes so there is no need to
> > +	 * visit each pte to check if they are contpte.
> > +	 */
> > +	walk->action = ACTION_CONTINUE;
> > +
> > +
> >   	return ret;
> >   }
> > @@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
> >   	.pte_entry	= split_to_ptes_pte_entry,
> >   };
> > -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> > +static int range_split_to_ptes(unsigned long start, unsigned long end,
> > +		 	       struct split_args *split_args)
> >   {
> >   	int ret;
> >   	arch_enter_lazy_mmu_mode();
> >   	ret = walk_kernel_page_table_range_lockless(start, end,
> > -					&split_to_ptes_ops, NULL, &gfp);
> > +						    &split_to_ptes_ops, NULL,
> > +						    split_args);
> >   	arch_leave_lazy_mmu_mode();
> >   	return ret;
> > @@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
> >   	smp_mb();
> >   }
> > -static int __init linear_map_split_to_ptes(void *__unused)
> > +static int __init linear_map_split_to_ptes(void *data)
> >   {
> >   	/*
> >   	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
> > @@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >   	 * be held in a waiting area with the idmap active.
> >   	 */
> >   	if (!smp_processor_id()) {
> > +		struct split_args *split_args = data;
> >   		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> >   		unsigned long lend = PAGE_END;
> >   		unsigned long kstart = (unsigned long)lm_alias(_stext);
> > @@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >   		 * PTE. The kernel alias remains static throughout runtime so
> >   		 * can continue to be safely mapped with large mappings.
> >   		 */
> > -		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> > +		ret = range_split_to_ptes(lstart, lend, split_args);
> >   		if (!ret)
> > -			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> > +			ret = range_split_to_ptes(kstart, kend, split_args);
> >   		if (ret)
> >   			panic("Failed to split linear map\n");
> > -		flush_tlb_kernel_range(lstart, lend);
> > +		if (split_args->mode != SPLIT_COLLECT)
> > +			flush_tlb_kernel_range(lstart, lend);
> >   		/*
> >   		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
> > @@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >   void __init linear_map_maybe_split_to_ptes(void)
> >   {
> > -	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> > -		init_idmap_kpti_bbml2_flag();
> > -		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> > +	struct page **pages = NULL;
> > +	struct split_args split_args;
> > +	int order;
> > +	int nr_populated;
> > +	int err;
> > +
> > +	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> > +		return;
> > +
> > +	/*
> > +	 * Phase 1: Collect required extra pages to split.
> > +	 */
> > +	split_args.mode = SPLIT_COLLECT;
> > +	split_args.nr = 0;
> > +
> > +	init_idmap_kpti_bbml2_flag();
> > +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
> > +
> > +	/*
> > +	 * Phase 2: Allocate necessary pages to split.
> > +	 */
> > +	if (split_args.nr == 0) {
> > +		err = 0;
> > +		split_args.mode = SPLIT_ALLOC;
> > +	} else {
> > +		err = -ENOMEM;
> > +		order = order_base_2(PAGE_ALIGN(split_args.nr *
> > +					sizeof(struct page *)) >> PAGE_SHIFT);
> > +
> > +		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> > +		if (!pages)
> > +			goto error;
> > +
> > +		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);
> > +		if (nr_populated < split_args.nr)
> > +			goto error;
>
> That is prone to cause trouble in the future once we allocate memdescs
> separately.
>
> Can we simply call pagetable_alloc() in a loop here instead for the time
> being? A future-proof bulk allocation can be done later separately.

I miss the future plan.
Yes, I'll change to pagetable_alloc().

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  8:46           ` David Hildenbrand (Red Hat)
@ 2025-12-18  9:31             ` Yeoreum Yun
  2025-12-18  9:41               ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18  9:31 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ryan Roberts, catalin.marinas, will, akpm, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
	linux-arm-kernel, linux-kernel, linux-rt-devel

Hi David,

> On 12/18/25 09:37, Yeoreum Yun wrote:
> > [...]
> > > > I think it would be better to use only __GFP_HIGH in here since
> > > > when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
> > > > and to allocate page with assurance, It would be good to use
> > > > min_reserved to.
> > > >
> > > > Am I missing something?
> > >
> > > Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
> > > would make this allocation look special, which it is not. If we fail to allocate
> > > at this point in boot, we have bigger problems.
> >
> > But I'm not sure *HOW effective* to use GFP_KERNEL in here.
> > Since it's before the any filesystem inited.
> > IOW, in this context, almost there would be no *page cache*
> > and I think it seems meaningless to use "GFP_KERNEL" and "direct
> > reclaim"
> >
> > So to get success for allocation, __GFP_HIGH | _GFP_ZERO seems much
> > better.
>
> Unless there is a real reason to confuse readers why this is very special,
> just go with "GFP_KERNEL | __GFP_ZERO", really.
>
> In particular if it doesn't matter in practice? Or does it and we are not
> getting your point?

My worries was
    - kpti_install_ng_mappings() is called while in "smp_init()" which is
      before creating the kswapd thread via module_init().
      Just wondered whether it allows to call wakeup_kswapd() before
      kswapd is created.

    - Similar reason kcompactd too.

    - Just wonder how much direct reclaim is effecitve since
      when kpti_install_ng_mappings() called before each
      filesystem initialised where not much of page cache in usage.

TBH (1) and (2) seems fine since each wakeup function checks
the waitqueue. but because of (3),
I think not GFP_KERNEL but __GFP_HIGH | __GFP_DIRECT_RECLAIM | __GFP_ZERO (?)

Am I missing?

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  9:31             ` Yeoreum Yun
@ 2025-12-18  9:41               ` David Hildenbrand (Red Hat)
  2025-12-18 10:07                 ` Yeoreum Yun
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  9:41 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Ryan Roberts, catalin.marinas, will, akpm, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
	linux-arm-kernel, linux-kernel, linux-rt-devel

On 12/18/25 10:31, Yeoreum Yun wrote:
> Hi David,
> 
>> On 12/18/25 09:37, Yeoreum Yun wrote:
>>> [...]
>>>>> I think it would be better to use only __GFP_HIGH in here since
>>>>> when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
>>>>> and to allocate page with assurance, It would be good to use
>>>>> min_reserved to.
>>>>>
>>>>> Am I missing something?
>>>>
>>>> Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
>>>> would make this allocation look special, which it is not. If we fail to allocate
>>>> at this point in boot, we have bigger problems.
>>>
>>> But I'm not sure *HOW effective* to use GFP_KERNEL in here.
>>> Since it's before the any filesystem inited.
>>> IOW, in this context, almost there would be no *page cache*
>>> and I think it seems meaningless to use "GFP_KERNEL" and "direct
>>> reclaim"
>>>
>>> So to get success for allocation, __GFP_HIGH | _GFP_ZERO seems much
>>> better.
>>
>> Unless there is a real reason to confuse readers why this is very special,
>> just go with "GFP_KERNEL | __GFP_ZERO", really.
>>
>> In particular if it doesn't matter in practice? Or does it and we are not
>> getting your point?
> 
> My worries was
>      - kpti_install_ng_mappings() is called while in "smp_init()" which is
>        before creating the kswapd thread via module_init().
>        Just wondered whether it allows to call wakeup_kswapd() before
>        kswapd is created.

The buddy should really be able to deal with that, no?

> 
>      - Similar reason kcompactd too.

Same as well.

We cannot expect alloc API users to know about these hidden details to 
work around them :)

> 
>      - Just wonder how much direct reclaim is effecitve since
>        when kpti_install_ng_mappings() called before each
>        filesystem initialised where not much of page cache in usage.

Right, but do you really think we would ever trigger that path?

The default should always be GFP_KERNEL unless we have for very good 
reason special demands.

So, do you think in practice there is real value in NOT using GFP_KERNEL? :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
  2025-12-18  9:41               ` David Hildenbrand (Red Hat)
@ 2025-12-18 10:07                 ` Yeoreum Yun
  0 siblings, 0 replies; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18 10:07 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ryan Roberts, catalin.marinas, will, akpm, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
	linux-arm-kernel, linux-kernel, linux-rt-devel

Hi David,

> On 12/18/25 10:31, Yeoreum Yun wrote:
> > Hi David,
> >
> > > On 12/18/25 09:37, Yeoreum Yun wrote:
> > > > [...]
> > > > > > I think it would be better to use only __GFP_HIGH in here since
> > > > > > when kpti_install_ng_mappings() is called, "kswpd" doesn't created yet.
> > > > > > and to allocate page with assurance, It would be good to use
> > > > > > min_reserved to.
> > > > > >
> > > > > > Am I missing something?
> > > > >
> > > > > Personally I think we should just use "GFP_KERNEL | __GFP_ZERO". Anything else
> > > > > would make this allocation look special, which it is not. If we fail to allocate
> > > > > at this point in boot, we have bigger problems.
> > > >
> > > > But I'm not sure *HOW effective* to use GFP_KERNEL in here.
> > > > Since it's before the any filesystem inited.
> > > > IOW, in this context, almost there would be no *page cache*
> > > > and I think it seems meaningless to use "GFP_KERNEL" and "direct
> > > > reclaim"
> > > >
> > > > So to get success for allocation, __GFP_HIGH | _GFP_ZERO seems much
> > > > better.
> > >
> > > Unless there is a real reason to confuse readers why this is very special,
> > > just go with "GFP_KERNEL | __GFP_ZERO", really.
> > >
> > > In particular if it doesn't matter in practice? Or does it and we are not
> > > getting your point?
> >
> > My worries was
> >      - kpti_install_ng_mappings() is called while in "smp_init()" which is
> >        before creating the kswapd thread via module_init().
> >        Just wondered whether it allows to call wakeup_kswapd() before
> >        kswapd is created.
>
> The buddy should really be able to deal with that, no?
>
> >
> >      - Similar reason kcompactd too.
>
> Same as well.

Nope. buddy handles them. It was a past :)

>
> We cannot expect alloc API users to know about these hidden details to work
> around them :)
>
> >
> >      - Just wonder how much direct reclaim is effecitve since
> >        when kpti_install_ng_mappings() called before each
> >        filesystem initialised where not much of page cache in usage.
>
> Right, but do you really think we would ever trigger that path?
>
> The default should always be GFP_KERNEL unless we have for very good reason
> special demands.
>
> So, do you think in practice there is real value in NOT using GFP_KERNEL? :)

In the *practical view*. Absoultely NOT
since I don't think it reaches those codes.

Okay. I'll change with GFP_KERNEL | __GFP_ZERO.

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-17 18:20 ` [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping Yeoreum Yun
  2025-12-18  8:23   ` David Hildenbrand (Red Hat)
@ 2025-12-18 14:22   ` Ryan Roberts
  2025-12-18 15:01     ` Yeoreum Yun
  1 sibling, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-12-18 14:22 UTC (permalink / raw)
  To: Yeoreum Yun, catalin.marinas, will, akpm, david, kevin.brodsky,
	quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
	clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka
  Cc: linux-arm-kernel, linux-kernel, linux-rt-devel

On 17/12/2025 18:20, Yeoreum Yun wrote:
> Current linear_map_split_to_ptes() allocate the pagetable
> while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
> 
> This is fine for non-PREEMPR_RT case.
> However It's a problem in PREEMPR_RT case since
> generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
> couldn't be called in non-preemptible context except _nolock() APIs
> since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
> 
> IOW, calling a pgtable_alloc() even with GFP_ATOMIC
> doesn't allow in __linear_map_split_to_pte() executed by stopper thread
> where preemption is disabled in PREEMPR_RT.
> 
> To address this, divide linear_map_maybe_split_ptes:
>   - collect number of pages to require for spliting.
>   - allocate the required number of pages for spliting.
>   - with pre-allocate page, split the linear map.

Thanks for working on this fix!

First some high level comments: I'm not a huge fan of the approach with the
different modes to modify behaviour.

For the first step of figuring out the number of pages required; there is no
need to be inside stop_machine() for that. Can should be able to just walk the
linear map without any locking since it is created once and remains static, with
the exception of hotplug, but that is not enabled yet.

I think it would be cleaner to just create separate walker callbacks rather than
repurpose the existing walker.

Then you could simply change the gfp flag to a callback function pointer and
pass the desired allocator as a function pointer. This would match the existing
patterns used in mmu.c today.

The preallocated page list could be stored in a global static variable and you
could create an allocation function that just strims a page from that list. See
kpti_ng_pgd_alloc() for an existing example. That assumes a contiguous block,
but you could generalize it to hold a list of struct pages and reuse it for both
cases?


> 
> Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/mm/mmu.c | 213 +++++++++++++++++++++++++++++++++++---------
>  1 file changed, 170 insertions(+), 43 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 9ae7ce00a7ef..e4e6c7e0a016 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -527,18 +527,28 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>  		panic("Failed to create page tables\n");
>  }
>  
> -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> -				       enum pgtable_type pgtable_type)
> -{
> -	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> -	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> -	phys_addr_t pa;
> -
> -	if (!ptdesc)
> -		return INVALID_PHYS_ADDR;
> +enum split_modes {
> +	SPLIT_ALLOC,
> +	SPLIT_PREALLOC,
> +	SPLIT_COLLECT,
> +};
>  
> -	pa = page_to_phys(ptdesc_page(ptdesc));
> +struct split_args {
> +	enum split_modes mode;
> +	union {
> +		struct {
> +			unsigned long nr;
> +			unsigned long i;
> +			struct page **pages;
> +		};
> +		gfp_t gfp;
> +	};
> +};
>  
> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> +					       struct ptdesc *ptdesc,
> +					       enum pgtable_type pgtable_type)
> +{
>  	switch (pgtable_type) {
>  	case TABLE_PTE:
>  		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> @@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>  		break;
>  	}
>  
> +}
> +
> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> +				       enum pgtable_type pgtable_type)
> +{
> +	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> +	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> +	phys_addr_t pa;
> +
> +	if (!ptdesc)
> +		return INVALID_PHYS_ADDR;
> +
> +	pa = page_to_phys(ptdesc_page(ptdesc));
> +
> +	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
> +
>  	return pa;
>  }
>  
> +static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
> +					  struct split_args *split_args,
> +					  enum pgtable_type pgtable_type)
> +{
> +	struct page *page;
> +
> +	BUG_ON(split_args->i >= split_args->nr);
> +
> +	page = split_args->pages[split_args->i++];
> +	if (!page)
> +		return INVALID_PHYS_ADDR;
> +
> +	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
> +
> +	return page_to_phys(page);
> +}
> +
>  static phys_addr_t
> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> +pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
> +		        struct split_args *split_args)
>  {
> -	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> +	if (split_args->mode == SPLIT_ALLOC)
> +		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
> +
> +	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
>  }
>  
>  static phys_addr_t __maybe_unused
>  pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
>  {
> -	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> +	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
>  }
>  
>  static phys_addr_t
> @@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
>  		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>  }
>  
> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> +		     struct split_args *split_args,
> +		     bool to_cont)
>  {
>  	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>  	unsigned long pfn = pmd_pfn(pmd);
> @@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>  	pte_t *ptep;
>  	int i;
>  
> -	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> +	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
>  	if (pte_phys == INVALID_PHYS_ADDR)
>  		return -ENOMEM;
>  	ptep = (pte_t *)phys_to_virt(pte_phys);
> @@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
>  		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>  }
>  
> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> +static int split_pud(pud_t *pudp, pud_t pud,
> +		    struct split_args *split_args,
> +		    bool to_cont)
>  {
>  	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>  	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> @@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>  	pmd_t *pmdp;
>  	int i;
>  
> -	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> +	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
>  	if (pmd_phys == INVALID_PHYS_ADDR)
>  		return -ENOMEM;
>  	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> @@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>  	pmd_t *pmdp, pmd;
>  	pte_t *ptep, pte;
>  	int ret = 0;
> +	struct split_args split_args = {
> +		.mode = SPLIT_ALLOC,
> +		.gfp = GFP_PGTABLE_KERNEL,
> +	};
>  
>  	/*
>  	 * PGD: If addr is PGD aligned then addr already describes a leaf
> @@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>  	if (!pud_present(pud))
>  		goto out;
>  	if (pud_leaf(pud)) {
> -		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> +		ret = split_pud(pudp, pud, &split_args, true);
>  		if (ret)
>  			goto out;
>  	}
> @@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>  		 */
>  		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>  			goto out;
> -		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> +		ret = split_pmd(pmdp, pmd, &split_args, true);
>  		if (ret)
>  			goto out;
>  	}
> @@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>  static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>  				   unsigned long next, struct mm_walk *walk)
>  {
> -	gfp_t gfp = *(gfp_t *)walk->private;
> +	struct split_args *split_args = (struct split_args *)walk->private;

no need for the casting here; it's a void *.

>  	pud_t pud = pudp_get(pudp);
>  	int ret = 0;
>  
> -	if (pud_leaf(pud))
> -		ret = split_pud(pudp, pud, gfp, false);
> +	if (!pud_leaf(pud))
> +		return 0;
> +
> +	if (split_args->mode == SPLIT_COLLECT)
> +		split_args->nr++;

bug: you're incrementing this for the pmd table that you would have inserted,
but because you're not actually splitting, split_to_ptes_pmd_entry() will never
be called for the 512 pte tables you also need. So you need:

		nr += 1 + PTRS_PER_PMD;

I'm guessing you didn't test this as the BUG would have triggered when you ran
out of preallocated memory?

> +	else
> +		ret = split_pud(pudp, pud, split_args, false);
>  
>  	return ret;
>  }
> @@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>  static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>  				   unsigned long next, struct mm_walk *walk)
>  {
> -	gfp_t gfp = *(gfp_t *)walk->private;
> +	struct split_args *split_args = (struct split_args *)walk->private;
>  	pmd_t pmd = pmdp_get(pmdp);
> -	int ret = 0;
> +	int ret;
>  
> -	if (pmd_leaf(pmd)) {
> -		if (pmd_cont(pmd))
> -			split_contpmd(pmdp);
> -		ret = split_pmd(pmdp, pmd, gfp, false);
> +	if (!pmd_leaf(pmd))
> +		return 0;
>  
> -		/*
> -		 * We have split the pmd directly to ptes so there is no need to
> -		 * visit each pte to check if they are contpte.
> -		 */
> -		walk->action = ACTION_CONTINUE;
> +	if (split_args->mode == SPLIT_COLLECT) {
> +		split_args->nr++;
> +		return 0;
>  	}
>  
> +	if (pmd_cont(pmd))
> +		split_contpmd(pmdp);
> +	ret = split_pmd(pmdp, pmd, split_args, false);
> +
> +	/*
> +	 * We have split the pmd directly to ptes so there is no need to
> +	 * visit each pte to check if they are contpte.
> +	 */
> +	walk->action = ACTION_CONTINUE;
> +
> +
>  	return ret;
>  }
>  
> @@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
>  	.pte_entry	= split_to_ptes_pte_entry,
>  };
>  
> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> +static int range_split_to_ptes(unsigned long start, unsigned long end,
> +		 	       struct split_args *split_args)
>  {
>  	int ret;
>  
>  	arch_enter_lazy_mmu_mode();
>  	ret = walk_kernel_page_table_range_lockless(start, end,
> -					&split_to_ptes_ops, NULL, &gfp);
> +						    &split_to_ptes_ops, NULL,
> +						    split_args);
>  	arch_leave_lazy_mmu_mode();
>  
>  	return ret;
> @@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
>  	smp_mb();
>  }
>  
> -static int __init linear_map_split_to_ptes(void *__unused)
> +static int __init linear_map_split_to_ptes(void *data)
>  {
>  	/*
>  	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
> @@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
>  	 * be held in a waiting area with the idmap active.
>  	 */
>  	if (!smp_processor_id()) {
> +		struct split_args *split_args = data;
>  		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>  		unsigned long lend = PAGE_END;
>  		unsigned long kstart = (unsigned long)lm_alias(_stext);
> @@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
>  		 * PTE. The kernel alias remains static throughout runtime so
>  		 * can continue to be safely mapped with large mappings.
>  		 */
> -		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> +		ret = range_split_to_ptes(lstart, lend, split_args);
>  		if (!ret)
> -			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> +			ret = range_split_to_ptes(kstart, kend, split_args);
>  		if (ret)
>  			panic("Failed to split linear map\n");
> -		flush_tlb_kernel_range(lstart, lend);
> +		if (split_args->mode != SPLIT_COLLECT)
> +			flush_tlb_kernel_range(lstart, lend);
>  
>  		/*
>  		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
> @@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
>  
>  void __init linear_map_maybe_split_to_ptes(void)
>  {
> -	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> -		init_idmap_kpti_bbml2_flag();
> -		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> +	struct page **pages = NULL;
> +	struct split_args split_args;
> +	int order;
> +	int nr_populated;
> +	int err;
> +
> +	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> +		return;
> +
> +	/*
> +	 * Phase 1: Collect required extra pages to split.
> +	 */
> +	split_args.mode = SPLIT_COLLECT;
> +	split_args.nr = 0;
> +
> +	init_idmap_kpti_bbml2_flag();
> +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);

This doesn't require stop_machine().

> +
> +	/*
> +	 * Phase 2: Allocate necessary pages to split.
> +	 */
> +	if (split_args.nr == 0) {
> +		err = 0;
> +		split_args.mode = SPLIT_ALLOC;

How would we hit this condition? If everything is already split why not just
exit early?

> +	} else {
> +		err = -ENOMEM;
> +		order = order_base_2(PAGE_ALIGN(split_args.nr *
> +					sizeof(struct page *)) >> PAGE_SHIFT);
> +
> +		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);

As per the other patch, I'd prefer to see GFP_KERNEL | __GFP_ZERO here.

Let's make this a list of struct ptdesc * though.


> +		if (!pages)
> +			goto error;
> +
> +		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);

Just GFP_KERNEL. No need for __GFP_ZERO as they will all be fully overwritten
with PTEs.

I'd prefer to use pagetable_alloc(GFP_KERNEL, 0) in a loop since that will
return a proper ptdesc.

> +		if (nr_populated < split_args.nr)
> +			goto error;
> +
> +		err = 0;
> +		split_args.mode = SPLIT_PREALLOC;
> +		split_args.i = 0;
> +		split_args.pages = pages;
>  	}
> +
> +	/*
> +	 * Phase 3: Split linear map.
> +	 */
> +	init_idmap_kpti_bbml2_flag();
> +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
> +
> +error:
> +	if (pages)
> +		free_pages((unsigned long)pages, order);

It's possible (but unlikely) that another CPU caused a split between figuring
out the required number of tables and actually doing the full split. In this
case, you would have preallocated pages left over that you will need to free here.

Thanks,
Ryan

> +
> +	if (err)
> +		panic("Failed to split linear map: %d\n", err);
>  }
>  
>  /*
> @@ -1087,6 +1209,11 @@ bool arch_kfence_init_pool(void)
>  	unsigned long start = (unsigned long)__kfence_pool;
>  	unsigned long end = start + KFENCE_POOL_SIZE;
>  	int ret;
> +	struct split_args split_args = {
> +		.mode = SPLIT_ALLOC,
> +		.gfp = GFP_PGTABLE_KERNEL,
> +	};
> +
>  
>  	/* Exit early if we know the linear map is already pte-mapped. */
>  	if (!split_leaf_mapping_possible())
> @@ -1097,7 +1224,7 @@ bool arch_kfence_init_pool(void)
>  		return true;
>  
>  	mutex_lock(&pgtable_split_lock);
> -	ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
> +	ret = range_split_to_ptes(start, end, &split_args);
>  	mutex_unlock(&pgtable_split_lock);
>  
>  	/*



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-18 14:22   ` Ryan Roberts
@ 2025-12-18 15:01     ` Yeoreum Yun
  2025-12-18 15:42       ` Ryan Roberts
  0 siblings, 1 reply; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18 15:01 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

Hi Ryan,

> On 17/12/2025 18:20, Yeoreum Yun wrote:
> > Current linear_map_split_to_ptes() allocate the pagetable
> > while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
> >
> > This is fine for non-PREEMPR_RT case.
> > However It's a problem in PREEMPR_RT case since
> > generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
> > couldn't be called in non-preemptible context except _nolock() APIs
> > since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
> >
> > IOW, calling a pgtable_alloc() even with GFP_ATOMIC
> > doesn't allow in __linear_map_split_to_pte() executed by stopper thread
> > where preemption is disabled in PREEMPR_RT.
> >
> > To address this, divide linear_map_maybe_split_ptes:
> >   - collect number of pages to require for spliting.
> >   - allocate the required number of pages for spliting.
> >   - with pre-allocate page, split the linear map.
>
> Thanks for working on this fix!
>
> First some high level comments: I'm not a huge fan of the approach with the
> different modes to modify behaviour.
>
> For the first step of figuring out the number of pages required; there is no
> need to be inside stop_machine() for that. Can should be able to just walk the
> linear map without any locking since it is created once and remains static, with
> the exception of hotplug, but that is not enabled yet.
>
> I think it would be cleaner to just create separate walker callbacks rather than
> repurpose the existing walker.

Okay. I'll repsin with separate walker.

>
> Then you could simply change the gfp flag to a callback function pointer and
> pass the desired allocator as a function pointer. This would match the existing
> patterns used in mmu.c today.
>
> The preallocated page list could be stored in a global static variable and you
> could create an allocation function that just strims a page from that list. See
> kpti_ng_pgd_alloc() for an existing example. That assumes a contiguous block,
> but you could generalize it to hold a list of struct pages and reuse it for both
> cases?

If we use walker callbacks, I think it not only pass the callback
function, but it need to pass the gfp flag too since split_xxx() is
called by another functions. (i.e) set_memory_xxx()

I think I can remove the mode with the suggestion of @David
but, I think it would be still use split_args with information of
gfp and preallocate buffer.
>
> >
> > Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> >  arch/arm64/mm/mmu.c | 213 +++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 170 insertions(+), 43 deletions(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 9ae7ce00a7ef..e4e6c7e0a016 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -527,18 +527,28 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> >  		panic("Failed to create page tables\n");
> >  }
> >
> > -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > -				       enum pgtable_type pgtable_type)
> > -{
> > -	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > -	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > -	phys_addr_t pa;
> > -
> > -	if (!ptdesc)
> > -		return INVALID_PHYS_ADDR;
> > +enum split_modes {
> > +	SPLIT_ALLOC,
> > +	SPLIT_PREALLOC,
> > +	SPLIT_COLLECT,
> > +};
> >
> > -	pa = page_to_phys(ptdesc_page(ptdesc));
> > +struct split_args {
> > +	enum split_modes mode;
> > +	union {
> > +		struct {
> > +			unsigned long nr;
> > +			unsigned long i;
> > +			struct page **pages;
> > +		};
> > +		gfp_t gfp;
> > +	};
> > +};
> >
> > +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> > +					       struct ptdesc *ptdesc,
> > +					       enum pgtable_type pgtable_type)
> > +{
> >  	switch (pgtable_type) {
> >  	case TABLE_PTE:
> >  		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> > @@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> >  		break;
> >  	}
> >
> > +}
> > +
> > +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > +				       enum pgtable_type pgtable_type)
> > +{
> > +	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > +	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > +	phys_addr_t pa;
> > +
> > +	if (!ptdesc)
> > +		return INVALID_PHYS_ADDR;
> > +
> > +	pa = page_to_phys(ptdesc_page(ptdesc));
> > +
> > +	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
> > +
> >  	return pa;
> >  }
> >
> > +static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
> > +					  struct split_args *split_args,
> > +					  enum pgtable_type pgtable_type)
> > +{
> > +	struct page *page;
> > +
> > +	BUG_ON(split_args->i >= split_args->nr);
> > +
> > +	page = split_args->pages[split_args->i++];
> > +	if (!page)
> > +		return INVALID_PHYS_ADDR;
> > +
> > +	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
> > +
> > +	return page_to_phys(page);
> > +}
> > +
> >  static phys_addr_t
> > -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> > +pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
> > +		        struct split_args *split_args)
> >  {
> > -	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> > +	if (split_args->mode == SPLIT_ALLOC)
> > +		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
> > +
> > +	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
> >  }
> >
> >  static phys_addr_t __maybe_unused
> >  pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
> >  {
> > -	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> > +	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
> >  }
> >
> >  static phys_addr_t
> > @@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
> >  		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
> >  }
> >
> > -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> > +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> > +		     struct split_args *split_args,
> > +		     bool to_cont)
> >  {
> >  	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
> >  	unsigned long pfn = pmd_pfn(pmd);
> > @@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> >  	pte_t *ptep;
> >  	int i;
> >
> > -	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> > +	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
> >  	if (pte_phys == INVALID_PHYS_ADDR)
> >  		return -ENOMEM;
> >  	ptep = (pte_t *)phys_to_virt(pte_phys);
> > @@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
> >  		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
> >  }
> >
> > -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> > +static int split_pud(pud_t *pudp, pud_t pud,
> > +		    struct split_args *split_args,
> > +		    bool to_cont)
> >  {
> >  	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
> >  	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> > @@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> >  	pmd_t *pmdp;
> >  	int i;
> >
> > -	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> > +	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
> >  	if (pmd_phys == INVALID_PHYS_ADDR)
> >  		return -ENOMEM;
> >  	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> > @@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >  	pmd_t *pmdp, pmd;
> >  	pte_t *ptep, pte;
> >  	int ret = 0;
> > +	struct split_args split_args = {
> > +		.mode = SPLIT_ALLOC,
> > +		.gfp = GFP_PGTABLE_KERNEL,
> > +	};
> >
> >  	/*
> >  	 * PGD: If addr is PGD aligned then addr already describes a leaf
> > @@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >  	if (!pud_present(pud))
> >  		goto out;
> >  	if (pud_leaf(pud)) {
> > -		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> > +		ret = split_pud(pudp, pud, &split_args, true);
> >  		if (ret)
> >  			goto out;
> >  	}
> > @@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> >  		 */
> >  		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> >  			goto out;
> > -		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> > +		ret = split_pmd(pmdp, pmd, &split_args, true);
> >  		if (ret)
> >  			goto out;
> >  	}
> > @@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> >  static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> >  				   unsigned long next, struct mm_walk *walk)
> >  {
> > -	gfp_t gfp = *(gfp_t *)walk->private;
> > +	struct split_args *split_args = (struct split_args *)walk->private;
>
> no need for the casting here; it's a void *.

Right. I'll remove it.

>
> >  	pud_t pud = pudp_get(pudp);
> >  	int ret = 0;
> >
> > -	if (pud_leaf(pud))
> > -		ret = split_pud(pudp, pud, gfp, false);
> > +	if (!pud_leaf(pud))
> > +		return 0;
> > +
> > +	if (split_args->mode == SPLIT_COLLECT)
> > +		split_args->nr++;
>
> bug: you're incrementing this for the pmd table that you would have inserted,
> but because you're not actually splitting, split_to_ptes_pmd_entry() will never
> be called for the 512 pte tables you also need. So you need:
>
> 		nr += 1 + PTRS_PER_PMD;
>
> I'm guessing you didn't test this as the BUG would have triggered when you ran
> out of preallocated memory?
>

Good catch! Yes. This is definitely bug.
The reason why I didn't hit the but there is no "block pud" in my enviroment.

> > +	else
> > +		ret = split_pud(pudp, pud, split_args, false);
> >
> >  	return ret;
> >  }
> > @@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> >  static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> >  				   unsigned long next, struct mm_walk *walk)
> >  {
> > -	gfp_t gfp = *(gfp_t *)walk->private;
> > +	struct split_args *split_args = (struct split_args *)walk->private;
> >  	pmd_t pmd = pmdp_get(pmdp);
> > -	int ret = 0;
> > +	int ret;
> >
> > -	if (pmd_leaf(pmd)) {
> > -		if (pmd_cont(pmd))
> > -			split_contpmd(pmdp);
> > -		ret = split_pmd(pmdp, pmd, gfp, false);
> > +	if (!pmd_leaf(pmd))
> > +		return 0;
> >
> > -		/*
> > -		 * We have split the pmd directly to ptes so there is no need to
> > -		 * visit each pte to check if they are contpte.
> > -		 */
> > -		walk->action = ACTION_CONTINUE;
> > +	if (split_args->mode == SPLIT_COLLECT) {
> > +		split_args->nr++;
> > +		return 0;
> >  	}
> >
> > +	if (pmd_cont(pmd))
> > +		split_contpmd(pmdp);
> > +	ret = split_pmd(pmdp, pmd, split_args, false);
> > +
> > +	/*
> > +	 * We have split the pmd directly to ptes so there is no need to
> > +	 * visit each pte to check if they are contpte.
> > +	 */
> > +	walk->action = ACTION_CONTINUE;
> > +
> > +
> >  	return ret;
> >  }
> >
> > @@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
> >  	.pte_entry	= split_to_ptes_pte_entry,
> >  };
> >
> > -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> > +static int range_split_to_ptes(unsigned long start, unsigned long end,
> > +		 	       struct split_args *split_args)
> >  {
> >  	int ret;
> >
> >  	arch_enter_lazy_mmu_mode();
> >  	ret = walk_kernel_page_table_range_lockless(start, end,
> > -					&split_to_ptes_ops, NULL, &gfp);
> > +						    &split_to_ptes_ops, NULL,
> > +						    split_args);
> >  	arch_leave_lazy_mmu_mode();
> >
> >  	return ret;
> > @@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
> >  	smp_mb();
> >  }
> >
> > -static int __init linear_map_split_to_ptes(void *__unused)
> > +static int __init linear_map_split_to_ptes(void *data)
> >  {
> >  	/*
> >  	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
> > @@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >  	 * be held in a waiting area with the idmap active.
> >  	 */
> >  	if (!smp_processor_id()) {
> > +		struct split_args *split_args = data;
> >  		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> >  		unsigned long lend = PAGE_END;
> >  		unsigned long kstart = (unsigned long)lm_alias(_stext);
> > @@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >  		 * PTE. The kernel alias remains static throughout runtime so
> >  		 * can continue to be safely mapped with large mappings.
> >  		 */
> > -		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> > +		ret = range_split_to_ptes(lstart, lend, split_args);
> >  		if (!ret)
> > -			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> > +			ret = range_split_to_ptes(kstart, kend, split_args);
> >  		if (ret)
> >  			panic("Failed to split linear map\n");
> > -		flush_tlb_kernel_range(lstart, lend);
> > +		if (split_args->mode != SPLIT_COLLECT)
> > +			flush_tlb_kernel_range(lstart, lend);
> >
> >  		/*
> >  		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
> > @@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >
> >  void __init linear_map_maybe_split_to_ptes(void)
> >  {
> > -	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> > -		init_idmap_kpti_bbml2_flag();
> > -		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> > +	struct page **pages = NULL;
> > +	struct split_args split_args;
> > +	int order;
> > +	int nr_populated;
> > +	int err;
> > +
> > +	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> > +		return;
> > +
> > +	/*
> > +	 * Phase 1: Collect required extra pages to split.
> > +	 */
> > +	split_args.mode = SPLIT_COLLECT;
> > +	split_args.nr = 0;
> > +
> > +	init_idmap_kpti_bbml2_flag();
> > +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
>
> This doesn't require stop_machine().

Yes. I'll separate with new walker as you suggest.

>
> > +
> > +	/*
> > +	 * Phase 2: Allocate necessary pages to split.
> > +	 */
> > +	if (split_args.nr == 0) {
> > +		err = 0;
> > +		split_args.mode = SPLIT_ALLOC;
>
> How would we hit this condition? If everything is already split why not just
> exit early?

I'm thinking about if all of pages map with CONTPTE only then
it need to be splited, but would you let me know whether It wouldn't be possible
for cross checking?

>
> > +	} else {
> > +		err = -ENOMEM;
> > +		order = order_base_2(PAGE_ALIGN(split_args.nr *
> > +					sizeof(struct page *)) >> PAGE_SHIFT);
> > +
> > +		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>
> As per the other patch, I'd prefer to see GFP_KERNEL | __GFP_ZERO here.
>
> Let's make this a list of struct ptdesc * though.

Agree. I'll change it.

>
> > +		if (!pages)
> > +			goto error;
> > +
> > +		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);
>
> Just GFP_KERNEL. No need for __GFP_ZERO as they will all be fully overwritten
> with PTEs.
>
> I'd prefer to use pagetable_alloc(GFP_KERNEL, 0) in a loop since that will
> return a proper ptdesc.

Okay.

>
> > +		if (nr_populated < split_args.nr)
> > +			goto error;
> > +
> > +		err = 0;
> > +		split_args.mode = SPLIT_PREALLOC;
> > +		split_args.i = 0;
> > +		split_args.pages = pages;
> >  	}
> > +
> > +	/*
> > +	 * Phase 3: Split linear map.
> > +	 */
> > +	init_idmap_kpti_bbml2_flag();
> > +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
> > +
> > +error:
> > +	if (pages)
> > +		free_pages((unsigned long)pages, order);
>
> It's possible (but unlikely) that another CPU caused a split between figuring
> out the required number of tables and actually doing the full split. In this
> case, you would have preallocated pages left over that you will need to free here.

You're right. I've overlooked this case.
I'll handle it.


Thanks for your review!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-18 15:01     ` Yeoreum Yun
@ 2025-12-18 15:42       ` Ryan Roberts
  2025-12-18 15:54         ` Yeoreum Yun
  0 siblings, 1 reply; 19+ messages in thread
From: Ryan Roberts @ 2025-12-18 15:42 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

On 18/12/2025 15:01, Yeoreum Yun wrote:
> Hi Ryan,
> 
>> On 17/12/2025 18:20, Yeoreum Yun wrote:
>>> Current linear_map_split_to_ptes() allocate the pagetable
>>> while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
>>>
>>> This is fine for non-PREEMPR_RT case.
>>> However It's a problem in PREEMPR_RT case since
>>> generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
>>> couldn't be called in non-preemptible context except _nolock() APIs
>>> since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
>>>
>>> IOW, calling a pgtable_alloc() even with GFP_ATOMIC
>>> doesn't allow in __linear_map_split_to_pte() executed by stopper thread
>>> where preemption is disabled in PREEMPR_RT.
>>>
>>> To address this, divide linear_map_maybe_split_ptes:
>>>   - collect number of pages to require for spliting.
>>>   - allocate the required number of pages for spliting.
>>>   - with pre-allocate page, split the linear map.
>>
>> Thanks for working on this fix!
>>
>> First some high level comments: I'm not a huge fan of the approach with the
>> different modes to modify behaviour.
>>
>> For the first step of figuring out the number of pages required; there is no
>> need to be inside stop_machine() for that. Can should be able to just walk the
>> linear map without any locking since it is created once and remains static, with
>> the exception of hotplug, but that is not enabled yet.
>>
>> I think it would be cleaner to just create separate walker callbacks rather than
>> repurpose the existing walker.
> 
> Okay. I'll repsin with separate walker.
> 
>>
>> Then you could simply change the gfp flag to a callback function pointer and
>> pass the desired allocator as a function pointer. This would match the existing
>> patterns used in mmu.c today.
>>
>> The preallocated page list could be stored in a global static variable and you
>> could create an allocation function that just strims a page from that list. See
>> kpti_ng_pgd_alloc() for an existing example. That assumes a contiguous block,
>> but you could generalize it to hold a list of struct pages and reuse it for both
>> cases?
> 
> If we use walker callbacks, I think it not only pass the callback
> function, but it need to pass the gfp flag too since split_xxx() is
> called by another functions. (i.e) set_memory_xxx()
> 
> I think I can remove the mode with the suggestion of @David
> but, I think it would be still use split_args with information of
> gfp and preallocate buffer.

I don't think that's necessary; all other users pass GFP_PGTABLE_KERNEL, so we
just need one allocator function that allocates with GFP_PGTABLE_KERNEL and
another allocator function that allocates from the prealloc'ed list. Then the
user passes the function they want to use.

That would work, I think?

>>
>>>
>>> Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
>>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>>> ---
>>>  arch/arm64/mm/mmu.c | 213 +++++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 170 insertions(+), 43 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 9ae7ce00a7ef..e4e6c7e0a016 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -527,18 +527,28 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>>  		panic("Failed to create page tables\n");
>>>  }
>>>
>>> -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>> -				       enum pgtable_type pgtable_type)
>>> -{
>>> -	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>> -	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>>> -	phys_addr_t pa;
>>> -
>>> -	if (!ptdesc)
>>> -		return INVALID_PHYS_ADDR;
>>> +enum split_modes {
>>> +	SPLIT_ALLOC,
>>> +	SPLIT_PREALLOC,
>>> +	SPLIT_COLLECT,
>>> +};
>>>
>>> -	pa = page_to_phys(ptdesc_page(ptdesc));
>>> +struct split_args {
>>> +	enum split_modes mode;
>>> +	union {
>>> +		struct {
>>> +			unsigned long nr;
>>> +			unsigned long i;
>>> +			struct page **pages;
>>> +		};
>>> +		gfp_t gfp;
>>> +	};
>>> +};
>>>
>>> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
>>> +					       struct ptdesc *ptdesc,
>>> +					       enum pgtable_type pgtable_type)
>>> +{
>>>  	switch (pgtable_type) {
>>>  	case TABLE_PTE:
>>>  		BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
>>> @@ -554,19 +564,56 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>>  		break;
>>>  	}
>>>
>>> +}
>>> +
>>> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>> +				       enum pgtable_type pgtable_type)
>>> +{
>>> +	/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>> +	struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>>> +	phys_addr_t pa;
>>> +
>>> +	if (!ptdesc)
>>> +		return INVALID_PHYS_ADDR;
>>> +
>>> +	pa = page_to_phys(ptdesc_page(ptdesc));
>>> +
>>> +	__pgd_pgtable_init(mm, ptdesc, pgtable_type);
>>> +
>>>  	return pa;
>>>  }
>>>
>>> +static phys_addr_t __pgd_pgtable_prealloc(struct mm_struct *mm,
>>> +					  struct split_args *split_args,
>>> +					  enum pgtable_type pgtable_type)
>>> +{
>>> +	struct page *page;
>>> +
>>> +	BUG_ON(split_args->i >= split_args->nr);
>>> +
>>> +	page = split_args->pages[split_args->i++];
>>> +	if (!page)
>>> +		return INVALID_PHYS_ADDR;
>>> +
>>> +	__pgd_pgtable_init(mm, page_ptdesc(page), pgtable_type);
>>> +
>>> +	return page_to_phys(page);
>>> +}
>>> +
>>>  static phys_addr_t
>>> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
>>> +pgd_pgtable_alloc_split(enum pgtable_type pgtable_type,
>>> +		        struct split_args *split_args)
>>>  {
>>> -	return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
>>> +	if (split_args->mode == SPLIT_ALLOC)
>>> +		return __pgd_pgtable_alloc(&init_mm, split_args->gfp, pgtable_type);
>>> +
>>> +	return __pgd_pgtable_prealloc(&init_mm, split_args, pgtable_type);
>>>  }
>>>
>>>  static phys_addr_t __maybe_unused
>>>  pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
>>>  {
>>> -	return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
>>> +	return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
>>>  }
>>>
>>>  static phys_addr_t
>>> @@ -584,7 +631,9 @@ static void split_contpte(pte_t *ptep)
>>>  		__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>>>  }
>>>
>>> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>>> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
>>> +		     struct split_args *split_args,
>>> +		     bool to_cont)
>>>  {
>>>  	pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>>>  	unsigned long pfn = pmd_pfn(pmd);
>>> @@ -593,7 +642,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>>>  	pte_t *ptep;
>>>  	int i;
>>>
>>> -	pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
>>> +	pte_phys = pgd_pgtable_alloc_split(TABLE_PTE, split_args);
>>>  	if (pte_phys == INVALID_PHYS_ADDR)
>>>  		return -ENOMEM;
>>>  	ptep = (pte_t *)phys_to_virt(pte_phys);
>>> @@ -628,7 +677,9 @@ static void split_contpmd(pmd_t *pmdp)
>>>  		set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>>>  }
>>>
>>> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>>> +static int split_pud(pud_t *pudp, pud_t pud,
>>> +		    struct split_args *split_args,
>>> +		    bool to_cont)
>>>  {
>>>  	pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>>>  	unsigned int step = PMD_SIZE >> PAGE_SHIFT;
>>> @@ -638,7 +689,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>>>  	pmd_t *pmdp;
>>>  	int i;
>>>
>>> -	pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
>>> +	pmd_phys = pgd_pgtable_alloc_split(TABLE_PMD, split_args);
>>>  	if (pmd_phys == INVALID_PHYS_ADDR)
>>>  		return -ENOMEM;
>>>  	pmdp = (pmd_t *)phys_to_virt(pmd_phys);
>>> @@ -672,6 +723,10 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>>  	pmd_t *pmdp, pmd;
>>>  	pte_t *ptep, pte;
>>>  	int ret = 0;
>>> +	struct split_args split_args = {
>>> +		.mode = SPLIT_ALLOC,
>>> +		.gfp = GFP_PGTABLE_KERNEL,
>>> +	};
>>>
>>>  	/*
>>>  	 * PGD: If addr is PGD aligned then addr already describes a leaf
>>> @@ -707,7 +762,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>>  	if (!pud_present(pud))
>>>  		goto out;
>>>  	if (pud_leaf(pud)) {
>>> -		ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
>>> +		ret = split_pud(pudp, pud, &split_args, true);
>>>  		if (ret)
>>>  			goto out;
>>>  	}
>>> @@ -732,7 +787,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>>  		 */
>>>  		if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>>>  			goto out;
>>> -		ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
>>> +		ret = split_pmd(pmdp, pmd, &split_args, true);
>>>  		if (ret)
>>>  			goto out;
>>>  	}
>>> @@ -831,12 +886,17 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>>  static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>>>  				   unsigned long next, struct mm_walk *walk)
>>>  {
>>> -	gfp_t gfp = *(gfp_t *)walk->private;
>>> +	struct split_args *split_args = (struct split_args *)walk->private;
>>
>> no need for the casting here; it's a void *.
> 
> Right. I'll remove it.
> 
>>
>>>  	pud_t pud = pudp_get(pudp);
>>>  	int ret = 0;
>>>
>>> -	if (pud_leaf(pud))
>>> -		ret = split_pud(pudp, pud, gfp, false);
>>> +	if (!pud_leaf(pud))
>>> +		return 0;
>>> +
>>> +	if (split_args->mode == SPLIT_COLLECT)
>>> +		split_args->nr++;
>>
>> bug: you're incrementing this for the pmd table that you would have inserted,
>> but because you're not actually splitting, split_to_ptes_pmd_entry() will never
>> be called for the 512 pte tables you also need. So you need:
>>
>> 		nr += 1 + PTRS_PER_PMD;
>>
>> I'm guessing you didn't test this as the BUG would have triggered when you ran
>> out of preallocated memory?
>>
> 
> Good catch! Yes. This is definitely bug.
> The reason why I didn't hit the but there is no "block pud" in my enviroment.
> 
>>> +	else
>>> +		ret = split_pud(pudp, pud, split_args, false);
>>>
>>>  	return ret;
>>>  }
>>> @@ -844,22 +904,29 @@ static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>>>  static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>>  				   unsigned long next, struct mm_walk *walk)
>>>  {
>>> -	gfp_t gfp = *(gfp_t *)walk->private;
>>> +	struct split_args *split_args = (struct split_args *)walk->private;
>>>  	pmd_t pmd = pmdp_get(pmdp);
>>> -	int ret = 0;
>>> +	int ret;
>>>
>>> -	if (pmd_leaf(pmd)) {
>>> -		if (pmd_cont(pmd))
>>> -			split_contpmd(pmdp);
>>> -		ret = split_pmd(pmdp, pmd, gfp, false);
>>> +	if (!pmd_leaf(pmd))
>>> +		return 0;
>>>
>>> -		/*
>>> -		 * We have split the pmd directly to ptes so there is no need to
>>> -		 * visit each pte to check if they are contpte.
>>> -		 */
>>> -		walk->action = ACTION_CONTINUE;
>>> +	if (split_args->mode == SPLIT_COLLECT) {
>>> +		split_args->nr++;
>>> +		return 0;
>>>  	}
>>>
>>> +	if (pmd_cont(pmd))
>>> +		split_contpmd(pmdp);
>>> +	ret = split_pmd(pmdp, pmd, split_args, false);
>>> +
>>> +	/*
>>> +	 * We have split the pmd directly to ptes so there is no need to
>>> +	 * visit each pte to check if they are contpte.
>>> +	 */
>>> +	walk->action = ACTION_CONTINUE;
>>> +
>>> +
>>>  	return ret;
>>>  }
>>>
>>> @@ -880,13 +947,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
>>>  	.pte_entry	= split_to_ptes_pte_entry,
>>>  };
>>>
>>> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>>> +static int range_split_to_ptes(unsigned long start, unsigned long end,
>>> +		 	       struct split_args *split_args)
>>>  {
>>>  	int ret;
>>>
>>>  	arch_enter_lazy_mmu_mode();
>>>  	ret = walk_kernel_page_table_range_lockless(start, end,
>>> -					&split_to_ptes_ops, NULL, &gfp);
>>> +						    &split_to_ptes_ops, NULL,
>>> +						    split_args);
>>>  	arch_leave_lazy_mmu_mode();
>>>
>>>  	return ret;
>>> @@ -903,7 +972,7 @@ static void __init init_idmap_kpti_bbml2_flag(void)
>>>  	smp_mb();
>>>  }
>>>
>>> -static int __init linear_map_split_to_ptes(void *__unused)
>>> +static int __init linear_map_split_to_ptes(void *data)
>>>  {
>>>  	/*
>>>  	 * Repainting the linear map must be done by CPU0 (the boot CPU) because
>>> @@ -911,6 +980,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>>  	 * be held in a waiting area with the idmap active.
>>>  	 */
>>>  	if (!smp_processor_id()) {
>>> +		struct split_args *split_args = data;
>>>  		unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>>>  		unsigned long lend = PAGE_END;
>>>  		unsigned long kstart = (unsigned long)lm_alias(_stext);
>>> @@ -928,12 +998,13 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>>  		 * PTE. The kernel alias remains static throughout runtime so
>>>  		 * can continue to be safely mapped with large mappings.
>>>  		 */
>>> -		ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>>> +		ret = range_split_to_ptes(lstart, lend, split_args);
>>>  		if (!ret)
>>> -			ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>>> +			ret = range_split_to_ptes(kstart, kend, split_args);
>>>  		if (ret)
>>>  			panic("Failed to split linear map\n");
>>> -		flush_tlb_kernel_range(lstart, lend);
>>> +		if (split_args->mode != SPLIT_COLLECT)
>>> +			flush_tlb_kernel_range(lstart, lend);
>>>
>>>  		/*
>>>  		 * Relies on dsb in flush_tlb_kernel_range() to avoid reordering
>>> @@ -963,10 +1034,61 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>>
>>>  void __init linear_map_maybe_split_to_ptes(void)
>>>  {
>>> -	if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
>>> -		init_idmap_kpti_bbml2_flag();
>>> -		stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
>>> +	struct page **pages = NULL;
>>> +	struct split_args split_args;
>>> +	int order;
>>> +	int nr_populated;
>>> +	int err;
>>> +
>>> +	if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Phase 1: Collect required extra pages to split.
>>> +	 */
>>> +	split_args.mode = SPLIT_COLLECT;
>>> +	split_args.nr = 0;
>>> +
>>> +	init_idmap_kpti_bbml2_flag();
>>> +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
>>
>> This doesn't require stop_machine().
> 
> Yes. I'll separate with new walker as you suggest.
> 
>>
>>> +
>>> +	/*
>>> +	 * Phase 2: Allocate necessary pages to split.
>>> +	 */
>>> +	if (split_args.nr == 0) {
>>> +		err = 0;
>>> +		split_args.mode = SPLIT_ALLOC;
>>
>> How would we hit this condition? If everything is already split why not just
>> exit early?
> 
> I'm thinking about if all of pages map with CONTPTE only then
> it need to be splited, but would you let me know whether It wouldn't be possible
> for cross checking?

Ahh, yes good point!

> 
>>
>>> +	} else {
>>> +		err = -ENOMEM;
>>> +		order = order_base_2(PAGE_ALIGN(split_args.nr *
>>> +					sizeof(struct page *)) >> PAGE_SHIFT);
>>> +
>>> +		pages = (struct page **)__get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
>>
>> As per the other patch, I'd prefer to see GFP_KERNEL | __GFP_ZERO here.
>>
>> Let's make this a list of struct ptdesc * though.
> 
> Agree. I'll change it.
> 
>>
>>> +		if (!pages)
>>> +			goto error;
>>> +
>>> +		nr_populated = alloc_pages_bulk(GFP_ATOMIC | __GFP_ZERO, split_args.nr, pages);
>>
>> Just GFP_KERNEL. No need for __GFP_ZERO as they will all be fully overwritten
>> with PTEs.
>>
>> I'd prefer to use pagetable_alloc(GFP_KERNEL, 0) in a loop since that will
>> return a proper ptdesc.
> 
> Okay.
> 
>>
>>> +		if (nr_populated < split_args.nr)
>>> +			goto error;
>>> +
>>> +		err = 0;
>>> +		split_args.mode = SPLIT_PREALLOC;
>>> +		split_args.i = 0;
>>> +		split_args.pages = pages;
>>>  	}
>>> +
>>> +	/*
>>> +	 * Phase 3: Split linear map.
>>> +	 */
>>> +	init_idmap_kpti_bbml2_flag();
>>> +	stop_machine(linear_map_split_to_ptes, &split_args, cpu_online_mask);
>>> +
>>> +error:
>>> +	if (pages)
>>> +		free_pages((unsigned long)pages, order);
>>
>> It's possible (but unlikely) that another CPU caused a split between figuring
>> out the required number of tables and actually doing the full split. In this
>> case, you would have preallocated pages left over that you will need to free here.
> 
> You're right. I've overlooked this case.
> I'll handle it.
> 
> 
> Thanks for your review!
> 
> --
> Sincerely,
> Yeoreum Yun



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping
  2025-12-18 15:42       ` Ryan Roberts
@ 2025-12-18 15:54         ` Yeoreum Yun
  0 siblings, 0 replies; 19+ messages in thread
From: Yeoreum Yun @ 2025-12-18 15:54 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
	dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
	lorenzo.stoakes, ardb, jackmanb, vbabka, linux-arm-kernel,
	linux-kernel, linux-rt-devel

> On 18/12/2025 15:01, Yeoreum Yun wrote:
> > Hi Ryan,
> >
> >> On 17/12/2025 18:20, Yeoreum Yun wrote:
> >>> Current linear_map_split_to_ptes() allocate the pagetable
> >>> while split linear mapping to ptes by stop_machine() with GFP_ATOMIC.
> >>>
> >>> This is fine for non-PREEMPR_RT case.
> >>> However It's a problem in PREEMPR_RT case since
> >>> generic memory allocation/free APIs (e.x) pgtable_alloc(), __get_free_pages and etc
> >>> couldn't be called in non-preemptible context except _nolock() APIs
> >>> since generic memory allocation/free APIs are *sleepable* for using *spin_lock()*
> >>>
> >>> IOW, calling a pgtable_alloc() even with GFP_ATOMIC
> >>> doesn't allow in __linear_map_split_to_pte() executed by stopper thread
> >>> where preemption is disabled in PREEMPR_RT.
> >>>
> >>> To address this, divide linear_map_maybe_split_ptes:
> >>>   - collect number of pages to require for spliting.
> >>>   - allocate the required number of pages for spliting.
> >>>   - with pre-allocate page, split the linear map.
> >>
> >> Thanks for working on this fix!
> >>
> >> First some high level comments: I'm not a huge fan of the approach with the
> >> different modes to modify behaviour.
> >>
> >> For the first step of figuring out the number of pages required; there is no
> >> need to be inside stop_machine() for that. Can should be able to just walk the
> >> linear map without any locking since it is created once and remains static, with
> >> the exception of hotplug, but that is not enabled yet.
> >>
> >> I think it would be cleaner to just create separate walker callbacks rather than
> >> repurpose the existing walker.
> >
> > Okay. I'll repsin with separate walker.
> >
> >>
> >> Then you could simply change the gfp flag to a callback function pointer and
> >> pass the desired allocator as a function pointer. This would match the existing
> >> patterns used in mmu.c today.
> >>
> >> The preallocated page list could be stored in a global static variable and you
> >> could create an allocation function that just strims a page from that list. See
> >> kpti_ng_pgd_alloc() for an existing example. That assumes a contiguous block,
> >> but you could generalize it to hold a list of struct pages and reuse it for both
> >> cases?
> >
> > If we use walker callbacks, I think it not only pass the callback
> > function, but it need to pass the gfp flag too since split_xxx() is
> > called by another functions. (i.e) set_memory_xxx()
> >
> > I think I can remove the mode with the suggestion of @David
> > but, I think it would be still use split_args with information of
> > gfp and preallocate buffer.
>
> I don't think that's necessary; all other users pass GFP_PGTABLE_KERNEL, so we
> just need one allocator function that allocates with GFP_PGTABLE_KERNEL and
> another allocator function that allocates from the prealloc'ed list. Then the
> user passes the function they want to use.
>
> That would work, I think?

True. I forgot I removed GFP_ATOMIC myself for pgtable alloc.
Okay. then I'll move the preallocated buffer information to global
and just pass function pointer then.

Thanks


--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-12-18 15:55 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-17 18:20 [PATCH v2 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2025-12-17 18:20 ` [PATCH v2 1/2] arm64: mmu: don't allocate page while spliting linear mapping Yeoreum Yun
2025-12-18  8:23   ` David Hildenbrand (Red Hat)
2025-12-18  9:08     ` Yeoreum Yun
2025-12-18 14:22   ` Ryan Roberts
2025-12-18 15:01     ` Yeoreum Yun
2025-12-18 15:42       ` Ryan Roberts
2025-12-18 15:54         ` Yeoreum Yun
2025-12-17 18:20 ` [PATCH v2 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
2025-12-17 19:03   ` Ryan Roberts
2025-12-17 19:09     ` Yeoreum Yun
2025-12-18  8:34       ` Ryan Roberts
2025-12-18  8:37         ` Yeoreum Yun
2025-12-18  8:46           ` David Hildenbrand (Red Hat)
2025-12-18  9:31             ` Yeoreum Yun
2025-12-18  9:41               ` David Hildenbrand (Red Hat)
2025-12-18 10:07                 ` Yeoreum Yun
2025-12-18  7:51     ` Vlastimil Babka
2025-12-18  8:32       ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).