* [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64
@ 2025-12-18 19:47 Yeoreum Yun
2025-12-18 19:47 ` [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Yeoreum Yun @ 2025-12-18 19:47 UTC (permalink / raw)
To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun
Under PREEMPT_RT, calling generic memory allocation/free APIs
(e.x) __get_free_pages(), pgtable_alloc(), free_pages() and etc
with preemption disabled is not allowed, but allow only nolock() APIs series
because it may acquire a spin lock that becomes sleepable on RT,
potentially causing a sleep during page allocation
(See Documentation/core-api/real-time/differences.rst, Memory allocation section).
However, In arm64, __linear_map_split_to_ptes() and
__kpti_install_ng_mappings() called by stopper thread via stop_machine()
use generic memory allocation/free APIs.
This patchset fixes this problem and based on v6.19-rc1
Patch History
==============
from v2 to v3:
- remove split-mode and split_args.
pass proper function pointer while spliting.
- rename function name.
- https://lore.kernel.org/all/20251217182007.2345700-1-yeoreum.yun@arm.com/
from v1 to v2:
- drop pagetable_alloc_nolock()
- following @Ryan Roberts suggestion.
- https://lore.kernel.org/all/20251212161832.2067134-1-yeoreum.yun@arm.com/
*** BLURB HERE ***
Yeoreum Yun (2):
arm64: mmu: avoid allocating pages while splitting the linear mapping
arm64: mmu: avoid allocating pages while installing ng-mapping for
KPTI
arch/arm64/mm/mmu.c | 254 ++++++++++++++++++++++++++++++++++----------
1 file changed, 197 insertions(+), 57 deletions(-)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping
2025-12-18 19:47 [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
@ 2025-12-18 19:47 ` Yeoreum Yun
2026-01-02 11:03 ` Ryan Roberts
2025-12-18 19:47 ` [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
2025-12-31 10:07 ` [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2 siblings, 1 reply; 11+ messages in thread
From: Yeoreum Yun @ 2025-12-18 19:47 UTC (permalink / raw)
To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun
linear_map_split_to_ptes() currently allocates page tables while
splitting the linear mapping into PTEs under stop_machine() using GFP_ATOMIC.
This is fine for non-PREEMPT_RT configurations.
However, it becomes problematic on PREEMPT_RT, because
generic memory allocation/free APIs (e.g. pgtable_alloc(), __get_free_pages(), etc.)
cannot be called from a non-preemptible context, except for the _nolock() variants.
This is because generic memory allocation/free paths are sleepable,
as they rely on spin_lock(), which becomes sleepable on PREEMPT_RT.
In other words, even calling pgtable_alloc() with GFP_ATOMIC is not permitted
in __linear_map_split_to_pte() when it is executed by the stopper thread,
where preemption is disabled on PREEMPT_RT.
To address this, the required number of page tables is first collected
and preallocated, and the preallocated page tables are then used
when splitting the linear mapping in __linear_map_split_to_pte().
Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
arch/arm64/mm/mmu.c | 232 +++++++++++++++++++++++++++++++++++---------
1 file changed, 184 insertions(+), 48 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9ae7ce00a7ef..96a9fa505e71 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -527,18 +527,15 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
panic("Failed to create page tables\n");
}
-static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
- enum pgtable_type pgtable_type)
-{
- /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
- struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
- phys_addr_t pa;
-
- if (!ptdesc)
- return INVALID_PHYS_ADDR;
-
- pa = page_to_phys(ptdesc_page(ptdesc));
+static struct ptdesc **split_pgtables;
+static int split_pgtables_order;
+static unsigned long split_pgtables_count;
+static unsigned long split_pgtables_idx;
+static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
+ struct ptdesc *ptdesc,
+ enum pgtable_type pgtable_type)
+{
switch (pgtable_type) {
case TABLE_PTE:
BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
@@ -554,19 +551,28 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
break;
}
- return pa;
}
-static phys_addr_t
-pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
+static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
+ enum pgtable_type pgtable_type)
{
- return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
+ /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
+ struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
+
+ if (!ptdesc)
+ return INVALID_PHYS_ADDR;
+
+ __pgd_pgtable_init(mm, ptdesc, pgtable_type);
+
+ return page_to_phys(ptdesc_page(ptdesc));
}
+typedef phys_addr_t (split_pgtable_alloc_fn)(enum pgtable_type);
+
static phys_addr_t __maybe_unused
pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
{
- return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
+ return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
}
static phys_addr_t
@@ -575,6 +581,23 @@ pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
}
+static phys_addr_t
+pgd_pgtable_get_preallocated(enum pgtable_type pgtable_type)
+{
+ struct ptdesc *ptdesc;
+
+ if (WARN_ON(split_pgtables_idx >= split_pgtables_count))
+ return INVALID_PHYS_ADDR;
+
+ ptdesc = split_pgtables[split_pgtables_idx++];
+ if (!ptdesc)
+ return INVALID_PHYS_ADDR;
+
+ __pgd_pgtable_init(&init_mm, ptdesc, pgtable_type);
+
+ return page_to_phys(ptdesc_page(ptdesc));
+}
+
static void split_contpte(pte_t *ptep)
{
int i;
@@ -584,7 +607,9 @@ static void split_contpte(pte_t *ptep)
__set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
}
-static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
+static int split_pmd(pmd_t *pmdp, pmd_t pmd,
+ split_pgtable_alloc_fn *pgtable_alloc_fn,
+ bool to_cont)
{
pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
unsigned long pfn = pmd_pfn(pmd);
@@ -593,7 +618,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
pte_t *ptep;
int i;
- pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
+ pte_phys = pgtable_alloc_fn(TABLE_PTE);
if (pte_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
ptep = (pte_t *)phys_to_virt(pte_phys);
@@ -628,7 +653,9 @@ static void split_contpmd(pmd_t *pmdp)
set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
}
-static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
+static int split_pud(pud_t *pudp, pud_t pud,
+ split_pgtable_alloc_fn *pgtable_alloc_fn,
+ bool to_cont)
{
pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
unsigned int step = PMD_SIZE >> PAGE_SHIFT;
@@ -638,7 +665,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
pmd_t *pmdp;
int i;
- pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
+ pmd_phys = pgtable_alloc_fn(TABLE_PMD);
if (pmd_phys == INVALID_PHYS_ADDR)
return -ENOMEM;
pmdp = (pmd_t *)phys_to_virt(pmd_phys);
@@ -707,7 +734,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
if (!pud_present(pud))
goto out;
if (pud_leaf(pud)) {
- ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
+ ret = split_pud(pudp, pud, pgd_pgtable_alloc_init_mm, true);
if (ret)
goto out;
}
@@ -732,7 +759,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
*/
if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
goto out;
- ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
+ ret = split_pmd(pmdp, pmd, pgd_pgtable_alloc_init_mm, true);
if (ret)
goto out;
}
@@ -831,34 +858,35 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
- gfp_t gfp = *(gfp_t *)walk->private;
+ split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
pud_t pud = pudp_get(pudp);
- int ret = 0;
- if (pud_leaf(pud))
- ret = split_pud(pudp, pud, gfp, false);
+ if (!pud_leaf(pud))
+ return 0;
- return ret;
+ return split_pud(pudp, pud, pgtable_alloc_fn, false);
}
static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
- gfp_t gfp = *(gfp_t *)walk->private;
+ split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
pmd_t pmd = pmdp_get(pmdp);
- int ret = 0;
+ int ret;
- if (pmd_leaf(pmd)) {
- if (pmd_cont(pmd))
- split_contpmd(pmdp);
- ret = split_pmd(pmdp, pmd, gfp, false);
+ if (!pmd_leaf(pmd))
+ return 0;
- /*
- * We have split the pmd directly to ptes so there is no need to
- * visit each pte to check if they are contpte.
- */
- walk->action = ACTION_CONTINUE;
- }
+ if (pmd_cont(pmd))
+ split_contpmd(pmdp);
+
+ ret = split_pmd(pmdp, pmd, pgtable_alloc_fn, false);
+
+ /*
+ * We have split the pmd directly to ptes so there is no need to
+ * visit each pte to check if they are contpte.
+ */
+ walk->action = ACTION_CONTINUE;
return ret;
}
@@ -880,13 +908,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
.pte_entry = split_to_ptes_pte_entry,
};
-static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
+static int range_split_to_ptes(unsigned long start, unsigned long end,
+ split_pgtable_alloc_fn *pgtable_alloc_fn)
{
int ret;
arch_enter_lazy_mmu_mode();
ret = walk_kernel_page_table_range_lockless(start, end,
- &split_to_ptes_ops, NULL, &gfp);
+ &split_to_ptes_ops, NULL,
+ pgtable_alloc_fn);
arch_leave_lazy_mmu_mode();
return ret;
@@ -903,6 +933,105 @@ static void __init init_idmap_kpti_bbml2_flag(void)
smp_mb();
}
+static int __init
+collect_to_split_pud_entry(pud_t *pudp, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pud_t pud = pudp_get(pudp);
+
+ if (pud_leaf(pud))
+ split_pgtables_count += 1 + PTRS_PER_PMD;
+
+ return 0;
+}
+
+static int __init
+collect_to_split_pmd_entry(pmd_t *pmdp, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pmd_t pmd = pmdp_get(pmdp);
+
+ if (pmd_leaf(pmd))
+ split_pgtables_count++;
+
+ walk->action = ACTION_CONTINUE;
+
+ return 0;
+}
+
+static void __init linear_map_free_split_pgtables(void)
+{
+ int i;
+
+ if (!split_pgtables_count || !split_pgtables)
+ goto skip_free;
+
+ for (i = split_pgtables_idx; i < split_pgtables_count; i++) {
+ if (split_pgtables[i])
+ pagetable_free(split_pgtables[i]);
+ }
+
+ free_pages((unsigned long)split_pgtables, split_pgtables_order);
+
+skip_free:
+ split_pgtables = NULL;
+ split_pgtables_count = 0;
+ split_pgtables_idx = 0;
+ split_pgtables_order = 0;
+}
+
+static int __init linear_map_prealloc_split_pgtables(void)
+{
+ int ret, i;
+ unsigned long lstart = _PAGE_OFFSET(vabits_actual);
+ unsigned long lend = PAGE_END;
+ unsigned long kstart = (unsigned long)lm_alias(_stext);
+ unsigned long kend = (unsigned long)lm_alias(__init_begin);
+
+ const struct mm_walk_ops collect_to_split_ops = {
+ .pud_entry = collect_to_split_pud_entry,
+ .pmd_entry = collect_to_split_pmd_entry
+ };
+
+ split_pgtables_idx = 0;
+ split_pgtables_count = 0;
+
+ ret = walk_kernel_page_table_range_lockless(lstart, kstart,
+ &collect_to_split_ops,
+ NULL, NULL);
+ if (!ret)
+ ret = walk_kernel_page_table_range_lockless(kend, lend,
+ &collect_to_split_ops,
+ NULL, NULL);
+ if (ret || !split_pgtables_count)
+ goto error;
+
+ ret = -ENOMEM;
+
+ split_pgtables_order =
+ order_base_2(PAGE_ALIGN(split_pgtables_count *
+ sizeof(struct ptdesc *)) >> PAGE_SHIFT);
+
+ split_pgtables = (struct ptdesc **) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ split_pgtables_order);
+ if (!split_pgtables)
+ goto error;
+
+ for (i = 0; i < split_pgtables_count; i++) {
+ split_pgtables[i] = pagetable_alloc(GFP_KERNEL, 0);
+ if (!split_pgtables[i])
+ goto error;
+ }
+
+ ret = 0;
+
+error:
+ if (ret)
+ linear_map_free_split_pgtables();
+
+ return ret;
+}
+
static int __init linear_map_split_to_ptes(void *__unused)
{
/*
@@ -928,9 +1057,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
* PTE. The kernel alias remains static throughout runtime so
* can continue to be safely mapped with large mappings.
*/
- ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
+ ret = range_split_to_ptes(lstart, kstart, pgd_pgtable_get_preallocated);
if (!ret)
- ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
+ ret = range_split_to_ptes(kend, lend, pgd_pgtable_get_preallocated);
if (ret)
panic("Failed to split linear map\n");
flush_tlb_kernel_range(lstart, lend);
@@ -963,10 +1092,16 @@ static int __init linear_map_split_to_ptes(void *__unused)
void __init linear_map_maybe_split_to_ptes(void)
{
- if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
- init_idmap_kpti_bbml2_flag();
- stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
- }
+ if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
+ return;
+
+ if (linear_map_prealloc_split_pgtables())
+ panic("Failed to split linear map\n");
+
+ init_idmap_kpti_bbml2_flag();
+ stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
+
+ linear_map_free_split_pgtables();
}
/*
@@ -1088,6 +1223,7 @@ bool arch_kfence_init_pool(void)
unsigned long end = start + KFENCE_POOL_SIZE;
int ret;
+
/* Exit early if we know the linear map is already pte-mapped. */
if (!split_leaf_mapping_possible())
return true;
@@ -1097,7 +1233,7 @@ bool arch_kfence_init_pool(void)
return true;
mutex_lock(&pgtable_split_lock);
- ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
+ ret = range_split_to_ptes(start, end, pgd_pgtable_alloc_init_mm);
mutex_unlock(&pgtable_split_lock);
/*
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
2025-12-18 19:47 [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2025-12-18 19:47 ` [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
@ 2025-12-18 19:47 ` Yeoreum Yun
2026-01-02 11:10 ` Ryan Roberts
2025-12-31 10:07 ` [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2 siblings, 1 reply; 11+ messages in thread
From: Yeoreum Yun @ 2025-12-18 19:47 UTC (permalink / raw)
To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel, Yeoreum Yun
The current __kpti_install_ng_mappings() allocates a temporary PGD
while installing the NG mapping for KPTI under stop_machine(),
using GFP_ATOMIC.
This is fine in the non-PREEMPT_RT case. However, it becomes a problem
under PREEMPT_RT because generic memory allocation/free APIs
(e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
in a non-preemptible context, except for the *_nolock() variants.
These generic allocators may sleep due to their use of spin_lock().
In other words, calling __get_free_pages(), even with GFP_ATOMIC,
is not allowed in __kpti_install_ng_mappings(), which is executed by
the stopper thread where preemption is disabled under PREEMPT_RT.
To address this, preallocate the page needed for the temporary PGD
before invoking __kpti_install_ng_mappings() via stop_machine().
Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
---
arch/arm64/mm/mmu.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 96a9fa505e71..9ad9612728e6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1369,7 +1369,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
return kpti_ng_temp_alloc;
}
-static int __init __kpti_install_ng_mappings(void *__unused)
+static int __init __kpti_install_ng_mappings(void *data)
{
typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
extern kpti_remap_fn idmap_kpti_install_ng_mappings;
@@ -1377,10 +1377,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
int cpu = smp_processor_id();
int levels = CONFIG_PGTABLE_LEVELS;
- int order = order_base_2(levels);
u64 kpti_ng_temp_pgd_pa = 0;
pgd_t *kpti_ng_temp_pgd;
- u64 alloc = 0;
+ u64 alloc = *(u64 *)data;
if (levels == 5 && !pgtable_l5_enabled())
levels = 4;
@@ -1391,8 +1390,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
if (!cpu) {
int ret;
-
- alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
@@ -1423,16 +1420,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
cpu_uninstall_idmap();
- if (!cpu) {
- free_pages(alloc, order);
+ if (!cpu)
arm64_use_ng_mappings = true;
- }
return 0;
}
void __init kpti_install_ng_mappings(void)
{
+ int order = order_base_2(CONFIG_PGTABLE_LEVELS);
+ u64 alloc;
+
/* Check whether KPTI is going to be used */
if (!arm64_kernel_unmapped_at_el0())
return;
@@ -1445,8 +1443,14 @@ void __init kpti_install_ng_mappings(void)
if (arm64_use_ng_mappings)
return;
+ alloc = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+ if (!alloc)
+ panic("Failed to alloc page tables\n");
+
init_idmap_kpti_bbml2_flag();
- stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
+ stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
+
+ free_pages(alloc, order);
}
static pgprot_t __init kernel_exec_prot(void)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64
2025-12-18 19:47 [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2025-12-18 19:47 ` [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
2025-12-18 19:47 ` [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
@ 2025-12-31 10:07 ` Yeoreum Yun
2025-12-31 12:34 ` David Hildenbrand (Red Hat)
2 siblings, 1 reply; 11+ messages in thread
From: Yeoreum Yun @ 2025-12-31 10:07 UTC (permalink / raw)
To: catalin.marinas, will, ryan.roberts, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel
Gentle ping in case of forgotten.
On Thu, Dec 18, 2025 at 07:47:48PM +0000, Yeoreum Yun wrote:
> Under PREEMPT_RT, calling generic memory allocation/free APIs
> (e.x) __get_free_pages(), pgtable_alloc(), free_pages() and etc
> with preemption disabled is not allowed, but allow only nolock() APIs series
> because it may acquire a spin lock that becomes sleepable on RT,
> potentially causing a sleep during page allocation
> (See Documentation/core-api/real-time/differences.rst, Memory allocation section).
>
> However, In arm64, __linear_map_split_to_ptes() and
> __kpti_install_ng_mappings() called by stopper thread via stop_machine()
> use generic memory allocation/free APIs.
>
> This patchset fixes this problem and based on v6.19-rc1
>
> Patch History
> ==============
> from v2 to v3:
> - remove split-mode and split_args.
> pass proper function pointer while spliting.
> - rename function name.
> - https://lore.kernel.org/all/20251217182007.2345700-1-yeoreum.yun@arm.com/
>
> from v1 to v2:
> - drop pagetable_alloc_nolock()
> - following @Ryan Roberts suggestion.
> - https://lore.kernel.org/all/20251212161832.2067134-1-yeoreum.yun@arm.com/
>
>
> *** BLURB HERE ***
>
> Yeoreum Yun (2):
> arm64: mmu: avoid allocating pages while splitting the linear mapping
> arm64: mmu: avoid allocating pages while installing ng-mapping for
> KPTI
>
> arch/arm64/mm/mmu.c | 254 ++++++++++++++++++++++++++++++++++----------
> 1 file changed, 197 insertions(+), 57 deletions(-)
>
> --
> LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
>
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64
2025-12-31 10:07 ` [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
@ 2025-12-31 12:34 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-31 12:34 UTC (permalink / raw)
To: Yeoreum Yun, catalin.marinas, will, ryan.roberts, akpm,
kevin.brodsky, quic_zhenhuah, dev.jain, yang, chaitanyas.prakash,
bigeasy, clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb,
vbabka, mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel
On 12/31/25 11:07, Yeoreum Yun wrote:
> Gentle ping in case of forgotten.
A lot of people (including me, wait, why am I reading mails ;) ) are
currently out, and will likely be out for the remainder of the week.
I'd assume people will look into it next week.
--
Cheers
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping
2025-12-18 19:47 ` [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
@ 2026-01-02 11:03 ` Ryan Roberts
2026-01-02 11:09 ` Ryan Roberts
2026-01-02 12:21 ` Yeoreum Yun
0 siblings, 2 replies; 11+ messages in thread
From: Ryan Roberts @ 2026-01-02 11:03 UTC (permalink / raw)
To: Yeoreum Yun, catalin.marinas, will, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel
On 18/12/2025 19:47, Yeoreum Yun wrote:
> linear_map_split_to_ptes() currently allocates page tables while
> splitting the linear mapping into PTEs under stop_machine() using GFP_ATOMIC.
>
> This is fine for non-PREEMPT_RT configurations.
> However, it becomes problematic on PREEMPT_RT, because
> generic memory allocation/free APIs (e.g. pgtable_alloc(), __get_free_pages(), etc.)
> cannot be called from a non-preemptible context, except for the _nolock() variants.
> This is because generic memory allocation/free paths are sleepable,
> as they rely on spin_lock(), which becomes sleepable on PREEMPT_RT.
>
> In other words, even calling pgtable_alloc() with GFP_ATOMIC is not permitted
> in __linear_map_split_to_pte() when it is executed by the stopper thread,
> where preemption is disabled on PREEMPT_RT.
>
> To address this, the required number of page tables is first collected
> and preallocated, and the preallocated page tables are then used
> when splitting the linear mapping in __linear_map_split_to_pte().
>
> Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
> arch/arm64/mm/mmu.c | 232 +++++++++++++++++++++++++++++++++++---------
> 1 file changed, 184 insertions(+), 48 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 9ae7ce00a7ef..96a9fa505e71 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -527,18 +527,15 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> panic("Failed to create page tables\n");
> }
>
> -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> - enum pgtable_type pgtable_type)
> -{
> - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> - phys_addr_t pa;
> -
> - if (!ptdesc)
> - return INVALID_PHYS_ADDR;
> -
> - pa = page_to_phys(ptdesc_page(ptdesc));
> +static struct ptdesc **split_pgtables;
> +static int split_pgtables_order;
> +static unsigned long split_pgtables_count;
> +static unsigned long split_pgtables_idx;
>
> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> + struct ptdesc *ptdesc,
> + enum pgtable_type pgtable_type)
> +{
> switch (pgtable_type) {
> case TABLE_PTE:
> BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> @@ -554,19 +551,28 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> break;
> }
>
nit: no need for this empty line
> - return pa;
> }
>
> -static phys_addr_t
> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
nit: all remaining callers pass gfp=GFP_PGTABLE_KERNEL so you could remove the
param?
> + enum pgtable_type pgtable_type)
> {
> - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> + struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> +
> + if (!ptdesc)
> + return INVALID_PHYS_ADDR;
> +
> + __pgd_pgtable_init(mm, ptdesc, pgtable_type);
> +
> + return page_to_phys(ptdesc_page(ptdesc));
> }
>
> +typedef phys_addr_t (split_pgtable_alloc_fn)(enum pgtable_type);
This type is used more generally than just for splitting. Perhaps simply call it
"pgtable_alloc_fn"?
We also pass this type around in __create_pgd_mapping() and friends; perhaps we
should have a preparatory patch to define this type and consistently use it
everywhere instead of passing around "phys_addr_t (*pgtable_alloc)(enum
pgtable_type)"?
> +
> static phys_addr_t __maybe_unused
This is no longer __maybe_unused; you can drop the decorator.
> pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
> {
> - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> + return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
> }
>
> static phys_addr_t
> @@ -575,6 +581,23 @@ pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
> return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
> }
>
> +static phys_addr_t
> +pgd_pgtable_get_preallocated(enum pgtable_type pgtable_type)
> +{
> + struct ptdesc *ptdesc;
> +
> + if (WARN_ON(split_pgtables_idx >= split_pgtables_count))
> + return INVALID_PHYS_ADDR;
> +
> + ptdesc = split_pgtables[split_pgtables_idx++];
> + if (!ptdesc)
> + return INVALID_PHYS_ADDR;
> +
> + __pgd_pgtable_init(&init_mm, ptdesc, pgtable_type);
> +
> + return page_to_phys(ptdesc_page(ptdesc));
> +}
> +
> static void split_contpte(pte_t *ptep)
> {
> int i;
> @@ -584,7 +607,9 @@ static void split_contpte(pte_t *ptep)
> __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
> }
>
> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> + split_pgtable_alloc_fn *pgtable_alloc_fn,
nit: I believe the * has no effect when passing function pointers and the usual
convention in Linux is to not use the *. Existing functions are also calling it
"pgtable_alloc" so perhaps this is a bit more consistent:
pgtable_alloc_fn pgtable_alloc
(same nitty comment for all uses below :) )
> + bool to_cont)
> {
> pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
> unsigned long pfn = pmd_pfn(pmd);
> @@ -593,7 +618,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> pte_t *ptep;
> int i;
>
> - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> + pte_phys = pgtable_alloc_fn(TABLE_PTE);
> if (pte_phys == INVALID_PHYS_ADDR)
> return -ENOMEM;
> ptep = (pte_t *)phys_to_virt(pte_phys);
> @@ -628,7 +653,9 @@ static void split_contpmd(pmd_t *pmdp)
> set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
> }
>
> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> +static int split_pud(pud_t *pudp, pud_t pud,
> + split_pgtable_alloc_fn *pgtable_alloc_fn,
> + bool to_cont)
> {
> pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
> unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> @@ -638,7 +665,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> pmd_t *pmdp;
> int i;
>
> - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> + pmd_phys = pgtable_alloc_fn(TABLE_PMD);
> if (pmd_phys == INVALID_PHYS_ADDR)
> return -ENOMEM;
> pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> @@ -707,7 +734,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> if (!pud_present(pud))
> goto out;
> if (pud_leaf(pud)) {
> - ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> + ret = split_pud(pudp, pud, pgd_pgtable_alloc_init_mm, true);
> if (ret)
> goto out;
> }
> @@ -732,7 +759,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> */
> if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> goto out;
> - ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> + ret = split_pmd(pmdp, pmd, pgd_pgtable_alloc_init_mm, true);
> if (ret)
> goto out;
> }
> @@ -831,34 +858,35 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> unsigned long next, struct mm_walk *walk)
> {
> - gfp_t gfp = *(gfp_t *)walk->private;
> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
> pud_t pud = pudp_get(pudp);
> - int ret = 0;
>
> - if (pud_leaf(pud))
> - ret = split_pud(pudp, pud, gfp, false);
> + if (!pud_leaf(pud))
> + return 0;
>
> - return ret;
> + return split_pud(pudp, pud, pgtable_alloc_fn, false);
why are you changing the layout of this function? Seems like unneccessary churn.
Just pass pgtable_alloc to split_pud() instead of gfp.
> }
>
> static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> unsigned long next, struct mm_walk *walk)
> {
> - gfp_t gfp = *(gfp_t *)walk->private;
> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
> pmd_t pmd = pmdp_get(pmdp);
> - int ret = 0;
> + int ret;
>
> - if (pmd_leaf(pmd)) {
> - if (pmd_cont(pmd))
> - split_contpmd(pmdp);
> - ret = split_pmd(pmdp, pmd, gfp, false);
> + if (!pmd_leaf(pmd))
> + return 0;
>
> - /*
> - * We have split the pmd directly to ptes so there is no need to
> - * visit each pte to check if they are contpte.
> - */
> - walk->action = ACTION_CONTINUE;
> - }
> + if (pmd_cont(pmd))
> + split_contpmd(pmdp);
> +
> + ret = split_pmd(pmdp, pmd, pgtable_alloc_fn, false);
> +
> + /*
> + * We have split the pmd directly to ptes so there is no need to
> + * visit each pte to check if they are contpte.
> + */
> + walk->action = ACTION_CONTINUE;
Same comment; no need to change the layout of the function.
>
> return ret;
> }
> @@ -880,13 +908,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
> .pte_entry = split_to_ptes_pte_entry,
> };
>
> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> +static int range_split_to_ptes(unsigned long start, unsigned long end,
> + split_pgtable_alloc_fn *pgtable_alloc_fn)
> {
> int ret;
>
> arch_enter_lazy_mmu_mode();
> ret = walk_kernel_page_table_range_lockless(start, end,
> - &split_to_ptes_ops, NULL, &gfp);
> + &split_to_ptes_ops, NULL,
> + pgtable_alloc_fn);
> arch_leave_lazy_mmu_mode();
>
> return ret;
> @@ -903,6 +933,105 @@ static void __init init_idmap_kpti_bbml2_flag(void)
> smp_mb();
> }
>
> +static int __init
> +collect_to_split_pud_entry(pud_t *pudp, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pud_t pud = pudp_get(pudp);
> +
> + if (pud_leaf(pud))
> + split_pgtables_count += 1 + PTRS_PER_PMD;
I think you probably want:
walk->action = ACTION_CONTINUE;
Likely you will end up with the same behaviour regardless. But you should at
least we consistent with collect_to_split_pmd_entry().
> +
> + return 0;
> +}
> +
> +static int __init
> +collect_to_split_pmd_entry(pmd_t *pmdp, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + pmd_t pmd = pmdp_get(pmdp);
> +
> + if (pmd_leaf(pmd))
> + split_pgtables_count++;
> +
> + walk->action = ACTION_CONTINUE;
> +
> + return 0;
> +}
> +
> +static void __init linear_map_free_split_pgtables(void)
> +{
> + int i;
> +
> + if (!split_pgtables_count || !split_pgtables)
> + goto skip_free;
> +
> + for (i = split_pgtables_idx; i < split_pgtables_count; i++) {
> + if (split_pgtables[i])
> + pagetable_free(split_pgtables[i]);
> + }
> +
> + free_pages((unsigned long)split_pgtables, split_pgtables_order);
> +
> +skip_free:
> + split_pgtables = NULL;
> + split_pgtables_count = 0;
> + split_pgtables_idx = 0;
> + split_pgtables_order = 0;
> +}
> +
> +static int __init linear_map_prealloc_split_pgtables(void)
> +{
> + int ret, i;
> + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> + unsigned long lend = PAGE_END;
> + unsigned long kstart = (unsigned long)lm_alias(_stext);
> + unsigned long kend = (unsigned long)lm_alias(__init_begin);
> +
> + const struct mm_walk_ops collect_to_split_ops = {
> + .pud_entry = collect_to_split_pud_entry,
> + .pmd_entry = collect_to_split_pmd_entry
> + };
> +
> + split_pgtables_idx = 0;
> + split_pgtables_count = 0;
> +
> + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> + &collect_to_split_ops,
> + NULL, NULL);
> + if (!ret)
> + ret = walk_kernel_page_table_range_lockless(kend, lend,
> + &collect_to_split_ops,
> + NULL, NULL);
> + if (ret || !split_pgtables_count)
> + goto error;
> +
> + ret = -ENOMEM;
> +
> + split_pgtables_order =
> + order_base_2(PAGE_ALIGN(split_pgtables_count *
> + sizeof(struct ptdesc *)) >> PAGE_SHIFT);
Wouldn't this be simpler?
split_pgtables_order = get_order(split_pgtables_count *
sizeof(struct ptdesc *));
> +
> + split_pgtables = (struct ptdesc **) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + split_pgtables_order);
Do you really need the cast? (I'm not sure).
> + if (!split_pgtables)
> + goto error;
> +
> + for (i = 0; i < split_pgtables_count; i++) {
> + split_pgtables[i] = pagetable_alloc(GFP_KERNEL, 0);
For consistency with other code, perhaps spell it out?:
/* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
gfp_t gfp = GFP_PGTABLE_KERNEL & ~__GFP_ZERO;
for (i = 0; i < split_pgtables_count; i++) {
split_pgtables[i] = pagetable_alloc(gfp, 0);
> + if (!split_pgtables[i])
> + goto error;
> + }
> +
> + ret = 0;
> +
> +error:
> + if (ret)
> + linear_map_free_split_pgtables();
> +
> + return ret;
> +}
I wonder if there is value in generalizing this a bit to separate out the
determination of the number of pages from the actual pre-allocation and free
functions? Then you have a reusable pre-allocation function that you could also
use for the KPTI case instead of having it have yet another private pre-allocator?
> +
> static int __init linear_map_split_to_ptes(void *__unused)
> {
> /*
> @@ -928,9 +1057,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
> * PTE. The kernel alias remains static throughout runtime so
> * can continue to be safely mapped with large mappings.
> */
> - ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> + ret = range_split_to_ptes(lstart, kstart, pgd_pgtable_get_preallocated);
> if (!ret)
> - ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> + ret = range_split_to_ptes(kend, lend, pgd_pgtable_get_preallocated);
> if (ret)
> panic("Failed to split linear map\n");
> flush_tlb_kernel_range(lstart, lend);
> @@ -963,10 +1092,16 @@ static int __init linear_map_split_to_ptes(void *__unused)
>
> void __init linear_map_maybe_split_to_ptes(void)
> {
> - if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> - init_idmap_kpti_bbml2_flag();
> - stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> - }
> + if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> + return;
> +
> + if (linear_map_prealloc_split_pgtables())
> + panic("Failed to split linear map\n");
> +
> + init_idmap_kpti_bbml2_flag();
> + stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> +
> + linear_map_free_split_pgtables();
> }
>
> /*
> @@ -1088,6 +1223,7 @@ bool arch_kfence_init_pool(void)
> unsigned long end = start + KFENCE_POOL_SIZE;
> int ret;
>
> +
nit: Remove extra empty line.
This is looking much cleaner now; nearly there!
Thanks,
Ryan
> /* Exit early if we know the linear map is already pte-mapped. */
> if (!split_leaf_mapping_possible())
> return true;
> @@ -1097,7 +1233,7 @@ bool arch_kfence_init_pool(void)
> return true;
>
> mutex_lock(&pgtable_split_lock);
> - ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
> + ret = range_split_to_ptes(start, end, pgd_pgtable_alloc_init_mm);
> mutex_unlock(&pgtable_split_lock);
>
> /*
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping
2026-01-02 11:03 ` Ryan Roberts
@ 2026-01-02 11:09 ` Ryan Roberts
2026-01-02 12:21 ` Yeoreum Yun
1 sibling, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2026-01-02 11:09 UTC (permalink / raw)
To: Yeoreum Yun, catalin.marinas, will, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel
On 02/01/2026 11:03, Ryan Roberts wrote:
> On 18/12/2025 19:47, Yeoreum Yun wrote:
>> linear_map_split_to_ptes() currently allocates page tables while
>> splitting the linear mapping into PTEs under stop_machine() using GFP_ATOMIC.
>>
>> This is fine for non-PREEMPT_RT configurations.
>> However, it becomes problematic on PREEMPT_RT, because
>> generic memory allocation/free APIs (e.g. pgtable_alloc(), __get_free_pages(), etc.)
>> cannot be called from a non-preemptible context, except for the _nolock() variants.
>> This is because generic memory allocation/free paths are sleepable,
>> as they rely on spin_lock(), which becomes sleepable on PREEMPT_RT.
>>
>> In other words, even calling pgtable_alloc() with GFP_ATOMIC is not permitted
>> in __linear_map_split_to_pte() when it is executed by the stopper thread,
>> where preemption is disabled on PREEMPT_RT.
>>
>> To address this, the required number of page tables is first collected
>> and preallocated, and the preallocated page tables are then used
>> when splitting the linear mapping in __linear_map_split_to_pte().
>>
>> Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>> ---
>> arch/arm64/mm/mmu.c | 232 +++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 184 insertions(+), 48 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 9ae7ce00a7ef..96a9fa505e71 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -527,18 +527,15 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>> panic("Failed to create page tables\n");
>> }
>>
>> -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>> - enum pgtable_type pgtable_type)
>> -{
>> - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>> - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>> - phys_addr_t pa;
>> -
>> - if (!ptdesc)
>> - return INVALID_PHYS_ADDR;
>> -
>> - pa = page_to_phys(ptdesc_page(ptdesc));
>> +static struct ptdesc **split_pgtables;
>> +static int split_pgtables_order;
>> +static unsigned long split_pgtables_count;
>> +static unsigned long split_pgtables_idx;
>>
>> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
>> + struct ptdesc *ptdesc,
>> + enum pgtable_type pgtable_type)
>> +{
>> switch (pgtable_type) {
>> case TABLE_PTE:
>> BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
>> @@ -554,19 +551,28 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>> break;
>> }
>>
>
> nit: no need for this empty line
>
>> - return pa;
>> }
>>
>> -static phys_addr_t
>> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
>> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>
> nit: all remaining callers pass gfp=GFP_PGTABLE_KERNEL so you could remove the
> param?
>
>> + enum pgtable_type pgtable_type)
>> {
>> - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
>> + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>> + struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>> +
>> + if (!ptdesc)
>> + return INVALID_PHYS_ADDR;
>> +
>> + __pgd_pgtable_init(mm, ptdesc, pgtable_type);
>> +
>> + return page_to_phys(ptdesc_page(ptdesc));
>> }
>>
>> +typedef phys_addr_t (split_pgtable_alloc_fn)(enum pgtable_type);
>
> This type is used more generally than just for splitting. Perhaps simply call it
> "pgtable_alloc_fn"?
>
> We also pass this type around in __create_pgd_mapping() and friends; perhaps we
> should have a preparatory patch to define this type and consistently use it
> everywhere instead of passing around "phys_addr_t (*pgtable_alloc)(enum
> pgtable_type)"?
>
>> +
>> static phys_addr_t __maybe_unused
>
> This is no longer __maybe_unused; you can drop the decorator.
>
>> pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
>> {
>> - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
>> + return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
>> }
>>
>> static phys_addr_t
>> @@ -575,6 +581,23 @@ pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
>> return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
>> }
>>
>> +static phys_addr_t
>> +pgd_pgtable_get_preallocated(enum pgtable_type pgtable_type)
>> +{
>> + struct ptdesc *ptdesc;
>> +
>> + if (WARN_ON(split_pgtables_idx >= split_pgtables_count))
>> + return INVALID_PHYS_ADDR;
>> +
>> + ptdesc = split_pgtables[split_pgtables_idx++];
>> + if (!ptdesc)
>> + return INVALID_PHYS_ADDR;
>> +
>> + __pgd_pgtable_init(&init_mm, ptdesc, pgtable_type);
>> +
>> + return page_to_phys(ptdesc_page(ptdesc));
>> +}
>> +
>> static void split_contpte(pte_t *ptep)
>> {
>> int i;
>> @@ -584,7 +607,9 @@ static void split_contpte(pte_t *ptep)
>> __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>> }
>>
>> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
>> + split_pgtable_alloc_fn *pgtable_alloc_fn,
>
> nit: I believe the * has no effect when passing function pointers and the usual
> convention in Linux is to not use the *. Existing functions are also calling it
> "pgtable_alloc" so perhaps this is a bit more consistent:
>
> pgtable_alloc_fn pgtable_alloc
>
> (same nitty comment for all uses below :) )
>
>> + bool to_cont)
>> {
>> pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>> unsigned long pfn = pmd_pfn(pmd);
>> @@ -593,7 +618,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>> pte_t *ptep;
>> int i;
>>
>> - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
>> + pte_phys = pgtable_alloc_fn(TABLE_PTE);
>> if (pte_phys == INVALID_PHYS_ADDR)
>> return -ENOMEM;
>> ptep = (pte_t *)phys_to_virt(pte_phys);
>> @@ -628,7 +653,9 @@ static void split_contpmd(pmd_t *pmdp)
>> set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>> }
>>
>> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>> +static int split_pud(pud_t *pudp, pud_t pud,
>> + split_pgtable_alloc_fn *pgtable_alloc_fn,
>> + bool to_cont)
>> {
>> pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>> unsigned int step = PMD_SIZE >> PAGE_SHIFT;
>> @@ -638,7 +665,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>> pmd_t *pmdp;
>> int i;
>>
>> - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
>> + pmd_phys = pgtable_alloc_fn(TABLE_PMD);
>> if (pmd_phys == INVALID_PHYS_ADDR)
>> return -ENOMEM;
>> pmdp = (pmd_t *)phys_to_virt(pmd_phys);
>> @@ -707,7 +734,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>> if (!pud_present(pud))
>> goto out;
>> if (pud_leaf(pud)) {
>> - ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
>> + ret = split_pud(pudp, pud, pgd_pgtable_alloc_init_mm, true);
>> if (ret)
>> goto out;
>> }
>> @@ -732,7 +759,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>> */
>> if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>> goto out;
>> - ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
>> + ret = split_pmd(pmdp, pmd, pgd_pgtable_alloc_init_mm, true);
>> if (ret)
>> goto out;
>> }
>> @@ -831,34 +858,35 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>> static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>> unsigned long next, struct mm_walk *walk)
>> {
>> - gfp_t gfp = *(gfp_t *)walk->private;
>> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
>> pud_t pud = pudp_get(pudp);
>> - int ret = 0;
>>
>> - if (pud_leaf(pud))
>> - ret = split_pud(pudp, pud, gfp, false);
>> + if (!pud_leaf(pud))
>> + return 0;
>>
>> - return ret;
>> + return split_pud(pudp, pud, pgtable_alloc_fn, false);
>
> why are you changing the layout of this function? Seems like unneccessary churn.
> Just pass pgtable_alloc to split_pud() instead of gfp.
>
>> }
>>
>> static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> unsigned long next, struct mm_walk *walk)
>> {
>> - gfp_t gfp = *(gfp_t *)walk->private;
>> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
>> pmd_t pmd = pmdp_get(pmdp);
>> - int ret = 0;
>> + int ret;
>>
>> - if (pmd_leaf(pmd)) {
>> - if (pmd_cont(pmd))
>> - split_contpmd(pmdp);
>> - ret = split_pmd(pmdp, pmd, gfp, false);
>> + if (!pmd_leaf(pmd))
>> + return 0;
>>
>> - /*
>> - * We have split the pmd directly to ptes so there is no need to
>> - * visit each pte to check if they are contpte.
>> - */
>> - walk->action = ACTION_CONTINUE;
>> - }
>> + if (pmd_cont(pmd))
>> + split_contpmd(pmdp);
>> +
>> + ret = split_pmd(pmdp, pmd, pgtable_alloc_fn, false);
>> +
>> + /*
>> + * We have split the pmd directly to ptes so there is no need to
>> + * visit each pte to check if they are contpte.
>> + */
>> + walk->action = ACTION_CONTINUE;
>
> Same comment; no need to change the layout of the function.
>
>>
>> return ret;
>> }
>> @@ -880,13 +908,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
>> .pte_entry = split_to_ptes_pte_entry,
>> };
>>
>> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>> +static int range_split_to_ptes(unsigned long start, unsigned long end,
>> + split_pgtable_alloc_fn *pgtable_alloc_fn)
>> {
>> int ret;
>>
>> arch_enter_lazy_mmu_mode();
>> ret = walk_kernel_page_table_range_lockless(start, end,
>> - &split_to_ptes_ops, NULL, &gfp);
>> + &split_to_ptes_ops, NULL,
>> + pgtable_alloc_fn);
>> arch_leave_lazy_mmu_mode();
>>
>> return ret;
>> @@ -903,6 +933,105 @@ static void __init init_idmap_kpti_bbml2_flag(void)
>> smp_mb();
>> }
>>
>> +static int __init
>> +collect_to_split_pud_entry(pud_t *pudp, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + pud_t pud = pudp_get(pudp);
>> +
>> + if (pud_leaf(pud))
>> + split_pgtables_count += 1 + PTRS_PER_PMD;
>
> I think you probably want:
>
> walk->action = ACTION_CONTINUE;
>
> Likely you will end up with the same behaviour regardless. But you should at
> least we consistent with collect_to_split_pmd_entry().
>
>> +
>> + return 0;
>> +}
>> +
>> +static int __init
>> +collect_to_split_pmd_entry(pmd_t *pmdp, unsigned long addr,
>> + unsigned long next, struct mm_walk *walk)
>> +{
>> + pmd_t pmd = pmdp_get(pmdp);
>> +
>> + if (pmd_leaf(pmd))
>> + split_pgtables_count++;
>> +
>> + walk->action = ACTION_CONTINUE;
>> +
>> + return 0;
>> +}
>> +
>> +static void __init linear_map_free_split_pgtables(void)
>> +{
>> + int i;
>> +
>> + if (!split_pgtables_count || !split_pgtables)
>> + goto skip_free;
>> +
>> + for (i = split_pgtables_idx; i < split_pgtables_count; i++) {
>> + if (split_pgtables[i])
>> + pagetable_free(split_pgtables[i]);
>> + }
>> +
>> + free_pages((unsigned long)split_pgtables, split_pgtables_order);
>> +
>> +skip_free:
>> + split_pgtables = NULL;
>> + split_pgtables_count = 0;
>> + split_pgtables_idx = 0;
>> + split_pgtables_order = 0;
>> +}
>> +
>> +static int __init linear_map_prealloc_split_pgtables(void)
>> +{
>> + int ret, i;
>> + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>> + unsigned long lend = PAGE_END;
>> + unsigned long kstart = (unsigned long)lm_alias(_stext);
>> + unsigned long kend = (unsigned long)lm_alias(__init_begin);
>> +
>> + const struct mm_walk_ops collect_to_split_ops = {
>> + .pud_entry = collect_to_split_pud_entry,
>> + .pmd_entry = collect_to_split_pmd_entry
>> + };
>> +
>> + split_pgtables_idx = 0;
>> + split_pgtables_count = 0;
>> +
>> + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>> + &collect_to_split_ops,
>> + NULL, NULL);
>> + if (!ret)
>> + ret = walk_kernel_page_table_range_lockless(kend, lend,
>> + &collect_to_split_ops,
>> + NULL, NULL);
>> + if (ret || !split_pgtables_count)
>> + goto error;
>> +
>> + ret = -ENOMEM;
>> +
>> + split_pgtables_order =
>> + order_base_2(PAGE_ALIGN(split_pgtables_count *
>> + sizeof(struct ptdesc *)) >> PAGE_SHIFT);
>
> Wouldn't this be simpler?
>
> split_pgtables_order = get_order(split_pgtables_count *
> sizeof(struct ptdesc *));
>
>> +
>> + split_pgtables = (struct ptdesc **) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
>> + split_pgtables_order);
>
> Do you really need the cast? (I'm not sure).
>
>> + if (!split_pgtables)
>> + goto error;
>> +
>> + for (i = 0; i < split_pgtables_count; i++) {
>> + split_pgtables[i] = pagetable_alloc(GFP_KERNEL, 0);
>
> For consistency with other code, perhaps spell it out?:
>
> /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> gfp_t gfp = GFP_PGTABLE_KERNEL & ~__GFP_ZERO;
>
> for (i = 0; i < split_pgtables_count; i++) {
> split_pgtables[i] = pagetable_alloc(gfp, 0);
>
>> + if (!split_pgtables[i])
>> + goto error;
>> + }
>> +
>> + ret = 0;
>> +
>> +error:
>> + if (ret)
>> + linear_map_free_split_pgtables();
>> +
>> + return ret;
>> +}
>
> I wonder if there is value in generalizing this a bit to separate out the
> determination of the number of pages from the actual pre-allocation and free
> functions? Then you have a reusable pre-allocation function that you could also
> use for the KPTI case instead of having it have yet another private pre-allocator?
Actually ignore this comment; looks like the KPTI case requires the allocation
to be contiguous because it needs to map it into a temp pgtable. So there is a
valid reason for the 2 pre-allocators to be different.
>
>> +
>> static int __init linear_map_split_to_ptes(void *__unused)
>> {
>> /*
>> @@ -928,9 +1057,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
>> * PTE. The kernel alias remains static throughout runtime so
>> * can continue to be safely mapped with large mappings.
>> */
>> - ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
>> + ret = range_split_to_ptes(lstart, kstart, pgd_pgtable_get_preallocated);
>> if (!ret)
>> - ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
>> + ret = range_split_to_ptes(kend, lend, pgd_pgtable_get_preallocated);
>> if (ret)
>> panic("Failed to split linear map\n");
>> flush_tlb_kernel_range(lstart, lend);
>> @@ -963,10 +1092,16 @@ static int __init linear_map_split_to_ptes(void *__unused)
>>
>> void __init linear_map_maybe_split_to_ptes(void)
>> {
>> - if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
>> - init_idmap_kpti_bbml2_flag();
>> - stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
>> - }
>> + if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
>> + return;
>> +
>> + if (linear_map_prealloc_split_pgtables())
>> + panic("Failed to split linear map\n");
>> +
>> + init_idmap_kpti_bbml2_flag();
>> + stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
>> +
>> + linear_map_free_split_pgtables();
>> }
>>
>> /*
>> @@ -1088,6 +1223,7 @@ bool arch_kfence_init_pool(void)
>> unsigned long end = start + KFENCE_POOL_SIZE;
>> int ret;
>>
>> +
>
> nit: Remove extra empty line.
>
> This is looking much cleaner now; nearly there!
>
> Thanks,
> Ryan
>
>
>> /* Exit early if we know the linear map is already pte-mapped. */
>> if (!split_leaf_mapping_possible())
>> return true;
>> @@ -1097,7 +1233,7 @@ bool arch_kfence_init_pool(void)
>> return true;
>>
>> mutex_lock(&pgtable_split_lock);
>> - ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
>> + ret = range_split_to_ptes(start, end, pgd_pgtable_alloc_init_mm);
>> mutex_unlock(&pgtable_split_lock);
>>
>> /*
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
2025-12-18 19:47 ` [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
@ 2026-01-02 11:10 ` Ryan Roberts
2026-01-02 14:48 ` Yeoreum Yun
0 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2026-01-02 11:10 UTC (permalink / raw)
To: Yeoreum Yun, catalin.marinas, will, akpm, david, kevin.brodsky,
quic_zhenhuah, dev.jain, yang, chaitanyas.prakash, bigeasy,
clrkwllms, rostedt, lorenzo.stoakes, ardb, jackmanb, vbabka,
mhocko
Cc: linux-arm-kernel, linux-kernel, linux-rt-devel
On 18/12/2025 19:47, Yeoreum Yun wrote:
> The current __kpti_install_ng_mappings() allocates a temporary PGD
> while installing the NG mapping for KPTI under stop_machine(),
> using GFP_ATOMIC.
>
> This is fine in the non-PREEMPT_RT case. However, it becomes a problem
> under PREEMPT_RT because generic memory allocation/free APIs
> (e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
> in a non-preemptible context, except for the *_nolock() variants.
> These generic allocators may sleep due to their use of spin_lock().
>
> In other words, calling __get_free_pages(), even with GFP_ATOMIC,
> is not allowed in __kpti_install_ng_mappings(), which is executed by
> the stopper thread where preemption is disabled under PREEMPT_RT.
>
> To address this, preallocate the page needed for the temporary PGD
> before invoking __kpti_install_ng_mappings() via stop_machine().
>
> Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> arch/arm64/mm/mmu.c | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 96a9fa505e71..9ad9612728e6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1369,7 +1369,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
> return kpti_ng_temp_alloc;
> }
>
> -static int __init __kpti_install_ng_mappings(void *__unused)
> +static int __init __kpti_install_ng_mappings(void *data)
> {
> typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
> extern kpti_remap_fn idmap_kpti_install_ng_mappings;
> @@ -1377,10 +1377,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>
> int cpu = smp_processor_id();
> int levels = CONFIG_PGTABLE_LEVELS;
> - int order = order_base_2(levels);
> u64 kpti_ng_temp_pgd_pa = 0;
> pgd_t *kpti_ng_temp_pgd;
> - u64 alloc = 0;
> + u64 alloc = *(u64 *)data;
>
> if (levels == 5 && !pgtable_l5_enabled())
> levels = 4;
> @@ -1391,8 +1390,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
>
> if (!cpu) {
> int ret;
> -
> - alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
> kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
>
> @@ -1423,16 +1420,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
> cpu_uninstall_idmap();
>
> - if (!cpu) {
> - free_pages(alloc, order);
> + if (!cpu)
> arm64_use_ng_mappings = true;
> - }
>
> return 0;
> }
>
> void __init kpti_install_ng_mappings(void)
> {
> + int order = order_base_2(CONFIG_PGTABLE_LEVELS);
> + u64 alloc;
> +
nit: Restore the blank line between the variable definitioins and the logic.
But you already have my R-b :)
> /* Check whether KPTI is going to be used */
> if (!arm64_kernel_unmapped_at_el0())
> return;
> @@ -1445,8 +1443,14 @@ void __init kpti_install_ng_mappings(void)
> if (arm64_use_ng_mappings)
> return;
>
> + alloc = __get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> + if (!alloc)
> + panic("Failed to alloc page tables\n");
> +
> init_idmap_kpti_bbml2_flag();
> - stop_machine(__kpti_install_ng_mappings, NULL, cpu_online_mask);
> + stop_machine(__kpti_install_ng_mappings, &alloc, cpu_online_mask);
> +
> + free_pages(alloc, order);
> }
>
> static pgprot_t __init kernel_exec_prot(void)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping
2026-01-02 11:03 ` Ryan Roberts
2026-01-02 11:09 ` Ryan Roberts
@ 2026-01-02 12:21 ` Yeoreum Yun
2026-01-02 12:35 ` Ryan Roberts
1 sibling, 1 reply; 11+ messages in thread
From: Yeoreum Yun @ 2026-01-02 12:21 UTC (permalink / raw)
To: Ryan Roberts
Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
lorenzo.stoakes, ardb, jackmanb, vbabka, mhocko, linux-arm-kernel,
linux-kernel, linux-rt-devel
Hi Ryan,
Thanks for your review :)
> > linear_map_split_to_ptes() currently allocates page tables while
> > splitting the linear mapping into PTEs under stop_machine() using GFP_ATOMIC.
> >
> > This is fine for non-PREEMPT_RT configurations.
> > However, it becomes problematic on PREEMPT_RT, because
> > generic memory allocation/free APIs (e.g. pgtable_alloc(), __get_free_pages(), etc.)
> > cannot be called from a non-preemptible context, except for the _nolock() variants.
> > This is because generic memory allocation/free paths are sleepable,
> > as they rely on spin_lock(), which becomes sleepable on PREEMPT_RT.
> >
> > In other words, even calling pgtable_alloc() with GFP_ATOMIC is not permitted
> > in __linear_map_split_to_pte() when it is executed by the stopper thread,
> > where preemption is disabled on PREEMPT_RT.
> >
> > To address this, the required number of page tables is first collected
> > and preallocated, and the preallocated page tables are then used
> > when splitting the linear mapping in __linear_map_split_to_pte().
> >
> > Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> > arch/arm64/mm/mmu.c | 232 +++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 184 insertions(+), 48 deletions(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 9ae7ce00a7ef..96a9fa505e71 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -527,18 +527,15 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> > panic("Failed to create page tables\n");
> > }
> >
> > -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > - enum pgtable_type pgtable_type)
> > -{
> > - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > - phys_addr_t pa;
> > -
> > - if (!ptdesc)
> > - return INVALID_PHYS_ADDR;
> > -
> > - pa = page_to_phys(ptdesc_page(ptdesc));
> > +static struct ptdesc **split_pgtables;
> > +static int split_pgtables_order;
> > +static unsigned long split_pgtables_count;
> > +static unsigned long split_pgtables_idx;
> >
> > +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
> > + struct ptdesc *ptdesc,
> > + enum pgtable_type pgtable_type)
> > +{
> > switch (pgtable_type) {
> > case TABLE_PTE:
> > BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
> > @@ -554,19 +551,28 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > break;
> > }
> >
>
> nit: no need for this empty line
Okay. I'll remove..
>
> > - return pa;
> > }
> >
> > -static phys_addr_t
> > -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
> > +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>
> nit: all remaining callers pass gfp=GFP_PGTABLE_KERNEL so you could remove the
> param?
Agree. I'll remove it.
>
> > + enum pgtable_type pgtable_type)
> > {
> > - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
> > + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > + struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > +
> > + if (!ptdesc)
> > + return INVALID_PHYS_ADDR;
> > +
> > + __pgd_pgtable_init(mm, ptdesc, pgtable_type);
> > +
> > + return page_to_phys(ptdesc_page(ptdesc));
> > }
> >
> > +typedef phys_addr_t (split_pgtable_alloc_fn)(enum pgtable_type);
>
> This type is used more generally than just for splitting. Perhaps simply call it
> "pgtable_alloc_fn"?
>
> We also pass this type around in __create_pgd_mapping() and friends; perhaps we
> should have a preparatory patch to define this type and consistently use it
> everywhere instead of passing around "phys_addr_t (*pgtable_alloc)(enum
> pgtable_type)"?
Oh. I miss that __create_pgd_mapping() uses the same callback type.
I'll follow your suggestion. Thanks!
>
> > +
> > static phys_addr_t __maybe_unused
>
> This is no longer __maybe_unused; you can drop the decorator.
Good point. Thanks!
>
> > pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
> > {
> > - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
> > + return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
> > }
> >
> > static phys_addr_t
> > @@ -575,6 +581,23 @@ pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
> > return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
> > }
> >
> > +static phys_addr_t
> > +pgd_pgtable_get_preallocated(enum pgtable_type pgtable_type)
> > +{
> > + struct ptdesc *ptdesc;
> > +
> > + if (WARN_ON(split_pgtables_idx >= split_pgtables_count))
> > + return INVALID_PHYS_ADDR;
> > +
> > + ptdesc = split_pgtables[split_pgtables_idx++];
> > + if (!ptdesc)
> > + return INVALID_PHYS_ADDR;
> > +
> > + __pgd_pgtable_init(&init_mm, ptdesc, pgtable_type);
> > +
> > + return page_to_phys(ptdesc_page(ptdesc));
> > +}
> > +
> > static void split_contpte(pte_t *ptep)
> > {
> > int i;
> > @@ -584,7 +607,9 @@ static void split_contpte(pte_t *ptep)
> > __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
> > }
> >
> > -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> > +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
> > + split_pgtable_alloc_fn *pgtable_alloc_fn,
>
> nit: I believe the * has no effect when passing function pointers and the usual
> convention in Linux is to not use the *. Existing functions are also calling it
> "pgtable_alloc" so perhaps this is a bit more consistent:
>
> pgtable_alloc_fn pgtable_alloc
>
> (same nitty comment for all uses below :) )
You're right. It just my *bad* habit. I'll remove it.
>
> > + bool to_cont)
> > {
> > pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
> > unsigned long pfn = pmd_pfn(pmd);
> > @@ -593,7 +618,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
> > pte_t *ptep;
> > int i;
> >
> > - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
> > + pte_phys = pgtable_alloc_fn(TABLE_PTE);
> > if (pte_phys == INVALID_PHYS_ADDR)
> > return -ENOMEM;
> > ptep = (pte_t *)phys_to_virt(pte_phys);
> > @@ -628,7 +653,9 @@ static void split_contpmd(pmd_t *pmdp)
> > set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
> > }
> >
> > -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> > +static int split_pud(pud_t *pudp, pud_t pud,
> > + split_pgtable_alloc_fn *pgtable_alloc_fn,
> > + bool to_cont)
> > {
> > pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
> > unsigned int step = PMD_SIZE >> PAGE_SHIFT;
> > @@ -638,7 +665,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
> > pmd_t *pmdp;
> > int i;
> >
> > - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
> > + pmd_phys = pgtable_alloc_fn(TABLE_PMD);
> > if (pmd_phys == INVALID_PHYS_ADDR)
> > return -ENOMEM;
> > pmdp = (pmd_t *)phys_to_virt(pmd_phys);
> > @@ -707,7 +734,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> > if (!pud_present(pud))
> > goto out;
> > if (pud_leaf(pud)) {
> > - ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
> > + ret = split_pud(pudp, pud, pgd_pgtable_alloc_init_mm, true);
> > if (ret)
> > goto out;
> > }
> > @@ -732,7 +759,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
> > */
> > if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
> > goto out;
> > - ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
> > + ret = split_pmd(pmdp, pmd, pgd_pgtable_alloc_init_mm, true);
> > if (ret)
> > goto out;
> > }
> > @@ -831,34 +858,35 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> > static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
> > unsigned long next, struct mm_walk *walk)
> > {
> > - gfp_t gfp = *(gfp_t *)walk->private;
> > + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
> > pud_t pud = pudp_get(pudp);
> > - int ret = 0;
> >
> > - if (pud_leaf(pud))
> > - ret = split_pud(pudp, pud, gfp, false);
> > + if (!pud_leaf(pud))
> > + return 0;
> >
> > - return ret;
> > + return split_pud(pudp, pud, pgtable_alloc_fn, false);
>
> why are you changing the layout of this function? Seems like unneccessary churn.
> Just pass pgtable_alloc to split_pud() instead of gfp.
>
> > }
> >
> > static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
> > unsigned long next, struct mm_walk *walk)
> > {
> > - gfp_t gfp = *(gfp_t *)walk->private;
> > + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
> > pmd_t pmd = pmdp_get(pmdp);
> > - int ret = 0;
> > + int ret;
> >
> > - if (pmd_leaf(pmd)) {
> > - if (pmd_cont(pmd))
> > - split_contpmd(pmdp);
> > - ret = split_pmd(pmdp, pmd, gfp, false);
> > + if (!pmd_leaf(pmd))
> > + return 0;
> >
> > - /*
> > - * We have split the pmd directly to ptes so there is no need to
> > - * visit each pte to check if they are contpte.
> > - */
> > - walk->action = ACTION_CONTINUE;
> > - }
> > + if (pmd_cont(pmd))
> > + split_contpmd(pmdp);
> > +
> > + ret = split_pmd(pmdp, pmd, pgtable_alloc_fn, false);
> > +
> > + /*
> > + * We have split the pmd directly to ptes so there is no need to
> > + * visit each pte to check if they are contpte.
> > + */
> > + walk->action = ACTION_CONTINUE;
>
> Same comment; no need to change the layout of the function.
Okay. I'll keep the former layout.
>
> >
> > return ret;
> > }
> > @@ -880,13 +908,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
> > .pte_entry = split_to_ptes_pte_entry,
> > };
> >
> > -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
> > +static int range_split_to_ptes(unsigned long start, unsigned long end,
> > + split_pgtable_alloc_fn *pgtable_alloc_fn)
> > {
> > int ret;
> >
> > arch_enter_lazy_mmu_mode();
> > ret = walk_kernel_page_table_range_lockless(start, end,
> > - &split_to_ptes_ops, NULL, &gfp);
> > + &split_to_ptes_ops, NULL,
> > + pgtable_alloc_fn);
> > arch_leave_lazy_mmu_mode();
> >
> > return ret;
> > @@ -903,6 +933,105 @@ static void __init init_idmap_kpti_bbml2_flag(void)
> > smp_mb();
> > }
> >
> > +static int __init
> > +collect_to_split_pud_entry(pud_t *pudp, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk)
> > +{
> > + pud_t pud = pudp_get(pudp);
> > +
> > + if (pud_leaf(pud))
> > + split_pgtables_count += 1 + PTRS_PER_PMD;
>
> I think you probably want:
>
> walk->action = ACTION_CONTINUE;
>
> Likely you will end up with the same behaviour regardless. But you should at
> least we consistent with collect_to_split_pmd_entry().
This is my fault to miss here. Thanks for catching this :)
>
> > +
> > + return 0;
> > +}
> > +
> > +static int __init
> > +collect_to_split_pmd_entry(pmd_t *pmdp, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk)
> > +{
> > + pmd_t pmd = pmdp_get(pmdp);
> > +
> > + if (pmd_leaf(pmd))
> > + split_pgtables_count++;
> > +
> > + walk->action = ACTION_CONTINUE;
> > +
> > + return 0;
> > +}
> > +
> > +static void __init linear_map_free_split_pgtables(void)
> > +{
> > + int i;
> > +
> > + if (!split_pgtables_count || !split_pgtables)
> > + goto skip_free;
> > +
> > + for (i = split_pgtables_idx; i < split_pgtables_count; i++) {
> > + if (split_pgtables[i])
> > + pagetable_free(split_pgtables[i]);
> > + }
> > +
> > + free_pages((unsigned long)split_pgtables, split_pgtables_order);
> > +
> > +skip_free:
> > + split_pgtables = NULL;
> > + split_pgtables_count = 0;
> > + split_pgtables_idx = 0;
> > + split_pgtables_order = 0;
> > +}
> > +
> > +static int __init linear_map_prealloc_split_pgtables(void)
> > +{
> > + int ret, i;
> > + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
> > + unsigned long lend = PAGE_END;
> > + unsigned long kstart = (unsigned long)lm_alias(_stext);
> > + unsigned long kend = (unsigned long)lm_alias(__init_begin);
> > +
> > + const struct mm_walk_ops collect_to_split_ops = {
> > + .pud_entry = collect_to_split_pud_entry,
> > + .pmd_entry = collect_to_split_pmd_entry
> > + };
> > +
> > + split_pgtables_idx = 0;
> > + split_pgtables_count = 0;
> > +
> > + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
> > + &collect_to_split_ops,
> > + NULL, NULL);
> > + if (!ret)
> > + ret = walk_kernel_page_table_range_lockless(kend, lend,
> > + &collect_to_split_ops,
> > + NULL, NULL);
> > + if (ret || !split_pgtables_count)
> > + goto error;
> > +
> > + ret = -ENOMEM;
> > +
> > + split_pgtables_order =
> > + order_base_2(PAGE_ALIGN(split_pgtables_count *
> > + sizeof(struct ptdesc *)) >> PAGE_SHIFT);
>
> Wouldn't this be simpler?
>
> split_pgtables_order = get_order(split_pgtables_count *
> sizeof(struct ptdesc *));
>
Yes. that would be simpler. But I think we can use
kvmalloc() for split_pagtables since
linear_map_prealloc_split_pgtables() is called after mm_core_init().
So It could be dropped or Am I missing something?
> > +
> > + split_pgtables = (struct ptdesc **) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
> > + split_pgtables_order);
>
> Do you really need the cast? (I'm not sure).
Since, __get_free_pages() return is "unsigned long", cast is reqruied.
But as I said above, I think it could be replaced with the "kvmalloc()'
for split_pagetables.
>
> > + if (!split_pgtables)
> > + goto error;
> > +
> > + for (i = 0; i < split_pgtables_count; i++) {
> > + split_pgtables[i] = pagetable_alloc(GFP_KERNEL, 0);
>
> For consistency with other code, perhaps spell it out?:
>
> /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> gfp_t gfp = GFP_PGTABLE_KERNEL & ~__GFP_ZERO;
Using the gfp_t gfp is fine but comments should be changed
since init_clear_pgtable() isn't use while split instead
spliting fills the pgtable with former mapped pages.
I'll use gfp with the proepr comments. Thanks!
>
> for (i = 0; i < split_pgtables_count; i++) {
> split_pgtables[i] = pagetable_alloc(gfp, 0);
>
> > + if (!split_pgtables[i])
> > + goto error;
> > + }
> > +
> > + ret = 0;
> > +
> > +error:
> > + if (ret)
> > + linear_map_free_split_pgtables();
> > +
> > + return ret;
> > +}
>
> I wonder if there is value in generalizing this a bit to separate out the
> determination of the number of pages from the actual pre-allocation and free
> functions? Then you have a reusable pre-allocation function that you could also
> use for the KPTI case instead of having it have yet another private pre-allocator?
I see your last comments for this :D. I'll drop this comment.
>
> > +
> > static int __init linear_map_split_to_ptes(void *__unused)
> > {
> > /*
> > @@ -928,9 +1057,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
> > * PTE. The kernel alias remains static throughout runtime so
> > * can continue to be safely mapped with large mappings.
> > */
> > - ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> > + ret = range_split_to_ptes(lstart, kstart, pgd_pgtable_get_preallocated);
> > if (!ret)
> > - ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> > + ret = range_split_to_ptes(kend, lend, pgd_pgtable_get_preallocated);
> > if (ret)
> > panic("Failed to split linear map\n");
> > flush_tlb_kernel_range(lstart, lend);
> > @@ -963,10 +1092,16 @@ static int __init linear_map_split_to_ptes(void *__unused)
> >
> > void __init linear_map_maybe_split_to_ptes(void)
> > {
> > - if (linear_map_requires_bbml2 && !system_supports_bbml2_noabort()) {
> > - init_idmap_kpti_bbml2_flag();
> > - stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> > - }
> > + if (!linear_map_requires_bbml2 || system_supports_bbml2_noabort())
> > + return;
> > +
> > + if (linear_map_prealloc_split_pgtables())
> > + panic("Failed to split linear map\n");
> > +
> > + init_idmap_kpti_bbml2_flag();
> > + stop_machine(linear_map_split_to_ptes, NULL, cpu_online_mask);
> > +
> > + linear_map_free_split_pgtables();
> > }
> >
> > /*
> > @@ -1088,6 +1223,7 @@ bool arch_kfence_init_pool(void)
> > unsigned long end = start + KFENCE_POOL_SIZE;
> > int ret;
> >
> > +
>
> nit: Remove extra empty line.
Okay. I'll remove it.
>
> This is looking much cleaner now; nearly there!
Thanks for your detail review!
>
> Thanks,
> Ryan
>
>
> > /* Exit early if we know the linear map is already pte-mapped. */
> > if (!split_leaf_mapping_possible())
> > return true;
> > @@ -1097,7 +1233,7 @@ bool arch_kfence_init_pool(void)
> > return true;
> >
> > mutex_lock(&pgtable_split_lock);
> > - ret = range_split_to_ptes(start, end, GFP_PGTABLE_KERNEL);
> > + ret = range_split_to_ptes(start, end, pgd_pgtable_alloc_init_mm);
> > mutex_unlock(&pgtable_split_lock);
> >
> > /*
>
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping
2026-01-02 12:21 ` Yeoreum Yun
@ 2026-01-02 12:35 ` Ryan Roberts
0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2026-01-02 12:35 UTC (permalink / raw)
To: Yeoreum Yun
Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
lorenzo.stoakes, ardb, jackmanb, vbabka, mhocko, linux-arm-kernel,
linux-kernel, linux-rt-devel
On 02/01/2026 12:21, Yeoreum Yun wrote:
> Hi Ryan,
>
> Thanks for your review :)
>
>>> linear_map_split_to_ptes() currently allocates page tables while
>>> splitting the linear mapping into PTEs under stop_machine() using GFP_ATOMIC.
>>>
>>> This is fine for non-PREEMPT_RT configurations.
>>> However, it becomes problematic on PREEMPT_RT, because
>>> generic memory allocation/free APIs (e.g. pgtable_alloc(), __get_free_pages(), etc.)
>>> cannot be called from a non-preemptible context, except for the _nolock() variants.
>>> This is because generic memory allocation/free paths are sleepable,
>>> as they rely on spin_lock(), which becomes sleepable on PREEMPT_RT.
>>>
>>> In other words, even calling pgtable_alloc() with GFP_ATOMIC is not permitted
>>> in __linear_map_split_to_pte() when it is executed by the stopper thread,
>>> where preemption is disabled on PREEMPT_RT.
>>>
>>> To address this, the required number of page tables is first collected
>>> and preallocated, and the preallocated page tables are then used
>>> when splitting the linear mapping in __linear_map_split_to_pte().
>>>
>>> Fixes: 3df6979d222b ("arm64: mm: split linear mapping if BBML2 unsupported on secondary CPUs")
>>> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
>>> ---
>>> arch/arm64/mm/mmu.c | 232 +++++++++++++++++++++++++++++++++++---------
>>> 1 file changed, 184 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>> index 9ae7ce00a7ef..96a9fa505e71 100644
>>> --- a/arch/arm64/mm/mmu.c
>>> +++ b/arch/arm64/mm/mmu.c
>>> @@ -527,18 +527,15 @@ static void early_create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
>>> panic("Failed to create page tables\n");
>>> }
>>>
>>> -static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>> - enum pgtable_type pgtable_type)
>>> -{
>>> - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>> - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>>> - phys_addr_t pa;
>>> -
>>> - if (!ptdesc)
>>> - return INVALID_PHYS_ADDR;
>>> -
>>> - pa = page_to_phys(ptdesc_page(ptdesc));
>>> +static struct ptdesc **split_pgtables;
>>> +static int split_pgtables_order;
>>> +static unsigned long split_pgtables_count;
>>> +static unsigned long split_pgtables_idx;
>>>
>>> +static __always_inline void __pgd_pgtable_init(struct mm_struct *mm,
>>> + struct ptdesc *ptdesc,
>>> + enum pgtable_type pgtable_type)
>>> +{
>>> switch (pgtable_type) {
>>> case TABLE_PTE:
>>> BUG_ON(!pagetable_pte_ctor(mm, ptdesc));
>>> @@ -554,19 +551,28 @@ static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>> break;
>>> }
>>>
>>
>> nit: no need for this empty line
>
> Okay. I'll remove..
>
>>
>>> - return pa;
>>> }
>>>
>>> -static phys_addr_t
>>> -pgd_pgtable_alloc_init_mm_gfp(enum pgtable_type pgtable_type, gfp_t gfp)
>>> +static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
>>
>> nit: all remaining callers pass gfp=GFP_PGTABLE_KERNEL so you could remove the
>> param?
>
> Agree. I'll remove it.
>
>>
>>> + enum pgtable_type pgtable_type)
>>> {
>>> - return __pgd_pgtable_alloc(&init_mm, gfp, pgtable_type);
>>> + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
>>> + struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
>>> +
>>> + if (!ptdesc)
>>> + return INVALID_PHYS_ADDR;
>>> +
>>> + __pgd_pgtable_init(mm, ptdesc, pgtable_type);
>>> +
>>> + return page_to_phys(ptdesc_page(ptdesc));
>>> }
>>>
>>> +typedef phys_addr_t (split_pgtable_alloc_fn)(enum pgtable_type);
>>
>> This type is used more generally than just for splitting. Perhaps simply call it
>> "pgtable_alloc_fn"?
>>
>> We also pass this type around in __create_pgd_mapping() and friends; perhaps we
>> should have a preparatory patch to define this type and consistently use it
>> everywhere instead of passing around "phys_addr_t (*pgtable_alloc)(enum
>> pgtable_type)"?
>
> Oh. I miss that __create_pgd_mapping() uses the same callback type.
> I'll follow your suggestion. Thanks!
>
>>
>>> +
>>> static phys_addr_t __maybe_unused
>>
>> This is no longer __maybe_unused; you can drop the decorator.
>
> Good point. Thanks!
>
>>
>>> pgd_pgtable_alloc_init_mm(enum pgtable_type pgtable_type)
>>> {
>>> - return pgd_pgtable_alloc_init_mm_gfp(pgtable_type, GFP_PGTABLE_KERNEL);
>>> + return __pgd_pgtable_alloc(&init_mm, GFP_PGTABLE_KERNEL, pgtable_type);
>>> }
>>>
>>> static phys_addr_t
>>> @@ -575,6 +581,23 @@ pgd_pgtable_alloc_special_mm(enum pgtable_type pgtable_type)
>>> return __pgd_pgtable_alloc(NULL, GFP_PGTABLE_KERNEL, pgtable_type);
>>> }
>>>
>>> +static phys_addr_t
>>> +pgd_pgtable_get_preallocated(enum pgtable_type pgtable_type)
>>> +{
>>> + struct ptdesc *ptdesc;
>>> +
>>> + if (WARN_ON(split_pgtables_idx >= split_pgtables_count))
>>> + return INVALID_PHYS_ADDR;
>>> +
>>> + ptdesc = split_pgtables[split_pgtables_idx++];
>>> + if (!ptdesc)
>>> + return INVALID_PHYS_ADDR;
>>> +
>>> + __pgd_pgtable_init(&init_mm, ptdesc, pgtable_type);
>>> +
>>> + return page_to_phys(ptdesc_page(ptdesc));
>>> +}
>>> +
>>> static void split_contpte(pte_t *ptep)
>>> {
>>> int i;
>>> @@ -584,7 +607,9 @@ static void split_contpte(pte_t *ptep)
>>> __set_pte(ptep, pte_mknoncont(__ptep_get(ptep)));
>>> }
>>>
>>> -static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>>> +static int split_pmd(pmd_t *pmdp, pmd_t pmd,
>>> + split_pgtable_alloc_fn *pgtable_alloc_fn,
>>
>> nit: I believe the * has no effect when passing function pointers and the usual
>> convention in Linux is to not use the *. Existing functions are also calling it
>> "pgtable_alloc" so perhaps this is a bit more consistent:
>>
>> pgtable_alloc_fn pgtable_alloc
>>
>> (same nitty comment for all uses below :) )
>
> You're right. It just my *bad* habit. I'll remove it.
>>
>>> + bool to_cont)
>>> {
>>> pmdval_t tableprot = PMD_TYPE_TABLE | PMD_TABLE_UXN | PMD_TABLE_AF;
>>> unsigned long pfn = pmd_pfn(pmd);
>>> @@ -593,7 +618,7 @@ static int split_pmd(pmd_t *pmdp, pmd_t pmd, gfp_t gfp, bool to_cont)
>>> pte_t *ptep;
>>> int i;
>>>
>>> - pte_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PTE, gfp);
>>> + pte_phys = pgtable_alloc_fn(TABLE_PTE);
>>> if (pte_phys == INVALID_PHYS_ADDR)
>>> return -ENOMEM;
>>> ptep = (pte_t *)phys_to_virt(pte_phys);
>>> @@ -628,7 +653,9 @@ static void split_contpmd(pmd_t *pmdp)
>>> set_pmd(pmdp, pmd_mknoncont(pmdp_get(pmdp)));
>>> }
>>>
>>> -static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>>> +static int split_pud(pud_t *pudp, pud_t pud,
>>> + split_pgtable_alloc_fn *pgtable_alloc_fn,
>>> + bool to_cont)
>>> {
>>> pudval_t tableprot = PUD_TYPE_TABLE | PUD_TABLE_UXN | PUD_TABLE_AF;
>>> unsigned int step = PMD_SIZE >> PAGE_SHIFT;
>>> @@ -638,7 +665,7 @@ static int split_pud(pud_t *pudp, pud_t pud, gfp_t gfp, bool to_cont)
>>> pmd_t *pmdp;
>>> int i;
>>>
>>> - pmd_phys = pgd_pgtable_alloc_init_mm_gfp(TABLE_PMD, gfp);
>>> + pmd_phys = pgtable_alloc_fn(TABLE_PMD);
>>> if (pmd_phys == INVALID_PHYS_ADDR)
>>> return -ENOMEM;
>>> pmdp = (pmd_t *)phys_to_virt(pmd_phys);
>>> @@ -707,7 +734,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>> if (!pud_present(pud))
>>> goto out;
>>> if (pud_leaf(pud)) {
>>> - ret = split_pud(pudp, pud, GFP_PGTABLE_KERNEL, true);
>>> + ret = split_pud(pudp, pud, pgd_pgtable_alloc_init_mm, true);
>>> if (ret)
>>> goto out;
>>> }
>>> @@ -732,7 +759,7 @@ static int split_kernel_leaf_mapping_locked(unsigned long addr)
>>> */
>>> if (ALIGN_DOWN(addr, PMD_SIZE) == addr)
>>> goto out;
>>> - ret = split_pmd(pmdp, pmd, GFP_PGTABLE_KERNEL, true);
>>> + ret = split_pmd(pmdp, pmd, pgd_pgtable_alloc_init_mm, true);
>>> if (ret)
>>> goto out;
>>> }
>>> @@ -831,34 +858,35 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
>>> static int split_to_ptes_pud_entry(pud_t *pudp, unsigned long addr,
>>> unsigned long next, struct mm_walk *walk)
>>> {
>>> - gfp_t gfp = *(gfp_t *)walk->private;
>>> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
>>> pud_t pud = pudp_get(pudp);
>>> - int ret = 0;
>>>
>>> - if (pud_leaf(pud))
>>> - ret = split_pud(pudp, pud, gfp, false);
>>> + if (!pud_leaf(pud))
>>> + return 0;
>>>
>>> - return ret;
>>> + return split_pud(pudp, pud, pgtable_alloc_fn, false);
>>
>> why are you changing the layout of this function? Seems like unneccessary churn.
>> Just pass pgtable_alloc to split_pud() instead of gfp.
>>
>>> }
>>>
>>> static int split_to_ptes_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>> unsigned long next, struct mm_walk *walk)
>>> {
>>> - gfp_t gfp = *(gfp_t *)walk->private;
>>> + split_pgtable_alloc_fn *pgtable_alloc_fn = walk->private;
>>> pmd_t pmd = pmdp_get(pmdp);
>>> - int ret = 0;
>>> + int ret;
>>>
>>> - if (pmd_leaf(pmd)) {
>>> - if (pmd_cont(pmd))
>>> - split_contpmd(pmdp);
>>> - ret = split_pmd(pmdp, pmd, gfp, false);
>>> + if (!pmd_leaf(pmd))
>>> + return 0;
>>>
>>> - /*
>>> - * We have split the pmd directly to ptes so there is no need to
>>> - * visit each pte to check if they are contpte.
>>> - */
>>> - walk->action = ACTION_CONTINUE;
>>> - }
>>> + if (pmd_cont(pmd))
>>> + split_contpmd(pmdp);
>>> +
>>> + ret = split_pmd(pmdp, pmd, pgtable_alloc_fn, false);
>>> +
>>> + /*
>>> + * We have split the pmd directly to ptes so there is no need to
>>> + * visit each pte to check if they are contpte.
>>> + */
>>> + walk->action = ACTION_CONTINUE;
>>
>> Same comment; no need to change the layout of the function.
>
> Okay. I'll keep the former layout.
>
>>
>>>
>>> return ret;
>>> }
>>> @@ -880,13 +908,15 @@ static const struct mm_walk_ops split_to_ptes_ops = {
>>> .pte_entry = split_to_ptes_pte_entry,
>>> };
>>>
>>> -static int range_split_to_ptes(unsigned long start, unsigned long end, gfp_t gfp)
>>> +static int range_split_to_ptes(unsigned long start, unsigned long end,
>>> + split_pgtable_alloc_fn *pgtable_alloc_fn)
>>> {
>>> int ret;
>>>
>>> arch_enter_lazy_mmu_mode();
>>> ret = walk_kernel_page_table_range_lockless(start, end,
>>> - &split_to_ptes_ops, NULL, &gfp);
>>> + &split_to_ptes_ops, NULL,
>>> + pgtable_alloc_fn);
>>> arch_leave_lazy_mmu_mode();
>>>
>>> return ret;
>>> @@ -903,6 +933,105 @@ static void __init init_idmap_kpti_bbml2_flag(void)
>>> smp_mb();
>>> }
>>>
>>> +static int __init
>>> +collect_to_split_pud_entry(pud_t *pudp, unsigned long addr,
>>> + unsigned long next, struct mm_walk *walk)
>>> +{
>>> + pud_t pud = pudp_get(pudp);
>>> +
>>> + if (pud_leaf(pud))
>>> + split_pgtables_count += 1 + PTRS_PER_PMD;
>>
>> I think you probably want:
>>
>> walk->action = ACTION_CONTINUE;
>>
>> Likely you will end up with the same behaviour regardless. But you should at
>> least we consistent with collect_to_split_pmd_entry().
>
> This is my fault to miss here. Thanks for catching this :)
>
>>
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int __init
>>> +collect_to_split_pmd_entry(pmd_t *pmdp, unsigned long addr,
>>> + unsigned long next, struct mm_walk *walk)
>>> +{
>>> + pmd_t pmd = pmdp_get(pmdp);
>>> +
>>> + if (pmd_leaf(pmd))
>>> + split_pgtables_count++;
>>> +
>>> + walk->action = ACTION_CONTINUE;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static void __init linear_map_free_split_pgtables(void)
>>> +{
>>> + int i;
>>> +
>>> + if (!split_pgtables_count || !split_pgtables)
>>> + goto skip_free;
>>> +
>>> + for (i = split_pgtables_idx; i < split_pgtables_count; i++) {
>>> + if (split_pgtables[i])
>>> + pagetable_free(split_pgtables[i]);
>>> + }
>>> +
>>> + free_pages((unsigned long)split_pgtables, split_pgtables_order);
>>> +
>>> +skip_free:
>>> + split_pgtables = NULL;
>>> + split_pgtables_count = 0;
>>> + split_pgtables_idx = 0;
>>> + split_pgtables_order = 0;
>>> +}
>>> +
>>> +static int __init linear_map_prealloc_split_pgtables(void)
>>> +{
>>> + int ret, i;
>>> + unsigned long lstart = _PAGE_OFFSET(vabits_actual);
>>> + unsigned long lend = PAGE_END;
>>> + unsigned long kstart = (unsigned long)lm_alias(_stext);
>>> + unsigned long kend = (unsigned long)lm_alias(__init_begin);
>>> +
>>> + const struct mm_walk_ops collect_to_split_ops = {
>>> + .pud_entry = collect_to_split_pud_entry,
>>> + .pmd_entry = collect_to_split_pmd_entry
>>> + };
>>> +
>>> + split_pgtables_idx = 0;
>>> + split_pgtables_count = 0;
>>> +
>>> + ret = walk_kernel_page_table_range_lockless(lstart, kstart,
>>> + &collect_to_split_ops,
>>> + NULL, NULL);
>>> + if (!ret)
>>> + ret = walk_kernel_page_table_range_lockless(kend, lend,
>>> + &collect_to_split_ops,
>>> + NULL, NULL);
>>> + if (ret || !split_pgtables_count)
>>> + goto error;
>>> +
>>> + ret = -ENOMEM;
>>> +
>>> + split_pgtables_order =
>>> + order_base_2(PAGE_ALIGN(split_pgtables_count *
>>> + sizeof(struct ptdesc *)) >> PAGE_SHIFT);
>>
>> Wouldn't this be simpler?
>>
>> split_pgtables_order = get_order(split_pgtables_count *
>> sizeof(struct ptdesc *));
>>
>
> Yes. that would be simpler. But I think we can use
> kvmalloc() for split_pagtables since
> linear_map_prealloc_split_pgtables() is called after mm_core_init().
> So It could be dropped or Am I missing something?
>
If kvmalloc is usable at this point, I agree that would be much better.
Thanks,
Ryan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI
2026-01-02 11:10 ` Ryan Roberts
@ 2026-01-02 14:48 ` Yeoreum Yun
0 siblings, 0 replies; 11+ messages in thread
From: Yeoreum Yun @ 2026-01-02 14:48 UTC (permalink / raw)
To: Ryan Roberts
Cc: catalin.marinas, will, akpm, david, kevin.brodsky, quic_zhenhuah,
dev.jain, yang, chaitanyas.prakash, bigeasy, clrkwllms, rostedt,
lorenzo.stoakes, ardb, jackmanb, vbabka, mhocko, linux-arm-kernel,
linux-kernel, linux-rt-devel
Hi Ryan,
> On 18/12/2025 19:47, Yeoreum Yun wrote:
> > The current __kpti_install_ng_mappings() allocates a temporary PGD
> > while installing the NG mapping for KPTI under stop_machine(),
> > using GFP_ATOMIC.
> >
> > This is fine in the non-PREEMPT_RT case. However, it becomes a problem
> > under PREEMPT_RT because generic memory allocation/free APIs
> > (e.g., pgtable_alloc(), __get_free_pages(), etc.) cannot be invoked
> > in a non-preemptible context, except for the *_nolock() variants.
> > These generic allocators may sleep due to their use of spin_lock().
> >
> > In other words, calling __get_free_pages(), even with GFP_ATOMIC,
> > is not allowed in __kpti_install_ng_mappings(), which is executed by
> > the stopper thread where preemption is disabled under PREEMPT_RT.
> >
> > To address this, preallocate the page needed for the temporary PGD
> > before invoking __kpti_install_ng_mappings() via stop_machine().
> >
> > Fixes: 47546a1912fc ("arm64: mm: install KPTI nG mappings with MMU enabled")
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> > ---
> > arch/arm64/mm/mmu.c | 22 +++++++++++++---------
> > 1 file changed, 13 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 96a9fa505e71..9ad9612728e6 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -1369,7 +1369,7 @@ static phys_addr_t __init kpti_ng_pgd_alloc(enum pgtable_type type)
> > return kpti_ng_temp_alloc;
> > }
> >
> > -static int __init __kpti_install_ng_mappings(void *__unused)
> > +static int __init __kpti_install_ng_mappings(void *data)
> > {
> > typedef void (kpti_remap_fn)(int, int, phys_addr_t, unsigned long);
> > extern kpti_remap_fn idmap_kpti_install_ng_mappings;
> > @@ -1377,10 +1377,9 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> >
> > int cpu = smp_processor_id();
> > int levels = CONFIG_PGTABLE_LEVELS;
> > - int order = order_base_2(levels);
> > u64 kpti_ng_temp_pgd_pa = 0;
> > pgd_t *kpti_ng_temp_pgd;
> > - u64 alloc = 0;
> > + u64 alloc = *(u64 *)data;
> >
> > if (levels == 5 && !pgtable_l5_enabled())
> > levels = 4;
> > @@ -1391,8 +1390,6 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> >
> > if (!cpu) {
> > int ret;
> > -
> > - alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> > kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
> > kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
> >
> > @@ -1423,16 +1420,17 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> > remap_fn(cpu, num_online_cpus(), kpti_ng_temp_pgd_pa, KPTI_NG_TEMP_VA);
> > cpu_uninstall_idmap();
> >
> > - if (!cpu) {
> > - free_pages(alloc, order);
> > + if (!cpu)
> > arm64_use_ng_mappings = true;
> > - }
> >
> > return 0;
> > }
> >
> > void __init kpti_install_ng_mappings(void)
> > {
> > + int order = order_base_2(CONFIG_PGTABLE_LEVELS);
> > + u64 alloc;
> > +
>
> nit: Restore the blank line between the variable definitioins and the logic.
Okay. I'll restore that line.
Thanks!
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-01-02 14:49 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-18 19:47 [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2025-12-18 19:47 ` [PATCH v3 1/2] arm64: mmu: avoid allocating pages while splitting the linear mapping Yeoreum Yun
2026-01-02 11:03 ` Ryan Roberts
2026-01-02 11:09 ` Ryan Roberts
2026-01-02 12:21 ` Yeoreum Yun
2026-01-02 12:35 ` Ryan Roberts
2025-12-18 19:47 ` [PATCH v3 2/2] arm64: mmu: avoid allocating pages while installing ng-mapping for KPTI Yeoreum Yun
2026-01-02 11:10 ` Ryan Roberts
2026-01-02 14:48 ` Yeoreum Yun
2025-12-31 10:07 ` [PATCH v3 0/2] fix wrong usage of memory allocation APIs under PREEMPT_RT in arm64 Yeoreum Yun
2025-12-31 12:34 ` David Hildenbrand (Red Hat)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).