* [PATCH 0/2] introduce pagetable_alloc_nolock()
@ 2025-12-12 16:18 Yeoreum Yun
2025-12-12 16:18 ` [PATCH 1/2] mm: " Yeoreum Yun
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-12 16:18 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, ryan.roberts, kevin.brodsky, dev.jain,
yang
Cc: linux-mm, linux-kernel, bpf, linux-rt-devel, linux-arm-kernel,
Yeoreum Yun
Some architectures invoke pagetable_alloc() or __get_free_pages()
with preemption disabled.
For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
while spliting block entry to ptes and __kpti_install_ng_mappings()
calls __get_free_pages() to create kpti pagetable.
Under PREEMPT_RT, calling pagetable_alloc() with
preemption disabled is not allowed, because it may acquire
a spin lock that becomes sleepable on RT, potentially
causing a sleep during page allocation.
Since above two functions is called as callback of stop_machine()
where its callback is called in preemption disabled,
They could make a potential problem. (sleeping in preemption disabled).
To address this, introduce pagetable_alloc_nolock() API.
Yeoreum Yun (2):
mm: introduce pagetable_alloc_nolock()
arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
arch/arm64/mm/mmu.c | 23 ++++++++++++++++++-----
include/linux/mm.h | 18 ++++++++++++++++++
kernel/bpf/stream.c | 2 +-
kernel/bpf/syscall.c | 2 +-
mm/page_alloc.c | 10 +++-------
5 files changed, 41 insertions(+), 14 deletions(-)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH 1/2] mm: introduce pagetable_alloc_nolock()
2025-12-12 16:18 [PATCH 0/2] introduce pagetable_alloc_nolock() Yeoreum Yun
@ 2025-12-12 16:18 ` Yeoreum Yun
2025-12-12 16:18 ` [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine() Yeoreum Yun
2025-12-16 15:11 ` [PATCH 0/2] introduce pagetable_alloc_nolock() Ryan Roberts
2 siblings, 0 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-12 16:18 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, ryan.roberts, kevin.brodsky, dev.jain,
yang
Cc: linux-mm, linux-kernel, bpf, linux-rt-devel, linux-arm-kernel,
Yeoreum Yun
Some architectures invoke pagetable_alloc() with preemption disabled
(e.g., arm64’s linear_map_split_to_ptes()).
Under PREEMPT_RT, calling pagetable_alloc() with
preemption disabled is not allowed, because it may acquire
a spin lock that becomes sleepable on RT, potentially
causing a sleep during page allocation.
To address this, introduce a pagetable_alloc_nolock() API and
permit two additional GFP flags for alloc_pages_nolock() — __GFP_HIGH and __GFP_ZERO.
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
include/linux/mm.h | 18 ++++++++++++++++++
kernel/bpf/stream.c | 2 +-
kernel/bpf/syscall.c | 2 +-
mm/page_alloc.c | 10 +++-------
4 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7c79b3369b82..11a27f60838b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2990,6 +2990,24 @@ static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int orde
}
#define pagetable_alloc(...) alloc_hooks(pagetable_alloc_noprof(__VA_ARGS__))
+/**
+ * pagetable_alloc_nolock - opportunistic reetentrant pagetables allocation
+ * from any context
+ * @gfp: GFP flags. Only __GFP_ZERO, __GFP_HIGH, __GFP_ACCOUNT allowed.
+ * @order: desired pagetable order
+ *
+ * opportunistic reetentrant version of pagetable_alloc().
+ *
+ * Return: The ptdesc describing the allocated page tables.
+ */
+static inline struct ptdesc *pagetable_alloc_nolock_noprof(gfp_t gfp, unsigned int order)
+{
+ struct page *page = alloc_pages_nolock_noprof(gfp, NUMA_NO_NODE, order);
+
+ return page_ptdesc(page);
+}
+#define pagetable_alloc_nolock(...) alloc_hooks(pagetable_alloc_nolock_noprof(__VA_ARGS__))
+
/**
* pagetable_free - Free pagetables
* @pt: The page table descriptor
diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
index ff16c631951b..3c80c8007d91 100644
--- a/kernel/bpf/stream.c
+++ b/kernel/bpf/stream.c
@@ -83,7 +83,7 @@ static struct bpf_stream_page *bpf_stream_page_replace(void)
struct bpf_stream_page *stream_page, *old_stream_page;
struct page *page;
- page = alloc_pages_nolock(/* Don't account */ 0, NUMA_NO_NODE, 0);
+ page = alloc_pages_nolock(/* Don't account */ __GFP_ZERO, NUMA_NO_NODE, 0);
if (!page)
return NULL;
stream_page = page_address(page);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8a129746bd6c..cbc0f8d0c18b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -598,7 +598,7 @@ static bool can_alloc_pages(void)
static struct page *__bpf_alloc_page(int nid)
{
if (!can_alloc_pages())
- return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
+ return alloc_pages_nolock(__GFP_ZERO | __GFP_ACCOUNT, nid, 0);
return alloc_pages_node(nid,
GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed82ee55e66a..88a920dc1e9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7542,21 +7542,17 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
* various contexts. We cannot use printk_deferred_enter() to mitigate,
* since the running context is unknown.
*
- * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
- * is safe in any context. Also zeroing the page is mandatory for
- * BPF use cases.
- *
* Though __GFP_NOMEMALLOC is not checked in the code path below,
* specify it here to highlight that alloc_pages_nolock()
* doesn't want to deplete reserves.
*/
- gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
+ gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | __GFP_COMP
| gfp_flags;
unsigned int alloc_flags = ALLOC_TRYLOCK;
struct alloc_context ac = { };
struct page *page;
- VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
+ VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_HIGH | __GFP_ZERO | __GFP_ACCOUNT));
/*
* In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
* unsafe in NMI. If spin_trylock() is called from hard IRQ the current
@@ -7602,7 +7598,7 @@ struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned
}
/**
* alloc_pages_nolock - opportunistic reentrant allocation from any context
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
+ * @gfp_flags: GFP flags. Only __GFP_ZERO, __GFP_HIGH, __GFP_ACCOUNT allowed.
* @nid: node to allocate from
* @order: allocation order size
*
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply related [flat|nested] 35+ messages in thread
* [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-12 16:18 [PATCH 0/2] introduce pagetable_alloc_nolock() Yeoreum Yun
2025-12-12 16:18 ` [PATCH 1/2] mm: " Yeoreum Yun
@ 2025-12-12 16:18 ` Yeoreum Yun
2025-12-13 7:05 ` Brendan Jackman
2025-12-18 9:30 ` Michal Hocko
2025-12-16 15:11 ` [PATCH 0/2] introduce pagetable_alloc_nolock() Ryan Roberts
2 siblings, 2 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-12 16:18 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, ryan.roberts, kevin.brodsky, dev.jain,
yang
Cc: linux-mm, linux-kernel, bpf, linux-rt-devel, linux-arm-kernel,
Yeoreum Yun
linear_map_split_to_ptes() and __kpti_install_ng_mappings()
are called as callback of stop_machine().
That means these functions context are preemption disabled.
Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
__get_free_pages() couldn't be called in this context
since spin lock that becomes sleepable on RT,
potentially causing a sleep during page allocation.
To address this, pagetable_alloc_nolock().
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
arch/arm64/mm/mmu.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 2ba01dc8ef82..0e98606d8c4c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -475,10 +475,15 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
enum pgtable_type pgtable_type)
{
- /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
- struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
+ struct ptdesc *ptdesc;
phys_addr_t pa;
+ /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
+ if (gfpflags_allow_spinning(gfp))
+ ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
+ else
+ ptdesc = pagetable_alloc_nolock(gfp & ~__GFP_ZERO, 0);
+
if (!ptdesc)
return INVALID_PHYS_ADDR;
@@ -869,6 +874,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
unsigned long kstart = (unsigned long)lm_alias(_stext);
unsigned long kend = (unsigned long)lm_alias(__init_begin);
int ret;
+ gfp_t gfp = IS_ENABLED(CONFIG_PREEMPT_RT) ? __GFP_HIGH : GFP_ATOMIC;
/*
* Wait for all secondary CPUs to be put into the waiting area.
@@ -881,9 +887,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
* PTE. The kernel alias remains static throughout runtime so
* can continue to be safely mapped with large mappings.
*/
- ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
+ ret = range_split_to_ptes(lstart, kstart, gfp);
if (!ret)
- ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
+ ret = range_split_to_ptes(kend, lend, gfp);
if (ret)
panic("Failed to split linear map\n");
flush_tlb_kernel_range(lstart, lend);
@@ -1207,7 +1213,14 @@ static int __init __kpti_install_ng_mappings(void *__unused)
remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
if (!cpu) {
- alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ alloc = (u64) pagetable_alloc_nolock(__GFP_HIGH | __GFP_ZERO, order);
+ else
+ alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
+
+ if (!alloc)
+ panic("Failed to alloc kpti_ng_pgd\n");
+
kpti_ng_temp_pgd = (pgd_t *)(alloc + (levels - 1) * PAGE_SIZE);
kpti_ng_temp_alloc = kpti_ng_temp_pgd_pa = __pa(kpti_ng_temp_pgd);
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-12 16:18 ` [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine() Yeoreum Yun
@ 2025-12-13 7:05 ` Brendan Jackman
2025-12-14 9:13 ` Yeoreum Yun
2025-12-18 9:30 ` Michal Hocko
1 sibling, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-13 7:05 UTC (permalink / raw)
To: Yeoreum Yun
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Sat, 13 Dec 2025 at 01:18, Yeoreum Yun <yeoreum.yun@arm.com> wrote:
>
> linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> are called as callback of stop_machine().
> That means these functions context are preemption disabled.
>
> Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> __get_free_pages() couldn't be called in this context
> since spin lock that becomes sleepable on RT,
> potentially causing a sleep during page allocation.
>
> To address this, pagetable_alloc_nolock().
>
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
> arch/arm64/mm/mmu.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 2ba01dc8ef82..0e98606d8c4c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -475,10 +475,15 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> enum pgtable_type pgtable_type)
> {
> - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> + struct ptdesc *ptdesc;
> phys_addr_t pa;
>
> + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> + if (gfpflags_allow_spinning(gfp))
> + ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> + else
> + ptdesc = pagetable_alloc_nolock(gfp & ~__GFP_ZERO, 0);
> +
> if (!ptdesc)
> return INVALID_PHYS_ADDR;
>
> @@ -869,6 +874,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
> unsigned long kstart = (unsigned long)lm_alias(_stext);
> unsigned long kend = (unsigned long)lm_alias(__init_begin);
> int ret;
> + gfp_t gfp = IS_ENABLED(CONFIG_PREEMPT_RT) ? __GFP_HIGH : GFP_ATOMIC;
>
> /*
> * Wait for all secondary CPUs to be put into the waiting area.
> @@ -881,9 +887,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
> * PTE. The kernel alias remains static throughout runtime so
> * can continue to be safely mapped with large mappings.
> */
> - ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> + ret = range_split_to_ptes(lstart, kstart, gfp);
> if (!ret)
> - ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> + ret = range_split_to_ptes(kend, lend, gfp);
> if (ret)
> panic("Failed to split linear map\n");
> flush_tlb_kernel_range(lstart, lend);
> @@ -1207,7 +1213,14 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
>
> if (!cpu) {
> - alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> + alloc = (u64) pagetable_alloc_nolock(__GFP_HIGH | __GFP_ZERO, order);
> + else
> + alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> +
> + if (!alloc)
> + panic("Failed to alloc kpti_ng_pgd\n");
> +
I don't have the context on what this code is doing so take this with
a grain of salt, but...
The point of the _nolock alloc is to give the allocator an excuse to
fail. Panicking on that failure doesn't seem like a great idea to me?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-13 7:05 ` Brendan Jackman
@ 2025-12-14 9:13 ` Yeoreum Yun
2025-12-15 9:22 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-14 9:13 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Brendan,
> On Sat, 13 Dec 2025 at 01:18, Yeoreum Yun <yeoreum.yun@arm.com> wrote:
> >
> > linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> > are called as callback of stop_machine().
> > That means these functions context are preemption disabled.
> >
> > Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> > __get_free_pages() couldn't be called in this context
> > since spin lock that becomes sleepable on RT,
> > potentially causing a sleep during page allocation.
> >
> > To address this, pagetable_alloc_nolock().
> >
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> > arch/arm64/mm/mmu.c | 23 ++++++++++++++++++-----
> > 1 file changed, 18 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 2ba01dc8ef82..0e98606d8c4c 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -475,10 +475,15 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
> > static phys_addr_t __pgd_pgtable_alloc(struct mm_struct *mm, gfp_t gfp,
> > enum pgtable_type pgtable_type)
> > {
> > - /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > - struct ptdesc *ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > + struct ptdesc *ptdesc;
> > phys_addr_t pa;
> >
> > + /* Page is zeroed by init_clear_pgtable() so don't duplicate effort. */
> > + if (gfpflags_allow_spinning(gfp))
> > + ptdesc = pagetable_alloc(gfp & ~__GFP_ZERO, 0);
> > + else
> > + ptdesc = pagetable_alloc_nolock(gfp & ~__GFP_ZERO, 0);
> > +
> > if (!ptdesc)
> > return INVALID_PHYS_ADDR;
> >
> > @@ -869,6 +874,7 @@ static int __init linear_map_split_to_ptes(void *__unused)
> > unsigned long kstart = (unsigned long)lm_alias(_stext);
> > unsigned long kend = (unsigned long)lm_alias(__init_begin);
> > int ret;
> > + gfp_t gfp = IS_ENABLED(CONFIG_PREEMPT_RT) ? __GFP_HIGH : GFP_ATOMIC;
> >
> > /*
> > * Wait for all secondary CPUs to be put into the waiting area.
> > @@ -881,9 +887,9 @@ static int __init linear_map_split_to_ptes(void *__unused)
> > * PTE. The kernel alias remains static throughout runtime so
> > * can continue to be safely mapped with large mappings.
> > */
> > - ret = range_split_to_ptes(lstart, kstart, GFP_ATOMIC);
> > + ret = range_split_to_ptes(lstart, kstart, gfp);
> > if (!ret)
> > - ret = range_split_to_ptes(kend, lend, GFP_ATOMIC);
> > + ret = range_split_to_ptes(kend, lend, gfp);
> > if (ret)
> > panic("Failed to split linear map\n");
> > flush_tlb_kernel_range(lstart, lend);
> > @@ -1207,7 +1213,14 @@ static int __init __kpti_install_ng_mappings(void *__unused)
> > remap_fn = (void *)__pa_symbol(idmap_kpti_install_ng_mappings);
> >
> > if (!cpu) {
> > - alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > + alloc = (u64) pagetable_alloc_nolock(__GFP_HIGH | __GFP_ZERO, order);
> > + else
> > + alloc = __get_free_pages(GFP_ATOMIC | __GFP_ZERO, order);
> > +
> > + if (!alloc)
> > + panic("Failed to alloc kpti_ng_pgd\n");
> > +
>
> I don't have the context on what this code is doing so take this with
> a grain of salt, but...
>
> The point of the _nolock alloc is to give the allocator an excuse to
> fail. Panicking on that failure doesn't seem like a great idea to me?
I thought first whether it changes to "static" memory area to handle
this in PREEMPT_RT.
But since this function is called while smp_cpus_done().
So, I think it's fine since there wouldn't be a contention for
memory allocation in this phase.
Thanks.
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-14 9:13 ` Yeoreum Yun
@ 2025-12-15 9:22 ` Brendan Jackman
2025-12-15 9:34 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-15 9:22 UTC (permalink / raw)
To: Yeoreum Yun, Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Sun Dec 14, 2025 at 9:13 AM UTC, Yeoreum Yun wrote:
>> I don't have the context on what this code is doing so take this with
>> a grain of salt, but...
>>
>> The point of the _nolock alloc is to give the allocator an excuse to
>> fail. Panicking on that failure doesn't seem like a great idea to me?
>
> I thought first whether it changes to "static" memory area to handle
> this in PREEMPT_RT.
> But since this function is called while smp_cpus_done().
> So, I think it's fine since there wouldn't be a contention for
> memory allocation in this phase.
Then shouldn't it use _nolock unconditionally?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-15 9:22 ` Brendan Jackman
@ 2025-12-15 9:34 ` Yeoreum Yun
2025-12-15 9:55 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-15 9:34 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Brendan,
> On Sun Dec 14, 2025 at 9:13 AM UTC, Yeoreum Yun wrote:
> >> I don't have the context on what this code is doing so take this with
> >> a grain of salt, but...
> >>
> >> The point of the _nolock alloc is to give the allocator an excuse to
> >> fail. Panicking on that failure doesn't seem like a great idea to me?
> >
> > I thought first whether it changes to "static" memory area to handle
> > this in PREEMPT_RT.
> > But since this function is called while smp_cpus_done().
> > So, I think it's fine since there wouldn't be a contention for
> > memory allocation in this phase.
>
> Then shouldn't it use _nolock unconditionally?
As you pointed out, I think it should be fine even in the !PREEMPT_RT case.
However, in case I missed something or if my understanding is incorrect,
I applied it only to the PREEMPT_RT case for now.
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-15 9:34 ` Yeoreum Yun
@ 2025-12-15 9:55 ` Brendan Jackman
2025-12-15 10:06 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-15 9:55 UTC (permalink / raw)
To: Yeoreum Yun, Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Mon Dec 15, 2025 at 9:34 AM UTC, Yeoreum Yun wrote:
> Hi Brendan,
>> On Sun Dec 14, 2025 at 9:13 AM UTC, Yeoreum Yun wrote:
>> >> I don't have the context on what this code is doing so take this with
>> >> a grain of salt, but...
>> >>
>> >> The point of the _nolock alloc is to give the allocator an excuse to
>> >> fail. Panicking on that failure doesn't seem like a great idea to me?
>> >
>> > I thought first whether it changes to "static" memory area to handle
>> > this in PREEMPT_RT.
>> > But since this function is called while smp_cpus_done().
>> > So, I think it's fine since there wouldn't be a contention for
>> > memory allocation in this phase.
>>
>> Then shouldn't it use _nolock unconditionally?
>
> As you pointed out, I think it should be fine even in the !PREEMPT_RT case.
> However, in case I missed something or if my understanding is incorrect,
> I applied it only to the PREEMPT_RT case for now.
Hmm, I don't think "this code might be broken so let's cage it behind a
conditional" is a good strategy.
1. It bloats the codebase.
2. It's confusing to readers, now you have to try an understand why this
conditional is here, which is a doomed effort. This could be
mitigated with comments but, see point 1.
3. It expands the testing matrix. So now we have code that we aren't
really sure is correct, AND it gets less test coverage.
Overall I am feeling a bit uncomfortable about this use of _nolock, but
I am also feeling pretty ignorant about PREEMPT_RT and also about this
arm64 code, so I am hesitant to suggest alternatives, I hope someone
else can offer some input here...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-15 9:55 ` Brendan Jackman
@ 2025-12-15 10:06 ` Yeoreum Yun
2025-12-16 10:10 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-15 10:06 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
> On Mon Dec 15, 2025 at 9:34 AM UTC, Yeoreum Yun wrote:
> > Hi Brendan,
> >> On Sun Dec 14, 2025 at 9:13 AM UTC, Yeoreum Yun wrote:
> >> >> I don't have the context on what this code is doing so take this with
> >> >> a grain of salt, but...
> >> >>
> >> >> The point of the _nolock alloc is to give the allocator an excuse to
> >> >> fail. Panicking on that failure doesn't seem like a great idea to me?
> >> >
> >> > I thought first whether it changes to "static" memory area to handle
> >> > this in PREEMPT_RT.
> >> > But since this function is called while smp_cpus_done().
> >> > So, I think it's fine since there wouldn't be a contention for
> >> > memory allocation in this phase.
> >>
> >> Then shouldn't it use _nolock unconditionally?
> >
> > As you pointed out, I think it should be fine even in the !PREEMPT_RT case.
> > However, in case I missed something or if my understanding is incorrect,
> > I applied it only to the PREEMPT_RT case for now.
>
> Hmm, I don't think "this code might be broken so let's cage it behind a
> conditional" is a good strategy.
>
> 1. It bloats the codebase.
>
> 2. It's confusing to readers, now you have to try an understand why this
> conditional is here, which is a doomed effort. This could be
> mitigated with comments but, see point 1.
>
> 3. It expands the testing matrix. So now we have code that we aren't
> really sure is correct, AND it gets less test coverage.
>
> Overall I am feeling a bit uncomfortable about this use of _nolock, but
> I am also feeling pretty ignorant about PREEMPT_RT and also about this
> arm64 code, so I am hesitant to suggest alternatives, I hope someone
> else can offer some input here...
I understand. However, as I mentioned earlier,
my main intention was to hear opinions specifically about memory contention.
That said, if there is no memory contention,
I don’t think using the _nolock API is necessarily a bad approach.
In fact, I believe a bigger issue is that, under PREEMPT_RT,
code that uses the regular memory allocation APIs may give users the false impression
that those APIs are “safe to use,” even though they are not.
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-15 10:06 ` Yeoreum Yun
@ 2025-12-16 10:10 ` Brendan Jackman
2025-12-16 11:03 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-16 10:10 UTC (permalink / raw)
To: Yeoreum Yun, Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
[snip]
>> Overall I am feeling a bit uncomfortable about this use of _nolock, but
>> I am also feeling pretty ignorant about PREEMPT_RT and also about this
>> arm64 code, so I am hesitant to suggest alternatives, I hope someone
>> else can offer some input here...
>
> I understand. However, as I mentioned earlier,
> my main intention was to hear opinions specifically about memory contention.
>
> That said, if there is no memory contention,
> I don’t think using the _nolock API is necessarily a bad approach.
> In fact, I believe a bigger issue is that, under PREEMPT_RT,
> code that uses the regular memory allocation APIs may give users the false impression
> that those APIs are “safe to use,” even though they are not.
Yeah, I share this concern. I would bet I have written code that's
broken under PREEMPT_RT (luckily only in Google's kernel fork). The
comment for GFP_ATOMIC says:
* %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
* watermark is applied to allow access to "atomic reserves".
* The current implementation doesn't support NMI and few other strict
* non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
normal preempt_disable() context. So do you know exactly why it's
invalid to use it in this stop_machine() context here? Maybe we need to
update this comment. Or, maybe actually we need to fix the allocator so
that GFP_ATOMIC allocs are safe in this context?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-16 10:10 ` Brendan Jackman
@ 2025-12-16 11:03 ` Yeoreum Yun
2025-12-16 11:26 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-16 11:03 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Brendan,
> On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
> [snip]
> >> Overall I am feeling a bit uncomfortable about this use of _nolock, but
> >> I am also feeling pretty ignorant about PREEMPT_RT and also about this
> >> arm64 code, so I am hesitant to suggest alternatives, I hope someone
> >> else can offer some input here...
> >
> > I understand. However, as I mentioned earlier,
> > my main intention was to hear opinions specifically about memory contention.
> >
> > That said, if there is no memory contention,
> > I don’t think using the _nolock API is necessarily a bad approach.
>
>
> > In fact, I believe a bigger issue is that, under PREEMPT_RT,
> > code that uses the regular memory allocation APIs may give users the false impression
> > that those APIs are “safe to use,” even though they are not.
>
> Yeah, I share this concern. I would bet I have written code that's
> broken under PREEMPT_RT (luckily only in Google's kernel fork). The
> comment for GFP_ATOMIC says:
>
> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> * watermark is applied to allow access to "atomic reserves".
> * The current implementation doesn't support NMI and few other strict
> * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
>
> It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
> normal preempt_disable() context. So do you know exactly why it's
> invalid to use it in this stop_machine() context here? Maybe we need to
> update this comment.
In non-PREEMPT_RT configurations, this is fine to use.
However, in PREEMPT_RT, it should not be used because
spin_lock becomes a sleepable lock backed by an rt-mutex.
From Documentation/locking/locktypes.rst:
The fact that PREEMPT_RT changes the lock category of spinlock_t and
rwlock_t from spinning to sleeping.
As you know, all locks related to memory allocation
(e.g., zone_lock, PCP locks, etc.) use spin_lock,
which becomes sleepable under PREEMPT_RT.
The callback of stop_machine() is executed in a preemption-disabled context
(see cpu_stopper_thread()). In this context, if it fails to acquire a spinlock
during memory allocation,
the task would be able to go to sleep while preemption is disabled,
which is an obviously problematic situation.
> Or maybe actually we need to fix the allocator
> so that GFP_ATOMIC allocs are safe in this context?
I don’t think so, because GFP_ATOMIC can still be used by
IRQ threads in PREEMPT_RT. (If an IRQ handler is non-threaded
due to IRQF_NO_THREAD, then it cannot use it.)
Although the root cause appears to be that the memory allocator uses spin_lock(),
I believe the real issue is where such allocations are allowed to be used.
If we changed the lock to raw_spin_lock(),
this would introduce significant latency during memory allocation.
So I believe this is why the memory allocator continues to use spin_lock().
IOW, what I really want to ask is whether
general memory allocation/free operations are
permissible in a stop_machine() context
(I think nolock() can be allowable only).
Thanks
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-16 11:03 ` Yeoreum Yun
@ 2025-12-16 11:26 ` Brendan Jackman
2025-12-16 12:01 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-16 11:26 UTC (permalink / raw)
To: Yeoreum Yun, Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Tue Dec 16, 2025 at 11:03 AM UTC, Yeoreum Yun wrote:
> Hi Brendan,
>
>> On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
>> [snip]
>> >> Overall I am feeling a bit uncomfortable about this use of _nolock, but
>> >> I am also feeling pretty ignorant about PREEMPT_RT and also about this
>> >> arm64 code, so I am hesitant to suggest alternatives, I hope someone
>> >> else can offer some input here...
>> >
>> > I understand. However, as I mentioned earlier,
>> > my main intention was to hear opinions specifically about memory contention.
>> >
>> > That said, if there is no memory contention,
>> > I don’t think using the _nolock API is necessarily a bad approach.
>>
>>
>> > In fact, I believe a bigger issue is that, under PREEMPT_RT,
>> > code that uses the regular memory allocation APIs may give users the false impression
>> > that those APIs are “safe to use,” even though they are not.
>>
>> Yeah, I share this concern. I would bet I have written code that's
>> broken under PREEMPT_RT (luckily only in Google's kernel fork). The
>> comment for GFP_ATOMIC says:
>>
>> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>> * watermark is applied to allow access to "atomic reserves".
>> * The current implementation doesn't support NMI and few other strict
>> * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
>>
>> It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
>> normal preempt_disable() context. So do you know exactly why it's
>> invalid to use it in this stop_machine() context here? Maybe we need to
>> update this comment.
>
> In non-PREEMPT_RT configurations, this is fine to use.
> However, in PREEMPT_RT, it should not be used because
> spin_lock becomes a sleepable lock backed by an rt-mutex.
>
> From Documentation/locking/locktypes.rst:
>
> The fact that PREEMPT_RT changes the lock category of spinlock_t and
> rwlock_t from spinning to sleeping.
>
> As you know, all locks related to memory allocation
> (e.g., zone_lock, PCP locks, etc.) use spin_lock,
> which becomes sleepable under PREEMPT_RT.
>
> The callback of stop_machine() is executed in a preemption-disabled context
> (see cpu_stopper_thread()). In this context, if it fails to acquire a spinlock
> during memory allocation,
> the task would be able to go to sleep while preemption is disabled,
> which is an obviously problematic situation.
But this is what I mean, doesn't this sound like the GFP_ATOMIC comment
I quoted is wrong (or at least, it implies things which are wrong)? The
comment refers specifically to raw_spin_lock() and "strict
non-preemptive contexts". Which sounds like it is being written with
PREEMPT_RT in mind. But that doesn't really match what you've said.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-16 11:26 ` Brendan Jackman
@ 2025-12-16 12:01 ` Yeoreum Yun
2025-12-16 12:39 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-16 12:01 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
> On Tue Dec 16, 2025 at 11:03 AM UTC, Yeoreum Yun wrote:
> > Hi Brendan,
> >
> >> On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
> >> [snip]
> >> >> Overall I am feeling a bit uncomfortable about this use of _nolock, but
> >> >> I am also feeling pretty ignorant about PREEMPT_RT and also about this
> >> >> arm64 code, so I am hesitant to suggest alternatives, I hope someone
> >> >> else can offer some input here...
> >> >
> >> > I understand. However, as I mentioned earlier,
> >> > my main intention was to hear opinions specifically about memory contention.
> >> >
> >> > That said, if there is no memory contention,
> >> > I don’t think using the _nolock API is necessarily a bad approach.
> >>
> >>
> >> > In fact, I believe a bigger issue is that, under PREEMPT_RT,
> >> > code that uses the regular memory allocation APIs may give users the false impression
> >> > that those APIs are “safe to use,” even though they are not.
> >>
> >> Yeah, I share this concern. I would bet I have written code that's
> >> broken under PREEMPT_RT (luckily only in Google's kernel fork). The
> >> comment for GFP_ATOMIC says:
> >>
> >> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> >> * watermark is applied to allow access to "atomic reserves".
> >> * The current implementation doesn't support NMI and few other strict
> >> * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> >>
> >> It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
> >> normal preempt_disable() context. So do you know exactly why it's
> >> invalid to use it in this stop_machine() context here? Maybe we need to
> >> update this comment.
> >
> > In non-PREEMPT_RT configurations, this is fine to use.
> > However, in PREEMPT_RT, it should not be used because
> > spin_lock becomes a sleepable lock backed by an rt-mutex.
> >
> > From Documentation/locking/locktypes.rst:
> >
> > The fact that PREEMPT_RT changes the lock category of spinlock_t and
> > rwlock_t from spinning to sleeping.
> >
> > As you know, all locks related to memory allocation
> > (e.g., zone_lock, PCP locks, etc.) use spin_lock,
> > which becomes sleepable under PREEMPT_RT.
> >
> > The callback of stop_machine() is executed in a preemption-disabled context
> > (see cpu_stopper_thread()). In this context, if it fails to acquire a spinlock
> > during memory allocation,
> > the task would be able to go to sleep while preemption is disabled,
> > which is an obviously problematic situation.
>
> But this is what I mean, doesn't this sound like the GFP_ATOMIC comment
> I quoted is wrong (or at least, it implies things which are wrong)? The
> comment refers specifically to raw_spin_lock() and "strict
> non-preemptive contexts". Which sounds like it is being written with
> PREEMPT_RT in mind. But that doesn't really match what you've said.
No. I think the comment of GFP_ATOMIC is right.
It definitely said:
The current implementation *doesn't support* NMI and few other strict
*non-preemptive contexts (e.g. raw_spin_lock)*.
The reason It couldn't be support GFP_ATOMIC in raw_spin_lock() context
in PREEMPT_RT since critical section protected by raw_spin_lock()
is non-preemptive (preemption disabled).
This is the same reason "GFP_ATOMIC" cannot be used in the
stop_machine().
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-16 12:01 ` Yeoreum Yun
@ 2025-12-16 12:39 ` Brendan Jackman
2025-12-16 13:25 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-16 12:39 UTC (permalink / raw)
To: Yeoreum Yun, Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Tue Dec 16, 2025 at 12:01 PM UTC, Yeoreum Yun wrote:
>> On Tue Dec 16, 2025 at 11:03 AM UTC, Yeoreum Yun wrote:
>> > Hi Brendan,
>> >
>> >> On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
>> >> [snip]
>> >> >> Overall I am feeling a bit uncomfortable about this use of _nolock, but
>> >> >> I am also feeling pretty ignorant about PREEMPT_RT and also about this
>> >> >> arm64 code, so I am hesitant to suggest alternatives, I hope someone
>> >> >> else can offer some input here...
>> >> >
>> >> > I understand. However, as I mentioned earlier,
>> >> > my main intention was to hear opinions specifically about memory contention.
>> >> >
>> >> > That said, if there is no memory contention,
>> >> > I don’t think using the _nolock API is necessarily a bad approach.
>> >>
>> >>
>> >> > In fact, I believe a bigger issue is that, under PREEMPT_RT,
>> >> > code that uses the regular memory allocation APIs may give users the false impression
>> >> > that those APIs are “safe to use,” even though they are not.
>> >>
>> >> Yeah, I share this concern. I would bet I have written code that's
>> >> broken under PREEMPT_RT (luckily only in Google's kernel fork). The
>> >> comment for GFP_ATOMIC says:
>> >>
>> >> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>> >> * watermark is applied to allow access to "atomic reserves".
>> >> * The current implementation doesn't support NMI and few other strict
>> >> * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
>> >>
>> >> It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
>> >> normal preempt_disable() context. So do you know exactly why it's
>> >> invalid to use it in this stop_machine() context here? Maybe we need to
>> >> update this comment.
>> >
>> > In non-PREEMPT_RT configurations, this is fine to use.
>> > However, in PREEMPT_RT, it should not be used because
>> > spin_lock becomes a sleepable lock backed by an rt-mutex.
>> >
>> > From Documentation/locking/locktypes.rst:
>> >
>> > The fact that PREEMPT_RT changes the lock category of spinlock_t and
>> > rwlock_t from spinning to sleeping.
>> >
>> > As you know, all locks related to memory allocation
>> > (e.g., zone_lock, PCP locks, etc.) use spin_lock,
>> > which becomes sleepable under PREEMPT_RT.
>> >
>> > The callback of stop_machine() is executed in a preemption-disabled context
>> > (see cpu_stopper_thread()). In this context, if it fails to acquire a spinlock
>> > during memory allocation,
>> > the task would be able to go to sleep while preemption is disabled,
>> > which is an obviously problematic situation.
>>
>> But this is what I mean, doesn't this sound like the GFP_ATOMIC comment
>> I quoted is wrong (or at least, it implies things which are wrong)? The
>> comment refers specifically to raw_spin_lock() and "strict
>> non-preemptive contexts". Which sounds like it is being written with
>> PREEMPT_RT in mind. But that doesn't really match what you've said.
>
> No. I think the comment of GFP_ATOMIC is right.
> It definitely said:
> The current implementation *doesn't support* NMI and few other strict
> *non-preemptive contexts (e.g. raw_spin_lock)*.
But this phrasing sounds like there are other non-preemptive contexts
that it _does_ support. I would definitely read this as implying that
plain old preempt_disable() is OK. I don't understand what those "few
other strict contexts" are, nor why the stop_machine() context is
included in them.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-16 12:39 ` Brendan Jackman
@ 2025-12-16 13:25 ` Yeoreum Yun
0 siblings, 0 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-16 13:25 UTC (permalink / raw)
To: Brendan Jackman
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
hannes, ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
> On Tue Dec 16, 2025 at 12:01 PM UTC, Yeoreum Yun wrote:
> >> On Tue Dec 16, 2025 at 11:03 AM UTC, Yeoreum Yun wrote:
> >> > Hi Brendan,
> >> >
> >> >> On Mon Dec 15, 2025 at 10:06 AM UTC, Yeoreum Yun wrote:
> >> >> [snip]
> >> >> >> Overall I am feeling a bit uncomfortable about this use of _nolock, but
> >> >> >> I am also feeling pretty ignorant about PREEMPT_RT and also about this
> >> >> >> arm64 code, so I am hesitant to suggest alternatives, I hope someone
> >> >> >> else can offer some input here...
> >> >> >
> >> >> > I understand. However, as I mentioned earlier,
> >> >> > my main intention was to hear opinions specifically about memory contention.
> >> >> >
> >> >> > That said, if there is no memory contention,
> >> >> > I don’t think using the _nolock API is necessarily a bad approach.
> >> >>
> >> >>
> >> >> > In fact, I believe a bigger issue is that, under PREEMPT_RT,
> >> >> > code that uses the regular memory allocation APIs may give users the false impression
> >> >> > that those APIs are “safe to use,” even though they are not.
> >> >>
> >> >> Yeah, I share this concern. I would bet I have written code that's
> >> >> broken under PREEMPT_RT (luckily only in Google's kernel fork). The
> >> >> comment for GFP_ATOMIC says:
> >> >>
> >> >> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> >> >> * watermark is applied to allow access to "atomic reserves".
> >> >> * The current implementation doesn't support NMI and few other strict
> >> >> * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> >> >>
> >> >> It kinda sounds like it's supposed to be OK to use GFP_ATOMIC in a
> >> >> normal preempt_disable() context. So do you know exactly why it's
> >> >> invalid to use it in this stop_machine() context here? Maybe we need to
> >> >> update this comment.
> >> >
> >> > In non-PREEMPT_RT configurations, this is fine to use.
> >> > However, in PREEMPT_RT, it should not be used because
> >> > spin_lock becomes a sleepable lock backed by an rt-mutex.
> >> >
> >> > From Documentation/locking/locktypes.rst:
> >> >
> >> > The fact that PREEMPT_RT changes the lock category of spinlock_t and
> >> > rwlock_t from spinning to sleeping.
> >> >
> >> > As you know, all locks related to memory allocation
> >> > (e.g., zone_lock, PCP locks, etc.) use spin_lock,
> >> > which becomes sleepable under PREEMPT_RT.
> >> >
> >> > The callback of stop_machine() is executed in a preemption-disabled context
> >> > (see cpu_stopper_thread()). In this context, if it fails to acquire a spinlock
> >> > during memory allocation,
> >> > the task would be able to go to sleep while preemption is disabled,
> >> > which is an obviously problematic situation.
> >>
> >> But this is what I mean, doesn't this sound like the GFP_ATOMIC comment
> >> I quoted is wrong (or at least, it implies things which are wrong)? The
> >> comment refers specifically to raw_spin_lock() and "strict
> >> non-preemptive contexts". Which sounds like it is being written with
> >> PREEMPT_RT in mind. But that doesn't really match what you've said.
> >
> > No. I think the comment of GFP_ATOMIC is right.
> > It definitely said:
> > The current implementation *doesn't support* NMI and few other strict
> > *non-preemptive contexts (e.g. raw_spin_lock)*.
>
> But this phrasing sounds like there are other non-preemptive contexts
> that it _does_ support. I would definitely read this as implying that
> plain old preempt_disable() is OK. I don't understand what those "few
> other strict contexts" are, nor why the stop_machine() context is
> included in them.
I think this phrasing seems to consider non-preeptive case for
the priority or schedule policy but still make me confused too.
But What I worth to say the stop_machine() -- exactly the callback
context (stopper thread context) by stop_machine() is
the same for raw_spin_lock() case where
explictly disable preemption by calling preempt_disable().
The reason why raw_spin_lock() context couldn't call the GFP_ATOMIC
since it explicitly disable preemption by calling preempt_disable().
stop_machine() callback context -- stopper thread's context is also the same.
when it calls the callback by stopper (see cpu_stopper_thread()):
...
preempt_count_inc();
ret = fn(arg);
...
preempt_count_dec();
...
preemption is explicitly disabled like raw_spin_lock()
So it seems to include in "few strict non-preemptive context".
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-12 16:18 [PATCH 0/2] introduce pagetable_alloc_nolock() Yeoreum Yun
2025-12-12 16:18 ` [PATCH 1/2] mm: " Yeoreum Yun
2025-12-12 16:18 ` [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine() Yeoreum Yun
@ 2025-12-16 15:11 ` Ryan Roberts
2025-12-16 16:52 ` Yeoreum Yun
2 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-16 15:11 UTC (permalink / raw)
To: Yeoreum Yun, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko, ast, daniel, andrii, martin.lau, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang
Cc: linux-mm, linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On 12/12/2025 16:18, Yeoreum Yun wrote:
> Some architectures invoke pagetable_alloc() or __get_free_pages()
> with preemption disabled.
> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
> while spliting block entry to ptes and __kpti_install_ng_mappings()
> calls __get_free_pages() to create kpti pagetable.
>
> Under PREEMPT_RT, calling pagetable_alloc() with
> preemption disabled is not allowed, because it may acquire
> a spin lock that becomes sleepable on RT, potentially
> causing a sleep during page allocation.
>
> Since above two functions is called as callback of stop_machine()
> where its callback is called in preemption disabled,
> They could make a potential problem. (sleeping in preemption disabled).
>
> To address this, introduce pagetable_alloc_nolock() API.
I don't really understand what the problem is that you're trying to fix. As I
see it, there are 2 call sites in arm64 arch code that are calling into the page
allocator from stop_machine() - one via via pagetable_alloc() and another via
__get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
understanding that the page allocator would ensure it never sleeps when
GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
What is the actual symptom you are seeing?
If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
then isn't that a bug in the page allocator? I'm not sure why you would change
the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
Thanks,
Ryan
>
> Yeoreum Yun (2):
> mm: introduce pagetable_alloc_nolock()
> arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
>
> arch/arm64/mm/mmu.c | 23 ++++++++++++++++++-----
> include/linux/mm.h | 18 ++++++++++++++++++
> kernel/bpf/stream.c | 2 +-
> kernel/bpf/syscall.c | 2 +-
> mm/page_alloc.c | 10 +++-------
> 5 files changed, 41 insertions(+), 14 deletions(-)
>
> --
> LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-16 15:11 ` [PATCH 0/2] introduce pagetable_alloc_nolock() Ryan Roberts
@ 2025-12-16 16:52 ` Yeoreum Yun
2025-12-17 9:34 ` Ryan Roberts
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-16 16:52 UTC (permalink / raw)
To: Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Ryan,
> On 12/12/2025 16:18, Yeoreum Yun wrote:
> > Some architectures invoke pagetable_alloc() or __get_free_pages()
> > with preemption disabled.
> > For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
> > while spliting block entry to ptes and __kpti_install_ng_mappings()
> > calls __get_free_pages() to create kpti pagetable.
> >
> > Under PREEMPT_RT, calling pagetable_alloc() with
> > preemption disabled is not allowed, because it may acquire
> > a spin lock that becomes sleepable on RT, potentially
> > causing a sleep during page allocation.
> >
> > Since above two functions is called as callback of stop_machine()
> > where its callback is called in preemption disabled,
> > They could make a potential problem. (sleeping in preemption disabled).
> >
> > To address this, introduce pagetable_alloc_nolock() API.
>
> I don't really understand what the problem is that you're trying to fix. As I
> see it, there are 2 call sites in arm64 arch code that are calling into the page
> allocator from stop_machine() - one via via pagetable_alloc() and another via
> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
> understanding that the page allocator would ensure it never sleeps when
> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
Although GFP_ATOMIC is specify, it only affects of "water mark" of the
page with __GFP_HIGH. and to get a page, it must grab the lock --
zone->lock or pcp_lock in the rmqueue().
This zone->lock and pcp_lock is spin_lock and it's a sleepable in
PREEMPT_RT that's why the memory allocation/free using general API
except nolock() version couldn't be called since
if "contention" happens they'll sleep while waiting to get the lock.
The reason why "nolock()" can use, it always uses "trylock" with
ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
PREEMPT_RT.
>
> What is the actual symptom you are seeing?
Since the place where called while smp_cpus_done() and there seems no
contention, there seems no problem. However as I mention in another
thread
(https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
This gives a the false impression --
GFP_ATOMIC are “safe to use in preemption disabled”
even though they are not in PREEMPT_RT case, I've changed it.
>
> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
> then isn't that a bug in the page allocator? I'm not sure why you would change
> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
It doesn't ignore the GFP_ATOMIC feature:
- __GFP_HIGH: use water mark till min reserved
- __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
But, it's a restriction -- "page allocation / free" API cannot be called
in preempt-disabled context at PREEMPT_RT.
That's why I think it's wrong usage not a page allocator bug.
[...]
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-16 16:52 ` Yeoreum Yun
@ 2025-12-17 9:34 ` Ryan Roberts
2025-12-17 10:48 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-17 9:34 UTC (permalink / raw)
To: Yeoreum Yun
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On 16/12/2025 16:52, Yeoreum Yun wrote:
> Hi Ryan,
>
>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>>> with preemption disabled.
>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>>> calls __get_free_pages() to create kpti pagetable.
>>>
>>> Under PREEMPT_RT, calling pagetable_alloc() with
>>> preemption disabled is not allowed, because it may acquire
>>> a spin lock that becomes sleepable on RT, potentially
>>> causing a sleep during page allocation.
>>>
>>> Since above two functions is called as callback of stop_machine()
>>> where its callback is called in preemption disabled,
>>> They could make a potential problem. (sleeping in preemption disabled).
>>>
>>> To address this, introduce pagetable_alloc_nolock() API.
>>
>> I don't really understand what the problem is that you're trying to fix. As I
>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>> understanding that the page allocator would ensure it never sleeps when
>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>
> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
> page with __GFP_HIGH. and to get a page, it must grab the lock --
> zone->lock or pcp_lock in the rmqueue().
>
> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
> PREEMPT_RT that's why the memory allocation/free using general API
> except nolock() version couldn't be called since
> if "contention" happens they'll sleep while waiting to get the lock.
>
> The reason why "nolock()" can use, it always uses "trylock" with
> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
> PREEMPT_RT.
>
>>
>> What is the actual symptom you are seeing?
>
> Since the place where called while smp_cpus_done() and there seems no
> contention, there seems no problem. However as I mention in another
> thread
> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
> This gives a the false impression --
> GFP_ATOMIC are “safe to use in preemption disabled”
> even though they are not in PREEMPT_RT case, I've changed it.
>
>>
>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>> then isn't that a bug in the page allocator? I'm not sure why you would change
>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>
> It doesn't ignore the GFP_ATOMIC feature:
> - __GFP_HIGH: use water mark till min reserved
> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>
> But, it's a restriction -- "page allocation / free" API cannot be called
> in preempt-disabled context at PREEMPT_RT.
>
> That's why I think it's wrong usage not a page allocator bug.
I've taken a look at this and I agree with your analysis. Thanks for explaining.
Looking at other stop_machine() callbacks, there are some that call printk() and
I would assume that spinlocks could be taken there which may present the same
kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
to allocate memory though.
Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
- Call the nolock variant (as you are doing). But that would just convert a
deadlock to a panic; if the lock is held when stop_machine() runs, without your
change, we now have a deadlock due to waiting on the lock inside stop_machine().
With your change, we notice the lock is already taken and panic. I guess it is
marginally better, but not by much. Certainly I would just _always_ call the
nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
the lock is guarranteed to be free so nolock will always succeed.
- Preallocate the memory before entering stop_machine(). I think this would be
much more robust. For kpti_install_ng_mappings() I think you could hoist the
allocation/free out of stop_machine() and pass the pointer in pretty easily. For
linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
pgtable to figure out how much to preallocate, allocate it, then set it up as a
special allocator, wrapped by an allocation function and modify the callchain to
take a callback function instead of gfp flags.
What do you think?
Thanks,
Ryan
>
> [...]
>
> --
> Sincerely,
> Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 9:34 ` Ryan Roberts
@ 2025-12-17 10:48 ` Yeoreum Yun
2025-12-17 12:04 ` Ryan Roberts
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-17 10:48 UTC (permalink / raw)
To: Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Ryan,
> On 16/12/2025 16:52, Yeoreum Yun wrote:
> > Hi Ryan,
> >
> >> On 12/12/2025 16:18, Yeoreum Yun wrote:
> >>> Some architectures invoke pagetable_alloc() or __get_free_pages()
> >>> with preemption disabled.
> >>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
> >>> while spliting block entry to ptes and __kpti_install_ng_mappings()
> >>> calls __get_free_pages() to create kpti pagetable.
> >>>
> >>> Under PREEMPT_RT, calling pagetable_alloc() with
> >>> preemption disabled is not allowed, because it may acquire
> >>> a spin lock that becomes sleepable on RT, potentially
> >>> causing a sleep during page allocation.
> >>>
> >>> Since above two functions is called as callback of stop_machine()
> >>> where its callback is called in preemption disabled,
> >>> They could make a potential problem. (sleeping in preemption disabled).
> >>>
> >>> To address this, introduce pagetable_alloc_nolock() API.
> >>
> >> I don't really understand what the problem is that you're trying to fix. As I
> >> see it, there are 2 call sites in arm64 arch code that are calling into the page
> >> allocator from stop_machine() - one via via pagetable_alloc() and another via
> >> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
> >> understanding that the page allocator would ensure it never sleeps when
> >> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
> >
> > Although GFP_ATOMIC is specify, it only affects of "water mark" of the
> > page with __GFP_HIGH. and to get a page, it must grab the lock --
> > zone->lock or pcp_lock in the rmqueue().
> >
> > This zone->lock and pcp_lock is spin_lock and it's a sleepable in
> > PREEMPT_RT that's why the memory allocation/free using general API
> > except nolock() version couldn't be called since
> > if "contention" happens they'll sleep while waiting to get the lock.
> >
> > The reason why "nolock()" can use, it always uses "trylock" with
> > ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
> > PREEMPT_RT.
> >
> >>
> >> What is the actual symptom you are seeing?
> >
> > Since the place where called while smp_cpus_done() and there seems no
> > contention, there seems no problem. However as I mention in another
> > thread
> > (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
> > This gives a the false impression --
> > GFP_ATOMIC are “safe to use in preemption disabled”
> > even though they are not in PREEMPT_RT case, I've changed it.
> >
> >>
> >> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
> >> then isn't that a bug in the page allocator? I'm not sure why you would change
> >> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
> >
> > It doesn't ignore the GFP_ATOMIC feature:
> > - __GFP_HIGH: use water mark till min reserved
> > - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
> >
> > But, it's a restriction -- "page allocation / free" API cannot be called
> > in preempt-disabled context at PREEMPT_RT.
> >
> > That's why I think it's wrong usage not a page allocator bug.
>
> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>
> Looking at other stop_machine() callbacks, there are some that call printk() and
> I would assume that spinlocks could be taken there which may present the same
> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
> to allocate memory though.
IIRC, there was a problem related for printk while try to grab
pl011_console related lock (spin_lock) while holding
console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
[ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
[ 230.381479] Hardware name: linux,dummy-virt (DT)
[ 230.381565] Call trace:
[ 230.381607] dump_backtrace+0x318/0x348
[ 230.381727] show_stack+0x4c/0x80
[ 230.381875] dump_stack_lvl+0x214/0x328
[ 230.382159] dump_stack+0x3c/0x58
[ 230.382456] __lock_acquire+0x4398/0x4720
[ 230.382683] lock_acquire+0x648/0xb70
[ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
[ 230.383121] pl011_console_write+0x240/0x8a0
[ 230.383356] console_flush_all+0x708/0x1368
[ 230.383571] console_unlock+0x180/0x440
[ 230.383742] vprintk_emit+0x1f8/0x9d0
[ 230.383832] vprintk_default+0x64/0x90
[ 230.383914] vprintk+0x2d0/0x400
[ 230.383971] _printk+0xdc/0x128
[ 230.384229] hrtimer_interrupt+0x8f0/0x920
[ 230.384414] arch_timer_handler_virt+0xc0/0x100
[ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
[ 230.385053] generic_handle_domain_irq+0xc0/0x120
[ 230.385367] gic_handle_irq+0x88/0x360
[ 230.385559] call_on_irq_stack+0x24/0x70
[ 230.385801] do_interrupt_handler+0xf8/0x200
[ 230.386092] el1_interrupt+0x68/0xc0
[ 230.386434] el1h_64_irq_handler+0x18/0x28
[ 230.386716] el1h_64_irq+0x64/0x68
[ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
[ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
[ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
[ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
[ 230.387822] folio_prealloc+0x5c/0x280
[ 230.388008] do_wp_page+0xc30/0x3bc0
[ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
[ 230.388448] handle_mm_fault+0x194/0x8a8
[ 230.388676] do_page_fault+0x6bc/0x1030
[ 230.388924] do_mem_abort+0x8c/0x240
[ 230.389056] el0_da+0xf0/0x3f8
[ 230.389178] el0t_64_sync_handler+0xb4/0x130
[ 230.389452] el0t_64_sync+0x190/0x198
But this problem is gone when I try with some of patches in rt-tree
related for printk which are merged in current tree
(https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
So I think printk() wouldn't be a problem.
>
> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>
> - Call the nolock variant (as you are doing). But that would just convert a
> deadlock to a panic; if the lock is held when stop_machine() runs, without your
> change, we now have a deadlock due to waiting on the lock inside stop_machine().
> With your change, we notice the lock is already taken and panic. I guess it is
> marginally better, but not by much. Certainly I would just _always_ call the
> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
> the lock is guarranteed to be free so nolock will always succeed.
>
> - Preallocate the memory before entering stop_machine(). I think this would be
> much more robust. For kpti_install_ng_mappings() I think you could hoist the
> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
> pgtable to figure out how much to preallocate, allocate it, then set it up as a
> special allocator, wrapped by an allocation function and modify the callchain to
> take a callback function instead of gfp flags.
>
> What do you think?
Definitely, second suggestoin is much better.
My question is whether *memory contention* really happen in the point
both functions are called.
Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
If we can be sure, I think we don't need to go to complex way and
I believe the reason why we couldn't find out this problem,
even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
in this time of both functions are called.
That's why I first try with the "simple way".
What do you think?
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 10:48 ` Yeoreum Yun
@ 2025-12-17 12:04 ` Ryan Roberts
2025-12-17 12:52 ` Yeoreum Yun
2025-12-23 22:59 ` Yang Shi
0 siblings, 2 replies; 35+ messages in thread
From: Ryan Roberts @ 2025-12-17 12:04 UTC (permalink / raw)
To: Yeoreum Yun
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On 17/12/2025 10:48, Yeoreum Yun wrote:
> Hi Ryan,
>
>> On 16/12/2025 16:52, Yeoreum Yun wrote:
>>> Hi Ryan,
>>>
>>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>>>>> with preemption disabled.
>>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>>>>> calls __get_free_pages() to create kpti pagetable.
>>>>>
>>>>> Under PREEMPT_RT, calling pagetable_alloc() with
>>>>> preemption disabled is not allowed, because it may acquire
>>>>> a spin lock that becomes sleepable on RT, potentially
>>>>> causing a sleep during page allocation.
>>>>>
>>>>> Since above two functions is called as callback of stop_machine()
>>>>> where its callback is called in preemption disabled,
>>>>> They could make a potential problem. (sleeping in preemption disabled).
>>>>>
>>>>> To address this, introduce pagetable_alloc_nolock() API.
>>>>
>>>> I don't really understand what the problem is that you're trying to fix. As I
>>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>>>> understanding that the page allocator would ensure it never sleeps when
>>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>>>
>>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
>>> page with __GFP_HIGH. and to get a page, it must grab the lock --
>>> zone->lock or pcp_lock in the rmqueue().
>>>
>>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
>>> PREEMPT_RT that's why the memory allocation/free using general API
>>> except nolock() version couldn't be called since
>>> if "contention" happens they'll sleep while waiting to get the lock.
>>>
>>> The reason why "nolock()" can use, it always uses "trylock" with
>>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
>>> PREEMPT_RT.
>>>
>>>>
>>>> What is the actual symptom you are seeing?
>>>
>>> Since the place where called while smp_cpus_done() and there seems no
>>> contention, there seems no problem. However as I mention in another
>>> thread
>>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
>>> This gives a the false impression --
>>> GFP_ATOMIC are “safe to use in preemption disabled”
>>> even though they are not in PREEMPT_RT case, I've changed it.
>>>
>>>>
>>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>>>> then isn't that a bug in the page allocator? I'm not sure why you would change
>>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>>>
>>> It doesn't ignore the GFP_ATOMIC feature:
>>> - __GFP_HIGH: use water mark till min reserved
>>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>>>
>>> But, it's a restriction -- "page allocation / free" API cannot be called
>>> in preempt-disabled context at PREEMPT_RT.
>>>
>>> That's why I think it's wrong usage not a page allocator bug.
>>
>> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>>
>> Looking at other stop_machine() callbacks, there are some that call printk() and
>> I would assume that spinlocks could be taken there which may present the same
>> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
>> to allocate memory though.
>
> IIRC, there was a problem related for printk while try to grab
> pl011_console related lock (spin_lock) while holding
> console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
>
> [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
> [ 230.381479] Hardware name: linux,dummy-virt (DT)
> [ 230.381565] Call trace:
> [ 230.381607] dump_backtrace+0x318/0x348
> [ 230.381727] show_stack+0x4c/0x80
> [ 230.381875] dump_stack_lvl+0x214/0x328
> [ 230.382159] dump_stack+0x3c/0x58
> [ 230.382456] __lock_acquire+0x4398/0x4720
> [ 230.382683] lock_acquire+0x648/0xb70
> [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
> [ 230.383121] pl011_console_write+0x240/0x8a0
> [ 230.383356] console_flush_all+0x708/0x1368
> [ 230.383571] console_unlock+0x180/0x440
> [ 230.383742] vprintk_emit+0x1f8/0x9d0
> [ 230.383832] vprintk_default+0x64/0x90
> [ 230.383914] vprintk+0x2d0/0x400
> [ 230.383971] _printk+0xdc/0x128
> [ 230.384229] hrtimer_interrupt+0x8f0/0x920
> [ 230.384414] arch_timer_handler_virt+0xc0/0x100
> [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
> [ 230.385053] generic_handle_domain_irq+0xc0/0x120
> [ 230.385367] gic_handle_irq+0x88/0x360
> [ 230.385559] call_on_irq_stack+0x24/0x70
> [ 230.385801] do_interrupt_handler+0xf8/0x200
> [ 230.386092] el1_interrupt+0x68/0xc0
> [ 230.386434] el1h_64_irq_handler+0x18/0x28
> [ 230.386716] el1h_64_irq+0x64/0x68
> [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
> [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
> [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
> [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
> [ 230.387822] folio_prealloc+0x5c/0x280
> [ 230.388008] do_wp_page+0xc30/0x3bc0
> [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
> [ 230.388448] handle_mm_fault+0x194/0x8a8
> [ 230.388676] do_page_fault+0x6bc/0x1030
> [ 230.388924] do_mem_abort+0x8c/0x240
> [ 230.389056] el0_da+0xf0/0x3f8
> [ 230.389178] el0t_64_sync_handler+0xb4/0x130
> [ 230.389452] el0t_64_sync+0x190/0x198
>
> But this problem is gone when I try with some of patches in rt-tree
> related for printk which are merged in current tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
>
> So I think printk() wouldn't be a problem.
>
>>
>> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>>
>> - Call the nolock variant (as you are doing). But that would just convert a
>> deadlock to a panic; if the lock is held when stop_machine() runs, without your
>> change, we now have a deadlock due to waiting on the lock inside stop_machine().
>> With your change, we notice the lock is already taken and panic. I guess it is
>> marginally better, but not by much. Certainly I would just _always_ call the
>> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
>> the lock is guarranteed to be free so nolock will always succeed.
>>
>> - Preallocate the memory before entering stop_machine(). I think this would be
>> much more robust. For kpti_install_ng_mappings() I think you could hoist the
>> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
>> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
>> pgtable to figure out how much to preallocate, allocate it, then set it up as a
>> special allocator, wrapped by an allocation function and modify the callchain to
>> take a callback function instead of gfp flags.
>>
>> What do you think?
>
> Definitely, second suggestoin is much better.
> My question is whether *memory contention* really happen in the point
> both functions are called.
My guess would be that it's unlikely, but not impossible. The secondary CPUs are
up, and presumably running their idle thread. I think various power management
things can be plugged into the idle thread; if so, then I guess it's possible
that the CPU could be running some hook as part of a power state transition, and
that could be dynamically allocating memory? That's all just a guess though; I
don't know the details of that part of the system.
>
> Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
> If we can be sure, I think we don't need to go to complex way and
> I believe the reason why we couldn't find out this problem,
> even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
> in this time of both functions are called.
> > That's why I first try with the "simple way".
>
> What do you think?
As far as linear_map_split_to_ptes() is concerned, it was implemented under the
impression that doing allocation with GFP_ATOMIC was safe, even in
stop_machine(). Given that's an incorrect assumption, I think we should fix it
to pre-allocate outside of stop_machine() regardless of the likelihood of
actually hitting the race.
Thanks,
Ryan
>
> --
> Sincerely,
> Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 12:04 ` Ryan Roberts
@ 2025-12-17 12:52 ` Yeoreum Yun
2025-12-17 13:15 ` Vlastimil Babka
2025-12-23 22:59 ` Yang Shi
1 sibling, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-17 12:52 UTC (permalink / raw)
To: Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
> On 17/12/2025 10:48, Yeoreum Yun wrote:
> > Hi Ryan,
> >
> >> On 16/12/2025 16:52, Yeoreum Yun wrote:
> >>> Hi Ryan,
> >>>
> >>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
> >>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
> >>>>> with preemption disabled.
> >>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
> >>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
> >>>>> calls __get_free_pages() to create kpti pagetable.
> >>>>>
> >>>>> Under PREEMPT_RT, calling pagetable_alloc() with
> >>>>> preemption disabled is not allowed, because it may acquire
> >>>>> a spin lock that becomes sleepable on RT, potentially
> >>>>> causing a sleep during page allocation.
> >>>>>
> >>>>> Since above two functions is called as callback of stop_machine()
> >>>>> where its callback is called in preemption disabled,
> >>>>> They could make a potential problem. (sleeping in preemption disabled).
> >>>>>
> >>>>> To address this, introduce pagetable_alloc_nolock() API.
> >>>>
> >>>> I don't really understand what the problem is that you're trying to fix. As I
> >>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
> >>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
> >>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
> >>>> understanding that the page allocator would ensure it never sleeps when
> >>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
> >>>
> >>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
> >>> page with __GFP_HIGH. and to get a page, it must grab the lock --
> >>> zone->lock or pcp_lock in the rmqueue().
> >>>
> >>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
> >>> PREEMPT_RT that's why the memory allocation/free using general API
> >>> except nolock() version couldn't be called since
> >>> if "contention" happens they'll sleep while waiting to get the lock.
> >>>
> >>> The reason why "nolock()" can use, it always uses "trylock" with
> >>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
> >>> PREEMPT_RT.
> >>>
> >>>>
> >>>> What is the actual symptom you are seeing?
> >>>
> >>> Since the place where called while smp_cpus_done() and there seems no
> >>> contention, there seems no problem. However as I mention in another
> >>> thread
> >>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
> >>> This gives a the false impression --
> >>> GFP_ATOMIC are “safe to use in preemption disabled”
> >>> even though they are not in PREEMPT_RT case, I've changed it.
> >>>
> >>>>
> >>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
> >>>> then isn't that a bug in the page allocator? I'm not sure why you would change
> >>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
> >>>
> >>> It doesn't ignore the GFP_ATOMIC feature:
> >>> - __GFP_HIGH: use water mark till min reserved
> >>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
> >>>
> >>> But, it's a restriction -- "page allocation / free" API cannot be called
> >>> in preempt-disabled context at PREEMPT_RT.
> >>>
> >>> That's why I think it's wrong usage not a page allocator bug.
> >>
> >> I've taken a look at this and I agree with your analysis. Thanks for explaining.
> >>
> >> Looking at other stop_machine() callbacks, there are some that call printk() and
> >> I would assume that spinlocks could be taken there which may present the same
> >> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
> >> to allocate memory though.
> >
> > IIRC, there was a problem related for printk while try to grab
> > pl011_console related lock (spin_lock) while holding
> > console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
> >
> > [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
> > [ 230.381479] Hardware name: linux,dummy-virt (DT)
> > [ 230.381565] Call trace:
> > [ 230.381607] dump_backtrace+0x318/0x348
> > [ 230.381727] show_stack+0x4c/0x80
> > [ 230.381875] dump_stack_lvl+0x214/0x328
> > [ 230.382159] dump_stack+0x3c/0x58
> > [ 230.382456] __lock_acquire+0x4398/0x4720
> > [ 230.382683] lock_acquire+0x648/0xb70
> > [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
> > [ 230.383121] pl011_console_write+0x240/0x8a0
> > [ 230.383356] console_flush_all+0x708/0x1368
> > [ 230.383571] console_unlock+0x180/0x440
> > [ 230.383742] vprintk_emit+0x1f8/0x9d0
> > [ 230.383832] vprintk_default+0x64/0x90
> > [ 230.383914] vprintk+0x2d0/0x400
> > [ 230.383971] _printk+0xdc/0x128
> > [ 230.384229] hrtimer_interrupt+0x8f0/0x920
> > [ 230.384414] arch_timer_handler_virt+0xc0/0x100
> > [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
> > [ 230.385053] generic_handle_domain_irq+0xc0/0x120
> > [ 230.385367] gic_handle_irq+0x88/0x360
> > [ 230.385559] call_on_irq_stack+0x24/0x70
> > [ 230.385801] do_interrupt_handler+0xf8/0x200
> > [ 230.386092] el1_interrupt+0x68/0xc0
> > [ 230.386434] el1h_64_irq_handler+0x18/0x28
> > [ 230.386716] el1h_64_irq+0x64/0x68
> > [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
> > [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
> > [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
> > [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
> > [ 230.387822] folio_prealloc+0x5c/0x280
> > [ 230.388008] do_wp_page+0xc30/0x3bc0
> > [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
> > [ 230.388448] handle_mm_fault+0x194/0x8a8
> > [ 230.388676] do_page_fault+0x6bc/0x1030
> > [ 230.388924] do_mem_abort+0x8c/0x240
> > [ 230.389056] el0_da+0xf0/0x3f8
> > [ 230.389178] el0t_64_sync_handler+0xb4/0x130
> > [ 230.389452] el0t_64_sync+0x190/0x198
> >
> > But this problem is gone when I try with some of patches in rt-tree
> > related for printk which are merged in current tree
> > (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
> >
> > So I think printk() wouldn't be a problem.
> >
> >>
> >> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
> >>
> >> - Call the nolock variant (as you are doing). But that would just convert a
> >> deadlock to a panic; if the lock is held when stop_machine() runs, without your
> >> change, we now have a deadlock due to waiting on the lock inside stop_machine().
> >> With your change, we notice the lock is already taken and panic. I guess it is
> >> marginally better, but not by much. Certainly I would just _always_ call the
> >> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
> >> the lock is guarranteed to be free so nolock will always succeed.
> >>
> >> - Preallocate the memory before entering stop_machine(). I think this would be
> >> much more robust. For kpti_install_ng_mappings() I think you could hoist the
> >> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
> >> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
> >> pgtable to figure out how much to preallocate, allocate it, then set it up as a
> >> special allocator, wrapped by an allocation function and modify the callchain to
> >> take a callback function instead of gfp flags.
> >>
> >> What do you think?
> >
> > Definitely, second suggestoin is much better.
> > My question is whether *memory contention* really happen in the point
> > both functions are called.
>
> My guess would be that it's unlikely, but not impossible. The secondary CPUs are
> up, and presumably running their idle thread. I think various power management
> things can be plugged into the idle thread; if so, then I guess it's possible
> that the CPU could be running some hook as part of a power state transition, and
> that could be dynamically allocating memory? That's all just a guess though; I
> don't know the details of that part of the system.
>
> >
> > Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
> > If we can be sure, I think we don't need to go to complex way and
> > I believe the reason why we couldn't find out this problem,
> > even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
> > in this time of both functions are called.
> > > That's why I first try with the "simple way".
> >
> > What do you think?
>
> As far as linear_map_split_to_ptes() is concerned, it was implemented under the
> impression that doing allocation with GFP_ATOMIC was safe, even in
> stop_machine(). Given that's an incorrect assumption, I think we should fix it
> to pre-allocate outside of stop_machine() regardless of the likelihood of
> actually hitting the race.
>
Yeap. It’s better to be certain than uncertain. Thanks for checking.
I'll repsin with the preallocate way.
Thanks!
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 12:52 ` Yeoreum Yun
@ 2025-12-17 13:15 ` Vlastimil Babka
2025-12-17 13:35 ` Brendan Jackman
0 siblings, 1 reply; 35+ messages in thread
From: Vlastimil Babka @ 2025-12-17 13:15 UTC (permalink / raw)
To: Yeoreum Yun, Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, jackmanb, hannes,
ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
kevin.brodsky, dev.jain, yang, linux-mm, linux-kernel, bpf,
linux-rt-devel, linux-arm-kernel
On 12/17/25 13:52, Yeoreum Yun wrote:
>> On 17/12/2025 10:48, Yeoreum Yun wrote:
>> > Hi Ryan,
>> >
>> >> On 16/12/2025 16:52, Yeoreum Yun wrote:
>> >>> Hi Ryan,
>> >>>
>> >>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>> >>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>> >>>>> with preemption disabled.
>> >>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>> >>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>> >>>>> calls __get_free_pages() to create kpti pagetable.
>> >>>>>
>> >>>>> Under PREEMPT_RT, calling pagetable_alloc() with
>> >>>>> preemption disabled is not allowed, because it may acquire
>> >>>>> a spin lock that becomes sleepable on RT, potentially
>> >>>>> causing a sleep during page allocation.
>> >>>>>
>> >>>>> Since above two functions is called as callback of stop_machine()
>> >>>>> where its callback is called in preemption disabled,
>> >>>>> They could make a potential problem. (sleeping in preemption disabled).
>> >>>>>
>> >>>>> To address this, introduce pagetable_alloc_nolock() API.
>> >>>>
>> >>>> I don't really understand what the problem is that you're trying to fix. As I
>> >>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>> >>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>> >>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>> >>>> understanding that the page allocator would ensure it never sleeps when
>> >>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>> >>>
>> >>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
>> >>> page with __GFP_HIGH. and to get a page, it must grab the lock --
>> >>> zone->lock or pcp_lock in the rmqueue().
>> >>>
>> >>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
>> >>> PREEMPT_RT that's why the memory allocation/free using general API
>> >>> except nolock() version couldn't be called since
>> >>> if "contention" happens they'll sleep while waiting to get the lock.
>> >>>
>> >>> The reason why "nolock()" can use, it always uses "trylock" with
>> >>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
>> >>> PREEMPT_RT.
>> >>>
>> >>>>
>> >>>> What is the actual symptom you are seeing?
>> >>>
>> >>> Since the place where called while smp_cpus_done() and there seems no
>> >>> contention, there seems no problem. However as I mention in another
>> >>> thread
>> >>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
>> >>> This gives a the false impression --
>> >>> GFP_ATOMIC are “safe to use in preemption disabled”
>> >>> even though they are not in PREEMPT_RT case, I've changed it.
>> >>>
>> >>>>
>> >>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>> >>>> then isn't that a bug in the page allocator? I'm not sure why you would change
>> >>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>> >>>
>> >>> It doesn't ignore the GFP_ATOMIC feature:
>> >>> - __GFP_HIGH: use water mark till min reserved
>> >>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>> >>>
>> >>> But, it's a restriction -- "page allocation / free" API cannot be called
>> >>> in preempt-disabled context at PREEMPT_RT.
>> >>>
>> >>> That's why I think it's wrong usage not a page allocator bug.
>> >>
>> >> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>> >>
>> >> Looking at other stop_machine() callbacks, there are some that call printk() and
>> >> I would assume that spinlocks could be taken there which may present the same
>> >> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
>> >> to allocate memory though.
>> >
>> > IIRC, there was a problem related for printk while try to grab
>> > pl011_console related lock (spin_lock) while holding
>> > console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
>> >
>> > [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
>> > [ 230.381479] Hardware name: linux,dummy-virt (DT)
>> > [ 230.381565] Call trace:
>> > [ 230.381607] dump_backtrace+0x318/0x348
>> > [ 230.381727] show_stack+0x4c/0x80
>> > [ 230.381875] dump_stack_lvl+0x214/0x328
>> > [ 230.382159] dump_stack+0x3c/0x58
>> > [ 230.382456] __lock_acquire+0x4398/0x4720
>> > [ 230.382683] lock_acquire+0x648/0xb70
>> > [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
>> > [ 230.383121] pl011_console_write+0x240/0x8a0
>> > [ 230.383356] console_flush_all+0x708/0x1368
>> > [ 230.383571] console_unlock+0x180/0x440
>> > [ 230.383742] vprintk_emit+0x1f8/0x9d0
>> > [ 230.383832] vprintk_default+0x64/0x90
>> > [ 230.383914] vprintk+0x2d0/0x400
>> > [ 230.383971] _printk+0xdc/0x128
>> > [ 230.384229] hrtimer_interrupt+0x8f0/0x920
>> > [ 230.384414] arch_timer_handler_virt+0xc0/0x100
>> > [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
>> > [ 230.385053] generic_handle_domain_irq+0xc0/0x120
>> > [ 230.385367] gic_handle_irq+0x88/0x360
>> > [ 230.385559] call_on_irq_stack+0x24/0x70
>> > [ 230.385801] do_interrupt_handler+0xf8/0x200
>> > [ 230.386092] el1_interrupt+0x68/0xc0
>> > [ 230.386434] el1h_64_irq_handler+0x18/0x28
>> > [ 230.386716] el1h_64_irq+0x64/0x68
>> > [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
>> > [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
>> > [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
>> > [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
>> > [ 230.387822] folio_prealloc+0x5c/0x280
>> > [ 230.388008] do_wp_page+0xc30/0x3bc0
>> > [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
>> > [ 230.388448] handle_mm_fault+0x194/0x8a8
>> > [ 230.388676] do_page_fault+0x6bc/0x1030
>> > [ 230.388924] do_mem_abort+0x8c/0x240
>> > [ 230.389056] el0_da+0xf0/0x3f8
>> > [ 230.389178] el0t_64_sync_handler+0xb4/0x130
>> > [ 230.389452] el0t_64_sync+0x190/0x198
>> >
>> > But this problem is gone when I try with some of patches in rt-tree
>> > related for printk which are merged in current tree
>> > (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
>> >
>> > So I think printk() wouldn't be a problem.
>> >
>> >>
>> >> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>> >>
>> >> - Call the nolock variant (as you are doing). But that would just convert a
>> >> deadlock to a panic; if the lock is held when stop_machine() runs, without your
>> >> change, we now have a deadlock due to waiting on the lock inside stop_machine().
>> >> With your change, we notice the lock is already taken and panic. I guess it is
>> >> marginally better, but not by much. Certainly I would just _always_ call the
>> >> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
>> >> the lock is guarranteed to be free so nolock will always succeed.
>> >>
>> >> - Preallocate the memory before entering stop_machine(). I think this would be
>> >> much more robust. For kpti_install_ng_mappings() I think you could hoist the
>> >> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
>> >> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
>> >> pgtable to figure out how much to preallocate, allocate it, then set it up as a
>> >> special allocator, wrapped by an allocation function and modify the callchain to
>> >> take a callback function instead of gfp flags.
>> >>
>> >> What do you think?
>> >
>> > Definitely, second suggestoin is much better.
>> > My question is whether *memory contention* really happen in the point
>> > both functions are called.
>>
>> My guess would be that it's unlikely, but not impossible. The secondary CPUs are
>> up, and presumably running their idle thread. I think various power management
>> things can be plugged into the idle thread; if so, then I guess it's possible
>> that the CPU could be running some hook as part of a power state transition, and
>> that could be dynamically allocating memory? That's all just a guess though; I
>> don't know the details of that part of the system.
>>
>> >
>> > Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
>> > If we can be sure, I think we don't need to go to complex way and
>> > I believe the reason why we couldn't find out this problem,
>> > even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
>> > in this time of both functions are called.
>> > > That's why I first try with the "simple way".
>> >
>> > What do you think?
>>
>> As far as linear_map_split_to_ptes() is concerned, it was implemented under the
>> impression that doing allocation with GFP_ATOMIC was safe, even in
>> stop_machine(). Given that's an incorrect assumption, I think we should fix it
>> to pre-allocate outside of stop_machine() regardless of the likelihood of
>> actually hitting the race.
>>
>
> Yeap. It’s better to be certain than uncertain. Thanks for checking.
> I'll repsin with the preallocate way.
Note this is explained in Documentation/core-api/real-time/differences.rst:
Memory allocation
-----------------
The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
necessary to use GFP_ATOMIC when allocating memory from interrupt context or
from sections where preemption is disabled. This is because the allocator must
not sleep in these contexts waiting for memory to become available.
However, this approach does not work on PREEMPT_RT kernels. The memory
allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
acquired when preemption is disabled. Fortunately, this is generally not a
problem, because PREEMPT_RT moves most contexts that would traditionally run
with preemption or interrupts disabled into threaded context, where sleeping is
allowed.
What remains problematic is code that explicitly disables preemption or
interrupts. In such cases, memory allocation must be performed outside the
critical section.
This restriction also applies to memory deallocation routines such as kfree()
and free_pages(), which may also involve internal locking and must not be
called from non-preemptible contexts.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 13:15 ` Vlastimil Babka
@ 2025-12-17 13:35 ` Brendan Jackman
2025-12-17 13:56 ` Yeoreum Yun
2025-12-17 15:10 ` Vlastimil Babka
0 siblings, 2 replies; 35+ messages in thread
From: Brendan Jackman @ 2025-12-17 13:35 UTC (permalink / raw)
To: Vlastimil Babka, Yeoreum Yun, Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, jackmanb, hannes,
ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
kevin.brodsky, dev.jain, yang, linux-mm, linux-kernel, bpf,
linux-rt-devel, linux-arm-kernel
On Wed Dec 17, 2025 at 1:15 PM UTC, Vlastimil Babka wrote:
> On 12/17/25 13:52, Yeoreum Yun wrote:
>>> On 17/12/2025 10:48, Yeoreum Yun wrote:
>>> > Hi Ryan,
>>> >
>>> >> On 16/12/2025 16:52, Yeoreum Yun wrote:
>>> >>> Hi Ryan,
>>> >>>
>>> >>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>>> >>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>>> >>>>> with preemption disabled.
>>> >>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>>> >>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>>> >>>>> calls __get_free_pages() to create kpti pagetable.
>>> >>>>>
>>> >>>>> Under PREEMPT_RT, calling pagetable_alloc() with
>>> >>>>> preemption disabled is not allowed, because it may acquire
>>> >>>>> a spin lock that becomes sleepable on RT, potentially
>>> >>>>> causing a sleep during page allocation.
>>> >>>>>
>>> >>>>> Since above two functions is called as callback of stop_machine()
>>> >>>>> where its callback is called in preemption disabled,
>>> >>>>> They could make a potential problem. (sleeping in preemption disabled).
>>> >>>>>
>>> >>>>> To address this, introduce pagetable_alloc_nolock() API.
>>> >>>>
>>> >>>> I don't really understand what the problem is that you're trying to fix. As I
>>> >>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>>> >>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>>> >>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>>> >>>> understanding that the page allocator would ensure it never sleeps when
>>> >>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>>> >>>
>>> >>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
>>> >>> page with __GFP_HIGH. and to get a page, it must grab the lock --
>>> >>> zone->lock or pcp_lock in the rmqueue().
>>> >>>
>>> >>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
>>> >>> PREEMPT_RT that's why the memory allocation/free using general API
>>> >>> except nolock() version couldn't be called since
>>> >>> if "contention" happens they'll sleep while waiting to get the lock.
>>> >>>
>>> >>> The reason why "nolock()" can use, it always uses "trylock" with
>>> >>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
>>> >>> PREEMPT_RT.
>>> >>>
>>> >>>>
>>> >>>> What is the actual symptom you are seeing?
>>> >>>
>>> >>> Since the place where called while smp_cpus_done() and there seems no
>>> >>> contention, there seems no problem. However as I mention in another
>>> >>> thread
>>> >>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
>>> >>> This gives a the false impression --
>>> >>> GFP_ATOMIC are “safe to use in preemption disabled”
>>> >>> even though they are not in PREEMPT_RT case, I've changed it.
>>> >>>
>>> >>>>
>>> >>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>>> >>>> then isn't that a bug in the page allocator? I'm not sure why you would change
>>> >>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>>> >>>
>>> >>> It doesn't ignore the GFP_ATOMIC feature:
>>> >>> - __GFP_HIGH: use water mark till min reserved
>>> >>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>>> >>>
>>> >>> But, it's a restriction -- "page allocation / free" API cannot be called
>>> >>> in preempt-disabled context at PREEMPT_RT.
>>> >>>
>>> >>> That's why I think it's wrong usage not a page allocator bug.
>>> >>
>>> >> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>>> >>
>>> >> Looking at other stop_machine() callbacks, there are some that call printk() and
>>> >> I would assume that spinlocks could be taken there which may present the same
>>> >> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
>>> >> to allocate memory though.
>>> >
>>> > IIRC, there was a problem related for printk while try to grab
>>> > pl011_console related lock (spin_lock) while holding
>>> > console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
>>> >
>>> > [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
>>> > [ 230.381479] Hardware name: linux,dummy-virt (DT)
>>> > [ 230.381565] Call trace:
>>> > [ 230.381607] dump_backtrace+0x318/0x348
>>> > [ 230.381727] show_stack+0x4c/0x80
>>> > [ 230.381875] dump_stack_lvl+0x214/0x328
>>> > [ 230.382159] dump_stack+0x3c/0x58
>>> > [ 230.382456] __lock_acquire+0x4398/0x4720
>>> > [ 230.382683] lock_acquire+0x648/0xb70
>>> > [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
>>> > [ 230.383121] pl011_console_write+0x240/0x8a0
>>> > [ 230.383356] console_flush_all+0x708/0x1368
>>> > [ 230.383571] console_unlock+0x180/0x440
>>> > [ 230.383742] vprintk_emit+0x1f8/0x9d0
>>> > [ 230.383832] vprintk_default+0x64/0x90
>>> > [ 230.383914] vprintk+0x2d0/0x400
>>> > [ 230.383971] _printk+0xdc/0x128
>>> > [ 230.384229] hrtimer_interrupt+0x8f0/0x920
>>> > [ 230.384414] arch_timer_handler_virt+0xc0/0x100
>>> > [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
>>> > [ 230.385053] generic_handle_domain_irq+0xc0/0x120
>>> > [ 230.385367] gic_handle_irq+0x88/0x360
>>> > [ 230.385559] call_on_irq_stack+0x24/0x70
>>> > [ 230.385801] do_interrupt_handler+0xf8/0x200
>>> > [ 230.386092] el1_interrupt+0x68/0xc0
>>> > [ 230.386434] el1h_64_irq_handler+0x18/0x28
>>> > [ 230.386716] el1h_64_irq+0x64/0x68
>>> > [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
>>> > [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
>>> > [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
>>> > [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
>>> > [ 230.387822] folio_prealloc+0x5c/0x280
>>> > [ 230.388008] do_wp_page+0xc30/0x3bc0
>>> > [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
>>> > [ 230.388448] handle_mm_fault+0x194/0x8a8
>>> > [ 230.388676] do_page_fault+0x6bc/0x1030
>>> > [ 230.388924] do_mem_abort+0x8c/0x240
>>> > [ 230.389056] el0_da+0xf0/0x3f8
>>> > [ 230.389178] el0t_64_sync_handler+0xb4/0x130
>>> > [ 230.389452] el0t_64_sync+0x190/0x198
>>> >
>>> > But this problem is gone when I try with some of patches in rt-tree
>>> > related for printk which are merged in current tree
>>> > (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
>>> >
>>> > So I think printk() wouldn't be a problem.
>>> >
>>> >>
>>> >> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>>> >>
>>> >> - Call the nolock variant (as you are doing). But that would just convert a
>>> >> deadlock to a panic; if the lock is held when stop_machine() runs, without your
>>> >> change, we now have a deadlock due to waiting on the lock inside stop_machine().
>>> >> With your change, we notice the lock is already taken and panic. I guess it is
>>> >> marginally better, but not by much. Certainly I would just _always_ call the
>>> >> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
>>> >> the lock is guarranteed to be free so nolock will always succeed.
>>> >>
>>> >> - Preallocate the memory before entering stop_machine(). I think this would be
>>> >> much more robust. For kpti_install_ng_mappings() I think you could hoist the
>>> >> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
>>> >> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
>>> >> pgtable to figure out how much to preallocate, allocate it, then set it up as a
>>> >> special allocator, wrapped by an allocation function and modify the callchain to
>>> >> take a callback function instead of gfp flags.
>>> >>
>>> >> What do you think?
>>> >
>>> > Definitely, second suggestoin is much better.
>>> > My question is whether *memory contention* really happen in the point
>>> > both functions are called.
>>>
>>> My guess would be that it's unlikely, but not impossible. The secondary CPUs are
>>> up, and presumably running their idle thread. I think various power management
>>> things can be plugged into the idle thread; if so, then I guess it's possible
>>> that the CPU could be running some hook as part of a power state transition, and
>>> that could be dynamically allocating memory? That's all just a guess though; I
>>> don't know the details of that part of the system.
>>>
>>> >
>>> > Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
>>> > If we can be sure, I think we don't need to go to complex way and
>>> > I believe the reason why we couldn't find out this problem,
>>> > even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
>>> > in this time of both functions are called.
>>> > > That's why I first try with the "simple way".
>>> >
>>> > What do you think?
>>>
>>> As far as linear_map_split_to_ptes() is concerned, it was implemented under the
>>> impression that doing allocation with GFP_ATOMIC was safe, even in
>>> stop_machine(). Given that's an incorrect assumption, I think we should fix it
>>> to pre-allocate outside of stop_machine() regardless of the likelihood of
>>> actually hitting the race.
>>>
>>
>> Yeap. It’s better to be certain than uncertain. Thanks for checking.
>> I'll repsin with the preallocate way.
>
> Note this is explained in Documentation/core-api/real-time/differences.rst:
>
> Memory allocation
> -----------------
>
> The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
> gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
> necessary to use GFP_ATOMIC when allocating memory from interrupt context or
> from sections where preemption is disabled. This is because the allocator must
> not sleep in these contexts waiting for memory to become available.
>
> However, this approach does not work on PREEMPT_RT kernels. The memory
> allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
> acquired when preemption is disabled. Fortunately, this is generally not a
> problem, because PREEMPT_RT moves most contexts that would traditionally run
> with preemption or interrupts disabled into threaded context, where sleeping is
> allowed.
>
> What remains problematic is code that explicitly disables preemption or
> interrupts. In such cases, memory allocation must be performed outside the
> critical section.
>
> This restriction also applies to memory deallocation routines such as kfree()
> and free_pages(), which may also involve internal locking and must not be
> called from non-preemptible contexts.
Oh, thanks for pointing to that, I had never read that before (oops).
Shall we point to this from the doc-comment? Something like the below.
BTW, Yeorum, assuming you care about PREEMPT_RT, maybe you can get
Sparse to find some other bugs of this nature? Or if not, plain old
Coccinelle would probably find a few.
---
From 4c6b4d4cb08aee9559d02a348b9ecf799142c96f Mon Sep 17 00:00:00 2001
From: Brendan Jackman <jackmanb@google.com>
Date: Wed, 17 Dec 2025 13:26:28 +0000
Subject: [PATCH] mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment
The current description of contexts where it's invalid to make
GFP_ATOMIC and GFP_NOWAIT calls is rather vague.
Replace this with a direct description of the actual contexts of concern
and refer to the RT docs where this is explained more discursively.
While rejigging this prose, also move the documentation of GFP_NOWAIT to
the GFP_NOWAIT section.
Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
include/linux/gfp_types.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 3de43b12209ee..07a378542caf2 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -309,8 +309,10 @@ enum {
*
* %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
* watermark is applied to allow access to "atomic reserves".
- * The current implementation doesn't support NMI and few other strict
- * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
+ * The current implementation doesn't support NMI, nor contexts that disable
+ * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain
+ * preempt_disable() - see Documentation/core-api/real-time/differences.rst for
+ * more info.
*
* %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
* %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
@@ -321,6 +323,7 @@ enum {
* %GFP_NOWAIT is for kernel allocations that should not stall for direct
* reclaim, start physical IO or use any filesystem callback. It is very
* likely to fail to allocate memory, even for very small allocations.
+ * The same restrictions on calling contexts apply as for %GFP_ATOMIC.
*
* %GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
--
2.50.1
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 13:35 ` Brendan Jackman
@ 2025-12-17 13:56 ` Yeoreum Yun
2025-12-17 15:10 ` Vlastimil Babka
1 sibling, 0 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-17 13:56 UTC (permalink / raw)
To: Brendan Jackman
Cc: Vlastimil Babka, Ryan Roberts, akpm, david, lorenzo.stoakes,
Liam.Howlett, rppt, surenb, mhocko, ast, daniel, andrii,
martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
sdf, haoluo, jolsa, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
[...]
> > Note this is explained in Documentation/core-api/real-time/differences.rst:
> >
> > Memory allocation
> > -----------------
> >
> > The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
> > gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
> > necessary to use GFP_ATOMIC when allocating memory from interrupt context or
> > from sections where preemption is disabled. This is because the allocator must
> > not sleep in these contexts waiting for memory to become available.
> >
> > However, this approach does not work on PREEMPT_RT kernels. The memory
> > allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
> > acquired when preemption is disabled. Fortunately, this is generally not a
> > problem, because PREEMPT_RT moves most contexts that would traditionally run
> > with preemption or interrupts disabled into threaded context, where sleeping is
> > allowed.
> >
> > What remains problematic is code that explicitly disables preemption or
> > interrupts. In such cases, memory allocation must be performed outside the
> > critical section.
> >
> > This restriction also applies to memory deallocation routines such as kfree()
> > and free_pages(), which may also involve internal locking and must not be
> > called from non-preemptible contexts.
>
> Oh, thanks for pointing to that, I had never read that before (oops).
>
> Shall we point to this from the doc-comment? Something like the below.
>
> BTW, Yeorum, assuming you care about PREEMPT_RT, maybe you can get
> Sparse to find some other bugs of this nature? Or if not, plain old
> Coccinelle would probably find a few.
That's good idea. I'll try to sparse later.
Although this is a slightly different topic, based on Ryan’s suggestion,
I plan to address this misuse on arm64 by switching to pre-allocated pages.
As a result, I will remove the pgtable_alloc_nolock() interface.
> From 4c6b4d4cb08aee9559d02a348b9ecf799142c96f Mon Sep 17 00:00:00 2001
> From: Brendan Jackman <jackmanb@google.com>
> Date: Wed, 17 Dec 2025 13:26:28 +0000
> Subject: [PATCH] mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment
>
> The current description of contexts where it's invalid to make
> GFP_ATOMIC and GFP_NOWAIT calls is rather vague.
>
> Replace this with a direct description of the actual contexts of concern
> and refer to the RT docs where this is explained more discursively.
>
> While rejigging this prose, also move the documentation of GFP_NOWAIT to
> the GFP_NOWAIT section.
>
> Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
> ---
> include/linux/gfp_types.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 3de43b12209ee..07a378542caf2 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -309,8 +309,10 @@ enum {
> *
> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> * watermark is applied to allow access to "atomic reserves".
> - * The current implementation doesn't support NMI and few other strict
> - * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> + * The current implementation doesn't support NMI, nor contexts that disable
> + * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain
> + * preempt_disable() - see Documentation/core-api/real-time/differences.rst for
> + * more info.
> *
> * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> @@ -321,6 +323,7 @@ enum {
> * %GFP_NOWAIT is for kernel allocations that should not stall for direct
> * reclaim, start physical IO or use any filesystem callback. It is very
> * likely to fail to allocate memory, even for very small allocations.
> + * The same restrictions on calling contexts apply as for %GFP_ATOMIC.
> *
> * %GFP_NOIO will use direct reclaim to discard clean pages or slab pages
> * that do not require the starting of any physical IO.
> --
> 2.50.1
This patch looks good to me. Feel free to add:
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 13:35 ` Brendan Jackman
2025-12-17 13:56 ` Yeoreum Yun
@ 2025-12-17 15:10 ` Vlastimil Babka
2025-12-17 17:19 ` Brendan Jackman
2025-12-18 7:52 ` David Hildenbrand (Red Hat)
1 sibling, 2 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-12-17 15:10 UTC (permalink / raw)
To: Brendan Jackman, Yeoreum Yun, Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, hannes, ziy, bigeasy,
clrkwllms, rostedt, catalin.marinas, will, kevin.brodsky,
dev.jain, yang, linux-mm, linux-kernel, bpf, linux-rt-devel,
linux-arm-kernel
On 12/17/25 14:35, Brendan Jackman wrote:
> On Wed Dec 17, 2025 at 1:15 PM UTC, Vlastimil Babka wrote:
>> On 12/17/25 13:52, Yeoreum Yun wrote:
>>>> On 17/12/2025 10:48, Yeoreum Yun wrote:
>>>> > Hi Ryan,
>>>> >
>>>> >> On 16/12/2025 16:52, Yeoreum Yun wrote:
>>>> >>> Hi Ryan,
>>>> >>>
>>>> >>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>>>> >>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>>>> >>>>> with preemption disabled.
>>>> >>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>>>> >>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>>>> >>>>> calls __get_free_pages() to create kpti pagetable.
>>>> >>>>>
>>>> >>>>> Under PREEMPT_RT, calling pagetable_alloc() with
>>>> >>>>> preemption disabled is not allowed, because it may acquire
>>>> >>>>> a spin lock that becomes sleepable on RT, potentially
>>>> >>>>> causing a sleep during page allocation.
>>>> >>>>>
>>>> >>>>> Since above two functions is called as callback of stop_machine()
>>>> >>>>> where its callback is called in preemption disabled,
>>>> >>>>> They could make a potential problem. (sleeping in preemption disabled).
>>>> >>>>>
>>>> >>>>> To address this, introduce pagetable_alloc_nolock() API.
>>>> >>>>
>>>> >>>> I don't really understand what the problem is that you're trying to fix. As I
>>>> >>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>>>> >>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>>>> >>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>>>> >>>> understanding that the page allocator would ensure it never sleeps when
>>>> >>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>>>> >>>
>>>> >>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
>>>> >>> page with __GFP_HIGH. and to get a page, it must grab the lock --
>>>> >>> zone->lock or pcp_lock in the rmqueue().
>>>> >>>
>>>> >>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
>>>> >>> PREEMPT_RT that's why the memory allocation/free using general API
>>>> >>> except nolock() version couldn't be called since
>>>> >>> if "contention" happens they'll sleep while waiting to get the lock.
>>>> >>>
>>>> >>> The reason why "nolock()" can use, it always uses "trylock" with
>>>> >>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
>>>> >>> PREEMPT_RT.
>>>> >>>
>>>> >>>>
>>>> >>>> What is the actual symptom you are seeing?
>>>> >>>
>>>> >>> Since the place where called while smp_cpus_done() and there seems no
>>>> >>> contention, there seems no problem. However as I mention in another
>>>> >>> thread
>>>> >>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
>>>> >>> This gives a the false impression --
>>>> >>> GFP_ATOMIC are “safe to use in preemption disabled”
>>>> >>> even though they are not in PREEMPT_RT case, I've changed it.
>>>> >>>
>>>> >>>>
>>>> >>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>>>> >>>> then isn't that a bug in the page allocator? I'm not sure why you would change
>>>> >>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>>>> >>>
>>>> >>> It doesn't ignore the GFP_ATOMIC feature:
>>>> >>> - __GFP_HIGH: use water mark till min reserved
>>>> >>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>>>> >>>
>>>> >>> But, it's a restriction -- "page allocation / free" API cannot be called
>>>> >>> in preempt-disabled context at PREEMPT_RT.
>>>> >>>
>>>> >>> That's why I think it's wrong usage not a page allocator bug.
>>>> >>
>>>> >> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>>>> >>
>>>> >> Looking at other stop_machine() callbacks, there are some that call printk() and
>>>> >> I would assume that spinlocks could be taken there which may present the same
>>>> >> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
>>>> >> to allocate memory though.
>>>> >
>>>> > IIRC, there was a problem related for printk while try to grab
>>>> > pl011_console related lock (spin_lock) while holding
>>>> > console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
>>>> >
>>>> > [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
>>>> > [ 230.381479] Hardware name: linux,dummy-virt (DT)
>>>> > [ 230.381565] Call trace:
>>>> > [ 230.381607] dump_backtrace+0x318/0x348
>>>> > [ 230.381727] show_stack+0x4c/0x80
>>>> > [ 230.381875] dump_stack_lvl+0x214/0x328
>>>> > [ 230.382159] dump_stack+0x3c/0x58
>>>> > [ 230.382456] __lock_acquire+0x4398/0x4720
>>>> > [ 230.382683] lock_acquire+0x648/0xb70
>>>> > [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
>>>> > [ 230.383121] pl011_console_write+0x240/0x8a0
>>>> > [ 230.383356] console_flush_all+0x708/0x1368
>>>> > [ 230.383571] console_unlock+0x180/0x440
>>>> > [ 230.383742] vprintk_emit+0x1f8/0x9d0
>>>> > [ 230.383832] vprintk_default+0x64/0x90
>>>> > [ 230.383914] vprintk+0x2d0/0x400
>>>> > [ 230.383971] _printk+0xdc/0x128
>>>> > [ 230.384229] hrtimer_interrupt+0x8f0/0x920
>>>> > [ 230.384414] arch_timer_handler_virt+0xc0/0x100
>>>> > [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
>>>> > [ 230.385053] generic_handle_domain_irq+0xc0/0x120
>>>> > [ 230.385367] gic_handle_irq+0x88/0x360
>>>> > [ 230.385559] call_on_irq_stack+0x24/0x70
>>>> > [ 230.385801] do_interrupt_handler+0xf8/0x200
>>>> > [ 230.386092] el1_interrupt+0x68/0xc0
>>>> > [ 230.386434] el1h_64_irq_handler+0x18/0x28
>>>> > [ 230.386716] el1h_64_irq+0x64/0x68
>>>> > [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
>>>> > [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
>>>> > [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
>>>> > [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
>>>> > [ 230.387822] folio_prealloc+0x5c/0x280
>>>> > [ 230.388008] do_wp_page+0xc30/0x3bc0
>>>> > [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
>>>> > [ 230.388448] handle_mm_fault+0x194/0x8a8
>>>> > [ 230.388676] do_page_fault+0x6bc/0x1030
>>>> > [ 230.388924] do_mem_abort+0x8c/0x240
>>>> > [ 230.389056] el0_da+0xf0/0x3f8
>>>> > [ 230.389178] el0t_64_sync_handler+0xb4/0x130
>>>> > [ 230.389452] el0t_64_sync+0x190/0x198
>>>> >
>>>> > But this problem is gone when I try with some of patches in rt-tree
>>>> > related for printk which are merged in current tree
>>>> > (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
>>>> >
>>>> > So I think printk() wouldn't be a problem.
>>>> >
>>>> >>
>>>> >> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>>>> >>
>>>> >> - Call the nolock variant (as you are doing). But that would just convert a
>>>> >> deadlock to a panic; if the lock is held when stop_machine() runs, without your
>>>> >> change, we now have a deadlock due to waiting on the lock inside stop_machine().
>>>> >> With your change, we notice the lock is already taken and panic. I guess it is
>>>> >> marginally better, but not by much. Certainly I would just _always_ call the
>>>> >> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
>>>> >> the lock is guarranteed to be free so nolock will always succeed.
>>>> >>
>>>> >> - Preallocate the memory before entering stop_machine(). I think this would be
>>>> >> much more robust. For kpti_install_ng_mappings() I think you could hoist the
>>>> >> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
>>>> >> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
>>>> >> pgtable to figure out how much to preallocate, allocate it, then set it up as a
>>>> >> special allocator, wrapped by an allocation function and modify the callchain to
>>>> >> take a callback function instead of gfp flags.
>>>> >>
>>>> >> What do you think?
>>>> >
>>>> > Definitely, second suggestoin is much better.
>>>> > My question is whether *memory contention* really happen in the point
>>>> > both functions are called.
>>>>
>>>> My guess would be that it's unlikely, but not impossible. The secondary CPUs are
>>>> up, and presumably running their idle thread. I think various power management
>>>> things can be plugged into the idle thread; if so, then I guess it's possible
>>>> that the CPU could be running some hook as part of a power state transition, and
>>>> that could be dynamically allocating memory? That's all just a guess though; I
>>>> don't know the details of that part of the system.
>>>>
>>>> >
>>>> > Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
>>>> > If we can be sure, I think we don't need to go to complex way and
>>>> > I believe the reason why we couldn't find out this problem,
>>>> > even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
>>>> > in this time of both functions are called.
>>>> > > That's why I first try with the "simple way".
>>>> >
>>>> > What do you think?
>>>>
>>>> As far as linear_map_split_to_ptes() is concerned, it was implemented under the
>>>> impression that doing allocation with GFP_ATOMIC was safe, even in
>>>> stop_machine(). Given that's an incorrect assumption, I think we should fix it
>>>> to pre-allocate outside of stop_machine() regardless of the likelihood of
>>>> actually hitting the race.
>>>>
>>>
>>> Yeap. It’s better to be certain than uncertain. Thanks for checking.
>>> I'll repsin with the preallocate way.
>>
>> Note this is explained in Documentation/core-api/real-time/differences.rst:
>>
>> Memory allocation
>> -----------------
>>
>> The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
>> gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
>> necessary to use GFP_ATOMIC when allocating memory from interrupt context or
>> from sections where preemption is disabled. This is because the allocator must
>> not sleep in these contexts waiting for memory to become available.
>>
>> However, this approach does not work on PREEMPT_RT kernels. The memory
>> allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
>> acquired when preemption is disabled. Fortunately, this is generally not a
>> problem, because PREEMPT_RT moves most contexts that would traditionally run
>> with preemption or interrupts disabled into threaded context, where sleeping is
>> allowed.
>>
>> What remains problematic is code that explicitly disables preemption or
>> interrupts. In such cases, memory allocation must be performed outside the
>> critical section.
>>
>> This restriction also applies to memory deallocation routines such as kfree()
>> and free_pages(), which may also involve internal locking and must not be
>> called from non-preemptible contexts.
>
> Oh, thanks for pointing to that, I had never read that before (oops).
>
> Shall we point to this from the doc-comment? Something like the below.
>
> BTW, Yeorum, assuming you care about PREEMPT_RT, maybe you can get
> Sparse to find some other bugs of this nature? Or if not, plain old
> Coccinelle would probably find a few.
>
> ---
>
> From 4c6b4d4cb08aee9559d02a348b9ecf799142c96f Mon Sep 17 00:00:00 2001
> From: Brendan Jackman <jackmanb@google.com>
> Date: Wed, 17 Dec 2025 13:26:28 +0000
> Subject: [PATCH] mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment
>
> The current description of contexts where it's invalid to make
> GFP_ATOMIC and GFP_NOWAIT calls is rather vague.
>
> Replace this with a direct description of the actual contexts of concern
> and refer to the RT docs where this is explained more discursively.
>
> While rejigging this prose, also move the documentation of GFP_NOWAIT to
> the GFP_NOWAIT section.
There doesn't seem to be any move?
>
> Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/
> Signed-off-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Nit below:
> ---
> include/linux/gfp_types.h | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 3de43b12209ee..07a378542caf2 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -309,8 +309,10 @@ enum {
> *
> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> * watermark is applied to allow access to "atomic reserves".
> - * The current implementation doesn't support NMI and few other strict
> - * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> + * The current implementation doesn't support NMI, nor contexts that disable
> + * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain
> + * preempt_disable() - see Documentation/core-api/real-time/differences.rst for
> + * more info.
Can we reference the "Memory allocation" section directly?
> *
> * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> @@ -321,6 +323,7 @@ enum {
> * %GFP_NOWAIT is for kernel allocations that should not stall for direct
> * reclaim, start physical IO or use any filesystem callback. It is very
> * likely to fail to allocate memory, even for very small allocations.
> + * The same restrictions on calling contexts apply as for %GFP_ATOMIC.
> *
> * %GFP_NOIO will use direct reclaim to discard clean pages or slab pages
> * that do not require the starting of any physical IO.
> --
> 2.50.1
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 15:10 ` Vlastimil Babka
@ 2025-12-17 17:19 ` Brendan Jackman
2025-12-18 7:47 ` Vlastimil Babka
2025-12-18 7:52 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 35+ messages in thread
From: Brendan Jackman @ 2025-12-17 17:19 UTC (permalink / raw)
To: Vlastimil Babka, Brendan Jackman, Yeoreum Yun, Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, hannes, ziy, bigeasy,
clrkwllms, rostedt, catalin.marinas, will, kevin.brodsky,
dev.jain, yang, linux-mm, linux-kernel, bpf, linux-rt-devel,
linux-arm-kernel
>> From 4c6b4d4cb08aee9559d02a348b9ecf799142c96f Mon Sep 17 00:00:00 2001
>> From: Brendan Jackman <jackmanb@google.com>
>> Date: Wed, 17 Dec 2025 13:26:28 +0000
>> Subject: [PATCH] mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment
>>
>> The current description of contexts where it's invalid to make
>> GFP_ATOMIC and GFP_NOWAIT calls is rather vague.
>>
>> Replace this with a direct description of the actual contexts of concern
>> and refer to the RT docs where this is explained more discursively.
>>
>> While rejigging this prose, also move the documentation of GFP_NOWAIT to
>> the GFP_NOWAIT section.
>
> There doesn't seem to be any move?
This is referring to [0] and [1].
>> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
>> index 3de43b12209ee..07a378542caf2 100644
>> --- a/include/linux/gfp_types.h
>> +++ b/include/linux/gfp_types.h
>> @@ -309,8 +309,10 @@ enum {
>> *
>> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>> * watermark is applied to allow access to "atomic reserves".
>> - * The current implementation doesn't support NMI and few other strict
>> - * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^
>> + * The current implementation doesn't support NMI, nor contexts that disable
>> + * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain
>> + * preempt_disable() - see Documentation/core-api/real-time/differences.rst for
>> + * more info.
>
> Can we reference the "Memory allocation" section directly?
Yeah good point. I will send this as a standalone [PATCH] mail tomorrow.
>> *
>> * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
>> * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
>> @@ -321,6 +323,7 @@ enum {
>> * %GFP_NOWAIT is for kernel allocations that should not stall for direct
>> * reclaim, start physical IO or use any filesystem callback. It is very
>> * likely to fail to allocate memory, even for very small allocations.
>> + * The same restrictions on calling contexts apply as for %GFP_ATOMIC.
[1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 17:19 ` Brendan Jackman
@ 2025-12-18 7:47 ` Vlastimil Babka
0 siblings, 0 replies; 35+ messages in thread
From: Vlastimil Babka @ 2025-12-18 7:47 UTC (permalink / raw)
To: Brendan Jackman, Yeoreum Yun, Ryan Roberts
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, hannes, ziy, bigeasy,
clrkwllms, rostedt, catalin.marinas, will, kevin.brodsky,
dev.jain, yang, linux-mm, linux-kernel, bpf, linux-rt-devel,
linux-arm-kernel
On 12/17/25 18:19, Brendan Jackman wrote:
>>> From 4c6b4d4cb08aee9559d02a348b9ecf799142c96f Mon Sep 17 00:00:00 2001
>>> From: Brendan Jackman <jackmanb@google.com>
>>> Date: Wed, 17 Dec 2025 13:26:28 +0000
>>> Subject: [PATCH] mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment
>>>
>>> The current description of contexts where it's invalid to make
>>> GFP_ATOMIC and GFP_NOWAIT calls is rather vague.
>>>
>>> Replace this with a direct description of the actual contexts of concern
>>> and refer to the RT docs where this is explained more discursively.
>>>
>>> While rejigging this prose, also move the documentation of GFP_NOWAIT to
>>> the GFP_NOWAIT section.
>>
>> There doesn't seem to be any move?
>
> This is referring to [0] and [1].
Oh I missed [0]. Thanks!
>>> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
>>> index 3de43b12209ee..07a378542caf2 100644
>>> --- a/include/linux/gfp_types.h
>>> +++ b/include/linux/gfp_types.h
>>> @@ -309,8 +309,10 @@ enum {
>>> *
>>> * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
>>> * watermark is applied to allow access to "atomic reserves".
>>> - * The current implementation doesn't support NMI and few other strict
>>> - * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
> [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> + * The current implementation doesn't support NMI, nor contexts that disable
>>> + * preemption under PREEMPT_RT. This includes raw_spin_lock() and plain
>>> + * preempt_disable() - see Documentation/core-api/real-time/differences.rst for
>>> + * more info.
>>
>> Can we reference the "Memory allocation" section directly?
>
> Yeah good point. I will send this as a standalone [PATCH] mail tomorrow.
>
>>> *
>>> * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
>>> * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
>>> @@ -321,6 +323,7 @@ enum {
>>> * %GFP_NOWAIT is for kernel allocations that should not stall for direct
>>> * reclaim, start physical IO or use any filesystem callback. It is very
>>> * likely to fail to allocate memory, even for very small allocations.
>>> + * The same restrictions on calling contexts apply as for %GFP_ATOMIC.
> [1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 15:10 ` Vlastimil Babka
2025-12-17 17:19 ` Brendan Jackman
@ 2025-12-18 7:52 ` David Hildenbrand (Red Hat)
1 sibling, 0 replies; 35+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 7:52 UTC (permalink / raw)
To: Vlastimil Babka, Brendan Jackman, Yeoreum Yun, Ryan Roberts
Cc: akpm, lorenzo.stoakes, Liam.Howlett, rppt, surenb, mhocko, ast,
daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, hannes, ziy, bigeasy,
clrkwllms, rostedt, catalin.marinas, will, kevin.brodsky,
dev.jain, yang, linux-mm, linux-kernel, bpf, linux-rt-devel,
linux-arm-kernel
>>
>> Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/
>> Signed-off-by: Brendan Jackman <jackmanb@google.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
Clarifies things for me
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-12 16:18 ` [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine() Yeoreum Yun
2025-12-13 7:05 ` Brendan Jackman
@ 2025-12-18 9:30 ` Michal Hocko
2025-12-18 9:36 ` Yeoreum Yun
1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-12-18 9:30 UTC (permalink / raw)
To: Yeoreum Yun
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, jackmanb, hannes,
ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Fri 12-12-25 16:18:32, Yeoreum Yun wrote:
> linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> are called as callback of stop_machine().
> That means these functions context are preemption disabled.
>
> Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> __get_free_pages() couldn't be called in this context
> since spin lock that becomes sleepable on RT,
> potentially causing a sleep during page allocation.
>
> To address this, pagetable_alloc_nolock().
As you cannot tolerate allocation failure and this is pretty much
permanent allocation (AFAIU) why don't you use a static allocation?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-18 9:30 ` Michal Hocko
@ 2025-12-18 9:36 ` Yeoreum Yun
2025-12-18 12:02 ` Ryan Roberts
0 siblings, 1 reply; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-18 9:36 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, jackmanb, hannes,
ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
ryan.roberts, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi,
> On Fri 12-12-25 16:18:32, Yeoreum Yun wrote:
> > linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> > are called as callback of stop_machine().
> > That means these functions context are preemption disabled.
> >
> > Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> > __get_free_pages() couldn't be called in this context
> > since spin lock that becomes sleepable on RT,
> > potentially causing a sleep during page allocation.
> >
> > To address this, pagetable_alloc_nolock().
>
> As you cannot tolerate allocation failure and this is pretty much
> permanent allocation (AFAIU) why don't you use a static allocation?
Because of when bbl2_noabort is supported, that pages doesn't need to.
If static alloc, that would be a waste in the system where bbl2_noabort
is supported.
When I tested, these extra pages are more than 40 in my FVP.
So, it would be better dynamic allocation and I think since it's quite a
early time, it's probably not failed that's why former code runs as it
is.
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-18 9:36 ` Yeoreum Yun
@ 2025-12-18 12:02 ` Ryan Roberts
2025-12-18 12:17 ` Michal Hocko
0 siblings, 1 reply; 35+ messages in thread
From: Ryan Roberts @ 2025-12-18 12:02 UTC (permalink / raw)
To: Yeoreum Yun, Michal Hocko
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
ast, daniel, andrii, martin.lau, eddyz87, song, yonghong.song,
john.fastabend, kpsingh, sdf, haoluo, jolsa, jackmanb, hannes,
ziy, bigeasy, clrkwllms, rostedt, catalin.marinas, will,
kevin.brodsky, dev.jain, yang, linux-mm, linux-kernel, bpf,
linux-rt-devel, linux-arm-kernel
On 18/12/2025 09:36, Yeoreum Yun wrote:
> Hi,
>> On Fri 12-12-25 16:18:32, Yeoreum Yun wrote:
>>> linear_map_split_to_ptes() and __kpti_install_ng_mappings()
>>> are called as callback of stop_machine().
>>> That means these functions context are preemption disabled.
>>>
>>> Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
>>> __get_free_pages() couldn't be called in this context
>>> since spin lock that becomes sleepable on RT,
>>> potentially causing a sleep during page allocation.
>>>
>>> To address this, pagetable_alloc_nolock().
>>
>> As you cannot tolerate allocation failure and this is pretty much
>> permanent allocation (AFAIU) why don't you use a static allocation?
>
> Because of when bbl2_noabort is supported, that pages doesn't need to.
> If static alloc, that would be a waste in the system where bbl2_noabort
> is supported.
>
> When I tested, these extra pages are more than 40 in my FVP.
> So, it would be better dynamic allocation and I think since it's quite a
> early time, it's probably not failed that's why former code runs as it
> is.
The required allocation size is also a function of the size of the installed RAM
so a static worst case allocation would consume all the RAM on small systems.
>
> --
> Sincerely,
> Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-18 12:02 ` Ryan Roberts
@ 2025-12-18 12:17 ` Michal Hocko
2025-12-18 12:24 ` Yeoreum Yun
0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-12-18 12:17 UTC (permalink / raw)
To: Ryan Roberts
Cc: Yeoreum Yun, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On Thu 18-12-25 12:02:14, Ryan Roberts wrote:
> On 18/12/2025 09:36, Yeoreum Yun wrote:
> > Hi,
> >> On Fri 12-12-25 16:18:32, Yeoreum Yun wrote:
> >>> linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> >>> are called as callback of stop_machine().
> >>> That means these functions context are preemption disabled.
> >>>
> >>> Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> >>> __get_free_pages() couldn't be called in this context
> >>> since spin lock that becomes sleepable on RT,
> >>> potentially causing a sleep during page allocation.
> >>>
> >>> To address this, pagetable_alloc_nolock().
> >>
> >> As you cannot tolerate allocation failure and this is pretty much
> >> permanent allocation (AFAIU) why don't you use a static allocation?
> >
> > Because of when bbl2_noabort is supported, that pages doesn't need to.
> > If static alloc, that would be a waste in the system where bbl2_noabort
> > is supported.
> >
> > When I tested, these extra pages are more than 40 in my FVP.
> > So, it would be better dynamic allocation and I think since it's quite a
> > early time, it's probably not failed that's why former code runs as it
> > is.
>
> The required allocation size is also a function of the size of the installed RAM
> so a static worst case allocation would consume all the RAM on small systems.
Understood. But is it possible to pre-allocate early on so that the
allocation itself doesn't have to happen from a constrained context.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine()
2025-12-18 12:17 ` Michal Hocko
@ 2025-12-18 12:24 ` Yeoreum Yun
0 siblings, 0 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-18 12:24 UTC (permalink / raw)
To: Michal Hocko
Cc: Ryan Roberts, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, yang, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
> On Thu 18-12-25 12:02:14, Ryan Roberts wrote:
> > On 18/12/2025 09:36, Yeoreum Yun wrote:
> > > Hi,
> > >> On Fri 12-12-25 16:18:32, Yeoreum Yun wrote:
> > >>> linear_map_split_to_ptes() and __kpti_install_ng_mappings()
> > >>> are called as callback of stop_machine().
> > >>> That means these functions context are preemption disabled.
> > >>>
> > >>> Unfortunately, under PREEMPT_RT, the pagetable_alloc() or
> > >>> __get_free_pages() couldn't be called in this context
> > >>> since spin lock that becomes sleepable on RT,
> > >>> potentially causing a sleep during page allocation.
> > >>>
> > >>> To address this, pagetable_alloc_nolock().
> > >>
> > >> As you cannot tolerate allocation failure and this is pretty much
> > >> permanent allocation (AFAIU) why don't you use a static allocation?
> > >
> > > Because of when bbl2_noabort is supported, that pages doesn't need to.
> > > If static alloc, that would be a waste in the system where bbl2_noabort
> > > is supported.
> > >
> > > When I tested, these extra pages are more than 40 in my FVP.
> > > So, it would be better dynamic allocation and I think since it's quite a
> > > early time, it's probably not failed that's why former code runs as it
> > > is.
> >
> > The required allocation size is also a function of the size of the installed RAM
> > so a static worst case allocation would consume all the RAM on small systems.
>
> Understood. But is it possible to pre-allocate early on so that the
> allocation itself doesn't have to happen from a constrained context.
That's the same suggestion from Ryan
(https://lore.kernel.org/all/100cc8da-b826-4fc2-a624-746bf6fb049d@arm.com/)
And here is the v2 which have accepted his suggestion :)
(But soon respin):
https://lore.kernel.org/all/20251217182007.2345700-1-yeoreum.yun@arm.com/
Thanks ;)
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-17 12:04 ` Ryan Roberts
2025-12-17 12:52 ` Yeoreum Yun
@ 2025-12-23 22:59 ` Yang Shi
2025-12-24 7:00 ` Yeoreum Yun
1 sibling, 1 reply; 35+ messages in thread
From: Yang Shi @ 2025-12-23 22:59 UTC (permalink / raw)
To: Ryan Roberts, Yeoreum Yun
Cc: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko, ast, daniel, andrii, martin.lau, eddyz87, song,
yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
On 12/17/25 9:04 PM, Ryan Roberts wrote:
> On 17/12/2025 10:48, Yeoreum Yun wrote:
>> Hi Ryan,
>>
>>> On 16/12/2025 16:52, Yeoreum Yun wrote:
>>>> Hi Ryan,
>>>>
>>>>> On 12/12/2025 16:18, Yeoreum Yun wrote:
>>>>>> Some architectures invoke pagetable_alloc() or __get_free_pages()
>>>>>> with preemption disabled.
>>>>>> For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
>>>>>> while spliting block entry to ptes and __kpti_install_ng_mappings()
>>>>>> calls __get_free_pages() to create kpti pagetable.
>>>>>>
>>>>>> Under PREEMPT_RT, calling pagetable_alloc() with
>>>>>> preemption disabled is not allowed, because it may acquire
>>>>>> a spin lock that becomes sleepable on RT, potentially
>>>>>> causing a sleep during page allocation.
>>>>>>
>>>>>> Since above two functions is called as callback of stop_machine()
>>>>>> where its callback is called in preemption disabled,
>>>>>> They could make a potential problem. (sleeping in preemption disabled).
>>>>>>
>>>>>> To address this, introduce pagetable_alloc_nolock() API.
>>>>> I don't really understand what the problem is that you're trying to fix. As I
>>>>> see it, there are 2 call sites in arm64 arch code that are calling into the page
>>>>> allocator from stop_machine() - one via via pagetable_alloc() and another via
>>>>> __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
>>>>> understanding that the page allocator would ensure it never sleeps when
>>>>> GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
>>>> Although GFP_ATOMIC is specify, it only affects of "water mark" of the
>>>> page with __GFP_HIGH. and to get a page, it must grab the lock --
>>>> zone->lock or pcp_lock in the rmqueue().
>>>>
>>>> This zone->lock and pcp_lock is spin_lock and it's a sleepable in
>>>> PREEMPT_RT that's why the memory allocation/free using general API
>>>> except nolock() version couldn't be called since
>>>> if "contention" happens they'll sleep while waiting to get the lock.
>>>>
>>>> The reason why "nolock()" can use, it always uses "trylock" with
>>>> ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
>>>> PREEMPT_RT.
>>>>
>>>>> What is the actual symptom you are seeing?
>>>> Since the place where called while smp_cpus_done() and there seems no
>>>> contention, there seems no problem. However as I mention in another
>>>> thread
>>>> (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
>>>> This gives a the false impression --
>>>> GFP_ATOMIC are “safe to use in preemption disabled”
>>>> even though they are not in PREEMPT_RT case, I've changed it.
>>>>
>>>>> If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
>>>>> then isn't that a bug in the page allocator? I'm not sure why you would change
>>>>> the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
>>>> It doesn't ignore the GFP_ATOMIC feature:
>>>> - __GFP_HIGH: use water mark till min reserved
>>>> - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
>>>>
>>>> But, it's a restriction -- "page allocation / free" API cannot be called
>>>> in preempt-disabled context at PREEMPT_RT.
>>>>
>>>> That's why I think it's wrong usage not a page allocator bug.
>>> I've taken a look at this and I agree with your analysis. Thanks for explaining.
>>>
>>> Looking at other stop_machine() callbacks, there are some that call printk() and
>>> I would assume that spinlocks could be taken there which may present the same
>>> kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
>>> to allocate memory though.
>> IIRC, there was a problem related for printk while try to grab
>> pl011_console related lock (spin_lock) while holding
>> console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
>>
>> [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
>> [ 230.381479] Hardware name: linux,dummy-virt (DT)
>> [ 230.381565] Call trace:
>> [ 230.381607] dump_backtrace+0x318/0x348
>> [ 230.381727] show_stack+0x4c/0x80
>> [ 230.381875] dump_stack_lvl+0x214/0x328
>> [ 230.382159] dump_stack+0x3c/0x58
>> [ 230.382456] __lock_acquire+0x4398/0x4720
>> [ 230.382683] lock_acquire+0x648/0xb70
>> [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
>> [ 230.383121] pl011_console_write+0x240/0x8a0
>> [ 230.383356] console_flush_all+0x708/0x1368
>> [ 230.383571] console_unlock+0x180/0x440
>> [ 230.383742] vprintk_emit+0x1f8/0x9d0
>> [ 230.383832] vprintk_default+0x64/0x90
>> [ 230.383914] vprintk+0x2d0/0x400
>> [ 230.383971] _printk+0xdc/0x128
>> [ 230.384229] hrtimer_interrupt+0x8f0/0x920
>> [ 230.384414] arch_timer_handler_virt+0xc0/0x100
>> [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
>> [ 230.385053] generic_handle_domain_irq+0xc0/0x120
>> [ 230.385367] gic_handle_irq+0x88/0x360
>> [ 230.385559] call_on_irq_stack+0x24/0x70
>> [ 230.385801] do_interrupt_handler+0xf8/0x200
>> [ 230.386092] el1_interrupt+0x68/0xc0
>> [ 230.386434] el1h_64_irq_handler+0x18/0x28
>> [ 230.386716] el1h_64_irq+0x64/0x68
>> [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
>> [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
>> [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
>> [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
>> [ 230.387822] folio_prealloc+0x5c/0x280
>> [ 230.388008] do_wp_page+0xc30/0x3bc0
>> [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
>> [ 230.388448] handle_mm_fault+0x194/0x8a8
>> [ 230.388676] do_page_fault+0x6bc/0x1030
>> [ 230.388924] do_mem_abort+0x8c/0x240
>> [ 230.389056] el0_da+0xf0/0x3f8
>> [ 230.389178] el0t_64_sync_handler+0xb4/0x130
>> [ 230.389452] el0t_64_sync+0x190/0x198
>>
>> But this problem is gone when I try with some of patches in rt-tree
>> related for printk which are merged in current tree
>> (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
>>
>> So I think printk() wouldn't be a problem.
>>
>>> Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
>>>
>>> - Call the nolock variant (as you are doing). But that would just convert a
>>> deadlock to a panic; if the lock is held when stop_machine() runs, without your
>>> change, we now have a deadlock due to waiting on the lock inside stop_machine().
>>> With your change, we notice the lock is already taken and panic. I guess it is
>>> marginally better, but not by much. Certainly I would just _always_ call the
>>> nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
>>> the lock is guarranteed to be free so nolock will always succeed.
>>>
>>> - Preallocate the memory before entering stop_machine(). I think this would be
>>> much more robust. For kpti_install_ng_mappings() I think you could hoist the
>>> allocation/free out of stop_machine() and pass the pointer in pretty easily. For
>>> linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
>>> pgtable to figure out how much to preallocate, allocate it, then set it up as a
>>> special allocator, wrapped by an allocation function and modify the callchain to
>>> take a callback function instead of gfp flags.
>>>
>>> What do you think?
>> Definitely, second suggestoin is much better.
>> My question is whether *memory contention* really happen in the point
>> both functions are called.
> My guess would be that it's unlikely, but not impossible. The secondary CPUs are
> up, and presumably running their idle thread. I think various power management
> things can be plugged into the idle thread; if so, then I guess it's possible
> that the CPU could be running some hook as part of a power state transition, and
> that could be dynamically allocating memory? That's all just a guess though; I
> don't know the details of that part of the system.
Sorry for chiming in late. I was just done my travel, but still suffered
from jet lag. I may be out of my mind...
I agree the sleeping lock is a problem for -rt kernel. But it is hard
for me to understand how come the lock contention could happen. When the
boot CPU is repainting the linear map, the secondary CPUs are running in
a busy loop to wait for idmap_kpti_bbml2_flag is cleared by the boot CPU
instead of idle thread. And the secondary CPUs running with idmap active
and init_mm inactive. So the nolock variant seems good enough to me if I
don't miss anything.
Thanks,
Yang
>
>> Above two functions are called as last step of "smp_init()" -- smp_cpus_done().
>> If we can be sure, I think we don't need to go to complex way and
>> I believe the reason why we couldn't find out this problem,
>> even using GFP_ATOMIC in PREEMPT_RT since there was *no contection*
>> in this time of both functions are called.
>>> That's why I first try with the "simple way".
>> What do you think?
> As far as linear_map_split_to_ptes() is concerned, it was implemented under the
> impression that doing allocation with GFP_ATOMIC was safe, even in
> stop_machine(). Given that's an incorrect assumption, I think we should fix it
> to pre-allocate outside of stop_machine() regardless of the likelihood of
> actually hitting the race.
>
> Thanks,
> Ryan
>
>> --
>> Sincerely,
>> Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 0/2] introduce pagetable_alloc_nolock()
2025-12-23 22:59 ` Yang Shi
@ 2025-12-24 7:00 ` Yeoreum Yun
0 siblings, 0 replies; 35+ messages in thread
From: Yeoreum Yun @ 2025-12-24 7:00 UTC (permalink / raw)
To: Yang Shi
Cc: Ryan Roberts, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko, ast, daniel, andrii, martin.lau, eddyz87,
song, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
jackmanb, hannes, ziy, bigeasy, clrkwllms, rostedt,
catalin.marinas, will, kevin.brodsky, dev.jain, linux-mm,
linux-kernel, bpf, linux-rt-devel, linux-arm-kernel
Hi Yang,
> On 12/17/25 9:04 PM, Ryan Roberts wrote:
> > On 17/12/2025 10:48, Yeoreum Yun wrote:
> > > Hi Ryan,
> > >
> > > > On 16/12/2025 16:52, Yeoreum Yun wrote:
> > > > > Hi Ryan,
> > > > >
> > > > > > On 12/12/2025 16:18, Yeoreum Yun wrote:
> > > > > > > Some architectures invoke pagetable_alloc() or __get_free_pages()
> > > > > > > with preemption disabled.
> > > > > > > For example, in arm64, linear_map_split_to_ptes() calls pagetable_alloc()
> > > > > > > while spliting block entry to ptes and __kpti_install_ng_mappings()
> > > > > > > calls __get_free_pages() to create kpti pagetable.
> > > > > > >
> > > > > > > Under PREEMPT_RT, calling pagetable_alloc() with
> > > > > > > preemption disabled is not allowed, because it may acquire
> > > > > > > a spin lock that becomes sleepable on RT, potentially
> > > > > > > causing a sleep during page allocation.
> > > > > > >
> > > > > > > Since above two functions is called as callback of stop_machine()
> > > > > > > where its callback is called in preemption disabled,
> > > > > > > They could make a potential problem. (sleeping in preemption disabled).
> > > > > > >
> > > > > > > To address this, introduce pagetable_alloc_nolock() API.
> > > > > > I don't really understand what the problem is that you're trying to fix. As I
> > > > > > see it, there are 2 call sites in arm64 arch code that are calling into the page
> > > > > > allocator from stop_machine() - one via via pagetable_alloc() and another via
> > > > > > __get_free_pages(). But both of those calls are passing in GFP_ATOMIC. It was my
> > > > > > understanding that the page allocator would ensure it never sleeps when
> > > > > > GFP_ATOMIC is passed in, (even for PREEMPT_RT)?
> > > > > Although GFP_ATOMIC is specify, it only affects of "water mark" of the
> > > > > page with __GFP_HIGH. and to get a page, it must grab the lock --
> > > > > zone->lock or pcp_lock in the rmqueue().
> > > > >
> > > > > This zone->lock and pcp_lock is spin_lock and it's a sleepable in
> > > > > PREEMPT_RT that's why the memory allocation/free using general API
> > > > > except nolock() version couldn't be called since
> > > > > if "contention" happens they'll sleep while waiting to get the lock.
> > > > >
> > > > > The reason why "nolock()" can use, it always uses "trylock" with
> > > > > ALLOC_TRYLOCK flags. otherwise GFP_ATOMIC also can be sleepable in
> > > > > PREEMPT_RT.
> > > > >
> > > > > > What is the actual symptom you are seeing?
> > > > > Since the place where called while smp_cpus_done() and there seems no
> > > > > contention, there seems no problem. However as I mention in another
> > > > > thread
> > > > > (https://lore.kernel.org/all/aT%2FdrjN1BkvyAGoi@e129823.arm.com/),
> > > > > This gives a the false impression --
> > > > > GFP_ATOMIC are “safe to use in preemption disabled”
> > > > > even though they are not in PREEMPT_RT case, I've changed it.
> > > > >
> > > > > > If the page allocator is somehow ignoring the GFP_ATOMIC request for PREEMPT_RT,
> > > > > > then isn't that a bug in the page allocator? I'm not sure why you would change
> > > > > > the callsites? Can't you just change the page allocator based on GFP_ATOMIC?
> > > > > It doesn't ignore the GFP_ATOMIC feature:
> > > > > - __GFP_HIGH: use water mark till min reserved
> > > > > - __GFP_KSWAPD_RECLAIM: wake up kswapd if reclaim required.
> > > > >
> > > > > But, it's a restriction -- "page allocation / free" API cannot be called
> > > > > in preempt-disabled context at PREEMPT_RT.
> > > > >
> > > > > That's why I think it's wrong usage not a page allocator bug.
> > > > I've taken a look at this and I agree with your analysis. Thanks for explaining.
> > > >
> > > > Looking at other stop_machine() callbacks, there are some that call printk() and
> > > > I would assume that spinlocks could be taken there which may present the same
> > > > kind of issue or PREEMPT_RT? (I'm guessing). I don't see any others that attempt
> > > > to allocate memory though.
> > > IIRC, there was a problem related for printk while try to grab
> > > pl011_console related lock (spin_lock) while holding
> > > console_lock(raw_spin_lock) in v6.10.0-rc7 at rpi5:
> > >
> > > [ 230.381263] CPU: 2 PID: 5574 Comm: syz.4.1695 Not tainted 6.10.0-rc7-01903-g52828ea60dfd #3
> > > [ 230.381479] Hardware name: linux,dummy-virt (DT)
> > > [ 230.381565] Call trace:
> > > [ 230.381607] dump_backtrace+0x318/0x348
> > > [ 230.381727] show_stack+0x4c/0x80
> > > [ 230.381875] dump_stack_lvl+0x214/0x328
> > > [ 230.382159] dump_stack+0x3c/0x58
> > > [ 230.382456] __lock_acquire+0x4398/0x4720
> > > [ 230.382683] lock_acquire+0x648/0xb70
> > > [ 230.382928] _raw_spin_lock_irqsave+0x138/0x240
> > > [ 230.383121] pl011_console_write+0x240/0x8a0
> > > [ 230.383356] console_flush_all+0x708/0x1368
> > > [ 230.383571] console_unlock+0x180/0x440
> > > [ 230.383742] vprintk_emit+0x1f8/0x9d0
> > > [ 230.383832] vprintk_default+0x64/0x90
> > > [ 230.383914] vprintk+0x2d0/0x400
> > > [ 230.383971] _printk+0xdc/0x128
> > > [ 230.384229] hrtimer_interrupt+0x8f0/0x920
> > > [ 230.384414] arch_timer_handler_virt+0xc0/0x100
> > > [ 230.384812] handle_percpu_devid_irq+0x20c/0x4e0
> > > [ 230.385053] generic_handle_domain_irq+0xc0/0x120
> > > [ 230.385367] gic_handle_irq+0x88/0x360
> > > [ 230.385559] call_on_irq_stack+0x24/0x70
> > > [ 230.385801] do_interrupt_handler+0xf8/0x200
> > > [ 230.386092] el1_interrupt+0x68/0xc0
> > > [ 230.386434] el1h_64_irq_handler+0x18/0x28
> > > [ 230.386716] el1h_64_irq+0x64/0x68
> > > [ 230.386853] __sanitizer_cov_trace_const_cmp2+0x30/0x68
> > > [ 230.387026] alloc_pages_mpol_noprof+0x170/0x698
> > > [ 230.387309] vma_alloc_folio_noprof+0x128/0x2a8
> > > [ 230.387610] vma_alloc_zeroed_movable_folio+0xa0/0xe0
> > > [ 230.387822] folio_prealloc+0x5c/0x280
> > > [ 230.388008] do_wp_page+0xc30/0x3bc0
> > > [ 230.388206] __handle_mm_fault+0xdb8/0x2ba0
> > > [ 230.388448] handle_mm_fault+0x194/0x8a8
> > > [ 230.388676] do_page_fault+0x6bc/0x1030
> > > [ 230.388924] do_mem_abort+0x8c/0x240
> > > [ 230.389056] el0_da+0xf0/0x3f8
> > > [ 230.389178] el0t_64_sync_handler+0xb4/0x130
> > > [ 230.389452] el0t_64_sync+0x190/0x198
> > >
> > > But this problem is gone when I try with some of patches in rt-tree
> > > related for printk which are merged in current tree
> > > (https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/log/?h=linux-6.10.y-rt-rebase).
> > >
> > > So I think printk() wouldn't be a problem.
> > >
> > > > Anyway, to fix the 2 arm64 callsites, I see 2 possible approaches:
> > > >
> > > > - Call the nolock variant (as you are doing). But that would just convert a
> > > > deadlock to a panic; if the lock is held when stop_machine() runs, without your
> > > > change, we now have a deadlock due to waiting on the lock inside stop_machine().
> > > > With your change, we notice the lock is already taken and panic. I guess it is
> > > > marginally better, but not by much. Certainly I would just _always_ call the
> > > > nolock variant regardless of PREEMPT_RT if we take this route; For !PREEMPT_RT,
> > > > the lock is guarranteed to be free so nolock will always succeed.
> > > >
> > > > - Preallocate the memory before entering stop_machine(). I think this would be
> > > > much more robust. For kpti_install_ng_mappings() I think you could hoist the
> > > > allocation/free out of stop_machine() and pass the pointer in pretty easily. For
> > > > linear_map_split_to_ptes() its a bit more complex; Perhaps, we need to walk the
> > > > pgtable to figure out how much to preallocate, allocate it, then set it up as a
> > > > special allocator, wrapped by an allocation function and modify the callchain to
> > > > take a callback function instead of gfp flags.
> > > >
> > > > What do you think?
> > > Definitely, second suggestoin is much better.
> > > My question is whether *memory contention* really happen in the point
> > > both functions are called.
> > My guess would be that it's unlikely, but not impossible. The secondary CPUs are
> > up, and presumably running their idle thread. I think various power management
> > things can be plugged into the idle thread; if so, then I guess it's possible
> > that the CPU could be running some hook as part of a power state transition, and
> > that could be dynamically allocating memory? That's all just a guess though; I
> > don't know the details of that part of the system.
>
> Sorry for chiming in late. I was just done my travel, but still suffered
> from jet lag. I may be out of my mind...
No worries. and I hope you feel better soon :).
>
> I agree the sleeping lock is a problem for -rt kernel. But it is hard for me
> to understand how come the lock contention could happen. When the boot CPU
> is repainting the linear map, the secondary CPUs are running in a busy loop
> to wait for idmap_kpti_bbml2_flag is cleared by the boot CPU instead of idle
> thread. And the secondary CPUs running with idmap active and init_mm
> inactive. So the nolock variant seems good enough to me if I don't miss
> anything.
As Ryan said, “It’s unlikely, but not impossible.”
For example, suppose someone creates a kthread bound to a CPU
other than the kernel_init() task during early_initcall,
and that thread performs memory allocation.
(Of course, I don’t expect anyone to actually do this;
it’s just to illustrate that it’s not impossible.)
When that CPU comes online, the kthread will be scheduled and will attempt to allocate memory.
Meanwhile, another CPU executing smp_init() calls smp_cpus_done() and
then invokes linear_map_split_to_ptes().
If the kthread performing memory allocation is preempted
by the stopper thread at that point, linear_map_split_to_ptes() could fail,
because the memory-allocation-related lock is already held by the kthread.
So, I've sent a new version of this:
- https://lore.kernel.org/all/20251218194750.395301-1-yeoreum.yun@arm.com/
Thanks!
--
Sincerely,
Yeoreum Yun
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2025-12-24 7:02 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-12 16:18 [PATCH 0/2] introduce pagetable_alloc_nolock() Yeoreum Yun
2025-12-12 16:18 ` [PATCH 1/2] mm: " Yeoreum Yun
2025-12-12 16:18 ` [PATCH 2/2] arm64: mmu: use pagetable_alloc_nolock() while stop_machine() Yeoreum Yun
2025-12-13 7:05 ` Brendan Jackman
2025-12-14 9:13 ` Yeoreum Yun
2025-12-15 9:22 ` Brendan Jackman
2025-12-15 9:34 ` Yeoreum Yun
2025-12-15 9:55 ` Brendan Jackman
2025-12-15 10:06 ` Yeoreum Yun
2025-12-16 10:10 ` Brendan Jackman
2025-12-16 11:03 ` Yeoreum Yun
2025-12-16 11:26 ` Brendan Jackman
2025-12-16 12:01 ` Yeoreum Yun
2025-12-16 12:39 ` Brendan Jackman
2025-12-16 13:25 ` Yeoreum Yun
2025-12-18 9:30 ` Michal Hocko
2025-12-18 9:36 ` Yeoreum Yun
2025-12-18 12:02 ` Ryan Roberts
2025-12-18 12:17 ` Michal Hocko
2025-12-18 12:24 ` Yeoreum Yun
2025-12-16 15:11 ` [PATCH 0/2] introduce pagetable_alloc_nolock() Ryan Roberts
2025-12-16 16:52 ` Yeoreum Yun
2025-12-17 9:34 ` Ryan Roberts
2025-12-17 10:48 ` Yeoreum Yun
2025-12-17 12:04 ` Ryan Roberts
2025-12-17 12:52 ` Yeoreum Yun
2025-12-17 13:15 ` Vlastimil Babka
2025-12-17 13:35 ` Brendan Jackman
2025-12-17 13:56 ` Yeoreum Yun
2025-12-17 15:10 ` Vlastimil Babka
2025-12-17 17:19 ` Brendan Jackman
2025-12-18 7:47 ` Vlastimil Babka
2025-12-18 7:52 ` David Hildenbrand (Red Hat)
2025-12-23 22:59 ` Yang Shi
2025-12-24 7:00 ` Yeoreum Yun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).