* [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
@ 2026-05-28 13:29 Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
To: dennis, tj, cl, akpm
Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng
Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
pcpu_alloc_mutex and the chunk creation slow path. This restored the
allocation capability that was lost when those constrained allocations
were treated as atomic, but it also opens two possible reclaim recursion
problems.
The first problem is that the create and populate slow paths do not fully
preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
pcpu_gfp from the caller supplied GFP mask and passes it to the backing
page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
and population can allocate temporary metadata or page tables while mapping
backing pages. Those internal allocations can use GFP_KERNEL. A caller
using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
allocation context.
The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
the nested allocation can now try to take pcpu_alloc_mutex again because
9a5b183941b5 no longer treats those masks as atomic.
Another possible way to avoid these issues is to revert 9a5b183941b5.
However, that would also bring back the premature allocation failures for
sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
to fix.
Kaitao Cheng (2):
mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
mm/percpu.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)
--
2.50.1 (Apple Git-155)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
@ 2026-05-28 13:29 ` Kaitao Cheng
2026-05-29 9:25 ` Pedro Falcato
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
` (2 subsequent siblings)
3 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
To: dennis, tj, cl, akpm
Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng
From: Kaitao Cheng <chengkaitao@kylinos.cn>
pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
passes it to the backing percpu allocators. This preserves GFP_NOFS and
GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
However, the chunk creation and population slow paths also call helpers
which do not take a GFP mask and perform internal allocations with
GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
and population can allocate temporary metadata or page tables while mapping
backing pages. As a result, a caller which explicitly uses GFP_NOFS or
GFP_NOIO can still enter FS or IO reclaim while creating or populating a
percpu chunk.
This is problematic for callers which use GFP_NOFS or GFP_NOIO because
they are already holding filesystem or IO-path locks. If free chunks are
exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
unconstrained reclaim from these internal allocations, defeating the
caller's allocation context and potentially recreating reclaim lock
dependencies.
Wrap chunk creation and population in a scoped NOIO or NOFS context when
pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
allocations unchanged so they retain full reclaim capability.
Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
mm/percpu.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/mm/percpu.c b/mm/percpu.c
index 71a85d7245c7..1bb38467390b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
}
#endif
+static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
+{
+ if (!(gfp & __GFP_IO))
+ return memalloc_noio_save();
+ if (!(gfp & __GFP_FS))
+ return memalloc_nofs_save();
+ return 0;
+}
+
+static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
+{
+ if (!(gfp & __GFP_IO))
+ memalloc_noio_restore(flags);
+ else if (!(gfp & __GFP_FS))
+ memalloc_nofs_restore(flags);
+}
+
/**
* pcpu_alloc - the percpu allocator
* @size: size of area to allocate in bytes
@@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
/* No space left. Create a new chunk. */
if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
+ unsigned int pcpu_scope;
+
+ pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
chunk = pcpu_create_chunk(pcpu_gfp);
+ pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
+
if (!chunk) {
err = "failed to allocate new chunk";
goto fail;
@@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
page_end = PFN_UP(off + size);
for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
+ unsigned int pcpu_scope;
+
WARN_ON(chunk->immutable);
+ pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
+ pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
spin_lock_irqsave(&pcpu_lock, flags);
if (ret) {
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
@ 2026-05-28 13:29 ` Kaitao Cheng
2026-05-29 9:34 ` Pedro Falcato
2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
2026-05-28 21:10 ` Andrew Morton
3 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
To: dennis, tj, cl, akpm
Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng
From: Kaitao Cheng <chengkaitao@kylinos.cn>
pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations
so that it can create chunks and populate backing pages. If reclaim is
entered while that mutex is already held, and reclaim reaches a path
which allocates percpu memory, the nested allocation can try to take
pcpu_alloc_mutex again.
That creates a reclaim recursion dependency:
pcpu_alloc_noprof(GFP_KERNEL)
mutex_lock(&pcpu_alloc_mutex)
reclaim
pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS)
mutex_lock(&pcpu_alloc_mutex)
Avoid this by treating percpu allocations from reclaim context as atomic.
Such allocations may still be served from already available and populated
areas, but they must not enter the mutex-protected slow path or create new
chunks. If no space is available, fail the allocation and let the normal
balance work handle replenishment outside reclaim.
Update the function comment to describe that reclaim context allocations
are atomic regardless of whether the supplied GFP mask would otherwise
allow blocking.
This patch is a preventive fix. There may not currently be any path that
calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context.
Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
mm/percpu.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 1bb38467390b..9c30e5897813 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
* @gfp: allocation flags
*
* Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't
- * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
- * then no warning will be triggered on invalid or failed allocation
- * requests.
+ * allow blocking, or if allocation is requested from reclaim context, the
+ * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be
+ * triggered on invalid or failed allocation requests.
*
* RETURNS:
* Percpu pointer to the allocated area on success, NULL on failure.
@@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
gfp = current_gfp_context(gfp);
/* whitelisted flags that can be passed to the backing allocators */
pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
- is_atomic = !gfpflags_allow_blocking(gfp);
+ /*
+ * Reclaim can be entered while pcpu_alloc_mutex is already held by
+ * another percpu allocation. Avoid recursing back into the mutex from
+ * reclaim; best-effort allocations from already populated areas are OK.
+ */
+ is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state;
do_warn = !(gfp & __GFP_NOWARN);
/*
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
@ 2026-05-28 21:09 ` Andrew Morton
2026-05-28 21:10 ` Andrew Morton
3 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2026-05-28 21:09 UTC (permalink / raw)
To: Kaitao Cheng
Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song
On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also opens two possible reclaim recursion
> problems.
>
> The first problem is that the create and populate slow paths do not fully
> preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
> pcpu_gfp from the caller supplied GFP mask and passes it to the backing
> page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. Those internal allocations can use GFP_KERNEL. A caller
> using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
> IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
> allocation context.
>
> The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
> If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
> reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
> the nested allocation can now try to take pcpu_alloc_mutex again because
> 9a5b183941b5 no longer treats those masks as atomic.
>
> Another possible way to avoid these issues is to revert 9a5b183941b5.
> However, that would also bring back the premature allocation failures for
> sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
> to fix.
Thanks.
9a5b183941b5 has been in there for a year. How are you observing/triggering this bug
and what are the userspace-visible effects?
We might choose to backport fixes into -stable kernels, but this additional info
is needed to make that determination.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
` (2 preceding siblings ...)
2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
@ 2026-05-28 21:10 ` Andrew Morton
3 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2026-05-28 21:10 UTC (permalink / raw)
To: Kaitao Cheng
Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song
On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also opens two possible reclaim recursion
> problems.
>
> The first problem is that the create and populate slow paths do not fully
> preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
> pcpu_gfp from the caller supplied GFP mask and passes it to the backing
> page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. Those internal allocations can use GFP_KERNEL. A caller
> using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
> IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
> allocation context.
>
> The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
> If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
> reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
> the nested allocation can now try to take pcpu_alloc_mutex again because
> 9a5b183941b5 no longer treats those masks as atomic.
>
> Another possible way to avoid these issues is to revert 9a5b183941b5.
> However, that would also bring back the premature allocation failures for
> sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
> to fix.
AI review might have found a couple of pre-existing issues:
https://sashiko.dev/#/patchset/20260528132917.81123-1-kaitao.cheng@linux.dev
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
@ 2026-05-29 9:25 ` Pedro Falcato
2026-05-29 9:38 ` Pedro Falcato
0 siblings, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29 9:25 UTC (permalink / raw)
To: Kaitao Cheng
Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song, Kaitao Cheng
On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> passes it to the backing percpu allocators. This preserves GFP_NOFS and
> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
>
> However, the chunk creation and population slow paths also call helpers
> which do not take a GFP mask and perform internal allocations with
> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> percpu chunk.
>
> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> they are already holding filesystem or IO-path locks. If free chunks are
> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> unconstrained reclaim from these internal allocations, defeating the
> caller's allocation context and potentially recreating reclaim lock
> dependencies.
>
> Wrap chunk creation and population in a scoped NOIO or NOFS context when
> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> allocations unchanged so they retain full reclaim capability.
>
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
I assume you _did not_ observe this in production? As in no reclaim path should be
insane^W daring enough to do pcpu allocations?
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
> mm/percpu.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 71a85d7245c7..1bb38467390b 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
> }
> #endif
>
> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> +{
> + if (!(gfp & __GFP_IO))
> + return memalloc_noio_save();
> + if (!(gfp & __GFP_FS))
> + return memalloc_nofs_save();
> + return 0;
> +}
> +
> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> +{
> + if (!(gfp & __GFP_IO))
> + memalloc_noio_restore(flags);
> + else if (!(gfp & __GFP_FS))
> + memalloc_nofs_restore(flags);
> +}
I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
and pcpu_populate_chunk(). It's their job to respect the gfp flags and
Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
mostly the vmalloc backend that's problematic.
> +
> /**
> * pcpu_alloc - the percpu allocator
> * @size: size of area to allocate in bytes
> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>
> /* No space left. Create a new chunk. */
> if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
> + unsigned int pcpu_scope;
> +
> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> chunk = pcpu_create_chunk(pcpu_gfp);
> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> +
> if (!chunk) {
> err = "failed to allocate new chunk";
> goto fail;
> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> page_end = PFN_UP(off + size);
>
> for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
> + unsigned int pcpu_scope;
> +
> WARN_ON(chunk->immutable);
>
> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>
> spin_lock_irqsave(&pcpu_lock, flags);
> if (ret) {
> --
> 2.50.1 (Apple Git-155)
>
--
Pedro
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
@ 2026-05-29 9:34 ` Pedro Falcato
0 siblings, 0 replies; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29 9:34 UTC (permalink / raw)
To: Kaitao Cheng
Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song, Kaitao Cheng
On Thu, May 28, 2026 at 09:29:17PM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>
> pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations
> so that it can create chunks and populate backing pages. If reclaim is
> entered while that mutex is already held, and reclaim reaches a path
> which allocates percpu memory, the nested allocation can try to take
> pcpu_alloc_mutex again.
>
> That creates a reclaim recursion dependency:
>
> pcpu_alloc_noprof(GFP_KERNEL)
> mutex_lock(&pcpu_alloc_mutex)
> reclaim
> pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS)
> mutex_lock(&pcpu_alloc_mutex)
>
> Avoid this by treating percpu allocations from reclaim context as atomic.
> Such allocations may still be served from already available and populated
> areas, but they must not enter the mutex-protected slow path or create new
> chunks. If no space is available, fail the allocation and let the normal
> balance work handle replenishment outside reclaim.
>
> Update the function comment to describe that reclaim context allocations
> are atomic regardless of whether the supplied GFP mask would otherwise
> allow blocking.
>
> This patch is a preventive fix. There may not currently be any path that
> calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context.
I don't like this. The proper way of fixing this would probably be to release
pcpu_alloc_mutex (or not have it in the first place!) while you're allocating
memory.
>
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
> mm/percpu.c | 13 +++++++++----
> 1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 1bb38467390b..9c30e5897813 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> * @gfp: allocation flags
> *
> * Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't
> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
> - * then no warning will be triggered on invalid or failed allocation
> - * requests.
> + * allow blocking, or if allocation is requested from reclaim context, the
> + * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be
> + * triggered on invalid or failed allocation requests.
> *
> * RETURNS:
> * Percpu pointer to the allocated area on success, NULL on failure.
> @@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> gfp = current_gfp_context(gfp);
> /* whitelisted flags that can be passed to the backing allocators */
> pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
> - is_atomic = !gfpflags_allow_blocking(gfp);
> + /*
> + * Reclaim can be entered while pcpu_alloc_mutex is already held by
> + * another percpu allocation. Avoid recursing back into the mutex from
> + * reclaim; best-effort allocations from already populated areas are OK.
> + */
since this is an entirely theoretical issue:
/* Reclaim paths should not be hitting the percpu allocator, for now */
if (WARN_ON_ONCE(current->reclaim_state))
return NULL;
But that's just my 2c.
> + is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state;
> do_warn = !(gfp & __GFP_NOWARN);
>
> /*
> --
> 2.50.1 (Apple Git-155)
>
--
Pedro
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
2026-05-29 9:25 ` Pedro Falcato
@ 2026-05-29 9:38 ` Pedro Falcato
2026-05-30 12:47 ` Kaitao Cheng
0 siblings, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2026-05-29 9:38 UTC (permalink / raw)
To: Kaitao Cheng
Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song, Kaitao Cheng
On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> > From: Kaitao Cheng <chengkaitao@kylinos.cn>
> >
> > pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> > passes it to the backing percpu allocators. This preserves GFP_NOFS and
> > GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
> >
> > However, the chunk creation and population slow paths also call helpers
> > which do not take a GFP mask and perform internal allocations with
> > GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> > and population can allocate temporary metadata or page tables while mapping
> > backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> > GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> > percpu chunk.
> >
> > This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> > they are already holding filesystem or IO-path locks. If free chunks are
> > exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> > unconstrained reclaim from these internal allocations, defeating the
> > caller's allocation context and potentially recreating reclaim lock
> > dependencies.
> >
> > Wrap chunk creation and population in a scoped NOIO or NOFS context when
> > pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> > allocations unchanged so they retain full reclaim capability.
> >
> > Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>
> I assume you _did not_ observe this in production? As in no reclaim path should be
> insane^W daring enough to do pcpu allocations?
Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
"git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
NOFS or NOIO as far as I can see. So you probably did not observe this?
>
> > Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> > ---
> > mm/percpu.c | 26 ++++++++++++++++++++++++++
> > 1 file changed, 26 insertions(+)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 71a85d7245c7..1bb38467390b 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
> > }
> > #endif
> >
> > +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> > +{
> > + if (!(gfp & __GFP_IO))
> > + return memalloc_noio_save();
> > + if (!(gfp & __GFP_FS))
> > + return memalloc_nofs_save();
> > + return 0;
> > +}
> > +
> > +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> > +{
> > + if (!(gfp & __GFP_IO))
> > + memalloc_noio_restore(flags);
> > + else if (!(gfp & __GFP_FS))
> > + memalloc_nofs_restore(flags);
> > +}
>
> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
> mostly the vmalloc backend that's problematic.
>
> > +
> > /**
> > * pcpu_alloc - the percpu allocator
> > * @size: size of area to allocate in bytes
> > @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> >
> > /* No space left. Create a new chunk. */
> > if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
> > + unsigned int pcpu_scope;
> > +
> > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> > chunk = pcpu_create_chunk(pcpu_gfp);
> > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> > +
> > if (!chunk) {
> > err = "failed to allocate new chunk";
> > goto fail;
> > @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> > page_end = PFN_UP(off + size);
> >
> > for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
> > + unsigned int pcpu_scope;
> > +
> > WARN_ON(chunk->immutable);
> >
> > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> > ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
> > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> >
> > spin_lock_irqsave(&pcpu_lock, flags);
> > if (ret) {
> > --
> > 2.50.1 (Apple Git-155)
> >
>
> --
> Pedro
--
Pedro
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
2026-05-29 9:38 ` Pedro Falcato
@ 2026-05-30 12:47 ` Kaitao Cheng
2026-05-30 13:32 ` Dennis Zhou
0 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-05-30 12:47 UTC (permalink / raw)
To: Pedro Falcato, akpm
Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
muchun.song, Kaitao Cheng
在 2026/5/29 17:38, Pedro Falcato 写道:
> On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
>> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>
>>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
>>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
>>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
>>>
>>> However, the chunk creation and population slow paths also call helpers
>>> which do not take a GFP mask and perform internal allocations with
>>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
>>> and population can allocate temporary metadata or page tables while mapping
>>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
>>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
>>> percpu chunk.
>>>
>>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
>>> they are already holding filesystem or IO-path locks. If free chunks are
>>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
>>> unconstrained reclaim from these internal allocations, defeating the
>>> caller's allocation context and potentially recreating reclaim lock
>>> dependencies.
>>>
>>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
>>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
>>> allocations unchanged so they retain full reclaim capability.
>>>
>>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>>
>> I assume you _did not_ observe this in production? As in no reclaim path should be
>> insane^W daring enough to do pcpu allocations?
>
> Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
> "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
> using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
> NOFS or NOIO as far as I can see. So you probably did not observe this?
Right, this issue has not been observed in production. It came from a
question raised by AI code review, and after carefully reading the code,
I found that there are indeed some synchronization concerns.
Here is one example of the scenario [PATCH 1/2] is trying to address:
blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
configuring policy"). blkg_conf_prep() now serializes against
blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
changed to GFP_NOIO for that reason:
CPU0: blkg_conf_prep()
mutex_lock(q->blkcg_mutex)
blkg_alloc(..., GFP_NOIO)
alloc_percpu_gfp(..., GFP_NOIO)
-> if percpu chunks are exhausted, chunk create/populate may do
internal GFP_KERNEL allocations
-> direct reclaim / writeback can issue IO to this queue
-> IO waits because the queue is frozen
CPU1: blkcg_deactivate_policy()
blk_mq_freeze_queue(q)
mutex_lock(q->blkcg_mutex)
-> waits for CPU0
... unfreeze only happens after q->blkcg_mutex is acquired/released
So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context. The failure
requires the slow path (new chunk creation or population), so it is not
expected to be common.
>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>> ---
>>> mm/percpu.c | 26 ++++++++++++++++++++++++++
>>> 1 file changed, 26 insertions(+)
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index 71a85d7245c7..1bb38467390b 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>> }
>>> #endif
>>>
>>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
>>> +{
>>> + if (!(gfp & __GFP_IO))
>>> + return memalloc_noio_save();
>>> + if (!(gfp & __GFP_FS))
>>> + return memalloc_nofs_save();
>>> + return 0;
>>> +}
>>> +
>>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
>>> +{
>>> + if (!(gfp & __GFP_IO))
>>> + memalloc_noio_restore(flags);
>>> + else if (!(gfp & __GFP_FS))
>>> + memalloc_nofs_restore(flags);
>>> +}
>>
>> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
>> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
>> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
>> mostly the vmalloc backend that's problematic.
I’ll try to do it.
Following your suggestion, including in [PATCH 2/2], I will also try a
different approach and fix the issue by reducing the scope of the
pcpu_alloc_mutex critical section.
>>> /**
>>> * pcpu_alloc - the percpu allocator
>>> * @size: size of area to allocate in bytes
>>> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>>
>>> /* No space left. Create a new chunk. */
>>> if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
>>> + unsigned int pcpu_scope;
>>> +
>>> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>> chunk = pcpu_create_chunk(pcpu_gfp);
>>> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>> +
>>> if (!chunk) {
>>> err = "failed to allocate new chunk";
>>> goto fail;
>>> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>> page_end = PFN_UP(off + size);
>>>
>>> for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
>>> + unsigned int pcpu_scope;
>>> +
>>> WARN_ON(chunk->immutable);
>>>
>>> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>> ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
>>> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>>
>>> spin_lock_irqsave(&pcpu_lock, flags);
>>> if (ret) {
>>> --
>>> 2.50.1 (Apple Git-155)
>>>
--
Thanks
Kaitao Cheng
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
2026-05-30 12:47 ` Kaitao Cheng
@ 2026-05-30 13:32 ` Dennis Zhou
0 siblings, 0 replies; 10+ messages in thread
From: Dennis Zhou @ 2026-05-30 13:32 UTC (permalink / raw)
To: Kaitao Cheng
Cc: Pedro Falcato, akpm, dennis, tj, cl, mhocko, vbabka, linux-mm,
linux-kernel, muchun.song, Kaitao Cheng
Hello,
On Sat, May 30, 2026 at 08:47:34PM +0800, Kaitao Cheng wrote:
> 在 2026/5/29 17:38, Pedro Falcato 写道:
> > On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
> >> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> >>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>>
> >>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> >>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
> >>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
> >>>
> >>> However, the chunk creation and population slow paths also call helpers
> >>> which do not take a GFP mask and perform internal allocations with
> >>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> >>> and population can allocate temporary metadata or page tables while mapping
> >>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> >>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> >>> percpu chunk.
> >>>
> >>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> >>> they are already holding filesystem or IO-path locks. If free chunks are
> >>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> >>> unconstrained reclaim from these internal allocations, defeating the
> >>> caller's allocation context and potentially recreating reclaim lock
> >>> dependencies.
> >>>
> >>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
> >>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> >>> allocations unchanged so they retain full reclaim capability.
> >>>
> >>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> >>
> >> I assume you _did not_ observe this in production? As in no reclaim path should be
> >> insane^W daring enough to do pcpu allocations?
> >
> > Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
> > "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
> > using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
> > NOFS or NOIO as far as I can see. So you probably did not observe this?
>
> Right, this issue has not been observed in production. It came from a
> question raised by AI code review, and after carefully reading the code,
> I found that there are indeed some synchronization concerns.
>
> Here is one example of the scenario [PATCH 1/2] is trying to address:
>
> blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
> configuring policy"). blkg_conf_prep() now serializes against
> blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
> changed to GFP_NOIO for that reason:
>
> CPU0: blkg_conf_prep()
> mutex_lock(q->blkcg_mutex)
> blkg_alloc(..., GFP_NOIO)
> alloc_percpu_gfp(..., GFP_NOIO)
> -> if percpu chunks are exhausted, chunk create/populate may do
> internal GFP_KERNEL allocations
> -> direct reclaim / writeback can issue IO to this queue
> -> IO waits because the queue is frozen
>
> CPU1: blkcg_deactivate_policy()
> blk_mq_freeze_queue(q)
> mutex_lock(q->blkcg_mutex)
> -> waits for CPU0
> ... unfreeze only happens after q->blkcg_mutex is acquired/released
>
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context. The failure
> requires the slow path (new chunk creation or population), so it is not
> expected to be common.
This is likely just a miss for [1] where we switching to
gfpflags_allow_blocking(),
This seems like it's just a miss from [1] where we switched to a less
conservative approach than atomic == !GFP_KERNEL. If anything we just
need to allow the additional flags through. Maybe just add GFP_NOIO and
GFP_NOFS in pcpu_gfp via the whitelist.
[1] 9a5b183941b ("mm, percpu: do not consider sleepable allocations atomic")
>
> >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>> ---
> >>> mm/percpu.c | 26 ++++++++++++++++++++++++++
> >>> 1 file changed, 26 insertions(+)
> >>>
> >>> diff --git a/mm/percpu.c b/mm/percpu.c
> >>> index 71a85d7245c7..1bb38467390b 100644
> >>> --- a/mm/percpu.c
> >>> +++ b/mm/percpu.c
> >>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
> >>> }
> >>> #endif
> >>>
> >>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> >>> +{
> >>> + if (!(gfp & __GFP_IO))
> >>> + return memalloc_noio_save();
> >>> + if (!(gfp & __GFP_FS))
> >>> + return memalloc_nofs_save();
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> >>> +{
> >>> + if (!(gfp & __GFP_IO))
> >>> + memalloc_noio_restore(flags);
> >>> + else if (!(gfp & __GFP_FS))
> >>> + memalloc_nofs_restore(flags);
> >>> +}
> >>
> >> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
> >> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
> >> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
> >> mostly the vmalloc backend that's problematic.
>
> I’ll try to do it.
>
> Following your suggestion, including in [PATCH 2/2], I will also try a
> different approach and fix the issue by reducing the scope of the
> pcpu_alloc_mutex critical section.
>
No please don't. The point of the percpu mutex is to ensure that only
one person is ever possibly creating a new chunk. If you drop the mutex,
then you have to deal with concurrent callers when available percpu
memory is low. Percpu memory is expensive and unmovable so the cost is
in the control plane to avoid excess fragmentation.
Thanks,
Dennis
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-05-30 13:32 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
2026-05-29 9:25 ` Pedro Falcato
2026-05-29 9:38 ` Pedro Falcato
2026-05-30 12:47 ` Kaitao Cheng
2026-05-30 13:32 ` Dennis Zhou
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
2026-05-29 9:34 ` Pedro Falcato
2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
2026-05-28 21:10 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox