[PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
@ 2026-05-28 13:29 Kaitao Cheng
  2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
  To: dennis, tj, cl, akpm
  Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng

Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
pcpu_alloc_mutex and the chunk creation slow path. This restored the
allocation capability that was lost when those constrained allocations
were treated as atomic, but it also opens two possible reclaim recursion
problems.

The first problem is that the create and populate slow paths do not fully
preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
pcpu_gfp from the caller supplied GFP mask and passes it to the backing
page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
and population can allocate temporary metadata or page tables while mapping
backing pages. Those internal allocations can use GFP_KERNEL. A caller
using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
allocation context.

The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
the nested allocation can now try to take pcpu_alloc_mutex again because
9a5b183941b5 no longer treats those masks as atomic.

Another possible way to avoid these issues is to revert 9a5b183941b5.
However, that would also bring back the premature allocation failures for
sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
to fix.

Kaitao Cheng (2):
  mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim

 mm/percpu.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

-- 
2.50.1 (Apple Git-155)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
@ 2026-05-28 13:29 ` Kaitao Cheng
  2026-05-29  9:25   ` Pedro Falcato
  2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
  To: dennis, tj, cl, akpm
  Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
passes it to the backing percpu allocators. This preserves GFP_NOFS and
GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.

However, the chunk creation and population slow paths also call helpers
which do not take a GFP mask and perform internal allocations with
GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
and population can allocate temporary metadata or page tables while mapping
backing pages. As a result, a caller which explicitly uses GFP_NOFS or
GFP_NOIO can still enter FS or IO reclaim while creating or populating a
percpu chunk.

This is problematic for callers which use GFP_NOFS or GFP_NOIO because
they are already holding filesystem or IO-path locks. If free chunks are
exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
unconstrained reclaim from these internal allocations, defeating the
caller's allocation context and potentially recreating reclaim lock
dependencies.

Wrap chunk creation and population in a scoped NOIO or NOFS context when
pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
allocations unchanged so they retain full reclaim capability.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 71a85d7245c7..1bb38467390b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
 }
 #endif

+static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
+{
+	if (!(gfp & __GFP_IO))
+		return memalloc_noio_save();
+	if (!(gfp & __GFP_FS))
+		return memalloc_nofs_save();
+	return 0;
+}
+
+static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
+{
+	if (!(gfp & __GFP_IO))
+		memalloc_noio_restore(flags);
+	else if (!(gfp & __GFP_FS))
+		memalloc_nofs_restore(flags);
+}
+
 /**
  * pcpu_alloc - the percpu allocator
  * @size: size of area to allocate in bytes
@@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,

 	/* No space left.  Create a new chunk. */
 	if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
+		unsigned int pcpu_scope;
+
+		pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
 		chunk = pcpu_create_chunk(pcpu_gfp);
+		pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
+
 		if (!chunk) {
 			err = "failed to allocate new chunk";
 			goto fail;
@@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 		page_end = PFN_UP(off + size);

 		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
+			unsigned int pcpu_scope;
+
 			WARN_ON(chunk->immutable);

+			pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
 			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
+			pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);

 			spin_lock_irqsave(&pcpu_lock, flags);
 			if (ret) {
-- 
2.50.1 (Apple Git-155)

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
  2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
@ 2026-05-28 13:29 ` Kaitao Cheng
  2026-05-29  9:34   ` Pedro Falcato
  2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
  2026-05-28 21:10 ` Andrew Morton
  3 siblings, 1 reply; 18+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
  To: dennis, tj, cl, akpm
  Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations
so that it can create chunks and populate backing pages. If reclaim is
entered while that mutex is already held, and reclaim reaches a path
which allocates percpu memory, the nested allocation can try to take
pcpu_alloc_mutex again.

That creates a reclaim recursion dependency:

  pcpu_alloc_noprof(GFP_KERNEL)
    mutex_lock(&pcpu_alloc_mutex)
    reclaim
      pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS)
        mutex_lock(&pcpu_alloc_mutex)

Avoid this by treating percpu allocations from reclaim context as atomic.
Such allocations may still be served from already available and populated
areas, but they must not enter the mutex-protected slow path or create new
chunks. If no space is available, fail the allocation and let the normal
balance work handle replenishment outside reclaim.

Update the function comment to describe that reclaim context allocations
are atomic regardless of whether the supplied GFP mask would otherwise
allow blocking.

This patch is a preventive fix. There may not currently be any path that
calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 1bb38467390b..9c30e5897813 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
  * @gfp: allocation flags
  *
  * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
- * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
- * then no warning will be triggered on invalid or failed allocation
- * requests.
+ * allow blocking, or if allocation is requested from reclaim context, the
+ * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be
+ * triggered on invalid or failed allocation requests.
  *
  * RETURNS:
  * Percpu pointer to the allocated area on success, NULL on failure.
@@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 	gfp = current_gfp_context(gfp);
 	/* whitelisted flags that can be passed to the backing allocators */
 	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
-	is_atomic = !gfpflags_allow_blocking(gfp);
+	/*
+	 * Reclaim can be entered while pcpu_alloc_mutex is already held by
+	 * another percpu allocation. Avoid recursing back into the mutex from
+	 * reclaim; best-effort allocations from already populated areas are OK.
+	 */
+	is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state;
 	do_warn = !(gfp & __GFP_NOWARN);
 
 	/*
-- 
2.50.1 (Apple Git-155)



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
  2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
  2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
@ 2026-05-28 21:09 ` Andrew Morton
  2026-05-28 21:10 ` Andrew Morton
  3 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2026-05-28 21:09 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song

On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also opens two possible reclaim recursion
> problems.
> 
> The first problem is that the create and populate slow paths do not fully
> preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
> pcpu_gfp from the caller supplied GFP mask and passes it to the backing
> page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. Those internal allocations can use GFP_KERNEL. A caller
> using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
> IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
> allocation context.
> 
> The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
> If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
> reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
> the nested allocation can now try to take pcpu_alloc_mutex again because
> 9a5b183941b5 no longer treats those masks as atomic.
> 
> Another possible way to avoid these issues is to revert 9a5b183941b5.
> However, that would also bring back the premature allocation failures for
> sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
> to fix.

Thanks.

9a5b183941b5 has been in there for a year.  How are you observing/triggering this bug
and what are the userspace-visible effects?

We might choose to backport fixes into -stable kernels, but this additional info
is needed to make that determination.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
  2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
                   ` (2 preceding siblings ...)
  2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
@ 2026-05-28 21:10 ` Andrew Morton
  3 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2026-05-28 21:10 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song

On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also opens two possible reclaim recursion
> problems.
> 
> The first problem is that the create and populate slow paths do not fully
> preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
> pcpu_gfp from the caller supplied GFP mask and passes it to the backing
> page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. Those internal allocations can use GFP_KERNEL. A caller
> using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
> IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
> allocation context.
> 
> The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
> If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
> reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
> the nested allocation can now try to take pcpu_alloc_mutex again because
> 9a5b183941b5 no longer treats those masks as atomic.
> 
> Another possible way to avoid these issues is to revert 9a5b183941b5.
> However, that would also bring back the premature allocation failures for
> sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
> to fix.

AI review might have found a couple of pre-existing issues:
	https://sashiko.dev/#/patchset/20260528132917.81123-1-kaitao.cheng@linux.dev


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
@ 2026-05-29  9:25   ` Pedro Falcato
  2026-05-29  9:38     ` Pedro Falcato
  0 siblings, 1 reply; 18+ messages in thread
From: Pedro Falcato @ 2026-05-29  9:25 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> passes it to the backing percpu allocators. This preserves GFP_NOFS and
> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
> 
> However, the chunk creation and population slow paths also call helpers
> which do not take a GFP mask and perform internal allocations with
> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> and population can allocate temporary metadata or page tables while mapping
> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> percpu chunk.
> 
> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> they are already holding filesystem or IO-path locks. If free chunks are
> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> unconstrained reclaim from these internal allocations, defeating the
> caller's allocation context and potentially recreating reclaim lock
> dependencies.
> 
> Wrap chunk creation and population in a scoped NOIO or NOFS context when
> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> allocations unchanged so they retain full reclaim capability.
> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")

I assume you _did not_ observe this in production? As in no reclaim path should be
insane^W daring enough to do pcpu allocations?

> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  mm/percpu.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 71a85d7245c7..1bb38467390b 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>  }
>  #endif
>  
> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> +{
> +	if (!(gfp & __GFP_IO))
> +		return memalloc_noio_save();
> +	if (!(gfp & __GFP_FS))
> +		return memalloc_nofs_save();
> +	return 0;
> +}
> +
> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> +{
> +	if (!(gfp & __GFP_IO))
> +		memalloc_noio_restore(flags);
> +	else if (!(gfp & __GFP_FS))
> +		memalloc_nofs_restore(flags);
> +}

I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
and pcpu_populate_chunk(). It's their job to respect the gfp flags and
Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
mostly the vmalloc backend that's problematic.

> +
>  /**
>   * pcpu_alloc - the percpu allocator
>   * @size: size of area to allocate in bytes
> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  
>  	/* No space left.  Create a new chunk. */
>  	if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
> +		unsigned int pcpu_scope;
> +
> +		pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>  		chunk = pcpu_create_chunk(pcpu_gfp);
> +		pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> +
>  		if (!chunk) {
>  			err = "failed to allocate new chunk";
>  			goto fail;
> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  		page_end = PFN_UP(off + size);
>  
>  		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
> +			unsigned int pcpu_scope;
> +
>  			WARN_ON(chunk->immutable);
>  
> +			pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>  			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
> +			pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>  
>  			spin_lock_irqsave(&pcpu_lock, flags);
>  			if (ret) {
> -- 
> 2.50.1 (Apple Git-155)
> 

-- 
Pedro


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
  2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
@ 2026-05-29  9:34   ` Pedro Falcato
  0 siblings, 0 replies; 18+ messages in thread
From: Pedro Falcato @ 2026-05-29  9:34 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

On Thu, May 28, 2026 at 09:29:17PM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations
> so that it can create chunks and populate backing pages. If reclaim is
> entered while that mutex is already held, and reclaim reaches a path
> which allocates percpu memory, the nested allocation can try to take
> pcpu_alloc_mutex again.
> 
> That creates a reclaim recursion dependency:
> 
>   pcpu_alloc_noprof(GFP_KERNEL)
>     mutex_lock(&pcpu_alloc_mutex)
>     reclaim
>       pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS)
>         mutex_lock(&pcpu_alloc_mutex)
> 
> Avoid this by treating percpu allocations from reclaim context as atomic.
> Such allocations may still be served from already available and populated
> areas, but they must not enter the mutex-protected slow path or create new
> chunks. If no space is available, fail the allocation and let the normal
> balance work handle replenishment outside reclaim.
> 
> Update the function comment to describe that reclaim context allocations
> are atomic regardless of whether the supplied GFP mask would otherwise
> allow blocking.
> 
> This patch is a preventive fix. There may not currently be any path that
> calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context.

I don't like this. The proper way of fixing this would probably be to release
pcpu_alloc_mutex (or not have it in the first place!) while you're allocating
memory.

> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  mm/percpu.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 1bb38467390b..9c30e5897813 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
>   * @gfp: allocation flags
>   *
>   * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
> - * then no warning will be triggered on invalid or failed allocation
> - * requests.
> + * allow blocking, or if allocation is requested from reclaim context, the
> + * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be
> + * triggered on invalid or failed allocation requests.
>   *
>   * RETURNS:
>   * Percpu pointer to the allocated area on success, NULL on failure.
> @@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  	gfp = current_gfp_context(gfp);
>  	/* whitelisted flags that can be passed to the backing allocators */
>  	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
> -	is_atomic = !gfpflags_allow_blocking(gfp);
> +	/*
> +	 * Reclaim can be entered while pcpu_alloc_mutex is already held by
> +	 * another percpu allocation. Avoid recursing back into the mutex from
> +	 * reclaim; best-effort allocations from already populated areas are OK.
> +	 */

since this is an entirely theoretical issue:

	/* Reclaim paths should not be hitting the percpu allocator, for now */
	if (WARN_ON_ONCE(current->reclaim_state))
		return NULL;

But that's just my 2c.

> +	is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state;
>  	do_warn = !(gfp & __GFP_NOWARN);
>  
>  	/*
> -- 
> 2.50.1 (Apple Git-155)
> 

-- 
Pedro


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-29  9:25   ` Pedro Falcato
@ 2026-05-29  9:38     ` Pedro Falcato
  2026-05-30 12:47       ` Kaitao Cheng
  0 siblings, 1 reply; 18+ messages in thread
From: Pedro Falcato @ 2026-05-29  9:38 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> > From: Kaitao Cheng <chengkaitao@kylinos.cn>
> > 
> > pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> > passes it to the backing percpu allocators. This preserves GFP_NOFS and
> > GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
> > 
> > However, the chunk creation and population slow paths also call helpers
> > which do not take a GFP mask and perform internal allocations with
> > GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> > and population can allocate temporary metadata or page tables while mapping
> > backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> > GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> > percpu chunk.
> > 
> > This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> > they are already holding filesystem or IO-path locks. If free chunks are
> > exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> > unconstrained reclaim from these internal allocations, defeating the
> > caller's allocation context and potentially recreating reclaim lock
> > dependencies.
> > 
> > Wrap chunk creation and population in a scoped NOIO or NOFS context when
> > pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> > allocations unchanged so they retain full reclaim capability.
> > 
> > Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> 
> I assume you _did not_ observe this in production? As in no reclaim path should be
> insane^W daring enough to do pcpu allocations?

Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
"git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
NOFS or NOIO as far as I can see. So you probably did not observe this?

> 
> > Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> > ---
> >  mm/percpu.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 71a85d7245c7..1bb38467390b 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
> >  }
> >  #endif
> >  
> > +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> > +{
> > +	if (!(gfp & __GFP_IO))
> > +		return memalloc_noio_save();
> > +	if (!(gfp & __GFP_FS))
> > +		return memalloc_nofs_save();
> > +	return 0;
> > +}
> > +
> > +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> > +{
> > +	if (!(gfp & __GFP_IO))
> > +		memalloc_noio_restore(flags);
> > +	else if (!(gfp & __GFP_FS))
> > +		memalloc_nofs_restore(flags);
> > +}
> 
> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
> mostly the vmalloc backend that's problematic.
> 
> > +
> >  /**
> >   * pcpu_alloc - the percpu allocator
> >   * @size: size of area to allocate in bytes
> > @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> >  
> >  	/* No space left.  Create a new chunk. */
> >  	if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
> > +		unsigned int pcpu_scope;
> > +
> > +		pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> >  		chunk = pcpu_create_chunk(pcpu_gfp);
> > +		pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> > +
> >  		if (!chunk) {
> >  			err = "failed to allocate new chunk";
> >  			goto fail;
> > @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
> >  		page_end = PFN_UP(off + size);
> >  
> >  		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
> > +			unsigned int pcpu_scope;
> > +
> >  			WARN_ON(chunk->immutable);
> >  
> > +			pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
> >  			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
> > +			pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
> >  
> >  			spin_lock_irqsave(&pcpu_lock, flags);
> >  			if (ret) {
> > -- 
> > 2.50.1 (Apple Git-155)
> > 
> 
> -- 
> Pedro

-- 
Pedro


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-29  9:38     ` Pedro Falcato
@ 2026-05-30 12:47       ` Kaitao Cheng
  2026-05-30 13:32         ` Dennis Zhou
  0 siblings, 1 reply; 18+ messages in thread
From: Kaitao Cheng @ 2026-05-30 12:47 UTC (permalink / raw)
  To: Pedro Falcato, akpm
  Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

在 2026/5/29 17:38, Pedro Falcato 写道:
> On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
>> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>
>>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
>>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
>>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
>>>
>>> However, the chunk creation and population slow paths also call helpers
>>> which do not take a GFP mask and perform internal allocations with
>>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
>>> and population can allocate temporary metadata or page tables while mapping
>>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
>>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
>>> percpu chunk.
>>>
>>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
>>> they are already holding filesystem or IO-path locks. If free chunks are
>>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
>>> unconstrained reclaim from these internal allocations, defeating the
>>> caller's allocation context and potentially recreating reclaim lock
>>> dependencies.
>>>
>>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
>>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
>>> allocations unchanged so they retain full reclaim capability.
>>>
>>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>>
>> I assume you _did not_ observe this in production? As in no reclaim path should be
>> insane^W daring enough to do pcpu allocations?
> 
> Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
> "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
> using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
> NOFS or NOIO as far as I can see. So you probably did not observe this?

Right, this issue has not been observed in production. It came from a
question raised by AI code review, and after carefully reading the code,
I found that there are indeed some synchronization concerns.

Here is one example of the scenario [PATCH 1/2] is trying to address:

blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
configuring policy"). blkg_conf_prep() now serializes against
blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        -> if percpu chunks are exhausted, chunk create/populate may do
           internal GFP_KERNEL allocations
        -> direct reclaim / writeback can issue IO to this queue
        -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context. The failure
requires the slow path (new chunk creation or population), so it is not
expected to be common.

>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>> ---
>>>  mm/percpu.c | 26 ++++++++++++++++++++++++++
>>>  1 file changed, 26 insertions(+)
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index 71a85d7245c7..1bb38467390b 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>>  }
>>>  #endif
>>>  
>>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
>>> +{
>>> +	if (!(gfp & __GFP_IO))
>>> +		return memalloc_noio_save();
>>> +	if (!(gfp & __GFP_FS))
>>> +		return memalloc_nofs_save();
>>> +	return 0;
>>> +}
>>> +
>>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
>>> +{
>>> +	if (!(gfp & __GFP_IO))
>>> +		memalloc_noio_restore(flags);
>>> +	else if (!(gfp & __GFP_FS))
>>> +		memalloc_nofs_restore(flags);
>>> +}
>>
>> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
>> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
>> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
>> mostly the vmalloc backend that's problematic.

I’ll try to do it.

Following your suggestion, including in [PATCH 2/2], I will also try a
different approach and fix the issue by reducing the scope of the
pcpu_alloc_mutex critical section.

>>>  /**
>>>   * pcpu_alloc - the percpu allocator
>>>   * @size: size of area to allocate in bytes
>>> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>>  
>>>  	/* No space left.  Create a new chunk. */
>>>  	if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
>>> +		unsigned int pcpu_scope;
>>> +
>>> +		pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>>  		chunk = pcpu_create_chunk(pcpu_gfp);
>>> +		pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>> +
>>>  		if (!chunk) {
>>>  			err = "failed to allocate new chunk";
>>>  			goto fail;
>>> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>>  		page_end = PFN_UP(off + size);
>>>  
>>>  		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
>>> +			unsigned int pcpu_scope;
>>> +
>>>  			WARN_ON(chunk->immutable);
>>>  
>>> +			pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>>  			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
>>> +			pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>>  
>>>  			spin_lock_irqsave(&pcpu_lock, flags);
>>>  			if (ret) {
>>> -- 
>>> 2.50.1 (Apple Git-155)
>>>
-- 
Thanks
Kaitao Cheng



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-30 12:47       ` Kaitao Cheng
@ 2026-05-30 13:32         ` Dennis Zhou
  2026-06-01  2:27           ` Kaitao Cheng
  2026-06-02 13:46           ` Pedro Falcato
  0 siblings, 2 replies; 18+ messages in thread
From: Dennis Zhou @ 2026-05-30 13:32 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Pedro Falcato, akpm, dennis, tj, cl, mhocko, vbabka, linux-mm,
	linux-kernel, muchun.song, Kaitao Cheng

Hello,

On Sat, May 30, 2026 at 08:47:34PM +0800, Kaitao Cheng wrote:
> 在 2026/5/29 17:38, Pedro Falcato 写道:
> > On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
> >> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
> >>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>>
> >>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> >>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
> >>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
> >>>
> >>> However, the chunk creation and population slow paths also call helpers
> >>> which do not take a GFP mask and perform internal allocations with
> >>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
> >>> and population can allocate temporary metadata or page tables while mapping
> >>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
> >>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
> >>> percpu chunk.
> >>>
> >>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
> >>> they are already holding filesystem or IO-path locks. If free chunks are
> >>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
> >>> unconstrained reclaim from these internal allocations, defeating the
> >>> caller's allocation context and potentially recreating reclaim lock
> >>> dependencies.
> >>>
> >>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
> >>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
> >>> allocations unchanged so they retain full reclaim capability.
> >>>
> >>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> >>
> >> I assume you _did not_ observe this in production? As in no reclaim path should be
> >> insane^W daring enough to do pcpu allocations?
> > 
> > Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
> > "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
> > using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
> > NOFS or NOIO as far as I can see. So you probably did not observe this?
> 
> Right, this issue has not been observed in production. It came from a
> question raised by AI code review, and after carefully reading the code,
> I found that there are indeed some synchronization concerns.
> 
> Here is one example of the scenario [PATCH 1/2] is trying to address:
> 
> blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
> configuring policy"). blkg_conf_prep() now serializes against
> blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
> changed to GFP_NOIO for that reason:
> 
>   CPU0: blkg_conf_prep()
>     mutex_lock(q->blkcg_mutex)
>     blkg_alloc(..., GFP_NOIO)
>       alloc_percpu_gfp(..., GFP_NOIO)
>         -> if percpu chunks are exhausted, chunk create/populate may do
>            internal GFP_KERNEL allocations
>         -> direct reclaim / writeback can issue IO to this queue
>         -> IO waits because the queue is frozen
> 
>   CPU1: blkcg_deactivate_policy()
>     blk_mq_freeze_queue(q)
>     mutex_lock(q->blkcg_mutex)
>       -> waits for CPU0
>     ... unfreeze only happens after q->blkcg_mutex is acquired/released
> 
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context. The failure
> requires the slow path (new chunk creation or population), so it is not
> expected to be common.

This is likely just a miss for [1] where we switching to
gfpflags_allow_blocking(),

This seems like it's just a miss from [1] where we switched to a less
conservative approach than atomic == !GFP_KERNEL. If anything we just
need to allow the additional flags through. Maybe just add GFP_NOIO and
GFP_NOFS in pcpu_gfp via the whitelist.

[1] 9a5b183941b ("mm, percpu: do not consider sleepable allocations atomic")
> 
> >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> >>> ---
> >>>  mm/percpu.c | 26 ++++++++++++++++++++++++++
> >>>  1 file changed, 26 insertions(+)
> >>>
> >>> diff --git a/mm/percpu.c b/mm/percpu.c
> >>> index 71a85d7245c7..1bb38467390b 100644
> >>> --- a/mm/percpu.c
> >>> +++ b/mm/percpu.c
> >>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
> >>>  }
> >>>  #endif
> >>>  
> >>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
> >>> +{
> >>> +	if (!(gfp & __GFP_IO))
> >>> +		return memalloc_noio_save();
> >>> +	if (!(gfp & __GFP_FS))
> >>> +		return memalloc_nofs_save();
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
> >>> +{
> >>> +	if (!(gfp & __GFP_IO))
> >>> +		memalloc_noio_restore(flags);
> >>> +	else if (!(gfp & __GFP_FS))
> >>> +		memalloc_nofs_restore(flags);
> >>> +}
> >>
> >> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
> >> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
> >> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
> >> mostly the vmalloc backend that's problematic.
> 
> I’ll try to do it.
> 
> Following your suggestion, including in [PATCH 2/2], I will also try a
> different approach and fix the issue by reducing the scope of the
> pcpu_alloc_mutex critical section.
> 

No please don't. The point of the percpu mutex is to ensure that only
one person is ever possibly creating a new chunk. If you drop the mutex,
then you have to deal with concurrent callers when available percpu
memory is low. Percpu memory is expensive and unmovable so the cost is
in the control plane to avoid excess fragmentation.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-30 13:32         ` Dennis Zhou
@ 2026-06-01  2:27           ` Kaitao Cheng
  2026-06-01 15:45             ` Michal Hocko
  2026-06-02 13:46           ` Pedro Falcato
  1 sibling, 1 reply; 18+ messages in thread
From: Kaitao Cheng @ 2026-06-01  2:27 UTC (permalink / raw)
  To: Dennis Zhou, Pedro Falcato, akpm
  Cc: tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song,
	Kaitao Cheng



在 2026/5/30 21:32, Dennis Zhou 写道:
> Hello,
> 
> On Sat, May 30, 2026 at 08:47:34PM +0800, Kaitao Cheng wrote:
>> 在 2026/5/29 17:38, Pedro Falcato 写道:
>>> On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
>>>> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
>>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>>
>>>>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
>>>>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
>>>>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
>>>>>
>>>>> However, the chunk creation and population slow paths also call helpers
>>>>> which do not take a GFP mask and perform internal allocations with
>>>>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
>>>>> and population can allocate temporary metadata or page tables while mapping
>>>>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
>>>>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
>>>>> percpu chunk.
>>>>>
>>>>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
>>>>> they are already holding filesystem or IO-path locks. If free chunks are
>>>>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
>>>>> unconstrained reclaim from these internal allocations, defeating the
>>>>> caller's allocation context and potentially recreating reclaim lock
>>>>> dependencies.
>>>>>
>>>>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
>>>>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
>>>>> allocations unchanged so they retain full reclaim capability.
>>>>>
>>>>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>>>>
>>>> I assume you _did not_ observe this in production? As in no reclaim path should be
>>>> insane^W daring enough to do pcpu allocations?
>>>
>>> Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
>>> "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
>>> using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
>>> NOFS or NOIO as far as I can see. So you probably did not observe this?
>>
>> Right, this issue has not been observed in production. It came from a
>> question raised by AI code review, and after carefully reading the code,
>> I found that there are indeed some synchronization concerns.
>>
>> Here is one example of the scenario [PATCH 1/2] is trying to address:
>>
>> blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
>> configuring policy"). blkg_conf_prep() now serializes against
>> blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
>> changed to GFP_NOIO for that reason:
>>
>>   CPU0: blkg_conf_prep()
>>     mutex_lock(q->blkcg_mutex)
>>     blkg_alloc(..., GFP_NOIO)
>>       alloc_percpu_gfp(..., GFP_NOIO)
>>         -> if percpu chunks are exhausted, chunk create/populate may do
>>            internal GFP_KERNEL allocations
>>         -> direct reclaim / writeback can issue IO to this queue
>>         -> IO waits because the queue is frozen
>>
>>   CPU1: blkcg_deactivate_policy()
>>     blk_mq_freeze_queue(q)
>>     mutex_lock(q->blkcg_mutex)
>>       -> waits for CPU0
>>     ... unfreeze only happens after q->blkcg_mutex is acquired/released
>>
>> So the concern is that the caller deliberately uses GFP_NOIO because it
>> may hold a lock which can be acquired after queue freeze, but the percpu
>> slow path can temporarily lose that allocation context. The failure
>> requires the slow path (new chunk creation or population), so it is not
>> expected to be common.
> 
> This is likely just a miss for [1] where we switching to
> gfpflags_allow_blocking(),
> 
> This seems like it's just a miss from [1] where we switched to a less
> conservative approach than atomic == !GFP_KERNEL. If anything we just
> need to allow the additional flags through. Maybe just add GFP_NOIO and
> GFP_NOFS in pcpu_gfp via the whitelist.
> 
> [1] 9a5b183941b ("mm, percpu: do not consider sleepable allocations atomic")
>>
>>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>>> ---
>>>>>  mm/percpu.c | 26 ++++++++++++++++++++++++++
>>>>>  1 file changed, 26 insertions(+)
>>>>>
>>>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>>>> index 71a85d7245c7..1bb38467390b 100644
>>>>> --- a/mm/percpu.c
>>>>> +++ b/mm/percpu.c
>>>>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>>>>  }
>>>>>  #endif
>>>>>  
>>>>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
>>>>> +{
>>>>> +	if (!(gfp & __GFP_IO))
>>>>> +		return memalloc_noio_save();
>>>>> +	if (!(gfp & __GFP_FS))
>>>>> +		return memalloc_nofs_save();
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
>>>>> +{
>>>>> +	if (!(gfp & __GFP_IO))
>>>>> +		memalloc_noio_restore(flags);
>>>>> +	else if (!(gfp & __GFP_FS))
>>>>> +		memalloc_nofs_restore(flags);
>>>>> +}
>>>>
>>>> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
>>>> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
>>>> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
>>>> mostly the vmalloc backend that's problematic.
>>
>> I’ll try to do it.
>>
>> Following your suggestion, including in [PATCH 2/2], I will also try a
>> different approach and fix the issue by reducing the scope of the
>> pcpu_alloc_mutex critical section.
>>
> 
> No please don't. The point of the percpu mutex is to ensure that only
> one person is ever possibly creating a new chunk. If you drop the mutex,
> then you have to deal with concurrent callers when available percpu
> memory is low. Percpu memory is expensive and unmovable so the cost is
> in the control plane to avoid excess fragmentation.
> 

You are right. pcpu_alloc_mutex seems necessary, and there does not appear
to be a good way to replace it for now.

The introduction of 9a5b183941b can lead to several synchronization issues.
There are already three known cases: the two described in [PATCH 0/2], plus
the one raised by AI review.

https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@linux.dev/
https://sashiko.dev/#/patchset/20260528132917.81123-1-kaitao.cheng@linux.dev

Since optimizing around pcpu_alloc_mutex is constrained, there does not seem
to be a perfect solution that addresses all of them at the same time.

However, if we revert 9a5b183941b, it seems that all of these issues would
be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
allocations may increase, which might be acceptable.

I would like to hear what others think.

-- 
Thanks
Kaitao Cheng



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-01  2:27           ` Kaitao Cheng
@ 2026-06-01 15:45             ` Michal Hocko
  2026-06-02  3:03               ` Kaitao Cheng
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2026-06-01 15:45 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Dennis Zhou, Pedro Falcato, akpm, tj, cl, vbabka, linux-mm,
	linux-kernel, muchun.song, Kaitao Cheng

On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
> However, if we revert 9a5b183941b, it seems that all of these issues would
> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
> allocations may increase, which might be acceptable.

That has practical impact on some versions of iscsid which do not have
PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
based on a theoretical concerns which I believe is the case here. 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-01 15:45             ` Michal Hocko
@ 2026-06-02  3:03               ` Kaitao Cheng
  2026-06-02  7:16                 ` Vlastimil Babka (SUSE)
  2026-06-02  7:17                 ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Kaitao Cheng @ 2026-06-02  3:03 UTC (permalink / raw)
  To: Michal Hocko, Dennis Zhou
  Cc: Pedro Falcato, akpm, tj, cl, vbabka, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

在 2026/6/1 23:45, Michal Hocko 写道:
> On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
>> However, if we revert 9a5b183941b, it seems that all of these issues would
>> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
>> allocations may increase, which might be acceptable.
> 
> That has practical impact on some versions of iscsid which do not have
> PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
> based on a theoretical concerns which I believe is the case here. 
> 

Based on the previous discussion, I think we have a way to address most
of the concurrency issues around percpu allocation.

However, there still seems to be one remaining case that I do not yet
have a good way to solve. For example:

Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
pcpu_alloc_mutex. Since the internal allocation is not constrained by
NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
creating a dependency like:

pcpu_alloc_mutex -> fs_reclaim -> FS lock
At the same time, Thread B may already hold an FS lock and then call
pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
pcpu_alloc_mutex and block, creating the reverse dependency:

FS lock -> pcpu_alloc_mutex
This can still form a potential deadlock cycle.

Does anyone have a good suggestion for how to handle this remaining case?
Or should we simply treat all GFP_KERNEL/GFP_NOFS allocation behavior in
pcpu_alloc_noprof() as GFP_NOIO?

If there is no clear solution for now, would it be acceptable to first
fix some of the issues introduced by commit 9a5b183941b, and leave this
remaining case as a pre-existing historical issue to be handled separately
later?

-- 
Thanks
Kaitao Cheng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-02  3:03               ` Kaitao Cheng
@ 2026-06-02  7:16                 ` Vlastimil Babka (SUSE)
  2026-06-02  8:05                   ` Michal Hocko
  2026-06-02  7:17                 ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-06-02  7:16 UTC (permalink / raw)
  To: Kaitao Cheng, Michal Hocko, Dennis Zhou
  Cc: Pedro Falcato, akpm, tj, cl, linux-mm, linux-kernel, muchun.song,
	Kaitao Cheng

On 6/2/26 05:03, Kaitao Cheng wrote:
> 
> 
> 在 2026/6/1 23:45, Michal Hocko 写道:
>> On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
>>> However, if we revert 9a5b183941b, it seems that all of these issues would
>>> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
>>> allocations may increase, which might be acceptable.
>> 
>> That has practical impact on some versions of iscsid which do not have
>> PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
>> based on a theoretical concerns which I believe is the case here. 
>> 
> 
> Based on the previous discussion, I think we have a way to address most
> of the concurrency issues around percpu allocation.
> 
> However, there still seems to be one remaining case that I do not yet
> have a good way to solve. For example:
> 
> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
> pcpu_alloc_mutex. Since the internal allocation is not constrained by
> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
> creating a dependency like:
> 
> pcpu_alloc_mutex -> fs_reclaim -> FS lock
> At the same time, Thread B may already hold an FS lock and then call
> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
> pcpu_alloc_mutex and block, creating the reverse dependency:
> 
> FS lock -> pcpu_alloc_mutex
> This can still form a potential deadlock cycle.
> 
> Does anyone have a good suggestion for how to handle this remaining case?
> Or should we simply treat all GFP_KERNEL/GFP_NOFS allocation behavior in
> pcpu_alloc_noprof() as GFP_NOIO?
> 
> If there is no clear solution for now, would it be acceptable to first
> fix some of the issues introduced by commit 9a5b183941b, and leave this
> remaining case as a pre-existing historical issue to be handled separately
> later?

We don't need to solve any issues that are only theoretical and based on
scenarios that nobody sane should be doing, i.e. Pedro already pointed out
"As in no reclaim path should be insane^W daring enough to do pcpu allocations?"

If anyone would (start to) do that, we would likely have lockdep reports
from the testing bots, which warn that the scenario can now exist, even
before it results in an actual deadlock.

Elsewhere Pedro said "The proper way of fixing this would probably be to
release pcpu_alloc_mutex (or not have it in the first place!) while you're
allocating memory."

Such a refactoring might be worth it (if it's feasible to do cleanly and
doesn't come with downsides) just to eliminate these lock dependencies
properly for good. Patching over individual theoretical issues is IMHO not
worth it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-02  3:03               ` Kaitao Cheng
  2026-06-02  7:16                 ` Vlastimil Babka (SUSE)
@ 2026-06-02  7:17                 ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2026-06-02  7:17 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Dennis Zhou, Pedro Falcato, akpm, tj, cl, vbabka, linux-mm,
	linux-kernel, muchun.song, Kaitao Cheng

On Tue 02-06-26 11:03:15, Kaitao Cheng wrote:
> 
> 
> 在 2026/6/1 23:45, Michal Hocko 写道:
> > On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
> >> However, if we revert 9a5b183941b, it seems that all of these issues would
> >> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
> >> allocations may increase, which might be acceptable.
> > 
> > That has practical impact on some versions of iscsid which do not have
> > PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
> > based on a theoretical concerns which I believe is the case here. 
> > 
> 
> Based on the previous discussion, I think we have a way to address most
> of the concurrency issues around percpu allocation.
> 
> However, there still seems to be one remaining case that I do not yet
> have a good way to solve. For example:
> 
> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
> pcpu_alloc_mutex. Since the internal allocation is not constrained by
> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
> creating a dependency like:
> 
> pcpu_alloc_mutex -> fs_reclaim -> FS lock
> At the same time, Thread B may already hold an FS lock and then call
> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
> pcpu_alloc_mutex and block, creating the reverse dependency:
> 
> FS lock -> pcpu_alloc_mutex
> This can still form a potential deadlock cycle.

Correct.

> Does anyone have a good suggestion for how to handle this remaining case?
> Or should we simply treat all GFP_KERNEL/GFP_NOFS allocation behavior in
> pcpu_alloc_noprof() as GFP_NOIO?

Yes, weakening the reclaim context would work around these dependencies.
This seems like a viable option as long as pcp allocations are not in
latency sensitive paths and stalling them under memory pressure is
acceptable. My insight into pcp allocator users is very limited so I
cannot really make any judgment call here.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-02  7:16                 ` Vlastimil Babka (SUSE)
@ 2026-06-02  8:05                   ` Michal Hocko
  2026-06-02  9:02                     ` Kaitao Cheng
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2026-06-02  8:05 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Kaitao Cheng, Dennis Zhou, Pedro Falcato, akpm, tj, cl, linux-mm,
	linux-kernel, muchun.song, Kaitao Cheng

On Tue 02-06-26 09:16:24, Vlastimil Babka (SUSE) wrote:
> On 6/2/26 05:03, Kaitao Cheng wrote:
> > 
> > 
> > 在 2026/6/1 23:45, Michal Hocko 写道:
> >> On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
> >>> However, if we revert 9a5b183941b, it seems that all of these issues would
> >>> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
> >>> allocations may increase, which might be acceptable.
> >> 
> >> That has practical impact on some versions of iscsid which do not have
> >> PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
> >> based on a theoretical concerns which I believe is the case here. 
> >> 
> > 
> > Based on the previous discussion, I think we have a way to address most
> > of the concurrency issues around percpu allocation.
> > 
> > However, there still seems to be one remaining case that I do not yet
> > have a good way to solve. For example:
> > 
> > Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
> > pcpu_alloc_mutex. Since the internal allocation is not constrained by
> > NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
> > creating a dependency like:
> > 
> > pcpu_alloc_mutex -> fs_reclaim -> FS lock
> > At the same time, Thread B may already hold an FS lock and then call
> > pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
> > pcpu_alloc_mutex and block, creating the reverse dependency:
> > 
> > FS lock -> pcpu_alloc_mutex
> > This can still form a potential deadlock cycle.
> > 
> > Does anyone have a good suggestion for how to handle this remaining case?
> > Or should we simply treat all GFP_KERNEL/GFP_NOFS allocation behavior in
> > pcpu_alloc_noprof() as GFP_NOIO?
> > 
> > If there is no clear solution for now, would it be acceptable to first
> > fix some of the issues introduced by commit 9a5b183941b, and leave this
> > remaining case as a pre-existing historical issue to be handled separately
> > later?
> 
> We don't need to solve any issues that are only theoretical and based on
> scenarios that nobody sane should be doing, i.e. Pedro already pointed out
> "As in no reclaim path should be insane^W daring enough to do pcpu allocations?"

Yes, but you do not need to do a pcp allocation from the reclaim path to
hit the deadlock. All you need is a NOFS pcp allocation - e.g. one done
from NOFS scope. Then you have fs lock <-> pcpu_alloc_mutex dependency
and a potential deadlock. It is hard to know whether we have any of
those in the kernel but I know for a fact (9a5b183941b) that there are
scoped NOIO allocations so I wouldn't be suprised if the same was the
case for NOFS. Not only that having NOIO is a weaker reclaim context.

> Elsewhere Pedro said "The proper way of fixing this would probably be to
> release pcpu_alloc_mutex (or not have it in the first place!) while you're
> allocating memory."

I do agree with this. Back then when I was dealing with the NOIO issue
I've tried to look at the lock and drop it but it was not really
straightforward. Maybe my lack of close understanding of the pcp
allocator was an obstacle there. So if there is a path forward like that
then it would certainly be the best.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-06-02  8:05                   ` Michal Hocko
@ 2026-06-02  9:02                     ` Kaitao Cheng
  0 siblings, 0 replies; 18+ messages in thread
From: Kaitao Cheng @ 2026-06-02  9:02 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka (SUSE)
  Cc: Dennis Zhou, Pedro Falcato, akpm, tj, cl, linux-mm, linux-kernel,
	muchun.song, Kaitao Cheng

在 2026/6/2 16:05, Michal Hocko 写道:
> On Tue 02-06-26 09:16:24, Vlastimil Babka (SUSE) wrote:
>> On 6/2/26 05:03, Kaitao Cheng wrote:
>>>
>>>
>>> 在 2026/6/1 23:45, Michal Hocko 写道:
>>>> On Mon 01-06-26 10:27:53, Kaitao Cheng wrote:
>>>>> However, if we revert 9a5b183941b, it seems that all of these issues would
>>>>> be resolved. The only downside is that the failure rate of pcpu_alloc_noprof()
>>>>> allocations may increase, which might be acceptable.
>>>>
>>>> That has practical impact on some versions of iscsid which do not have
>>>> PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert
>>>> based on a theoretical concerns which I believe is the case here. 
>>>>
>>>
>>> Based on the previous discussion, I think we have a way to address most
>>> of the concurrency issues around percpu allocation.
>>>
>>> However, there still seems to be one remaining case that I do not yet
>>> have a good way to solve. For example:
>>>
>>> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
>>> pcpu_alloc_mutex. Since the internal allocation is not constrained by
>>> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
>>> creating a dependency like:
>>>
>>> pcpu_alloc_mutex -> fs_reclaim -> FS lock
>>> At the same time, Thread B may already hold an FS lock and then call
>>> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
>>> pcpu_alloc_mutex and block, creating the reverse dependency:
>>>
>>> FS lock -> pcpu_alloc_mutex
>>> This can still form a potential deadlock cycle.
>>>
>>> Does anyone have a good suggestion for how to handle this remaining case?
>>> Or should we simply treat all GFP_KERNEL/GFP_NOFS allocation behavior in
>>> pcpu_alloc_noprof() as GFP_NOIO?
>>>
>>> If there is no clear solution for now, would it be acceptable to first
>>> fix some of the issues introduced by commit 9a5b183941b, and leave this
>>> remaining case as a pre-existing historical issue to be handled separately
>>> later?
>>
>> We don't need to solve any issues that are only theoretical and based on
>> scenarios that nobody sane should be doing, i.e. Pedro already pointed out
>> "As in no reclaim path should be insane^W daring enough to do pcpu allocations?"
> 
> Yes, but you do not need to do a pcp allocation from the reclaim path to
> hit the deadlock. All you need is a NOFS pcp allocation - e.g. one done
> from NOFS scope. Then you have fs lock <-> pcpu_alloc_mutex dependency
> and a potential deadlock. It is hard to know whether we have any of
> those in the kernel but I know for a fact (9a5b183941b) that there are
> scoped NOIO allocations so I wouldn't be suprised if the same was the
> case for NOFS. Not only that having NOIO is a weaker reclaim context.

Besides the case mentioned by Michal Hocko, in my previous email I listed
a scenario where blkg_conf_prep races with blkcg_deactivate_policy. This
is also something that can actually happen in the existing code, and we
have discussed some possible solutions.

The concurrent situation Pedro pointed out, "As in no reclaim path should
be insane^W daring enough to do pcpu allocations?", does exist at the
theoretical level. However, some other issues do genuinely exist. The
reason they have not been observed so far may be that the reproduction
probability is very low, or that they have already happened but no one
has reported them to the community.

>> Elsewhere Pedro said "The proper way of fixing this would probably be to
>> release pcpu_alloc_mutex (or not have it in the first place!) while you're
>> allocating memory."
> 
> I do agree with this. Back then when I was dealing with the NOIO issue
> I've tried to look at the lock and drop it but it was not really
> straightforward. Maybe my lack of close understanding of the pcp
> allocator was an obstacle there. So if there is a path forward like that
> then it would certainly be the best.

That is indeed the optimal solution, but after thinking about it briefly,
this optimization may not be easy to implement, or it may require a large
amount of changes. I would really appreciate hearing any concrete suggestions
people may have on how to optimize pcpu_alloc_mutex. Otherwise, this issue
will remain blocked here.

-- 
Thanks
Kaitao Cheng



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
  2026-05-30 13:32         ` Dennis Zhou
  2026-06-01  2:27           ` Kaitao Cheng
@ 2026-06-02 13:46           ` Pedro Falcato
  1 sibling, 0 replies; 18+ messages in thread
From: Pedro Falcato @ 2026-06-02 13:46 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Kaitao Cheng, akpm, tj, cl, mhocko, vbabka, linux-mm,
	linux-kernel, muchun.song, Kaitao Cheng

On Sat, May 30, 2026 at 03:32:19PM +0200, Dennis Zhou wrote:
<snip>
> > 
> > I’ll try to do it.
> > 
> > Following your suggestion, including in [PATCH 2/2], I will also try a
> > different approach and fix the issue by reducing the scope of the
> > pcpu_alloc_mutex critical section.
> > 
> 
> No please don't. The point of the percpu mutex is to ensure that only
> one person is ever possibly creating a new chunk. If you drop the mutex,
> then you have to deal with concurrent callers when available percpu
> memory is low. Percpu memory is expensive and unmovable so the cost is
> in the control plane to avoid excess fragmentation.

And so is slab memory, yet this is not a problem. We shouldn't keep
problematic locks just because they may aid fragmentation slightly.


-- 
Pedro


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-02 13:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
2026-05-29  9:25   ` Pedro Falcato
2026-05-29  9:38     ` Pedro Falcato
2026-05-30 12:47       ` Kaitao Cheng
2026-05-30 13:32         ` Dennis Zhou
2026-06-01  2:27           ` Kaitao Cheng
2026-06-01 15:45             ` Michal Hocko
2026-06-02  3:03               ` Kaitao Cheng
2026-06-02  7:16                 ` Vlastimil Babka (SUSE)
2026-06-02  8:05                   ` Michal Hocko
2026-06-02  9:02                     ` Kaitao Cheng
2026-06-02  7:17                 ` Michal Hocko
2026-06-02 13:46           ` Pedro Falcato
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
2026-05-29  9:34   ` Pedro Falcato
2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
2026-05-28 21:10 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox