* [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
@ 2026-05-28 13:29 Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw)
To: dennis, tj, cl, akpm
Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng
Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
pcpu_alloc_mutex and the chunk creation slow path. This restored the
allocation capability that was lost when those constrained allocations
were treated as atomic, but it also opens two possible reclaim recursion
problems.
The first problem is that the create and populate slow paths do not fully
preserve the caller's allocation constraints. pcpu_alloc_noprof() derives
pcpu_gfp from the caller supplied GFP mask and passes it to the backing
page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(),
and population can allocate temporary metadata or page tables while mapping
backing pages. Those internal allocations can use GFP_KERNEL. A caller
using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or
IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's
allocation context.
The second problem is a possible pcpu_alloc_mutex recursion from reclaim.
If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim
reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO,
the nested allocation can now try to take pcpu_alloc_mutex again because
9a5b183941b5 no longer treats those masks as atomic.
Another possible way to avoid these issues is to revert 9a5b183941b5.
However, that would also bring back the premature allocation failures for
sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended
to fix.
Kaitao Cheng (2):
mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim
mm/percpu.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)
--
2.50.1 (Apple Git-155)
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng @ 2026-05-28 13:29 ` Kaitao Cheng 2026-05-29 9:25 ` Pedro Falcato 2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng ` (2 subsequent siblings) 3 siblings, 1 reply; 12+ messages in thread From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw) To: dennis, tj, cl, akpm Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng From: Kaitao Cheng <chengkaitao@kylinos.cn> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and passes it to the backing percpu allocators. This preserves GFP_NOFS and GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. However, the chunk creation and population slow paths also call helpers which do not take a GFP mask and perform internal allocations with GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), and population can allocate temporary metadata or page tables while mapping backing pages. As a result, a caller which explicitly uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating or populating a percpu chunk. This is problematic for callers which use GFP_NOFS or GFP_NOIO because they are already holding filesystem or IO-path locks. If free chunks are exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter unconstrained reclaim from these internal allocations, defeating the caller's allocation context and potentially recreating reclaim lock dependencies. Wrap chunk creation and population in a scoped NOIO or NOFS context when pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL allocations unchanged so they retain full reclaim capability. Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> --- mm/percpu.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/mm/percpu.c b/mm/percpu.c index 71a85d7245c7..1bb38467390b 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s } #endif +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) +{ + if (!(gfp & __GFP_IO)) + return memalloc_noio_save(); + if (!(gfp & __GFP_FS)) + return memalloc_nofs_save(); + return 0; +} + +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) +{ + if (!(gfp & __GFP_IO)) + memalloc_noio_restore(flags); + else if (!(gfp & __GFP_FS)) + memalloc_nofs_restore(flags); +} + /** * pcpu_alloc - the percpu allocator * @size: size of area to allocate in bytes @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, /* No space left. Create a new chunk. */ if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) { + unsigned int pcpu_scope; + + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); chunk = pcpu_create_chunk(pcpu_gfp); + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); + if (!chunk) { err = "failed to allocate new chunk"; goto fail; @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, page_end = PFN_UP(off + size); for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) { + unsigned int pcpu_scope; + WARN_ON(chunk->immutable); + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp); + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); spin_lock_irqsave(&pcpu_lock, flags); if (ret) { -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng @ 2026-05-29 9:25 ` Pedro Falcato 2026-05-29 9:38 ` Pedro Falcato 0 siblings, 1 reply; 12+ messages in thread From: Pedro Falcato @ 2026-05-29 9:25 UTC (permalink / raw) To: Kaitao Cheng Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote: > From: Kaitao Cheng <chengkaitao@kylinos.cn> > > pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and > passes it to the backing percpu allocators. This preserves GFP_NOFS and > GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. > > However, the chunk creation and population slow paths also call helpers > which do not take a GFP mask and perform internal allocations with > GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), > and population can allocate temporary metadata or page tables while mapping > backing pages. As a result, a caller which explicitly uses GFP_NOFS or > GFP_NOIO can still enter FS or IO reclaim while creating or populating a > percpu chunk. > > This is problematic for callers which use GFP_NOFS or GFP_NOIO because > they are already holding filesystem or IO-path locks. If free chunks are > exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter > unconstrained reclaim from these internal allocations, defeating the > caller's allocation context and potentially recreating reclaim lock > dependencies. > > Wrap chunk creation and population in a scoped NOIO or NOFS context when > pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL > allocations unchanged so they retain full reclaim capability. > > Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") I assume you _did not_ observe this in production? As in no reclaim path should be insane^W daring enough to do pcpu allocations? > Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> > --- > mm/percpu.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 71a85d7245c7..1bb38467390b 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s > } > #endif > > +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) > +{ > + if (!(gfp & __GFP_IO)) > + return memalloc_noio_save(); > + if (!(gfp & __GFP_FS)) > + return memalloc_nofs_save(); > + return 0; > +} > + > +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) > +{ > + if (!(gfp & __GFP_IO)) > + memalloc_noio_restore(flags); > + else if (!(gfp & __GFP_FS)) > + memalloc_nofs_restore(flags); > +} I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk() and pcpu_populate_chunk(). It's their job to respect the gfp flags and Do The Right Thing(tm). Can you fix the problematic places? It seems like it's mostly the vmalloc backend that's problematic. > + > /** > * pcpu_alloc - the percpu allocator > * @size: size of area to allocate in bytes > @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, > > /* No space left. Create a new chunk. */ > if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) { > + unsigned int pcpu_scope; > + > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); > chunk = pcpu_create_chunk(pcpu_gfp); > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); > + > if (!chunk) { > err = "failed to allocate new chunk"; > goto fail; > @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, > page_end = PFN_UP(off + size); > > for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) { > + unsigned int pcpu_scope; > + > WARN_ON(chunk->immutable); > > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); > ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp); > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); > > spin_lock_irqsave(&pcpu_lock, flags); > if (ret) { > -- > 2.50.1 (Apple Git-155) > -- Pedro ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-29 9:25 ` Pedro Falcato @ 2026-05-29 9:38 ` Pedro Falcato 2026-05-30 12:47 ` Kaitao Cheng 0 siblings, 1 reply; 12+ messages in thread From: Pedro Falcato @ 2026-05-29 9:38 UTC (permalink / raw) To: Kaitao Cheng Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote: > On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote: > > From: Kaitao Cheng <chengkaitao@kylinos.cn> > > > > pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and > > passes it to the backing percpu allocators. This preserves GFP_NOFS and > > GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. > > > > However, the chunk creation and population slow paths also call helpers > > which do not take a GFP mask and perform internal allocations with > > GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), > > and population can allocate temporary metadata or page tables while mapping > > backing pages. As a result, a caller which explicitly uses GFP_NOFS or > > GFP_NOIO can still enter FS or IO reclaim while creating or populating a > > percpu chunk. > > > > This is problematic for callers which use GFP_NOFS or GFP_NOIO because > > they are already holding filesystem or IO-path locks. If free chunks are > > exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter > > unconstrained reclaim from these internal allocations, defeating the > > caller's allocation context and potentially recreating reclaim lock > > dependencies. > > > > Wrap chunk creation and population in a scoped NOIO or NOFS context when > > pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL > > allocations unchanged so they retain full reclaim capability. > > > > Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") > > I assume you _did not_ observe this in production? As in no reclaim path should be > insane^W daring enough to do pcpu allocations? Oops, I mixed my issues up. This is purely a GFP flags issue. A quick "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no NOFS or NOIO as far as I can see. So you probably did not observe this? > > > Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> > > --- > > mm/percpu.c | 26 ++++++++++++++++++++++++++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index 71a85d7245c7..1bb38467390b 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s > > } > > #endif > > > > +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) > > +{ > > + if (!(gfp & __GFP_IO)) > > + return memalloc_noio_save(); > > + if (!(gfp & __GFP_FS)) > > + return memalloc_nofs_save(); > > + return 0; > > +} > > + > > +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) > > +{ > > + if (!(gfp & __GFP_IO)) > > + memalloc_noio_restore(flags); > > + else if (!(gfp & __GFP_FS)) > > + memalloc_nofs_restore(flags); > > +} > > I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk() > and pcpu_populate_chunk(). It's their job to respect the gfp flags and > Do The Right Thing(tm). Can you fix the problematic places? It seems like it's > mostly the vmalloc backend that's problematic. > > > + > > /** > > * pcpu_alloc - the percpu allocator > > * @size: size of area to allocate in bytes > > @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, > > > > /* No space left. Create a new chunk. */ > > if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) { > > + unsigned int pcpu_scope; > > + > > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); > > chunk = pcpu_create_chunk(pcpu_gfp); > > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); > > + > > if (!chunk) { > > err = "failed to allocate new chunk"; > > goto fail; > > @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, > > page_end = PFN_UP(off + size); > > > > for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) { > > + unsigned int pcpu_scope; > > + > > WARN_ON(chunk->immutable); > > > > + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); > > ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp); > > + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); > > > > spin_lock_irqsave(&pcpu_lock, flags); > > if (ret) { > > -- > > 2.50.1 (Apple Git-155) > > > > -- > Pedro -- Pedro ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-29 9:38 ` Pedro Falcato @ 2026-05-30 12:47 ` Kaitao Cheng 2026-05-30 13:32 ` Dennis Zhou 0 siblings, 1 reply; 12+ messages in thread From: Kaitao Cheng @ 2026-05-30 12:47 UTC (permalink / raw) To: Pedro Falcato, akpm Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng 在 2026/5/29 17:38, Pedro Falcato 写道: > On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote: >> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote: >>> From: Kaitao Cheng <chengkaitao@kylinos.cn> >>> >>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and >>> passes it to the backing percpu allocators. This preserves GFP_NOFS and >>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. >>> >>> However, the chunk creation and population slow paths also call helpers >>> which do not take a GFP mask and perform internal allocations with >>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), >>> and population can allocate temporary metadata or page tables while mapping >>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or >>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a >>> percpu chunk. >>> >>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because >>> they are already holding filesystem or IO-path locks. If free chunks are >>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter >>> unconstrained reclaim from these internal allocations, defeating the >>> caller's allocation context and potentially recreating reclaim lock >>> dependencies. >>> >>> Wrap chunk creation and population in a scoped NOIO or NOFS context when >>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL >>> allocations unchanged so they retain full reclaim capability. >>> >>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") >> >> I assume you _did not_ observe this in production? As in no reclaim path should be >> insane^W daring enough to do pcpu allocations? > > Oops, I mixed my issues up. This is purely a GFP flags issue. A quick > "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are > using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no > NOFS or NOIO as far as I can see. So you probably did not observe this? Right, this issue has not been observed in production. It came from a question raised by AI code review, and after carefully reading the code, I found that there are indeed some synchronization concerns. Here is one example of the scenario [PATCH 1/2] is trying to address: blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while configuring policy"). blkg_conf_prep() now serializes against blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason: CPU0: blkg_conf_prep() mutex_lock(q->blkcg_mutex) blkg_alloc(..., GFP_NOIO) alloc_percpu_gfp(..., GFP_NOIO) -> if percpu chunks are exhausted, chunk create/populate may do internal GFP_KERNEL allocations -> direct reclaim / writeback can issue IO to this queue -> IO waits because the queue is frozen CPU1: blkcg_deactivate_policy() blk_mq_freeze_queue(q) mutex_lock(q->blkcg_mutex) -> waits for CPU0 ... unfreeze only happens after q->blkcg_mutex is acquired/released So the concern is that the caller deliberately uses GFP_NOIO because it may hold a lock which can be acquired after queue freeze, but the percpu slow path can temporarily lose that allocation context. The failure requires the slow path (new chunk creation or population), so it is not expected to be common. >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> >>> --- >>> mm/percpu.c | 26 ++++++++++++++++++++++++++ >>> 1 file changed, 26 insertions(+) >>> >>> diff --git a/mm/percpu.c b/mm/percpu.c >>> index 71a85d7245c7..1bb38467390b 100644 >>> --- a/mm/percpu.c >>> +++ b/mm/percpu.c >>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s >>> } >>> #endif >>> >>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) >>> +{ >>> + if (!(gfp & __GFP_IO)) >>> + return memalloc_noio_save(); >>> + if (!(gfp & __GFP_FS)) >>> + return memalloc_nofs_save(); >>> + return 0; >>> +} >>> + >>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) >>> +{ >>> + if (!(gfp & __GFP_IO)) >>> + memalloc_noio_restore(flags); >>> + else if (!(gfp & __GFP_FS)) >>> + memalloc_nofs_restore(flags); >>> +} >> >> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk() >> and pcpu_populate_chunk(). It's their job to respect the gfp flags and >> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's >> mostly the vmalloc backend that's problematic. I’ll try to do it. Following your suggestion, including in [PATCH 2/2], I will also try a different approach and fix the issue by reducing the scope of the pcpu_alloc_mutex critical section. >>> /** >>> * pcpu_alloc - the percpu allocator >>> * @size: size of area to allocate in bytes >>> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, >>> >>> /* No space left. Create a new chunk. */ >>> if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) { >>> + unsigned int pcpu_scope; >>> + >>> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); >>> chunk = pcpu_create_chunk(pcpu_gfp); >>> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); >>> + >>> if (!chunk) { >>> err = "failed to allocate new chunk"; >>> goto fail; >>> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, >>> page_end = PFN_UP(off + size); >>> >>> for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) { >>> + unsigned int pcpu_scope; >>> + >>> WARN_ON(chunk->immutable); >>> >>> + pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp); >>> ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp); >>> + pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope); >>> >>> spin_lock_irqsave(&pcpu_lock, flags); >>> if (ret) { >>> -- >>> 2.50.1 (Apple Git-155) >>> -- Thanks Kaitao Cheng ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-30 12:47 ` Kaitao Cheng @ 2026-05-30 13:32 ` Dennis Zhou 2026-06-01 2:27 ` Kaitao Cheng 0 siblings, 1 reply; 12+ messages in thread From: Dennis Zhou @ 2026-05-30 13:32 UTC (permalink / raw) To: Kaitao Cheng Cc: Pedro Falcato, akpm, dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng Hello, On Sat, May 30, 2026 at 08:47:34PM +0800, Kaitao Cheng wrote: > 在 2026/5/29 17:38, Pedro Falcato 写道: > > On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote: > >> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote: > >>> From: Kaitao Cheng <chengkaitao@kylinos.cn> > >>> > >>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and > >>> passes it to the backing percpu allocators. This preserves GFP_NOFS and > >>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. > >>> > >>> However, the chunk creation and population slow paths also call helpers > >>> which do not take a GFP mask and perform internal allocations with > >>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), > >>> and population can allocate temporary metadata or page tables while mapping > >>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or > >>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a > >>> percpu chunk. > >>> > >>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because > >>> they are already holding filesystem or IO-path locks. If free chunks are > >>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter > >>> unconstrained reclaim from these internal allocations, defeating the > >>> caller's allocation context and potentially recreating reclaim lock > >>> dependencies. > >>> > >>> Wrap chunk creation and population in a scoped NOIO or NOFS context when > >>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL > >>> allocations unchanged so they retain full reclaim capability. > >>> > >>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") > >> > >> I assume you _did not_ observe this in production? As in no reclaim path should be > >> insane^W daring enough to do pcpu allocations? > > > > Oops, I mixed my issues up. This is purely a GFP flags issue. A quick > > "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are > > using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no > > NOFS or NOIO as far as I can see. So you probably did not observe this? > > Right, this issue has not been observed in production. It came from a > question raised by AI code review, and after carefully reading the code, > I found that there are indeed some synchronization concerns. > > Here is one example of the scenario [PATCH 1/2] is trying to address: > > blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while > configuring policy"). blkg_conf_prep() now serializes against > blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was > changed to GFP_NOIO for that reason: > > CPU0: blkg_conf_prep() > mutex_lock(q->blkcg_mutex) > blkg_alloc(..., GFP_NOIO) > alloc_percpu_gfp(..., GFP_NOIO) > -> if percpu chunks are exhausted, chunk create/populate may do > internal GFP_KERNEL allocations > -> direct reclaim / writeback can issue IO to this queue > -> IO waits because the queue is frozen > > CPU1: blkcg_deactivate_policy() > blk_mq_freeze_queue(q) > mutex_lock(q->blkcg_mutex) > -> waits for CPU0 > ... unfreeze only happens after q->blkcg_mutex is acquired/released > > So the concern is that the caller deliberately uses GFP_NOIO because it > may hold a lock which can be acquired after queue freeze, but the percpu > slow path can temporarily lose that allocation context. The failure > requires the slow path (new chunk creation or population), so it is not > expected to be common. This is likely just a miss for [1] where we switching to gfpflags_allow_blocking(), This seems like it's just a miss from [1] where we switched to a less conservative approach than atomic == !GFP_KERNEL. If anything we just need to allow the additional flags through. Maybe just add GFP_NOIO and GFP_NOFS in pcpu_gfp via the whitelist. [1] 9a5b183941b ("mm, percpu: do not consider sleepable allocations atomic") > > >>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> > >>> --- > >>> mm/percpu.c | 26 ++++++++++++++++++++++++++ > >>> 1 file changed, 26 insertions(+) > >>> > >>> diff --git a/mm/percpu.c b/mm/percpu.c > >>> index 71a85d7245c7..1bb38467390b 100644 > >>> --- a/mm/percpu.c > >>> +++ b/mm/percpu.c > >>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s > >>> } > >>> #endif > >>> > >>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) > >>> +{ > >>> + if (!(gfp & __GFP_IO)) > >>> + return memalloc_noio_save(); > >>> + if (!(gfp & __GFP_FS)) > >>> + return memalloc_nofs_save(); > >>> + return 0; > >>> +} > >>> + > >>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) > >>> +{ > >>> + if (!(gfp & __GFP_IO)) > >>> + memalloc_noio_restore(flags); > >>> + else if (!(gfp & __GFP_FS)) > >>> + memalloc_nofs_restore(flags); > >>> +} > >> > >> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk() > >> and pcpu_populate_chunk(). It's their job to respect the gfp flags and > >> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's > >> mostly the vmalloc backend that's problematic. > > I’ll try to do it. > > Following your suggestion, including in [PATCH 2/2], I will also try a > different approach and fix the issue by reducing the scope of the > pcpu_alloc_mutex critical section. > No please don't. The point of the percpu mutex is to ensure that only one person is ever possibly creating a new chunk. If you drop the mutex, then you have to deal with concurrent callers when available percpu memory is low. Percpu memory is expensive and unmovable so the cost is in the control plane to avoid excess fragmentation. Thanks, Dennis ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-05-30 13:32 ` Dennis Zhou @ 2026-06-01 2:27 ` Kaitao Cheng 2026-06-01 15:45 ` Michal Hocko 0 siblings, 1 reply; 12+ messages in thread From: Kaitao Cheng @ 2026-06-01 2:27 UTC (permalink / raw) To: Dennis Zhou, Pedro Falcato, akpm Cc: tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng 在 2026/5/30 21:32, Dennis Zhou 写道: > Hello, > > On Sat, May 30, 2026 at 08:47:34PM +0800, Kaitao Cheng wrote: >> 在 2026/5/29 17:38, Pedro Falcato 写道: >>> On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote: >>>> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote: >>>>> From: Kaitao Cheng <chengkaitao@kylinos.cn> >>>>> >>>>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and >>>>> passes it to the backing percpu allocators. This preserves GFP_NOFS and >>>>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation. >>>>> >>>>> However, the chunk creation and population slow paths also call helpers >>>>> which do not take a GFP mask and perform internal allocations with >>>>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(), >>>>> and population can allocate temporary metadata or page tables while mapping >>>>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or >>>>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a >>>>> percpu chunk. >>>>> >>>>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because >>>>> they are already holding filesystem or IO-path locks. If free chunks are >>>>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter >>>>> unconstrained reclaim from these internal allocations, defeating the >>>>> caller's allocation context and potentially recreating reclaim lock >>>>> dependencies. >>>>> >>>>> Wrap chunk creation and population in a scoped NOIO or NOFS context when >>>>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL >>>>> allocations unchanged so they retain full reclaim capability. >>>>> >>>>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") >>>> >>>> I assume you _did not_ observe this in production? As in no reclaim path should be >>>> insane^W daring enough to do pcpu allocations? >>> >>> Oops, I mixed my issues up. This is purely a GFP flags issue. A quick >>> "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are >>> using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no >>> NOFS or NOIO as far as I can see. So you probably did not observe this? >> >> Right, this issue has not been observed in production. It came from a >> question raised by AI code review, and after carefully reading the code, >> I found that there are indeed some synchronization concerns. >> >> Here is one example of the scenario [PATCH 1/2] is trying to address: >> >> blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while >> configuring policy"). blkg_conf_prep() now serializes against >> blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was >> changed to GFP_NOIO for that reason: >> >> CPU0: blkg_conf_prep() >> mutex_lock(q->blkcg_mutex) >> blkg_alloc(..., GFP_NOIO) >> alloc_percpu_gfp(..., GFP_NOIO) >> -> if percpu chunks are exhausted, chunk create/populate may do >> internal GFP_KERNEL allocations >> -> direct reclaim / writeback can issue IO to this queue >> -> IO waits because the queue is frozen >> >> CPU1: blkcg_deactivate_policy() >> blk_mq_freeze_queue(q) >> mutex_lock(q->blkcg_mutex) >> -> waits for CPU0 >> ... unfreeze only happens after q->blkcg_mutex is acquired/released >> >> So the concern is that the caller deliberately uses GFP_NOIO because it >> may hold a lock which can be acquired after queue freeze, but the percpu >> slow path can temporarily lose that allocation context. The failure >> requires the slow path (new chunk creation or population), so it is not >> expected to be common. > > This is likely just a miss for [1] where we switching to > gfpflags_allow_blocking(), > > This seems like it's just a miss from [1] where we switched to a less > conservative approach than atomic == !GFP_KERNEL. If anything we just > need to allow the additional flags through. Maybe just add GFP_NOIO and > GFP_NOFS in pcpu_gfp via the whitelist. > > [1] 9a5b183941b ("mm, percpu: do not consider sleepable allocations atomic") >> >>>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> >>>>> --- >>>>> mm/percpu.c | 26 ++++++++++++++++++++++++++ >>>>> 1 file changed, 26 insertions(+) >>>>> >>>>> diff --git a/mm/percpu.c b/mm/percpu.c >>>>> index 71a85d7245c7..1bb38467390b 100644 >>>>> --- a/mm/percpu.c >>>>> +++ b/mm/percpu.c >>>>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s >>>>> } >>>>> #endif >>>>> >>>>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp) >>>>> +{ >>>>> + if (!(gfp & __GFP_IO)) >>>>> + return memalloc_noio_save(); >>>>> + if (!(gfp & __GFP_FS)) >>>>> + return memalloc_nofs_save(); >>>>> + return 0; >>>>> +} >>>>> + >>>>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) >>>>> +{ >>>>> + if (!(gfp & __GFP_IO)) >>>>> + memalloc_noio_restore(flags); >>>>> + else if (!(gfp & __GFP_FS)) >>>>> + memalloc_nofs_restore(flags); >>>>> +} >>>> >>>> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk() >>>> and pcpu_populate_chunk(). It's their job to respect the gfp flags and >>>> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's >>>> mostly the vmalloc backend that's problematic. >> >> I’ll try to do it. >> >> Following your suggestion, including in [PATCH 2/2], I will also try a >> different approach and fix the issue by reducing the scope of the >> pcpu_alloc_mutex critical section. >> > > No please don't. The point of the percpu mutex is to ensure that only > one person is ever possibly creating a new chunk. If you drop the mutex, > then you have to deal with concurrent callers when available percpu > memory is low. Percpu memory is expensive and unmovable so the cost is > in the control plane to avoid excess fragmentation. > You are right. pcpu_alloc_mutex seems necessary, and there does not appear to be a good way to replace it for now. The introduction of 9a5b183941b can lead to several synchronization issues. There are already three known cases: the two described in [PATCH 0/2], plus the one raised by AI review. https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@linux.dev/ https://sashiko.dev/#/patchset/20260528132917.81123-1-kaitao.cheng@linux.dev Since optimizing around pcpu_alloc_mutex is constrained, there does not seem to be a perfect solution that addresses all of them at the same time. However, if we revert 9a5b183941b, it seems that all of these issues would be resolved. The only downside is that the failure rate of pcpu_alloc_noprof() allocations may increase, which might be acceptable. I would like to hear what others think. -- Thanks Kaitao Cheng ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate 2026-06-01 2:27 ` Kaitao Cheng @ 2026-06-01 15:45 ` Michal Hocko 0 siblings, 0 replies; 12+ messages in thread From: Michal Hocko @ 2026-06-01 15:45 UTC (permalink / raw) To: Kaitao Cheng Cc: Dennis Zhou, Pedro Falcato, akpm, tj, cl, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng On Mon 01-06-26 10:27:53, Kaitao Cheng wrote: > However, if we revert 9a5b183941b, it seems that all of these issues would > be resolved. The only downside is that the failure rate of pcpu_alloc_noprof() > allocations may increase, which might be acceptable. That has practical impact on some versions of iscsid which do not have PR_SET_IO_FLUSHER. And maybe some more so I would rather not revert based on a theoretical concerns which I believe is the case here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim 2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng 2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng @ 2026-05-28 13:29 ` Kaitao Cheng 2026-05-29 9:34 ` Pedro Falcato 2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton 2026-05-28 21:10 ` Andrew Morton 3 siblings, 1 reply; 12+ messages in thread From: Kaitao Cheng @ 2026-05-28 13:29 UTC (permalink / raw) To: dennis, tj, cl, akpm Cc: mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng From: Kaitao Cheng <chengkaitao@kylinos.cn> pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations so that it can create chunks and populate backing pages. If reclaim is entered while that mutex is already held, and reclaim reaches a path which allocates percpu memory, the nested allocation can try to take pcpu_alloc_mutex again. That creates a reclaim recursion dependency: pcpu_alloc_noprof(GFP_KERNEL) mutex_lock(&pcpu_alloc_mutex) reclaim pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) mutex_lock(&pcpu_alloc_mutex) Avoid this by treating percpu allocations from reclaim context as atomic. Such allocations may still be served from already available and populated areas, but they must not enter the mutex-protected slow path or create new chunks. If no space is available, fail the allocation and let the normal balance work handle replenishment outside reclaim. Update the function comment to describe that reclaim context allocations are atomic regardless of whether the supplied GFP mask would otherwise allow blocking. This patch is a preventive fix. There may not currently be any path that calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context. Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> --- mm/percpu.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/percpu.c b/mm/percpu.c index 1bb38467390b..9c30e5897813 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) * @gfp: allocation flags * * Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN - * then no warning will be triggered on invalid or failed allocation - * requests. + * allow blocking, or if allocation is requested from reclaim context, the + * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be + * triggered on invalid or failed allocation requests. * * RETURNS: * Percpu pointer to the allocated area on success, NULL on failure. @@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, gfp = current_gfp_context(gfp); /* whitelisted flags that can be passed to the backing allocators */ pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); - is_atomic = !gfpflags_allow_blocking(gfp); + /* + * Reclaim can be entered while pcpu_alloc_mutex is already held by + * another percpu allocation. Avoid recursing back into the mutex from + * reclaim; best-effort allocations from already populated areas are OK. + */ + is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state; do_warn = !(gfp & __GFP_NOWARN); /* -- 2.50.1 (Apple Git-155) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim 2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng @ 2026-05-29 9:34 ` Pedro Falcato 0 siblings, 0 replies; 12+ messages in thread From: Pedro Falcato @ 2026-05-29 9:34 UTC (permalink / raw) To: Kaitao Cheng Cc: dennis, tj, cl, akpm, mhocko, vbabka, linux-mm, linux-kernel, muchun.song, Kaitao Cheng On Thu, May 28, 2026 at 09:29:17PM +0800, Kaitao Cheng wrote: > From: Kaitao Cheng <chengkaitao@kylinos.cn> > > pcpu_alloc_noprof() takes pcpu_alloc_mutex for sleepable allocations > so that it can create chunks and populate backing pages. If reclaim is > entered while that mutex is already held, and reclaim reaches a path > which allocates percpu memory, the nested allocation can try to take > pcpu_alloc_mutex again. > > That creates a reclaim recursion dependency: > > pcpu_alloc_noprof(GFP_KERNEL) > mutex_lock(&pcpu_alloc_mutex) > reclaim > pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) > mutex_lock(&pcpu_alloc_mutex) > > Avoid this by treating percpu allocations from reclaim context as atomic. > Such allocations may still be served from already available and populated > areas, but they must not enter the mutex-protected slow path or create new > chunks. If no space is available, fail the allocation and let the normal > balance work handle replenishment outside reclaim. > > Update the function comment to describe that reclaim context allocations > are atomic regardless of whether the supplied GFP mask would otherwise > allow blocking. > > This patch is a preventive fix. There may not currently be any path that > calls pcpu_alloc_noprof(GFP_NOIO/GFP_NOFS) from direct reclaim context. I don't like this. The proper way of fixing this would probably be to release pcpu_alloc_mutex (or not have it in the first place!) while you're allocating memory. > > Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic") > Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> > --- > mm/percpu.c | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/mm/percpu.c b/mm/percpu.c > index 1bb38467390b..9c30e5897813 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1803,9 +1803,9 @@ static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags) > * @gfp: allocation flags > * > * Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't > - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN > - * then no warning will be triggered on invalid or failed allocation > - * requests. > + * allow blocking, or if allocation is requested from reclaim context, the > + * allocation is atomic. If @gfp has __GFP_NOWARN then no warning will be > + * triggered on invalid or failed allocation requests. > * > * RETURNS: > * Percpu pointer to the allocated area on success, NULL on failure. > @@ -1828,7 +1828,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved, > gfp = current_gfp_context(gfp); > /* whitelisted flags that can be passed to the backing allocators */ > pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); > - is_atomic = !gfpflags_allow_blocking(gfp); > + /* > + * Reclaim can be entered while pcpu_alloc_mutex is already held by > + * another percpu allocation. Avoid recursing back into the mutex from > + * reclaim; best-effort allocations from already populated areas are OK. > + */ since this is an entirely theoretical issue: /* Reclaim paths should not be hitting the percpu allocator, for now */ if (WARN_ON_ONCE(current->reclaim_state)) return NULL; But that's just my 2c. > + is_atomic = !gfpflags_allow_blocking(gfp) || current->reclaim_state; > do_warn = !(gfp & __GFP_NOWARN); > > /* > -- > 2.50.1 (Apple Git-155) > -- Pedro ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion 2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng 2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng 2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng @ 2026-05-28 21:09 ` Andrew Morton 2026-05-28 21:10 ` Andrew Morton 3 siblings, 0 replies; 12+ messages in thread From: Andrew Morton @ 2026-05-28 21:09 UTC (permalink / raw) To: Kaitao Cheng Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote: > Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations > atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use > pcpu_alloc_mutex and the chunk creation slow path. This restored the > allocation capability that was lost when those constrained allocations > were treated as atomic, but it also opens two possible reclaim recursion > problems. > > The first problem is that the create and populate slow paths do not fully > preserve the caller's allocation constraints. pcpu_alloc_noprof() derives > pcpu_gfp from the caller supplied GFP mask and passes it to the backing > page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(), > and population can allocate temporary metadata or page tables while mapping > backing pages. Those internal allocations can use GFP_KERNEL. A caller > using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or > IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's > allocation context. > > The second problem is a possible pcpu_alloc_mutex recursion from reclaim. > If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim > reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO, > the nested allocation can now try to take pcpu_alloc_mutex again because > 9a5b183941b5 no longer treats those masks as atomic. > > Another possible way to avoid these issues is to revert 9a5b183941b5. > However, that would also bring back the premature allocation failures for > sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended > to fix. Thanks. 9a5b183941b5 has been in there for a year. How are you observing/triggering this bug and what are the userspace-visible effects? We might choose to backport fixes into -stable kernels, but this additional info is needed to make that determination. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion 2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng ` (2 preceding siblings ...) 2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton @ 2026-05-28 21:10 ` Andrew Morton 3 siblings, 0 replies; 12+ messages in thread From: Andrew Morton @ 2026-05-28 21:10 UTC (permalink / raw) To: Kaitao Cheng Cc: dennis, tj, cl, mhocko, vbabka, linux-mm, linux-kernel, muchun.song On Thu, 28 May 2026 21:29:15 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote: > Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations > atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use > pcpu_alloc_mutex and the chunk creation slow path. This restored the > allocation capability that was lost when those constrained allocations > were treated as atomic, but it also opens two possible reclaim recursion > problems. > > The first problem is that the create and populate slow paths do not fully > preserve the caller's allocation constraints. pcpu_alloc_noprof() derives > pcpu_gfp from the caller supplied GFP mask and passes it to the backing > page allocator. However, pcpu_create_chunk() calls pcpu_get_vm_areas(), > and population can allocate temporary metadata or page tables while mapping > backing pages. Those internal allocations can use GFP_KERNEL. A caller > using GFP_NOFS or GFP_NOIO can therefore still enter unconstrained FS or > IO reclaim while holding pcpu_alloc_mutex. This defeats the caller's > allocation context. > > The second problem is a possible pcpu_alloc_mutex recursion from reclaim. > If reclaim is entered while pcpu_alloc_mutex is already held, and reclaim > reaches a path which allocates percpu memory with GFP_NOFS or GFP_NOIO, > the nested allocation can now try to take pcpu_alloc_mutex again because > 9a5b183941b5 no longer treats those masks as atomic. > > Another possible way to avoid these issues is to revert 9a5b183941b5. > However, that would also bring back the premature allocation failures for > sleepable GFP_NOFS/GFP_NOIO percpu users that 9a5b183941b5 was intended > to fix. AI review might have found a couple of pre-existing issues: https://sashiko.dev/#/patchset/20260528132917.81123-1-kaitao.cheng@linux.dev ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-06-01 15:45 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng 2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng 2026-05-29 9:25 ` Pedro Falcato 2026-05-29 9:38 ` Pedro Falcato 2026-05-30 12:47 ` Kaitao Cheng 2026-05-30 13:32 ` Dennis Zhou 2026-06-01 2:27 ` Kaitao Cheng 2026-06-01 15:45 ` Michal Hocko 2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng 2026-05-29 9:34 ` Pedro Falcato 2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton 2026-05-28 21:10 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox