Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Kaitao Cheng <kaitao.cheng@linux.dev>
To: Pedro Falcato <pfalcato@suse.de>, akpm@linux-foundation.org
Cc: dennis@kernel.org, tj@kernel.org, cl@gentwo.org, mhocko@suse.com,
	vbabka@kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, muchun.song@linux.dev,
	Kaitao Cheng <chengkaitao@kylinos.cn>
Subject: Re: [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate
Date: Sat, 30 May 2026 20:47:34 +0800	[thread overview]
Message-ID: <5a4aa532-77a0-436a-8f5e-1bbcf2db6bbb@linux.dev> (raw)
In-Reply-To: <ahld5va1b9w8j659@pedro-suse.lan>

在 2026/5/29 17:38, Pedro Falcato 写道:
> On Fri, May 29, 2026 at 10:25:28AM +0100, Pedro Falcato wrote:
>> On Thu, May 28, 2026 at 09:29:16PM +0800, Kaitao Cheng wrote:
>>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>>
>>> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
>>> passes it to the backing percpu allocators. This preserves GFP_NOFS and
>>> GFP_NOIO for pcpu_alloc_pages() and for the initial pcpu_chunk allocation.
>>>
>>> However, the chunk creation and population slow paths also call helpers
>>> which do not take a GFP mask and perform internal allocations with
>>> GFP_KERNEL. For example, pcpu_create_chunk() calls pcpu_get_vm_areas(),
>>> and population can allocate temporary metadata or page tables while mapping
>>> backing pages. As a result, a caller which explicitly uses GFP_NOFS or
>>> GFP_NOIO can still enter FS or IO reclaim while creating or populating a
>>> percpu chunk.
>>>
>>> This is problematic for callers which use GFP_NOFS or GFP_NOIO because
>>> they are already holding filesystem or IO-path locks. If free chunks are
>>> exhausted, the percpu allocation can take pcpu_alloc_mutex and then enter
>>> unconstrained reclaim from these internal allocations, defeating the
>>> caller's allocation context and potentially recreating reclaim lock
>>> dependencies.
>>>
>>> Wrap chunk creation and population in a scoped NOIO or NOFS context when
>>> pcpu_gfp has the corresponding constraints. Leave ordinary GFP_KERNEL
>>> allocations unchanged so they retain full reclaim capability.
>>>
>>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>>
>> I assume you _did not_ observe this in production? As in no reclaim path should be
>> insane^W daring enough to do pcpu allocations?
> 
> Oops, I mixed my issues up. This is purely a GFP flags issue. A quick
> "git grep alloc_percpu_gfp" shows that the vast majority (all?) callers are
> using some combination of GFP_KERNEL or GFP_ATOMIC + other GFP flags, but no
> NOFS or NOIO as far as I can see. So you probably did not observe this?

Right, this issue has not been observed in production. It came from a
question raised by AI code review, and after carefully reading the code,
I found that there are indeed some synchronization concerns.

Here is one example of the scenario [PATCH 1/2] is trying to address:

blk-cgroup after 5d726c4dbeed ("blk-cgroup: fix possible deadlock while
configuring policy"). blkg_conf_prep() now serializes against
blkcg_deactivate_policy() with q->blkcg_mutex, and blkg_alloc() was
changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        -> if percpu chunks are exhausted, chunk create/populate may do
           internal GFP_KERNEL allocations
        -> direct reclaim / writeback can issue IO to this queue
        -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context. The failure
requires the slow path (new chunk creation or population), so it is not
expected to be common.

>>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>>> ---
>>>  mm/percpu.c | 26 ++++++++++++++++++++++++++
>>>  1 file changed, 26 insertions(+)
>>>
>>> diff --git a/mm/percpu.c b/mm/percpu.c
>>> index 71a85d7245c7..1bb38467390b 100644
>>> --- a/mm/percpu.c
>>> +++ b/mm/percpu.c
>>> @@ -1778,6 +1778,23 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>>  }
>>>  #endif
>>>  
>>> +static unsigned int pcpu_memalloc_scope_save(gfp_t gfp)
>>> +{
>>> +	if (!(gfp & __GFP_IO))
>>> +		return memalloc_noio_save();
>>> +	if (!(gfp & __GFP_FS))
>>> +		return memalloc_nofs_save();
>>> +	return 0;
>>> +}
>>> +
>>> +static void pcpu_memalloc_scope_restore(gfp_t gfp, unsigned int flags)
>>> +{
>>> +	if (!(gfp & __GFP_IO))
>>> +		memalloc_noio_restore(flags);
>>> +	else if (!(gfp & __GFP_FS))
>>> +		memalloc_nofs_restore(flags);
>>> +}
>>
>> I disagree with this. We already have gfp flags, they're already passed to pcpu_create_chunk()
>> and pcpu_populate_chunk(). It's their job to respect the gfp flags and
>> Do The Right Thing(tm). Can you fix the problematic places? It seems like it's
>> mostly the vmalloc backend that's problematic.

I’ll try to do it.

Following your suggestion, including in [PATCH 2/2], I will also try a
different approach and fix the issue by reducing the scope of the
pcpu_alloc_mutex critical section.

>>>  /**
>>>   * pcpu_alloc - the percpu allocator
>>>   * @size: size of area to allocate in bytes
>>> @@ -1901,7 +1918,12 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>>  
>>>  	/* No space left.  Create a new chunk. */
>>>  	if (list_empty(&pcpu_chunk_lists[pcpu_free_slot])) {
>>> +		unsigned int pcpu_scope;
>>> +
>>> +		pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>>  		chunk = pcpu_create_chunk(pcpu_gfp);
>>> +		pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>> +
>>>  		if (!chunk) {
>>>  			err = "failed to allocate new chunk";
>>>  			goto fail;
>>> @@ -1931,9 +1953,13 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>>  		page_end = PFN_UP(off + size);
>>>  
>>>  		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
>>> +			unsigned int pcpu_scope;
>>> +
>>>  			WARN_ON(chunk->immutable);
>>>  
>>> +			pcpu_scope = pcpu_memalloc_scope_save(pcpu_gfp);
>>>  			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
>>> +			pcpu_memalloc_scope_restore(pcpu_gfp, pcpu_scope);
>>>  
>>>  			spin_lock_irqsave(&pcpu_lock, flags);
>>>  			if (ret) {
>>> -- 
>>> 2.50.1 (Apple Git-155)
>>>
-- 
Thanks
Kaitao Cheng



  reply	other threads:[~2026-05-30 12:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28 13:29 [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-05-28 13:29 ` [PATCH 1/2] mm/percpu: Preserve NOFS/NOIO scope during chunk create and populate Kaitao Cheng
2026-05-29  9:25   ` Pedro Falcato
2026-05-29  9:38     ` Pedro Falcato
2026-05-30 12:47       ` Kaitao Cheng [this message]
2026-05-30 13:32         ` Dennis Zhou
2026-05-28 13:29 ` [PATCH 2/2] mm/percpu: Avoid pcpu_alloc_mutex recursion from reclaim Kaitao Cheng
2026-05-29  9:34   ` Pedro Falcato
2026-05-28 21:09 ` [PATCH 0/2] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Andrew Morton
2026-05-28 21:10 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5a4aa532-77a0-436a-8f5e-1bbcf2db6bbb@linux.dev \
    --to=kaitao.cheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=chengkaitao@kylinos.cn \
    --cc=cl@gentwo.org \
    --cc=dennis@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=pfalcato@suse.de \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox