From: Dennis Zhou <dennis@kernel.org>
To: Kaitao Cheng <kaitao.cheng@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>, Tejun Heo <tj@kernel.org>,
Christoph Lameter <cl@gentwo.org>,
Vlastimil Babka <vbabka@kernel.org>,
Michal Hocko <mhocko@suse.com>,
muchun.song@linux.dev, linux-mm@kvack.org,
linux-kernel@vger.kernel.org,
Kaitao Cheng <chengkaitao@kylinos.cn>
Subject: Re: [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks
Date: Tue, 16 Jun 2026 23:29:37 -0700 [thread overview]
Message-ID: <ajI-0ZMWVPPbZa33@palisades.local> (raw)
In-Reply-To: <20260612022648.13008-3-kaitao.cheng@linux.dev>
On Fri, Jun 12, 2026 at 10:26:47AM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> passes it down to pcpu_populate_chunk(). pcpu_alloc_pages() already uses
> that mask for backing page allocation.
>
> However, the populate slow path still has internal allocations and page
> table allocations which can lose the caller's allocation context. The
> temporary pages array is allocated by pcpu_get_pages() with GFP_KERNEL,
> and pcpu_map_pages() maps the backing pages through
> vmap_pages_range_noflush() using GFP_KERNEL. The latter can allocate
> vmalloc page tables implicitly, so a caller which deliberately uses
> GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while populating
> a percpu chunk.
>
> This has the same concern as chunk creation: callers such as blk-cgroup
> may use GFP_NOIO because they hold locks which can be involved in queue
> freeze or IO reclaim dependencies. If an allocation reaches the percpu
> slow path and needs to populate previously unbacked pages, the internal
> GFP_KERNEL allocations can defeat that context.
>
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:
>
> CPU0: blkg_conf_prep()
> mutex_lock(q->blkcg_mutex)
> blkg_alloc(..., GFP_NOIO)
> alloc_percpu_gfp(..., GFP_NOIO)
> pcpu_alloc_noprof(..., GFP_NOIO)
> pcpu_populate_chunk(GFP_NOIO)
> pcpu_get_pages()
> pcpu_map_pages()
> -> if the selected percpu chunk has unpopulated pages,
> chunk population may do internal GFP_KERNEL allocations
> -> direct reclaim / writeback can issue IO to this queue
> -> IO waits because the queue is frozen
>
> CPU1: blkcg_deactivate_policy()
> blk_mq_freeze_queue(q)
> mutex_lock(q->blkcg_mutex)
> -> waits for CPU0
> ... unfreeze only happens after q->blkcg_mutex is acquired/released
>
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context.
>
Maybe others have different takes on this, but I don't think this needs
a full duplicate explanation in each patch.
> Pass pcpu_gfp through pcpu_get_pages(), pcpu_map_pages() and
> __pcpu_map_pages(). Apply the corresponding memalloc scope around
> vmap_pages_range_noflush(), because vmalloc page table allocation does not
> pass the GFP mask down explicitly. Keep the first chunk setup path using
> GFP_KERNEL, matching the previous early-init behavior.
>
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
> mm/percpu-vm.c | 38 ++++++++++++++++++++++++++------------
> mm/percpu.c | 2 +-
> 2 files changed, 27 insertions(+), 13 deletions(-)
>
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 69b00741dc68..ccd03cc152d4 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
>
> /**
> * pcpu_get_pages - get temp pages array
> + * @gfp: allocation flags passed to the underlying allocator
> *
> * Returns pointer to array of pointers to struct page which can be indexed
> * with pcpu_page_idx(). Note that there is only one array and accesses
> @@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
> * RETURNS:
> * Pointer to temp pages array on success.
> */
> -static struct page **pcpu_get_pages(void)
> +static struct page **pcpu_get_pages(gfp_t gfp)
> {
> static struct page **pages;
> size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
> @@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void)
> lockdep_assert_held(&pcpu_alloc_mutex);
>
> if (!pages)
> - pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
> + pages = pcpu_mem_zalloc(pages_size, gfp);
> return pages;
> }
>
> @@ -191,10 +192,22 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
> }
>
> static int __pcpu_map_pages(unsigned long addr, struct page **pages,
> - int nr_pages)
> + int nr_pages, gfp_t gfp)
> {
> - return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
> - PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
> + unsigned int flags;
> + int ret;
> +
> + /*
> + * The vmalloc page table allocation path does not pass @gfp down
> + * explicitly. Apply the corresponding memalloc scope so implicit
> + * page table allocations preserve NOFS/NOIO constraints.
> + */
> + flags = memalloc_apply_gfp_scope(gfp);
> + ret = vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
> + PAGE_KERNEL, pages, PAGE_SHIFT, gfp);
> + memalloc_restore_scope(flags);
> +
> + return ret;
> }
>
> /**
> @@ -203,6 +216,7 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
> * @pages: pages array containing pages to be mapped
> * @page_start: page index of the first page to map
> * @page_end: page index of the last page to map + 1
> + * @gfp: allocation flags passed to the underlying allocator
> *
> * For each cpu, map pages [@page_start,@page_end) into @chunk. The
> * caller is responsible for calling pcpu_post_map_flush() after all
> @@ -211,8 +225,8 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
> * This function is responsible for setting up whatever is necessary for
> * reverse lookup (addr -> chunk).
> */
> -static int pcpu_map_pages(struct pcpu_chunk *chunk,
> - struct page **pages, int page_start, int page_end)
> +static int pcpu_map_pages(struct pcpu_chunk *chunk, struct page **pages,
> + int page_start, int page_end, gfp_t gfp)
> {
> unsigned int cpu, tcpu;
> int i, err;
> @@ -220,7 +234,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
> for_each_possible_cpu(cpu) {
> err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
> &pages[pcpu_page_idx(cpu, page_start)],
> - page_end - page_start);
> + page_end - page_start, gfp);
> if (err < 0)
> goto err;
>
> @@ -271,21 +285,21 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
> * @chunk.
> *
> * CONTEXT:
> - * pcpu_alloc_mutex, does GFP_KERNEL allocation.
> + * pcpu_alloc_mutex, does @gfp allocation.
> */
> static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
> int page_start, int page_end, gfp_t gfp)
> {
> struct page **pages;
>
> - pages = pcpu_get_pages();
> + pages = pcpu_get_pages(gfp);
> if (!pages)
> return -ENOMEM;
>
> if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
> return -ENOMEM;
>
> - if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
> + if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) {
> pcpu_free_pages(chunk, pages, page_start, page_end);
> return -ENOMEM;
> }
> @@ -319,7 +333,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
> * successful population attempt so the temp pages array must
> * be available now.
> */
> - pages = pcpu_get_pages();
> + pages = pcpu_get_pages(GFP_KERNEL);
> BUG_ON(!pages);
>
nit: it's a little misleading to pass GFP_KERNEL here because this is
the deallocation path and we expect the pages array to be already
allocated and cached in the static variable.
A little terse might be just passing 0 and checking gfp != 0 to allocate
pages.
A little more verbose could be introducing pcpu_get_pages_cached() to
get to that static variable.
> /* unmap and free */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b0676b8054ed..4d89965cba16 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -3256,7 +3256,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
>
> /* pte already populated, the following shouldn't fail */
> rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
> - unit_pages);
> + unit_pages, GFP_KERNEL);
> if (rc < 0)
> panic("failed to map percpu area, err=%d\n", rc);
>
> --
> 2.43.0
>
I think this is correct regardless of the nit.
Acked-by: Dennis Zhou <dennis@kernel.org>
Thanks,
Dennis
next prev parent reply other threads:[~2026-06-17 6:29 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-06-12 2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
2026-06-17 6:02 ` Dennis Zhou
2026-06-12 2:26 ` [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
2026-06-17 6:29 ` Dennis Zhou [this message]
2026-06-12 2:26 ` [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
2026-06-17 6:53 ` Dennis Zhou
2026-06-17 8:56 ` Kaitao Cheng
2026-06-17 7:03 ` [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Dennis Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajI-0ZMWVPPbZa33@palisades.local \
--to=dennis@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=chengkaitao@kylinos.cn \
--cc=cl@gentwo.org \
--cc=kaitao.cheng@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=tj@kernel.org \
--cc=urezki@gmail.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox