[PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
@ 2026-06-12  2:26 Kaitao Cheng
  2026-06-12  2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Kaitao Cheng @ 2026-06-12  2:26 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Dennis Zhou, Tejun Heo,
	Christoph Lameter
  Cc: Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, chengkaitao

From: chengkaitao <chengkaitao@kylinos.cn>

Hi all,

After v1 was posted, there were many different opinions, mainly around
optimizing pcpu_alloc_mutex. This v3 is intended to describe the existing
problems more clearly and provide a conventional fix approach.

Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
pcpu_alloc_mutex and the chunk creation slow path. This restored the
allocation capability that was lost when those constrained allocations
were treated as atomic, but it also makes the percpu slow path visible
to callers from constrained reclaim contexts.

There are two related problems.

First, the create and populate slow paths do not fully preserve the
caller's allocation constraints. pcpu_alloc_noprof() derives pcpu_gfp from
the caller supplied GFP mask and passes it down to the percpu backing page
allocator. However, chunk creation calls pcpu_get_vm_areas(), and chunk
population can allocate temporary metadata or vmalloc page tables while
mapping backing pages. Those internal allocations can still use GFP_KERNEL,
so a caller using GFP_NOFS or GFP_NOIO can enter unconstrained FS or IO
reclaim while holding pcpu_alloc_mutex.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() uses GFP_NOIO because queue freeze and IO
reclaim dependencies can otherwise deadlock. If the percpu slow path loses
that GFP_NOIO context, direct reclaim or writeback can issue IO to a frozen
queue while q->blkcg_mutex is held.

Second, allowing sleepable GFP_NOFS/GFP_NOIO allocations to take
pcpu_alloc_mutex means that unconstrained backing allocations made under
the mutex can create an FS/IO reclaim dependency against a constrained
caller which already holds an FS or IO lock and then waits for
pcpu_alloc_mutex.

This series fixes those issues in three steps:

  - pass the caller supplied GFP mask into pcpu_get_vm_areas() and use it
    for vmalloc metadata and KASAN shadow allocations;
  - pass the GFP mask through the chunk population path, including the
    temporary pages array and vmalloc page table allocation scope;
  - restrict percpu backing allocations performed while holding
    pcpu_alloc_mutex to GFP_NOIO, so they cannot recurse into IO or FS
    reclaim.

This keeps sleepable GFP_NOFS/GFP_NOIO percpu allocations working, while
avoiding the reclaim recursion risks introduced by making those allocations
eligible for the mutex-protected slow path.

Changes in v3:
Allow @gfp to pass __GFP_NOFAIL through. (Andrew Morton)

Changes in v2:
  - split the previous first patch into vmalloc-area creation and chunk
    population changes; (Pedro Falcato)
  - pass the GFP mask explicitly to pcpu_get_vm_areas(); (Pedro Falcato)
  - apply the corresponding memalloc scope around vmalloc page table
    allocation during chunk population;
  - replace the reclaim recursion avoidance with a GFP_NOIO backing
    allocation mask instead of only rejecting nested reclaim.
    (Michal Hocko)

Link to v2:
https://lore.kernel.org/all/20260604113101.89510-1-kaitao.cheng@linux.dev/

Link to v1:
https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@linux.dev/

Kaitao Cheng (3):
  mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  mm/percpu: honor GFP constraints when populating chunks
  mm/percpu: Avoid IO/FS reclaim in backing allocations

 include/linux/vmalloc.h |  4 ++--
 mm/percpu-vm.c          | 40 +++++++++++++++++++++++++++-------------
 mm/percpu.c             | 18 ++++++++++++------
 mm/vmalloc.c            | 23 ++++++++++++-----------
 4 files changed, 53 insertions(+), 32 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  2026-06-12  2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
@ 2026-06-12  2:26 ` Kaitao Cheng
  2026-06-17  6:02   ` Dennis Zhou
  2026-06-12  2:26 ` [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-06-12  2:26 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Dennis Zhou, Tejun Heo,
	Christoph Lameter
  Cc: Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask
and passes it down to the backing percpu allocator. However, when the
percpu vmalloc allocator has to create a new chunk, pcpu_create_chunk()
calls pcpu_get_vm_areas() to allocate the corresponding vmalloc areas.

pcpu_get_vm_areas() currently performs its internal allocations with
GFP_KERNEL, including vmap area metadata, vm_struct metadata and KASAN
vmalloc shadow population. This means that a caller which deliberately
uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating
the vmalloc areas for a new percpu chunk.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        pcpu_alloc_noprof(..., GFP_NOIO)
	  pcpu_create_chunk(GFP_NOIO)
	    pcpu_get_vm_areas()
              -> if percpu chunks are exhausted, chunk create may do
                 internal GFP_KERNEL allocations
              -> direct reclaim / writeback can issue IO to this queue
              -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context.

Pass the caller supplied GFP mask from pcpu_create_chunk() to
pcpu_get_vm_areas(), and use it for the internal vmalloc metadata and
KASAN shadow allocations.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/vmalloc.h |  4 ++--
 mm/percpu-vm.c          |  2 +-
 mm/vmalloc.c            | 23 ++++++++++++-----------
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index d87dc7f77f4e..e4d8d0a9f30f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -310,14 +310,14 @@ static inline void set_vm_flush_reset_perms(void *addr) {}
 #if defined(CONFIG_MMU) && defined(CONFIG_SMP)
 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     const size_t *sizes, int nr_vms,
-				     size_t align);
+				     size_t align, gfp_t gfp);
 
 void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
 # else
 static inline struct vm_struct **
 pcpu_get_vm_areas(const unsigned long *offsets,
 		const size_t *sizes, int nr_vms,
-		size_t align)
+		size_t align, gfp_t gfp)
 {
 	return NULL;
 }
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 4f5937090590..69b00741dc68 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -340,7 +340,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 		return NULL;
 
 	vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
-				pcpu_nr_groups, pcpu_atom_size);
+				pcpu_nr_groups, pcpu_atom_size, gfp);
 	if (!vms) {
 		pcpu_free_chunk(chunk);
 		return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1afca3568b9b..08f468135e4d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4946,16 +4946,17 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
  * @sizes: array containing size of each area
  * @nr_vms: the number of areas to allocate
  * @align: alignment, all entries in @offsets and @sizes must be aligned to this
+ * @gfp: allocation flags passed to the underlying memory allocator
  *
  * Returns: kmalloc'd vm_struct pointer array pointing to allocated
  *	    vm_structs on success, %NULL on failure
  *
  * Percpu allocator wants to use congruent vm areas so that it can
  * maintain the offsets among percpu areas.  This function allocates
- * congruent vmalloc areas for it with GFP_KERNEL.  These areas tend to
- * be scattered pretty far, distance between two areas easily going up
- * to gigabytes.  To avoid interacting with regular vmallocs, these
- * areas are allocated from top.
+ * congruent vmalloc areas for it. These areas tend to be scattered
+ * pretty far, distance between two areas easily going up to gigabytes.
+ * To avoid interacting with regular vmallocs, these areas are allocated
+ * from top.
  *
  * Despite its complicated look, this allocator is rather simple. It
  * does everything top-down and scans free blocks from the end looking
@@ -4966,7 +4967,7 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
  */
 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     const size_t *sizes, int nr_vms,
-				     size_t align)
+				     size_t align, gfp_t gfp)
 {
 	const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
 	const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
@@ -5004,14 +5005,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		return NULL;
 	}
 
-	vms = kzalloc_objs(vms[0], nr_vms);
-	vas = kzalloc_objs(vas[0], nr_vms);
+	vms = kzalloc_objs(vms[0], nr_vms, gfp);
+	vas = kzalloc_objs(vas[0], nr_vms, gfp);
 	if (!vas || !vms)
 		goto err_free2;
 
 	for (area = 0; area < nr_vms; area++) {
-		vas[area] = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL);
-		vms[area] = kzalloc_obj(struct vm_struct);
+		vas[area] = kmem_cache_zalloc(vmap_area_cachep, gfp);
+		vms[area] = kzalloc_obj(struct vm_struct, gfp);
 		if (!vas[area] || !vms[area])
 			goto err_free;
 	}
@@ -5101,7 +5102,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* populate the kasan shadow space */
 	for (area = 0; area < nr_vms; area++) {
-		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL))
+		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], gfp))
 			goto err_free_shadow;
 	}
 
@@ -5158,7 +5159,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				continue;
 
 			vas[area] = kmem_cache_zalloc(
-				vmap_area_cachep, GFP_KERNEL);
+				vmap_area_cachep, gfp);
 			if (!vas[area])
 				goto err_free;
 		}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  2026-06-12  2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
@ 2026-06-17  6:02   ` Dennis Zhou
  0 siblings, 0 replies; 10+ messages in thread
From: Dennis Zhou @ 2026-06-17  6:02 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, Uladzislau Rezki, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Vlastimil Babka, Michal Hocko, muchun.song,
	linux-mm, linux-kernel, Kaitao Cheng

Hello,

On Fri, Jun 12, 2026 at 10:26:46AM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask
> and passes it down to the backing percpu allocator. However, when the
> percpu vmalloc allocator has to create a new chunk, pcpu_create_chunk()
> calls pcpu_get_vm_areas() to allocate the corresponding vmalloc areas.
> 
> pcpu_get_vm_areas() currently performs its internal allocations with
> GFP_KERNEL, including vmap area metadata, vm_struct metadata and KASAN
> vmalloc shadow population. This means that a caller which deliberately
> uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating
> the vmalloc areas for a new percpu chunk.
> 
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:
> 
>   CPU0: blkg_conf_prep()
>     mutex_lock(q->blkcg_mutex)
>     blkg_alloc(..., GFP_NOIO)
>       alloc_percpu_gfp(..., GFP_NOIO)
>         pcpu_alloc_noprof(..., GFP_NOIO)
> 	  pcpu_create_chunk(GFP_NOIO)
> 	    pcpu_get_vm_areas()
>               -> if percpu chunks are exhausted, chunk create may do
>                  internal GFP_KERNEL allocations
>               -> direct reclaim / writeback can issue IO to this queue
>               -> IO waits because the queue is frozen
> 
>   CPU1: blkcg_deactivate_policy()
>     blk_mq_freeze_queue(q)
>     mutex_lock(q->blkcg_mutex)
>       -> waits for CPU0
>     ... unfreeze only happens after q->blkcg_mutex is acquired/released
> 
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context.
> 
> Pass the caller supplied GFP mask from pcpu_create_chunk() to
> pcpu_get_vm_areas(), and use it for the internal vmalloc metadata and
> KASAN shadow allocations.
> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
 
Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks
  2026-06-12  2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-06-12  2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
@ 2026-06-12  2:26 ` Kaitao Cheng
  2026-06-17  6:29   ` Dennis Zhou
  2026-06-12  2:26 ` [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
  2026-06-17  7:03 ` [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Dennis Zhou
  3 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-06-12  2:26 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Dennis Zhou, Tejun Heo,
	Christoph Lameter
  Cc: Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
passes it down to pcpu_populate_chunk().  pcpu_alloc_pages() already uses
that mask for backing page allocation.

However, the populate slow path still has internal allocations and page
table allocations which can lose the caller's allocation context.  The
temporary pages array is allocated by pcpu_get_pages() with GFP_KERNEL,
and pcpu_map_pages() maps the backing pages through
vmap_pages_range_noflush() using GFP_KERNEL.  The latter can allocate
vmalloc page tables implicitly, so a caller which deliberately uses
GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while populating
a percpu chunk.

This has the same concern as chunk creation: callers such as blk-cgroup
may use GFP_NOIO because they hold locks which can be involved in queue
freeze or IO reclaim dependencies.  If an allocation reaches the percpu
slow path and needs to populate previously unbacked pages, the internal
GFP_KERNEL allocations can defeat that context.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        pcpu_alloc_noprof(..., GFP_NOIO)
          pcpu_populate_chunk(GFP_NOIO)
            pcpu_get_pages()
	    pcpu_map_pages()
              -> if the selected percpu chunk has unpopulated pages,
	         chunk population may do internal GFP_KERNEL allocations
              -> direct reclaim / writeback can issue IO to this queue
              -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context.

Pass pcpu_gfp through pcpu_get_pages(), pcpu_map_pages() and
__pcpu_map_pages().  Apply the corresponding memalloc scope around
vmap_pages_range_noflush(), because vmalloc page table allocation does not
pass the GFP mask down explicitly.  Keep the first chunk setup path using
GFP_KERNEL, matching the previous early-init behavior.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu-vm.c | 38 ++++++++++++++++++++++++++------------
 mm/percpu.c    |  2 +-
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 69b00741dc68..ccd03cc152d4 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
 
 /**
  * pcpu_get_pages - get temp pages array
+ * @gfp: allocation flags passed to the underlying allocator
  *
  * Returns pointer to array of pointers to struct page which can be indexed
  * with pcpu_page_idx().  Note that there is only one array and accesses
@@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
  * RETURNS:
  * Pointer to temp pages array on success.
  */
-static struct page **pcpu_get_pages(void)
+static struct page **pcpu_get_pages(gfp_t gfp)
 {
 	static struct page **pages;
 	size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
@@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void)
 	lockdep_assert_held(&pcpu_alloc_mutex);
 
 	if (!pages)
-		pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
+		pages = pcpu_mem_zalloc(pages_size, gfp);
 	return pages;
 }
 
@@ -191,10 +192,22 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
 }
 
 static int __pcpu_map_pages(unsigned long addr, struct page **pages,
-			    int nr_pages)
+			    int nr_pages, gfp_t gfp)
 {
-	return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
-			PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
+	unsigned int flags;
+	int ret;
+
+	/*
+	 * The vmalloc page table allocation path does not pass @gfp down
+	 * explicitly.  Apply the corresponding memalloc scope so implicit
+	 * page table allocations preserve NOFS/NOIO constraints.
+	 */
+	flags = memalloc_apply_gfp_scope(gfp);
+	ret = vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
+				       PAGE_KERNEL, pages, PAGE_SHIFT, gfp);
+	memalloc_restore_scope(flags);
+
+	return ret;
 }
 
 /**
@@ -203,6 +216,7 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
  * @pages: pages array containing pages to be mapped
  * @page_start: page index of the first page to map
  * @page_end: page index of the last page to map + 1
+ * @gfp: allocation flags passed to the underlying allocator
  *
  * For each cpu, map pages [@page_start,@page_end) into @chunk.  The
  * caller is responsible for calling pcpu_post_map_flush() after all
@@ -211,8 +225,8 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
  * This function is responsible for setting up whatever is necessary for
  * reverse lookup (addr -> chunk).
  */
-static int pcpu_map_pages(struct pcpu_chunk *chunk,
-			  struct page **pages, int page_start, int page_end)
+static int pcpu_map_pages(struct pcpu_chunk *chunk, struct page **pages,
+			  int page_start, int page_end, gfp_t gfp)
 {
 	unsigned int cpu, tcpu;
 	int i, err;
@@ -220,7 +234,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 	for_each_possible_cpu(cpu) {
 		err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
 				       &pages[pcpu_page_idx(cpu, page_start)],
-				       page_end - page_start);
+				       page_end - page_start, gfp);
 		if (err < 0)
 			goto err;
 
@@ -271,21 +285,21 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
  * @chunk.
  *
  * CONTEXT:
- * pcpu_alloc_mutex, does GFP_KERNEL allocation.
+ * pcpu_alloc_mutex, does @gfp allocation.
  */
 static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp)
 {
 	struct page **pages;
 
-	pages = pcpu_get_pages();
+	pages = pcpu_get_pages(gfp);
 	if (!pages)
 		return -ENOMEM;
 
 	if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
 		return -ENOMEM;
 
-	if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
+	if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) {
 		pcpu_free_pages(chunk, pages, page_start, page_end);
 		return -ENOMEM;
 	}
@@ -319,7 +333,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	 * successful population attempt so the temp pages array must
 	 * be available now.
 	 */
-	pages = pcpu_get_pages();
+	pages = pcpu_get_pages(GFP_KERNEL);
 	BUG_ON(!pages);
 
 	/* unmap and free */
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..4d89965cba16 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3256,7 +3256,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 
 		/* pte already populated, the following shouldn't fail */
 		rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
-				      unit_pages);
+				      unit_pages, GFP_KERNEL);
 		if (rc < 0)
 			panic("failed to map percpu area, err=%d\n", rc);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks
  2026-06-12  2:26 ` [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
@ 2026-06-17  6:29   ` Dennis Zhou
  0 siblings, 0 replies; 10+ messages in thread
From: Dennis Zhou @ 2026-06-17  6:29 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, Uladzislau Rezki, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

On Fri, Jun 12, 2026 at 10:26:47AM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
> passes it down to pcpu_populate_chunk().  pcpu_alloc_pages() already uses
> that mask for backing page allocation.
> 
> However, the populate slow path still has internal allocations and page
> table allocations which can lose the caller's allocation context.  The
> temporary pages array is allocated by pcpu_get_pages() with GFP_KERNEL,
> and pcpu_map_pages() maps the backing pages through
> vmap_pages_range_noflush() using GFP_KERNEL.  The latter can allocate
> vmalloc page tables implicitly, so a caller which deliberately uses
> GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while populating
> a percpu chunk.
> 
> This has the same concern as chunk creation: callers such as blk-cgroup
> may use GFP_NOIO because they hold locks which can be involved in queue
> freeze or IO reclaim dependencies.  If an allocation reaches the percpu
> slow path and needs to populate previously unbacked pages, the internal
> GFP_KERNEL allocations can defeat that context.
> 
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:
> 
>   CPU0: blkg_conf_prep()
>     mutex_lock(q->blkcg_mutex)
>     blkg_alloc(..., GFP_NOIO)
>       alloc_percpu_gfp(..., GFP_NOIO)
>         pcpu_alloc_noprof(..., GFP_NOIO)
>           pcpu_populate_chunk(GFP_NOIO)
>             pcpu_get_pages()
> 	    pcpu_map_pages()
>               -> if the selected percpu chunk has unpopulated pages,
> 	         chunk population may do internal GFP_KERNEL allocations
>               -> direct reclaim / writeback can issue IO to this queue
>               -> IO waits because the queue is frozen
> 
>   CPU1: blkcg_deactivate_policy()
>     blk_mq_freeze_queue(q)
>     mutex_lock(q->blkcg_mutex)
>       -> waits for CPU0
>     ... unfreeze only happens after q->blkcg_mutex is acquired/released
> 
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context.
> 

Maybe others have different takes on this, but I don't think this needs
a full duplicate explanation in each patch.

> Pass pcpu_gfp through pcpu_get_pages(), pcpu_map_pages() and
> __pcpu_map_pages().  Apply the corresponding memalloc scope around
> vmap_pages_range_noflush(), because vmalloc page table allocation does not
> pass the GFP mask down explicitly.  Keep the first chunk setup path using
> GFP_KERNEL, matching the previous early-init behavior.
> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  mm/percpu-vm.c | 38 ++++++++++++++++++++++++++------------
>  mm/percpu.c    |  2 +-
>  2 files changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 69b00741dc68..ccd03cc152d4 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
>  
>  /**
>   * pcpu_get_pages - get temp pages array
> + * @gfp: allocation flags passed to the underlying allocator
>   *
>   * Returns pointer to array of pointers to struct page which can be indexed
>   * with pcpu_page_idx().  Note that there is only one array and accesses
> @@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
>   * RETURNS:
>   * Pointer to temp pages array on success.
>   */
> -static struct page **pcpu_get_pages(void)
> +static struct page **pcpu_get_pages(gfp_t gfp)
>  {
>  	static struct page **pages;
>  	size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
> @@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void)
>  	lockdep_assert_held(&pcpu_alloc_mutex);
>  
>  	if (!pages)
> -		pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
> +		pages = pcpu_mem_zalloc(pages_size, gfp);
>  	return pages;
>  }
>  
> @@ -191,10 +192,22 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
>  }
>  
>  static int __pcpu_map_pages(unsigned long addr, struct page **pages,
> -			    int nr_pages)
> +			    int nr_pages, gfp_t gfp)
>  {
> -	return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
> -			PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
> +	unsigned int flags;
> +	int ret;
> +
> +	/*
> +	 * The vmalloc page table allocation path does not pass @gfp down
> +	 * explicitly.  Apply the corresponding memalloc scope so implicit
> +	 * page table allocations preserve NOFS/NOIO constraints.
> +	 */
> +	flags = memalloc_apply_gfp_scope(gfp);
> +	ret = vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
> +				       PAGE_KERNEL, pages, PAGE_SHIFT, gfp);
> +	memalloc_restore_scope(flags);
> +
> +	return ret;
>  }
>  
>  /**
> @@ -203,6 +216,7 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
>   * @pages: pages array containing pages to be mapped
>   * @page_start: page index of the first page to map
>   * @page_end: page index of the last page to map + 1
> + * @gfp: allocation flags passed to the underlying allocator
>   *
>   * For each cpu, map pages [@page_start,@page_end) into @chunk.  The
>   * caller is responsible for calling pcpu_post_map_flush() after all
> @@ -211,8 +225,8 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
>   * This function is responsible for setting up whatever is necessary for
>   * reverse lookup (addr -> chunk).
>   */
> -static int pcpu_map_pages(struct pcpu_chunk *chunk,
> -			  struct page **pages, int page_start, int page_end)
> +static int pcpu_map_pages(struct pcpu_chunk *chunk, struct page **pages,
> +			  int page_start, int page_end, gfp_t gfp)
>  {
>  	unsigned int cpu, tcpu;
>  	int i, err;
> @@ -220,7 +234,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
>  	for_each_possible_cpu(cpu) {
>  		err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
>  				       &pages[pcpu_page_idx(cpu, page_start)],
> -				       page_end - page_start);
> +				       page_end - page_start, gfp);
>  		if (err < 0)
>  			goto err;
>  
> @@ -271,21 +285,21 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
>   * @chunk.
>   *
>   * CONTEXT:
> - * pcpu_alloc_mutex, does GFP_KERNEL allocation.
> + * pcpu_alloc_mutex, does @gfp allocation.
>   */
>  static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
>  			       int page_start, int page_end, gfp_t gfp)
>  {
>  	struct page **pages;
>  
> -	pages = pcpu_get_pages();
> +	pages = pcpu_get_pages(gfp);
>  	if (!pages)
>  		return -ENOMEM;
>  
>  	if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
>  		return -ENOMEM;
>  
> -	if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
> +	if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) {
>  		pcpu_free_pages(chunk, pages, page_start, page_end);
>  		return -ENOMEM;
>  	}
> @@ -319,7 +333,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
>  	 * successful population attempt so the temp pages array must
>  	 * be available now.
>  	 */
> -	pages = pcpu_get_pages();
> +	pages = pcpu_get_pages(GFP_KERNEL);
>  	BUG_ON(!pages);
>  

nit: it's a little misleading to pass GFP_KERNEL here because this is
the deallocation path and we expect the pages array to be already
allocated and cached in the static variable.

A little terse might be just passing 0 and checking gfp != 0 to allocate
pages.

A little more verbose could be introducing pcpu_get_pages_cached() to
get to that static variable.

>  	/* unmap and free */
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b0676b8054ed..4d89965cba16 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -3256,7 +3256,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
>  
>  		/* pte already populated, the following shouldn't fail */
>  		rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
> -				      unit_pages);
> +				      unit_pages, GFP_KERNEL);
>  		if (rc < 0)
>  			panic("failed to map percpu area, err=%d\n", rc);
>  
> -- 
> 2.43.0
> 

I think this is correct regardless of the nit.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-12  2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-06-12  2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
  2026-06-12  2:26 ` [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
@ 2026-06-12  2:26 ` Kaitao Cheng
  2026-06-17  6:53   ` Dennis Zhou
  2026-06-17  7:03 ` [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Dennis Zhou
  3 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-06-12  2:26 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Dennis Zhou, Tejun Heo,
	Christoph Lameter
  Cc: Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
allocations to take pcpu_alloc_mutex.  This avoids premature allocation
failures, but it also makes the mutex visible to callers from constrained
IO/FS contexts.

Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
pcpu_alloc_mutex. Since the internal allocation is not constrained by
NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock

At the same time, Thread B may already hold an FS lock and then call
pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
pcpu_alloc_mutex and block, creating the reverse dependency:
FS lock -> pcpu_alloc_mutex

This can still form a potential deadlock cycle.

Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
The public allocation still uses the caller's GFP context to decide whether
it may block, but the internal memory allocations performed while
pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 4d89965cba16..47824061a701 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1726,9 +1726,9 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
  * @gfp: allocation flags
  *
  * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
- * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
- * then no warning will be triggered on invalid or failed allocation
- * requests.
+ * allow blocking, the allocation is atomic. If @gfp has __GFP_NOFAIL, backing
+ * allocation failures are retried. If @gfp has __GFP_NOWARN then no warning
+ * will be triggered on invalid or failed allocation requests.
  *
  * RETURNS:
  * Percpu pointer to the allocated area on success, NULL on failure.
@@ -1749,8 +1749,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 	size_t bits, bit_align;
 
 	gfp = current_gfp_context(gfp);
-	/* whitelisted flags that can be passed to the backing allocators */
-	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+	/*
+	 * Allowlisted flags that can be passed to the backing allocators.
+	 * Backing allocations under pcpu_alloc_mutex must not recurse into
+	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
+	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
+	 * waits for the same mutex.
+	 */
+	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOFAIL);
 	is_atomic = !gfpflags_allow_blocking(gfp);
 	do_warn = !(gfp & __GFP_NOWARN);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-12  2:26 ` [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
@ 2026-06-17  6:53   ` Dennis Zhou
  2026-06-17  8:56     ` Kaitao Cheng
  0 siblings, 1 reply; 10+ messages in thread
From: Dennis Zhou @ 2026-06-17  6:53 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, Uladzislau Rezki, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

On Fri, Jun 12, 2026 at 10:26:48AM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
> allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
> allocations to take pcpu_alloc_mutex.  This avoids premature allocation
> failures, but it also makes the mutex visible to callers from constrained
> IO/FS contexts.
> 
> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
> pcpu_alloc_mutex. Since the internal allocation is not constrained by
> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
> creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock
> 
> At the same time, Thread B may already hold an FS lock and then call
> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
> pcpu_alloc_mutex and block, creating the reverse dependency:
> FS lock -> pcpu_alloc_mutex
> 
> This can still form a potential deadlock cycle.
> 
> Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
> The public allocation still uses the caller's GFP context to decide whether
> it may block, but the internal memory allocations performed while
> pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.
> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  mm/percpu.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 4d89965cba16..47824061a701 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1726,9 +1726,9 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>   * @gfp: allocation flags
>   *
>   * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
> - * then no warning will be triggered on invalid or failed allocation
> - * requests.
> + * allow blocking, the allocation is atomic. If @gfp has __GFP_NOFAIL, backing
> + * allocation failures are retried. If @gfp has __GFP_NOWARN then no warning
> + * will be triggered on invalid or failed allocation requests.
>   *
>   * RETURNS:
>   * Percpu pointer to the allocated area on success, NULL on failure.
> @@ -1749,8 +1749,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  	size_t bits, bit_align;
>  
>  	gfp = current_gfp_context(gfp);
> -	/* whitelisted flags that can be passed to the backing allocators */
> -	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
> +	/*
> +	 * Allowlisted flags that can be passed to the backing allocators.
> +	 * Backing allocations under pcpu_alloc_mutex must not recurse into
> +	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
> +	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
> +	 * waits for the same mutex.
> +	 */
> +	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOFAIL);
>  	is_atomic = !gfpflags_allow_blocking(gfp);
>  	do_warn = !(gfp & __GFP_NOWARN);
>  

I think GFP_KERNEL -> GFP_NOIO makes sense. It breaks the cycle.

For __GFP_NOFAIL, I think my concern is that a chunk can be quite large
and might need numerous pages. If we allow __GFP_NOFAIL, then we could
potentially churn and stall out other allocations for quite some time
while GFP_NOIO tries to reclaim without access to fs or io paths.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-17  6:53   ` Dennis Zhou
@ 2026-06-17  8:56     ` Kaitao Cheng
  2026-06-17 13:16       ` Michal Hocko
  0 siblings, 1 reply; 10+ messages in thread
From: Kaitao Cheng @ 2026-06-17  8:56 UTC (permalink / raw)
  To: Dennis Zhou
  Cc: Andrew Morton, Uladzislau Rezki, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

在 2026/6/17 14:53, Dennis Zhou 写道:
> On Fri, Jun 12, 2026 at 10:26:48AM +0800, Kaitao Cheng wrote:
>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>
>> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
>> allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
>> allocations to take pcpu_alloc_mutex.  This avoids premature allocation
>> failures, but it also makes the mutex visible to callers from constrained
>> IO/FS contexts.
>>
>> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
>> pcpu_alloc_mutex. Since the internal allocation is not constrained by
>> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
>> creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock
>>
>> At the same time, Thread B may already hold an FS lock and then call
>> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
>> pcpu_alloc_mutex and block, creating the reverse dependency:
>> FS lock -> pcpu_alloc_mutex
>>
>> This can still form a potential deadlock cycle.
>>
>> Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
>> The public allocation still uses the caller's GFP context to decide whether
>> it may block, but the internal memory allocations performed while
>> pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.
>>
>> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
>> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
>> ---
>>  mm/percpu.c | 16 +++++++++++-----
>>  1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/percpu.c b/mm/percpu.c
>> index 4d89965cba16..47824061a701 100644
>> --- a/mm/percpu.c
>> +++ b/mm/percpu.c
>> @@ -1726,9 +1726,9 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>   * @gfp: allocation flags
>>   *
>>   * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
>> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
>> - * then no warning will be triggered on invalid or failed allocation
>> - * requests.
>> + * allow blocking, the allocation is atomic. If @gfp has __GFP_NOFAIL, backing
>> + * allocation failures are retried. If @gfp has __GFP_NOWARN then no warning
>> + * will be triggered on invalid or failed allocation requests.
>>   *
>>   * RETURNS:
>>   * Percpu pointer to the allocated area on success, NULL on failure.
>> @@ -1749,8 +1749,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>  	size_t bits, bit_align;
>>  
>>  	gfp = current_gfp_context(gfp);
>> -	/* whitelisted flags that can be passed to the backing allocators */
>> -	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
>> +	/*
>> +	 * Allowlisted flags that can be passed to the backing allocators.
>> +	 * Backing allocations under pcpu_alloc_mutex must not recurse into
>> +	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
>> +	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
>> +	 * waits for the same mutex.
>> +	 */
>> +	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOFAIL);
>>  	is_atomic = !gfpflags_allow_blocking(gfp);
>>  	do_warn = !(gfp & __GFP_NOWARN);
>>  
> 
> I think GFP_KERNEL -> GFP_NOIO makes sense. It breaks the cycle.
> 
> For __GFP_NOFAIL, I think my concern is that a chunk can be quite large
> and might need numerous pages. If we allow __GFP_NOFAIL, then we could
> potentially churn and stall out other allocations for quite some time
> while GFP_NOIO tries to reclaim without access to fs or io paths.

__GFP_NOFAIL is actually unnecessary here. The main reason is that,
for now, I have not found any in-kernel callers that pass __GFP_NOFAIL
to pcpu_alloc_noprof() or its wrapper functions. The reason I added
__GFP_NOFAIL was to address the issue reported by sashiko, and I
provided a detailed clarification in the link below.

https://lore.kernel.org/all/3de3a89b-92f0-4cd2-9f41-8e853eae4e78@linux.dev/

We should probably revert the current patch back to the v2 version,
and then add some comments explaining why pcpu_alloc_noprof() must
not be passed the __GFP_NOFAIL flag, as suggested by Andrew Morton.

-- 
Thanks
Kaitao Cheng



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-17  8:56     ` Kaitao Cheng
@ 2026-06-17 13:16       ` Michal Hocko
  0 siblings, 0 replies; 10+ messages in thread
From: Michal Hocko @ 2026-06-17 13:16 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Dennis Zhou, Andrew Morton, Uladzislau Rezki, Tejun Heo,
	Christoph Lameter, Vlastimil Babka, muchun.song, linux-mm,
	linux-kernel, Kaitao Cheng

On Wed 17-06-26 16:56:56, Kaitao Cheng wrote:
[...]
> __GFP_NOFAIL is actually unnecessary here. The main reason is that,
> for now, I have not found any in-kernel callers that pass __GFP_NOFAIL
> to pcpu_alloc_noprof() or its wrapper functions. The reason I added
> __GFP_NOFAIL was to address the issue reported by sashiko, and I
> provided a detailed clarification in the link below.
> 
> https://lore.kernel.org/all/3de3a89b-92f0-4cd2-9f41-8e853eae4e78@linux.dev/
> 
> We should probably revert the current patch back to the v2 version,
> and then add some comments explaining why pcpu_alloc_noprof() must
> not be passed the __GFP_NOFAIL flag, as suggested by Andrew Morton.

Do not add support for nofail semantic until there is a clear demand for
it. Supporting this semantic is a big commitment and it shouldn't be
done without a very good usecase in mind.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
  2026-06-12  2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
                   ` (2 preceding siblings ...)
  2026-06-12  2:26 ` [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
@ 2026-06-17  7:03 ` Dennis Zhou
  3 siblings, 0 replies; 10+ messages in thread
From: Dennis Zhou @ 2026-06-17  7:03 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, Uladzislau Rezki, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Michal Hocko, muchun.song, linux-mm,
	linux-kernel, chengkaitao

Hello,

On Fri, Jun 12, 2026 at 10:26:45AM +0800, Kaitao Cheng wrote:
> From: chengkaitao <chengkaitao@kylinos.cn>
> 
> Hi all,
> 
> After v1 was posted, there were many different opinions, mainly around
> optimizing pcpu_alloc_mutex. This v3 is intended to describe the existing
> problems more clearly and provide a conventional fix approach.
> 
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also makes the percpu slow path visible
> to callers from constrained reclaim contexts.
> 
> There are two related problems.
> 
> First, the create and populate slow paths do not fully preserve the
> caller's allocation constraints. pcpu_alloc_noprof() derives pcpu_gfp from
> the caller supplied GFP mask and passes it down to the percpu backing page
> allocator. However, chunk creation calls pcpu_get_vm_areas(), and chunk
> population can allocate temporary metadata or vmalloc page tables while
> mapping backing pages. Those internal allocations can still use GFP_KERNEL,
> so a caller using GFP_NOFS or GFP_NOIO can enter unconstrained FS or IO
> reclaim while holding pcpu_alloc_mutex.
> 
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() uses GFP_NOIO because queue freeze and IO
> reclaim dependencies can otherwise deadlock. If the percpu slow path loses
> that GFP_NOIO context, direct reclaim or writeback can issue IO to a frozen
> queue while q->blkcg_mutex is held.
> 
> Second, allowing sleepable GFP_NOFS/GFP_NOIO allocations to take
> pcpu_alloc_mutex means that unconstrained backing allocations made under
> the mutex can create an FS/IO reclaim dependency against a constrained
> caller which already holds an FS or IO lock and then waits for
> pcpu_alloc_mutex.
> 
> This series fixes those issues in three steps:
> 
>   - pass the caller supplied GFP mask into pcpu_get_vm_areas() and use it
>     for vmalloc metadata and KASAN shadow allocations;
>   - pass the GFP mask through the chunk population path, including the
>     temporary pages array and vmalloc page table allocation scope;
>   - restrict percpu backing allocations performed while holding
>     pcpu_alloc_mutex to GFP_NOIO, so they cannot recurse into IO or FS
>     reclaim.
> 
> This keeps sleepable GFP_NOFS/GFP_NOIO percpu allocations working, while
> avoiding the reclaim recursion risks introduced by making those allocations
> eligible for the mutex-protected slow path.
> 
> Changes in v3:
> Allow @gfp to pass __GFP_NOFAIL through. (Andrew Morton)
> 
> Changes in v2:
>   - split the previous first patch into vmalloc-area creation and chunk
>     population changes; (Pedro Falcato)
>   - pass the GFP mask explicitly to pcpu_get_vm_areas(); (Pedro Falcato)
>   - apply the corresponding memalloc scope around vmalloc page table
>     allocation during chunk population;
>   - replace the reclaim recursion avoidance with a GFP_NOIO backing
>     allocation mask instead of only rejecting nested reclaim.
>     (Michal Hocko)
> 
> Link to v2:
> https://lore.kernel.org/all/20260604113101.89510-1-kaitao.cheng@linux.dev/
> 
> Link to v1:
> https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@linux.dev/
> 
> Kaitao Cheng (3):
>   mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
>   mm/percpu: honor GFP constraints when populating chunks
>   mm/percpu: Avoid IO/FS reclaim in backing allocations
> 
>  include/linux/vmalloc.h |  4 ++--
>  mm/percpu-vm.c          | 40 +++++++++++++++++++++++++++-------------
>  mm/percpu.c             | 18 ++++++++++++------
>  mm/vmalloc.c            | 23 ++++++++++++-----------
>  4 files changed, 53 insertions(+), 32 deletions(-)
> 
> -- 
> 2.43.0
> 

Thanks for taking on this work. I definitely missed this earlier.

I acked patches 1 and 2. I think 3 is good but the __GFP_NOFAIL warrants
more discussion. I think my take back then was a single percpu
allocation can trigger a large # of backing pages. As a result, while
the caller may not be asking for a lot of memory, we may need
substantially more to back that allocation. Given the discrepancy,
that's why __GFP_NOFAIL is just mutex_lock() vs mutex_lock_killable().

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-17 13:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  2:26 [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-06-12  2:26 ` [PATCH v3 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
2026-06-17  6:02   ` Dennis Zhou
2026-06-12  2:26 ` [PATCH v3 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
2026-06-17  6:29   ` Dennis Zhou
2026-06-12  2:26 ` [PATCH v3 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
2026-06-17  6:53   ` Dennis Zhou
2026-06-17  8:56     ` Kaitao Cheng
2026-06-17 13:16       ` Michal Hocko
2026-06-17  7:03 ` [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Dennis Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox