[PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion
@ 2026-06-04 11:30 Kaitao Cheng
  2026-06-04 11:30 ` [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Kaitao Cheng @ 2026-06-04 11:30 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Uladzislau Rezki, Pedro Falcato, Vlastimil Babka, Michal Hocko
  Cc: muchun.song, linux-mm, linux-kernel, chengkaitao

From: chengkaitao <chengkaitao@kylinos.cn>

Hi all,

After v1 was posted, there were many different opinions, mainly around
optimizing pcpu_alloc_mutex. This v2 is intended to describe the existing
problems more clearly and provide a conventional fix approach.

Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
pcpu_alloc_mutex and the chunk creation slow path. This restored the
allocation capability that was lost when those constrained allocations
were treated as atomic, but it also makes the percpu slow path visible
to callers from constrained reclaim contexts.

There are two related problems.

First, the create and populate slow paths do not fully preserve the
caller's allocation constraints. pcpu_alloc_noprof() derives pcpu_gfp from
the caller supplied GFP mask and passes it down to the percpu backing page
allocator. However, chunk creation calls pcpu_get_vm_areas(), and chunk
population can allocate temporary metadata or vmalloc page tables while
mapping backing pages. Those internal allocations can still use GFP_KERNEL,
so a caller using GFP_NOFS or GFP_NOIO can enter unconstrained FS or IO
reclaim while holding pcpu_alloc_mutex.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() uses GFP_NOIO because queue freeze and IO
reclaim dependencies can otherwise deadlock. If the percpu slow path loses
that GFP_NOIO context, direct reclaim or writeback can issue IO to a frozen
queue while q->blkcg_mutex is held.

Second, allowing sleepable GFP_NOFS/GFP_NOIO allocations to take
pcpu_alloc_mutex means that unconstrained backing allocations made under
the mutex can create an FS/IO reclaim dependency against a constrained
caller which already holds an FS or IO lock and then waits for
pcpu_alloc_mutex.

This series fixes those issues in three steps:

  - pass the caller supplied GFP mask into pcpu_get_vm_areas() and use it
    for vmalloc metadata and KASAN shadow allocations;
  - pass the GFP mask through the chunk population path, including the
    temporary pages array and vmalloc page table allocation scope;
  - restrict percpu backing allocations performed while holding
    pcpu_alloc_mutex to GFP_NOIO, so they cannot recurse into IO or FS
    reclaim.

This keeps sleepable GFP_NOFS/GFP_NOIO percpu allocations working, while
avoiding the reclaim recursion risks introduced by making those allocations
eligible for the mutex-protected slow path.

Changes in v2:
  - split the previous first patch into vmalloc-area creation and chunk
    population changes; (Pedro Falcato)
  - pass the GFP mask explicitly to pcpu_get_vm_areas(); (Pedro Falcato)
  - apply the corresponding memalloc scope around vmalloc page table
    allocation during chunk population;
  - replace the reclaim recursion avoidance with a GFP_NOIO backing
    allocation mask instead of only rejecting nested reclaim.
    (Michal Hocko)

Link to v1:
https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@linux.dev/

Kaitao Cheng (3):
  mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  mm/percpu: honor GFP constraints when populating chunks
  mm/percpu: Avoid IO/FS reclaim in backing allocations

 include/linux/vmalloc.h |  4 ++--
 mm/percpu-vm.c          | 40 +++++++++++++++++++++++++++-------------
 mm/percpu.c             | 17 +++++++++++------
 mm/vmalloc.c            | 23 ++++++++++++-----------
 4 files changed, 52 insertions(+), 32 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  2026-06-04 11:30 [PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
@ 2026-06-04 11:30 ` Kaitao Cheng
  2026-06-04 16:49   ` Uladzislau Rezki
  2026-06-04 11:31 ` [PATCH v2 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
  2026-06-04 11:31 ` [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
  2 siblings, 1 reply; 7+ messages in thread
From: Kaitao Cheng @ 2026-06-04 11:30 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Uladzislau Rezki, Pedro Falcato, Vlastimil Babka, Michal Hocko
  Cc: muchun.song, linux-mm, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask
and passes it down to the backing percpu allocator. However, when the
percpu vmalloc allocator has to create a new chunk, pcpu_create_chunk()
calls pcpu_get_vm_areas() to allocate the corresponding vmalloc areas.

pcpu_get_vm_areas() currently performs its internal allocations with
GFP_KERNEL, including vmap area metadata, vm_struct metadata and KASAN
vmalloc shadow population. This means that a caller which deliberately
uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating
the vmalloc areas for a new percpu chunk.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        pcpu_alloc_noprof(..., GFP_NOIO)
	  pcpu_create_chunk(GFP_NOIO)
	    pcpu_get_vm_areas()
              -> if percpu chunks are exhausted, chunk create may do
                 internal GFP_KERNEL allocations
              -> direct reclaim / writeback can issue IO to this queue
              -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context.

Pass the caller supplied GFP mask from pcpu_create_chunk() to
pcpu_get_vm_areas(), and use it for the internal vmalloc metadata and
KASAN shadow allocations.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 include/linux/vmalloc.h |  4 ++--
 mm/percpu-vm.c          |  2 +-
 mm/vmalloc.c            | 23 ++++++++++++-----------
 3 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 3b02c0c6b371..9601e06624c8 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -308,14 +308,14 @@ static inline void set_vm_flush_reset_perms(void *addr) {}
 #if defined(CONFIG_MMU) && defined(CONFIG_SMP)
 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     const size_t *sizes, int nr_vms,
-				     size_t align);
+				     size_t align, gfp_t gfp);
 
 void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
 # else
 static inline struct vm_struct **
 pcpu_get_vm_areas(const unsigned long *offsets,
 		const size_t *sizes, int nr_vms,
-		size_t align)
+		size_t align, gfp_t gfp)
 {
 	return NULL;
 }
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 4f5937090590..69b00741dc68 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -340,7 +340,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 		return NULL;
 
 	vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
-				pcpu_nr_groups, pcpu_atom_size);
+				pcpu_nr_groups, pcpu_atom_size, gfp);
 	if (!vms) {
 		pcpu_free_chunk(chunk);
 		return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1afca3568b9b..08f468135e4d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4946,16 +4946,17 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
  * @sizes: array containing size of each area
  * @nr_vms: the number of areas to allocate
  * @align: alignment, all entries in @offsets and @sizes must be aligned to this
+ * @gfp: allocation flags passed to the underlying memory allocator
  *
  * Returns: kmalloc'd vm_struct pointer array pointing to allocated
  *	    vm_structs on success, %NULL on failure
  *
  * Percpu allocator wants to use congruent vm areas so that it can
  * maintain the offsets among percpu areas.  This function allocates
- * congruent vmalloc areas for it with GFP_KERNEL.  These areas tend to
- * be scattered pretty far, distance between two areas easily going up
- * to gigabytes.  To avoid interacting with regular vmallocs, these
- * areas are allocated from top.
+ * congruent vmalloc areas for it. These areas tend to be scattered
+ * pretty far, distance between two areas easily going up to gigabytes.
+ * To avoid interacting with regular vmallocs, these areas are allocated
+ * from top.
  *
  * Despite its complicated look, this allocator is rather simple. It
  * does everything top-down and scans free blocks from the end looking
@@ -4966,7 +4967,7 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
  */
 struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				     const size_t *sizes, int nr_vms,
-				     size_t align)
+				     size_t align, gfp_t gfp)
 {
 	const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
 	const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
@@ -5004,14 +5005,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		return NULL;
 	}
 
-	vms = kzalloc_objs(vms[0], nr_vms);
-	vas = kzalloc_objs(vas[0], nr_vms);
+	vms = kzalloc_objs(vms[0], nr_vms, gfp);
+	vas = kzalloc_objs(vas[0], nr_vms, gfp);
 	if (!vas || !vms)
 		goto err_free2;
 
 	for (area = 0; area < nr_vms; area++) {
-		vas[area] = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL);
-		vms[area] = kzalloc_obj(struct vm_struct);
+		vas[area] = kmem_cache_zalloc(vmap_area_cachep, gfp);
+		vms[area] = kzalloc_obj(struct vm_struct, gfp);
 		if (!vas[area] || !vms[area])
 			goto err_free;
 	}
@@ -5101,7 +5102,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* populate the kasan shadow space */
 	for (area = 0; area < nr_vms; area++) {
-		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL))
+		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], gfp))
 			goto err_free_shadow;
 	}
 
@@ -5158,7 +5159,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 				continue;
 
 			vas[area] = kmem_cache_zalloc(
-				vmap_area_cachep, GFP_KERNEL);
+				vmap_area_cachep, gfp);
 			if (!vas[area])
 				goto err_free;
 		}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 2/3] mm/percpu: honor GFP constraints when populating chunks
  2026-06-04 11:30 [PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-06-04 11:30 ` [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
@ 2026-06-04 11:31 ` Kaitao Cheng
  2026-06-04 11:31 ` [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
  2 siblings, 0 replies; 7+ messages in thread
From: Kaitao Cheng @ 2026-06-04 11:31 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Uladzislau Rezki, Pedro Falcato, Vlastimil Babka, Michal Hocko
  Cc: muchun.song, linux-mm, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask and
passes it down to pcpu_populate_chunk().  pcpu_alloc_pages() already uses
that mask for backing page allocation.

However, the populate slow path still has internal allocations and page
table allocations which can lose the caller's allocation context.  The
temporary pages array is allocated by pcpu_get_pages() with GFP_KERNEL,
and pcpu_map_pages() maps the backing pages through
vmap_pages_range_noflush() using GFP_KERNEL.  The latter can allocate
vmalloc page tables implicitly, so a caller which deliberately uses
GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while populating
a percpu chunk.

This has the same concern as chunk creation: callers such as blk-cgroup
may use GFP_NOIO because they hold locks which can be involved in queue
freeze or IO reclaim dependencies.  If an allocation reaches the percpu
slow path and needs to populate previously unbacked pages, the internal
GFP_KERNEL allocations can defeat that context.

One possible case is blk-cgroup after commit 5d726c4dbeed
("blk-cgroup: fix possible deadlock while configuring policy").
blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:

  CPU0: blkg_conf_prep()
    mutex_lock(q->blkcg_mutex)
    blkg_alloc(..., GFP_NOIO)
      alloc_percpu_gfp(..., GFP_NOIO)
        pcpu_alloc_noprof(..., GFP_NOIO)
          pcpu_populate_chunk(GFP_NOIO)
            pcpu_get_pages()
	    pcpu_map_pages()
              -> if the selected percpu chunk has unpopulated pages,
	         chunk population may do internal GFP_KERNEL allocations
              -> direct reclaim / writeback can issue IO to this queue
              -> IO waits because the queue is frozen

  CPU1: blkcg_deactivate_policy()
    blk_mq_freeze_queue(q)
    mutex_lock(q->blkcg_mutex)
      -> waits for CPU0
    ... unfreeze only happens after q->blkcg_mutex is acquired/released

So the concern is that the caller deliberately uses GFP_NOIO because it
may hold a lock which can be acquired after queue freeze, but the percpu
slow path can temporarily lose that allocation context.

Pass pcpu_gfp through pcpu_get_pages(), pcpu_map_pages() and
__pcpu_map_pages().  Apply the corresponding memalloc scope around
vmap_pages_range_noflush(), because vmalloc page table allocation does not
pass the GFP mask down explicitly.  Keep the first chunk setup path using
GFP_KERNEL, matching the previous early-init behavior.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu-vm.c | 38 ++++++++++++++++++++++++++------------
 mm/percpu.c    |  2 +-
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 69b00741dc68..ccd03cc152d4 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -21,6 +21,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
 
 /**
  * pcpu_get_pages - get temp pages array
+ * @gfp: allocation flags passed to the underlying allocator
  *
  * Returns pointer to array of pointers to struct page which can be indexed
  * with pcpu_page_idx().  Note that there is only one array and accesses
@@ -29,7 +30,7 @@ static struct page *pcpu_chunk_page(struct pcpu_chunk *chunk,
  * RETURNS:
  * Pointer to temp pages array on success.
  */
-static struct page **pcpu_get_pages(void)
+static struct page **pcpu_get_pages(gfp_t gfp)
 {
 	static struct page **pages;
 	size_t pages_size = pcpu_nr_units * pcpu_unit_pages * sizeof(pages[0]);
@@ -37,7 +38,7 @@ static struct page **pcpu_get_pages(void)
 	lockdep_assert_held(&pcpu_alloc_mutex);
 
 	if (!pages)
-		pages = pcpu_mem_zalloc(pages_size, GFP_KERNEL);
+		pages = pcpu_mem_zalloc(pages_size, gfp);
 	return pages;
 }
 
@@ -191,10 +192,22 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,
 }
 
 static int __pcpu_map_pages(unsigned long addr, struct page **pages,
-			    int nr_pages)
+			    int nr_pages, gfp_t gfp)
 {
-	return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
-			PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
+	unsigned int flags;
+	int ret;
+
+	/*
+	 * The vmalloc page table allocation path does not pass @gfp down
+	 * explicitly.  Apply the corresponding memalloc scope so implicit
+	 * page table allocations preserve NOFS/NOIO constraints.
+	 */
+	flags = memalloc_apply_gfp_scope(gfp);
+	ret = vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
+				       PAGE_KERNEL, pages, PAGE_SHIFT, gfp);
+	memalloc_restore_scope(flags);
+
+	return ret;
 }
 
 /**
@@ -203,6 +216,7 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
  * @pages: pages array containing pages to be mapped
  * @page_start: page index of the first page to map
  * @page_end: page index of the last page to map + 1
+ * @gfp: allocation flags passed to the underlying allocator
  *
  * For each cpu, map pages [@page_start,@page_end) into @chunk.  The
  * caller is responsible for calling pcpu_post_map_flush() after all
@@ -211,8 +225,8 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
  * This function is responsible for setting up whatever is necessary for
  * reverse lookup (addr -> chunk).
  */
-static int pcpu_map_pages(struct pcpu_chunk *chunk,
-			  struct page **pages, int page_start, int page_end)
+static int pcpu_map_pages(struct pcpu_chunk *chunk, struct page **pages,
+			  int page_start, int page_end, gfp_t gfp)
 {
 	unsigned int cpu, tcpu;
 	int i, err;
@@ -220,7 +234,7 @@ static int pcpu_map_pages(struct pcpu_chunk *chunk,
 	for_each_possible_cpu(cpu) {
 		err = __pcpu_map_pages(pcpu_chunk_addr(chunk, cpu, page_start),
 				       &pages[pcpu_page_idx(cpu, page_start)],
-				       page_end - page_start);
+				       page_end - page_start, gfp);
 		if (err < 0)
 			goto err;
 
@@ -271,21 +285,21 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,
  * @chunk.
  *
  * CONTEXT:
- * pcpu_alloc_mutex, does GFP_KERNEL allocation.
+ * pcpu_alloc_mutex, does @gfp allocation.
  */
 static int pcpu_populate_chunk(struct pcpu_chunk *chunk,
 			       int page_start, int page_end, gfp_t gfp)
 {
 	struct page **pages;
 
-	pages = pcpu_get_pages();
+	pages = pcpu_get_pages(gfp);
 	if (!pages)
 		return -ENOMEM;
 
 	if (pcpu_alloc_pages(chunk, pages, page_start, page_end, gfp))
 		return -ENOMEM;
 
-	if (pcpu_map_pages(chunk, pages, page_start, page_end)) {
+	if (pcpu_map_pages(chunk, pages, page_start, page_end, gfp)) {
 		pcpu_free_pages(chunk, pages, page_start, page_end);
 		return -ENOMEM;
 	}
@@ -319,7 +333,7 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk,
 	 * successful population attempt so the temp pages array must
 	 * be available now.
 	 */
-	pages = pcpu_get_pages();
+	pages = pcpu_get_pages(GFP_KERNEL);
 	BUG_ON(!pages);
 
 	/* unmap and free */
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..4d89965cba16 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -3256,7 +3256,7 @@ int __init pcpu_page_first_chunk(size_t reserved_size, pcpu_fc_cpu_to_node_fn_t
 
 		/* pte already populated, the following shouldn't fail */
 		rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
-				      unit_pages);
+				      unit_pages, GFP_KERNEL);
 		if (rc < 0)
 			panic("failed to map percpu area, err=%d\n", rc);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-04 11:30 [PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
  2026-06-04 11:30 ` [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
  2026-06-04 11:31 ` [PATCH v2 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
@ 2026-06-04 11:31 ` Kaitao Cheng
  2026-06-04 19:07   ` Andrew Morton
  2 siblings, 1 reply; 7+ messages in thread
From: Kaitao Cheng @ 2026-06-04 11:31 UTC (permalink / raw)
  To: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Uladzislau Rezki, Pedro Falcato, Vlastimil Babka, Michal Hocko
  Cc: muchun.song, linux-mm, linux-kernel, Kaitao Cheng

From: Kaitao Cheng <chengkaitao@kylinos.cn>

Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
allocations to take pcpu_alloc_mutex.  This avoids premature allocation
failures, but it also makes the mutex visible to callers from constrained
IO/FS contexts.

Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
pcpu_alloc_mutex. Since the internal allocation is not constrained by
NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock

At the same time, Thread B may already hold an FS lock and then call
pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
pcpu_alloc_mutex and block, creating the reverse dependency:
FS lock -> pcpu_alloc_mutex

This can still form a potential deadlock cycle.

Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
The public allocation still uses the caller's GFP context to decide whether
it may block, but the internal memory allocations performed while
pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.

Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
---
 mm/percpu.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 4d89965cba16..e6f449323064 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1726,9 +1726,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
  * @gfp: allocation flags
  *
  * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
- * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
- * then no warning will be triggered on invalid or failed allocation
- * requests.
+ * allow blocking, the allocation is atomic. If @gfp has __GFP_NOWARN then no
+ * warning will be triggered on invalid or failed allocation requests.
  *
  * RETURNS:
  * Percpu pointer to the allocated area on success, NULL on failure.
@@ -1749,8 +1748,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 	size_t bits, bit_align;
 
 	gfp = current_gfp_context(gfp);
-	/* whitelisted flags that can be passed to the backing allocators */
-	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
+	/*
+	 * Whitelisted flags that can be passed to the backing allocators.
+	 * Backing allocations under pcpu_alloc_mutex must not recurse into
+	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
+	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
+	 * waits for the same mutex.
+	 */
+	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);
 	is_atomic = !gfpflags_allow_blocking(gfp);
 	do_warn = !(gfp & __GFP_NOWARN);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
  2026-06-04 11:30 ` [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
@ 2026-06-04 16:49   ` Uladzislau Rezki
  0 siblings, 0 replies; 7+ messages in thread
From: Uladzislau Rezki @ 2026-06-04 16:49 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Andrew Morton, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Uladzislau Rezki, Pedro Falcato, Vlastimil Babka, Michal Hocko,
	muchun.song, linux-mm, linux-kernel, Kaitao Cheng

On Thu, Jun 04, 2026 at 07:30:59PM +0800, Kaitao Cheng wrote:
> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> pcpu_alloc_noprof() derives pcpu_gfp from the caller supplied GFP mask
> and passes it down to the backing percpu allocator. However, when the
> percpu vmalloc allocator has to create a new chunk, pcpu_create_chunk()
> calls pcpu_get_vm_areas() to allocate the corresponding vmalloc areas.
> 
> pcpu_get_vm_areas() currently performs its internal allocations with
> GFP_KERNEL, including vmap area metadata, vm_struct metadata and KASAN
> vmalloc shadow population. This means that a caller which deliberately
> uses GFP_NOFS or GFP_NOIO can still enter FS or IO reclaim while creating
> the vmalloc areas for a new percpu chunk.
> 
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() was changed to GFP_NOIO for that reason:
> 
>   CPU0: blkg_conf_prep()
>     mutex_lock(q->blkcg_mutex)
>     blkg_alloc(..., GFP_NOIO)
>       alloc_percpu_gfp(..., GFP_NOIO)
>         pcpu_alloc_noprof(..., GFP_NOIO)
> 	  pcpu_create_chunk(GFP_NOIO)
> 	    pcpu_get_vm_areas()
>               -> if percpu chunks are exhausted, chunk create may do
>                  internal GFP_KERNEL allocations
>               -> direct reclaim / writeback can issue IO to this queue
>               -> IO waits because the queue is frozen
> 
>   CPU1: blkcg_deactivate_policy()
>     blk_mq_freeze_queue(q)
>     mutex_lock(q->blkcg_mutex)
>       -> waits for CPU0
>     ... unfreeze only happens after q->blkcg_mutex is acquired/released
> 
> So the concern is that the caller deliberately uses GFP_NOIO because it
> may hold a lock which can be acquired after queue freeze, but the percpu
> slow path can temporarily lose that allocation context.
> 
> Pass the caller supplied GFP mask from pcpu_create_chunk() to
> pcpu_get_vm_areas(), and use it for the internal vmalloc metadata and
> KASAN shadow allocations.
> 
> Fixes: 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations atomic")
> Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
> ---
>  include/linux/vmalloc.h |  4 ++--
>  mm/percpu-vm.c          |  2 +-
>  mm/vmalloc.c            | 23 ++++++++++++-----------
>  3 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 3b02c0c6b371..9601e06624c8 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -308,14 +308,14 @@ static inline void set_vm_flush_reset_perms(void *addr) {}
>  #if defined(CONFIG_MMU) && defined(CONFIG_SMP)
>  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  				     const size_t *sizes, int nr_vms,
> -				     size_t align);
> +				     size_t align, gfp_t gfp);
>  
>  void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms);
>  # else
>  static inline struct vm_struct **
>  pcpu_get_vm_areas(const unsigned long *offsets,
>  		const size_t *sizes, int nr_vms,
> -		size_t align)
> +		size_t align, gfp_t gfp)
>  {
>  	return NULL;
>  }
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
> index 4f5937090590..69b00741dc68 100644
> --- a/mm/percpu-vm.c
> +++ b/mm/percpu-vm.c
> @@ -340,7 +340,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
>  		return NULL;
>  
>  	vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
> -				pcpu_nr_groups, pcpu_atom_size);
> +				pcpu_nr_groups, pcpu_atom_size, gfp);
>  	if (!vms) {
>  		pcpu_free_chunk(chunk);
>  		return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 1afca3568b9b..08f468135e4d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -4946,16 +4946,17 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
>   * @sizes: array containing size of each area
>   * @nr_vms: the number of areas to allocate
>   * @align: alignment, all entries in @offsets and @sizes must be aligned to this
> + * @gfp: allocation flags passed to the underlying memory allocator
>   *
>   * Returns: kmalloc'd vm_struct pointer array pointing to allocated
>   *	    vm_structs on success, %NULL on failure
>   *
>   * Percpu allocator wants to use congruent vm areas so that it can
>   * maintain the offsets among percpu areas.  This function allocates
> - * congruent vmalloc areas for it with GFP_KERNEL.  These areas tend to
> - * be scattered pretty far, distance between two areas easily going up
> - * to gigabytes.  To avoid interacting with regular vmallocs, these
> - * areas are allocated from top.
> + * congruent vmalloc areas for it. These areas tend to be scattered
> + * pretty far, distance between two areas easily going up to gigabytes.
> + * To avoid interacting with regular vmallocs, these areas are allocated
> + * from top.
>   *
>   * Despite its complicated look, this allocator is rather simple. It
>   * does everything top-down and scans free blocks from the end looking
> @@ -4966,7 +4967,7 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
>   */
>  struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  				     const size_t *sizes, int nr_vms,
> -				     size_t align)
> +				     size_t align, gfp_t gfp)
>  {
>  	const unsigned long vmalloc_start = ALIGN(VMALLOC_START, align);
>  	const unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
> @@ -5004,14 +5005,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  		return NULL;
>  	}
>  
> -	vms = kzalloc_objs(vms[0], nr_vms);
> -	vas = kzalloc_objs(vas[0], nr_vms);
> +	vms = kzalloc_objs(vms[0], nr_vms, gfp);
> +	vas = kzalloc_objs(vas[0], nr_vms, gfp);
>  	if (!vas || !vms)
>  		goto err_free2;
>  
>  	for (area = 0; area < nr_vms; area++) {
> -		vas[area] = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL);
> -		vms[area] = kzalloc_obj(struct vm_struct);
> +		vas[area] = kmem_cache_zalloc(vmap_area_cachep, gfp);
> +		vms[area] = kzalloc_obj(struct vm_struct, gfp);
>  		if (!vas[area] || !vms[area])
>  			goto err_free;
>  	}
> @@ -5101,7 +5102,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  
>  	/* populate the kasan shadow space */
>  	for (area = 0; area < nr_vms; area++) {
> -		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL))
> +		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], gfp))
>  			goto err_free_shadow;
>  	}
>  
> @@ -5158,7 +5159,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  				continue;
>  
>  			vas[area] = kmem_cache_zalloc(
> -				vmap_area_cachep, GFP_KERNEL);
> +				vmap_area_cachep, gfp);
>  			if (!vas[area])
>  				goto err_free;
>  		}
> -- 
> 2.43.0
> 
Looks good to me:

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-04 11:31 ` [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
@ 2026-06-04 19:07   ` Andrew Morton
  2026-06-05  8:48     ` Kaitao Cheng
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2026-06-04 19:07 UTC (permalink / raw)
  To: Kaitao Cheng
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Uladzislau Rezki,
	Pedro Falcato, Vlastimil Babka, Michal Hocko, muchun.song,
	linux-mm, linux-kernel, Kaitao Cheng

On Thu,  4 Jun 2026 19:31:01 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:

> From: Kaitao Cheng <chengkaitao@kylinos.cn>
> 
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
> allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
> allocations to take pcpu_alloc_mutex.  This avoids premature allocation
> failures, but it also makes the mutex visible to callers from constrained
> IO/FS contexts.
> 
> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
> pcpu_alloc_mutex. Since the internal allocation is not constrained by
> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
> creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock
> 
> At the same time, Thread B may already hold an FS lock and then call
> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
> pcpu_alloc_mutex and block, creating the reverse dependency:
> FS lock -> pcpu_alloc_mutex
> 
> This can still form a potential deadlock cycle.
> 
> Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
> The public allocation still uses the caller's GFP context to decide whether
> it may block, but the internal memory allocations performed while
> pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.
>
> ...
>
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1726,9 +1726,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>   * @gfp: allocation flags
>   *
>   * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
> - * then no warning will be triggered on invalid or failed allocation
> - * requests.
> + * allow blocking, the allocation is atomic. If @gfp has __GFP_NOWARN then no
> + * warning will be triggered on invalid or failed allocation requests.
>   *
>   * RETURNS:
>   * Percpu pointer to the allocated area on success, NULL on failure.
> @@ -1749,8 +1748,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>  	size_t bits, bit_align;
>  
>  	gfp = current_gfp_context(gfp);
> -	/* whitelisted flags that can be passed to the backing allocators */
> -	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
> +	/*
> +	 * Whitelisted flags that can be passed to the backing allocators.

We're supposed to say "allowlist".

> +	 * Backing allocations under pcpu_alloc_mutex must not recurse into
> +	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
> +	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
> +	 * waits for the same mutex.
> +	 */
> +	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);

AI review
(https://sashiko.dev/#/patchset/20260604113101.89510-1-kaitao.cheng@linux.dev)
asked why we're currently removing __GFP_NOFAIL here.  There are
probably good reasons for this, but it would be good to describe them
in that comment.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations
  2026-06-04 19:07   ` Andrew Morton
@ 2026-06-05  8:48     ` Kaitao Cheng
  0 siblings, 0 replies; 7+ messages in thread
From: Kaitao Cheng @ 2026-06-05  8:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dennis Zhou, Tejun Heo, Christoph Lameter, Uladzislau Rezki,
	Pedro Falcato, Vlastimil Babka, Michal Hocko, muchun.song,
	linux-mm, linux-kernel, Kaitao Cheng

在 2026/6/5 03:07, Andrew Morton 写道:
> On Thu,  4 Jun 2026 19:31:01 +0800 Kaitao Cheng <kaitao.cheng@linux.dev> wrote:
> 
>> From: Kaitao Cheng <chengkaitao@kylinos.cn>
>>
>> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable
>> allocations atomic") allows sleepable GFP_NOIO and GFP_NOFS percpu
>> allocations to take pcpu_alloc_mutex.  This avoids premature allocation
>> failures, but it also makes the mutex visible to callers from constrained
>> IO/FS contexts.
>>
>> Thread A calls pcpu_alloc_noprof() with GFP_KERNEL and takes
>> pcpu_alloc_mutex. Since the internal allocation is not constrained by
>> NOFS, it may enter FS reclaim while still holding pcpu_alloc_mutex,
>> creating a dependency like: pcpu_alloc_mutex -> fs_reclaim -> FS lock
>>
>> At the same time, Thread B may already hold an FS lock and then call
>> pcpu_alloc_noprof() with GFP_NOFS. It will try to acquire
>> pcpu_alloc_mutex and block, creating the reverse dependency:
>> FS lock -> pcpu_alloc_mutex
>>
>> This can still form a potential deadlock cycle.
>>
>> Avoid the dependency by restricting percpu backing allocations to GFP_NOIO.
>> The public allocation still uses the caller's GFP context to decide whether
>> it may block, but the internal memory allocations performed while
>> pcpu_alloc_mutex is held cannot recurse into IO or FS reclaim.
>>
>> ...
>>
>> --- a/mm/percpu.c
>> +++ b/mm/percpu.c
>> @@ -1726,9 +1726,8 @@ static void pcpu_alloc_tag_free_hook(struct pcpu_chunk *chunk, int off, size_t s
>>   * @gfp: allocation flags
>>   *
>>   * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
>> - * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
>> - * then no warning will be triggered on invalid or failed allocation
>> - * requests.
>> + * allow blocking, the allocation is atomic. If @gfp has __GFP_NOWARN then no
>> + * warning will be triggered on invalid or failed allocation requests.
>>   *
>>   * RETURNS:
>>   * Percpu pointer to the allocated area on success, NULL on failure.
>> @@ -1749,8 +1748,14 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
>>  	size_t bits, bit_align;
>>  
>>  	gfp = current_gfp_context(gfp);
>> -	/* whitelisted flags that can be passed to the backing allocators */
>> -	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
>> +	/*
>> +	 * Whitelisted flags that can be passed to the backing allocators.
> 
> We're supposed to say "allowlist".
> 
>> +	 * Backing allocations under pcpu_alloc_mutex must not recurse into
>> +	 * IO/FS reclaim.  Otherwise a GFP_KERNEL caller holding the mutex can
>> +	 * block on reclaim while a GFP_NOIO/NOFS caller holding an IO/FS lock
>> +	 * waits for the same mutex.
>> +	 */
>> +	pcpu_gfp = gfp & (GFP_NOIO | __GFP_NORETRY | __GFP_NOWARN);
> 
> AI review
> (https://sashiko.dev/#/patchset/20260604113101.89510-1-kaitao.cheng@linux.dev)
> asked why we're currently removing __GFP_NOFAIL here.  There are
> probably good reasons for this, but it would be good to describe them
> in that comment.
> 

This behavior has been present since commit 554fef1c39ee
("percpu: allow select gfp to be passed to underlying allocators"),
which introduced the whitelist for GFP flags passed down to the backing
allocators.

I did a quick AI-assisted scan of the current tree and did not find any
in-tree caller passing __GFP_NOFAIL to pcpu_alloc_noprof() or its
wrappers. So the issue Sashiko described does not appear to be reachable
with current callers.

That said, I agree the semantics are somewhat incomplete: __GFP_NOFAIL
is handled when taking pcpu_alloc_mutex, but it is not propagated through
pcpu_gfp to the backing allocations. If we want to address this
defensively, I think it would be better as a separate patch. Even though it
touches the same line, it fixes a different issue from this change.

-- 
Thanks
Kaitao Cheng



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-05  8:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-04 11:30 [PATCH v2 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion Kaitao Cheng
2026-06-04 11:30 ` [PATCH v2 1/3] mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas() Kaitao Cheng
2026-06-04 16:49   ` Uladzislau Rezki
2026-06-04 11:31 ` [PATCH v2 2/3] mm/percpu: honor GFP constraints when populating chunks Kaitao Cheng
2026-06-04 11:31 ` [PATCH v2 3/3] mm/percpu: Avoid IO/FS reclaim in backing allocations Kaitao Cheng
2026-06-04 19:07   ` Andrew Morton
2026-06-05  8:48     ` Kaitao Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox