[PATCH 0/8] __vmalloc() and no-block support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] __vmalloc() and no-block support
@ 2025-08-07  7:58 Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 1/8] lib/test_vmalloc: add no_block_alloc_test case Uladzislau Rezki (Sony)
                   ` (8 more replies)
  0 siblings, 9 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

Hello.

This is a second series of making __vmalloc() to support GFP_ATOMIC and
GFP_NOWAIT flags. It tends to improve the non-blocking behaviour.

The first one can be found here:

https://lore.kernel.org/all/20250704152537.55724-1-urezki@gmail.com/

that was an RFC. Using this series for testing i have not found more
places which can trigger: scheduling during atomic. Though there is
one which requires attention. I will explain in [1].

Please note, non-blocking gets improved in the __vmalloc() call only,
i.e. vmalloc_huge() still contains in its paths many cond_resched()
points and can not be used as non-blocking as of now.

[1] The vmap_pages_range_noflush() contains the kmsan_vmap_pages_range_noflush()
external implementation for KCSAN specifically which is hard coded to GFP_KERNEL.
The kernel should be built with CONFIG_KCSAN option. To me it looks like not
straight forward to run such kernel on my box, therefore i need more time to
investigate what is wrong with CONFIG_KCSAN and my env.

Uladzislau Rezki (Sony) (8):
  lib/test_vmalloc: add no_block_alloc_test case
  lib/test_vmalloc: Remove xfail condition check
  mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area()
  mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages()
  mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc()
  mm/vmalloc: Defer freeing partly initialized vm_struct
  mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set

 include/linux/kasan.h    |  6 ++--
 include/linux/sched/mm.h |  7 +++-
 include/linux/vmalloc.h  |  6 +++-
 lib/test_vmalloc.c       | 28 ++++++++++++++-
 mm/kasan/shadow.c        | 22 ++++++++----
 mm/vmalloc.c             | 77 ++++++++++++++++++++++++++++++++--------
 6 files changed, 119 insertions(+), 27 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/8] lib/test_vmalloc: add no_block_alloc_test case
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 2/8] lib/test_vmalloc: Remove xfail condition check Uladzislau Rezki (Sony)
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

Introduce a new test case "no_block_alloc_test" that verifies
non-blocking allocations using __vmalloc() with GFP_ATOMIC and
GFP_NOWAIT flags.

It is recommended to build kernel with CONFIG_DEBUG_ATOMIC_SLEEP
enabled to help catch "sleeping while atomic" issues. This test
ensures that memory allocation logic under atomic constraints
does not inadvertently sleep.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 lib/test_vmalloc.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index 2815658ccc37..aae5f4910aff 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -54,6 +54,7 @@ __param(int, run_test_mask, 7,
 		"\t\tid: 256,  name: kvfree_rcu_1_arg_vmalloc_test\n"
 		"\t\tid: 512,  name: kvfree_rcu_2_arg_vmalloc_test\n"
 		"\t\tid: 1024, name: vm_map_ram_test\n"
+		"\t\tid: 2048, name: no_block_alloc_test\n"
 		/* Add a new test case description here. */
 );
 
@@ -283,6 +284,30 @@ static int fix_size_alloc_test(void)
 	return 0;
 }
 
+static int no_block_alloc_test(void)
+{
+	void *ptr;
+	int i;
+
+	for (i = 0; i < test_loop_count; i++) {
+		bool use_atomic = !!(get_random_u8() % 2);
+		gfp_t gfp = use_atomic ? GFP_ATOMIC : GFP_NOWAIT;
+		unsigned long size = (nr_pages > 0 ? nr_pages : 1) * PAGE_SIZE;
+
+		preempt_disable();
+		ptr = __vmalloc(size, gfp);
+		preempt_enable();
+
+		if (!ptr)
+			return -1;
+
+		*((__u8 *)ptr) = 0;
+		vfree(ptr);
+	}
+
+	return 0;
+}
+
 static int
 pcpu_alloc_test(void)
 {
@@ -411,6 +436,7 @@ static struct test_case_desc test_case_array[] = {
 	{ "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test, },
 	{ "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test, },
 	{ "vm_map_ram_test", vm_map_ram_test, },
+	{ "no_block_alloc_test", no_block_alloc_test, true },
 	/* Add a new test case here. */
 };
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/8] lib/test_vmalloc: Remove xfail condition check
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 1/8] lib/test_vmalloc: add no_block_alloc_test case Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area() Uladzislau Rezki (Sony)
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

A test marked with "xfail = true" is expected to fail but that
does not mean it is predetermined to fail. Remove "xfail" condition
check for tests which pass successfully.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 lib/test_vmalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index aae5f4910aff..6521c05c7816 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -500,7 +500,7 @@ static int test_func(void *private)
 		for (j = 0; j < test_repeat_count; j++) {
 			ret = test_case_array[index].test_func();
 
-			if (!ret && !test_case_array[index].xfail)
+			if (!ret)
 				t->data[index].test_passed++;
 			else if (ret && test_case_array[index].xfail)
 				t->data[index].test_xfailed++;
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area()
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 1/8] lib/test_vmalloc: add no_block_alloc_test case Uladzislau Rezki (Sony)
  2025-08-07  7:58 ` [PATCH 2/8] lib/test_vmalloc: Remove xfail condition check Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 11:20   ` Michal Hocko
  2025-08-18  2:11   ` Baoquan He
  2025-08-07  7:58 ` [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages() Uladzislau Rezki (Sony)
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

alloc_vmap_area() currently assumes that sleeping is allowed during
allocation. This is not true for callers which pass non-blocking
GFP flags, such as GFP_ATOMIC or GFP_NOWAIT.

This patch adds logic to detect whether the given gfp_mask permits
blocking. It avoids invoking might_sleep() or falling back to reclaim
path if blocking is not allowed.

This makes alloc_vmap_area() safer for use in non-sleeping contexts,
where previously it could hit unexpected sleeps, trigger warnings.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6dbcdceecae1..81b6d3bde719 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2017,6 +2017,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	unsigned long freed;
 	unsigned long addr;
 	unsigned int vn_id;
+	bool allow_block;
 	int purged = 0;
 	int ret;
 
@@ -2026,7 +2027,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	if (unlikely(!vmap_initialized))
 		return ERR_PTR(-EBUSY);
 
-	might_sleep();
+	allow_block = gfpflags_allow_blocking(gfp_mask);
+	might_sleep_if(allow_block);
 
 	/*
 	 * If a VA is obtained from a global heap(if it fails here)
@@ -2065,8 +2067,16 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	 * If an allocation fails, the error value is
 	 * returned. Therefore trigger the overflow path.
 	 */
-	if (IS_ERR_VALUE(addr))
-		goto overflow;
+	if (IS_ERR_VALUE(addr)) {
+		if (allow_block)
+			goto overflow;
+
+		/*
+		 * We can not trigger any reclaim logic because
+		 * sleeping is not allowed, thus fail an allocation.
+		 */
+		goto error;
+	}
 
 	va->va_start = addr;
 	va->va_end = addr + size;
@@ -2116,6 +2126,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 		pr_warn("vmalloc_node_range for size %lu failed: Address range restricted to %#lx - %#lx\n",
 				size, vstart, vend);
 
+error:
 	kmem_cache_free(vmap_area_cachep, va);
 	return ERR_PTR(-EBUSY);
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages()
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (2 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area() Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 11:22   ` Michal Hocko
  2025-08-18  2:14   ` Baoquan He
  2025-08-07  7:58 ` [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc() Uladzislau Rezki (Sony)
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

The vm_area_alloc_pages() function uses cond_resched() to yield the
CPU during potentially long-running loops. However, these loops are
not considered long-running under normal conditions. In non-blocking
contexts, calling cond_resched() is inappropriate also.

Remove these calls to ensure correctness for blocking/non-blocking
contexts. This also simplifies the code path. In fact, a slow path
of page allocator already includes reschedule points to mitigate
latency.

This patch was tested for !CONFIG_PREEMPT kernel and with large
allocation chunks(~1GB), without triggering any "BUG: soft lockup"
warnings.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 81b6d3bde719..b0255e0c74b3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3633,7 +3633,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 							pages + nr_allocated);
 
 			nr_allocated += nr;
-			cond_resched();
 
 			/*
 			 * If zero or pages were obtained partly,
@@ -3675,7 +3674,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		for (i = 0; i < (1U << order); i++)
 			pages[nr_allocated + i] = page + i;
 
-		cond_resched();
 		nr_allocated += 1U << order;
 	}
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc()
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (3 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages() Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 16:05   ` Andrey Ryabinin
  2025-08-07  7:58 ` [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct Uladzislau Rezki (Sony)
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki,
	Andrey Ryabinin, Alexander Potapenko

The function kasan_populate_vmalloc() internally allocates a page using
a hardcoded GFP_KERNEL flag. This is not safe in contexts where non-blocking
allocation flags are required, such as GFP_ATOMIC or GFP_NOWAIT, for example
during atomic vmalloc paths.

This patch modifies kasan_populate_vmalloc() and its helpers to accept a
gfp_mask argument to use it for a page allocation. It allows the caller to
specify the correct allocation context.

Also, when non-blocking flags are used, memalloc_noreclaim_save/restore()
is used around apply_to_page_range() to suppress potential reclaim behavior
that may otherwise violate atomic constraints.

Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/kasan.h |  6 +++---
 mm/kasan/shadow.c     | 22 +++++++++++++++-------
 mm/vmalloc.c          |  4 ++--
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 890011071f2b..fe5ce9215821 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -562,7 +562,7 @@ static inline void kasan_init_hw_tags(void) { }
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 
 void kasan_populate_early_vm_area_shadow(void *start, unsigned long size);
-int kasan_populate_vmalloc(unsigned long addr, unsigned long size);
+int kasan_populate_vmalloc(unsigned long addr, unsigned long size, gfp_t gfp_mask);
 void kasan_release_vmalloc(unsigned long start, unsigned long end,
 			   unsigned long free_region_start,
 			   unsigned long free_region_end,
@@ -574,7 +574,7 @@ static inline void kasan_populate_early_vm_area_shadow(void *start,
 						       unsigned long size)
 { }
 static inline int kasan_populate_vmalloc(unsigned long start,
-					unsigned long size)
+					unsigned long size, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -610,7 +610,7 @@ static __always_inline void kasan_poison_vmalloc(const void *start,
 static inline void kasan_populate_early_vm_area_shadow(void *start,
 						       unsigned long size) { }
 static inline int kasan_populate_vmalloc(unsigned long start,
-					unsigned long size)
+					unsigned long size, gfp_t gfp_mask)
 {
 	return 0;
 }
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index d2c70cd2afb1..5edfc1f6b53e 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -335,13 +335,13 @@ static void ___free_pages_bulk(struct page **pages, int nr_pages)
 	}
 }
 
-static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
+static int ___alloc_pages_bulk(struct page **pages, int nr_pages, gfp_t gfp_mask)
 {
 	unsigned long nr_populated, nr_total = nr_pages;
 	struct page **page_array = pages;
 
 	while (nr_pages) {
-		nr_populated = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
+		nr_populated = alloc_pages_bulk(gfp_mask, nr_pages, pages);
 		if (!nr_populated) {
 			___free_pages_bulk(page_array, nr_total - nr_pages);
 			return -ENOMEM;
@@ -353,25 +353,33 @@ static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
 	return 0;
 }
 
-static int __kasan_populate_vmalloc(unsigned long start, unsigned long end)
+static int __kasan_populate_vmalloc(unsigned long start, unsigned long end, gfp_t gfp_mask)
 {
 	unsigned long nr_pages, nr_total = PFN_UP(end - start);
+	bool noblock = !gfpflags_allow_blocking(gfp_mask);
 	struct vmalloc_populate_data data;
+	unsigned int flags;
 	int ret = 0;
 
-	data.pages = (struct page **)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	data.pages = (struct page **)__get_free_page(gfp_mask | __GFP_ZERO);
 	if (!data.pages)
 		return -ENOMEM;
 
 	while (nr_total) {
 		nr_pages = min(nr_total, PAGE_SIZE / sizeof(data.pages[0]));
-		ret = ___alloc_pages_bulk(data.pages, nr_pages);
+		ret = ___alloc_pages_bulk(data.pages, nr_pages, gfp_mask);
 		if (ret)
 			break;
 
 		data.start = start;
+		if (noblock)
+			flags = memalloc_noreclaim_save();
+
 		ret = apply_to_page_range(&init_mm, start, nr_pages * PAGE_SIZE,
 					  kasan_populate_vmalloc_pte, &data);
+		if (noblock)
+			memalloc_noreclaim_restore(flags);
+
 		___free_pages_bulk(data.pages, nr_pages);
 		if (ret)
 			break;
@@ -385,7 +393,7 @@ static int __kasan_populate_vmalloc(unsigned long start, unsigned long end)
 	return ret;
 }
 
-int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
+int kasan_populate_vmalloc(unsigned long addr, unsigned long size, gfp_t gfp_mask)
 {
 	unsigned long shadow_start, shadow_end;
 	int ret;
@@ -414,7 +422,7 @@ int kasan_populate_vmalloc(unsigned long addr, unsigned long size)
 	shadow_start = PAGE_ALIGN_DOWN(shadow_start);
 	shadow_end = PAGE_ALIGN(shadow_end);
 
-	ret = __kasan_populate_vmalloc(shadow_start, shadow_end);
+	ret = __kasan_populate_vmalloc(shadow_start, shadow_end, gfp_mask);
 	if (ret)
 		return ret;
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b0255e0c74b3..7f48a54ec108 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2099,7 +2099,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	BUG_ON(va->va_start < vstart);
 	BUG_ON(va->va_end > vend);
 
-	ret = kasan_populate_vmalloc(addr, size);
+	ret = kasan_populate_vmalloc(addr, size, gfp_mask);
 	if (ret) {
 		free_vmap_area(va);
 		return ERR_PTR(ret);
@@ -4835,7 +4835,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	/* populate the kasan shadow space */
 	for (area = 0; area < nr_vms; area++) {
-		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area]))
+		if (kasan_populate_vmalloc(vas[area]->va_start, sizes[area], GFP_KERNEL))
 			goto err_free_shadow;
 	}
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (4 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc() Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 11:25   ` Michal Hocko
  2025-08-18  4:21   ` Baoquan He
  2025-08-07  7:58 ` [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node() Uladzislau Rezki (Sony)
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

__vmalloc_area_node() may call free_vmap_area() or vfree() on
error paths, both of which can sleep. This becomes problematic
if the function is invoked from an atomic context, such as when
GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.

To fix this, unify error paths and defer the cleanup of partly
initialized vm_struct objects to a workqueue. This ensures that
freeing happens in a process context and avoids invalid sleeps
in atomic regions.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/vmalloc.h |  6 +++++-
 mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index fdc9aeb74a44..b1425fae8cbf 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
 #endif
 
 struct vm_struct {
-	struct vm_struct	*next;
+	union {
+		struct vm_struct *next;	  /* Early registration of vm_areas. */
+		struct llist_node llnode; /* Asynchronous freeing on error paths. */
+	};
+
 	void			*addr;
 	unsigned long		size;
 	unsigned long		flags;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7f48a54ec108..2424f80d524a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	return nr_allocated;
 }
 
+static LLIST_HEAD(pending_vm_area_cleanup);
+static void cleanup_vm_area_work(struct work_struct *work)
+{
+	struct vm_struct *area, *tmp;
+	struct llist_node *head;
+
+	head = llist_del_all(&pending_vm_area_cleanup);
+	if (!head)
+		return;
+
+	llist_for_each_entry_safe(area, tmp, head, llnode) {
+		if (!area->pages)
+			free_vm_area(area);
+		else
+			vfree(area->addr);
+	}
+}
+
+/*
+ * Helper for __vmalloc_area_node() to defer cleanup
+ * of partially initialized vm_struct in error paths.
+ */
+static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
+static void defer_vm_area_cleanup(struct vm_struct *area)
+{
+	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
+		schedule_work(&cleanup_vm_area);
+}
+
 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 				 pgprot_t prot, unsigned int page_shift,
 				 int node)
@@ -3711,8 +3740,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to allocated page array size %lu",
 			nr_small_pages * PAGE_SIZE, array_size);
-		free_vm_area(area);
-		return NULL;
+		goto fail;
 	}
 
 	set_vm_area_page_order(area, page_shift - PAGE_SHIFT);
@@ -3789,7 +3817,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	return area->addr;
 
 fail:
-	vfree(area->addr);
+	defer_vm_area_cleanup(area);
 	return NULL;
 }
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (5 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 11:54   ` Michal Hocko
  2025-08-18  4:35   ` Baoquan He
  2025-08-07  7:58 ` [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set Uladzislau Rezki (Sony)
  2025-08-07 11:01 ` [PATCH 0/8] __vmalloc() and no-block support Marco Elver
  8 siblings, 2 replies; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

This patch makes __vmalloc_area_node() to correctly handle non-blocking
allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:

- Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
  if there are no DMA constraints.

- vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
  to avoid memory reclaim related operations that could sleep during
  page table setup or mapping pages.

This is particularly important for page table allocations that
internally use GFP_PGTABLE_KERNEL, which may sleep unless such
scope restrictions are applied. For example:

<snip>
__pte_alloc_kernel()
    pte_alloc_one_kernel(&init_mm);
        pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
<snip>

Note: in most cases, PTE entries are established only up to the level
required by current vmap space usage, meaning the page tables are typically
fully populated during the mapping process.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2424f80d524a..8a7eab810561 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3721,12 +3721,20 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	unsigned int nr_small_pages = size >> PAGE_SHIFT;
 	unsigned int page_order;
 	unsigned int flags;
+	bool noblock;
 	int ret;
 
 	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
+	noblock = !gfpflags_allow_blocking(gfp_mask);
 
-	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
-		gfp_mask |= __GFP_HIGHMEM;
+	if (noblock) {
+		/* __GFP_NOFAIL and "noblock" flags are mutually exclusive. */
+		nofail = false;
+	} else {
+		/* Allow highmem allocations if there are no DMA constraints. */
+		if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
+			gfp_mask |= __GFP_HIGHMEM;
+	}
 
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
@@ -3790,7 +3798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	 * page tables allocations ignore external gfp mask, enforce it
 	 * by the scope API
 	 */
-	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+	if (noblock)
+		flags = memalloc_noreclaim_save();
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		flags = memalloc_nofs_save();
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		flags = memalloc_noio_save();
@@ -3802,7 +3812,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			schedule_timeout_uninterruptible(1);
 	} while (nofail && (ret < 0));
 
-	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+	if (noblock)
+		memalloc_noreclaim_restore(flags);
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		memalloc_nofs_restore(flags);
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		memalloc_noio_restore(flags);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (6 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node() Uladzislau Rezki (Sony)
@ 2025-08-07  7:58 ` Uladzislau Rezki (Sony)
  2025-08-07 11:58   ` Michal Hocko
  2025-08-07 11:01 ` [PATCH 0/8] __vmalloc() and no-block support Marco Elver
  8 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki (Sony) @ 2025-08-07  7:58 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML, Uladzislau Rezki

The memory allocator already avoids reclaim when PF_MEMALLOC is set.
Clear __GFP_DIRECT_RECLAIM explicitly to suppress might_alloc() warnings
to make more correct behavior.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 include/linux/sched/mm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2201da0afecc..8332fc09f8ac 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -246,12 +246,14 @@ static inline bool in_vfork(struct task_struct *tsk)
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
  * PF_MEMALLOC_PIN  implies !GFP_MOVABLE
+ * PF_MEMALLOC      implies !__GFP_DIRECT_RECLAIM
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
 	unsigned int pflags = READ_ONCE(current->flags);
 
-	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_PIN))) {
+	if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS |
+			PF_MEMALLOC_PIN | PF_MEMALLOC))) {
 		/*
 		 * NOIO implies both NOIO and NOFS and it is a weaker context
 		 * so always make sure it makes precedence
@@ -263,6 +265,9 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 
 		if (pflags & PF_MEMALLOC_PIN)
 			flags &= ~__GFP_MOVABLE;
+
+		if (pflags & PF_MEMALLOC)
+			flags &= ~__GFP_DIRECT_RECLAIM;
 	}
 	return flags;
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] __vmalloc() and no-block support
  2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
                   ` (7 preceding siblings ...)
  2025-08-07  7:58 ` [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set Uladzislau Rezki (Sony)
@ 2025-08-07 11:01 ` Marco Elver
  2025-08-08  8:48   ` Uladzislau Rezki
  8 siblings, 1 reply; 35+ messages in thread
From: Marco Elver @ 2025-08-07 11:01 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko,
	Baoquan He, LKML, Alexander Potapenko, kasan-dev

On Thu, Aug 07, 2025 at 09:58AM +0200, Uladzislau Rezki (Sony) wrote:
> Hello.
> 
> This is a second series of making __vmalloc() to support GFP_ATOMIC and
> GFP_NOWAIT flags. It tends to improve the non-blocking behaviour.
> 
> The first one can be found here:
> 
> https://lore.kernel.org/all/20250704152537.55724-1-urezki@gmail.com/
> 
> that was an RFC. Using this series for testing i have not found more
> places which can trigger: scheduling during atomic. Though there is
> one which requires attention. I will explain in [1].
> 
> Please note, non-blocking gets improved in the __vmalloc() call only,
> i.e. vmalloc_huge() still contains in its paths many cond_resched()
> points and can not be used as non-blocking as of now.
> 
> [1] The vmap_pages_range_noflush() contains the kmsan_vmap_pages_range_noflush()
> external implementation for KCSAN specifically which is hard coded to GFP_KERNEL.
> The kernel should be built with CONFIG_KCSAN option. To me it looks like not
> straight forward to run such kernel on my box, therefore i need more time to
> investigate what is wrong with CONFIG_KCSAN and my env.

KMSAN or KCSAN?

[+Cc KMSAN maintainers]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area()
  2025-08-07  7:58 ` [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area() Uladzislau Rezki (Sony)
@ 2025-08-07 11:20   ` Michal Hocko
  2025-08-08  9:59     ` Uladzislau Rezki
  2025-08-18  2:11   ` Baoquan He
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-07 11:20 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Thu 07-08-25 09:58:05, Uladzislau Rezki wrote:
> alloc_vmap_area() currently assumes that sleeping is allowed during
> allocation. This is not true for callers which pass non-blocking
> GFP flags, such as GFP_ATOMIC or GFP_NOWAIT.

Those are currently not allowed so it would be better to mention this is
a preparation for those to _be_ supported later in the series.
 
> This patch adds logic to detect whether the given gfp_mask permits
> blocking. It avoids invoking might_sleep() or falling back to reclaim
> path if blocking is not allowed.
> 
> This makes alloc_vmap_area() safer for use in non-sleeping contexts,
> where previously it could hit unexpected sleeps, trigger warnings.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

With the changelog clarified
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/vmalloc.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6dbcdceecae1..81b6d3bde719 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2017,6 +2017,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	unsigned long freed;
>  	unsigned long addr;
>  	unsigned int vn_id;
> +	bool allow_block;
>  	int purged = 0;
>  	int ret;
>  
> @@ -2026,7 +2027,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	if (unlikely(!vmap_initialized))
>  		return ERR_PTR(-EBUSY);
>  
> -	might_sleep();
> +	allow_block = gfpflags_allow_blocking(gfp_mask);
> +	might_sleep_if(allow_block);
>  
>  	/*
>  	 * If a VA is obtained from a global heap(if it fails here)
> @@ -2065,8 +2067,16 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 * If an allocation fails, the error value is
>  	 * returned. Therefore trigger the overflow path.
>  	 */
> -	if (IS_ERR_VALUE(addr))
> -		goto overflow;
> +	if (IS_ERR_VALUE(addr)) {
> +		if (allow_block)
> +			goto overflow;
> +
> +		/*
> +		 * We can not trigger any reclaim logic because
> +		 * sleeping is not allowed, thus fail an allocation.
> +		 */
> +		goto error;
> +	}
>  
>  	va->va_start = addr;
>  	va->va_end = addr + size;
> @@ -2116,6 +2126,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  		pr_warn("vmalloc_node_range for size %lu failed: Address range restricted to %#lx - %#lx\n",
>  				size, vstart, vend);
>  
> +error:
>  	kmem_cache_free(vmap_area_cachep, va);
>  	return ERR_PTR(-EBUSY);
>  }
> -- 
> 2.39.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages()
  2025-08-07  7:58 ` [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages() Uladzislau Rezki (Sony)
@ 2025-08-07 11:22   ` Michal Hocko
  2025-08-08 10:08     ` Uladzislau Rezki
  2025-08-18  2:14   ` Baoquan He
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-07 11:22 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Thu 07-08-25 09:58:06, Uladzislau Rezki wrote:
> The vm_area_alloc_pages() function uses cond_resched() to yield the
> CPU during potentially long-running loops. However, these loops are
> not considered long-running under normal conditions.

To be more precise they can take long if they dive into the page
allocator but that already involves cond_rescheds where appropriate so
these are not needed in fact.

> In non-blocking
> contexts, calling cond_resched() is inappropriate also.
> 
> Remove these calls to ensure correctness for blocking/non-blocking
> contexts. This also simplifies the code path. In fact, a slow path
> of page allocator already includes reschedule points to mitigate
> latency.
> 
> This patch was tested for !CONFIG_PREEMPT kernel and with large
> allocation chunks(~1GB), without triggering any "BUG: soft lockup"
> warnings.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
> ---
>  mm/vmalloc.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 81b6d3bde719..b0255e0c74b3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3633,7 +3633,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  							pages + nr_allocated);
>  
>  			nr_allocated += nr;
> -			cond_resched();
>  
>  			/*
>  			 * If zero or pages were obtained partly,
> @@ -3675,7 +3674,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		for (i = 0; i < (1U << order); i++)
>  			pages[nr_allocated + i] = page + i;
>  
> -		cond_resched();
>  		nr_allocated += 1U << order;
>  	}
>  
> -- 
> 2.39.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-07  7:58 ` [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct Uladzislau Rezki (Sony)
@ 2025-08-07 11:25   ` Michal Hocko
  2025-08-08 10:37     ` Uladzislau Rezki
  2025-08-18  4:21   ` Baoquan He
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-07 11:25 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Thu 07-08-25 09:58:08, Uladzislau Rezki wrote:
> __vmalloc_area_node() may call free_vmap_area() or vfree() on
> error paths, both of which can sleep. This becomes problematic
> if the function is invoked from an atomic context, such as when
> GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> 
> To fix this, unify error paths and defer the cleanup of partly
> initialized vm_struct objects to a workqueue. This ensures that
> freeing happens in a process context and avoids invalid sleeps
> in atomic regions.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

LGTM
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  include/linux/vmalloc.h |  6 +++++-
>  mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
>  2 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index fdc9aeb74a44..b1425fae8cbf 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
>  #endif
>  
>  struct vm_struct {
> -	struct vm_struct	*next;
> +	union {
> +		struct vm_struct *next;	  /* Early registration of vm_areas. */
> +		struct llist_node llnode; /* Asynchronous freeing on error paths. */
> +	};
> +
>  	void			*addr;
>  	unsigned long		size;
>  	unsigned long		flags;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 7f48a54ec108..2424f80d524a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	return nr_allocated;
>  }
>  
> +static LLIST_HEAD(pending_vm_area_cleanup);
> +static void cleanup_vm_area_work(struct work_struct *work)
> +{
> +	struct vm_struct *area, *tmp;
> +	struct llist_node *head;
> +
> +	head = llist_del_all(&pending_vm_area_cleanup);
> +	if (!head)
> +		return;
> +
> +	llist_for_each_entry_safe(area, tmp, head, llnode) {
> +		if (!area->pages)
> +			free_vm_area(area);
> +		else
> +			vfree(area->addr);
> +	}
> +}
> +
> +/*
> + * Helper for __vmalloc_area_node() to defer cleanup
> + * of partially initialized vm_struct in error paths.
> + */
> +static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
> +static void defer_vm_area_cleanup(struct vm_struct *area)
> +{
> +	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
> +		schedule_work(&cleanup_vm_area);
> +}
> +
>  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  				 pgprot_t prot, unsigned int page_shift,
>  				 int node)
> @@ -3711,8 +3740,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  		warn_alloc(gfp_mask, NULL,
>  			"vmalloc error: size %lu, failed to allocated page array size %lu",
>  			nr_small_pages * PAGE_SIZE, array_size);
> -		free_vm_area(area);
> -		return NULL;
> +		goto fail;
>  	}
>  
>  	set_vm_area_page_order(area, page_shift - PAGE_SHIFT);
> @@ -3789,7 +3817,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	return area->addr;
>  
>  fail:
> -	vfree(area->addr);
> +	defer_vm_area_cleanup(area);
>  	return NULL;
>  }
>  
> -- 
> 2.39.5

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-07  7:58 ` [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node() Uladzislau Rezki (Sony)
@ 2025-08-07 11:54   ` Michal Hocko
  2025-08-08 11:54     ` Uladzislau Rezki
  2025-08-18  4:35   ` Baoquan He
  1 sibling, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-07 11:54 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Thu 07-08-25 09:58:09, Uladzislau Rezki wrote:
> This patch makes __vmalloc_area_node() to correctly handle non-blocking
> allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:
> 
> - Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
>   if there are no DMA constraints.

This begs for a more explanation. Why does __GFP_HIGHMEM matters? I
suspect this is due to kmapping of those pages but that could be done in
an atomic way. But in practice I do not think we really care about
highmem all that much for vmalloc. The vmalloc space is really tiny for
32b systems where highmem matters and failing vmalloc allocations due to
lack is of __GFP_HIGHMEM is hard to consider important if relevant at
all.

> - vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
>   to avoid memory reclaim related operations that could sleep during
>   page table setup or mapping pages.
> 
> This is particularly important for page table allocations that
> internally use GFP_PGTABLE_KERNEL, which may sleep unless such
> scope restrictions are applied. For example:
> 
> <snip>
> __pte_alloc_kernel()
>     pte_alloc_one_kernel(&init_mm);
>         pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
> <snip>

As I've said in several occations, I am not entirely happy about this
approach because it doesn't really guarantee atomicty. If any
architecture decides to use some sleeping locking down that path then
the whole thing just blows up. On the other hand this is mostly a
theoretical concern at this stage and this is a feature people have
been asking for a long time (especially from kvmalloc side) so better
good than perfect that his.

That being said, you are missing __kvmalloc_node_noprof,
__vmalloc_node_range_noprof (and maybe some more places) documentation
update.

> Note: in most cases, PTE entries are established only up to the level
> required by current vmap space usage, meaning the page tables are typically
> fully populated during the mapping process.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

With the doc part fixed
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/vmalloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2424f80d524a..8a7eab810561 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3721,12 +3721,20 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
>  	unsigned int flags;
> +	bool noblock;
>  	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> +	noblock = !gfpflags_allow_blocking(gfp_mask);
>  
> -	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> -		gfp_mask |= __GFP_HIGHMEM;
> +	if (noblock) {
> +		/* __GFP_NOFAIL and "noblock" flags are mutually exclusive. */
> +		nofail = false;
> +	} else {
> +		/* Allow highmem allocations if there are no DMA constraints. */
> +		if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> +			gfp_mask |= __GFP_HIGHMEM;
> +	}
>  
>  	/* Please note that the recursion is strictly bounded. */
>  	if (array_size > PAGE_SIZE) {
> @@ -3790,7 +3798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	 * page tables allocations ignore external gfp mask, enforce it
>  	 * by the scope API
>  	 */
> -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +	if (noblock)
> +		flags = memalloc_noreclaim_save();
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		flags = memalloc_nofs_save();
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
> @@ -3802,7 +3812,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  			schedule_timeout_uninterruptible(1);
>  	} while (nofail && (ret < 0));
>  
> -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +	if (noblock)
> +		memalloc_noreclaim_restore(flags);
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		memalloc_noio_restore(flags);
> -- 
> 2.39.5
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set
  2025-08-07  7:58 ` [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set Uladzislau Rezki (Sony)
@ 2025-08-07 11:58   ` Michal Hocko
  2025-08-08 13:12     ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-07 11:58 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Thu 07-08-25 09:58:10, Uladzislau Rezki wrote:
> The memory allocator already avoids reclaim when PF_MEMALLOC is set.
> Clear __GFP_DIRECT_RECLAIM explicitly to suppress might_alloc() warnings
> to make more correct behavior.

Rather than chaning the gfp mask would it make more sense to update
might_alloc instead?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc()
  2025-08-07  7:58 ` [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc() Uladzislau Rezki (Sony)
@ 2025-08-07 16:05   ` Andrey Ryabinin
  2025-08-08 10:18     ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Andrey Ryabinin @ 2025-08-07 16:05 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), linux-mm, Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Baoquan He, LKML,
	Alexander Potapenko


On 8/7/25 9:58 AM, Uladzislau Rezki (Sony) wrote:

> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index d2c70cd2afb1..5edfc1f6b53e 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -335,13 +335,13 @@ static void ___free_pages_bulk(struct page **pages, int nr_pages)
>  	}
>  }
>  
> -static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
> +static int ___alloc_pages_bulk(struct page **pages, int nr_pages, gfp_t gfp_mask)
>  {
>  	unsigned long nr_populated, nr_total = nr_pages;
>  	struct page **page_array = pages;
>  
>  	while (nr_pages) {
> -		nr_populated = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
> +		nr_populated = alloc_pages_bulk(gfp_mask, nr_pages, pages);
>  		if (!nr_populated) {
>  			___free_pages_bulk(page_array, nr_total - nr_pages);
>  			return -ENOMEM;
> @@ -353,25 +353,33 @@ static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
>  	return 0;
>  }
>  
> -static int __kasan_populate_vmalloc(unsigned long start, unsigned long end)
> +static int __kasan_populate_vmalloc(unsigned long start, unsigned long end, gfp_t gfp_mask)
>  {
>  	unsigned long nr_pages, nr_total = PFN_UP(end - start);
> +	bool noblock = !gfpflags_allow_blocking(gfp_mask);
>  	struct vmalloc_populate_data data;
> +	unsigned int flags;
>  	int ret = 0;

gfp_mask = (gfp_mask & GFP_RECLAIM_MASK);


But it might be better to do this in alloc_vmap_area().
In alloc_vmap_area() we have this:

retry:
	if (IS_ERR_VALUE(addr)) {
		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);

which probably needs GFP_RECLAIM_MASK too.

>  
> -	data.pages = (struct page **)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> +	data.pages = (struct page **)__get_free_page(gfp_mask | __GFP_ZERO);
>  	if (!data.pages)
>  		return -ENOMEM;
>  
>  	while (nr_total) {
>  		nr_pages = min(nr_total, PAGE_SIZE / sizeof(data.pages[0]));
> -		ret = ___alloc_pages_bulk(data.pages, nr_pages);
> +		ret = ___alloc_pages_bulk(data.pages, nr_pages, gfp_mask);
>  		if (ret)
>  			break;
>  
>  		data.start = start;
> +		if (noblock)
> +			flags = memalloc_noreclaim_save();
> +


This should be the same as in __vmalloc_area_node():

	if (noblock)
		flags = memalloc_noreclaim_save();
	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		flags = memalloc_nofs_save();
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		flags = memalloc_noio_save();


It would be better to fix noio/nofs stuff first with separate patch, as it's
bug and needs cc stable. And add support for noblock in follow up.

It might be a good idea to consolidate such logic in separate function,
memalloc_save(gfp_mask)/memalloc_restore(gfp_mask, flags) ?

>  		ret = apply_to_page_range(&init_mm, start, nr_pages * PAGE_SIZE,
>  					  kasan_populate_vmalloc_pte, &data);
> +		if (noblock)
> +			memalloc_noreclaim_restore(flags);
> +
>  		___free_pages_bulk(data.pages, nr_pages);
>  		if (ret)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] __vmalloc() and no-block support
  2025-08-07 11:01 ` [PATCH 0/8] __vmalloc() and no-block support Marco Elver
@ 2025-08-08  8:48   ` Uladzislau Rezki
  2025-08-23  9:35     ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08  8:48 UTC (permalink / raw)
  To: Marco Elver
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Baoquan He, LKML, Alexander Potapenko, kasan-dev

On Thu, Aug 07, 2025 at 01:01:00PM +0200, Marco Elver wrote:
> On Thu, Aug 07, 2025 at 09:58AM +0200, Uladzislau Rezki (Sony) wrote:
> > Hello.
> > 
> > This is a second series of making __vmalloc() to support GFP_ATOMIC and
> > GFP_NOWAIT flags. It tends to improve the non-blocking behaviour.
> > 
> > The first one can be found here:
> > 
> > https://lore.kernel.org/all/20250704152537.55724-1-urezki@gmail.com/
> > 
> > that was an RFC. Using this series for testing i have not found more
> > places which can trigger: scheduling during atomic. Though there is
> > one which requires attention. I will explain in [1].
> > 
> > Please note, non-blocking gets improved in the __vmalloc() call only,
> > i.e. vmalloc_huge() still contains in its paths many cond_resched()
> > points and can not be used as non-blocking as of now.
> > 
> > [1] The vmap_pages_range_noflush() contains the kmsan_vmap_pages_range_noflush()
> > external implementation for KCSAN specifically which is hard coded to GFP_KERNEL.
> > The kernel should be built with CONFIG_KCSAN option. To me it looks like not
> > straight forward to run such kernel on my box, therefore i need more time to
> > investigate what is wrong with CONFIG_KCSAN and my env.
> 
> KMSAN or KCSAN?
> 
> [+Cc KMSAN maintainers]
>
Sorry for type, yes, that was about CONFIG_KMSAN.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area()
  2025-08-07 11:20   ` Michal Hocko
@ 2025-08-08  9:59     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Thu, Aug 07, 2025 at 01:20:54PM +0200, Michal Hocko wrote:
> On Thu 07-08-25 09:58:05, Uladzislau Rezki wrote:
> > alloc_vmap_area() currently assumes that sleeping is allowed during
> > allocation. This is not true for callers which pass non-blocking
> > GFP flags, such as GFP_ATOMIC or GFP_NOWAIT.
> 
> Those are currently not allowed so it would be better to mention this is
> a preparation for those to _be_ supported later in the series.
>  
> > This patch adds logic to detect whether the given gfp_mask permits
> > blocking. It avoids invoking might_sleep() or falling back to reclaim
> > path if blocking is not allowed.
> > 
> > This makes alloc_vmap_area() safer for use in non-sleeping contexts,
> > where previously it could hit unexpected sleeps, trigger warnings.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> With the changelog clarified
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
> 
Thank you! Added in the end:

It is a preparation and adjustment step to later allow both GFP_ATOMIC
and GFP_NOWAIT allocations in this series.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages()
  2025-08-07 11:22   ` Michal Hocko
@ 2025-08-08 10:08     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 10:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Thu, Aug 07, 2025 at 01:22:36PM +0200, Michal Hocko wrote:
> On Thu 07-08-25 09:58:06, Uladzislau Rezki wrote:
> > The vm_area_alloc_pages() function uses cond_resched() to yield the
> > CPU during potentially long-running loops. However, these loops are
> > not considered long-running under normal conditions.
> 
> To be more precise they can take long if they dive into the page
> allocator but that already involves cond_rescheds where appropriate so
> these are not needed in fact.
> 
> > In non-blocking
> > contexts, calling cond_resched() is inappropriate also.
> > 
> > Remove these calls to ensure correctness for blocking/non-blocking
> > contexts. This also simplifies the code path. In fact, a slow path
> > of page allocator already includes reschedule points to mitigate
> > latency.
> > 
> > This patch was tested for !CONFIG_PREEMPT kernel and with large
> > allocation chunks(~1GB), without triggering any "BUG: soft lockup"
> > warnings.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!
>
Updated the commit message. Right, it can take long time.

Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc()
  2025-08-07 16:05   ` Andrey Ryabinin
@ 2025-08-08 10:18     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 10:18 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Baoquan He, LKML, Alexander Potapenko

On Thu, Aug 07, 2025 at 06:05:21PM +0200, Andrey Ryabinin wrote:
> 
> On 8/7/25 9:58 AM, Uladzislau Rezki (Sony) wrote:
> 
> > diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> > index d2c70cd2afb1..5edfc1f6b53e 100644
> > --- a/mm/kasan/shadow.c
> > +++ b/mm/kasan/shadow.c
> > @@ -335,13 +335,13 @@ static void ___free_pages_bulk(struct page **pages, int nr_pages)
> >  	}
> >  }
> >  
> > -static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
> > +static int ___alloc_pages_bulk(struct page **pages, int nr_pages, gfp_t gfp_mask)
> >  {
> >  	unsigned long nr_populated, nr_total = nr_pages;
> >  	struct page **page_array = pages;
> >  
> >  	while (nr_pages) {
> > -		nr_populated = alloc_pages_bulk(GFP_KERNEL, nr_pages, pages);
> > +		nr_populated = alloc_pages_bulk(gfp_mask, nr_pages, pages);
> >  		if (!nr_populated) {
> >  			___free_pages_bulk(page_array, nr_total - nr_pages);
> >  			return -ENOMEM;
> > @@ -353,25 +353,33 @@ static int ___alloc_pages_bulk(struct page **pages, int nr_pages)
> >  	return 0;
> >  }
> >  
> > -static int __kasan_populate_vmalloc(unsigned long start, unsigned long end)
> > +static int __kasan_populate_vmalloc(unsigned long start, unsigned long end, gfp_t gfp_mask)
> >  {
> >  	unsigned long nr_pages, nr_total = PFN_UP(end - start);
> > +	bool noblock = !gfpflags_allow_blocking(gfp_mask);
> >  	struct vmalloc_populate_data data;
> > +	unsigned int flags;
> >  	int ret = 0;
> 
> gfp_mask = (gfp_mask & GFP_RECLAIM_MASK);
> 
> 
> But it might be better to do this in alloc_vmap_area().
> In alloc_vmap_area() we have this:
> 
> retry:
> 	if (IS_ERR_VALUE(addr)) {
> 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> 
> which probably needs GFP_RECLAIM_MASK too.
> 
Thank you for pointing to this. I will check it!

> >  
> > -	data.pages = (struct page **)__get_free_page(GFP_KERNEL | __GFP_ZERO);
> > +	data.pages = (struct page **)__get_free_page(gfp_mask | __GFP_ZERO);
> >  	if (!data.pages)
> >  		return -ENOMEM;
> >  
> >  	while (nr_total) {
> >  		nr_pages = min(nr_total, PAGE_SIZE / sizeof(data.pages[0]));
> > -		ret = ___alloc_pages_bulk(data.pages, nr_pages);
> > +		ret = ___alloc_pages_bulk(data.pages, nr_pages, gfp_mask);
> >  		if (ret)
> >  			break;
> >  
> >  		data.start = start;
> > +		if (noblock)
> > +			flags = memalloc_noreclaim_save();
> > +
> 
> 
> This should be the same as in __vmalloc_area_node():
> 
> 	if (noblock)
> 		flags = memalloc_noreclaim_save();
> 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		flags = memalloc_nofs_save();
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
> 
> 
> It would be better to fix noio/nofs stuff first with separate patch, as it's
> bug and needs cc stable. And add support for noblock in follow up.
> 
Right. KASAN was not fixed together with vmalloc. I will look into it.

> It might be a good idea to consolidate such logic in separate function,
> memalloc_save(gfp_mask)/memalloc_restore(gfp_mask, flags) ?
> 
> >  		ret = apply_to_page_range(&init_mm, start, nr_pages * PAGE_SIZE,
> >  					  kasan_populate_vmalloc_pte, &data);
> > +		if (noblock)
> > +			memalloc_noreclaim_restore(flags);
> > +
> >  		___free_pages_bulk(data.pages, nr_pages);
> >  		if (ret)
>
Sounds good.

Thank you.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-07 11:25   ` Michal Hocko
@ 2025-08-08 10:37     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 10:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Thu, Aug 07, 2025 at 01:25:01PM +0200, Michal Hocko wrote:
> On Thu 07-08-25 09:58:08, Uladzislau Rezki wrote:
> > __vmalloc_area_node() may call free_vmap_area() or vfree() on
> > error paths, both of which can sleep. This becomes problematic
> > if the function is invoked from an atomic context, such as when
> > GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> > 
> > To fix this, unify error paths and defer the cleanup of partly
> > initialized vm_struct objects to a workqueue. This ensures that
> > freeing happens in a process context and avoids invalid sleeps
> > in atomic regions.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> LGTM
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
> 
Thanks, applied Acked-by.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-07 11:54   ` Michal Hocko
@ 2025-08-08 11:54     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 11:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Thu, Aug 07, 2025 at 01:54:21PM +0200, Michal Hocko wrote:
> On Thu 07-08-25 09:58:09, Uladzislau Rezki wrote:
> > This patch makes __vmalloc_area_node() to correctly handle non-blocking
> > allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:
> > 
> > - Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
> >   if there are no DMA constraints.
> 
> This begs for a more explanation. Why does __GFP_HIGHMEM matters? I
> suspect this is due to kmapping of those pages but that could be done in
> an atomic way. But in practice I do not think we really care about
> highmem all that much for vmalloc. The vmalloc space is really tiny for
> 32b systems where highmem matters and failing vmalloc allocations due to
> lack is of __GFP_HIGHMEM is hard to consider important if relevant at
> all.
> 
Thank you for this note. Yes, __GFP_HIGHMEM is about 32 bit systems.
Initially, in the RFC series, during testing i saw some incompatibility
kernel splats when use together with non-blocking flags.

Whereas i do not see it anymore and now. It looks like i messed up something
when testing pre-RFC series.

I will not touch it. I mean i will keep it as it used to be, i.e. apply
it if no (GFP_DMA | GFP_DMA32).

> > - vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
> >   to avoid memory reclaim related operations that could sleep during
> >   page table setup or mapping pages.
> > 
> > This is particularly important for page table allocations that
> > internally use GFP_PGTABLE_KERNEL, which may sleep unless such
> > scope restrictions are applied. For example:
> > 
> > <snip>
> > __pte_alloc_kernel()
> >     pte_alloc_one_kernel(&init_mm);
> >         pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
> > <snip>
> 
> As I've said in several occations, I am not entirely happy about this
> approach because it doesn't really guarantee atomicty. If any
> architecture decides to use some sleeping locking down that path then
> the whole thing just blows up. On the other hand this is mostly a
> theoretical concern at this stage and this is a feature people have
> been asking for a long time (especially from kvmalloc side) so better
> good than perfect that his.
> 
I agree with it. Unfortunately i can not control the PTE kernel allocation
layer.

>
> That being said, you are missing __kvmalloc_node_noprof,
> __vmalloc_node_range_noprof (and maybe some more places) documentation
> update.
> 
I would like to fix documentation in separate patch. That was deliberately.

> > Note: in most cases, PTE entries are established only up to the level
> > required by current vmap space usage, meaning the page tables are typically
> > fully populated during the mapping process.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> With the doc part fixed
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
Thanks! Applied.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set
  2025-08-07 11:58   ` Michal Hocko
@ 2025-08-08 13:12     ` Uladzislau Rezki
  2025-08-08 14:16       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 13:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Thu, Aug 07, 2025 at 01:58:20PM +0200, Michal Hocko wrote:
> On Thu 07-08-25 09:58:10, Uladzislau Rezki wrote:
> > The memory allocator already avoids reclaim when PF_MEMALLOC is set.
> > Clear __GFP_DIRECT_RECLAIM explicitly to suppress might_alloc() warnings
> > to make more correct behavior.
> 
> Rather than chaning the gfp mask would it make more sense to update
> might_alloc instead?
> 
Hm.. I was thinking about it but decided to drop the __GFP_DIRECT_RECLAIM
instead just to guarantee a no-reclaim behaviour, as it is written now to
the flag.

From the other hand after this patch we would have some unneeded/dead
checks(if i do not missing anything). For example:

[1]
    WARN_ON_ONCE(!can_direct_reclaim);
    /*
     * PF_MEMALLOC request from this context is rather bizarre
     * because we cannot reclaim anything and only can loop waiting
     * for somebody to do a work for us.
     */
    WARN_ON_ONCE(current->flags & PF_MEMALLOC);
[2]
    /* no reclaim without waiting on it */
    if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
        return false;

    /* this guy won't enter reclaim */
    if (current->flags & PF_MEMALLOC)
        return false;

[3]
    /* Caller is not willing to reclaim, we can't balance anything */
    if (!can_direct_reclaim)
        goto nopage;

    /* Avoid recursion of direct reclaim */
    if (current->flags & PF_MEMALLOC)
        goto nopage;
etc.

But, yes, might_alloc() can be modified also.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set
  2025-08-08 13:12     ` Uladzislau Rezki
@ 2025-08-08 14:16       ` Michal Hocko
  2025-08-08 16:56         ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2025-08-08 14:16 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Baoquan He, LKML

On Fri 08-08-25 15:12:45, Uladzislau Rezki wrote:
> On Thu, Aug 07, 2025 at 01:58:20PM +0200, Michal Hocko wrote:
> > On Thu 07-08-25 09:58:10, Uladzislau Rezki wrote:
> > > The memory allocator already avoids reclaim when PF_MEMALLOC is set.
> > > Clear __GFP_DIRECT_RECLAIM explicitly to suppress might_alloc() warnings
> > > to make more correct behavior.
> > 
> > Rather than chaning the gfp mask would it make more sense to update
> > might_alloc instead?
> > 
> Hm.. I was thinking about it but decided to drop the __GFP_DIRECT_RECLAIM
> instead just to guarantee a no-reclaim behaviour, as it is written now to
> the flag.
> 
> >From the other hand after this patch we would have some unneeded/dead
> checks(if i do not missing anything). For example:
> 
> [1]
>     WARN_ON_ONCE(!can_direct_reclaim);
>     /*
>      * PF_MEMALLOC request from this context is rather bizarre
>      * because we cannot reclaim anything and only can loop waiting
>      * for somebody to do a work for us.
>      */
>     WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> [2]
>     /* no reclaim without waiting on it */
>     if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>         return false;
> 
>     /* this guy won't enter reclaim */
>     if (current->flags & PF_MEMALLOC)
>         return false;
> 
> [3]
>     /* Caller is not willing to reclaim, we can't balance anything */
>     if (!can_direct_reclaim)
>         goto nopage;
> 
>     /* Avoid recursion of direct reclaim */
>     if (current->flags & PF_MEMALLOC)
>         goto nopage;
> etc.
> 
> But, yes, might_alloc() can be modified also.

I do not have a _strong_ preference but my slight preference would be to
deal with this in might_alloc. Not sure what other think.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set
  2025-08-08 14:16       ` Michal Hocko
@ 2025-08-08 16:56         ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-08 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vlastimil Babka,
	Baoquan He, LKML

On Fri, Aug 08, 2025 at 04:16:04PM +0200, Michal Hocko wrote:
> On Fri 08-08-25 15:12:45, Uladzislau Rezki wrote:
> > On Thu, Aug 07, 2025 at 01:58:20PM +0200, Michal Hocko wrote:
> > > On Thu 07-08-25 09:58:10, Uladzislau Rezki wrote:
> > > > The memory allocator already avoids reclaim when PF_MEMALLOC is set.
> > > > Clear __GFP_DIRECT_RECLAIM explicitly to suppress might_alloc() warnings
> > > > to make more correct behavior.
> > > 
> > > Rather than chaning the gfp mask would it make more sense to update
> > > might_alloc instead?
> > > 
> > Hm.. I was thinking about it but decided to drop the __GFP_DIRECT_RECLAIM
> > instead just to guarantee a no-reclaim behaviour, as it is written now to
> > the flag.
> > 
> > >From the other hand after this patch we would have some unneeded/dead
> > checks(if i do not missing anything). For example:
> > 
> > [1]
> >     WARN_ON_ONCE(!can_direct_reclaim);
> >     /*
> >      * PF_MEMALLOC request from this context is rather bizarre
> >      * because we cannot reclaim anything and only can loop waiting
> >      * for somebody to do a work for us.
> >      */
> >     WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> > [2]
> >     /* no reclaim without waiting on it */
> >     if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> >         return false;
> > 
> >     /* this guy won't enter reclaim */
> >     if (current->flags & PF_MEMALLOC)
> >         return false;
> > 
> > [3]
> >     /* Caller is not willing to reclaim, we can't balance anything */
> >     if (!can_direct_reclaim)
> >         goto nopage;
> > 
> >     /* Avoid recursion of direct reclaim */
> >     if (current->flags & PF_MEMALLOC)
> >         goto nopage;
> > etc.
> > 
> > But, yes, might_alloc() can be modified also.
> 
> I do not have a _strong_ preference but my slight preference would be to
> deal with this in might_alloc. Not sure what other think.
> 
No problem, that i can easily switch to.

--
Uladzisau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area()
  2025-08-07  7:58 ` [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area() Uladzislau Rezki (Sony)
  2025-08-07 11:20   ` Michal Hocko
@ 2025-08-18  2:11   ` Baoquan He
  1 sibling, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-08-18  2:11 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> alloc_vmap_area() currently assumes that sleeping is allowed during
> allocation. This is not true for callers which pass non-blocking
> GFP flags, such as GFP_ATOMIC or GFP_NOWAIT.
> 
> This patch adds logic to detect whether the given gfp_mask permits
> blocking. It avoids invoking might_sleep() or falling back to reclaim
> path if blocking is not allowed.
> 
> This makes alloc_vmap_area() safer for use in non-sleeping contexts,
> where previously it could hit unexpected sleeps, trigger warnings.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 6dbcdceecae1..81b6d3bde719 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2017,6 +2017,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	unsigned long freed;
>  	unsigned long addr;
>  	unsigned int vn_id;
> +	bool allow_block;
>  	int purged = 0;
>  	int ret;
>  
> @@ -2026,7 +2027,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	if (unlikely(!vmap_initialized))
>  		return ERR_PTR(-EBUSY);
>  
> -	might_sleep();
> +	allow_block = gfpflags_allow_blocking(gfp_mask);
> +	might_sleep_if(allow_block);
>  
>  	/*
>  	 * If a VA is obtained from a global heap(if it fails here)
> @@ -2065,8 +2067,16 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 * If an allocation fails, the error value is
>  	 * returned. Therefore trigger the overflow path.
>  	 */
> -	if (IS_ERR_VALUE(addr))
> -		goto overflow;
> +	if (IS_ERR_VALUE(addr)) {
> +		if (allow_block)
> +			goto overflow;
> +
> +		/*
> +		 * We can not trigger any reclaim logic because
> +		 * sleeping is not allowed, thus fail an allocation.
> +		 */
> +		goto error;
> +	}
>  
>  	va->va_start = addr;
>  	va->va_end = addr + size;
> @@ -2116,6 +2126,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  		pr_warn("vmalloc_node_range for size %lu failed: Address range restricted to %#lx - %#lx\n",
>  				size, vstart, vend);
>  
> +error:
>  	kmem_cache_free(vmap_area_cachep, va);
>  	return ERR_PTR(-EBUSY);
>  }
> -- 
> 2.39.5
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages()
  2025-08-07  7:58 ` [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages() Uladzislau Rezki (Sony)
  2025-08-07 11:22   ` Michal Hocko
@ 2025-08-18  2:14   ` Baoquan He
  1 sibling, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-08-18  2:14 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> The vm_area_alloc_pages() function uses cond_resched() to yield the
> CPU during potentially long-running loops. However, these loops are
> not considered long-running under normal conditions. In non-blocking
> contexts, calling cond_resched() is inappropriate also.
> 
> Remove these calls to ensure correctness for blocking/non-blocking
> contexts. This also simplifies the code path. In fact, a slow path
> of page allocator already includes reschedule points to mitigate
> latency.
> 
> This patch was tested for !CONFIG_PREEMPT kernel and with large
> allocation chunks(~1GB), without triggering any "BUG: soft lockup"
> warnings.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 2 --
>  1 file changed, 2 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 81b6d3bde719..b0255e0c74b3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3633,7 +3633,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  							pages + nr_allocated);
>  
>  			nr_allocated += nr;
> -			cond_resched();
>  
>  			/*
>  			 * If zero or pages were obtained partly,
> @@ -3675,7 +3674,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		for (i = 0; i < (1U << order); i++)
>  			pages[nr_allocated + i] = page + i;
>  
> -		cond_resched();
>  		nr_allocated += 1U << order;
>  	}
>  
> -- 
> 2.39.5
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-07  7:58 ` [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct Uladzislau Rezki (Sony)
  2025-08-07 11:25   ` Michal Hocko
@ 2025-08-18  4:21   ` Baoquan He
  2025-08-18 13:02     ` Uladzislau Rezki
  1 sibling, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-08-18  4:21 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> __vmalloc_area_node() may call free_vmap_area() or vfree() on
> error paths, both of which can sleep. This becomes problematic
> if the function is invoked from an atomic context, such as when
> GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> 
> To fix this, unify error paths and defer the cleanup of partly
> initialized vm_struct objects to a workqueue. This ensures that
> freeing happens in a process context and avoids invalid sleeps
> in atomic regions.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  include/linux/vmalloc.h |  6 +++++-
>  mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
>  2 files changed, 36 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index fdc9aeb74a44..b1425fae8cbf 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
>  #endif
>  
>  struct vm_struct {
> -	struct vm_struct	*next;
> +	union {
> +		struct vm_struct *next;	  /* Early registration of vm_areas. */
> +		struct llist_node llnode; /* Asynchronous freeing on error paths. */
> +	};
> +
>  	void			*addr;
>  	unsigned long		size;
>  	unsigned long		flags;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 7f48a54ec108..2424f80d524a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  	return nr_allocated;
>  }
>  
> +static LLIST_HEAD(pending_vm_area_cleanup);
> +static void cleanup_vm_area_work(struct work_struct *work)
> +{
> +	struct vm_struct *area, *tmp;
> +	struct llist_node *head;
> +
> +	head = llist_del_all(&pending_vm_area_cleanup);
> +	if (!head)
> +		return;
> +
> +	llist_for_each_entry_safe(area, tmp, head, llnode) {
> +		if (!area->pages)
> +			free_vm_area(area);
> +		else
> +			vfree(area->addr);
> +	}
> +}
> +
> +/*
> + * Helper for __vmalloc_area_node() to defer cleanup
> + * of partially initialized vm_struct in error paths.
> + */
> +static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
> +static void defer_vm_area_cleanup(struct vm_struct *area)
> +{
> +	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
> +		schedule_work(&cleanup_vm_area);
> +}

Wondering why here we need call schudule_work() when
pending_vm_area_cleanup was empty before adding new entry. Shouldn't
it be as below to schedule the job? Not sure if I miss anything.

	if (!llist_add(&area->llnode, &pending_vm_area_cleanup))
		schedule_work(&cleanup_vm_area);

=====
/**
 * llist_add - add a new entry
 * @new:        new entry to be added
 * @head:       the head for your lock-less list
 *
 * Returns true if the list was empty prior to adding this entry.
 */
static inline bool llist_add(struct llist_node *new, struct llist_head *head)
{
        return llist_add_batch(new, new, head);
}
=====

> +
>  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  				 pgprot_t prot, unsigned int page_shift,
>  				 int node)
> @@ -3711,8 +3740,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  		warn_alloc(gfp_mask, NULL,
>  			"vmalloc error: size %lu, failed to allocated page array size %lu",
>  			nr_small_pages * PAGE_SIZE, array_size);
> -		free_vm_area(area);
> -		return NULL;
> +		goto fail;
>  	}
>  
>  	set_vm_area_page_order(area, page_shift - PAGE_SHIFT);
> @@ -3789,7 +3817,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	return area->addr;
>  
>  fail:
> -	vfree(area->addr);
> +	defer_vm_area_cleanup(area);
>  	return NULL;
>  }
>  
> -- 
> 2.39.5
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-07  7:58 ` [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node() Uladzislau Rezki (Sony)
  2025-08-07 11:54   ` Michal Hocko
@ 2025-08-18  4:35   ` Baoquan He
  2025-08-18 13:08     ` Uladzislau Rezki
  1 sibling, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-08-18  4:35 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> This patch makes __vmalloc_area_node() to correctly handle non-blocking
> allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:
> 
> - Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
>   if there are no DMA constraints.
> 
> - vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
>   to avoid memory reclaim related operations that could sleep during
>   page table setup or mapping pages.
> 
> This is particularly important for page table allocations that
> internally use GFP_PGTABLE_KERNEL, which may sleep unless such
> scope restrictions are applied. For example:
> 
> <snip>
> __pte_alloc_kernel()
>     pte_alloc_one_kernel(&init_mm);
>         pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
> <snip>
> 
> Note: in most cases, PTE entries are established only up to the level
> required by current vmap space usage, meaning the page tables are typically
> fully populated during the mapping process.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2424f80d524a..8a7eab810561 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3721,12 +3721,20 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
>  	unsigned int flags;
> +	bool noblock;
>  	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> +	noblock = !gfpflags_allow_blocking(gfp_mask);
>  
> -	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> -		gfp_mask |= __GFP_HIGHMEM;
> +	if (noblock) {
> +		/* __GFP_NOFAIL and "noblock" flags are mutually exclusive. */
> +		nofail = false;
> +	} else {
> +		/* Allow highmem allocations if there are no DMA constraints. */
> +		if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> +			gfp_mask |= __GFP_HIGHMEM;
> +	}
>  
>  	/* Please note that the recursion is strictly bounded. */
>  	if (array_size > PAGE_SIZE) {
> @@ -3790,7 +3798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	 * page tables allocations ignore external gfp mask, enforce it
>  	 * by the scope API
>  	 */
> -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +	if (noblock)
> +		flags = memalloc_noreclaim_save();
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		flags = memalloc_nofs_save();
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
> @@ -3802,7 +3812,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  			schedule_timeout_uninterruptible(1);
>  	} while (nofail && (ret < 0));
>  
> -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> +	if (noblock)
> +		memalloc_noreclaim_restore(flags);
> +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		memalloc_noio_restore(flags);

Can we use memalloc_flags_restore(flags) directly to replace above if
else checking? It can reduce LOC, might be not as readable as the change
in patch surely. Not strong opinion.

	memalloc_flags_restore(flags);


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-18  4:21   ` Baoquan He
@ 2025-08-18 13:02     ` Uladzislau Rezki
  2025-08-19  8:56       ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-18 13:02 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, LKML

On Mon, Aug 18, 2025 at 12:21:15PM +0800, Baoquan He wrote:
> On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> > __vmalloc_area_node() may call free_vmap_area() or vfree() on
> > error paths, both of which can sleep. This becomes problematic
> > if the function is invoked from an atomic context, such as when
> > GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> > 
> > To fix this, unify error paths and defer the cleanup of partly
> > initialized vm_struct objects to a workqueue. This ensures that
> > freeing happens in a process context and avoids invalid sleeps
> > in atomic regions.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  include/linux/vmalloc.h |  6 +++++-
> >  mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
> >  2 files changed, 36 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index fdc9aeb74a44..b1425fae8cbf 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
> >  #endif
> >  
> >  struct vm_struct {
> > -	struct vm_struct	*next;
> > +	union {
> > +		struct vm_struct *next;	  /* Early registration of vm_areas. */
> > +		struct llist_node llnode; /* Asynchronous freeing on error paths. */
> > +	};
> > +
> >  	void			*addr;
> >  	unsigned long		size;
> >  	unsigned long		flags;
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 7f48a54ec108..2424f80d524a 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >  	return nr_allocated;
> >  }
> >  
> > +static LLIST_HEAD(pending_vm_area_cleanup);
> > +static void cleanup_vm_area_work(struct work_struct *work)
> > +{
> > +	struct vm_struct *area, *tmp;
> > +	struct llist_node *head;
> > +
> > +	head = llist_del_all(&pending_vm_area_cleanup);
> > +	if (!head)
> > +		return;
> > +
> > +	llist_for_each_entry_safe(area, tmp, head, llnode) {
> > +		if (!area->pages)
> > +			free_vm_area(area);
> > +		else
> > +			vfree(area->addr);
> > +	}
> > +}
> > +
> > +/*
> > + * Helper for __vmalloc_area_node() to defer cleanup
> > + * of partially initialized vm_struct in error paths.
> > + */
> > +static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
> > +static void defer_vm_area_cleanup(struct vm_struct *area)
> > +{
> > +	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
> > +		schedule_work(&cleanup_vm_area);
> > +}
> 
> Wondering why here we need call schudule_work() when
> pending_vm_area_cleanup was empty before adding new entry. Shouldn't
> it be as below to schedule the job? Not sure if I miss anything.
> 
> 	if (!llist_add(&area->llnode, &pending_vm_area_cleanup))
> 		schedule_work(&cleanup_vm_area);
> 
> =====
> /**
>  * llist_add - add a new entry
>  * @new:        new entry to be added
>  * @head:       the head for your lock-less list
>  *
>  * Returns true if the list was empty prior to adding this entry.
>  */
> static inline bool llist_add(struct llist_node *new, struct llist_head *head)
> {
>         return llist_add_batch(new, new, head);
> }
> =====
> 
But then you will not schedule. If the list is empty, we add one element
llist_add() returns 1, but your condition expects 0.

How it works:

If someone keeps adding to the llist and it is not empty we should not
trigger a new work, because a current work is in flight(it will cover new comers),
i.e. it has been scheduled but it has not yet completed llist_del_all() on
the head.

Once it is done, a new comer will trigger a work again only if it sees NULL,
i.e. when the list is empty.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-18  4:35   ` Baoquan He
@ 2025-08-18 13:08     ` Uladzislau Rezki
  2025-08-19  8:46       ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-18 13:08 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony), linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, LKML

On Mon, Aug 18, 2025 at 12:35:16PM +0800, Baoquan He wrote:
> On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> > This patch makes __vmalloc_area_node() to correctly handle non-blocking
> > allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:
> > 
> > - Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
> >   if there are no DMA constraints.
> > 
> > - vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
> >   to avoid memory reclaim related operations that could sleep during
> >   page table setup or mapping pages.
> > 
> > This is particularly important for page table allocations that
> > internally use GFP_PGTABLE_KERNEL, which may sleep unless such
> > scope restrictions are applied. For example:
> > 
> > <snip>
> > __pte_alloc_kernel()
> >     pte_alloc_one_kernel(&init_mm);
> >         pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
> > <snip>
> > 
> > Note: in most cases, PTE entries are established only up to the level
> > required by current vmap space usage, meaning the page tables are typically
> > fully populated during the mapping process.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 2424f80d524a..8a7eab810561 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3721,12 +3721,20 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
> >  	unsigned int page_order;
> >  	unsigned int flags;
> > +	bool noblock;
> >  	int ret;
> >  
> >  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> > +	noblock = !gfpflags_allow_blocking(gfp_mask);
> >  
> > -	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> > -		gfp_mask |= __GFP_HIGHMEM;
> > +	if (noblock) {
> > +		/* __GFP_NOFAIL and "noblock" flags are mutually exclusive. */
> > +		nofail = false;
> > +	} else {
> > +		/* Allow highmem allocations if there are no DMA constraints. */
> > +		if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> > +			gfp_mask |= __GFP_HIGHMEM;
> > +	}
> >  
> >  	/* Please note that the recursion is strictly bounded. */
> >  	if (array_size > PAGE_SIZE) {
> > @@ -3790,7 +3798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	 * page tables allocations ignore external gfp mask, enforce it
> >  	 * by the scope API
> >  	 */
> > -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > +	if (noblock)
> > +		flags = memalloc_noreclaim_save();
> > +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> >  		flags = memalloc_nofs_save();
> >  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> >  		flags = memalloc_noio_save();
> > @@ -3802,7 +3812,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  			schedule_timeout_uninterruptible(1);
> >  	} while (nofail && (ret < 0));
> >  
> > -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > +	if (noblock)
> > +		memalloc_noreclaim_restore(flags);
> > +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> >  		memalloc_nofs_restore(flags);
> >  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> >  		memalloc_noio_restore(flags);
> 
> Can we use memalloc_flags_restore(flags) directly to replace above if
> else checking? It can reduce LOC, might be not as readable as the change
> in patch surely. Not strong opinion.
> 
> 	memalloc_flags_restore(flags);
> 
I agree, those if/else cases looks ugly. Maybe adding two save/restore
functions are worth doing specifically for vmalloc part.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node()
  2025-08-18 13:08     ` Uladzislau Rezki
@ 2025-08-19  8:46       ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-08-19  8:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/18/25 at 03:08pm, Uladzislau Rezki wrote:
> On Mon, Aug 18, 2025 at 12:35:16PM +0800, Baoquan He wrote:
> > On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> > > This patch makes __vmalloc_area_node() to correctly handle non-blocking
> > > allocation requests, such as GFP_ATOMIC and GFP_NOWAIT. Main changes:
> > > 
> > > - Add a __GFP_HIGHMEM to gfp_mask only for blocking requests
> > >   if there are no DMA constraints.
> > > 
> > > - vmap_page_range() is wrapped by memalloc_noreclaim_save/restore()
> > >   to avoid memory reclaim related operations that could sleep during
> > >   page table setup or mapping pages.
> > > 
> > > This is particularly important for page table allocations that
> > > internally use GFP_PGTABLE_KERNEL, which may sleep unless such
> > > scope restrictions are applied. For example:
> > > 
> > > <snip>
> > > __pte_alloc_kernel()
> > >     pte_alloc_one_kernel(&init_mm);
> > >         pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
> > > <snip>
> > > 
> > > Note: in most cases, PTE entries are established only up to the level
> > > required by current vmap space usage, meaning the page tables are typically
> > > fully populated during the mapping process.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 20 ++++++++++++++++----
> > >  1 file changed, 16 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 2424f80d524a..8a7eab810561 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3721,12 +3721,20 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
> > >  	unsigned int page_order;
> > >  	unsigned int flags;
> > > +	bool noblock;
> > >  	int ret;
> > >  
> > >  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> > > +	noblock = !gfpflags_allow_blocking(gfp_mask);
> > >  
> > > -	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> > > -		gfp_mask |= __GFP_HIGHMEM;
> > > +	if (noblock) {
> > > +		/* __GFP_NOFAIL and "noblock" flags are mutually exclusive. */
> > > +		nofail = false;
> > > +	} else {
> > > +		/* Allow highmem allocations if there are no DMA constraints. */
> > > +		if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
> > > +			gfp_mask |= __GFP_HIGHMEM;
> > > +	}
> > >  
> > >  	/* Please note that the recursion is strictly bounded. */
> > >  	if (array_size > PAGE_SIZE) {
> > > @@ -3790,7 +3798,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >  	 * page tables allocations ignore external gfp mask, enforce it
> > >  	 * by the scope API
> > >  	 */
> > > -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > > +	if (noblock)
> > > +		flags = memalloc_noreclaim_save();
> > > +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > >  		flags = memalloc_nofs_save();
> > >  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > >  		flags = memalloc_noio_save();
> > > @@ -3802,7 +3812,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >  			schedule_timeout_uninterruptible(1);
> > >  	} while (nofail && (ret < 0));
> > >  
> > > -	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > > +	if (noblock)
> > > +		memalloc_noreclaim_restore(flags);
> > > +	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > >  		memalloc_nofs_restore(flags);
> > >  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > >  		memalloc_noio_restore(flags);
> > 
> > Can we use memalloc_flags_restore(flags) directly to replace above if
> > else checking? It can reduce LOC, might be not as readable as the change
> > in patch surely. Not strong opinion.
> > 
> > 	memalloc_flags_restore(flags);
> > 
> I agree, those if/else cases looks ugly. Maybe adding two save/restore
> functions are worth doing specifically for vmalloc part.

Yeah, that is also great idea.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-18 13:02     ` Uladzislau Rezki
@ 2025-08-19  8:56       ` Baoquan He
  2025-08-19  9:20         ` Uladzislau Rezki
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-08-19  8:56 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Michal Hocko, LKML

On 08/18/25 at 03:02pm, Uladzislau Rezki wrote:
> On Mon, Aug 18, 2025 at 12:21:15PM +0800, Baoquan He wrote:
> > On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> > > __vmalloc_area_node() may call free_vmap_area() or vfree() on
> > > error paths, both of which can sleep. This becomes problematic
> > > if the function is invoked from an atomic context, such as when
> > > GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> > > 
> > > To fix this, unify error paths and defer the cleanup of partly
> > > initialized vm_struct objects to a workqueue. This ensures that
> > > freeing happens in a process context and avoids invalid sleeps
> > > in atomic regions.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  include/linux/vmalloc.h |  6 +++++-
> > >  mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
> > >  2 files changed, 36 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > index fdc9aeb74a44..b1425fae8cbf 100644
> > > --- a/include/linux/vmalloc.h
> > > +++ b/include/linux/vmalloc.h
> > > @@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
> > >  #endif
> > >  
> > >  struct vm_struct {
> > > -	struct vm_struct	*next;
> > > +	union {
> > > +		struct vm_struct *next;	  /* Early registration of vm_areas. */
> > > +		struct llist_node llnode; /* Asynchronous freeing on error paths. */
> > > +	};
> > > +
> > >  	void			*addr;
> > >  	unsigned long		size;
> > >  	unsigned long		flags;
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 7f48a54ec108..2424f80d524a 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >  	return nr_allocated;
> > >  }
> > >  
> > > +static LLIST_HEAD(pending_vm_area_cleanup);
> > > +static void cleanup_vm_area_work(struct work_struct *work)
> > > +{
> > > +	struct vm_struct *area, *tmp;
> > > +	struct llist_node *head;
> > > +
> > > +	head = llist_del_all(&pending_vm_area_cleanup);
> > > +	if (!head)
> > > +		return;
> > > +
> > > +	llist_for_each_entry_safe(area, tmp, head, llnode) {
> > > +		if (!area->pages)
> > > +			free_vm_area(area);
> > > +		else
> > > +			vfree(area->addr);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * Helper for __vmalloc_area_node() to defer cleanup
> > > + * of partially initialized vm_struct in error paths.
> > > + */
> > > +static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
> > > +static void defer_vm_area_cleanup(struct vm_struct *area)
> > > +{
> > > +	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
> > > +		schedule_work(&cleanup_vm_area);
> > > +}
> > 
> > Wondering why here we need call schudule_work() when
> > pending_vm_area_cleanup was empty before adding new entry. Shouldn't
> > it be as below to schedule the job? Not sure if I miss anything.
> > 
> > 	if (!llist_add(&area->llnode, &pending_vm_area_cleanup))
> > 		schedule_work(&cleanup_vm_area);
> > 
> > =====
> > /**
> >  * llist_add - add a new entry
> >  * @new:        new entry to be added
> >  * @head:       the head for your lock-less list
> >  *
> >  * Returns true if the list was empty prior to adding this entry.
> >  */
> > static inline bool llist_add(struct llist_node *new, struct llist_head *head)
> > {
> >         return llist_add_batch(new, new, head);
> > }
> > =====
> > 
> But then you will not schedule. If the list is empty, we add one element
> llist_add() returns 1, but your condition expects 0.
> 
> How it works:
> 
> If someone keeps adding to the llist and it is not empty we should not
> trigger a new work, because a current work is in flight(it will cover new comers),
> i.e. it has been scheduled but it has not yet completed llist_del_all() on
> the head.
> 
> Once it is done, a new comer will trigger a work again only if it sees NULL,
> i.e. when the list is empty.

Fair enough. I thought it's a deferring work, in fact it's aiming to put the
error handling in a workqueue, but not the current atomic context.
Thanks for the explanation.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct
  2025-08-19  8:56       ` Baoquan He
@ 2025-08-19  9:20         ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-19  9:20 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, LKML

On Tue, Aug 19, 2025 at 04:56:25PM +0800, Baoquan He wrote:
> On 08/18/25 at 03:02pm, Uladzislau Rezki wrote:
> > On Mon, Aug 18, 2025 at 12:21:15PM +0800, Baoquan He wrote:
> > > On 08/07/25 at 09:58am, Uladzislau Rezki (Sony) wrote:
> > > > __vmalloc_area_node() may call free_vmap_area() or vfree() on
> > > > error paths, both of which can sleep. This becomes problematic
> > > > if the function is invoked from an atomic context, such as when
> > > > GFP_ATOMIC or GFP_NOWAIT is passed via gfp_mask.
> > > > 
> > > > To fix this, unify error paths and defer the cleanup of partly
> > > > initialized vm_struct objects to a workqueue. This ensures that
> > > > freeing happens in a process context and avoids invalid sleeps
> > > > in atomic regions.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > ---
> > > >  include/linux/vmalloc.h |  6 +++++-
> > > >  mm/vmalloc.c            | 34 +++++++++++++++++++++++++++++++---
> > > >  2 files changed, 36 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > > > index fdc9aeb74a44..b1425fae8cbf 100644
> > > > --- a/include/linux/vmalloc.h
> > > > +++ b/include/linux/vmalloc.h
> > > > @@ -50,7 +50,11 @@ struct iov_iter;		/* in uio.h */
> > > >  #endif
> > > >  
> > > >  struct vm_struct {
> > > > -	struct vm_struct	*next;
> > > > +	union {
> > > > +		struct vm_struct *next;	  /* Early registration of vm_areas. */
> > > > +		struct llist_node llnode; /* Asynchronous freeing on error paths. */
> > > > +	};
> > > > +
> > > >  	void			*addr;
> > > >  	unsigned long		size;
> > > >  	unsigned long		flags;
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 7f48a54ec108..2424f80d524a 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3680,6 +3680,35 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > > >  	return nr_allocated;
> > > >  }
> > > >  
> > > > +static LLIST_HEAD(pending_vm_area_cleanup);
> > > > +static void cleanup_vm_area_work(struct work_struct *work)
> > > > +{
> > > > +	struct vm_struct *area, *tmp;
> > > > +	struct llist_node *head;
> > > > +
> > > > +	head = llist_del_all(&pending_vm_area_cleanup);
> > > > +	if (!head)
> > > > +		return;
> > > > +
> > > > +	llist_for_each_entry_safe(area, tmp, head, llnode) {
> > > > +		if (!area->pages)
> > > > +			free_vm_area(area);
> > > > +		else
> > > > +			vfree(area->addr);
> > > > +	}
> > > > +}
> > > > +
> > > > +/*
> > > > + * Helper for __vmalloc_area_node() to defer cleanup
> > > > + * of partially initialized vm_struct in error paths.
> > > > + */
> > > > +static DECLARE_WORK(cleanup_vm_area, cleanup_vm_area_work);
> > > > +static void defer_vm_area_cleanup(struct vm_struct *area)
> > > > +{
> > > > +	if (llist_add(&area->llnode, &pending_vm_area_cleanup))
> > > > +		schedule_work(&cleanup_vm_area);
> > > > +}
> > > 
> > > Wondering why here we need call schudule_work() when
> > > pending_vm_area_cleanup was empty before adding new entry. Shouldn't
> > > it be as below to schedule the job? Not sure if I miss anything.
> > > 
> > > 	if (!llist_add(&area->llnode, &pending_vm_area_cleanup))
> > > 		schedule_work(&cleanup_vm_area);
> > > 
> > > =====
> > > /**
> > >  * llist_add - add a new entry
> > >  * @new:        new entry to be added
> > >  * @head:       the head for your lock-less list
> > >  *
> > >  * Returns true if the list was empty prior to adding this entry.
> > >  */
> > > static inline bool llist_add(struct llist_node *new, struct llist_head *head)
> > > {
> > >         return llist_add_batch(new, new, head);
> > > }
> > > =====
> > > 
> > But then you will not schedule. If the list is empty, we add one element
> > llist_add() returns 1, but your condition expects 0.
> > 
> > How it works:
> > 
> > If someone keeps adding to the llist and it is not empty we should not
> > trigger a new work, because a current work is in flight(it will cover new comers),
> > i.e. it has been scheduled but it has not yet completed llist_del_all() on
> > the head.
> > 
> > Once it is done, a new comer will trigger a work again only if it sees NULL,
> > i.e. when the list is empty.
> 
> Fair enough. I thought it's a deferring work, in fact it's aiming to put the
> error handling in a workqueue, but not the current atomic context.
> Thanks for the explanation.
> 
You are welcome!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/8] __vmalloc() and no-block support
  2025-08-08  8:48   ` Uladzislau Rezki
@ 2025-08-23  9:35     ` Uladzislau Rezki
  0 siblings, 0 replies; 35+ messages in thread
From: Uladzislau Rezki @ 2025-08-23  9:35 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Marco Elver, linux-mm, Andrew Morton, Vlastimil Babka,
	Michal Hocko, Baoquan He, LKML, Alexander Potapenko, kasan-dev

Hello, Alexander!

I am working on making vmalloc to support extra non-blocking flags.
Currently i see one more place that i need to address:

kmsan_vmap_pages_range_noflush() function which uses hard-coded GFP_KERNEL
flags for allocation of two arrays for its internal use only.

I have a question to you, can we just get rid of those two allocations?
It is the easiest way, if possible. Otherwise i can add "gfp_t gfp_mask"
extra parameter and pass there a corresponding gfp_mask flag. See below:

<snip>
diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index 2b1432cc16d5..e4b34e7a3b11 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -133,6 +133,7 @@ void kmsan_kfree_large(const void *ptr);
  * @prot:      page protection flags used for vmap.
  * @pages:     array of pages.
  * @page_shift:        page_shift passed to vmap_range_noflush().
+ * @gfp_mask:  gfp_mask to use internally.
  *
  * KMSAN maps shadow and origin pages of @pages into contiguous ranges in
  * vmalloc metadata address range. Returns 0 on success, callers must check
@@ -142,7 +143,8 @@ int __must_check kmsan_vmap_pages_range_noflush(unsigned long start,
                                                unsigned long end,
                                                pgprot_t prot,
                                                struct page **pages,
-                                               unsigned int page_shift);
+                                               unsigned int page_shift,
+                                               gfp_t gfp_mask);

 /**
  * kmsan_vunmap_kernel_range_noflush() - Notify KMSAN about a vunmap.
@@ -348,7 +350,7 @@ static inline void kmsan_kfree_large(const void *ptr)

 static inline int __must_check kmsan_vmap_pages_range_noflush(
        unsigned long start, unsigned long end, pgprot_t prot,
-       struct page **pages, unsigned int page_shift)
+       struct page **pages, unsigned int page_shift, gfp_t gfp_mask)
 {
        return 0;
 }
diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc03..6a13b8ee1e6c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1359,7 +1359,7 @@ size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
 #ifdef CONFIG_MMU
 void __init vmalloc_init(void);
 int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-                pgprot_t prot, struct page **pages, unsigned int page_shift);
+               pgprot_t prot, struct page **pages, unsigned int page_shift, gfp_t gfp_mask);
 unsigned int get_vm_area_page_order(struct vm_struct *vm);
 #else
 static inline void vmalloc_init(void)
@@ -1368,7 +1368,7 @@ static inline void vmalloc_init(void)

 static inline
 int __must_check vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-                pgprot_t prot, struct page **pages, unsigned int page_shift)
+               pgprot_t prot, struct page **pages, unsigned int page_shift, gfp_t gfp_mask)
 {
        return -EINVAL;
 }
diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index b14ce3417e65..5b74d6dbf0b8 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -233,5 +233,6 @@ void __init kmsan_init_runtime(void)
        kmsan_memblock_discard();
        pr_info("Starting KernelMemorySanitizer\n");
        pr_info("ATTENTION: KMSAN is a debugging tool! Do not use it on production machines!\n");
-       kmsan_enabled = true;
+       /* kmsan_enabled = true; */
+       kmsan_enabled = false;
 }
diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 54f3c3c962f0..3cd733663100 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -215,7 +215,7 @@ void kmsan_free_page(struct page *page, unsigned int order)

 int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end,
                                   pgprot_t prot, struct page **pages,
-                                  unsigned int page_shift)
+                                  unsigned int page_shift, gfp_t gfp_mask)
 {
        unsigned long shadow_start, origin_start, shadow_end, origin_end;
        struct page **s_pages, **o_pages;
@@ -230,8 +230,8 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, unsigned long end,
                return 0;

        nr = (end - start) / PAGE_SIZE;
-       s_pages = kcalloc(nr, sizeof(*s_pages), GFP_KERNEL);
-       o_pages = kcalloc(nr, sizeof(*o_pages), GFP_KERNEL);
+       s_pages = kcalloc(nr, sizeof(*s_pages), gfp_mask);
+       o_pages = kcalloc(nr, sizeof(*o_pages), gfp_mask);
        if (!s_pages || !o_pages) {
                err = -ENOMEM;
                goto ret;
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index cd69caf6aa8d..4f5937090590 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -194,7 +194,7 @@ static int __pcpu_map_pages(unsigned long addr, struct page **pages,
                            int nr_pages)
 {
        return vmap_pages_range_noflush(addr, addr + (nr_pages << PAGE_SHIFT),
-                                       PAGE_KERNEL, pages, PAGE_SHIFT);
+                       PAGE_KERNEL, pages, PAGE_SHIFT, GFP_KERNEL);
 }

 /**
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ee197f5b8cf0..9be01dcca690 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -671,16 +671,28 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 }

 int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
-               pgprot_t prot, struct page **pages, unsigned int page_shift)
+               pgprot_t prot, struct page **pages, unsigned int page_shift,
+               gfp_t gfp_mask)
 {
        int ret = kmsan_vmap_pages_range_noflush(addr, end, prot, pages,
-                                                page_shift);
+                                               page_shift, gfp_mask);

        if (ret)
                return ret;
        return __vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
 }

+static int __vmap_pages_range(unsigned long addr, unsigned long end,
+               pgprot_t prot, struct page **pages, unsigned int page_shift,
+               gfp_t gfp_mask)
+{
+       int err;
+
+       err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift, gfp_mask);
+       flush_cache_vmap(addr, end);
+       return err;
+}
+
 /**
  * vmap_pages_range - map pages to a kernel virtual address
  * @addr: start of the VM area to map
@@ -696,11 +708,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
 int vmap_pages_range(unsigned long addr, unsigned long end,
                pgprot_t prot, struct page **pages, unsigned int page_shift)
 {
-       int err;
-
-       err = vmap_pages_range_noflush(addr, end, prot, pages, page_shift);
-       flush_cache_vmap(addr, end);
-       return err;
+       return __vmap_pages_range(addr, end, prot, pages, page_shift, GFP_KERNEL);
 }

 static int check_sparse_vm_area(struct vm_struct *area, unsigned long start,
@@ -3804,8 +3812,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
                flags = memalloc_noio_save();

        do {
-               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
-                       page_shift);
+               ret = __vmap_pages_range(addr, addr + size, prot, area->pages,
+                               page_shift, gfp_mask);
                if (nofail && (ret < 0))
                        schedule_timeout_uninterruptible(1);
        } while (nofail && (ret < 0));
<snip>

Thanks!

--
Uladzislau Rezki

^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-08-23  9:35 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07  7:58 [PATCH 0/8] __vmalloc() and no-block support Uladzislau Rezki (Sony)
2025-08-07  7:58 ` [PATCH 1/8] lib/test_vmalloc: add no_block_alloc_test case Uladzislau Rezki (Sony)
2025-08-07  7:58 ` [PATCH 2/8] lib/test_vmalloc: Remove xfail condition check Uladzislau Rezki (Sony)
2025-08-07  7:58 ` [PATCH 3/8] mm/vmalloc: Support non-blocking GFP flags in alloc_vmap_area() Uladzislau Rezki (Sony)
2025-08-07 11:20   ` Michal Hocko
2025-08-08  9:59     ` Uladzislau Rezki
2025-08-18  2:11   ` Baoquan He
2025-08-07  7:58 ` [PATCH 4/8] mm/vmalloc: Remove cond_resched() in vm_area_alloc_pages() Uladzislau Rezki (Sony)
2025-08-07 11:22   ` Michal Hocko
2025-08-08 10:08     ` Uladzislau Rezki
2025-08-18  2:14   ` Baoquan He
2025-08-07  7:58 ` [PATCH 5/8] mm/kasan, mm/vmalloc: Respect GFP flags in kasan_populate_vmalloc() Uladzislau Rezki (Sony)
2025-08-07 16:05   ` Andrey Ryabinin
2025-08-08 10:18     ` Uladzislau Rezki
2025-08-07  7:58 ` [PATCH 6/8] mm/vmalloc: Defer freeing partly initialized vm_struct Uladzislau Rezki (Sony)
2025-08-07 11:25   ` Michal Hocko
2025-08-08 10:37     ` Uladzislau Rezki
2025-08-18  4:21   ` Baoquan He
2025-08-18 13:02     ` Uladzislau Rezki
2025-08-19  8:56       ` Baoquan He
2025-08-19  9:20         ` Uladzislau Rezki
2025-08-07  7:58 ` [PATCH 7/8] mm/vmalloc: Support non-blocking GFP flags in __vmalloc_area_node() Uladzislau Rezki (Sony)
2025-08-07 11:54   ` Michal Hocko
2025-08-08 11:54     ` Uladzislau Rezki
2025-08-18  4:35   ` Baoquan He
2025-08-18 13:08     ` Uladzislau Rezki
2025-08-19  8:46       ` Baoquan He
2025-08-07  7:58 ` [PATCH 8/8] mm: Drop __GFP_DIRECT_RECLAIM flag if PF_MEMALLOC is set Uladzislau Rezki (Sony)
2025-08-07 11:58   ` Michal Hocko
2025-08-08 13:12     ` Uladzislau Rezki
2025-08-08 14:16       ` Michal Hocko
2025-08-08 16:56         ` Uladzislau Rezki
2025-08-07 11:01 ` [PATCH 0/8] __vmalloc() and no-block support Marco Elver
2025-08-08  8:48   ` Uladzislau Rezki
2025-08-23  9:35     ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).