[PATCH v3 0/4] mm: clarify nofail memory allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] mm: clarify nofail memory allocation
@ 2024-08-17  6:24 Barry Song
  2024-08-17  6:24 ` [PATCH v3 1/4] vduse: avoid using __GFP_NOFAIL Barry Song
                   ` (4 more replies)
  0 siblings, 5 replies; 101+ messages in thread
From: Barry Song @ 2024-08-17  6:24 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization

From: Barry Song <v-songbaohua@oppo.com>

-v3:
 * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr,
   thanks!
 * use Jason's patch[1] to fix vdpa and refine his changelog.
 * refine changelogs
[1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/

-v2:
 https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/

 * adjust vpda fix according to Jason and Michal's feedback, I would
   expect Jason to test it, thanks!
 * split BUG_ON of unavoidable failure and the case GFP_ATOMIC |
   __GFP_NOFAIL into two patches according to Vlastimil and Michal.
 * collect Michal's acked-by for patch 2 - the doc;
 * remove the new GFP_NOFAIL from this series, that one would be a
   separate enhancement patchset later on.

-v1:
 https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/

__GFP_NOFAIL carries the semantics of never failing, so its callers
do not check the return value:
  %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  cannot handle allocation failures. The allocation could block
  indefinitely but will never return with failure. Testing for
  failure is pointless.

However, __GFP_NOFAIL can sometimes fail if it exceeds size limits
or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.
This can expose security vulnerabilities due to potential NULL
dereferences.

Since __GFP_NOFAIL does not support non-blocking allocation, we introduce
GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL
as a replacement for __GFP_NOFAIL in non-mm.

If we must still fail a nofail allocation, we should trigger a BUG rather
than exposing NULL dereferences to callers who do not check the return
value.

* The discussion started from this topic:
 [PATCH RFC] mm: warn potential return NULL for kmalloc_array and
             kvmalloc_array with __GFP_NOFAIL

 https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/

Thank you to Michal, Christoph, Vlastimil, and Hailong for all the
comments.

Barry Song (3):
  mm: document __GFP_NOFAIL must be blockable
  mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  mm: prohibit NULL deference exposed for unsupported non-blockable
    __GFP_NOFAIL

Jason Wang (1):
  vduse: avoid using __GFP_NOFAIL

 drivers/vdpa/vdpa_user/iova_domain.c | 19 +++++++++++--------
 drivers/vdpa/vdpa_user/iova_domain.h |  1 +
 include/linux/gfp_types.h            |  5 ++++-
 include/linux/slab.h                 |  4 +++-
 mm/page_alloc.c                      | 14 ++++++++------
 mm/util.c                            |  1 +
 6 files changed, 28 insertions(+), 16 deletions(-)

-- 
2.39.3 (Apple Git-146)

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH v3 1/4] vduse: avoid using __GFP_NOFAIL
  2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
@ 2024-08-17  6:24 ` Barry Song
  2024-08-17  6:24 ` [PATCH v3 2/4] mm: document __GFP_NOFAIL must be blockable Barry Song
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 101+ messages in thread
From: Barry Song @ 2024-08-17  6:24 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Jason Wang, Xie Yongji

From: Jason Wang <jasowang@redhat.com>

mm doesn't support non-blockable __GFP_NOFAIL allocation. Because
persisting in providing __GFP_NOFAIL services for non-block users
who cannot perform direct memory reclaim may only result in an
endless busy loop.

Therefore, in such cases, the current mm-core may directly return
a NULL pointer:

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac)
{
        ...
        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
                ...
        }
        ...
fail:
        warn_alloc(gfp_mask, ac->nodemask,
                        "page allocation failure: order:%u", order);
got_pg:
        return page;
}

Unfortuantely, vpda does that nofail allocation under non-sleepable
lock. A possible way to fix that is to move the pages allocation out
of the lock into the caller, but having to allocate a huge number of
pages and auxiliary page array seems to be problematic as well per
Tetsuon: " You should implement proper error handling instead of
using __GFP_NOFAIL if count can become large."

So I choose another way, which does not release kernel bounce pages
when user tries to register userspace bounce pages. Then we can
avoid allocating in paths where failure is not expected.(e.g in
the release). We pay this for more memory usage as we don't release
kernel bounce pages but further optimizations could be done on top.

Fixes: 6c77ed22880d ("vduse: Support using userspace pages as bounce buffer")
Reviewed-by: Xie Yongji <xieyongji@bytedance.com>
Tested-by: Xie Yongji <xieyongji@bytedance.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
[v-songbaohua@oppo.com: Refine the changelog]
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 drivers/vdpa/vdpa_user/iova_domain.c | 19 +++++++++++--------
 drivers/vdpa/vdpa_user/iova_domain.h |  1 +
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
index 791d38d6284c..58116f89d8da 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -162,6 +162,7 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 				enum dma_data_direction dir)
 {
 	struct vduse_bounce_map *map;
+	struct page *page;
 	unsigned int offset;
 	void *addr;
 	size_t sz;
@@ -178,7 +179,10 @@ static void vduse_domain_bounce(struct vduse_iova_domain *domain,
 			    map->orig_phys == INVALID_PHYS_ADDR))
 			return;
 
-		addr = kmap_local_page(map->bounce_page);
+		page = domain->user_bounce_pages ?
+		       map->user_bounce_page : map->bounce_page;
+
+		addr = kmap_local_page(page);
 		do_bounce(map->orig_phys + offset, addr + offset, sz, dir);
 		kunmap_local(addr);
 		size -= sz;
@@ -270,9 +274,8 @@ int vduse_domain_add_user_bounce_pages(struct vduse_iova_domain *domain,
 				memcpy_to_page(pages[i], 0,
 					       page_address(map->bounce_page),
 					       PAGE_SIZE);
-			__free_page(map->bounce_page);
 		}
-		map->bounce_page = pages[i];
+		map->user_bounce_page = pages[i];
 		get_page(pages[i]);
 	}
 	domain->user_bounce_pages = true;
@@ -297,17 +300,17 @@ void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
 		struct page *page = NULL;
 
 		map = &domain->bounce_maps[i];
-		if (WARN_ON(!map->bounce_page))
+		if (WARN_ON(!map->user_bounce_page))
 			continue;
 
 		/* Copy user page to kernel page if it's in use */
 		if (map->orig_phys != INVALID_PHYS_ADDR) {
-			page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
+			page = map->bounce_page;
 			memcpy_from_page(page_address(page),
-					 map->bounce_page, 0, PAGE_SIZE);
+					 map->user_bounce_page, 0, PAGE_SIZE);
 		}
-		put_page(map->bounce_page);
-		map->bounce_page = page;
+		put_page(map->user_bounce_page);
+		map->user_bounce_page = NULL;
 	}
 	domain->user_bounce_pages = false;
 out:
diff --git a/drivers/vdpa/vdpa_user/iova_domain.h b/drivers/vdpa/vdpa_user/iova_domain.h
index f92f22a7267d..7f3f0928ec78 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.h
+++ b/drivers/vdpa/vdpa_user/iova_domain.h
@@ -21,6 +21,7 @@
 
 struct vduse_bounce_map {
 	struct page *bounce_page;
+	struct page *user_bounce_page;
 	u64 orig_phys;
 };
 
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v3 2/4] mm: document __GFP_NOFAIL must be blockable
  2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
  2024-08-17  6:24 ` [PATCH v3 1/4] vduse: avoid using __GFP_NOFAIL Barry Song
@ 2024-08-17  6:24 ` Barry Song
  2024-08-17  6:24 ` [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 101+ messages in thread
From: Barry Song @ 2024-08-17  6:24 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Christoph Hellwig, Davidlohr Bueso,
	Eugenio Pérez, Jason Wang, Kees Cook, Lorenzo Stoakes,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

From: Barry Song <v-songbaohua@oppo.com>

Non-blocking allocation with __GFP_NOFAIL is not supported and may still
result in NULL pointers (if we don't return NULL, we result in busy-loop
within non-sleepable contexts):

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
						struct alloc_context *ac)
{
	...
	/*
	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
	 * we always retry
	 */
	if (gfp_mask & __GFP_NOFAIL) {
		/*
		 * All existing users of the __GFP_NOFAIL are blockable, so warn
		 * of any new users that actually require GFP_NOWAIT
		 */
		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
			goto fail;
		...
	}
	...
fail:
	warn_alloc(gfp_mask, ac->nodemask,
			"page allocation failure: order:%u", order);
got_pg:
	return page;
}

Highlight this in the documentation of __GFP_NOFAIL so that non-mm
subsystems can reject any illegal usage of __GFP_NOFAIL with GFP_ATOMIC,
GFP_NOWAIT, etc.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/gfp_types.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 313be4ad79fd..4a1fa7706b0c 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -215,7 +215,8 @@ enum {
  * the caller still has to check for failures) while costly requests try to be
  * not disruptive and back off even without invoking the OOM killer.
  * The following three modifiers might be used to override some of these
- * implicit rules.
+ * implicit rules. Please note that all of them must be used along with
+ * %__GFP_DIRECT_RECLAIM flag.
  *
  * %__GFP_NORETRY: The VM implementation will try only very lightweight
  * memory direct reclaim to get some memory under memory pressure (thus
@@ -246,6 +247,8 @@ enum {
  * cannot handle allocation failures. The allocation could block
  * indefinitely but will never return with failure. Testing for
  * failure is pointless.
+ * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.
+ * It should _never_ be used in non-sleepable contexts.
  * New users should be evaluated carefully (and the flag should be
  * used only when there is no reasonable failure policy) but it is
  * definitely preferable to use the flag rather than opencode endless
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
  2024-08-17  6:24 ` [PATCH v3 1/4] vduse: avoid using __GFP_NOFAIL Barry Song
  2024-08-17  6:24 ` [PATCH v3 2/4] mm: document __GFP_NOFAIL must be blockable Barry Song
@ 2024-08-17  6:24 ` Barry Song
  2024-08-19  9:43   ` David Hildenbrand
  2024-08-17  6:24 ` [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
  2024-08-19 13:02 ` [PATCH v3 0/4] mm: clarify nofail memory allocation David Hildenbrand
  4 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-17  6:24 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Christoph Hellwig, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

From: Barry Song <v-songbaohua@oppo.com>

We have cases we still fail though callers might have __GFP_NOFAIL.  Since
they don't check the return, we are exposed to the security risks for NULL
deference.

Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
situation.

Christoph Hellwig:
The whole freaking point of __GFP_NOFAIL is that callers don't handle
allocation failures.  So in fact a straight BUG is the right thing
here.

Vlastimil Babka:
It's just not a recoverable situation (WARN_ON is for recoverable
situations). The caller cannot handle allocation failure and at the same
time asked for an impossible allocation. BUG_ON() is a guaranteed oops
with stracktrace etc. We don't need to hope for the later NULL pointer
dereference (which might if really unlucky happen from a different
context where it's no longer obvious what lead to the allocation failing).

Michal Hocko:
Linus tends to be against adding new BUG() calls unless the failure is
absolutely unrecoverable (e.g. corrupted data structures etc.). I am
not sure how he would look at simply incorrect memory allocator usage to
blow up the kernel. Now the argument could be made that those failures
could cause subtle memory corruptions or even be exploitable which might
be a sufficient reason to stop them early.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/slab.h | 4 +++-
 mm/page_alloc.c      | 4 +++-
 mm/util.c            | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c9cb42203183..4a4d1fdc2afe 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	return kvmalloc_node_noprof(bytes, flags, node);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60742d057b05..d2c37f8f8d09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 	 * There are several places where we assume that the order value is sane
 	 * so bail out early if the request is out of bound.
 	 */
-	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
+	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
+		BUG_ON(gfp & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	gfp &= gfp_allowed_mask;
 	/*
diff --git a/mm/util.c b/mm/util.c
index ac01925a4179..678c647b778f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
 
 	/* Don't even allow crazy sizes */
 	if (unlikely(size > INT_MAX)) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
 		return NULL;
 	}
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
                   ` (2 preceding siblings ...)
  2024-08-17  6:24 ` [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
@ 2024-08-17  6:24 ` Barry Song
  2024-08-18  2:55   ` Yafang Shao
  2024-08-19  9:44   ` David Hildenbrand
  2024-08-19 13:02 ` [PATCH v3 0/4] mm: clarify nofail memory allocation David Hildenbrand
  4 siblings, 2 replies; 101+ messages in thread
From: Barry Song @ 2024-08-17  6:24 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

From: Barry Song <v-songbaohua@oppo.com>

When users allocate memory with the __GFP_NOFAIL flag, they might
incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
attempt and still fail to allocate memory for these users, we have two
choices:

    1. We could busy-loop and hope that some other direct reclamation or
    kswapd rescues the current process. However, this is unreliable
    and could ultimately lead to hard or soft lockups, which might not
    be well supported by some architectures.

    2. We could use BUG_ON to trigger a reliable system crash, avoiding
    exposing NULL dereference.

Neither option is ideal, but both are improvements over the existing code.
This patch selects the second option because, with the introduction of
scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
nofail users(which is in my plan), non-blockable nofail allocations will
no longer be possible.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 mm/page_alloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2c37f8f8d09..fb5850ecd3ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (gfp_mask & __GFP_NOFAIL) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * All existing users of the __GFP_NOFAIL are blockable
+		 * otherwise we introduce a busy loop with inside the page
+		 * allocator from non-sleepable contexts
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+		BUG_ON(!can_direct_reclaim);
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre
@@ -4434,7 +4434,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		cond_resched();
 		goto retry;
 	}
-fail:
+
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-17  6:24 ` [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
@ 2024-08-18  2:55   ` Yafang Shao
  2024-08-18  3:48     ` Barry Song
  2024-08-19  7:50     ` Michal Hocko
  2024-08-19  9:44   ` David Hildenbrand
  1 sibling, 2 replies; 101+ messages in thread
From: Yafang Shao @ 2024-08-18  2:55 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> attempt and still fail to allocate memory for these users, we have two
> choices:
>
>     1. We could busy-loop and hope that some other direct reclamation or
>     kswapd rescues the current process. However, this is unreliable
>     and could ultimately lead to hard or soft lockups,

That can occur even if we set both __GFP_NOFAIL and
__GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
design of __GFP_NOFAIL itself.

> which might not
>     be well supported by some architectures.
>
>     2. We could use BUG_ON to trigger a reliable system crash, avoiding
>     exposing NULL dereference.
>
> Neither option is ideal, but both are improvements over the existing code.
> This patch selects the second option because, with the introduction of
> scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> nofail users(which is in my plan), non-blockable nofail allocations will
> no longer be possible.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  mm/page_alloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2c37f8f8d09..fb5850ecd3ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
> -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> -                * of any new users that actually require GFP_NOWAIT
> +                * All existing users of the __GFP_NOFAIL are blockable
> +                * otherwise we introduce a busy loop with inside the page
> +                * allocator from non-sleepable contexts
>                  */
> -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -                       goto fail;
> +               BUG_ON(!can_direct_reclaim);

I'm not in favor of using BUG_ON() here, as many call sites already
handle the return value of __GFP_NOFAIL.

If we believe BUG_ON() is necessary, why not place it at the beginning
of __alloc_pages_slowpath() instead of after numerous operations,
which could potentially obscure the issue?


>
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
> @@ -4434,7 +4434,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                 cond_resched();
>                 goto retry;
>         }
> -fail:
> +
>         warn_alloc(gfp_mask, ac->nodemask,
>                         "page allocation failure: order:%u", order);
>  got_pg:
> --
> 2.39.3 (Apple Git-146)
>
>


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  2:55   ` Yafang Shao
@ 2024-08-18  3:48     ` Barry Song
  2024-08-18  5:51       ` Yafang Shao
  2024-08-19  7:50     ` Michal Hocko
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-18  3:48 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When users allocate memory with the __GFP_NOFAIL flag, they might
> > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > attempt and still fail to allocate memory for these users, we have two
> > choices:
> >
> >     1. We could busy-loop and hope that some other direct reclamation or
> >     kswapd rescues the current process. However, this is unreliable
> >     and could ultimately lead to hard or soft lockups,
>
> That can occur even if we set both __GFP_NOFAIL and
> __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> design of __GFP_NOFAIL itself.

the point of GFP_NOFAIL is that it won't fail and its user won't check
the return value. without direct_reclamation, it is sometimes impossible.
but with direct reclamation, users constantly wait and finally they can
get memory. if you read the doc of __GFP_NOFAIL you will find it.
it is absolutely clearly documented.

>
> > which might not
> >     be well supported by some architectures.
> >
> >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> >     exposing NULL dereference.
> >
> > Neither option is ideal, but both are improvements over the existing code.
> > This patch selects the second option because, with the introduction of
> > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > nofail users(which is in my plan), non-blockable nofail allocations will
> > no longer be possible.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >  mm/page_alloc.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d2c37f8f8d09..fb5850ecd3ae 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >          */
> >         if (gfp_mask & __GFP_NOFAIL) {
> >                 /*
> > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -                * of any new users that actually require GFP_NOWAIT
> > +                * All existing users of the __GFP_NOFAIL are blockable
> > +                * otherwise we introduce a busy loop with inside the page
> > +                * allocator from non-sleepable contexts
> >                  */
> > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                       goto fail;
> > +               BUG_ON(!can_direct_reclaim);
>
> I'm not in favor of using BUG_ON() here, as many call sites already
> handle the return value of __GFP_NOFAIL.
>

it is not correct to handle the return value of __GFP_NOFAIL.
if you check the ret, don't use __GFP_NOFAIL.

> If we believe BUG_ON() is necessary, why not place it at the beginning
> of __alloc_pages_slowpath() instead of after numerous operations,
> which could potentially obscure the issue?

to some extent I agree with you. but the point here is that we might
want to avoid this check in the hot path. so basically, we check when
we have to check. in 99%+ case, this check can be avoided.

>
>
> >
> >                 /*
> >                  * PF_MEMALLOC request from this context is rather bizarre
> > @@ -4434,7 +4434,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >                 cond_resched();
> >                 goto retry;
> >         }
> > -fail:
> > +
> >         warn_alloc(gfp_mask, ac->nodemask,
> >                         "page allocation failure: order:%u", order);
> >  got_pg:
> > --
> > 2.39.3 (Apple Git-146)
> >
> >
>
>
> --
> Regards
> Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  3:48     ` Barry Song
@ 2024-08-18  5:51       ` Yafang Shao
  2024-08-18  6:27         ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-18  5:51 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > attempt and still fail to allocate memory for these users, we have two
> > > choices:
> > >
> > >     1. We could busy-loop and hope that some other direct reclamation or
> > >     kswapd rescues the current process. However, this is unreliable
> > >     and could ultimately lead to hard or soft lockups,
> >
> > That can occur even if we set both __GFP_NOFAIL and
> > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > design of __GFP_NOFAIL itself.
>
> the point of GFP_NOFAIL is that it won't fail and its user won't check
> the return value. without direct_reclamation, it is sometimes impossible.
> but with direct reclamation, users constantly wait and finally they can

So, what exactly is the difference between 'constantly waiting' and
'busy looping'? Could you please clarify? Also, why can't we
'constantly wait' when __GFP_DIRECT_RECLAIM is not set?

> get memory. if you read the doc of __GFP_NOFAIL you will find it.
> it is absolutely clearly documented.
>
> >
> > > which might not
> > >     be well supported by some architectures.
> > >
> > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > >     exposing NULL dereference.
> > >
> > > Neither option is ideal, but both are improvements over the existing code.
> > > This patch selects the second option because, with the introduction of
> > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > no longer be possible.
> > >
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Pekka Enberg <penberg@kernel.org>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Kees Cook <kees@kernel.org>
> > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > ---
> > >  mm/page_alloc.c | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >          */
> > >         if (gfp_mask & __GFP_NOFAIL) {
> > >                 /*
> > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > -                * of any new users that actually require GFP_NOWAIT
> > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > +                * otherwise we introduce a busy loop with inside the page
> > > +                * allocator from non-sleepable contexts
> > >                  */
> > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > -                       goto fail;
> > > +               BUG_ON(!can_direct_reclaim);
> >
> > I'm not in favor of using BUG_ON() here, as many call sites already
> > handle the return value of __GFP_NOFAIL.
> >
>
> it is not correct to handle the return value of __GFP_NOFAIL.
> if you check the ret, don't use __GFP_NOFAIL.

If so, you have many code changes to make in the linux kernel ;)

>
> > If we believe BUG_ON() is necessary, why not place it at the beginning
> > of __alloc_pages_slowpath() instead of after numerous operations,
> > which could potentially obscure the issue?
>
> to some extent I agree with you. but the point here is that we might
> want to avoid this check in the hot path. so basically, we check when
> we have to check. in 99%+ case, this check can be avoided.

It's on the slow path, but that's not the main point here.

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  5:51       ` Yafang Shao
@ 2024-08-18  6:27         ` Barry Song
  2024-08-18  6:45           ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-18  6:27 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > the return value. without direct_reclamation, it is sometimes impossible.
> > but with direct reclamation, users constantly wait and finally they can
>
> So, what exactly is the difference between 'constantly waiting' and
> 'busy looping'? Could you please clarify? Also, why can't we
> 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?

I list two options in changelog
1: busy loop 2. bug_on. I am actually fine with either one. either one is
better than the existing code. but returning null in the current code
is definitely wrong.

1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
legal. so it is a bit suspicious going in the wrong direction.

busy-loop is that you are not reclaiming memory you are not sleeping.
cpu is constantly working and busy, so it might result in a lockup, either
soft lockup or hard lockup.

with direct_reclamation, wait is the case you can sleep. it is not holding
cpu, not a busy loop. in rare case, users might end in endless wait,
but it matches the doc of __GFP_NOFAIL, never return till memory
is gotten (the current code is implemented in this way unless users
incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)

note, long-term we won't expose __GFP_NOFAIL any more. we
will only expose GFP_NOFAIL which enforces Blockable.  I am
quite busy on other issues, so this won't happen in a short time.

>
> > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > it is absolutely clearly documented.
> >
> > >
> > > > which might not
> > > >     be well supported by some architectures.
> > > >
> > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > >     exposing NULL dereference.
> > > >
> > > > Neither option is ideal, but both are improvements over the existing code.
> > > > This patch selects the second option because, with the introduction of
> > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > no longer be possible.
> > > >
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Cc: Christoph Lameter <cl@linux.com>
> > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > Cc: David Rientjes <rientjes@google.com>
> > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: Kees Cook <kees@kernel.org>
> > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > >  mm/page_alloc.c | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >          */
> > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > >                 /*
> > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > -                * of any new users that actually require GFP_NOWAIT
> > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > +                * otherwise we introduce a busy loop with inside the page
> > > > +                * allocator from non-sleepable contexts
> > > >                  */
> > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > -                       goto fail;
> > > > +               BUG_ON(!can_direct_reclaim);
> > >
> > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > handle the return value of __GFP_NOFAIL.
> > >
> >
> > it is not correct to handle the return value of __GFP_NOFAIL.
> > if you check the ret, don't use __GFP_NOFAIL.
>
> If so, you have many code changes to make in the linux kernel ;)
>

Please list those code using __GFP_NOFAIL and check the result
might fail, we should get them fixed. This is insane. NOFAIL means
no fail.

> >
> > > If we believe BUG_ON() is necessary, why not place it at the beginning
> > > of __alloc_pages_slowpath() instead of after numerous operations,
> > > which could potentially obscure the issue?
> >
> > to some extent I agree with you. but the point here is that we might
> > want to avoid this check in the hot path. so basically, we check when
> > we have to check. in 99%+ case, this check can be avoided.
>
> It's on the slow path, but that's not the main point here.

I actually recommended the approach, we can do an earlier check in the hotpath.
somehow, in the previous discussion, people didn't like it.

>
> --
> Regards
> Yafang

Thanks
barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  6:27         ` Barry Song
@ 2024-08-18  6:45           ` Barry Song
  2024-08-18  7:07             ` Yafang Shao
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-18  6:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > the return value. without direct_reclamation, it is sometimes impossible.
> > > but with direct reclamation, users constantly wait and finally they can
> >
> > So, what exactly is the difference between 'constantly waiting' and
> > 'busy looping'? Could you please clarify? Also, why can't we
> > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
>
> I list two options in changelog
> 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> better than the existing code. but returning null in the current code
> is definitely wrong.
>
> 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> legal. so it is a bit suspicious going in the wrong direction.
>
> busy-loop is that you are not reclaiming memory you are not sleeping.
> cpu is constantly working and busy, so it might result in a lockup, either
> soft lockup or hard lockup.
>
> with direct_reclamation, wait is the case you can sleep. it is not holding
> cpu, not a busy loop. in rare case, users might end in endless wait,
> but it matches the doc of __GFP_NOFAIL, never return till memory
> is gotten (the current code is implemented in this way unless users
> incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
>

and the essential difference between "w/ and w/o direct_reclaim": with
direct reclaim, the user is actively reclaiming memory to rescue itself
by all kinds of possible ways(compact, oom, reclamation), while without
direct reclamation, it can do nothing and just loop (busy-loop).

> note, long-term we won't expose __GFP_NOFAIL any more. we
> will only expose GFP_NOFAIL which enforces Blockable.  I am
> quite busy on other issues, so this won't happen in a short time.
>
> >
> > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > it is absolutely clearly documented.
> > >
> > > >
> > > > > which might not
> > > > >     be well supported by some architectures.
> > > > >
> > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > >     exposing NULL dereference.
> > > > >
> > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > This patch selects the second option because, with the introduction of
> > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > no longer be possible.
> > > > >
> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > ---
> > > > >  mm/page_alloc.c | 10 +++++-----
> > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > >          */
> > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > >                 /*
> > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > +                * allocator from non-sleepable contexts
> > > > >                  */
> > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > -                       goto fail;
> > > > > +               BUG_ON(!can_direct_reclaim);
> > > >
> > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > handle the return value of __GFP_NOFAIL.
> > > >
> > >
> > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > if you check the ret, don't use __GFP_NOFAIL.
> >
> > If so, you have many code changes to make in the linux kernel ;)
> >
>
> Please list those code using __GFP_NOFAIL and check the result
> might fail, we should get them fixed. This is insane. NOFAIL means
> no fail.
>
> > >
> > > > If we believe BUG_ON() is necessary, why not place it at the beginning
> > > > of __alloc_pages_slowpath() instead of after numerous operations,
> > > > which could potentially obscure the issue?
> > >
> > > to some extent I agree with you. but the point here is that we might
> > > want to avoid this check in the hot path. so basically, we check when
> > > we have to check. in 99%+ case, this check can be avoided.
> >
> > It's on the slow path, but that's not the main point here.
>
> I actually recommended the approach, we can do an earlier check in the hotpath.
> somehow, in the previous discussion, people didn't like it.
>
> >
> > --
> > Regards
> > Yafang
>
> Thanks
> barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  6:45           ` Barry Song
@ 2024-08-18  7:07             ` Yafang Shao
  2024-08-18  7:25               ` Barry Song
  2024-08-19  7:51               ` Michal Hocko
  0 siblings, 2 replies; 101+ messages in thread
From: Yafang Shao @ 2024-08-18  7:07 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > > the return value. without direct_reclamation, it is sometimes impossible.
> > > > but with direct reclamation, users constantly wait and finally they can
> > >
> > > So, what exactly is the difference between 'constantly waiting' and
> > > 'busy looping'? Could you please clarify? Also, why can't we
> > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
> >
> > I list two options in changelog
> > 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> > better than the existing code. but returning null in the current code
> > is definitely wrong.
> >
> > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> > legal. so it is a bit suspicious going in the wrong direction.
> >
> > busy-loop is that you are not reclaiming memory you are not sleeping.
> > cpu is constantly working and busy, so it might result in a lockup, either
> > soft lockup or hard lockup.

Thanks for the clarification.
That can be avoided by a simple cond_resched() if the hard lockup or
softlockup is the main issue ;)

> >
> > with direct_reclamation, wait is the case you can sleep. it is not holding
> > cpu, not a busy loop. in rare case, users might end in endless wait,
> > but it matches the doc of __GFP_NOFAIL, never return till memory
> > is gotten (the current code is implemented in this way unless users
> > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
> >
>
> and the essential difference between "w/ and w/o direct_reclaim": with
> direct reclaim, the user is actively reclaiming memory to rescue itself
> by all kinds of possible ways(compact, oom, reclamation), while without
> direct reclamation, it can do nothing and just loop (busy-loop).

It can wake up kswapd, which can then reclaim memory. If kswapd can't
keep up, the system is likely under heavy memory pressure. In such a
case, it makes little difference whether __GFP_DIRECT_RECLAIM is set
or not. For reference, see the old issue:
https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/.

I believe the core issue persists, and the design of __GFP_NOFAIL
exacerbates it.

By the way, I believe we could trigger an asynchronous OOM kill in the
case without direct reclaim to avoid busy looping.

>
> > note, long-term we won't expose __GFP_NOFAIL any more. we
> > will only expose GFP_NOFAIL which enforces Blockable.  I am
> > quite busy on other issues, so this won't happen in a short time.
> >
> > >
> > > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > > it is absolutely clearly documented.
> > > >
> > > > >
> > > > > > which might not
> > > > > >     be well supported by some architectures.
> > > > > >
> > > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > > >     exposing NULL dereference.
> > > > > >
> > > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > > This patch selects the second option because, with the introduction of
> > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > > no longer be possible.
> > > > > >
> > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/page_alloc.c | 10 +++++-----
> > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > >          */
> > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > >                 /*
> > > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > > +                * allocator from non-sleepable contexts
> > > > > >                  */
> > > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > -                       goto fail;
> > > > > > +               BUG_ON(!can_direct_reclaim);
> > > > >
> > > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > > handle the return value of __GFP_NOFAIL.
> > > > >
> > > >
> > > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > > if you check the ret, don't use __GFP_NOFAIL.
> > >
> > > If so, you have many code changes to make in the linux kernel ;)
> > >
> >
> > Please list those code using __GFP_NOFAIL and check the result
> > might fail, we should get them fixed. This is insane. NOFAIL means
> > no fail.

You can find some instances with grep commands, but there's no
reliable way to capture them all with a single command. Here are a few
examples:

                // drivers/infiniband/hw/cxgb4/mem.c
                skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL);
                if (!skb)
                        return -ENOMEM;

        // fs/xfs/libxfs/xfs_dir2.c
        args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL);
        if (!args)
                return -ENOMEM;


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  7:07             ` Yafang Shao
@ 2024-08-18  7:25               ` Barry Song
  2024-08-19  7:51               ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Barry Song @ 2024-08-18  7:25 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun, Aug 18, 2024 at 7:07 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > choices:
> > > > > > >
> > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > >
> > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > > design of __GFP_NOFAIL itself.
> > > > >
> > > > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > > > the return value. without direct_reclamation, it is sometimes impossible.
> > > > > but with direct reclamation, users constantly wait and finally they can
> > > >
> > > > So, what exactly is the difference between 'constantly waiting' and
> > > > 'busy looping'? Could you please clarify? Also, why can't we
> > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
> > >
> > > I list two options in changelog
> > > 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> > > better than the existing code. but returning null in the current code
> > > is definitely wrong.
> > >
> > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> > > legal. so it is a bit suspicious going in the wrong direction.
> > >
> > > busy-loop is that you are not reclaiming memory you are not sleeping.
> > > cpu is constantly working and busy, so it might result in a lockup, either
> > > soft lockup or hard lockup.
>
> Thanks for the clarification.
> That can be avoided by a simple cond_resched() if the hard lockup or
> softlockup is the main issue ;)

Is your point to legitimize the combination of gfp_nofail and non-lockable?
I don't think you can simply fix the lockup. A direct example is
single-core, and preemptible kernel. you are calling __gfp_nofail without
direct reclamation in a spinlock,  cond_resched() will do nothing.

it doesn't have to be a single core, for example on phones, cpu hotplug is
used to save power. Sometimes, phones unplug lots of cpus to save
power. and even we are multiple cores, we have cpuset cgroups things
which can prevent async oom from running on other cores.

>
> > >
> > > with direct_reclamation, wait is the case you can sleep. it is not holding
> > > cpu, not a busy loop. in rare case, users might end in endless wait,
> > > but it matches the doc of __GFP_NOFAIL, never return till memory
> > > is gotten (the current code is implemented in this way unless users
> > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
> > >
> >
> > and the essential difference between "w/ and w/o direct_reclaim": with
> > direct reclaim, the user is actively reclaiming memory to rescue itself
> > by all kinds of possible ways(compact, oom, reclamation), while without
> > direct reclamation, it can do nothing and just loop (busy-loop).
>
> It can wake up kswapd, which can then reclaim memory. If kswapd can't
> keep up, the system is likely under heavy memory pressure. In such a
> case, it makes little difference whether __GFP_DIRECT_RECLAIM is set
> or not. For reference, see the old issue:
> https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/.
>
> I believe the core issue persists, and the design of __GFP_NOFAIL
> exacerbates it.
>
> By the way, I believe we could trigger an asynchronous OOM kill in the
> case without direct reclaim to avoid busy looping.

This async oom is sometimes impossible with the same reason as
the above example.

>
> >
> > > note, long-term we won't expose __GFP_NOFAIL any more. we
> > > will only expose GFP_NOFAIL which enforces Blockable.  I am
> > > quite busy on other issues, so this won't happen in a short time.
> > >
> > > >
> > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > > > it is absolutely clearly documented.
> > > > >
> > > > > >
> > > > > > > which might not
> > > > > > >     be well supported by some architectures.
> > > > > > >
> > > > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > > > >     exposing NULL dereference.
> > > > > > >
> > > > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > > > This patch selects the second option because, with the introduction of
> > > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > > > no longer be possible.
> > > > > > >
> > > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/page_alloc.c | 10 +++++-----
> > > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > > >          */
> > > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > > >                 /*
> > > > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > > > +                * allocator from non-sleepable contexts
> > > > > > >                  */
> > > > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > > -                       goto fail;
> > > > > > > +               BUG_ON(!can_direct_reclaim);
> > > > > >
> > > > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > > > handle the return value of __GFP_NOFAIL.
> > > > > >
> > > > >
> > > > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > > > if you check the ret, don't use __GFP_NOFAIL.
> > > >
> > > > If so, you have many code changes to make in the linux kernel ;)
> > > >
> > >
> > > Please list those code using __GFP_NOFAIL and check the result
> > > might fail, we should get them fixed. This is insane. NOFAIL means
> > > no fail.
>
> You can find some instances with grep commands, but there's no
> reliable way to capture them all with a single command. Here are a few
> examples:
>
>                 // drivers/infiniband/hw/cxgb4/mem.c
>                 skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL);
>                 if (!skb)
>                         return -ENOMEM;
>

Thanks for pointing out this wrong code.

This is absolutely wrong. with GFP_KERNEL | __GFP_NOFAIL,
mm-core is doing endless retry till direct reclamation gives
memory to itself. there is no possibility to return NULL.

the only possibility is __GFP_NOFAIL | GFP_ATOMIC(NOWAIT),
mm-core is returning NULL incorrectly. actually mm-core just ignores
this __GFP_NOFAIL in this case. with GFP_FAIL enforcing
direct_reclamation done, the wrong case __GFP_NOFAIL |
GFP_ATOMIC(NOWAIT) will no longer exist.

>         // fs/xfs/libxfs/xfs_dir2.c
>         args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL);
>         if (!args)
>                 return -ENOMEM;
>
>
> --
> Regards
> Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  2:55   ` Yafang Shao
  2024-08-18  3:48     ` Barry Song
@ 2024-08-19  7:50     ` Michal Hocko
  2024-08-19  9:25       ` Yafang Shao
  1 sibling, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-19  7:50 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When users allocate memory with the __GFP_NOFAIL flag, they might
> > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > attempt and still fail to allocate memory for these users, we have two
> > choices:
> >
> >     1. We could busy-loop and hope that some other direct reclamation or
> >     kswapd rescues the current process. However, this is unreliable
> >     and could ultimately lead to hard or soft lockups,
> 
> That can occur even if we set both __GFP_NOFAIL and
> __GFP_DIRECT_RECLAIM, right?

No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
time to satisfy the allocation but it will reclaim to get the memory, it
will sleep if necessary and it will will trigger OOM killer if there is
no other option. __GFP_DIRECT_RECLAIM is a completely different story
than without it which means _no_sleeping_ is allowed and therefore only
a busy loop waiting for the allocation to proceed is allowed.

> So, I don't believe the issue is related
> to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> design of __GFP_NOFAIL itself.

Care to elaborate?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-18  7:07             ` Yafang Shao
  2024-08-18  7:25               ` Barry Song
@ 2024-08-19  7:51               ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-19  7:51 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Sun 18-08-24 15:07:00, Yafang Shao wrote:
> That can be avoided by a simple cond_resched() if the hard lockup or
> softlockup is the main issue ;)

No, it cannot! cond_resched from withing an atomic context is no-no. And NOWAIT
is used from withing atomic contexts.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  7:50     ` Michal Hocko
@ 2024-08-19  9:25       ` Yafang Shao
  2024-08-19  9:39         ` Barry Song
  2024-08-19 10:17         ` Michal Hocko
  0 siblings, 2 replies; 101+ messages in thread
From: Yafang Shao @ 2024-08-19  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > attempt and still fail to allocate memory for these users, we have two
> > > choices:
> > >
> > >     1. We could busy-loop and hope that some other direct reclamation or
> > >     kswapd rescues the current process. However, this is unreliable
> > >     and could ultimately lead to hard or soft lockups,
> >
> > That can occur even if we set both __GFP_NOFAIL and
> > __GFP_DIRECT_RECLAIM, right?
>
> No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> time to satisfy the allocation but it will reclaim to get the memory, it
> will sleep if necessary and it will will trigger OOM killer if there is
> no other option. __GFP_DIRECT_RECLAIM is a completely different story
> than without it which means _no_sleeping_ is allowed and therefore only
> a busy loop waiting for the allocation to proceed is allowed.

That could be a livelock.
From the user's perspective, there's no noticeable difference between
a livelock, soft lockup, or hard lockup.

>
> > So, I don't believe the issue is related
> > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > design of __GFP_NOFAIL itself.
>
> Care to elaborate?

I've read the documentation explaining why the busy loop is embedded
within the page allocation process instead of letting users implement
it based on their needs. However, the complexity and numerous issues
suggest that this design might be fundamentally flawed.

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  9:25       ` Yafang Shao
@ 2024-08-19  9:39         ` Barry Song
  2024-08-19  9:45           ` Yafang Shao
  2024-08-19 10:17         ` Michal Hocko
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19  9:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Michal Hocko, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right?
> >
> > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > time to satisfy the allocation but it will reclaim to get the memory, it
> > will sleep if necessary and it will will trigger OOM killer if there is
> > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > than without it which means _no_sleeping_ is allowed and therefore only
> > a busy loop waiting for the allocation to proceed is allowed.
>
> That could be a livelock.
> From the user's perspective, there's no noticeable difference between
> a livelock, soft lockup, or hard lockup.

This is certainly different. A lockup occurs when tasks can't be scheduled,
causing the entire system to stop functioning.

>
> >
> > > So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > Care to elaborate?
>
> I've read the documentation explaining why the busy loop is embedded
> within the page allocation process instead of letting users implement
> it based on their needs. However, the complexity and numerous issues
> suggest that this design might be fundamentally flawed.

I don't see "numerous issues", only two issues:

1. allocation size overflow with __GFP_NOFAIL
2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.

for 1, it has been a BUG to require an overflowed size to always succeed.

for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
and only expose GFP_NOFAIL(which definitely includes blockable) so
any unsupported case like vdpa will no longer occur.  I would greatly
appreciate it if you or someone else could take over this task, as I am
currently extremely busy.

>
> --
> Regards
> Yafang

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-17  6:24 ` [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
@ 2024-08-19  9:43   ` David Hildenbrand
  2024-08-19  9:47     ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19  9:43 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Christoph Hellwig, Lorenzo Stoakes, Kees Cook,
	Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> they don't check the return, we are exposed to the security risks for NULL
> deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).
> 
> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   include/linux/slab.h | 4 +++-
>   mm/page_alloc.c      | 4 +++-
>   mm/util.c            | 1 +
>   3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c9cb42203183..4a4d1fdc2afe 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>   {
>   	size_t bytes;
>   
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>   		return NULL;
> +	}
>   
>   	return kvmalloc_node_noprof(bytes, flags, node);
>   }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 60742d057b05..d2c37f8f8d09 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>   	 * There are several places where we assume that the order value is sane
>   	 * so bail out early if the request is out of bound.
>   	 */
> -	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> +	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> +		BUG_ON(gfp & __GFP_NOFAIL);
>   		return NULL;
> +	}
>   
>   	gfp &= gfp_allowed_mask;
>   	/*
> diff --git a/mm/util.c b/mm/util.c
> index ac01925a4179..678c647b778f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>   
>   	/* Don't even allow crazy sizes */
>   	if (unlikely(size > INT_MAX)) {
> +		BUG_ON(flags & __GFP_NOFAIL);

No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-17  6:24 ` [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
  2024-08-18  2:55   ` Yafang Shao
@ 2024-08-19  9:44   ` David Hildenbrand
  2024-08-19 10:19     ` Michal Hocko
  1 sibling, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19  9:44 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> attempt and still fail to allocate memory for these users, we have two
> choices:
> 
>      1. We could busy-loop and hope that some other direct reclamation or
>      kswapd rescues the current process. However, this is unreliable
>      and could ultimately lead to hard or soft lockups, which might not
>      be well supported by some architectures.
> 
>      2. We could use BUG_ON to trigger a reliable system crash, avoiding
>      exposing NULL dereference.
> 
> Neither option is ideal, but both are improvements over the existing code.
> This patch selects the second option because, with the introduction of
> scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> nofail users(which is in my plan), non-blockable nofail allocations will
> no longer be possible.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   mm/page_alloc.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2c37f8f8d09..fb5850ecd3ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   	 */
>   	if (gfp_mask & __GFP_NOFAIL) {
>   		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * All existing users of the __GFP_NOFAIL are blockable
> +		 * otherwise we introduce a busy loop with inside the page
> +		 * allocator from non-sleepable contexts
>   		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -			goto fail;
> +		BUG_ON(!can_direct_reclaim);

No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should 
be found during ordinary testing.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  9:39         ` Barry Song
@ 2024-08-19  9:45           ` Yafang Shao
  2024-08-19 10:10             ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-19  9:45 UTC (permalink / raw)
  To: Barry Song
  Cc: Michal Hocko, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right?
> > >
> > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > will sleep if necessary and it will will trigger OOM killer if there is
> > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > than without it which means _no_sleeping_ is allowed and therefore only
> > > a busy loop waiting for the allocation to proceed is allowed.
> >
> > That could be a livelock.
> > From the user's perspective, there's no noticeable difference between
> > a livelock, soft lockup, or hard lockup.
>
> This is certainly different. A lockup occurs when tasks can't be scheduled,
> causing the entire system to stop functioning.

When a livelock occurs, your only options are to migrate your
applications to other servers or reboot the system—there’s no other
resolution (except for using oomd, which is difficult for users
without cgroup2 or swap).

So, there's effectively no difference.

>
> >
> > >
> > > > So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > Care to elaborate?
> >
> > I've read the documentation explaining why the busy loop is embedded
> > within the page allocation process instead of letting users implement
> > it based on their needs. However, the complexity and numerous issues
> > suggest that this design might be fundamentally flawed.
>
> I don't see "numerous issues", only two issues:
>
> 1. allocation size overflow with __GFP_NOFAIL
> 2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.
>
> for 1, it has been a BUG to require an overflowed size to always succeed.
>
> for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
> and only expose GFP_NOFAIL(which definitely includes blockable) so
> any unsupported case like vdpa will no longer occur.  I would greatly
> appreciate it if you or someone else could take over this task, as I am
> currently extremely busy.
>
> >
> > --
> > Regards
> > Yafang
>
> Thanks
> Barry



--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19  9:43   ` David Hildenbrand
@ 2024-08-19  9:47     ` Barry Song
  2024-08-19  9:55       ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19  9:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.08.24 08:24, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> > they don't check the return, we are exposed to the security risks for NULL
> > deference.
> >
> > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> > situation.
> >
> > Christoph Hellwig:
> > The whole freaking point of __GFP_NOFAIL is that callers don't handle
> > allocation failures.  So in fact a straight BUG is the right thing
> > here.
> >
> > Vlastimil Babka:
> > It's just not a recoverable situation (WARN_ON is for recoverable
> > situations). The caller cannot handle allocation failure and at the same
> > time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> > with stracktrace etc. We don't need to hope for the later NULL pointer
> > dereference (which might if really unlucky happen from a different
> > context where it's no longer obvious what lead to the allocation failing).
> >
> > Michal Hocko:
> > Linus tends to be against adding new BUG() calls unless the failure is
> > absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> > not sure how he would look at simply incorrect memory allocator usage to
> > blow up the kernel. Now the argument could be made that those failures
> > could cause subtle memory corruptions or even be exploitable which might
> > be a sufficient reason to stop them early.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >   include/linux/slab.h | 4 +++-
> >   mm/page_alloc.c      | 4 +++-
> >   mm/util.c            | 1 +
> >   3 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index c9cb42203183..4a4d1fdc2afe 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >   {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >   }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 60742d057b05..d2c37f8f8d09 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >        * There are several places where we assume that the order value is sane
> >        * so bail out early if the request is out of bound.
> >        */
> > -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> > +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> > +             BUG_ON(gfp & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       gfp &= gfp_allowed_mask;
> >       /*
> > diff --git a/mm/util.c b/mm/util.c
> > index ac01925a4179..678c647b778f 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >
> >       /* Don't even allow crazy sizes */
> >       if (unlikely(size > INT_MAX)) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
>
> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.

Hi David,
WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19  9:47     ` Barry Song
@ 2024-08-19  9:55       ` David Hildenbrand
  2024-08-19 10:02         ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19  9:55 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 11:47, Barry Song wrote:
> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 17.08.24 08:24, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>> they don't check the return, we are exposed to the security risks for NULL
>>> deference.
>>>
>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>> situation.
>>>
>>> Christoph Hellwig:
>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>> allocation failures.  So in fact a straight BUG is the right thing
>>> here.
>>>
>>> Vlastimil Babka:
>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>> situations). The caller cannot handle allocation failure and at the same
>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>> dereference (which might if really unlucky happen from a different
>>> context where it's no longer obvious what lead to the allocation failing).
>>>
>>> Michal Hocko:
>>> Linus tends to be against adding new BUG() calls unless the failure is
>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>> not sure how he would look at simply incorrect memory allocator usage to
>>> blow up the kernel. Now the argument could be made that those failures
>>> could cause subtle memory corruptions or even be exploitable which might
>>> be a sufficient reason to stop them early.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Christoph Lameter <cl@linux.com>
>>> Cc: Pekka Enberg <penberg@kernel.org>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Kees Cook <kees@kernel.org>
>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>> Cc: Jason Wang <jasowang@redhat.com>
>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>    include/linux/slab.h | 4 +++-
>>>    mm/page_alloc.c      | 4 +++-
>>>    mm/util.c            | 1 +
>>>    3 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>> index c9cb42203183..4a4d1fdc2afe 100644
>>> --- a/include/linux/slab.h
>>> +++ b/include/linux/slab.h
>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>    {
>>>        size_t bytes;
>>>
>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>                return NULL;
>>> +     }
>>>
>>>        return kvmalloc_node_noprof(bytes, flags, node);
>>>    }
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 60742d057b05..d2c37f8f8d09 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>         * There are several places where we assume that the order value is sane
>>>         * so bail out early if the request is out of bound.
>>>         */
>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>                return NULL;
>>> +     }
>>>
>>>        gfp &= gfp_allowed_mask;
>>>        /*
>>> diff --git a/mm/util.c b/mm/util.c
>>> index ac01925a4179..678c647b778f 100644
>>> --- a/mm/util.c
>>> +++ b/mm/util.c
>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>
>>>        /* Don't even allow crazy sizes */
>>>        if (unlikely(size > INT_MAX)) {
>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>
>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> 
> Hi David,
> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.

Just return NULL? "shit in shit out" :) ?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19  9:55       ` David Hildenbrand
@ 2024-08-19 10:02         ` Barry Song
  2024-08-19 12:33           ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 10:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 11:47, Barry Song wrote:
> > On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 17.08.24 08:24, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> >>> they don't check the return, we are exposed to the security risks for NULL
> >>> deference.
> >>>
> >>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> >>> situation.
> >>>
> >>> Christoph Hellwig:
> >>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> >>> allocation failures.  So in fact a straight BUG is the right thing
> >>> here.
> >>>
> >>> Vlastimil Babka:
> >>> It's just not a recoverable situation (WARN_ON is for recoverable
> >>> situations). The caller cannot handle allocation failure and at the same
> >>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> >>> with stracktrace etc. We don't need to hope for the later NULL pointer
> >>> dereference (which might if really unlucky happen from a different
> >>> context where it's no longer obvious what lead to the allocation failing).
> >>>
> >>> Michal Hocko:
> >>> Linus tends to be against adding new BUG() calls unless the failure is
> >>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> >>> not sure how he would look at simply incorrect memory allocator usage to
> >>> blow up the kernel. Now the argument could be made that those failures
> >>> could cause subtle memory corruptions or even be exploitable which might
> >>> be a sufficient reason to stop them early.
> >>>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>> Acked-by: Michal Hocko <mhocko@suse.com>
> >>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Christoph Lameter <cl@linux.com>
> >>> Cc: Pekka Enberg <penberg@kernel.org>
> >>> Cc: David Rientjes <rientjes@google.com>
> >>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >>> Cc: Kees Cook <kees@kernel.org>
> >>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> >>> Cc: Hailong.Liu <hailong.liu@oppo.com>
> >>> Cc: Jason Wang <jasowang@redhat.com>
> >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> ---
> >>>    include/linux/slab.h | 4 +++-
> >>>    mm/page_alloc.c      | 4 +++-
> >>>    mm/util.c            | 1 +
> >>>    3 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/slab.h b/include/linux/slab.h
> >>> index c9cb42203183..4a4d1fdc2afe 100644
> >>> --- a/include/linux/slab.h
> >>> +++ b/include/linux/slab.h
> >>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >>>    {
> >>>        size_t bytes;
> >>>
> >>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> >>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> >>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>                return NULL;
> >>> +     }
> >>>
> >>>        return kvmalloc_node_noprof(bytes, flags, node);
> >>>    }
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 60742d057b05..d2c37f8f8d09 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >>>         * There are several places where we assume that the order value is sane
> >>>         * so bail out early if the request is out of bound.
> >>>         */
> >>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> >>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> >>> +             BUG_ON(gfp & __GFP_NOFAIL);
> >>>                return NULL;
> >>> +     }
> >>>
> >>>        gfp &= gfp_allowed_mask;
> >>>        /*
> >>> diff --git a/mm/util.c b/mm/util.c
> >>> index ac01925a4179..678c647b778f 100644
> >>> --- a/mm/util.c
> >>> +++ b/mm/util.c
> >>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >>>
> >>>        /* Don't even allow crazy sizes */
> >>>        if (unlikely(size > INT_MAX)) {
> >>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>
> >> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> >
> > Hi David,
> > WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>
> Just return NULL? "shit in shit out" :) ?

Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
as it's the caller's responsibility to check the return value. However, with
__GFP_NOFAIL, users will directly dereference *(p + offset) even when
p == NULL. It is how __GFP_NOFAIL is supposed to work.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  9:45           ` Yafang Shao
@ 2024-08-19 10:10             ` Barry Song
  2024-08-19 11:56               ` Yafang Shao
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 10:10 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Michal Hocko, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right?
> > > >
> > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > a busy loop waiting for the allocation to proceed is allowed.
> > >
> > > That could be a livelock.
> > > From the user's perspective, there's no noticeable difference between
> > > a livelock, soft lockup, or hard lockup.
> >
> > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > causing the entire system to stop functioning.
>
> When a livelock occurs, your only options are to migrate your
> applications to other servers or reboot the system—there’s no other
> resolution (except for using oomd, which is difficult for users
> without cgroup2 or swap).
>
> So, there's effectively no difference.

Could you express your options more clearly? I am guessing two
possibilities?
1. entirely drop __GFP_NOFAIL and require all users who are
using __GFP_NOFAIL to add error handlers instead?

2. no matter if it is an unsupported case, such as, GFP_ATOMIC|
__GFP_NOFAIL, we always loop till a soft or hard lockup?

>
> >
> > >
> > > >
> > > > > So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > Care to elaborate?
> > >
> > > I've read the documentation explaining why the busy loop is embedded
> > > within the page allocation process instead of letting users implement
> > > it based on their needs. However, the complexity and numerous issues
> > > suggest that this design might be fundamentally flawed.
> >
> > I don't see "numerous issues", only two issues:
> >
> > 1. allocation size overflow with __GFP_NOFAIL
> > 2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.
> >
> > for 1, it has been a BUG to require an overflowed size to always succeed.
> >
> > for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
> > and only expose GFP_NOFAIL(which definitely includes blockable) so
> > any unsupported case like vdpa will no longer occur.  I would greatly
> > appreciate it if you or someone else could take over this task, as I am
> > currently extremely busy.
> >
> > >
> > > --
> > > Regards
> > > Yafang
> >
> > Thanks
> > Barry
>
>
>
> --
> Regards
> Yafang

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  9:25       ` Yafang Shao
  2024-08-19  9:39         ` Barry Song
@ 2024-08-19 10:17         ` Michal Hocko
  2024-08-19 11:56           ` Yafang Shao
  1 sibling, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 10:17 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right?
> >
> > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > time to satisfy the allocation but it will reclaim to get the memory, it
> > will sleep if necessary and it will will trigger OOM killer if there is
> > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > than without it which means _no_sleeping_ is allowed and therefore only
> > a busy loop waiting for the allocation to proceed is allowed.
> 
> That could be a livelock.
> >From the user's perspective, there's no noticeable difference between
> a livelock, soft lockup, or hard lockup.

Ohh, it very much is different if somebody in a sleepable context is
taking too long to complete and making a CPU completely unusable for
anything else.

Please consider that asking for never failing allocation is a major
requirement.

> > > So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > Care to elaborate?
> 
> I've read the documentation explaining why the busy loop is embedded
> within the page allocation process instead of letting users implement
> it based on their needs. However, the complexity and numerous issues
> suggest that this design might be fundamentally flawed.

I really fail what you mean.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19  9:44   ` David Hildenbrand
@ 2024-08-19 10:19     ` Michal Hocko
  2024-08-19 12:48       ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 10:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 11:44:39, David Hildenbrand wrote:
[...]
> >   	if (gfp_mask & __GFP_NOFAIL) {
> >   		/*
> > -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -		 * of any new users that actually require GFP_NOWAIT
> > +		 * All existing users of the __GFP_NOFAIL are blockable
> > +		 * otherwise we introduce a busy loop with inside the page
> > +		 * allocator from non-sleepable contexts
> >   		 */
> > -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -			goto fail;
> > +		BUG_ON(!can_direct_reclaim);
> 
> No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should be
> found during ordinary testing.

Do you mean 
	if (WARN_ON_ONCE_GFP(...))
		goto retry?

Barry has mentioned that option in the changelog. 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 10:17         ` Michal Hocko
@ 2024-08-19 11:56           ` Yafang Shao
  2024-08-19 12:04             ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-19 11:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 6:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right?
> > >
> > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > will sleep if necessary and it will will trigger OOM killer if there is
> > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > than without it which means _no_sleeping_ is allowed and therefore only
> > > a busy loop waiting for the allocation to proceed is allowed.
> >
> > That could be a livelock.
> > >From the user's perspective, there's no noticeable difference between
> > a livelock, soft lockup, or hard lockup.
>
> Ohh, it very much is different if somebody in a sleepable context is
> taking too long to complete and making a CPU completely unusable for
> anything else.

__alloc_pages_slowpath
retry:
    if (gfp_mask & __GFP_NOFAIL) {
        goto retry;
    }

When the loop continues indefinitely here, it indicates that the
system is unstable. In such a scenario, does it really matter whether
you sleep or not?

>
> Please consider that asking for never failing allocation is a major
> requirement.
>
> > > > So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > Care to elaborate?
> >
> > I've read the documentation explaining why the busy loop is embedded
> > within the page allocation process instead of letting users implement
> > it based on their needs. However, the complexity and numerous issues
> > suggest that this design might be fundamentally flawed.
>
> I really fail what you mean.

I mean giving the user the option to handle the loop at the call site,
rather than having it loop within __alloc_pages_slowpath().

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 10:10             ` Barry Song
@ 2024-08-19 11:56               ` Yafang Shao
  2024-08-19 12:09                 ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-19 11:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Michal Hocko, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > choices:
> > > > > > >
> > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > >
> > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > >
> > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > >
> > > > That could be a livelock.
> > > > From the user's perspective, there's no noticeable difference between
> > > > a livelock, soft lockup, or hard lockup.
> > >
> > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > causing the entire system to stop functioning.
> >
> > When a livelock occurs, your only options are to migrate your
> > applications to other servers or reboot the system—there’s no other
> > resolution (except for using oomd, which is difficult for users
> > without cgroup2 or swap).
> >
> > So, there's effectively no difference.
>
> Could you express your options more clearly? I am guessing two
> possibilities?
> 1. entirely drop __GFP_NOFAIL and require all users who are
> using __GFP_NOFAIL to add error handlers instead?

When the system is unstable—such as after reaching the maximum retries
without successfully allocating pages—simply failing the operation
might be the better option.

>
> 2. no matter if it is an unsupported case, such as, GFP_ATOMIC|
> __GFP_NOFAIL, we always loop till a soft or hard lockup?

--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 11:56           ` Yafang Shao
@ 2024-08-19 12:04             ` Michal Hocko
  0 siblings, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 12:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 19:56:16, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 6:18 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right?
> > > >
> > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > a busy loop waiting for the allocation to proceed is allowed.
> > >
> > > That could be a livelock.
> > > >From the user's perspective, there's no noticeable difference between
> > > a livelock, soft lockup, or hard lockup.
> >
> > Ohh, it very much is different if somebody in a sleepable context is
> > taking too long to complete and making a CPU completely unusable for
> > anything else.
> 
> __alloc_pages_slowpath
> retry:
>     if (gfp_mask & __GFP_NOFAIL) {
>         goto retry;
>     }
> 
> When the loop continues indefinitely here, it indicates that the
> system is unstable.

No, it means the system is low on memory to satisfy the allocation
request. This doesn't automatically imply the system is unstable. The
requested NUMA node(s) or zone(s) might be depleted.

> In such a scenario, does it really matter whether
> you sleep or not?

Absolutely! Hogging CPU might prevent anybody else running on it.

> > Please consider that asking for never failing allocation is a major
> > requirement.
> >
> > > > > So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > Care to elaborate?
> > >
> > > I've read the documentation explaining why the busy loop is embedded
> > > within the page allocation process instead of letting users implement
> > > it based on their needs. However, the complexity and numerous issues
> > > suggest that this design might be fundamentally flawed.
> >
> > I really fail what you mean.
> 
> I mean giving the user the option to handle the loop at the call site,
> rather than having it loop within __alloc_pages_slowpath().

Users who have a allocation failure strategy do not and should not use
__GFP_NOFAIL.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 11:56               ` Yafang Shao
@ 2024-08-19 12:09                 ` Michal Hocko
  2024-08-19 12:17                   ` Yafang Shao
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 12:09 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 19:56:53, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > >
> > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > > choices:
> > > > > > > >
> > > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > > >
> > > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > > >
> > > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > > >
> > > > > That could be a livelock.
> > > > > From the user's perspective, there's no noticeable difference between
> > > > > a livelock, soft lockup, or hard lockup.
> > > >
> > > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > > causing the entire system to stop functioning.
> > >
> > > When a livelock occurs, your only options are to migrate your
> > > applications to other servers or reboot the system—there’s no other
> > > resolution (except for using oomd, which is difficult for users
> > > without cgroup2 or swap).
> > >
> > > So, there's effectively no difference.
> >
> > Could you express your options more clearly? I am guessing two
> > possibilities?
> > 1. entirely drop __GFP_NOFAIL and require all users who are
> > using __GFP_NOFAIL to add error handlers instead?
> 
> When the system is unstable—such as after reaching the maximum retries
> without successfully allocating pages—simply failing the operation
> might be the better option.

It seems you are failing to understand the __GFP_NOFAIL semantic and you
are circling around that. So let me repeat that for you here. Make sure
you understand before going forward with the discussion. Feel free if
something is not clear but please do not continue with what-if kind of
questions.

GFP_NOFAIL means that the caller has no way to deal with the allocation
strategy. Allocator simply cannot fail the request even if that takes
ages to succeed! To put it simpler if you have a code like

	while (!(ptr = alloc()));
or
	BUG_ON(!(ptr = alloc()));

then you should better use __GFP_NOFAIL rather than opencode the endless
loop or the bug on for the failure.

Our (page, vmalloc, kmalloc) allocators do support that node for
allocation that are allowed to sleep. But those allocators have never
supported and are unlikely to suppoort atomic non-failing allocations.

More clear?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 12:09                 ` Michal Hocko
@ 2024-08-19 12:17                   ` Yafang Shao
  2024-08-19 14:01                     ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-19 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 8:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 19-08-24 19:56:53, Yafang Shao wrote:
> > On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > > >
> > > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > > > choices:
> > > > > > > > >
> > > > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > > > >
> > > > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > > > >
> > > > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > > > >
> > > > > > That could be a livelock.
> > > > > > From the user's perspective, there's no noticeable difference between
> > > > > > a livelock, soft lockup, or hard lockup.
> > > > >
> > > > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > > > causing the entire system to stop functioning.
> > > >
> > > > When a livelock occurs, your only options are to migrate your
> > > > applications to other servers or reboot the system—there’s no other
> > > > resolution (except for using oomd, which is difficult for users
> > > > without cgroup2 or swap).
> > > >
> > > > So, there's effectively no difference.
> > >
> > > Could you express your options more clearly? I am guessing two
> > > possibilities?
> > > 1. entirely drop __GFP_NOFAIL and require all users who are
> > > using __GFP_NOFAIL to add error handlers instead?
> >
> > When the system is unstable—such as after reaching the maximum retries
> > without successfully allocating pages—simply failing the operation
> > might be the better option.
>
> It seems you are failing to understand the __GFP_NOFAIL semantic and you
> are circling around that. So let me repeat that for you here. Make sure
> you understand before going forward with the discussion. Feel free if
> something is not clear but please do not continue with what-if kind of
> questions.
>
> GFP_NOFAIL means that the caller has no way to deal with the allocation
> strategy. Allocator simply cannot fail the request even if that takes
> ages to succeed! To put it simpler if you have a code like
>
>         while (!(ptr = alloc()));
> or
>         BUG_ON(!(ptr = alloc()));
>
> then you should better use __GFP_NOFAIL rather than opencode the endless
> loop or the bug on for the failure.
>
> Our (page, vmalloc, kmalloc) allocators do support that node for
> allocation that are allowed to sleep. But those allocators have never
> supported and are unlikely to suppoort atomic non-failing allocations.
>
> More clear?

 * New users should be evaluated carefully (and the flag should be
 * used only when there is no reasonable failure policy) but it is
 * definitely preferable to use the flag rather than opencode endless
 * loop around allocator.

The doc has already expressed what I mean.  My question is why is that
? Why not let it loop around the allocator?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 10:02         ` Barry Song
@ 2024-08-19 12:33           ` David Hildenbrand
  2024-08-19 12:48             ` Barry Song
  2024-08-19 12:49             ` Christoph Hellwig
  0 siblings, 2 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 12:33 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 12:02, Barry Song wrote:
> On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 11:47, Barry Song wrote:
>>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 17.08.24 08:24, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>>>> they don't check the return, we are exposed to the security risks for NULL
>>>>> deference.
>>>>>
>>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>>>> situation.
>>>>>
>>>>> Christoph Hellwig:
>>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>>>> allocation failures.  So in fact a straight BUG is the right thing
>>>>> here.
>>>>>
>>>>> Vlastimil Babka:
>>>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>>>> situations). The caller cannot handle allocation failure and at the same
>>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>>>> dereference (which might if really unlucky happen from a different
>>>>> context where it's no longer obvious what lead to the allocation failing).
>>>>>
>>>>> Michal Hocko:
>>>>> Linus tends to be against adding new BUG() calls unless the failure is
>>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>>>> not sure how he would look at simply incorrect memory allocator usage to
>>>>> blow up the kernel. Now the argument could be made that those failures
>>>>> could cause subtle memory corruptions or even be exploitable which might
>>>>> be a sufficient reason to stop them early.
>>>>>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Cc: Christoph Lameter <cl@linux.com>
>>>>> Cc: Pekka Enberg <penberg@kernel.org>
>>>>> Cc: David Rientjes <rientjes@google.com>
>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>>> Cc: Kees Cook <kees@kernel.org>
>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>>>> Cc: Jason Wang <jasowang@redhat.com>
>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>> ---
>>>>>     include/linux/slab.h | 4 +++-
>>>>>     mm/page_alloc.c      | 4 +++-
>>>>>     mm/util.c            | 1 +
>>>>>     3 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>>> index c9cb42203183..4a4d1fdc2afe 100644
>>>>> --- a/include/linux/slab.h
>>>>> +++ b/include/linux/slab.h
>>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>>>     {
>>>>>         size_t bytes;
>>>>>
>>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>                 return NULL;
>>>>> +     }
>>>>>
>>>>>         return kvmalloc_node_noprof(bytes, flags, node);
>>>>>     }
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>          * There are several places where we assume that the order value is sane
>>>>>          * so bail out early if the request is out of bound.
>>>>>          */
>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>                 return NULL;
>>>>> +     }
>>>>>
>>>>>         gfp &= gfp_allowed_mask;
>>>>>         /*
>>>>> diff --git a/mm/util.c b/mm/util.c
>>>>> index ac01925a4179..678c647b778f 100644
>>>>> --- a/mm/util.c
>>>>> +++ b/mm/util.c
>>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>>>
>>>>>         /* Don't even allow crazy sizes */
>>>>>         if (unlikely(size > INT_MAX)) {
>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>
>>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
>>>
>>> Hi David,
>>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>>
>> Just return NULL? "shit in shit out" :) ?
> 
> Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
> as it's the caller's responsibility to check the return value. However, with
> __GFP_NOFAIL, users will directly dereference *(p + offset) even when
> p == NULL. It is how __GFP_NOFAIL is supposed to work.

If the caller is not supposed to pass that flag combination (shit in), 
we are not obligated to give a reasonable result (shit out).

My point is that we should let the caller (possibly?) crash -- the one 
that did something that is wrong -- instead of forcing a crash using 
BUG_ON in this code here.

It should all be caught during testing either way. And if some OOT 
module does something nasty, that's not our responsibility.

BUG_ON is not a way to write assertions into the code.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 10:19     ` Michal Hocko
@ 2024-08-19 12:48       ` David Hildenbrand
  0 siblings, 0 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 12:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 12:19, Michal Hocko wrote:
> On Mon 19-08-24 11:44:39, David Hildenbrand wrote:
> [...]
>>>    	if (gfp_mask & __GFP_NOFAIL) {
>>>    		/*
>>> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
>>> -		 * of any new users that actually require GFP_NOWAIT
>>> +		 * All existing users of the __GFP_NOFAIL are blockable
>>> +		 * otherwise we introduce a busy loop with inside the page
>>> +		 * allocator from non-sleepable contexts
>>>    		 */
>>> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>>> -			goto fail;
>>> +		BUG_ON(!can_direct_reclaim);
>>
>> No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should be
>> found during ordinary testing.
> 
> Do you mean
> 	if (WARN_ON_ONCE_GFP(...))
> 		goto retry?

Not really ...

but now I read the description more carefully and I am not sure why we 
are so into throwing around BUG_ONs here, for something that is simply 
not allowed and doesn't make sense.

If __GFP_NOFAIL is documented to

"
+ * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.
+ * It should _never_ be used in non-sleepable contexts.
"

Why not document

"__GFP_NOFAIL always implies __GFP_DIRECT_RECLAIM" and do exactly that?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:33           ` David Hildenbrand
@ 2024-08-19 12:48             ` Barry Song
  2024-08-19 12:49               ` David Hildenbrand
  2024-08-19 12:49             ` Christoph Hellwig
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 12:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 12:02, Barry Song wrote:
> > On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.08.24 11:47, Barry Song wrote:
> >>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 17.08.24 08:24, Barry Song wrote:
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> >>>>> they don't check the return, we are exposed to the security risks for NULL
> >>>>> deference.
> >>>>>
> >>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> >>>>> situation.
> >>>>>
> >>>>> Christoph Hellwig:
> >>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> >>>>> allocation failures.  So in fact a straight BUG is the right thing
> >>>>> here.
> >>>>>
> >>>>> Vlastimil Babka:
> >>>>> It's just not a recoverable situation (WARN_ON is for recoverable
> >>>>> situations). The caller cannot handle allocation failure and at the same
> >>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> >>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
> >>>>> dereference (which might if really unlucky happen from a different
> >>>>> context where it's no longer obvious what lead to the allocation failing).
> >>>>>
> >>>>> Michal Hocko:
> >>>>> Linus tends to be against adding new BUG() calls unless the failure is
> >>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> >>>>> not sure how he would look at simply incorrect memory allocator usage to
> >>>>> blow up the kernel. Now the argument could be made that those failures
> >>>>> could cause subtle memory corruptions or even be exploitable which might
> >>>>> be a sufficient reason to stop them early.
> >>>>>
> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>> Acked-by: Michal Hocko <mhocko@suse.com>
> >>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>>>> Cc: Christoph Lameter <cl@linux.com>
> >>>>> Cc: Pekka Enberg <penberg@kernel.org>
> >>>>> Cc: David Rientjes <rientjes@google.com>
> >>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >>>>> Cc: Kees Cook <kees@kernel.org>
> >>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> >>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
> >>>>> Cc: Jason Wang <jasowang@redhat.com>
> >>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>>>> ---
> >>>>>     include/linux/slab.h | 4 +++-
> >>>>>     mm/page_alloc.c      | 4 +++-
> >>>>>     mm/util.c            | 1 +
> >>>>>     3 files changed, 7 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
> >>>>> index c9cb42203183..4a4d1fdc2afe 100644
> >>>>> --- a/include/linux/slab.h
> >>>>> +++ b/include/linux/slab.h
> >>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >>>>>     {
> >>>>>         size_t bytes;
> >>>>>
> >>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> >>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> >>>>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>>>                 return NULL;
> >>>>> +     }
> >>>>>
> >>>>>         return kvmalloc_node_noprof(bytes, flags, node);
> >>>>>     }
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index 60742d057b05..d2c37f8f8d09 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >>>>>          * There are several places where we assume that the order value is sane
> >>>>>          * so bail out early if the request is out of bound.
> >>>>>          */
> >>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> >>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> >>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
> >>>>>                 return NULL;
> >>>>> +     }
> >>>>>
> >>>>>         gfp &= gfp_allowed_mask;
> >>>>>         /*
> >>>>> diff --git a/mm/util.c b/mm/util.c
> >>>>> index ac01925a4179..678c647b778f 100644
> >>>>> --- a/mm/util.c
> >>>>> +++ b/mm/util.c
> >>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >>>>>
> >>>>>         /* Don't even allow crazy sizes */
> >>>>>         if (unlikely(size > INT_MAX)) {
> >>>>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>>
> >>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> >>>
> >>> Hi David,
> >>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
> >>
> >> Just return NULL? "shit in shit out" :) ?
> >
> > Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
> > as it's the caller's responsibility to check the return value. However, with
> > __GFP_NOFAIL, users will directly dereference *(p + offset) even when
> > p == NULL. It is how __GFP_NOFAIL is supposed to work.
>
> If the caller is not supposed to pass that flag combination (shit in),
> we are not obligated to give a reasonable result (shit out).
>
> My point is that we should let the caller (possibly?) crash -- the one
> that did something that is wrong -- instead of forcing a crash using
> BUG_ON in this code here.
>
> It should all be caught during testing either way. And if some OOT
> module does something nasty, that's not our responsibility.
>
> BUG_ON is not a way to write assertions into the code.

It seems there was a misunderstanding regarding the purpose of
this change. we actually have many details in changelog.

Its aim is not to write an assertion, but rather to prevent exposing
a security vulnerability.

Returning NULL doesn't necessarily crash the caller's process, p->field,
*(p + offset) deference could be used by hackers to exploit the system.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:33           ` David Hildenbrand
  2024-08-19 12:48             ` Barry Song
@ 2024-08-19 12:49             ` Christoph Hellwig
  2024-08-19 12:51               ` David Hildenbrand
  1 sibling, 1 reply; 101+ messages in thread
From: Christoph Hellwig @ 2024-08-19 12:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> It should all be caught during testing either way. And if some OOT module 
> does something nasty, that's not our responsibility.
>
> BUG_ON is not a way to write assertions into the code.

So you'd rather create exploits than crashing on a fundamental API
violation?  That's exactly what the series is trying to fix.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:48             ` Barry Song
@ 2024-08-19 12:49               ` David Hildenbrand
  2024-08-19 17:12                 ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 12:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	mhocko, penberg, rientjes, roman.gushchin, torvalds, urezki,
	v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 14:48, Barry Song wrote:
> On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 12:02, Barry Song wrote:
>>> On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.08.24 11:47, Barry Song wrote:
>>>>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 17.08.24 08:24, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>>>>>> they don't check the return, we are exposed to the security risks for NULL
>>>>>>> deference.
>>>>>>>
>>>>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>>>>>> situation.
>>>>>>>
>>>>>>> Christoph Hellwig:
>>>>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>>>>>> allocation failures.  So in fact a straight BUG is the right thing
>>>>>>> here.
>>>>>>>
>>>>>>> Vlastimil Babka:
>>>>>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>>>>>> situations). The caller cannot handle allocation failure and at the same
>>>>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>>>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>>>>>> dereference (which might if really unlucky happen from a different
>>>>>>> context where it's no longer obvious what lead to the allocation failing).
>>>>>>>
>>>>>>> Michal Hocko:
>>>>>>> Linus tends to be against adding new BUG() calls unless the failure is
>>>>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>>>>>> not sure how he would look at simply incorrect memory allocator usage to
>>>>>>> blow up the kernel. Now the argument could be made that those failures
>>>>>>> could cause subtle memory corruptions or even be exploitable which might
>>>>>>> be a sufficient reason to stop them early.
>>>>>>>
>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>>> Cc: Christoph Lameter <cl@linux.com>
>>>>>>> Cc: Pekka Enberg <penberg@kernel.org>
>>>>>>> Cc: David Rientjes <rientjes@google.com>
>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>>>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>>>>> Cc: Kees Cook <kees@kernel.org>
>>>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>>>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>>>>>> Cc: Jason Wang <jasowang@redhat.com>
>>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>>> ---
>>>>>>>      include/linux/slab.h | 4 +++-
>>>>>>>      mm/page_alloc.c      | 4 +++-
>>>>>>>      mm/util.c            | 1 +
>>>>>>>      3 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>>>>> index c9cb42203183..4a4d1fdc2afe 100644
>>>>>>> --- a/include/linux/slab.h
>>>>>>> +++ b/include/linux/slab.h
>>>>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>>>>>      {
>>>>>>>          size_t bytes;
>>>>>>>
>>>>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>>>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>>>                  return NULL;
>>>>>>> +     }
>>>>>>>
>>>>>>>          return kvmalloc_node_noprof(bytes, flags, node);
>>>>>>>      }
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>>>           * There are several places where we assume that the order value is sane
>>>>>>>           * so bail out early if the request is out of bound.
>>>>>>>           */
>>>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>>>                  return NULL;
>>>>>>> +     }
>>>>>>>
>>>>>>>          gfp &= gfp_allowed_mask;
>>>>>>>          /*
>>>>>>> diff --git a/mm/util.c b/mm/util.c
>>>>>>> index ac01925a4179..678c647b778f 100644
>>>>>>> --- a/mm/util.c
>>>>>>> +++ b/mm/util.c
>>>>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>>>>>
>>>>>>>          /* Don't even allow crazy sizes */
>>>>>>>          if (unlikely(size > INT_MAX)) {
>>>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>>
>>>>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
>>>>>
>>>>> Hi David,
>>>>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>>>>
>>>> Just return NULL? "shit in shit out" :) ?
>>>
>>> Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
>>> as it's the caller's responsibility to check the return value. However, with
>>> __GFP_NOFAIL, users will directly dereference *(p + offset) even when
>>> p == NULL. It is how __GFP_NOFAIL is supposed to work.
>>
>> If the caller is not supposed to pass that flag combination (shit in),
>> we are not obligated to give a reasonable result (shit out).
>>
>> My point is that we should let the caller (possibly?) crash -- the one
>> that did something that is wrong -- instead of forcing a crash using
>> BUG_ON in this code here.
>>
>> It should all be caught during testing either way. And if some OOT
>> module does something nasty, that's not our responsibility.
>>
>> BUG_ON is not a way to write assertions into the code.
> 
> It seems there was a misunderstanding regarding the purpose of
> this change. we actually have many details in changelog.
> 
> Its aim is not to write an assertion, but rather to prevent exposing
> a security vulnerability.
> 
> Returning NULL doesn't necessarily crash the caller's process, p->field,
> *(p + offset) deference could be used by hackers to exploit the system.

See my other reply to Michal: why do we even allow to specify them 
separately and not simply let one enforce the other?

That might result in an issue elsewhere, but likely no security 
vulnerability?

I really hate each and every BUG_ON I have to stare at.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:49             ` Christoph Hellwig
@ 2024-08-19 12:51               ` David Hildenbrand
  2024-08-19 12:53                 ` Christoph Hellwig
  2024-08-19 13:05                 ` Barry Song
  0 siblings, 2 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 12:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 14:49, Christoph Hellwig wrote:
> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>> It should all be caught during testing either way. And if some OOT module
>> does something nasty, that's not our responsibility.
>>
>> BUG_ON is not a way to write assertions into the code.
> 
> So you'd rather create exploits than crashing on a fundamental API
> violation?  That's exactly what the series is trying to fix.

I'd rather have a sane API that doesn't even allow this level of 
flexibility with NOFAIL.

But probably I'm missing more details here why this all has to be so 
complicated ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:51               ` David Hildenbrand
@ 2024-08-19 12:53                 ` Christoph Hellwig
  2024-08-19 13:14                   ` David Hildenbrand
  2024-08-19 13:05                 ` Barry Song
  1 sibling, 1 reply; 101+ messages in thread
From: Christoph Hellwig @ 2024-08-19 12:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote:
> On 19.08.24 14:49, Christoph Hellwig wrote:
>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>> It should all be caught during testing either way. And if some OOT module
>>> does something nasty, that's not our responsibility.
>>>
>>> BUG_ON is not a way to write assertions into the code.
>>
>> So you'd rather create exploits than crashing on a fundamental API
>> violation?  That's exactly what the series is trying to fix.
>
> I'd rather have a sane API that doesn't even allow this level of 
> flexibility with NOFAIL.
>
> But probably I'm missing more details here why this all has to be so 
> complicated ;)

Well, the only way to do that is to remove (__)GFP_NOFAIL and require
either explicit _nofail variants without a way to pass gfp flags, or
endless loops in the callers.  I suggested the first one earlier, but
no one liked it due to the API complexity and overhead.  And I've not
heard anyone arguing for the latter yet.

One other way might be a compile time check that requires any GPF
flag that contains (__)GFP_NOFAIL to be a compile time constants.
This is true for many but not all callers currently.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
                   ` (3 preceding siblings ...)
  2024-08-17  6:24 ` [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
@ 2024-08-19 13:02 ` David Hildenbrand
  2024-08-19 16:05   ` Linus Torvalds
  4 siblings, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 13:02 UTC (permalink / raw)
  To: Barry Song, akpm, linux-mm
  Cc: 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua, vbabka,
	virtualization

On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> -v3:
>   * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr,
>     thanks!
>   * use Jason's patch[1] to fix vdpa and refine his changelog.
>   * refine changelogs
> [1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/
> 
> -v2:
>   https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/
> 
>   * adjust vpda fix according to Jason and Michal's feedback, I would
>     expect Jason to test it, thanks!
>   * split BUG_ON of unavoidable failure and the case GFP_ATOMIC |
>     __GFP_NOFAIL into two patches according to Vlastimil and Michal.
>   * collect Michal's acked-by for patch 2 - the doc;
>   * remove the new GFP_NOFAIL from this series, that one would be a
>     separate enhancement patchset later on.
> 
> -v1:
>   https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/
> 
> __GFP_NOFAIL carries the semantics of never failing, so its callers
> do not check the return value:
>    %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>    cannot handle allocation failures. The allocation could block
>    indefinitely but will never return with failure. Testing for
>    failure is pointless.
> 

You should have called this series "mm: clarify nofail and BUG_ON on 
with invalid arguments with nofail"

I was a bit surprised to find that many BUG_ON in this series ;)

> However, __GFP_NOFAIL can sometimes fail if it exceeds size limits
> or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.
> This can expose security vulnerabilities due to potential NULL
> dereferences.

Which code is that that we are concerned about? Some OOT driver? Some 
untested in-tree driver? Just trying to understand what we are afraid 
about here. I'm afraid we cannot safe the world with some OOT drivers :(

Are we aware of such security vulnerabilities (out of pure interest!)?

Is the expectation that these code paths are never triggered during 
early testing where a WARN_ON_ONCE() would just highlight -- in addition 
to a likely crash -- that something is very wrong?

> 
> Since __GFP_NOFAIL does not support non-blocking allocation, we introduce
> GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL
> as a replacement for __GFP_NOFAIL in non-mm.

I still wonder if specifying GFP_NOFAIL should not simply come with the 
semantics "implies direct reclaim. If called for non-blocking 
allocations, bad things will happen (no security bad things though)." ;)

> 
> If we must still fail a nofail allocation, we should trigger a BUG rather
> than exposing NULL dereferences to callers who do not check the return
> value.

I am not convinced that BUG_ON is the right tool here to save the world, 
but I see how we arrived here.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:51               ` David Hildenbrand
  2024-08-19 12:53                 ` Christoph Hellwig
@ 2024-08-19 13:05                 ` Barry Song
  2024-08-19 13:10                   ` David Hildenbrand
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 13:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 14:49, Christoph Hellwig wrote:
> > On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> >> It should all be caught during testing either way. And if some OOT module
> >> does something nasty, that's not our responsibility.
> >>
> >> BUG_ON is not a way to write assertions into the code.
> >
> > So you'd rather create exploits than crashing on a fundamental API
> > violation?  That's exactly what the series is trying to fix.
>
> I'd rather have a sane API that doesn't even allow this level of
> flexibility with NOFAIL.

yes, i have already sent a RFC enforcing direct_reclamation:
https://www.spinics.net/lists/linux-mm/msg394659.html

somehow, it is not ready yet. i think Christoph prefers scope
api rather than GFP_NOFAIL which definitely has
__GFP_DIRECT_RECLAIM set. I guess you know I have
at least  5 series running, so it will happen soon though.

>
> But probably I'm missing more details here why this all has to be so
> complicated ;)

enforcing direct_reclamation is right and will work for a reasonable size.
but for this overflow size, even if we enforce direct_reclamation
in GFP_NOFAIL, we are still failing.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 13:05                 ` Barry Song
@ 2024-08-19 13:10                   ` David Hildenbrand
  2024-08-19 13:19                     ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 13:10 UTC (permalink / raw)
  To: Barry Song
  Cc: Christoph Hellwig, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 15:05, Barry Song wrote:
> On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>> It should all be caught during testing either way. And if some OOT module
>>>> does something nasty, that's not our responsibility.
>>>>
>>>> BUG_ON is not a way to write assertions into the code.
>>>
>>> So you'd rather create exploits than crashing on a fundamental API
>>> violation?  That's exactly what the series is trying to fix.
>>
>> I'd rather have a sane API that doesn't even allow this level of
>> flexibility with NOFAIL.
> 
> yes, i have already sent a RFC enforcing direct_reclamation:
> https://www.spinics.net/lists/linux-mm/msg394659.html
> 
> somehow, it is not ready yet. i think Christoph prefers scope
> api rather than GFP_NOFAIL which definitely has
> __GFP_DIRECT_RECLAIM set. I guess you know I have
> at least  5 series running, so it will happen soon though.

That really sounds like the right thing to do, at least with the "direct 
reclaim" problem ...

> 
>>
>> But probably I'm missing more details here why this all has to be so
>> complicated ;)
> 
> enforcing direct_reclamation is right and will work for a reasonable size.
> but for this overflow size, even if we enforce direct_reclamation
> in GFP_NOFAIL, we are still failing.

Right, someone requested something completely impossible. It's harder to 
do something here that doesn't return NULL. Except WARN_ON_ONCE() and 
loop for all infinity.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:53                 ` Christoph Hellwig
@ 2024-08-19 13:14                   ` David Hildenbrand
  0 siblings, 0 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 13:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 14:53, Christoph Hellwig wrote:
> On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote:
>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>> It should all be caught during testing either way. And if some OOT module
>>>> does something nasty, that's not our responsibility.
>>>>
>>>> BUG_ON is not a way to write assertions into the code.
>>>
>>> So you'd rather create exploits than crashing on a fundamental API
>>> violation?  That's exactly what the series is trying to fix.
>>
>> I'd rather have a sane API that doesn't even allow this level of
>> flexibility with NOFAIL.
>>
>> But probably I'm missing more details here why this all has to be so
>> complicated ;)
> 
> Well, the only way to do that is to remove (__)GFP_NOFAIL and require
> either explicit _nofail variants without a way to pass gfp flags, or
> endless loops in the callers.  I suggested the first one earlier, but
> no one liked it due to the API complexity and overhead.  And I've not
> heard anyone arguing for the latter yet.

Right, and and "allocate more than even possible" case is extremely ugly.

> 
> One other way might be a compile time check that requires any GPF
> flag that contains (__)GFP_NOFAIL to be a compile time constants.
> This is true for many but not all callers currently.

Right.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 13:10                   ` David Hildenbrand
@ 2024-08-19 13:19                     ` Barry Song
  2024-08-19 13:22                       ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 15:05, Barry Song wrote:
> > On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.08.24 14:49, Christoph Hellwig wrote:
> >>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> >>>> It should all be caught during testing either way. And if some OOT module
> >>>> does something nasty, that's not our responsibility.
> >>>>
> >>>> BUG_ON is not a way to write assertions into the code.
> >>>
> >>> So you'd rather create exploits than crashing on a fundamental API
> >>> violation?  That's exactly what the series is trying to fix.
> >>
> >> I'd rather have a sane API that doesn't even allow this level of
> >> flexibility with NOFAIL.
> >
> > yes, i have already sent a RFC enforcing direct_reclamation:
> > https://www.spinics.net/lists/linux-mm/msg394659.html
> >
> > somehow, it is not ready yet. i think Christoph prefers scope
> > api rather than GFP_NOFAIL which definitely has
> > __GFP_DIRECT_RECLAIM set. I guess you know I have
> > at least  5 series running, so it will happen soon though.
>
> That really sounds like the right thing to do, at least with the "direct
> reclaim" problem ...
>
> >
> >>
> >> But probably I'm missing more details here why this all has to be so
> >> complicated ;)
> >
> > enforcing direct_reclamation is right and will work for a reasonable size.
> > but for this overflow size, even if we enforce direct_reclamation
> > in GFP_NOFAIL, we are still failing.
>
> Right, someone requested something completely impossible. It's harder to
> do something here that doesn't return NULL. Except WARN_ON_ONCE() and
> loop for all infinity.

Returning NULL can introduce security vulnerabilities. While I’m not a hacker,
it’s hard to predict how they might exploit this. If we want to avoid using
BUG_ON, an alternative approach could be as follows:

while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job;    ?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 13:19                     ` Barry Song
@ 2024-08-19 13:22                       ` David Hildenbrand
  0 siblings, 0 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 13:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Christoph Hellwig, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	torvalds, urezki, v-songbaohua, vbabka, virtualization,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 15:19, Barry Song wrote:
> On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 15:05, Barry Song wrote:
>>> On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>>>> It should all be caught during testing either way. And if some OOT module
>>>>>> does something nasty, that's not our responsibility.
>>>>>>
>>>>>> BUG_ON is not a way to write assertions into the code.
>>>>>
>>>>> So you'd rather create exploits than crashing on a fundamental API
>>>>> violation?  That's exactly what the series is trying to fix.
>>>>
>>>> I'd rather have a sane API that doesn't even allow this level of
>>>> flexibility with NOFAIL.
>>>
>>> yes, i have already sent a RFC enforcing direct_reclamation:
>>> https://www.spinics.net/lists/linux-mm/msg394659.html
>>>
>>> somehow, it is not ready yet. i think Christoph prefers scope
>>> api rather than GFP_NOFAIL which definitely has
>>> __GFP_DIRECT_RECLAIM set. I guess you know I have
>>> at least  5 series running, so it will happen soon though.
>>
>> That really sounds like the right thing to do, at least with the "direct
>> reclaim" problem ...
>>
>>>
>>>>
>>>> But probably I'm missing more details here why this all has to be so
>>>> complicated ;)
>>>
>>> enforcing direct_reclamation is right and will work for a reasonable size.
>>> but for this overflow size, even if we enforce direct_reclamation
>>> in GFP_NOFAIL, we are still failing.
>>
>> Right, someone requested something completely impossible. It's harder to
>> do something here that doesn't return NULL. Except WARN_ON_ONCE() and
>> loop for all infinity.
> 
> Returning NULL can introduce security vulnerabilities. While I’m not a hacker,
> it’s hard to predict how they might exploit this. If we want to avoid using
> BUG_ON, an alternative approach could be as follows:
> 
> while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job;    ?

At least for the "allocate something impossible in size", retrying 
forever makes *some* sense (with a warning, of course). We would retry 
until it works for sizes that are possible.

For the other case, we should just make __GFP_NOFAIL imply whatever is 
currently a "must". Returning NULL here is indeed strange.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL
  2024-08-19 12:17                   ` Yafang Shao
@ 2024-08-19 14:01                     ` Michal Hocko
  0 siblings, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 14:01 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Lorenzo Stoakes,
	Kees Cook, Eugenio Pérez, Jason Wang, Maxime Coquelin,
	Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 20:17:36, Yafang Shao wrote:
> My question is why is that ? Why not let it loop around the allocator?

Because of 2 reasons. It is much easier to see NOFAIL allocations when
they are annotated properly (try to grep for all sorts of endless loops)
and also the allocator can make sertain heuristics if it knows that
allocation must not fail. 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 13:02 ` [PATCH v3 0/4] mm: clarify nofail memory allocation David Hildenbrand
@ 2024-08-19 16:05   ` Linus Torvalds
  2024-08-19 19:23     ` Barry Song
  2024-08-21 12:40     ` Yafang Shao
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-19 16:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin, urezki,
	v-songbaohua, vbabka, virtualization

On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> >
> > If we must still fail a nofail allocation, we should trigger a BUG rather
> > than exposing NULL dereferences to callers who do not check the return
> > value.
>
> I am not convinced that BUG_ON is the right tool here to save the world,
> but I see how we arrived here.

I think the thing to do is to just add a

     WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));

or similar, where that bad_nofail_alloc() checks that the allocation
order is small and that the flags are sane for a NOFAIL allocation.

Because no, BUG_ON() is *never* the answer. The answer is to make sure
nobody ever sets NOFAIL in situations where the allocation can fail
and there is no way forward.

A BUG_ON() will quite likely just make things worse. You're better off
with a WARN_ON() and letting the caller just oops.

Honestly, I'm perfectly fine with just removing that stupid useless
flag entirely. The flag goes back to 2003 and was introduced in
2.5.69, and was meant to be for very particular uses that otherwise
just looped waiting for memory.

Back in 2.5.69, there was exactly one user: the jbd journal code, that
did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
expanded by another user in XFS, and even that one had a comment
saying that it needed to be narrowed down. And in fact, by the 2.6.12
release, that XFS use had been removed, but the jbd journal had grown
another jbd_kmalloc case for transaction data. So at the beginning of
the git archives, we had exactly *one* user (with two places).

*THAT* is the kind of use that the flag was meant for: small
allocations required to make forward progress in writeout during
memory pressure.

It has then expanded and is now a problem. The cases using GFP_NOFAIL
for things like vmalloc() - which is by definition not a small
allocation - should be just removed as outright bugs.

Note that we had this comment back in 2010:

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.

and then it was softened in 2015 to the current

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. New users should be evaluated carefully
  ...

and clearly that "evaluated carefully" actually never happened, so the
new comment is just garbage.

I wonder how many modern users of GFP_NOFAIL are simply due to
over-eager allocation failure injection testing, and then people added
GFP_NOFAIL just because it shut up the mindless random allocation
failures.

I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
actually return an error just fine, but there was this crazy worry
about the IPC layer initialization failing:

   https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/

Things like that, where people just added mindless "theoretical
concerns" issues, or possibly had some error injection module that
inserted impossible failures.

I do NOT want those things to become BUG_ON()'s. It's better to just
return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
happen in the actual bad place that did an invalid allocation.

Because the blame should go *there*, and it should not even remotely
look like "oh, the MM code failed". No. The caller was garbage.

So no. No MM BUG_ON code.

                    Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 12:49               ` David Hildenbrand
@ 2024-08-19 17:12                 ` Michal Hocko
  2024-08-19 17:17                   ` Linus Torvalds
  2024-08-19 20:24                   ` David Hildenbrand
  0 siblings, 2 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-19 17:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 14:49:55, David Hildenbrand wrote:
> On 19.08.24 14:48, Barry Song wrote:
[...]
> > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > > index 60742d057b05..d2c37f8f8d09 100644
> > > > > > > > --- a/mm/page_alloc.c
> > > > > > > > +++ b/mm/page_alloc.c
> > > > > > > > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> > > > > > > >           * There are several places where we assume that the order value is sane
> > > > > > > >           * so bail out early if the request is out of bound.
> > > > > > > >           */
> > > > > > > > -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> > > > > > > > +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> > > > > > > > +             BUG_ON(gfp & __GFP_NOFAIL);
> > > > > > > >                  return NULL;
> > > > > > > > +     }
[...]
> > Returning NULL doesn't necessarily crash the caller's process, p->field,
> > *(p + offset) deference could be used by hackers to exploit the system.
> 
> See my other reply to Michal: why do we even allow to specify them
> separately and not simply let one enforce the other?

Are you replying to this patch? This is not about a combination of
flags. This is about the above (and other similar) boundary checks which
return NULL if the size is deemed incorrect. I think those are potential
problems because it could be a lack of input check which could be turned
into a potentially malicious code. Because unchecked (return value
because NOFAIL never fails, right?) return value might even not OOPs and
become a silent read/write into memory.

Whether to BUG_ON or simply loop for ever in the allocator if somebody
requests non-sleeping NOFAIL allocation is a different story.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 17:12                 ` Michal Hocko
@ 2024-08-19 17:17                   ` Linus Torvalds
  2024-08-19 20:24                   ` David Hildenbrand
  1 sibling, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-19 17:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, 19 Aug 2024 at 10:12, Michal Hocko <mhocko@suse.com> wrote:
>
> Whether to BUG_ON or simply loop for ever in the allocator if somebody
> requests non-sleeping NOFAIL allocation is a different story.

Just return NULL.

The bug isn't in the VM. It's in the caller. Don't take on other
peoples problems.

It was never valid to say "I want to allocate lots of memory and you
can't fail".

Don't validate that kind of bogus behavior, just tell them "you're
bad" and return NULL.

            Linus


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 16:05   ` Linus Torvalds
@ 2024-08-19 19:23     ` Barry Song
  2024-08-19 19:33       ` Linus Torvalds
  2024-08-21 12:40     ` Yafang Shao
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-19 19:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	urezki, v-songbaohua, vbabka, virtualization

On Tue, Aug 20, 2024 at 4:05 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > >
> > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > than exposing NULL dereferences to callers who do not check the return
> > > value.
> >
> > I am not convinced that BUG_ON is the right tool here to save the world,
> > but I see how we arrived here.
>
> I think the thing to do is to just add a
>
>      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
>
> or similar, where that bad_nofail_alloc() checks that the allocation
> order is small and that the flags are sane for a NOFAIL allocation.
>
> Because no, BUG_ON() is *never* the answer. The answer is to make sure
> nobody ever sets NOFAIL in situations where the allocation can fail
> and there is no way forward.
>
> A BUG_ON() will quite likely just make things worse. You're better off
> with a WARN_ON() and letting the caller just oops.

That could be an exploit taking advantage of those improper callers,
thus it wouldn’t necessarily result in an immediate oops in callers but
result in an exploit such as

p = malloc(gfp_nofail);

using p->field
or
using p + offset

where p->filed or p + offset might be attacking code though p is NULL.

Could we consider infinitely blocking those bad callers:
while (illegal_gfp_nofail)  do_sth_relax_cpu();

>
> Honestly, I'm perfectly fine with just removing that stupid useless
> flag entirely. The flag goes back to 2003 and was introduced in
> 2.5.69, and was meant to be for very particular uses that otherwise
> just looped waiting for memory.
>
> Back in 2.5.69, there was exactly one user: the jbd journal code, that
> did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> expanded by another user in XFS, and even that one had a comment
> saying that it needed to be narrowed down. And in fact, by the 2.6.12
> release, that XFS use had been removed, but the jbd journal had grown
> another jbd_kmalloc case for transaction data. So at the beginning of
> the git archives, we had exactly *one* user (with two places).
>
> *THAT* is the kind of use that the flag was meant for: small
> allocations required to make forward progress in writeout during
> memory pressure.
>
> It has then expanded and is now a problem. The cases using GFP_NOFAIL
> for things like vmalloc() - which is by definition not a small
> allocation - should be just removed as outright bugs.
>
> Note that we had this comment back in 2010:
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
>
> and then it was softened in 2015 to the current
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. New users should be evaluated carefully
>   ...
>
> and clearly that "evaluated carefully" actually never happened, so the
> new comment is just garbage.
>
> I wonder how many modern users of GFP_NOFAIL are simply due to
> over-eager allocation failure injection testing, and then people added
> GFP_NOFAIL just because it shut up the mindless random allocation
> failures.
>
> I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> actually return an error just fine, but there was this crazy worry
> about the IPC layer initialization failing:
>
>    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
>
> Things like that, where people just added mindless "theoretical
> concerns" issues, or possibly had some error injection module that
> inserted impossible failures.
>
> I do NOT want those things to become BUG_ON()'s. It's better to just
> return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> happen in the actual bad place that did an invalid allocation.
>
> Because the blame should go *there*, and it should not even remotely
> look like "oh, the MM code failed". No. The caller was garbage.
>
> So no. No MM BUG_ON code.
>
>                     Linus


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 19:23     ` Barry Song
@ 2024-08-19 19:33       ` Linus Torvalds
  2024-08-19 21:48         ` Barry Song
  2024-08-20  6:24         ` Michal Hocko
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-19 19:33 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	urezki, v-songbaohua, vbabka, virtualization

On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
>
>
> That could be an exploit taking advantage of those improper callers,

So?

FIX THE BUGGY CODE.

Don't make insane and incorrect changes to the MM code and spread
Fear, Uncertainty and Doubt.

> thus it wouldn’t necessarily result in an immediate oops in callers but
> result in an exploit

No. Any bug can be an exploit. Don't try to make this something
special by calling it an exploit.

NULL pointer dereferences are some of the *least* worrisome bugs,
because we don't allow people to mmap the NULL area anyway.

So just stop spreading FUD. We don't improve the kernel by making
excuses for bugs, we improve it by fixing things.

And any caller that asks for NOFAIL with bad parameters is buggy. The
MM code should NOT try to fix it up, and dammit, BUG_ON() is not
acceptable as a debugging help. Never was, never will be.

Worry-warts already do "reboot-on-warn".

            Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 17:12                 ` Michal Hocko
  2024-08-19 17:17                   ` Linus Torvalds
@ 2024-08-19 20:24                   ` David Hildenbrand
  2024-08-19 20:35                     ` Linus Torvalds
  1 sibling, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 20:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch,
	iamjoonsoo.kim, penberg, rientjes, roman.gushchin, torvalds,
	urezki, v-songbaohua, vbabka, virtualization, Christoph Hellwig,
	Lorenzo Stoakes, Kees Cook, Eugenio Pérez, Jason Wang,
	Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 19:12, Michal Hocko wrote:
> On Mon 19-08-24 14:49:55, David Hildenbrand wrote:
>> On 19.08.24 14:48, Barry Song wrote:
> [...]
>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>>>>>            * There are several places where we assume that the order value is sane
>>>>>>>>>            * so bail out early if the request is out of bound.
>>>>>>>>>            */
>>>>>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>>>>>                   return NULL;
>>>>>>>>> +     }
> [...]
>>> Returning NULL doesn't necessarily crash the caller's process, p->field,
>>> *(p + offset) deference could be used by hackers to exploit the system.
>>
>> See my other reply to Michal: why do we even allow to specify them
>> separately and not simply let one enforce the other?
> 
> Are you replying to this patch? This is not about a combination of
> flags. This is about the above (and other similar) boundary checks which
> return NULL if the size is deemed incorrect. 
> I think those are potential
> problems because it could be a lack of input check which could be turned
> into a potentially malicious code. Because unchecked (return value
> because NOFAIL never fails, right?) return value might even not OOPs and
> become a silent read/write into memory.

Right, that's a different kind of issue than the simple "don't pass 
stupid flag combinations" thing where "we'll fix that up for you" is 
more reasonable.

Possibly NOFAIL allocations with an allocation sizes that is not a small 
compile-time constant already has a bad smell to it, but I'm sure there 
are reasonable exceptions ...

> 
> Whether to BUG_ON or simply loop for ever in the allocator if somebody
> requests non-sleeping NOFAIL allocation is a different story.

Right, "warn + loop forever" is one alternative where you could at least 
keep the system alive to some degree. Satisfying a large allocation 
might take a long time, satisfying a "too large" allocation would take 
forever.

But as Linus says, it's all workarounds for other buggy code, to make 
buggy code less exploitable, maybe, ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 20:24                   ` David Hildenbrand
@ 2024-08-19 20:35                     ` Linus Torvalds
  2024-08-19 21:57                       ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-19 20:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
>
> Right, "warn + loop forever" is one alternative where you could at least
> keep the system alive to some degree.

Maybe. Or it might just lock up the machine.

For small allocations looping forever is probably fine, because in
practice there's always *something* that can be thrown out.

But anything larger than order-3 (handwavy, but that was our
historical limit, I think, and we call it PAGE_ALLOC_COSTLY_ORDER) has
to fail at _some_ point, and the caller setting GFP_NOFAIL is just
fantasy and "Daddy, I want a pony", and should be ignored.

With a WARN_ON_ONCE(), by all means, so that people can see who the
fantasist is.

            Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 19:33       ` Linus Torvalds
@ 2024-08-19 21:48         ` Barry Song
  2024-08-20  6:24         ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Barry Song @ 2024-08-19 21:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl, hailong.liu,
	hch, iamjoonsoo.kim, mhocko, penberg, rientjes, roman.gushchin,
	urezki, v-songbaohua, vbabka, virtualization

On Tue, Aug 20, 2024 at 7:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
> >
> >
> > That could be an exploit taking advantage of those improper callers,
>
> So?
>
> FIX THE BUGGY CODE.

That's definitely in progress, with patch 1/4 addressing vdpa. There's also
an RFC to enforce DIRECT_RECLAMATION for __GFP_NOFAIL, which
will prevent passing unsupported flags to the memory management
system:

https://lore.kernel.org/all/20240724085544.299090-6-21cnbao@gmail.com/

>
> Don't make insane and incorrect changes to the MM code and spread
> Fear, Uncertainty and Doubt.
>
> > thus it wouldn’t necessarily result in an immediate oops in callers but
> > result in an exploit
>
> No. Any bug can be an exploit. Don't try to make this something
> special by calling it an exploit.
>
> NULL pointer dereferences are some of the *least* worrisome bugs,
> because we don't allow people to mmap the NULL area anyway.
>
> So just stop spreading FUD. We don't improve the kernel by making
> excuses for bugs, we improve it by fixing things.
>
> And any caller that asks for NOFAIL with bad parameters is buggy. The
> MM code should NOT try to fix it up, and dammit, BUG_ON() is not
> acceptable as a debugging help. Never was, never will be.

Okay, I see your point. However, the discussion originally began with just
a simple WARN_ON() to flag improper usage:
https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/

Now, it seems we've come full circle and are opting to use
WARN_ON_ONCE() instead?

>
> Worry-warts already do "reboot-on-warn".
>
>             Linus

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 20:35                     ` Linus Torvalds
@ 2024-08-19 21:57                       ` David Hildenbrand
  2024-08-19 22:13                         ` Linus Torvalds
  2024-08-20  6:17                         ` Michal Hocko
  0 siblings, 2 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-19 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On 19.08.24 22:35, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
>>
>> Right, "warn + loop forever" is one alternative where you could at least
>> keep the system alive to some degree.
> 
> Maybe. Or it might just lock up the machine.

Yes, regarding the security concerns it might be a bit better that way. 
(no security expert, so I cannot judge ...)

So, instead of saying "No, you can't have a pony, stupid", you would say 
"well, let me talk to your mother about that (WARN)" ... and just never 
reply to the real question ... :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 21:57                       ` David Hildenbrand
@ 2024-08-19 22:13                         ` Linus Torvalds
  2024-08-20  6:17                         ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-19 22:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon, 19 Aug 2024 at 14:57, David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 22:35, Linus Torvalds wrote:
> >
> > Maybe. Or it might just lock up the machine.
>
> Yes, regarding the security concerns it might be a bit better that way.
> (no security expert, so I cannot judge ...)

The thing is, in a static universe where time does not flow, that might be true.

But in such a universe, we wouldn't actually be alive, much less do
kernel development.

In *REALITY*, security is not a single point in time. Real security is
about the *future* more than it is about anything else.

From a user perspective, security is very much about things like "keep
your system up-to-date", because we _know_ old systems have (known)
bugs.

But the other side of that is that from a developer perspective, the
#1 thing in security is about fixing bugs.

Yes, you want to be conservative so that bugs you haven't fixed yet
don't cause problems, but that "be conservative" absolutely *has* to
be priority #2.

Because no amount of conservatism is going to help you unless you
actively fix the bugs.

And the thing is, a locked-up machine is really really hard to debug.
I can tell you from experience just what the bug reports will look
like, but I think you can guess.

So locking up is one of the worst things you can do. You are pretty
much always much better off making forward progress and try to report
the issue.

Including for security reasons. Because you'd rather have a security
issue today that you can debug, and fix for tomorrow.

A dead machine with a bug you can't debug is not actually suddenly a
really secure machine. Because somebody will just figure out a way to
take advantage of *that* bug too indirectly.

Being able to kill machines at your leisure is a great way to stress
other parts of your network, and suddenly that dead machine might be a
big security hazard when somebody figures out a weakness elsewhere
("Oh, look, there's a bug in the fail-over that allows me to do X").

              Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  2024-08-19 21:57                       ` David Hildenbrand
  2024-08-19 22:13                         ` Linus Torvalds
@ 2024-08-20  6:17                         ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-20  6:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization,
	Christoph Hellwig, Lorenzo Stoakes, Kees Cook, Eugenio Pérez,
	Jason Wang, Maxime Coquelin, Michael S. Tsirkin, Xuan Zhuo

On Mon 19-08-24 23:57:29, David Hildenbrand wrote:
> On 19.08.24 22:35, Linus Torvalds wrote:
> > On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > Right, "warn + loop forever" is one alternative where you could at least
> > > keep the system alive to some degree.
> > 
> > Maybe. Or it might just lock up the machine.
> 
> Yes, regarding the security concerns it might be a bit better that way. (no
> security expert, so I cannot judge ...)

Would it make sense to simply enforce and oops? We do expect that an
incorrect usage would trigger one but we have no guarantee because the
actual user could be 
array = kvmalloc(unsupported_size_provided_from_userspace, GFP_NOFAIL)

which might be actually a valid usecase because that the normaly
requested size is large, yet reasonable. A lack of user input checks is
just a sad reality we have to live with. While those bugs absolutely
_need_ to be fixed it is better to not just allow them to
array[large_index] = payload
and make them easier to exploit the kernel. Sure you will get a warning
but your machine has been compromised.

BUG_ON will shoot the whole machine down which I do understand is just
too drastic of a measure. Making the allocation loop for ever with
cond_resched() or a short sleep is slightly better because it contains
the bad user but the process context is still not killabale that way
which is a problem on its own. I am not aware of OOPS_ON that would kill
the calling user process. Yes, this could still be leaving locks behind
but still better than a compromised system.

WDYT?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 19:33       ` Linus Torvalds
  2024-08-19 21:48         ` Barry Song
@ 2024-08-20  6:24         ` Michal Hocko
  1 sibling, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-20  6:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Barry Song, David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Mon 19-08-24 12:33:17, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
> >
> >
> > That could be an exploit taking advantage of those improper callers,
> 
> So?
> 
> FIX THE BUGGY CODE.
> 
> Don't make insane and incorrect changes to the MM code and spread
> Fear, Uncertainty and Doubt.
> 
> > thus it wouldn’t necessarily result in an immediate oops in callers but
> > result in an exploit
> 
> No. Any bug can be an exploit. Don't try to make this something
> special by calling it an exploit.
> 
> NULL pointer dereferences are some of the *least* worrisome bugs,
> because we don't allow people to mmap the NULL area anyway.
> 
> So just stop spreading FUD. We don't improve the kernel by making
> excuses for bugs, we improve it by fixing things.

I really do not think that accusing Barry from spreading FUD is fair!
I believe there is a good intention behind this initiative and he has
used tools that he has available.

So rather than throwing FUD arguments I would really like to stick to a
technical discussion, please.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-19 16:05   ` Linus Torvalds
  2024-08-19 19:23     ` Barry Song
@ 2024-08-21 12:40     ` Yafang Shao
  2024-08-21 22:59       ` Linus Torvalds
  2024-08-22  6:37       ` Barry Song
  1 sibling, 2 replies; 101+ messages in thread
From: Yafang Shao @ 2024-08-21 12:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > >
> > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > than exposing NULL dereferences to callers who do not check the return
> > > value.
> >
> > I am not convinced that BUG_ON is the right tool here to save the world,
> > but I see how we arrived here.
>
> I think the thing to do is to just add a
>
>      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
>
> or similar, where that bad_nofail_alloc() checks that the allocation
> order is small and that the flags are sane for a NOFAIL allocation.
>
> Because no, BUG_ON() is *never* the answer. The answer is to make sure
> nobody ever sets NOFAIL in situations where the allocation can fail
> and there is no way forward.
>
> A BUG_ON() will quite likely just make things worse. You're better off
> with a WARN_ON() and letting the caller just oops.
>
> Honestly, I'm perfectly fine with just removing that stupid useless
> flag entirely. The flag goes back to 2003 and was introduced in
> 2.5.69, and was meant to be for very particular uses that otherwise
> just looped waiting for memory.
>
> Back in 2.5.69, there was exactly one user: the jbd journal code, that
> did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> expanded by another user in XFS, and even that one had a comment
> saying that it needed to be narrowed down. And in fact, by the 2.6.12
> release, that XFS use had been removed, but the jbd journal had grown
> another jbd_kmalloc case for transaction data. So at the beginning of
> the git archives, we had exactly *one* user (with two places).
>
> *THAT* is the kind of use that the flag was meant for: small
> allocations required to make forward progress in writeout during
> memory pressure.
>
> It has then expanded and is now a problem. The cases using GFP_NOFAIL
> for things like vmalloc() - which is by definition not a small
> allocation - should be just removed as outright bugs.

One potential approach could be to rename GFP_NOFAIL to
GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
to clear this flag for larger allocations. However, the challenge lies
in determining what constitutes a 'small' allocation.

>
> Note that we had this comment back in 2010:
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
>
> and then it was softened in 2015 to the current
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. New users should be evaluated carefully
>   ...
>
> and clearly that "evaluated carefully" actually never happened, so the
> new comment is just garbage.
>
> I wonder how many modern users of GFP_NOFAIL are simply due to
> over-eager allocation failure injection testing, and then people added
> GFP_NOFAIL just because it shut up the mindless random allocation
> failures.
>
> I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> actually return an error just fine, but there was this crazy worry
> about the IPC layer initialization failing:
>
>    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
>
> Things like that, where people just added mindless "theoretical
> concerns" issues, or possibly had some error injection module that
> inserted impossible failures.
>
> I do NOT want those things to become BUG_ON()'s. It's better to just
> return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> happen in the actual bad place that did an invalid allocation.
>
> Because the blame should go *there*, and it should not even remotely
> look like "oh, the MM code failed". No. The caller was garbage.
>
> So no. No MM BUG_ON code.
>
>                     Linus
>


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-21 12:40     ` Yafang Shao
@ 2024-08-21 22:59       ` Linus Torvalds
  2024-08-22  6:21         ` Michal Hocko
  2024-08-22  6:37       ` Barry Song
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-21 22:59 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Hildenbrand, Barry Song, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> One potential approach could be to rename GFP_NOFAIL to
> GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> to clear this flag for larger allocations.

Yes, that sounds like a good way to make sure people don't blame the
MM layer when they themselves were the cause of problems.

> However, the challenge lies
> in determining what constitutes a 'small' allocation.

I think we could easily just stick to the historical "order <
PAGE_ALLOC_COSTLY_ORDER":

 * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
 * costly to service.

(And the value for that is 3 - orders 0-2 are considered "cheap")

             Linus


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-21 22:59       ` Linus Torvalds
@ 2024-08-22  6:21         ` Michal Hocko
  2024-08-22  6:40           ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, David Hildenbrand, Barry Song, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 06:59:08, Linus Torvalds wrote:
> On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > One potential approach could be to rename GFP_NOFAIL to
> > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> > to clear this flag for larger allocations.
> 
> Yes, that sounds like a good way to make sure people don't blame the
> MM layer when they themselves were the cause of problems.

The reality disagrees because there is a real demand for real GFP_NOFAIL
semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
anything larger than PAGE_SIZE is doable without a considerable burden
on the MM end.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-21 12:40     ` Yafang Shao
  2024-08-21 22:59       ` Linus Torvalds
@ 2024-08-22  6:37       ` Barry Song
  2024-08-22 14:22         ` Yafang Shao
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-22  6:37 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Linus Torvalds, David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > > than exposing NULL dereferences to callers who do not check the return
> > > > value.
> > >
> > > I am not convinced that BUG_ON is the right tool here to save the world,
> > > but I see how we arrived here.
> >
> > I think the thing to do is to just add a
> >
> >      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
> >
> > or similar, where that bad_nofail_alloc() checks that the allocation
> > order is small and that the flags are sane for a NOFAIL allocation.
> >
> > Because no, BUG_ON() is *never* the answer. The answer is to make sure
> > nobody ever sets NOFAIL in situations where the allocation can fail
> > and there is no way forward.
> >
> > A BUG_ON() will quite likely just make things worse. You're better off
> > with a WARN_ON() and letting the caller just oops.
> >
> > Honestly, I'm perfectly fine with just removing that stupid useless
> > flag entirely. The flag goes back to 2003 and was introduced in
> > 2.5.69, and was meant to be for very particular uses that otherwise
> > just looped waiting for memory.
> >
> > Back in 2.5.69, there was exactly one user: the jbd journal code, that
> > did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> > expanded by another user in XFS, and even that one had a comment
> > saying that it needed to be narrowed down. And in fact, by the 2.6.12
> > release, that XFS use had been removed, but the jbd journal had grown
> > another jbd_kmalloc case for transaction data. So at the beginning of
> > the git archives, we had exactly *one* user (with two places).
> >
> > *THAT* is the kind of use that the flag was meant for: small
> > allocations required to make forward progress in writeout during
> > memory pressure.
> >
> > It has then expanded and is now a problem. The cases using GFP_NOFAIL
> > for things like vmalloc() - which is by definition not a small
> > allocation - should be just removed as outright bugs.
>
> One potential approach could be to rename GFP_NOFAIL to
> GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> to clear this flag for larger allocations. However, the challenge lies
> in determining what constitutes a 'small' allocation.

I'm not entirely sure if our concern is with higher order or larger size. Higher
order might pose a problem, but larger size(not too large) isn't
always an issue.
Allocating 100 * 4KiB pages is possibly easier than allocating a single
128KB folio.

Are we trying to limit the physical size or the physical order? If the concern
is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the
concern is higher order, this sounds reasonable.  but it seems the buddy
system already has code to trigger a warning even for order > 1:

struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
        WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

        if (likely(pcp_allowed_order(order))) {
                page = rmqueue_pcplist(preferred_zone, zone, order,
                                       migratetype, alloc_flags);
                if (likely(page))
                        goto out;
        }
        ....
}

>
> >
> > Note that we had this comment back in 2010:
> >
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> >
> > and then it was softened in 2015 to the current
> >
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures. New users should be evaluated carefully
> >   ...
> >
> > and clearly that "evaluated carefully" actually never happened, so the
> > new comment is just garbage.
> >
> > I wonder how many modern users of GFP_NOFAIL are simply due to
> > over-eager allocation failure injection testing, and then people added
> > GFP_NOFAIL just because it shut up the mindless random allocation
> > failures.
> >
> > I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> > actually return an error just fine, but there was this crazy worry
> > about the IPC layer initialization failing:
> >
> >    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
> >
> > Things like that, where people just added mindless "theoretical
> > concerns" issues, or possibly had some error injection module that
> > inserted impossible failures.
> >
> > I do NOT want those things to become BUG_ON()'s. It's better to just
> > return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> > happen in the actual bad place that did an invalid allocation.
> >
> > Because the blame should go *there*, and it should not even remotely
> > look like "oh, the MM code failed". No. The caller was garbage.
> >
> > So no. No MM BUG_ON code.
> >
> >                     Linus
> >
>
>
> --
> Regards
> Yafang

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  6:21         ` Michal Hocko
@ 2024-08-22  6:40           ` Linus Torvalds
  2024-08-22  6:56             ` Linus Torvalds
  2024-08-22  7:01             ` Gao Xiang
  0 siblings, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  6:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, David Hildenbrand, Barry Song, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote:
>
> The reality disagrees because there is a real demand for real GFP_NOFAIL
> semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
> for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
> anything larger than PAGE_SIZE is doable without a considerable burden
> on the MM end.

Doable? Sure. Sensible? Not clear.

I do not find a single case of that in the kernel.

I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
one in erofs. It's not clear that any of them make much sense (or that
the erofs one is actually a large allocation).

And there's some disgusting noise in hmm_test.c. Irrelevant.

Now, this was all from a pure mindless grep, and it didn't take any
context into account, so if somebody is building up the gfp flags and
not using them directly, that grep has failed.

But when you say "reality disagrees", I think you need to point at said reality.

Because the *real* reality is that large allocations HAVE ALWAYS
FAILED. In fact, with flags like GFP_ATOMIC etc, it will fail very
aggressively. Christ, that's what started this whole thread and
discussion in the first place.

So I think *that* is the reality you need to face: GFP_NOFAIL has
never ever been a hard guarantee to begin with, and expecting it to be
that is a sign of insanity. It's fundamentally an impossible
operation.

It really is a "Daddy, daddy, I want a pony" flag.

             Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  6:40           ` Linus Torvalds
@ 2024-08-22  6:56             ` Linus Torvalds
  2024-08-22  7:47               ` Michal Hocko
  2024-08-22  7:01             ` Gao Xiang
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  6:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yafang Shao, David Hildenbrand, Barry Song, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> one in erofs. It's not clear that any of them make much sense (or that
> the erofs one is actually a large allocation).

Oh, and I missed one in btrfs because it needed five lines of context
due to being the allocation from hell.

That said, yes, the vmalloc case at least has no fragmentation issues,
but I do think even that needs to be size limited for sanity.

The size limit might be larger than a couple of pages, but not
_hugely_ larger. You can't just say "I want a megabyte, and you can't
fail me". That kind of code is garbage, and needs to be called out for
being garbage.

                 Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  6:40           ` Linus Torvalds
  2024-08-22  6:56             ` Linus Torvalds
@ 2024-08-22  7:01             ` Gao Xiang
  2024-08-22  7:54               ` Michal Hocko
  1 sibling, 1 reply; 101+ messages in thread
From: Gao Xiang @ 2024-08-22  7:01 UTC (permalink / raw)
  To: Linus Torvalds, Michal Hocko
  Cc: Yafang Shao, David Hildenbrand, Barry Song, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

Hi Linus,

On 2024/8/22 14:40, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote:
>>
>> The reality disagrees because there is a real demand for real GFP_NOFAIL
>> semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
>> for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
>> anything larger than PAGE_SIZE is doable without a considerable burden
>> on the MM end.
> 
> Doable? Sure. Sensible? Not clear.
> 
> I do not find a single case of that in the kernel.
> 
> I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> one in erofs. It's not clear that any of them make much sense (or that
> the erofs one is actually a large allocation).

I don't follow all the thread due to other internal work ongoing
but EROFS could do _large_ kvmalloc NOFAIL allocation according to
PAGE_ALLOC_COSTLY_ORDER (~24kb at most due to on-disk restriction),
my detailed story was outlined in my previous reply (and thread):
https://lore.kernel.org/r/20d782ad-c059-4029-9c75-0ef278c98d81@linux.alibaba.com

Because EROFS needs page arraies for vmap and then do decompression,
for the worst case, it almost needs ~24kb temporary page array
but that is the end user choice to use such extreme compression
(mostly just syzkallar crafted images.)

In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
means for a single shot.  Because assume even if you don't consider
a virtual consecutive buffer, people could also do
< PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
the same heavy workload to the whole system.  And we also allow
direct/kswap reclaim here.

Failure path is complex in some cases like here and it's hard
to reach or get it right.  If kvmalloc() will be restricted on
< PAGE_ALLOC_COSTLY_ORDER anyway, I guess I will use a global
static buffer (and a sleeping lock) as a worst fallback to fulfill
the extreme on-disk restriction.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  6:56             ` Linus Torvalds
@ 2024-08-22  7:47               ` Michal Hocko
  2024-08-22  7:57                 ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yafang Shao, David Hildenbrand, Barry Song, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 14:56:08, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> > one in erofs. It's not clear that any of them make much sense (or that
> > the erofs one is actually a large allocation).
> 
> Oh, and I missed one in btrfs because it needed five lines of context
> due to being the allocation from hell.
> 
> That said, yes, the vmalloc case at least has no fragmentation issues,
> but I do think even that needs to be size limited for sanity.
> 
> The size limit might be larger than a couple of pages, but not
> _hugely_ larger. You can't just say "I want a megabyte, and you can't
> fail me". That kind of code is garbage, and needs to be called out for
> being garbage.

yes, no objection here. Our current limits are too large for any
practical purpose. We still need a strategy how to communicate that
the size is not supported though. Just returning NULL is IMHO bad thing
because it adds potentially silent failure. In other subthread I was
contemplating about OOPS_ON which would simply terminate the user
context and kill it right away. What do you think about something like
that?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  7:01             ` Gao Xiang
@ 2024-08-22  7:54               ` Michal Hocko
  2024-08-22  8:04                 ` Gao Xiang
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  7:54 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Linus Torvalds, Yafang Shao, David Hildenbrand, Barry Song, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 15:01:43, Gao Xiang wrote:
> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
> means for a single shot.  Because assume even if you don't consider
> a virtual consecutive buffer, people could also do
> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
> the same heavy workload to the whole system.  And we also allow
> direct/kswap reclaim here.

Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
make sense outside of the page allocator proper. There is no reason why
vmalloc NOFAIL should be constrained by that. Sure it should be
contrained to some value but considering it is just a bunch of PAGE_SIZE
allocation then the limit could be higher. I am not sure where the
practical limit should be but anything that requires more than couple of
MBs seems really excessive.

And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I
would argue that we do not want to provide such a strong guarantee to
anything order > 0. Just use kvmalloc for that purpose. Sure we
practically never fail up to costly order but guaranteeing that is a
different story.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  7:47               ` Michal Hocko
@ 2024-08-22  7:57                 ` Barry Song
  2024-08-22  8:24                   ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-22  7:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Yafang Shao, David Hildenbrand, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, Aug 22, 2024 at 7:47 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 22-08-24 14:56:08, Linus Torvalds wrote:
> > On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> > > one in erofs. It's not clear that any of them make much sense (or that
> > > the erofs one is actually a large allocation).
> >
> > Oh, and I missed one in btrfs because it needed five lines of context
> > due to being the allocation from hell.
> >
> > That said, yes, the vmalloc case at least has no fragmentation issues,
> > but I do think even that needs to be size limited for sanity.
> >
> > The size limit might be larger than a couple of pages, but not
> > _hugely_ larger. You can't just say "I want a megabyte, and you can't
> > fail me". That kind of code is garbage, and needs to be called out for
> > being garbage.
>
> yes, no objection here. Our current limits are too large for any
> practical purpose. We still need a strategy how to communicate that
> the size is not supported though. Just returning NULL is IMHO bad thing
> because it adds potentially silent failure. In other subthread I was
> contemplating about OOPS_ON which would simply terminate the user
> context and kill it right away. What do you think about something like
> that?

Personally, I fully support this approach that falls between BUG_ON
and returning NULL. Regarding the concern about 'leaving locks
behind' you have in that subthread,  I believe there's no difference
when returning NULL, as it could still leave locks behind but offers
a chance for the calling process to avoid an immediate crash.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  7:54               ` Michal Hocko
@ 2024-08-22  8:04                 ` Gao Xiang
  2024-08-22 14:35                   ` Yafang Shao
  0 siblings, 1 reply; 101+ messages in thread
From: Gao Xiang @ 2024-08-22  8:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Yafang Shao, David Hildenbrand, Barry Song, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

Hi Michal,

On 2024/8/22 15:54, Michal Hocko wrote:
> On Thu 22-08-24 15:01:43, Gao Xiang wrote:
>> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
>> means for a single shot.  Because assume even if you don't consider
>> a virtual consecutive buffer, people could also do
>> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
>> the same heavy workload to the whole system.  And we also allow
>> direct/kswap reclaim here.
> 
> Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
> make sense outside of the page allocator proper. There is no reason why
> vmalloc NOFAIL should be constrained by that. Sure it should be
> contrained to some value but considering it is just a bunch of PAGE_SIZE
> allocation then the limit could be higher. I am not sure where the
> practical limit should be but anything that requires more than couple of
> MBs seems really excessive.

Yeah, totally agreed, that would make my own life easier, of
course I will not allocate MBs insanely.

I've always trying to kill unnecessary NOFAILs (mostly together
with code cleanups), but if a failure path increases more than
100 LOCs just for rare failure and extreme workloads, I _do_
hope kvmalloc(NOFAIL) could work instead.


> 
> And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I
> would argue that we do not want to provide such a strong guarantee to
> anything order > 0. Just use kvmalloc for that purpose. Sure we
> practically never fail up to costly order but guaranteeing that is a
> different story.

Agreed.

Thanks,
Gao Xiang

> 


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  7:57                 ` Barry Song
@ 2024-08-22  8:24                   ` Michal Hocko
  2024-08-22  8:39                     ` David Hildenbrand
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  8:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Linus Torvalds, Yafang Shao, David Hildenbrand, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 19:57:41, Barry Song wrote:
> Regarding the concern about 'leaving locks
> behind' you have in that subthread,  I believe there's no difference
> when returning NULL, as it could still leave locks behind but offers
> a chance for the calling process to avoid an immediate crash.

Yes, I have mentioned this risk just for completeness. Without having
some sort of unwinding mechanism we are doomed to not be able to handle
this.

The sole difference between just returning NULL and OOPsing rigth away
is that the former is not guaranteed to happen and the caller can cause
an actual harm by derefering non-oopsing addressed close to 0 which
would be a) much harder to find out b) could cause much more damage than
killing the context right away.

Besides that I believe we have many BUG_ON users which would really
prefer to just call the current context instead, they just do not have
means to do that so OOPS_ON could be a safer way to stop bad users and
reduce the number of BUG_ONs as well.

I am just not really sure how to implement that. A stupid and an obvious
way would be to have a dereference from a known (pre-defined) unmapped
area. But this smells like something that should be achievable in a
better way.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  8:24                   ` Michal Hocko
@ 2024-08-22  8:39                     ` David Hildenbrand
  2024-08-22  9:08                       ` Linus Torvalds
  2024-08-22  9:11                       ` Michal Hocko
  0 siblings, 2 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-22  8:39 UTC (permalink / raw)
  To: Michal Hocko, Barry Song
  Cc: Linus Torvalds, Yafang Shao, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On 22.08.24 10:24, Michal Hocko wrote:
> On Thu 22-08-24 19:57:41, Barry Song wrote:
>> Regarding the concern about 'leaving locks
>> behind' you have in that subthread,  I believe there's no difference
>> when returning NULL, as it could still leave locks behind but offers
>> a chance for the calling process to avoid an immediate crash.
> 
> Yes, I have mentioned this risk just for completeness. Without having
> some sort of unwinding mechanism we are doomed to not be able to handle
> this.
> 
> The sole difference between just returning NULL and OOPsing rigth away
> is that the former is not guaranteed to happen and the caller can cause
> an actual harm by derefering non-oopsing addressed close to 0 which
> would be a) much harder to find out b) could cause much more damage than
> killing the context right away.
> 
> Besides that I believe we have many BUG_ON users which would really
> prefer to just call the current context instead, they just do not have
> means to do that so OOPS_ON could be a safer way to stop bad users and
> reduce the number of BUG_ONs as well.

To me that sounds better as well, but I was also wondering if it's easy 
to implement or easy to assemble from existing pieces.

Linus has a point that "retry forever" can also be nasty. I think the 
important part here is, though, that we report sufficient information 
(stacktrace), such that the problem can be debugged reasonably well, and 
not just having a locked-up system.

But then the question is: does it really make sense to differentiate 
difference between an NOFAIL allocation under memory pressure of 
MAX_ORDER compared to MAX_ORDER+1 (Linus also touched on that)? It could 
well take minutes/hours/days to satisfy a very large NOFAIL allocation. 
So callers should be prepared to run into effective lockups ... :/

NOFAIL shouldn't exist, or at least not used to that degree.

I am to blame myself, I made use of it in kernel/resource.c, where there 
is no turning back when completed memory unplug to 99% (even having 
freed the vmemmap), but then we might have to allocate a new node in the 
resource tree, when having to split an existing one. Maybe there would 
be ways to preallocate before starting memory unplug, or to pre-split ...

But then again, sizeof(struct resource) is probably so small that it 
likely would never fail.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  8:39                     ` David Hildenbrand
@ 2024-08-22  9:08                       ` Linus Torvalds
  2024-08-22  9:16                         ` Michal Hocko
  2024-08-22  9:11                       ` Michal Hocko
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Barry Song, Yafang Shao, akpm, linux-mm, 42.hyeyoo,
	cl, hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote:
>
> Linus has a point that "retry forever" can also be nasty. I think the
> important part here is, though, that we report sufficient information
> (stacktrace), such that the problem can be debugged reasonably well, and
> not just having a locked-up system.

Unless I missed some case, I *think* most NOFAIL cases are actually
fairly small.

In fact, I suspect many of them are so small that we already
effectively give that guarantee:

> But then again, sizeof(struct resource) is probably so small that it
> likely would never fail.

Iirc, we had the policy of never failing unrestricted kernel
allocations that are smaller than a page (where "unrestricted" means
that it's a regular GFP_KERNEL, not some NOFS or similar allocation).

In fact, I think we practically speaking still do. We really *really*
tend to try very hard to retry small allocations.

That was one of the things that GFP_USER does - it's identical to
GFP_KERNEL, but it basically tells the MM that it should not try so
hard because an allocation failure was fine.

(I say "was", because the semantics have changed over time.
Originally, GFP_KERNEL had the "GFP_HIGH" bit set that said "access
emergency pools for this allocation", and GFP_USER did not have that
bit set. Now the only difference seems to be GFP_HARDWALL, so a user
allocation the cgroup limits are hard limits etc. I think there's been
other versions of this kind of logic over the years as people try to
make it all work out well in practice).

In fact, kernel allocations try so hard that we have those "opposite
flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try
*TOO* hard, and reasonably many code-paths have that whole "let's
optimistically ask for a big allocation, but not try very hard and not
warn if it fails, because we can fall back on a smaller one".

So it's _really_ hard to fail a small GFP_KERNEL allocation. It used
to be practically impossible, and in fact I think GFP_NOFAIL was
originally added long ago when the MM code was going through big
upheavals and one of the things that was mucked around with was the
whole "how hard to retry".

              Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  8:39                     ` David Hildenbrand
  2024-08-22  9:08                       ` Linus Torvalds
@ 2024-08-22  9:11                       ` Michal Hocko
  2024-08-22  9:18                         ` Linus Torvalds
  2024-08-22  9:27                         ` David Hildenbrand
  1 sibling, 2 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  9:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, Linus Torvalds, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 10:39:09, David Hildenbrand wrote:
[...]
> But then the question is: does it really make sense to differentiate
> difference between an NOFAIL allocation under memory pressure of MAX_ORDER
> compared to MAX_ORDER+1 (Linus also touched on that)? It could well take
> minutes/hours/days to satisfy a very large NOFAIL allocation. So callers
> should be prepared to run into effective lockups ... :/

As pointed out in other subthread. We shouldn't really pretend we
support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and
encourage kvmalloc for those users.

A nofail order 2 allocation can kill most of the userspace on terribly
fragmented system that is kernel allocation heavy.

> NOFAIL shouldn't exist, or at least not used to that degree.

Let's put whishful thinking aside. Unless somebody manages to go over
all existing NOFAIL users and fix them then we should better focus on
providing a reasonable clearly documented and enforced semantic.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:08                       ` Linus Torvalds
@ 2024-08-22  9:16                         ` Michal Hocko
  2024-08-22  9:24                           ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  9:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 17:08:15, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote:
> >
> > Linus has a point that "retry forever" can also be nasty. I think the
> > important part here is, though, that we report sufficient information
> > (stacktrace), such that the problem can be debugged reasonably well, and
> > not just having a locked-up system.
> 
> Unless I missed some case, I *think* most NOFAIL cases are actually
> fairly small.
> 
> In fact, I suspect many of them are so small that we already
> effectively give that guarantee:
> 
> > But then again, sizeof(struct resource) is probably so small that it
> > likely would never fail.
> 
> Iirc, we had the policy of never failing unrestricted kernel
> allocations that are smaller than a page (where "unrestricted" means
> that it's a regular GFP_KERNEL, not some NOFS or similar allocation).
> 
> In fact, I think we practically speaking still do. We really *really*
> tend to try very hard to retry small allocations.

yes we try very hard but allocation failure is still possible in some
corner cases so callers _must_ check for return value and deal with it.

> That was one of the things that GFP_USER does - it's identical to
> GFP_KERNEL, but it basically tells the MM that it should not try so
> hard because an allocation failure was fine.

GFP_USER allocation only impluy __GFP_HARDWALL and that only makes
difference for cpusets. It doesn't make difference in most cases though.
 
> In fact, kernel allocations try so hard that we have those "opposite
> flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try
> *TOO* hard, and reasonably many code-paths have that whole "let's
> optimistically ask for a big allocation, but not try very hard and not
> warn if it fails, because we can fall back on a smaller one".
> 
> So it's _really_ hard to fail a small GFP_KERNEL allocation. It used
> to be practically impossible, and in fact I think GFP_NOFAIL was
> originally added long ago when the MM code was going through big
> upheavals and one of the things that was mucked around with was the
> whole "how hard to retry".

There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the
allocation will not fail so callers do not check for the failure because
they have (presumably) no (practical) way to handle the failure.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:11                       ` Michal Hocko
@ 2024-08-22  9:18                         ` Linus Torvalds
  2024-08-22  9:33                           ` Michal Hocko
  2024-08-22  9:27                         ` David Hildenbrand
  1 sibling, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote:
>
> Let's put whishful thinking aside. Unless somebody manages to go over
> all existing NOFAIL users and fix them then we should better focus on
> providing a reasonable clearly documented and enforced semantic.

I do like changing the naming to make it clear that it's not some kind
of general MM guarantee for any random allocation.

So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who
use it aware that no, they aren't getting a "get out of jail free"
card.

Admittedly that's probably _too_ long a name, but conceptually...

              Linus


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:16                         ` Michal Hocko
@ 2024-08-22  9:24                           ` Linus Torvalds
  0 siblings, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, 22 Aug 2024 at 17:16, Michal Hocko <mhocko@suse.com> wrote:
>
> GFP_USER allocation only impluy __GFP_HARDWALL and that only makes
> difference for cpusets. It doesn't make difference in most cases though.

That's what it does today.

We used to have a very clear notion of "how hard to try". It was
"LOW", "MED" and "HIGH".

And GFP_USER used __GFP_LOW, exactly so that the MM layer knew to not
try very hard.

GFP_ATOMIC used __GFP_HIGH, to say "use the reserved resources".

GFP_KERNEL then at one point used __GF_MED, to say "don't dip into the
reserved pool, but retry harder".

But exactly because people did want kernel allocations to basically
always succeed, then GFP_KERNEL ended up using __GFP_HIGH too.

> There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the
> allocation will not fail so callers do not check for the failure because
> they have (presumably) no (practical) way to handle the failure.

And this mindset needs to go away. That's what I've been trying to say.

It absolutely MUST NOT GUARANTEE THAT.

I've seen crap patche that say "BUG_ON() if we cannot guarantee it",
and I'm NACKing those kinds of completely bogus models.

The hard reality needs to be that GFP_NOFAIL is simply IGNORED if
people mis-use it. It absolutely HAS to be a "conditional no-failure".

And it needs to be conditional on both size and things like "I'm
allowed to do reclaim".

Any discussion that starts with "GFP_NOFAIL is a guarantee" needs to *DIE*.

                  Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:11                       ` Michal Hocko
  2024-08-22  9:18                         ` Linus Torvalds
@ 2024-08-22  9:27                         ` David Hildenbrand
  2024-08-22  9:34                           ` Linus Torvalds
  2024-08-22  9:41                           ` Michal Hocko
  1 sibling, 2 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-22  9:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, Linus Torvalds, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On 22.08.24 11:11, Michal Hocko wrote:
> On Thu 22-08-24 10:39:09, David Hildenbrand wrote:
> [...]
>> But then the question is: does it really make sense to differentiate
>> difference between an NOFAIL allocation under memory pressure of MAX_ORDER
>> compared to MAX_ORDER+1 (Linus also touched on that)? It could well take
>> minutes/hours/days to satisfy a very large NOFAIL allocation. So callers
>> should be prepared to run into effective lockups ... :/
> 
> As pointed out in other subthread. We shouldn't really pretend we
> support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and
> encourage kvmalloc for those users.
> 
> A nofail order 2 allocation can kill most of the userspace on terribly
> fragmented system that is kernel allocation heavy.
> 
>> NOFAIL shouldn't exist, or at least not used to that degree.
> 
> Let's put whishful thinking aside. Unless somebody manages to go over
> all existing NOFAIL users and fix them then we should better focus on
> providing a reasonable clearly documented and enforced semantic.

Probably it would be time better spent than trying to find ways to deal 
with that mess. ;)


I think the documentation is mostly there:

"The VM implementation _must_ retry infinitely: the caller cannot handle 
allocation failures. The allocation could block indefinitely but will 
never return with failure. Testing for failure is pointless."

To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry 
infinitely". if that implies just OOPSing or actually be in a busy loop, 
I don't care. It could effectively happen with MAX_ORDER as well, as 
stated. But certainly not BUG_ON.


"Using this flag for costly allocations is _highly_ discouraged" should 
be rephrased to "Using this flag with costly allocations is _highly 
dangerous_ and will likely result in the allocation never succeeding and 
this function never making any progress."

I do also agree that renaming NOFAIL to make some of that clearer makes 
sense.

Likely, checkpatch should be updated to warn on any new NOFAIL usage.



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:18                         ` Linus Torvalds
@ 2024-08-22  9:33                           ` Michal Hocko
  2024-08-22  9:44                             ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  9:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 17:18:29, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote:
> >
> > Let's put whishful thinking aside. Unless somebody manages to go over
> > all existing NOFAIL users and fix them then we should better focus on
> > providing a reasonable clearly documented and enforced semantic.
> 
> I do like changing the naming to make it clear that it's not some kind
> of general MM guarantee for any random allocation.
> 
> So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who
> use it aware that no, they aren't getting a "get out of jail free"
> card.

Small means different things to different people. Also that small has a
completely different meaning for the page allocator and for kvmalloc. I
really do not like to carve any ambiguity like that into the flag that
is supposed to be used for both.

Quite honestly I am not even sure we have actual GFP_NOFAIL users of the
page allocator outside of the MM (e.g. to implement SLUB internal
allocations). git grep just takes too much time to process because the
underlying allocator is not always immediately visible.

Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make
some sense then as it could enforce reasonable use more easily I guess.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:27                         ` David Hildenbrand
@ 2024-08-22  9:34                           ` Linus Torvalds
  2024-08-22  9:43                             ` David Hildenbrand
  2024-08-26 12:10                             ` Vlastimil Babka
  2024-08-22  9:41                           ` Michal Hocko
  1 sibling, 2 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Barry Song, Yafang Shao, akpm, linux-mm, 42.hyeyoo,
	cl, hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>
> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> infinitely". if that implies just OOPSing or actually be in a busy loop,
> I don't care. It could effectively happen with MAX_ORDER as well, as
> stated. But certainly not BUG_ON.

No BUG_ON(), but also no endless loop.

Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
make it easy to find offenders, and then let them deal with it.

Don't take it upon yourself to say "we have to deal with any amount of
stupidity".

The MM layer is not some slave to users. The MM layer is one of the
most core pieces of code in the kernel, and as such the MM layer is
damn well in charge.

Nobody has the right to say "I will not deal with allocation
failures". The MM should not bend over backwards over something like
that.

Seriously. Get a spine already, people. Tell random drivers that claim
that they cannot deal with errors to just f-ck off.

And you don't do it by looping forever, and you don't do it by killing
the kernel. You do it by ignoring their bullying tactics.

Then you document the *LIMITED* cases where you actually will try forever.

This discussion has gone on for too damn long.

              Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:27                         ` David Hildenbrand
  2024-08-22  9:34                           ` Linus Torvalds
@ 2024-08-22  9:41                           ` Michal Hocko
  2024-08-22  9:42                             ` David Hildenbrand
  1 sibling, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  9:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, Linus Torvalds, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 11:27:25, David Hildenbrand wrote:
[...]
> Likely, checkpatch should be updated to warn on any new NOFAIL usage.

What do you expect people to do? I do not see a pattern of nilly-willy
use of the flag (we have less than 200 in the kernel outside of mm,
compare that to 40k GFP_KERNEL allocations). If you warn and the only
answer is shrug and go on then this serves no purpose.

I have learned couple of things. People do not really give a deep
thought on the naming (e.g. GFP_TEMPORARY story). Or a documentation
(e.g. GFP_NOWAIT | GFP_NOFAIL user). They sometimes pay attention to
warnings but WARN_ON_ONCE tells you about a single abuser... I believe
that enforcing a constrains for GFP_NOFAIL by killing the allocation
user context would have a more visible effect than WARN_ON and it would
stop potential silent failure mode at the same time.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:41                           ` Michal Hocko
@ 2024-08-22  9:42                             ` David Hildenbrand
  0 siblings, 0 replies; 101+ messages in thread
From: David Hildenbrand @ 2024-08-22  9:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Barry Song, Linus Torvalds, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On 22.08.24 11:41, Michal Hocko wrote:
> On Thu 22-08-24 11:27:25, David Hildenbrand wrote:
> [...]
>> Likely, checkpatch should be updated to warn on any new NOFAIL usage.
> 
> What do you expect people to do? I do not see a pattern of nilly-willy
> use of the flag (we have less than 200 in the kernel outside of mm,
> compare that to 40k GFP_KERNEL allocations). If you warn and the only
> answer is shrug and go on then this serves no purpose.

Like the BUG_ON checks I added. Some people still try to ignore them and 
we end up in discussions like this when I spot it ;)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:34                           ` Linus Torvalds
@ 2024-08-22  9:43                             ` David Hildenbrand
  2024-08-22  9:53                               ` Linus Torvalds
  2024-08-26 12:10                             ` Vlastimil Babka
  1 sibling, 1 reply; 101+ messages in thread
From: David Hildenbrand @ 2024-08-22  9:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Barry Song, Yafang Shao, akpm, linux-mm, 42.hyeyoo,
	cl, hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On 22.08.24 11:34, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>>
>> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> I don't care. It could effectively happen with MAX_ORDER as well, as
>> stated. But certainly not BUG_ON.
> 
> No BUG_ON(), but also no endless loop.
> 
> Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> make it easy to find offenders, and then let them deal with it.
> 
> Don't take it upon yourself to say "we have to deal with any amount of
> stupidity".
> 
> The MM layer is not some slave to users. The MM layer is one of the
> most core pieces of code in the kernel, and as such the MM layer is
> damn well in charge.
> 
> Nobody has the right to say "I will not deal with allocation
> failures". The MM should not bend over backwards over something like
> that.
> 
> Seriously. Get a spine already, people. Tell random drivers that claim
> that they cannot deal with errors to just f-ck off.
> 
> And you don't do it by looping forever, and you don't do it by killing
> the kernel. You do it by ignoring their bullying tactics.
> 
> Then you document the *LIMITED* cases where you actually will try forever.

So on the buddy level, that might mean that we limit it to a single 
page, and document "NOFAIL is ineffective and ignored when allcoating 
pages of order > 0. Any attempt will result in a WARN_ON_ONCE()". 
(assuming we can find and eliminate users that allocate order > 0 fairly 
easily)

{kv}malloc allocators would be different, as Michal said.

No idea if that is feasible, but it sounds like something you have in mind.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:33                           ` Michal Hocko
@ 2024-08-22  9:44                             ` Linus Torvalds
  2024-08-22  9:59                               ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, 22 Aug 2024 at 17:33, Michal Hocko <mhocko@suse.com> wrote:
>
> Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make
> some sense then as it could enforce reasonable use more easily I guess.

If by "limit to SLUB" you mean "limit it to the kmalloc() cases that
can be done using the standard *SMALL* buckets, then maybe.

But even then it should probably be only if you don't ask for specific
nodes or other limitations on the allocation.

Because realize that "kmalloc()" and friends will fall back to other
things like __kmalloc_large_node_noprof(), and HELL NO those should
not honor NOFAIL.

And dammit, those kvmalloc() sizes need to be limited too. A number
like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make
it *SMALL*.

And make it clear that it will return NULL if somebody misuses it.

            Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:43                             ` David Hildenbrand
@ 2024-08-22  9:53                               ` Linus Torvalds
  2024-08-22 11:58                                 ` Johannes Weiner
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22  9:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, Barry Song, Yafang Shao, akpm, linux-mm, 42.hyeyoo,
	cl, hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote:
>
> So on the buddy level, that might mean that we limit it to a single
> page,

Actually, for many SLUB allocations, you probably do have to accept
the small orders - the slab caches are often two or four pages.

For example, a kmalloc(256) is an order-1 allocation on a buddy level
from a quick look at /proc/slabinfo.

So it's not necessarily only single pages. We do handle small orders.
But it gets exponentially harder, so it really is just the small
orders that work.

Looks like slub will use up to order-3. That smells like an off-by-one
to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just
wrong), but it probably is still acceptable.

             Linus

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:44                             ` Linus Torvalds
@ 2024-08-22  9:59                               ` Michal Hocko
  2024-08-22 10:30                                 ` Linus Torvalds
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-22  9:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Barry Song, Yafang Shao, akpm, linux-mm,
	42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim, penberg,
	rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu 22-08-24 17:44:22, Linus Torvalds wrote:
[...]
> And dammit, those kvmalloc() sizes need to be limited too. A number
> like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make
> it *SMALL*.

I am not particularly bothered by exact size for kvmalloc. There should
be a limit all right, but whether that is 1MB or 64kB shouldn't make
much of a difference because we do use 4kB pages to back those
allocations anyway. Sure 32b is a different story because the vmalloc
space itself is a scarce resource but do we care about those?

> And make it clear that it will return NULL if somebody misuses it.

What do you expect users do with the return value then?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:59                               ` Michal Hocko
@ 2024-08-22 10:30                                 ` Linus Torvalds
  2024-08-22 10:46                                   ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Linus Torvalds @ 2024-08-22 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, Barry Song, Yafang Shao, Andrew Morton,
	linux-mm, Hyeonggon Yoo, Christoph Lameter, hailong.liu,
	Christoph Hellwig, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, v-songbaohua, Vlastimil Babka,
	virtualization

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

[ Sorry, on mobile right now, so HTML crud ]

On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote:

>
> > And make it clear that it will return NULL if somebody misuses it.
>
> What do you expect users do with the return value then?
>

NOT. YOUR. PROBLEM.

It's the problem of the caller.

What's so hard to understand about that? The MM layer has enough problems
on it's own, it doesn't need to take on the problems of others.

      Linus

>

[-- Attachment #2: Type: text/html, Size: 1145 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22 10:30                                 ` Linus Torvalds
@ 2024-08-22 10:46                                   ` Michal Hocko
  0 siblings, 0 replies; 101+ messages in thread
From: Michal Hocko @ 2024-08-22 10:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Barry Song, Yafang Shao, Andrew Morton,
	linux-mm, Hyeonggon Yoo, Christoph Lameter, hailong.liu,
	Christoph Hellwig, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Roman Gushchin, Uladzislau Rezki, v-songbaohua, Vlastimil Babka,
	virtualization

On Thu 22-08-24 18:30:12, Linus Torvalds wrote:
> [ Sorry, on mobile right now, so HTML crud ]
> 
> On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote:
> 
> >
> > > And make it clear that it will return NULL if somebody misuses it.
> >
> > What do you expect users do with the return value then?
> >
> 
> NOT. YOUR. PROBLEM.
> 
> It's the problem of the caller.

They have already told you they (believe) have no way to handle the failure.
So if they learn they can get NULL they will either BUG_ON, put a loop
around that or just close eyes and hope for the best. I argue that
neither of that is a good option because it leads to a poor and buggy
code.

Look Linus, I believe we are getting into a circle and I do not have
much more to offer into this discussion. I do agree with you that we
should define boundaries of GFP_NOFAIL more explicitly. Documentation we
have is not an effective way. I strongly disagree that WARN_ONCE and
just return NULL and close eyes is a good programming nor it encourages
a good programming on the caller side. I have offered to explicitly oops
in those cases inside the allocator (while not holding any internal
allocator locks) as a sreasonable compromise. I even believe that this
is a useful tool in other contexts as well. I haven't heard your opinion
on that so far. If I had time to work on that myself I would give it a
try.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:53                               ` Linus Torvalds
@ 2024-08-22 11:58                                 ` Johannes Weiner
  0 siblings, 0 replies; 101+ messages in thread
From: Johannes Weiner @ 2024-08-22 11:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Hildenbrand, Michal Hocko, Barry Song, Yafang Shao, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, Aug 22, 2024 at 05:53:22PM +0800, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote:
> >
> > So on the buddy level, that might mean that we limit it to a single
> > page,
> 
> Actually, for many SLUB allocations, you probably do have to accept
> the small orders - the slab caches are often two or four pages.
> 
> For example, a kmalloc(256) is an order-1 allocation on a buddy level
> from a quick look at /proc/slabinfo.

It will try higher orders to reduce fragmentation (and mask out NOFAIL
on those attempts), but it can fall back to the minimum size required
for the object, i.e. get_order(size).

> So it's not necessarily only single pages. We do handle small orders.
> But it gets exponentially harder, so it really is just the small
> orders that work.

Agreed.

> Looks like slub will use up to order-3. That smells like an off-by-one
> to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just
> wrong), but it probably is still acceptable.

It's 0-3. We #define PAGE_ALLOC_COSTLY_ORDER 3, but it's exclusive -
all the order checks are > costly and <= costly.


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  6:37       ` Barry Song
@ 2024-08-22 14:22         ` Yafang Shao
  0 siblings, 0 replies; 101+ messages in thread
From: Yafang Shao @ 2024-08-22 14:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Linus Torvalds, David Hildenbrand, akpm, linux-mm, 42.hyeyoo, cl,
	hailong.liu, hch, iamjoonsoo.kim, mhocko, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, vbabka, virtualization

On Thu, Aug 22, 2024 at 2:37 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > > > than exposing NULL dereferences to callers who do not check the return
> > > > > value.
> > > >
> > > > I am not convinced that BUG_ON is the right tool here to save the world,
> > > > but I see how we arrived here.
> > >
> > > I think the thing to do is to just add a
> > >
> > >      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
> > >
> > > or similar, where that bad_nofail_alloc() checks that the allocation
> > > order is small and that the flags are sane for a NOFAIL allocation.
> > >
> > > Because no, BUG_ON() is *never* the answer. The answer is to make sure
> > > nobody ever sets NOFAIL in situations where the allocation can fail
> > > and there is no way forward.
> > >
> > > A BUG_ON() will quite likely just make things worse. You're better off
> > > with a WARN_ON() and letting the caller just oops.
> > >
> > > Honestly, I'm perfectly fine with just removing that stupid useless
> > > flag entirely. The flag goes back to 2003 and was introduced in
> > > 2.5.69, and was meant to be for very particular uses that otherwise
> > > just looped waiting for memory.
> > >
> > > Back in 2.5.69, there was exactly one user: the jbd journal code, that
> > > did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> > > expanded by another user in XFS, and even that one had a comment
> > > saying that it needed to be narrowed down. And in fact, by the 2.6.12
> > > release, that XFS use had been removed, but the jbd journal had grown
> > > another jbd_kmalloc case for transaction data. So at the beginning of
> > > the git archives, we had exactly *one* user (with two places).
> > >
> > > *THAT* is the kind of use that the flag was meant for: small
> > > allocations required to make forward progress in writeout during
> > > memory pressure.
> > >
> > > It has then expanded and is now a problem. The cases using GFP_NOFAIL
> > > for things like vmalloc() - which is by definition not a small
> > > allocation - should be just removed as outright bugs.
> >
> > One potential approach could be to rename GFP_NOFAIL to
> > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> > to clear this flag for larger allocations. However, the challenge lies
> > in determining what constitutes a 'small' allocation.
>
> I'm not entirely sure if our concern is with higher order or larger size.

I believe both should be considered. Since the higher-order task might
be easier to address, starting with that seems like the more
straightforward approach.

> Higher
> order might pose a problem, but larger size(not too large) isn't
> always an issue.
> Allocating 100 * 4KiB pages is possibly easier than allocating a single
> 128KB folio.
>
> Are we trying to limit the physical size or the physical order? If the concern
> is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the
> concern is higher order, this sounds reasonable.  but it seems the buddy
> system already has code to trigger a warning even for order > 1:

To avoid potential livelock, it may be wise to drop this flag for
higher-order allocations as well. Following Linus's suggestion, we
could start by removing it for "> PAGE_ALLOC_COSTLY_ORDER".

>
> struct page *rmqueue(struct zone *preferred_zone,
>                         struct zone *zone, unsigned int order,
>                         gfp_t gfp_flags, unsigned int alloc_flags,
>                         int migratetype)
> {
>         struct page *page;
>
>         /*
>          * We most definitely don't want callers attempting to
>          * allocate greater than order-1 page units with __GFP_NOFAIL.
>          */
>         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

This line was added by Michal in commit 0f352e5392c8 ("mm: remove
__GFP_NOFAIL is deprecated comment"), but it appears that Michal has
since reconsidered his stance. ;)



--
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  8:04                 ` Gao Xiang
@ 2024-08-22 14:35                   ` Yafang Shao
  2024-08-22 15:02                     ` Gao Xiang
  0 siblings, 1 reply; 101+ messages in thread
From: Yafang Shao @ 2024-08-22 14:35 UTC (permalink / raw)
  To: Gao Xiang
  Cc: Michal Hocko, Linus Torvalds, David Hildenbrand, Barry Song, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization

On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Michal,
>
> On 2024/8/22 15:54, Michal Hocko wrote:
> > On Thu 22-08-24 15:01:43, Gao Xiang wrote:
> >> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
> >> means for a single shot.  Because assume even if you don't consider
> >> a virtual consecutive buffer, people could also do
> >> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
> >> the same heavy workload to the whole system.  And we also allow
> >> direct/kswap reclaim here.
> >
> > Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
> > make sense outside of the page allocator proper. There is no reason why
> > vmalloc NOFAIL should be constrained by that. Sure it should be
> > contrained to some value but considering it is just a bunch of PAGE_SIZE
> > allocation then the limit could be higher. I am not sure where the
> > practical limit should be but anything that requires more than couple of
> > MBs seems really excessive.
>
> Yeah, totally agreed, that would make my own life easier, of
> course I will not allocate MBs insanely.
>
> I've always trying to kill unnecessary NOFAILs (mostly together
> with code cleanups), but if a failure path increases more than
> 100 LOCs just for rare failure and extreme workloads, I _do_
> hope kvmalloc(NOFAIL) could work instead.

If the LOCs in the error handler are a concern, I believe we can
simplify it to a single line: while (!alloc()), which is essentially
what NOFAIL does and is also the reason we want desperate NOFAIL.

A better approach might involve failing after a maximum number of
retries at the call site, for example:

  while (try < max_retries && !alloc())

At least that is better than the endless loop in the page allocator.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22 14:35                   ` Yafang Shao
@ 2024-08-22 15:02                     ` Gao Xiang
  0 siblings, 0 replies; 101+ messages in thread
From: Gao Xiang @ 2024-08-22 15:02 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Michal Hocko, Linus Torvalds, David Hildenbrand, Barry Song, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua, vbabka,
	virtualization



On 2024/8/22 22:35, Yafang Shao wrote:
> On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Michal,
>>
>> On 2024/8/22 15:54, Michal Hocko wrote:
>>> On Thu 22-08-24 15:01:43, Gao Xiang wrote:
>>>> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
>>>> means for a single shot.  Because assume even if you don't consider
>>>> a virtual consecutive buffer, people could also do
>>>> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
>>>> the same heavy workload to the whole system.  And we also allow
>>>> direct/kswap reclaim here.
>>>
>>> Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
>>> make sense outside of the page allocator proper. There is no reason why
>>> vmalloc NOFAIL should be constrained by that. Sure it should be
>>> contrained to some value but considering it is just a bunch of PAGE_SIZE
>>> allocation then the limit could be higher. I am not sure where the
>>> practical limit should be but anything that requires more than couple of
>>> MBs seems really excessive.
>>
>> Yeah, totally agreed, that would make my own life easier, of
>> course I will not allocate MBs insanely.
>>
>> I've always trying to kill unnecessary NOFAILs (mostly together
>> with code cleanups), but if a failure path increases more than
>> 100 LOCs just for rare failure and extreme workloads, I _do_
>> hope kvmalloc(NOFAIL) could work instead.
> 
> If the LOCs in the error handler are a concern, I believe we can
> simplify it to a single line: while (!alloc()), which is essentially
> what NOFAIL does and is also the reason we want desperate NOFAIL.
> 
> A better approach might involve failing after a maximum number of
> retries at the call site, for example:
> 
>    while (try < max_retries && !alloc())
> 
> At least that is better than the endless loop in the page allocator.

Funny.

Thanks,
Gao Xiang


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-22  9:34                           ` Linus Torvalds
  2024-08-22  9:43                             ` David Hildenbrand
@ 2024-08-26 12:10                             ` Vlastimil Babka
  2024-08-27  6:57                               ` Linus Torvalds
  2024-08-27  7:15                               ` Barry Song
  1 sibling, 2 replies; 101+ messages in thread
From: Vlastimil Babka @ 2024-08-26 12:10 UTC (permalink / raw)
  To: Linus Torvalds, David Hildenbrand
  Cc: Michal Hocko, Barry Song, Yafang Shao, akpm, linux-mm, 42.hyeyoo,
	cl, hailong.liu, hch, iamjoonsoo.kim, penberg, rientjes,
	roman.gushchin, urezki, v-songbaohua, virtualization

On 8/22/24 11:34, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>>
>> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> I don't care. It could effectively happen with MAX_ORDER as well, as
>> stated. But certainly not BUG_ON.
> 
> No BUG_ON(), but also no endless loop.
> 
> Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> make it easy to find offenders, and then let them deal with it.

Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
we're about to actually return NULL, so the memory has to be depleted
already. To make it easier to find the offenders much more reliably, we
should consider doing it sooner, but also not add unnecessary overhead to
allocator fastpaths just because of the potentially buggy users. So either
always in __alloc_pages_slowpath(), which should be often enough (unless the
system never needs to wake up kswapd to reclaim) but with negligible enough
overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?

> Don't take it upon yourself to say "we have to deal with any amount of
> stupidity".
> 
> The MM layer is not some slave to users. The MM layer is one of the
> most core pieces of code in the kernel, and as such the MM layer is
> damn well in charge.
> 
> Nobody has the right to say "I will not deal with allocation
> failures". The MM should not bend over backwards over something like
> that.
> 
> Seriously. Get a spine already, people. Tell random drivers that claim
> that they cannot deal with errors to just f-ck off.
> 
> And you don't do it by looping forever, and you don't do it by killing
> the kernel. You do it by ignoring their bullying tactics.
> 
> Then you document the *LIMITED* cases where you actually will try forever.
> 
> This discussion has gone on for too damn long.
> 
>               Linus



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-26 12:10                             ` Vlastimil Babka
@ 2024-08-27  6:57                               ` Linus Torvalds
  2024-08-27  7:15                               ` Barry Song
  1 sibling, 0 replies; 101+ messages in thread
From: Linus Torvalds @ 2024-08-27  6:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: David Hildenbrand, Michal Hocko, Barry Song, Yafang Shao, akpm,
	linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization

On Tue, 27 Aug 2024 at 00:10, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> we're about to actually return NULL, so the memory has to be depleted
> already. To make it easier to find the offenders much more reliably, we
> should consider doing it sooner, but also not add unnecessary overhead to
> allocator fastpaths just because of the potentially buggy users.

Ack. Sounds like a sane model to me. And I agree that
__alloc_pages_slowpath() is likely a reasonable middle path between
"catch misuses, but don't put it in the hotpath".

               Linus


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-26 12:10                             ` Vlastimil Babka
  2024-08-27  6:57                               ` Linus Torvalds
@ 2024-08-27  7:15                               ` Barry Song
  2024-08-27  7:38                                 ` Vlastimil Babka
  1 sibling, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-27  7:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, David Hildenbrand, Michal Hocko, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization

On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/22/24 11:34, Linus Torvalds wrote:
> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
> >> I don't care. It could effectively happen with MAX_ORDER as well, as
> >> stated. But certainly not BUG_ON.
> >
> > No BUG_ON(), but also no endless loop.
> >
> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> > make it easy to find offenders, and then let them deal with it.
>
> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> we're about to actually return NULL, so the memory has to be depleted
> already. To make it easier to find the offenders much more reliably, we
> should consider doing it sooner, but also not add unnecessary overhead to
> allocator fastpaths just because of the potentially buggy users. So either
> always in __alloc_pages_slowpath(), which should be often enough (unless the
> system never needs to wake up kswapd to reclaim) but with negligible enough
> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?

We already have a WARN_ON for order > 1 in rmqueue. we might extend
the condition there to include checking flags as well?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7dcb0713eb57..b5717c6569f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
  /*
  * We most definitely don't want callers attempting to
  * allocate greater than order-1 page units with __GFP_NOFAIL.
+ * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
+ * which can result in a lockup
  */
- WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
+     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));

  if (likely(pcp_allowed_order(order))) {
  page = rmqueue_pcplist(preferred_zone, zone, order,

>
> > Don't take it upon yourself to say "we have to deal with any amount of
> > stupidity".
> >
> > The MM layer is not some slave to users. The MM layer is one of the
> > most core pieces of code in the kernel, and as such the MM layer is
> > damn well in charge.
> >
> > Nobody has the right to say "I will not deal with allocation
> > failures". The MM should not bend over backwards over something like
> > that.
> >
> > Seriously. Get a spine already, people. Tell random drivers that claim
> > that they cannot deal with errors to just f-ck off.
> >
> > And you don't do it by looping forever, and you don't do it by killing
> > the kernel. You do it by ignoring their bullying tactics.
> >
> > Then you document the *LIMITED* cases where you actually will try forever.
> >
> > This discussion has gone on for too damn long.
> >
> >               Linus
>


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-27  7:15                               ` Barry Song
@ 2024-08-27  7:38                                 ` Vlastimil Babka
  2024-08-27  7:50                                   ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Vlastimil Babka @ 2024-08-27  7:38 UTC (permalink / raw)
  To: Barry Song
  Cc: Linus Torvalds, David Hildenbrand, Michal Hocko, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization

On 8/27/24 09:15, Barry Song wrote:
> On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 8/22/24 11:34, Linus Torvalds wrote:
>> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>> >>
>> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> >> I don't care. It could effectively happen with MAX_ORDER as well, as
>> >> stated. But certainly not BUG_ON.
>> >
>> > No BUG_ON(), but also no endless loop.
>> >
>> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
>> > make it easy to find offenders, and then let them deal with it.
>>
>> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
>> we're about to actually return NULL, so the memory has to be depleted
>> already. To make it easier to find the offenders much more reliably, we
>> should consider doing it sooner, but also not add unnecessary overhead to
>> allocator fastpaths just because of the potentially buggy users. So either
>> always in __alloc_pages_slowpath(), which should be often enough (unless the
>> system never needs to wake up kswapd to reclaim) but with negligible enough
>> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?
> 
> We already have a WARN_ON for order > 1 in rmqueue. we might extend
> the condition there to include checking flags as well?

Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
existing users of __GFP_NOFAIL with order > 1 :)

But also the check is in the hotpath, even before trying the pcplists, so we
could move it to __alloc_pages_slowpath() while extending it?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7dcb0713eb57..b5717c6569f9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
>   /*
>   * We most definitely don't want callers attempting to
>   * allocate greater than order-1 page units with __GFP_NOFAIL.
> + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> + * which can result in a lockup
>   */
> - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
> +     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));
> 
>   if (likely(pcp_allowed_order(order))) {
>   page = rmqueue_pcplist(preferred_zone, zone, order,
> 
>>
>> > Don't take it upon yourself to say "we have to deal with any amount of
>> > stupidity".
>> >
>> > The MM layer is not some slave to users. The MM layer is one of the
>> > most core pieces of code in the kernel, and as such the MM layer is
>> > damn well in charge.
>> >
>> > Nobody has the right to say "I will not deal with allocation
>> > failures". The MM should not bend over backwards over something like
>> > that.
>> >
>> > Seriously. Get a spine already, people. Tell random drivers that claim
>> > that they cannot deal with errors to just f-ck off.
>> >
>> > And you don't do it by looping forever, and you don't do it by killing
>> > the kernel. You do it by ignoring their bullying tactics.
>> >
>> > Then you document the *LIMITED* cases where you actually will try forever.
>> >
>> > This discussion has gone on for too damn long.
>> >
>> >               Linus
>>



^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-27  7:38                                 ` Vlastimil Babka
@ 2024-08-27  7:50                                   ` Barry Song
  2024-08-29 10:24                                     ` Vlastimil Babka
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-27  7:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, David Hildenbrand, Michal Hocko, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization

On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/27/24 09:15, Barry Song wrote:
> > On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 8/22/24 11:34, Linus Torvalds wrote:
> >> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
> >> >>
> >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> >> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
> >> >> I don't care. It could effectively happen with MAX_ORDER as well, as
> >> >> stated. But certainly not BUG_ON.
> >> >
> >> > No BUG_ON(), but also no endless loop.
> >> >
> >> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> >> > make it easy to find offenders, and then let them deal with it.
> >>
> >> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> >> we're about to actually return NULL, so the memory has to be depleted
> >> already. To make it easier to find the offenders much more reliably, we
> >> should consider doing it sooner, but also not add unnecessary overhead to
> >> allocator fastpaths just because of the potentially buggy users. So either
> >> always in __alloc_pages_slowpath(), which should be often enough (unless the
> >> system never needs to wake up kswapd to reclaim) but with negligible enough
> >> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?
> >
> > We already have a WARN_ON for order > 1 in rmqueue. we might extend
> > the condition there to include checking flags as well?
>
> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> existing users of __GFP_NOFAIL with order > 1 :)
>
> But also the check is in the hotpath, even before trying the pcplists, so we
> could move it to __alloc_pages_slowpath() while extending it?

Agreed. I don't think it is reasonable to check the order and flags in
two different places especially rmqueue() has already had
gfp_flags & __GFP_NOFAIL operation and order > 1
overhead.

We can at least extend the current check to make some improvement
though I still believe Michal's suggestion of implementing OOPS_ON is a
better approach to pursue, as it doesn't crash the entire system
while ensuring the problematic process is terminated.

>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7dcb0713eb57..b5717c6569f9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
> >   /*
> >   * We most definitely don't want callers attempting to
> >   * allocate greater than order-1 page units with __GFP_NOFAIL.
> > + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> > + * which can result in a lockup
> >   */
> > - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
> > +     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));
> >
> >   if (likely(pcp_allowed_order(order))) {
> >   page = rmqueue_pcplist(preferred_zone, zone, order,
> >
> >>
> >> > Don't take it upon yourself to say "we have to deal with any amount of
> >> > stupidity".
> >> >
> >> > The MM layer is not some slave to users. The MM layer is one of the
> >> > most core pieces of code in the kernel, and as such the MM layer is
> >> > damn well in charge.
> >> >
> >> > Nobody has the right to say "I will not deal with allocation
> >> > failures". The MM should not bend over backwards over something like
> >> > that.
> >> >
> >> > Seriously. Get a spine already, people. Tell random drivers that claim
> >> > that they cannot deal with errors to just f-ck off.
> >> >
> >> > And you don't do it by looping forever, and you don't do it by killing
> >> > the kernel. You do it by ignoring their bullying tactics.
> >> >
> >> > Then you document the *LIMITED* cases where you actually will try forever.
> >> >
> >> > This discussion has gone on for too damn long.
> >> >
> >> >               Linus
> >>
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-27  7:50                                   ` Barry Song
@ 2024-08-29 10:24                                     ` Vlastimil Babka
  2024-08-29 11:53                                       ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Vlastimil Babka @ 2024-08-29 10:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Linus Torvalds, David Hildenbrand, Michal Hocko, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization, linux-hardening@vger.kernel.org

On 8/27/24 09:50, Barry Song wrote:
> On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>
>> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
>> existing users of __GFP_NOFAIL with order > 1 :)
>>
>> But also the check is in the hotpath, even before trying the pcplists, so we
>> could move it to __alloc_pages_slowpath() while extending it?
> 
> Agreed. I don't think it is reasonable to check the order and flags in
> two different places especially rmqueue() has already had
> gfp_flags & __GFP_NOFAIL operation and order > 1
> overhead.
> 
> We can at least extend the current check to make some improvement
> though I still believe Michal's suggestion of implementing OOPS_ON is a
> better approach to pursue, as it doesn't crash the entire system
> while ensuring the problematic process is terminated.

Linus made clear it's not a mm concern. If e.g. hardening people want to
pursuit that instead, they can.

BTW I think BUG_ON already works like this, if possible only the calling
process is terminated. panic happens in case of being in a irq context, or
due to panic_on_oops. Which the security people are setting to 1 anyway and
OOPS_ON would have to observe it too. So AFAICS the only difference from
BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
(as for "no mm locks held" I think it's already satisfied at the points we
check for __GFP_NOFAIL).


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-29 10:24                                     ` Vlastimil Babka
@ 2024-08-29 11:53                                       ` Barry Song
  2024-08-29 13:20                                         ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-29 11:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, David Hildenbrand, Michal Hocko, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization, linux-hardening@vger.kernel.org

On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/27/24 09:50, Barry Song wrote:
> > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >>
> >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> >> existing users of __GFP_NOFAIL with order > 1 :)
> >>
> >> But also the check is in the hotpath, even before trying the pcplists, so we
> >> could move it to __alloc_pages_slowpath() while extending it?
> >
> > Agreed. I don't think it is reasonable to check the order and flags in
> > two different places especially rmqueue() has already had
> > gfp_flags & __GFP_NOFAIL operation and order > 1
> > overhead.
> >
> > We can at least extend the current check to make some improvement
> > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > better approach to pursue, as it doesn't crash the entire system
> > while ensuring the problematic process is terminated.
>
> Linus made clear it's not a mm concern. If e.g. hardening people want to
> pursuit that instead, they can.
>
> BTW I think BUG_ON already works like this, if possible only the calling
> process is terminated. panic happens in case of being in a irq context, or

you are right. This is a detail I overlooked in the last discussion.
BUG_ON has already been exactly the case to only terminate the bad
process if it can
(panic_on_oops=N and not in irq context).

> due to panic_on_oops. Which the security people are setting to 1 anyway and
> OOPS_ON would have to observe it too. So AFAICS the only difference from
> BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.

right.

> (as for "no mm locks held" I think it's already satisfied at the points we
> check for __GFP_NOFAIL).

Let me summarize the discussion:

Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
in vdpa driver, is necessary.
Patch 2/4, which updates the documentation to clarify that
non-blockable gfp_nofail is not
                  supported, is needed.
Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
size is too large,
                 where gfp_nofail will return NULL.
Patch 4/4: We will move the order > 1 check from the current fast path
to the slow path and extend
                 the check of gfp_direct_reclaim flag also in the slow path.

If nobody has an objection, I will prepare v4 as above.

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-29 11:53                                       ` Barry Song
@ 2024-08-29 13:20                                         ` Michal Hocko
  2024-08-29 21:27                                           ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-29 13:20 UTC (permalink / raw)
  To: Barry Song
  Cc: Vlastimil Babka, Linus Torvalds, David Hildenbrand, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization, linux-hardening@vger.kernel.org

On Thu 29-08-24 23:53:33, Barry Song wrote:
> On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 8/27/24 09:50, Barry Song wrote:
> > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >>
> > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> > >> existing users of __GFP_NOFAIL with order > 1 :)
> > >>
> > >> But also the check is in the hotpath, even before trying the pcplists, so we
> > >> could move it to __alloc_pages_slowpath() while extending it?
> > >
> > > Agreed. I don't think it is reasonable to check the order and flags in
> > > two different places especially rmqueue() has already had
> > > gfp_flags & __GFP_NOFAIL operation and order > 1
> > > overhead.
> > >
> > > We can at least extend the current check to make some improvement
> > > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > > better approach to pursue, as it doesn't crash the entire system
> > > while ensuring the problematic process is terminated.
> >
> > Linus made clear it's not a mm concern. If e.g. hardening people want to
> > pursuit that instead, they can.
> >
> > BTW I think BUG_ON already works like this, if possible only the calling
> > process is terminated. panic happens in case of being in a irq context, or
> 
> you are right. This is a detail I overlooked in the last discussion.
> BUG_ON has already been exactly the case to only terminate the bad
> process if it can
> (panic_on_oops=N and not in irq context).

Are you sure about that? Maybe x86 implementation treats BUG as oops but
is this what that does on all arches? BUG() has historically meant stop
everything and die and I am not really sure when that would have
changed TBH.

> > due to panic_on_oops. Which the security people are setting to 1 anyway and
> > OOPS_ON would have to observe it too. So AFAICS the only difference from
> > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
> 
> right.
> 
> > (as for "no mm locks held" I think it's already satisfied at the points we
> > check for __GFP_NOFAIL).
> 
> Let me summarize the discussion:
> 
> Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
> in vdpa driver, is necessary.
> Patch 2/4, which updates the documentation to clarify that
> non-blockable gfp_nofail is not
>                   supported, is needed.

Let's please have those merged now.

> Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
> size is too large,
>                  where gfp_nofail will return NULL.


I would pull this one out for a separate discussion. We should really
define what the too large really means and INT_MAX etc. is not it at
all.

> Patch 4/4: We will move the order > 1 check from the current fast path
> to the slow path and extend
>                  the check of gfp_direct_reclaim flag also in the slow path.

OK, let's have that go in now as well.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-29 13:20                                         ` Michal Hocko
@ 2024-08-29 21:27                                           ` Barry Song
  2024-08-29 22:31                                             ` Barry Song
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-29 21:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Linus Torvalds, David Hildenbrand, Yafang Shao,
	akpm, linux-mm, 42.hyeyoo, cl, hailong.liu, hch, iamjoonsoo.kim,
	penberg, rientjes, roman.gushchin, urezki, v-songbaohua,
	virtualization, linux-hardening@vger.kernel.org

On Fri, Aug 30, 2024 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 29-08-24 23:53:33, Barry Song wrote:
> > On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 8/27/24 09:50, Barry Song wrote:
> > > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >>
> > > >>
> > > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> > > >> existing users of __GFP_NOFAIL with order > 1 :)
> > > >>
> > > >> But also the check is in the hotpath, even before trying the pcplists, so we
> > > >> could move it to __alloc_pages_slowpath() while extending it?
> > > >
> > > > Agreed. I don't think it is reasonable to check the order and flags in
> > > > two different places especially rmqueue() has already had
> > > > gfp_flags & __GFP_NOFAIL operation and order > 1
> > > > overhead.
> > > >
> > > > We can at least extend the current check to make some improvement
> > > > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > > > better approach to pursue, as it doesn't crash the entire system
> > > > while ensuring the problematic process is terminated.
> > >
> > > Linus made clear it's not a mm concern. If e.g. hardening people want to
> > > pursuit that instead, they can.
> > >
> > > BTW I think BUG_ON already works like this, if possible only the calling
> > > process is terminated. panic happens in case of being in a irq context, or
> >
> > you are right. This is a detail I overlooked in the last discussion.
> > BUG_ON has already been exactly the case to only terminate the bad
> > process if it can
> > (panic_on_oops=N and not in irq context).
>
> Are you sure about that? Maybe x86 implementation treats BUG as oops but
> is this what that does on all arches? BUG() has historically meant stop
> everything and die and I am not really sure when that would have
> changed TBH.

My ARM64 machine also only terminates the bad process by BUG_ON()
if we are not in irq and we don't set panic_on_oops.

I guess it depends on HAVE_ARCH_BUG? if arch has no BUG(), BUG()
will be just a panic ?

#ifndef HAVE_ARCH_BUG
#define BUG() do { \
        printk("BUG: failure at %s:%d/%s()!\n", __FILE__, __LINE__, __func__); \
        barrier_before_unreachable(); \
        panic("BUG!"); \
} while (0)
#endif

I assume it is equally difficult to implement OOPS_ON() if arch lacks
HAVE_ARCH_BUG ?

"grep" shows the most mainstream archs have their own HAVE_ARCH_BUG:

$ git grep HAVE_ARCH_BUG
arch/alpha/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arm/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arm64/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/csky/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/loongarch/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/m68k/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG_ON
arch/parisc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG_ON
arch/riscv/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/s390/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/sh/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/sparc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/x86/include/asm/bug.h:#define HAVE_ARCH_BUG

>
> > > due to panic_on_oops. Which the security people are setting to 1 anyway and
> > > OOPS_ON would have to observe it too. So AFAICS the only difference from
> > > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
> >
> > right.
> >
> > > (as for "no mm locks held" I think it's already satisfied at the points we
> > > check for __GFP_NOFAIL).
> >
> > Let me summarize the discussion:
> >
> > Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
> > in vdpa driver, is necessary.
> > Patch 2/4, which updates the documentation to clarify that
> > non-blockable gfp_nofail is not
> >                   supported, is needed.
>
> Let's please have those merged now.
>
> > Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
> > size is too large,
> >                  where gfp_nofail will return NULL.
>
>
> I would pull this one out for a separate discussion. We should really
> define what the too large really means and INT_MAX etc. is not it at
> all.

make sense.

>
> > Patch 4/4: We will move the order > 1 check from the current fast path
> > to the slow path and extend
> >                  the check of gfp_direct_reclaim flag also in the slow path.
>
> OK, let's have that go in now as well.
>
> --
> Michal Hocko
> SUSE Labs

Thanks
Barry


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-29 21:27                                           ` Barry Song
@ 2024-08-29 22:31                                             ` Barry Song
  2024-08-30  7:24                                               ` Michal Hocko
  0 siblings, 1 reply; 101+ messages in thread
From: Barry Song @ 2024-08-29 22:31 UTC (permalink / raw)
  To: 21cnbao, mhocko, vbabka
  Cc: 42.hyeyoo, akpm, cl, david, hailong.liu, hch, iamjoonsoo.kim,
	laoar.shao, linux-hardening, linux-mm, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, virtualization

> > > Patch 4/4: We will move the order > 1 check from the current fast path
> > > to the slow path and extend
> > >                  the check of gfp_direct_reclaim flag also in the slow path.
> >
> > OK, let's have that go in now as well.

Hi Michal and Vlastimil,
Could you please review the changes below before I send v4 for patch 4/4?

1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
in the hotpath, while others are in less likely scenarios. Moving all warnings to the
slowpath will reduce the overhead for order > 1 and increase the visibility of other
warnings.

2. We currently have two warnings for order: one for order > 1 in the hotpath and another
for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
it’s been in use for a long time.

3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
meant to suppress allocation failure reports, but here we're dealing with bug detection, not
allocation failures.
So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c81ee5662cc7..0d3dd679d0ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 {
 	struct page *page;
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				       migratetype, alloc_flags);
@@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;
 	bool can_compact = gfp_compaction_allowed(gfp_mask);
 	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
 	struct page *page = NULL;
@@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
+	if (nofail) {
+		/*
+		 * We most definitely don't want callers attempting to
+		 * allocate greater than order-1 page units with __GFP_NOFAIL.
+		 */
+		WARN_ON_ONCE(order > 1);
+		/*
+		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
+		 * otherwise, we may result in lockup.
+		 */
+		WARN_ON_ONCE(!can_direct_reclaim);
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us.
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+	}
+
 restart:
 	compaction_retries = 0;
 	no_progress_loops = 0;
@@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
 	 * we always retry
 	 */
-	if (gfp_mask & __GFP_NOFAIL) {
+	if (nofail) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * Lacking direct_reclaim we can't do anything to reclaim memory,
+		 * we disregard these unreasonable nofail requests and still
+		 * return NULL
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
+		if (!can_direct_reclaim)
 			goto fail;
 
-		/*
-		 * PF_MEMALLOC request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us
-		 */
-		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
-
-		/*
-		 * non failing costly orders are a hard requirement which we
-		 * are not prepared for much so let's warn about these users
-		 * so that we can identify them and convert them to something
-		 * else.
-		 */
-		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
-
 		/*
 		 * Help non-failing allocations by giving some access to memory
 		 * reserves normally used for high priority non-blocking

> >
> > --
> > Michal Hocko
> > SUSE Labs

Thanks
Barry


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-29 22:31                                             ` Barry Song
@ 2024-08-30  7:24                                               ` Michal Hocko
  2024-08-30  7:37                                                 ` Vlastimil Babka
  0 siblings, 1 reply; 101+ messages in thread
From: Michal Hocko @ 2024-08-30  7:24 UTC (permalink / raw)
  To: Barry Song
  Cc: vbabka, 42.hyeyoo, akpm, cl, david, hailong.liu, hch,
	iamjoonsoo.kim, laoar.shao, linux-hardening, linux-mm, penberg,
	rientjes, roman.gushchin, torvalds, urezki, v-songbaohua,
	virtualization

On Fri 30-08-24 10:31:14, Barry Song wrote:
> > > > Patch 4/4: We will move the order > 1 check from the current fast path
> > > > to the slow path and extend
> > > >                  the check of gfp_direct_reclaim flag also in the slow path.
> > >
> > > OK, let's have that go in now as well.
> 
> Hi Michal and Vlastimil,
> Could you please review the changes below before I send v4 for patch 4/4?
> 
> 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
> in the hotpath, while others are in less likely scenarios. Moving all warnings to the
> slowpath will reduce the overhead for order > 1 and increase the visibility of other
> warnings.
> 
> 2. We currently have two warnings for order: one for order > 1 in the hotpath and another
> for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
> it’s been in use for a long time.
> 
> 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
> meant to suppress allocation failure reports, but here we're dealing with bug detection, not
> allocation failures.
> So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c81ee5662cc7..0d3dd679d0ab 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
>  {
>  	struct page *page;
>  
> -	/*
> -	 * We most definitely don't want callers attempting to
> -	 * allocate greater than order-1 page units with __GFP_NOFAIL.
> -	 */
> -	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> -
>  	if (likely(pcp_allowed_order(order))) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>  				       migratetype, alloc_flags);
> @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
>  {
>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> +	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;
>  	bool can_compact = gfp_compaction_allowed(gfp_mask);
>  	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
>  	struct page *page = NULL;
> @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned int zonelist_iter_cookie;
>  	int reserve_flags;
>  
> +	if (nofail) {
> +		/*
> +		 * We most definitely don't want callers attempting to
> +		 * allocate greater than order-1 page units with __GFP_NOFAIL.
> +		 */
> +		WARN_ON_ONCE(order > 1);
> +		/*
> +		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> +		 * otherwise, we may result in lockup.
> +		 */
> +		WARN_ON_ONCE(!can_direct_reclaim);
> +		/*
> +		 * PF_MEMALLOC request from this context is rather bizarre
> +		 * because we cannot reclaim anything and only can loop waiting
> +		 * for somebody to do a work for us.
> +		 */
> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> +	}

Yes, this makes sense. Any reason you have not put that int the nofail
branch below?

> +
>  restart:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
> @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>  	 * we always retry
>  	 */
> -	if (gfp_mask & __GFP_NOFAIL) {
> +	if (nofail) {
>  		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * Lacking direct_reclaim we can't do anything to reclaim memory,
> +		 * we disregard these unreasonable nofail requests and still
> +		 * return NULL
>  		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> +		if (!can_direct_reclaim)
>  			goto fail;
>  
> -		/*
> -		 * PF_MEMALLOC request from this context is rather bizarre
> -		 * because we cannot reclaim anything and only can loop waiting
> -		 * for somebody to do a work for us
> -		 */
> -		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
> -
> -		/*
> -		 * non failing costly orders are a hard requirement which we
> -		 * are not prepared for much so let's warn about these users
> -		 * so that we can identify them and convert them to something
> -		 * else.
> -		 */
> -		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
> -
>  		/*
>  		 * Help non-failing allocations by giving some access to memory
>  		 * reserves normally used for high priority non-blocking
> 
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> 
> Thanks
> Barry

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH v3 0/4] mm: clarify nofail memory allocation
  2024-08-30  7:24                                               ` Michal Hocko
@ 2024-08-30  7:37                                                 ` Vlastimil Babka
  0 siblings, 0 replies; 101+ messages in thread
From: Vlastimil Babka @ 2024-08-30  7:37 UTC (permalink / raw)
  To: Michal Hocko, Barry Song
  Cc: 42.hyeyoo, akpm, cl, david, hailong.liu, hch, iamjoonsoo.kim,
	laoar.shao, linux-hardening, linux-mm, penberg, rientjes,
	roman.gushchin, torvalds, urezki, v-songbaohua, virtualization

On 8/30/24 09:24, Michal Hocko wrote:
> On Fri 30-08-24 10:31:14, Barry Song wrote:
>> > > > Patch 4/4: We will move the order > 1 check from the current fast path
>> > > > to the slow path and extend
>> > > >                  the check of gfp_direct_reclaim flag also in the slow path.
>> > >
>> > > OK, let's have that go in now as well.
>> 
>> Hi Michal and Vlastimil,
>> Could you please review the changes below before I send v4 for patch 4/4?
>> 
>> 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
>> in the hotpath, while others are in less likely scenarios. Moving all warnings to the
>> slowpath will reduce the overhead for order > 1 and increase the visibility of other
>> warnings.
>> 
>> 2. We currently have two warnings for order: one for order > 1 in the hotpath and another
>> for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
>> it’s been in use for a long time.
>> 
>> 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
>> meant to suppress allocation failure reports, but here we're dealing with bug detection, not
>> allocation failures.

Ack. __GFP_NOWARN is to suppress warnings in case the allocation has a less
expensive fallback to the current attempt, which logically means the current
attempt can't be a __GFP_NOFAIL one. So having both is a bug itself (not
worth reporting) so we can just ignore __GFP_NOWARN.

>> So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c81ee5662cc7..0d3dd679d0ab 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
>>  {
>>  	struct page *page;
>>  
>> -	/*
>> -	 * We most definitely don't want callers attempting to
>> -	 * allocate greater than order-1 page units with __GFP_NOFAIL.
>> -	 */
>> -	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>> -
>>  	if (likely(pcp_allowed_order(order))) {
>>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>>  				       migratetype, alloc_flags);
>> @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  						struct alloc_context *ac)
>>  {
>>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>> +	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;

__GFP_NOFAIL

>>  	bool can_compact = gfp_compaction_allowed(gfp_mask);
>>  	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
>>  	struct page *page = NULL;
>> @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	unsigned int zonelist_iter_cookie;
>>  	int reserve_flags;
>>  
>> +	if (nofail) {

Could add unlikely() to put it off the instruction cache hotpath.

>> +		/*
>> +		 * We most definitely don't want callers attempting to
>> +		 * allocate greater than order-1 page units with __GFP_NOFAIL.
>> +		 */
>> +		WARN_ON_ONCE(order > 1);
>> +		/*
>> +		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
>> +		 * otherwise, we may result in lockup.
>> +		 */
>> +		WARN_ON_ONCE(!can_direct_reclaim);
>> +		/*
>> +		 * PF_MEMALLOC request from this context is rather bizarre
>> +		 * because we cannot reclaim anything and only can loop waiting
>> +		 * for somebody to do a work for us.
>> +		 */
>> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
>> +	}
> 
> Yes, this makes sense. Any reason you have not put that int the nofail
> branch below?

Because that branch is executed only when we're already so depleted we gave
up retrying, and we want to warn about the buggy users more reliably (see
point 1 above).

>> +
>>  restart:
>>  	compaction_retries = 0;
>>  	no_progress_loops = 0;
>> @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>>  	 * we always retry
>>  	 */
>> -	if (gfp_mask & __GFP_NOFAIL) {
>> +	if (nofail) {
>>  		/*
>> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
>> -		 * of any new users that actually require GFP_NOWAIT
>> +		 * Lacking direct_reclaim we can't do anything to reclaim memory,
>> +		 * we disregard these unreasonable nofail requests and still
>> +		 * return NULL
>>  		 */
>> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>> +		if (!can_direct_reclaim)
>>  			goto fail;
>>  
>> -		/*
>> -		 * PF_MEMALLOC request from this context is rather bizarre
>> -		 * because we cannot reclaim anything and only can loop waiting
>> -		 * for somebody to do a work for us
>> -		 */
>> -		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
>> -
>> -		/*
>> -		 * non failing costly orders are a hard requirement which we
>> -		 * are not prepared for much so let's warn about these users
>> -		 * so that we can identify them and convert them to something
>> -		 * else.
>> -		 */
>> -		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
>> -
>>  		/*
>>  		 * Help non-failing allocations by giving some access to memory
>>  		 * reserves normally used for high priority non-blocking
>> 
>> > >
>> > > --
>> > > Michal Hocko
>> > > SUSE Labs
>> 
>> Thanks
>> Barry
> 



^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2024-08-30  7:37 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-17  6:24 [PATCH v3 0/4] mm: clarify nofail memory allocation Barry Song
2024-08-17  6:24 ` [PATCH v3 1/4] vduse: avoid using __GFP_NOFAIL Barry Song
2024-08-17  6:24 ` [PATCH v3 2/4] mm: document __GFP_NOFAIL must be blockable Barry Song
2024-08-17  6:24 ` [PATCH v3 3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails Barry Song
2024-08-19  9:43   ` David Hildenbrand
2024-08-19  9:47     ` Barry Song
2024-08-19  9:55       ` David Hildenbrand
2024-08-19 10:02         ` Barry Song
2024-08-19 12:33           ` David Hildenbrand
2024-08-19 12:48             ` Barry Song
2024-08-19 12:49               ` David Hildenbrand
2024-08-19 17:12                 ` Michal Hocko
2024-08-19 17:17                   ` Linus Torvalds
2024-08-19 20:24                   ` David Hildenbrand
2024-08-19 20:35                     ` Linus Torvalds
2024-08-19 21:57                       ` David Hildenbrand
2024-08-19 22:13                         ` Linus Torvalds
2024-08-20  6:17                         ` Michal Hocko
2024-08-19 12:49             ` Christoph Hellwig
2024-08-19 12:51               ` David Hildenbrand
2024-08-19 12:53                 ` Christoph Hellwig
2024-08-19 13:14                   ` David Hildenbrand
2024-08-19 13:05                 ` Barry Song
2024-08-19 13:10                   ` David Hildenbrand
2024-08-19 13:19                     ` Barry Song
2024-08-19 13:22                       ` David Hildenbrand
2024-08-17  6:24 ` [PATCH v3 4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Barry Song
2024-08-18  2:55   ` Yafang Shao
2024-08-18  3:48     ` Barry Song
2024-08-18  5:51       ` Yafang Shao
2024-08-18  6:27         ` Barry Song
2024-08-18  6:45           ` Barry Song
2024-08-18  7:07             ` Yafang Shao
2024-08-18  7:25               ` Barry Song
2024-08-19  7:51               ` Michal Hocko
2024-08-19  7:50     ` Michal Hocko
2024-08-19  9:25       ` Yafang Shao
2024-08-19  9:39         ` Barry Song
2024-08-19  9:45           ` Yafang Shao
2024-08-19 10:10             ` Barry Song
2024-08-19 11:56               ` Yafang Shao
2024-08-19 12:09                 ` Michal Hocko
2024-08-19 12:17                   ` Yafang Shao
2024-08-19 14:01                     ` Michal Hocko
2024-08-19 10:17         ` Michal Hocko
2024-08-19 11:56           ` Yafang Shao
2024-08-19 12:04             ` Michal Hocko
2024-08-19  9:44   ` David Hildenbrand
2024-08-19 10:19     ` Michal Hocko
2024-08-19 12:48       ` David Hildenbrand
2024-08-19 13:02 ` [PATCH v3 0/4] mm: clarify nofail memory allocation David Hildenbrand
2024-08-19 16:05   ` Linus Torvalds
2024-08-19 19:23     ` Barry Song
2024-08-19 19:33       ` Linus Torvalds
2024-08-19 21:48         ` Barry Song
2024-08-20  6:24         ` Michal Hocko
2024-08-21 12:40     ` Yafang Shao
2024-08-21 22:59       ` Linus Torvalds
2024-08-22  6:21         ` Michal Hocko
2024-08-22  6:40           ` Linus Torvalds
2024-08-22  6:56             ` Linus Torvalds
2024-08-22  7:47               ` Michal Hocko
2024-08-22  7:57                 ` Barry Song
2024-08-22  8:24                   ` Michal Hocko
2024-08-22  8:39                     ` David Hildenbrand
2024-08-22  9:08                       ` Linus Torvalds
2024-08-22  9:16                         ` Michal Hocko
2024-08-22  9:24                           ` Linus Torvalds
2024-08-22  9:11                       ` Michal Hocko
2024-08-22  9:18                         ` Linus Torvalds
2024-08-22  9:33                           ` Michal Hocko
2024-08-22  9:44                             ` Linus Torvalds
2024-08-22  9:59                               ` Michal Hocko
2024-08-22 10:30                                 ` Linus Torvalds
2024-08-22 10:46                                   ` Michal Hocko
2024-08-22  9:27                         ` David Hildenbrand
2024-08-22  9:34                           ` Linus Torvalds
2024-08-22  9:43                             ` David Hildenbrand
2024-08-22  9:53                               ` Linus Torvalds
2024-08-22 11:58                                 ` Johannes Weiner
2024-08-26 12:10                             ` Vlastimil Babka
2024-08-27  6:57                               ` Linus Torvalds
2024-08-27  7:15                               ` Barry Song
2024-08-27  7:38                                 ` Vlastimil Babka
2024-08-27  7:50                                   ` Barry Song
2024-08-29 10:24                                     ` Vlastimil Babka
2024-08-29 11:53                                       ` Barry Song
2024-08-29 13:20                                         ` Michal Hocko
2024-08-29 21:27                                           ` Barry Song
2024-08-29 22:31                                             ` Barry Song
2024-08-30  7:24                                               ` Michal Hocko
2024-08-30  7:37                                                 ` Vlastimil Babka
2024-08-22  9:41                           ` Michal Hocko
2024-08-22  9:42                             ` David Hildenbrand
2024-08-22  7:01             ` Gao Xiang
2024-08-22  7:54               ` Michal Hocko
2024-08-22  8:04                 ` Gao Xiang
2024-08-22 14:35                   ` Yafang Shao
2024-08-22 15:02                     ` Gao Xiang
2024-08-22  6:37       ` Barry Song
2024-08-22 14:22         ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).