[PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
@ 2025-10-31  6:13 libaokun
  2025-10-31  7:25 ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: libaokun @ 2025-10-31  6:13 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, willy, jack,
	yi.zhang, yangerkun, libaokun1

From: Baokun Li <libaokun1@huawei.com>

Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
reads at critical points, since they cannot afford to go read-only,
shut down, or enter an inconsistent state due to memory pressure.

Currently, attempting to allocate page units greater than order-1 with
the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
can easily require allocations larger than order-1.

As Matthew noted, if we have a filesystem with 64KiB sectors, there will
be many clean folios in the page cache that are 64KiB or larger.

Therefore, to avoid the warning when LBS is enabled, we relax this
restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
maximum supported logical block size is 64KiB, meaning the maximum order
handled here is 4.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 mm/page_alloc.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb91c566327c..913b9baa24b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
 	return false;
 }
 
+/*
+ * We most definitely don't want callers attempting to
+ * allocate greater than order-1 page units with __GFP_NOFAIL.
+ *
+ * However, folio allocations up to BLK_MAX_BLOCK_SIZE with
+ * __GFP_NOFAIL should always be supported.
+ */
+static inline void check_nofail_max_order(unsigned int order)
+{
+	unsigned int max_order = 1;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (PAGE_SIZE << 1 < SZ_64K)
+		max_order = get_order(SZ_64K);
+#endif
+
+	WARN_ON_ONCE(order > max_order);
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int reserve_flags;
 
 	if (unlikely(nofail)) {
-		/*
-		 * We most definitely don't want callers attempting to
-		 * allocate greater than order-1 page units with __GFP_NOFAIL.
-		 */
-		WARN_ON_ONCE(order > 1);
+		check_nofail_max_order(order);
 		/*
 		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
 		 * otherwise, we may result in lockup.
-- 
2.46.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31  6:13 [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS libaokun
@ 2025-10-31  7:25 ` Michal Hocko
  2025-10-31 10:12   ` Vlastimil Babka
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-10-31  7:25 UTC (permalink / raw)
  To: libaokun
  Cc: linux-mm, akpm, vbabka, surenb, jackmanb, hannes, ziy, willy,
	jack, yi.zhang, yangerkun, libaokun1

On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> reads at critical points, since they cannot afford to go read-only,
> shut down, or enter an inconsistent state due to memory pressure.
> 
> Currently, attempting to allocate page units greater than order-1 with
> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> can easily require allocations larger than order-1.
> 
> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> be many clean folios in the page cache that are 64KiB or larger.
> 
> Therefore, to avoid the warning when LBS is enabled, we relax this
> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> maximum supported logical block size is 64KiB, meaning the maximum order
> handled here is 4.

Would be using kvmalloc an option instead of this?

This change doesn't really make much sense to me TBH. While the order=1
is rather arbitrary it is an internal allocator constrain - i.e. order which
the allocator can sustain for NOFAIL requests is directly related to
memory reclaim and internal allocator operation rather than something as
external as block size. If the allocator needs to support 64kB NOFAIL
requests because there is a strong demand for that then fine and we can
see whether this is feasible.

Please keep in mind that 64kb > PAGE_ALLOC_COSTLY_ORDER and that is
where page allocator behavior changes considerably (e.g. oom killer is
not invoked so the allocation could stall for ever). So it is not as
simple as say this just going to work fine.

> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  mm/page_alloc.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fb91c566327c..913b9baa24b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
>  	return false;
>  }
>  
> +/*
> + * We most definitely don't want callers attempting to
> + * allocate greater than order-1 page units with __GFP_NOFAIL.
> + *
> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with
> + * __GFP_NOFAIL should always be supported.
> + */
> +static inline void check_nofail_max_order(unsigned int order)
> +{
> +	unsigned int max_order = 1;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (PAGE_SIZE << 1 < SZ_64K)
> +		max_order = get_order(SZ_64K);
> +#endif
> +
> +	WARN_ON_ONCE(order > max_order);
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
> @@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int reserve_flags;
>  
>  	if (unlikely(nofail)) {
> -		/*
> -		 * We most definitely don't want callers attempting to
> -		 * allocate greater than order-1 page units with __GFP_NOFAIL.
> -		 */
> -		WARN_ON_ONCE(order > 1);
> +		check_nofail_max_order(order);
>  		/*
>  		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
>  		 * otherwise, we may result in lockup.
> -- 
> 2.46.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31  7:25 ` Michal Hocko
@ 2025-10-31 10:12   ` Vlastimil Babka
  2025-10-31 14:26     ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-10-31 10:12 UTC (permalink / raw)
  To: Michal Hocko, libaokun
  Cc: linux-mm, akpm, surenb, jackmanb, hannes, ziy, willy, jack,
	yi.zhang, yangerkun, libaokun1

On 10/31/25 08:25, Michal Hocko wrote:
> On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>> 
>> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
>> reads at critical points, since they cannot afford to go read-only,
>> shut down, or enter an inconsistent state due to memory pressure.
>> 
>> Currently, attempting to allocate page units greater than order-1 with
>> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
>> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
>> can easily require allocations larger than order-1.
>> 
>> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
>> be many clean folios in the page cache that are 64KiB or larger.
>> 
>> Therefore, to avoid the warning when LBS is enabled, we relax this
>> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
>> maximum supported logical block size is 64KiB, meaning the maximum order
>> handled here is 4.
> 
> Would be using kvmalloc an option instead of this?

The thread under Link: suggests xfs has its own vmalloc callback. But it's
not one of the 5 options listed, so it's good question how difficult would
be to implement that for ext4 or in general.

> This change doesn't really make much sense to me TBH. While the order=1
> is rather arbitrary it is an internal allocator constrain - i.e. order which
> the allocator can sustain for NOFAIL requests is directly related to
> memory reclaim and internal allocator operation rather than something as
> external as block size. If the allocator needs to support 64kB NOFAIL
> requests because there is a strong demand for that then fine and we can
> see whether this is feasible.
> 
> Please keep in mind that 64kb > PAGE_ALLOC_COSTLY_ORDER and that is
> where page allocator behavior changes considerably (e.g. oom killer is
> not invoked so the allocation could stall for ever). So it is not as
> simple as say this just going to work fine.

True.

>> Suggested-by: Matthew Wilcox <willy@infradead.org>
>> Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>  mm/page_alloc.c | 25 ++++++++++++++++++++-----
>>  1 file changed, 20 insertions(+), 5 deletions(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index fb91c566327c..913b9baa24b4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
>>  	return false;
>>  }
>>  
>> +/*
>> + * We most definitely don't want callers attempting to
>> + * allocate greater than order-1 page units with __GFP_NOFAIL.
>> + *
>> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with
>> + * __GFP_NOFAIL should always be supported.
>> + */
>> +static inline void check_nofail_max_order(unsigned int order)
>> +{
>> +	unsigned int max_order = 1;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

This is a bit confusing to me since we are talking about block size. Are
filesystems with these large block sizes only possible to mount with a
kernel with THPs?

>> +	if (PAGE_SIZE << 1 < SZ_64K)
>> +		max_order = get_order(SZ_64K);
>> +#endif
>> +
>> +	WARN_ON_ONCE(order > max_order);
>> +}
>> +
>>  static inline struct page *
>>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  						struct alloc_context *ac)
>> @@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	int reserve_flags;
>>  
>>  	if (unlikely(nofail)) {
>> -		/*
>> -		 * We most definitely don't want callers attempting to
>> -		 * allocate greater than order-1 page units with __GFP_NOFAIL.
>> -		 */
>> -		WARN_ON_ONCE(order > 1);
>> +		check_nofail_max_order(order);
>>  		/*
>>  		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
>>  		 * otherwise, we may result in lockup.
>> -- 
>> 2.46.1
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 10:12   ` Vlastimil Babka
@ 2025-10-31 14:26     ` Matthew Wilcox
  2025-10-31 15:35       ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2025-10-31 14:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes,
	ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> On 10/31/25 08:25, Michal Hocko wrote:
> > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> >> From: Baokun Li <libaokun1@huawei.com>
> >> 
> >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> >> reads at critical points, since they cannot afford to go read-only,
> >> shut down, or enter an inconsistent state due to memory pressure.
> >> 
> >> Currently, attempting to allocate page units greater than order-1 with
> >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> >> can easily require allocations larger than order-1.
> >> 
> >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> >> be many clean folios in the page cache that are 64KiB or larger.
> >> 
> >> Therefore, to avoid the warning when LBS is enabled, we relax this
> >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> >> maximum supported logical block size is 64KiB, meaning the maximum order
> >> handled here is 4.
> > 
> > Would be using kvmalloc an option instead of this?
> 
> The thread under Link: suggests xfs has its own vmalloc callback. But it's
> not one of the 5 options listed, so it's good question how difficult would
> be to implement that for ext4 or in general.

It's implicit in options 1-4.  Today, the buffer cache is an alias into
the page cache.  The page cache can only store folios.  So to use
vmalloc, we either have to make folios discontiguous, stop the buffer
cache being an alias into the page cache, or stop ext4 from using the
buffer cache.

> > This change doesn't really make much sense to me TBH. While the order=1
> > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > the allocator can sustain for NOFAIL requests is directly related to
> > memory reclaim and internal allocator operation rather than something as
> > external as block size. If the allocator needs to support 64kB NOFAIL
> > requests because there is a strong demand for that then fine and we can
> > see whether this is feasible.

Maybe Baokun's explanation for why this is unlikel to be a problem in
practice didn't make sense to you.  Let me try again, perhaps being more
explicit about things which an fs developer would know but an MM person
might not realise.

Hard drive manufacturers are absolutely gagging to ship drives with a
64KiB sector size.  Once they do, the minimum transfer size to/from a
device becomes 64KiB.  That means the page cache will cache all files
(and fs metadata) from that drive in contiguous 64KiB chunks.  That means
that when reclaim shakes the page cache, it's going to find a lot of
order-4 folios to free ... which means that the occasional GFP_NOFAIL
order-4 allocation is going to have no trouble finding order-4 pages to
satisfy the allocation.

Now, the problem is the non-filesystems which may now take advantage of
this to write lazy code.  It'd be nice if we had some token that said
"hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
NOFAIL high-order allocation, you can reclaim one I've already allocated
and everything will be fine".  But I can't see a way to put that kind
of token into our interfaces.

> >> +/*
> >> + * We most definitely don't want callers attempting to
> >> + * allocate greater than order-1 page units with __GFP_NOFAIL.
> >> + *
> >> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with
> >> + * __GFP_NOFAIL should always be supported.
> >> + */
> >> +static inline void check_nofail_max_order(unsigned int order)
> >> +{
> >> +	unsigned int max_order = 1;
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 
> This is a bit confusing to me since we are talking about block size. Are
> filesystems with these large block sizes only possible to mount with a
> kernel with THPs?

For the moment, yes.  It's an artefact of how large folio support was
originally developed.  It's one of those things that's only a problem for
weirdoes who compile their own kernels because all distros have turned
it on since basically forever.  Also some minority architectures don't
support it yet.  Anyway, fixing this is on the todo list, but it's not
a high priority.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 14:26     ` Matthew Wilcox
@ 2025-10-31 15:35       ` Shakeel Butt
  2025-10-31 15:52         ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2025-10-31 15:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > On 10/31/25 08:25, Michal Hocko wrote:
> > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > >> From: Baokun Li <libaokun1@huawei.com>
> > >> 
> > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > >> reads at critical points, since they cannot afford to go read-only,
> > >> shut down, or enter an inconsistent state due to memory pressure.
> > >> 
> > >> Currently, attempting to allocate page units greater than order-1 with
> > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > >> can easily require allocations larger than order-1.
> > >> 
> > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > >> be many clean folios in the page cache that are 64KiB or larger.
> > >> 
> > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > >> handled here is 4.
> > > 
> > > Would be using kvmalloc an option instead of this?
> > 
> > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > not one of the 5 options listed, so it's good question how difficult would
> > be to implement that for ext4 or in general.
> 
> It's implicit in options 1-4.  Today, the buffer cache is an alias into
> the page cache.  The page cache can only store folios.  So to use
> vmalloc, we either have to make folios discontiguous, stop the buffer
> cache being an alias into the page cache, or stop ext4 from using the
> buffer cache.
> 
> > > This change doesn't really make much sense to me TBH. While the order=1
> > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > the allocator can sustain for NOFAIL requests is directly related to
> > > memory reclaim and internal allocator operation rather than something as
> > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > requests because there is a strong demand for that then fine and we can
> > > see whether this is feasible.
> 
> Maybe Baokun's explanation for why this is unlikel to be a problem in
> practice didn't make sense to you.  Let me try again, perhaps being more
> explicit about things which an fs developer would know but an MM person
> might not realise.
> 
> Hard drive manufacturers are absolutely gagging to ship drives with a
> 64KiB sector size.  Once they do, the minimum transfer size to/from a
> device becomes 64KiB.  That means the page cache will cache all files
> (and fs metadata) from that drive in contiguous 64KiB chunks.  That means
> that when reclaim shakes the page cache, it's going to find a lot of
> order-4 folios to free ... which means that the occasional GFP_NOFAIL
> order-4 allocation is going to have no trouble finding order-4 pages to
> satisfy the allocation.
> 
> Now, the problem is the non-filesystems which may now take advantage of
> this to write lazy code.  It'd be nice if we had some token that said
> "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> NOFAIL high-order allocation, you can reclaim one I've already allocated
> and everything will be fine".  But I can't see a way to put that kind
> of token into our interfaces.

A new gfp flag should be easy enough. However "you can reclaim one I've
already allocated" is not something current allocation & reclaim can
take any action on. Maybe that is something we can add. In addition the
behavior change of costly order needs more thought.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 15:35       ` Shakeel Butt
@ 2025-10-31 15:52         ` Shakeel Butt
  2025-10-31 15:54           ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2025-10-31 15:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 08:35:50AM -0700, Shakeel Butt wrote:
> On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote:
> > On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> > > On 10/31/25 08:25, Michal Hocko wrote:
> > > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> > > >> From: Baokun Li <libaokun1@huawei.com>
> > > >> 
> > > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> > > >> reads at critical points, since they cannot afford to go read-only,
> > > >> shut down, or enter an inconsistent state due to memory pressure.
> > > >> 
> > > >> Currently, attempting to allocate page units greater than order-1 with
> > > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> > > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> > > >> can easily require allocations larger than order-1.
> > > >> 
> > > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> > > >> be many clean folios in the page cache that are 64KiB or larger.
> > > >> 
> > > >> Therefore, to avoid the warning when LBS is enabled, we relax this
> > > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> > > >> maximum supported logical block size is 64KiB, meaning the maximum order
> > > >> handled here is 4.
> > > > 
> > > > Would be using kvmalloc an option instead of this?
> > > 
> > > The thread under Link: suggests xfs has its own vmalloc callback. But it's
> > > not one of the 5 options listed, so it's good question how difficult would
> > > be to implement that for ext4 or in general.
> > 
> > It's implicit in options 1-4.  Today, the buffer cache is an alias into
> > the page cache.  The page cache can only store folios.  So to use
> > vmalloc, we either have to make folios discontiguous, stop the buffer
> > cache being an alias into the page cache, or stop ext4 from using the
> > buffer cache.
> > 
> > > > This change doesn't really make much sense to me TBH. While the order=1
> > > > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > > > the allocator can sustain for NOFAIL requests is directly related to
> > > > memory reclaim and internal allocator operation rather than something as
> > > > external as block size. If the allocator needs to support 64kB NOFAIL
> > > > requests because there is a strong demand for that then fine and we can
> > > > see whether this is feasible.
> > 
> > Maybe Baokun's explanation for why this is unlikel to be a problem in
> > practice didn't make sense to you.  Let me try again, perhaps being more
> > explicit about things which an fs developer would know but an MM person
> > might not realise.
> > 
> > Hard drive manufacturers are absolutely gagging to ship drives with a
> > 64KiB sector size.  Once they do, the minimum transfer size to/from a
> > device becomes 64KiB.  That means the page cache will cache all files
> > (and fs metadata) from that drive in contiguous 64KiB chunks.  That means
> > that when reclaim shakes the page cache, it's going to find a lot of
> > order-4 folios to free ... which means that the occasional GFP_NOFAIL
> > order-4 allocation is going to have no trouble finding order-4 pages to
> > satisfy the allocation.
> > 
> > Now, the problem is the non-filesystems which may now take advantage of
> > this to write lazy code.  It'd be nice if we had some token that said
> > "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
> > NOFAIL high-order allocation, you can reclaim one I've already allocated
> > and everything will be fine".  But I can't see a way to put that kind
> > of token into our interfaces.
> 
> A new gfp flag should be easy enough. However "you can reclaim one I've
> already allocated" is not something current allocation & reclaim can
> take any action on. Maybe that is something we can add. In addition the
> behavior change of costly order needs more thought.
> 

After reading the background link, it seems like the actual allocation
will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not
really reclaim any file memory (page cache). However I wonder with the
writeback gone from reclaim path, should we allow reclaiming clean file
pages even for NOFS context (need some digging).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 15:52         ` Shakeel Butt
@ 2025-10-31 15:54           ` Matthew Wilcox
  2025-10-31 16:46             ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2025-10-31 15:54 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 08:52:49AM -0700, Shakeel Butt wrote:
> After reading the background link, it seems like the actual allocation
> will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not
> really reclaim any file memory (page cache). However I wonder with the
> writeback gone from reclaim path, should we allow reclaiming clean file
> pages even for NOFS context (need some digging).

I thought that was true yesterday morning, but I read the code and it isn't.

Look at where may_enter_fs() is called from.  It's only for folios which
are either marked dirty or writeback.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 15:54           ` Matthew Wilcox
@ 2025-10-31 16:46             ` Shakeel Butt
  2025-10-31 16:55               ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2025-10-31 16:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 03:54:56PM +0000, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 08:52:49AM -0700, Shakeel Butt wrote:
> > After reading the background link, it seems like the actual allocation
> > will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not
> > really reclaim any file memory (page cache). However I wonder with the
> > writeback gone from reclaim path, should we allow reclaiming clean file
> > pages even for NOFS context (need some digging).
> 
> I thought that was true yesterday morning, but I read the code and it isn't.
> 
> Look at where may_enter_fs() is called from.  It's only for folios which
> are either marked dirty or writeback.
> 

Indeed you are right.

Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
(FS specific) gfp is fine but will require some maintenance to avoid
abuse.

I am more interested in how to codify "you can reclaim one I've already
allocated". I have a different scenario where network stack keep
stealing memory from direct reclaimers and keeping them in reclaim for
long time. If we have some mechanism to allow reclaimers to get the
memory they have reclaimed (at least for some cases), I think that can
be used in both cases.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 16:46             ` Shakeel Butt
@ 2025-10-31 16:55               ` Matthew Wilcox
  2025-11-03  2:45                 ` Baokun Li
  2025-11-03  7:55                 ` Michal Hocko
  0 siblings, 2 replies; 20+ messages in thread
From: Matthew Wilcox @ 2025-10-31 16:55 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
> Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
> (FS specific) gfp is fine but will require some maintenance to avoid
> abuse.

I don't think a new GFP flag is the answer.  GFP_TRUST_ME_BRO just
doesn't feel right.

> I am more interested in how to codify "you can reclaim one I've already
> allocated". I have a different scenario where network stack keep
> stealing memory from direct reclaimers and keeping them in reclaim for
> long time. If we have some mechanism to allow reclaimers to get the
> memory they have reclaimed (at least for some cases), I think that can
> be used in both cases.

The only thing that comes to mind is putting pages freed by reclaim on
a list in task_struct instead of sending them back to the allocator.
Then the task can allocate from there and free up anything else it's
reclaimed at some later point.  I don't think this is a good idea,
but it's the only idea that comes to mind.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 16:55               ` Matthew Wilcox
@ 2025-11-03  2:45                 ` Baokun Li
  2025-11-03  7:55                 ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Baokun Li @ 2025-11-03  2:45 UTC (permalink / raw)
  To: Matthew Wilcox, Shakeel Butt, Michal Hocko, Vlastimil Babka
  Cc: linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang,
	yangerkun, libaokun1, Baokun Li

Hi,

Sorry for the late reply, I was traveling of office.

Thanks to Matthew for helping with the explanation.

On 2025-11-01 00:55, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
>> Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
>> (FS specific) gfp is fine but will require some maintenance to avoid
>> abuse.
> I don't think a new GFP flag is the answer.  GFP_TRUST_ME_BRO just
> doesn't feel right.

Agreed. If we add a new GFP flag, it will be hard to prevent its misuse,
and __GFP_NOFAIL is a good example of this; we simply have no way to
monitor all callers.

>> I am more interested in how to codify "you can reclaim one I've already
>> allocated". I have a different scenario where network stack keep
>> stealing memory from direct reclaimers and keeping them in reclaim for
>> long time. If we have some mechanism to allow reclaimers to get the
>> memory they have reclaimed (at least for some cases), I think that can
>> be used in both cases.
> The only thing that comes to mind is putting pages freed by reclaim on
> a list in task_struct instead of sending them back to the allocator.
> Then the task can allocate from there and free up anything else it's
> reclaimed at some later point.  I don't think this is a good idea,
> but it's the only idea that comes to mind.

I am not familiar with the MM module, but here is a rough thought:

What about passing the minimum folio order to the allocator?
We could then ensure that pages are freed and allocated using the list
matching that specific order.


Thanks,
Baokun



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-10-31 16:55               ` Matthew Wilcox
  2025-11-03  2:45                 ` Baokun Li
@ 2025-11-03  7:55                 ` Michal Hocko
  2025-11-03  9:01                   ` Vlastimil Babka
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-11-03  7:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Shakeel Butt, Vlastimil Babka, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Fri 31-10-25 16:55:44, Matthew Wilcox wrote:
> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
> > (FS specific) gfp is fine but will require some maintenance to avoid
> > abuse.
> 
> I don't think a new GFP flag is the answer.  GFP_TRUST_ME_BRO just
> doesn't feel right.

Yeah, as usual a new gfp flag seems convenient except history has taught 
us this rarely works.

> > I am more interested in how to codify "you can reclaim one I've already
> > allocated". I have a different scenario where network stack keep
> > stealing memory from direct reclaimers and keeping them in reclaim for
> > long time. If we have some mechanism to allow reclaimers to get the
> > memory they have reclaimed (at least for some cases), I think that can
> > be used in both cases.
> 
> The only thing that comes to mind is putting pages freed by reclaim on
> a list in task_struct instead of sending them back to the allocator.
> Then the task can allocate from there and free up anything else it's
> reclaimed at some later point.  I don't think this is a good idea,
> but it's the only idea that comes to mind.

I have played with that idea years ago. Mostly to deal with direct
reclaim unfairness when some reclaimers were doing a lot of work on
behalf of everybody else. IIRC I have hit into different problems, like
reclaim throttling and over-reclaim.

Anyway, page allocator does respect GFP_NOFAIL even for high order
requests. The oom killer will be disabled for order-4 but as these will
likely be GFP_NOFS anyway then the order doesn't make much of a
difference. So these requests could really take long time to succeed but
I guess this will be generally understood. As the vmalloc fallback
doesn't seem to be a feasible option short (maybe even mid) term then
this is the only choice we have other than failing allocations and
seeing a lot of fs failures.

That being said I would much rather go and drop the order warning than
trying to invent some fine tuning based on usecase. We might need to
invent some OOM protection for order-3 nofail requests as OOM killer
could just make too much harm killing tasks without much of chance to
defragment memory. Let's deal with that once we see that happening.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-03  7:55                 ` Michal Hocko
@ 2025-11-03  9:01                   ` Vlastimil Babka
  2025-11-03  9:25                     ` Michal Hocko
  2025-11-03 18:53                     ` Shakeel Butt
  0 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2025-11-03  9:01 UTC (permalink / raw)
  To: Michal Hocko, Matthew Wilcox
  Cc: Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes,
	ziy, jack, yi.zhang, yangerkun, libaokun1

On 11/3/25 08:55, Michal Hocko wrote:
> On Fri 31-10-25 16:55:44, Matthew Wilcox wrote:
>> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
>> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
>> > (FS specific) gfp is fine but will require some maintenance to avoid
>> > abuse.
>> 
>> I don't think a new GFP flag is the answer.  GFP_TRUST_ME_BRO just
>> doesn't feel right.
> 
> Yeah, as usual a new gfp flag seems convenient except history has taught 
> us this rarely works.
> 
>> > I am more interested in how to codify "you can reclaim one I've already
>> > allocated". I have a different scenario where network stack keep
>> > stealing memory from direct reclaimers and keeping them in reclaim for
>> > long time. If we have some mechanism to allow reclaimers to get the
>> > memory they have reclaimed (at least for some cases), I think that can
>> > be used in both cases.
>> 
>> The only thing that comes to mind is putting pages freed by reclaim on
>> a list in task_struct instead of sending them back to the allocator.
>> Then the task can allocate from there and free up anything else it's
>> reclaimed at some later point.  I don't think this is a good idea,
>> but it's the only idea that comes to mind.
> 
> I have played with that idea years ago. Mostly to deal with direct
> reclaim unfairness when some reclaimers were doing a lot of work on
> behalf of everybody else. IIRC I have hit into different problems, like
> reclaim throttling and over-reclaim.

Btw, meanwhile we got this implemented in compaction, see
compaction_capture(). As the hook is in __free_one_page() it should now be
straightforward to arm it also for direct reclaim of e.g. __GFP_NOFAIL
costly order allocations. It probably wouldn't make sense for non-costly
orders because they are freed to the pcplists and we wouldn't want to make
those more expensive by adding the hook there too.

It's likely the hook in compaction already helps such allocations. But if
you expect the order-4 pages reclaim to be common thanks to the large
blocks, it could maybe help if capture was done in reclaim too.

> Anyway, page allocator does respect GFP_NOFAIL even for high order
> requests. The oom killer will be disabled for order-4 but as these will
> likely be GFP_NOFS anyway then the order doesn't make much of a
> difference. So these requests could really take long time to succeed but
> I guess this will be generally understood. As the vmalloc fallback
> doesn't seem to be a feasible option short (maybe even mid) term then
> this is the only choice we have other than failing allocations and
> seeing a lot of fs failures.
> 
> That being said I would much rather go and drop the order warning than
> trying to invent some fine tuning based on usecase. We might need to

Agreed. Note it would also solve the warnings we saw syzbot etc trigger via
slab by allocating a <8k object with __GFP_NOFAIL. This would normally only
pass the __GFP_NOFAIL only to the fallback minimum size (order-1) slab
allocation and thus be fine, but can result in order>1 allocation if you
enable KASAN or other debugging option that bumps the <8k object size to >8k
space needed with the debug metadata.

Maybe we could keep the warning for >=PMD_ORDER as that would still mean
someone made an error?

> invent some OOM protection for order-3 nofail requests as OOM killer
> could just make too much harm killing tasks without much of chance to
> defragment memory. Let's deal with that once we see that happening.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-03  9:01                   ` Vlastimil Babka
@ 2025-11-03  9:25                     ` Michal Hocko
  2025-11-04 10:31                       ` Michal Hocko
  2025-11-03 18:53                     ` Shakeel Butt
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-11-03  9:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Mon 03-11-25 10:01:54, Vlastimil Babka wrote:
> Maybe we could keep the warning for >=PMD_ORDER as that would still mean
> someone made an error?

I am not sure TBH. For those large requests (anything that is costly
order) it is essentially a loop around allocator inside the allocator.
I would be really much more worried about order-3 which still triggers
the oom killer and could kill half of the system without much progress.
For oder-2 you at least have task_struct which spans 2 pages but I do
not think we have any guaranteed order-3 page for each task to guarantee
anything when killing those.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-03  9:25                     ` Michal Hocko
@ 2025-11-04 10:31                       ` Michal Hocko
  2025-11-04 12:32                         ` Vlastimil Babka
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-11-04 10:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Mon 03-11-25 10:25:40, Michal Hocko wrote:
> On Mon 03-11-25 10:01:54, Vlastimil Babka wrote:
> > Maybe we could keep the warning for >=PMD_ORDER as that would still mean
> > someone made an error?
> 
> I am not sure TBH. For those large requests (anything that is costly
> order) it is essentially a loop around allocator inside the allocator.
> I would be really much more worried about order-3 which still triggers
> the oom killer and could kill half of the system without much progress.
> For oder-2 you at least have task_struct which spans 2 pages but I do
> not think we have any guaranteed order-3 page for each task to guarantee
> anything when killing those.

Essentially something like this
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..2df477d97cee 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc)
 	if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
 		return true;
 
+	/*
+	 * unlike for other !costly requests killing a task is not
+	 * really guaranteed to free any order-3 pages. Warn about
+	 * that to see whether that happens often enough to special
+	 * case.
+	 */
+	WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL));
+
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA and memcg) that may require different handling.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1d037f97c5f..ca8795156b14 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3993,6 +3993,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	/* Coredumps can quickly deplete all memory reserves */
 	if (current->flags & PF_DUMPCORE)
 		goto out;
+
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		goto out;
@@ -4612,11 +4613,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int reserve_flags;
 
 	if (unlikely(nofail)) {
-		/*
-		 * We most definitely don't want callers attempting to
-		 * allocate greater than order-1 page units with __GFP_NOFAIL.
-		 */
-		WARN_ON_ONCE(order > 1);
 		/*
 		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
 		 * otherwise, we may result in lockup.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-04 10:31                       ` Michal Hocko
@ 2025-11-04 12:32                         ` Vlastimil Babka
  2025-11-04 12:50                           ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-11-04 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On 11/4/25 11:31 AM, Michal Hocko wrote:
> On Mon 03-11-25 10:25:40, Michal Hocko wrote:
>> On Mon 03-11-25 10:01:54, Vlastimil Babka wrote:
>>> Maybe we could keep the warning for >=PMD_ORDER as that would still mean
>>> someone made an error?
>>
>> I am not sure TBH. For those large requests (anything that is costly
>> order) it is essentially a loop around allocator inside the allocator.
>> I would be really much more worried about order-3 which still triggers
>> the oom killer and could kill half of the system without much progress.
>> For oder-2 you at least have task_struct which spans 2 pages but I do
>> not think we have any guaranteed order-3 page for each task to guarantee
>> anything when killing those.
> 
> Essentially something like this
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..2df477d97cee 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc)
>  	if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>  		return true;
>  
> +	/*
> +	 * unlike for other !costly requests killing a task is not
> +	 * really guaranteed to free any order-3 pages. Warn about
> +	 * that to see whether that happens often enough to special
> +	 * case.
> +	 */
> +	WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL));

OK, it might not create an order-3 page immediately. But I'd expect it
allows compaction to make progress thanks to making more free memory
available? We do retry reclaim/compaction after OOM killing one process,
and don't just kill until we succeed allocating, right?

> +
>  	/*
>  	 * Check if there were limitations on the allocation (only relevant for
>  	 * NUMA and memcg) that may require different handling.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d1d037f97c5f..ca8795156b14 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3993,6 +3993,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	/* Coredumps can quickly deplete all memory reserves */
>  	if (current->flags & PF_DUMPCORE)
>  		goto out;
> +
>  	/* The OOM killer will not help higher order allocs */
>  	if (order > PAGE_ALLOC_COSTLY_ORDER)
>  		goto out;
> @@ -4612,11 +4613,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int reserve_flags;
>  
>  	if (unlikely(nofail)) {
> -		/*
> -		 * We most definitely don't want callers attempting to
> -		 * allocate greater than order-1 page units with __GFP_NOFAIL.
> -		 */
> -		WARN_ON_ONCE(order > 1);
>  		/*
>  		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
>  		 * otherwise, we may result in lockup.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-04 12:32                         ` Vlastimil Babka
@ 2025-11-04 12:50                           ` Michal Hocko
  2025-11-04 12:57                             ` Vlastimil Babka
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-11-04 12:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Tue 04-11-25 13:32:52, Vlastimil Babka wrote:
> On 11/4/25 11:31 AM, Michal Hocko wrote:
> > On Mon 03-11-25 10:25:40, Michal Hocko wrote:
> >> On Mon 03-11-25 10:01:54, Vlastimil Babka wrote:
> >>> Maybe we could keep the warning for >=PMD_ORDER as that would still mean
> >>> someone made an error?
> >>
> >> I am not sure TBH. For those large requests (anything that is costly
> >> order) it is essentially a loop around allocator inside the allocator.
> >> I would be really much more worried about order-3 which still triggers
> >> the oom killer and could kill half of the system without much progress.
> >> For oder-2 you at least have task_struct which spans 2 pages but I do
> >> not think we have any guaranteed order-3 page for each task to guarantee
> >> anything when killing those.
> > 
> > Essentially something like this
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 25923cfec9c6..2df477d97cee 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc)
> >  	if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
> >  		return true;
> >  
> > +	/*
> > +	 * unlike for other !costly requests killing a task is not
> > +	 * really guaranteed to free any order-3 pages. Warn about
> > +	 * that to see whether that happens often enough to special
> > +	 * case.
> > +	 */
> > +	WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL));
> 
> OK, it might not create an order-3 page immediately. But I'd expect it
> allows compaction to make progress thanks to making more free memory
> available? We do retry reclaim/compaction after OOM killing one process,
> and don't just kill until we succeed allocating, right?

Yes we do go through the reclaim/compaction cycle. Do you think this
warning is overzealous? Th idea is that a flood of OOMs could be easier
to pin point with this in place. It doesn't have to be full WARN_ON,
maybe pr_warn would be sufficient as the backtrace is usually printed 
in the oom report.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-04 12:50                           ` Michal Hocko
@ 2025-11-04 12:57                             ` Vlastimil Babka
  2025-11-04 16:43                               ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2025-11-04 12:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On 11/4/25 1:50 PM, Michal Hocko wrote:
> On Tue 04-11-25 13:32:52, Vlastimil Babka wrote:
>>
>> OK, it might not create an order-3 page immediately. But I'd expect it
>> allows compaction to make progress thanks to making more free memory
>> available? We do retry reclaim/compaction after OOM killing one process,
>> and don't just kill until we succeed allocating, right?
> 
> Yes we do go through the reclaim/compaction cycle. Do you think this
> warning is overzealous? Th idea is that a flood of OOMs could be easier

I think it's too odd to warn for a specific order and not that or higher
orders. We would risk someone would make the allocation order-4 instead
of order-3 just to avoid it.

> to pin point with this in place. It doesn't have to be full WARN_ON,
> maybe pr_warn would be sufficient as the backtrace is usually printed 
> in the oom report.

With gfp flags and order also part of the OOM report, I think we could
risk just removing the warning completely and seeing if such (flood of)
reports ever comes up?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-04 12:57                             ` Vlastimil Babka
@ 2025-11-04 16:43                               ` Michal Hocko
  2025-11-05  6:23                                 ` Baokun Li
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-11-04 16:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb,
	jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Tue 04-11-25 13:57:35, Vlastimil Babka wrote:
> On 11/4/25 1:50 PM, Michal Hocko wrote:
> > On Tue 04-11-25 13:32:52, Vlastimil Babka wrote:
> >>
> >> OK, it might not create an order-3 page immediately. But I'd expect it
> >> allows compaction to make progress thanks to making more free memory
> >> available? We do retry reclaim/compaction after OOM killing one process,
> >> and don't just kill until we succeed allocating, right?
> > 
> > Yes we do go through the reclaim/compaction cycle. Do you think this
> > warning is overzealous? Th idea is that a flood of OOMs could be easier
> 
> I think it's too odd to warn for a specific order and not that or higher
> orders. We would risk someone would make the allocation order-4 instead
> of order-3 just to avoid it.

higher orders simply avoid OOM killer so it is effectivelly retry loop
around the allocator. Order-3 is a bit odd and that is what the warning
is trying to tell. But fair enough let's just drop the existing warning
and see how it goes.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-04 16:43                               ` Michal Hocko
@ 2025-11-05  6:23                                 ` Baokun Li
  0 siblings, 0 replies; 20+ messages in thread
From: Baokun Li @ 2025-11-05  6:23 UTC (permalink / raw)
  To: Michal Hocko, Vlastimil Babka
  Cc: Matthew Wilcox, Shakeel Butt, linux-mm, akpm, surenb, jackmanb,
	hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On 2025-11-05 00:43, Michal Hocko wrote:
> On Tue 04-11-25 13:57:35, Vlastimil Babka wrote:
>> On 11/4/25 1:50 PM, Michal Hocko wrote:
>>> On Tue 04-11-25 13:32:52, Vlastimil Babka wrote:
>>>> OK, it might not create an order-3 page immediately. But I'd expect it
>>>> allows compaction to make progress thanks to making more free memory
>>>> available? We do retry reclaim/compaction after OOM killing one process,
>>>> and don't just kill until we succeed allocating, right?
>>> Yes we do go through the reclaim/compaction cycle. Do you think this
>>> warning is overzealous? Th idea is that a flood of OOMs could be easier
>> I think it's too odd to warn for a specific order and not that or higher
>> orders. We would risk someone would make the allocation order-4 instead
>> of order-3 just to avoid it.
> higher orders simply avoid OOM killer so it is effectivelly retry loop
> around the allocator. Order-3 is a bit odd and that is what the warning
> is trying to tell. But fair enough let's just drop the existing warning
> and see how it goes.
>
Okay, since most people agree that we should first remove the current
warning and then observe what happens before making further decisions,
I will send a patch that directly deletes the warning.

Afterwards, depending on the situation once the warning is removed,
we can consider adding some special handling in other places if needed.

Thanks to everyone for the discussion and suggestions! 


Cheers,
Baokun



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
  2025-11-03  9:01                   ` Vlastimil Babka
  2025-11-03  9:25                     ` Michal Hocko
@ 2025-11-03 18:53                     ` Shakeel Butt
  1 sibling, 0 replies; 20+ messages in thread
From: Shakeel Butt @ 2025-11-03 18:53 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: Matthew Wilcox, libaokun, linux-mm, akpm, surenb, jackmanb,
	hannes, ziy, jack, yi.zhang, yangerkun, libaokun1

On Mon, Nov 03, 2025 at 10:01:54AM +0100, Vlastimil Babka wrote:
> On 11/3/25 08:55, Michal Hocko wrote:
> > On Fri 31-10-25 16:55:44, Matthew Wilcox wrote:
> >> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote:
> >> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new
> >> > (FS specific) gfp is fine but will require some maintenance to avoid
> >> > abuse.
> >> 
> >> I don't think a new GFP flag is the answer.  GFP_TRUST_ME_BRO just
> >> doesn't feel right.
> > 
> > Yeah, as usual a new gfp flag seems convenient except history has taught 
> > us this rarely works.

Point taken and let me discuss below if any interface for such
allocation requeat makes sense or not.

> > 
> >> > I am more interested in how to codify "you can reclaim one I've already
> >> > allocated". I have a different scenario where network stack keep
> >> > stealing memory from direct reclaimers and keeping them in reclaim for
> >> > long time. If we have some mechanism to allow reclaimers to get the
> >> > memory they have reclaimed (at least for some cases), I think that can
> >> > be used in both cases.
> >> 
> >> The only thing that comes to mind is putting pages freed by reclaim on
> >> a list in task_struct instead of sending them back to the allocator.
> >> Then the task can allocate from there and free up anything else it's
> >> reclaimed at some later point.  I don't think this is a good idea,
> >> but it's the only idea that comes to mind.
> > 
> > I have played with that idea years ago. Mostly to deal with direct
> > reclaim unfairness when some reclaimers were doing a lot of work on
> > behalf of everybody else. IIRC I have hit into different problems, like
> > reclaim throttling and over-reclaim.
> 
> Btw, meanwhile we got this implemented in compaction, see
> compaction_capture(). As the hook is in __free_one_page() it should now be
> straightforward to arm it also for direct reclaim of e.g. __GFP_NOFAIL
> costly order allocations. It probably wouldn't make sense for non-costly
> orders because they are freed to the pcplists and we wouldn't want to make
> those more expensive by adding the hook there too.
> 
> It's likely the hook in compaction already helps such allocations. But if
> you expect the order-4 pages reclaim to be common thanks to the large
> blocks, it could maybe help if capture was done in reclaim too.

Thanks for the pointer, I didn't know about this mechanism. I think we
can expand the scope of this mechanism to whole __alloc_pages_slowpath()
which calls both reclaim and compaction. Currently free hook is only
triggered when compaction causes free but with larger scope reclaim can
trigger it as well. Now there are couple of open questions:

1. Should we differentiate and prioritize between different allocators?
   That is allocators with NOFS+NOFAIL get preference as they might be
   holding locks and might be impacting concurrent allocators or maybe
   prefer allocators which will release memory in near future.

2. At the moment, we do expect allocators in slow path to work for the
   betterment of the whole system. So, should we use this mechanism in
   not the first (or couple of) iterations of slowpath but during later
   iterations.

3. This mechanism still does not capture "reclaim from me" which was
   Willy's original point. "Reclaim from me" seems more involved as I
   can see reclaim in general prefers to reclaim cold memory. In
   addition there are memcg protections (low/min). So, reclaim
   algo/heuristics may decide you are not reclaimable. Not sure if it is
   still worth trying the "reclaim from me" option.

Anyways at the moment I think if we go with this mechanism, we might
really need an explicit interface. We may in future if we try to be more
fancy.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-11-05  6:23 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31  6:13 [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS libaokun
2025-10-31  7:25 ` Michal Hocko
2025-10-31 10:12   ` Vlastimil Babka
2025-10-31 14:26     ` Matthew Wilcox
2025-10-31 15:35       ` Shakeel Butt
2025-10-31 15:52         ` Shakeel Butt
2025-10-31 15:54           ` Matthew Wilcox
2025-10-31 16:46             ` Shakeel Butt
2025-10-31 16:55               ` Matthew Wilcox
2025-11-03  2:45                 ` Baokun Li
2025-11-03  7:55                 ` Michal Hocko
2025-11-03  9:01                   ` Vlastimil Babka
2025-11-03  9:25                     ` Michal Hocko
2025-11-04 10:31                       ` Michal Hocko
2025-11-04 12:32                         ` Vlastimil Babka
2025-11-04 12:50                           ` Michal Hocko
2025-11-04 12:57                             ` Vlastimil Babka
2025-11-04 16:43                               ` Michal Hocko
2025-11-05  6:23                                 ` Baokun Li
2025-11-03 18:53                     ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).