* [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS @ 2025-10-31 6:13 libaokun 2025-10-31 7:25 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: libaokun @ 2025-10-31 6:13 UTC (permalink / raw) To: linux-mm Cc: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy, willy, jack, yi.zhang, yangerkun, libaokun1 From: Baokun Li <libaokun1@huawei.com> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata reads at critical points, since they cannot afford to go read-only, shut down, or enter an inconsistent state due to memory pressure. Currently, attempting to allocate page units greater than order-1 with the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) can easily require allocations larger than order-1. As Matthew noted, if we have a filesystem with 64KiB sectors, there will be many clean folios in the page cache that are 64KiB or larger. Therefore, to avoid the warning when LBS is enabled, we relax this restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current maximum supported logical block size is 64KiB, meaning the maximum order handled here is 4. Suggested-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org Signed-off-by: Baokun Li <libaokun1@huawei.com> --- mm/page_alloc.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fb91c566327c..913b9baa24b4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac) return false; } +/* + * We most definitely don't want callers attempting to + * allocate greater than order-1 page units with __GFP_NOFAIL. + * + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with + * __GFP_NOFAIL should always be supported. + */ +static inline void check_nofail_max_order(unsigned int order) +{ + unsigned int max_order = 1; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (PAGE_SIZE << 1 < SZ_64K) + max_order = get_order(SZ_64K); +#endif + + WARN_ON_ONCE(order > max_order); +} + static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) @@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int reserve_flags; if (unlikely(nofail)) { - /* - * We most definitely don't want callers attempting to - * allocate greater than order-1 page units with __GFP_NOFAIL. - */ - WARN_ON_ONCE(order > 1); + check_nofail_max_order(order); /* * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, * otherwise, we may result in lockup. -- 2.46.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 6:13 [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS libaokun @ 2025-10-31 7:25 ` Michal Hocko 2025-10-31 10:12 ` Vlastimil Babka 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-10-31 7:25 UTC (permalink / raw) To: libaokun Cc: linux-mm, akpm, vbabka, surenb, jackmanb, hannes, ziy, willy, jack, yi.zhang, yangerkun, libaokun1 On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote: > From: Baokun Li <libaokun1@huawei.com> > > Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata > reads at critical points, since they cannot afford to go read-only, > shut down, or enter an inconsistent state due to memory pressure. > > Currently, attempting to allocate page units greater than order-1 with > the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). > However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) > can easily require allocations larger than order-1. > > As Matthew noted, if we have a filesystem with 64KiB sectors, there will > be many clean folios in the page cache that are 64KiB or larger. > > Therefore, to avoid the warning when LBS is enabled, we relax this > restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current > maximum supported logical block size is 64KiB, meaning the maximum order > handled here is 4. Would be using kvmalloc an option instead of this? This change doesn't really make much sense to me TBH. While the order=1 is rather arbitrary it is an internal allocator constrain - i.e. order which the allocator can sustain for NOFAIL requests is directly related to memory reclaim and internal allocator operation rather than something as external as block size. If the allocator needs to support 64kB NOFAIL requests because there is a strong demand for that then fine and we can see whether this is feasible. Please keep in mind that 64kb > PAGE_ALLOC_COSTLY_ORDER and that is where page allocator behavior changes considerably (e.g. oom killer is not invoked so the allocation could stall for ever). So it is not as simple as say this just going to work fine. > Suggested-by: Matthew Wilcox <willy@infradead.org> > Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org > Signed-off-by: Baokun Li <libaokun1@huawei.com> > --- > mm/page_alloc.c | 25 ++++++++++++++++++++----- > 1 file changed, 20 insertions(+), 5 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fb91c566327c..913b9baa24b4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac) > return false; > } > > +/* > + * We most definitely don't want callers attempting to > + * allocate greater than order-1 page units with __GFP_NOFAIL. > + * > + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with > + * __GFP_NOFAIL should always be supported. > + */ > +static inline void check_nofail_max_order(unsigned int order) > +{ > + unsigned int max_order = 1; > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (PAGE_SIZE << 1 < SZ_64K) > + max_order = get_order(SZ_64K); > +#endif > + > + WARN_ON_ONCE(order > max_order); > +} > + > static inline struct page * > __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > struct alloc_context *ac) > @@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int reserve_flags; > > if (unlikely(nofail)) { > - /* > - * We most definitely don't want callers attempting to > - * allocate greater than order-1 page units with __GFP_NOFAIL. > - */ > - WARN_ON_ONCE(order > 1); > + check_nofail_max_order(order); > /* > * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, > * otherwise, we may result in lockup. > -- > 2.46.1 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 7:25 ` Michal Hocko @ 2025-10-31 10:12 ` Vlastimil Babka 2025-10-31 14:26 ` Matthew Wilcox 0 siblings, 1 reply; 20+ messages in thread From: Vlastimil Babka @ 2025-10-31 10:12 UTC (permalink / raw) To: Michal Hocko, libaokun Cc: linux-mm, akpm, surenb, jackmanb, hannes, ziy, willy, jack, yi.zhang, yangerkun, libaokun1 On 10/31/25 08:25, Michal Hocko wrote: > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote: >> From: Baokun Li <libaokun1@huawei.com> >> >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata >> reads at critical points, since they cannot afford to go read-only, >> shut down, or enter an inconsistent state due to memory pressure. >> >> Currently, attempting to allocate page units greater than order-1 with >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) >> can easily require allocations larger than order-1. >> >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will >> be many clean folios in the page cache that are 64KiB or larger. >> >> Therefore, to avoid the warning when LBS is enabled, we relax this >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current >> maximum supported logical block size is 64KiB, meaning the maximum order >> handled here is 4. > > Would be using kvmalloc an option instead of this? The thread under Link: suggests xfs has its own vmalloc callback. But it's not one of the 5 options listed, so it's good question how difficult would be to implement that for ext4 or in general. > This change doesn't really make much sense to me TBH. While the order=1 > is rather arbitrary it is an internal allocator constrain - i.e. order which > the allocator can sustain for NOFAIL requests is directly related to > memory reclaim and internal allocator operation rather than something as > external as block size. If the allocator needs to support 64kB NOFAIL > requests because there is a strong demand for that then fine and we can > see whether this is feasible. > > Please keep in mind that 64kb > PAGE_ALLOC_COSTLY_ORDER and that is > where page allocator behavior changes considerably (e.g. oom killer is > not invoked so the allocation could stall for ever). So it is not as > simple as say this just going to work fine. True. >> Suggested-by: Matthew Wilcox <willy@infradead.org> >> Link: https://lore.kernel.org/all/aQPX1-XWQjKaMTZB@casper.infradead.org >> Signed-off-by: Baokun Li <libaokun1@huawei.com> >> --- >> mm/page_alloc.c | 25 ++++++++++++++++++++----- >> 1 file changed, 20 insertions(+), 5 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index fb91c566327c..913b9baa24b4 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4663,6 +4663,25 @@ check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac) >> return false; >> } >> >> +/* >> + * We most definitely don't want callers attempting to >> + * allocate greater than order-1 page units with __GFP_NOFAIL. >> + * >> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with >> + * __GFP_NOFAIL should always be supported. >> + */ >> +static inline void check_nofail_max_order(unsigned int order) >> +{ >> + unsigned int max_order = 1; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE This is a bit confusing to me since we are talking about block size. Are filesystems with these large block sizes only possible to mount with a kernel with THPs? >> + if (PAGE_SIZE << 1 < SZ_64K) >> + max_order = get_order(SZ_64K); >> +#endif >> + >> + WARN_ON_ONCE(order > max_order); >> +} >> + >> static inline struct page * >> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> struct alloc_context *ac) >> @@ -4683,11 +4702,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> int reserve_flags; >> >> if (unlikely(nofail)) { >> - /* >> - * We most definitely don't want callers attempting to >> - * allocate greater than order-1 page units with __GFP_NOFAIL. >> - */ >> - WARN_ON_ONCE(order > 1); >> + check_nofail_max_order(order); >> /* >> * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, >> * otherwise, we may result in lockup. >> -- >> 2.46.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 10:12 ` Vlastimil Babka @ 2025-10-31 14:26 ` Matthew Wilcox 2025-10-31 15:35 ` Shakeel Butt 0 siblings, 1 reply; 20+ messages in thread From: Matthew Wilcox @ 2025-10-31 14:26 UTC (permalink / raw) To: Vlastimil Babka Cc: Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote: > On 10/31/25 08:25, Michal Hocko wrote: > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote: > >> From: Baokun Li <libaokun1@huawei.com> > >> > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata > >> reads at critical points, since they cannot afford to go read-only, > >> shut down, or enter an inconsistent state due to memory pressure. > >> > >> Currently, attempting to allocate page units greater than order-1 with > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) > >> can easily require allocations larger than order-1. > >> > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will > >> be many clean folios in the page cache that are 64KiB or larger. > >> > >> Therefore, to avoid the warning when LBS is enabled, we relax this > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current > >> maximum supported logical block size is 64KiB, meaning the maximum order > >> handled here is 4. > > > > Would be using kvmalloc an option instead of this? > > The thread under Link: suggests xfs has its own vmalloc callback. But it's > not one of the 5 options listed, so it's good question how difficult would > be to implement that for ext4 or in general. It's implicit in options 1-4. Today, the buffer cache is an alias into the page cache. The page cache can only store folios. So to use vmalloc, we either have to make folios discontiguous, stop the buffer cache being an alias into the page cache, or stop ext4 from using the buffer cache. > > This change doesn't really make much sense to me TBH. While the order=1 > > is rather arbitrary it is an internal allocator constrain - i.e. order which > > the allocator can sustain for NOFAIL requests is directly related to > > memory reclaim and internal allocator operation rather than something as > > external as block size. If the allocator needs to support 64kB NOFAIL > > requests because there is a strong demand for that then fine and we can > > see whether this is feasible. Maybe Baokun's explanation for why this is unlikel to be a problem in practice didn't make sense to you. Let me try again, perhaps being more explicit about things which an fs developer would know but an MM person might not realise. Hard drive manufacturers are absolutely gagging to ship drives with a 64KiB sector size. Once they do, the minimum transfer size to/from a device becomes 64KiB. That means the page cache will cache all files (and fs metadata) from that drive in contiguous 64KiB chunks. That means that when reclaim shakes the page cache, it's going to find a lot of order-4 folios to free ... which means that the occasional GFP_NOFAIL order-4 allocation is going to have no trouble finding order-4 pages to satisfy the allocation. Now, the problem is the non-filesystems which may now take advantage of this to write lazy code. It'd be nice if we had some token that said "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a NOFAIL high-order allocation, you can reclaim one I've already allocated and everything will be fine". But I can't see a way to put that kind of token into our interfaces. > >> +/* > >> + * We most definitely don't want callers attempting to > >> + * allocate greater than order-1 page units with __GFP_NOFAIL. > >> + * > >> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with > >> + * __GFP_NOFAIL should always be supported. > >> + */ > >> +static inline void check_nofail_max_order(unsigned int order) > >> +{ > >> + unsigned int max_order = 1; > >> + > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > This is a bit confusing to me since we are talking about block size. Are > filesystems with these large block sizes only possible to mount with a > kernel with THPs? For the moment, yes. It's an artefact of how large folio support was originally developed. It's one of those things that's only a problem for weirdoes who compile their own kernels because all distros have turned it on since basically forever. Also some minority architectures don't support it yet. Anyway, fixing this is on the todo list, but it's not a high priority. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 14:26 ` Matthew Wilcox @ 2025-10-31 15:35 ` Shakeel Butt 2025-10-31 15:52 ` Shakeel Butt 0 siblings, 1 reply; 20+ messages in thread From: Shakeel Butt @ 2025-10-31 15:35 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote: > On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote: > > On 10/31/25 08:25, Michal Hocko wrote: > > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote: > > >> From: Baokun Li <libaokun1@huawei.com> > > >> > > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata > > >> reads at critical points, since they cannot afford to go read-only, > > >> shut down, or enter an inconsistent state due to memory pressure. > > >> > > >> Currently, attempting to allocate page units greater than order-1 with > > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). > > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) > > >> can easily require allocations larger than order-1. > > >> > > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will > > >> be many clean folios in the page cache that are 64KiB or larger. > > >> > > >> Therefore, to avoid the warning when LBS is enabled, we relax this > > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current > > >> maximum supported logical block size is 64KiB, meaning the maximum order > > >> handled here is 4. > > > > > > Would be using kvmalloc an option instead of this? > > > > The thread under Link: suggests xfs has its own vmalloc callback. But it's > > not one of the 5 options listed, so it's good question how difficult would > > be to implement that for ext4 or in general. > > It's implicit in options 1-4. Today, the buffer cache is an alias into > the page cache. The page cache can only store folios. So to use > vmalloc, we either have to make folios discontiguous, stop the buffer > cache being an alias into the page cache, or stop ext4 from using the > buffer cache. > > > > This change doesn't really make much sense to me TBH. While the order=1 > > > is rather arbitrary it is an internal allocator constrain - i.e. order which > > > the allocator can sustain for NOFAIL requests is directly related to > > > memory reclaim and internal allocator operation rather than something as > > > external as block size. If the allocator needs to support 64kB NOFAIL > > > requests because there is a strong demand for that then fine and we can > > > see whether this is feasible. > > Maybe Baokun's explanation for why this is unlikel to be a problem in > practice didn't make sense to you. Let me try again, perhaps being more > explicit about things which an fs developer would know but an MM person > might not realise. > > Hard drive manufacturers are absolutely gagging to ship drives with a > 64KiB sector size. Once they do, the minimum transfer size to/from a > device becomes 64KiB. That means the page cache will cache all files > (and fs metadata) from that drive in contiguous 64KiB chunks. That means > that when reclaim shakes the page cache, it's going to find a lot of > order-4 folios to free ... which means that the occasional GFP_NOFAIL > order-4 allocation is going to have no trouble finding order-4 pages to > satisfy the allocation. > > Now, the problem is the non-filesystems which may now take advantage of > this to write lazy code. It'd be nice if we had some token that said > "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a > NOFAIL high-order allocation, you can reclaim one I've already allocated > and everything will be fine". But I can't see a way to put that kind > of token into our interfaces. A new gfp flag should be easy enough. However "you can reclaim one I've already allocated" is not something current allocation & reclaim can take any action on. Maybe that is something we can add. In addition the behavior change of costly order needs more thought. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 15:35 ` Shakeel Butt @ 2025-10-31 15:52 ` Shakeel Butt 2025-10-31 15:54 ` Matthew Wilcox 0 siblings, 1 reply; 20+ messages in thread From: Shakeel Butt @ 2025-10-31 15:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 08:35:50AM -0700, Shakeel Butt wrote: > On Fri, Oct 31, 2025 at 02:26:56PM +0000, Matthew Wilcox wrote: > > On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote: > > > On 10/31/25 08:25, Michal Hocko wrote: > > > > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote: > > > >> From: Baokun Li <libaokun1@huawei.com> > > > >> > > > >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata > > > >> reads at critical points, since they cannot afford to go read-only, > > > >> shut down, or enter an inconsistent state due to memory pressure. > > > >> > > > >> Currently, attempting to allocate page units greater than order-1 with > > > >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath(). > > > >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE) > > > >> can easily require allocations larger than order-1. > > > >> > > > >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will > > > >> be many clean folios in the page cache that are 64KiB or larger. > > > >> > > > >> Therefore, to avoid the warning when LBS is enabled, we relax this > > > >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current > > > >> maximum supported logical block size is 64KiB, meaning the maximum order > > > >> handled here is 4. > > > > > > > > Would be using kvmalloc an option instead of this? > > > > > > The thread under Link: suggests xfs has its own vmalloc callback. But it's > > > not one of the 5 options listed, so it's good question how difficult would > > > be to implement that for ext4 or in general. > > > > It's implicit in options 1-4. Today, the buffer cache is an alias into > > the page cache. The page cache can only store folios. So to use > > vmalloc, we either have to make folios discontiguous, stop the buffer > > cache being an alias into the page cache, or stop ext4 from using the > > buffer cache. > > > > > > This change doesn't really make much sense to me TBH. While the order=1 > > > > is rather arbitrary it is an internal allocator constrain - i.e. order which > > > > the allocator can sustain for NOFAIL requests is directly related to > > > > memory reclaim and internal allocator operation rather than something as > > > > external as block size. If the allocator needs to support 64kB NOFAIL > > > > requests because there is a strong demand for that then fine and we can > > > > see whether this is feasible. > > > > Maybe Baokun's explanation for why this is unlikel to be a problem in > > practice didn't make sense to you. Let me try again, perhaps being more > > explicit about things which an fs developer would know but an MM person > > might not realise. > > > > Hard drive manufacturers are absolutely gagging to ship drives with a > > 64KiB sector size. Once they do, the minimum transfer size to/from a > > device becomes 64KiB. That means the page cache will cache all files > > (and fs metadata) from that drive in contiguous 64KiB chunks. That means > > that when reclaim shakes the page cache, it's going to find a lot of > > order-4 folios to free ... which means that the occasional GFP_NOFAIL > > order-4 allocation is going to have no trouble finding order-4 pages to > > satisfy the allocation. > > > > Now, the problem is the non-filesystems which may now take advantage of > > this to write lazy code. It'd be nice if we had some token that said > > "hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a > > NOFAIL high-order allocation, you can reclaim one I've already allocated > > and everything will be fine". But I can't see a way to put that kind > > of token into our interfaces. > > A new gfp flag should be easy enough. However "you can reclaim one I've > already allocated" is not something current allocation & reclaim can > take any action on. Maybe that is something we can add. In addition the > behavior change of costly order needs more thought. > After reading the background link, it seems like the actual allocation will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not really reclaim any file memory (page cache). However I wonder with the writeback gone from reclaim path, should we allow reclaiming clean file pages even for NOFS context (need some digging). ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 15:52 ` Shakeel Butt @ 2025-10-31 15:54 ` Matthew Wilcox 2025-10-31 16:46 ` Shakeel Butt 0 siblings, 1 reply; 20+ messages in thread From: Matthew Wilcox @ 2025-10-31 15:54 UTC (permalink / raw) To: Shakeel Butt Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 08:52:49AM -0700, Shakeel Butt wrote: > After reading the background link, it seems like the actual allocation > will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not > really reclaim any file memory (page cache). However I wonder with the > writeback gone from reclaim path, should we allow reclaiming clean file > pages even for NOFS context (need some digging). I thought that was true yesterday morning, but I read the code and it isn't. Look at where may_enter_fs() is called from. It's only for folios which are either marked dirty or writeback. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 15:54 ` Matthew Wilcox @ 2025-10-31 16:46 ` Shakeel Butt 2025-10-31 16:55 ` Matthew Wilcox 0 siblings, 1 reply; 20+ messages in thread From: Shakeel Butt @ 2025-10-31 16:46 UTC (permalink / raw) To: Matthew Wilcox Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 03:54:56PM +0000, Matthew Wilcox wrote: > On Fri, Oct 31, 2025 at 08:52:49AM -0700, Shakeel Butt wrote: > > After reading the background link, it seems like the actual allocation > > will be NOFS + NOFAIL + higher_order. With NOFS, current reclaim can not > > really reclaim any file memory (page cache). However I wonder with the > > writeback gone from reclaim path, should we allow reclaiming clean file > > pages even for NOFS context (need some digging). > > I thought that was true yesterday morning, but I read the code and it isn't. > > Look at where may_enter_fs() is called from. It's only for folios which > are either marked dirty or writeback. > Indeed you are right. Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new (FS specific) gfp is fine but will require some maintenance to avoid abuse. I am more interested in how to codify "you can reclaim one I've already allocated". I have a different scenario where network stack keep stealing memory from direct reclaimers and keeping them in reclaim for long time. If we have some mechanism to allow reclaimers to get the memory they have reclaimed (at least for some cases), I think that can be used in both cases. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 16:46 ` Shakeel Butt @ 2025-10-31 16:55 ` Matthew Wilcox 2025-11-03 2:45 ` Baokun Li 2025-11-03 7:55 ` Michal Hocko 0 siblings, 2 replies; 20+ messages in thread From: Matthew Wilcox @ 2025-10-31 16:55 UTC (permalink / raw) To: Shakeel Butt Cc: Vlastimil Babka, Michal Hocko, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote: > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new > (FS specific) gfp is fine but will require some maintenance to avoid > abuse. I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just doesn't feel right. > I am more interested in how to codify "you can reclaim one I've already > allocated". I have a different scenario where network stack keep > stealing memory from direct reclaimers and keeping them in reclaim for > long time. If we have some mechanism to allow reclaimers to get the > memory they have reclaimed (at least for some cases), I think that can > be used in both cases. The only thing that comes to mind is putting pages freed by reclaim on a list in task_struct instead of sending them back to the allocator. Then the task can allocate from there and free up anything else it's reclaimed at some later point. I don't think this is a good idea, but it's the only idea that comes to mind. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 16:55 ` Matthew Wilcox @ 2025-11-03 2:45 ` Baokun Li 2025-11-03 7:55 ` Michal Hocko 1 sibling, 0 replies; 20+ messages in thread From: Baokun Li @ 2025-11-03 2:45 UTC (permalink / raw) To: Matthew Wilcox, Shakeel Butt, Michal Hocko, Vlastimil Babka Cc: linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1, Baokun Li Hi, Sorry for the late reply, I was traveling of office. Thanks to Matthew for helping with the explanation. On 2025-11-01 00:55, Matthew Wilcox wrote: > On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote: >> Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new >> (FS specific) gfp is fine but will require some maintenance to avoid >> abuse. > I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just > doesn't feel right. Agreed. If we add a new GFP flag, it will be hard to prevent its misuse, and __GFP_NOFAIL is a good example of this; we simply have no way to monitor all callers. >> I am more interested in how to codify "you can reclaim one I've already >> allocated". I have a different scenario where network stack keep >> stealing memory from direct reclaimers and keeping them in reclaim for >> long time. If we have some mechanism to allow reclaimers to get the >> memory they have reclaimed (at least for some cases), I think that can >> be used in both cases. > The only thing that comes to mind is putting pages freed by reclaim on > a list in task_struct instead of sending them back to the allocator. > Then the task can allocate from there and free up anything else it's > reclaimed at some later point. I don't think this is a good idea, > but it's the only idea that comes to mind. I am not familiar with the MM module, but here is a rough thought: What about passing the minimum folio order to the allocator? We could then ensure that pages are freed and allocated using the list matching that specific order. Thanks, Baokun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-10-31 16:55 ` Matthew Wilcox 2025-11-03 2:45 ` Baokun Li @ 2025-11-03 7:55 ` Michal Hocko 2025-11-03 9:01 ` Vlastimil Babka 1 sibling, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-11-03 7:55 UTC (permalink / raw) To: Matthew Wilcox Cc: Shakeel Butt, Vlastimil Babka, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Fri 31-10-25 16:55:44, Matthew Wilcox wrote: > On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote: > > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new > > (FS specific) gfp is fine but will require some maintenance to avoid > > abuse. > > I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just > doesn't feel right. Yeah, as usual a new gfp flag seems convenient except history has taught us this rarely works. > > I am more interested in how to codify "you can reclaim one I've already > > allocated". I have a different scenario where network stack keep > > stealing memory from direct reclaimers and keeping them in reclaim for > > long time. If we have some mechanism to allow reclaimers to get the > > memory they have reclaimed (at least for some cases), I think that can > > be used in both cases. > > The only thing that comes to mind is putting pages freed by reclaim on > a list in task_struct instead of sending them back to the allocator. > Then the task can allocate from there and free up anything else it's > reclaimed at some later point. I don't think this is a good idea, > but it's the only idea that comes to mind. I have played with that idea years ago. Mostly to deal with direct reclaim unfairness when some reclaimers were doing a lot of work on behalf of everybody else. IIRC I have hit into different problems, like reclaim throttling and over-reclaim. Anyway, page allocator does respect GFP_NOFAIL even for high order requests. The oom killer will be disabled for order-4 but as these will likely be GFP_NOFS anyway then the order doesn't make much of a difference. So these requests could really take long time to succeed but I guess this will be generally understood. As the vmalloc fallback doesn't seem to be a feasible option short (maybe even mid) term then this is the only choice we have other than failing allocations and seeing a lot of fs failures. That being said I would much rather go and drop the order warning than trying to invent some fine tuning based on usecase. We might need to invent some OOM protection for order-3 nofail requests as OOM killer could just make too much harm killing tasks without much of chance to defragment memory. Let's deal with that once we see that happening. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-03 7:55 ` Michal Hocko @ 2025-11-03 9:01 ` Vlastimil Babka 2025-11-03 9:25 ` Michal Hocko 2025-11-03 18:53 ` Shakeel Butt 0 siblings, 2 replies; 20+ messages in thread From: Vlastimil Babka @ 2025-11-03 9:01 UTC (permalink / raw) To: Michal Hocko, Matthew Wilcox Cc: Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On 11/3/25 08:55, Michal Hocko wrote: > On Fri 31-10-25 16:55:44, Matthew Wilcox wrote: >> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote: >> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new >> > (FS specific) gfp is fine but will require some maintenance to avoid >> > abuse. >> >> I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just >> doesn't feel right. > > Yeah, as usual a new gfp flag seems convenient except history has taught > us this rarely works. > >> > I am more interested in how to codify "you can reclaim one I've already >> > allocated". I have a different scenario where network stack keep >> > stealing memory from direct reclaimers and keeping them in reclaim for >> > long time. If we have some mechanism to allow reclaimers to get the >> > memory they have reclaimed (at least for some cases), I think that can >> > be used in both cases. >> >> The only thing that comes to mind is putting pages freed by reclaim on >> a list in task_struct instead of sending them back to the allocator. >> Then the task can allocate from there and free up anything else it's >> reclaimed at some later point. I don't think this is a good idea, >> but it's the only idea that comes to mind. > > I have played with that idea years ago. Mostly to deal with direct > reclaim unfairness when some reclaimers were doing a lot of work on > behalf of everybody else. IIRC I have hit into different problems, like > reclaim throttling and over-reclaim. Btw, meanwhile we got this implemented in compaction, see compaction_capture(). As the hook is in __free_one_page() it should now be straightforward to arm it also for direct reclaim of e.g. __GFP_NOFAIL costly order allocations. It probably wouldn't make sense for non-costly orders because they are freed to the pcplists and we wouldn't want to make those more expensive by adding the hook there too. It's likely the hook in compaction already helps such allocations. But if you expect the order-4 pages reclaim to be common thanks to the large blocks, it could maybe help if capture was done in reclaim too. > Anyway, page allocator does respect GFP_NOFAIL even for high order > requests. The oom killer will be disabled for order-4 but as these will > likely be GFP_NOFS anyway then the order doesn't make much of a > difference. So these requests could really take long time to succeed but > I guess this will be generally understood. As the vmalloc fallback > doesn't seem to be a feasible option short (maybe even mid) term then > this is the only choice we have other than failing allocations and > seeing a lot of fs failures. > > That being said I would much rather go and drop the order warning than > trying to invent some fine tuning based on usecase. We might need to Agreed. Note it would also solve the warnings we saw syzbot etc trigger via slab by allocating a <8k object with __GFP_NOFAIL. This would normally only pass the __GFP_NOFAIL only to the fallback minimum size (order-1) slab allocation and thus be fine, but can result in order>1 allocation if you enable KASAN or other debugging option that bumps the <8k object size to >8k space needed with the debug metadata. Maybe we could keep the warning for >=PMD_ORDER as that would still mean someone made an error? > invent some OOM protection for order-3 nofail requests as OOM killer > could just make too much harm killing tasks without much of chance to > defragment memory. Let's deal with that once we see that happening. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-03 9:01 ` Vlastimil Babka @ 2025-11-03 9:25 ` Michal Hocko 2025-11-04 10:31 ` Michal Hocko 2025-11-03 18:53 ` Shakeel Butt 1 sibling, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-11-03 9:25 UTC (permalink / raw) To: Vlastimil Babka Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Mon 03-11-25 10:01:54, Vlastimil Babka wrote: > Maybe we could keep the warning for >=PMD_ORDER as that would still mean > someone made an error? I am not sure TBH. For those large requests (anything that is costly order) it is essentially a loop around allocator inside the allocator. I would be really much more worried about order-3 which still triggers the oom killer and could kill half of the system without much progress. For oder-2 you at least have task_struct which spans 2 pages but I do not think we have any guaranteed order-3 page for each task to guarantee anything when killing those. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-03 9:25 ` Michal Hocko @ 2025-11-04 10:31 ` Michal Hocko 2025-11-04 12:32 ` Vlastimil Babka 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-11-04 10:31 UTC (permalink / raw) To: Vlastimil Babka Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Mon 03-11-25 10:25:40, Michal Hocko wrote: > On Mon 03-11-25 10:01:54, Vlastimil Babka wrote: > > Maybe we could keep the warning for >=PMD_ORDER as that would still mean > > someone made an error? > > I am not sure TBH. For those large requests (anything that is costly > order) it is essentially a loop around allocator inside the allocator. > I would be really much more worried about order-3 which still triggers > the oom killer and could kill half of the system without much progress. > For oder-2 you at least have task_struct which spans 2 pages but I do > not think we have any guaranteed order-3 page for each task to guarantee > anything when killing those. Essentially something like this diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 25923cfec9c6..2df477d97cee 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc) if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) return true; + /* + * unlike for other !costly requests killing a task is not + * really guaranteed to free any order-3 pages. Warn about + * that to see whether that happens often enough to special + * case. + */ + WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL)); + /* * Check if there were limitations on the allocation (only relevant for * NUMA and memcg) that may require different handling. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d1d037f97c5f..ca8795156b14 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3993,6 +3993,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* Coredumps can quickly deplete all memory reserves */ if (current->flags & PF_DUMPCORE) goto out; + /* The OOM killer will not help higher order allocs */ if (order > PAGE_ALLOC_COSTLY_ORDER) goto out; @@ -4612,11 +4613,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int reserve_flags; if (unlikely(nofail)) { - /* - * We most definitely don't want callers attempting to - * allocate greater than order-1 page units with __GFP_NOFAIL. - */ - WARN_ON_ONCE(order > 1); /* * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, * otherwise, we may result in lockup. -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-04 10:31 ` Michal Hocko @ 2025-11-04 12:32 ` Vlastimil Babka 2025-11-04 12:50 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Vlastimil Babka @ 2025-11-04 12:32 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On 11/4/25 11:31 AM, Michal Hocko wrote: > On Mon 03-11-25 10:25:40, Michal Hocko wrote: >> On Mon 03-11-25 10:01:54, Vlastimil Babka wrote: >>> Maybe we could keep the warning for >=PMD_ORDER as that would still mean >>> someone made an error? >> >> I am not sure TBH. For those large requests (anything that is costly >> order) it is essentially a loop around allocator inside the allocator. >> I would be really much more worried about order-3 which still triggers >> the oom killer and could kill half of the system without much progress. >> For oder-2 you at least have task_struct which spans 2 pages but I do >> not think we have any guaranteed order-3 page for each task to guarantee >> anything when killing those. > > Essentially something like this > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 25923cfec9c6..2df477d97cee 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc) > if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) > return true; > > + /* > + * unlike for other !costly requests killing a task is not > + * really guaranteed to free any order-3 pages. Warn about > + * that to see whether that happens often enough to special > + * case. > + */ > + WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL)); OK, it might not create an order-3 page immediately. But I'd expect it allows compaction to make progress thanks to making more free memory available? We do retry reclaim/compaction after OOM killing one process, and don't just kill until we succeed allocating, right? > + > /* > * Check if there were limitations on the allocation (only relevant for > * NUMA and memcg) that may require different handling. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d1d037f97c5f..ca8795156b14 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3993,6 +3993,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > /* Coredumps can quickly deplete all memory reserves */ > if (current->flags & PF_DUMPCORE) > goto out; > + > /* The OOM killer will not help higher order allocs */ > if (order > PAGE_ALLOC_COSTLY_ORDER) > goto out; > @@ -4612,11 +4613,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > int reserve_flags; > > if (unlikely(nofail)) { > - /* > - * We most definitely don't want callers attempting to > - * allocate greater than order-1 page units with __GFP_NOFAIL. > - */ > - WARN_ON_ONCE(order > 1); > /* > * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, > * otherwise, we may result in lockup. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-04 12:32 ` Vlastimil Babka @ 2025-11-04 12:50 ` Michal Hocko 2025-11-04 12:57 ` Vlastimil Babka 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-11-04 12:50 UTC (permalink / raw) To: Vlastimil Babka Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Tue 04-11-25 13:32:52, Vlastimil Babka wrote: > On 11/4/25 11:31 AM, Michal Hocko wrote: > > On Mon 03-11-25 10:25:40, Michal Hocko wrote: > >> On Mon 03-11-25 10:01:54, Vlastimil Babka wrote: > >>> Maybe we could keep the warning for >=PMD_ORDER as that would still mean > >>> someone made an error? > >> > >> I am not sure TBH. For those large requests (anything that is costly > >> order) it is essentially a loop around allocator inside the allocator. > >> I would be really much more worried about order-3 which still triggers > >> the oom killer and could kill half of the system without much progress. > >> For oder-2 you at least have task_struct which spans 2 pages but I do > >> not think we have any guaranteed order-3 page for each task to guarantee > >> anything when killing those. > > > > Essentially something like this > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 25923cfec9c6..2df477d97cee 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -1142,6 +1142,14 @@ bool out_of_memory(struct oom_control *oc) > > if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) > > return true; > > > > + /* > > + * unlike for other !costly requests killing a task is not > > + * really guaranteed to free any order-3 pages. Warn about > > + * that to see whether that happens often enough to special > > + * case. > > + */ > > + WARN_ON(oc->order == 3 && (oc->gfp_mask & __GFP_NOFAIL)); > > OK, it might not create an order-3 page immediately. But I'd expect it > allows compaction to make progress thanks to making more free memory > available? We do retry reclaim/compaction after OOM killing one process, > and don't just kill until we succeed allocating, right? Yes we do go through the reclaim/compaction cycle. Do you think this warning is overzealous? Th idea is that a flood of OOMs could be easier to pin point with this in place. It doesn't have to be full WARN_ON, maybe pr_warn would be sufficient as the backtrace is usually printed in the oom report. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-04 12:50 ` Michal Hocko @ 2025-11-04 12:57 ` Vlastimil Babka 2025-11-04 16:43 ` Michal Hocko 0 siblings, 1 reply; 20+ messages in thread From: Vlastimil Babka @ 2025-11-04 12:57 UTC (permalink / raw) To: Michal Hocko Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On 11/4/25 1:50 PM, Michal Hocko wrote: > On Tue 04-11-25 13:32:52, Vlastimil Babka wrote: >> >> OK, it might not create an order-3 page immediately. But I'd expect it >> allows compaction to make progress thanks to making more free memory >> available? We do retry reclaim/compaction after OOM killing one process, >> and don't just kill until we succeed allocating, right? > > Yes we do go through the reclaim/compaction cycle. Do you think this > warning is overzealous? Th idea is that a flood of OOMs could be easier I think it's too odd to warn for a specific order and not that or higher orders. We would risk someone would make the allocation order-4 instead of order-3 just to avoid it. > to pin point with this in place. It doesn't have to be full WARN_ON, > maybe pr_warn would be sufficient as the backtrace is usually printed > in the oom report. With gfp flags and order also part of the OOM report, I think we could risk just removing the warning completely and seeing if such (flood of) reports ever comes up? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-04 12:57 ` Vlastimil Babka @ 2025-11-04 16:43 ` Michal Hocko 2025-11-05 6:23 ` Baokun Li 0 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-11-04 16:43 UTC (permalink / raw) To: Vlastimil Babka Cc: Matthew Wilcox, Shakeel Butt, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Tue 04-11-25 13:57:35, Vlastimil Babka wrote: > On 11/4/25 1:50 PM, Michal Hocko wrote: > > On Tue 04-11-25 13:32:52, Vlastimil Babka wrote: > >> > >> OK, it might not create an order-3 page immediately. But I'd expect it > >> allows compaction to make progress thanks to making more free memory > >> available? We do retry reclaim/compaction after OOM killing one process, > >> and don't just kill until we succeed allocating, right? > > > > Yes we do go through the reclaim/compaction cycle. Do you think this > > warning is overzealous? Th idea is that a flood of OOMs could be easier > > I think it's too odd to warn for a specific order and not that or higher > orders. We would risk someone would make the allocation order-4 instead > of order-3 just to avoid it. higher orders simply avoid OOM killer so it is effectivelly retry loop around the allocator. Order-3 is a bit odd and that is what the warning is trying to tell. But fair enough let's just drop the existing warning and see how it goes. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-04 16:43 ` Michal Hocko @ 2025-11-05 6:23 ` Baokun Li 0 siblings, 0 replies; 20+ messages in thread From: Baokun Li @ 2025-11-05 6:23 UTC (permalink / raw) To: Michal Hocko, Vlastimil Babka Cc: Matthew Wilcox, Shakeel Butt, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On 2025-11-05 00:43, Michal Hocko wrote: > On Tue 04-11-25 13:57:35, Vlastimil Babka wrote: >> On 11/4/25 1:50 PM, Michal Hocko wrote: >>> On Tue 04-11-25 13:32:52, Vlastimil Babka wrote: >>>> OK, it might not create an order-3 page immediately. But I'd expect it >>>> allows compaction to make progress thanks to making more free memory >>>> available? We do retry reclaim/compaction after OOM killing one process, >>>> and don't just kill until we succeed allocating, right? >>> Yes we do go through the reclaim/compaction cycle. Do you think this >>> warning is overzealous? Th idea is that a flood of OOMs could be easier >> I think it's too odd to warn for a specific order and not that or higher >> orders. We would risk someone would make the allocation order-4 instead >> of order-3 just to avoid it. > higher orders simply avoid OOM killer so it is effectivelly retry loop > around the allocator. Order-3 is a bit odd and that is what the warning > is trying to tell. But fair enough let's just drop the existing warning > and see how it goes. > Okay, since most people agree that we should first remove the current warning and then observe what happens before making further decisions, I will send a patch that directly deletes the warning. Afterwards, depending on the situation once the warning is removed, we can consider adding some special handling in other places if needed. Thanks to everyone for the discussion and suggestions! Cheers, Baokun ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS 2025-11-03 9:01 ` Vlastimil Babka 2025-11-03 9:25 ` Michal Hocko @ 2025-11-03 18:53 ` Shakeel Butt 1 sibling, 0 replies; 20+ messages in thread From: Shakeel Butt @ 2025-11-03 18:53 UTC (permalink / raw) To: Vlastimil Babka, Michal Hocko Cc: Matthew Wilcox, libaokun, linux-mm, akpm, surenb, jackmanb, hannes, ziy, jack, yi.zhang, yangerkun, libaokun1 On Mon, Nov 03, 2025 at 10:01:54AM +0100, Vlastimil Babka wrote: > On 11/3/25 08:55, Michal Hocko wrote: > > On Fri 31-10-25 16:55:44, Matthew Wilcox wrote: > >> On Fri, Oct 31, 2025 at 09:46:17AM -0700, Shakeel Butt wrote: > >> > Now for the interface to allow NOFS+NOFAIL+higher_order, I think a new > >> > (FS specific) gfp is fine but will require some maintenance to avoid > >> > abuse. > >> > >> I don't think a new GFP flag is the answer. GFP_TRUST_ME_BRO just > >> doesn't feel right. > > > > Yeah, as usual a new gfp flag seems convenient except history has taught > > us this rarely works. Point taken and let me discuss below if any interface for such allocation requeat makes sense or not. > > > >> > I am more interested in how to codify "you can reclaim one I've already > >> > allocated". I have a different scenario where network stack keep > >> > stealing memory from direct reclaimers and keeping them in reclaim for > >> > long time. If we have some mechanism to allow reclaimers to get the > >> > memory they have reclaimed (at least for some cases), I think that can > >> > be used in both cases. > >> > >> The only thing that comes to mind is putting pages freed by reclaim on > >> a list in task_struct instead of sending them back to the allocator. > >> Then the task can allocate from there and free up anything else it's > >> reclaimed at some later point. I don't think this is a good idea, > >> but it's the only idea that comes to mind. > > > > I have played with that idea years ago. Mostly to deal with direct > > reclaim unfairness when some reclaimers were doing a lot of work on > > behalf of everybody else. IIRC I have hit into different problems, like > > reclaim throttling and over-reclaim. > > Btw, meanwhile we got this implemented in compaction, see > compaction_capture(). As the hook is in __free_one_page() it should now be > straightforward to arm it also for direct reclaim of e.g. __GFP_NOFAIL > costly order allocations. It probably wouldn't make sense for non-costly > orders because they are freed to the pcplists and we wouldn't want to make > those more expensive by adding the hook there too. > > It's likely the hook in compaction already helps such allocations. But if > you expect the order-4 pages reclaim to be common thanks to the large > blocks, it could maybe help if capture was done in reclaim too. Thanks for the pointer, I didn't know about this mechanism. I think we can expand the scope of this mechanism to whole __alloc_pages_slowpath() which calls both reclaim and compaction. Currently free hook is only triggered when compaction causes free but with larger scope reclaim can trigger it as well. Now there are couple of open questions: 1. Should we differentiate and prioritize between different allocators? That is allocators with NOFS+NOFAIL get preference as they might be holding locks and might be impacting concurrent allocators or maybe prefer allocators which will release memory in near future. 2. At the moment, we do expect allocators in slow path to work for the betterment of the whole system. So, should we use this mechanism in not the first (or couple of) iterations of slowpath but during later iterations. 3. This mechanism still does not capture "reclaim from me" which was Willy's original point. "Reclaim from me" seems more involved as I can see reclaim in general prefers to reclaim cold memory. In addition there are memcg protections (low/min). So, reclaim algo/heuristics may decide you are not reclaimable. Not sure if it is still worth trying the "reclaim from me" option. Anyways at the moment I think if we go with this mechanism, we might really need an explicit interface. We may in future if we try to be more fancy. thanks, Shakeel ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-11-05 6:23 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-31 6:13 [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS libaokun 2025-10-31 7:25 ` Michal Hocko 2025-10-31 10:12 ` Vlastimil Babka 2025-10-31 14:26 ` Matthew Wilcox 2025-10-31 15:35 ` Shakeel Butt 2025-10-31 15:52 ` Shakeel Butt 2025-10-31 15:54 ` Matthew Wilcox 2025-10-31 16:46 ` Shakeel Butt 2025-10-31 16:55 ` Matthew Wilcox 2025-11-03 2:45 ` Baokun Li 2025-11-03 7:55 ` Michal Hocko 2025-11-03 9:01 ` Vlastimil Babka 2025-11-03 9:25 ` Michal Hocko 2025-11-04 10:31 ` Michal Hocko 2025-11-04 12:32 ` Vlastimil Babka 2025-11-04 12:50 ` Michal Hocko 2025-11-04 12:57 ` Vlastimil Babka 2025-11-04 16:43 ` Michal Hocko 2025-11-05 6:23 ` Baokun Li 2025-11-03 18:53 ` Shakeel Butt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).