linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>,
	libaokun@huaweicloud.com, linux-mm@kvack.org,
	akpm@linux-foundation.org, surenb@google.com,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	jack@suse.cz, yi.zhang@huawei.com, yangerkun@huawei.com,
	libaokun1@huawei.com
Subject: Re: [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS
Date: Fri, 31 Oct 2025 14:26:56 +0000	[thread overview]
Message-ID: <aQTHMI3t5mNXp0M1@casper.infradead.org> (raw)
In-Reply-To: <1ab71a9d-dc28-4fa0-8151-6e322728beae@suse.cz>

On Fri, Oct 31, 2025 at 11:12:16AM +0100, Vlastimil Babka wrote:
> On 10/31/25 08:25, Michal Hocko wrote:
> > On Fri 31-10-25 14:13:50, libaokun@huaweicloud.com wrote:
> >> From: Baokun Li <libaokun1@huawei.com>
> >> 
> >> Filesystems use __GFP_NOFAIL to allocate block-sized folios for metadata
> >> reads at critical points, since they cannot afford to go read-only,
> >> shut down, or enter an inconsistent state due to memory pressure.
> >> 
> >> Currently, attempting to allocate page units greater than order-1 with
> >> the __GFP_NOFAIL flag triggers a WARN_ON() in __alloc_pages_slowpath().
> >> However, filesystems supporting large block sizes (blocksize > PAGE_SIZE)
> >> can easily require allocations larger than order-1.
> >> 
> >> As Matthew noted, if we have a filesystem with 64KiB sectors, there will
> >> be many clean folios in the page cache that are 64KiB or larger.
> >> 
> >> Therefore, to avoid the warning when LBS is enabled, we relax this
> >> restriction to allow allocations up to BLK_MAX_BLOCK_SIZE. The current
> >> maximum supported logical block size is 64KiB, meaning the maximum order
> >> handled here is 4.
> > 
> > Would be using kvmalloc an option instead of this?
> 
> The thread under Link: suggests xfs has its own vmalloc callback. But it's
> not one of the 5 options listed, so it's good question how difficult would
> be to implement that for ext4 or in general.

It's implicit in options 1-4.  Today, the buffer cache is an alias into
the page cache.  The page cache can only store folios.  So to use
vmalloc, we either have to make folios discontiguous, stop the buffer
cache being an alias into the page cache, or stop ext4 from using the
buffer cache.

> > This change doesn't really make much sense to me TBH. While the order=1
> > is rather arbitrary it is an internal allocator constrain - i.e. order which
> > the allocator can sustain for NOFAIL requests is directly related to
> > memory reclaim and internal allocator operation rather than something as
> > external as block size. If the allocator needs to support 64kB NOFAIL
> > requests because there is a strong demand for that then fine and we can
> > see whether this is feasible.

Maybe Baokun's explanation for why this is unlikel to be a problem in
practice didn't make sense to you.  Let me try again, perhaps being more
explicit about things which an fs developer would know but an MM person
might not realise.

Hard drive manufacturers are absolutely gagging to ship drives with a
64KiB sector size.  Once they do, the minimum transfer size to/from a
device becomes 64KiB.  That means the page cache will cache all files
(and fs metadata) from that drive in contiguous 64KiB chunks.  That means
that when reclaim shakes the page cache, it's going to find a lot of
order-4 folios to free ... which means that the occasional GFP_NOFAIL
order-4 allocation is going to have no trouble finding order-4 pages to
satisfy the allocation.

Now, the problem is the non-filesystems which may now take advantage of
this to write lazy code.  It'd be nice if we had some token that said
"hey, I'm the page cache, I know what I'm doing, trust me if I'm doing a
NOFAIL high-order allocation, you can reclaim one I've already allocated
and everything will be fine".  But I can't see a way to put that kind
of token into our interfaces.

> >> +/*
> >> + * We most definitely don't want callers attempting to
> >> + * allocate greater than order-1 page units with __GFP_NOFAIL.
> >> + *
> >> + * However, folio allocations up to BLK_MAX_BLOCK_SIZE with
> >> + * __GFP_NOFAIL should always be supported.
> >> + */
> >> +static inline void check_nofail_max_order(unsigned int order)
> >> +{
> >> +	unsigned int max_order = 1;
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 
> This is a bit confusing to me since we are talking about block size. Are
> filesystems with these large block sizes only possible to mount with a
> kernel with THPs?

For the moment, yes.  It's an artefact of how large folio support was
originally developed.  It's one of those things that's only a problem for
weirdoes who compile their own kernels because all distros have turned
it on since basically forever.  Also some minority architectures don't
support it yet.  Anyway, fixing this is on the todo list, but it's not
a high priority.



  reply	other threads:[~2025-10-31 14:27 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-31  6:13 [PATCH RFC] mm: allow __GFP_NOFAIL allocation up to BLK_MAX_BLOCK_SIZE to support LBS libaokun
2025-10-31  7:25 ` Michal Hocko
2025-10-31 10:12   ` Vlastimil Babka
2025-10-31 14:26     ` Matthew Wilcox [this message]
2025-10-31 15:35       ` Shakeel Butt
2025-10-31 15:52         ` Shakeel Butt
2025-10-31 15:54           ` Matthew Wilcox
2025-10-31 16:46             ` Shakeel Butt
2025-10-31 16:55               ` Matthew Wilcox
2025-11-03  2:45                 ` Baokun Li
2025-11-03  7:55                 ` Michal Hocko
2025-11-03  9:01                   ` Vlastimil Babka
2025-11-03  9:25                     ` Michal Hocko
2025-11-04 10:31                       ` Michal Hocko
2025-11-04 12:32                         ` Vlastimil Babka
2025-11-04 12:50                           ` Michal Hocko
2025-11-04 12:57                             ` Vlastimil Babka
2025-11-04 16:43                               ` Michal Hocko
2025-11-05  6:23                                 ` Baokun Li
2025-11-03 18:53                     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQTHMI3t5mNXp0M1@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=jackmanb@google.com \
    --cc=libaokun1@huawei.com \
    --cc=libaokun@huaweicloud.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).