Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* revisiting alloc_pages_bulks semantics?
@ 2026-05-27  7:18 Christoph Hellwig
  2026-05-27  7:53 ` Zi Yan
  2026-05-27 10:06 ` Vlastimil Babka (SUSE)
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2026-05-27  7:18 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Chuck Lever,
	Matthew Wilcox (Oracle), linux-nfs, linux-mm, linux-kernel

Hi all,

I've been looking into using alloc_pages_bulks in a few places lately,
and have run into issues with the API.  Here is my suggestions for how
to make this more useful, although only some of them are something
I'd feel comfortable to do myself:

1) early fail semantics

alloc_pages_bulks can do partial allocations for some reasons, and
users usually have a fallback by either looping and calling it again
or falling back to single page allocations.  This sucks!  Why can't
we get our usual try as hard as you can semantics, requiring
GFP_NORETRY or similar to relax it?

2) pre-zeroed page array 

There is one single user (svc_fill_pages in sunrpc) that relies on it.
For everyone else it creates extra burden and is very error prone
(speaking from experience).

3) page instead of folio

We're allocating folios, so we should have a folio API.

4) > order 0 support

The bulk allocator is limited to order 0 which limits it's usefulness
these days.  It would be really helpful to do bulk allocations for
the pagecache or bounce buffering.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27  7:18 revisiting alloc_pages_bulks semantics? Christoph Hellwig
@ 2026-05-27  7:53 ` Zi Yan
  2026-05-27  8:00   ` Christoph Hellwig
  2026-05-27 10:06 ` Vlastimil Babka (SUSE)
  1 sibling, 1 reply; 11+ messages in thread
From: Zi Yan @ 2026-05-27  7:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Chuck Lever,
	Matthew Wilcox (Oracle), linux-nfs, linux-mm, linux-kernel

On 27 May 2026, at 15:18, Christoph Hellwig wrote:

> Hi all,
>
> I've been looking into using alloc_pages_bulks in a few places lately,
> and have run into issues with the API.  Here is my suggestions for how
> to make this more useful, although only some of them are something
> I'd feel comfortable to do myself:

I have some questions below to get more details on your needs.

>
> 1) early fail semantics
>
> alloc_pages_bulks can do partial allocations for some reasons, and
> users usually have a fallback by either looping and calling it again
> or falling back to single page allocations.  This sucks!  Why can't
> we get our usual try as hard as you can semantics, requiring
> GFP_NORETRY or similar to relax it?

IIUC, current alloc_pages_bulks() tries to get free pages without doing
compaction or reclaim unless none can be allocated. Does your “usual try”
mean possible invocation of compaction and/or reclaim for every page
allocation? I guess it also relates to the order > 0 bulk allocation
below? My gut feeling is that if one “usual try” fails, the following
“usual try” might not work. So making alloc_pages_bulks() do heavy
allocation might not buy you much.

But can you elaborate on why looping alloc_pages_bulks() does not work
well? That is essentially triggering compaction/reclaim repeatedly
like your proposed “usual try” idea.

>
> 2) pre-zeroed page array
>
> There is one single user (svc_fill_pages in sunrpc) that relies on it.
> For everyone else it creates extra burden and is very error prone
> (speaking from experience).

No comment for this one.

>
> 3) page instead of folio
>
> We're allocating folios, so we should have a folio API.

Sounds reasonable to me.

>
> 4) > order 0 support
>
> The bulk allocator is limited to order 0 which limits it's usefulness
> these days.  It would be really helpful to do bulk allocations for
> the pagecache or bounce buffering.

Sounds reasonable to me, but when under memory pressure, I wonder
how many > order 0 folios you can get in the end. And that might
cause a storm of compaction and/or reclaim if combined with Idea 1.
For > order 0 bulk allocations, are you thinking about 1)
a try and bail-out early model or 2) a keep-trying model?
For the latter, I wonder how large the allocation latency can be
and if that is tolerable or even makes sense, since for THP
allocations, we have seen >30s allocation latency when under
memory pressure. Is waiting minutes for bulk > order 0 allocation
making sense in your use cases?

Thanks.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27  7:53 ` Zi Yan
@ 2026-05-27  8:00   ` Christoph Hellwig
  2026-05-27  8:31     ` Zi Yan
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2026-05-27  8:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: Christoph Hellwig, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Chuck Lever, Matthew Wilcox (Oracle), linux-nfs,
	linux-mm, linux-kernel

On Wed, May 27, 2026 at 03:53:53PM +0800, Zi Yan wrote:
> > 1) early fail semantics
> >
> > alloc_pages_bulks can do partial allocations for some reasons, and
> > users usually have a fallback by either looping and calling it again
> > or falling back to single page allocations.  This sucks!  Why can't
> > we get our usual try as hard as you can semantics, requiring
> > GFP_NORETRY or similar to relax it?
> 
> IIUC, current alloc_pages_bulks() tries to get free pages without doing
> compaction or reclaim unless none can be allocated.

Yes, which is really odd, as other page/folio allocators make that an
opt-in through GFP flags.

> Does your “usual try”
> mean possible invocation of compaction and/or reclaim for every page
> allocation?

If you look at most callers in tree, and my recently merged or to be
merged work isn't any different, they just bloody want the pages just
as any other allocator.  Failing under grave memory pressure is fine
of course, but just failing because getting the memory requires effort
is not.

> I guess it also relates to the order > 0 bulk allocation
> below? My gut feeling is that if one “usual try” fails, the following
> “usual try” might not work. So making alloc_pages_bulks() do heavy
> allocation might not buy you much.

Well, we need to centralize this.  Right now there is lots of divering
cargo culting in the callers.

> But can you elaborate on why looping alloc_pages_bulks() does not work
> well? That is essentially triggering compaction/reclaim repeatedly
> like your proposed “usual try” idea.

I'm not even sure if it works well.  There are some callers that do that,
some use individual fallbacks.  I don't really want to think about that
when all I need is a few folios.

> > The bulk allocator is limited to order 0 which limits it's usefulness
> > these days.  It would be really helpful to do bulk allocations for
> > the pagecache or bounce buffering.
> 
> Sounds reasonable to me, but when under memory pressure, I wonder
> how many > order 0 folios you can get in the end. And that might
> cause a storm of compaction and/or reclaim if combined with Idea 1.

Well, I really want them.  In some cases I might be fine falling down
to smaller sizes, but I also really don't want the logic in every
caller.

> For > order 0 bulk allocations, are you thinking about 1)
> a try and bail-out early model or 2) a keep-trying model?

Both are useful and as with other allocators should depend on the
passed in GFP flags.

> For the latter, I wonder how large the allocation latency can be
> and if that is tolerable or even makes sense, since for THP
> allocations, we have seen >30s allocation latency when under
> memory pressure. Is waiting minutes for bulk > order 0 allocation
> making sense in your use cases?

The allocations I have in mind would only require try hard allocations
for typical file system blocks sizes (64k at most), while eveything
larger is fair game for falling back.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27  8:00   ` Christoph Hellwig
@ 2026-05-27  8:31     ` Zi Yan
  2026-05-27 12:15       ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Zi Yan @ 2026-05-27  8:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Chuck Lever,
	Matthew Wilcox (Oracle), linux-nfs, linux-mm, linux-kernel

On 27 May 2026, at 16:00, Christoph Hellwig wrote:

> On Wed, May 27, 2026 at 03:53:53PM +0800, Zi Yan wrote:
>>> 1) early fail semantics
>>>
>>> alloc_pages_bulks can do partial allocations for some reasons, and
>>> users usually have a fallback by either looping and calling it again
>>> or falling back to single page allocations.  This sucks!  Why can't
>>> we get our usual try as hard as you can semantics, requiring
>>> GFP_NORETRY or similar to relax it?
>>
>> IIUC, current alloc_pages_bulks() tries to get free pages without doing
>> compaction or reclaim unless none can be allocated.
>
> Yes, which is really odd, as other page/folio allocators make that an
> opt-in through GFP flags.

Based on my understanding of the code, the GFP flags are respected at
the __alloc_pages_noprof() in alloc_pages_bulk(). The loop of
rmqueue_pcplist() is just a quick try of getting free pages.
And I suspect it might be quicker than calling __alloc_pages_noprof()
in a loop, since other preparation work in __alloc_pages_noprof()
is only done once.

>
>> Does your “usual try”
>> mean possible invocation of compaction and/or reclaim for every page
>> allocation?
>
> If you look at most callers in tree, and my recently merged or to be
> merged work isn't any different, they just bloody want the pages just
> as any other allocator.  Failing under grave memory pressure is fine
> of course, but just failing because getting the memory requires effort
> is not.
>
>> I guess it also relates to the order > 0 bulk allocation
>> below? My gut feeling is that if one “usual try” fails, the following
>> “usual try” might not work. So making alloc_pages_bulks() do heavy
>> allocation might not buy you much.
>
> Well, we need to centralize this.  Right now there is lots of divering
> cargo culting in the callers.
>
>> But can you elaborate on why looping alloc_pages_bulks() does not work
>> well? That is essentially triggering compaction/reclaim repeatedly
>> like your proposed “usual try” idea.
>
> I'm not even sure if it works well.  There are some callers that do that,
> some use individual fallbacks.  I don't really want to think about that
> when all I need is a few folios.
>
>>> The bulk allocator is limited to order 0 which limits it's usefulness
>>> these days.  It would be really helpful to do bulk allocations for
>>> the pagecache or bounce buffering.
>>
>> Sounds reasonable to me, but when under memory pressure, I wonder
>> how many > order 0 folios you can get in the end. And that might
>> cause a storm of compaction and/or reclaim if combined with Idea 1.
>
> Well, I really want them.  In some cases I might be fine falling down
> to smaller sizes, but I also really don't want the logic in every
> caller.

Based on your answers above, it sounds like a wrapper of
__alloc_pages_bulk() that doing allocation in a loop until all requested
pages are filled might be good enough for your case.

But let me know if I miss something.

>
>> For > order 0 bulk allocations, are you thinking about 1)
>> a try and bail-out early model or 2) a keep-trying model?
>
> Both are useful and as with other allocators should depend on the
> passed in GFP flags.

Like I said above, __alloc_pages_noprof() in alloc_pages_bulk()
respects the GFP flags.

>
>> For the latter, I wonder how large the allocation latency can be
>> and if that is tolerable or even makes sense, since for THP
>> allocations, we have seen >30s allocation latency when under
>> memory pressure. Is waiting minutes for bulk > order 0 allocation
>> making sense in your use cases?
>
> The allocations I have in mind would only require try hard allocations
> for typical file system blocks sizes (64k at most), while eveything
> larger is fair game for falling back.

Sure. In MM, PAGE_ALLOC_COSTLY_ORDER is 3, so pages bigger than that
would take more effort to get and the allocation latency can be longer.
So it might take a long time to allocate the last 64KB page in
a bulk allocation.

I do not have any data for such scenarios, but some trick I can think
of is to ask compaction and reclaim to aim for more free pages instead
of just the requested order (not higher order), so that after one round
of compaction and/or reclaim, more pages at the requested order can
be allocated afterwards.


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27  7:18 revisiting alloc_pages_bulks semantics? Christoph Hellwig
  2026-05-27  7:53 ` Zi Yan
@ 2026-05-27 10:06 ` Vlastimil Babka (SUSE)
  2026-05-27 12:19   ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-05-27 10:06 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Chuck Lever, Matthew Wilcox (Oracle), linux-nfs, linux-mm,
	linux-kernel

On 5/27/26 09:18, Christoph Hellwig wrote:
> Hi all,
> 
> I've been looking into using alloc_pages_bulks in a few places lately,
> and have run into issues with the API.  Here is my suggestions for how
> to make this more useful, although only some of them are something
> I'd feel comfortable to do myself:
> 
> 1) early fail semantics
> 
> alloc_pages_bulks can do partial allocations for some reasons, and
> users usually have a fallback by either looping and calling it again
> or falling back to single page allocations.  This sucks!  Why can't
> we get our usual try as hard as you can semantics, requiring
> GFP_NORETRY or similar to relax it?

If we do that, do we keep the possibility of partial success, i.e. return
how many were allocated? Seems wasteful to suceed N-1 and then throw all
away, if the caller can use a fallback only for the last one.
Do some callers need all-or-nothing semantics? Should a flag indicate which
one to use?

> 2) pre-zeroed page array 
> 
> There is one single user (svc_fill_pages in sunrpc) that relies on it.
> For everyone else it creates extra burden and is very error prone
> (speaking from experience).

Sounds good to me. Will sunrpc be easy to convert, or should it be another
flag to opt-in to the current behavior, that it would use?

> 3) page instead of folio
> 
> We're allocating folios, so we should have a folio API.

Hm, folios initially started as "base or compound page" but then the
semantics shifted and now they are also rmappable. See how
folio_alloc_noprof() does page_rmappable_folio(). The differences might grow
further with memdesc conversion I think.
So do all the callers actually want folios? If not, we could have both
alloc_pages_bulk() and folio_alloc_bulk()?

> 4) > order 0 support
> 
> The bulk allocator is limited to order 0 which limits it's usefulness
> these days.  It would be really helpful to do bulk allocations for
> the pagecache or bounce buffering.

Fine, with implications for the comment for 1)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27  8:31     ` Zi Yan
@ 2026-05-27 12:15       ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2026-05-27 12:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: Christoph Hellwig, Andrew Morton, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Chuck Lever, Matthew Wilcox (Oracle), linux-nfs,
	linux-mm, linux-kernel

On Wed, May 27, 2026 at 04:31:24PM +0800, Zi Yan wrote:
> > Yes, which is really odd, as other page/folio allocators make that an
> > opt-in through GFP flags.
> 
> Based on my understanding of the code, the GFP flags are respected at
> the __alloc_pages_noprof() in alloc_pages_bulk().

As __alloc_pages_noprof is the core of the regular page/folio allocator
I'd expect that as well.

> The loop of
> rmqueue_pcplist() is just a quick try of getting free pages.
> And I suspect it might be quicker than calling __alloc_pages_noprof()
> in a loop, since other preparation work in __alloc_pages_noprof()
> is only done once.

Possibly.  But that means a whole bunch of callers have the wrong
assumption.

> > Well, I really want them.  In some cases I might be fine falling down
> > to smaller sizes, but I also really don't want the logic in every
> > caller.
> 
> Based on your answers above, it sounds like a wrapper of
> __alloc_pages_bulk() that doing allocation in a loop until all requested
> pages are filled might be good enough for your case.
> 
> But let me know if I miss something.

Or just allocate all pages using a loop when alloc_pages_bulk_noprof
doesn't get enough pages from the PCP list?

> > The allocations I have in mind would only require try hard allocations
> > for typical file system blocks sizes (64k at most), while eveything
> > larger is fair game for falling back.
> 
> Sure. In MM, PAGE_ALLOC_COSTLY_ORDER is 3, so pages bigger than that
> would take more effort to get and the allocation latency can be longer.
> So it might take a long time to allocate the last 64KB page in
> a bulk allocation.

Based on the LSF/MM session on lage folios and MM fragmentation session
it seems like we should raise it to 4 for 4k page size platforms,
as this seems to be a proble for 64k folio allocations.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27 10:06 ` Vlastimil Babka (SUSE)
@ 2026-05-27 12:19   ` Christoph Hellwig
  2026-05-27 13:23     ` Matthew Wilcox
  2026-05-27 13:58     ` Chuck Lever
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2026-05-27 12:19 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Christoph Hellwig, Andrew Morton, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Chuck Lever, Matthew Wilcox (Oracle), linux-nfs, linux-mm,
	linux-kernel

On Wed, May 27, 2026 at 12:06:08PM +0200, Vlastimil Babka (SUSE) wrote:
> > alloc_pages_bulks can do partial allocations for some reasons, and
> > users usually have a fallback by either looping and calling it again
> > or falling back to single page allocations.  This sucks!  Why can't
> > we get our usual try as hard as you can semantics, requiring
> > GFP_NORETRY or similar to relax it?
> 
> If we do that, do we keep the possibility of partial success, i.e. return
> how many were allocated? Seems wasteful to suceed N-1 and then throw all
> away, if the caller can use a fallback only for the last one.
> Do some callers need all-or-nothing semantics? Should a flag indicate which
> one to use?

A lot of callers (but not all) need all or nothing semantics.  But
freeing already allocated pages is the not a major problem - the caller
just has to add a release_pages call if it didn't already have one
for cleaning up later failures.

> > There is one single user (svc_fill_pages in sunrpc) that relies on it.
> > For everyone else it creates extra burden and is very error prone
> > (speaking from experience).
> 
> Sounds good to me. Will sunrpc be easy to convert, or should it be another
> flag to opt-in to the current behavior, that it would use?

I've added Chuck to the Cc list, but from memory sunrpc actually does
make use of this feature and he objected to previous attempts to
change it.  So a first step would be to have a lower-level helper
that works as-is and a wrapper that zeroes the array, even if that
doesn't feel as efficient as it could be.

> > 3) page instead of folio
> > 
> > We're allocating folios, so we should have a folio API.
> 
> Hm, folios initially started as "base or compound page" but then the
> semantics shifted and now they are also rmappable. See how
> folio_alloc_noprof() does page_rmappable_folio(). The differences might grow
> further with memdesc conversion I think.
> So do all the callers actually want folios? If not, we could have both
> alloc_pages_bulk() and folio_alloc_bulk()?

I can't speak for all of them, but many do.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27 12:19   ` Christoph Hellwig
@ 2026-05-27 13:23     ` Matthew Wilcox
  2026-05-27 13:58     ` Chuck Lever
  1 sibling, 0 replies; 11+ messages in thread
From: Matthew Wilcox @ 2026-05-27 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vlastimil Babka (SUSE), Andrew Morton, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Chuck Lever, linux-nfs, linux-mm, linux-kernel

On Wed, May 27, 2026 at 02:19:20PM +0200, Christoph Hellwig wrote:
> > > There is one single user (svc_fill_pages in sunrpc) that relies on it.
> > > For everyone else it creates extra burden and is very error prone
> > > (speaking from experience).
> > 
> > Sounds good to me. Will sunrpc be easy to convert, or should it be another
> > flag to opt-in to the current behavior, that it would use?
> 
> I've added Chuck to the Cc list, but from memory sunrpc actually does
> make use of this feature and he objected to previous attempts to
> change it.  So a first step would be to have a lower-level helper
> that works as-is and a wrapper that zeroes the array, even if that
> doesn't feel as efficient as it could be.

I think the problem is that sunrpc uses the pages as a queue instead of
a stack.  If it consumed pages from the end instead of the beginning,
it could just refill the entire tail of the array.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27 12:19   ` Christoph Hellwig
  2026-05-27 13:23     ` Matthew Wilcox
@ 2026-05-27 13:58     ` Chuck Lever
  2026-05-28  9:00       ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2026-05-27 13:58 UTC (permalink / raw)
  To: Christoph Hellwig, Vlastimil Babka (SUSE)
  Cc: Andrew Morton, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle), linux-nfs,
	linux-mm, linux-kernel

On 5/27/26 8:19 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 12:06:08PM +0200, Vlastimil Babka (SUSE) wrote:
>>> alloc_pages_bulks can do partial allocations for some reasons, and
>>> users usually have a fallback by either looping and calling it again
>>> or falling back to single page allocations.  This sucks!  Why can't
>>> we get our usual try as hard as you can semantics, requiring
>>> GFP_NORETRY or similar to relax it?
>>
>> If we do that, do we keep the possibility of partial success, i.e. return
>> how many were allocated? Seems wasteful to suceed N-1 and then throw all
>> away, if the caller can use a fallback only for the last one.
>> Do some callers need all-or-nothing semantics? Should a flag indicate which
>> one to use?
> 
> A lot of callers (but not all) need all or nothing semantics.  But
> freeing already allocated pages is the not a major problem - the caller
> just has to add a release_pages call if it didn't already have one
> for cleaning up later failures.

What the svc/nfsd thread is trying to avoid is sleeping uninterruptibly
waiting for memory resources. That stalls server shutdown, among other
things.

Guaranteeing forward progress is the bottom line.


>>> There is one single user (svc_fill_pages in sunrpc) that relies on it.
>>> For everyone else it creates extra burden and is very error prone
>>> (speaking from experience).
>>
>> Sounds good to me. Will sunrpc be easy to convert, or should it be another
>> flag to opt-in to the current behavior, that it would use?
> 
> I've added Chuck to the Cc list, but from memory sunrpc actually does
> make use of this feature and he objected to previous attempts to
> change it.  So a first step would be to have a lower-level helper
> that works as-is and a wrapper that zeroes the array, even if that
> doesn't feel as efficient as it could be.
If sunrpc is the only user, it might be sensible to hoist the "zero
fill" capability into sunrpc.ko.

The impact of walking the whole array, for this check, is measurable,
and I've got a patch in 7.1 to mitigate that:

commit d7f3efd9ff474867b04e1ea784690f02450a245b
(refs/patches/nfsd-fixes/sunrpc-optimize-rq_respages-allocation-in-svc_alloc_arg)
Author:     Chuck Lever <chuck.lever@oracle.com>
AuthorDate: Thu Feb 26 09:47:39 2026 -0500
Commit:     Chuck Lever <chuck.lever@oracle.com>
CommitDate: Sun Mar 29 21:25:09 2026 -0400

    SUNRPC: Optimize rq_respages allocation in svc_alloc_arg

    svc_alloc_arg() invokes alloc_pages_bulk() with the full rq_maxpages
    count (~259 for 1MB messages) for the rq_respages array, causing a
    full-array scan despite most slots holding valid pages.

    svc_rqst_release_pages() NULLs only the range

      [rq_respages, rq_next_page)

    after each RPC, so only that range contains NULL entries. Limit the
    rq_respages fill in svc_alloc_arg() to that range instead of
    scanning the full array.

    svc_init_buffer() initializes rq_next_page to span the entire
    rq_respages array, so the first svc_alloc_arg() call fills all
    slots.

    Reviewed-by: Jeff Layton <jlayton@kernel.org>
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-27 13:58     ` Chuck Lever
@ 2026-05-28  9:00       ` Christoph Hellwig
  2026-05-28 13:16         ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2026-05-28  9:00 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Christoph Hellwig, Vlastimil Babka (SUSE), Andrew Morton,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle), linux-nfs,
	linux-mm, linux-kernel

On Wed, May 27, 2026 at 09:58:11AM -0400, Chuck Lever wrote:
> On 5/27/26 8:19 AM, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 12:06:08PM +0200, Vlastimil Babka (SUSE) wrote:
> >>> alloc_pages_bulks can do partial allocations for some reasons, and
> >>> users usually have a fallback by either looping and calling it again
> >>> or falling back to single page allocations.  This sucks!  Why can't
> >>> we get our usual try as hard as you can semantics, requiring
> >>> GFP_NORETRY or similar to relax it?
> >>
> >> If we do that, do we keep the possibility of partial success, i.e. return
> >> how many were allocated? Seems wasteful to suceed N-1 and then throw all
> >> away, if the caller can use a fallback only for the last one.
> >> Do some callers need all-or-nothing semantics? Should a flag indicate which
> >> one to use?
> > 
> > A lot of callers (but not all) need all or nothing semantics.  But
> > freeing already allocated pages is the not a major problem - the caller
> > just has to add a release_pages call if it didn't already have one
> > for cleaning up later failures.
> 
> What the svc/nfsd thread is trying to avoid is sleeping uninterruptibly
> waiting for memory resources. That stalls server shutdown, among other
> things.

I'm not fully understanding the sentence.  I guess you mean that
you want svc_thread_should_stop to intercept some memory allocation
waits?

> > I've added Chuck to the Cc list, but from memory sunrpc actually does
> > make use of this feature and he objected to previous attempts to
> > change it.  So a first step would be to have a lower-level helper
> > that works as-is and a wrapper that zeroes the array, even if that
> > doesn't feel as efficient as it could be.
> If sunrpc is the only user, it might be sensible to hoist the "zero
> fill" capability into sunrpc.ko.

As far as I can tell it is the only one.  But I don't really see
how you could implement that functionality outside the core, except
by falling back to single allocations, or looking for empty slots.

I'm curious what you think about willy's comment, or if there is
indeed a way to always use the pages from the beginning or end in
sunrpc.

>     svc_rqst_release_pages() NULLs only the range
> 
>       [rq_respages, rq_next_page)
> 
>     after each RPC, so only that range contains NULL entries. Limit the
>     rq_respages fill in svc_alloc_arg() to that range instead of
>     scanning the full array.

Does it NULL the entire range, or part of it?  Because if it is the
entire above range, you don't really need the check for NULL behavior
at all but just point the bulk allocation to this range.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: revisiting alloc_pages_bulks semantics?
  2026-05-28  9:00       ` Christoph Hellwig
@ 2026-05-28 13:16         ` Chuck Lever
  0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2026-05-28 13:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vlastimil Babka (SUSE), Andrew Morton, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Matthew Wilcox (Oracle), linux-nfs, linux-mm, linux-kernel

On 5/28/26 5:00 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 09:58:11AM -0400, Chuck Lever wrote:
>> On 5/27/26 8:19 AM, Christoph Hellwig wrote:
>>> On Wed, May 27, 2026 at 12:06:08PM +0200, Vlastimil Babka (SUSE) wrote:
>>>>> alloc_pages_bulks can do partial allocations for some reasons, and
>>>>> users usually have a fallback by either looping and calling it again
>>>>> or falling back to single page allocations.  This sucks!  Why can't
>>>>> we get our usual try as hard as you can semantics, requiring
>>>>> GFP_NORETRY or similar to relax it?
>>>>
>>>> If we do that, do we keep the possibility of partial success, i.e. return
>>>> how many were allocated? Seems wasteful to suceed N-1 and then throw all
>>>> away, if the caller can use a fallback only for the last one.
>>>> Do some callers need all-or-nothing semantics? Should a flag indicate which
>>>> one to use?
>>>
>>> A lot of callers (but not all) need all or nothing semantics.  But
>>> freeing already allocated pages is the not a major problem - the caller
>>> just has to add a release_pages call if it didn't already have one
>>> for cleaning up later failures.
>>
>> What the svc/nfsd thread is trying to avoid is sleeping uninterruptibly
>> waiting for memory resources. That stalls server shutdown, among other
>> things.
> 
> I'm not fully understanding the sentence.  I guess you mean that
> you want svc_thread_should_stop to intercept some memory allocation
> waits?

That's the gist of it.


> I'm curious what you think about willy's comment, or if there is
> indeed a way to always use the pages from the beginning or end in
> sunrpc.

It is a ready supply of fresh pages, but it's not used as a simple
queue. Many places in sunrpc point to rq_pages and use it as an array.

struct xdr_buf uses this array as the middle payload section of an RPC
message:

struct xdr_buf {

        struct kvec     head[1],        /* RPC header + non-page data */

                        tail[1];        /* Appended after page data */



        struct bio_vec  *bvec;

        struct page **  pages;          /* Array of pages */

        unsigned int    page_base,      /* Start of page data */

                        page_len,       /* Length of page data */

                        flags;          /* Flags for data disposition */



        unsigned int    buflen,         /* Total length of storage
buffer */
                        len;            /* Length of XDR encoded message
*/
};

There's no actual array in struct xdr_buf: the "pages" fields points to
the rq_pages array that contains the RPC message being processed.

Today, pages are consumed from the beginning of rq_pages.


>>     svc_rqst_release_pages() NULLs only the range
>>
>>       [rq_respages, rq_next_page)
>>
>>     after each RPC, so only that range contains NULL entries. Limit the
>>     rq_respages fill in svc_alloc_arg() to that range instead of
>>     scanning the full array.
> 
> Does it NULL the entire range, or part of it?  Because if it is the
> entire above range, you don't really need the check for NULL behavior
> at all but just point the bulk allocation to this range.
It's two phase.

Phase A prepares the thread (svc_rqst) for the next RPC, and it's done
before an RPC is ready to be processed. alloc_bulk_pages refills the
NULL entries.

Phase B is done while the client is waiting for the server to send an
RPC reply, so it has to be low latency. This is where the pages are
"removed" from the array and the array pointers are set to NULL.


Note that as struct xdr_buf is transitioned from "struct page **" to a
bvec array, we indeed have good low-risk opportunities to restructure
this code and how it uses rq_pages.


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-28 13:17 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-27  7:18 revisiting alloc_pages_bulks semantics? Christoph Hellwig
2026-05-27  7:53 ` Zi Yan
2026-05-27  8:00   ` Christoph Hellwig
2026-05-27  8:31     ` Zi Yan
2026-05-27 12:15       ` Christoph Hellwig
2026-05-27 10:06 ` Vlastimil Babka (SUSE)
2026-05-27 12:19   ` Christoph Hellwig
2026-05-27 13:23     ` Matthew Wilcox
2026-05-27 13:58     ` Chuck Lever
2026-05-28  9:00       ` Christoph Hellwig
2026-05-28 13:16         ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox