[PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER
@ 2026-06-06  3:57 Chuck Lever
  2026-06-06 17:35 ` Jonathan Flynn
  0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever @ 2026-06-06  3:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nfs, linux-rdma, Chuck Lever, Jonathan Flynn

From: Chuck Lever <chuck.lever@oracle.com>

Jonathan Flynn reports that commit 18755b8c2f24 ("svcrdma: Use
contiguous pages for RDMA Read sink buffers") regresses NFS/RDMA
WRITE throughput from 73.9 GiB/s to 30.3 GiB/s on a 128-core
single-NUMA-node server driving dual 400Gb/s links with 640 nfsd
threads. In the regressed configuration, server CPU utilization
rises from 8.5% to 76%, and 73% of all server CPU cycles are spent
in native_queued_spin_lock_slowpath.

The contended lock is zone->lock. The page allocator serves
allocations only up to PAGE_ALLOC_COSTLY_ORDER (3) from its per-CPU
page lists; SVC_RDMA_CONTIG_MAX_ORDER is 4, so every contiguous
sink buffer allocation falls through to rmqueue_buddy() and
acquires the zone lock. The workload above issues roughly half a
million order-4 allocations per second, all serialized on the
single zone lock of the one NUMA node. Replacing the GFP mask with
GFP_NOWAIT did not change the profile because direct reclaim never
ran: the cycles are spent acquiring the lock, not reclaiming
memory.

Cap the allocation order at PAGE_ALLOC_COSTLY_ORDER so contiguous
sink buffer allocations remain eligible for the per-CPU page
lists, where zone lock acquisition is amortized across pcp batch
refills. An order-3 chunk still replaces eight per-page bvecs with
one.

Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Fixes: 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink buffers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index efde26cac961..4546e594f2d7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -746,11 +746,12 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
 }

 /*
- * Cap contiguous RDMA Read sink allocations at order-4. Higher orders risk
- * allocation failure under GFP_NOWAIT, which would negate the benefit of
- * the contiguous fast path.
+ * Cap contiguous RDMA Read sink allocations at PAGE_ALLOC_COSTLY_ORDER.
+ * The page allocator serves allocations at or below that order from
+ * its per-CPU page lists; above it, every allocation acquires the
+ * zone lock, which serializes all nfsd threads.
  */
-#define SVC_RDMA_CONTIG_MAX_ORDER	4
+#define SVC_RDMA_CONTIG_MAX_ORDER	PAGE_ALLOC_COSTLY_ORDER

 /**
  * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* RE: [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER
  2026-06-06  3:57 [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER Chuck Lever
@ 2026-06-06 17:35 ` Jonathan Flynn
  2026-06-07  3:17   ` Chuck Lever
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Flynn @ 2026-06-06 17:35 UTC (permalink / raw)
  To: Chuck Lever, Mike Snitzer; +Cc: linux-nfs, linux-rdma, Chuck Lever

I tested the PAGE_ALLOC_COSTLY_ORDER change on the same setup.
Unfortunately, it did not improve the regression. Throughput was slightly
worse than the previous GFP_NOWAIT test, measuring 25.4 GiB/s.

Current results are:
Original regressed build: ~30.3 GiB/s
GFP_NOWAIT build: ~31.0 GiB/s
PAGE_ALLOC_COSTLY_ORDER: 25.4 GiB/s
Commit reverted: ~73.9 GiB/s

I added the results to the shared bundle. (including flamegraph)

The GFP_NOWAIT and the Original Commit flamegraphs are nearly identical.
The dominant stack being:
svc_recv()
-> svc_rdma_build_read_segment_contig()
-> alloc_pages_noprof()
-> get_page_from_freelist()
-> rmqueue_buddy()

The PAGE_ALLOC_COSTLY_ORDER flamegraph is different. Time spent under
alloc_pages_noprof() is reduced, but the reduction does not translate into
improved throughput.

The following percentages were observed:
                                                   Original     GFP_NOWAIT
COSTLY_ORDER
svc_recv()                                 76.09%      75.99%
78.44%
alloc_pages_noprof()             58.07%      57.99%               40.29%
folios_put_refs()                        7.15%        7.19%
16.06%
svc_rdma_read_complete()    7.18%        7.21%               16.08%

In other words, the PAGE_ALLOC_COSTLY_ORDER change reduces time spent in
the allocation path, but a larger fraction of CPU time then appears under
svc_rdma_read_complete() and folios_put_refs(), while overall throughput
decreases further.

-Jon

> -----Original Message-----
> From: Chuck Lever <cel@kernel.org>
> Sent: Friday, June 5, 2026 9:57 PM
> To: Mike Snitzer <snitzer@kernel.org>
> Cc: linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org; Chuck Lever
> <chuck.lever@oracle.com>; Jonathan Flynn
> <jonathan.flynn@hammerspace.com>
> Subject: [PATCH] svcrdma: Cap Read sink allocations at
> PAGE_ALLOC_COSTLY_ORDER
>
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Jonathan Flynn reports that commit 18755b8c2f24 ("svcrdma: Use
contiguous
> pages for RDMA Read sink buffers") regresses NFS/RDMA WRITE throughput
> from 73.9 GiB/s to 30.3 GiB/s on a 128-core single-NUMA-node server
driving
> dual 400Gb/s links with 640 nfsd threads. In the regressed
configuration,
> server CPU utilization rises from 8.5% to 76%, and 73% of all server CPU
cycles
> are spent in native_queued_spin_lock_slowpath.
>
> The contended lock is zone->lock. The page allocator serves allocations
only
> up to PAGE_ALLOC_COSTLY_ORDER (3) from its per-CPU page lists;
> SVC_RDMA_CONTIG_MAX_ORDER is 4, so every contiguous sink buffer
> allocation falls through to rmqueue_buddy() and acquires the zone lock.
The
> workload above issues roughly half a million order-4 allocations per
second,
> all serialized on the single zone lock of the one NUMA node. Replacing
the
> GFP mask with GFP_NOWAIT did not change the profile because direct
> reclaim never
> ran: the cycles are spent acquiring the lock, not reclaiming memory.
>
> Cap the allocation order at PAGE_ALLOC_COSTLY_ORDER so contiguous sink
> buffer allocations remain eligible for the per-CPU page lists, where
zone lock
> acquisition is amortized across pcp batch refills. An order-3 chunk
still
> replaces eight per-page bvecs with one.
>
> Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
> Fixes: 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink
> buffers")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/xprtrdma/svc_rdma_rw.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index efde26cac961..4546e594f2d7 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -746,11 +746,12 @@ int svc_rdma_prepare_reply_chunk(struct
> svcxprt_rdma *rdma,  }
>
>  /*
> - * Cap contiguous RDMA Read sink allocations at order-4. Higher orders
risk
> - * allocation failure under GFP_NOWAIT, which would negate the benefit
of
> - * the contiguous fast path.
> + * Cap contiguous RDMA Read sink allocations at
> PAGE_ALLOC_COSTLY_ORDER.
> + * The page allocator serves allocations at or below that order from
> + * its per-CPU page lists; above it, every allocation acquires the
> + * zone lock, which serializes all nfsd threads.
>   */
> -#define SVC_RDMA_CONTIG_MAX_ORDER	4
> +#define SVC_RDMA_CONTIG_MAX_ORDER	PAGE_ALLOC_COSTLY_ORDER
>
>  /**
>   * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
> --
> 2.54.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER
  2026-06-06 17:35 ` Jonathan Flynn
@ 2026-06-07  3:17   ` Chuck Lever
  2026-06-07 18:11     ` Jonathan Flynn
  0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever @ 2026-06-07  3:17 UTC (permalink / raw)
  To: Jonathan Flynn, Mike Snitzer; +Cc: linux-nfs, linux-rdma, Chuck Lever



On Sat, Jun 6, 2026, at 1:35 PM, Jonathan Flynn wrote:
> I tested the PAGE_ALLOC_COSTLY_ORDER change on the same setup.
> Unfortunately, it did not improve the regression. Throughput was slightly
> worse than the previous GFP_NOWAIT test, measuring 25.4 GiB/s.
>
> Current results are:
> Original regressed build: ~30.3 GiB/s
> GFP_NOWAIT build: ~31.0 GiB/s
> PAGE_ALLOC_COSTLY_ORDER: 25.4 GiB/s
> Commit reverted: ~73.9 GiB/s
>
> I added the results to the shared bundle. (including flamegraph)
>
> The GFP_NOWAIT and the Original Commit flamegraphs are nearly identical.
> The dominant stack being:
> svc_recv()
> -> svc_rdma_build_read_segment_contig()
> -> alloc_pages_noprof()
> -> get_page_from_freelist()
> -> rmqueue_buddy()
>
> The PAGE_ALLOC_COSTLY_ORDER flamegraph is different. Time spent under
> alloc_pages_noprof() is reduced, but the reduction does not translate into
> improved throughput.
>
> The following percentages were observed:
>                                                    Original     GFP_NOWAIT
> COSTLY_ORDER
> svc_recv()                                 76.09%      75.99%
> 78.44%
> alloc_pages_noprof()             58.07%      57.99%               40.29%
> folios_put_refs()                        7.15%        7.19%
> 16.06%
> svc_rdma_read_complete()    7.18%        7.21%               16.08%
>
> In other words, the PAGE_ALLOC_COSTLY_ORDER change reduces time spent in
> the allocation path, but a larger fraction of CPU time then appears under
> svc_rdma_read_complete() and folios_put_refs(), while overall throughput
> decreases further.

The two failed fixes demonstrate that the current folio allocator is
not up to the task -- the problem appears to be on the release side,
where the individual pages have to be merged back into an order-4
compound page. I don't yet see a straightforward way to make it work.

Since we're right up against v7.1-rc7, I've added a patch to nfsd-next
to revert 18755b8c2f24 -- it will get pulled back into 7.1.y as soon as
the v7.2 merge window closes in three weeks.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER
  2026-06-07  3:17   ` Chuck Lever
@ 2026-06-07 18:11     ` Jonathan Flynn
  0 siblings, 0 replies; 4+ messages in thread
From: Jonathan Flynn @ 2026-06-07 18:11 UTC (permalink / raw)
  To: Chuck Lever, Mike Snitzer; +Cc: linux-nfs, linux-rdma, Chuck Lever

> -----Original Message-----
> From: Chuck Lever <cel@kernel.org>
> Sent: Saturday, June 6, 2026 9:18 PM
> To: Jonathan Flynn <jonathan.flynn@hammerspace.com>; Mike Snitzer
> <snitzer@kernel.org>
> Cc: linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org; Chuck Lever
> <chuck.lever@oracle.com>
> Subject: Re: [PATCH] svcrdma: Cap Read sink allocations at
> PAGE_ALLOC_COSTLY_ORDER
>
>
>
> On Sat, Jun 6, 2026, at 1:35 PM, Jonathan Flynn wrote:
> > I tested the PAGE_ALLOC_COSTLY_ORDER change on the same setup.
> > Unfortunately, it did not improve the regression. Throughput was
> > slightly worse than the previous GFP_NOWAIT test, measuring 25.4
GiB/s.
> >
> > Current results are:
> > Original regressed build: ~30.3 GiB/s
> > GFP_NOWAIT build: ~31.0 GiB/s
> > PAGE_ALLOC_COSTLY_ORDER: 25.4 GiB/s
> > Commit reverted: ~73.9 GiB/s
> >
> > I added the results to the shared bundle. (including flamegraph)
> >
> > The GFP_NOWAIT and the Original Commit flamegraphs are nearly
> identical.
> > The dominant stack being:
> > svc_recv()
> > -> svc_rdma_build_read_segment_contig()
> > -> alloc_pages_noprof()
> > -> get_page_from_freelist()
> > -> rmqueue_buddy()
> >
> > The PAGE_ALLOC_COSTLY_ORDER flamegraph is different. Time spent under
> > alloc_pages_noprof() is reduced, but the reduction does not translate
> > into improved throughput.
> >
> > The following percentages were observed:
> >                                                    Original
GFP_NOWAIT
> > COSTLY_ORDER
> > svc_recv()                                 76.09%      75.99%
> > 78.44%
> > alloc_pages_noprof()             58.07%      57.99%
40.29%
> > folios_put_refs()                        7.15%        7.19%
> > 16.06%
> > svc_rdma_read_complete()    7.18%        7.21%               16.08%
> >
> > In other words, the PAGE_ALLOC_COSTLY_ORDER change reduces time
> spent
> > in the allocation path, but a larger fraction of CPU time then appears
> > under
> > svc_rdma_read_complete() and folios_put_refs(), while overall
> > throughput decreases further.
>
> The two failed fixes demonstrate that the current folio allocator is not
up to
> the task -- the problem appears to be on the release side, where the
> individual pages have to be merged back into an order-4 compound page. I
> don't yet see a straightforward way to make it work.
>
> Since we're right up against v7.1-rc7, I've added a patch to nfsd-next
to revert
> 18755b8c2f24 -- it will get pulled back into 7.1.y as soon as the v7.2
merge
> window closes in three weeks.
>
>
> --
> Chuck Lever


This sounds good.

Thank you for taking the time to investigate it and for working through
the test results with us.

If you continue exploring this area in the future and still see promise in
the contiguous allocation approach, I'd be happy to help test additional
changes as time permits.

-Jon

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-07 18:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-06  3:57 [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER Chuck Lever
2026-06-06 17:35 ` Jonathan Flynn
2026-06-07  3:17   ` Chuck Lever
2026-06-07 18:11     ` Jonathan Flynn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.