From: Jonathan Flynn <jonathan.flynn@hammerspace.com>
To: Chuck Lever <cel@kernel.org>, Mike Snitzer <snitzer@kernel.org>
Cc: linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org,
Chuck Lever <chuck.lever@oracle.com>
Subject: RE: [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER
Date: Sat, 6 Jun 2026 11:35:51 -0600 [thread overview]
Message-ID: <65a2cdb132b0c28e69a29955e3bd37e7@mail.gmail.com> (raw)
In-Reply-To: <20260606035722.83175-1-cel@kernel.org>
I tested the PAGE_ALLOC_COSTLY_ORDER change on the same setup.
Unfortunately, it did not improve the regression. Throughput was slightly
worse than the previous GFP_NOWAIT test, measuring 25.4 GiB/s.
Current results are:
Original regressed build: ~30.3 GiB/s
GFP_NOWAIT build: ~31.0 GiB/s
PAGE_ALLOC_COSTLY_ORDER: 25.4 GiB/s
Commit reverted: ~73.9 GiB/s
I added the results to the shared bundle. (including flamegraph)
The GFP_NOWAIT and the Original Commit flamegraphs are nearly identical.
The dominant stack being:
svc_recv()
-> svc_rdma_build_read_segment_contig()
-> alloc_pages_noprof()
-> get_page_from_freelist()
-> rmqueue_buddy()
The PAGE_ALLOC_COSTLY_ORDER flamegraph is different. Time spent under
alloc_pages_noprof() is reduced, but the reduction does not translate into
improved throughput.
The following percentages were observed:
Original GFP_NOWAIT
COSTLY_ORDER
svc_recv() 76.09% 75.99%
78.44%
alloc_pages_noprof() 58.07% 57.99% 40.29%
folios_put_refs() 7.15% 7.19%
16.06%
svc_rdma_read_complete() 7.18% 7.21% 16.08%
In other words, the PAGE_ALLOC_COSTLY_ORDER change reduces time spent in
the allocation path, but a larger fraction of CPU time then appears under
svc_rdma_read_complete() and folios_put_refs(), while overall throughput
decreases further.
-Jon
> -----Original Message-----
> From: Chuck Lever <cel@kernel.org>
> Sent: Friday, June 5, 2026 9:57 PM
> To: Mike Snitzer <snitzer@kernel.org>
> Cc: linux-nfs@vger.kernel.org; linux-rdma@vger.kernel.org; Chuck Lever
> <chuck.lever@oracle.com>; Jonathan Flynn
> <jonathan.flynn@hammerspace.com>
> Subject: [PATCH] svcrdma: Cap Read sink allocations at
> PAGE_ALLOC_COSTLY_ORDER
>
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Jonathan Flynn reports that commit 18755b8c2f24 ("svcrdma: Use
contiguous
> pages for RDMA Read sink buffers") regresses NFS/RDMA WRITE throughput
> from 73.9 GiB/s to 30.3 GiB/s on a 128-core single-NUMA-node server
driving
> dual 400Gb/s links with 640 nfsd threads. In the regressed
configuration,
> server CPU utilization rises from 8.5% to 76%, and 73% of all server CPU
cycles
> are spent in native_queued_spin_lock_slowpath.
>
> The contended lock is zone->lock. The page allocator serves allocations
only
> up to PAGE_ALLOC_COSTLY_ORDER (3) from its per-CPU page lists;
> SVC_RDMA_CONTIG_MAX_ORDER is 4, so every contiguous sink buffer
> allocation falls through to rmqueue_buddy() and acquires the zone lock.
The
> workload above issues roughly half a million order-4 allocations per
second,
> all serialized on the single zone lock of the one NUMA node. Replacing
the
> GFP mask with GFP_NOWAIT did not change the profile because direct
> reclaim never
> ran: the cycles are spent acquiring the lock, not reclaiming memory.
>
> Cap the allocation order at PAGE_ALLOC_COSTLY_ORDER so contiguous sink
> buffer allocations remain eligible for the per-CPU page lists, where
zone lock
> acquisition is amortized across pcp batch refills. An order-3 chunk
still
> replaces eight per-page bvecs with one.
>
> Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
> Fixes: 18755b8c2f24 ("svcrdma: Use contiguous pages for RDMA Read sink
> buffers")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> index efde26cac961..4546e594f2d7 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
> @@ -746,11 +746,12 @@ int svc_rdma_prepare_reply_chunk(struct
> svcxprt_rdma *rdma, }
>
> /*
> - * Cap contiguous RDMA Read sink allocations at order-4. Higher orders
risk
> - * allocation failure under GFP_NOWAIT, which would negate the benefit
of
> - * the contiguous fast path.
> + * Cap contiguous RDMA Read sink allocations at
> PAGE_ALLOC_COSTLY_ORDER.
> + * The page allocator serves allocations at or below that order from
> + * its per-CPU page lists; above it, every allocation acquires the
> + * zone lock, which serializes all nfsd threads.
> */
> -#define SVC_RDMA_CONTIG_MAX_ORDER 4
> +#define SVC_RDMA_CONTIG_MAX_ORDER PAGE_ALLOC_COSTLY_ORDER
>
> /**
> * svc_rdma_alloc_read_pages - Allocate physically contiguous pages
> --
> 2.54.0
next prev parent reply other threads:[~2026-06-06 17:35 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-06 3:57 [PATCH] svcrdma: Cap Read sink allocations at PAGE_ALLOC_COSTLY_ORDER Chuck Lever
2026-06-06 17:35 ` Jonathan Flynn [this message]
2026-06-07 3:17 ` Chuck Lever
2026-06-07 18:11 ` Jonathan Flynn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=65a2cdb132b0c28e69a29955e3bd37e7@mail.gmail.com \
--to=jonathan.flynn@hammerspace.com \
--cc=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.