From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>,
Jonathan Flynn <jonathan.flynn@hammerspace.com>
Cc: Leon Romanovsky <leon@kernel.org>, Christoph Hellwig <hch@lst.de>,
NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>,
linux-nfs@vger.kernel.org, linux-rdma@vger.kernel.org,
Chuck Lever <chuck.lever@oracle.com>
Subject: Re: [PATCH v4] svcrdma: Use contiguous pages for RDMA Read sink buffers
Date: Thu, 4 Jun 2026 16:51:10 -0400 [thread overview]
Message-ID: <aiHlPmeZq3WgMwoJ@kernel.org> (raw)
In-Reply-To: <20260319133610.2556826-1-cel@kernel.org>
On Thu, Mar 19, 2026 at 09:36:10AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> svc_rdma_build_read_segment() constructs RDMA Read sink
> buffers by consuming pages one-at-a-time from rq_pages[]
> and building one bvec per page. A 64KB NFS READ payload
> produces 16 separate bvecs, 16 DMA mappings, and
> potentially multiple RDMA Read WRs (on platforms with
> 4KB pages).
>
> A single higher-order allocation followed by split_page()
> yields physically contiguous memory while preserving
> per-page refcounts. A single bvec spanning the contiguous
> range causes rdma_rw_ctx_init_bvec() to take the
> rdma_rw_init_single_wr_bvec() fast path: one DMA mapping,
> one SGE, one WR.
>
> The split sub-pages replace the original rq_pages[] entries,
> so all downstream page tracking, completion handling, and
> xdr_buf assembly remain unchanged.
>
> Allocation uses __GFP_NORETRY | __GFP_NOWARN and falls back
> through decreasing orders. If even order-1 fails, the
> existing per-page path handles the segment.
>
> When nr_pages is not a power of two, get_order() rounds up
> and the allocation yields more pages than needed. The extra
> split pages replace existing rq_pages[] entries (freed via
> put_page() first), so there is no net increase in per-
> request page consumption. Successive segments reuse the
> same padding slots, preventing accumulation. The
> rq_maxpages guard rejects any allocation that would
> overrun the array, falling back to the per-page path.
> Under memory pressure, __GFP_NORETRY causes the higher-
> order allocation to fail without stalling.
>
> The contiguous path is attempted when the segment starts
> page-aligned (rc_pageoff == 0) and spans at least two
> pages. NFS WRITE segments carry application-modified byte
> ranges of arbitrary length, so the optimization is not
> restricted to power-of-two page counts.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> Changes since v3:
> - Drop 1/3 - 3/3, they have already been reviewed and queued
> - Incorporate hch's review comments
> - Remove the #ifdef SZ_64K -- the logic works on those systems too
>
>
> net/sunrpc/xprtrdma/svc_rdma_rw.c | 213 ++++++++++++++++++++++++++++++
> 1 file changed, 213 insertions(+)
This patch, which landed during the 7.1 merge window as commit
18755b8c2f241 ("svcrdma: Use contiguous pages for RDMA Read sink
buffers"), severely hurts RDMA performance when testing on very fast
RDMA networking on x86_64.
With this commit WRITE performance is "only" 24.2GB/s.
Without this commit WRITE performance is 60.6GB/s.
The cpu burn due to spinlock dominates the flamegraph that was
collected. Chuck, I'll send you the flamegraph off-list (and can send
it to anyone else who might be interested). Jon Flynn did the testing
and we can request more info from him.
We may have a window _now_ on the current Hammerspace testbed to try
an incremental fix, but as of now simply reverting commit
18755b8c2f241 is as far as we got.
Thanks,
Mike
prev parent reply other threads:[~2026-06-04 20:51 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-19 13:36 [PATCH v4] svcrdma: Use contiguous pages for RDMA Read sink buffers Chuck Lever
2026-06-04 20:51 ` Mike Snitzer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiHlPmeZq3WgMwoJ@kernel.org \
--to=snitzer@kernel.org \
--cc=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=hch@lst.de \
--cc=jlayton@kernel.org \
--cc=jonathan.flynn@hammerspace.com \
--cc=leon@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox