From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
Chuck Lever <chuck.lever@oracle.com>
Subject: [RFC PATCH 00/15] svcrdma performance scalability enhancements
Date: Tue, 10 Feb 2026 11:32:07 -0500 [thread overview]
Message-ID: <20260210163222.2356793-1-cel@kernel.org> (raw)
From: Chuck Lever <chuck.lever@oracle.com>
When considering NFS I/O throughput and latency, typically the RPC
transport is not the primary bottleneck. The CPU cost of the
RPC/RDMA transport is insignificant in comparison to other resource
utilization on the server.
However, when considered from a scalability point of view, the
incremental costs of things like memory-per-connection and interrupt
rate or doorbell rate per connection do add up.
The following series lowers the per-connection resource utilization
of svcrdma in several areas. The main benefits are lower lock
contention, lower interrupt and doorbell rate per RPC, and less CPU
cache theft.
Profiling an 8KB NFSv3 read/write workload over RDMA identifies
where overhead accumulates as connection count grows. Roughly 4% of
total CPU time goes to contention on the svcrdma_wq unbound
workqueue pool lock, driven by cascading work item re-queues through
the Send completion path. Receive completion processing acquires a
per-transport spinlock on every inbound message. Doorbell and
completion event counts scale with Write chunk usage.
svc_alloc_arg() scans ~259 rq_pages slots per receive even when only
a few pages need replacement.
Three strategies recur throughout this series.
Lock-free lists replace spinlock-protected queues on the hottest
paths. The receive completion queue, Read completion queue, and send
context release path all convert to llist, eliminating producer-side
locking. The global svcrdma_wq workqueue -- the single largest
contention source -- is replaced by per-transport kthreads that
drain completed send contexts from an llist in batches. The
intermediate re-queue for write chunk resource release is thus
removed as well. Those resources are now freed inline during send
context teardown.
Work Request chaining reduces per-RPC doorbell and completion rates.
Write chunk RDMA Write WRs are linked onto the Reply Send WR chain,
so a single ib_post_send() covers both operations with one
completion event. Receive Queue posting switches from small fixed
batches to watermark-triggered bulk replenishment, provisioned at
twice the negotiated credit limit. Ticket-based fair queuing for
Send Queue slot allocation prevents starvation when the SQ fills
under concurrent use.
Per-object caching and cache line separation reduce allocation cost
and cross-CPU invalidation traffic. Each recv_ctxt includes a
single-entry svc_rdma_chunk cache, covering the >99% common case
without kmalloc. Cache line annotations on struct svcxprt_rdma place
the Send context cache, R/W context cache, and SQ availability
counter in separate cache lines.
XPT_DATA flag handling upon sc_read_complete_q consumption is
corrected to clear and recompute the flag. Trace data from a 256KB
write workload shows ~14 transport enqueue attempts per RPC; in 62%
of cases, no work is pending. Clearing the flag on consumption
eliminates the majority of these spurious dispatches.
Base commit: v6.19
URL: https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=svcrdma-next
---
Chuck Lever (15):
svcrdma: Add fair queuing for Send Queue access
svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
svcrdma: Clean up use of rdma->sc_pd->device
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Factor out WR chain linking into helper
svcrdma: Reduce false sharing in struct svcxprt_rdma
svcrdma: Use lock-free list for Receive Queue tracking
svcrdma: Convert Read completion queue to use lock-free list
svcrdma: Release write chunk resources without re-queuing
svcrdma: Use per-transport kthread for send context release
svcrdma: Use watermark-based Receive Queue replenishment
svcrdma: Add per-recv_ctxt chunk context cache
svcrdma: clear XPT_DATA on sc_read_complete_q consumption
svcrdma: retry when receive queues drain transiently
svcrdma: clear XPT_DATA on sc_rq_dto_q consumption
include/linux/sunrpc/svc_rdma.h | 80 ++++++---
include/linux/sunrpc/svc_rdma_pcl.h | 12 +-
net/sunrpc/xprtrdma/svc_rdma.c | 18 +-
net/sunrpc/xprtrdma/svc_rdma_pcl.c | 55 +++++-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 158 +++++++++++------
net/sunrpc/xprtrdma/svc_rdma_rw.c | 169 ++++++++++--------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 209 ++++++++++++++++-------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 28 +--
8 files changed, 488 insertions(+), 241 deletions(-)
--
2.52.0
next reply other threads:[~2026-02-10 16:32 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-10 16:32 Chuck Lever [this message]
2026-02-10 16:32 ` [RFC PATCH 01/15] svcrdma: Add fair queuing for Send Queue access Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 02/15] svcrdma: Clean up use of rdma->sc_pd->device in Receive paths Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 03/15] svcrdma: Clean up use of rdma->sc_pd->device Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 04/15] svcrdma: Add Write chunk WRs to the RPC's Send WR chain Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 05/15] svcrdma: Factor out WR chain linking into helper Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 06/15] svcrdma: Reduce false sharing in struct svcxprt_rdma Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 07/15] svcrdma: Use lock-free list for Receive Queue tracking Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 08/15] svcrdma: Convert Read completion queue to use lock-free list Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 09/15] svcrdma: Release write chunk resources without re-queuing Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 10/15] svcrdma: Use per-transport kthread for send context release Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 11/15] svcrdma: Use watermark-based Receive Queue replenishment Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 12/15] svcrdma: Add per-recv_ctxt chunk context cache Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 13/15] svcrdma: clear XPT_DATA on sc_read_complete_q consumption Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 14/15] svcrdma: retry when receive queues drain transiently Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 15/15] svcrdma: clear XPT_DATA on sc_rq_dto_q consumption Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260210163222.2356793-1-cel@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox