* [RFC PATCH 01/15] svcrdma: Add fair queuing for Send Queue access
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 02/15] svcrdma: Clean up use of rdma->sc_pd->device in Receive paths Chuck Lever
` (13 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When the Send Queue fills, multiple threads may wait for SQ slots.
The previous implementation had no ordering guarantee, allowing
starvation when one thread repeatedly acquires slots while others
wait indefinitely.
Introduce a ticket-based fair queuing system. Each waiter takes a
ticket number and is served in FIFO order. This ensures forward
progress for all waiters when SQ capacity is constrained.
The implementation has two phases:
1. Fast path: attempt to reserve SQ slots without waiting
2. Slow path: take a ticket, wait for turn, then wait for slots
The ticket system adds two atomic counters to the transport:
- sc_sq_ticket_head: next ticket to issue
- sc_sq_ticket_tail: ticket currently being served
When a waiter successfully reserves slots, it advances the tail
counter and wakes the next waiter. This creates an orderly handoff
that prevents starvation while maintaining good throughput on the
fast path when contention is low.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 4 +
net/sunrpc/xprtrdma/svc_rdma_rw.c | 42 ++++-----
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 113 +++++++++++++++--------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +
4 files changed, 98 insertions(+), 63 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 57f4fd94166a..f68ac035fec6 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -84,6 +84,8 @@ struct svcxprt_rdma {
atomic_t sc_sq_avail; /* SQEs ready to be consumed */
unsigned int sc_sq_depth; /* Depth of SQ */
+ atomic_t sc_sq_ticket_head; /* Next ticket to issue */
+ atomic_t sc_sq_ticket_tail; /* Ticket currently serving */
__be32 sc_fc_credits; /* Forward credits */
u32 sc_max_requests; /* Max requests */
u32 sc_max_bc_requests;/* Backward credits */
@@ -306,6 +308,8 @@ extern void svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
struct svc_rdma_recv_ctxt *rctxt,
int status);
extern void svc_rdma_wake_send_waiters(struct svcxprt_rdma *rdma, int avail);
+extern int svc_rdma_sq_wait(struct svcxprt_rdma *rdma,
+ const struct rpc_rdma_cid *cid, int sqecount);
extern int svc_rdma_sendto(struct svc_rqst *);
extern int svc_rdma_result_payload(struct svc_rqst *rqstp, unsigned int offset,
unsigned int length);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 310de7a80be5..921dd2542d0d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -384,34 +384,24 @@ static int svc_rdma_post_chunk_ctxt(struct svcxprt_rdma *rdma,
cqe = NULL;
}
- do {
- if (atomic_sub_return(cc->cc_sqecount,
- &rdma->sc_sq_avail) > 0) {
- cc->cc_posttime = ktime_get();
- ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
- if (ret)
- break;
+ ret = svc_rdma_sq_wait(rdma, &cc->cc_cid, cc->cc_sqecount);
+ if (ret < 0)
+ return ret;
+
+ cc->cc_posttime = ktime_get();
+ ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
+ if (ret) {
+ trace_svcrdma_sq_post_err(rdma, &cc->cc_cid, ret);
+ svc_xprt_deferred_close(&rdma->sc_xprt);
+
+ /* If even one was posted, there will be a completion. */
+ if (bad_wr != first_wr)
return 0;
- }
- percpu_counter_inc(&svcrdma_stat_sq_starve);
- trace_svcrdma_sq_full(rdma, &cc->cc_cid);
- atomic_add(cc->cc_sqecount, &rdma->sc_sq_avail);
- wait_event(rdma->sc_send_wait,
- atomic_read(&rdma->sc_sq_avail) > cc->cc_sqecount);
- trace_svcrdma_sq_retry(rdma, &cc->cc_cid);
- } while (1);
-
- trace_svcrdma_sq_post_err(rdma, &cc->cc_cid, ret);
- svc_xprt_deferred_close(&rdma->sc_xprt);
-
- /* If even one was posted, there will be a completion. */
- if (bad_wr != first_wr)
- return 0;
-
- atomic_add(cc->cc_sqecount, &rdma->sc_sq_avail);
- wake_up(&rdma->sc_send_wait);
- return -ENOTCONN;
+ svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);
+ return -ENOTCONN;
+ }
+ return 0;
}
/* Build and DMA-map an SGL that covers one kvec in an xdr_buf
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 914cd263c2f1..da0d637ba4fb 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -294,6 +294,66 @@ void svc_rdma_wake_send_waiters(struct svcxprt_rdma *rdma, int avail)
wake_up(&rdma->sc_send_wait);
}
+/**
+ * svc_rdma_sq_wait - Wait for SQ slots using fair queuing
+ * @rdma: controlling transport
+ * @cid: completion ID for tracing
+ * @sqecount: number of SQ entries needed
+ *
+ * A ticket-based system ensures fair ordering when multiple threads
+ * wait for Send Queue capacity. Each waiter takes a ticket and is
+ * served in order, preventing starvation.
+ *
+ * Return values:
+ * %0: SQ slots were reserved successfully
+ * %-ENOTCONN: The connection was lost
+ */
+int svc_rdma_sq_wait(struct svcxprt_rdma *rdma,
+ const struct rpc_rdma_cid *cid, int sqecount)
+{
+ int ticket;
+
+ /* Fast path: try to reserve SQ slots without waiting */
+ if (likely(atomic_sub_return(sqecount, &rdma->sc_sq_avail) >= 0))
+ return 0;
+ atomic_add(sqecount, &rdma->sc_sq_avail);
+
+ /* Slow path: take a ticket and wait in line */
+ ticket = atomic_fetch_inc(&rdma->sc_sq_ticket_head);
+
+ percpu_counter_inc(&svcrdma_stat_sq_starve);
+ trace_svcrdma_sq_full(rdma, cid);
+
+ /* Wait until all earlier tickets have been served */
+ wait_event(rdma->sc_send_wait,
+ test_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags) ||
+ atomic_read(&rdma->sc_sq_ticket_tail) == ticket);
+ if (test_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags))
+ goto out_close;
+
+ /* It's our turn. Wait for enough SQ slots to be available. */
+ while (atomic_sub_return(sqecount, &rdma->sc_sq_avail) < 0) {
+ atomic_add(sqecount, &rdma->sc_sq_avail);
+
+ wait_event(rdma->sc_send_wait,
+ test_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags) ||
+ atomic_read(&rdma->sc_sq_avail) >= sqecount);
+ if (test_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags))
+ goto out_close;
+ }
+
+ /* Slots reserved successfully. Let the next waiter proceed. */
+ atomic_inc(&rdma->sc_sq_ticket_tail);
+ wake_up(&rdma->sc_send_wait);
+ trace_svcrdma_sq_retry(rdma, cid);
+ return 0;
+
+out_close:
+ atomic_inc(&rdma->sc_sq_ticket_tail);
+ wake_up(&rdma->sc_send_wait);
+ return -ENOTCONN;
+}
+
/**
* svc_rdma_wc_send - Invoked by RDMA provider for each polled Send WC
* @cq: Completion Queue context
@@ -336,11 +396,6 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc)
* that these values remain available after the ib_post_send() call.
* In some error flow cases, svc_rdma_wc_send() releases @ctxt.
*
- * Note there is potential for starvation when the Send Queue is
- * full because there is no order to when waiting threads are
- * awoken. The transport is typically provisioned with a deep
- * enough Send Queue that SQ exhaustion should be a rare event.
- *
* Return values:
* %0: @ctxt's WR chain was posted successfully
* %-ENOTCONN: The connection was lost
@@ -362,42 +417,26 @@ int svc_rdma_post_send(struct svcxprt_rdma *rdma,
send_wr->sg_list[0].length,
DMA_TO_DEVICE);
- /* If the SQ is full, wait until an SQ entry is available */
- while (!test_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags)) {
- if (atomic_sub_return(sqecount, &rdma->sc_sq_avail) < 0) {
- svc_rdma_wake_send_waiters(rdma, sqecount);
+ ret = svc_rdma_sq_wait(rdma, &cid, sqecount);
+ if (ret < 0)
+ return ret;
- /* When the transport is torn down, assume
- * ib_drain_sq() will trigger enough Send
- * completions to wake us. The XPT_CLOSE test
- * above should then cause the while loop to
- * exit.
- */
- percpu_counter_inc(&svcrdma_stat_sq_starve);
- trace_svcrdma_sq_full(rdma, &cid);
- wait_event(rdma->sc_send_wait,
- atomic_read(&rdma->sc_sq_avail) > 0);
- trace_svcrdma_sq_retry(rdma, &cid);
- continue;
- }
+ trace_svcrdma_post_send(ctxt);
+ ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
+ if (ret) {
+ trace_svcrdma_sq_post_err(rdma, &cid, ret);
+ svc_xprt_deferred_close(&rdma->sc_xprt);
- trace_svcrdma_post_send(ctxt);
- ret = ib_post_send(rdma->sc_qp, first_wr, &bad_wr);
- if (ret) {
- trace_svcrdma_sq_post_err(rdma, &cid, ret);
- svc_xprt_deferred_close(&rdma->sc_xprt);
+ /* If even one WR was posted, there will be a
+ * Send completion that bumps sc_sq_avail.
+ */
+ if (bad_wr != first_wr)
+ return 0;
- /* If even one WR was posted, there will be a
- * Send completion that bumps sc_sq_avail.
- */
- if (bad_wr == first_wr) {
- svc_rdma_wake_send_waiters(rdma, sqecount);
- break;
- }
- }
- return 0;
+ svc_rdma_wake_send_waiters(rdma, sqecount);
+ return -ENOTCONN;
}
- return -ENOTCONN;
+ return 0;
}
/**
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b7b318ad25c4..d1dcffbf2fe7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -474,6 +474,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
if (newxprt->sc_sq_depth > dev->attrs.max_qp_wr)
newxprt->sc_sq_depth = dev->attrs.max_qp_wr;
atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
+ atomic_set(&newxprt->sc_sq_ticket_head, 0);
+ atomic_set(&newxprt->sc_sq_ticket_tail, 0);
newxprt->sc_pd = ib_alloc_pd(dev, 0);
if (IS_ERR(newxprt->sc_pd)) {
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 02/15] svcrdma: Clean up use of rdma->sc_pd->device in Receive paths
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 01/15] svcrdma: Add fair queuing for Send Queue access Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 03/15] svcrdma: Clean up use of rdma->sc_pd->device Chuck Lever
` (12 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
I can't think of a reason why svcrdma is using the PD's device. Most
other consumers of the IB DMA API use the ib_device pointer from the
connection's rdma_cm_id.
I don't believe there's any functional difference between the two,
but it is a little confusing to see some uses of rdma_cm_id->device
and some of ib_pd->device.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e7e4a39ca6c6..29a71fa79e2b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -118,7 +118,7 @@ svc_rdma_next_recv_ctxt(struct list_head *list)
static struct svc_rdma_recv_ctxt *
svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
{
- int node = ibdev_to_node(rdma->sc_cm_id->device);
+ struct ib_device *device = rdma->sc_cm_id->device;
struct svc_rdma_recv_ctxt *ctxt;
unsigned long pages;
dma_addr_t addr;
@@ -126,16 +126,17 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
pages = svc_serv_maxpages(rdma->sc_xprt.xpt_server);
ctxt = kzalloc_node(struct_size(ctxt, rc_pages, pages),
- GFP_KERNEL, node);
+ GFP_KERNEL, ibdev_to_node(device));
if (!ctxt)
goto fail0;
ctxt->rc_maxpages = pages;
- buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
+ buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL,
+ ibdev_to_node(device));
if (!buffer)
goto fail1;
- addr = ib_dma_map_single(rdma->sc_pd->device, buffer,
- rdma->sc_max_req_size, DMA_FROM_DEVICE);
- if (ib_dma_mapping_error(rdma->sc_pd->device, addr))
+ addr = ib_dma_map_single(device, buffer, rdma->sc_max_req_size,
+ DMA_FROM_DEVICE);
+ if (ib_dma_mapping_error(device, addr))
goto fail2;
svc_rdma_recv_cid_init(rdma, &ctxt->rc_cid);
@@ -167,7 +168,7 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
static void svc_rdma_recv_ctxt_destroy(struct svcxprt_rdma *rdma,
struct svc_rdma_recv_ctxt *ctxt)
{
- ib_dma_unmap_single(rdma->sc_pd->device, ctxt->rc_recv_sge.addr,
+ ib_dma_unmap_single(rdma->sc_cm_id->device, ctxt->rc_recv_sge.addr,
ctxt->rc_recv_sge.length, DMA_FROM_DEVICE);
kfree(ctxt->rc_recv_buf);
kfree(ctxt);
@@ -962,7 +963,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
return 0;
percpu_counter_inc(&svcrdma_stat_recv);
- ib_dma_sync_single_for_cpu(rdma_xprt->sc_pd->device,
+ ib_dma_sync_single_for_cpu(rdma_xprt->sc_cm_id->device,
ctxt->rc_recv_sge.addr, ctxt->rc_byte_len,
DMA_FROM_DEVICE);
svc_rdma_build_arg_xdr(rqstp, ctxt);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 03/15] svcrdma: Clean up use of rdma->sc_pd->device
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 01/15] svcrdma: Add fair queuing for Send Queue access Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 02/15] svcrdma: Clean up use of rdma->sc_pd->device in Receive paths Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 04/15] svcrdma: Add Write chunk WRs to the RPC's Send WR chain Chuck Lever
` (11 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
I can't think of a reason why svcrdma is using the PD's device. Most
other consumers of the IB DMA API use the ib_device pointer from the
connection's rdma_cm_id.
I don't think there's any functional difference between the two, but
it is a little confusing to see some uses of rdma_cm_id and some of
ib_pd.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index da0d637ba4fb..eb21544f4a61 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -116,7 +116,7 @@ static void svc_rdma_wc_send(struct ib_cq *cq, struct ib_wc *wc);
static struct svc_rdma_send_ctxt *
svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
{
- int node = ibdev_to_node(rdma->sc_cm_id->device);
+ struct ib_device *device = rdma->sc_cm_id->device;
struct svc_rdma_send_ctxt *ctxt;
unsigned long pages;
dma_addr_t addr;
@@ -124,21 +124,22 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
int i;
ctxt = kzalloc_node(struct_size(ctxt, sc_sges, rdma->sc_max_send_sges),
- GFP_KERNEL, node);
+ GFP_KERNEL, ibdev_to_node(device));
if (!ctxt)
goto fail0;
pages = svc_serv_maxpages(rdma->sc_xprt.xpt_server);
ctxt->sc_pages = kcalloc_node(pages, sizeof(struct page *),
- GFP_KERNEL, node);
+ GFP_KERNEL, ibdev_to_node(device));
if (!ctxt->sc_pages)
goto fail1;
ctxt->sc_maxpages = pages;
- buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL, node);
+ buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL,
+ ibdev_to_node(device));
if (!buffer)
goto fail2;
- addr = ib_dma_map_single(rdma->sc_pd->device, buffer,
- rdma->sc_max_req_size, DMA_TO_DEVICE);
- if (ib_dma_mapping_error(rdma->sc_pd->device, addr))
+ addr = ib_dma_map_single(device, buffer, rdma->sc_max_req_size,
+ DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(device, addr))
goto fail3;
svc_rdma_send_cid_init(rdma, &ctxt->sc_cid);
@@ -175,15 +176,14 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
*/
void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
{
+ struct ib_device *device = rdma->sc_cm_id->device;
struct svc_rdma_send_ctxt *ctxt;
struct llist_node *node;
while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
- ib_dma_unmap_single(rdma->sc_pd->device,
- ctxt->sc_sges[0].addr,
- rdma->sc_max_req_size,
- DMA_TO_DEVICE);
+ ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+ rdma->sc_max_req_size, DMA_TO_DEVICE);
kfree(ctxt->sc_xprt_buf);
kfree(ctxt->sc_pages);
kfree(ctxt);
@@ -412,7 +412,7 @@ int svc_rdma_post_send(struct svcxprt_rdma *rdma,
might_sleep();
/* Sync the transport header buffer */
- ib_dma_sync_single_for_device(rdma->sc_pd->device,
+ ib_dma_sync_single_for_device(rdma->sc_cm_id->device,
send_wr->sg_list[0].addr,
send_wr->sg_list[0].length,
DMA_TO_DEVICE);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 04/15] svcrdma: Add Write chunk WRs to the RPC's Send WR chain
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (2 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 03/15] svcrdma: Clean up use of rdma->sc_pd->device Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 05/15] svcrdma: Factor out WR chain linking into helper Chuck Lever
` (10 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Previously, Write chunk RDMA Writes were posted via a separate
ib_post_send() call with their own completion handler. Each Write
chunk incurred a doorbell and generated a completion event.
Link Write chunk WRs onto the RPC Reply's Send WR chain so that a
single ib_post_send() call posts both the RDMA Writes and the Send
WR. A single completion event signals that all operations have
finished. This reduces both doorbell rate and completion rate, as
well as eliminating the latency of a round-trip between the Write
chunk completion and the subsequent Send WR posting.
The lifecycle of Write chunk resources changes: previously, the
svc_rdma_write_done() completion handler released Write chunk
resources when RDMA Writes completed. With WR chaining, resources
remain live until the Send completion. A new sc_write_info_list
tracks Write chunk metadata attached to each Send context, and
svc_rdma_write_chunk_release() frees these resources when the
Send context is released.
The svc_rdma_write_done() handler now handles only error cases.
On success it returns immediately since the Send completion handles
resource release. On failure (WR flush), it closes the connection
to signal to the client that the RPC Reply is incomplete.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 13 +++-
net/sunrpc/xprtrdma/svc_rdma_rw.c | 94 ++++++++++++++++++++-------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 10 ++-
3 files changed, 91 insertions(+), 26 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f68ac035fec6..d84946cf6176 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -215,6 +215,7 @@ struct svc_rdma_recv_ctxt {
*/
struct svc_rdma_write_info {
struct svcxprt_rdma *wi_rdma;
+ struct list_head wi_list;
const struct svc_rdma_chunk *wi_chunk;
@@ -243,7 +244,10 @@ struct svc_rdma_send_ctxt {
struct ib_cqe sc_cqe;
struct xdr_buf sc_hdrbuf;
struct xdr_stream sc_stream;
+
+ struct list_head sc_write_info_list;
struct svc_rdma_write_info sc_reply_info;
+
void *sc_xprt_buf;
int sc_page_count;
int sc_cur_sge_no;
@@ -276,11 +280,14 @@ extern void svc_rdma_cc_init(struct svcxprt_rdma *rdma,
extern void svc_rdma_cc_release(struct svcxprt_rdma *rdma,
struct svc_rdma_chunk_ctxt *cc,
enum dma_data_direction dir);
+extern void svc_rdma_write_chunk_release(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *ctxt);
extern void svc_rdma_reply_chunk_release(struct svcxprt_rdma *rdma,
struct svc_rdma_send_ctxt *ctxt);
-extern int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
- const struct svc_rdma_recv_ctxt *rctxt,
- const struct xdr_buf *xdr);
+extern int svc_rdma_prepare_write_list(struct svcxprt_rdma *rdma,
+ const struct svc_rdma_recv_ctxt *rctxt,
+ struct svc_rdma_send_ctxt *sctxt,
+ const struct xdr_buf *xdr);
extern int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
const struct svc_rdma_pcl *write_pcl,
const struct svc_rdma_pcl *reply_pcl,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 921dd2542d0d..6057966d3502 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -230,6 +230,28 @@ static void svc_rdma_write_info_free(struct svc_rdma_write_info *info)
queue_work(svcrdma_wq, &info->wi_work);
}
+/**
+ * svc_rdma_write_chunk_release - Release Write chunk I/O resources
+ * @rdma: controlling transport
+ * @ctxt: Send context that is being released
+ *
+ * Write chunk resources remain live until Send completion because
+ * Write WRs are chained to the Send WR. This function releases all
+ * write_info structures accumulated on @ctxt->sc_write_info_list.
+ */
+void svc_rdma_write_chunk_release(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *ctxt)
+{
+ struct svc_rdma_write_info *info;
+
+ while (!list_empty(&ctxt->sc_write_info_list)) {
+ info = list_first_entry(&ctxt->sc_write_info_list,
+ struct svc_rdma_write_info, wi_list);
+ list_del(&info->wi_list);
+ svc_rdma_write_info_free(info);
+ }
+}
+
/**
* svc_rdma_reply_chunk_release - Release Reply chunk I/O resources
* @rdma: controlling transport
@@ -286,13 +308,11 @@ static void svc_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)
struct ib_cqe *cqe = wc->wr_cqe;
struct svc_rdma_chunk_ctxt *cc =
container_of(cqe, struct svc_rdma_chunk_ctxt, cc_cqe);
- struct svc_rdma_write_info *info =
- container_of(cc, struct svc_rdma_write_info, wi_cc);
switch (wc->status) {
case IB_WC_SUCCESS:
trace_svcrdma_wc_write(&cc->cc_cid);
- break;
+ return;
case IB_WC_WR_FLUSH_ERR:
trace_svcrdma_wc_write_flush(wc, &cc->cc_cid);
break;
@@ -300,12 +320,11 @@ static void svc_rdma_write_done(struct ib_cq *cq, struct ib_wc *wc)
trace_svcrdma_wc_write_err(wc, &cc->cc_cid);
}
- svc_rdma_wake_send_waiters(rdma, cc->cc_sqecount);
-
- if (unlikely(wc->status != IB_WC_SUCCESS))
- svc_xprt_deferred_close(&rdma->sc_xprt);
-
- svc_rdma_write_info_free(info);
+ /* The RDMA Write has flushed, so the client won't get
+ * some of the outgoing RPC message. Signal the loss
+ * to the client by closing the connection.
+ */
+ svc_xprt_deferred_close(&rdma->sc_xprt);
}
/**
@@ -591,13 +610,27 @@ static int svc_rdma_xb_write(const struct xdr_buf *xdr, void *data)
return xdr->len;
}
-static int svc_rdma_send_write_chunk(struct svcxprt_rdma *rdma,
- const struct svc_rdma_chunk *chunk,
- const struct xdr_buf *xdr)
+/*
+ * svc_rdma_prepare_write_chunk - Link Write WRs for @chunk onto @sctxt's chain
+ *
+ * Write WRs are prepended to the Send WR chain so that a single
+ * ib_post_send() posts both RDMA Writes and the final Send. Only
+ * the first WR in each chunk gets a CQE for error detection;
+ * subsequent WRs complete without individual completion events.
+ * The Send WR's signaled completion indicates all chained
+ * operations have finished.
+ */
+static int svc_rdma_prepare_write_chunk(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *sctxt,
+ const struct svc_rdma_chunk *chunk,
+ const struct xdr_buf *xdr)
{
struct svc_rdma_write_info *info;
struct svc_rdma_chunk_ctxt *cc;
+ struct ib_send_wr *first_wr;
struct xdr_buf payload;
+ struct list_head *pos;
+ struct ib_cqe *cqe;
int ret;
if (xdr_buf_subsegment(xdr, &payload, chunk->ch_position,
@@ -613,10 +646,25 @@ static int svc_rdma_send_write_chunk(struct svcxprt_rdma *rdma,
if (ret != payload.len)
goto out_err;
- trace_svcrdma_post_write_chunk(&cc->cc_cid, cc->cc_sqecount);
- ret = svc_rdma_post_chunk_ctxt(rdma, cc);
- if (ret < 0)
+ ret = -EINVAL;
+ if (unlikely(sctxt->sc_sqecount + cc->cc_sqecount > rdma->sc_sq_depth))
goto out_err;
+
+ first_wr = sctxt->sc_wr_chain;
+ cqe = &cc->cc_cqe;
+ list_for_each(pos, &cc->cc_rwctxts) {
+ struct svc_rdma_rw_ctxt *rwc;
+
+ rwc = list_entry(pos, struct svc_rdma_rw_ctxt, rw_list);
+ first_wr = rdma_rw_ctx_wrs(&rwc->rw_ctx, rdma->sc_qp,
+ rdma->sc_port_num, cqe, first_wr);
+ cqe = NULL;
+ }
+ sctxt->sc_wr_chain = first_wr;
+ sctxt->sc_sqecount += cc->cc_sqecount;
+ list_add(&info->wi_list, &sctxt->sc_write_info_list);
+
+ trace_svcrdma_post_write_chunk(&cc->cc_cid, cc->cc_sqecount);
return 0;
out_err:
@@ -625,17 +673,19 @@ static int svc_rdma_send_write_chunk(struct svcxprt_rdma *rdma,
}
/**
- * svc_rdma_send_write_list - Send all chunks on the Write list
+ * svc_rdma_prepare_write_list - Construct WR chain for sending Write list
* @rdma: controlling RDMA transport
* @rctxt: Write list provisioned by the client
+ * @sctxt: Send WR resources
* @xdr: xdr_buf containing an RPC Reply message
*
- * Returns zero on success, or a negative errno if one or more
- * Write chunks could not be sent.
+ * Returns zero on success, or a negative errno if WR chain
+ * construction fails for one or more Write chunks.
*/
-int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
- const struct svc_rdma_recv_ctxt *rctxt,
- const struct xdr_buf *xdr)
+int svc_rdma_prepare_write_list(struct svcxprt_rdma *rdma,
+ const struct svc_rdma_recv_ctxt *rctxt,
+ struct svc_rdma_send_ctxt *sctxt,
+ const struct xdr_buf *xdr)
{
struct svc_rdma_chunk *chunk;
int ret;
@@ -643,7 +693,7 @@ int svc_rdma_send_write_list(struct svcxprt_rdma *rdma,
pcl_for_each_chunk(chunk, &rctxt->rc_write_pcl) {
if (!chunk->ch_payload_length)
break;
- ret = svc_rdma_send_write_chunk(rdma, chunk, xdr);
+ ret = svc_rdma_prepare_write_chunk(rdma, sctxt, chunk, xdr);
if (ret < 0)
return ret;
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index eb21544f4a61..e9056039c118 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -150,6 +150,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
ctxt->sc_send_wr.sg_list = ctxt->sc_sges;
ctxt->sc_send_wr.send_flags = IB_SEND_SIGNALED;
ctxt->sc_cqe.done = svc_rdma_wc_send;
+ INIT_LIST_HEAD(&ctxt->sc_write_info_list);
ctxt->sc_xprt_buf = buffer;
xdr_buf_init(&ctxt->sc_hdrbuf, ctxt->sc_xprt_buf,
rdma->sc_max_req_size);
@@ -237,6 +238,7 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
struct ib_device *device = rdma->sc_cm_id->device;
unsigned int i;
+ svc_rdma_write_chunk_release(rdma, ctxt);
svc_rdma_reply_chunk_release(rdma, ctxt);
if (ctxt->sc_page_count)
@@ -1015,6 +1017,12 @@ void svc_rdma_send_error_msg(struct svcxprt_rdma *rdma,
sctxt->sc_send_wr.num_sge = 1;
sctxt->sc_send_wr.opcode = IB_WR_SEND;
sctxt->sc_sges[0].length = sctxt->sc_hdrbuf.len;
+
+ /* Ensure only the error message is posted, not any previously
+ * prepared Write chunk WRs.
+ */
+ sctxt->sc_wr_chain = &sctxt->sc_send_wr;
+ sctxt->sc_sqecount = 1;
if (svc_rdma_post_send(rdma, sctxt))
goto put_ctxt;
return;
@@ -1062,7 +1070,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
if (!p)
goto put_ctxt;
- ret = svc_rdma_send_write_list(rdma, rctxt, &rqstp->rq_res);
+ ret = svc_rdma_prepare_write_list(rdma, rctxt, sctxt, &rqstp->rq_res);
if (ret < 0)
goto put_ctxt;
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 05/15] svcrdma: Factor out WR chain linking into helper
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (3 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 04/15] svcrdma: Add Write chunk WRs to the RPC's Send WR chain Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 06/15] svcrdma: Reduce false sharing in struct svcxprt_rdma Chuck Lever
` (9 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_prepare_write_chunk() and svc_rdma_prepare_reply_chunk()
contain identical code for linking RDMA R/W work requests onto a
Send context's WR chain. This duplication increases maintenance
burden and risks divergent bug fixes.
Introduce svc_rdma_cc_link_wrs() to consolidate the WR chain
linking logic. The helper walks the chunk context's rwctxts list,
chains each WR via rdma_rw_ctx_wrs(), and updates the Send
context's chain head and SQE count. Completion signaling is
requested only for the tail WR (posted first).
No functional change.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_rw.c | 67 +++++++++++++------------------
1 file changed, 28 insertions(+), 39 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 6057966d3502..9ac6a73e4b5d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -610,15 +610,32 @@ static int svc_rdma_xb_write(const struct xdr_buf *xdr, void *data)
return xdr->len;
}
-/*
- * svc_rdma_prepare_write_chunk - Link Write WRs for @chunk onto @sctxt's chain
- *
- * Write WRs are prepended to the Send WR chain so that a single
- * ib_post_send() posts both RDMA Writes and the final Send. Only
- * the first WR in each chunk gets a CQE for error detection;
- * subsequent WRs complete without individual completion events.
- * The Send WR's signaled completion indicates all chained
- * operations have finished.
+/* Link chunk WRs onto @sctxt's WR chain. Completion is requested
+ * for the tail WR, which is posted first.
+ */
+static inline void svc_rdma_cc_link_wrs(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *sctxt,
+ struct svc_rdma_chunk_ctxt *cc)
+{
+ struct ib_send_wr *first_wr;
+ struct list_head *pos;
+ struct ib_cqe *cqe;
+
+ first_wr = sctxt->sc_wr_chain;
+ cqe = &cc->cc_cqe;
+ list_for_each(pos, &cc->cc_rwctxts) {
+ struct svc_rdma_rw_ctxt *rwc;
+
+ rwc = list_entry(pos, struct svc_rdma_rw_ctxt, rw_list);
+ first_wr = rdma_rw_ctx_wrs(&rwc->rw_ctx, rdma->sc_qp,
+ rdma->sc_port_num, cqe, first_wr);
+ cqe = NULL;
+ }
+ sctxt->sc_wr_chain = first_wr;
+ sctxt->sc_sqecount += cc->cc_sqecount;
+}
+
+/* Link Write WRs for @chunk onto @sctxt's WR chain.
*/
static int svc_rdma_prepare_write_chunk(struct svcxprt_rdma *rdma,
struct svc_rdma_send_ctxt *sctxt,
@@ -627,10 +644,7 @@ static int svc_rdma_prepare_write_chunk(struct svcxprt_rdma *rdma,
{
struct svc_rdma_write_info *info;
struct svc_rdma_chunk_ctxt *cc;
- struct ib_send_wr *first_wr;
struct xdr_buf payload;
- struct list_head *pos;
- struct ib_cqe *cqe;
int ret;
if (xdr_buf_subsegment(xdr, &payload, chunk->ch_position,
@@ -650,18 +664,7 @@ static int svc_rdma_prepare_write_chunk(struct svcxprt_rdma *rdma,
if (unlikely(sctxt->sc_sqecount + cc->cc_sqecount > rdma->sc_sq_depth))
goto out_err;
- first_wr = sctxt->sc_wr_chain;
- cqe = &cc->cc_cqe;
- list_for_each(pos, &cc->cc_rwctxts) {
- struct svc_rdma_rw_ctxt *rwc;
-
- rwc = list_entry(pos, struct svc_rdma_rw_ctxt, rw_list);
- first_wr = rdma_rw_ctx_wrs(&rwc->rw_ctx, rdma->sc_qp,
- rdma->sc_port_num, cqe, first_wr);
- cqe = NULL;
- }
- sctxt->sc_wr_chain = first_wr;
- sctxt->sc_sqecount += cc->cc_sqecount;
+ svc_rdma_cc_link_wrs(rdma, sctxt, cc);
list_add(&info->wi_list, &sctxt->sc_write_info_list);
trace_svcrdma_post_write_chunk(&cc->cc_cid, cc->cc_sqecount);
@@ -723,9 +726,6 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
{
struct svc_rdma_write_info *info = &sctxt->sc_reply_info;
struct svc_rdma_chunk_ctxt *cc = &info->wi_cc;
- struct ib_send_wr *first_wr;
- struct list_head *pos;
- struct ib_cqe *cqe;
int ret;
info->wi_rdma = rdma;
@@ -739,18 +739,7 @@ int svc_rdma_prepare_reply_chunk(struct svcxprt_rdma *rdma,
if (ret < 0)
return ret;
- first_wr = sctxt->sc_wr_chain;
- cqe = &cc->cc_cqe;
- list_for_each(pos, &cc->cc_rwctxts) {
- struct svc_rdma_rw_ctxt *rwc;
-
- rwc = list_entry(pos, struct svc_rdma_rw_ctxt, rw_list);
- first_wr = rdma_rw_ctx_wrs(&rwc->rw_ctx, rdma->sc_qp,
- rdma->sc_port_num, cqe, first_wr);
- cqe = NULL;
- }
- sctxt->sc_wr_chain = first_wr;
- sctxt->sc_sqecount += cc->cc_sqecount;
+ svc_rdma_cc_link_wrs(rdma, sctxt, cc);
trace_svcrdma_post_reply_chunk(&cc->cc_cid, cc->cc_sqecount);
return xdr->len;
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 06/15] svcrdma: Reduce false sharing in struct svcxprt_rdma
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (4 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 05/15] svcrdma: Factor out WR chain linking into helper Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 07/15] svcrdma: Use lock-free list for Receive Queue tracking Chuck Lever
` (8 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Several frequently-modified fields in struct svcxprt_rdma reside
in the same cache line, causing false sharing between independent
code paths:
- sc_sq_avail: atomic, modified on every ib_post_send and
completion
- sc_send_lock/sc_send_ctxts: Send context cache, accessed during
reply construction
- sc_rw_ctxt_lock/sc_rw_ctxts: R/W context cache, accessed during
Read/Write chunk processing
When any of these fields is modified, the entire cache line is
invalidated on other CPUs. Under load, concurrent operations on
different code paths cause the cache line to bounce between cores,
degrading performance.
Insert ____cacheline_aligned_in_smp annotations to place the Send
context cache, R/W context cache, and receive-path fields into
separate cache lines. To utilize the padding this creates:
- Move sc_pd, sc_ord, sc_max_send_sges into the Send cache line
(sc_pd is accessed during send context setup)
- Move sc_qp, sc_port_num, sc_rq_cq, sc_sq_cq into the R/W cache
line (sc_qp and sc_port_num are accessed together during every
rdma_rw_ctx_wrs call)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index d84946cf6176..972d446439a6 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -78,8 +78,6 @@ struct svcxprt_rdma {
struct rdma_cm_id *sc_cm_id; /* RDMA connection id */
struct list_head sc_accept_q; /* Conn. waiting accept */
struct rpcrdma_notification sc_rn; /* removal notification */
- int sc_ord; /* RDMA read limit */
- int sc_max_send_sges;
bool sc_snd_w_inv; /* OK to use Send With Invalidate */
atomic_t sc_sq_avail; /* SQEs ready to be consumed */
@@ -90,23 +88,30 @@ struct svcxprt_rdma {
u32 sc_max_requests; /* Max requests */
u32 sc_max_bc_requests;/* Backward credits */
int sc_max_req_size; /* Size of each RQ WR buf */
- u8 sc_port_num;
- struct ib_pd *sc_pd;
-
- spinlock_t sc_send_lock;
+ /* Send context cache */
+ spinlock_t sc_send_lock ____cacheline_aligned_in_smp;
struct llist_head sc_send_ctxts;
- spinlock_t sc_rw_ctxt_lock;
- struct llist_head sc_rw_ctxts;
+ /* sc_pd accessed during send context alloc */
+ struct ib_pd *sc_pd;
+ int sc_ord; /* RDMA read limit */
+ int sc_max_send_sges;
- u32 sc_pending_recvs;
+ /* R/W context cache */
+ spinlock_t sc_rw_ctxt_lock ____cacheline_aligned_in_smp;
+ struct llist_head sc_rw_ctxts;
+ /* sc_qp and sc_port_num accessed together */
+ struct ib_qp *sc_qp;
+ u8 sc_port_num;
+ struct ib_cq *sc_rq_cq;
+ struct ib_cq *sc_sq_cq;
+
+ /* Receive path */
+ u32 sc_pending_recvs ____cacheline_aligned_in_smp;
u32 sc_recv_batch;
struct list_head sc_rq_dto_q;
struct list_head sc_read_complete_q;
spinlock_t sc_rq_dto_lock;
- struct ib_qp *sc_qp;
- struct ib_cq *sc_rq_cq;
- struct ib_cq *sc_sq_cq;
spinlock_t sc_lock; /* transport lock */
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 07/15] svcrdma: Use lock-free list for Receive Queue tracking
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (5 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 06/15] svcrdma: Reduce false sharing in struct svcxprt_rdma Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 08/15] svcrdma: Convert Read completion queue to use lock-free list Chuck Lever
` (7 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The sc_rq_dto_lock spinlock is acquired on every receive completion
to add the completed receive context to the sc_rq_dto_q list. Under
high message rates this creates contention between softirq contexts
processing completions.
Replace sc_rq_dto_q with a lock-free llist. Receive completions now
use llist_add() which requires no locking. The consumer uses
llist_del_first() to retrieve one item at a time.
The lock remains for sc_read_complete_q, but the primary hot path
(receive completion and consumption) no longer requires it. This
eliminates producer-side contention entirely.
Note that llist provides LIFO ordering rather than FIFO. For
independent RPC requests this has no semantic impact and avoids
the overhead of reversing the list on the consumer side.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 2 +-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 38 +++++++++++++++++-------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
3 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 972d446439a6..ae4300364491 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -109,7 +109,7 @@ struct svcxprt_rdma {
/* Receive path */
u32 sc_pending_recvs ____cacheline_aligned_in_smp;
u32 sc_recv_batch;
- struct list_head sc_rq_dto_q;
+ struct llist_head sc_rq_dto_q;
struct list_head sc_read_complete_q;
spinlock_t sc_rq_dto_lock;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 29a71fa79e2b..fd4d3fbd7054 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -361,11 +361,12 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
/* All wc fields are now known to be valid */
ctxt->rc_byte_len = wc->byte_len;
- spin_lock(&rdma->sc_rq_dto_lock);
- list_add_tail(&ctxt->rc_list, &rdma->sc_rq_dto_q);
- /* Note the unlock pairs with the smp_rmb in svc_xprt_ready: */
+ llist_add(&ctxt->rc_node, &rdma->sc_rq_dto_q);
+ /*
+ * The implicit barrier of llist_add's cmpxchg pairs with
+ * the smp_rmb in svc_xprt_ready.
+ */
set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
- spin_unlock(&rdma->sc_rq_dto_lock);
if (!test_bit(RDMAXPRT_CONN_PENDING, &rdma->sc_flags))
svc_xprt_enqueue(&rdma->sc_xprt);
return;
@@ -388,13 +389,16 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
void svc_rdma_flush_recv_queues(struct svcxprt_rdma *rdma)
{
struct svc_rdma_recv_ctxt *ctxt;
+ struct llist_node *node;
while ((ctxt = svc_rdma_next_recv_ctxt(&rdma->sc_read_complete_q))) {
list_del(&ctxt->rc_list);
svc_rdma_recv_ctxt_put(rdma, ctxt);
}
- while ((ctxt = svc_rdma_next_recv_ctxt(&rdma->sc_rq_dto_q))) {
- list_del(&ctxt->rc_list);
+ node = llist_del_all(&rdma->sc_rq_dto_q);
+ while (node) {
+ ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
+ node = node->next;
svc_rdma_recv_ctxt_put(rdma, ctxt);
}
}
@@ -930,6 +934,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
struct svcxprt_rdma *rdma_xprt =
container_of(xprt, struct svcxprt_rdma, sc_xprt);
struct svc_rdma_recv_ctxt *ctxt;
+ struct llist_node *node;
int ret;
/* Prevent svc_xprt_release() from releasing pages in rq_pages
@@ -949,13 +954,24 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
svc_rdma_read_complete(rqstp, ctxt);
goto complete;
}
- ctxt = svc_rdma_next_recv_ctxt(&rdma_xprt->sc_rq_dto_q);
- if (ctxt)
- list_del(&ctxt->rc_list);
- else
+ spin_unlock(&rdma_xprt->sc_rq_dto_lock);
+
+ node = llist_del_first(&rdma_xprt->sc_rq_dto_q);
+ if (node) {
+ ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
+ } else {
+ ctxt = NULL;
/* No new incoming requests, terminate the loop */
clear_bit(XPT_DATA, &xprt->xpt_flags);
- spin_unlock(&rdma_xprt->sc_rq_dto_lock);
+
+ /*
+ * If a completion arrived after llist_del_first but
+ * before clear_bit, the producer's set_bit would be
+ * cleared above. Recheck to close this race window.
+ */
+ if (!llist_empty(&rdma_xprt->sc_rq_dto_q))
+ set_bit(XPT_DATA, &xprt->xpt_flags);
+ }
/* Unblock the transport for the next receive */
svc_xprt_received(xprt);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index d1dcffbf2fe7..e7f8898d09db 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -173,7 +173,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv);
INIT_LIST_HEAD(&cma_xprt->sc_accept_q);
- INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
+ init_llist_head(&cma_xprt->sc_rq_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
init_llist_head(&cma_xprt->sc_send_ctxts);
init_llist_head(&cma_xprt->sc_recv_ctxts);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 08/15] svcrdma: Convert Read completion queue to use lock-free list
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (6 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 07/15] svcrdma: Use lock-free list for Receive Queue tracking Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 09/15] svcrdma: Release write chunk resources without re-queuing Chuck Lever
` (6 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Extend the lock-free list conversion to sc_read_complete_q. This
queue tracks receive contexts that have completed RDMA Read
operations for handling Read chunks.
With both sc_rq_dto_q and sc_read_complete_q now using llist,
the sc_rq_dto_lock spinlock is no longer needed and is removed.
This eliminates all locking from the receive and Read completion
paths.
Note that llist provides LIFO ordering rather than FIFO. For
independent RPC requests this has no semantic impact.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 4 +--
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 34 +++++++++++-------------
net/sunrpc/xprtrdma/svc_rdma_rw.c | 9 ++++---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +---
4 files changed, 23 insertions(+), 29 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index ae4300364491..e6cb52285818 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -110,8 +110,7 @@ struct svcxprt_rdma {
u32 sc_pending_recvs ____cacheline_aligned_in_smp;
u32 sc_recv_batch;
struct llist_head sc_rq_dto_q;
- struct list_head sc_read_complete_q;
- spinlock_t sc_rq_dto_lock;
+ struct llist_head sc_read_complete_q;
spinlock_t sc_lock; /* transport lock */
@@ -183,7 +182,6 @@ struct svc_rdma_chunk_ctxt {
struct svc_rdma_recv_ctxt {
struct llist_node rc_node;
- struct list_head rc_list;
struct ib_recv_wr rc_recv_wr;
struct ib_cqe rc_cqe;
struct rpc_rdma_cid rc_cid;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index fd4d3fbd7054..0c048eaf2b8e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -108,13 +108,6 @@
static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc);
-static inline struct svc_rdma_recv_ctxt *
-svc_rdma_next_recv_ctxt(struct list_head *list)
-{
- return list_first_entry_or_null(list, struct svc_rdma_recv_ctxt,
- rc_list);
-}
-
static struct svc_rdma_recv_ctxt *
svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
{
@@ -385,14 +378,21 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
* svc_rdma_flush_recv_queues - Drain pending Receive work
* @rdma: svcxprt_rdma being shut down
*
+ * Called from svc_rdma_free() after ib_drain_qp() has blocked until
+ * completion queues are empty and flush_workqueue() has waited for
+ * pending work items. These preceding calls guarantee no concurrent
+ * producers (completion handlers) or consumers (svc_rdma_recvfrom)
+ * can be active, making unsynchronized llist_del_all() safe here.
*/
void svc_rdma_flush_recv_queues(struct svcxprt_rdma *rdma)
{
struct svc_rdma_recv_ctxt *ctxt;
struct llist_node *node;
- while ((ctxt = svc_rdma_next_recv_ctxt(&rdma->sc_read_complete_q))) {
- list_del(&ctxt->rc_list);
+ node = llist_del_all(&rdma->sc_read_complete_q);
+ while (node) {
+ ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
+ node = node->next;
svc_rdma_recv_ctxt_put(rdma, ctxt);
}
node = llist_del_all(&rdma->sc_rq_dto_q);
@@ -945,17 +945,13 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
rqstp->rq_xprt_ctxt = NULL;
- spin_lock(&rdma_xprt->sc_rq_dto_lock);
- ctxt = svc_rdma_next_recv_ctxt(&rdma_xprt->sc_read_complete_q);
- if (ctxt) {
- list_del(&ctxt->rc_list);
- spin_unlock(&rdma_xprt->sc_rq_dto_lock);
+ node = llist_del_first(&rdma_xprt->sc_read_complete_q);
+ if (node) {
+ ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
svc_xprt_received(xprt);
svc_rdma_read_complete(rqstp, ctxt);
goto complete;
}
- spin_unlock(&rdma_xprt->sc_rq_dto_lock);
-
node = llist_del_first(&rdma_xprt->sc_rq_dto_q);
if (node) {
ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
@@ -967,9 +963,11 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
/*
* If a completion arrived after llist_del_first but
* before clear_bit, the producer's set_bit would be
- * cleared above. Recheck to close this race window.
+ * cleared above. Recheck both queues to close this
+ * race window.
*/
- if (!llist_empty(&rdma_xprt->sc_rq_dto_q))
+ if (!llist_empty(&rdma_xprt->sc_rq_dto_q) ||
+ !llist_empty(&rdma_xprt->sc_read_complete_q))
set_bit(XPT_DATA, &xprt->xpt_flags);
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 9ac6a73e4b5d..b0bbebbecb3e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -349,11 +349,12 @@ static void svc_rdma_wc_read_done(struct ib_cq *cq, struct ib_wc *wc)
trace_svcrdma_wc_read(wc, &cc->cc_cid, ctxt->rc_readbytes,
cc->cc_posttime);
- spin_lock(&rdma->sc_rq_dto_lock);
- list_add_tail(&ctxt->rc_list, &rdma->sc_read_complete_q);
- /* the unlock pairs with the smp_rmb in svc_xprt_ready */
+ llist_add(&ctxt->rc_node, &rdma->sc_read_complete_q);
+ /*
+ * The implicit barrier of llist_add's cmpxchg pairs with
+ * the smp_rmb in svc_xprt_ready.
+ */
set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
- spin_unlock(&rdma->sc_rq_dto_lock);
svc_xprt_enqueue(&rdma->sc_xprt);
return;
case IB_WC_WR_FLUSH_ERR:
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index e7f8898d09db..286806ac0739 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -164,7 +164,6 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
{
static struct lock_class_key svcrdma_rwctx_lock;
static struct lock_class_key svcrdma_sctx_lock;
- static struct lock_class_key svcrdma_dto_lock;
struct svcxprt_rdma *cma_xprt;
cma_xprt = kzalloc_node(sizeof(*cma_xprt), GFP_KERNEL, node);
@@ -174,15 +173,13 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
svc_xprt_init(net, &svc_rdma_class, &cma_xprt->sc_xprt, serv);
INIT_LIST_HEAD(&cma_xprt->sc_accept_q);
init_llist_head(&cma_xprt->sc_rq_dto_q);
- INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
+ init_llist_head(&cma_xprt->sc_read_complete_q);
init_llist_head(&cma_xprt->sc_send_ctxts);
init_llist_head(&cma_xprt->sc_recv_ctxts);
init_llist_head(&cma_xprt->sc_rw_ctxts);
init_waitqueue_head(&cma_xprt->sc_send_wait);
spin_lock_init(&cma_xprt->sc_lock);
- spin_lock_init(&cma_xprt->sc_rq_dto_lock);
- lockdep_set_class(&cma_xprt->sc_rq_dto_lock, &svcrdma_dto_lock);
spin_lock_init(&cma_xprt->sc_send_lock);
lockdep_set_class(&cma_xprt->sc_send_lock, &svcrdma_sctx_lock);
spin_lock_init(&cma_xprt->sc_rw_ctxt_lock);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 09/15] svcrdma: Release write chunk resources without re-queuing
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (7 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 08/15] svcrdma: Convert Read completion queue to use lock-free list Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 10/15] svcrdma: Use per-transport kthread for send context release Chuck Lever
` (5 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Each RDMA Send completion triggers a cascade of work items on the
svcrdma_wq unbound workqueue:
ib_cq_poll_work (on ib_comp_wq, per-CPU)
-> svc_rdma_send_ctxt_put -> queue_work [work item 1]
-> svc_rdma_write_info_free -> queue_work [work item 2]
Every transition through queue_work contends on the unbound
pool's spinlock. Profiling an 8KB NFSv3 read/write workload
over RDMA shows about 4% of total CPU cycles spent on this
lock, with the cascading re-queue of write_info release
contributing roughly 1%.
The initial queue_work in svc_rdma_send_ctxt_put is needed to
move release work off the CQ completion context (which runs on
a per-CPU bound workqueue). However, once executing on
svcrdma_wq, there is no need to re-queue for each write_info
structure. svc_rdma_reply_chunk_release already calls
svc_rdma_cc_release inline, confirming these operations are
safe in workqueue and nfsd thread context alike.
Release write chunk resources inline in
svc_rdma_write_info_free, removing the intermediate
svc_rdma_write_info_free_async work item and the wi_work
field from struct svc_rdma_write_info.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 1 -
net/sunrpc/xprtrdma/svc_rdma_rw.c | 13 ++-----------
2 files changed, 2 insertions(+), 12 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index e6cb52285818..9691238df47f 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -232,7 +232,6 @@ struct svc_rdma_write_info {
unsigned int wi_next_off;
struct svc_rdma_chunk_ctxt wi_cc;
- struct work_struct wi_work;
};
struct svc_rdma_send_ctxt {
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index b0bbebbecb3e..e62abdbf84f8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -215,19 +215,10 @@ svc_rdma_write_info_alloc(struct svcxprt_rdma *rdma,
return info;
}
-static void svc_rdma_write_info_free_async(struct work_struct *work)
-{
- struct svc_rdma_write_info *info;
-
- info = container_of(work, struct svc_rdma_write_info, wi_work);
- svc_rdma_cc_release(info->wi_rdma, &info->wi_cc, DMA_TO_DEVICE);
- kfree(info);
-}
-
static void svc_rdma_write_info_free(struct svc_rdma_write_info *info)
{
- INIT_WORK(&info->wi_work, svc_rdma_write_info_free_async);
- queue_work(svcrdma_wq, &info->wi_work);
+ svc_rdma_cc_release(info->wi_rdma, &info->wi_cc, DMA_TO_DEVICE);
+ kfree(info);
}
/**
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 10/15] svcrdma: Use per-transport kthread for send context release
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (8 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 09/15] svcrdma: Release write chunk resources without re-queuing Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 11/15] svcrdma: Use watermark-based Receive Queue replenishment Chuck Lever
` (4 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Each RDMA Send completion queues a separate work item to the
global svcrdma_wq (an unbound workqueue) to handle DMA
unmapping and page release. Under load, many worker threads
contend on the shared workqueue pool lock -- profiling an
NFSv3 8KB read+write workload over RDMA shows ~2.6% of
total CPU cycles spent in native_queued_spin_lock_slowpath
on this lock.
The contention arises from three directions: CQ completion
handlers acquiring the pool lock to enqueue work, a dozen
unbound workers re-acquiring it after each work item
completes, and XFS CIL flush callers hitting the same
unbound pool lock.
Replace the workqueue with a per-transport kthread that
drains a lock-free list. The CQ handler appends completed
send contexts via llist_add() (a single cmpxchg) and wakes
the kthread. The kthread collects all pending items with
llist_del_all() (a single xchg) and releases them in a
batch. Both operations are wait-free, eliminating the pool
lock entirely.
This also removes the global svcrdma_wq workqueue, which
has no remaining users.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 9 ++--
net/sunrpc/xprtrdma/svc_rdma.c | 18 +------
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 62 +++++++++++++++++++++---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 8 ++-
4 files changed, 69 insertions(+), 28 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 9691238df47f..874941b22485 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -66,8 +66,6 @@ extern unsigned int svcrdma_ord;
extern unsigned int svcrdma_max_requests;
extern unsigned int svcrdma_max_bc_requests;
extern unsigned int svcrdma_max_req_size;
-extern struct workqueue_struct *svcrdma_wq;
-
extern struct percpu_counter svcrdma_stat_read;
extern struct percpu_counter svcrdma_stat_recv;
extern struct percpu_counter svcrdma_stat_sq_starve;
@@ -120,6 +118,10 @@ struct svcxprt_rdma {
struct llist_head sc_recv_ctxts;
+ struct llist_head sc_send_release_list;
+ wait_queue_head_t sc_release_wait;
+ struct task_struct *sc_release_task;
+
atomic_t sc_completion_ids;
};
/* sc_flags */
@@ -237,7 +239,6 @@ struct svc_rdma_write_info {
struct svc_rdma_send_ctxt {
struct llist_node sc_node;
struct rpc_rdma_cid sc_cid;
- struct work_struct sc_work;
struct svcxprt_rdma *sc_rdma;
struct ib_send_wr sc_send_wr;
@@ -301,6 +302,8 @@ extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
/* svc_rdma_sendto.c */
extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
+extern int svc_rdma_start_release_thread(struct svcxprt_rdma *rdma);
+extern void svc_rdma_stop_release_thread(struct svcxprt_rdma *rdma);
extern struct svc_rdma_send_ctxt *
svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma);
extern void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 415c0310101f..f67f0612b1a9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -264,38 +264,22 @@ static int svc_rdma_proc_init(void)
return rc;
}
-struct workqueue_struct *svcrdma_wq;
-
void svc_rdma_cleanup(void)
{
svc_unreg_xprt_class(&svc_rdma_class);
svc_rdma_proc_cleanup();
- if (svcrdma_wq) {
- struct workqueue_struct *wq = svcrdma_wq;
-
- svcrdma_wq = NULL;
- destroy_workqueue(wq);
- }
dprintk("SVCRDMA Module Removed, deregister RPC RDMA transport\n");
}
int svc_rdma_init(void)
{
- struct workqueue_struct *wq;
int rc;
- wq = alloc_workqueue("svcrdma", WQ_UNBOUND, 0);
- if (!wq)
- return -ENOMEM;
-
rc = svc_rdma_proc_init();
- if (rc) {
- destroy_workqueue(wq);
+ if (rc)
return rc;
- }
- svcrdma_wq = wq;
svc_reg_xprt_class(&svc_rdma_class);
dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index e9056039c118..1ff39c88b3cb 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -99,6 +99,7 @@
* where two different Write segments send portions of the same page.
*/
+#include <linux/kthread.h>
#include <linux/spinlock.h>
#include <linux/unaligned.h>
@@ -260,12 +261,57 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
}
-static void svc_rdma_send_ctxt_put_async(struct work_struct *work)
+static int svc_rdma_release_fn(void *data)
{
- struct svc_rdma_send_ctxt *ctxt;
+ struct svcxprt_rdma *rdma = data;
+ struct svc_rdma_send_ctxt *ctxt, *next;
+ struct llist_node *node;
- ctxt = container_of(work, struct svc_rdma_send_ctxt, sc_work);
- svc_rdma_send_ctxt_release(ctxt->sc_rdma, ctxt);
+ while (!kthread_should_stop()) {
+ wait_event(rdma->sc_release_wait,
+ !llist_empty(&rdma->sc_send_release_list) ||
+ kthread_should_stop());
+
+ node = llist_del_all(&rdma->sc_send_release_list);
+ llist_for_each_entry_safe(ctxt, next, node, sc_node)
+ svc_rdma_send_ctxt_release(rdma, ctxt);
+ }
+
+ /* Defensive: the list is usually empty here. */
+ node = llist_del_all(&rdma->sc_send_release_list);
+ llist_for_each_entry_safe(ctxt, next, node, sc_node)
+ svc_rdma_send_ctxt_release(rdma, ctxt);
+ return 0;
+}
+
+/**
+ * svc_rdma_start_release_thread - Launch release kthread
+ * @rdma: controlling transport
+ *
+ * Returns zero on success, or a negative errno.
+ */
+int svc_rdma_start_release_thread(struct svcxprt_rdma *rdma)
+{
+ struct task_struct *task;
+
+ task = kthread_run(svc_rdma_release_fn, rdma,
+ "svcrdma-rel");
+ if (IS_ERR(task))
+ return PTR_ERR(task);
+ rdma->sc_release_task = task;
+ return 0;
+}
+
+/**
+ * svc_rdma_stop_release_thread - Stop release kthread
+ * @rdma: controlling transport
+ *
+ * Waits for the kthread to drain and exit.
+ */
+void svc_rdma_stop_release_thread(struct svcxprt_rdma *rdma)
+{
+ if (rdma->sc_release_task)
+ kthread_stop(rdma->sc_release_task);
}
/**
@@ -273,13 +319,15 @@ static void svc_rdma_send_ctxt_put_async(struct work_struct *work)
* @rdma: controlling svcxprt_rdma
* @ctxt: object to return to the free list
*
- * Pages left in sc_pages are DMA unmapped and released.
+ * DMA unmapping and page release are deferred to a
+ * per-transport kthread to keep these costs off the
+ * completion handler's critical path.
*/
void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
struct svc_rdma_send_ctxt *ctxt)
{
- INIT_WORK(&ctxt->sc_work, svc_rdma_send_ctxt_put_async);
- queue_work(svcrdma_wq, &ctxt->sc_work);
+ llist_add(&ctxt->sc_node, &rdma->sc_send_release_list);
+ wake_up(&rdma->sc_release_wait);
}
/**
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 286806ac0739..0a3969d36a80 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -177,7 +177,9 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
init_llist_head(&cma_xprt->sc_send_ctxts);
init_llist_head(&cma_xprt->sc_recv_ctxts);
init_llist_head(&cma_xprt->sc_rw_ctxts);
+ init_llist_head(&cma_xprt->sc_send_release_list);
init_waitqueue_head(&cma_xprt->sc_send_wait);
+ init_waitqueue_head(&cma_xprt->sc_release_wait);
spin_lock_init(&cma_xprt->sc_lock);
spin_lock_init(&cma_xprt->sc_send_lock);
@@ -526,6 +528,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
if (!svc_rdma_post_recvs(newxprt))
goto errout;
+ ret = svc_rdma_start_release_thread(newxprt);
+ if (ret)
+ goto errout;
+
/* Construct RDMA-CM private message */
pmsg.cp_magic = rpcrdma_cmp_magic;
pmsg.cp_version = RPCRDMA_CMP_VERSION;
@@ -605,7 +611,7 @@ static void svc_rdma_free(struct svc_xprt *xprt)
/* This blocks until the Completion Queues are empty */
if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
ib_drain_qp(rdma->sc_qp);
- flush_workqueue(svcrdma_wq);
+ svc_rdma_stop_release_thread(rdma);
svc_rdma_flush_recv_queues(rdma);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 11/15] svcrdma: Use watermark-based Receive Queue replenishment
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (9 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 10/15] svcrdma: Use per-transport kthread for send context release Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 12/15] svcrdma: Add per-recv_ctxt chunk context cache Chuck Lever
` (3 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The current Receive posting strategy posts a small fixed batch of
Receives on every completion when the queue depth drops below the
maximum. At high message rates this results in frequent
ib_post_recv() calls, each incurring doorbell overhead.
The Receive Queue is now provisioned with twice the negotiated
credit limit (sc_max_requests). Replenishment is triggered when the
number of posted Receives drops below the credit limit (the low
watermark), posting enough Receives to refill the queue to capacity.
For a typical configuration with a credit limit of 128:
- Receive Queue depth: 256
- Low watermark: 128 (replenish when half consumed)
- Batch size: ~128 Receives per posting
Tying the watermark to the credit limit rather than a percentage of
queue capacity ensures adequate buffering regardless of the
configured credit limit. Even with a small credit limit, at least
one full credit window remains posted, guaranteeing forward
progress.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 22 ++++++++++++-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 41 ++++++++++++++++--------
net/sunrpc/xprtrdma/svc_rdma_transport.c | 11 +++----
3 files changed, 54 insertions(+), 20 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 874941b22485..8e78f958fa46 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -106,7 +106,6 @@ struct svcxprt_rdma {
/* Receive path */
u32 sc_pending_recvs ____cacheline_aligned_in_smp;
- u32 sc_recv_batch;
struct llist_head sc_rq_dto_q;
struct llist_head sc_read_complete_q;
@@ -143,6 +142,27 @@ enum {
RPCRDMA_MAX_BC_REQUESTS = 2,
};
+/*
+ * Receive Queue provisioning constants for watermark-based replenishment.
+ *
+ * The Receive Queue is sized at twice the credit limit to enable
+ * batched posting that reduces doorbell overhead. Replenishment
+ * occurs when posted receives drop below the credit limit (the
+ * low watermark), refilling to full capacity.
+ */
+enum {
+ /* Queue depth = sc_max_requests * multiplier */
+ SVCRDMA_RQ_DEPTH_MULT = 2,
+
+ /* Total recv_ctxt pool = sc_max_requests * multiplier
+ * (RQ_DEPTH_MULT for posted receives + 1 for RPCs in process)
+ */
+ SVCRDMA_RECV_CTXT_MULT = 3,
+
+ /* Overhead entries in RQ: sc_max_bc_requests + drain sentinel */
+ SVCRDMA_RQ_OVERHEAD = 3,
+};
+
#define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
/**
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0c048eaf2b8e..333b9468a15b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -301,10 +301,11 @@ bool svc_rdma_post_recvs(struct svcxprt_rdma *rdma)
{
unsigned int total;
- /* For each credit, allocate enough recv_ctxts for one
- * posted Receive and one RPC in process.
+ /* Allocate enough recv_ctxts for:
+ * - SVCRDMA_RQ_DEPTH_MULT * sc_max_requests posted on the RQ
+ * - sc_max_requests RPCs in process
*/
- total = (rdma->sc_max_requests * 2) + rdma->sc_recv_batch;
+ total = rdma->sc_max_requests * SVCRDMA_RECV_CTXT_MULT;
while (total--) {
struct svc_rdma_recv_ctxt *ctxt;
@@ -314,7 +315,8 @@ bool svc_rdma_post_recvs(struct svcxprt_rdma *rdma)
llist_add(&ctxt->rc_node, &rdma->sc_recv_ctxts);
}
- return svc_rdma_refresh_recvs(rdma, rdma->sc_max_requests);
+ return svc_rdma_refresh_recvs(rdma,
+ rdma->sc_max_requests * SVCRDMA_RQ_DEPTH_MULT);
}
/**
@@ -338,18 +340,31 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
goto flushed;
trace_svcrdma_wc_recv(wc, &ctxt->rc_cid);
- /* If receive posting fails, the connection is about to be
- * lost anyway. The server will not be able to send a reply
- * for this RPC, and the client will retransmit this RPC
- * anyway when it reconnects.
+ /* Watermark-based receive posting: The Receive Queue is
+ * provisioned with SVCRDMA_RQ_DEPTH_MULT times the number of
+ * credits (sc_max_requests). Replenish when posted Receives
+ * drops below sc_max_requests (the low watermark), posting
+ * back to full capacity.
*
- * Therefore we drop the Receive, even if status was SUCCESS
- * to reduce the likelihood of replayed requests once the
- * client reconnects.
+ * This batching reduces doorbell rate compared to posting a
+ * fixed small batch on every completion, while ensuring
+ * the Receive Queue is never empty.
+ *
+ * If posting fails, a connection teardown is imminent. The
+ * server will not be able to send a reply for this RPC, and
+ * the client will retransmit this RPC anyway when it
+ * reconnects. Therefore drop the Receive, even if status was
+ * SUCCESS, to reduce the likelihood of replayed requests once
+ * the client reconnects.
*/
- if (rdma->sc_pending_recvs < rdma->sc_max_requests)
- if (!svc_rdma_refresh_recvs(rdma, rdma->sc_recv_batch))
+ if (rdma->sc_pending_recvs < rdma->sc_max_requests) {
+ unsigned int target =
+ (rdma->sc_max_requests * SVCRDMA_RQ_DEPTH_MULT) -
+ rdma->sc_pending_recvs;
+
+ if (!svc_rdma_refresh_recvs(rdma, target))
goto dropped;
+ }
/* All wc fields are now known to be valid */
ctxt->rc_byte_len = wc->byte_len;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 0a3969d36a80..5982006c65a0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -439,7 +439,6 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
newxprt->sc_max_req_size = svcrdma_max_req_size;
newxprt->sc_max_requests = svcrdma_max_requests;
newxprt->sc_max_bc_requests = svcrdma_max_bc_requests;
- newxprt->sc_recv_batch = RPCRDMA_MAX_RECV_BATCH;
newxprt->sc_fc_credits = cpu_to_be32(newxprt->sc_max_requests);
/* Qualify the transport's resource defaults with the
@@ -452,12 +451,12 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
newxprt->sc_max_send_sges += (svcrdma_max_req_size / PAGE_SIZE) + 1;
if (newxprt->sc_max_send_sges > dev->attrs.max_send_sge)
newxprt->sc_max_send_sges = dev->attrs.max_send_sge;
- rq_depth = newxprt->sc_max_requests + newxprt->sc_max_bc_requests +
- newxprt->sc_recv_batch + 1 /* drain */;
+ rq_depth = (newxprt->sc_max_requests * SVCRDMA_RQ_DEPTH_MULT) +
+ newxprt->sc_max_bc_requests + 1 /* drain */;
if (rq_depth > dev->attrs.max_qp_wr) {
rq_depth = dev->attrs.max_qp_wr;
- newxprt->sc_recv_batch = 1;
- newxprt->sc_max_requests = rq_depth - 2;
+ newxprt->sc_max_requests =
+ (rq_depth - SVCRDMA_RQ_OVERHEAD) / SVCRDMA_RQ_DEPTH_MULT;
newxprt->sc_max_bc_requests = 2;
}
@@ -465,7 +464,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
*/
maxpayload = min(xprt->xpt_server->sv_max_payload,
RPCSVC_MAXPAYLOAD_RDMA);
- ctxts = newxprt->sc_max_requests * 3 *
+ ctxts = newxprt->sc_max_requests * SVCRDMA_RECV_CTXT_MULT *
rdma_rw_mr_factor(dev, newxprt->sc_port_num,
maxpayload >> PAGE_SHIFT);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 12/15] svcrdma: Add per-recv_ctxt chunk context cache
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (10 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 11/15] svcrdma: Use watermark-based Receive Queue replenishment Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 13/15] svcrdma: clear XPT_DATA on sc_read_complete_q consumption Chuck Lever
` (2 subsequent siblings)
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Parsed chunk list (PCL) processing currently allocates a new
svc_rdma_chunk structure via kmalloc for each chunk in every
incoming RPC. These allocations add overhead to the receive path.
Introduce a per-recv_ctxt single-entry cache. Over 99% of RPC Calls
that specify RPC/RDMA chunks provide only a single chunk, so a
single cached chunk handles the common case. Chunks with up to
SVC_RDMA_CHUNK_SEGMAX (4) segments are eligible for caching; larger
chunks fall back to dynamic allocation.
Using per-recv_ctxt caching instead of a per-transport pool avoids
the need for locking or atomic operations, since a recv_ctxt is
used by only one thread at a time.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc_rdma.h | 2 +
include/linux/sunrpc/svc_rdma_pcl.h | 12 +++++-
net/sunrpc/xprtrdma/svc_rdma_pcl.c | 55 +++++++++++++++++++++----
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 10 +++--
4 files changed, 67 insertions(+), 12 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 8e78f958fa46..2164504093fd 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -204,6 +204,8 @@ struct svc_rdma_chunk_ctxt {
struct svc_rdma_recv_ctxt {
struct llist_node rc_node;
+ struct svcxprt_rdma *rc_rdma;
+ struct svc_rdma_chunk *rc_chunk_cache;
struct ib_recv_wr rc_recv_wr;
struct ib_cqe rc_cqe;
struct rpc_rdma_cid rc_cid;
diff --git a/include/linux/sunrpc/svc_rdma_pcl.h b/include/linux/sunrpc/svc_rdma_pcl.h
index 7516ad0fae80..e23803b19e66 100644
--- a/include/linux/sunrpc/svc_rdma_pcl.h
+++ b/include/linux/sunrpc/svc_rdma_pcl.h
@@ -22,6 +22,7 @@ struct svc_rdma_chunk {
u32 ch_payload_length;
u32 ch_segcount;
+ u32 ch_segmax;
struct svc_rdma_segment ch_segments[];
};
@@ -114,7 +115,16 @@ pcl_chunk_end_offset(const struct svc_rdma_chunk *chunk)
struct svc_rdma_recv_ctxt;
-extern void pcl_free(struct svc_rdma_pcl *pcl);
+/*
+ * Cached chunks have capacity for this many segments.
+ * Typical clients can register up to 120KB per segment, so 4
+ * segments covers most NFS I/O operations. Larger chunks fall
+ * back to kmalloc.
+ */
+#define SVC_RDMA_CHUNK_SEGMAX 4
+
+extern void pcl_free(struct svc_rdma_recv_ctxt *rctxt,
+ struct svc_rdma_pcl *pcl);
extern bool pcl_alloc_call(struct svc_rdma_recv_ctxt *rctxt, __be32 *p);
extern bool pcl_alloc_read(struct svc_rdma_recv_ctxt *rctxt, __be32 *p);
extern bool pcl_alloc_write(struct svc_rdma_recv_ctxt *rctxt,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_pcl.c b/net/sunrpc/xprtrdma/svc_rdma_pcl.c
index b63cfeaa2923..079af7c633fd 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_pcl.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_pcl.c
@@ -9,30 +9,71 @@
#include "xprt_rdma.h"
#include <trace/events/rpcrdma.h>
+static struct svc_rdma_chunk *rctxt_chunk_get(struct svc_rdma_recv_ctxt *rctxt)
+{
+ struct svc_rdma_chunk *chunk = rctxt->rc_chunk_cache;
+
+ if (chunk)
+ rctxt->rc_chunk_cache = NULL;
+ return chunk;
+}
+
+static void rctxt_chunk_put(struct svc_rdma_recv_ctxt *rctxt,
+ struct svc_rdma_chunk *chunk)
+{
+ if (rctxt->rc_chunk_cache) {
+ kfree(chunk);
+ return;
+ }
+ rctxt->rc_chunk_cache = chunk;
+}
+
+static void rctxt_chunk_free(struct svc_rdma_recv_ctxt *rctxt,
+ struct svc_rdma_chunk *chunk)
+{
+ if (chunk->ch_segmax == SVC_RDMA_CHUNK_SEGMAX)
+ rctxt_chunk_put(rctxt, chunk);
+ else
+ kfree(chunk);
+}
+
/**
* pcl_free - Release all memory associated with a parsed chunk list
+ * @rctxt: receive context containing @pcl
* @pcl: parsed chunk list
*
*/
-void pcl_free(struct svc_rdma_pcl *pcl)
+void pcl_free(struct svc_rdma_recv_ctxt *rctxt, struct svc_rdma_pcl *pcl)
{
while (!list_empty(&pcl->cl_chunks)) {
struct svc_rdma_chunk *chunk;
chunk = pcl_first_chunk(pcl);
list_del(&chunk->ch_list);
- kfree(chunk);
+ rctxt_chunk_free(rctxt, chunk);
}
}
-static struct svc_rdma_chunk *pcl_alloc_chunk(u32 segcount, u32 position)
+static struct svc_rdma_chunk *pcl_alloc_chunk(struct svc_rdma_recv_ctxt *rctxt,
+ u32 segcount, u32 position)
{
+ struct ib_device *device = rctxt->rc_rdma->sc_cm_id->device;
struct svc_rdma_chunk *chunk;
- chunk = kmalloc(struct_size(chunk, ch_segments, segcount), GFP_KERNEL);
+ if (segcount <= SVC_RDMA_CHUNK_SEGMAX) {
+ chunk = rctxt_chunk_get(rctxt);
+ if (chunk)
+ goto out;
+ segcount = SVC_RDMA_CHUNK_SEGMAX;
+ }
+
+ chunk = kmalloc_node(struct_size(chunk, ch_segments, segcount),
+ GFP_KERNEL, ibdev_to_node(device));
if (!chunk)
return NULL;
+ chunk->ch_segmax = segcount;
+out:
chunk->ch_position = position;
chunk->ch_length = 0;
chunk->ch_payload_length = 0;
@@ -117,7 +158,7 @@ bool pcl_alloc_call(struct svc_rdma_recv_ctxt *rctxt, __be32 *p)
continue;
if (pcl_is_empty(pcl)) {
- chunk = pcl_alloc_chunk(segcount, position);
+ chunk = pcl_alloc_chunk(rctxt, segcount, position);
if (!chunk)
return false;
pcl_insert_position(pcl, chunk);
@@ -172,7 +213,7 @@ bool pcl_alloc_read(struct svc_rdma_recv_ctxt *rctxt, __be32 *p)
chunk = pcl_lookup_position(pcl, position);
if (!chunk) {
- chunk = pcl_alloc_chunk(segcount, position);
+ chunk = pcl_alloc_chunk(rctxt, segcount, position);
if (!chunk)
return false;
pcl_insert_position(pcl, chunk);
@@ -210,7 +251,7 @@ bool pcl_alloc_write(struct svc_rdma_recv_ctxt *rctxt,
p++; /* skip the list discriminator */
segcount = be32_to_cpup(p++);
- chunk = pcl_alloc_chunk(segcount, 0);
+ chunk = pcl_alloc_chunk(rctxt, segcount, 0);
if (!chunk)
return false;
list_add_tail(&chunk->ch_list, &pcl->cl_chunks);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 333b9468a15b..b48ef78c79c2 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -122,6 +122,7 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
GFP_KERNEL, ibdev_to_node(device));
if (!ctxt)
goto fail0;
+ ctxt->rc_rdma = rdma;
ctxt->rc_maxpages = pages;
buffer = kmalloc_node(rdma->sc_max_req_size, GFP_KERNEL,
ibdev_to_node(device));
@@ -161,6 +162,7 @@ svc_rdma_recv_ctxt_alloc(struct svcxprt_rdma *rdma)
static void svc_rdma_recv_ctxt_destroy(struct svcxprt_rdma *rdma,
struct svc_rdma_recv_ctxt *ctxt)
{
+ kfree(ctxt->rc_chunk_cache);
ib_dma_unmap_single(rdma->sc_cm_id->device, ctxt->rc_recv_sge.addr,
ctxt->rc_recv_sge.length, DMA_FROM_DEVICE);
kfree(ctxt->rc_recv_buf);
@@ -219,10 +221,10 @@ void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma,
*/
release_pages(ctxt->rc_pages, ctxt->rc_page_count);
- pcl_free(&ctxt->rc_call_pcl);
- pcl_free(&ctxt->rc_read_pcl);
- pcl_free(&ctxt->rc_write_pcl);
- pcl_free(&ctxt->rc_reply_pcl);
+ pcl_free(ctxt, &ctxt->rc_call_pcl);
+ pcl_free(ctxt, &ctxt->rc_read_pcl);
+ pcl_free(ctxt, &ctxt->rc_write_pcl);
+ pcl_free(ctxt, &ctxt->rc_reply_pcl);
llist_add(&ctxt->rc_node, &rdma->sc_recv_ctxts);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 13/15] svcrdma: clear XPT_DATA on sc_read_complete_q consumption
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (11 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 12/15] svcrdma: Add per-recv_ctxt chunk context cache Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 14/15] svcrdma: retry when receive queues drain transiently Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 15/15] svcrdma: clear XPT_DATA on sc_rq_dto_q consumption Chuck Lever
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_wc_read_done() sets XPT_DATA when adding a
completed RDMA Read context to sc_read_complete_q. The
consumer in svc_rdma_recvfrom() takes the context but
leaves XPT_DATA set. The subsequent svc_xprt_received()
clears XPT_BUSY and re-enqueues the transport; because
XPT_DATA remains set, a second thread awakens. That thread
finds both queues empty, accomplishes nothing, and releases
its slot and reservation.
Trace data from a 256KB NFSv3 WRITE workload over RDMA
shows approximately 14 enqueue attempts per RPC, with 62%
returning immediately due to no pending data. The majority
originate from this spurious dispatch path.
After clearing XPT_DATA to acknowledge consumption, the
XPT_DATA state must be recomputed from both queue states.
A concurrent producer may call llist_add and then
set_bit(XPT_DATA) between this consumer's llist_del_first
and the clear_bit, causing clear_bit to erase the producer's
signal. An smp_mb__after_atomic() barrier after clear_bit
pairs with the implicit barrier in each producer's llist_add
cmpxchg, ensuring llist_empty rechecks observe any add whose
set_bit was erased. This barrier requirement applies at both
call sites: the new sc_read_complete_q path and the
pre-existing sc_rq_dto_q "both queues empty" path.
A new helper svc_rdma_update_xpt_data() centralizes this
clear/barrier/recheck/set pattern to ensure both locations
maintain the required memory ordering.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 33 ++++++++++++++++---------
1 file changed, 22 insertions(+), 11 deletions(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index b48ef78c79c2..2ee9819a53d7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -917,6 +917,25 @@ static noinline void svc_rdma_read_complete(struct svc_rqst *rqstp,
trace_svcrdma_read_finished(&ctxt->rc_cid);
}
+/*
+ * Recompute XPT_DATA from queue state after consuming a completion. A
+ * concurrent producer may have called llist_add and then set_bit(XPT_DATA)
+ * between this consumer's llist_del_first and the clear_bit below, causing
+ * clear_bit to erase the producer's signal. The barrier pairs with the
+ * implicit barrier in each producer's llist_add so that the llist_empty
+ * rechecks observe any add whose set_bit was erased.
+ */
+static void svc_rdma_update_xpt_data(struct svcxprt_rdma *rdma)
+{
+ struct svc_xprt *xprt = &rdma->sc_xprt;
+
+ clear_bit(XPT_DATA, &xprt->xpt_flags);
+ smp_mb__after_atomic();
+ if (!llist_empty(&rdma->sc_rq_dto_q) ||
+ !llist_empty(&rdma->sc_read_complete_q))
+ set_bit(XPT_DATA, &xprt->xpt_flags);
+}
+
/**
* svc_rdma_recvfrom - Receive an RPC call
* @rqstp: request structure into which to receive an RPC Call
@@ -965,6 +984,8 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
node = llist_del_first(&rdma_xprt->sc_read_complete_q);
if (node) {
ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
+
+ svc_rdma_update_xpt_data(rdma_xprt);
svc_xprt_received(xprt);
svc_rdma_read_complete(rqstp, ctxt);
goto complete;
@@ -975,17 +996,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
} else {
ctxt = NULL;
/* No new incoming requests, terminate the loop */
- clear_bit(XPT_DATA, &xprt->xpt_flags);
-
- /*
- * If a completion arrived after llist_del_first but
- * before clear_bit, the producer's set_bit would be
- * cleared above. Recheck both queues to close this
- * race window.
- */
- if (!llist_empty(&rdma_xprt->sc_rq_dto_q) ||
- !llist_empty(&rdma_xprt->sc_read_complete_q))
- set_bit(XPT_DATA, &xprt->xpt_flags);
+ svc_rdma_update_xpt_data(rdma_xprt);
}
/* Unblock the transport for the next receive */
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 14/15] svcrdma: retry when receive queues drain transiently
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (12 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 13/15] svcrdma: clear XPT_DATA on sc_read_complete_q consumption Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
2026-02-10 16:32 ` [RFC PATCH 15/15] svcrdma: clear XPT_DATA on sc_rq_dto_q consumption Chuck Lever
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
When svc_rdma_recvfrom finds both sc_read_complete_q
and sc_rq_dto_q empty, svc_rdma_update_xpt_data clears
XPT_DATA, executes a barrier, and rechecks the queues.
If a completion arrived between the llist_del_first and
the recheck, XPT_DATA is re-set, but recvfrom returns
zero regardless. The thread then traverses the full
svc_recv cycle -- page allocation, dequeue, recvfrom,
release -- only to find the item that was already
available at the time of the recheck.
Trace data from a 256KB NFSv3 workload over RDMA shows
267,848 of 464,355 transport dequeues (57.7%) are these
empty bounces. Each bounce costs roughly 37 us. During
the READ phase, empty bounces consume 8.6% of thread
capacity and inflate inter-RPC gaps by an average of
87 us.
The calling thread holds XPT_BUSY for the duration, so
no other consumer can drain the queue between the
recheck and the retry. A retry is therefore guaranteed
to find data on its first iteration.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 2ee9819a53d7..a124c6ed057a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -981,6 +981,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
rqstp->rq_xprt_ctxt = NULL;
+retry:
node = llist_del_first(&rdma_xprt->sc_read_complete_q);
if (node) {
ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
@@ -995,8 +996,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
} else {
ctxt = NULL;
- /* No new incoming requests, terminate the loop */
svc_rdma_update_xpt_data(rdma_xprt);
+ /*
+ * A completion may have arrived between the
+ * llist_del_first above and the queue recheck
+ * inside svc_rdma_update_xpt_data. This thread
+ * holds XPT_BUSY, preventing any other consumer
+ * from draining the queue in the meantime.
+ * Retry to avoid a full svc_recv round-trip.
+ */
+ if (test_bit(XPT_DATA, &xprt->xpt_flags))
+ goto retry;
}
/* Unblock the transport for the next receive */
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread* [RFC PATCH 15/15] svcrdma: clear XPT_DATA on sc_rq_dto_q consumption
2026-02-10 16:32 [RFC PATCH 00/15] svcrdma performance scalability enhancements Chuck Lever
` (13 preceding siblings ...)
2026-02-10 16:32 ` [RFC PATCH 14/15] svcrdma: retry when receive queues drain transiently Chuck Lever
@ 2026-02-10 16:32 ` Chuck Lever
14 siblings, 0 replies; 16+ messages in thread
From: Chuck Lever @ 2026-02-10 16:32 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, linux-rdma, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
svc_rdma_wc_receive() sets XPT_DATA when adding a
completed Receive to sc_rq_dto_q. When
svc_rdma_recvfrom() consumes the item from sc_rq_dto_q,
XPT_DATA is left set. The subsequent svc_xprt_received()
clears XPT_BUSY and re-enqueues the transport; because
stale XPT_DATA remains set, svc_xprt_enqueue() dispatches
a second thread. That thread finds both queues empty,
accomplishes nothing, and returns zero.
Trace data from a 256KB NFSv3 workload over RDMA shows
172,280 of 467,171 transport dequeues (36.9%) are these
spurious dispatches. The READ phase averages 1.99
dequeues per RPC (expected 1.0) and the WRITE phase
averages 2.77 (expected 2.0). Each wasted cycle traverses
svc_alloc_arg, svc_thread_wait_for_work,
svc_rdma_recvfrom, and svc_xprt_release before the
thread can accept new work.
Add svc_rdma_update_xpt_data() on the sc_rq_dto_q
success path, matching the existing call on the
sc_read_complete_q path added by commit 6807f36a39b7
("svcrdma: clear XPT_DATA on sc_read_complete_q
consumption"). The same barrier semantics apply: the
clear/recheck pattern in svc_rdma_update_xpt_data()
ensures a concurrent producer's llist_add + set_bit
is not lost.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index a124c6ed057a..c56d70658068 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -994,6 +994,7 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
node = llist_del_first(&rdma_xprt->sc_rq_dto_q);
if (node) {
ctxt = llist_entry(node, struct svc_rdma_recv_ctxt, rc_node);
+ svc_rdma_update_xpt_data(rdma_xprt);
} else {
ctxt = NULL;
svc_rdma_update_xpt_data(rdma_xprt);
--
2.52.0
^ permalink raw reply related [flat|nested] 16+ messages in thread