Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [PATCH 0/2] svcrdma: Reduce svcrdma_wq contention on the Send completion path
@ 2026-05-06 15:26 Chuck Lever
  2026-05-06 15:26 ` [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing Chuck Lever
  2026-05-06 15:26 ` [PATCH 2/2] svcrdma: Defer send context release to xpo_release_ctxt Chuck Lever
  0 siblings, 2 replies; 5+ messages in thread
From: Chuck Lever @ 2026-05-06 15:26 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey
  Cc: linux-nfs, linux-rdma, Chuck Lever

Profiling an 8KB NFSv3 read/write workload over RDMA shows about
4% of total CPU spent on the svcrdma_wq unbound workqueue pool
spinlock. Each Send completion queues work on svcrdma_wq to
release the send_ctxt, and that work item queues another item
for each write_info chunk it owns. Every queue_work step contends
on the same pool lock.

The first patch removes the inner re-queue.
svc_rdma_write_info_free already runs on svcrdma_wq from its
caller, so the extra work item only adds another spinlock
acquisition with no parallelism to gain. Inlining the chunk
release recovers roughly 1% of CPU cycles. Mike, your workload
might see relief from just this patch alone.

The second patch retires svcrdma_wq. Send completion handlers
append the send_ctxt to a per-transport lock-free list, and the
nfsd thread drains the list in xpo_release_ctxt between RPCs.
DMA unmap and page release move out of the completion context.
That matters when an IOMMU runs in strict mode, where each unmap
synchronously invalidates the IOTLB; the nfsd thread absorbs that
latency where it is harmless and batches teardown across all
completions that accumulated during the prior RPC.

A self-enqueue covers the trailing edge of a burst. When a Send
completion finds sc_send_release_list previously empty on an idle
connection, it sets XPT_DATA and enqueues the transport. The nfsd
thread enters svc_rdma_recvfrom, finds nothing to receive, and
returns; svc_xprt_release then runs xpo_release_ctxt and drains
the list. Without that wakeup, a Send completion arriving after
the last xpo_release_ctxt would leave the send_ctxt's DMA mappings
and reply pages pinned until the next RPC, send-context exhaustion,
or transport close.

Patches were rebased today, but have not been recently tested.

---
Chuck Lever (2):
      svcrdma: Release write chunk resources without re-queuing
      svcrdma: Defer send context release to xpo_release_ctxt

 include/linux/sunrpc/svc_rdma.h          |  6 +--
 net/sunrpc/xprtrdma/svc_rdma.c           | 18 +------
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  9 ++++
 net/sunrpc/xprtrdma/svc_rdma_rw.c        | 13 +----
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 91 +++++++++++++++++++++++---------
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  3 +-
 6 files changed, 84 insertions(+), 56 deletions(-)
---
base-commit: d1c29a34fe35c1eb9331cab0537c7bb583692187
change-id: 20260506-svcrdma-next-2e736249390f

Best regards,
--  
Chuck Lever <chuck.lever@oracle.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing
  2026-05-06 15:26 [PATCH 0/2] svcrdma: Reduce svcrdma_wq contention on the Send completion path Chuck Lever
@ 2026-05-06 15:26 ` Chuck Lever
  2026-05-07 20:46   ` Mike Snitzer
  2026-05-06 15:26 ` [PATCH 2/2] svcrdma: Defer send context release to xpo_release_ctxt Chuck Lever
  1 sibling, 1 reply; 5+ messages in thread
From: Chuck Lever @ 2026-05-06 15:26 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Each RDMA Send completion triggers a cascade of work items on the
svcrdma_wq unbound workqueue:

  ib_cq_poll_work (on ib_comp_wq, per-CPU)
    -> svc_rdma_send_ctxt_put -> queue_work    [work item 1]
      -> svc_rdma_write_info_free -> queue_work [work item 2]

Every transition through queue_work contends on the unbound
pool's spinlock. Profiling an 8KB NFSv3 read/write workload
over RDMA shows about 4% of total CPU cycles spent on this
lock, with the cascading re-queue of write_info release
contributing roughly 1%.

The initial queue_work in svc_rdma_send_ctxt_put is needed to
move release work off the CQ completion context (which runs on
a per-CPU bound workqueue). However, once executing on
svcrdma_wq, there is no need to re-queue for each write_info
structure. svc_rdma_reply_chunk_release already calls
svc_rdma_cc_release inline from the same svcrdma_wq context,
and svc_rdma_recv_ctxt_put does the same from nfsd thread
context.

Release write chunk resources inline in
svc_rdma_write_info_free, removing the intermediate
svc_rdma_write_info_free_async work item and the wi_work
field from struct svc_rdma_write_info.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h   |  1 -
 net/sunrpc/xprtrdma/svc_rdma_rw.c | 13 ++-----------
 2 files changed, 2 insertions(+), 12 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad57..14eb9d52742e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -230,7 +230,6 @@ struct svc_rdma_write_info {
 	unsigned int		wi_next_off;
 
 	struct svc_rdma_chunk_ctxt	wi_cc;
-	struct work_struct	wi_work;
 };
 
 struct svc_rdma_send_ctxt {
diff --git a/net/sunrpc/xprtrdma/svc_rdma_rw.c b/net/sunrpc/xprtrdma/svc_rdma_rw.c
index 402e2ceca4ff..cca8ec973de4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_rw.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_rw.c
@@ -236,19 +236,10 @@ svc_rdma_write_info_alloc(struct svcxprt_rdma *rdma,
 	return info;
 }
 
-static void svc_rdma_write_info_free_async(struct work_struct *work)
-{
-	struct svc_rdma_write_info *info;
-
-	info = container_of(work, struct svc_rdma_write_info, wi_work);
-	svc_rdma_cc_release(info->wi_rdma, &info->wi_cc, DMA_TO_DEVICE);
-	kfree(info);
-}
-
 static void svc_rdma_write_info_free(struct svc_rdma_write_info *info)
 {
-	INIT_WORK(&info->wi_work, svc_rdma_write_info_free_async);
-	queue_work(svcrdma_wq, &info->wi_work);
+	svc_rdma_cc_release(info->wi_rdma, &info->wi_cc, DMA_TO_DEVICE);
+	kfree(info);
 }
 
 /**

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] svcrdma: Defer send context release to xpo_release_ctxt
  2026-05-06 15:26 [PATCH 0/2] svcrdma: Reduce svcrdma_wq contention on the Send completion path Chuck Lever
  2026-05-06 15:26 ` [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing Chuck Lever
@ 2026-05-06 15:26 ` Chuck Lever
  1 sibling, 0 replies; 5+ messages in thread
From: Chuck Lever @ 2026-05-06 15:26 UTC (permalink / raw)
  To: Mike Snitzer, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey
  Cc: linux-nfs, linux-rdma, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Send completion currently queues a work item to an unbound
workqueue for each completed send context. Under load, the
Send Completion handlers contend for the shared workqueue
pool lock.

Replace the workqueue with a per-transport lock-free list
(llist). The Send completion handler appends the send_ctxt
to sc_send_release_list and does no further teardown. The
nfsd thread drains the list in xpo_release_ctxt between
RPCs, performing DMA unmapping, chunk I/O resource release,
and page release in a batch.

This eliminates both the workqueue pool lock and the DMA
unmap cost from the Send completion path. DMA unmapping can
be expensive when an IOMMU is present in strict mode, as
each unmap triggers a synchronous hardware IOTLB
invalidation. Moving it to the nfsd thread, where that
latency is harmless, avoids penalizing completion handler
throughput.

The nfsd threads absorb the release cost at a point where
the client is no longer waiting on a reply, and natural
batching amortizes the overhead when completions arrive
faster than RPCs complete.

A self-enqueue backstops drain on a quiescing transport.
When svc_rdma_send_ctxt_put() observes that its llist_add()
transitions sc_send_release_list from empty to non-empty,
it sets XPT_DATA and calls svc_xprt_enqueue() so that
svc_xprt_ready() schedules an nfsd thread. The thread
enters svc_rdma_recvfrom(), finds no pending receive,
clears XPT_DATA, and returns 0; svc_xprt_release() then
runs xpo_release_ctxt and drains the list. Under steady
load the foreground drain keeps the list non-empty between
adds and no enqueue fires; only the trailing edge of a
burst pays for a wakeup. Without this path, a Send
completion arriving after the last xpo_release_ctxt on an
idle connection would leave the send_ctxt's DMA mappings
and reply pages pinned until the next RPC, send-context
exhaustion, or transport close.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |  5 +-
 net/sunrpc/xprtrdma/svc_rdma.c           | 18 +------
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  9 ++++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 91 +++++++++++++++++++++++---------
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  3 +-
 5 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 14eb9d52742e..4ba39f07371d 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -66,7 +66,6 @@ extern unsigned int svcrdma_ord;
 extern unsigned int svcrdma_max_requests;
 extern unsigned int svcrdma_max_bc_requests;
 extern unsigned int svcrdma_max_req_size;
-extern struct workqueue_struct *svcrdma_wq;
 
 extern struct percpu_counter svcrdma_stat_read;
 extern struct percpu_counter svcrdma_stat_recv;
@@ -117,6 +116,8 @@ struct svcxprt_rdma {
 
 	struct llist_head    sc_recv_ctxts;
 
+	struct llist_head    sc_send_release_list;
+
 	atomic_t	     sc_completion_ids;
 };
 /* sc_flags */
@@ -235,7 +236,6 @@ struct svc_rdma_write_info {
 struct svc_rdma_send_ctxt {
 	struct llist_node	sc_node;
 	struct rpc_rdma_cid	sc_cid;
-	struct work_struct	sc_work;
 
 	struct svcxprt_rdma	*sc_rdma;
 	struct ib_send_wr	sc_send_wr;
@@ -299,6 +299,7 @@ extern int svc_rdma_process_read_list(struct svcxprt_rdma *rdma,
 
 /* svc_rdma_sendto.c */
 extern void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma);
+extern void svc_rdma_send_ctxts_drain(struct svcxprt_rdma *rdma);
 extern struct svc_rdma_send_ctxt *
 		svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma);
 extern void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 415c0310101f..f67f0612b1a9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -264,38 +264,22 @@ static int svc_rdma_proc_init(void)
 	return rc;
 }
 
-struct workqueue_struct *svcrdma_wq;
-
 void svc_rdma_cleanup(void)
 {
 	svc_unreg_xprt_class(&svc_rdma_class);
 	svc_rdma_proc_cleanup();
-	if (svcrdma_wq) {
-		struct workqueue_struct *wq = svcrdma_wq;
-
-		svcrdma_wq = NULL;
-		destroy_workqueue(wq);
-	}
 
 	dprintk("SVCRDMA Module Removed, deregister RPC RDMA transport\n");
 }
 
 int svc_rdma_init(void)
 {
-	struct workqueue_struct *wq;
 	int rc;
 
-	wq = alloc_workqueue("svcrdma", WQ_UNBOUND, 0);
-	if (!wq)
-		return -ENOMEM;
-
 	rc = svc_rdma_proc_init();
-	if (rc) {
-		destroy_workqueue(wq);
+	if (rc)
 		return rc;
-	}
 
-	svcrdma_wq = wq;
 	svc_reg_xprt_class(&svc_rdma_class);
 
 	dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index f8a0638eb095..19503a12d0a2 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -242,6 +242,10 @@ void svc_rdma_recv_ctxt_put(struct svcxprt_rdma *rdma,
  * Ensure that the recv_ctxt is released whether or not a Reply
  * was sent. For example, the client could close the connection,
  * or svc_process could drop an RPC, before the Reply is sent.
+ *
+ * Also drain any send_ctxts queued for deferred release so that
+ * DMA unmap and page release run in nfsd thread context between
+ * RPCs rather than on the Send completion path.
  */
 void svc_rdma_release_ctxt(struct svc_xprt *xprt, void *vctxt)
 {
@@ -251,6 +255,8 @@ void svc_rdma_release_ctxt(struct svc_xprt *xprt, void *vctxt)
 
 	if (ctxt)
 		svc_rdma_recv_ctxt_put(rdma, ctxt);
+
+	svc_rdma_send_ctxts_drain(rdma);
 }
 
 static bool svc_rdma_refresh_recvs(struct svcxprt_rdma *rdma,
@@ -384,6 +390,9 @@ static void svc_rdma_wc_receive(struct ib_cq *cq, struct ib_wc *wc)
  * svc_rdma_flush_recv_queues - Drain pending Receive work
  * @rdma: svcxprt_rdma being shut down
  *
+ * Caller must guarantee that @rdma's Send and Recv Completion
+ * Queues are empty (e.g., via ib_drain_qp()), so that no completion
+ * handlers can still produce work on the queues being drained.
  */
 void svc_rdma_flush_recv_queues(struct svcxprt_rdma *rdma)
 {
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b2..eceefd21bec8 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -79,21 +79,21 @@
  * The ownership of all of the Reply's pages are transferred into that
  * ctxt, the Send WR is posted, and sendto returns.
  *
- * The svc_rdma_send_ctxt is presented when the Send WR completes. The
- * Send completion handler finally releases the Reply's pages.
- *
- * This mechanism also assumes that completions on the transport's Send
- * Completion Queue do not run in parallel. Otherwise a Write completion
- * and Send completion running at the same time could release pages that
- * are still DMA-mapped.
+ * The svc_rdma_send_ctxt is presented when the Send WR completes.
+ * The Send completion handler queues the send_ctxt onto the
+ * per-transport sc_send_release_list (a lock-free llist). The
+ * nfsd thread drains sc_send_release_list in xpo_release_ctxt
+ * between RPCs, DMA-unmapping SGEs, releasing chunk I/O
+ * resources and pages, and returning send_ctxts to the free
+ * list in a batch.
  *
  * Error Handling
  *
  * - If the Send WR is posted successfully, it will either complete
  *   successfully, or get flushed. Either way, the Send completion
- *   handler releases the Reply's pages.
- * - If the Send WR cannot be not posted, the forward path releases
- *   the Reply's pages.
+ *   handler queues the send_ctxt for deferred release.
+ * - If the Send WR cannot be posted, the forward path releases the
+ *   Reply's pages.
  *
  * This handles the case, without the use of page reference counting,
  * where two different Write segments send portions of the same page.
@@ -226,14 +226,25 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
 	return ctxt;
 
 out_empty:
+	svc_rdma_send_ctxts_drain(rdma);
+
+	spin_lock(&rdma->sc_send_lock);
+	node = llist_del_first(&rdma->sc_send_ctxts);
+	spin_unlock(&rdma->sc_send_lock);
+	if (node) {
+		ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
+		goto out;
+	}
+
 	ctxt = svc_rdma_send_ctxt_alloc(rdma);
 	if (!ctxt)
 		return NULL;
 	goto out;
 }
 
-static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
-				       struct svc_rdma_send_ctxt *ctxt)
+/* Release chunk I/O resources and DMA-unmap SGEs. */
+static void svc_rdma_send_ctxt_unmap(struct svcxprt_rdma *rdma,
+				     struct svc_rdma_send_ctxt *ctxt)
 {
 	struct ib_device *device = rdma->sc_cm_id->device;
 	unsigned int i;
@@ -241,9 +252,6 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
 	svc_rdma_write_chunk_release(rdma, ctxt);
 	svc_rdma_reply_chunk_release(rdma, ctxt);
 
-	if (ctxt->sc_page_count)
-		release_pages(ctxt->sc_pages, ctxt->sc_page_count);
-
 	/* The first SGE contains the transport header, which
 	 * remains mapped until @ctxt is destroyed.
 	 */
@@ -256,30 +264,56 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
 				  ctxt->sc_sges[i].length,
 				  DMA_TO_DEVICE);
 	}
+}
+
+/* Unmap, release pages, and return send_ctxt to the free list. */
+static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
+				       struct svc_rdma_send_ctxt *ctxt)
+{
+	svc_rdma_send_ctxt_unmap(rdma, ctxt);
+
+	if (ctxt->sc_page_count)
+		release_pages(ctxt->sc_pages, ctxt->sc_page_count);
 
 	llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
 }
 
-static void svc_rdma_send_ctxt_put_async(struct work_struct *work)
+/**
+ * svc_rdma_send_ctxts_drain - Release completed send_ctxts
+ * @rdma: controlling svcxprt_rdma
+ */
+void svc_rdma_send_ctxts_drain(struct svcxprt_rdma *rdma)
 {
-	struct svc_rdma_send_ctxt *ctxt;
+	struct svc_rdma_send_ctxt *ctxt, *next;
+	struct llist_node *node;
 
-	ctxt = container_of(work, struct svc_rdma_send_ctxt, sc_work);
-	svc_rdma_send_ctxt_release(ctxt->sc_rdma, ctxt);
+	node = llist_del_all(&rdma->sc_send_release_list);
+	llist_for_each_entry_safe(ctxt, next, node, sc_node)
+		svc_rdma_send_ctxt_release(rdma, ctxt);
 }
 
 /**
- * svc_rdma_send_ctxt_put - Return send_ctxt to free list
+ * svc_rdma_send_ctxt_put - Queue send_ctxt for deferred release
  * @rdma: controlling svcxprt_rdma
- * @ctxt: object to return to the free list
+ * @ctxt: send_ctxt to queue for deferred release
  *
- * Pages left in sc_pages are DMA unmapped and released.
+ * Queues @ctxt onto sc_send_release_list. DMA unmap and
+ * page release run later in svc_rdma_send_ctxts_drain(),
+ * typically from xpo_release_ctxt.
+ *
+ * On the empty-to-non-empty transition, set XPT_DATA and
+ * enqueue the transport. Without this self-trigger, a Send
+ * completion arriving after the last xpo_release_ctxt on an
+ * idle connection would leave the send_ctxt's DMA mappings
+ * and reply pages pinned until another drain occurred.
  */
 void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma,
 			    struct svc_rdma_send_ctxt *ctxt)
 {
-	INIT_WORK(&ctxt->sc_work, svc_rdma_send_ctxt_put_async);
-	queue_work(svcrdma_wq, &ctxt->sc_work);
+	if (llist_add(&ctxt->sc_node, &rdma->sc_send_release_list)) {
+		set_bit(XPT_DATA, &rdma->sc_xprt.xpt_flags);
+		svc_xprt_enqueue(&rdma->sc_xprt);
+	}
 }
 
 /**
@@ -367,6 +401,15 @@ int svc_rdma_sq_wait(struct svcxprt_rdma *rdma,
 	atomic_inc(&rdma->sc_sq_ticket_tail);
 	wake_up(&rdma->sc_sq_ticket_wait);
 	trace_svcrdma_sq_retry(rdma, cid);
+
+	/*
+	 * While this thread sat on sc_send_wait or sc_sq_ticket_wait,
+	 * Send completions that tried to enqueue this transport for a
+	 * release-list drain were rejected: svc_rdma_has_wspace returns
+	 * 0 while either waitqueue is active, and svc_xprt_ready
+	 * rejects the enqueue. Drain the release list now.
+	 */
+	svc_rdma_send_ctxts_drain(rdma);
 	return 0;
 
 out_close:
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4f..f99cd6177504 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -178,6 +178,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
 	init_llist_head(&cma_xprt->sc_send_ctxts);
 	init_llist_head(&cma_xprt->sc_recv_ctxts);
 	init_llist_head(&cma_xprt->sc_rw_ctxts);
+	init_llist_head(&cma_xprt->sc_send_release_list);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
 	init_waitqueue_head(&cma_xprt->sc_sq_ticket_wait);
 
@@ -614,7 +615,7 @@ static void svc_rdma_free(struct svc_xprt *xprt)
 	/* This blocks until the Completion Queues are empty */
 	if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
 		ib_drain_qp(rdma->sc_qp);
-	flush_workqueue(svcrdma_wq);
+	svc_rdma_send_ctxts_drain(rdma);
 
 	svc_rdma_flush_recv_queues(rdma);
 

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing
  2026-05-06 15:26 ` [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing Chuck Lever
@ 2026-05-07 20:46   ` Mike Snitzer
  2026-05-08 20:14     ` Chuck Lever
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2026-05-07 20:46 UTC (permalink / raw)
  To: Chuck Lever, Jonathan Flynn
  Cc: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, linux-rdma, Chuck Lever, ben.coddington

On Wed, May 06, 2026 at 11:26:50AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Each RDMA Send completion triggers a cascade of work items on the
> svcrdma_wq unbound workqueue:
> 
>   ib_cq_poll_work (on ib_comp_wq, per-CPU)
>     -> svc_rdma_send_ctxt_put -> queue_work    [work item 1]
>       -> svc_rdma_write_info_free -> queue_work [work item 2]
> 
> Every transition through queue_work contends on the unbound
> pool's spinlock. Profiling an 8KB NFSv3 read/write workload
> over RDMA shows about 4% of total CPU cycles spent on this
> lock, with the cascading re-queue of write_info release
> contributing roughly 1%.
> 
> The initial queue_work in svc_rdma_send_ctxt_put is needed to
> move release work off the CQ completion context (which runs on
> a per-CPU bound workqueue). However, once executing on
> svcrdma_wq, there is no need to re-queue for each write_info
> structure. svc_rdma_reply_chunk_release already calls
> svc_rdma_cc_release inline from the same svcrdma_wq context,
> and svc_rdma_recv_ctxt_put does the same from nfsd thread
> context.
> 
> Release write chunk resources inline in
> svc_rdma_write_info_free, removing the intermediate
> svc_rdma_write_info_free_async work item and the wi_work
> field from struct svc_rdma_write_info.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

You were correct: this patch alone eliminates the OOM (we tested with
both 16K and then 4K read IO from 121 clients to 8 NFS servers, no
measurable memory growth while testing).

Feel free to add:

Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>

Thanks!
Mike

ps.
So you are aware, couldn't test your 2nd patch at the customer site
because the baseline kernel there is based on 6.12-stable but your
2nd patch builds on your 7.1 svcrdma changes. I think your 2nd patch
is ideal though, and will be able to pull it in to test in future, but
won't have the ability to test at this customer's scale until we can
role that newer kernel out there... might take a couple months.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing
  2026-05-07 20:46   ` Mike Snitzer
@ 2026-05-08 20:14     ` Chuck Lever
  0 siblings, 0 replies; 5+ messages in thread
From: Chuck Lever @ 2026-05-08 20:14 UTC (permalink / raw)
  To: Mike Snitzer, Jonathan Flynn
  Cc: Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, linux-rdma, Chuck Lever, ben.coddington

On 5/7/26 4:46 PM, Mike Snitzer wrote:
> On Wed, May 06, 2026 at 11:26:50AM -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Each RDMA Send completion triggers a cascade of work items on the
>> svcrdma_wq unbound workqueue:
>>
>>   ib_cq_poll_work (on ib_comp_wq, per-CPU)
>>     -> svc_rdma_send_ctxt_put -> queue_work    [work item 1]
>>       -> svc_rdma_write_info_free -> queue_work [work item 2]
>>
>> Every transition through queue_work contends on the unbound
>> pool's spinlock. Profiling an 8KB NFSv3 read/write workload
>> over RDMA shows about 4% of total CPU cycles spent on this
>> lock, with the cascading re-queue of write_info release
>> contributing roughly 1%.
>>
>> The initial queue_work in svc_rdma_send_ctxt_put is needed to
>> move release work off the CQ completion context (which runs on
>> a per-CPU bound workqueue). However, once executing on
>> svcrdma_wq, there is no need to re-queue for each write_info
>> structure. svc_rdma_reply_chunk_release already calls
>> svc_rdma_cc_release inline from the same svcrdma_wq context,
>> and svc_rdma_recv_ctxt_put does the same from nfsd thread
>> context.
>>
>> Release write chunk resources inline in
>> svc_rdma_write_info_free, removing the intermediate
>> svc_rdma_write_info_free_async work item and the wi_work
>> field from struct svc_rdma_write_info.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> You were correct: this patch alone eliminates the OOM (we tested with
> both 16K and then 4K read IO from 121 clients to 8 NFS servers, no
> measurable memory growth while testing).

Excellent news! Thanks to you both for testing.


> Feel free to add:
> 
> Reviewed-by: Mike Snitzer <snitzer@kernel.org>
> Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
> 
> Thanks!
> Mike
> 
> ps.
> So you are aware, couldn't test your 2nd patch at the customer site
> because the baseline kernel there is based on 6.12-stable but your
> 2nd patch builds on your 7.1 svcrdma changes. I think your 2nd patch
> is ideal though, and will be able to pull it in to test in future, but
> won't have the ability to test at this customer's scale until we can
> role that newer kernel out there... might take a couple months.

Understood.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-08 20:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 15:26 [PATCH 0/2] svcrdma: Reduce svcrdma_wq contention on the Send completion path Chuck Lever
2026-05-06 15:26 ` [PATCH 1/2] svcrdma: Release write chunk resources without re-queuing Chuck Lever
2026-05-07 20:46   ` Mike Snitzer
2026-05-08 20:14     ` Chuck Lever
2026-05-06 15:26 ` [PATCH 2/2] svcrdma: Defer send context release to xpo_release_ctxt Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox