Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion
@ 2026-05-26 14:14 Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling Chuck Lever
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

rl_kref currently conflates two lifetimes through one refcount:
it gates when a Reply can wake its RPC task, and it gates when
an rpcrdma_req can return to its free pool. The marshal path
takes the Send-side reference only when sc_unmap_count > 0, so
a Send carrying only pre-registered buffers takes no Send-side
reference. When the Reply for such an RPC arrives before its
Send completes, the Reply handler drops rl_kref from 1 to 0
and frees the req while the HCA may still be DMA-reading from
its send buffer. A retransmission can put different bytes on
the wire.

This series narrows rl_kref's job. The RPC layer takes one
reference at slot allocation; rpcrdma_prepare_send_sges() takes
a Send-side reference unconditionally after WR prep succeeds.
A req returns to its free pool only after both owners release.
Replies complete the RPC directly, without atomic activity on
rl_kref.

Three design choices shape the series.

The Send-side reference is taken only on the success path of
rpcrdma_prepare_send_sges(). Marshal failure runs
rpcrdma_sendctx_cancel(), which unmaps the SGEs and clears
sc_req without touching rl_kref. Sendctx ring walks in
rpcrdma_sendctx_put_locked() and rpcrdma_sendctxs_destroy()
skip entries whose sc_req is NULL, so a burst of -EIO marshal
failures cannot hold reqs off rb_send_bufs.

Connection teardown drains the sendctx ring against pre-reset
reqs by ordering rpcrdma_sendctxs_destroy() ahead of
rpcrdma_reqs_reset() in rpcrdma_xprt_disconnect(). The drain
releases Send-side references whose unsignaled Sends never had
a later signaled completion to walk the ring. On the
backchannel, releasing a bc_prealloc req re-adds it to
bc_pa_list, which xprt_destroy_backchannel() has already
emptied; xprt_rdma_destroy() runs xprt_rdma_bc_destroy() a
second time after the disconnect to reclaim those reqs.

With recycling now gated on Send completion, completed RPCs
can remain pinned by the sendctx ring until the next signaled
Send completion. The headroom is bounded: re_send_batch is set
to re_max_requests >> 3. The req pool gains max_reqs/8 slack
(patch 3) so the recycle delay does not stall a slot
allocation that the RPC/RDMA credit window would admit.

Changes since v2:
- While addressing sashiko review comments, substantially
  reworked to simplify and better align the series with the
  current architecture of xprtrdma

Link to v2: https://lore.kernel.org/linux-nfs/20260523000252.465074-1-cel@kernel.org/#r

Changes since v1:
- Split into three patches. A prep patch converts the
  Send-signaling test from a kref_read to sc_unmap_count, and
  a separate patch names the request-pool slack at its
  allocation site.
- Wrap the bc_prealloc release branch in
  CONFIG_SUNRPC_BACKCHANNEL (kernel test robot, build break on
  configs without the backchannel).
- Order rpcrdma_sendctxs_destroy() ahead of
  rpcrdma_reqs_reset() in rpcrdma_xprt_disconnect() so the
  drain runs against pre-reset reqs.
- Run xprt_rdma_bc_destroy() a second time from
  xprt_rdma_destroy() to reclaim bc_prealloc reqs returned by
  the disconnect's drain.
- Add rpcrdma_sendctx_cancel() for the marshal-failure path;
  sendctx ring walkers skip entries with sc_req == NULL.

Link to v1: https://lore.kernel.org/r/20260520175016.29480-1-cel@kernel.org

Chuck Lever (5):
  xprtrdma: Use sendctx DMA state for Send signaling
  xprtrdma: Decouple req recycling from RPC completion
  xprtrdma: Add request-pool slack for delayed recycling
  xprtrdma: Clear receive-side ownership pointers on release
  xprtrdma: Document and assert reply-handler invariants

 net/sunrpc/xprtrdma/backchannel.c |   5 +-
 net/sunrpc/xprtrdma/frwr_ops.c    |   2 +-
 net/sunrpc/xprtrdma/rpc_rdma.c    | 134 ++++++++++++++++++++++--------
 net/sunrpc/xprtrdma/transport.c   |  68 +++++++++++++--
 net/sunrpc/xprtrdma/verbs.c       |  68 +++++++++++++--
 net/sunrpc/xprtrdma/xprt_rdma.h   |   2 +-
 6 files changed, 225 insertions(+), 54 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling
  2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 2/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Send signaling matters only when the prepared Send has page
mappings to unmap. Today that test is expressed indirectly with
rl_kref, because the Send-side reference is taken only for Sends
with mapped SGEs.

Split the SGE DMA unmap loop into its own helper and use
sc_unmap_count directly for the signaling decision. This keeps the
current behavior but removes one dependency on the old rl_kref
semantics before the request lifetime rules are changed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/frwr_ops.c |  2 +-
 net/sunrpc/xprtrdma/rpc_rdma.c | 22 +++++++++++++---------
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 7f79a0a2601e..e5c71cf705a3 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -474,7 +474,7 @@ int frwr_send(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
 		++num_wrs;
 	}
 
-	if ((kref_read(&req->rl_kref) > 1) || num_wrs > ep->re_send_count) {
+	if (req->rl_sendctx->sc_unmap_count || num_wrs > ep->re_send_count) {
 		send_wr->send_flags |= IB_SEND_SIGNALED;
 		ep->re_send_count = min_t(unsigned int, ep->re_send_batch,
 					  num_wrs - ep->re_send_count);
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 0e0f21974710..16b9987858d6 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -477,19 +477,11 @@ static void rpcrdma_sendctx_done(struct kref *kref)
 	rep->rr_rxprt->rx_stats.reply_waits_for_send++;
 }
 
-/**
- * rpcrdma_sendctx_unmap - DMA-unmap Send buffer
- * @sc: sendctx containing SGEs to unmap
- *
- */
-void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
+static void rpcrdma_sendctx_dma_unmap(struct rpcrdma_sendctx *sc)
 {
 	struct rpcrdma_regbuf *rb = sc->sc_req->rl_sendbuf;
 	struct ib_sge *sge;
 
-	if (!sc->sc_unmap_count)
-		return;
-
 	/* The first two SGEs contain the transport header and
 	 * the inline buffer. These are always left mapped so
 	 * they can be cheaply re-used.
@@ -498,7 +490,19 @@ void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
 	     ++sge, --sc->sc_unmap_count)
 		ib_dma_unmap_page(rdmab_device(rb), sge->addr, sge->length,
 				  DMA_TO_DEVICE);
+}
 
+/**
+ * rpcrdma_sendctx_unmap - DMA-unmap Send buffer
+ * @sc: sendctx containing SGEs to unmap
+ *
+ */
+void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
+{
+	if (!sc->sc_unmap_count)
+		return;
+
+	rpcrdma_sendctx_dma_unmap(sc);
 	kref_put(&sc->sc_req->rl_kref, rpcrdma_sendctx_done);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 2/5] xprtrdma: Decouple req recycling from RPC completion
  2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 3/5] xprtrdma: Add request-pool slack for delayed recycling Chuck Lever
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

rl_kref formerly served two distinct lifetimes through a single
refcount: it gated when a Reply could wake its RPC task, and it
gated when an rpcrdma_req could return to its free pool. The
marshal path took the Send-side reference only when SGEs needed
DMA-unmap (sc_unmap_count > 0), which made a Send carrying only
pre-registered buffers an exception: the Reply handler dropped
rl_kref from 1 to 0 and freed the req while the HCA might still
be DMA-reading from its send buffer.

Give rl_kref a narrower job. The RPC layer takes one reference
when slot allocation hands a req out. rpcrdma_prepare_send_sges()
takes a Send-side reference unconditionally after WR preparation
succeeds. xprt_rdma_free_slot() and xprt_rdma_bc_free_rqst() drop
the RPC-layer reference; rpcrdma_sendctx_unmap() drops the
Send-side reference. The req returns to its free pool only after
both owners have signed off.

The existing kref_init(&req->rl_kref) call in
rpcrdma_prepare_send_sges() is removed. Initialization moves to
the slot-allocation paths (xprt_rdma_alloc_slot and
rpcrdma_bc_rqst_get), and the release callback re-arms rl_kref
before the req returns to a free pool. A re-init in the marshal
path would discard the RPC-layer reference that already exists
on entry.

Three invariants follow:

  - Any rpcrdma_req held by an rpc_rqst has rl_kref >= 1.
    xprt_rdma_alloc_slot(), rpcrdma_bc_rqst_get(), and the
    backlog-wake branch in xprt_rdma_alloc_slot() each kref_init
    rl_kref before publishing the req. Without this invariant,
    an RPC task that aborts between slot allocation and marshal
    (gss_refresh failure or signal during call_connect, for
    example) would drive xprt_release() ->
    xprt_rdma_free_slot() -> kref_put against a refcount of
    zero, saturating refcount_t and stranding the slot.

  - The Send-side reference is taken only after WR prep
    succeeds. A mapping failure in rpcrdma_prepare_send_sges()
    runs rpcrdma_sendctx_cancel(), which DMA-unmaps the sendctx
    and clears sc_req without touching rl_kref. The sendctx
    ring walks in rpcrdma_sendctx_put_locked() and
    rpcrdma_sendctxs_destroy() skip entries with sc_req == NULL,
    so a burst of -EIO marshal failures cannot hold reqs off
    rb_send_bufs.

  - The release callback re-arms rl_kref so the next consumer
    enters with the invariant satisfied.

Replies now complete the RPC directly. rpcrdma_reply_handler()
calls rpcrdma_complete_rqst() in place of kref_put on the
non-LocalInv branch. The LocalInv branch already completes the
RPC from frwr_unmap_async() and is unaffected.

Because Send-side references can now outlive RPC completion,
connection teardown drains sendctx entries whose unsignaled
Sends never had a later signaled completion to walk the ring.
rpcrdma_sendctxs_destroy() walks the active range and runs
rpcrdma_sendctx_unmap() on each entry with a non-NULL sc_req
before the request buffers are reset, and is moved ahead of
rpcrdma_reqs_reset() in rpcrdma_xprt_disconnect() so the reqs
are still in their pre-reset state when the Send-side refs are
released.

The drain creates a teardown-ordering hazard on the backchannel
path. With the new lifetime, releasing a bc_prealloc req from
rpcrdma_req_release() re-adds it to bc_pa_list. The disconnect
in xprt_rdma_destroy() runs after xprt_destroy_backchannel() has
already emptied bc_pa_list, so the drained reqs would otherwise
leak. xprt_rdma_destroy() now runs xprt_rdma_bc_destroy(xprt, 0)
a second time after the disconnect to reclaim them.

Fixes: 0ab115237025 ("xprtrdma: Wake RPCs directly in rpcrdma_wc_send path")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/backchannel.c |  5 ++-
 net/sunrpc/xprtrdma/rpc_rdma.c    | 47 +++++++++++---------------
 net/sunrpc/xprtrdma/transport.c   | 55 ++++++++++++++++++++++++++++---
 net/sunrpc/xprtrdma/verbs.c       | 29 +++++++++++++---
 net/sunrpc/xprtrdma/xprt_rdma.h   |  2 +-
 5 files changed, 97 insertions(+), 41 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 2f0f9618dd05..e5b3463da25f 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -159,9 +159,7 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 	rpcrdma_rep_put(&r_xprt->rx_buf, rep);
 	req->rl_reply = NULL;
 
-	spin_lock(&xprt->bc_pa_lock);
-	list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
-	spin_unlock(&xprt->bc_pa_lock);
+	rpcrdma_req_put(req);
 	xprt_put(xprt);
 }
 
@@ -203,6 +201,7 @@ static struct rpc_rqst *rpcrdma_bc_rqst_get(struct rpcrdma_xprt *r_xprt)
 	rqst->rq_xprt = xprt;
 	__set_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
 	xdr_buf_init(&rqst->rq_snd_buf, rdmab_data(req->rl_sendbuf), size);
+	kref_init(&req->rl_kref);
 	return rqst;
 }
 
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 16b9987858d6..69380f9dfa49 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -467,16 +467,6 @@ static int rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
 	return 0;
 }
 
-static void rpcrdma_sendctx_done(struct kref *kref)
-{
-	struct rpcrdma_req *req =
-		container_of(kref, struct rpcrdma_req, rl_kref);
-	struct rpcrdma_rep *rep = req->rl_reply;
-
-	rpcrdma_complete_rqst(rep);
-	rep->rr_rxprt->rx_stats.reply_waits_for_send++;
-}
-
 static void rpcrdma_sendctx_dma_unmap(struct rpcrdma_sendctx *sc)
 {
 	struct rpcrdma_regbuf *rb = sc->sc_req->rl_sendbuf;
@@ -493,17 +483,26 @@ static void rpcrdma_sendctx_dma_unmap(struct rpcrdma_sendctx *sc)
 }
 
 /**
- * rpcrdma_sendctx_unmap - DMA-unmap Send buffer
+ * rpcrdma_sendctx_unmap - DMA-unmap Send buffer and release Send owner
  * @sc: sendctx containing SGEs to unmap
  *
  */
 void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
 {
-	if (!sc->sc_unmap_count)
-		return;
+	struct rpcrdma_req *req = sc->sc_req;
 
 	rpcrdma_sendctx_dma_unmap(sc);
-	kref_put(&sc->sc_req->rl_kref, rpcrdma_sendctx_done);
+	sc->sc_req = NULL;
+	rpcrdma_req_put(req);
+}
+
+/* No Send was posted. Release DMA mappings prepared for this
+ * sendctx, but leave the request reference count alone.
+ */
+static void rpcrdma_sendctx_cancel(struct rpcrdma_sendctx *sc)
+{
+	rpcrdma_sendctx_dma_unmap(sc);
+	sc->sc_req = NULL;
 }
 
 /* Prepare an SGE for the RPC-over-RDMA transport header.
@@ -695,8 +694,6 @@ static bool rpcrdma_prepare_noch_mapped(struct rpcrdma_xprt *r_xprt,
 					      tail->iov_len))
 			return false;
 
-	if (req->rl_sendctx->sc_unmap_count)
-		kref_get(&req->rl_kref);
 	return true;
 }
 
@@ -726,7 +723,6 @@ static bool rpcrdma_prepare_readch(struct rpcrdma_xprt *r_xprt,
 		len -= len & 3;
 		if (!rpcrdma_prepare_tail_iov(req, xdr, page_base, len))
 			return false;
-		kref_get(&req->rl_kref);
 	}
 
 	return true;
@@ -755,7 +751,6 @@ inline int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
 		goto out_nosc;
 	req->rl_sendctx->sc_unmap_count = 0;
 	req->rl_sendctx->sc_req = req;
-	kref_init(&req->rl_kref);
 	req->rl_wr.wr_cqe = &req->rl_sendctx->sc_cqe;
 	req->rl_wr.sg_list = req->rl_sendctx->sc_sges;
 	req->rl_wr.num_sge = 0;
@@ -783,10 +778,14 @@ inline int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
 		goto out_unmap;
 	}
 
+	/* The Send-side owner releases this reference when the
+	 * Send has completed.
+	 */
+	kref_get(&req->rl_kref);
 	return 0;
 
 out_unmap:
-	rpcrdma_sendctx_unmap(req->rl_sendctx);
+	rpcrdma_sendctx_cancel(req->rl_sendctx);
 out_nosc:
 	trace_xprtrdma_prepsend_failed(&req->rl_slot, ret);
 	return ret;
@@ -1364,14 +1363,6 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
 	goto out;
 }
 
-static void rpcrdma_reply_done(struct kref *kref)
-{
-	struct rpcrdma_req *req =
-		container_of(kref, struct rpcrdma_req, rl_kref);
-
-	rpcrdma_complete_rqst(req->rl_reply);
-}
-
 /**
  * rpcrdma_reply_handler - Process received RPC/RDMA messages
  * @rep: Incoming rpcrdma_rep object to process
@@ -1443,7 +1434,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
 		frwr_unmap_async(r_xprt, req);
 		/* LocalInv completion will complete the RPC */
 	else
-		kref_put(&req->rl_kref, rpcrdma_reply_done);
+		rpcrdma_complete_rqst(rep);
 
 out_post:
 	rpcrdma_post_recvs(r_xprt,
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 61706df5e485..5569f17fdd9b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -279,6 +279,13 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
 	cancel_delayed_work_sync(&r_xprt->rx_connect_worker);
 
 	rpcrdma_xprt_disconnect(r_xprt);
+
+	/* The disconnect's sendctx drain can return bc_prealloc reqs
+	 * to bc_pa_list after xprt_destroy_backchannel() emptied it.
+	 */
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+	xprt_rdma_bc_destroy(xprt, 0);
+#endif
 	rpcrdma_buffer_destroy(&r_xprt->rx_buf);
 
 	xprt_rdma_free_addresses(xprt);
@@ -487,6 +494,45 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
 	queue_delayed_work(system_long_wq, &r_xprt->rx_connect_worker, delay);
 }
 
+/* rl_kref has two owners while a Send is outstanding: the rpc_rqst
+ * owner and the sendctx. Replies complete the RPC but do not drop
+ * either reference. The req returns to its free pool only after
+ * xprt_rdma_free_slot() or xprt_rdma_bc_free_rqst() has dropped the
+ * RPC-layer reference and rpcrdma_sendctx_unmap() has dropped the
+ * Send-side reference.
+ */
+static void rpcrdma_req_release(struct kref *kref)
+{
+	struct rpcrdma_req *req =
+		container_of(kref, struct rpcrdma_req, rl_kref);
+	struct rpc_rqst *rqst = &req->rl_slot;
+	struct rpc_xprt *xprt = rqst->rq_xprt;
+	struct rpcrdma_xprt *r_xprt;
+
+	kref_init(&req->rl_kref);
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+	if (bc_prealloc(rqst)) {
+		spin_lock(&xprt->bc_pa_lock);
+		list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
+		spin_unlock(&xprt->bc_pa_lock);
+		return;
+	}
+#endif
+
+	if (xprt_wake_up_backlog(xprt, rqst))
+		return;
+
+	r_xprt = rpcx_to_rdmax(xprt);
+	memset(rqst, 0, sizeof(*rqst));
+	rpcrdma_buffer_put(&r_xprt->rx_buf, req);
+}
+
+void rpcrdma_req_put(struct rpcrdma_req *req)
+{
+	kref_put(&req->rl_kref, rpcrdma_req_release);
+}
+
 /**
  * xprt_rdma_alloc_slot - allocate an rpc_rqst
  * @xprt: controlling RPC transport
@@ -505,6 +551,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 	req = rpcrdma_buffer_get(&r_xprt->rx_buf);
 	if (!req)
 		goto out_sleep;
+	kref_init(&req->rl_kref);
 	task->tk_rqstp = &req->rl_slot;
 	task->tk_status = 0;
 	return;
@@ -520,6 +567,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
 	if (req) {
 		struct rpc_rqst *rqst = &req->rl_slot;
 
+		kref_init(&req->rl_kref);
 		if (!xprt_wake_up_backlog(xprt, rqst)) {
 			memset(rqst, 0, sizeof(*rqst));
 			rpcrdma_buffer_put(&r_xprt->rx_buf, req);
@@ -540,10 +588,7 @@ xprt_rdma_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *rqst)
 		container_of(xprt, struct rpcrdma_xprt, rx_xprt);
 
 	rpcrdma_reply_put(&r_xprt->rx_buf, rpcr_to_rdmar(rqst));
-	if (!xprt_wake_up_backlog(xprt, rqst)) {
-		memset(rqst, 0, sizeof(*rqst));
-		rpcrdma_buffer_put(&r_xprt->rx_buf, rpcr_to_rdmar(rqst));
-	}
+	rpcrdma_req_put(rpcr_to_rdmar(rqst));
 }
 
 static bool rpcrdma_check_regbuf(struct rpcrdma_xprt *r_xprt,
@@ -716,7 +761,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
 		   r_xprt->rx_stats.mrs_allocated,
 		   r_xprt->rx_stats.local_inv_needed,
 		   r_xprt->rx_stats.empty_sendctx_q,
-		   r_xprt->rx_stats.reply_waits_for_send);
+		   0LU); /* was reply_waits_for_send; column preserved */
 }
 
 static int
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aecf9c0a153f..97b8b2376602 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -65,6 +65,8 @@
 
 static int rpcrdma_sendctxs_create(struct rpcrdma_xprt *r_xprt);
 static void rpcrdma_sendctxs_destroy(struct rpcrdma_xprt *r_xprt);
+static unsigned long rpcrdma_sendctx_next(struct rpcrdma_buffer *buf,
+					  unsigned long item);
 static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
 				       struct rpcrdma_sendctx *sc);
 static int rpcrdma_reqs_setup(struct rpcrdma_xprt *r_xprt);
@@ -571,9 +573,9 @@ void rpcrdma_xprt_disconnect(struct rpcrdma_xprt *r_xprt)
 
 	rpcrdma_xprt_drain(r_xprt);
 	rpcrdma_reps_unmap(r_xprt);
+	rpcrdma_sendctxs_destroy(r_xprt);
 	rpcrdma_reqs_reset(r_xprt);
 	rpcrdma_mrs_destroy(r_xprt);
-	rpcrdma_sendctxs_destroy(r_xprt);
 
 	if (rpcrdma_ep_put(ep))
 		rdma_destroy_id(id);
@@ -605,6 +607,20 @@ static void rpcrdma_sendctxs_destroy(struct rpcrdma_xprt *r_xprt)
 
 	if (!buf->rb_sc_ctxs)
 		return;
+
+	/* The QP is drained, but the final unsignaled Sends might not
+	 * have been walked by a signaled Send completion. Release those
+	 * Send owners before request buffers are reset.
+	 */
+	for (i = rpcrdma_sendctx_next(buf, buf->rb_sc_tail);
+	     i != rpcrdma_sendctx_next(buf, buf->rb_sc_head);
+	     i = rpcrdma_sendctx_next(buf, i)) {
+		struct rpcrdma_sendctx *sc = buf->rb_sc_ctxs[i];
+
+		if (sc && sc->sc_req)
+			rpcrdma_sendctx_unmap(sc);
+	}
+
 	for (i = 0; i <= buf->rb_sc_last; i++)
 		kfree(buf->rb_sc_ctxs[i]);
 	kfree(buf->rb_sc_ctxs);
@@ -739,15 +755,20 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
 	unsigned long next_tail;
 
-	/* Unmap SGEs of previously completed but unsignaled
-	 * Sends by walking up the queue until @sc is found.
+	/* Release previously completed but unsignaled Sends by walking
+	 * up the queue until @sc is found. Entries left behind by a
+	 * failed rpcrdma_prepare_send_sges() have sc_req cleared.
 	 */
 	next_tail = buf->rb_sc_tail;
 	do {
+		struct rpcrdma_sendctx *cur;
+
 		next_tail = rpcrdma_sendctx_next(buf, next_tail);
 
 		/* ORDER: item must be accessed _before_ tail is updated */
-		rpcrdma_sendctx_unmap(buf->rb_sc_ctxs[next_tail]);
+		cur = buf->rb_sc_ctxs[next_tail];
+		if (cur->sc_req)
+			rpcrdma_sendctx_unmap(cur);
 
 	} while (buf->rb_sc_ctxs[next_tail] != sc);
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f53a77472724..f879d9b9f57e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -427,7 +427,6 @@ struct rpcrdma_stats {
 	/* accessed when receiving a reply */
 	unsigned long long	total_rdma_reply;
 	unsigned long long	fixup_copy_count;
-	unsigned long		reply_waits_for_send;
 	unsigned long		local_inv_needed;
 	unsigned long		nomsg_call_count;
 	unsigned long		bcall_count;
@@ -505,6 +504,7 @@ void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers,
 			struct rpcrdma_req *req);
 void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep);
 void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req);
+void rpcrdma_req_put(struct rpcrdma_req *req);
 
 bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size,
 			    gfp_t flags);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 3/5] xprtrdma: Add request-pool slack for delayed recycling
  2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 2/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 4/5] xprtrdma: Clear receive-side ownership pointers on release Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 5/5] xprtrdma: Document and assert reply-handler invariants Chuck Lever
  4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

After the previous patch gates req recycling on Send completion,
a completed RPC's rpcrdma_req can remain pinned by the sendctx
ring until the next signaled Send completion releases it. The
transmitted-RPC ceiling is unchanged: xprt_request_get_cong()
gates Sends against xprt->cwnd, the RPC/RDMA credit window fed
by server-granted credits and capped at re_max_requests. The
req pool, however, must exceed max_reqs by enough that this
recycle delay does not stall a slot allocation that the credit
window would admit.

The headroom is bounded. frwr_open() sets re_send_batch to
re_max_requests >> 3 -- one in every eight Sends is signaled --
so at most re_send_batch unsignaled Sends can be outstanding
before the next signaled completion releases them. That equals
max_reqs / 8 reqs in the worst case, with a one-slot floor for
small max_reqs values where the right-shift rounds to zero.

The sendctx ring and the hardware Send Queue are not enlarged
to match. Both are sized in rpcrdma_sendctxs_create() and
frwr_query_device() for re_max_requests in-flight Sends, which
is the ceiling the credit window enforces. The pool slack does
not raise that ceiling -- it only lets allocation keep pace
with the credit window during the brief interval in which
earlier reqs are pinned waiting for the next signaled
completion. At any moment, at most re_send_batch sendctxes are
held by unswept unsignaled Sends, leaving the rest of the ring
available for newly admitted Sends.

Allocate max_reqs + DIV_ROUND_UP(max_reqs, 8) request objects
and name the slack calculation at the allocation site so the
1/8 bound stays tied to the Send-signaling batch size.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/verbs.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 97b8b2376602..98bd965787e6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1080,6 +1080,22 @@ static void rpcrdma_reps_destroy(struct rpcrdma_buffer *buf)
 	spin_unlock(&buf->rb_lock);
 }
 
+static unsigned int rpcrdma_req_pool_slack(unsigned int max_reqs)
+{
+	/* The sendctx ring can hold up to one Send-signaling batch
+	 * (re_send_batch, set by frwr_open() to re_max_requests >> 3)
+	 * of unfinished Sends. Each pins its req until a signaled Send
+	 * completion releases the sendctx. Size the pool above max_reqs
+	 * by that batch so the recycle delay does not stall a slot
+	 * allocation that the RPC/RDMA credit window would admit.
+	 *
+	 * Round up: re_max_requests >> 3 is zero when max_reqs < 8, but
+	 * a single unsignaled Send is still enough to pin one req. One
+	 * slack slot covers that case.
+	 */
+	return DIV_ROUND_UP(max_reqs, 8);
+}
+
 /**
  * rpcrdma_buffer_create - Create initial set of req/rep objects
  * @r_xprt: transport instance to (re)initialize
@@ -1089,6 +1105,7 @@ static void rpcrdma_reps_destroy(struct rpcrdma_buffer *buf)
 int rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
 {
 	struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+	unsigned int max_reqs;
 	int i, rc;
 
 	buf->rb_bc_srv_max_requests = 0;
@@ -1102,7 +1119,9 @@ int rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
 	INIT_LIST_HEAD(&buf->rb_all_reps);
 
 	rc = -ENOMEM;
-	for (i = 0; i < r_xprt->rx_xprt.max_reqs; i++) {
+	max_reqs = r_xprt->rx_xprt.max_reqs;
+	max_reqs += rpcrdma_req_pool_slack(max_reqs);
+	for (i = 0; i < max_reqs; i++) {
 		struct rpcrdma_req *req;
 
 		req = rpcrdma_req_create(r_xprt,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 4/5] xprtrdma: Clear receive-side ownership pointers on release
  2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
                   ` (2 preceding siblings ...)
  2026-05-26 14:14 ` [PATCH v3 3/5] xprtrdma: Add request-pool slack for delayed recycling Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
  2026-05-26 14:14 ` [PATCH v3 5/5] xprtrdma: Document and assert reply-handler invariants Chuck Lever
  4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Three small ownership-state cleanups land the transport in a
state that lets future reviewers reason about each pointer
locally rather than tracing the whole reply path:

rpcrdma_rep_put() clears rep->rr_rqst before the rep enters
rb_free_reps so that no rep on the free list still carries a
stale rqst pointer.  rpcrdma_reply_handler() and
rpcrdma_unpin_rqst() are the only sites that set rr_rqst;
rpcrdma_reply_handler() hands the rep through
rpcrdma_rep_put(), and rpcrdma_unpin_rqst() NULLs rr_rqst
directly because its error path abandons the rep for
teardown cleanup rather than returning it to rb_free_reps.

rpcrdma_reply_put() NULLs req->rl_reply before calling
rpcrdma_rep_put().  The previous order placed the rep on
rb_free_reps while req->rl_reply still pointed at it; the
window was harmless because xprt_rdma_free_slot() holds the
req exclusively across the pair, but closing it makes the
invariant 'rep on rb_free_reps implies no req references it'
strictly checkable.

rpcrdma_sendctx_unmap() and rpcrdma_sendctx_cancel() clear
req->rl_sendctx after dropping the sendctx pointer in the
sendctx ring.  Without this, req->rl_sendctx survives across
Send completion and points at a sendctx that may already have
been reassigned by rpcrdma_sendctx_get_locked() to a different
req.  No caller dereferences the stale pointer today --
rpcrdma_prepare_send_sges() overwrites it before the next
Send -- but a NULL is a more honest representation of 'the
Send is no longer outstanding' and lets the assertion patch
that follows trip on any future regression.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c |  4 ++++
 net/sunrpc/xprtrdma/verbs.c    | 12 ++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 69380f9dfa49..f4b4abefc4e0 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -493,6 +493,7 @@ void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
 
 	rpcrdma_sendctx_dma_unmap(sc);
 	sc->sc_req = NULL;
+	req->rl_sendctx = NULL;
 	rpcrdma_req_put(req);
 }
 
@@ -501,8 +502,11 @@ void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
  */
 static void rpcrdma_sendctx_cancel(struct rpcrdma_sendctx *sc)
 {
+	struct rpcrdma_req *req = sc->sc_req;
+
 	rpcrdma_sendctx_dma_unmap(sc);
 	sc->sc_req = NULL;
+	req->rl_sendctx = NULL;
 }
 
 /* Prepare an SGE for the RPC-over-RDMA transport header.
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 98bd965787e6..60cbc14c5299 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1043,9 +1043,15 @@ static struct rpcrdma_rep *rpcrdma_rep_get_locked(struct rpcrdma_buffer *buf)
  * @buf: buffer pool
  * @rep: rep to release
  *
+ * The rep's transient association with an rpc_rqst, established
+ * by rpcrdma_reply_handler() and torn down here, must not survive
+ * onto rb_free_reps: rpcrdma_post_recvs() pulls reps from the free
+ * list to re-post them, and a non-NULL rr_rqst on a free-listed rep
+ * would imply the rep is still referenced by a req.
  */
 void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep)
 {
+	rep->rr_rqst = NULL;
 	llist_add(&rep->rr_node, &buf->rb_free_reps);
 }
 
@@ -1247,9 +1253,11 @@ rpcrdma_mr_get(struct rpcrdma_xprt *r_xprt)
  */
 void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req)
 {
-	if (req->rl_reply) {
-		rpcrdma_rep_put(buffers, req->rl_reply);
+	struct rpcrdma_rep *rep = req->rl_reply;
+
+	if (rep) {
 		req->rl_reply = NULL;
+		rpcrdma_rep_put(buffers, rep);
 	}
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v3 5/5] xprtrdma: Document and assert reply-handler invariants
  2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
                   ` (3 preceding siblings ...)
  2026-05-26 14:14 ` [PATCH v3 4/5] xprtrdma: Clear receive-side ownership pointers on release Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
  4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
  To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The xprtrdma reply path has been the subject of recurring
LLM-driven review claims that 'an RPC can complete while
receive buffers are still DMA-mapped' or that 'the req can be
freed while the HCA still owns the send buffer.'  No runtime
reproducer has surfaced, but the absence of a written-down
invariant set lets each pass of automated review reach the
same hypothetical conclusion.  Subsequent fixes against
ce2f9a4d9ccc ('xprtrdma: Decouple req recycling from RPC
completion') closed the underlying races but did not document
the closure where future readers will look for it.

State the invariants explicitly in a comment above
rpcrdma_reply_handler() and back four of them with
WARN_ON_ONCE() probes positioned where each invariant is
locally checkable on the previous patch's cleaned-up
ownership state:

- I1 (Receive WR ownership): WARN at rpcrdma_post_recvs() that
  a rep pulled from rb_free_reps carries rr_rqst == NULL.

- I2 (rep attachment): WARN at rpcrdma_reply_put() that
  req->rl_reply was NULLed before the matching rep_put.

- I3 (Registered-MR fence): WARN at rpcrdma_complete_rqst()
  that req->rl_registered is empty.  Strong send-queue
  ordering of the LocalInv WR chain makes the last
  completion observe the ib_dma_unmap_sg() of every earlier
  MR, so 'list empty' implies 'all MRs unmapped'.

- I4 (Send-buffer release): WARN at rpcrdma_req_release()
  that req->rl_sendctx is NULL.  Reaching the kref release
  callback requires both the RPC-layer and Send-side
  references to have dropped; the Send-side drop runs in
  rpcrdma_sendctx_unmap(), which clears rl_sendctx
  (previous patch).  A non-NULL rl_sendctx here would mean
  the Send-side owner had not run -- a contradiction.

The XXX comment in xprt_rdma_free() about signal-driven
release racing the Send completion described the pre-decouple
state.  Replace it with a one-line note pointing at the
invariant set, since the kref scheme now holds the req across
the in-flight Send regardless of which path released the
rpc_task.

I5 (req lifecycle) is stated in the comment but not probed:
making it locally assertible would require moving kref_init
out of rpcrdma_req_release(), which in turn requires adding
kref_init to the bc_pa_list and backlog-wake reuse paths.
That restructuring is deferred -- the invariant is unchanged
either way.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c  | 69 +++++++++++++++++++++++++++++++++
 net/sunrpc/xprtrdma/transport.c | 13 +++++--
 net/sunrpc/xprtrdma/verbs.c     |  6 +++
 3 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index f4b4abefc4e0..626cadec4555 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1336,6 +1336,11 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
 	struct rpc_rqst *rqst = rep->rr_rqst;
 	int status;
 
+	/* I3: every registered MR has been invalidated and
+	 * ib_dma_unmap_sg()'d before complete_rqst runs.
+	 */
+	WARN_ON_ONCE(!list_empty(&rpcr_to_rdmar(rqst)->rl_registered));
+
 	switch (rep->rr_proc) {
 	case rdma_msg:
 		status = rpcrdma_decode_msg(r_xprt, rep, rqst);
@@ -1367,6 +1372,70 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
 	goto out;
 }
 
+/* Reply-side ownership invariants
+ *
+ * I1 (Receive WR ownership).  A struct rpcrdma_rep is owned by the
+ *    HCA between ib_post_recv() and the matching Receive completion.
+ *    After ib_dma_sync_single_for_cpu() in rpcrdma_wc_receive() it is
+ *    owned by the CPU until rpcrdma_rep_put() returns it to
+ *    rb_free_reps; a rep on rb_free_reps is not re-posted until
+ *    rpcrdma_post_recvs() pulls it off.  Asserted: rpcrdma_post_recvs()
+ *    WARNs that a pulled rep has rr_rqst == NULL.
+ *
+ * I2 (rep attachment).  While req->rl_reply == rep, the rep cannot be
+ *    re-posted.  rpcrdma_reply_put() NULLs req->rl_reply before handing
+ *    the rep to rpcrdma_rep_put().  Asserted: rpcrdma_reply_put() WARNs
+ *    that rl_reply is NULL after the put.
+ *
+ * I3 (Registered-MR fence).  On entry to rpcrdma_complete_rqst() every
+ *    MR that was on req->rl_registered has had its rkey invalidated
+ *    (remotely via IB_WC_WITH_INVALIDATE or locally via IB_WR_LOCAL_INV)
+ *    and its pages ib_dma_unmap_sg()'d.  The LocalInv chain is posted
+ *    on a single QP; strong send-queue ordering makes the last
+ *    completion (frwr_wc_localinv_done) observe the
+ *    ib_dma_unmap_sg() that ran from each earlier completion's
+ *    frwr_mr_put() before complete_rqst is called.  The inline
+ *    frwr_reminv() path unmaps its one MR synchronously before
+ *    rpcrdma_reply_handler() reaches complete_rqst.  Asserted:
+ *    rpcrdma_complete_rqst() WARNs that rl_registered is empty.
+ *
+ * I4 (Send-buffer release).  req->rl_kref carries two unconditional
+ *    owners while a Send is outstanding: the RPC-layer reference (set
+ *    at xprt_rdma_alloc_slot / xprt_rdma_bc_rqst_get / rpcrdma_req_release
+ *    pool-entry) and the Send-side reference (kref_get() in
+ *    rpcrdma_prepare_send_sges()).  rpcrdma_req_release() runs only
+ *    after both have dropped, so the req does not return to its free
+ *    pool until rpcrdma_sendctx_unmap() has fired -- the HCA has
+ *    released the send buffer before the req can be reused.  Asserted:
+ *    rpcrdma_req_release() WARNs that rl_sendctx is NULL.
+ *
+ * I5 (req lifecycle).  A req is owned by the RPC layer between slot
+ *    acquisition and the matching xprt_rdma_free_slot() (or, for the
+ *    backchannel, xprt_rdma_bc_free_rqst()).  While owned, rl_kref >= 1.
+ *    The pools (rb_send_bufs, bc_pa_list, backlog wake target) never
+ *    contain a req with outstanding Send-side or Reply-side work.
+ *
+ * Non-hazards.  The following claims have been raised by adversarial
+ * review and are each closed by the invariants above:
+ *
+ *   * "Reply completes the RPC while the HCA still holds the send
+ *     buffer" -- excluded by I4.  The Send-side kref reference is held
+ *     until rpcrdma_sendctx_unmap() runs from Send completion.
+ *
+ *   * "Signal-driven release races the in-flight Send" -- same
+ *     resolution.  xprt_rdma_free() does not touch rl_kref; the
+ *     Send-side reference keeps the req out of its pool until Send
+ *     completion fires.
+ *
+ *   * "Receive completion races rep reuse" -- excluded by I1.  A rep
+ *     is on rb_free_reps only after rpcrdma_rep_put() has been called
+ *     and rpcrdma_post_recvs() owns the next transition back to the HCA.
+ *
+ *   * "Pages still DMA-mapped when call_decode reads them" -- excluded
+ *     by I3.  The matching ib_dma_unmap_sg() for every MR has run on
+ *     the same CPU thread that calls rpcrdma_complete_rqst().
+ */
+
 /**
  * rpcrdma_reply_handler - Process received RPC/RDMA messages
  * @rep: Incoming rpcrdma_rep object to process
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 5569f17fdd9b..5ff8e5126a6c 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -509,6 +509,11 @@ static void rpcrdma_req_release(struct kref *kref)
 	struct rpc_xprt *xprt = rqst->rq_xprt;
 	struct rpcrdma_xprt *r_xprt;
 
+	/* I4: both the RPC-layer and Send-side owners have dropped,
+	 * so rpcrdma_sendctx_unmap() has cleared rl_sendctx.
+	 */
+	WARN_ON_ONCE(req->rl_sendctx);
+
 	kref_init(&req->rl_kref);
 
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
@@ -652,10 +657,10 @@ xprt_rdma_free(struct rpc_task *task)
 		frwr_unmap_sync(rpcx_to_rdmax(rqst->rq_xprt), req);
 	}
 
-	/* XXX: If the RPC is completing because of a signal and
-	 * not because a reply was received, we ought to ensure
-	 * that the Send completion has fired, so that memory
-	 * involved with the Send is not still visible to the NIC.
+	/* The Send-side rl_kref owner keeps req out of its free pool
+	 * until rpcrdma_sendctx_unmap() has fired -- see I4 above
+	 * rpcrdma_reply_handler() -- so signal-driven release here
+	 * does not let the HCA touch a recycled send buffer.
 	 */
 }
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 60cbc14c5299..da2c6fa44154 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1259,6 +1259,10 @@ void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req)
 		req->rl_reply = NULL;
 		rpcrdma_rep_put(buffers, rep);
 	}
+	/* I2: rl_reply NULL after the put closes the
+	 * 'rep on rb_free_reps still referenced by req' window.
+	 */
+	WARN_ON_ONCE(req->rl_reply);
 }
 
 /**
@@ -1435,6 +1439,8 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed)
 			rep = rpcrdma_rep_create(r_xprt);
 		if (!rep)
 			break;
+		/* I1: a rep on rb_free_reps must carry no rqst pointer. */
+		WARN_ON_ONCE(rep->rr_rqst);
 		if (!rpcrdma_regbuf_dma_map(r_xprt, rep->rr_rdmabuf)) {
 			rpcrdma_rep_put(buf, rep);
 			break;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-26 14:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
2026-05-26 14:14 ` [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling Chuck Lever
2026-05-26 14:14 ` [PATCH v3 2/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
2026-05-26 14:14 ` [PATCH v3 3/5] xprtrdma: Add request-pool slack for delayed recycling Chuck Lever
2026-05-26 14:14 ` [PATCH v3 4/5] xprtrdma: Clear receive-side ownership pointers on release Chuck Lever
2026-05-26 14:14 ` [PATCH v3 5/5] xprtrdma: Document and assert reply-handler invariants Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox