* [PATCH v3 2/5] xprtrdma: Decouple req recycling from RPC completion
2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
2026-05-26 14:14 ` [PATCH v3 1/5] xprtrdma: Use sendctx DMA state for Send signaling Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
2026-05-26 14:14 ` [PATCH v3 3/5] xprtrdma: Add request-pool slack for delayed recycling Chuck Lever
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
rl_kref formerly served two distinct lifetimes through a single
refcount: it gated when a Reply could wake its RPC task, and it
gated when an rpcrdma_req could return to its free pool. The
marshal path took the Send-side reference only when SGEs needed
DMA-unmap (sc_unmap_count > 0), which made a Send carrying only
pre-registered buffers an exception: the Reply handler dropped
rl_kref from 1 to 0 and freed the req while the HCA might still
be DMA-reading from its send buffer.
Give rl_kref a narrower job. The RPC layer takes one reference
when slot allocation hands a req out. rpcrdma_prepare_send_sges()
takes a Send-side reference unconditionally after WR preparation
succeeds. xprt_rdma_free_slot() and xprt_rdma_bc_free_rqst() drop
the RPC-layer reference; rpcrdma_sendctx_unmap() drops the
Send-side reference. The req returns to its free pool only after
both owners have signed off.
The existing kref_init(&req->rl_kref) call in
rpcrdma_prepare_send_sges() is removed. Initialization moves to
the slot-allocation paths (xprt_rdma_alloc_slot and
rpcrdma_bc_rqst_get), and the release callback re-arms rl_kref
before the req returns to a free pool. A re-init in the marshal
path would discard the RPC-layer reference that already exists
on entry.
Three invariants follow:
- Any rpcrdma_req held by an rpc_rqst has rl_kref >= 1.
xprt_rdma_alloc_slot(), rpcrdma_bc_rqst_get(), and the
backlog-wake branch in xprt_rdma_alloc_slot() each kref_init
rl_kref before publishing the req. Without this invariant,
an RPC task that aborts between slot allocation and marshal
(gss_refresh failure or signal during call_connect, for
example) would drive xprt_release() ->
xprt_rdma_free_slot() -> kref_put against a refcount of
zero, saturating refcount_t and stranding the slot.
- The Send-side reference is taken only after WR prep
succeeds. A mapping failure in rpcrdma_prepare_send_sges()
runs rpcrdma_sendctx_cancel(), which DMA-unmaps the sendctx
and clears sc_req without touching rl_kref. The sendctx
ring walks in rpcrdma_sendctx_put_locked() and
rpcrdma_sendctxs_destroy() skip entries with sc_req == NULL,
so a burst of -EIO marshal failures cannot hold reqs off
rb_send_bufs.
- The release callback re-arms rl_kref so the next consumer
enters with the invariant satisfied.
Replies now complete the RPC directly. rpcrdma_reply_handler()
calls rpcrdma_complete_rqst() in place of kref_put on the
non-LocalInv branch. The LocalInv branch already completes the
RPC from frwr_unmap_async() and is unaffected.
Because Send-side references can now outlive RPC completion,
connection teardown drains sendctx entries whose unsignaled
Sends never had a later signaled completion to walk the ring.
rpcrdma_sendctxs_destroy() walks the active range and runs
rpcrdma_sendctx_unmap() on each entry with a non-NULL sc_req
before the request buffers are reset, and is moved ahead of
rpcrdma_reqs_reset() in rpcrdma_xprt_disconnect() so the reqs
are still in their pre-reset state when the Send-side refs are
released.
The drain creates a teardown-ordering hazard on the backchannel
path. With the new lifetime, releasing a bc_prealloc req from
rpcrdma_req_release() re-adds it to bc_pa_list. The disconnect
in xprt_rdma_destroy() runs after xprt_destroy_backchannel() has
already emptied bc_pa_list, so the drained reqs would otherwise
leak. xprt_rdma_destroy() now runs xprt_rdma_bc_destroy(xprt, 0)
a second time after the disconnect to reclaim them.
Fixes: 0ab115237025 ("xprtrdma: Wake RPCs directly in rpcrdma_wc_send path")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/backchannel.c | 5 ++-
net/sunrpc/xprtrdma/rpc_rdma.c | 47 +++++++++++---------------
net/sunrpc/xprtrdma/transport.c | 55 ++++++++++++++++++++++++++++---
net/sunrpc/xprtrdma/verbs.c | 29 +++++++++++++---
net/sunrpc/xprtrdma/xprt_rdma.h | 2 +-
5 files changed, 97 insertions(+), 41 deletions(-)
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index 2f0f9618dd05..e5b3463da25f 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -159,9 +159,7 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
rpcrdma_rep_put(&r_xprt->rx_buf, rep);
req->rl_reply = NULL;
- spin_lock(&xprt->bc_pa_lock);
- list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
- spin_unlock(&xprt->bc_pa_lock);
+ rpcrdma_req_put(req);
xprt_put(xprt);
}
@@ -203,6 +201,7 @@ static struct rpc_rqst *rpcrdma_bc_rqst_get(struct rpcrdma_xprt *r_xprt)
rqst->rq_xprt = xprt;
__set_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
xdr_buf_init(&rqst->rq_snd_buf, rdmab_data(req->rl_sendbuf), size);
+ kref_init(&req->rl_kref);
return rqst;
}
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 16b9987858d6..69380f9dfa49 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -467,16 +467,6 @@ static int rpcrdma_encode_reply_chunk(struct rpcrdma_xprt *r_xprt,
return 0;
}
-static void rpcrdma_sendctx_done(struct kref *kref)
-{
- struct rpcrdma_req *req =
- container_of(kref, struct rpcrdma_req, rl_kref);
- struct rpcrdma_rep *rep = req->rl_reply;
-
- rpcrdma_complete_rqst(rep);
- rep->rr_rxprt->rx_stats.reply_waits_for_send++;
-}
-
static void rpcrdma_sendctx_dma_unmap(struct rpcrdma_sendctx *sc)
{
struct rpcrdma_regbuf *rb = sc->sc_req->rl_sendbuf;
@@ -493,17 +483,26 @@ static void rpcrdma_sendctx_dma_unmap(struct rpcrdma_sendctx *sc)
}
/**
- * rpcrdma_sendctx_unmap - DMA-unmap Send buffer
+ * rpcrdma_sendctx_unmap - DMA-unmap Send buffer and release Send owner
* @sc: sendctx containing SGEs to unmap
*
*/
void rpcrdma_sendctx_unmap(struct rpcrdma_sendctx *sc)
{
- if (!sc->sc_unmap_count)
- return;
+ struct rpcrdma_req *req = sc->sc_req;
rpcrdma_sendctx_dma_unmap(sc);
- kref_put(&sc->sc_req->rl_kref, rpcrdma_sendctx_done);
+ sc->sc_req = NULL;
+ rpcrdma_req_put(req);
+}
+
+/* No Send was posted. Release DMA mappings prepared for this
+ * sendctx, but leave the request reference count alone.
+ */
+static void rpcrdma_sendctx_cancel(struct rpcrdma_sendctx *sc)
+{
+ rpcrdma_sendctx_dma_unmap(sc);
+ sc->sc_req = NULL;
}
/* Prepare an SGE for the RPC-over-RDMA transport header.
@@ -695,8 +694,6 @@ static bool rpcrdma_prepare_noch_mapped(struct rpcrdma_xprt *r_xprt,
tail->iov_len))
return false;
- if (req->rl_sendctx->sc_unmap_count)
- kref_get(&req->rl_kref);
return true;
}
@@ -726,7 +723,6 @@ static bool rpcrdma_prepare_readch(struct rpcrdma_xprt *r_xprt,
len -= len & 3;
if (!rpcrdma_prepare_tail_iov(req, xdr, page_base, len))
return false;
- kref_get(&req->rl_kref);
}
return true;
@@ -755,7 +751,6 @@ inline int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
goto out_nosc;
req->rl_sendctx->sc_unmap_count = 0;
req->rl_sendctx->sc_req = req;
- kref_init(&req->rl_kref);
req->rl_wr.wr_cqe = &req->rl_sendctx->sc_cqe;
req->rl_wr.sg_list = req->rl_sendctx->sc_sges;
req->rl_wr.num_sge = 0;
@@ -783,10 +778,14 @@ inline int rpcrdma_prepare_send_sges(struct rpcrdma_xprt *r_xprt,
goto out_unmap;
}
+ /* The Send-side owner releases this reference when the
+ * Send has completed.
+ */
+ kref_get(&req->rl_kref);
return 0;
out_unmap:
- rpcrdma_sendctx_unmap(req->rl_sendctx);
+ rpcrdma_sendctx_cancel(req->rl_sendctx);
out_nosc:
trace_xprtrdma_prepsend_failed(&req->rl_slot, ret);
return ret;
@@ -1364,14 +1363,6 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
goto out;
}
-static void rpcrdma_reply_done(struct kref *kref)
-{
- struct rpcrdma_req *req =
- container_of(kref, struct rpcrdma_req, rl_kref);
-
- rpcrdma_complete_rqst(req->rl_reply);
-}
-
/**
* rpcrdma_reply_handler - Process received RPC/RDMA messages
* @rep: Incoming rpcrdma_rep object to process
@@ -1443,7 +1434,7 @@ void rpcrdma_reply_handler(struct rpcrdma_rep *rep)
frwr_unmap_async(r_xprt, req);
/* LocalInv completion will complete the RPC */
else
- kref_put(&req->rl_kref, rpcrdma_reply_done);
+ rpcrdma_complete_rqst(rep);
out_post:
rpcrdma_post_recvs(r_xprt,
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 61706df5e485..5569f17fdd9b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -279,6 +279,13 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
cancel_delayed_work_sync(&r_xprt->rx_connect_worker);
rpcrdma_xprt_disconnect(r_xprt);
+
+ /* The disconnect's sendctx drain can return bc_prealloc reqs
+ * to bc_pa_list after xprt_destroy_backchannel() emptied it.
+ */
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+ xprt_rdma_bc_destroy(xprt, 0);
+#endif
rpcrdma_buffer_destroy(&r_xprt->rx_buf);
xprt_rdma_free_addresses(xprt);
@@ -487,6 +494,45 @@ xprt_rdma_connect(struct rpc_xprt *xprt, struct rpc_task *task)
queue_delayed_work(system_long_wq, &r_xprt->rx_connect_worker, delay);
}
+/* rl_kref has two owners while a Send is outstanding: the rpc_rqst
+ * owner and the sendctx. Replies complete the RPC but do not drop
+ * either reference. The req returns to its free pool only after
+ * xprt_rdma_free_slot() or xprt_rdma_bc_free_rqst() has dropped the
+ * RPC-layer reference and rpcrdma_sendctx_unmap() has dropped the
+ * Send-side reference.
+ */
+static void rpcrdma_req_release(struct kref *kref)
+{
+ struct rpcrdma_req *req =
+ container_of(kref, struct rpcrdma_req, rl_kref);
+ struct rpc_rqst *rqst = &req->rl_slot;
+ struct rpc_xprt *xprt = rqst->rq_xprt;
+ struct rpcrdma_xprt *r_xprt;
+
+ kref_init(&req->rl_kref);
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+ if (bc_prealloc(rqst)) {
+ spin_lock(&xprt->bc_pa_lock);
+ list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
+ spin_unlock(&xprt->bc_pa_lock);
+ return;
+ }
+#endif
+
+ if (xprt_wake_up_backlog(xprt, rqst))
+ return;
+
+ r_xprt = rpcx_to_rdmax(xprt);
+ memset(rqst, 0, sizeof(*rqst));
+ rpcrdma_buffer_put(&r_xprt->rx_buf, req);
+}
+
+void rpcrdma_req_put(struct rpcrdma_req *req)
+{
+ kref_put(&req->rl_kref, rpcrdma_req_release);
+}
+
/**
* xprt_rdma_alloc_slot - allocate an rpc_rqst
* @xprt: controlling RPC transport
@@ -505,6 +551,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
req = rpcrdma_buffer_get(&r_xprt->rx_buf);
if (!req)
goto out_sleep;
+ kref_init(&req->rl_kref);
task->tk_rqstp = &req->rl_slot;
task->tk_status = 0;
return;
@@ -520,6 +567,7 @@ xprt_rdma_alloc_slot(struct rpc_xprt *xprt, struct rpc_task *task)
if (req) {
struct rpc_rqst *rqst = &req->rl_slot;
+ kref_init(&req->rl_kref);
if (!xprt_wake_up_backlog(xprt, rqst)) {
memset(rqst, 0, sizeof(*rqst));
rpcrdma_buffer_put(&r_xprt->rx_buf, req);
@@ -540,10 +588,7 @@ xprt_rdma_free_slot(struct rpc_xprt *xprt, struct rpc_rqst *rqst)
container_of(xprt, struct rpcrdma_xprt, rx_xprt);
rpcrdma_reply_put(&r_xprt->rx_buf, rpcr_to_rdmar(rqst));
- if (!xprt_wake_up_backlog(xprt, rqst)) {
- memset(rqst, 0, sizeof(*rqst));
- rpcrdma_buffer_put(&r_xprt->rx_buf, rpcr_to_rdmar(rqst));
- }
+ rpcrdma_req_put(rpcr_to_rdmar(rqst));
}
static bool rpcrdma_check_regbuf(struct rpcrdma_xprt *r_xprt,
@@ -716,7 +761,7 @@ void xprt_rdma_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
r_xprt->rx_stats.mrs_allocated,
r_xprt->rx_stats.local_inv_needed,
r_xprt->rx_stats.empty_sendctx_q,
- r_xprt->rx_stats.reply_waits_for_send);
+ 0LU); /* was reply_waits_for_send; column preserved */
}
static int
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aecf9c0a153f..97b8b2376602 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -65,6 +65,8 @@
static int rpcrdma_sendctxs_create(struct rpcrdma_xprt *r_xprt);
static void rpcrdma_sendctxs_destroy(struct rpcrdma_xprt *r_xprt);
+static unsigned long rpcrdma_sendctx_next(struct rpcrdma_buffer *buf,
+ unsigned long item);
static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_sendctx *sc);
static int rpcrdma_reqs_setup(struct rpcrdma_xprt *r_xprt);
@@ -571,9 +573,9 @@ void rpcrdma_xprt_disconnect(struct rpcrdma_xprt *r_xprt)
rpcrdma_xprt_drain(r_xprt);
rpcrdma_reps_unmap(r_xprt);
+ rpcrdma_sendctxs_destroy(r_xprt);
rpcrdma_reqs_reset(r_xprt);
rpcrdma_mrs_destroy(r_xprt);
- rpcrdma_sendctxs_destroy(r_xprt);
if (rpcrdma_ep_put(ep))
rdma_destroy_id(id);
@@ -605,6 +607,20 @@ static void rpcrdma_sendctxs_destroy(struct rpcrdma_xprt *r_xprt)
if (!buf->rb_sc_ctxs)
return;
+
+ /* The QP is drained, but the final unsignaled Sends might not
+ * have been walked by a signaled Send completion. Release those
+ * Send owners before request buffers are reset.
+ */
+ for (i = rpcrdma_sendctx_next(buf, buf->rb_sc_tail);
+ i != rpcrdma_sendctx_next(buf, buf->rb_sc_head);
+ i = rpcrdma_sendctx_next(buf, i)) {
+ struct rpcrdma_sendctx *sc = buf->rb_sc_ctxs[i];
+
+ if (sc && sc->sc_req)
+ rpcrdma_sendctx_unmap(sc);
+ }
+
for (i = 0; i <= buf->rb_sc_last; i++)
kfree(buf->rb_sc_ctxs[i]);
kfree(buf->rb_sc_ctxs);
@@ -739,15 +755,20 @@ static void rpcrdma_sendctx_put_locked(struct rpcrdma_xprt *r_xprt,
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
unsigned long next_tail;
- /* Unmap SGEs of previously completed but unsignaled
- * Sends by walking up the queue until @sc is found.
+ /* Release previously completed but unsignaled Sends by walking
+ * up the queue until @sc is found. Entries left behind by a
+ * failed rpcrdma_prepare_send_sges() have sc_req cleared.
*/
next_tail = buf->rb_sc_tail;
do {
+ struct rpcrdma_sendctx *cur;
+
next_tail = rpcrdma_sendctx_next(buf, next_tail);
/* ORDER: item must be accessed _before_ tail is updated */
- rpcrdma_sendctx_unmap(buf->rb_sc_ctxs[next_tail]);
+ cur = buf->rb_sc_ctxs[next_tail];
+ if (cur->sc_req)
+ rpcrdma_sendctx_unmap(cur);
} while (buf->rb_sc_ctxs[next_tail] != sc);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f53a77472724..f879d9b9f57e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -427,7 +427,6 @@ struct rpcrdma_stats {
/* accessed when receiving a reply */
unsigned long long total_rdma_reply;
unsigned long long fixup_copy_count;
- unsigned long reply_waits_for_send;
unsigned long local_inv_needed;
unsigned long nomsg_call_count;
unsigned long bcall_count;
@@ -505,6 +504,7 @@ void rpcrdma_buffer_put(struct rpcrdma_buffer *buffers,
struct rpcrdma_req *req);
void rpcrdma_rep_put(struct rpcrdma_buffer *buf, struct rpcrdma_rep *rep);
void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req);
+void rpcrdma_req_put(struct rpcrdma_req *req);
bool rpcrdma_regbuf_realloc(struct rpcrdma_regbuf *rb, size_t size,
gfp_t flags);
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v3 5/5] xprtrdma: Document and assert reply-handler invariants
2026-05-26 14:14 [PATCH v3 0/5] xprtrdma: Decouple req recycling from RPC completion Chuck Lever
` (3 preceding siblings ...)
2026-05-26 14:14 ` [PATCH v3 4/5] xprtrdma: Clear receive-side ownership pointers on release Chuck Lever
@ 2026-05-26 14:14 ` Chuck Lever
4 siblings, 0 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-26 14:14 UTC (permalink / raw)
To: Anna Schumaker; +Cc: linux-rdma, linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The xprtrdma reply path has been the subject of recurring
LLM-driven review claims that 'an RPC can complete while
receive buffers are still DMA-mapped' or that 'the req can be
freed while the HCA still owns the send buffer.' No runtime
reproducer has surfaced, but the absence of a written-down
invariant set lets each pass of automated review reach the
same hypothetical conclusion. Subsequent fixes against
ce2f9a4d9ccc ('xprtrdma: Decouple req recycling from RPC
completion') closed the underlying races but did not document
the closure where future readers will look for it.
State the invariants explicitly in a comment above
rpcrdma_reply_handler() and back four of them with
WARN_ON_ONCE() probes positioned where each invariant is
locally checkable on the previous patch's cleaned-up
ownership state:
- I1 (Receive WR ownership): WARN at rpcrdma_post_recvs() that
a rep pulled from rb_free_reps carries rr_rqst == NULL.
- I2 (rep attachment): WARN at rpcrdma_reply_put() that
req->rl_reply was NULLed before the matching rep_put.
- I3 (Registered-MR fence): WARN at rpcrdma_complete_rqst()
that req->rl_registered is empty. Strong send-queue
ordering of the LocalInv WR chain makes the last
completion observe the ib_dma_unmap_sg() of every earlier
MR, so 'list empty' implies 'all MRs unmapped'.
- I4 (Send-buffer release): WARN at rpcrdma_req_release()
that req->rl_sendctx is NULL. Reaching the kref release
callback requires both the RPC-layer and Send-side
references to have dropped; the Send-side drop runs in
rpcrdma_sendctx_unmap(), which clears rl_sendctx
(previous patch). A non-NULL rl_sendctx here would mean
the Send-side owner had not run -- a contradiction.
The XXX comment in xprt_rdma_free() about signal-driven
release racing the Send completion described the pre-decouple
state. Replace it with a one-line note pointing at the
invariant set, since the kref scheme now holds the req across
the in-flight Send regardless of which path released the
rpc_task.
I5 (req lifecycle) is stated in the comment but not probed:
making it locally assertible would require moving kref_init
out of rpcrdma_req_release(), which in turn requires adding
kref_init to the bc_pa_list and backlog-wake reuse paths.
That restructuring is deferred -- the invariant is unchanged
either way.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/xprtrdma/rpc_rdma.c | 69 +++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/transport.c | 13 +++++--
net/sunrpc/xprtrdma/verbs.c | 6 +++
3 files changed, 84 insertions(+), 4 deletions(-)
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index f4b4abefc4e0..626cadec4555 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -1336,6 +1336,11 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
struct rpc_rqst *rqst = rep->rr_rqst;
int status;
+ /* I3: every registered MR has been invalidated and
+ * ib_dma_unmap_sg()'d before complete_rqst runs.
+ */
+ WARN_ON_ONCE(!list_empty(&rpcr_to_rdmar(rqst)->rl_registered));
+
switch (rep->rr_proc) {
case rdma_msg:
status = rpcrdma_decode_msg(r_xprt, rep, rqst);
@@ -1367,6 +1372,70 @@ void rpcrdma_complete_rqst(struct rpcrdma_rep *rep)
goto out;
}
+/* Reply-side ownership invariants
+ *
+ * I1 (Receive WR ownership). A struct rpcrdma_rep is owned by the
+ * HCA between ib_post_recv() and the matching Receive completion.
+ * After ib_dma_sync_single_for_cpu() in rpcrdma_wc_receive() it is
+ * owned by the CPU until rpcrdma_rep_put() returns it to
+ * rb_free_reps; a rep on rb_free_reps is not re-posted until
+ * rpcrdma_post_recvs() pulls it off. Asserted: rpcrdma_post_recvs()
+ * WARNs that a pulled rep has rr_rqst == NULL.
+ *
+ * I2 (rep attachment). While req->rl_reply == rep, the rep cannot be
+ * re-posted. rpcrdma_reply_put() NULLs req->rl_reply before handing
+ * the rep to rpcrdma_rep_put(). Asserted: rpcrdma_reply_put() WARNs
+ * that rl_reply is NULL after the put.
+ *
+ * I3 (Registered-MR fence). On entry to rpcrdma_complete_rqst() every
+ * MR that was on req->rl_registered has had its rkey invalidated
+ * (remotely via IB_WC_WITH_INVALIDATE or locally via IB_WR_LOCAL_INV)
+ * and its pages ib_dma_unmap_sg()'d. The LocalInv chain is posted
+ * on a single QP; strong send-queue ordering makes the last
+ * completion (frwr_wc_localinv_done) observe the
+ * ib_dma_unmap_sg() that ran from each earlier completion's
+ * frwr_mr_put() before complete_rqst is called. The inline
+ * frwr_reminv() path unmaps its one MR synchronously before
+ * rpcrdma_reply_handler() reaches complete_rqst. Asserted:
+ * rpcrdma_complete_rqst() WARNs that rl_registered is empty.
+ *
+ * I4 (Send-buffer release). req->rl_kref carries two unconditional
+ * owners while a Send is outstanding: the RPC-layer reference (set
+ * at xprt_rdma_alloc_slot / xprt_rdma_bc_rqst_get / rpcrdma_req_release
+ * pool-entry) and the Send-side reference (kref_get() in
+ * rpcrdma_prepare_send_sges()). rpcrdma_req_release() runs only
+ * after both have dropped, so the req does not return to its free
+ * pool until rpcrdma_sendctx_unmap() has fired -- the HCA has
+ * released the send buffer before the req can be reused. Asserted:
+ * rpcrdma_req_release() WARNs that rl_sendctx is NULL.
+ *
+ * I5 (req lifecycle). A req is owned by the RPC layer between slot
+ * acquisition and the matching xprt_rdma_free_slot() (or, for the
+ * backchannel, xprt_rdma_bc_free_rqst()). While owned, rl_kref >= 1.
+ * The pools (rb_send_bufs, bc_pa_list, backlog wake target) never
+ * contain a req with outstanding Send-side or Reply-side work.
+ *
+ * Non-hazards. The following claims have been raised by adversarial
+ * review and are each closed by the invariants above:
+ *
+ * * "Reply completes the RPC while the HCA still holds the send
+ * buffer" -- excluded by I4. The Send-side kref reference is held
+ * until rpcrdma_sendctx_unmap() runs from Send completion.
+ *
+ * * "Signal-driven release races the in-flight Send" -- same
+ * resolution. xprt_rdma_free() does not touch rl_kref; the
+ * Send-side reference keeps the req out of its pool until Send
+ * completion fires.
+ *
+ * * "Receive completion races rep reuse" -- excluded by I1. A rep
+ * is on rb_free_reps only after rpcrdma_rep_put() has been called
+ * and rpcrdma_post_recvs() owns the next transition back to the HCA.
+ *
+ * * "Pages still DMA-mapped when call_decode reads them" -- excluded
+ * by I3. The matching ib_dma_unmap_sg() for every MR has run on
+ * the same CPU thread that calls rpcrdma_complete_rqst().
+ */
+
/**
* rpcrdma_reply_handler - Process received RPC/RDMA messages
* @rep: Incoming rpcrdma_rep object to process
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 5569f17fdd9b..5ff8e5126a6c 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -509,6 +509,11 @@ static void rpcrdma_req_release(struct kref *kref)
struct rpc_xprt *xprt = rqst->rq_xprt;
struct rpcrdma_xprt *r_xprt;
+ /* I4: both the RPC-layer and Send-side owners have dropped,
+ * so rpcrdma_sendctx_unmap() has cleared rl_sendctx.
+ */
+ WARN_ON_ONCE(req->rl_sendctx);
+
kref_init(&req->rl_kref);
#if defined(CONFIG_SUNRPC_BACKCHANNEL)
@@ -652,10 +657,10 @@ xprt_rdma_free(struct rpc_task *task)
frwr_unmap_sync(rpcx_to_rdmax(rqst->rq_xprt), req);
}
- /* XXX: If the RPC is completing because of a signal and
- * not because a reply was received, we ought to ensure
- * that the Send completion has fired, so that memory
- * involved with the Send is not still visible to the NIC.
+ /* The Send-side rl_kref owner keeps req out of its free pool
+ * until rpcrdma_sendctx_unmap() has fired -- see I4 above
+ * rpcrdma_reply_handler() -- so signal-driven release here
+ * does not let the HCA touch a recycled send buffer.
*/
}
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 60cbc14c5299..da2c6fa44154 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1259,6 +1259,10 @@ void rpcrdma_reply_put(struct rpcrdma_buffer *buffers, struct rpcrdma_req *req)
req->rl_reply = NULL;
rpcrdma_rep_put(buffers, rep);
}
+ /* I2: rl_reply NULL after the put closes the
+ * 'rep on rb_free_reps still referenced by req' window.
+ */
+ WARN_ON_ONCE(req->rl_reply);
}
/**
@@ -1435,6 +1439,8 @@ void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed)
rep = rpcrdma_rep_create(r_xprt);
if (!rep)
break;
+ /* I1: a rep on rb_free_reps must carry no rqst pointer. */
+ WARN_ON_ONCE(rep->rr_rqst);
if (!rpcrdma_regbuf_dma_map(r_xprt, rep->rr_rdmabuf)) {
rpcrdma_rep_put(buf, rep);
break;
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread