From: Mike Snitzer <snitzer@hammerspace.com>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, ben.coddington@hammerspace.com,
jonathan.flynn@hammerspace.com
Subject: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
Date: Tue, 5 May 2026 17:55:34 -0400 [thread overview]
Message-ID: <20260505215535.68412-2-snitzer@kernel.org> (raw)
In-Reply-To: <20260505215535.68412-1-snitzer@kernel.org>
From: Benjamin Coddington <ben.coddington@hammerspace.com>
Under sustained heavy load over RDMA, kNFSD servers can pin tens of
gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
released until the connection terminates. A customer site reported
OOM kills under heavy NFS READ workloads with ~2.3M cached
send_ctxts visible via slab tracing (two stacks in
svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
the same ctxt population double-counted across the sc_pages and
sc_xprt_buf allocations). Aggregated across the customer's ~218
long-lived xprts that worked out to roughly 80 GB pinned, freed only
by knfsd restart.
Root cause is an unbounded cache, not a per-op leak.
svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
on empty, allocates fresh. svc_rdma_send_ctxt_release() always
llist_add()s the ctxt back -- regardless of how many ctxts are
already on the list. The only kfree() site is
svc_rdma_send_ctxts_destroy() at xprt teardown. The list has no
shrinker, no cap, no aging: it can only grow.
Two effects compound to drive the high-water mark well above the
configured RPC slot count:
1. _put runs through a workqueue. svc_rdma_send_ctxt_put() does
INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns. The
actual _release (which puts the ctxt back on the llist) runs
later on svcrdma_wq. Between wc_send -> _put and _put_async ->
_release, the ctxt is "in transit" -- off the list, off the SQ,
not yet reusable.
2. During that gap, a concurrent _get sees an empty llist and calls
_alloc to mint a fresh ctxt. When the in-transit one eventually
lands on the llist, the cache has grown by one. Under HCA-driven
completion rates with even small workqueue dispatch lag, this
happens constantly. The cache settles not at the steady-state
in-flight count but at the all-time peak of (in-flight +
workqueue-pending), and never shrinks.
Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
svc_rdma_send_ctxt_destroy). Apply the cap in two places:
- svc_rdma_send_ctxt_get(): when the llist is empty and depth has
reached sc_max_requests, return NULL instead of allocating. The
caller drops the connection; the client reconnects with a fresh
xprt that starts at depth zero. This is the backpressure point
that prevents in-test memory growth -- it stops new allocations
regardless of where in the pipeline existing ctxts are stuck.
- svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
between concurrent _get callers, or transient burst), free the
ctxt instead of returning it to the llist. This keeps depth
convergent.
The cap is sc_max_requests because:
- It is the configured number of credit slots per xprt -- the client
can have at most this many RPCs outstanding on the transport.
- Each RPC reply uses one send_ctxt at a time; concurrent in-flight
ctxts therefore cannot legitimately exceed sc_max_requests in
steady state.
- Workqueue lag can momentarily push (in-flight + queued) above
sc_max_requests, but those ctxts are exactly what the cap should
shed -- they are not steady-state working set, just lag-inflation.
The reuse semantics of the cache are intentional and unchanged: ctxts
keep their first SGE DMA-mapped across cycles, so the steady-state
hot path stays alloc-free. Only the *excess* ctxts are freed.
A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
freed-by-cap ctxt, so operators can confirm the cap is doing real
work on a given workload.
== Verification on the test rig ==
Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
this patch) -- per-5s rates of wc_send (queue inflow), _put_async
(workqueue dispatch), _get (demand), and the new tracepoint.
Negative case (cap on on-llist depth alone, with the atomic
incremented in _release and decremented in _get), sustained NFS
READ load:
wc_send ~432K/s, release ~342K/s
-> ~90K/s of ctxts pinned as queued sc_work items
-> ~2.7M pinned after 30s; matches the slab measurement
-> svcrdma_send_ctxt_capped fires 0 times during the test, then
floods (~3.25M events) on test stop as the workqueue catches up
The cap is structurally blind to ctxts pinned in workqueue items
because depth only counts what's currently on the llist; during
sustained load almost nothing makes it onto the llist before the
next _get takes it back off. Inflation accumulates as queued
sc_work items, invisible to the cap, until load stops.
Post-patch (depth tracked at alloc/destroy + _get backpressure),
same workload, 5 minutes:
wc_send and release rates match within 1% (~410K/s each)
No accumulation; no flood at test stop
svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
recovery)
Throughput slightly higher than the negative case (cache no longer
bloats the slab/page allocator into reclaim)
The persistent wc_send/release gap in the negative case was itself
a consequence of the unbounded growth: cache bloat -> slab pressure
-> reclaim activity -> workqueue starvation -> larger gap. Once
the cap breaks that spiral, the workqueue runs at full capacity and
the rates equalize.
Operators can confirm the cap is doing real work via:
cd /sys/kernel/tracing
echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
cat trace_pipe
If a workload genuinely needs more than sc_max_requests concurrent
in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
rather than removing the cap.
== Diagnostics caveats ==
A previous diagnostic pass on this code path was misled by GCC
inlining of svc_rdma_send_ctxt_put into all in-tree callers (same
TU): the symbol stays in kallsyms (lowercase t) but no caller jumps
there, so kprobes attach without warning yet fire zero times. This
made an inventory script's "@inflight = gets - puts" appear to be
monotonically rising, falsely confirming a per-op lifecycle leak.
Hooking svc_rdma_send_ctxt_put_async (a workqueue function-pointer
target, forced out-of-line by INIT_WORK) instead of _put gets
accurate accounting and shows gets/puts balanced within ~1% under
load. Future probes in this path should prefer function-pointer
targets, tracepoints, or kmem tracepoints over kprobes on small
non-static functions in the same TU as their callers.
Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Diagnosed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
include/linux/sunrpc/svc_rdma.h | 1 +
include/trace/events/rpcrdma.h | 2 ++
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 41 ++++++++++++++++++++----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
4 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad570..b8ae1032bf293 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -97,6 +97,7 @@ struct svcxprt_rdma {
spinlock_t sc_send_lock;
struct llist_head sc_send_ctxts;
+ atomic_t sc_send_ctxts_depth;
spinlock_t sc_rw_ctxt_lock;
struct llist_head sc_rw_ctxts;
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b79913048e1a0..945152e33af8c 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2027,6 +2027,8 @@ DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_send);
DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);
+DEFINE_SIMPLE_CID_EVENT(svcrdma_send_ctxt_capped);
+
DEFINE_SIMPLE_CID_EVENT(svcrdma_post_recv);
DEFINE_RECEIVE_SUCCESS_EVENT(svcrdma_wc_recv);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b25..e487d2815b33e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -158,6 +158,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
for (i = 0; i < rdma->sc_max_send_sges; i++)
ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
+ atomic_inc(&rdma->sc_send_ctxts_depth);
return ctxt;
fail3:
@@ -170,6 +171,20 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
return NULL;
}
+/* Tear down a single send_ctxt: reverse of svc_rdma_send_ctxt_alloc. */
+static void svc_rdma_send_ctxt_destroy(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *ctxt)
+{
+ struct ib_device *device = rdma->sc_cm_id->device;
+
+ ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+ rdma->sc_max_req_size, DMA_TO_DEVICE);
+ kfree(ctxt->sc_xprt_buf);
+ kfree(ctxt->sc_pages);
+ kfree(ctxt);
+ atomic_dec(&rdma->sc_send_ctxts_depth);
+}
+
/**
* svc_rdma_send_ctxts_destroy - Release all send_ctxt's for an xprt
* @rdma: svcxprt_rdma being torn down
@@ -177,17 +192,12 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
*/
void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
{
- struct ib_device *device = rdma->sc_cm_id->device;
struct svc_rdma_send_ctxt *ctxt;
struct llist_node *node;
while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
- ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
- rdma->sc_max_req_size, DMA_TO_DEVICE);
- kfree(ctxt->sc_xprt_buf);
- kfree(ctxt->sc_pages);
- kfree(ctxt);
+ svc_rdma_send_ctxt_destroy(rdma, ctxt);
}
}
@@ -226,6 +236,14 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
return ctxt;
out_empty:
+ /* Backpressure: refuse to mint a new ctxt once the per-xprt total
+ * (in-flight + queued for release + on-llist) has reached the
+ * configured slot count. The caller drops the connection; the
+ * client reconnects with a fresh xprt. Better than the unbounded
+ * allocation that lets workqueue lag inflate the cache to OOM.
+ */
+ if (atomic_read(&rdma->sc_send_ctxts_depth) >= rdma->sc_max_requests)
+ return NULL;
ctxt = svc_rdma_send_ctxt_alloc(rdma);
if (!ctxt)
return NULL;
@@ -257,6 +275,17 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
DMA_TO_DEVICE);
}
+ /* Depth is now tracked at alloc/destroy, so it reflects total
+ * live ctxts (in-flight + queued + on-llist), not just on-llist.
+ * If we've blown past the cap -- via a race in the _get
+ * backpressure check, or a transient burst -- destroy this ctxt
+ * instead of returning it to the llist so the depth converges.
+ */
+ if (atomic_read(&rdma->sc_send_ctxts_depth) > rdma->sc_max_requests) {
+ trace_svcrdma_send_ctxt_capped(&ctxt->sc_cid);
+ svc_rdma_send_ctxt_destroy(rdma, ctxt);
+ return;
+ }
llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4ff..7708634ebf587 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -176,6 +176,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
init_llist_head(&cma_xprt->sc_send_ctxts);
+ atomic_set(&cma_xprt->sc_send_ctxts_depth, 0);
init_llist_head(&cma_xprt->sc_recv_ctxts);
init_llist_head(&cma_xprt->sc_rw_ctxts);
init_waitqueue_head(&cma_xprt->sc_send_wait);
--
2.44.0
next prev parent reply other threads:[~2026-05-05 21:55 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` Mike Snitzer [this message]
2026-05-06 6:01 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Chuck Lever
2026-05-06 11:34 ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260505215535.68412-2-snitzer@kernel.org \
--to=snitzer@hammerspace.com \
--cc=ben.coddington@hammerspace.com \
--cc=cel@kernel.org \
--cc=jonathan.flynn@hammerspace.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox