* [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache
@ 2026-05-05 21:55 Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn
Hi,
I drew the short-straw by having to take a hand-off from Ben on work
he started with Claude yesterday in response to a really crazy OOM
situation that hits like a freight train at one large customer's
install that currently has 121 NFS clients and 9 NFS servers, all
connected with RDMA networking. Working with Jon Flynn, to bound the
problem a bit more we later scaled the testing down to 15 clients
reading from 1 server using 16K O_DIRECT reads.
So I imported Ben's CLAUDE.md that he handed off and carried on, with
patch 1/2 we're able to avoid OOM killing the NFS servers (each with
128GB) -- with the 16K test workload memory use would grow from ~12GB
to exhaustion (128GB) within ~10 seconds of starting the test.
The 2nd patch in this series provides a diagnostic svcrdma-wq-lag.bt
bpf script that Claude suggested -- I just dropped it in
Documentation/filesystems/nfs/ but it isn't intended to go upstream.
Chuck,
Patch 1/2 is marked RFC because ultimately we suspect you'll have a
better way to skin this cat... but Claude was pretty great at helping
us cut through this nasty OOM situation with RDMA.
Please feel free to ask follow-up questions and we'll fill in any
details as best we can.
Thanks,
Mike
Benjamin Coddington (1):
svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
Mike Snitzer (1):
for diagnostic use only: add svcrdma_wq lag diagnostic
.../filesystems/nfs/svcrdma-wq-lag.bt | 146 ++++++++++++++++++
include/linux/sunrpc/svc_rdma.h | 1 +
include/trace/events/rpcrdma.h | 2 +
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 41 ++++-
net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
5 files changed, 185 insertions(+), 6 deletions(-)
create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt
--
2.44.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
@ 2026-05-05 21:55 ` Mike Snitzer
2026-05-06 6:01 ` Chuck Lever
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2 siblings, 1 reply; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn
From: Benjamin Coddington <ben.coddington@hammerspace.com>
Under sustained heavy load over RDMA, kNFSD servers can pin tens of
gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
released until the connection terminates. A customer site reported
OOM kills under heavy NFS READ workloads with ~2.3M cached
send_ctxts visible via slab tracing (two stacks in
svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
the same ctxt population double-counted across the sc_pages and
sc_xprt_buf allocations). Aggregated across the customer's ~218
long-lived xprts that worked out to roughly 80 GB pinned, freed only
by knfsd restart.
Root cause is an unbounded cache, not a per-op leak.
svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
on empty, allocates fresh. svc_rdma_send_ctxt_release() always
llist_add()s the ctxt back -- regardless of how many ctxts are
already on the list. The only kfree() site is
svc_rdma_send_ctxts_destroy() at xprt teardown. The list has no
shrinker, no cap, no aging: it can only grow.
Two effects compound to drive the high-water mark well above the
configured RPC slot count:
1. _put runs through a workqueue. svc_rdma_send_ctxt_put() does
INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns. The
actual _release (which puts the ctxt back on the llist) runs
later on svcrdma_wq. Between wc_send -> _put and _put_async ->
_release, the ctxt is "in transit" -- off the list, off the SQ,
not yet reusable.
2. During that gap, a concurrent _get sees an empty llist and calls
_alloc to mint a fresh ctxt. When the in-transit one eventually
lands on the llist, the cache has grown by one. Under HCA-driven
completion rates with even small workqueue dispatch lag, this
happens constantly. The cache settles not at the steady-state
in-flight count but at the all-time peak of (in-flight +
workqueue-pending), and never shrinks.
Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
svc_rdma_send_ctxt_destroy). Apply the cap in two places:
- svc_rdma_send_ctxt_get(): when the llist is empty and depth has
reached sc_max_requests, return NULL instead of allocating. The
caller drops the connection; the client reconnects with a fresh
xprt that starts at depth zero. This is the backpressure point
that prevents in-test memory growth -- it stops new allocations
regardless of where in the pipeline existing ctxts are stuck.
- svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
between concurrent _get callers, or transient burst), free the
ctxt instead of returning it to the llist. This keeps depth
convergent.
The cap is sc_max_requests because:
- It is the configured number of credit slots per xprt -- the client
can have at most this many RPCs outstanding on the transport.
- Each RPC reply uses one send_ctxt at a time; concurrent in-flight
ctxts therefore cannot legitimately exceed sc_max_requests in
steady state.
- Workqueue lag can momentarily push (in-flight + queued) above
sc_max_requests, but those ctxts are exactly what the cap should
shed -- they are not steady-state working set, just lag-inflation.
The reuse semantics of the cache are intentional and unchanged: ctxts
keep their first SGE DMA-mapped across cycles, so the steady-state
hot path stays alloc-free. Only the *excess* ctxts are freed.
A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
freed-by-cap ctxt, so operators can confirm the cap is doing real
work on a given workload.
== Verification on the test rig ==
Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
this patch) -- per-5s rates of wc_send (queue inflow), _put_async
(workqueue dispatch), _get (demand), and the new tracepoint.
Negative case (cap on on-llist depth alone, with the atomic
incremented in _release and decremented in _get), sustained NFS
READ load:
wc_send ~432K/s, release ~342K/s
-> ~90K/s of ctxts pinned as queued sc_work items
-> ~2.7M pinned after 30s; matches the slab measurement
-> svcrdma_send_ctxt_capped fires 0 times during the test, then
floods (~3.25M events) on test stop as the workqueue catches up
The cap is structurally blind to ctxts pinned in workqueue items
because depth only counts what's currently on the llist; during
sustained load almost nothing makes it onto the llist before the
next _get takes it back off. Inflation accumulates as queued
sc_work items, invisible to the cap, until load stops.
Post-patch (depth tracked at alloc/destroy + _get backpressure),
same workload, 5 minutes:
wc_send and release rates match within 1% (~410K/s each)
No accumulation; no flood at test stop
svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
recovery)
Throughput slightly higher than the negative case (cache no longer
bloats the slab/page allocator into reclaim)
The persistent wc_send/release gap in the negative case was itself
a consequence of the unbounded growth: cache bloat -> slab pressure
-> reclaim activity -> workqueue starvation -> larger gap. Once
the cap breaks that spiral, the workqueue runs at full capacity and
the rates equalize.
Operators can confirm the cap is doing real work via:
cd /sys/kernel/tracing
echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
cat trace_pipe
If a workload genuinely needs more than sc_max_requests concurrent
in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
rather than removing the cap.
== Diagnostics caveats ==
A previous diagnostic pass on this code path was misled by GCC
inlining of svc_rdma_send_ctxt_put into all in-tree callers (same
TU): the symbol stays in kallsyms (lowercase t) but no caller jumps
there, so kprobes attach without warning yet fire zero times. This
made an inventory script's "@inflight = gets - puts" appear to be
monotonically rising, falsely confirming a per-op lifecycle leak.
Hooking svc_rdma_send_ctxt_put_async (a workqueue function-pointer
target, forced out-of-line by INIT_WORK) instead of _put gets
accurate accounting and shows gets/puts balanced within ~1% under
load. Future probes in this path should prefer function-pointer
targets, tracepoints, or kmem tracepoints over kprobes on small
non-static functions in the same TU as their callers.
Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Diagnosed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
include/linux/sunrpc/svc_rdma.h | 1 +
include/trace/events/rpcrdma.h | 2 ++
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 41 ++++++++++++++++++++----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 1 +
4 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad570..b8ae1032bf293 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -97,6 +97,7 @@ struct svcxprt_rdma {
spinlock_t sc_send_lock;
struct llist_head sc_send_ctxts;
+ atomic_t sc_send_ctxts_depth;
spinlock_t sc_rw_ctxt_lock;
struct llist_head sc_rw_ctxts;
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b79913048e1a0..945152e33af8c 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2027,6 +2027,8 @@ DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_send);
DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);
+DEFINE_SIMPLE_CID_EVENT(svcrdma_send_ctxt_capped);
+
DEFINE_SIMPLE_CID_EVENT(svcrdma_post_recv);
DEFINE_RECEIVE_SUCCESS_EVENT(svcrdma_wc_recv);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b25..e487d2815b33e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -158,6 +158,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
for (i = 0; i < rdma->sc_max_send_sges; i++)
ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
+ atomic_inc(&rdma->sc_send_ctxts_depth);
return ctxt;
fail3:
@@ -170,6 +171,20 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
return NULL;
}
+/* Tear down a single send_ctxt: reverse of svc_rdma_send_ctxt_alloc. */
+static void svc_rdma_send_ctxt_destroy(struct svcxprt_rdma *rdma,
+ struct svc_rdma_send_ctxt *ctxt)
+{
+ struct ib_device *device = rdma->sc_cm_id->device;
+
+ ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+ rdma->sc_max_req_size, DMA_TO_DEVICE);
+ kfree(ctxt->sc_xprt_buf);
+ kfree(ctxt->sc_pages);
+ kfree(ctxt);
+ atomic_dec(&rdma->sc_send_ctxts_depth);
+}
+
/**
* svc_rdma_send_ctxts_destroy - Release all send_ctxt's for an xprt
* @rdma: svcxprt_rdma being torn down
@@ -177,17 +192,12 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
*/
void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
{
- struct ib_device *device = rdma->sc_cm_id->device;
struct svc_rdma_send_ctxt *ctxt;
struct llist_node *node;
while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
- ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
- rdma->sc_max_req_size, DMA_TO_DEVICE);
- kfree(ctxt->sc_xprt_buf);
- kfree(ctxt->sc_pages);
- kfree(ctxt);
+ svc_rdma_send_ctxt_destroy(rdma, ctxt);
}
}
@@ -226,6 +236,14 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
return ctxt;
out_empty:
+ /* Backpressure: refuse to mint a new ctxt once the per-xprt total
+ * (in-flight + queued for release + on-llist) has reached the
+ * configured slot count. The caller drops the connection; the
+ * client reconnects with a fresh xprt. Better than the unbounded
+ * allocation that lets workqueue lag inflate the cache to OOM.
+ */
+ if (atomic_read(&rdma->sc_send_ctxts_depth) >= rdma->sc_max_requests)
+ return NULL;
ctxt = svc_rdma_send_ctxt_alloc(rdma);
if (!ctxt)
return NULL;
@@ -257,6 +275,17 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
DMA_TO_DEVICE);
}
+ /* Depth is now tracked at alloc/destroy, so it reflects total
+ * live ctxts (in-flight + queued + on-llist), not just on-llist.
+ * If we've blown past the cap -- via a race in the _get
+ * backpressure check, or a transient burst -- destroy this ctxt
+ * instead of returning it to the llist so the depth converges.
+ */
+ if (atomic_read(&rdma->sc_send_ctxts_depth) > rdma->sc_max_requests) {
+ trace_svcrdma_send_ctxt_capped(&ctxt->sc_cid);
+ svc_rdma_send_ctxt_destroy(rdma, ctxt);
+ return;
+ }
llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4ff..7708634ebf587 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -176,6 +176,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
init_llist_head(&cma_xprt->sc_send_ctxts);
+ atomic_set(&cma_xprt->sc_send_ctxts_depth, 0);
init_llist_head(&cma_xprt->sc_recv_ctxts);
init_llist_head(&cma_xprt->sc_rw_ctxts);
init_waitqueue_head(&cma_xprt->sc_send_wait);
--
2.44.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
@ 2026-05-05 21:55 ` Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn
From: Mike Snitzer <snitzer@hammerspace.com>
Per-5s rates of svc_rdma_wc_send (proxy for queue_work into
svcrdma_wq, since each Send completion queues one put),
svc_rdma_send_ctxt_put_async (actual workqueue dispatch),
svc_rdma_send_ctxt_get (demand on the cache), and the
svcrdma_send_ctxt_capped tracepoint.
Used to diagnose whether svcrdma_wq dispatch is keeping up with
inflow under sustained load -- a sustained gap between wc_send and
release rates means ctxts are pinned as queued sc_work items, which
the on-llist depth counter cannot see.
Uses only reliably-traceable hooks: wc_send and _put_async are
function-pointer call sites that cannot be inlined; _get is forced
out-of-line by external callers in recvfrom and backchannel TUs (see
CLAUDE.md "Probe-inlining gotcha").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
.../filesystems/nfs/svcrdma-wq-lag.bt | 146 ++++++++++++++++++
1 file changed, 146 insertions(+)
create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt
diff --git a/Documentation/filesystems/nfs/svcrdma-wq-lag.bt b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
new file mode 100755
index 0000000000000..16e374bcb7aa8
--- /dev/null
+++ b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
@@ -0,0 +1,146 @@
+#!/usr/bin/env bpftrace
+/*
+ * svcrdma-wq-lag.bt
+ *
+ * Test the "svcrdma_wq is the bottleneck" theory.
+ *
+ * Background
+ * ----------
+ * An earlier commit from Ben Coddington ("svcrdma: cap per-xprt
+ * sc_send_ctxts free list at sc_max_requests") added an atomic depth
+ * counter on the sc_send_ctxts llist and a cap check inside
+ * svc_rdma_send_ctxt_release(). The svcrdma_send_ctxt_capped
+ * tracepoint was supposed to fire whenever the cache was about to
+ * exceed sc_max_requests.
+ *
+ * Customer test on hot workload: tracepoint fires zero times during
+ * the test, then floods the moment the test is Ctrl-C'd. Memory
+ * keeps growing during the test as before.
+ *
+ * Hypothesis
+ * ----------
+ * svcrdma_wq is alloc_workqueue("svcrdma", WQ_UNBOUND, 0). WQ_UNBOUND
+ * has bounded concurrency but UNBOUNDED queue depth. Under load,
+ * svc_rdma_wc_send (Send completion ISR) calls svc_rdma_send_ctxt_put,
+ * which does INIT_WORK + queue_work and returns in microseconds. The
+ * worker, svc_rdma_send_ctxt_put_async -> svc_rdma_send_ctxt_release,
+ * has to do ib_dma_unmap_page per SGE plus release_pages plus
+ * svc_rdma_reply_chunk_release -- a much slower operation.
+ *
+ * If queue rate >> worker rate, work items pile up. Each pinned
+ * sc_work is embedded in its ctxt, so ctxts can't be released. The
+ * llist stays empty (depth ~ 0), and the cap at _release never
+ * fires. _get keeps seeing an empty llist and calling _alloc for new
+ * ctxts. Memory grows.
+ *
+ * After Ctrl-C: completions stop, workqueue drains, all queued
+ * releases run rapid-fire. NOW the llist depth shoots past
+ * sc_max_requests and the cap fires for each subsequent release --
+ * the post-test flood.
+ *
+ * What this script measures
+ * -------------------------
+ * Per 5-second window, three rates:
+ *
+ * @wc_send_rate svc_rdma_wc_send invocations
+ * -- proxy for svcrdma_wq.queue_work() rate
+ * (each completion queues one put)
+ *
+ * @release_rate svc_rdma_send_ctxt_put_async invocations
+ * -- actual workqueue dispatch rate
+ *
+ * @get_rate svc_rdma_send_ctxt_get invocations
+ * -- demand pressure on the cache
+ *
+ * @capped_rate svcrdma_send_ctxt_capped tracepoint count
+ * -- where Ben's cap actually fires
+ *
+ * Reading the result
+ * ------------------
+ * If @wc_send_rate >> @release_rate during the test, the workqueue
+ * is backed up -- queue items are accumulating, which means ctxts
+ * are pinned in workqueue items. Hypothesis confirmed.
+ *
+ * If those two rates are comparable, the workqueue is keeping up
+ * and the bottleneck is somewhere else.
+ *
+ * @capped_rate ~ 0 during the test, then @release_rate >>
+ * @wc_send_rate after Ctrl-C (workqueue draining), then
+ * @capped_rate spikes -- exactly the user's observed pattern.
+ *
+ * @get_rate gives the workload's natural cadence; if @get_rate >>
+ * @release_rate sustained, allocations are accumulating regardless
+ * of where in the pipeline they're stuck.
+ *
+ * Probes used
+ * -----------
+ * kprobe:svc_rdma_wc_send function pointer call;
+ * reliably traceable
+ * kprobe:svc_rdma_send_ctxt_put_async workqueue function
+ * pointer; reliably
+ * traceable
+ * kprobe:svc_rdma_send_ctxt_get external callers force
+ * it out-of-line;
+ * reliably traceable
+ * tracepoint:rpcrdma:svcrdma_send_ctxt_capped
+ * from Ben's patch
+ *
+ * If the tracepoint fails to attach (rpcrdma module's tracepoint
+ * table not visible to bpftrace at attach time), comment out that
+ * probe -- the rest of the script still answers the main question.
+ *
+ * Usage
+ * -----
+ * sudo bpftrace svcrdma-wq-lag.bt
+ * (Ctrl-C to exit; END block prints a final snapshot.)
+ *
+ * Run for 60-120s under the load that previously OOM'd. Watch the
+ * 5-second windows; the gap between @wc_send_rate and @release_rate
+ * is the smoking gun.
+ */
+
+config = {
+ max_map_keys = 4096;
+}
+
+BEGIN {
+ printf("svcrdma_wq lag diagnostic. Per-5s rates.\n");
+ printf("Watch for @wc_send_rate >> @release_rate during the test\n");
+ printf("(workqueue backed up -> ctxts pinned in queue items ->\n");
+ printf(" Ben's cap can't see them).\n");
+ printf("Ctrl-C to exit.\n\n");
+}
+
+kprobe:svc_rdma_wc_send {
+ @wc_send_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_put_async {
+ @release_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_get {
+ @get_window += 1;
+}
+
+tracepoint:rpcrdma:svcrdma_send_ctxt_capped {
+ @capped_window += 1;
+}
+
+interval:s:5 {
+ time("\n%H:%M:%S ");
+ printf("wc_send=%-8lld release=%-8lld get=%-8lld capped=%-8lld",
+ @wc_send_window, @release_window,
+ @get_window, @capped_window);
+ if (@release_window > 0) {
+ printf(" queue/release ratio=%lld",
+ @wc_send_window / @release_window);
+ }
+ printf("\n");
+ clear(@wc_send_window);
+ clear(@release_window);
+ clear(@get_window);
+ clear(@capped_window);
+}
+
+END { }
--
2.44.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
@ 2026-05-05 22:05 ` Mike Snitzer
2 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 22:05 UTC (permalink / raw)
To: Chuck Lever, linux-nfs; +Cc: ben.coddington, jonathan.flynn
On Tue, May 05, 2026 at 05:55:33PM -0400, Mike Snitzer wrote:
> Hi,
>
> I drew the short-straw by having to take a hand-off from Ben on work
> he started with Claude yesterday in response to a really crazy OOM
> situation that hits like a freight train at one large customer's
> install that currently has 121 NFS clients and 9 NFS servers, all
> connected with RDMA networking. Working with Jon Flynn, to bound the
> problem a bit more we later scaled the testing down to 15 clients
> reading from 1 server using 16K O_DIRECT reads.
>
> So I imported Ben's CLAUDE.md that he handed off and carried on, with
> patch 1/2 we're able to avoid OOM killing the NFS servers (each with
> 128GB) -- with the 16K test workload memory use would grow from ~12GB
> to exhaustion (128GB) within ~10 seconds of starting the test.
>
> The 2nd patch in this series provides a diagnostic svcrdma-wq-lag.bt
> bpf script that Claude suggested -- I just dropped it in
> Documentation/filesystems/nfs/ but it isn't intended to go upstream.
Here is svcrdma-wq-lag.bt output before we capped in _get (short run
because otherwise we'd have hit OOM, but shows the flood of
sc_send_ctxts that only become visible once the test was killed at
17:45:36):
# ./svcrdma-wq-lag.bt
Attaching 7 probes...
svcrdma_wq lag diagnostic. Per-5s rates.
Watch for @wc_send_rate >> @release_rate during the test
(workqueue backed up -> ctxts pinned in queue items ->
Ben's cap can't see them).
Ctrl-C to exit.
17:45:06 wc_send=0 release=0 get=0 capped=0
17:45:11 wc_send=1984617 release=1571458 get=1916504 capped=0 queue/release ratio=1
17:45:16 wc_send=2176680 release=1698697 get=2097454 capped=0 queue/release ratio=1
17:45:21 wc_send=2181507 release=1716810 get=2105847 capped=0 queue/release ratio=1
17:45:26 wc_send=2178462 release=1728112 get=2104069 capped=726 queue/release ratio=1
17:45:31 wc_send=2158883 release=1714519 get=2092136 capped=205 queue/release ratio=1
17:45:36 wc_send=1029461 release=1585039 get=996644 capped=757304 queue/release ratio=0
17:45:41 wc_send=0 release=3092605 get=0 capped=1492238 queue/release ratio=0
17:45:46 wc_send=0 release=1000663 get=0 capped=1000764 queue/release ratio=0
17:45:51 wc_send=0 release=0 get=0 capped=0
^C
And here is svcrdma-wq-lag.bt output with patch 1/2 applied (full 5
minute run):
18:13:21 wc_send=1151440 release=1148610 get=1150980 capped=0 queue/release ratio=1
18:13:26 wc_send=2125499 release=2149303 get=2107227 capped=0 queue/release ratio=0
18:13:31 wc_send=1780821 release=1792233 get=1759546 capped=0 queue/release ratio=0
18:13:36 wc_send=3893061 release=2118765 get=2081403 capped=1 queue/release ratio=1
18:13:41 wc_send=1837599 release=1883821 get=1815453 capped=0 queue/release ratio=0
18:13:46 wc_send=1813727 release=1822876 get=1794087 capped=0 queue/release ratio=0
18:13:51 wc_send=1491464 release=1529370 get=1476924 capped=0 queue/release ratio=0
18:13:56 wc_send=1930254 release=1955278 get=1910046 capped=0 queue/release ratio=0
18:14:01 wc_send=1916374 release=1944863 get=1894205 capped=0 queue/release ratio=0
18:14:11 wc_send=4089684 release=4107873 get=4041449 capped=0 queue/release ratio=0
18:14:16 wc_send=1740066 release=1789864 get=5765163 capped=0 queue/release ratio=0
18:14:21 wc_send=3655551 release=1936245 get=1894173 capped=0 queue/release ratio=1
18:14:26 wc_send=1652813 release=3613434 get=1637738 capped=0 queue/release ratio=0
18:14:31 wc_send=1873398 release=1903780 get=1855887 capped=0 queue/release ratio=0
18:14:36 wc_send=1947061 release=1971983 get=1920987 capped=0 queue/release ratio=0
18:14:41 wc_send=1624716 release=1630181 get=1608404 capped=1 queue/release ratio=0
18:14:46 wc_send=1709804 release=1752711 get=1693828 capped=0 queue/release ratio=0
18:14:51 wc_send=1939672 release=1970791 get=3615177 capped=0 queue/release ratio=0
18:14:56 wc_send=1884914 release=3875602 get=1862209 capped=0 queue/release ratio=0
18:15:01 wc_send=2019244 release=2044366 get=1993761 capped=0 queue/release ratio=0
18:15:06 wc_send=2065910 release=2099472 get=2046106 capped=0 queue/release ratio=0
18:15:11 wc_send=1914899 release=1933756 get=1893723 capped=2 queue/release ratio=0
18:15:16 wc_send=1895244 release=1916124 get=1874634 capped=0 queue/release ratio=0
18:15:21 wc_send=2000953 release=2036536 get=3853528 capped=2 queue/release ratio=0
18:15:26 wc_send=3971332 release=1984930 get=5805016 capped=0 queue/release ratio=2
18:15:31 wc_send=1842215 release=1876278 get=1827305 capped=0 queue/release ratio=0
18:15:36 wc_send=1712031 release=1745388 get=1698252 capped=0 queue/release ratio=0
18:15:41 wc_send=1457677 release=1477376 get=1446707 capped=0 queue/release ratio=0
18:15:46 wc_send=1522972 release=1562257 get=1510851 capped=0 queue/release ratio=0
18:15:51 wc_send=2126919 release=2126524 get=2100919 capped=0 queue/release ratio=1
18:15:56 wc_send=1795205 release=1853495 get=1782554 capped=0 queue/release ratio=0
18:16:01 wc_send=1872737 release=1892789 get=1852457 capped=1 queue/release ratio=0
18:16:06 wc_send=1956552 release=1978503 get=1936808 capped=0 queue/release ratio=0
18:16:11 wc_send=1971408 release=2008768 get=1946069 capped=0 queue/release ratio=0
18:16:16 wc_send=1906940 release=1928668 get=1891132 capped=0 queue/release ratio=0
18:16:21 wc_send=2019228 release=2059320 get=1993856 capped=0 queue/release ratio=0
18:16:26 wc_send=2024263 release=2039861 get=2004786 capped=0 queue/release ratio=0
18:16:31 wc_send=1959046 release=1986165 get=1936975 capped=0 queue/release ratio=0
18:16:36 wc_send=3426760 release=3483327 get=1456324 capped=0 queue/release ratio=0
18:16:41 wc_send=2066845 release=2078503 get=2042898 capped=0 queue/release ratio=0
18:16:46 wc_send=1834131 release=1878182 get=1818034 capped=1 queue/release ratio=0
18:16:51 wc_send=1804133 release=1817222 get=1787677 capped=1 queue/release ratio=0
18:16:56 wc_send=1443325 release=1469384 get=1429495 capped=0 queue/release ratio=0
18:17:01 wc_send=1437236 release=1480771 get=1425373 capped=1 queue/release ratio=0
18:17:06 wc_send=3342896 release=1925346 get=1885982 capped=0 queue/release ratio=1
18:17:11 wc_send=2035328 release=2063539 get=2011609 capped=1 queue/release ratio=0
18:17:16 wc_send=1923583 release=1930805 get=1904366 capped=0 queue/release ratio=0
18:17:21 wc_send=1918902 release=1958907 get=1899028 capped=0 queue/release ratio=0
18:17:26 wc_send=1912808 release=1942152 get=1895315 capped=0 queue/release ratio=0
18:17:31 wc_send=2067295 release=2090789 get=2045747 capped=0 queue/release ratio=0
18:17:36 wc_send=1893866 release=1917954 get=3924985 capped=0 queue/release ratio=0
18:17:41 wc_send=1914595 release=1922461 get=1895841 capped=0 queue/release ratio=0
18:17:46 wc_send=1833010 release=1884469 get=1816313 capped=0 queue/release ratio=0
18:17:51 wc_send=1853085 release=3756671 get=3652976 capped=1 queue/release ratio=0
18:17:56 wc_send=1814888 release=1845390 get=1797898 capped=0 queue/release ratio=0
18:18:01 wc_send=1875252 release=3752430 get=1856752 capped=0 queue/release ratio=0
18:18:06 wc_send=1929648 release=1955900 get=1913162 capped=0 queue/release ratio=0
18:18:11 wc_send=1878956 release=1907828 get=1862308 capped=0 queue/release ratio=0
18:18:16 wc_send=1839023 release=1862031 get=1823009 capped=0 queue/release ratio=0
18:18:21 wc_send=77 release=77 get=77 capped=0 queue/release ratio=1
18:18:26 wc_send=0 release=0 get=0 capped=0
18:18:31 wc_send=0 release=0 get=0 capped=0
^C
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
@ 2026-05-06 6:01 ` Chuck Lever
2026-05-06 11:34 ` Mike Snitzer
0 siblings, 1 reply; 6+ messages in thread
From: Chuck Lever @ 2026-05-06 6:01 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-nfs, ben.coddington, jonathan.flynn
On Tue, May 5, 2026, at 11:55 PM, Mike Snitzer wrote:
> From: Benjamin Coddington <ben.coddington@hammerspace.com>
>
> Under sustained heavy load over RDMA, kNFSD servers can pin tens of
> gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
> released until the connection terminates. A customer site reported
> OOM kills under heavy NFS READ workloads with ~2.3M cached
> send_ctxts visible via slab tracing (two stacks in
> svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
> the same ctxt population double-counted across the sc_pages and
> sc_xprt_buf allocations). Aggregated across the customer's ~218
> long-lived xprts that worked out to roughly 80 GB pinned, freed only
> by knfsd restart.
>
> Root cause is an unbounded cache, not a per-op leak.
> svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
> on empty, allocates fresh. svc_rdma_send_ctxt_release() always
> llist_add()s the ctxt back -- regardless of how many ctxts are
> already on the list. The only kfree() site is
> svc_rdma_send_ctxts_destroy() at xprt teardown. The list has no
> shrinker, no cap, no aging: it can only grow.
>
> Two effects compound to drive the high-water mark well above the
> configured RPC slot count:
>
> 1. _put runs through a workqueue. svc_rdma_send_ctxt_put() does
> INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns. The
> actual _release (which puts the ctxt back on the llist) runs
> later on svcrdma_wq. Between wc_send -> _put and _put_async ->
> _release, the ctxt is "in transit" -- off the list, off the SQ,
> not yet reusable.
>
> 2. During that gap, a concurrent _get sees an empty llist and calls
> _alloc to mint a fresh ctxt. When the in-transit one eventually
> lands on the llist, the cache has grown by one. Under HCA-driven
> completion rates with even small workqueue dispatch lag, this
> happens constantly. The cache settles not at the steady-state
> in-flight count but at the all-time peak of (in-flight +
> workqueue-pending), and never shrinks.
>
> Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
> xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
> svc_rdma_send_ctxt_destroy). Apply the cap in two places:
>
> - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
> reached sc_max_requests, return NULL instead of allocating. The
> caller drops the connection; the client reconnects with a fresh
> xprt that starts at depth zero. This is the backpressure point
> that prevents in-test memory growth -- it stops new allocations
> regardless of where in the pipeline existing ctxts are stuck.
>
> - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
> between concurrent _get callers, or transient burst), free the
> ctxt instead of returning it to the llist. This keeps depth
> convergent.
>
> The cap is sc_max_requests because:
> - It is the configured number of credit slots per xprt -- the client
> can have at most this many RPCs outstanding on the transport.
> - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
> ctxts therefore cannot legitimately exceed sc_max_requests in
> steady state.
> - Workqueue lag can momentarily push (in-flight + queued) above
> sc_max_requests, but those ctxts are exactly what the cap should
> shed -- they are not steady-state working set, just lag-inflation.
>
> The reuse semantics of the cache are intentional and unchanged: ctxts
> keep their first SGE DMA-mapped across cycles, so the steady-state
> hot path stays alloc-free. Only the *excess* ctxts are freed.
>
> A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
> freed-by-cap ctxt, so operators can confirm the cap is doing real
> work on a given workload.
>
> == Verification on the test rig ==
>
> Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
> this patch) -- per-5s rates of wc_send (queue inflow), _put_async
> (workqueue dispatch), _get (demand), and the new tracepoint.
>
> Negative case (cap on on-llist depth alone, with the atomic
> incremented in _release and decremented in _get), sustained NFS
> READ load:
> wc_send ~432K/s, release ~342K/s
> -> ~90K/s of ctxts pinned as queued sc_work items
> -> ~2.7M pinned after 30s; matches the slab measurement
> -> svcrdma_send_ctxt_capped fires 0 times during the test, then
> floods (~3.25M events) on test stop as the workqueue catches up
>
> The cap is structurally blind to ctxts pinned in workqueue items
> because depth only counts what's currently on the llist; during
> sustained load almost nothing makes it onto the llist before the
> next _get takes it back off. Inflation accumulates as queued
> sc_work items, invisible to the cap, until load stops.
>
> Post-patch (depth tracked at alloc/destroy + _get backpressure),
> same workload, 5 minutes:
> wc_send and release rates match within 1% (~410K/s each)
> No accumulation; no flood at test stop
> svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
> recovery)
> Throughput slightly higher than the negative case (cache no longer
> bloats the slab/page allocator into reclaim)
>
> The persistent wc_send/release gap in the negative case was itself
> a consequence of the unbounded growth: cache bloat -> slab pressure
> -> reclaim activity -> workqueue starvation -> larger gap. Once
> the cap breaks that spiral, the workqueue runs at full capacity and
> the rates equalize.
>
> Operators can confirm the cap is doing real work via:
> cd /sys/kernel/tracing
> echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
> cat trace_pipe
>
> If a workload genuinely needs more than sc_max_requests concurrent
> in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
> rather than removing the cap.
The current svcrdma design assumes that the workqueue always keeps up
with the transport's Reply transmission rate. For the send_ctxt cache
to reach 2.3M ctxts, the workqueue itself must be running orders of
magnitude slower than wc_send.
The "in transit" gap between wc_send -> _put and _put_async ->
_release is structurally one workqueue dispatch; under normal
scheduling, microseconds. The verification numbers show a sustained
90K/s deficit (432K wc_send/s vs 342K release/s) that accumulates
linearly. That deficit is the actual pathology here...
So I'm wondering: Why does svcrdma_wq lag at steady state? _put_async
is small work -- ib_dma_unmap_page_list plus an llist_add. On a busy
box it should easily outpace wc_send, not trail it by ~21%.
The cover letter posits a spiral: cache bloat -> slab pressure ->
reclaim -> workqueue starvation -> more bloat. That makes the eventual
collapse plausible, but it does not establish what initiates the
spiral. If the workqueue keeps up at low cache size and only loses
ground after slab/reclaim kicks in, the early-load gap should be
small and growing. If the gap is fixed at 21% from second zero,
something structural is throttling the workqueue independent of slab
state.
Under sustained workqueue lag the cap fires not because the working
set exceeds sc_max_requests but because the workqueue backlog does.
Capping at sc_max_requests then translates "workqueue is slow" into
"drop the connection." That bounds memory but substitutes one failure
mode for another, and it punishes the client for what is essentially
a server-side scheduling problem. Not to mention the risks inherent
in repeated spurious connection loss.
I have a couple of patches that replace the use of svcrdma_wq, and
could alleviate the spiral issue. Would you be interested in trying
them with your reproducer?
--
Chuck Lever
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
2026-05-06 6:01 ` Chuck Lever
@ 2026-05-06 11:34 ` Mike Snitzer
0 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-06 11:34 UTC (permalink / raw)
To: Chuck Lever; +Cc: Mike Snitzer, linux-nfs, ben.coddington, jonathan.flynn
On Wed, May 06, 2026 at 08:01:50AM +0200, Chuck Lever wrote:
>
>
> On Tue, May 5, 2026, at 11:55 PM, Mike Snitzer wrote:
> > From: Benjamin Coddington <ben.coddington@hammerspace.com>
> >
> > Under sustained heavy load over RDMA, kNFSD servers can pin tens of
> > gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
> > released until the connection terminates. A customer site reported
> > OOM kills under heavy NFS READ workloads with ~2.3M cached
> > send_ctxts visible via slab tracing (two stacks in
> > svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
> > the same ctxt population double-counted across the sc_pages and
> > sc_xprt_buf allocations). Aggregated across the customer's ~218
> > long-lived xprts that worked out to roughly 80 GB pinned, freed only
> > by knfsd restart.
> >
> > Root cause is an unbounded cache, not a per-op leak.
> > svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
> > on empty, allocates fresh. svc_rdma_send_ctxt_release() always
> > llist_add()s the ctxt back -- regardless of how many ctxts are
> > already on the list. The only kfree() site is
> > svc_rdma_send_ctxts_destroy() at xprt teardown. The list has no
> > shrinker, no cap, no aging: it can only grow.
> >
> > Two effects compound to drive the high-water mark well above the
> > configured RPC slot count:
> >
> > 1. _put runs through a workqueue. svc_rdma_send_ctxt_put() does
> > INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns. The
> > actual _release (which puts the ctxt back on the llist) runs
> > later on svcrdma_wq. Between wc_send -> _put and _put_async ->
> > _release, the ctxt is "in transit" -- off the list, off the SQ,
> > not yet reusable.
> >
> > 2. During that gap, a concurrent _get sees an empty llist and calls
> > _alloc to mint a fresh ctxt. When the in-transit one eventually
> > lands on the llist, the cache has grown by one. Under HCA-driven
> > completion rates with even small workqueue dispatch lag, this
> > happens constantly. The cache settles not at the steady-state
> > in-flight count but at the all-time peak of (in-flight +
> > workqueue-pending), and never shrinks.
> >
> > Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
> > xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
> > svc_rdma_send_ctxt_destroy). Apply the cap in two places:
> >
> > - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
> > reached sc_max_requests, return NULL instead of allocating. The
> > caller drops the connection; the client reconnects with a fresh
> > xprt that starts at depth zero. This is the backpressure point
> > that prevents in-test memory growth -- it stops new allocations
> > regardless of where in the pipeline existing ctxts are stuck.
> >
> > - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
> > between concurrent _get callers, or transient burst), free the
> > ctxt instead of returning it to the llist. This keeps depth
> > convergent.
> >
> > The cap is sc_max_requests because:
> > - It is the configured number of credit slots per xprt -- the client
> > can have at most this many RPCs outstanding on the transport.
> > - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
> > ctxts therefore cannot legitimately exceed sc_max_requests in
> > steady state.
> > - Workqueue lag can momentarily push (in-flight + queued) above
> > sc_max_requests, but those ctxts are exactly what the cap should
> > shed -- they are not steady-state working set, just lag-inflation.
> >
> > The reuse semantics of the cache are intentional and unchanged: ctxts
> > keep their first SGE DMA-mapped across cycles, so the steady-state
> > hot path stays alloc-free. Only the *excess* ctxts are freed.
> >
> > A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
> > freed-by-cap ctxt, so operators can confirm the cap is doing real
> > work on a given workload.
> >
> > == Verification on the test rig ==
> >
> > Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
> > this patch) -- per-5s rates of wc_send (queue inflow), _put_async
> > (workqueue dispatch), _get (demand), and the new tracepoint.
> >
> > Negative case (cap on on-llist depth alone, with the atomic
> > incremented in _release and decremented in _get), sustained NFS
> > READ load:
> > wc_send ~432K/s, release ~342K/s
> > -> ~90K/s of ctxts pinned as queued sc_work items
> > -> ~2.7M pinned after 30s; matches the slab measurement
> > -> svcrdma_send_ctxt_capped fires 0 times during the test, then
> > floods (~3.25M events) on test stop as the workqueue catches up
> >
> > The cap is structurally blind to ctxts pinned in workqueue items
> > because depth only counts what's currently on the llist; during
> > sustained load almost nothing makes it onto the llist before the
> > next _get takes it back off. Inflation accumulates as queued
> > sc_work items, invisible to the cap, until load stops.
> >
> > Post-patch (depth tracked at alloc/destroy + _get backpressure),
> > same workload, 5 minutes:
> > wc_send and release rates match within 1% (~410K/s each)
> > No accumulation; no flood at test stop
> > svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
> > recovery)
> > Throughput slightly higher than the negative case (cache no longer
> > bloats the slab/page allocator into reclaim)
> >
> > The persistent wc_send/release gap in the negative case was itself
> > a consequence of the unbounded growth: cache bloat -> slab pressure
> > -> reclaim activity -> workqueue starvation -> larger gap. Once
> > the cap breaks that spiral, the workqueue runs at full capacity and
> > the rates equalize.
> >
> > Operators can confirm the cap is doing real work via:
> > cd /sys/kernel/tracing
> > echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
> > cat trace_pipe
> >
> > If a workload genuinely needs more than sc_max_requests concurrent
> > in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
> > rather than removing the cap.
>
> The current svcrdma design assumes that the workqueue always keeps up
> with the transport's Reply transmission rate. For the send_ctxt cache
> to reach 2.3M ctxts, the workqueue itself must be running orders of
> magnitude slower than wc_send.
>
> The "in transit" gap between wc_send -> _put and _put_async ->
> _release is structurally one workqueue dispatch; under normal
> scheduling, microseconds. The verification numbers show a sustained
> 90K/s deficit (432K wc_send/s vs 342K release/s) that accumulates
> linearly. That deficit is the actual pathology here...
>
> So I'm wondering: Why does svcrdma_wq lag at steady state? _put_async
> is small work -- ib_dma_unmap_page_list plus an llist_add. On a busy
> box it should easily outpace wc_send, not trail it by ~21%.
>
> The cover letter posits a spiral: cache bloat -> slab pressure ->
> reclaim -> workqueue starvation -> more bloat. That makes the eventual
> collapse plausible, but it does not establish what initiates the
> spiral. If the workqueue keeps up at low cache size and only loses
> ground after slab/reclaim kicks in, the early-load gap should be
> small and growing. If the gap is fixed at 21% from second zero,
> something structural is throttling the workqueue independent of slab
> state.
>
> Under sustained workqueue lag the cap fires not because the working
> set exceeds sc_max_requests but because the workqueue backlog does.
> Capping at sc_max_requests then translates "workqueue is slow" into
> "drop the connection." That bounds memory but substitutes one failure
> mode for another, and it punishes the client for what is essentially
> a server-side scheduling problem. Not to mention the risks inherent
> in repeated spurious connection loss.
Thanks for all that context. Yes the imposed connection loss isn't
ideal when the cap hits but in practice its relatively rare even with
the more extreme small (16K) workload. If larger IO size used the
workqueue is able to keep up.
> I have a couple of patches that replace the use of svcrdma_wq, and
> could alleviate the spiral issue. Would you be interested in trying
> them with your reproducer?
Yes, please share and we'll do our best to get time on the system.
We have some constraints at the moment that confine us to only making
module changes (otherwise, if core kernel changes needed, it'd trigger
a much more involved general kernel update that'd require coordination
with the admins due to netboot infra).
Thanks,
Mike
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-06 11:34 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-06 6:01 ` Chuck Lever
2026-05-06 11:34 ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox