[RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send

Linux NFS development
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache
@ 2026-05-05 21:55 Mike Snitzer
  2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn

Hi,

I drew the short-straw by having to take a hand-off from Ben on work
he started with Claude yesterday in response to a really crazy OOM
situation that hits like a freight train at one large customer's
install that currently has 121 NFS clients and 9 NFS servers, all
connected with RDMA networking.  Working with Jon Flynn, to bound the
problem a bit more we later scaled the testing down to 15 clients
reading from 1 server using 16K O_DIRECT reads.

So I imported Ben's CLAUDE.md that he handed off and carried on, with
patch 1/2 we're able to avoid OOM killing the NFS servers (each with
128GB) -- with the 16K test workload memory use would grow from ~12GB
to exhaustion (128GB) within ~10 seconds of starting the test.

The 2nd patch in this series provides a diagnostic svcrdma-wq-lag.bt
bpf script that Claude suggested -- I just dropped it in
Documentation/filesystems/nfs/ but it isn't intended to go upstream.

Chuck,
Patch 1/2 is marked RFC because ultimately we suspect you'll have a
better way to skin this cat... but Claude was pretty great at helping
us cut through this nasty OOM situation with RDMA.

Please feel free to ask follow-up questions and we'll fill in any
details as best we can.

Thanks,
Mike

Benjamin Coddington (1):
  svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get

Mike Snitzer (1):
  for diagnostic use only: add svcrdma_wq lag diagnostic

 .../filesystems/nfs/svcrdma-wq-lag.bt         | 146 ++++++++++++++++++
 include/linux/sunrpc/svc_rdma.h               |   1 +
 include/trace/events/rpcrdma.h                |   2 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c         |  41 ++++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c      |   1 +
 5 files changed, 185 insertions(+), 6 deletions(-)
 create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt

-- 
2.44.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
  2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
@ 2026-05-05 21:55 ` Mike Snitzer
  2026-05-06  6:01   ` Chuck Lever
  2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
  2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
  2 siblings, 1 reply; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn

From: Benjamin Coddington <ben.coddington@hammerspace.com>

Under sustained heavy load over RDMA, kNFSD servers can pin tens of
gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
released until the connection terminates.  A customer site reported
OOM kills under heavy NFS READ workloads with ~2.3M cached
send_ctxts visible via slab tracing (two stacks in
svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
the same ctxt population double-counted across the sc_pages and
sc_xprt_buf allocations).  Aggregated across the customer's ~218
long-lived xprts that worked out to roughly 80 GB pinned, freed only
by knfsd restart.

Root cause is an unbounded cache, not a per-op leak.
svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
llist_add()s the ctxt back -- regardless of how many ctxts are
already on the list.  The only kfree() site is
svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
shrinker, no cap, no aging: it can only grow.

Two effects compound to drive the high-water mark well above the
configured RPC slot count:

 1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
    INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
    actual _release (which puts the ctxt back on the llist) runs
    later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
    _release, the ctxt is "in transit" -- off the list, off the SQ,
    not yet reusable.

 2. During that gap, a concurrent _get sees an empty llist and calls
    _alloc to mint a fresh ctxt.  When the in-transit one eventually
    lands on the llist, the cache has grown by one.  Under HCA-driven
    completion rates with even small workqueue dispatch lag, this
    happens constantly.  The cache settles not at the steady-state
    in-flight count but at the all-time peak of (in-flight +
    workqueue-pending), and never shrinks.

Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
svc_rdma_send_ctxt_destroy).  Apply the cap in two places:

  - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
    reached sc_max_requests, return NULL instead of allocating.  The
    caller drops the connection; the client reconnects with a fresh
    xprt that starts at depth zero.  This is the backpressure point
    that prevents in-test memory growth -- it stops new allocations
    regardless of where in the pipeline existing ctxts are stuck.

  - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
    between concurrent _get callers, or transient burst), free the
    ctxt instead of returning it to the llist.  This keeps depth
    convergent.

The cap is sc_max_requests because:
 - It is the configured number of credit slots per xprt -- the client
   can have at most this many RPCs outstanding on the transport.
 - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
   ctxts therefore cannot legitimately exceed sc_max_requests in
   steady state.
 - Workqueue lag can momentarily push (in-flight + queued) above
   sc_max_requests, but those ctxts are exactly what the cap should
   shed -- they are not steady-state working set, just lag-inflation.

The reuse semantics of the cache are intentional and unchanged: ctxts
keep their first SGE DMA-mapped across cycles, so the steady-state
hot path stays alloc-free.  Only the *excess* ctxts are freed.

A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
freed-by-cap ctxt, so operators can confirm the cap is doing real
work on a given workload.

== Verification on the test rig ==

Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
this patch) -- per-5s rates of wc_send (queue inflow), _put_async
(workqueue dispatch), _get (demand), and the new tracepoint.

Negative case (cap on on-llist depth alone, with the atomic
incremented in _release and decremented in _get), sustained NFS
READ load:
  wc_send ~432K/s, release ~342K/s
  -> ~90K/s of ctxts pinned as queued sc_work items
  -> ~2.7M pinned after 30s; matches the slab measurement
  -> svcrdma_send_ctxt_capped fires 0 times during the test, then
     floods (~3.25M events) on test stop as the workqueue catches up

  The cap is structurally blind to ctxts pinned in workqueue items
  because depth only counts what's currently on the llist; during
  sustained load almost nothing makes it onto the llist before the
  next _get takes it back off.  Inflation accumulates as queued
  sc_work items, invisible to the cap, until load stops.

Post-patch (depth tracked at alloc/destroy + _get backpressure),
same workload, 5 minutes:
  wc_send and release rates match within 1% (~410K/s each)
  No accumulation; no flood at test stop
  svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
  recovery)
  Throughput slightly higher than the negative case (cache no longer
  bloats the slab/page allocator into reclaim)

The persistent wc_send/release gap in the negative case was itself
a consequence of the unbounded growth: cache bloat -> slab pressure
-> reclaim activity -> workqueue starvation -> larger gap.  Once
the cap breaks that spiral, the workqueue runs at full capacity and
the rates equalize.

Operators can confirm the cap is doing real work via:
   cd /sys/kernel/tracing
   echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
   cat trace_pipe

If a workload genuinely needs more than sc_max_requests concurrent
in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
rather than removing the cap.

== Diagnostics caveats ==

A previous diagnostic pass on this code path was misled by GCC
inlining of svc_rdma_send_ctxt_put into all in-tree callers (same
TU): the symbol stays in kallsyms (lowercase t) but no caller jumps
there, so kprobes attach without warning yet fire zero times.  This
made an inventory script's "@inflight = gets - puts" appear to be
monotonically rising, falsely confirming a per-op lifecycle leak.
Hooking svc_rdma_send_ctxt_put_async (a workqueue function-pointer
target, forced out-of-line by INIT_WORK) instead of _put gets
accurate accounting and shows gets/puts balanced within ~1% under
load.  Future probes in this path should prefer function-pointer
targets, tracepoints, or kmem tracepoints over kprobes on small
non-static functions in the same TU as their callers.

Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Diagnosed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 include/linux/sunrpc/svc_rdma.h          |  1 +
 include/trace/events/rpcrdma.h           |  2 ++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 41 ++++++++++++++++++++----
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad570..b8ae1032bf293 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -97,6 +97,7 @@ struct svcxprt_rdma {
 
 	spinlock_t	     sc_send_lock;
 	struct llist_head    sc_send_ctxts;
+	atomic_t	     sc_send_ctxts_depth;
 	spinlock_t	     sc_rw_ctxt_lock;
 	struct llist_head    sc_rw_ctxts;
 
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b79913048e1a0..945152e33af8c 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2027,6 +2027,8 @@ DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_send);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);
 
+DEFINE_SIMPLE_CID_EVENT(svcrdma_send_ctxt_capped);
+
 DEFINE_SIMPLE_CID_EVENT(svcrdma_post_recv);
 
 DEFINE_RECEIVE_SUCCESS_EVENT(svcrdma_wc_recv);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b25..e487d2815b33e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -158,6 +158,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 
 	for (i = 0; i < rdma->sc_max_send_sges; i++)
 		ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
+	atomic_inc(&rdma->sc_send_ctxts_depth);
 	return ctxt;
 
 fail3:
@@ -170,6 +171,20 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 	return NULL;
 }
 
+/* Tear down a single send_ctxt: reverse of svc_rdma_send_ctxt_alloc. */
+static void svc_rdma_send_ctxt_destroy(struct svcxprt_rdma *rdma,
+				       struct svc_rdma_send_ctxt *ctxt)
+{
+	struct ib_device *device = rdma->sc_cm_id->device;
+
+	ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+			    rdma->sc_max_req_size, DMA_TO_DEVICE);
+	kfree(ctxt->sc_xprt_buf);
+	kfree(ctxt->sc_pages);
+	kfree(ctxt);
+	atomic_dec(&rdma->sc_send_ctxts_depth);
+}
+
 /**
  * svc_rdma_send_ctxts_destroy - Release all send_ctxt's for an xprt
  * @rdma: svcxprt_rdma being torn down
@@ -177,17 +192,12 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
  */
 void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
 {
-	struct ib_device *device = rdma->sc_cm_id->device;
 	struct svc_rdma_send_ctxt *ctxt;
 	struct llist_node *node;
 
 	while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
 		ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
-		ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
-				    rdma->sc_max_req_size, DMA_TO_DEVICE);
-		kfree(ctxt->sc_xprt_buf);
-		kfree(ctxt->sc_pages);
-		kfree(ctxt);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
 	}
 }
 
@@ -226,6 +236,14 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
 	return ctxt;
 
 out_empty:
+	/* Backpressure: refuse to mint a new ctxt once the per-xprt total
+	 * (in-flight + queued for release + on-llist) has reached the
+	 * configured slot count. The caller drops the connection; the
+	 * client reconnects with a fresh xprt. Better than the unbounded
+	 * allocation that lets workqueue lag inflate the cache to OOM.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) >= rdma->sc_max_requests)
+		return NULL;
 	ctxt = svc_rdma_send_ctxt_alloc(rdma);
 	if (!ctxt)
 		return NULL;
@@ -257,6 +275,17 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
 				  DMA_TO_DEVICE);
 	}
 
+	/* Depth is now tracked at alloc/destroy, so it reflects total
+	 * live ctxts (in-flight + queued + on-llist), not just on-llist.
+	 * If we've blown past the cap -- via a race in the _get
+	 * backpressure check, or a transient burst -- destroy this ctxt
+	 * instead of returning it to the llist so the depth converges.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) > rdma->sc_max_requests) {
+		trace_svcrdma_send_ctxt_capped(&ctxt->sc_cid);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
+		return;
+	}
 	llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
 }
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4ff..7708634ebf587 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -176,6 +176,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
 	init_llist_head(&cma_xprt->sc_send_ctxts);
+	atomic_set(&cma_xprt->sc_send_ctxts_depth, 0);
 	init_llist_head(&cma_xprt->sc_recv_ctxts);
 	init_llist_head(&cma_xprt->sc_rw_ctxts);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic
  2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
  2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
@ 2026-05-05 21:55 ` Mike Snitzer
  2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
  2 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn

From: Mike Snitzer <snitzer@hammerspace.com>

Per-5s rates of svc_rdma_wc_send (proxy for queue_work into
svcrdma_wq, since each Send completion queues one put),
svc_rdma_send_ctxt_put_async (actual workqueue dispatch),
svc_rdma_send_ctxt_get (demand on the cache), and the
svcrdma_send_ctxt_capped tracepoint.

Used to diagnose whether svcrdma_wq dispatch is keeping up with
inflow under sustained load -- a sustained gap between wc_send and
release rates means ctxts are pinned as queued sc_work items, which
the on-llist depth counter cannot see.

Uses only reliably-traceable hooks: wc_send and _put_async are
function-pointer call sites that cannot be inlined; _get is forced
out-of-line by external callers in recvfrom and backchannel TUs (see
CLAUDE.md "Probe-inlining gotcha").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
 .../filesystems/nfs/svcrdma-wq-lag.bt         | 146 ++++++++++++++++++
 1 file changed, 146 insertions(+)
 create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt

diff --git a/Documentation/filesystems/nfs/svcrdma-wq-lag.bt b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
new file mode 100755
index 0000000000000..16e374bcb7aa8
--- /dev/null
+++ b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
@@ -0,0 +1,146 @@
+#!/usr/bin/env bpftrace
+/*
+ * svcrdma-wq-lag.bt
+ *
+ * Test the "svcrdma_wq is the bottleneck" theory.
+ *
+ * Background
+ * ----------
+ * An earlier commit from Ben Coddington ("svcrdma: cap per-xprt
+ * sc_send_ctxts free list at sc_max_requests") added an atomic depth
+ * counter on the sc_send_ctxts llist and a cap check inside
+ * svc_rdma_send_ctxt_release(). The svcrdma_send_ctxt_capped
+ * tracepoint was supposed to fire whenever the cache was about to
+ * exceed sc_max_requests.
+ *
+ * Customer test on hot workload: tracepoint fires zero times during
+ * the test, then floods the moment the test is Ctrl-C'd. Memory
+ * keeps growing during the test as before.
+ *
+ * Hypothesis
+ * ----------
+ * svcrdma_wq is alloc_workqueue("svcrdma", WQ_UNBOUND, 0). WQ_UNBOUND
+ * has bounded concurrency but UNBOUNDED queue depth. Under load,
+ * svc_rdma_wc_send (Send completion ISR) calls svc_rdma_send_ctxt_put,
+ * which does INIT_WORK + queue_work and returns in microseconds. The
+ * worker, svc_rdma_send_ctxt_put_async -> svc_rdma_send_ctxt_release,
+ * has to do ib_dma_unmap_page per SGE plus release_pages plus
+ * svc_rdma_reply_chunk_release -- a much slower operation.
+ *
+ * If queue rate >> worker rate, work items pile up. Each pinned
+ * sc_work is embedded in its ctxt, so ctxts can't be released. The
+ * llist stays empty (depth ~ 0), and the cap at _release never
+ * fires. _get keeps seeing an empty llist and calling _alloc for new
+ * ctxts. Memory grows.
+ *
+ * After Ctrl-C: completions stop, workqueue drains, all queued
+ * releases run rapid-fire. NOW the llist depth shoots past
+ * sc_max_requests and the cap fires for each subsequent release --
+ * the post-test flood.
+ *
+ * What this script measures
+ * -------------------------
+ * Per 5-second window, three rates:
+ *
+ *   @wc_send_rate    svc_rdma_wc_send invocations
+ *                    -- proxy for svcrdma_wq.queue_work() rate
+ *                    (each completion queues one put)
+ *
+ *   @release_rate    svc_rdma_send_ctxt_put_async invocations
+ *                    -- actual workqueue dispatch rate
+ *
+ *   @get_rate        svc_rdma_send_ctxt_get invocations
+ *                    -- demand pressure on the cache
+ *
+ *   @capped_rate     svcrdma_send_ctxt_capped tracepoint count
+ *                    -- where Ben's cap actually fires
+ *
+ * Reading the result
+ * ------------------
+ *   If @wc_send_rate >> @release_rate during the test, the workqueue
+ *   is backed up -- queue items are accumulating, which means ctxts
+ *   are pinned in workqueue items. Hypothesis confirmed.
+ *
+ *   If those two rates are comparable, the workqueue is keeping up
+ *   and the bottleneck is somewhere else.
+ *
+ *   @capped_rate ~ 0 during the test, then @release_rate >>
+ *   @wc_send_rate after Ctrl-C (workqueue draining), then
+ *   @capped_rate spikes -- exactly the user's observed pattern.
+ *
+ *   @get_rate gives the workload's natural cadence; if @get_rate >>
+ *   @release_rate sustained, allocations are accumulating regardless
+ *   of where in the pipeline they're stuck.
+ *
+ * Probes used
+ * -----------
+ *   kprobe:svc_rdma_wc_send                  function pointer call;
+ *                                            reliably traceable
+ *   kprobe:svc_rdma_send_ctxt_put_async      workqueue function
+ *                                            pointer; reliably
+ *                                            traceable
+ *   kprobe:svc_rdma_send_ctxt_get            external callers force
+ *                                            it out-of-line;
+ *                                            reliably traceable
+ *   tracepoint:rpcrdma:svcrdma_send_ctxt_capped
+ *                                            from Ben's patch
+ *
+ * If the tracepoint fails to attach (rpcrdma module's tracepoint
+ * table not visible to bpftrace at attach time), comment out that
+ * probe -- the rest of the script still answers the main question.
+ *
+ * Usage
+ * -----
+ *   sudo bpftrace svcrdma-wq-lag.bt
+ *   (Ctrl-C to exit; END block prints a final snapshot.)
+ *
+ * Run for 60-120s under the load that previously OOM'd. Watch the
+ * 5-second windows; the gap between @wc_send_rate and @release_rate
+ * is the smoking gun.
+ */
+
+config = {
+	max_map_keys = 4096;
+}
+
+BEGIN {
+	printf("svcrdma_wq lag diagnostic. Per-5s rates.\n");
+	printf("Watch for @wc_send_rate >> @release_rate during the test\n");
+	printf("(workqueue backed up -> ctxts pinned in queue items ->\n");
+	printf(" Ben's cap can't see them).\n");
+	printf("Ctrl-C to exit.\n\n");
+}
+
+kprobe:svc_rdma_wc_send {
+	@wc_send_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_put_async {
+	@release_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_get {
+	@get_window += 1;
+}
+
+tracepoint:rpcrdma:svcrdma_send_ctxt_capped {
+	@capped_window += 1;
+}
+
+interval:s:5 {
+	time("\n%H:%M:%S  ");
+	printf("wc_send=%-8lld release=%-8lld get=%-8lld capped=%-8lld",
+	       @wc_send_window, @release_window,
+	       @get_window, @capped_window);
+	if (@release_window > 0) {
+		printf("  queue/release ratio=%lld",
+		       @wc_send_window / @release_window);
+	}
+	printf("\n");
+	clear(@wc_send_window);
+	clear(@release_window);
+	clear(@get_window);
+	clear(@capped_window);
+}
+
+END { }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache
  2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
  2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
  2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
@ 2026-05-05 22:05 ` Mike Snitzer
  2 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 22:05 UTC (permalink / raw)
  To: Chuck Lever, linux-nfs; +Cc: ben.coddington, jonathan.flynn

On Tue, May 05, 2026 at 05:55:33PM -0400, Mike Snitzer wrote:
> Hi,
> 
> I drew the short-straw by having to take a hand-off from Ben on work
> he started with Claude yesterday in response to a really crazy OOM
> situation that hits like a freight train at one large customer's
> install that currently has 121 NFS clients and 9 NFS servers, all
> connected with RDMA networking.  Working with Jon Flynn, to bound the
> problem a bit more we later scaled the testing down to 15 clients
> reading from 1 server using 16K O_DIRECT reads.
> 
> So I imported Ben's CLAUDE.md that he handed off and carried on, with
> patch 1/2 we're able to avoid OOM killing the NFS servers (each with
> 128GB) -- with the 16K test workload memory use would grow from ~12GB
> to exhaustion (128GB) within ~10 seconds of starting the test.
> 
> The 2nd patch in this series provides a diagnostic svcrdma-wq-lag.bt
> bpf script that Claude suggested -- I just dropped it in
> Documentation/filesystems/nfs/ but it isn't intended to go upstream.

Here is svcrdma-wq-lag.bt output before we capped in _get (short run
because otherwise we'd have hit OOM, but shows the flood of
sc_send_ctxts that only become visible once the test was killed at
17:45:36):

# ./svcrdma-wq-lag.bt
Attaching 7 probes...
svcrdma_wq lag diagnostic. Per-5s rates.
Watch for @wc_send_rate >> @release_rate during the test
(workqueue backed up -> ctxts pinned in queue items ->
 Ben's cap can't see them).
Ctrl-C to exit.


17:45:06  wc_send=0        release=0        get=0        capped=0

17:45:11  wc_send=1984617  release=1571458  get=1916504  capped=0         queue/release ratio=1

17:45:16  wc_send=2176680  release=1698697  get=2097454  capped=0         queue/release ratio=1

17:45:21  wc_send=2181507  release=1716810  get=2105847  capped=0         queue/release ratio=1

17:45:26  wc_send=2178462  release=1728112  get=2104069  capped=726       queue/release ratio=1

17:45:31  wc_send=2158883  release=1714519  get=2092136  capped=205       queue/release ratio=1

17:45:36  wc_send=1029461  release=1585039  get=996644   capped=757304    queue/release ratio=0

17:45:41  wc_send=0        release=3092605  get=0        capped=1492238   queue/release ratio=0

17:45:46  wc_send=0        release=1000663  get=0        capped=1000764   queue/release ratio=0

17:45:51  wc_send=0        release=0        get=0        capped=0
^C


And here is svcrdma-wq-lag.bt output with patch 1/2 applied (full 5
minute run):

18:13:21  wc_send=1151440  release=1148610  get=1150980  capped=0         queue/release ratio=1

18:13:26  wc_send=2125499  release=2149303  get=2107227  capped=0         queue/release ratio=0

18:13:31  wc_send=1780821  release=1792233  get=1759546  capped=0         queue/release ratio=0

18:13:36  wc_send=3893061  release=2118765  get=2081403  capped=1         queue/release ratio=1

18:13:41  wc_send=1837599  release=1883821  get=1815453  capped=0         queue/release ratio=0

18:13:46  wc_send=1813727  release=1822876  get=1794087  capped=0         queue/release ratio=0

18:13:51  wc_send=1491464  release=1529370  get=1476924  capped=0         queue/release ratio=0

18:13:56  wc_send=1930254  release=1955278  get=1910046  capped=0         queue/release ratio=0

18:14:01  wc_send=1916374  release=1944863  get=1894205  capped=0         queue/release ratio=0

18:14:11  wc_send=4089684  release=4107873  get=4041449  capped=0         queue/release ratio=0

18:14:16  wc_send=1740066  release=1789864  get=5765163  capped=0         queue/release ratio=0

18:14:21  wc_send=3655551  release=1936245  get=1894173  capped=0         queue/release ratio=1

18:14:26  wc_send=1652813  release=3613434  get=1637738  capped=0         queue/release ratio=0

18:14:31  wc_send=1873398  release=1903780  get=1855887  capped=0         queue/release ratio=0

18:14:36  wc_send=1947061  release=1971983  get=1920987  capped=0         queue/release ratio=0

18:14:41  wc_send=1624716  release=1630181  get=1608404  capped=1         queue/release ratio=0

18:14:46  wc_send=1709804  release=1752711  get=1693828  capped=0         queue/release ratio=0

18:14:51  wc_send=1939672  release=1970791  get=3615177  capped=0         queue/release ratio=0

18:14:56  wc_send=1884914  release=3875602  get=1862209  capped=0         queue/release ratio=0

18:15:01  wc_send=2019244  release=2044366  get=1993761  capped=0         queue/release ratio=0

18:15:06  wc_send=2065910  release=2099472  get=2046106  capped=0         queue/release ratio=0

18:15:11  wc_send=1914899  release=1933756  get=1893723  capped=2         queue/release ratio=0

18:15:16  wc_send=1895244  release=1916124  get=1874634  capped=0         queue/release ratio=0

18:15:21  wc_send=2000953  release=2036536  get=3853528  capped=2         queue/release ratio=0

18:15:26  wc_send=3971332  release=1984930  get=5805016  capped=0         queue/release ratio=2

18:15:31  wc_send=1842215  release=1876278  get=1827305  capped=0         queue/release ratio=0

18:15:36  wc_send=1712031  release=1745388  get=1698252  capped=0         queue/release ratio=0

18:15:41  wc_send=1457677  release=1477376  get=1446707  capped=0         queue/release ratio=0

18:15:46  wc_send=1522972  release=1562257  get=1510851  capped=0         queue/release ratio=0

18:15:51  wc_send=2126919  release=2126524  get=2100919  capped=0         queue/release ratio=1

18:15:56  wc_send=1795205  release=1853495  get=1782554  capped=0         queue/release ratio=0

18:16:01  wc_send=1872737  release=1892789  get=1852457  capped=1         queue/release ratio=0

18:16:06  wc_send=1956552  release=1978503  get=1936808  capped=0         queue/release ratio=0

18:16:11  wc_send=1971408  release=2008768  get=1946069  capped=0         queue/release ratio=0

18:16:16  wc_send=1906940  release=1928668  get=1891132  capped=0         queue/release ratio=0

18:16:21  wc_send=2019228  release=2059320  get=1993856  capped=0         queue/release ratio=0

18:16:26  wc_send=2024263  release=2039861  get=2004786  capped=0         queue/release ratio=0

18:16:31  wc_send=1959046  release=1986165  get=1936975  capped=0         queue/release ratio=0

18:16:36  wc_send=3426760  release=3483327  get=1456324  capped=0         queue/release ratio=0

18:16:41  wc_send=2066845  release=2078503  get=2042898  capped=0         queue/release ratio=0

18:16:46  wc_send=1834131  release=1878182  get=1818034  capped=1         queue/release ratio=0

18:16:51  wc_send=1804133  release=1817222  get=1787677  capped=1         queue/release ratio=0

18:16:56  wc_send=1443325  release=1469384  get=1429495  capped=0         queue/release ratio=0

18:17:01  wc_send=1437236  release=1480771  get=1425373  capped=1         queue/release ratio=0

18:17:06  wc_send=3342896  release=1925346  get=1885982  capped=0         queue/release ratio=1

18:17:11  wc_send=2035328  release=2063539  get=2011609  capped=1         queue/release ratio=0

18:17:16  wc_send=1923583  release=1930805  get=1904366  capped=0         queue/release ratio=0

18:17:21  wc_send=1918902  release=1958907  get=1899028  capped=0         queue/release ratio=0

18:17:26  wc_send=1912808  release=1942152  get=1895315  capped=0         queue/release ratio=0

18:17:31  wc_send=2067295  release=2090789  get=2045747  capped=0         queue/release ratio=0

18:17:36  wc_send=1893866  release=1917954  get=3924985  capped=0         queue/release ratio=0

18:17:41  wc_send=1914595  release=1922461  get=1895841  capped=0         queue/release ratio=0

18:17:46  wc_send=1833010  release=1884469  get=1816313  capped=0         queue/release ratio=0

18:17:51  wc_send=1853085  release=3756671  get=3652976  capped=1         queue/release ratio=0

18:17:56  wc_send=1814888  release=1845390  get=1797898  capped=0         queue/release ratio=0

18:18:01  wc_send=1875252  release=3752430  get=1856752  capped=0         queue/release ratio=0

18:18:06  wc_send=1929648  release=1955900  get=1913162  capped=0         queue/release ratio=0

18:18:11  wc_send=1878956  release=1907828  get=1862308  capped=0         queue/release ratio=0

18:18:16  wc_send=1839023  release=1862031  get=1823009  capped=0         queue/release ratio=0

18:18:21  wc_send=77       release=77       get=77       capped=0         queue/release ratio=1

18:18:26  wc_send=0        release=0        get=0        capped=0

18:18:31  wc_send=0        release=0        get=0        capped=0
^C

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
  2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
@ 2026-05-06  6:01   ` Chuck Lever
  2026-05-06 11:34     ` Mike Snitzer
  0 siblings, 1 reply; 6+ messages in thread
From: Chuck Lever @ 2026-05-06  6:01 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-nfs, ben.coddington, jonathan.flynn



On Tue, May 5, 2026, at 11:55 PM, Mike Snitzer wrote:
> From: Benjamin Coddington <ben.coddington@hammerspace.com>
>
> Under sustained heavy load over RDMA, kNFSD servers can pin tens of
> gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
> released until the connection terminates.  A customer site reported
> OOM kills under heavy NFS READ workloads with ~2.3M cached
> send_ctxts visible via slab tracing (two stacks in
> svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
> the same ctxt population double-counted across the sc_pages and
> sc_xprt_buf allocations).  Aggregated across the customer's ~218
> long-lived xprts that worked out to roughly 80 GB pinned, freed only
> by knfsd restart.
>
> Root cause is an unbounded cache, not a per-op leak.
> svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
> on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
> llist_add()s the ctxt back -- regardless of how many ctxts are
> already on the list.  The only kfree() site is
> svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
> shrinker, no cap, no aging: it can only grow.
>
> Two effects compound to drive the high-water mark well above the
> configured RPC slot count:
>
>  1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
>     INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
>     actual _release (which puts the ctxt back on the llist) runs
>     later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
>     _release, the ctxt is "in transit" -- off the list, off the SQ,
>     not yet reusable.
>
>  2. During that gap, a concurrent _get sees an empty llist and calls
>     _alloc to mint a fresh ctxt.  When the in-transit one eventually
>     lands on the llist, the cache has grown by one.  Under HCA-driven
>     completion rates with even small workqueue dispatch lag, this
>     happens constantly.  The cache settles not at the steady-state
>     in-flight count but at the all-time peak of (in-flight +
>     workqueue-pending), and never shrinks.
>
> Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
> xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
> svc_rdma_send_ctxt_destroy).  Apply the cap in two places:
>
>   - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
>     reached sc_max_requests, return NULL instead of allocating.  The
>     caller drops the connection; the client reconnects with a fresh
>     xprt that starts at depth zero.  This is the backpressure point
>     that prevents in-test memory growth -- it stops new allocations
>     regardless of where in the pipeline existing ctxts are stuck.
>
>   - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
>     between concurrent _get callers, or transient burst), free the
>     ctxt instead of returning it to the llist.  This keeps depth
>     convergent.
>
> The cap is sc_max_requests because:
>  - It is the configured number of credit slots per xprt -- the client
>    can have at most this many RPCs outstanding on the transport.
>  - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
>    ctxts therefore cannot legitimately exceed sc_max_requests in
>    steady state.
>  - Workqueue lag can momentarily push (in-flight + queued) above
>    sc_max_requests, but those ctxts are exactly what the cap should
>    shed -- they are not steady-state working set, just lag-inflation.
>
> The reuse semantics of the cache are intentional and unchanged: ctxts
> keep their first SGE DMA-mapped across cycles, so the steady-state
> hot path stays alloc-free.  Only the *excess* ctxts are freed.
>
> A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
> freed-by-cap ctxt, so operators can confirm the cap is doing real
> work on a given workload.
>
> == Verification on the test rig ==
>
> Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
> this patch) -- per-5s rates of wc_send (queue inflow), _put_async
> (workqueue dispatch), _get (demand), and the new tracepoint.
>
> Negative case (cap on on-llist depth alone, with the atomic
> incremented in _release and decremented in _get), sustained NFS
> READ load:
>   wc_send ~432K/s, release ~342K/s
>   -> ~90K/s of ctxts pinned as queued sc_work items
>   -> ~2.7M pinned after 30s; matches the slab measurement
>   -> svcrdma_send_ctxt_capped fires 0 times during the test, then
>      floods (~3.25M events) on test stop as the workqueue catches up
>
>   The cap is structurally blind to ctxts pinned in workqueue items
>   because depth only counts what's currently on the llist; during
>   sustained load almost nothing makes it onto the llist before the
>   next _get takes it back off.  Inflation accumulates as queued
>   sc_work items, invisible to the cap, until load stops.
>
> Post-patch (depth tracked at alloc/destroy + _get backpressure),
> same workload, 5 minutes:
>   wc_send and release rates match within 1% (~410K/s each)
>   No accumulation; no flood at test stop
>   svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
>   recovery)
>   Throughput slightly higher than the negative case (cache no longer
>   bloats the slab/page allocator into reclaim)
>
> The persistent wc_send/release gap in the negative case was itself
> a consequence of the unbounded growth: cache bloat -> slab pressure
> -> reclaim activity -> workqueue starvation -> larger gap.  Once
> the cap breaks that spiral, the workqueue runs at full capacity and
> the rates equalize.
>
> Operators can confirm the cap is doing real work via:
>    cd /sys/kernel/tracing
>    echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
>    cat trace_pipe
>
> If a workload genuinely needs more than sc_max_requests concurrent
> in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
> rather than removing the cap.

The current svcrdma design assumes that the workqueue always keeps up
with the transport's Reply transmission rate. For the send_ctxt cache
to reach 2.3M ctxts, the workqueue itself must be running orders of
magnitude slower than wc_send.

The "in transit" gap between wc_send -> _put and _put_async ->
_release is structurally one workqueue dispatch; under normal
scheduling, microseconds. The verification numbers show a sustained
90K/s deficit (432K wc_send/s vs 342K release/s) that accumulates
linearly. That deficit is the actual pathology here...

So I'm wondering: Why does svcrdma_wq lag at steady state? _put_async
is small work -- ib_dma_unmap_page_list plus an llist_add. On a busy
box it should easily outpace wc_send, not trail it by ~21%.

The cover letter posits a spiral: cache bloat -> slab pressure ->
reclaim -> workqueue starvation -> more bloat. That makes the eventual
collapse plausible, but it does not establish what initiates the
spiral. If the workqueue keeps up at low cache size and only loses
ground after slab/reclaim kicks in, the early-load gap should be
small and growing. If the gap is fixed at 21% from second zero,
something structural is throttling the workqueue independent of slab
state.

Under sustained workqueue lag the cap fires not because the working
set exceeds sc_max_requests but because the workqueue backlog does.
Capping at sc_max_requests then translates "workqueue is slow" into
"drop the connection." That bounds memory but substitutes one failure
mode for another, and it punishes the client for what is essentially
a server-side scheduling problem. Not to mention the risks inherent
in repeated spurious connection loss.

I have a couple of patches that replace the use of svcrdma_wq, and
could alleviate the spiral issue. Would you be interested in trying
them with your reproducer?


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
  2026-05-06  6:01   ` Chuck Lever
@ 2026-05-06 11:34     ` Mike Snitzer
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-06 11:34 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Mike Snitzer, linux-nfs, ben.coddington, jonathan.flynn

On Wed, May 06, 2026 at 08:01:50AM +0200, Chuck Lever wrote:
> 
> 
> On Tue, May 5, 2026, at 11:55 PM, Mike Snitzer wrote:
> > From: Benjamin Coddington <ben.coddington@hammerspace.com>
> >
> > Under sustained heavy load over RDMA, kNFSD servers can pin tens of
> > gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
> > released until the connection terminates.  A customer site reported
> > OOM kills under heavy NFS READ workloads with ~2.3M cached
> > send_ctxts visible via slab tracing (two stacks in
> > svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
> > the same ctxt population double-counted across the sc_pages and
> > sc_xprt_buf allocations).  Aggregated across the customer's ~218
> > long-lived xprts that worked out to roughly 80 GB pinned, freed only
> > by knfsd restart.
> >
> > Root cause is an unbounded cache, not a per-op leak.
> > svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
> > on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
> > llist_add()s the ctxt back -- regardless of how many ctxts are
> > already on the list.  The only kfree() site is
> > svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
> > shrinker, no cap, no aging: it can only grow.
> >
> > Two effects compound to drive the high-water mark well above the
> > configured RPC slot count:
> >
> >  1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
> >     INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
> >     actual _release (which puts the ctxt back on the llist) runs
> >     later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
> >     _release, the ctxt is "in transit" -- off the list, off the SQ,
> >     not yet reusable.
> >
> >  2. During that gap, a concurrent _get sees an empty llist and calls
> >     _alloc to mint a fresh ctxt.  When the in-transit one eventually
> >     lands on the llist, the cache has grown by one.  Under HCA-driven
> >     completion rates with even small workqueue dispatch lag, this
> >     happens constantly.  The cache settles not at the steady-state
> >     in-flight count but at the all-time peak of (in-flight +
> >     workqueue-pending), and never shrinks.
> >
> > Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
> > xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
> > svc_rdma_send_ctxt_destroy).  Apply the cap in two places:
> >
> >   - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
> >     reached sc_max_requests, return NULL instead of allocating.  The
> >     caller drops the connection; the client reconnects with a fresh
> >     xprt that starts at depth zero.  This is the backpressure point
> >     that prevents in-test memory growth -- it stops new allocations
> >     regardless of where in the pipeline existing ctxts are stuck.
> >
> >   - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
> >     between concurrent _get callers, or transient burst), free the
> >     ctxt instead of returning it to the llist.  This keeps depth
> >     convergent.
> >
> > The cap is sc_max_requests because:
> >  - It is the configured number of credit slots per xprt -- the client
> >    can have at most this many RPCs outstanding on the transport.
> >  - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
> >    ctxts therefore cannot legitimately exceed sc_max_requests in
> >    steady state.
> >  - Workqueue lag can momentarily push (in-flight + queued) above
> >    sc_max_requests, but those ctxts are exactly what the cap should
> >    shed -- they are not steady-state working set, just lag-inflation.
> >
> > The reuse semantics of the cache are intentional and unchanged: ctxts
> > keep their first SGE DMA-mapped across cycles, so the steady-state
> > hot path stays alloc-free.  Only the *excess* ctxts are freed.
> >
> > A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
> > freed-by-cap ctxt, so operators can confirm the cap is doing real
> > work on a given workload.
> >
> > == Verification on the test rig ==
> >
> > Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
> > this patch) -- per-5s rates of wc_send (queue inflow), _put_async
> > (workqueue dispatch), _get (demand), and the new tracepoint.
> >
> > Negative case (cap on on-llist depth alone, with the atomic
> > incremented in _release and decremented in _get), sustained NFS
> > READ load:
> >   wc_send ~432K/s, release ~342K/s
> >   -> ~90K/s of ctxts pinned as queued sc_work items
> >   -> ~2.7M pinned after 30s; matches the slab measurement
> >   -> svcrdma_send_ctxt_capped fires 0 times during the test, then
> >      floods (~3.25M events) on test stop as the workqueue catches up
> >
> >   The cap is structurally blind to ctxts pinned in workqueue items
> >   because depth only counts what's currently on the llist; during
> >   sustained load almost nothing makes it onto the llist before the
> >   next _get takes it back off.  Inflation accumulates as queued
> >   sc_work items, invisible to the cap, until load stops.
> >
> > Post-patch (depth tracked at alloc/destroy + _get backpressure),
> > same workload, 5 minutes:
> >   wc_send and release rates match within 1% (~410K/s each)
> >   No accumulation; no flood at test stop
> >   svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
> >   recovery)
> >   Throughput slightly higher than the negative case (cache no longer
> >   bloats the slab/page allocator into reclaim)
> >
> > The persistent wc_send/release gap in the negative case was itself
> > a consequence of the unbounded growth: cache bloat -> slab pressure
> > -> reclaim activity -> workqueue starvation -> larger gap.  Once
> > the cap breaks that spiral, the workqueue runs at full capacity and
> > the rates equalize.
> >
> > Operators can confirm the cap is doing real work via:
> >    cd /sys/kernel/tracing
> >    echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
> >    cat trace_pipe
> >
> > If a workload genuinely needs more than sc_max_requests concurrent
> > in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
> > rather than removing the cap.
> 
> The current svcrdma design assumes that the workqueue always keeps up
> with the transport's Reply transmission rate. For the send_ctxt cache
> to reach 2.3M ctxts, the workqueue itself must be running orders of
> magnitude slower than wc_send.
> 
> The "in transit" gap between wc_send -> _put and _put_async ->
> _release is structurally one workqueue dispatch; under normal
> scheduling, microseconds. The verification numbers show a sustained
> 90K/s deficit (432K wc_send/s vs 342K release/s) that accumulates
> linearly. That deficit is the actual pathology here...
> 
> So I'm wondering: Why does svcrdma_wq lag at steady state? _put_async
> is small work -- ib_dma_unmap_page_list plus an llist_add. On a busy
> box it should easily outpace wc_send, not trail it by ~21%.
> 
> The cover letter posits a spiral: cache bloat -> slab pressure ->
> reclaim -> workqueue starvation -> more bloat. That makes the eventual
> collapse plausible, but it does not establish what initiates the
> spiral. If the workqueue keeps up at low cache size and only loses
> ground after slab/reclaim kicks in, the early-load gap should be
> small and growing. If the gap is fixed at 21% from second zero,
> something structural is throttling the workqueue independent of slab
> state.
> 
> Under sustained workqueue lag the cap fires not because the working
> set exceeds sc_max_requests but because the workqueue backlog does.
> Capping at sc_max_requests then translates "workqueue is slow" into
> "drop the connection." That bounds memory but substitutes one failure
> mode for another, and it punishes the client for what is essentially
> a server-side scheduling problem. Not to mention the risks inherent
> in repeated spurious connection loss.

Thanks for all that context.  Yes the imposed connection loss isn't
ideal when the cap hits but in practice its relatively rare even with
the more extreme small (16K) workload.  If larger IO size used the
workqueue is able to keep up.

> I have a couple of patches that replace the use of svcrdma_wq, and
> could alleviate the spiral issue. Would you be interested in trying
> them with your reproducer?

Yes, please share and we'll do our best to get time on the system.

We have some constraints at the moment that confine us to only making
module changes (otherwise, if core kernel changes needed, it'd trigger
a much more involved general kernel update that'd require coordination
with the admins due to netboot infra).

Thanks,
Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-06 11:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-06  6:01   ` Chuck Lever
2026-05-06 11:34     ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox