Linux NFS development
 help / color / mirror / Atom feed
From: Mike Snitzer <snitzer@hammerspace.com>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, ben.coddington@hammerspace.com,
	jonathan.flynn@hammerspace.com
Subject: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
Date: Tue,  5 May 2026 17:55:34 -0400	[thread overview]
Message-ID: <20260505215535.68412-2-snitzer@kernel.org> (raw)
In-Reply-To: <20260505215535.68412-1-snitzer@kernel.org>

From: Benjamin Coddington <ben.coddington@hammerspace.com>

Under sustained heavy load over RDMA, kNFSD servers can pin tens of
gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
released until the connection terminates.  A customer site reported
OOM kills under heavy NFS READ workloads with ~2.3M cached
send_ctxts visible via slab tracing (two stacks in
svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
the same ctxt population double-counted across the sc_pages and
sc_xprt_buf allocations).  Aggregated across the customer's ~218
long-lived xprts that worked out to roughly 80 GB pinned, freed only
by knfsd restart.

Root cause is an unbounded cache, not a per-op leak.
svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
llist_add()s the ctxt back -- regardless of how many ctxts are
already on the list.  The only kfree() site is
svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
shrinker, no cap, no aging: it can only grow.

Two effects compound to drive the high-water mark well above the
configured RPC slot count:

 1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
    INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
    actual _release (which puts the ctxt back on the llist) runs
    later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
    _release, the ctxt is "in transit" -- off the list, off the SQ,
    not yet reusable.

 2. During that gap, a concurrent _get sees an empty llist and calls
    _alloc to mint a fresh ctxt.  When the in-transit one eventually
    lands on the llist, the cache has grown by one.  Under HCA-driven
    completion rates with even small workqueue dispatch lag, this
    happens constantly.  The cache settles not at the steady-state
    in-flight count but at the all-time peak of (in-flight +
    workqueue-pending), and never shrinks.

Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
svc_rdma_send_ctxt_destroy).  Apply the cap in two places:

  - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
    reached sc_max_requests, return NULL instead of allocating.  The
    caller drops the connection; the client reconnects with a fresh
    xprt that starts at depth zero.  This is the backpressure point
    that prevents in-test memory growth -- it stops new allocations
    regardless of where in the pipeline existing ctxts are stuck.

  - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
    between concurrent _get callers, or transient burst), free the
    ctxt instead of returning it to the llist.  This keeps depth
    convergent.

The cap is sc_max_requests because:
 - It is the configured number of credit slots per xprt -- the client
   can have at most this many RPCs outstanding on the transport.
 - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
   ctxts therefore cannot legitimately exceed sc_max_requests in
   steady state.
 - Workqueue lag can momentarily push (in-flight + queued) above
   sc_max_requests, but those ctxts are exactly what the cap should
   shed -- they are not steady-state working set, just lag-inflation.

The reuse semantics of the cache are intentional and unchanged: ctxts
keep their first SGE DMA-mapped across cycles, so the steady-state
hot path stays alloc-free.  Only the *excess* ctxts are freed.

A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
freed-by-cap ctxt, so operators can confirm the cap is doing real
work on a given workload.

== Verification on the test rig ==

Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
this patch) -- per-5s rates of wc_send (queue inflow), _put_async
(workqueue dispatch), _get (demand), and the new tracepoint.

Negative case (cap on on-llist depth alone, with the atomic
incremented in _release and decremented in _get), sustained NFS
READ load:
  wc_send ~432K/s, release ~342K/s
  -> ~90K/s of ctxts pinned as queued sc_work items
  -> ~2.7M pinned after 30s; matches the slab measurement
  -> svcrdma_send_ctxt_capped fires 0 times during the test, then
     floods (~3.25M events) on test stop as the workqueue catches up

  The cap is structurally blind to ctxts pinned in workqueue items
  because depth only counts what's currently on the llist; during
  sustained load almost nothing makes it onto the llist before the
  next _get takes it back off.  Inflation accumulates as queued
  sc_work items, invisible to the cap, until load stops.

Post-patch (depth tracked at alloc/destroy + _get backpressure),
same workload, 5 minutes:
  wc_send and release rates match within 1% (~410K/s each)
  No accumulation; no flood at test stop
  svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
  recovery)
  Throughput slightly higher than the negative case (cache no longer
  bloats the slab/page allocator into reclaim)

The persistent wc_send/release gap in the negative case was itself
a consequence of the unbounded growth: cache bloat -> slab pressure
-> reclaim activity -> workqueue starvation -> larger gap.  Once
the cap breaks that spiral, the workqueue runs at full capacity and
the rates equalize.

Operators can confirm the cap is doing real work via:
   cd /sys/kernel/tracing
   echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
   cat trace_pipe

If a workload genuinely needs more than sc_max_requests concurrent
in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
rather than removing the cap.

== Diagnostics caveats ==

A previous diagnostic pass on this code path was misled by GCC
inlining of svc_rdma_send_ctxt_put into all in-tree callers (same
TU): the symbol stays in kallsyms (lowercase t) but no caller jumps
there, so kprobes attach without warning yet fire zero times.  This
made an inventory script's "@inflight = gets - puts" appear to be
monotonically rising, falsely confirming a per-op lifecycle leak.
Hooking svc_rdma_send_ctxt_put_async (a workqueue function-pointer
target, forced out-of-line by INIT_WORK) instead of _put gets
accurate accounting and shows gets/puts balanced within ~1% under
load.  Future probes in this path should prefer function-pointer
targets, tracepoints, or kmem tracepoints over kprobes on small
non-static functions in the same TU as their callers.

Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Diagnosed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 include/linux/sunrpc/svc_rdma.h          |  1 +
 include/trace/events/rpcrdma.h           |  2 ++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 41 ++++++++++++++++++++----
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad570..b8ae1032bf293 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -97,6 +97,7 @@ struct svcxprt_rdma {
 
 	spinlock_t	     sc_send_lock;
 	struct llist_head    sc_send_ctxts;
+	atomic_t	     sc_send_ctxts_depth;
 	spinlock_t	     sc_rw_ctxt_lock;
 	struct llist_head    sc_rw_ctxts;
 
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b79913048e1a0..945152e33af8c 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2027,6 +2027,8 @@ DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_send);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);
 
+DEFINE_SIMPLE_CID_EVENT(svcrdma_send_ctxt_capped);
+
 DEFINE_SIMPLE_CID_EVENT(svcrdma_post_recv);
 
 DEFINE_RECEIVE_SUCCESS_EVENT(svcrdma_wc_recv);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b25..e487d2815b33e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -158,6 +158,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 
 	for (i = 0; i < rdma->sc_max_send_sges; i++)
 		ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
+	atomic_inc(&rdma->sc_send_ctxts_depth);
 	return ctxt;
 
 fail3:
@@ -170,6 +171,20 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 	return NULL;
 }
 
+/* Tear down a single send_ctxt: reverse of svc_rdma_send_ctxt_alloc. */
+static void svc_rdma_send_ctxt_destroy(struct svcxprt_rdma *rdma,
+				       struct svc_rdma_send_ctxt *ctxt)
+{
+	struct ib_device *device = rdma->sc_cm_id->device;
+
+	ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+			    rdma->sc_max_req_size, DMA_TO_DEVICE);
+	kfree(ctxt->sc_xprt_buf);
+	kfree(ctxt->sc_pages);
+	kfree(ctxt);
+	atomic_dec(&rdma->sc_send_ctxts_depth);
+}
+
 /**
  * svc_rdma_send_ctxts_destroy - Release all send_ctxt's for an xprt
  * @rdma: svcxprt_rdma being torn down
@@ -177,17 +192,12 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
  */
 void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
 {
-	struct ib_device *device = rdma->sc_cm_id->device;
 	struct svc_rdma_send_ctxt *ctxt;
 	struct llist_node *node;
 
 	while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
 		ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
-		ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
-				    rdma->sc_max_req_size, DMA_TO_DEVICE);
-		kfree(ctxt->sc_xprt_buf);
-		kfree(ctxt->sc_pages);
-		kfree(ctxt);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
 	}
 }
 
@@ -226,6 +236,14 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
 	return ctxt;
 
 out_empty:
+	/* Backpressure: refuse to mint a new ctxt once the per-xprt total
+	 * (in-flight + queued for release + on-llist) has reached the
+	 * configured slot count. The caller drops the connection; the
+	 * client reconnects with a fresh xprt. Better than the unbounded
+	 * allocation that lets workqueue lag inflate the cache to OOM.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) >= rdma->sc_max_requests)
+		return NULL;
 	ctxt = svc_rdma_send_ctxt_alloc(rdma);
 	if (!ctxt)
 		return NULL;
@@ -257,6 +275,17 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
 				  DMA_TO_DEVICE);
 	}
 
+	/* Depth is now tracked at alloc/destroy, so it reflects total
+	 * live ctxts (in-flight + queued + on-llist), not just on-llist.
+	 * If we've blown past the cap -- via a race in the _get
+	 * backpressure check, or a transient burst -- destroy this ctxt
+	 * instead of returning it to the llist so the depth converges.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) > rdma->sc_max_requests) {
+		trace_svcrdma_send_ctxt_capped(&ctxt->sc_cid);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
+		return;
+	}
 	llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
 }
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4ff..7708634ebf587 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -176,6 +176,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
 	init_llist_head(&cma_xprt->sc_send_ctxts);
+	atomic_set(&cma_xprt->sc_send_ctxts_depth, 0);
 	init_llist_head(&cma_xprt->sc_recv_ctxts);
 	init_llist_head(&cma_xprt->sc_rw_ctxts);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
-- 
2.44.0


  reply	other threads:[~2026-05-05 21:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` Mike Snitzer [this message]
2026-05-06  6:01   ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Chuck Lever
2026-05-06 11:34     ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505215535.68412-2-snitzer@kernel.org \
    --to=snitzer@hammerspace.com \
    --cc=ben.coddington@hammerspace.com \
    --cc=cel@kernel.org \
    --cc=jonathan.flynn@hammerspace.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox