[RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@hammerspace.com>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, ben.coddington@hammerspace.com,
	jonathan.flynn@hammerspace.com
Subject: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
Date: Tue,  5 May 2026 17:55:34 -0400	[thread overview]
Message-ID: <20260505215535.68412-2-snitzer@kernel.org> (raw)
In-Reply-To: <20260505215535.68412-1-snitzer@kernel.org>

From: Benjamin Coddington <ben.coddington@hammerspace.com>

Under sustained heavy load over RDMA, kNFSD servers can pin tens of
gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
released until the connection terminates.  A customer site reported
OOM kills under heavy NFS READ workloads with ~2.3M cached
send_ctxts visible via slab tracing (two stacks in
svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
the same ctxt population double-counted across the sc_pages and
sc_xprt_buf allocations).  Aggregated across the customer's ~218
long-lived xprts that worked out to roughly 80 GB pinned, freed only
by knfsd restart.

Root cause is an unbounded cache, not a per-op leak.
svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
llist_add()s the ctxt back -- regardless of how many ctxts are
already on the list.  The only kfree() site is
svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
shrinker, no cap, no aging: it can only grow.

Two effects compound to drive the high-water mark well above the
configured RPC slot count:

 1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
    INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
    actual _release (which puts the ctxt back on the llist) runs
    later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
    _release, the ctxt is "in transit" -- off the list, off the SQ,
    not yet reusable.

 2. During that gap, a concurrent _get sees an empty llist and calls
    _alloc to mint a fresh ctxt.  When the in-transit one eventually
    lands on the llist, the cache has grown by one.  Under HCA-driven
    completion rates with even small workqueue dispatch lag, this
    happens constantly.  The cache settles not at the steady-state
    in-flight count but at the all-time peak of (in-flight +
    workqueue-pending), and never shrinks.

Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
svc_rdma_send_ctxt_destroy).  Apply the cap in two places:

  - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
    reached sc_max_requests, return NULL instead of allocating.  The
    caller drops the connection; the client reconnects with a fresh
    xprt that starts at depth zero.  This is the backpressure point
    that prevents in-test memory growth -- it stops new allocations
    regardless of where in the pipeline existing ctxts are stuck.

  - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
    between concurrent _get callers, or transient burst), free the
    ctxt instead of returning it to the llist.  This keeps depth
    convergent.

The cap is sc_max_requests because:
 - It is the configured number of credit slots per xprt -- the client
   can have at most this many RPCs outstanding on the transport.
 - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
   ctxts therefore cannot legitimately exceed sc_max_requests in
   steady state.
 - Workqueue lag can momentarily push (in-flight + queued) above
   sc_max_requests, but those ctxts are exactly what the cap should
   shed -- they are not steady-state working set, just lag-inflation.

The reuse semantics of the cache are intentional and unchanged: ctxts
keep their first SGE DMA-mapped across cycles, so the steady-state
hot path stays alloc-free.  Only the *excess* ctxts are freed.

A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
freed-by-cap ctxt, so operators can confirm the cap is doing real
work on a given workload.

== Verification on the test rig ==

Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
this patch) -- per-5s rates of wc_send (queue inflow), _put_async
(workqueue dispatch), _get (demand), and the new tracepoint.

Negative case (cap on on-llist depth alone, with the atomic
incremented in _release and decremented in _get), sustained NFS
READ load:
  wc_send ~432K/s, release ~342K/s
  -> ~90K/s of ctxts pinned as queued sc_work items
  -> ~2.7M pinned after 30s; matches the slab measurement
  -> svcrdma_send_ctxt_capped fires 0 times during the test, then
     floods (~3.25M events) on test stop as the workqueue catches up

  The cap is structurally blind to ctxts pinned in workqueue items
  because depth only counts what's currently on the llist; during
  sustained load almost nothing makes it onto the llist before the
  next _get takes it back off.  Inflation accumulates as queued
  sc_work items, invisible to the cap, until load stops.

Post-patch (depth tracked at alloc/destroy + _get backpressure),
same workload, 5 minutes:
  wc_send and release rates match within 1% (~410K/s each)
  No accumulation; no flood at test stop
  svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
  recovery)
  Throughput slightly higher than the negative case (cache no longer
  bloats the slab/page allocator into reclaim)

The persistent wc_send/release gap in the negative case was itself
a consequence of the unbounded growth: cache bloat -> slab pressure
-> reclaim activity -> workqueue starvation -> larger gap.  Once
the cap breaks that spiral, the workqueue runs at full capacity and
the rates equalize.

Operators can confirm the cap is doing real work via:
   cd /sys/kernel/tracing
   echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
   cat trace_pipe

If a workload genuinely needs more than sc_max_requests concurrent
in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
rather than removing the cap.

== Diagnostics caveats ==

A previous diagnostic pass on this code path was misled by GCC
inlining of svc_rdma_send_ctxt_put into all in-tree callers (same
TU): the symbol stays in kallsyms (lowercase t) but no caller jumps
there, so kprobes attach without warning yet fire zero times.  This
made an inventory script's "@inflight = gets - puts" appear to be
monotonically rising, falsely confirming a per-op lifecycle leak.
Hooking svc_rdma_send_ctxt_put_async (a workqueue function-pointer
target, forced out-of-line by INIT_WORK) instead of _put gets
accurate accounting and shows gets/puts balanced within ~1% under
load.  Future probes in this path should prefer function-pointer
targets, tracepoints, or kmem tracepoints over kprobes on small
non-static functions in the same TU as their callers.

Reported-by: Jonathan Flynn <jonathan.flynn@hammerspace.com>
Diagnosed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Benjamin Coddington <ben.coddington@hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 include/linux/sunrpc/svc_rdma.h          |  1 +
 include/trace/events/rpcrdma.h           |  2 ++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c    | 41 ++++++++++++++++++++----
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  1 +
 4 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index df6e08aaad570..b8ae1032bf293 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -97,6 +97,7 @@ struct svcxprt_rdma {
 
 	spinlock_t	     sc_send_lock;
 	struct llist_head    sc_send_ctxts;
+	atomic_t	     sc_send_ctxts_depth;
 	spinlock_t	     sc_rw_ctxt_lock;
 	struct llist_head    sc_rw_ctxts;
 
diff --git a/include/trace/events/rpcrdma.h b/include/trace/events/rpcrdma.h
index b79913048e1a0..945152e33af8c 100644
--- a/include/trace/events/rpcrdma.h
+++ b/include/trace/events/rpcrdma.h
@@ -2027,6 +2027,8 @@ DEFINE_SIMPLE_CID_EVENT(svcrdma_wc_send);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_flush);
 DEFINE_SEND_FLUSH_EVENT(svcrdma_wc_send_err);
 
+DEFINE_SIMPLE_CID_EVENT(svcrdma_send_ctxt_capped);
+
 DEFINE_SIMPLE_CID_EVENT(svcrdma_post_recv);
 
 DEFINE_RECEIVE_SUCCESS_EVENT(svcrdma_wc_recv);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 8b3f0c8c14b25..e487d2815b33e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -158,6 +158,7 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 
 	for (i = 0; i < rdma->sc_max_send_sges; i++)
 		ctxt->sc_sges[i].lkey = rdma->sc_pd->local_dma_lkey;
+	atomic_inc(&rdma->sc_send_ctxts_depth);
 	return ctxt;
 
 fail3:
@@ -170,6 +171,20 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
 	return NULL;
 }
 
+/* Tear down a single send_ctxt: reverse of svc_rdma_send_ctxt_alloc. */
+static void svc_rdma_send_ctxt_destroy(struct svcxprt_rdma *rdma,
+				       struct svc_rdma_send_ctxt *ctxt)
+{
+	struct ib_device *device = rdma->sc_cm_id->device;
+
+	ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
+			    rdma->sc_max_req_size, DMA_TO_DEVICE);
+	kfree(ctxt->sc_xprt_buf);
+	kfree(ctxt->sc_pages);
+	kfree(ctxt);
+	atomic_dec(&rdma->sc_send_ctxts_depth);
+}
+
 /**
  * svc_rdma_send_ctxts_destroy - Release all send_ctxt's for an xprt
  * @rdma: svcxprt_rdma being torn down
@@ -177,17 +192,12 @@ svc_rdma_send_ctxt_alloc(struct svcxprt_rdma *rdma)
  */
 void svc_rdma_send_ctxts_destroy(struct svcxprt_rdma *rdma)
 {
-	struct ib_device *device = rdma->sc_cm_id->device;
 	struct svc_rdma_send_ctxt *ctxt;
 	struct llist_node *node;
 
 	while ((node = llist_del_first(&rdma->sc_send_ctxts)) != NULL) {
 		ctxt = llist_entry(node, struct svc_rdma_send_ctxt, sc_node);
-		ib_dma_unmap_single(device, ctxt->sc_sges[0].addr,
-				    rdma->sc_max_req_size, DMA_TO_DEVICE);
-		kfree(ctxt->sc_xprt_buf);
-		kfree(ctxt->sc_pages);
-		kfree(ctxt);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
 	}
 }
 
@@ -226,6 +236,14 @@ struct svc_rdma_send_ctxt *svc_rdma_send_ctxt_get(struct svcxprt_rdma *rdma)
 	return ctxt;
 
 out_empty:
+	/* Backpressure: refuse to mint a new ctxt once the per-xprt total
+	 * (in-flight + queued for release + on-llist) has reached the
+	 * configured slot count. The caller drops the connection; the
+	 * client reconnects with a fresh xprt. Better than the unbounded
+	 * allocation that lets workqueue lag inflate the cache to OOM.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) >= rdma->sc_max_requests)
+		return NULL;
 	ctxt = svc_rdma_send_ctxt_alloc(rdma);
 	if (!ctxt)
 		return NULL;
@@ -257,6 +275,17 @@ static void svc_rdma_send_ctxt_release(struct svcxprt_rdma *rdma,
 				  DMA_TO_DEVICE);
 	}
 
+	/* Depth is now tracked at alloc/destroy, so it reflects total
+	 * live ctxts (in-flight + queued + on-llist), not just on-llist.
+	 * If we've blown past the cap -- via a race in the _get
+	 * backpressure check, or a transient burst -- destroy this ctxt
+	 * instead of returning it to the llist so the depth converges.
+	 */
+	if (atomic_read(&rdma->sc_send_ctxts_depth) > rdma->sc_max_requests) {
+		trace_svcrdma_send_ctxt_capped(&ctxt->sc_cid);
+		svc_rdma_send_ctxt_destroy(rdma, ctxt);
+		return;
+	}
 	llist_add(&ctxt->sc_node, &rdma->sc_send_ctxts);
 }
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index f18bc60d9f4ff..7708634ebf587 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -176,6 +176,7 @@ static struct svcxprt_rdma *svc_rdma_create_xprt(struct svc_serv *serv,
 	INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
 	INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
 	init_llist_head(&cma_xprt->sc_send_ctxts);
+	atomic_set(&cma_xprt->sc_send_ctxts_depth, 0);
 	init_llist_head(&cma_xprt->sc_recv_ctxts);
 	init_llist_head(&cma_xprt->sc_rw_ctxts);
 	init_waitqueue_head(&cma_xprt->sc_send_wait);
-- 
2.44.0

next prev parent reply	other threads:[~2026-05-05 21:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` Mike Snitzer [this message]
2026-05-06  6:01   ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Chuck Lever
2026-05-06 11:34     ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:df6e08aaad57 dfblob:b8ae1032bf29 dfblob:b79913048e1a
dfblob:945152e33af8 dfblob:8b3f0c8c14b2 dfblob:e487d2815b33
dfblob:f18bc60d9f4f dfblob:7708634ebf58 )
 OR (
bs:"[RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505215535.68412-2-snitzer@kernel.org \
    --to=snitzer@hammerspace.com \
    --cc=ben.coddington@hammerspace.com \
    --cc=cel@kernel.org \
    --cc=jonathan.flynn@hammerspace.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.