From: Mike Snitzer <snitzer@hammerspace.com>
To: Chuck Lever <cel@kernel.org>
Cc: linux-nfs@vger.kernel.org, ben.coddington@hammerspace.com,
jonathan.flynn@hammerspace.com
Subject: [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic
Date: Tue, 5 May 2026 17:55:35 -0400 [thread overview]
Message-ID: <20260505215535.68412-3-snitzer@kernel.org> (raw)
In-Reply-To: <20260505215535.68412-1-snitzer@kernel.org>
From: Mike Snitzer <snitzer@hammerspace.com>
Per-5s rates of svc_rdma_wc_send (proxy for queue_work into
svcrdma_wq, since each Send completion queues one put),
svc_rdma_send_ctxt_put_async (actual workqueue dispatch),
svc_rdma_send_ctxt_get (demand on the cache), and the
svcrdma_send_ctxt_capped tracepoint.
Used to diagnose whether svcrdma_wq dispatch is keeping up with
inflow under sustained load -- a sustained gap between wc_send and
release rates means ctxts are pinned as queued sc_work items, which
the on-llist depth counter cannot see.
Uses only reliably-traceable hooks: wc_send and _put_async are
function-pointer call sites that cannot be inlined; _get is forced
out-of-line by external callers in recvfrom and backchannel TUs (see
CLAUDE.md "Probe-inlining gotcha").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
---
.../filesystems/nfs/svcrdma-wq-lag.bt | 146 ++++++++++++++++++
1 file changed, 146 insertions(+)
create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt
diff --git a/Documentation/filesystems/nfs/svcrdma-wq-lag.bt b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
new file mode 100755
index 0000000000000..16e374bcb7aa8
--- /dev/null
+++ b/Documentation/filesystems/nfs/svcrdma-wq-lag.bt
@@ -0,0 +1,146 @@
+#!/usr/bin/env bpftrace
+/*
+ * svcrdma-wq-lag.bt
+ *
+ * Test the "svcrdma_wq is the bottleneck" theory.
+ *
+ * Background
+ * ----------
+ * An earlier commit from Ben Coddington ("svcrdma: cap per-xprt
+ * sc_send_ctxts free list at sc_max_requests") added an atomic depth
+ * counter on the sc_send_ctxts llist and a cap check inside
+ * svc_rdma_send_ctxt_release(). The svcrdma_send_ctxt_capped
+ * tracepoint was supposed to fire whenever the cache was about to
+ * exceed sc_max_requests.
+ *
+ * Customer test on hot workload: tracepoint fires zero times during
+ * the test, then floods the moment the test is Ctrl-C'd. Memory
+ * keeps growing during the test as before.
+ *
+ * Hypothesis
+ * ----------
+ * svcrdma_wq is alloc_workqueue("svcrdma", WQ_UNBOUND, 0). WQ_UNBOUND
+ * has bounded concurrency but UNBOUNDED queue depth. Under load,
+ * svc_rdma_wc_send (Send completion ISR) calls svc_rdma_send_ctxt_put,
+ * which does INIT_WORK + queue_work and returns in microseconds. The
+ * worker, svc_rdma_send_ctxt_put_async -> svc_rdma_send_ctxt_release,
+ * has to do ib_dma_unmap_page per SGE plus release_pages plus
+ * svc_rdma_reply_chunk_release -- a much slower operation.
+ *
+ * If queue rate >> worker rate, work items pile up. Each pinned
+ * sc_work is embedded in its ctxt, so ctxts can't be released. The
+ * llist stays empty (depth ~ 0), and the cap at _release never
+ * fires. _get keeps seeing an empty llist and calling _alloc for new
+ * ctxts. Memory grows.
+ *
+ * After Ctrl-C: completions stop, workqueue drains, all queued
+ * releases run rapid-fire. NOW the llist depth shoots past
+ * sc_max_requests and the cap fires for each subsequent release --
+ * the post-test flood.
+ *
+ * What this script measures
+ * -------------------------
+ * Per 5-second window, three rates:
+ *
+ * @wc_send_rate svc_rdma_wc_send invocations
+ * -- proxy for svcrdma_wq.queue_work() rate
+ * (each completion queues one put)
+ *
+ * @release_rate svc_rdma_send_ctxt_put_async invocations
+ * -- actual workqueue dispatch rate
+ *
+ * @get_rate svc_rdma_send_ctxt_get invocations
+ * -- demand pressure on the cache
+ *
+ * @capped_rate svcrdma_send_ctxt_capped tracepoint count
+ * -- where Ben's cap actually fires
+ *
+ * Reading the result
+ * ------------------
+ * If @wc_send_rate >> @release_rate during the test, the workqueue
+ * is backed up -- queue items are accumulating, which means ctxts
+ * are pinned in workqueue items. Hypothesis confirmed.
+ *
+ * If those two rates are comparable, the workqueue is keeping up
+ * and the bottleneck is somewhere else.
+ *
+ * @capped_rate ~ 0 during the test, then @release_rate >>
+ * @wc_send_rate after Ctrl-C (workqueue draining), then
+ * @capped_rate spikes -- exactly the user's observed pattern.
+ *
+ * @get_rate gives the workload's natural cadence; if @get_rate >>
+ * @release_rate sustained, allocations are accumulating regardless
+ * of where in the pipeline they're stuck.
+ *
+ * Probes used
+ * -----------
+ * kprobe:svc_rdma_wc_send function pointer call;
+ * reliably traceable
+ * kprobe:svc_rdma_send_ctxt_put_async workqueue function
+ * pointer; reliably
+ * traceable
+ * kprobe:svc_rdma_send_ctxt_get external callers force
+ * it out-of-line;
+ * reliably traceable
+ * tracepoint:rpcrdma:svcrdma_send_ctxt_capped
+ * from Ben's patch
+ *
+ * If the tracepoint fails to attach (rpcrdma module's tracepoint
+ * table not visible to bpftrace at attach time), comment out that
+ * probe -- the rest of the script still answers the main question.
+ *
+ * Usage
+ * -----
+ * sudo bpftrace svcrdma-wq-lag.bt
+ * (Ctrl-C to exit; END block prints a final snapshot.)
+ *
+ * Run for 60-120s under the load that previously OOM'd. Watch the
+ * 5-second windows; the gap between @wc_send_rate and @release_rate
+ * is the smoking gun.
+ */
+
+config = {
+ max_map_keys = 4096;
+}
+
+BEGIN {
+ printf("svcrdma_wq lag diagnostic. Per-5s rates.\n");
+ printf("Watch for @wc_send_rate >> @release_rate during the test\n");
+ printf("(workqueue backed up -> ctxts pinned in queue items ->\n");
+ printf(" Ben's cap can't see them).\n");
+ printf("Ctrl-C to exit.\n\n");
+}
+
+kprobe:svc_rdma_wc_send {
+ @wc_send_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_put_async {
+ @release_window += 1;
+}
+
+kprobe:svc_rdma_send_ctxt_get {
+ @get_window += 1;
+}
+
+tracepoint:rpcrdma:svcrdma_send_ctxt_capped {
+ @capped_window += 1;
+}
+
+interval:s:5 {
+ time("\n%H:%M:%S ");
+ printf("wc_send=%-8lld release=%-8lld get=%-8lld capped=%-8lld",
+ @wc_send_window, @release_window,
+ @get_window, @capped_window);
+ if (@release_window > 0) {
+ printf(" queue/release ratio=%lld",
+ @wc_send_window / @release_window);
+ }
+ printf("\n");
+ clear(@wc_send_window);
+ clear(@release_window);
+ clear(@get_window);
+ clear(@capped_window);
+}
+
+END { }
--
2.44.0
next prev parent reply other threads:[~2026-05-05 21:55 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-06 6:01 ` Chuck Lever
2026-05-06 11:34 ` Mike Snitzer
2026-05-05 21:55 ` Mike Snitzer [this message]
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260505215535.68412-3-snitzer@kernel.org \
--to=snitzer@hammerspace.com \
--cc=ben.coddington@hammerspace.com \
--cc=cel@kernel.org \
--cc=jonathan.flynn@hammerspace.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox