Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get

Linux NFS development
 help / color / mirror / Atom feed

From: Mike Snitzer <snitzer@kernel.org>
To: Chuck Lever <cel@kernel.org>
Cc: Mike Snitzer <snitzer@hammerspace.com>,
	linux-nfs@vger.kernel.org, ben.coddington@hammerspace.com,
	jonathan.flynn@hammerspace.com
Subject: Re: [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get
Date: Wed, 6 May 2026 07:34:20 -0400	[thread overview]
Message-ID: <afsnPPujM016BExw@kernel.org> (raw)
In-Reply-To: <0010c891-174d-468a-be80-f53fa60ac5c7@app.fastmail.com>

On Wed, May 06, 2026 at 08:01:50AM +0200, Chuck Lever wrote:
> 
> 
> On Tue, May 5, 2026, at 11:55 PM, Mike Snitzer wrote:
> > From: Benjamin Coddington <ben.coddington@hammerspace.com>
> >
> > Under sustained heavy load over RDMA, kNFSD servers can pin tens of
> > gigabytes of memory in per-xprt svc_rdma_send_ctxt caches, never
> > released until the connection terminates.  A customer site reported
> > OOM kills under heavy NFS READ workloads with ~2.3M cached
> > send_ctxts visible via slab tracing (two stacks in
> > svc_rdma_send_ctxt_alloc, each kmalloc-4k, ~9.5 GB outstanding --
> > the same ctxt population double-counted across the sc_pages and
> > sc_xprt_buf allocations).  Aggregated across the customer's ~218
> > long-lived xprts that worked out to roughly 80 GB pinned, freed only
> > by knfsd restart.
> >
> > Root cause is an unbounded cache, not a per-op leak.
> > svc_rdma_send_ctxt_get() pulls from rdma->sc_send_ctxts (an llist) or,
> > on empty, allocates fresh.  svc_rdma_send_ctxt_release() always
> > llist_add()s the ctxt back -- regardless of how many ctxts are
> > already on the list.  The only kfree() site is
> > svc_rdma_send_ctxts_destroy() at xprt teardown.  The list has no
> > shrinker, no cap, no aging: it can only grow.
> >
> > Two effects compound to drive the high-water mark well above the
> > configured RPC slot count:
> >
> >  1. _put runs through a workqueue.  svc_rdma_send_ctxt_put() does
> >     INIT_WORK(...) ; queue_work(svcrdma_wq, ...) and returns.  The
> >     actual _release (which puts the ctxt back on the llist) runs
> >     later on svcrdma_wq.  Between wc_send -> _put and _put_async ->
> >     _release, the ctxt is "in transit" -- off the list, off the SQ,
> >     not yet reusable.
> >
> >  2. During that gap, a concurrent _get sees an empty llist and calls
> >     _alloc to mint a fresh ctxt.  When the in-transit one eventually
> >     lands on the llist, the cache has grown by one.  Under HCA-driven
> >     completion rates with even small workqueue dispatch lag, this
> >     happens constantly.  The cache settles not at the steady-state
> >     in-flight count but at the all-time peak of (in-flight +
> >     workqueue-pending), and never shrinks.
> >
> > Fix: track sc_send_ctxts_depth as the count of *live* ctxts on the
> > xprt (incremented in svc_rdma_send_ctxt_alloc, decremented in
> > svc_rdma_send_ctxt_destroy).  Apply the cap in two places:
> >
> >   - svc_rdma_send_ctxt_get(): when the llist is empty and depth has
> >     reached sc_max_requests, return NULL instead of allocating.  The
> >     caller drops the connection; the client reconnects with a fresh
> >     xprt that starts at depth zero.  This is the backpressure point
> >     that prevents in-test memory growth -- it stops new allocations
> >     regardless of where in the pipeline existing ctxts are stuck.
> >
> >   - svc_rdma_send_ctxt_release(): if depth has overshot the cap (race
> >     between concurrent _get callers, or transient burst), free the
> >     ctxt instead of returning it to the llist.  This keeps depth
> >     convergent.
> >
> > The cap is sc_max_requests because:
> >  - It is the configured number of credit slots per xprt -- the client
> >    can have at most this many RPCs outstanding on the transport.
> >  - Each RPC reply uses one send_ctxt at a time; concurrent in-flight
> >    ctxts therefore cannot legitimately exceed sc_max_requests in
> >    steady state.
> >  - Workqueue lag can momentarily push (in-flight + queued) above
> >    sc_max_requests, but those ctxts are exactly what the cap should
> >    shed -- they are not steady-state working set, just lag-inflation.
> >
> > The reuse semantics of the cache are intentional and unchanged: ctxts
> > keep their first SGE DMA-mapped across cycles, so the steady-state
> > hot path stays alloc-free.  Only the *excess* ctxts are freed.
> >
> > A simple-CID tracepoint, svcrdma_send_ctxt_capped, fires once per
> > freed-by-cap ctxt, so operators can confirm the cap is doing real
> > work on a given workload.
> >
> > == Verification on the test rig ==
> >
> > Diagnostic tool: svcrdma-wq-lag.bt (will be provided in reply to
> > this patch) -- per-5s rates of wc_send (queue inflow), _put_async
> > (workqueue dispatch), _get (demand), and the new tracepoint.
> >
> > Negative case (cap on on-llist depth alone, with the atomic
> > incremented in _release and decremented in _get), sustained NFS
> > READ load:
> >   wc_send ~432K/s, release ~342K/s
> >   -> ~90K/s of ctxts pinned as queued sc_work items
> >   -> ~2.7M pinned after 30s; matches the slab measurement
> >   -> svcrdma_send_ctxt_capped fires 0 times during the test, then
> >      floods (~3.25M events) on test stop as the workqueue catches up
> >
> >   The cap is structurally blind to ctxts pinned in workqueue items
> >   because depth only counts what's currently on the llist; during
> >   sustained load almost nothing makes it onto the llist before the
> >   next _get takes it back off.  Inflation accumulates as queued
> >   sc_work items, invisible to the cap, until load stops.
> >
> > Post-patch (depth tracked at alloc/destroy + _get backpressure),
> > same workload, 5 minutes:
> >   wc_send and release rates match within 1% (~410K/s each)
> >   No accumulation; no flood at test stop
> >   svcrdma_send_ctxt_capped fires ~13 times total (rare overshoot
> >   recovery)
> >   Throughput slightly higher than the negative case (cache no longer
> >   bloats the slab/page allocator into reclaim)
> >
> > The persistent wc_send/release gap in the negative case was itself
> > a consequence of the unbounded growth: cache bloat -> slab pressure
> > -> reclaim activity -> workqueue starvation -> larger gap.  Once
> > the cap breaks that spiral, the workqueue runs at full capacity and
> > the rates equalize.
> >
> > Operators can confirm the cap is doing real work via:
> >    cd /sys/kernel/tracing
> >    echo 1 > events/rpcrdma/svcrdma_send_ctxt_capped/enable
> >    cat trace_pipe
> >
> > If a workload genuinely needs more than sc_max_requests concurrent
> > in-flight Sends, raise it via the sunrpc.svcrdma_max_requests sysctl
> > rather than removing the cap.
> 
> The current svcrdma design assumes that the workqueue always keeps up
> with the transport's Reply transmission rate. For the send_ctxt cache
> to reach 2.3M ctxts, the workqueue itself must be running orders of
> magnitude slower than wc_send.
> 
> The "in transit" gap between wc_send -> _put and _put_async ->
> _release is structurally one workqueue dispatch; under normal
> scheduling, microseconds. The verification numbers show a sustained
> 90K/s deficit (432K wc_send/s vs 342K release/s) that accumulates
> linearly. That deficit is the actual pathology here...
> 
> So I'm wondering: Why does svcrdma_wq lag at steady state? _put_async
> is small work -- ib_dma_unmap_page_list plus an llist_add. On a busy
> box it should easily outpace wc_send, not trail it by ~21%.
> 
> The cover letter posits a spiral: cache bloat -> slab pressure ->
> reclaim -> workqueue starvation -> more bloat. That makes the eventual
> collapse plausible, but it does not establish what initiates the
> spiral. If the workqueue keeps up at low cache size and only loses
> ground after slab/reclaim kicks in, the early-load gap should be
> small and growing. If the gap is fixed at 21% from second zero,
> something structural is throttling the workqueue independent of slab
> state.
> 
> Under sustained workqueue lag the cap fires not because the working
> set exceeds sc_max_requests but because the workqueue backlog does.
> Capping at sc_max_requests then translates "workqueue is slow" into
> "drop the connection." That bounds memory but substitutes one failure
> mode for another, and it punishes the client for what is essentially
> a server-side scheduling problem. Not to mention the risks inherent
> in repeated spurious connection loss.

Thanks for all that context.  Yes the imposed connection loss isn't
ideal when the cap hits but in practice its relatively rare even with
the more extreme small (16K) workload.  If larger IO size used the
workqueue is able to keep up.

> I have a couple of patches that replace the use of svcrdma_wq, and
> could alleviate the spiral issue. Would you be interested in trying
> them with your reproducer?

Yes, please share and we'll do our best to get time on the system.

We have some constraints at the moment that confine us to only making
module changes (otherwise, if core kernel changes needed, it'd trigger
a much more involved general kernel update that'd require coordination
with the admins due to netboot infra).

Thanks,
Mike

next prev parent reply	other threads:[~2026-05-06 11:34 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-06  6:01   ` Chuck Lever
2026-05-06 11:34     ` Mike Snitzer [this message]
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afsnPPujM016BExw@kernel.org \
    --to=snitzer@kernel.org \
    --cc=ben.coddington@hammerspace.com \
    --cc=cel@kernel.org \
    --cc=jonathan.flynn@hammerspace.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=snitzer@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox