All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache
@ 2026-05-05 21:55 Mike Snitzer
  2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mike Snitzer @ 2026-05-05 21:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, ben.coddington, jonathan.flynn

Hi,

I drew the short-straw by having to take a hand-off from Ben on work
he started with Claude yesterday in response to a really crazy OOM
situation that hits like a freight train at one large customer's
install that currently has 121 NFS clients and 9 NFS servers, all
connected with RDMA networking.  Working with Jon Flynn, to bound the
problem a bit more we later scaled the testing down to 15 clients
reading from 1 server using 16K O_DIRECT reads.

So I imported Ben's CLAUDE.md that he handed off and carried on, with
patch 1/2 we're able to avoid OOM killing the NFS servers (each with
128GB) -- with the 16K test workload memory use would grow from ~12GB
to exhaustion (128GB) within ~10 seconds of starting the test.

The 2nd patch in this series provides a diagnostic svcrdma-wq-lag.bt
bpf script that Claude suggested -- I just dropped it in
Documentation/filesystems/nfs/ but it isn't intended to go upstream.

Chuck,
Patch 1/2 is marked RFC because ultimately we suspect you'll have a
better way to skin this cat... but Claude was pretty great at helping
us cut through this nasty OOM situation with RDMA.

Please feel free to ask follow-up questions and we'll fill in any
details as best we can.

Thanks,
Mike

Benjamin Coddington (1):
  svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get

Mike Snitzer (1):
  for diagnostic use only: add svcrdma_wq lag diagnostic

 .../filesystems/nfs/svcrdma-wq-lag.bt         | 146 ++++++++++++++++++
 include/linux/sunrpc/svc_rdma.h               |   1 +
 include/trace/events/rpcrdma.h                |   2 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c         |  41 ++++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c      |   1 +
 5 files changed, 185 insertions(+), 6 deletions(-)
 create mode 100755 Documentation/filesystems/nfs/svcrdma-wq-lag.bt

-- 
2.44.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-06 11:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-05 21:55 [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 1/2] svcrdma: bound per-xprt sc_send_ctxts cache and apply backpressure on _get Mike Snitzer
2026-05-06  6:01   ` Chuck Lever
2026-05-06 11:34     ` Mike Snitzer
2026-05-05 21:55 ` [RFC PATCH 2/2] for diagnostic use only: add svcrdma_wq lag diagnostic Mike Snitzer
2026-05-05 22:05 ` [RFC PATCH 0/2] svcrdma: avoid OOM due to unbounded sc_send_ctxts cache Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.