All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch
@ 2026-06-24 17:04 Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Benjamin Coddington @ 2026-06-24 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker
  Cc: Daire Byrne, linux-nfs

This RFC follows the per-client fair-scheduling discussion from May/June
[1], and Chuck's slot-growth clamp that grew out of it [2].  It takes the
direction that thread settled on: rather than enforce proportional
fairness between clients, give an interactive client a latency floor that
a busy neighbour cannot push it below.

Problem
-------

nfsd dispatches ready transports per-pool in FIFO order, with no notion of
how much work each transport already has outstanding.  A client that keeps
many requests in flight -- nconnect, a deep v4.1 slot table, or simply
several connections -- sits in that queue on equal terms with a client
that has been idle.  So the request a user is waiting on (a stat, an open,
a directory listing) waits behind a backup job's backlog, even though
servicing it costs a single round trip.  The user-visible symptom is an
interactive session that stutters whenever a bulk client is busy.

Approach
--------

This is starvation avoidance, not proportional fairness (per Chuck's
reframe on [1]).  The protected unit is the interactive cycle: one user
command fans out into a burst of correlated RPCs, so it is the whole
burst, not a single RPC, that must be allowed to jump ahead.

The signal is already on every transport: xpt_nr_rqsts, the count of
requests in flight.  A transport with nothing in flight when new work
arrives is, by definition, not the one hogging threads -- so it goes on a
second, high-priority per-pool queue that is drained ahead of the normal
one.  To cover the whole burst rather than only its leading edge, an
idle->active transport is granted a budget of 64 high-priority dispatches,
spent down as the burst is serviced and refilled only when the transport
next idles.  A transport that never idles spends its budget once and then
shares the normal queue like everyone else.

There is no client identity anywhere in this -- it keys only on a
transport's own in-flight depth -- so it covers v3, v4.0 and v4.1
uniformly, needs no lookup or lifecycle, adds no lock (the existing lwq
barriers carry the sleep/wake), and is always on with nothing to tune.
(LOCALIO is out of scope: a local client's reads and writes bypass the
RPC dispatch path entirely, so they neither contend here nor are
reordered by this change.)

  patch 1  add the second per-pool ready queue (no functional change)
  patch 2  dispatch idle transports ahead of backlogged ones
  patch 3  grant the 64-RPC burst budget

Relationship to Chuck's slot-growth RFC
---------------------------------------

This series is based on Chuck's "Stop NFSv4.1 slot-growth heuristic from
rewarding busy clients" [2], and both A and B above include it -- so the
A/B delta isolates this series and the clamp does not account for the
improvement.

The two are complementary, not overlapping.  [2] stops nfsd from growing a
busy session's slot table past the thread ceiling: nfsd slot accounting.
This series changes the order in which ready transports are dispatched:
sunrpc dispatch.  Basing on [2] changes neither the design nor the result
here -- a slot-capped session still fills the pool, so the dispatch-layer
floor is still wanted on top of it.

Results
-------

Because the goal is a latency floor and not a share, the measurement
differs from the earlier bucket RFC and the numbers are not comparable:
that series measured a greedy client's share of throughput; this one
measures how long an interactive client's command takes to complete while
a neighbour is busy.

Workload: a backlogged aggressor (K connections, deep offered load,
saturating the pool) shares the server with one interactive client that
issues a burst of N requests, waits for all replies, idles 50ms, and
repeats.  A 10ms service time is injected per op (a test-only debug hook,
not part of this series) so the measurement isolates the dispatch path from
filesystem work.  A is stock nfsd-testing; B is the same plus this series.
Figures are NFSv3 burst-completion p50 in ms; NFSv4.1 tracks within a few
percent (the classifier uses no source identity, so v3 runs on plain
loopback and behaves the same).

Interactive burst completion vs aggressor connection count K (N=32;
unobstructed floor 45ms):

    K       4      8     16     32
    A    81.6  242.0  432.9  826.6
    B    60.9   64.5   65.4   64.4
         1.3x   3.8x   6.6x  12.8x

Baseline climbs ~linearly with the aggressor's backlog; patched holds the
floor no matter how deep that backlog is -- the property that matters.

vs burst size N (K=16), showing the 64-credit knee (floor for reference):

    N        1      8     32     64     96    128
   floor  14.9   31.9   44.8   71.5   96.8  122.4
    A     28.6  122.0  434.2  862.4 1272.6 1696.3
    B     16.7   36.0   65.1  124.5  546.4  961.6

For N <= 64 the whole burst is covered and tracks the floor (B within
1.7x); beyond 64 the uncovered tail -- the N-64 RPCs that spill to the
normal queue -- degrades toward baseline (B/floor jumps 1.7x -> 5.6x
across the 64->96 boundary).  The knee at exactly 64 is the budget working
as designed, not a regression.

Aggregate throughput (aggressor + interactive) is within 1% A vs B at
saturation (~1290 ops/s) -- the dispatcher reorders, it does not throttle.
With no aggressor the interactive floor is identical A vs B; the kernels
differ only under contention.

Figures are p50; longer runs are available for firmer tails.

What this deliberately does not do
----------------------------------

- It is not proportional fairness.  Two backlogged clients still split the
  pool by connection count; this protects only flows that idle.

- The normal queue can wait under sustained high-priority load.  v1 drains
  high-priority strictly first; bounding that starvation -- an inter-tier
  deficit round-robin, the way fq_codel actually interleaves its sparse
  and bulk tiers -- is left for a follow-up, pending the question below.

- The 64-dispatch budget defeats a pipelined aggressor (it lands wholly in
  the normal queue), but a client that shapes its traffic to look sparse
  -- send <=64, drain, repeat -- can still scale its high-priority share
  with connection count.  Whether to harden against that is the open
  question: if this is cooperative scheduling (Neil's framing) a bulk
  floor suffices; if it is an attack surface (Chuck's framing) the
  high-priority tier needs intra-tier fairness, which reintroduces
  identity.

Open questions for the list: is the budget (64 here) the right shape, and
should it be fixed, derived from the thread count, or hinted by the v4.1
slot table?  And does the cooperative framing hold, or must we resist
deliberate sparse-shaping?

[1] https://lore.kernel.org/linux-nfs/cover.1780498019.git.bcodding@hammerspace.com/
[2] https://lore.kernel.org/linux-nfs/20260610-nfsd-slot-growth-clamp-v1-0-7b966700df0b@kernel.org/

Benjamin Coddington (3):
  SUNRPC: add a second per-pool ready queue for high-priority transports
  SUNRPC: dispatch idle transports ahead of backlogged ones
  SUNRPC: grant an idle flow a burst allowance on the high-priority
    queue

 include/linux/sunrpc/svc.h      |  1 +
 include/linux/sunrpc/svc_xprt.h |  1 +
 net/sunrpc/svc.c                |  1 +
 net/sunrpc/svc_xprt.c           | 52 +++++++++++++++++++++++----------
 4 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-24 17:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 17:04 [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 2/3] SUNRPC: dispatch idle transports ahead of backlogged ones Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 3/3] SUNRPC: grant an idle flow a burst allowance on the high-priority queue Benjamin Coddington

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.