[PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch
@ 2026-06-24 17:04 Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Benjamin Coddington @ 2026-06-24 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker
  Cc: Daire Byrne, linux-nfs

This RFC follows the per-client fair-scheduling discussion from May/June
[1], and Chuck's slot-growth clamp that grew out of it [2].  It takes the
direction that thread settled on: rather than enforce proportional
fairness between clients, give an interactive client a latency floor that
a busy neighbour cannot push it below.

Problem
-------

nfsd dispatches ready transports per-pool in FIFO order, with no notion of
how much work each transport already has outstanding.  A client that keeps
many requests in flight -- nconnect, a deep v4.1 slot table, or simply
several connections -- sits in that queue on equal terms with a client
that has been idle.  So the request a user is waiting on (a stat, an open,
a directory listing) waits behind a backup job's backlog, even though
servicing it costs a single round trip.  The user-visible symptom is an
interactive session that stutters whenever a bulk client is busy.

Approach
--------

This is starvation avoidance, not proportional fairness (per Chuck's
reframe on [1]).  The protected unit is the interactive cycle: one user
command fans out into a burst of correlated RPCs, so it is the whole
burst, not a single RPC, that must be allowed to jump ahead.

The signal is already on every transport: xpt_nr_rqsts, the count of
requests in flight.  A transport with nothing in flight when new work
arrives is, by definition, not the one hogging threads -- so it goes on a
second, high-priority per-pool queue that is drained ahead of the normal
one.  To cover the whole burst rather than only its leading edge, an
idle->active transport is granted a budget of 64 high-priority dispatches,
spent down as the burst is serviced and refilled only when the transport
next idles.  A transport that never idles spends its budget once and then
shares the normal queue like everyone else.

There is no client identity anywhere in this -- it keys only on a
transport's own in-flight depth -- so it covers v3, v4.0 and v4.1
uniformly, needs no lookup or lifecycle, adds no lock (the existing lwq
barriers carry the sleep/wake), and is always on with nothing to tune.
(LOCALIO is out of scope: a local client's reads and writes bypass the
RPC dispatch path entirely, so they neither contend here nor are
reordered by this change.)

  patch 1  add the second per-pool ready queue (no functional change)
  patch 2  dispatch idle transports ahead of backlogged ones
  patch 3  grant the 64-RPC burst budget

Relationship to Chuck's slot-growth RFC
---------------------------------------

This series is based on Chuck's "Stop NFSv4.1 slot-growth heuristic from
rewarding busy clients" [2], and both A and B above include it -- so the
A/B delta isolates this series and the clamp does not account for the
improvement.

The two are complementary, not overlapping.  [2] stops nfsd from growing a
busy session's slot table past the thread ceiling: nfsd slot accounting.
This series changes the order in which ready transports are dispatched:
sunrpc dispatch.  Basing on [2] changes neither the design nor the result
here -- a slot-capped session still fills the pool, so the dispatch-layer
floor is still wanted on top of it.

Results
-------

Because the goal is a latency floor and not a share, the measurement
differs from the earlier bucket RFC and the numbers are not comparable:
that series measured a greedy client's share of throughput; this one
measures how long an interactive client's command takes to complete while
a neighbour is busy.

Workload: a backlogged aggressor (K connections, deep offered load,
saturating the pool) shares the server with one interactive client that
issues a burst of N requests, waits for all replies, idles 50ms, and
repeats.  A 10ms service time is injected per op (a test-only debug hook,
not part of this series) so the measurement isolates the dispatch path from
filesystem work.  A is stock nfsd-testing; B is the same plus this series.
Figures are NFSv3 burst-completion p50 in ms; NFSv4.1 tracks within a few
percent (the classifier uses no source identity, so v3 runs on plain
loopback and behaves the same).

Interactive burst completion vs aggressor connection count K (N=32;
unobstructed floor 45ms):

    K       4      8     16     32
    A    81.6  242.0  432.9  826.6
    B    60.9   64.5   65.4   64.4
         1.3x   3.8x   6.6x  12.8x

Baseline climbs ~linearly with the aggressor's backlog; patched holds the
floor no matter how deep that backlog is -- the property that matters.

vs burst size N (K=16), showing the 64-credit knee (floor for reference):

    N        1      8     32     64     96    128
   floor  14.9   31.9   44.8   71.5   96.8  122.4
    A     28.6  122.0  434.2  862.4 1272.6 1696.3
    B     16.7   36.0   65.1  124.5  546.4  961.6

For N <= 64 the whole burst is covered and tracks the floor (B within
1.7x); beyond 64 the uncovered tail -- the N-64 RPCs that spill to the
normal queue -- degrades toward baseline (B/floor jumps 1.7x -> 5.6x
across the 64->96 boundary).  The knee at exactly 64 is the budget working
as designed, not a regression.

Aggregate throughput (aggressor + interactive) is within 1% A vs B at
saturation (~1290 ops/s) -- the dispatcher reorders, it does not throttle.
With no aggressor the interactive floor is identical A vs B; the kernels
differ only under contention.

Figures are p50; longer runs are available for firmer tails.

What this deliberately does not do
----------------------------------

- It is not proportional fairness.  Two backlogged clients still split the
  pool by connection count; this protects only flows that idle.

- The normal queue can wait under sustained high-priority load.  v1 drains
  high-priority strictly first; bounding that starvation -- an inter-tier
  deficit round-robin, the way fq_codel actually interleaves its sparse
  and bulk tiers -- is left for a follow-up, pending the question below.

- The 64-dispatch budget defeats a pipelined aggressor (it lands wholly in
  the normal queue), but a client that shapes its traffic to look sparse
  -- send <=64, drain, repeat -- can still scale its high-priority share
  with connection count.  Whether to harden against that is the open
  question: if this is cooperative scheduling (Neil's framing) a bulk
  floor suffices; if it is an attack surface (Chuck's framing) the
  high-priority tier needs intra-tier fairness, which reintroduces
  identity.

Open questions for the list: is the budget (64 here) the right shape, and
should it be fixed, derived from the thread count, or hinted by the v4.1
slot table?  And does the cooperative framing hold, or must we resist
deliberate sparse-shaping?

[1] https://lore.kernel.org/linux-nfs/cover.1780498019.git.bcodding@hammerspace.com/
[2] https://lore.kernel.org/linux-nfs/20260610-nfsd-slot-growth-clamp-v1-0-7b966700df0b@kernel.org/

Benjamin Coddington (3):
  SUNRPC: add a second per-pool ready queue for high-priority transports
  SUNRPC: dispatch idle transports ahead of backlogged ones
  SUNRPC: grant an idle flow a burst allowance on the high-priority
    queue

 include/linux/sunrpc/svc.h      |  1 +
 include/linux/sunrpc/svc_xprt.h |  1 +
 net/sunrpc/svc.c                |  1 +
 net/sunrpc/svc_xprt.c           | 52 +++++++++++++++++++++++----------
 4 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports
  2026-06-24 17:04 [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch Benjamin Coddington
@ 2026-06-24 17:04 ` Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 2/3] SUNRPC: dispatch idle transports ahead of backlogged ones Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 3/3] SUNRPC: grant an idle flow a burst allowance on the high-priority queue Benjamin Coddington
  2 siblings, 0 replies; 4+ messages in thread
From: Benjamin Coddington @ 2026-06-24 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker
  Cc: Daire Byrne, linux-nfs

A pool's ready transports are tracked in a single lockless queue,
sp_xprts, and dispatched in FIFO order.  A transport's position in that
queue is independent of how much work it already has in flight, so a
connection that keeps many requests pending is serviced on equal terms
with one that has been idle -- the idle flow's next request waits behind
the busy flow's backlog.

Add a second queue, sp_xprts_hi, alongside sp_xprts and initialise it
with each pool.  Nothing is enqueued to it yet, so this is no functional
change; a later patch routes transports with no requests in flight onto
it and drains it ahead of sp_xprts, giving an idle flow's request a
latency floor that a backlogged flow cannot push it below.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
---
 include/linux/sunrpc/svc.h | 1 +
 net/sunrpc/svc.c           | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 3a0152d926fb..89fa90f6d360 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -39,6 +39,7 @@ struct svc_pool {
 	unsigned int		sp_nrthrmin;	/* Min number of threads to run per pool */
 	unsigned int		sp_nrthrmax;	/* Max requested number of threads in pool */
 	struct lwq		sp_xprts;	/* pending transports */
+	struct lwq		sp_xprts_hi;	/* high-priority pending transports */
 	struct list_head	sp_all_threads;	/* all server threads */
 	struct llist_head	sp_idle_threads; /* idle server threads */
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index dd80a2eaaa74..7ead2db314c7 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -566,6 +566,7 @@ __svc_create(struct svc_program *prog, int nprogs, struct svc_stat *stats,
 
 		pool->sp_id = i;
 		lwq_init(&pool->sp_xprts);
+		lwq_init(&pool->sp_xprts_hi);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
 		init_llist_head(&pool->sp_idle_threads);
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/3] SUNRPC: dispatch idle transports ahead of backlogged ones
  2026-06-24 17:04 [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
@ 2026-06-24 17:04 ` Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 3/3] SUNRPC: grant an idle flow a burst allowance on the high-priority queue Benjamin Coddington
  2 siblings, 0 replies; 4+ messages in thread
From: Benjamin Coddington @ 2026-06-24 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker
  Cc: Daire Byrne, linux-nfs

A pool dispatches ready transports in FIFO order, so a connection that
already has requests in flight sits in the same queue, on equal terms,
as one that has been idle.  A client driving many concurrent requests
therefore delays the next request of an interactive client whose
previous reply has already completed: the interactive request waits
behind the busy client's backlog even though servicing it costs a
single round trip.

Route a transport with no requests in flight onto the new high-priority
queue, sp_xprts_hi, and drain that queue ahead of sp_xprts.  xpt_nr_rqsts
is incremented when a thread reserves a request slot and decremented when
the reply completes, so at enqueue time it counts the transport's prior
in-flight work, not the request that just arrived: a flow whose last
reply has completed reads zero and jumps the queue, while a flow with
requests still in flight stays in the bulk queue.  The classification
needs no client identity and no lock -- each queue is an independent lwq
with its own ordering, so svc_thread_should_sleep() simply tests both.

This gives an idle flow's request a latency floor that a backlogged flow
cannot push it below.  It is starvation avoidance, not proportional
fairness: under sustained load across many idle flows the bulk queue can
wait, and bounding that is left to later work.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
---
 net/sunrpc/svc_xprt.c | 44 +++++++++++++++++++++++++++----------------
 1 file changed, 28 insertions(+), 16 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 63d1002e63e7..ec4c05094e9a 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -523,7 +523,10 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 
 	percpu_counter_inc(&pool->sp_sockets_queued);
 	xprt->xpt_qtime = ktime_get();
-	lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
+	if (atomic_read(&xprt->xpt_nr_rqsts))
+		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
+	else
+		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts_hi);
 
 	svc_pool_wake_idle_thread(pool);
 }
@@ -536,7 +539,9 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 {
 	struct svc_xprt	*xprt = NULL;
 
-	xprt = lwq_dequeue(&pool->sp_xprts, struct svc_xprt, xpt_ready);
+	xprt = lwq_dequeue(&pool->sp_xprts_hi, struct svc_xprt, xpt_ready);
+	if (!xprt)
+		xprt = lwq_dequeue(&pool->sp_xprts, struct svc_xprt, xpt_ready);
 	if (xprt)
 		svc_xprt_get(xprt);
 	return xprt;
@@ -759,7 +764,7 @@ svc_thread_should_sleep(struct svc_rqst *rqstp)
 		return false;
 
 	/* was a socket queued? */
-	if (!lwq_empty(&pool->sp_xprts))
+	if (!lwq_empty(&pool->sp_xprts_hi) || !lwq_empty(&pool->sp_xprts))
 		return false;
 
 	/* are we shutting down? */
@@ -1183,26 +1188,33 @@ static int svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, st
 	return ret;
 }
 
-static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
+static void svc_clean_up_queue(struct lwq *queue, struct net *net)
 {
 	struct svc_xprt *xprt;
+	struct llist_node *q, **t1, *t2;
+
+	q = lwq_dequeue_all(queue);
+	lwq_for_each_safe(xprt, t1, t2, &q, xpt_ready) {
+		if (xprt->xpt_net == net) {
+			set_bit(XPT_CLOSE, &xprt->xpt_flags);
+			svc_delete_xprt(xprt);
+			xprt = NULL;
+		}
+	}
+
+	if (q)
+		lwq_enqueue_batch(q, queue);
+}
+
+static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
+{
 	int i;
 
 	for (i = 0; i < serv->sv_nrpools; i++) {
 		struct svc_pool *pool = &serv->sv_pools[i];
-		struct llist_node *q, **t1, *t2;
-
-		q = lwq_dequeue_all(&pool->sp_xprts);
-		lwq_for_each_safe(xprt, t1, t2, &q, xpt_ready) {
-			if (xprt->xpt_net == net) {
-				set_bit(XPT_CLOSE, &xprt->xpt_flags);
-				svc_delete_xprt(xprt);
-				xprt = NULL;
-			}
-		}
 
-		if (q)
-			lwq_enqueue_batch(q, &pool->sp_xprts);
+		svc_clean_up_queue(&pool->sp_xprts_hi, net);
+		svc_clean_up_queue(&pool->sp_xprts, net);
 	}
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 3/3] SUNRPC: grant an idle flow a burst allowance on the high-priority queue
  2026-06-24 17:04 [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
  2026-06-24 17:04 ` [PATCH RFC 2/3] SUNRPC: dispatch idle transports ahead of backlogged ones Benjamin Coddington
@ 2026-06-24 17:04 ` Benjamin Coddington
  2 siblings, 0 replies; 4+ messages in thread
From: Benjamin Coddington @ 2026-06-24 17:04 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, Trond Myklebust, Anna Schumaker
  Cc: Daire Byrne, linux-nfs

Routing only the leading request of a flow to the high-priority queue
protects single, isolated round trips but not interactive work: one
client request -- a syscall, a page fault against an NFS-backed file --
fans out into many correlated RPCs.  With a budget of one, the first
jumps the queue and the rest fall to the bulk queue and finish at the
aggressor's rate, so the command completes no sooner than before.

Grant a transport a budget of SVC_XPRT_HI_BURST high-priority dispatches
when it becomes active after idling (xpt_nr_rqsts == 0 at enqueue),
spending one credit per dispatch and falling back to the bulk queue once
the budget is gone.  The whole burst a request fans out into is now
serviced ahead of backlogged flows, not just its leading edge.

The budget refills only on an idle-to-active transition, so a flow that
stays continuously backlogged spends its budget once and then lives in
the bulk queue: a client cannot hold the high-priority queue by keeping
its connections busy or by opening more of them.  xpt_hi_credit is
touched only under the XPT_BUSY serialisation in svc_xprt_enqueue, so it
needs no atomic or lock.

SVC_XPRT_HI_BURST is a fixed 64 for now -- generous enough to cover a
typical request's RPC fan-out, and cheap to over-provision since a
backlogged flow rides the high-priority queue for at most that many
dispatches per idle period.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
---
 include/linux/sunrpc/svc_xprt.h |  1 +
 net/sunrpc/svc_xprt.c           | 14 +++++++++++---
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index da2a2531e110..18b2c4237f1f 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -61,6 +61,7 @@ struct svc_xprt {
 	struct svc_serv		*xpt_server;	/* service for transport */
 	atomic_t    	    	xpt_reserved;	/* space on outq that is rsvd */
 	atomic_t		xpt_nr_rqsts;	/* Number of requests */
+	int			xpt_hi_credit;	/* high-priority dispatch budget */
 	struct mutex		xpt_mutex;	/* to serialize sending data */
 	spinlock_t		xpt_lock;	/* protects sk_deferred
 						 * and xpt_auth_cache */
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ec4c05094e9a..6b0da45aede6 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -499,6 +499,10 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
 	return false;
 }

+/* High-priority dispatch budget granted to a flow when it becomes active
+ * after idling -- sized to cover an interactive request's burst of RPCs. */
+#define SVC_XPRT_HI_BURST	64
+
 /**
  * svc_xprt_enqueue - Queue a transport on an idle nfsd thread
  * @xprt: transport with data pending
@@ -523,10 +527,14 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)

 	percpu_counter_inc(&pool->sp_sockets_queued);
 	xprt->xpt_qtime = ktime_get();
-	if (atomic_read(&xprt->xpt_nr_rqsts))
-		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
-	else
+	if (atomic_read(&xprt->xpt_nr_rqsts) == 0)
+		xprt->xpt_hi_credit = SVC_XPRT_HI_BURST;
+	if (xprt->xpt_hi_credit > 0) {
+		xprt->xpt_hi_credit--;
 		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts_hi);
+	} else {
+		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
+	}

 	svc_pool_wake_idle_thread(pool);
 }
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-24 17:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24 17:04 [PATCH RFC 0/3] SUNRPC: a latency floor for interactive clients via sparse-flow dispatch Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 1/3] SUNRPC: add a second per-pool ready queue for high-priority transports Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 2/3] SUNRPC: dispatch idle transports ahead of backlogged ones Benjamin Coddington
2026-06-24 17:04 ` [PATCH RFC 3/3] SUNRPC: grant an idle flow a burst allowance on the high-priority queue Benjamin Coddington

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.