All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
@ 2026-06-03 15:09 Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt Benjamin Coddington
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-03 15:09 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown; +Cc: linux-nfs, Daire Byrne

knfsd dispatches ready transports from a single per-pool FIFO, so a
client's share of nfsd service scales with the number of connections it
holds rather than being shared per client.  A client that opens many
connections (nconnect, or a farm of data movers) starves other clients
on the same server purely by out-numbering them in sockets.

I measured this with a load generator that pins each request to a fixed
service time and does no filesystem work, so that nfsd thread-time is the
only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
greedy client opens K connections alongside one single-connection
interactive client.

NFSv3, dispatch as it is today:

  greedy K   greedy share   interactive ops/s
     1           50%              241
     4           80%              129
     8           89%               72
    16           94%               38

The interactive client's share tracks 1/(K+1) and its throughput falls
roughly 6x while it does nothing different.  NFSv4.1 behaves identically
(89% greedy at K=8) even when the greedy connections are bound to a
single session, because the dispatch decision is below the NFS version.

The same NFSv4.1 workload with fair queueing enabled:

  greedy K   greedy share   interactive ops/s
     8           72%              182
    16           73%              177
    32           70%              193

The greedy client's share no longer climbs with its connection count and
the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
throughput is unchanged: the T/D pool ceiling is the same with fair
queueing on and off.  The split does not reach 50/50 because a single
interactive connection is bounded by its request window and by XPT_BUSY
serialising one transport; with a deeper window it reaches ~59/41.

The approach:

  - sunrpc grows an opaque per-transport fairness key (patch 1), with a
    default derived from the source address (the source port is excluded
    so a client's several connections share one key), and an opt-in
    per-pool scheduler that buckets ready transports by that key and
    dispatches round-robin across keys (patch 2).  When it is disabled,
    which is the default, the existing lockless FIFO path is unchanged.

  - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
    stamps the NFSv4.1 clientid as the key when a connection binds to a
    session (patch 4), so all of a client's connections share one key.
    NFSv3 uses the source-address default.

This is an RFC; a few questions for the list:

  - Unit of fairness: clientid (used here) or session?  Earlier
    discussion leaned toward exploring per-session.

  - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
    on the opt-in path, versus a lockless or per-flow structure.

  - Would a per-client in-flight cap be preferable to proportional fair
    queueing?

The measurement used a debug-only filehandle-latency hook that is not
part of this series.

Benjamin Coddington (4):
  sunrpc: add a per-transport fairness key to svc_xprt
  sunrpc: dispatch ready transports fairly per client
  nfsd: add a fairq module parameter
  nfsd: key NFSv4.1 connections by clientid for fair queueing

 fs/nfsd/nfs4state.c             |  17 +++
 fs/nfsd/nfssvc.c                |  19 +++
 include/linux/sunrpc/svc.h      |   5 +
 include/linux/sunrpc/svc_xprt.h |  46 ++++++-
 net/sunrpc/svc.c                |   2 +
 net/sunrpc/svc_xprt.c           | 216 +++++++++++++++++++++++++++++++-
 6 files changed, 302 insertions(+), 3 deletions(-)


base-commit: e7ca66ba17f1b5e4ecbb29b9c3c4a31aa062bed0
-- 
2.53.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
@ 2026-06-03 15:09 ` Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 2/4] sunrpc: dispatch ready transports fairly per client Benjamin Coddington
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-03 15:09 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown; +Cc: linux-nfs, Daire Byrne

The server dispatches ready transports from a single per-pool FIFO
(svc_pool.sp_xprts), so a client's share of the service threads scales
with the number of connections it holds: a client with K connections is
served roughly K times as often as a single-connection client.  The unit
of fairness is the transport, not the client.

As the first step toward per-client fair queueing, give every transport
an opaque fairness key.  sunrpc supplies a default derived from the
peer's source address; the source port is deliberately excluded so that
a client's several connections (nconnect, or a fan of data movers) share
one key.  An upper layer that knows a better identity may overwrite
xpt_fairq_key, and a nonzero value takes precedence over the default.

Also union a list_head (xpt_fairq) over xpt_ready for linking a transport
into a fair-queue bucket; a transport is only ever on one of the two
ready structures.  Nothing uses either yet, so there is no functional
change.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 include/linux/sunrpc/svc_xprt.h | 46 ++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index da2a2531e110..67d0c5d9b92c 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -55,7 +55,13 @@ struct svc_xprt {
 	struct kref		xpt_ref;
 	ktime_t			xpt_qtime;
 	struct list_head	xpt_list;
-	struct lwq_node		xpt_ready;
+	union {
+		struct lwq_node		xpt_ready;	/* lockless FIFO (default) */
+		struct list_head	xpt_fairq;	/* fair-queue bucket link */
+	};
+	unsigned long		xpt_fairq_key;	/* opaque per-client fairness
+						 * identity; 0 => derive from
+						 * source address (xpt_remote) */
 	unsigned long		xpt_flags;
 
 	struct svc_serv		*xpt_server;	/* service for transport */
@@ -157,6 +163,44 @@ static inline bool svc_xprt_is_dead(const struct svc_xprt *xprt)
 		(test_bit(XPT_CLOSE, &xprt->xpt_flags) != 0);
 }
 
+/*
+ * Per-transport fairness key.  The fair-queue dispatcher groups ready
+ * transports by this key and round-robins across keys, so that service is
+ * shared per client rather than per transport.
+ *
+ * sunrpc supplies a default derived from the peer's source *address*; the
+ * source port is deliberately excluded so that a client's several connections
+ * (e.g. nconnect, or a fan of data movers) share a single key.  An upper layer
+ * that knows a better identity -- e.g. nfsd stamping an NFSv4.1 clientid in
+ * nfsd4_init_conn() -- may set xpt_fairq_key directly; a nonzero value
+ * overrides the default.
+ */
+static inline unsigned long svc_xprt_fairq_default_key(const struct svc_xprt *xprt)
+{
+	const struct sockaddr *sa = (const struct sockaddr *)&xprt->xpt_remote;
+
+	switch (sa->sa_family) {
+	case AF_INET:
+		return (unsigned long)
+			((const struct sockaddr_in *)sa)->sin_addr.s_addr;
+	case AF_INET6: {
+		const struct in6_addr *a =
+			&((const struct sockaddr_in6 *)sa)->sin6_addr;
+
+		return (unsigned long)(a->s6_addr32[0] ^ a->s6_addr32[1] ^
+				       a->s6_addr32[2] ^ a->s6_addr32[3]);
+	}
+	default:
+		return (unsigned long)sa->sa_family;
+	}
+}
+
+static inline unsigned long svc_xprt_fairq_key(const struct svc_xprt *xprt)
+{
+	return xprt->xpt_fairq_key ? xprt->xpt_fairq_key
+				   : svc_xprt_fairq_default_key(xprt);
+}
+
 int	svc_reg_xprt_class(struct svc_xprt_class *);
 void	svc_unreg_xprt_class(struct svc_xprt_class *);
 void	svc_xprt_init(struct net *, struct svc_xprt_class *, struct svc_xprt *,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/4] sunrpc: dispatch ready transports fairly per client
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt Benjamin Coddington
@ 2026-06-03 15:09 ` Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 3/4] nfsd: add a fairq module parameter Benjamin Coddington
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-03 15:09 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown; +Cc: linux-nfs, Daire Byrne

Add an opt-in per-pool scheduler that groups ready transports into
buckets by their fairness key and dispatches round-robin across the
non-empty buckets, so each client gets one turn per round regardless of
how many connections it has queued.  Buckets are a fixed hash array: a
key collision merely shares a turn between two clients and never drops
work.  The structure is guarded by a single per-pool spinlock taken only
on this path -- when it is absent (sp_fairq == NULL) the existing
lockless lwq FIFO fast path runs unchanged.

svc_set_fairq() allocates or frees the per-pool state for a service and
must be called while the service is quiescent, before any thread is
started or transport queued.

Enqueue can run in softirq context (a socket data_ready callback), so
the lock disables bottom halves.  The sleep/wake handshake the lockless
path gets for free from lwq_enqueue()'s cmpxchg is preserved here by
issuing an explicit barrier before waking an idle thread, pairing with
the full barrier the sleeper executes when it adds itself to the idle
list.

Nothing selects fair queueing yet, so there is no functional change.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 include/linux/sunrpc/svc.h |   5 +
 net/sunrpc/svc.c           |   2 +
 net/sunrpc/svc_xprt.c      | 216 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 221 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4be6204f6630..7fb09d18492d 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -33,12 +33,16 @@
  * have one pool per NUMA node.  This optimisation reduces cross-
  * node traffic on multi-node NUMA NFS servers.
  */
+struct svc_fairq;
+
 struct svc_pool {
 	unsigned int		sp_id;		/* pool id; also node id on NUMA */
 	unsigned int		sp_nrthreads;	/* # of threads currently running in pool */
 	unsigned int		sp_nrthrmin;	/* Min number of threads to run per pool */
 	unsigned int		sp_nrthrmax;	/* Max requested number of threads in pool */
 	struct lwq		sp_xprts;	/* pending transports */
+	struct svc_fairq	*sp_fairq;	/* per-client fair-queue dispatch
+						 * state; NULL => sp_xprts FIFO */
 	struct list_head	sp_all_threads;	/* all server threads */
 	struct llist_head	sp_idle_threads; /* idle server threads */
 
@@ -465,6 +469,7 @@ struct svc_serv *  svc_create_pooled(struct svc_program *prog,
 				     struct svc_stat *stats,
 				     unsigned int bufsize,
 				     int (*threadfn)(void *data));
+int		   svc_set_fairq(struct svc_serv *serv, bool enable);
 int		   svc_set_pool_threads(struct svc_serv *serv, struct svc_pool *pool,
 					unsigned int min_threads, unsigned int max_threads);
 int		   svc_set_num_threads(struct svc_serv *serv, unsigned int min_threads,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 009373737ea9..d413eb669d85 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -656,6 +656,8 @@ svc_destroy(struct svc_serv **servp)
 	if (serv->sv_is_pooled)
 		svc_pool_map_put();
 
+	svc_set_fairq(serv, false);
+
 	for (i = 0; i < serv->sv_nrpools; i++) {
 		struct svc_pool *pool = &serv->sv_pools[i];
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 63d1002e63e7..1e81a5c3accf 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -9,6 +9,7 @@
 #include <linux/sched/mm.h>
 #include <linux/errno.h>
 #include <linux/freezer.h>
+#include <linux/hash.h>
 #include <linux/slab.h>
 #include <net/sock.h>
 #include <linux/sunrpc/addr.h>
@@ -44,6 +45,193 @@ static int svc_conn_age_period = 6*60;
 static DEFINE_SPINLOCK(svc_xprt_class_lock);
 static LIST_HEAD(svc_xprt_class_list);
 
+/*
+ * Per-client fair-queue dispatch.
+ *
+ * The default dispatcher keeps all ready transports in one lockless FIFO
+ * (svc_pool.sp_xprts), so a client's share of nfsd service scales with its
+ * connection count.  When fair queueing is enabled (svc_set_fairq()), ready
+ * transports are instead grouped into buckets by their fairness key
+ * (svc_xprt_fairq_key()) and dispatched round-robin across the non-empty
+ * buckets, so each client gets one turn per round however many connections it
+ * holds.
+ *
+ * Buckets are a fixed hash array: distinct clients almost always land in
+ * distinct buckets, and a collision merely shares a turn between two clients,
+ * never drops work.  The structure is guarded by a single per-pool spinlock
+ * taken only on this opt-in path; the default FIFO fast path is untouched.
+ */
+#define SVC_FAIRQ_BITS		8
+#define SVC_FAIRQ_BUCKETS	(1U << SVC_FAIRQ_BITS)
+
+struct svc_fairq_bucket {
+	struct list_head	fb_xprts;	/* ready transports, FIFO order */
+	struct list_head	fb_active;	/* link in fq_active when non-empty */
+};
+
+struct svc_fairq {
+	spinlock_t		fq_lock;
+	unsigned int		fq_nr;		/* total transports queued */
+	struct list_head	fq_active;	/* non-empty buckets, RR order */
+	struct svc_fairq_bucket	fq_buckets[SVC_FAIRQ_BUCKETS];
+};
+
+static struct svc_fairq *svc_fairq_alloc(void)
+{
+	struct svc_fairq *fq;
+	unsigned int i;
+
+	fq = kzalloc(sizeof(*fq), GFP_KERNEL);
+	if (!fq)
+		return NULL;
+	spin_lock_init(&fq->fq_lock);
+	INIT_LIST_HEAD(&fq->fq_active);
+	for (i = 0; i < SVC_FAIRQ_BUCKETS; i++) {
+		INIT_LIST_HEAD(&fq->fq_buckets[i].fb_xprts);
+		INIT_LIST_HEAD(&fq->fq_buckets[i].fb_active);
+	}
+	return fq;
+}
+
+static void svc_fairq_free(struct svc_fairq *fq)
+{
+	kfree(fq);
+}
+
+/**
+ * svc_set_fairq - enable or disable per-client fair-queue dispatch
+ * @serv: service whose pools to (re)configure
+ * @enable: %true to group ready transports by client and dispatch round-robin
+ *
+ * Allocates (or frees) a fair-queue structure for every pool of @serv.  Must
+ * be called while the service is quiescent -- before any thread is started or
+ * transport queued -- because it changes how ready transports are tracked.
+ *
+ * Return: 0 on success, or -ENOMEM if a structure could not be allocated, in
+ * which case @serv is left with fair queueing disabled.
+ */
+int svc_set_fairq(struct svc_serv *serv, bool enable)
+{
+	unsigned int i;
+
+	/* Drop any existing fair-queue state. */
+	for (i = 0; i < serv->sv_nrpools; i++) {
+		svc_fairq_free(serv->sv_pools[i].sp_fairq);
+		serv->sv_pools[i].sp_fairq = NULL;
+	}
+	if (!enable)
+		return 0;
+
+	for (i = 0; i < serv->sv_nrpools; i++) {
+		serv->sv_pools[i].sp_fairq = svc_fairq_alloc();
+		if (!serv->sv_pools[i].sp_fairq)
+			goto out_free;
+	}
+	return 0;
+
+out_free:
+	while (i-- > 0) {
+		svc_fairq_free(serv->sv_pools[i].sp_fairq);
+		serv->sv_pools[i].sp_fairq = NULL;
+	}
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(svc_set_fairq);
+
+static struct svc_fairq_bucket *svc_fairq_bucket(struct svc_fairq *fq,
+						 struct svc_xprt *xprt)
+{
+	return &fq->fq_buckets[hash_long(svc_xprt_fairq_key(xprt),
+					 SVC_FAIRQ_BITS)];
+}
+
+/*
+ * Add a ready transport to its client's bucket, FIFO within the bucket.
+ * Enqueue can run in softirq context (socket data_ready), so all acquisitions
+ * of fq_lock disable bottom halves.
+ */
+static void svc_fairq_enqueue(struct svc_fairq *fq, struct svc_xprt *xprt)
+{
+	struct svc_fairq_bucket *b = svc_fairq_bucket(fq, xprt);
+
+	spin_lock_bh(&fq->fq_lock);
+	if (list_empty(&b->fb_xprts))
+		list_add_tail(&b->fb_active, &fq->fq_active);
+	list_add_tail(&xprt->xpt_fairq, &b->fb_xprts);
+	WRITE_ONCE(fq->fq_nr, fq->fq_nr + 1);
+	spin_unlock_bh(&fq->fq_lock);
+}
+
+/*
+ * Round-robin across clients: take the bucket at the head of the active list,
+ * pop its oldest transport, then rotate the bucket to the tail (or drop it if
+ * it has no more ready transports).  Each client thus gets one turn per cycle
+ * regardless of how many connections it has queued.
+ */
+static struct svc_xprt *svc_fairq_dequeue(struct svc_fairq *fq)
+{
+	struct svc_fairq_bucket *b;
+	struct svc_xprt *xprt = NULL;
+
+	spin_lock_bh(&fq->fq_lock);
+	if (!list_empty(&fq->fq_active)) {
+		b = list_first_entry(&fq->fq_active, struct svc_fairq_bucket,
+				     fb_active);
+		xprt = list_first_entry(&b->fb_xprts, struct svc_xprt,
+					xpt_fairq);
+		list_del(&xprt->xpt_fairq);
+		WRITE_ONCE(fq->fq_nr, fq->fq_nr - 1);
+		list_del_init(&b->fb_active);
+		if (!list_empty(&b->fb_xprts))
+			list_add_tail(&b->fb_active, &fq->fq_active);
+		svc_xprt_get(xprt);
+	}
+	spin_unlock_bh(&fq->fq_lock);
+	return xprt;
+}
+
+/*
+ * Lockless emptiness hint for the sleep/wake decision.  Correctness against a
+ * concurrent enqueue relies on the full barrier the sleeper executes
+ * (llist_add() of itself to sp_idle_threads) pairing with the smp_mb() that
+ * svc_xprt_enqueue() issues before waking an idle thread -- the same
+ * store-buffer pattern lwq_enqueue()'s cmpxchg provides on the default path.
+ */
+static bool svc_fairq_empty(struct svc_fairq *fq)
+{
+	return READ_ONCE(fq->fq_nr) == 0;
+}
+
+/* Remove and delete all queued transports belonging to @net (namespace down). */
+static void svc_fairq_clean_net(struct svc_fairq *fq, struct net *net)
+{
+	struct svc_xprt *xprt, *tmp;
+	unsigned int i;
+	LIST_HEAD(dying);
+
+	spin_lock_bh(&fq->fq_lock);
+	for (i = 0; i < SVC_FAIRQ_BUCKETS; i++) {
+		struct svc_fairq_bucket *b = &fq->fq_buckets[i];
+
+		list_for_each_entry_safe(xprt, tmp, &b->fb_xprts, xpt_fairq) {
+			if (xprt->xpt_net != net)
+				continue;
+			list_move_tail(&xprt->xpt_fairq, &dying);
+			WRITE_ONCE(fq->fq_nr, fq->fq_nr - 1);
+		}
+		if (list_empty(&b->fb_xprts))
+			list_del_init(&b->fb_active);
+	}
+	spin_unlock_bh(&fq->fq_lock);
+
+	while (!list_empty(&dying)) {
+		xprt = list_first_entry(&dying, struct svc_xprt, xpt_fairq);
+		list_del_init(&xprt->xpt_fairq);
+		set_bit(XPT_CLOSE, &xprt->xpt_flags);
+		svc_delete_xprt(xprt);
+	}
+}
+
 /* SMP locking strategy:
  *
  *	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
@@ -523,7 +711,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 
 	percpu_counter_inc(&pool->sp_sockets_queued);
 	xprt->xpt_qtime = ktime_get();
-	lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
+	if (pool->sp_fairq) {
+		svc_fairq_enqueue(pool->sp_fairq, xprt);
+		/*
+		 * Order the enqueue before the idle-thread scan below, so a
+		 * thread that just found no work and added itself to the idle
+		 * list is guaranteed to be seen here.  lwq_enqueue()'s cmpxchg
+		 * supplies this barrier on the default path; the locked fair
+		 * enqueue is release-only, so issue it explicitly.
+		 */
+		smp_mb();
+	} else {
+		lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
+	}
 
 	svc_pool_wake_idle_thread(pool);
 }
@@ -536,6 +736,9 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 {
 	struct svc_xprt	*xprt = NULL;
 
+	if (pool->sp_fairq)
+		return svc_fairq_dequeue(pool->sp_fairq);
+
 	xprt = lwq_dequeue(&pool->sp_xprts, struct svc_xprt, xpt_ready);
 	if (xprt)
 		svc_xprt_get(xprt);
@@ -759,8 +962,12 @@ svc_thread_should_sleep(struct svc_rqst *rqstp)
 		return false;
 
 	/* was a socket queued? */
-	if (!lwq_empty(&pool->sp_xprts))
+	if (pool->sp_fairq) {
+		if (!svc_fairq_empty(pool->sp_fairq))
+			return false;
+	} else if (!lwq_empty(&pool->sp_xprts)) {
 		return false;
+	}
 
 	/* are we shutting down? */
 	if (svc_thread_should_stop(rqstp))
@@ -1192,6 +1399,11 @@ static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
 		struct svc_pool *pool = &serv->sv_pools[i];
 		struct llist_node *q, **t1, *t2;
 
+		if (pool->sp_fairq) {
+			svc_fairq_clean_net(pool->sp_fairq, net);
+			continue;
+		}
+
 		q = lwq_dequeue_all(&pool->sp_xprts);
 		lwq_for_each_safe(xprt, t1, t2, &q, xpt_ready) {
 			if (xprt->xpt_net == net) {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/4] nfsd: add a fairq module parameter
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 2/4] sunrpc: dispatch ready transports fairly per client Benjamin Coddington
@ 2026-06-03 15:09 ` Benjamin Coddington
  2026-06-03 15:09 ` [PATCH RFC 4/4] nfsd: key NFSv4.1 connections by clientid for fair queueing Benjamin Coddington
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-03 15:09 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown; +Cc: linux-nfs, Daire Byrne

Add nfsd.fairq (bool, default off).  When set, nfsd enables the sunrpc
per-client fair-queue dispatcher for its service via svc_set_fairq() at
service creation, so that a client cannot increase its share of nfsd
threads simply by opening more connections.  The parameter is read when
the service is created, so a change takes effect at the next nfsd start
(thread count 0 -> N).

With fair queueing enabled and no upper-layer identity stamped on a
transport, clients are distinguished by source address, which already
gives per-client fairness for NFSv3.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 fs/nfsd/nfssvc.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 551d3cf51036..3abe4d2aec5a 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -36,6 +36,18 @@
 #define NFSDDBG_FACILITY	NFSDDBG_SVC
 
 atomic_t			nfsd_th_cnt = ATOMIC_INIT(0);
+
+/*
+ * Per-client fair-queue dispatch.  When set, ready connections are scheduled
+ * round-robin per client (NFSv4.1 clientid, else source address) rather than
+ * per transport, so a client cannot grab a larger share of nfsd threads by
+ * opening more connections.  Read at service creation, so a change takes
+ * effect at the next nfsd start (thread count 0 -> N).
+ */
+static bool			nfsd_fairq;
+module_param_named(fairq, nfsd_fairq, bool, 0644);
+MODULE_PARM_DESC(fairq, "schedule nfsd service fairly per client (default off)");
+
 static int			nfsd(void *vrqstp);
 #if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
 static int			nfsd_acl_rpcbind_set(struct net *,
@@ -635,6 +647,13 @@ int nfsd_create_serv(struct net *net)
 		return -ENOMEM;
 	}
 
+	error = svc_set_fairq(serv, nfsd_fairq);
+	if (error) {
+		svc_destroy(&serv);
+		percpu_ref_exit(&nn->nfsd_net_ref);
+		return error;
+	}
+
 	error = svc_bind(serv, net);
 	if (error < 0) {
 		svc_destroy(&serv);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 4/4] nfsd: key NFSv4.1 connections by clientid for fair queueing
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
                   ` (2 preceding siblings ...)
  2026-06-03 15:09 ` [PATCH RFC 3/4] nfsd: add a fairq module parameter Benjamin Coddington
@ 2026-06-03 15:09 ` Benjamin Coddington
  2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
  2026-06-04 22:48 ` Jeff Layton
  5 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-03 15:09 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton, NeilBrown; +Cc: linux-nfs, Daire Byrne

An NFSv4.1 client may bind several connections to its session.  Stamp
each such transport's fairness key with a value derived from the clientid
in nfsd4_init_conn(), so that all of a client's connections share one key
and the fair-queue dispatcher schedules them as a single client rather
than per transport.  A multi-connection v4.1 client is thereby held to
the same per-client share as a single-connection client.

Signed-off-by: Benjamin Coddington <bcodding@hammerspace.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 fs/nfsd/nfs4state.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 5982fc9eb6b1..52bd9bc5cdbd 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2351,11 +2351,28 @@ static int nfsd4_register_conn(struct nfsd4_conn *conn)
 	return register_xpt_user(conn->cn_xprt, &conn->cn_xpt_user);
 }
 
+/*
+ * Derive the transport fairness key from the v4.1 clientid, so that all of a
+ * client's connections (bound to its session) share one key and the sunrpc
+ * fair-queue dispatcher schedules them as a single client rather than per
+ * transport.  cl_boot is the server's boot time and is effectively always
+ * nonzero, so this overrides the source-address default that
+ * svc_xprt_fairq_key() would otherwise use.
+ */
+static unsigned long nfsd4_fairq_key(const clientid_t *clid)
+{
+	u64 id = ((u64)clid->cl_boot << 32) | clid->cl_id;
+
+	return (unsigned long)(id ^ (id >> 32));
+}
+
 static void nfsd4_init_conn(struct svc_rqst *rqstp, struct nfsd4_conn *conn, struct nfsd4_session *ses)
 {
 	int ret;
 
 	nfsd4_hash_conn(conn, ses);
+	conn->cn_xprt->xpt_fairq_key =
+		nfsd4_fairq_key(&ses->se_client->cl_clientid);
 	ret = nfsd4_register_conn(conn);
 	if (ret)
 		/* oops; xprt is already down: */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
                   ` (3 preceding siblings ...)
  2026-06-03 15:09 ` [PATCH RFC 4/4] nfsd: key NFSv4.1 connections by clientid for fair queueing Benjamin Coddington
@ 2026-06-03 22:52 ` NeilBrown
  2026-06-04 12:11   ` Daire Byrne
                     ` (2 more replies)
  2026-06-04 22:48 ` Jeff Layton
  5 siblings, 3 replies; 14+ messages in thread
From: NeilBrown @ 2026-06-03 22:52 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Chuck Lever, Jeff Layton, linux-nfs, Daire Byrne

On Thu, 04 Jun 2026, Benjamin Coddington wrote:
> knfsd dispatches ready transports from a single per-pool FIFO, so a
> client's share of nfsd service scales with the number of connections it
> holds rather than being shared per client.  A client that opens many
> connections (nconnect, or a farm of data movers) starves other clients
> on the same server purely by out-numbering them in sockets.
> 
> I measured this with a load generator that pins each request to a fixed
> service time and does no filesystem work, so that nfsd thread-time is the
> only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
> greedy client opens K connections alongside one single-connection
> interactive client.
> 
> NFSv3, dispatch as it is today:
> 
>   greedy K   greedy share   interactive ops/s
>      1           50%              241
>      4           80%              129
>      8           89%               72
>     16           94%               38
> 
> The interactive client's share tracks 1/(K+1) and its throughput falls
> roughly 6x while it does nothing different.  NFSv4.1 behaves identically
> (89% greedy at K=8) even when the greedy connections are bound to a
> single session, because the dispatch decision is below the NFS version.
> 
> The same NFSv4.1 workload with fair queueing enabled:
> 
>   greedy K   greedy share   interactive ops/s
>      8           72%              182
>     16           73%              177
>     32           70%              193
> 
> The greedy client's share no longer climbs with its connection count and
> the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
> throughput is unchanged: the T/D pool ceiling is the same with fair
> queueing on and off.  The split does not reach 50/50 because a single
> interactive connection is bounded by its request window and by XPT_BUSY
> serialising one transport; with a deeper window it reaches ~59/41.
> 
> The approach:
> 
>   - sunrpc grows an opaque per-transport fairness key (patch 1), with a
>     default derived from the source address (the source port is excluded
>     so a client's several connections share one key), and an opt-in
>     per-pool scheduler that buckets ready transports by that key and
>     dispatches round-robin across keys (patch 2).  When it is disabled,
>     which is the default, the existing lockless FIFO path is unchanged.
> 
>   - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
>     stamps the NFSv4.1 clientid as the key when a connection binds to a
>     session (patch 4), so all of a client's connections share one key.
>     NFSv3 uses the source-address default.
> 
> This is an RFC; a few questions for the list:
> 
>   - Unit of fairness: clientid (used here) or session?  Earlier
>     discussion leaned toward exploring per-session.
> 
>   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
>     on the opt-in path, versus a lockless or per-flow structure.

I'm not keen on the hash bucket approach, though I can't clearly say
why.
I'm also not keen on the opt-in design.  I don't like asking the admin to
tune performance.  We should always provide optimal performance.

I imagine creating an object which represents a client - using IP
address or possibly v4 client id.  These may well be located using a
hash table (rhashtable?), but that is peripheral to the design.
Each xprt has an associated client which can be changed at any time
(nfsd could change to a v4.1-client).

Each client has a lwq of xprts and is part of an lwq of clients.

To enqueue an xprt, we grab the client, enqueue to that, then if
necessary enqueue the client.

To dequeue next, we dequeue the first client, dequeue the first xprt,
then optionally enqueue the client again.

So this would be slightly more work than the current (2 dequeues instead
of 1) but I think that might be acceptable.

clients would be refcounted by xprts and probably rcu-freed.

> 
>   - Would a per-client in-flight cap be preferable to proportional fair
>     queueing?

A per-client cap would only be ok if the admin didn't have to tune it.
So the client would need to get some sort of feed-back from the server
so that it knows when it is pushing too hard.  If we were still using
UDP we could possibly use packet-loss for that feed-back, but we aren't
and don't want to.
With v4.1 we can of course use the slot based flow control and I think
we should if we can agree on a good design.  With v3 I don't think there
is any way to get the needed feed-back

Thanks,
NeilBrown


> 
> The measurement used a debug-only filehandle-latency hook that is not
> part of this series.
> 
> Benjamin Coddington (4):
>   sunrpc: add a per-transport fairness key to svc_xprt
>   sunrpc: dispatch ready transports fairly per client
>   nfsd: add a fairq module parameter
>   nfsd: key NFSv4.1 connections by clientid for fair queueing
> 
>  fs/nfsd/nfs4state.c             |  17 +++
>  fs/nfsd/nfssvc.c                |  19 +++
>  include/linux/sunrpc/svc.h      |   5 +
>  include/linux/sunrpc/svc_xprt.h |  46 ++++++-
>  net/sunrpc/svc.c                |   2 +
>  net/sunrpc/svc_xprt.c           | 216 +++++++++++++++++++++++++++++++-
>  6 files changed, 302 insertions(+), 3 deletions(-)
> 
> 
> base-commit: e7ca66ba17f1b5e4ecbb29b9c3c4a31aa062bed0
> -- 
> 2.53.0
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
@ 2026-06-04 12:11   ` Daire Byrne
  2026-06-04 14:54     ` Chuck Lever
  2026-06-04 14:25   ` Benjamin Coddington
  2026-06-04 22:27   ` Jeff Layton
  2 siblings, 1 reply; 14+ messages in thread
From: Daire Byrne @ 2026-06-04 12:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: Benjamin Coddington, Chuck Lever, Jeff Layton, linux-nfs

I'm quite interested in this functionality - like I said we already
set svc_rpc_per_connection_limit as a way to limit the damage greedy
clients can do to wider service delivery (especially interactive
desktops).

Another thing we do is use the fq qdisc to limit the outbound
bandwidth (maxrate) per stream (client/nconnect).

And finally, we use some qdisc prio stuff to try and "prefer" things
like workstations over the batch render farm (differentiated by
subnet). We also mostly use NFSv3 (for a few different reasons).

I guess anything that can give us better control over these kinds of
"QOS" scenarios we'll happily take.

So for example preferring a subnet of workstations over a subnet of
batch render farm would be nice.

Daire


On Wed, 3 Jun 2026 at 23:52, NeilBrown <neilb@ownmail.net> wrote:
>
> On Thu, 04 Jun 2026, Benjamin Coddington wrote:
> > knfsd dispatches ready transports from a single per-pool FIFO, so a
> > client's share of nfsd service scales with the number of connections it
> > holds rather than being shared per client.  A client that opens many
> > connections (nconnect, or a farm of data movers) starves other clients
> > on the same server purely by out-numbering them in sockets.
> >
> > I measured this with a load generator that pins each request to a fixed
> > service time and does no filesystem work, so that nfsd thread-time is the
> > only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
> > greedy client opens K connections alongside one single-connection
> > interactive client.
> >
> > NFSv3, dispatch as it is today:
> >
> >   greedy K   greedy share   interactive ops/s
> >      1           50%              241
> >      4           80%              129
> >      8           89%               72
> >     16           94%               38
> >
> > The interactive client's share tracks 1/(K+1) and its throughput falls
> > roughly 6x while it does nothing different.  NFSv4.1 behaves identically
> > (89% greedy at K=8) even when the greedy connections are bound to a
> > single session, because the dispatch decision is below the NFS version.
> >
> > The same NFSv4.1 workload with fair queueing enabled:
> >
> >   greedy K   greedy share   interactive ops/s
> >      8           72%              182
> >     16           73%              177
> >     32           70%              193
> >
> > The greedy client's share no longer climbs with its connection count and
> > the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
> > throughput is unchanged: the T/D pool ceiling is the same with fair
> > queueing on and off.  The split does not reach 50/50 because a single
> > interactive connection is bounded by its request window and by XPT_BUSY
> > serialising one transport; with a deeper window it reaches ~59/41.
> >
> > The approach:
> >
> >   - sunrpc grows an opaque per-transport fairness key (patch 1), with a
> >     default derived from the source address (the source port is excluded
> >     so a client's several connections share one key), and an opt-in
> >     per-pool scheduler that buckets ready transports by that key and
> >     dispatches round-robin across keys (patch 2).  When it is disabled,
> >     which is the default, the existing lockless FIFO path is unchanged.
> >
> >   - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
> >     stamps the NFSv4.1 clientid as the key when a connection binds to a
> >     session (patch 4), so all of a client's connections share one key.
> >     NFSv3 uses the source-address default.
> >
> > This is an RFC; a few questions for the list:
> >
> >   - Unit of fairness: clientid (used here) or session?  Earlier
> >     discussion leaned toward exploring per-session.
> >
> >   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
> >     on the opt-in path, versus a lockless or per-flow structure.
>
> I'm not keen on the hash bucket approach, though I can't clearly say
> why.
> I'm also not keen on the opt-in design.  I don't like asking the admin to
> tune performance.  We should always provide optimal performance.
>
> I imagine creating an object which represents a client - using IP
> address or possibly v4 client id.  These may well be located using a
> hash table (rhashtable?), but that is peripheral to the design.
> Each xprt has an associated client which can be changed at any time
> (nfsd could change to a v4.1-client).
>
> Each client has a lwq of xprts and is part of an lwq of clients.
>
> To enqueue an xprt, we grab the client, enqueue to that, then if
> necessary enqueue the client.
>
> To dequeue next, we dequeue the first client, dequeue the first xprt,
> then optionally enqueue the client again.
>
> So this would be slightly more work than the current (2 dequeues instead
> of 1) but I think that might be acceptable.
>
> clients would be refcounted by xprts and probably rcu-freed.
>
> >
> >   - Would a per-client in-flight cap be preferable to proportional fair
> >     queueing?
>
> A per-client cap would only be ok if the admin didn't have to tune it.
> So the client would need to get some sort of feed-back from the server
> so that it knows when it is pushing too hard.  If we were still using
> UDP we could possibly use packet-loss for that feed-back, but we aren't
> and don't want to.
> With v4.1 we can of course use the slot based flow control and I think
> we should if we can agree on a good design.  With v3 I don't think there
> is any way to get the needed feed-back
>
> Thanks,
> NeilBrown
>
>
> >
> > The measurement used a debug-only filehandle-latency hook that is not
> > part of this series.
> >
> > Benjamin Coddington (4):
> >   sunrpc: add a per-transport fairness key to svc_xprt
> >   sunrpc: dispatch ready transports fairly per client
> >   nfsd: add a fairq module parameter
> >   nfsd: key NFSv4.1 connections by clientid for fair queueing
> >
> >  fs/nfsd/nfs4state.c             |  17 +++
> >  fs/nfsd/nfssvc.c                |  19 +++
> >  include/linux/sunrpc/svc.h      |   5 +
> >  include/linux/sunrpc/svc_xprt.h |  46 ++++++-
> >  net/sunrpc/svc.c                |   2 +
> >  net/sunrpc/svc_xprt.c           | 216 +++++++++++++++++++++++++++++++-
> >  6 files changed, 302 insertions(+), 3 deletions(-)
> >
> >
> > base-commit: e7ca66ba17f1b5e4ecbb29b9c3c4a31aa062bed0
> > --
> > 2.53.0
> >
> >
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
  2026-06-04 12:11   ` Daire Byrne
@ 2026-06-04 14:25   ` Benjamin Coddington
  2026-06-04 22:27   ` Jeff Layton
  2 siblings, 0 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-04 14:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: Benjamin Coddington, Chuck Lever, Jeff Layton, linux-nfs,
	Daire Byrne

On 3 Jun 2026, at 18:52, NeilBrown wrote:

> On Thu, 04 Jun 2026, Benjamin Coddington wrote:

>>   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
>>     on the opt-in path, versus a lockless or per-flow structure.
>
> I'm not keen on the hash bucket approach, though I can't clearly say
> why.

I don't like the hash collisions.  Maybe that's your unspoken reason.
Unlucky client means I still get random hand-wavy reports.

> I'm also not keen on the opt-in design.  I don't like asking the admin to
> tune performance.  We should always provide optimal performance.
>
> I imagine creating an object which represents a client - using IP
> address or possibly v4 client id.  These may well be located using a
> hash table (rhashtable?), but that is peripheral to the design.
> Each xprt has an associated client which can be changed at any time
> (nfsd could change to a v4.1-client).
>
> Each client has a lwq of xprts and is part of an lwq of clients.
>
> To enqueue an xprt, we grab the client, enqueue to that, then if
> necessary enqueue the client.
>
> To dequeue next, we dequeue the first client, dequeue the first xprt,
> then optionally enqueue the client again.

ah - cool.  I think this is a much better design, the enqueue is lockless
and no bucket sharing.

> So this would be slightly more work than the current (2 dequeues instead
> of 1) but I think that might be acceptable.
>
> clients would be refcounted by xprts and probably rcu-freed.

This is the hard part - the xprt holding ref to client, moving from v3 style
(by ip addr) to v4.1 (by clientid) on session bind means repointing the xprt
to a different client object, which can race if the xprt is enqueued.

I suppose it's a one-time event for each v4.1 xprt, so could just create
another XPRT_ flag for it..

I'll see what I can come up with.

>
>>
>>   - Would a per-client in-flight cap be preferable to proportional fair
>>     queueing?
>
> A per-client cap would only be ok if the admin didn't have to tune it.
> So the client would need to get some sort of feed-back from the server
> so that it knows when it is pushing too hard.  If we were still using
> UDP we could possibly use packet-loss for that feed-back, but we aren't
> and don't want to.
> With v4.1 we can of course use the slot based flow control and I think
> we should if we can agree on a good design.  With v3 I don't think there
> is any way to get the needed feed-back

I need to fairly queue v3 and v4.1 clients at the same time... :/

Appreciate your look at this.

Ben

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-04 12:11   ` Daire Byrne
@ 2026-06-04 14:54     ` Chuck Lever
  2026-06-04 16:08       ` Benjamin Coddington
  0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2026-06-04 14:54 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Jeff Layton, linux-nfs, Daire Byrne, NeilBrown



On Thu, Jun 4, 2026, at 5:11 AM, Daire Byrne wrote:
> I'm quite interested in this functionality - like I said we already
> set svc_rpc_per_connection_limit as a way to limit the damage greedy
> clients can do to wider service delivery (especially interactive
> desktops).
>
> Another thing we do is use the fq qdisc to limit the outbound
> bandwidth (maxrate) per stream (client/nconnect).
>
> And finally, we use some qdisc prio stuff to try and "prefer" things
> like workstations over the batch render farm (differentiated by
> subnet). We also mostly use NFSv3 (for a few different reasons).
>
> I guess anything that can give us better control over these kinds of
> "QOS" scenarios we'll happily take.
>
> So for example preferring a subnet of workstations over a subnet of
> batch render farm would be nice.
>
> Daire
>
>
> On Wed, 3 Jun 2026 at 23:52, NeilBrown <neilb@ownmail.net> wrote:
>>
>> On Thu, 04 Jun 2026, Benjamin Coddington wrote:
>> > knfsd dispatches ready transports from a single per-pool FIFO, so a
>> > client's share of nfsd service scales with the number of connections it
>> > holds rather than being shared per client.  A client that opens many
>> > connections (nconnect, or a farm of data movers) starves other clients
>> > on the same server purely by out-numbering them in sockets.
>> >
>> > I measured this with a load generator that pins each request to a fixed
>> > service time and does no filesystem work, so that nfsd thread-time is the
>> > only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
>> > greedy client opens K connections alongside one single-connection
>> > interactive client.
>> >
>> > NFSv3, dispatch as it is today:
>> >
>> >   greedy K   greedy share   interactive ops/s
>> >      1           50%              241
>> >      4           80%              129
>> >      8           89%               72
>> >     16           94%               38
>> >
>> > The interactive client's share tracks 1/(K+1) and its throughput falls
>> > roughly 6x while it does nothing different.  NFSv4.1 behaves identically
>> > (89% greedy at K=8) even when the greedy connections are bound to a
>> > single session, because the dispatch decision is below the NFS version.
>> >
>> > The same NFSv4.1 workload with fair queueing enabled:
>> >
>> >   greedy K   greedy share   interactive ops/s
>> >      8           72%              182
>> >     16           73%              177
>> >     32           70%              193
>> >
>> > The greedy client's share no longer climbs with its connection count and
>> > the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
>> > throughput is unchanged: the T/D pool ceiling is the same with fair
>> > queueing on and off.  The split does not reach 50/50 because a single
>> > interactive connection is bounded by its request window and by XPT_BUSY
>> > serialising one transport; with a deeper window it reaches ~59/41.
>> >
>> > The approach:
>> >
>> >   - sunrpc grows an opaque per-transport fairness key (patch 1), with a
>> >     default derived from the source address (the source port is excluded
>> >     so a client's several connections share one key), and an opt-in
>> >     per-pool scheduler that buckets ready transports by that key and
>> >     dispatches round-robin across keys (patch 2).  When it is disabled,
>> >     which is the default, the existing lockless FIFO path is unchanged.
>> >
>> >   - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
>> >     stamps the NFSv4.1 clientid as the key when a connection binds to a
>> >     session (patch 4), so all of a client's connections share one key.
>> >     NFSv3 uses the source-address default.
>> >
>> > This is an RFC; a few questions for the list:
>> >
>> >   - Unit of fairness: clientid (used here) or session?  Earlier
>> >     discussion leaned toward exploring per-session.
>> >
>> >   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
>> >     on the opt-in path, versus a lockless or per-flow structure.
>>
>> I'm not keen on the hash bucket approach, though I can't clearly say
>> why.
>> I'm also not keen on the opt-in design.  I don't like asking the admin to
>> tune performance.  We should always provide optimal performance.
>>
>> I imagine creating an object which represents a client - using IP
>> address or possibly v4 client id.  These may well be located using a
>> hash table (rhashtable?), but that is peripheral to the design.
>> Each xprt has an associated client which can be changed at any time
>> (nfsd could change to a v4.1-client).
>>
>> Each client has a lwq of xprts and is part of an lwq of clients.
>>
>> To enqueue an xprt, we grab the client, enqueue to that, then if
>> necessary enqueue the client.
>>
>> To dequeue next, we dequeue the first client, dequeue the first xprt,
>> then optionally enqueue the client again.
>>
>> So this would be slightly more work than the current (2 dequeues instead
>> of 1) but I think that might be acceptable.
>>
>> clients would be refcounted by xprts and probably rcu-freed.
>>
>> >
>> >   - Would a per-client in-flight cap be preferable to proportional fair
>> >     queueing?
>>
>> A per-client cap would only be ok if the admin didn't have to tune it.
>> So the client would need to get some sort of feed-back from the server
>> so that it knows when it is pushing too hard.  If we were still using
>> UDP we could possibly use packet-loss for that feed-back, but we aren't
>> and don't want to.
>> With v4.1 we can of course use the slot based flow control and I think
>> we should if we can agree on a good design.  With v3 I don't think there
>> is any way to get the needed feed-back
>>
>> Thanks,
>> NeilBrown
>>
>>
>> >
>> > The measurement used a debug-only filehandle-latency hook that is not
>> > part of this series.

[ Daire, we're generally a "bottom post" list. My comments below
are in response to your notes quoted at the top as well as Neil's
and Ben's quoted inline ]

First some general responses to the shape of Ben's patches.

- Thanks for giving us something to start kicking around!

- I'm not a fan of adding administrative controls, and these days, a
  module parameter is outdated and way too global. I'd like to see
  that removed until we have a clear need (use cases) for tuning

- Having to rely on client identity is going to be difficult to get
  right. The scope of the identity, for example, is never going to
  cover all the use cases that we want. For example, fairness-by-
  client-address means all clients behind a NAT get reduced to a
  single fairness unit.

- The observability you proposed for measuring your solution is
  worth some attention. We should consider making it a part of
  the patch series rather than something that was useful only
  while it was being developed.

- Probably the largest challenge will be including NFSv3, NFSv4.0,
  and LOCALIO, none of which have the concept of a "session".

After some thought, IMO the problem as initially stated has some
areas that are going to be challenging and delay a real solution.

Instead, we could look at the problem as "preventing starvation of
any one connection" rather than the more difficult goal of "ensuring
complete scheduling fairness". This changes the problem class
fundamentally.

Strictly speaking, nothing in Ben's data is starved today: the FIFO
transport queue guarantees every backlogged transport one receive
per cycle, and every op gets completed. What degrades is the floor:
the latency and throughput a lightly loaded connection can count on.
It degrades linearly with the aggressor's connection count: with K+1
backlogged transports and pool service rate T/D, the interactive
connection waits ~(K+1) * D/T between services, which is unbounded
as K grows.

The reframed goal is precise: make the service floor of any one
connection independent of how many connections and slots everyone
else holds. That is a much weaker contract than proportional
fairness, but the best part is it dissolves the hardest challenges.

Equal shares require knowing who "everyone" is; a floor does not.
No client identity, no clientid-vs-session-vs-address debate, no
NAT collapse, no client-object lifetime, no per-pool share
fragmentation. And IMO it is what both Ben and Daire actually
asked for: Ben wrote "I would prefer the 'interactive' clients
be prioritized, but I don't control their nconnect." And Daire
described something similar above.

Looking at prior art, networking has solved exactly this reframed
problem (and Daire alludes to that above): fq_codel's success
comes less from its fair queueing than from its sparse-flow
optimization; that is, flows with nothing in flight jump the queue
and heavy flows share what remains.

TLDR; this reframe turns "schedule fairly" into "let request-
response traffic overtake streams," and NFSD already appears to
track the bit of state needed to tell them apart.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-04 14:54     ` Chuck Lever
@ 2026-06-04 16:08       ` Benjamin Coddington
  2026-06-04 16:43         ` Chuck Lever
  2026-06-04 21:11         ` NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: Benjamin Coddington @ 2026-06-04 16:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Benjamin Coddington, Jeff Layton, linux-nfs, Daire Byrne,
	NeilBrown

On 4 Jun 2026, at 10:54, Chuck Lever wrote:
> - I'm not a fan of adding administrative controls, and these days, a
>   module parameter is outdated and way too global. I'd like to see
>   that removed until we have a clear need (use cases) for tuning

Agreed - I'll drop the module parameter.  With the reframe below the fair
path is cheap enough to be always-on.

> - Having to rely on client identity is going to be difficult to get
>   right. ... fairness-by-client-address means all clients behind a NAT
>   get reduced to a single fairness unit.

Agreed, and avoiding identity entirely is most of the appeal of the
reframe - it also gets NFSv3, NFSv4.0 and LOCALIO for free, which solves
the no-session problem you raised.

> - The observability you proposed ... worth ... making it a part of the
>   patch series

Will do - I'll turn the latency-injection measurement into something that
stays in the series, but it's probably not appropriate for any final
inclusion.

> Instead, we could look at the problem as "preventing starvation of
> any one connection" ...
> fq_codel's success comes less from its fair queueing than from its
> sparse-flow optimization; that is, flows with nothing in flight jump
> the queue and heavy flows share what remains.

I like this - it dissolves the identity question, and xpt_nr_rqsts already
gives us the in-flight signal to classify on.

One wrinkle I'd want to get right first: what a user perceives as
"interactive" is a command, not a single RPC, and a command is a burst of
correlated RPCs - loading a web page whose browser cache is on NFS is a
readdir, then stat/open/read across many files.  Plain sparse-flow grants
priority for one quantum (~one RPC for us), so only the first RPC of the
burst jumps; RPCs 2..N find the connection backlogged, demote to the bulk
tier, and complete at the aggressor's rate.  The leading edge is fast but
the command isn't.

Rough numbers, ~650 ops/s pool, interactive on one connection, aggressor on
16, a 50-RPC command:

  today (per-transport FIFO):     50 * 17 / 650  ~= 1.3 s
  sparse-flow, 1-RPC quantum:     ~= 1.3 s  (only the first RPC saved)
  whole command prioritized:      50 / 650       ~= 0.08 s

So to make the command feel interactive the priority allowance has to cover
the cycle, not one RPC: grant an idle->active connection a budget of ~N RPCs,
re-granted each time it returns to idle, rather than a single jump.  This
stays identity-free and connection-count-proof for the same reason your
version does - a sustained stream never returns to idle, so it collects the
allowance once and is in the bulk tier forever after, however many
connections it opens.  A user-paced connection idles between commands and is
re-granted each time.  (This also naturally covers Daire's
workstations-over-render-farm case: bursty-then-idle vs sustained, no subnet
config.)

That turns the one open parameter into "how many RPCs is an interactive
cycle" (N).  I'd rather not make it an admin knob either:

  - a generous fixed default - over-provisioning costs almost nothing, since
    a sustained stream never idles to collect it, and under-provisioning just
    lets the tail of an unusually large command fall back to bulk;
  - measure it - track per-connection burst size between idle gaps;
  - borrow the v4.1 slot count as a per-connection hint (v3 would need a
    fallback).

Does a burst-sized allowance fit how you were picturing the sparse tier, or
were you thinking the single-RPC jump was enough and the rest should just
share bulk?  That's the main fork I see.

I'll rework the series around this - always-on, no module parameter, no
client identity, sparse/bulk tiers keyed on xpt_nr_rqsts, measurement
included.  I also owe new numbers: my current harness runs every connection
at a fixed window, so the "interactive" client is actually backlogged -
the wrong shape to show any of this.  The metric that matters is
command-completion time versus the aggressor's connection count.

Thanks for the reframe - it's a much better-shaped problem.

Ben

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-04 16:08       ` Benjamin Coddington
@ 2026-06-04 16:43         ` Chuck Lever
  2026-06-04 21:11         ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2026-06-04 16:43 UTC (permalink / raw)
  To: Benjamin Coddington; +Cc: Jeff Layton, linux-nfs, Daire Byrne, NeilBrown


On Thu, Jun 4, 2026, at 9:08 AM, Benjamin Coddington wrote:
> That turns the one open parameter into "how many RPCs is an interactive
> cycle" (N).  I'd rather not make it an admin knob either:
>
>   - a generous fixed default - over-provisioning costs almost nothing, since
>     a sustained stream never idles to collect it, and under-provisioning just
>     lets the tail of an unusually large command fall back to bulk;
>   - measure it - track per-connection burst size between idle gaps;
>   - borrow the v4.1 slot count as a per-connection hint (v3 would need a
>     fallback).

If we can get away with a single mechanism rather than an additional
fallback, that will keep things "no more complicated than necessary"
and easier to test.

> Does a burst-sized allowance fit how you were picturing the sparse tier, or
> were you thinking the single-RPC jump was enough and the rest should just
> share bulk?  That's the main fork I see.

It's a good open question. Why not prototype a burst-sized allowance and
then we can examine the behavior using your test harness to see how much
of a difference it makes.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-04 16:08       ` Benjamin Coddington
  2026-06-04 16:43         ` Chuck Lever
@ 2026-06-04 21:11         ` NeilBrown
  1 sibling, 0 replies; 14+ messages in thread
From: NeilBrown @ 2026-06-04 21:11 UTC (permalink / raw)
  To: Benjamin Coddington
  Cc: Chuck Lever, Benjamin Coddington, Jeff Layton, linux-nfs,
	Daire Byrne

On Fri, 05 Jun 2026, Benjamin Coddington wrote:
> On 4 Jun 2026, at 10:54, Chuck Lever wrote:
> > - I'm not a fan of adding administrative controls, and these days, a
> >   module parameter is outdated and way too global. I'd like to see
> >   that removed until we have a clear need (use cases) for tuning
> 
> Agreed - I'll drop the module parameter.  With the reframe below the fair
> path is cheap enough to be always-on.
> 
> > - Having to rely on client identity is going to be difficult to get
> >   right. ... fairness-by-client-address means all clients behind a NAT
> >   get reduced to a single fairness unit.
> 
> Agreed, and avoiding identity entirely is most of the appeal of the
> reframe - it also gets NFSv3, NFSv4.0 and LOCALIO for free, which solves
> the no-session problem you raised.
> 
> > - The observability you proposed ... worth ... making it a part of the
> >   patch series
> 
> Will do - I'll turn the latency-injection measurement into something that
> stays in the series, but it's probably not appropriate for any final
> inclusion.
> 
> > Instead, we could look at the problem as "preventing starvation of
> > any one connection" ...
> > fq_codel's success comes less from its fair queueing than from its
> > sparse-flow optimization; that is, flows with nothing in flight jump
> > the queue and heavy flows share what remains.
> 
> I like this - it dissolves the identity question, and xpt_nr_rqsts already
> gives us the in-flight signal to classify on.
> 
> One wrinkle I'd want to get right first: what a user perceives as
> "interactive" is a command, not a single RPC, and a command is a burst of
> correlated RPCs - loading a web page whose browser cache is on NFS is a
> readdir, then stat/open/read across many files.  Plain sparse-flow grants
> priority for one quantum (~one RPC for us), so only the first RPC of the
> burst jumps; RPCs 2..N find the connection backlogged, demote to the bulk
> tier, and complete at the aggressor's rate.  The leading edge is fast but
> the command isn't.
> 
> Rough numbers, ~650 ops/s pool, interactive on one connection, aggressor on
> 16, a 50-RPC command:
> 
>   today (per-transport FIFO):     50 * 17 / 650  ~= 1.3 s
>   sparse-flow, 1-RPC quantum:     ~= 1.3 s  (only the first RPC saved)
>   whole command prioritized:      50 / 650       ~= 0.08 s
> 
> So to make the command feel interactive the priority allowance has to cover
> the cycle, not one RPC: grant an idle->active connection a budget of ~N RPCs,
> re-granted each time it returns to idle, rather than a single jump.  This
> stays identity-free and connection-count-proof for the same reason your
> version does - a sustained stream never returns to idle, so it collects the
> allowance once and is in the bulk tier forever after, however many
> connections it opens.  A user-paced connection idles between commands and is
> re-granted each time.  (This also naturally covers Daire's
> workstations-over-render-farm case: bursty-then-idle vs sustained, no subnet
> config.)
> 
> That turns the one open parameter into "how many RPCs is an interactive
> cycle" (N).  I'd rather not make it an admin knob either:

There is another parameter: how much priority boost do you give
"interactive" tasks.  You cannot let interactive completely dominate
batch, else a sufficiently large number of interactive clients can
starve the batch clients.

> 
>   - a generous fixed default - over-provisioning costs almost nothing, since
>     a sustained stream never idles to collect it, and under-provisioning just
>     lets the tail of an unusually large command fall back to bulk;
>   - measure it - track per-connection burst size between idle gaps;
>   - borrow the v4.1 slot count as a per-connection hint (v3 would need a
>     fallback).
> 
> Does a burst-sized allowance fit how you were picturing the sparse tier, or
> were you thinking the single-RPC jump was enough and the rest should just
> share bulk?  That's the main fork I see.
> 
> I'll rework the series around this - always-on, no module parameter, no
> client identity, sparse/bulk tiers keyed on xpt_nr_rqsts, measurement
> included.  I also owe new numbers: my current harness runs every connection
> at a fixed window, so the "interactive" client is actually backlogged -
> the wrong shape to show any of this.  The metric that matters is
> command-completion time versus the aggressor's connection count.
> 
> Thanks for the reframe - it's a much better-shaped problem.

It seems to be a completely different problem, but maybe I misunderstood
your original problem statement.
You seemed to start out focused on nconnect and wanting fairness between
clients with nconnect=1 and those with nconnect=16.  Is this not what
you really want?

The "interactive" frame can only work for clients that are single-user
or few-users.  A non-trivial multi-user system could easily start
looking like a batch system, at least in bursts, and this could give
users surprising changes in latency.

It might make sense to add an "interactive" export flag so that some
clients get a boost.  Manual configuration is best avoided, but not
always possible (which is why we even have export options).

Ideally the client would determine what is "interactive" and use v4.1
sessions to put "interactive" on a different session to "batch".  Then
it should be easier for interactive sessions to get an appropriate
share.


Thanks,
NeilBrown


> 
> Ben
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
  2026-06-04 12:11   ` Daire Byrne
  2026-06-04 14:25   ` Benjamin Coddington
@ 2026-06-04 22:27   ` Jeff Layton
  2 siblings, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2026-06-04 22:27 UTC (permalink / raw)
  To: NeilBrown, Benjamin Coddington; +Cc: Chuck Lever, linux-nfs, Daire Byrne

On Thu, 2026-06-04 at 08:52 +1000, NeilBrown wrote:
> On Thu, 04 Jun 2026, Benjamin Coddington wrote:
> > knfsd dispatches ready transports from a single per-pool FIFO, so a
> > client's share of nfsd service scales with the number of connections it
> > holds rather than being shared per client.  A client that opens many
> > connections (nconnect, or a farm of data movers) starves other clients
> > on the same server purely by out-numbering them in sockets.
> > 
> > I measured this with a load generator that pins each request to a fixed
> > service time and does no filesystem work, so that nfsd thread-time is the
> > only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
> > greedy client opens K connections alongside one single-connection
> > interactive client.
> > 
> > NFSv3, dispatch as it is today:
> > 
> >   greedy K   greedy share   interactive ops/s
> >      1           50%              241
> >      4           80%              129
> >      8           89%               72
> >     16           94%               38
> > 
> > The interactive client's share tracks 1/(K+1) and its throughput falls
> > roughly 6x while it does nothing different.  NFSv4.1 behaves identically
> > (89% greedy at K=8) even when the greedy connections are bound to a
> > single session, because the dispatch decision is below the NFS version.
> > 
> > The same NFSv4.1 workload with fair queueing enabled:
> > 
> >   greedy K   greedy share   interactive ops/s
> >      8           72%              182
> >     16           73%              177
> >     32           70%              193
> > 
> > The greedy client's share no longer climbs with its connection count and
> > the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
> > throughput is unchanged: the T/D pool ceiling is the same with fair
> > queueing on and off.  The split does not reach 50/50 because a single
> > interactive connection is bounded by its request window and by XPT_BUSY
> > serialising one transport; with a deeper window it reaches ~59/41.
> > 
> > The approach:
> > 
> >   - sunrpc grows an opaque per-transport fairness key (patch 1), with a
> >     default derived from the source address (the source port is excluded
> >     so a client's several connections share one key), and an opt-in
> >     per-pool scheduler that buckets ready transports by that key and
> >     dispatches round-robin across keys (patch 2).  When it is disabled,
> >     which is the default, the existing lockless FIFO path is unchanged.
> > 
> >   - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
> >     stamps the NFSv4.1 clientid as the key when a connection binds to a
> >     session (patch 4), so all of a client's connections share one key.
> >     NFSv3 uses the source-address default.
> > 
> > This is an RFC; a few questions for the list:
> > 
> >   - Unit of fairness: clientid (used here) or session?  Earlier
> >     discussion leaned toward exploring per-session.
> > 
> >   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
> >     on the opt-in path, versus a lockless or per-flow structure.
> 
> I'm not keen on the hash bucket approach, though I can't clearly say
> why.
> I'm also not keen on the opt-in design.  I don't like asking the admin to
> tune performance.  We should always provide optimal performance.
> 
> I imagine creating an object which represents a client - using IP
> address or possibly v4 client id.  These may well be located using a
> hash table (rhashtable?), but that is peripheral to the design.
> Each xprt has an associated client which can be changed at any time
> (nfsd could change to a v4.1-client).
> 
> Each client has a lwq of xprts and is part of an lwq of clients.
> 
> To enqueue an xprt, we grab the client, enqueue to that, then if
> necessary enqueue the client.
> 
> To dequeue next, we dequeue the first client, dequeue the first xprt,
> then optionally enqueue the client again.
> 
> So this would be slightly more work than the current (2 dequeues instead
> of 1) but I think that might be acceptable.
> 
> clients would be refcounted by xprts and probably rcu-freed.
> 
> > 
> >   - Would a per-client in-flight cap be preferable to proportional fair
> >     queueing?
> 
> A per-client cap would only be ok if the admin didn't have to tune it.
> So the client would need to get some sort of feed-back from the server
> so that it knows when it is pushing too hard.  If we were still using
> UDP we could possibly use packet-loss for that feed-back, but we aren't
> and don't want to.
> With v4.1 we can of course use the slot based flow control and I think
> we should if we can agree on a good design.  With v3 I don't think there
> is any way to get the needed feed-back
> 

Shrink the TCP window to 0 for a bit on the connection(s)? That's the
typical way to slow down a spammy client on a TCP socket. It's a bit
coarse grained but typically the best we can do with v3. IDK, could we
do something with the ECN flag?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
  2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
                   ` (4 preceding siblings ...)
  2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
@ 2026-06-04 22:48 ` Jeff Layton
  5 siblings, 0 replies; 14+ messages in thread
From: Jeff Layton @ 2026-06-04 22:48 UTC (permalink / raw)
  To: Benjamin Coddington, Chuck Lever, NeilBrown; +Cc: linux-nfs, Daire Byrne

On Wed, 2026-06-03 at 11:09 -0400, Benjamin Coddington wrote:
> knfsd dispatches ready transports from a single per-pool FIFO, so a
> client's share of nfsd service scales with the number of connections it
> holds rather than being shared per client.  A client that opens many
> connections (nconnect, or a farm of data movers) starves other clients
> on the same server purely by out-numbering them in sockets.
> 
> I measured this with a load generator that pins each request to a fixed
> service time and does no filesystem work, so that nfsd thread-time is the
> only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
> greedy client opens K connections alongside one single-connection
> interactive client.
> 
> NFSv3, dispatch as it is today:
> 
>   greedy K   greedy share   interactive ops/s
>      1           50%              241
>      4           80%              129
>      8           89%               72
>     16           94%               38
> 
> The interactive client's share tracks 1/(K+1) and its throughput falls
> roughly 6x while it does nothing different.  NFSv4.1 behaves identically
> (89% greedy at K=8) even when the greedy connections are bound to a
> single session, because the dispatch decision is below the NFS version.
> 
> The same NFSv4.1 workload with fair queueing enabled:
> 
>   greedy K   greedy share   interactive ops/s
>      8           72%              182
>     16           73%              177
>     32           70%              193
> 

These seem like pretty clear improvements.

> The greedy client's share no longer climbs with its connection count and
> the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
> throughput is unchanged: the T/D pool ceiling is the same with fair
> queueing on and off.  The split does not reach 50/50 because a single
> interactive connection is bounded by its request window and by XPT_BUSY
> serialising one transport; with a deeper window it reaches ~59/41.
> 
> The approach:
> 
>   - sunrpc grows an opaque per-transport fairness key (patch 1), with a
>     default derived from the source address (the source port is excluded
>     so a client's several connections share one key), and an opt-in
>     per-pool scheduler that buckets ready transports by that key and
>     dispatches round-robin across keys (patch 2).  When it is disabled,
>     which is the default, the existing lockless FIFO path is unchanged.
> 

That makes sense: you can make the IP address the key by default, and
then nfsd can override that later with the sessionid or whatever.

>   - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
>     stamps the NFSv4.1 clientid as the key when a connection binds to a
>     session (patch 4), so all of a client's connections share one key.
>     NFSv3 uses the source-address default.
> 

Like Chuck, I greatly prefer no tunables, but while experimenting with
this, it may be helpful to add some under debugfs. The goal at the end
should be "no tunables" though.

> This is an RFC; a few questions for the list:
> 
>   - Unit of fairness: clientid (used here) or session?  Earlier
>     discussion leaned toward exploring per-session.
> 

I don't think we know. Per-session sounds like the right thing, but I'd
leave room for experimentation here. For v4.0, you might want to do
this by clientid. For v3, I think using the IP address without port
like you have is probably as good as can be done.

>   - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
>     on the opt-in path, versus a lockless or per-flow structure.
> 

I don't have a real preference here. Whatever performs best, most
likely. This will be in a hot path.

>   - Would a per-client in-flight cap be preferable to proportional fair
>     queueing?
> 

I think probably not, but this too might require some experimentation.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-06-04 22:48 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 15:09 [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 2/4] sunrpc: dispatch ready transports fairly per client Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 3/4] nfsd: add a fairq module parameter Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 4/4] nfsd: key NFSv4.1 connections by clientid for fair queueing Benjamin Coddington
2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
2026-06-04 12:11   ` Daire Byrne
2026-06-04 14:54     ` Chuck Lever
2026-06-04 16:08       ` Benjamin Coddington
2026-06-04 16:43         ` Chuck Lever
2026-06-04 21:11         ` NeilBrown
2026-06-04 14:25   ` Benjamin Coddington
2026-06-04 22:27   ` Jeff Layton
2026-06-04 22:48 ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.