All of lore.kernel.org
 help / color / mirror / Atom feed
From: Benjamin Coddington <ben.coddington@hammerspace.com>
To: Chuck Lever <cel@kernel.org>, Jeff Layton <jlayton@kernel.org>,
	NeilBrown <neilb@ownmail.net>
Cc: linux-nfs@vger.kernel.org, Daire Byrne <daire@brahma.io>
Subject: [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch
Date: Wed,  3 Jun 2026 11:09:38 -0400	[thread overview]
Message-ID: <cover.1780498019.git.bcodding@hammerspace.com> (raw)

knfsd dispatches ready transports from a single per-pool FIFO, so a
client's share of nfsd service scales with the number of connections it
holds rather than being shared per client.  A client that opens many
connections (nconnect, or a farm of data movers) starves other clients
on the same server purely by out-numbering them in sockets.

I measured this with a load generator that pins each request to a fixed
service time and does no filesystem work, so that nfsd thread-time is the
only scarce resource (8 threads, 10ms/op, ~648 ops/s pool ceiling).  A
greedy client opens K connections alongside one single-connection
interactive client.

NFSv3, dispatch as it is today:

  greedy K   greedy share   interactive ops/s
     1           50%              241
     4           80%              129
     8           89%               72
    16           94%               38

The interactive client's share tracks 1/(K+1) and its throughput falls
roughly 6x while it does nothing different.  NFSv4.1 behaves identically
(89% greedy at K=8) even when the greedy connections are bound to a
single session, because the dispatch decision is below the NFS version.

The same NFSv4.1 workload with fair queueing enabled:

  greedy K   greedy share   interactive ops/s
     8           72%              182
    16           73%              177
    32           70%              193

The greedy client's share no longer climbs with its connection count and
the interactive client recovers (72 -> 182 ops/s at K=8).  Aggregate
throughput is unchanged: the T/D pool ceiling is the same with fair
queueing on and off.  The split does not reach 50/50 because a single
interactive connection is bounded by its request window and by XPT_BUSY
serialising one transport; with a deeper window it reaches ~59/41.

The approach:

  - sunrpc grows an opaque per-transport fairness key (patch 1), with a
    default derived from the source address (the source port is excluded
    so a client's several connections share one key), and an opt-in
    per-pool scheduler that buckets ready transports by that key and
    dispatches round-robin across keys (patch 2).  When it is disabled,
    which is the default, the existing lockless FIFO path is unchanged.

  - nfsd gains a "fairq" module parameter to turn it on (patch 3) and
    stamps the NFSv4.1 clientid as the key when a connection binds to a
    session (patch 4), so all of a client's connections share one key.
    NFSv3 uses the source-address default.

This is an RFC; a few questions for the list:

  - Unit of fairness: clientid (used here) or session?  Earlier
    discussion leaned toward exploring per-session.

  - Mechanism: a fixed bucket hash under a per-pool spinlock taken only
    on the opt-in path, versus a lockless or per-flow structure.

  - Would a per-client in-flight cap be preferable to proportional fair
    queueing?

The measurement used a debug-only filehandle-latency hook that is not
part of this series.

Benjamin Coddington (4):
  sunrpc: add a per-transport fairness key to svc_xprt
  sunrpc: dispatch ready transports fairly per client
  nfsd: add a fairq module parameter
  nfsd: key NFSv4.1 connections by clientid for fair queueing

 fs/nfsd/nfs4state.c             |  17 +++
 fs/nfsd/nfssvc.c                |  19 +++
 include/linux/sunrpc/svc.h      |   5 +
 include/linux/sunrpc/svc_xprt.h |  46 ++++++-
 net/sunrpc/svc.c                |   2 +
 net/sunrpc/svc_xprt.c           | 216 +++++++++++++++++++++++++++++++-
 6 files changed, 302 insertions(+), 3 deletions(-)


base-commit: e7ca66ba17f1b5e4ecbb29b9c3c4a31aa062bed0
-- 
2.53.0


             reply	other threads:[~2026-06-03 15:09 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03 15:09 Benjamin Coddington [this message]
2026-06-03 15:09 ` [PATCH RFC 1/4] sunrpc: add a per-transport fairness key to svc_xprt Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 2/4] sunrpc: dispatch ready transports fairly per client Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 3/4] nfsd: add a fairq module parameter Benjamin Coddington
2026-06-03 15:09 ` [PATCH RFC 4/4] nfsd: key NFSv4.1 connections by clientid for fair queueing Benjamin Coddington
2026-06-03 22:52 ` [PATCH RFC 0/4] nfsd: per-client fair-queue dispatch NeilBrown
2026-06-04 12:11   ` Daire Byrne
2026-06-04 14:54     ` Chuck Lever
2026-06-04 16:08       ` Benjamin Coddington
2026-06-04 16:43         ` Chuck Lever
2026-06-04 21:11         ` NeilBrown
2026-06-04 14:25   ` Benjamin Coddington
2026-06-04 22:27   ` Jeff Layton
2026-06-04 22:48 ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1780498019.git.bcodding@hammerspace.com \
    --to=ben.coddington@hammerspace.com \
    --cc=cel@kernel.org \
    --cc=daire@brahma.io \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.