public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets
@ 2026-02-05 15:57 Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies Chuck Lever
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High-throughput NFSD workloads exhibit significant lock contention on
TCP connections. Worker threads compete for the socket lock during
receives and serialize on xpt_mutex during sends, limiting scalability.

This series addresses both paths:

 - Receive: A dedicated kernel thread per TCP connection owns all
   sock_recvmsg() calls and queues complete RPC messages for workers
   via lock-free llist. This eliminates socket lock contention among
   workers.

 - Transmit: Flat combining allows one thread to send on behalf of
   multiple waiters. Threads enqueue requests; the mutex holder
   ("combiner") processes the batch, amortizing lock acquisition and
   enabling TCP segment coalescing via MSG_MORE.

Supporting changes include a workqueue affinity fix for single-LLC
systems, a page recycling pool for receive buffers, and explicit TCP
buffer sizing for high bandwidth-delay product networks.

Base commit: v6.19-rc8

---

Chuck Lever (7):
  workqueue: Automatic affinity scope fallback for single-pod topologies
  sunrpc: split svc_data_ready into protocol-specific callbacks
  sunrpc: add per-transport page recycling pool
  sunrpc: add dedicated TCP receiver thread
  sunrpc: implement flat combining for TCP socket sends
  sunrpc: unify fore and backchannel server TCP send paths
  SUNRPC: Set explicit TCP socket buffer sizes for NFSD

 include/linux/sunrpc/svc.h      |   1 +
 include/linux/sunrpc/svc_xprt.h |  32 ++
 include/linux/sunrpc/svcsock.h  |  40 ++
 include/linux/workqueue.h       |   8 +-
 kernel/workqueue.c              |  68 ++-
 net/sunrpc/svc.c                |  13 +
 net/sunrpc/svc_xprt.c           | 151 ++++++
 net/sunrpc/svcsock.c            | 797 +++++++++++++++++++++++++++++---
 net/sunrpc/xprtsock.c           |  60 +--
 9 files changed, 1044 insertions(+), 126 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-06 14:57   ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 2/7] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

The default affinity scope WQ_AFFN_CACHE assumes systems have
multiple last-level caches. On systems where all CPUs share a
single LLC (common with Intel monolithic dies), this scope
degenerates to a single worker pool. All queue_work() calls then
contend on that pool's single lock, causing severe performance
degradation under high-throughput workloads.

For example, on a 12-core system with a single shared L3 cache
running NFS over RDMA with 12 fio jobs, perf shows approximately
39% of CPU cycles spent in native_queued_spin_lock_slowpath,
nearly all from __queue_work() contending on the single pool lock.

On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
scopes all collapse to a single pod.

Add wq_effective_affn_scope() to detect when a selected affinity
scope provides only one pod despite having multiple CPUs, and
automatically fall back to a finer-grained scope. This enables lock
distribution to scale with CPU count without requiring manual
configuration via the workqueue.default_affinity_scope parameter or
per-workqueue sysfs tuning.

The fallback is conservative: it triggers only when a scope
degenerates to exactly one pod, and respects explicitly configured
(non-default) scopes.

Also update wq_affn_scope_show() to display the effective scope
when fallback occurs, making the behavior transparent to
administrators via sysfs (e.g., "default (cache -> smt)").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/workqueue.h |  8 ++++-
 kernel/workqueue.c        | 68 +++++++++++++++++++++++++++++++++++----
 2 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index dabc351cc127..1fca5791337d 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -128,10 +128,16 @@ struct rcu_work {
 	struct workqueue_struct *wq;
 };
 
+/*
+ * Affinity scopes are ordered from finest to coarsest granularity. This
+ * ordering is used by the automatic fallback logic in wq_effective_affn_scope()
+ * which walks from coarse toward fine when a scope degenerates to a single pod.
+ */
 enum wq_affn_scope {
 	WQ_AFFN_DFL,			/* use system default */
 	WQ_AFFN_CPU,			/* one pod per CPU */
-	WQ_AFFN_SMT,			/* one pod poer SMT */
+	WQ_AFFN_SMT,			/* one pod per SMT */
+	WQ_AFFN_CLUSTER,		/* one pod per cluster */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 253311af47c6..32598b9cd1c2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -405,6 +405,7 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_DFL]		= "default",
 	[WQ_AFFN_CPU]		= "cpu",
 	[WQ_AFFN_SMT]		= "smt",
+	[WQ_AFFN_CLUSTER]	= "cluster",
 	[WQ_AFFN_CACHE]		= "cache",
 	[WQ_AFFN_NUMA]		= "numa",
 	[WQ_AFFN_SYSTEM]	= "system",
@@ -4753,6 +4754,39 @@ static void wqattrs_actualize_cpumask(struct workqueue_attrs *attrs,
 		cpumask_copy(attrs->cpumask, unbound_cpumask);
 }
 
+/*
+ * Determine the effective affinity scope. If the configured scope results
+ * in a single pod (e.g., WQ_AFFN_CACHE on a system with one shared LLC),
+ * fall back to a finer-grained scope to distribute pool lock contention.
+ *
+ * The search stops at WQ_AFFN_CPU, which always provides one pod per CPU
+ * and thus cannot degenerate further.
+ *
+ * Returns the scope to actually use, which may differ from the configured
+ * scope on systems where coarser scopes degenerate.
+ */
+static enum wq_affn_scope wq_effective_affn_scope(enum wq_affn_scope scope)
+{
+	struct wq_pod_type *pt;
+
+	/*
+	 * Walk from the requested scope toward finer granularity. Stop
+	 * when a scope provides more than one pod, or when CPU scope is
+	 * reached. CPU scope always provides nr_possible_cpus() pods.
+	 */
+	while (scope > WQ_AFFN_CPU) {
+		pt = &wq_pod_types[scope];
+
+		/* Multiple pods at this scope; no fallback needed */
+		if (pt->nr_pods > 1)
+			break;
+
+		scope--;
+	}
+
+	return scope;
+}
+
 /* find wq_pod_type to use for @attrs */
 static const struct wq_pod_type *
 wqattrs_pod_type(const struct workqueue_attrs *attrs)
@@ -4763,8 +4797,13 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
 	/* to synchronize access to wq_affn_dfl */
 	lockdep_assert_held(&wq_pool_mutex);
 
+	/*
+	 * For default scope, apply automatic fallback for degenerate
+	 * topologies. Explicit scope selection via sysfs or per-workqueue
+	 * attributes bypasses fallback, preserving administrator intent.
+	 */
 	if (attrs->affn_scope == WQ_AFFN_DFL)
-		scope = wq_affn_dfl;
+		scope = wq_effective_affn_scope(wq_affn_dfl);
 	else
 		scope = attrs->affn_scope;
 
@@ -7206,16 +7245,27 @@ static ssize_t wq_affn_scope_show(struct device *dev,
 				  struct device_attribute *attr, char *buf)
 {
 	struct workqueue_struct *wq = dev_to_wq(dev);
+	enum wq_affn_scope scope, effective;
 	int written;
 
 	mutex_lock(&wq->mutex);
-	if (wq->unbound_attrs->affn_scope == WQ_AFFN_DFL)
-		written = scnprintf(buf, PAGE_SIZE, "%s (%s)\n",
-				    wq_affn_names[WQ_AFFN_DFL],
-				    wq_affn_names[wq_affn_dfl]);
-	else
+	if (wq->unbound_attrs->affn_scope == WQ_AFFN_DFL) {
+		scope = wq_affn_dfl;
+		effective = wq_effective_affn_scope(scope);
+		if (wq_pod_types[effective].nr_pods >
+		    wq_pod_types[scope].nr_pods)
+			written = scnprintf(buf, PAGE_SIZE, "%s (%s -> %s)\n",
+					    wq_affn_names[WQ_AFFN_DFL],
+					    wq_affn_names[scope],
+					    wq_affn_names[effective]);
+		else
+			written = scnprintf(buf, PAGE_SIZE, "%s (%s)\n",
+					    wq_affn_names[WQ_AFFN_DFL],
+					    wq_affn_names[scope]);
+	} else {
 		written = scnprintf(buf, PAGE_SIZE, "%s\n",
 				    wq_affn_names[wq->unbound_attrs->affn_scope]);
+	}
 	mutex_unlock(&wq->mutex);
 
 	return written;
@@ -8023,6 +8073,11 @@ static bool __init cpus_share_smt(int cpu0, int cpu1)
 #endif
 }
 
+static bool __init cpus_share_cluster(int cpu0, int cpu1)
+{
+	return cpumask_test_cpu(cpu0, topology_cluster_cpumask(cpu1));
+}
+
 static bool __init cpus_share_numa(int cpu0, int cpu1)
 {
 	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
@@ -8042,6 +8097,7 @@ void __init workqueue_init_topology(void)
 
 	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
 	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
+	init_pod_type(&wq_pod_types[WQ_AFFN_CLUSTER], cpus_share_cluster);
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/7] sunrpc: split svc_data_ready into protocol-specific callbacks
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 3/7] sunrpc: add per-transport page recycling pool Chuck Lever
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Separate data-ready callbacks enable protocol-specific
optimizations. UDP and TCP transports already have different
requirements: currently UDP sockets do not implement DTLS, so the
XPT_HANDSHAKE check is unnecessary overhead for them.

Prepare the server-side socket infrastructure for additional
changes to TCP's data_ready callback.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svcsock.c | 37 ++++++++++++++++++++++++++++++++-----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d61cd9b40491..3ec50812b110 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -397,10 +397,37 @@ static void svc_sock_secure_port(struct svc_rqst *rqstp)
 		clear_bit(RQ_SECURE, &rqstp->rq_flags);
 }
 
-/*
- * INET callback when data has been received on the socket.
+/**
+ * svc_udp_data_ready - sk->sk_data_ready callback for UDP sockets
+ * @sk: socket whose receive buffer contains data
+ *
+ * This implementation does not yet support DTLS, so the
+ * XPT_HANDSHAKE check is not needed here.
  */
-static void svc_data_ready(struct sock *sk)
+static void svc_udp_data_ready(struct sock *sk)
+{
+	struct svc_sock	*svsk = (struct svc_sock *)sk->sk_user_data;
+
+	trace_sk_data_ready(sk);
+
+	if (svsk) {
+		/* Refer to svc_setup_socket() for details. */
+		rmb();
+		svsk->sk_odata(sk);
+		trace_svcsock_data_ready(&svsk->sk_xprt, 0);
+		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
+			svc_xprt_enqueue(&svsk->sk_xprt);
+	}
+}
+
+/**
+ * svc_tcp_data_ready - sk->sk_data_ready callback for TCP sockets
+ * @sk: socket whose receive buffer contains data
+ *
+ * Data ingest is skipped while a TLS handshake is in progress
+ * (XPT_HANDSHAKE).
+ */
+static void svc_tcp_data_ready(struct sock *sk)
 {
 	struct svc_sock	*svsk = (struct svc_sock *)sk->sk_user_data;
 
@@ -835,7 +862,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
 	svc_xprt_init(sock_net(svsk->sk_sock->sk), &svc_udp_class,
 		      &svsk->sk_xprt, serv);
 	clear_bit(XPT_CACHE_AUTH, &svsk->sk_xprt.xpt_flags);
-	svsk->sk_sk->sk_data_ready = svc_data_ready;
+	svsk->sk_sk->sk_data_ready = svc_udp_data_ready;
 	svsk->sk_sk->sk_write_space = svc_write_space;
 
 	/* initialise setting must have enough space to
@@ -1368,7 +1395,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
 	} else {
 		sk->sk_state_change = svc_tcp_state_change;
-		sk->sk_data_ready = svc_data_ready;
+		sk->sk_data_ready = svc_tcp_data_ready;
 		sk->sk_write_space = svc_write_space;
 
 		svsk->sk_marker = xdr_zero;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/7] sunrpc: add per-transport page recycling pool
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 2/7] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 4/7] sunrpc: add dedicated TCP receiver thread Chuck Lever
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

RPC server transports allocate pages for receiving incoming data on
every request. Under high load, this repeated allocation and freeing
creates unnecessary overhead in the page allocator hot path.

Introduce svc_page_pool, a lock-free page recycling mechanism that
enables efficient page reuse between receive operations. A follow-up
commit wires this into the TCP transport's receive path; svcrdma's
RDMA Read path might also make use of this mechanism some day.

The pool uses llist for lock-free producer-consumer handoff: worker
threads returning pages after RPC processing act as producers, while
receiver threads allocating pages for incoming data act as
consumers. Pages are linked via page->pcp_llist, which is safe
because these pages are owned exclusively by the transport.

Each pool tracks its NUMA node affinity, allowing page allocations
to target the same node as the transport's receiver thread. Provide
svc_pool_node() to enable transports to determine the NUMA node
associated with a service pool for NUMA-aware resource allocation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h      |   1 +
 include/linux/sunrpc/svc_xprt.h |  32 +++++++
 net/sunrpc/svc.c                |  13 +++
 net/sunrpc/svc_xprt.c           | 151 ++++++++++++++++++++++++++++++++
 4 files changed, 197 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5506d20857c3..f4efe60f4dad 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -457,6 +457,7 @@ void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
 void		   svc_pool_wake_idle_thread(struct svc_pool *pool);
 struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
+int		   svc_pool_node(struct svc_pool *pool);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 const char *	   svc_proc_name(const struct svc_rqst *rqstp);
 int		   svc_encode_result_payload(struct svc_rqst *rqstp,
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index da2a2531e110..e60c2936b1ce 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -9,9 +9,33 @@
 #define SUNRPC_SVC_XPRT_H
 
 #include <linux/sunrpc/svc.h>
+#include <linux/llist.h>
 
 struct module;
 
+/**
+ * struct svc_page_pool - per-transport page recycling pool
+ * @pp_pages: lock-free list of recycled pages
+ * @pp_count: number of pages currently in pool
+ * @pp_numa_node: NUMA node for page allocations
+ * @pp_max: maximum pages to retain in pool
+ *
+ * Lock-free page recycling between producers (svc threads returning
+ * pages) and a single consumer (the thread allocating pages for
+ * receives). Uses llist for efficient producer-consumer handoff
+ * without spinlocks.
+ *
+ * Callers must serialize calls to svc_page_pool_get(); multiple
+ * concurrent consumers are not supported.
+ * Allocate with svc_page_pool_alloc(); free with svc_page_pool_free().
+ */
+struct svc_page_pool {
+	struct llist_head	pp_pages;
+	atomic_t		pp_count;
+	int			pp_numa_node;
+	unsigned int		pp_max;
+};
+
 struct svc_xprt_ops {
 	struct svc_xprt	*(*xpo_create)(struct svc_serv *,
 				       struct net *net,
@@ -187,6 +211,14 @@ void	svc_add_new_perm_xprt(struct svc_serv *serv, struct svc_xprt *xprt);
 void	svc_age_temp_xprts_now(struct svc_serv *, struct sockaddr *);
 void	svc_xprt_deferred_close(struct svc_xprt *xprt);
 
+/* Page pool helpers */
+struct svc_page_pool *svc_page_pool_alloc(int numa_node, unsigned int max);
+void	svc_page_pool_free(struct svc_page_pool *pool);
+void	svc_page_pool_put(struct svc_page_pool *pool, struct page *page);
+void	svc_page_pool_put_bulk(struct svc_page_pool *pool,
+			       struct page **pages, unsigned int count);
+struct page *svc_page_pool_get(struct svc_page_pool *pool);
+
 static inline void svc_xprt_get(struct svc_xprt *xprt)
 {
 	kref_get(&xprt->xpt_ref);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4704dce7284e..6b350cb7d539 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -418,6 +418,19 @@ struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 	return &serv->sv_pools[pidx % serv->sv_nrpools];
 }
 
+/**
+ * svc_pool_node - Return the NUMA node affinity of a service pool
+ * @pool: the service pool
+ *
+ * Return value:
+ *   The NUMA node the pool is associated with, or the local node
+ *   if no explicit mapping exists
+ */
+int svc_pool_node(struct svc_pool *pool)
+{
+	return svc_pool_map_get_node(pool->sp_id);
+}
+
 static int svc_rpcb_setup(struct svc_serv *serv, struct net *net)
 {
 	int err;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 6973184ff667..fe31cf6a9c5d 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -1497,4 +1497,155 @@ int svc_pool_stats_open(struct svc_info *info, struct file *file)
 }
 EXPORT_SYMBOL(svc_pool_stats_open);
 
+static struct llist_node *svc_page_to_llist(struct page *page)
+{
+	return &page->pcp_llist;
+}
+
+static struct page *svc_llist_to_page(struct llist_node *node)
+{
+	return container_of(node, struct page, pcp_llist);
+}
+
+/**
+ * svc_page_pool_alloc - Allocate a page pool
+ * @numa_node: NUMA node for page allocations
+ * @max: maximum pages to retain in pool
+ *
+ * Pages in an svc_page_pool are linked via page->pcp_llist, which is
+ * safe since these pages are owned exclusively by the transport.
+ *
+ * The caller must free the pool with svc_page_pool_free() when
+ * the transport is destroyed.
+ *
+ * Returns a new page pool, or NULL on allocation failure.
+ */
+struct svc_page_pool *svc_page_pool_alloc(int numa_node, unsigned int max)
+{
+	struct svc_page_pool *pool;
+
+	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL, numa_node);
+	if (!pool)
+		return NULL;
+
+	init_llist_head(&pool->pp_pages);
+	atomic_set(&pool->pp_count, 0);
+	pool->pp_numa_node = numa_node;
+	pool->pp_max = max;
+	return pool;
+}
+
+/**
+ * svc_page_pool_free - Free a page pool and all pages in it
+ * @pool: pool to free (may be NULL)
+ */
+void svc_page_pool_free(struct svc_page_pool *pool)
+{
+	struct llist_node *node;
+
+	if (!pool)
+		return;
+
+	while ((node = llist_del_first(&pool->pp_pages)) != NULL)
+		put_page(svc_llist_to_page(node));
+	kfree(pool);
+}
+
+/**
+ * svc_page_pool_put - Return a page to the pool
+ * @pool: pool to return page to (may be NULL)
+ * @page: page to return (may be NULL)
+ *
+ * Transfers ownership of @page to the pool. The caller's reference
+ * is consumed: either the pool retains the page, or put_page() is
+ * called if @pool is NULL or full.
+ */
+void svc_page_pool_put(struct svc_page_pool *pool, struct page *page)
+{
+	if (!page)
+		return;
+	if (!pool || atomic_read(&pool->pp_count) >= pool->pp_max) {
+		put_page(page);
+		return;
+	}
+	llist_add(svc_page_to_llist(page), &pool->pp_pages);
+	atomic_inc(&pool->pp_count);
+}
+
+/**
+ * svc_page_pool_put_bulk - Return multiple pages to the pool
+ * @pool: pool to return pages to (may be NULL)
+ * @pages: array of pages to return
+ * @count: number of pages in @pages array
+ *
+ * Batch version of svc_page_pool_put() that reduces atomic operations
+ * when returning many pages at once. Transfers ownership of all pages
+ * in @pages to the pool. Uses release_pages() for efficient bulk
+ * freeing when the pool is full.
+ *
+ * Unlike svc_page_pool_put(), this function does not handle NULL
+ * entries in @pages. All @count entries must be valid page pointers.
+ */
+void svc_page_pool_put_bulk(struct svc_page_pool *pool,
+			    struct page **pages, unsigned int count)
+{
+	struct llist_node *head, *last, *node;
+	unsigned int i, to_add, avail;
+
+	if (!count)
+		return;
+	if (!pool) {
+		release_pages(pages, count);
+		return;
+	}
+
+	avail = pool->pp_max - atomic_read(&pool->pp_count);
+	to_add = min_t(unsigned int, count, avail);
+	if (!to_add) {
+		release_pages(pages, count);
+		return;
+	}
+
+	head = NULL;
+	last = NULL;
+	for (i = 0; i < to_add; i++) {
+		node = svc_page_to_llist(pages[i]);
+		node->next = head;
+		head = node;
+		if (!last)
+			last = node;
+	}
+	llist_add_batch(head, last, &pool->pp_pages);
+	atomic_add(to_add, &pool->pp_count);
+
+	/* Free overflow pages that didn't fit in the pool */
+	if (to_add < count)
+		release_pages(pages + to_add, count - to_add);
+}
+EXPORT_SYMBOL_GPL(svc_page_pool_put_bulk);
+
+/**
+ * svc_page_pool_get - Get a page from the pool
+ * @pool: pool to take from (may be NULL)
+ *
+ * Returns a recycled page with one reference, or NULL if @pool is
+ * NULL or empty. The caller owns the returned page and must either
+ * return it via svc_page_pool_put() or release it with put_page().
+ *
+ * Caller must serialize; concurrent calls for the same pool are
+ * not supported.
+ */
+struct page *svc_page_pool_get(struct svc_page_pool *pool)
+{
+	struct llist_node *node;
+
+	if (!pool)
+		return NULL;
+	node = llist_del_first(&pool->pp_pages);
+	if (!node)
+		return NULL;
+	atomic_dec(&pool->pp_count);
+	return svc_llist_to_page(node);
+}
+
 /*----------------------------------------------------------------------------*/
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 4/7] sunrpc: add dedicated TCP receiver thread
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (2 preceding siblings ...)
  2026-02-05 15:57 ` [RFC PATCH 3/7] sunrpc: add per-transport page recycling pool Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 5/7] sunrpc: implement flat combining for TCP socket sends Chuck Lever
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Eliminate receive-side socket lock contention for NFS server TCP
connections by dedicating one kernel thread per TCP socket to handle
all receives.

Current architecture has multiple nfsd worker threads competing for
the same socket, serializing on socket lock inside sock_recvmsg().
The new design creates a single receiver thread per TCP connection
that owns all sock_recvmsg() calls and queues complete RPC messages
for workers to process.

Architecture:

  Before:
    Worker 1 --+                     +-- sock_recvmsg() --+
    Worker 2 --+-- compete for xprt--+-- sock_recvmsg() --+-- CONTENTION
    Worker 3 --+                     +-- sock_recvmsg() --+

  After:
    Receiver Thread -- sock_recvmsg() --+-- Worker 1 -- process -- send
          (no contention)               +-- Worker 2 -- process -- send
                                        +-- Worker 3 -- process -- send

The receiver thread uses a lock-free llist queue to pass complete RPC
messages to worker threads, avoiding spinlock overhead in the fast
path. Flow control limits queue depth to SVC_TCP_MSG_QUEUE_MAX (64)
messages per socket to bound memory usage.

This mirrors the architecture used by svcrdma, where RDMA completion
handlers queue received messages for worker threads rather than having
workers compete for hardware resources.

NUMA Affinity:

The receiver thread is created on the NUMA node associated with the
service pool handling the accept, following the same NUMA placement
strategy used for nfsd worker threads. Page allocations for receive
buffers explicitly target this node via __alloc_pages_bulk(),
providing memory locality for the receive path. This mirrors how
svcrdma allocates resources on the RNIC's NUMA node.

svc_tcp_data_ready() now wakes the dedicated receiver thread instead
of enqueueing the transport for worker threads. If receiver thread
creation fails during connection accept, the connection is rejected;
the client will retry.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |  36 +++
 net/sunrpc/svcsock.c           | 472 ++++++++++++++++++++++++++++++---
 2 files changed, 477 insertions(+), 31 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index de37069aba90..391ce9c14f2d 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -12,6 +12,32 @@
 
 #include <linux/sunrpc/svc.h>
 #include <linux/sunrpc/svc_xprt.h>
+#include <linux/cache.h>
+#include <linux/llist.h>
+#include <linux/wait.h>
+#include <linux/completion.h>
+#include <linux/ktime.h>
+
+/* Maximum queued messages per TCP socket before backpressure */
+#define SVC_TCP_MSG_QUEUE_MAX	64
+
+/**
+ * struct svc_tcp_msg - queued RPC message ready for processing
+ * @tm_node: lock-free queue linkage
+ * @tm_len: total message length
+ * @tm_npages: number of pages holding message data
+ * @tm_pages: flexible array of pages containing the message
+ *
+ * The receiver thread allocates these to queue complete RPC messages
+ * for worker threads to process. Page ownership transfers from the
+ * receiver's rqstp to this structure, then to the worker's rqstp.
+ */
+struct svc_tcp_msg {
+	struct llist_node	tm_node;
+	size_t			tm_len;
+	unsigned int		tm_npages;
+	struct page		*tm_pages[];
+};
 
 /*
  * RPC server socket.
@@ -43,6 +69,16 @@ struct svc_sock {
 
 	struct completion	sk_handshake_done;
 
+	/* Dedicated receiver thread (TCP only) */
+	struct task_struct	*sk_receiver;
+	struct llist_head	sk_msg_queue;
+	wait_queue_head_t	sk_receiver_wq;
+	struct completion	sk_receiver_exit;
+	struct svc_page_pool	*sk_page_pool;
+	ktime_t			sk_partial_record_time;
+
+	atomic_t		sk_msg_count ____cacheline_aligned_in_smp;
+
 	/* received data */
 	unsigned long		sk_maxpages;
 	struct page *		sk_pages[] __counted_by(sk_maxpages);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 3ec50812b110..fa486a01ee3a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -22,6 +22,7 @@
 
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/kthread.h>
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/fcntl.h>
@@ -93,6 +94,9 @@ static int		svc_udp_sendto(struct svc_rqst *);
 static void		svc_sock_detach(struct svc_xprt *);
 static void		svc_tcp_sock_detach(struct svc_xprt *);
 static void		svc_sock_free(struct svc_xprt *);
+static int		svc_tcp_recv_msg(struct svc_rqst *);
+static int		svc_tcp_start_receiver(struct svc_sock *);
+static void		svc_tcp_stop_receiver(struct svc_sock *);
 
 static struct svc_xprt *svc_create_socket(struct svc_serv *, int,
 					  struct net *, struct sockaddr *,
@@ -440,8 +444,7 @@ static void svc_tcp_data_ready(struct sock *sk)
 		trace_svcsock_data_ready(&svsk->sk_xprt, 0);
 		if (test_bit(XPT_HANDSHAKE, &svsk->sk_xprt.xpt_flags))
 			return;
-		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
-			svc_xprt_enqueue(&svsk->sk_xprt);
+		wake_up(&svsk->sk_receiver_wq);
 	}
 }
 
@@ -934,8 +937,15 @@ static void svc_tcp_state_change(struct sock *sk)
 		rmb();
 		svsk->sk_ostate(sk);
 		trace_svcsock_tcp_state(&svsk->sk_xprt, svsk->sk_sock);
-		if (sk->sk_state != TCP_ESTABLISHED)
+		if (sk->sk_state != TCP_ESTABLISHED) {
 			svc_xprt_deferred_close(&svsk->sk_xprt);
+			/*
+			 * Wake the receiver thread so it sees XPT_CLOSE.
+			 * Without this, a receiver sleeping in wait_event
+			 * may not notice the connection has died.
+			 */
+			wake_up(&svsk->sk_receiver_wq);
+		}
 	}
 }
 
@@ -1003,8 +1013,22 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
 	if (serv->sv_stats)
 		serv->sv_stats->nettcpconn++;
 
+	/*
+	 * Disable busy polling for this socket. The dedicated receiver
+	 * thread does not benefit from busy polling since it is already
+	 * dedicated to this connection and will block in sock_recvmsg()
+	 * waiting for data. Busy polling just wastes CPU cycles.
+	 */
+	WRITE_ONCE(newsock->sk->sk_ll_usec, 0);
+
+	if (svc_tcp_start_receiver(newsvsk) < 0)
+		goto failed_start;
+
 	return &newsvsk->sk_xprt;
 
+failed_start:
+	svc_xprt_put(&newsvsk->sk_xprt);
+	return NULL;
 failed:
 	sockfd_put(newsock);
 	return NULL;
@@ -1151,25 +1175,365 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
 	svsk->sk_marker = xdr_zero;
 }
 
-/**
- * svc_tcp_recvfrom - Receive data from a TCP socket
- * @rqstp: request structure into which to receive an RPC Call
- *
- * Called in a loop when XPT_DATA has been set.
- *
- * Read the 4-byte stream record marker, then use the record length
- * in that marker to set up exactly the resources needed to receive
- * the next RPC message into @rqstp.
- *
- * Returns:
- *   On success, the number of bytes in a received RPC Call, or
- *   %0 if a complete RPC Call message was not ready to return
- *
- * The zero return case handles partial receives and callback Replies.
- * The state of a partial receive is preserved in the svc_sock for
- * the next call to svc_tcp_recvfrom.
+static struct svc_tcp_msg *svc_tcp_msg_alloc(unsigned int npages)
+{
+	return kmalloc(struct_size_t(struct svc_tcp_msg, tm_pages, npages),
+		       GFP_KERNEL);
+}
+
+static void svc_tcp_msg_free(struct svc_tcp_msg *msg)
+{
+	unsigned int i;
+
+	for (i = 0; i < msg->tm_npages; i++)
+		if (msg->tm_pages[i])
+			put_page(msg->tm_pages[i]);
+	kfree(msg);
+}
+
+static void svc_tcp_drain_msg_queue(struct svc_sock *svsk)
+{
+	struct llist_node *node;
+	struct svc_tcp_msg *msg;
+
+	while ((node = llist_del_first(&svsk->sk_msg_queue)) != NULL) {
+		msg = llist_entry(node, struct svc_tcp_msg, tm_node);
+		atomic_dec(&svsk->sk_msg_count);
+		svc_tcp_msg_free(msg);
+	}
+}
+
+static inline void svc_tcp_setup_rqst(struct svc_rqst *rqstp,
+				      struct svc_xprt *xprt)
+{
+	rqstp->rq_xprt_ctxt = NULL;
+	rqstp->rq_prot = IPPROTO_TCP;
+	if (test_bit(XPT_LOCAL, &xprt->xpt_flags))
+		set_bit(RQ_LOCAL, &rqstp->rq_flags);
+	else
+		clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+}
+
+/*
+ * Transfer page ownership from @msg to @rqstp and set up the xdr_buf
+ * for RPC processing.
  */
-static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
+static void svc_tcp_msg_to_rqst(struct svc_rqst *rqstp, struct svc_tcp_msg *msg)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct svc_page_pool *pool = svsk->sk_page_pool;
+	unsigned int i;
+
+	for (i = 0; i < msg->tm_npages; i++) {
+		if (rqstp->rq_pages[i])
+			svc_page_pool_put(pool, rqstp->rq_pages[i]);
+		rqstp->rq_pages[i] = msg->tm_pages[i];
+		msg->tm_pages[i] = NULL;
+	}
+
+	rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+	rqstp->rq_arg.head[0].iov_len = min_t(size_t, msg->tm_len, PAGE_SIZE);
+	rqstp->rq_arg.pages = rqstp->rq_pages + 1;
+	rqstp->rq_arg.page_base = 0;
+	if (msg->tm_len <= PAGE_SIZE)
+		rqstp->rq_arg.page_len = 0;
+	else
+		rqstp->rq_arg.page_len = msg->tm_len - PAGE_SIZE;
+	rqstp->rq_arg.len = msg->tm_len;
+	rqstp->rq_arg.buflen = msg->tm_npages * PAGE_SIZE;
+
+	rqstp->rq_respages = &rqstp->rq_pages[msg->tm_npages];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+	svc_xprt_copy_addrs(rqstp, rqstp->rq_xprt);
+	svc_tcp_setup_rqst(rqstp, rqstp->rq_xprt);
+}
+
+static int svc_tcp_queue_msg(struct svc_sock *svsk, struct svc_rqst *rqstp)
+{
+	struct svc_tcp_msg *msg;
+	unsigned int npages;
+	unsigned int i;
+
+	npages = DIV_ROUND_UP(rqstp->rq_arg.len, PAGE_SIZE);
+	msg = svc_tcp_msg_alloc(npages);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->tm_len = rqstp->rq_arg.len;
+	msg->tm_npages = npages;
+
+	for (i = 0; i < npages; i++) {
+		msg->tm_pages[i] = rqstp->rq_pages[i];
+		rqstp->rq_pages[i] = NULL;
+	}
+
+	llist_add(&msg->tm_node, &svsk->sk_msg_queue);
+	atomic_inc(&svsk->sk_msg_count);
+
+	return 0;
+}
+
+static int svc_tcp_receiver_alloc_pages(struct svc_rqst *rqstp)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct svc_page_pool *pool = svsk->sk_page_pool;
+	unsigned long pages, filled, ret;
+	struct page *page;
+
+	pages = rqstp->rq_maxpages;
+
+	for (filled = 0; filled < pages; filled++) {
+		page = svc_page_pool_get(pool);
+		if (!page)
+			break;
+		rqstp->rq_pages[filled] = page;
+	}
+	while (filled < pages) {
+		ret = __alloc_pages_bulk(GFP_KERNEL, pool->pp_numa_node, NULL,
+					 pages - filled,
+					 rqstp->rq_pages + filled);
+		if (ret == 0) {
+			while (filled--)
+				put_page(rqstp->rq_pages[filled]);
+			return -ENOMEM;
+		}
+		filled += ret;
+	}
+
+	rqstp->rq_page_end = &rqstp->rq_pages[pages];
+	rqstp->rq_pages[pages] = NULL;
+	rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+	rqstp->rq_arg.head[0].iov_len = PAGE_SIZE;
+	rqstp->rq_arg.pages = rqstp->rq_pages + 1;
+	rqstp->rq_arg.page_base = 0;
+	rqstp->rq_arg.page_len = (pages - 2) * PAGE_SIZE;
+	rqstp->rq_arg.len = (pages - 1) * PAGE_SIZE;
+	rqstp->rq_arg.tail[0].iov_len = 0;
+
+	return 0;
+}
+
+/*
+ * Dedicated receiver thread for a TCP socket. This thread owns all
+ * sock_recvmsg() calls for its connection, eliminating socket lock
+ * contention between workers. Complete RPC messages are queued for
+ * worker threads to process.
+ */
+static int svc_tcp_receiver_thread(void *data)
+{
+	struct svc_sock *svsk = data;
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	struct svc_rqst rqstp_storage;
+	struct svc_rqst *rqstp = &rqstp_storage;
+	unsigned int i;
+	bool progress;
+	int len;
+
+	memset(rqstp, 0, sizeof(*rqstp));
+	rqstp->rq_server = serv;
+	rqstp->rq_maxpages = svc_serv_maxpages(serv);
+	rqstp->rq_pages = kcalloc(rqstp->rq_maxpages + 1,
+				  sizeof(struct page *), GFP_KERNEL);
+	if (!rqstp->rq_pages)
+		goto out_close;
+	rqstp->rq_bvec = kcalloc(rqstp->rq_maxpages,
+				 sizeof(struct bio_vec), GFP_KERNEL);
+	if (!rqstp->rq_bvec)
+		goto out_close_free_pages;
+	rqstp->rq_xprt = &svsk->sk_xprt;
+
+	if (svc_tcp_receiver_alloc_pages(rqstp) < 0)
+		goto out_close_free_bvec;
+
+	while (!kthread_should_stop() &&
+	       !test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags)) {
+		/*
+		 * Wait until there is data in the socket and room in
+		 * the message queue. The condition is re-evaluated on
+		 * each wakeup, so spurious wakeups are harmless.
+		 *
+		 * Use a timeout when there is a partial RPC record.
+		 * This ensures periodic checks of the connection state
+		 * and timeout counters even if no new data arrives.
+		 */
+#define receiver_can_work(svsk) \
+		(tcp_inq((svsk)->sk_sk) > 0 && \
+		 atomic_read(&(svsk)->sk_msg_count) < SVC_TCP_MSG_QUEUE_MAX)
+
+		if (svsk->sk_tcplen > 0)
+			wait_event_interruptible_timeout(svsk->sk_receiver_wq,
+				receiver_can_work(svsk) ||
+				test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags) ||
+				kthread_should_stop(),
+				msecs_to_jiffies(5000));
+		else
+			wait_event_interruptible(svsk->sk_receiver_wq,
+				receiver_can_work(svsk) ||
+				test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags) ||
+				kthread_should_stop());
+#undef receiver_can_work
+
+		progress = false;
+		while (!kthread_should_stop() &&
+		       !test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags)) {
+			if (atomic_read(&svsk->sk_msg_count) >=
+			    SVC_TCP_MSG_QUEUE_MAX)
+				break;
+
+			len = svc_tcp_recv_msg(rqstp);
+			if (len <= 0)
+				break;
+
+			progress = true;
+			if (svc_tcp_queue_msg(svsk, rqstp) < 0) {
+				svc_xprt_deferred_close(&svsk->sk_xprt);
+				break;
+			}
+			if (svc_tcp_receiver_alloc_pages(rqstp) < 0) {
+				svc_xprt_deferred_close(&svsk->sk_xprt);
+				break;
+			}
+		}
+
+		if (!llist_empty(&svsk->sk_msg_queue)) {
+			set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+			svc_xprt_enqueue(&svsk->sk_xprt);
+		}
+
+		/*
+		 * Detect defunct connections with partial RPC records.
+		 * If data sits in the socket buffer but we cannot form
+		 * a complete RPC record, the client may have crashed
+		 * mid-request. Enable keepalives to probe the peer;
+		 * TCP will call state_change when the connection fails.
+		 */
+		if (!progress && svsk->sk_tcplen > 0) {
+			struct sock *sk = svsk->sk_sk;
+
+			/*
+			 * Check if TCP has already detected the dead peer.
+			 * state_change sets XPT_CLOSE, but we may have
+			 * missed the wake_up if we were not yet sleeping.
+			 */
+			if (sk->sk_state != TCP_ESTABLISHED || sk->sk_err) {
+				svc_xprt_deferred_close(&svsk->sk_xprt);
+				break;
+			}
+
+			/*
+			 * Still ESTABLISHED but stuck with partial record.
+			 * Enable keepalives to probe the peer. Use short
+			 * intervals to bound the time before detecting a
+			 * dead client.
+			 */
+			if (!sock_flag(sk, SOCK_KEEPOPEN)) {
+				sock_set_keepalive(sk);
+				tcp_sock_set_keepidle(sk, 10);
+				tcp_sock_set_keepintvl(sk, 5);
+				tcp_sock_set_keepcnt(sk, 3);
+			}
+
+			/*
+			 * Track how long we have been stuck. If keepalives
+			 * have not closed the connection after a reasonable
+			 * period, give up. This is a backstop against
+			 * pathological cases where keepalive probes succeed
+			 * but the client never sends more data.
+			 */
+			if (!svsk->sk_partial_record_time) {
+				svsk->sk_partial_record_time = ktime_get();
+			} else if (ktime_ms_delta(ktime_get(),
+					svsk->sk_partial_record_time) > 60000) {
+				svc_xprt_deferred_close(&svsk->sk_xprt);
+				break;
+			}
+		} else if (progress) {
+			svsk->sk_partial_record_time = 0;
+		}
+	}
+
+	for (i = 0; i < rqstp->rq_maxpages; i++)
+		if (rqstp->rq_pages[i])
+			put_page(rqstp->rq_pages[i]);
+
+	kfree(rqstp->rq_bvec);
+	kfree(rqstp->rq_pages);
+	complete(&svsk->sk_receiver_exit);
+	return 0;
+
+out_close_free_bvec:
+	kfree(rqstp->rq_bvec);
+out_close_free_pages:
+	kfree(rqstp->rq_pages);
+out_close:
+	svc_xprt_deferred_close(&svsk->sk_xprt);
+	complete(&svsk->sk_receiver_exit);
+	return 0;
+}
+
+/*
+ * The thread is created on the NUMA node associated with the current
+ * CPU's service pool, providing memory locality for receive buffer
+ * allocations.
+ */
+static int svc_tcp_start_receiver(struct svc_sock *svsk)
+{
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	struct svc_page_pool *pool;
+	struct task_struct *task;
+	int numa_node;
+
+	/* Initialize receiver thread infrastructure.
+	 * The wait queue is initialized earlier in svc_tcp_init()
+	 * so svc_tcp_data_ready() can safely wake it.
+	 */
+	init_llist_head(&svsk->sk_msg_queue);
+	init_completion(&svsk->sk_receiver_exit);
+	atomic_set(&svsk->sk_msg_count, 0);
+
+	numa_node = svc_pool_node(svc_pool_for_cpu(serv));
+	pool = svc_page_pool_alloc(numa_node, svsk->sk_maxpages);
+	if (!pool)
+		return -ENOMEM;
+	svsk->sk_page_pool = pool;
+
+	task = kthread_create_on_node(svc_tcp_receiver_thread, svsk,
+				      numa_node, "tcp-recv/%s",
+				      svsk->sk_xprt.xpt_remotebuf);
+	if (IS_ERR(task)) {
+		svc_page_pool_free(pool);
+		svsk->sk_page_pool = NULL;
+		return PTR_ERR(task);
+	}
+
+	svsk->sk_receiver = task;
+	wake_up_process(task);
+	return 0;
+}
+
+static void svc_tcp_stop_receiver(struct svc_sock *svsk)
+{
+	if (!svsk->sk_receiver)
+		return;
+
+	wake_up(&svsk->sk_receiver_wq);
+	kthread_stop(svsk->sk_receiver);
+	wait_for_completion(&svsk->sk_receiver_exit);
+	svsk->sk_receiver = NULL;
+
+	svc_tcp_drain_msg_queue(svsk);
+	svc_page_pool_free(svsk->sk_page_pool);
+	svsk->sk_page_pool = NULL;
+}
+
+/*
+ * Called only by the dedicated receiver thread; does not call
+ * svc_xprt_received() since the receiver thread manages its own
+ * event loop.
+ */
+static int svc_tcp_recv_msg(struct svc_rqst *rqstp)
 {
 	struct svc_sock	*svsk =
 		container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
@@ -1179,7 +1543,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	__be32 *p;
 	__be32 calldir;
 
-	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	len = svc_tcp_read_marker(svsk, rqstp);
 	if (len < 0)
 		goto error;
@@ -1205,12 +1568,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	} else
 		rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
 
-	rqstp->rq_xprt_ctxt   = NULL;
-	rqstp->rq_prot	      = IPPROTO_TCP;
-	if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
-		set_bit(RQ_LOCAL, &rqstp->rq_flags);
-	else
-		clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+	svc_tcp_setup_rqst(rqstp, &svsk->sk_xprt);
 
 	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
 	calldir = p[1];
@@ -1229,7 +1587,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		serv->sv_stats->nettcpcnt++;
 
 	svc_sock_secure_port(rqstp);
-	svc_xprt_received(rqstp->rq_xprt);
 	return rqstp->rq_arg.len;
 
 err_incomplete:
@@ -1254,10 +1611,56 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
 	svc_xprt_deferred_close(&svsk->sk_xprt);
 err_noclose:
-	svc_xprt_received(rqstp->rq_xprt);
 	return 0;	/* record not complete */
 }
 
+/**
+ * svc_tcp_recvfrom - Receive an RPC Call from a TCP socket
+ * @rqstp: request structure into which to receive an RPC Call
+ *
+ * Return values:
+ *   %0: no complete message ready
+ *   positive: length of received RPC Call, in bytes
+ */
+static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct llist_node *node;
+	struct svc_tcp_msg *msg;
+	int len;
+
+	node = llist_del_first(&svsk->sk_msg_queue);
+	if (!node) {
+		clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+		svc_xprt_received(rqstp->rq_xprt);
+		return 0;
+	}
+
+	msg = llist_entry(node, struct svc_tcp_msg, tm_node);
+
+	/*
+	 * Wake the receiver thread when the queue drops below the
+	 * threshold. The receiver may have been sleeping while the
+	 * queue was full.
+	 */
+	if (atomic_dec_return(&svsk->sk_msg_count) == SVC_TCP_MSG_QUEUE_MAX - 1)
+		wake_up_interruptible(&svsk->sk_receiver_wq);
+
+	svc_tcp_msg_to_rqst(rqstp, msg);
+	len = rqstp->rq_arg.len;
+
+	svc_sock_secure_port(rqstp);
+
+	if (llist_empty(&svsk->sk_msg_queue))
+		clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+
+	svc_xprt_received(rqstp->rq_xprt);
+	kfree(msg);
+
+	return len;
+}
+
 /*
  * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
@@ -1394,6 +1797,12 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		sk->sk_data_ready = svc_tcp_listen_data_ready;
 		set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
 	} else {
+		/* Initialize receiver thread wait queue before installing
+		 * the data_ready callback. svc_tcp_data_ready() calls
+		 * wake_up() on this wait queue.
+		 */
+		init_waitqueue_head(&svsk->sk_receiver_wq);
+
 		sk->sk_state_change = svc_tcp_state_change;
 		sk->sk_data_ready = svc_tcp_data_ready;
 		sk->sk_write_space = svc_write_space;
@@ -1406,7 +1815,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 
 		tcp_sock_set_nodelay(sk);
 
-		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 		switch (sk->sk_state) {
 		case TCP_SYN_RECV:
 		case TCP_ESTABLISHED:
@@ -1677,6 +2085,8 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
 {
 	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
 
+	svc_tcp_stop_receiver(svsk);
+
 	tls_handshake_close(svsk->sk_sock);
 
 	svc_sock_detach(xprt);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/7] sunrpc: implement flat combining for TCP socket sends
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (3 preceding siblings ...)
  2026-02-05 15:57 ` [RFC PATCH 4/7] sunrpc: add dedicated TCP receiver thread Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 6/7] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 7/7] SUNRPC: Set explicit TCP socket buffer sizes for NFSD Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High contention on xpt_mutex during TCP reply transmission limits
nfsd scalability. When multiple threads send replies on the same
connection, mutex handoff triggers optimistic spinning that consumes
substantial CPU time. Profiling on high-throughput workloads shows
approximately 15% of cycles spent in the spin-wait path.

The flat combining pattern addresses this by allowing one thread to
perform send operations on behalf of multiple waiting threads. Rather
than each thread acquiring xpt_mutex independently, threads enqueue
their send requests to a lock-free llist. The first thread to
acquire the mutex becomes the combiner and processes all pending
sends in a batch. Other threads wait on a per-request completion
structure instead of spinning on the lock.

The combiner continues processing the queue until empty before
releasing xpt_mutex, amortizing acquisition cost across multiple
sends. Setting MSG_MORE on all but the final send in each batch
hints TCP to coalesce segments, reducing protocol overhead when
multiple replies transmit in quick succession.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |   2 +
 net/sunrpc/svcsock.c           | 190 ++++++++++++++++++++++++++++-----
 2 files changed, 165 insertions(+), 27 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index 391ce9c14f2d..d085093769a1 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -54,6 +54,8 @@ struct svc_sock {
 
 	/* For sends (protected by xpt_mutex) */
 	struct bio_vec		*sk_bvec;
+	struct llist_head	sk_send_queue;	/* pending sends for combining */
+	u64			sk_drain_avg_ns; /* EMA of combiner drain time */
 
 	/* private TCP part */
 	/* On-the-wire fragment header: */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index fa486a01ee3a..0a8f5695daf3 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1661,16 +1661,128 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	return len;
 }
 
+/*
+ * Pending send request for flat combining.
+ * Stack-allocated by each thread that wants to send on a TCP socket.
+ */
+struct svc_pending_send {
+	struct llist_node	node;
+	struct svc_rqst		*rqstp;
+	rpc_fraghdr		marker;
+	struct completion	done;
+	int			result;
+};
+
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+			   rpc_fraghdr marker, int msg_flags);
+
+/*
+ * svc_tcp_wait_timeout - Compute adaptive timeout for flat combining wait
+ * @svsk: the socket with drain time statistics
+ *
+ * Waiting threads need a timeout that balances two concerns: too short
+ * causes premature wakeups and unnecessary mutex acquisitions; too long
+ * delays threads when the combiner has already finished.
+ *
+ * LAN environments with fast networks and consistent latencies work well
+ * with fixed timeouts. WAN links exhibit higher variance in send times
+ * due to congestion, packet loss, and bandwidth constraints. An adaptive
+ * timeout based on observed drain times accommodates both cases without
+ * manual tuning.
+ *
+ * The timeout targets 2x the recent average drain time, clamped to
+ * [1ms, 100ms]. The multiplier provides headroom for variance while the
+ * floor prevents excessive wakeups and the ceiling bounds worst-case
+ * latency when measurements are anomalous.
+ *
+ * Returns: timeout in jiffies
+ */
+static unsigned long svc_tcp_wait_timeout(struct svc_sock *svsk)
+{
+	u64 avg_ns = READ_ONCE(svsk->sk_drain_avg_ns);
+	unsigned long timeout;
+
+	/* Initial timeout before measurements are available */
+	if (!avg_ns)
+		return msecs_to_jiffies(10);
+
+	timeout = nsecs_to_jiffies(avg_ns * 2);
+	return clamp(timeout, msecs_to_jiffies(1), msecs_to_jiffies(100));
+}
+
+/*
+ * svc_tcp_combine_sends - Process batched send requests as combiner
+ * @svsk: the socket to send on
+ * @xprt: the transport (for dead check and close)
+ *
+ * Called with xpt_mutex held. Drains sk_send_queue and processes each
+ * pending send. All items are completed before returning, even on error.
+ */
+static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
+{
+	struct llist_node *pending, *node, *next;
+	struct svc_pending_send *ps;
+	bool transport_dead = false;
+	u64 start, elapsed, avg;
+	int sent;
+
+	/* Take a snapshot of queued items; new arrivals go to next combiner */
+	pending = llist_del_all(&svsk->sk_send_queue);
+	if (!pending)
+		return;
+
+	start = ktime_get_ns();
+
+	for (node = pending; node; node = next) {
+		next = node->next;
+		ps = container_of(node, struct svc_pending_send, node);
+
+		if (transport_dead || svc_xprt_is_dead(xprt)) {
+			transport_dead = true;
+			ps->result = -ENOTCONN;
+			complete(&ps->done);
+			continue;
+		}
+
+		/*
+		 * Set MSG_MORE if there are more items queued, hinting
+		 * TCP to delay pushing until the batch completes.
+		 */
+		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker,
+				       next ? MSG_MORE : 0);
+		trace_svcsock_tcp_send(xprt, sent);
+
+		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+			ps->result = sent;
+		} else {
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send", sent,
+				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+			svc_xprt_deferred_close(xprt);
+			transport_dead = true;
+			ps->result = -EAGAIN;
+		}
+
+		complete(&ps->done);
+	}
+
+	/* Update drain time EMA: new = (7 * old + measured) / 8 */
+	elapsed = ktime_get_ns() - start;
+	avg = READ_ONCE(svsk->sk_drain_avg_ns);
+	WRITE_ONCE(svsk->sk_drain_avg_ns, avg ? (7 * avg + elapsed) / 8 : elapsed);
+}
+
 /*
  * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
  * that the pages backing @xdr are unchanging.
  */
 static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
-			   rpc_fraghdr marker)
+			   rpc_fraghdr marker, int msg_flags)
 {
 	struct msghdr msg = {
-		.msg_flags	= MSG_SPLICE_PAGES,
+		.msg_flags	= MSG_SPLICE_PAGES | msg_flags,
 	};
 	unsigned int count;
 	void *buf;
@@ -1700,44 +1812,68 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
  * svc_tcp_sendto - Send out a reply on a TCP socket
  * @rqstp: completed svc_rqst
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Uses flat combining to reduce mutex contention: threads enqueue their
+ * send requests and one thread (the "combiner") processes the batch.
+ * xpt_mutex ensures RPC-level serialization while the combiner holds it.
  *
  * Returns the number of bytes sent, or a negative errno.
  */
 static int svc_tcp_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_xprt *xprt = rqstp->rq_xprt;
-	struct svc_sock	*svsk = container_of(xprt, struct svc_sock, sk_xprt);
+	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
 	struct xdr_buf *xdr = &rqstp->rq_res;
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	int sent;
+	struct svc_pending_send ps = {
+		.rqstp = rqstp,
+		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+	};
 
 	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
 	rqstp->rq_xprt_ctxt = NULL;
 
-	mutex_lock(&xprt->xpt_mutex);
-	if (svc_xprt_is_dead(xprt))
-		goto out_notconn;
-	sent = svc_tcp_sendmsg(svsk, rqstp, marker);
-	trace_svcsock_tcp_send(xprt, sent);
-	if (sent < 0 || sent != (xdr->len + sizeof(marker)))
-		goto out_close;
-	mutex_unlock(&xprt->xpt_mutex);
-	return sent;
+	init_completion(&ps.done);
 
-out_notconn:
+	/* Enqueue this send request; lock-free via llist */
+	llist_add(&ps.node, &svsk->sk_send_queue);
+
+	/*
+	 * Flat combining: threads compete for xpt_mutex; the winner becomes
+	 * the combiner and processes all queued requests. The mutex provides
+	 * RPC-level serialization while combining reduces lock handoff overhead.
+	 *
+	 * Use trylock first for the fast path. On contention, wait briefly
+	 * to see if the current combiner handles our request, then fall back
+	 * to blocking mutex_lock.
+	 */
+	if (mutex_trylock(&xprt->xpt_mutex))
+		goto combine;
+
+	/*
+	 * Combiner holds the lock. Wait for completion with a timeout;
+	 * the combiner may process our request while we wait.
+	 */
+	if (wait_for_completion_timeout(&ps.done, svc_tcp_wait_timeout(svsk))) {
+		/* Completed by the combiner */
+		return ps.result;
+	}
+
+	/*
+	 * Timed out waiting. Acquire mutex to either become the new combiner
+	 * or find that our request was completed in the meantime.
+	 */
+	mutex_lock(&xprt->xpt_mutex);
+	if (completion_done(&ps.done)) {
+		mutex_unlock(&xprt->xpt_mutex);
+		return ps.result;
+	}
+
+combine:
+	/* We are the combiner. Process until queue is empty. */
+	while (!llist_empty(&svsk->sk_send_queue))
+		svc_tcp_combine_sends(svsk, xprt);
 	mutex_unlock(&xprt->xpt_mutex);
-	return -ENOTCONN;
-out_close:
-	pr_notice("rpc-srv/tcp: %s: %s %d when sending %zu bytes - shutting down socket\n",
-		  xprt->xpt_server->sv_name,
-		  (sent < 0) ? "got error" : "sent",
-		  sent, xdr->len + sizeof(marker));
-	svc_xprt_deferred_close(xprt);
-	mutex_unlock(&xprt->xpt_mutex);
-	return -EAGAIN;
+
+	return ps.result;
 }
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 6/7] sunrpc: unify fore and backchannel server TCP send paths
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (4 preceding siblings ...)
  2026-02-05 15:57 ` [RFC PATCH 5/7] sunrpc: implement flat combining for TCP socket sends Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  2026-02-05 15:57 ` [RFC PATCH 7/7] SUNRPC: Set explicit TCP socket buffer sizes for NFSD Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High latency in callback processing for NFSv4.1+ sessions can occur
when the fore channel sustains heavy request traffic. The backchannel
implementation acquires xpt_mutex directly while the fore channel
uses flat combining to batch socket operations. A thread in the
combining loop processes queued sends continuously until the llist
empties, which under load means backchannel threads block on xpt_mutex
for extended periods waiting for a turn at the socket. Delegation
recalls and other callback operations carry time constraints that
make this starvation problematic.

Routing backchannel sends through the same flat combining
infrastructure eliminates this starvation; a shared llist queue
replaces direct mutex acquisition and separate code paths. The
struct svc_pending_send now holds an xdr_buf pointer instead of
svc_rqst, decoupling the queueing mechanism from RPC request
structures. A new svc_tcp_send() function accepts an xprt, xdr_buf,
and marker, then either enters the combining loop or enqueues for
processing by an active combiner.

The fore channel path through svc_tcp_sendto() now calls
svc_tcp_send() after preparing its xdr_buf. The backchannel
bc_send_request() similarly calls svc_tcp_send() in place of its
former mutex acquisition and direct bc_sendto() invocation. Both
channels queue into the same llist, so backchannel operations
receive fair treatment in the send ordering. When a backchannel
send queues behind fore channel traffic, the combining loop
processes both together with shared socket lock acquisition and
MSG_MORE coalescing where applicable.

Maintenance burden decreases with a single code path for TCP sends.
The backchannel gains batching benefits when concurrent with fore
channel load, and starvation no longer occurs because queueing
provides deterministic ordering independent of mutex contention
timing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |  2 +
 net/sunrpc/svcsock.c           | 76 +++++++++++++++++++++++++---------
 net/sunrpc/xprtsock.c          | 60 ++++++---------------------
 3 files changed, 71 insertions(+), 67 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index d085093769a1..f7f0c5e47fc5 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -101,6 +101,8 @@ static inline u32 svc_sock_final_rec(struct svc_sock *svsk)
  */
 void		svc_recv(struct svc_rqst *rqstp);
 void		svc_send(struct svc_rqst *rqstp);
+int		svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+			     rpc_fraghdr marker);
 int		svc_addsock(struct svc_serv *serv, struct net *net,
 			    const int fd, char *name_return, const size_t len,
 			    const struct cred *cred);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0a8f5695daf3..8d7ac777dfe3 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1667,13 +1667,13 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
  */
 struct svc_pending_send {
 	struct llist_node	node;
-	struct svc_rqst		*rqstp;
+	struct xdr_buf		*xdr;
 	rpc_fraghdr		marker;
 	struct completion	done;
 	int			result;
 };
 
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker, int msg_flags);
 
 /*
@@ -1734,6 +1734,8 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 	start = ktime_get_ns();
 
 	for (node = pending; node; node = next) {
+		size_t expected;
+
 		next = node->next;
 		ps = container_of(node, struct svc_pending_send, node);
 
@@ -1748,17 +1750,30 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 		 * Set MSG_MORE if there are more items queued, hinting
 		 * TCP to delay pushing until the batch completes.
 		 */
-		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker,
+		sent = svc_tcp_sendmsg(svsk, ps->xdr, ps->marker,
 				       next ? MSG_MORE : 0);
 		trace_svcsock_tcp_send(xprt, sent);
 
-		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+		expected = ps->xdr->len + sizeof(ps->marker);
+		if (sent == expected) {
 			ps->result = sent;
 		} else {
 			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
 				  xprt->xpt_server->sv_name,
-				  sent < 0 ? "send error" : "short send", sent,
-				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
 			svc_xprt_deferred_close(xprt);
 			transport_dead = true;
 			ps->result = -EAGAIN;
@@ -1778,7 +1793,7 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
  * copy operations in this path. Therefore the caller must ensure
  * that the pages backing @xdr are unchanging.
  */
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker, int msg_flags)
 {
 	struct msghdr msg = {
@@ -1798,39 +1813,40 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	memcpy(buf, &marker, sizeof(marker));
 	bvec_set_virt(svsk->sk_bvec, buf, sizeof(marker));
 
-	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, rqstp->rq_maxpages,
-				&rqstp->rq_res);
+	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, svsk->sk_maxpages, xdr);
 
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_bvec,
-		      1 + count, sizeof(marker) + rqstp->rq_res.len);
+		      1 + count, sizeof(marker) + xdr->len);
 	ret = sock_sendmsg(svsk->sk_sock, &msg);
 	page_frag_free(buf);
 	return ret;
 }
 
 /**
- * svc_tcp_sendto - Send out a reply on a TCP socket
- * @rqstp: completed svc_rqst
+ * svc_tcp_send - Send an XDR buffer on a TCP socket using flat combining
+ * @xprt: the transport to send on
+ * @xdr: the XDR buffer to send
+ * @marker: RPC record marker
  *
  * Uses flat combining to reduce mutex contention: threads enqueue their
  * send requests and one thread (the "combiner") processes the batch.
  * xpt_mutex ensures RPC-level serialization while the combiner holds it.
  *
+ * Can be used for both fore channel (NFS replies) and backchannel
+ * (NFSv4 callbacks) sends since both share the same TCP connection
+ * and xpt_mutex.
+ *
  * Returns the number of bytes sent, or a negative errno.
  */
-static int svc_tcp_sendto(struct svc_rqst *rqstp)
+int svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+		 rpc_fraghdr marker)
 {
-	struct svc_xprt *xprt = rqstp->rq_xprt;
 	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
-	struct xdr_buf *xdr = &rqstp->rq_res;
 	struct svc_pending_send ps = {
-		.rqstp = rqstp,
-		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+		.xdr = xdr,
+		.marker = marker,
 	};
 
-	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
-	rqstp->rq_xprt_ctxt = NULL;
-
 	init_completion(&ps.done);
 
 	/* Enqueue this send request; lock-free via llist */
@@ -1875,6 +1891,26 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
 
 	return ps.result;
 }
+EXPORT_SYMBOL_GPL(svc_tcp_send);
+
+/**
+ * svc_tcp_sendto - Send out a reply on a TCP socket
+ * @rqstp: completed svc_rqst
+ *
+ * Returns the number of bytes sent, or a negative errno.
+ */
+static int svc_tcp_sendto(struct svc_rqst *rqstp)
+{
+	struct svc_xprt *xprt = rqstp->rq_xprt;
+	struct xdr_buf *xdr = &rqstp->rq_res;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+
+	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
+	rqstp->rq_xprt_ctxt = NULL;
+
+	return svc_tcp_send(xprt, xdr, marker);
+}
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
 				       struct net *net,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 2e1fe6013361..4e1d82186b00 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2979,36 +2979,13 @@ static void bc_free(struct rpc_task *task)
 	free_page((unsigned long)buf);
 }
 
-static int bc_sendto(struct rpc_rqst *req)
-{
-	struct xdr_buf *xdr = &req->rq_snd_buf;
-	struct sock_xprt *transport =
-			container_of(req->rq_xprt, struct sock_xprt, xprt);
-	struct msghdr msg = {
-		.msg_flags	= 0,
-	};
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	unsigned int sent = 0;
-	int err;
-
-	req->rq_xtime = ktime_get();
-	err = xdr_alloc_bvec(xdr, rpc_task_gfp_mask());
-	if (err < 0)
-		return err;
-	err = xprt_sock_sendmsg(transport->sock, &msg, xdr, 0, marker, &sent);
-	xdr_free_bvec(xdr);
-	if (err < 0 || sent != (xdr->len + sizeof(marker)))
-		return -EAGAIN;
-	return sent;
-}
-
 /**
  * bc_send_request - Send a backchannel Call on a TCP socket
  * @req: rpc_rqst containing Call message to be sent
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Uses flat combining via svc_tcp_send() to participate in batched
+ * sending with fore channel traffic, ensuring fair ordering and
+ * reduced lock contention.
  *
  * Return values:
  *   %0 if the message was sent successfully
@@ -3016,29 +2993,18 @@ static int bc_sendto(struct rpc_rqst *req)
  */
 static int bc_send_request(struct rpc_rqst *req)
 {
-	struct svc_xprt	*xprt;
-	int len;
+	struct xdr_buf *xdr = &req->rq_snd_buf;
+	struct svc_xprt *xprt = req->rq_xprt->bc_xprt;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+	int ret;
 
-	/*
-	 * Get the server socket associated with this callback xprt
-	 */
-	xprt = req->rq_xprt->bc_xprt;
+	req->rq_xtime = ktime_get();
+	ret = svc_tcp_send(xprt, xdr, marker);
 
-	/*
-	 * Grab the mutex to serialize data as the connection is shared
-	 * with the fore channel
-	 */
-	mutex_lock(&xprt->xpt_mutex);
-	if (test_bit(XPT_DEAD, &xprt->xpt_flags))
-		len = -ENOTCONN;
-	else
-		len = bc_sendto(req);
-	mutex_unlock(&xprt->xpt_mutex);
-
-	if (len > 0)
-		len = 0;
-
-	return len;
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 
 static void bc_close(struct rpc_xprt *xprt)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 7/7] SUNRPC: Set explicit TCP socket buffer sizes for NFSD
  2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (5 preceding siblings ...)
  2026-02-05 15:57 ` [RFC PATCH 6/7] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
@ 2026-02-05 15:57 ` Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-05 15:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	daire, Mike Snitzer
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

NFSD TCP sockets currently rely on system defaults with TCP
auto-tuning. On networks with large bandwidth-delay products, the
default maximum buffer sizes (6MB receive, 4MB send) can throttle
throughput. Administrators must resort to system-wide sysctl
adjustments (tcp_rmem/tcp_wmem), which affect all TCP connections
rather than just NFS traffic.

This change sets explicit buffer sizes for NFSD TCP data sockets.
The buffer size is set to 4 * sv_max_mesg, yielding approximately
16MB with default NFS payload sizes. On memory-constrained systems,
the buffer size is capped at 1/1024 of physical RAM, with a hard
ceiling of 16MB. SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK disable
auto-tuning, providing predictable memory consumption.

The existing svc_sock_setbufsize() is renamed to
svc_udp_setbufsize() to reflect its UDP-specific purpose, and a
new svc_tcp_setbufsize() handles TCP data connections. Listener
sockets remain unaffected, as listeners do not transfer data.

This approach improves throughput on high-speed networks without
requiring system-wide configuration changes, while automatically
scaling down buffer sizes on small systems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svcsock.c | 52 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 8d7ac777dfe3..e019ae285d47 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -50,6 +50,7 @@
 #include <net/handshake.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
+#include <linux/mm.h>
 #include <asm/ioctls.h>
 #include <linux/key.h>
 
@@ -377,9 +378,12 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
 }
 
 /*
- * Set socket snd and rcv buffer lengths
+ * Set socket snd and rcv buffer lengths for UDP sockets.
+ *
+ * UDP sockets need large buffers because pending requests remain
+ * in the receive buffer until processed by a worker thread.
  */
-static void svc_sock_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
+static void svc_udp_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
 {
 	unsigned int max_mesg = svsk->sk_xprt.xpt_server->sv_max_mesg;
 	struct socket *sock = svsk->sk_sock;
@@ -393,6 +397,45 @@ static void svc_sock_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
 	release_sock(sock->sk);
 }
 
+/* Accommodate high bandwidth-delay product connections */
+#define SVC_TCP_SNDBUF_MAX	(16 * 1024 * 1024)
+#define SVC_TCP_RCVBUF_MAX	(16 * 1024 * 1024)
+
+/*
+ * Set socket snd and rcv buffer lengths for TCP data sockets.
+ *
+ * Buffers are sized to accommodate high-bandwidth data transfers on
+ * high-latency networks (large bandwidth-delay product). Automatic
+ * buffer tuning is disabled to allow control of server memory
+ * consumption.
+ */
+static void svc_tcp_setbufsize(struct svc_sock *svsk)
+{
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	struct socket *sock = svsk->sk_sock;
+	unsigned long mem_cap, ideal;
+	unsigned int sndbuf, rcvbuf;
+
+	/* Buffer multiple in-flight RPC messages */
+	ideal = serv->sv_max_mesg * 4;
+
+	/* Memory-based cap: 1/1024 of physical RAM */
+	mem_cap = (totalram_pages() >> 10) << PAGE_SHIFT;
+
+	sndbuf = clamp_t(unsigned long, ideal,
+			 serv->sv_max_mesg, min(mem_cap, SVC_TCP_SNDBUF_MAX));
+	rcvbuf = clamp_t(unsigned long, ideal,
+			 serv->sv_max_mesg, min(mem_cap, SVC_TCP_RCVBUF_MAX));
+
+	lock_sock(sock->sk);
+	sock->sk->sk_sndbuf = sndbuf;
+	sock->sk->sk_rcvbuf = rcvbuf;
+	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+	sock->sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+	sock->sk->sk_write_space(sock->sk);
+	release_sock(sock->sk);
+}
+
 static void svc_sock_secure_port(struct svc_rqst *rqstp)
 {
 	if (svc_port_is_privileged(svc_addr(rqstp)))
@@ -656,7 +699,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 	     * provides an upper bound on the number of threads
 	     * which will access the socket.
 	     */
-	    svc_sock_setbufsize(svsk, serv->sv_nrthreads + 3);
+	    svc_udp_setbufsize(svsk, serv->sv_nrthreads + 3);
 
 	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
@@ -872,7 +915,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
 	 * receive and respond to one request.
 	 * svc_udp_recvfrom will re-adjust if necessary
 	 */
-	svc_sock_setbufsize(svsk, 3);
+	svc_udp_setbufsize(svsk, 3);
 
 	/* data might have come in before data_ready set up */
 	set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
@@ -1986,6 +2029,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		       svsk->sk_maxpages * sizeof(struct page *));
 
 		tcp_sock_set_nodelay(sk);
+		svc_tcp_setbufsize(svsk);
 
 		switch (sk->sk_state) {
 		case TCP_SYN_RECV:
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies
  2026-02-05 15:57 ` [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies Chuck Lever
@ 2026-02-06 14:57   ` Chuck Lever
  0 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-06 14:57 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	Daire Byrne, Mike Snitzer
  Cc: linux-nfs, Chuck Lever



On Thu, Feb 5, 2026, at 10:57 AM, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> The default affinity scope WQ_AFFN_CACHE assumes systems have
> multiple last-level caches. On systems where all CPUs share a
> single LLC (common with Intel monolithic dies), this scope
> degenerates to a single worker pool. All queue_work() calls then
> contend on that pool's single lock, causing severe performance
> degradation under high-throughput workloads.
>
> For example, on a 12-core system with a single shared L3 cache
> running NFS over RDMA with 12 fio jobs, perf shows approximately
> 39% of CPU cycles spent in native_queued_spin_lock_slowpath,
> nearly all from __queue_work() contending on the single pool lock.
>
> On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
> scopes all collapse to a single pod.
>
> Add wq_effective_affn_scope() to detect when a selected affinity
> scope provides only one pod despite having multiple CPUs, and
> automatically fall back to a finer-grained scope. This enables lock
> distribution to scale with CPU count without requiring manual
> configuration via the workqueue.default_affinity_scope parameter or
> per-workqueue sysfs tuning.
>
> The fallback is conservative: it triggers only when a scope
> degenerates to exactly one pod, and respects explicitly configured
> (non-default) scopes.
>
> Also update wq_affn_scope_show() to display the effective scope
> when fallback occurs, making the behavior transparent to
> administrators via sysfs (e.g., "default (cache -> smt)").
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/workqueue.h |  8 ++++-
>  kernel/workqueue.c        | 68 +++++++++++++++++++++++++++++++++++----
>  2 files changed, 69 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index dabc351cc127..1fca5791337d 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -128,10 +128,16 @@ struct rcu_work {
>  	struct workqueue_struct *wq;
>  };
> 
> +/*
> + * Affinity scopes are ordered from finest to coarsest granularity. 
> This
> + * ordering is used by the automatic fallback logic in 
> wq_effective_affn_scope()
> + * which walks from coarse toward fine when a scope degenerates to a 
> single pod.
> + */
>  enum wq_affn_scope {
>  	WQ_AFFN_DFL,			/* use system default */
>  	WQ_AFFN_CPU,			/* one pod per CPU */
> -	WQ_AFFN_SMT,			/* one pod poer SMT */
> +	WQ_AFFN_SMT,			/* one pod per SMT */
> +	WQ_AFFN_CLUSTER,		/* one pod per cluster */
>  	WQ_AFFN_CACHE,			/* one pod per LLC */
>  	WQ_AFFN_NUMA,			/* one pod per NUMA node */
>  	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 253311af47c6..32598b9cd1c2 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -405,6 +405,7 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] 
> = {
>  	[WQ_AFFN_DFL]		= "default",
>  	[WQ_AFFN_CPU]		= "cpu",
>  	[WQ_AFFN_SMT]		= "smt",
> +	[WQ_AFFN_CLUSTER]	= "cluster",
>  	[WQ_AFFN_CACHE]		= "cache",
>  	[WQ_AFFN_NUMA]		= "numa",
>  	[WQ_AFFN_SYSTEM]	= "system",
> @@ -4753,6 +4754,39 @@ static void wqattrs_actualize_cpumask(struct 
> workqueue_attrs *attrs,
>  		cpumask_copy(attrs->cpumask, unbound_cpumask);
>  }
> 
> +/*
> + * Determine the effective affinity scope. If the configured scope results
> + * in a single pod (e.g., WQ_AFFN_CACHE on a system with one shared LLC),
> + * fall back to a finer-grained scope to distribute pool lock contention.
> + *
> + * The search stops at WQ_AFFN_CPU, which always provides one pod per CPU
> + * and thus cannot degenerate further.
> + *
> + * Returns the scope to actually use, which may differ from the configured
> + * scope on systems where coarser scopes degenerate.
> + */
> +static enum wq_affn_scope wq_effective_affn_scope(enum wq_affn_scope scope)
> +{
> +	struct wq_pod_type *pt;
> +
> +	/*
> +	 * Walk from the requested scope toward finer granularity. Stop
> +	 * when a scope provides more than one pod, or when CPU scope is
> +	 * reached. CPU scope always provides nr_possible_cpus() pods.
> +	 */
> +	while (scope > WQ_AFFN_CPU) {
> +		pt = &wq_pod_types[scope];
> +
> +		/* Multiple pods at this scope; no fallback needed */
> +		if (pt->nr_pods > 1)
> +			break;
> +
> +		scope--;
> +	}
> +
> +	return scope;
> +}
> +
>  /* find wq_pod_type to use for @attrs */
>  static const struct wq_pod_type *
>  wqattrs_pod_type(const struct workqueue_attrs *attrs)
> @@ -4763,8 +4797,13 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs)
>  	/* to synchronize access to wq_affn_dfl */
>  	lockdep_assert_held(&wq_pool_mutex);
> 
> +	/*
> +	 * For default scope, apply automatic fallback for degenerate
> +	 * topologies. Explicit scope selection via sysfs or per-workqueue
> +	 * attributes bypasses fallback, preserving administrator intent.
> +	 */
>  	if (attrs->affn_scope == WQ_AFFN_DFL)
> -		scope = wq_affn_dfl;
> +		scope = wq_effective_affn_scope(wq_affn_dfl);
>  	else
>  		scope = attrs->affn_scope;
> 
> @@ -7206,16 +7245,27 @@ static ssize_t wq_affn_scope_show(struct device *dev,
>  				  struct device_attribute *attr, char *buf)
>  {
>  	struct workqueue_struct *wq = dev_to_wq(dev);
> +	enum wq_affn_scope scope, effective;
>  	int written;
> 
>  	mutex_lock(&wq->mutex);
> -	if (wq->unbound_attrs->affn_scope == WQ_AFFN_DFL)
> -		written = scnprintf(buf, PAGE_SIZE, "%s (%s)\n",
> -				    wq_affn_names[WQ_AFFN_DFL],
> -				    wq_affn_names[wq_affn_dfl]);
> -	else
> +	if (wq->unbound_attrs->affn_scope == WQ_AFFN_DFL) {
> +		scope = wq_affn_dfl;
> +		effective = wq_effective_affn_scope(scope);
> +		if (wq_pod_types[effective].nr_pods >
> +		    wq_pod_types[scope].nr_pods)
> +			written = scnprintf(buf, PAGE_SIZE, "%s (%s -> %s)\n",
> +					    wq_affn_names[WQ_AFFN_DFL],
> +					    wq_affn_names[scope],
> +					    wq_affn_names[effective]);
> +		else
> +			written = scnprintf(buf, PAGE_SIZE, "%s (%s)\n",
> +					    wq_affn_names[WQ_AFFN_DFL],
> +					    wq_affn_names[scope]);
> +	} else {
>  		written = scnprintf(buf, PAGE_SIZE, "%s\n",
>  				    wq_affn_names[wq->unbound_attrs->affn_scope]);
> +	}
>  	mutex_unlock(&wq->mutex);
> 
>  	return written;
> @@ -8023,6 +8073,11 @@ static bool __init cpus_share_smt(int cpu0, int cpu1)
>  #endif
>  }
> 
> +static bool __init cpus_share_cluster(int cpu0, int cpu1)
> +{
> +	return cpumask_test_cpu(cpu0, topology_cluster_cpumask(cpu1));
> +}
> +
>  static bool __init cpus_share_numa(int cpu0, int cpu1)
>  {
>  	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
> @@ -8042,6 +8097,7 @@ void __init workqueue_init_topology(void)
> 
>  	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
>  	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
> +	init_pod_type(&wq_pod_types[WQ_AFFN_CLUSTER], cpus_share_cluster);
>  	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
>  	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
> 
> -- 
> 2.52.0

Tejun has rejected this one. However I've worked up a patch
that replaces svcrdma_wq with a kthread that does not have an
issue with the workqueue pool spin lock, and that solves the
same issue this one does (for NFSD).


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-06 14:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-05 15:57 [RFC PATCH 0/7] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 1/7] workqueue: Automatic affinity scope fallback for single-pod topologies Chuck Lever
2026-02-06 14:57   ` Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 2/7] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 3/7] sunrpc: add per-transport page recycling pool Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 4/7] sunrpc: add dedicated TCP receiver thread Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 5/7] sunrpc: implement flat combining for TCP socket sends Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 6/7] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
2026-02-05 15:57 ` [RFC PATCH 7/7] SUNRPC: Set explicit TCP socket buffer sizes for NFSD Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox