public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets
@ 2026-02-10 16:20 Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High-throughput NFSD workloads exhibit significant lock contention on
TCP connections. Worker threads compete for the socket lock during
receives and serialize on xpt_mutex during sends, limiting scalability.

This series addresses both paths:

 - Receive: A dedicated kernel thread per TCP connection owns all
   sock_recvmsg() calls and queues complete RPC messages for workers
   via lock-free llist. This eliminates socket lock contention among
   workers.

 - Transmit: Flat combining allows one thread to send on behalf of
   multiple waiters. Threads enqueue requests; the mutex holder
   ("combiner") processes the batch, amortizing lock acquisition and
   enabling TCP segment coalescing via MSG_MORE.

Supporting changes include a page recycling pool for receive buffers,
and explicit TCP buffer sizing for high bandwidth-delay product
networks.

Base commit: v6.19
URL: https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log/?h=svctcp-next

---

Changes since RFC:
- Drop the affinity scope patch
- Skip user memory hardening for kernel-to-kernel copies
- Avoid invoking wake_up when the receive is already running
- Refactor svc_tcp_receiver_thread() for legibility
- Do not set MSG_MORE during batched sends

Chuck Lever (8):
  sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST
  net: datagram: bypass usercopy checks for kernel iterators
  sunrpc: split svc_data_ready into protocol-specific callbacks
  sunrpc: add per-transport page recycling pool
  sunrpc: add dedicated TCP receiver thread
  sunrpc: implement flat combining for TCP socket sends
  sunrpc: unify fore and backchannel server TCP send paths
  sunrpc: Set explicit TCP socket buffer sizes for NFSD

 include/linux/sunrpc/svc.h      |   1 +
 include/linux/sunrpc/svc_xprt.h |  32 ++
 include/linux/sunrpc/svcsock.h  |  40 ++
 include/trace/events/sunrpc.h   |   7 +-
 net/core/datagram.c             |  15 +-
 net/sunrpc/svc.c                |  13 +
 net/sunrpc/svc_xprt.c           | 151 ++++++
 net/sunrpc/svcsock.c            | 802 +++++++++++++++++++++++++++++---
 net/sunrpc/xprtsock.c           |  60 +--
 9 files changed, 999 insertions(+), 122 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Commit eccbbc7c00a5 ("nfsd: don't use sv_nrthreads in connection
limiting calculations.") and commit 898374fdd7f0 ("nfsd: unregister
with rpcbind when deleting a transport") added new XPT flags but
neglected to update the show_svc_xprt_flags() macro.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 750ecce56930..076182ae19ec 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1933,7 +1933,9 @@ TRACE_EVENT(svc_stats_latency,
 	svc_xprt_flag(CONG_CTRL)					\
 	svc_xprt_flag(HANDSHAKE)					\
 	svc_xprt_flag(TLS_SESSION)					\
-	svc_xprt_flag_end(PEER_AUTH)
+	svc_xprt_flag(PEER_AUTH)					\
+	svc_xprt_flag(XPT_PEER_VALID)					\
+	svc_xprt_flag_end(XPT_RPCB_UNREG)
 
 #undef svc_xprt_flag
 #undef svc_xprt_flag_end
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Profiling NFSD under an iozone workload showed that hardened usercopy
checks consume roughly 1.3% of CPU in the TCP receive path. These
checks validate memory regions during copies, but provide no security
benefit when both source (skb data) and destination (kernel pages in
BVEC/KVEC iterators) reside in kernel address space.

Modify simple_copy_to_iter() and crc32c_and_copy_to_iter() to call
_copy_to_iter() directly when the destination is a kernel-only
iterator, bypassing the usercopy hardening validation. User-backed
iterators (ITER_UBUF, ITER_IOVEC) continue to use copy_to_iter() with
full validation.

This benefits kernel consumers of TCP receive such as NFSD (SUNRPC)
and NVMe-TCP, which use ITER_BVEC for their receive buffers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/core/datagram.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index c285c6465923..df6b87d7c415 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -490,7 +490,10 @@ static size_t crc32c_and_copy_to_iter(const void *addr, size_t bytes,
 	u32 *crcp = _crcp;
 	size_t copied;
 
-	copied = copy_to_iter(addr, bytes, i);
+	if (user_backed_iter(i))
+		copied = copy_to_iter(addr, bytes, i);
+	else
+		copied = _copy_to_iter(addr, bytes, i);
 	*crcp = crc32c(*crcp, addr, copied);
 	return copied;
 }
@@ -515,10 +518,18 @@ int skb_copy_and_crc32c_datagram_iter(const struct sk_buff *skb, int offset,
 EXPORT_SYMBOL(skb_copy_and_crc32c_datagram_iter);
 #endif /* CONFIG_NET_CRC32C */
 
+/*
+ * For kernel-only iterators (BVEC, KVEC, etc.), bypass usercopy
+ * hardening checks. Both the source (skb data) and destination
+ * (kernel pages/buffers) are kernel memory, so the checks add
+ * overhead without security benefit.
+ */
 static size_t simple_copy_to_iter(const void *addr, size_t bytes,
 		void *data __always_unused, struct iov_iter *i)
 {
-	return copy_to_iter(addr, bytes, i);
+	if (user_backed_iter(i))
+		return copy_to_iter(addr, bytes, i);
+	return _copy_to_iter(addr, bytes, i);
 }
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Separate data-ready callbacks enable protocol-specific
optimizations. UDP and TCP transports already have different
requirements: currently UDP sockets do not implement DTLS, so the
XPT_HANDSHAKE check is unnecessary overhead for them.

Prepare the server-side socket infrastructure for additional
changes to TCP's data_ready callback.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h |  3 ++-
 net/sunrpc/svcsock.c          | 39 +++++++++++++++++++++++++++++------
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 076182ae19ec..c5a15b7a321d 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -2313,7 +2313,8 @@ DEFINE_SVCSOCK_EVENT(tcp_send);
 DEFINE_SVCSOCK_EVENT(tcp_recv);
 DEFINE_SVCSOCK_EVENT(tcp_recv_eagain);
 DEFINE_SVCSOCK_EVENT(tcp_recv_err);
-DEFINE_SVCSOCK_EVENT(data_ready);
+DEFINE_SVCSOCK_EVENT(udp_data_ready);
+DEFINE_SVCSOCK_EVENT(tcp_data_ready);
 DEFINE_SVCSOCK_EVENT(write_space);
 
 TRACE_EVENT(svcsock_tcp_recv_short,
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d61cd9b40491..73644f3b63c7 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -397,10 +397,14 @@ static void svc_sock_secure_port(struct svc_rqst *rqstp)
 		clear_bit(RQ_SECURE, &rqstp->rq_flags);
 }
 
-/*
- * INET callback when data has been received on the socket.
+/**
+ * svc_udp_data_ready - sk->sk_data_ready callback for UDP sockets
+ * @sk: socket whose receive buffer contains data
+ *
+ * This implementation does not yet support DTLS, so the
+ * XPT_HANDSHAKE check is not needed here.
  */
-static void svc_data_ready(struct sock *sk)
+static void svc_udp_data_ready(struct sock *sk)
 {
 	struct svc_sock	*svsk = (struct svc_sock *)sk->sk_user_data;
 
@@ -410,7 +414,30 @@ static void svc_data_ready(struct sock *sk)
 		/* Refer to svc_setup_socket() for details. */
 		rmb();
 		svsk->sk_odata(sk);
-		trace_svcsock_data_ready(&svsk->sk_xprt, 0);
+		trace_svcsock_udp_data_ready(&svsk->sk_xprt, 0);
+		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
+			svc_xprt_enqueue(&svsk->sk_xprt);
+	}
+}
+
+/**
+ * svc_tcp_data_ready - sk->sk_data_ready callback for TCP sockets
+ * @sk: socket whose receive buffer contains data
+ *
+ * Data ingest is skipped while a TLS handshake is in progress
+ * (XPT_HANDSHAKE).
+ */
+static void svc_tcp_data_ready(struct sock *sk)
+{
+	struct svc_sock	*svsk = (struct svc_sock *)sk->sk_user_data;
+
+	trace_sk_data_ready(sk);
+
+	if (svsk) {
+		/* Refer to svc_setup_socket() for details. */
+		rmb();
+		svsk->sk_odata(sk);
+		trace_svcsock_tcp_data_ready(&svsk->sk_xprt, 0);
 		if (test_bit(XPT_HANDSHAKE, &svsk->sk_xprt.xpt_flags))
 			return;
 		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
@@ -835,7 +862,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
 	svc_xprt_init(sock_net(svsk->sk_sock->sk), &svc_udp_class,
 		      &svsk->sk_xprt, serv);
 	clear_bit(XPT_CACHE_AUTH, &svsk->sk_xprt.xpt_flags);
-	svsk->sk_sk->sk_data_ready = svc_data_ready;
+	svsk->sk_sk->sk_data_ready = svc_udp_data_ready;
 	svsk->sk_sk->sk_write_space = svc_write_space;
 
 	/* initialise setting must have enough space to
@@ -1368,7 +1395,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
 	} else {
 		sk->sk_state_change = svc_tcp_state_change;
-		sk->sk_data_ready = svc_data_ready;
+		sk->sk_data_ready = svc_tcp_data_ready;
 		sk->sk_write_space = svc_write_space;
 
 		svsk->sk_marker = xdr_zero;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/8] sunrpc: add per-transport page recycling pool
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (2 preceding siblings ...)
  2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

RPC server transports allocate pages for receiving incoming data on
every request. Under high load, this repeated allocation and freeing
creates unnecessary overhead in the page allocator hot path.

Introduce svc_page_pool, a lock-free page recycling mechanism that
enables efficient page reuse between receive operations. A follow-up
commit wires this into the TCP transport's receive path; svcrdma's
RDMA Read path might also make use of this mechanism some day.

The pool uses llist for lock-free producer-consumer handoff: worker
threads returning pages after RPC processing act as producers, while
receiver threads allocating pages for incoming data act as
consumers. Pages are linked via page->pcp_llist, which is safe
because these pages are owned exclusively by the transport.

Each pool tracks its NUMA node affinity, allowing page allocations
to target the same node as the transport's receiver thread. Provide
svc_pool_node() to enable transports to determine the NUMA node
associated with a service pool for NUMA-aware resource allocation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h      |   1 +
 include/linux/sunrpc/svc_xprt.h |  32 +++++++
 net/sunrpc/svc.c                |  13 +++
 net/sunrpc/svc_xprt.c           | 151 ++++++++++++++++++++++++++++++++
 4 files changed, 197 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5506d20857c3..f4efe60f4dad 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -457,6 +457,7 @@ void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
 void		   svc_pool_wake_idle_thread(struct svc_pool *pool);
 struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
+int		   svc_pool_node(struct svc_pool *pool);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 const char *	   svc_proc_name(const struct svc_rqst *rqstp);
 int		   svc_encode_result_payload(struct svc_rqst *rqstp,
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index da2a2531e110..e60c2936b1ce 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -9,9 +9,33 @@
 #define SUNRPC_SVC_XPRT_H
 
 #include <linux/sunrpc/svc.h>
+#include <linux/llist.h>
 
 struct module;
 
+/**
+ * struct svc_page_pool - per-transport page recycling pool
+ * @pp_pages: lock-free list of recycled pages
+ * @pp_count: number of pages currently in pool
+ * @pp_numa_node: NUMA node for page allocations
+ * @pp_max: maximum pages to retain in pool
+ *
+ * Lock-free page recycling between producers (svc threads returning
+ * pages) and a single consumer (the thread allocating pages for
+ * receives). Uses llist for efficient producer-consumer handoff
+ * without spinlocks.
+ *
+ * Callers must serialize calls to svc_page_pool_get(); multiple
+ * concurrent consumers are not supported.
+ * Allocate with svc_page_pool_alloc(); free with svc_page_pool_free().
+ */
+struct svc_page_pool {
+	struct llist_head	pp_pages;
+	atomic_t		pp_count;
+	int			pp_numa_node;
+	unsigned int		pp_max;
+};
+
 struct svc_xprt_ops {
 	struct svc_xprt	*(*xpo_create)(struct svc_serv *,
 				       struct net *net,
@@ -187,6 +211,14 @@ void	svc_add_new_perm_xprt(struct svc_serv *serv, struct svc_xprt *xprt);
 void	svc_age_temp_xprts_now(struct svc_serv *, struct sockaddr *);
 void	svc_xprt_deferred_close(struct svc_xprt *xprt);
 
+/* Page pool helpers */
+struct svc_page_pool *svc_page_pool_alloc(int numa_node, unsigned int max);
+void	svc_page_pool_free(struct svc_page_pool *pool);
+void	svc_page_pool_put(struct svc_page_pool *pool, struct page *page);
+void	svc_page_pool_put_bulk(struct svc_page_pool *pool,
+			       struct page **pages, unsigned int count);
+struct page *svc_page_pool_get(struct svc_page_pool *pool);
+
 static inline void svc_xprt_get(struct svc_xprt *xprt)
 {
 	kref_get(&xprt->xpt_ref);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 4704dce7284e..6b350cb7d539 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -418,6 +418,19 @@ struct svc_pool *svc_pool_for_cpu(struct svc_serv *serv)
 	return &serv->sv_pools[pidx % serv->sv_nrpools];
 }
 
+/**
+ * svc_pool_node - Return the NUMA node affinity of a service pool
+ * @pool: the service pool
+ *
+ * Return value:
+ *   The NUMA node the pool is associated with, or the local node
+ *   if no explicit mapping exists
+ */
+int svc_pool_node(struct svc_pool *pool)
+{
+	return svc_pool_map_get_node(pool->sp_id);
+}
+
 static int svc_rpcb_setup(struct svc_serv *serv, struct net *net)
 {
 	int err;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 6973184ff667..fe31cf6a9c5d 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -1497,4 +1497,155 @@ int svc_pool_stats_open(struct svc_info *info, struct file *file)
 }
 EXPORT_SYMBOL(svc_pool_stats_open);
 
+static struct llist_node *svc_page_to_llist(struct page *page)
+{
+	return &page->pcp_llist;
+}
+
+static struct page *svc_llist_to_page(struct llist_node *node)
+{
+	return container_of(node, struct page, pcp_llist);
+}
+
+/**
+ * svc_page_pool_alloc - Allocate a page pool
+ * @numa_node: NUMA node for page allocations
+ * @max: maximum pages to retain in pool
+ *
+ * Pages in an svc_page_pool are linked via page->pcp_llist, which is
+ * safe since these pages are owned exclusively by the transport.
+ *
+ * The caller must free the pool with svc_page_pool_free() when
+ * the transport is destroyed.
+ *
+ * Returns a new page pool, or NULL on allocation failure.
+ */
+struct svc_page_pool *svc_page_pool_alloc(int numa_node, unsigned int max)
+{
+	struct svc_page_pool *pool;
+
+	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL, numa_node);
+	if (!pool)
+		return NULL;
+
+	init_llist_head(&pool->pp_pages);
+	atomic_set(&pool->pp_count, 0);
+	pool->pp_numa_node = numa_node;
+	pool->pp_max = max;
+	return pool;
+}
+
+/**
+ * svc_page_pool_free - Free a page pool and all pages in it
+ * @pool: pool to free (may be NULL)
+ */
+void svc_page_pool_free(struct svc_page_pool *pool)
+{
+	struct llist_node *node;
+
+	if (!pool)
+		return;
+
+	while ((node = llist_del_first(&pool->pp_pages)) != NULL)
+		put_page(svc_llist_to_page(node));
+	kfree(pool);
+}
+
+/**
+ * svc_page_pool_put - Return a page to the pool
+ * @pool: pool to return page to (may be NULL)
+ * @page: page to return (may be NULL)
+ *
+ * Transfers ownership of @page to the pool. The caller's reference
+ * is consumed: either the pool retains the page, or put_page() is
+ * called if @pool is NULL or full.
+ */
+void svc_page_pool_put(struct svc_page_pool *pool, struct page *page)
+{
+	if (!page)
+		return;
+	if (!pool || atomic_read(&pool->pp_count) >= pool->pp_max) {
+		put_page(page);
+		return;
+	}
+	llist_add(svc_page_to_llist(page), &pool->pp_pages);
+	atomic_inc(&pool->pp_count);
+}
+
+/**
+ * svc_page_pool_put_bulk - Return multiple pages to the pool
+ * @pool: pool to return pages to (may be NULL)
+ * @pages: array of pages to return
+ * @count: number of pages in @pages array
+ *
+ * Batch version of svc_page_pool_put() that reduces atomic operations
+ * when returning many pages at once. Transfers ownership of all pages
+ * in @pages to the pool. Uses release_pages() for efficient bulk
+ * freeing when the pool is full.
+ *
+ * Unlike svc_page_pool_put(), this function does not handle NULL
+ * entries in @pages. All @count entries must be valid page pointers.
+ */
+void svc_page_pool_put_bulk(struct svc_page_pool *pool,
+			    struct page **pages, unsigned int count)
+{
+	struct llist_node *head, *last, *node;
+	unsigned int i, to_add, avail;
+
+	if (!count)
+		return;
+	if (!pool) {
+		release_pages(pages, count);
+		return;
+	}
+
+	avail = pool->pp_max - atomic_read(&pool->pp_count);
+	to_add = min_t(unsigned int, count, avail);
+	if (!to_add) {
+		release_pages(pages, count);
+		return;
+	}
+
+	head = NULL;
+	last = NULL;
+	for (i = 0; i < to_add; i++) {
+		node = svc_page_to_llist(pages[i]);
+		node->next = head;
+		head = node;
+		if (!last)
+			last = node;
+	}
+	llist_add_batch(head, last, &pool->pp_pages);
+	atomic_add(to_add, &pool->pp_count);
+
+	/* Free overflow pages that didn't fit in the pool */
+	if (to_add < count)
+		release_pages(pages + to_add, count - to_add);
+}
+EXPORT_SYMBOL_GPL(svc_page_pool_put_bulk);
+
+/**
+ * svc_page_pool_get - Get a page from the pool
+ * @pool: pool to take from (may be NULL)
+ *
+ * Returns a recycled page with one reference, or NULL if @pool is
+ * NULL or empty. The caller owns the returned page and must either
+ * return it via svc_page_pool_put() or release it with put_page().
+ *
+ * Caller must serialize; concurrent calls for the same pool are
+ * not supported.
+ */
+struct page *svc_page_pool_get(struct svc_page_pool *pool)
+{
+	struct llist_node *node;
+
+	if (!pool)
+		return NULL;
+	node = llist_del_first(&pool->pp_pages);
+	if (!node)
+		return NULL;
+	atomic_dec(&pool->pp_count);
+	return svc_llist_to_page(node);
+}
+
 /*----------------------------------------------------------------------------*/
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (3 preceding siblings ...)
  2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends Chuck Lever
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Receive-side socket lock contention on NFS server TCP connections
is eliminated by dedicating one kernel thread per TCP socket to
handle all receives.

In the current architecture, multiple nfsd threads compete for
the same socket, serializing on the socket lock inside
sock_recvmsg(). A single receiver kthread per TCP connection
issues all sock_recvmsg() calls and enqueues complete RPC
messages for nfsd threads to process.

Architecture:

  Before:
    nfsd 1 --+                    +-- sock_recvmsg() --+
    nfsd 2 --+-- compete for xprt-+-- sock_recvmsg() --+- CONTENTION
    nfsd 3 --+                    +-- sock_recvmsg() --+

  After:
    Receiver kthread - sock_recvmsg() -+- nfsd 1 - process - send
         (no contention)               +- nfsd 2 - process - send
                                       +- nfsd 3 - process - send

A lock-free llist queue passes complete RPC messages from the
receiver to nfsd threads, avoiding spinlock overhead in the
fast path. Flow control limits queue depth to
SVC_TCP_MSG_QUEUE_MAX (64) messages per socket to bound
memory usage.

This mirrors the svcrdma architecture, where RDMA completion
handlers enqueue received messages for nfsd threads rather
than having nfsd threads compete for hardware resources.

NUMA Affinity:

The receiver kthread is placed on the NUMA node associated
with the service pool handling the accept, matching the NUMA
placement strategy used for nfsd threads. Page allocations
for receive buffers explicitly target this node via
__alloc_pages_bulk(), providing memory locality for the
receive path. This mirrors how svcrdma allocates resources
on the RNIC's NUMA node.

svc_tcp_data_ready() now wakes the dedicated receiver kthread
instead of enqueueing the transport for nfsd threads. If
receiver kthread creation fails during connection accept, the
connection is rejected; the client retries.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |  36 +++
 net/sunrpc/svcsock.c           | 502 ++++++++++++++++++++++++++++++---
 2 files changed, 498 insertions(+), 40 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index de37069aba90..b94096eeb890 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -12,6 +12,32 @@
 
 #include <linux/sunrpc/svc.h>
 #include <linux/sunrpc/svc_xprt.h>
+#include <linux/cache.h>
+#include <linux/llist.h>
+#include <linux/wait.h>
+#include <linux/completion.h>
+#include <linux/ktime.h>
+
+/* Maximum queued messages per TCP socket before backpressure */
+#define SVC_TCP_MSG_QUEUE_MAX	64
+
+/**
+ * struct svc_tcp_msg - queued RPC message ready for processing
+ * @tm_node: lock-free queue linkage
+ * @tm_len: total message length
+ * @tm_npages: number of pages holding message data
+ * @tm_pages: flexible array of pages containing the message
+ *
+ * Complete RPC messages are enqueued using these structures after
+ * reception. Page ownership transfers from the receiver's rqstp to
+ * this structure, then to an nfsd thread's rqstp during dequeue.
+ */
+struct svc_tcp_msg {
+	struct llist_node	tm_node;
+	size_t			tm_len;
+	unsigned int		tm_npages;
+	struct page		*tm_pages[];
+};
 
 /*
  * RPC server socket.
@@ -43,6 +69,16 @@ struct svc_sock {
 
 	struct completion	sk_handshake_done;
 
+	/* Dedicated receiver thread (TCP only) */
+	struct task_struct	*sk_receiver;
+	struct llist_head	sk_msg_queue;
+	wait_queue_head_t	sk_receiver_wq;
+	struct completion	sk_receiver_exit;
+	struct svc_page_pool	*sk_page_pool;
+	ktime_t			sk_partial_record_time;
+
+	atomic_t		sk_msg_count ____cacheline_aligned_in_smp;
+
 	/* received data */
 	unsigned long		sk_maxpages;
 	struct page *		sk_pages[] __counted_by(sk_maxpages);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 73644f3b63c7..a89427b5fc70 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -22,6 +22,7 @@
 
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/kthread.h>
 #include <linux/module.h>
 #include <linux/errno.h>
 #include <linux/fcntl.h>
@@ -93,6 +94,9 @@ static int		svc_udp_sendto(struct svc_rqst *);
 static void		svc_sock_detach(struct svc_xprt *);
 static void		svc_tcp_sock_detach(struct svc_xprt *);
 static void		svc_sock_free(struct svc_xprt *);
+static int		svc_tcp_recv_msg(struct svc_rqst *);
+static int		svc_tcp_start_receiver(struct svc_sock *);
+static void		svc_tcp_stop_receiver(struct svc_sock *);
 
 static struct svc_xprt *svc_create_socket(struct svc_serv *, int,
 					  struct net *, struct sockaddr *,
@@ -423,26 +427,37 @@ static void svc_udp_data_ready(struct sock *sk)
 /**
  * svc_tcp_data_ready - sk->sk_data_ready callback for TCP sockets
  * @sk: socket whose receive buffer contains data
- *
- * Data ingest is skipped while a TLS handshake is in progress
- * (XPT_HANDSHAKE).
  */
 static void svc_tcp_data_ready(struct sock *sk)
 {
 	struct svc_sock	*svsk = (struct svc_sock *)sk->sk_user_data;
 
 	trace_sk_data_ready(sk);
+	if (!svsk)
+		return;
 
-	if (svsk) {
-		/* Refer to svc_setup_socket() for details. */
-		rmb();
+	/* Refer to svc_setup_socket() for details. */
+	rmb();
+
+	trace_svcsock_tcp_data_ready(&svsk->sk_xprt, 0);
+
+	/* During a TLS handshake, the socket fd is installed in the
+	 * handshake daemon's fdtable (see handshake_nl_accept_doit).
+	 * The daemon blocks on the standard sk->sk_wq via read()/poll(),
+	 * so sk_odata (sock_def_readable) is needed to wake it.
+	 */
+	if (test_bit(XPT_HANDSHAKE, &svsk->sk_xprt.xpt_flags)) {
 		svsk->sk_odata(sk);
-		trace_svcsock_tcp_data_ready(&svsk->sk_xprt, 0);
-		if (test_bit(XPT_HANDSHAKE, &svsk->sk_xprt.xpt_flags))
-			return;
-		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
-			svc_xprt_enqueue(&svsk->sk_xprt);
+		return;
 	}
+
+	/* Skip sk_odata: the dedicated receiver kthread waits on
+	 * sk_receiver_wq, not sk->sk_wq, so sk_odata (sock_def_readable)
+	 * would invoke rcu_read_lock, skwq_has_sleeper, and
+	 * sk_wake_async_rcu on every TCP segment with no effect.
+	 */
+	if (wq_has_sleeper(&svsk->sk_receiver_wq))
+		wake_up(&svsk->sk_receiver_wq);
 }
 
 /*
@@ -934,8 +949,10 @@ static void svc_tcp_state_change(struct sock *sk)
 		rmb();
 		svsk->sk_ostate(sk);
 		trace_svcsock_tcp_state(&svsk->sk_xprt, svsk->sk_sock);
-		if (sk->sk_state != TCP_ESTABLISHED)
+		if (sk->sk_state != TCP_ESTABLISHED) {
 			svc_xprt_deferred_close(&svsk->sk_xprt);
+			wake_up(&svsk->sk_receiver_wq);
+		}
 	}
 }
 
@@ -1003,8 +1020,22 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
 	if (serv->sv_stats)
 		serv->sv_stats->nettcpconn++;
 
+	/*
+	 * Disable busy polling for this socket. The receiver kthread
+	 * blocks in sock_recvmsg() waiting for data on a single
+	 * connection. Busy polling adds CPU overhead without reducing
+	 * latency.
+	 */
+	WRITE_ONCE(newsock->sk->sk_ll_usec, 0);
+
+	if (svc_tcp_start_receiver(newsvsk) < 0)
+		goto failed_start;
+
 	return &newsvsk->sk_xprt;
 
+failed_start:
+	svc_xprt_put(&newsvsk->sk_xprt);
+	return NULL;
 failed:
 	sockfd_put(newsock);
 	return NULL;
@@ -1151,25 +1182,375 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
 	svsk->sk_marker = xdr_zero;
 }
 
-/**
- * svc_tcp_recvfrom - Receive data from a TCP socket
- * @rqstp: request structure into which to receive an RPC Call
- *
- * Called in a loop when XPT_DATA has been set.
- *
- * Read the 4-byte stream record marker, then use the record length
- * in that marker to set up exactly the resources needed to receive
- * the next RPC message into @rqstp.
- *
- * Returns:
- *   On success, the number of bytes in a received RPC Call, or
- *   %0 if a complete RPC Call message was not ready to return
- *
- * The zero return case handles partial receives and callback Replies.
- * The state of a partial receive is preserved in the svc_sock for
- * the next call to svc_tcp_recvfrom.
+static struct svc_tcp_msg *svc_tcp_msg_alloc(unsigned int npages)
+{
+	return kmalloc(struct_size_t(struct svc_tcp_msg, tm_pages, npages),
+		       GFP_KERNEL);
+}
+
+static void svc_tcp_msg_free(struct svc_tcp_msg *msg)
+{
+	unsigned int i;
+
+	for (i = 0; i < msg->tm_npages; i++)
+		if (msg->tm_pages[i])
+			put_page(msg->tm_pages[i]);
+	kfree(msg);
+}
+
+static void svc_tcp_drain_msg_queue(struct svc_sock *svsk)
+{
+	struct llist_node *node;
+	struct svc_tcp_msg *msg;
+
+	while ((node = llist_del_first(&svsk->sk_msg_queue)) != NULL) {
+		msg = llist_entry(node, struct svc_tcp_msg, tm_node);
+		atomic_dec(&svsk->sk_msg_count);
+		svc_tcp_msg_free(msg);
+	}
+}
+
+static inline void svc_tcp_setup_rqst(struct svc_rqst *rqstp,
+				      struct svc_xprt *xprt)
+{
+	rqstp->rq_xprt_ctxt = NULL;
+	rqstp->rq_prot = IPPROTO_TCP;
+	if (test_bit(XPT_LOCAL, &xprt->xpt_flags))
+		set_bit(RQ_LOCAL, &rqstp->rq_flags);
+	else
+		clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+}
+
+/*
+ * Transfer page ownership from @msg to @rqstp and set up the xdr_buf
+ * for RPC processing.
  */
-static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
+static void svc_tcp_msg_to_rqst(struct svc_rqst *rqstp, struct svc_tcp_msg *msg)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct svc_page_pool *pool = svsk->sk_page_pool;
+	unsigned int i;
+
+	for (i = 0; i < msg->tm_npages; i++) {
+		if (rqstp->rq_pages[i])
+			svc_page_pool_put(pool, rqstp->rq_pages[i]);
+		rqstp->rq_pages[i] = msg->tm_pages[i];
+		msg->tm_pages[i] = NULL;
+	}
+
+	rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+	rqstp->rq_arg.head[0].iov_len = min_t(size_t, msg->tm_len, PAGE_SIZE);
+	rqstp->rq_arg.pages = rqstp->rq_pages + 1;
+	rqstp->rq_arg.page_base = 0;
+	if (msg->tm_len <= PAGE_SIZE)
+		rqstp->rq_arg.page_len = 0;
+	else
+		rqstp->rq_arg.page_len = msg->tm_len - PAGE_SIZE;
+	rqstp->rq_arg.len = msg->tm_len;
+	rqstp->rq_arg.buflen = msg->tm_npages * PAGE_SIZE;
+
+	rqstp->rq_respages = &rqstp->rq_pages[msg->tm_npages];
+	rqstp->rq_next_page = rqstp->rq_respages + 1;
+	svc_xprt_copy_addrs(rqstp, rqstp->rq_xprt);
+	svc_tcp_setup_rqst(rqstp, rqstp->rq_xprt);
+}
+
+static int svc_tcp_queue_msg(struct svc_sock *svsk, struct svc_rqst *rqstp)
+{
+	struct svc_tcp_msg *msg;
+	unsigned int npages;
+	unsigned int i;
+
+	npages = DIV_ROUND_UP(rqstp->rq_arg.len, PAGE_SIZE);
+	msg = svc_tcp_msg_alloc(npages);
+	if (!msg)
+		return -ENOMEM;
+
+	msg->tm_len = rqstp->rq_arg.len;
+	msg->tm_npages = npages;
+
+	for (i = 0; i < npages; i++) {
+		msg->tm_pages[i] = rqstp->rq_pages[i];
+		rqstp->rq_pages[i] = NULL;
+	}
+
+	llist_add(&msg->tm_node, &svsk->sk_msg_queue);
+	atomic_inc(&svsk->sk_msg_count);
+
+	return 0;
+}
+
+static int svc_tcp_receiver_alloc_pages(struct svc_rqst *rqstp)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct svc_page_pool *pool = svsk->sk_page_pool;
+	unsigned long pages, filled, ret;
+	struct page *page;
+
+	pages = rqstp->rq_maxpages;
+
+	for (filled = 0; filled < pages; filled++) {
+		page = svc_page_pool_get(pool);
+		if (!page)
+			break;
+		rqstp->rq_pages[filled] = page;
+	}
+	while (filled < pages) {
+		ret = __alloc_pages_bulk(GFP_KERNEL, pool->pp_numa_node, NULL,
+					 pages - filled,
+					 rqstp->rq_pages + filled);
+		if (ret == 0) {
+			while (filled--)
+				put_page(rqstp->rq_pages[filled]);
+			return -ENOMEM;
+		}
+		filled += ret;
+	}
+
+	rqstp->rq_page_end = &rqstp->rq_pages[pages];
+	rqstp->rq_pages[pages] = NULL;
+	rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+	rqstp->rq_arg.head[0].iov_len = PAGE_SIZE;
+	rqstp->rq_arg.pages = rqstp->rq_pages + 1;
+	rqstp->rq_arg.page_base = 0;
+	rqstp->rq_arg.page_len = (pages - 2) * PAGE_SIZE;
+	rqstp->rq_arg.len = (pages - 1) * PAGE_SIZE;
+	rqstp->rq_arg.tail[0].iov_len = 0;
+
+	return 0;
+}
+
+static inline bool svc_tcp_receiver_can_work(struct svc_sock *svsk)
+{
+	return tcp_inq(svsk->sk_sk) > 0 &&
+	       atomic_read(&svsk->sk_msg_count) < SVC_TCP_MSG_QUEUE_MAX;
+}
+
+static int svc_tcp_receiver_rqst_init(struct svc_rqst *rqstp,
+				      struct svc_sock *svsk)
+{
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+
+	memset(rqstp, 0, sizeof(*rqstp));
+	rqstp->rq_server = serv;
+	rqstp->rq_maxpages = svc_serv_maxpages(serv);
+	rqstp->rq_pages = kcalloc(rqstp->rq_maxpages + 1,
+				  sizeof(struct page *), GFP_KERNEL);
+	if (!rqstp->rq_pages)
+		return -ENOMEM;
+	rqstp->rq_bvec = kcalloc(rqstp->rq_maxpages,
+				 sizeof(struct bio_vec), GFP_KERNEL);
+	if (!rqstp->rq_bvec) {
+		kfree(rqstp->rq_pages);
+		return -ENOMEM;
+	}
+	rqstp->rq_xprt = &svsk->sk_xprt;
+
+	if (svc_tcp_receiver_alloc_pages(rqstp) < 0) {
+		kfree(rqstp->rq_bvec);
+		kfree(rqstp->rq_pages);
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static void svc_tcp_receiver_rqst_free(struct svc_rqst *rqstp)
+{
+	unsigned int i;
+
+	for (i = 0; i < rqstp->rq_maxpages; i++)
+		if (rqstp->rq_pages[i])
+			put_page(rqstp->rq_pages[i]);
+	kfree(rqstp->rq_bvec);
+	kfree(rqstp->rq_pages);
+}
+
+/*
+ * Receive complete RPC messages and queue them for nfsd threads.
+ * Return true if at least one message was queued.
+ */
+static noinline bool
+svc_tcp_recv_and_queue(struct svc_sock *svsk, struct svc_rqst *rqstp)
+{
+	bool progress = false;
+	int len;
+
+	while (!kthread_should_stop() &&
+	       !test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags)) {
+		if (atomic_read(&svsk->sk_msg_count) >=
+		    SVC_TCP_MSG_QUEUE_MAX)
+			break;
+
+		len = svc_tcp_recv_msg(rqstp);
+		if (len <= 0)
+			break;
+
+		progress = true;
+		if (svc_tcp_queue_msg(svsk, rqstp) < 0) {
+			svc_xprt_deferred_close(&svsk->sk_xprt);
+			break;
+		}
+		if (svc_tcp_receiver_alloc_pages(rqstp) < 0) {
+			svc_xprt_deferred_close(&svsk->sk_xprt);
+			break;
+		}
+	}
+
+	return progress;
+}
+
+/*
+ * Check for a stalled partial RPC record. Enable keepalives to
+ * probe the peer; close the connection if it has already failed
+ * or the stall exceeds a timeout.
+ *
+ * Return true if the connection should be closed.
+ */
+static noinline bool
+svc_tcp_check_partial_record(struct svc_sock *svsk, bool progress)
+{
+	struct sock *sk = svsk->sk_sk;
+
+	if (!progress && svsk->sk_tcplen > 0) {
+		if (sk->sk_state != TCP_ESTABLISHED || sk->sk_err) {
+			svc_xprt_deferred_close(&svsk->sk_xprt);
+			return true;
+		}
+
+		if (!sock_flag(sk, SOCK_KEEPOPEN)) {
+			sock_set_keepalive(sk);
+			tcp_sock_set_keepidle(sk, 10);
+			tcp_sock_set_keepintvl(sk, 5);
+			tcp_sock_set_keepcnt(sk, 3);
+		}
+
+		if (!svsk->sk_partial_record_time) {
+			svsk->sk_partial_record_time = ktime_get();
+		} else if (ktime_ms_delta(ktime_get(),
+				svsk->sk_partial_record_time) > 60000) {
+			svc_xprt_deferred_close(&svsk->sk_xprt);
+			return true;
+		}
+	} else if (progress) {
+		svsk->sk_partial_record_time = 0;
+	}
+
+	return false;
+}
+
+/*
+ * Dedicated receiver kthread for a TCP socket. All sock_recvmsg()
+ * calls for this connection occur in this context, eliminating
+ * socket lock contention between nfsd threads. Complete RPC
+ * messages are enqueued for nfsd threads to process.
+ */
+static int svc_tcp_receiver_thread(void *data)
+{
+	struct svc_sock *svsk = data;
+	struct svc_rqst rqstp_storage;
+	struct svc_rqst *rqstp = &rqstp_storage;
+	bool progress;
+
+	if (svc_tcp_receiver_rqst_init(rqstp, svsk) < 0) {
+		svc_xprt_deferred_close(&svsk->sk_xprt);
+		complete(&svsk->sk_receiver_exit);
+		return 0;
+	}
+
+	while (!kthread_should_stop() &&
+	       !test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags)) {
+		/*
+		 * Wait until data arrives and the message queue has
+		 * room. Use a timeout when a partial RPC record
+		 * remains so connection health is checked periodically.
+		 */
+		if (svsk->sk_tcplen > 0)
+			wait_event_interruptible_timeout(svsk->sk_receiver_wq,
+				svc_tcp_receiver_can_work(svsk) ||
+				test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags) ||
+				kthread_should_stop(),
+				msecs_to_jiffies(5000));
+		else
+			wait_event_interruptible(svsk->sk_receiver_wq,
+				svc_tcp_receiver_can_work(svsk) ||
+				test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags) ||
+				kthread_should_stop());
+
+		progress = svc_tcp_recv_and_queue(svsk, rqstp);
+
+		if (!llist_empty(&svsk->sk_msg_queue)) {
+			set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+			svc_xprt_enqueue(&svsk->sk_xprt);
+		}
+
+		if (svc_tcp_check_partial_record(svsk, progress))
+			break;
+	}
+
+	svc_tcp_receiver_rqst_free(rqstp);
+	complete(&svsk->sk_receiver_exit);
+	return 0;
+}
+
+static int svc_tcp_start_receiver(struct svc_sock *svsk)
+{
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	struct svc_page_pool *pool;
+	struct task_struct *task;
+	int numa_node;
+
+	/* The wait queue is initialized earlier in svc_tcp_init()
+	 * so svc_tcp_data_ready() can wake it before this function
+	 * runs.
+	 */
+	init_llist_head(&svsk->sk_msg_queue);
+	init_completion(&svsk->sk_receiver_exit);
+	atomic_set(&svsk->sk_msg_count, 0);
+
+	numa_node = svc_pool_node(svc_pool_for_cpu(serv));
+	pool = svc_page_pool_alloc(numa_node, svsk->sk_maxpages);
+	if (!pool)
+		return -ENOMEM;
+	svsk->sk_page_pool = pool;
+
+	task = kthread_create_on_node(svc_tcp_receiver_thread, svsk,
+				      numa_node, "tcp-recv/%s",
+				      svsk->sk_xprt.xpt_remotebuf);
+	if (IS_ERR(task)) {
+		svc_page_pool_free(pool);
+		svsk->sk_page_pool = NULL;
+		return PTR_ERR(task);
+	}
+
+	svsk->sk_receiver = task;
+	wake_up_process(task);
+	return 0;
+}
+
+static void svc_tcp_stop_receiver(struct svc_sock *svsk)
+{
+	if (!svsk->sk_receiver)
+		return;
+
+	wake_up(&svsk->sk_receiver_wq);
+	kthread_stop(svsk->sk_receiver);
+	wait_for_completion(&svsk->sk_receiver_exit);
+	svsk->sk_receiver = NULL;
+
+	svc_tcp_drain_msg_queue(svsk);
+	svc_page_pool_free(svsk->sk_page_pool);
+	svsk->sk_page_pool = NULL;
+}
+
+/*
+ * Called only by the dedicated receiver kthread. Does not call
+ * svc_xprt_received() because the receiver implements its own
+ * event loop separate from the nfsd thread pool.
+ */
+static int svc_tcp_recv_msg(struct svc_rqst *rqstp)
 {
 	struct svc_sock	*svsk =
 		container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
@@ -1179,7 +1560,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	__be32 *p;
 	__be32 calldir;
 
-	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	len = svc_tcp_read_marker(svsk, rqstp);
 	if (len < 0)
 		goto error;
@@ -1205,12 +1585,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	} else
 		rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
 
-	rqstp->rq_xprt_ctxt   = NULL;
-	rqstp->rq_prot	      = IPPROTO_TCP;
-	if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
-		set_bit(RQ_LOCAL, &rqstp->rq_flags);
-	else
-		clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+	svc_tcp_setup_rqst(rqstp, &svsk->sk_xprt);
 
 	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
 	calldir = p[1];
@@ -1229,7 +1604,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		serv->sv_stats->nettcpcnt++;
 
 	svc_sock_secure_port(rqstp);
-	svc_xprt_received(rqstp->rq_xprt);
 	return rqstp->rq_arg.len;
 
 err_incomplete:
@@ -1254,10 +1628,55 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
 	svc_xprt_deferred_close(&svsk->sk_xprt);
 err_noclose:
-	svc_xprt_received(rqstp->rq_xprt);
 	return 0;	/* record not complete */
 }
 
+/**
+ * svc_tcp_recvfrom - Receive an RPC Call from a TCP socket
+ * @rqstp: request structure into which to receive an RPC Call
+ *
+ * Return values:
+ *   %0: no complete message ready
+ *   positive: length of received RPC Call, in bytes
+ */
+static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
+{
+	struct svc_sock *svsk = container_of(rqstp->rq_xprt,
+					     struct svc_sock, sk_xprt);
+	struct llist_node *node;
+	struct svc_tcp_msg *msg;
+	int len;
+
+	node = llist_del_first(&svsk->sk_msg_queue);
+	if (!node) {
+		clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+		svc_xprt_received(rqstp->rq_xprt);
+		return 0;
+	}
+
+	msg = llist_entry(node, struct svc_tcp_msg, tm_node);
+
+	/*
+	 * Wake the receiver when the queue drops below the threshold.
+	 * The receiver may be blocked waiting for queue space.
+	 */
+	if (atomic_dec_return(&svsk->sk_msg_count) == SVC_TCP_MSG_QUEUE_MAX - 1)
+		wake_up_interruptible(&svsk->sk_receiver_wq);
+
+	svc_tcp_msg_to_rqst(rqstp, msg);
+	len = rqstp->rq_arg.len;
+
+	svc_sock_secure_port(rqstp);
+
+	if (llist_empty(&svsk->sk_msg_queue))
+		clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+
+	svc_xprt_received(rqstp->rq_xprt);
+	kfree(msg);
+
+	return len;
+}
+
 /*
  * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
@@ -1394,6 +1813,8 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		sk->sk_data_ready = svc_tcp_listen_data_ready;
 		set_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags);
 	} else {
+		init_waitqueue_head(&svsk->sk_receiver_wq);
+
 		sk->sk_state_change = svc_tcp_state_change;
 		sk->sk_data_ready = svc_tcp_data_ready;
 		sk->sk_write_space = svc_write_space;
@@ -1406,7 +1827,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 
 		tcp_sock_set_nodelay(sk);
 
-		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 		switch (sk->sk_state) {
 		case TCP_SYN_RECV:
 		case TCP_ESTABLISHED:
@@ -1677,6 +2097,8 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
 {
 	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
 
+	svc_tcp_stop_receiver(svsk);
+
 	tls_handshake_close(svsk->sk_sock);
 
 	svc_sock_detach(xprt);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (4 preceding siblings ...)
  2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High contention on xpt_mutex during TCP reply transmission limits
nfsd scalability. When multiple threads send replies on the same
connection, mutex handoff triggers optimistic spinning that consumes
substantial CPU time. Profiling on high-throughput workloads shows
approximately 15% of cycles spent in the spin-wait path.

The flat combining pattern addresses this by allowing one thread to
perform send operations on behalf of multiple waiting threads. Rather
than each thread acquiring xpt_mutex independently, threads enqueue
their send requests to a lock-free llist. The first thread to
acquire the mutex becomes the combiner and processes all pending
sends in a batch. Other threads wait on a per-request completion
structure instead of spinning on the lock.

The combiner continues processing the queue until empty before
releasing xpt_mutex, amortizing acquisition cost across multiple
sends. TCP auto-corking naturally coalesces segments once the
first send places data in flight, avoiding the need for explicit
corking hints that would delay initial replies.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |   2 +
 net/sunrpc/svcsock.c           | 181 ++++++++++++++++++++++++++++-----
 2 files changed, 158 insertions(+), 25 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index b94096eeb890..fae760eaa9f7 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -54,6 +54,8 @@ struct svc_sock {
 
 	/* For sends (protected by xpt_mutex) */
 	struct bio_vec		*sk_bvec;
+	struct llist_head	sk_send_queue;	/* queued sends awaiting batch processing */
+	u64			sk_drain_avg_ns; /* EMA of batch drain time */
 
 	/* private TCP part */
 	/* On-the-wire fragment header: */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a89427b5fc70..e3fa81b63191 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1677,6 +1677,113 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	return len;
 }
 
+/*
+ * Pending send request for flat combining.
+ * Stack-allocated by each thread enqueuing a send on a TCP socket.
+ */
+struct svc_pending_send {
+	struct llist_node	node;
+	struct svc_rqst		*rqstp;
+	rpc_fraghdr		marker;
+	struct completion	done;
+	int			result;
+};
+
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+			   rpc_fraghdr marker);
+
+/*
+ * svc_tcp_wait_timeout - Compute adaptive timeout for flat combining wait
+ * @svsk: the socket with drain time statistics
+ *
+ * Timeout selection balances two constraints: short timeouts cause
+ * premature wakeups and unnecessary mutex acquisitions; long timeouts
+ * delay threads after batch processing completes.
+ *
+ * LAN environments with fast networks and consistent latencies work well
+ * with fixed timeouts. WAN links exhibit higher variance in send times
+ * due to congestion, packet loss, and bandwidth constraints. An adaptive
+ * timeout based on observed drain times accommodates both cases without
+ * manual tuning.
+ *
+ * The timeout targets 2x the recent average drain time, clamped to
+ * [1ms, 100ms]. The multiplier provides headroom for variance while the
+ * floor prevents excessive wakeups and the ceiling bounds worst-case
+ * latency when measurements are anomalous.
+ *
+ * Returns: timeout in jiffies
+ */
+static unsigned long svc_tcp_wait_timeout(struct svc_sock *svsk)
+{
+	u64 avg_ns = READ_ONCE(svsk->sk_drain_avg_ns);
+	unsigned long timeout;
+
+	/* Initial timeout before measurements are available */
+	if (!avg_ns)
+		return msecs_to_jiffies(10);
+
+	timeout = nsecs_to_jiffies(avg_ns * 2);
+	return clamp(timeout, msecs_to_jiffies(1), msecs_to_jiffies(100));
+}
+
+/*
+ * svc_tcp_combine_sends - Process batched send requests
+ * @svsk: the socket to send on
+ * @xprt: the transport (for dead check and close)
+ *
+ * Caller holds xpt_mutex. Drains sk_send_queue and processes each
+ * pending send. All items are completed before return, even on error.
+ */
+static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
+{
+	struct llist_node *pending, *node, *next;
+	struct svc_pending_send *ps;
+	bool transport_dead = false;
+	u64 start, elapsed, avg;
+	int sent;
+
+	/* Snapshot queued items; subsequent arrivals processed in next batch */
+	pending = llist_del_all(&svsk->sk_send_queue);
+	if (!pending)
+		return;
+
+	start = ktime_get_ns();
+
+	for (node = pending; node; node = next) {
+		next = node->next;
+		ps = container_of(node, struct svc_pending_send, node);
+
+		if (transport_dead || svc_xprt_is_dead(xprt)) {
+			transport_dead = true;
+			ps->result = -ENOTCONN;
+			complete(&ps->done);
+			continue;
+		}
+
+		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker);
+		trace_svcsock_tcp_send(xprt, sent);
+
+		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+			ps->result = sent;
+		} else {
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send", sent,
+				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+			svc_xprt_deferred_close(xprt);
+			transport_dead = true;
+			ps->result = -EAGAIN;
+		}
+
+		complete(&ps->done);
+	}
+
+	/* EMA update: new = (7 * old + measured) / 8 */
+	elapsed = ktime_get_ns() - start;
+	avg = READ_ONCE(svsk->sk_drain_avg_ns);
+	WRITE_ONCE(svsk->sk_drain_avg_ns, avg ? (7 * avg + elapsed) / 8 : elapsed);
+}
+
 /*
  * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
@@ -1716,44 +1823,68 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
  * svc_tcp_sendto - Send out a reply on a TCP socket
  * @rqstp: completed svc_rqst
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Flat combining reduces mutex contention: threads enqueue send
+ * requests; a single thread processes the batch while holding xpt_mutex
+ * to ensure RPC-level serialization.
  *
  * Returns the number of bytes sent, or a negative errno.
  */
 static int svc_tcp_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_xprt *xprt = rqstp->rq_xprt;
-	struct svc_sock	*svsk = container_of(xprt, struct svc_sock, sk_xprt);
+	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
 	struct xdr_buf *xdr = &rqstp->rq_res;
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	int sent;
+	struct svc_pending_send ps = {
+		.rqstp = rqstp,
+		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+	};
 
 	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
 	rqstp->rq_xprt_ctxt = NULL;
 
-	mutex_lock(&xprt->xpt_mutex);
-	if (svc_xprt_is_dead(xprt))
-		goto out_notconn;
-	sent = svc_tcp_sendmsg(svsk, rqstp, marker);
-	trace_svcsock_tcp_send(xprt, sent);
-	if (sent < 0 || sent != (xdr->len + sizeof(marker)))
-		goto out_close;
-	mutex_unlock(&xprt->xpt_mutex);
-	return sent;
+	init_completion(&ps.done);
 
-out_notconn:
+	/* Enqueue send request; lock-free via llist */
+	llist_add(&ps.node, &svsk->sk_send_queue);
+
+	/*
+	 * Flat combining: trylock attempts xpt_mutex acquisition; success
+	 * enables processing all queued requests. The mutex provides RPC-level
+	 * serialization while batching reduces lock handoff overhead.
+	 *
+	 * Trylock first for fast path. On contention, wait briefly for batch
+	 * processing to complete this request, then fall back to blocking
+	 * mutex_lock.
+	 */
+	if (mutex_trylock(&xprt->xpt_mutex))
+		goto combine;
+
+	/*
+	 * Lock held for batch processing. Wait for completion with timeout;
+	 * batch processing may complete this request during the wait.
+	 */
+	if (wait_for_completion_timeout(&ps.done, svc_tcp_wait_timeout(svsk))) {
+		/* Completed by batch processing */
+		return ps.result;
+	}
+
+	/*
+	 * Timeout expired. Acquire mutex to enable batch processing or
+	 * discover completion occurred during mutex acquisition.
+	 */
+	mutex_lock(&xprt->xpt_mutex);
+	if (completion_done(&ps.done)) {
+		mutex_unlock(&xprt->xpt_mutex);
+		return ps.result;
+	}
+
+combine:
+	/* Mutex held; process batches until queue drains. */
+	while (!llist_empty(&svsk->sk_send_queue))
+		svc_tcp_combine_sends(svsk, xprt);
 	mutex_unlock(&xprt->xpt_mutex);
-	return -ENOTCONN;
-out_close:
-	pr_notice("rpc-srv/tcp: %s: %s %d when sending %zu bytes - shutting down socket\n",
-		  xprt->xpt_server->sv_name,
-		  (sent < 0) ? "got error" : "sent",
-		  sent, xdr->len + sizeof(marker));
-	svc_xprt_deferred_close(xprt);
-	mutex_unlock(&xprt->xpt_mutex);
-	return -EAGAIN;
+
+	return ps.result;
 }
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (5 preceding siblings ...)
  2026-02-10 16:20 ` [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

High latency in callback processing for NFSv4.1+ sessions can occur
when the fore channel sustains heavy request traffic. The backchannel
implementation acquires xpt_mutex directly while the fore channel
uses flat combining to batch socket operations. A thread in the
combining loop processes queued sends continuously until the llist
empties, which under load means backchannel threads block on xpt_mutex
for extended periods waiting for a turn at the socket. Delegation
recalls and other callback operations carry time constraints that
make this starvation problematic.

Routing backchannel sends through the same flat combining
infrastructure eliminates this starvation; a shared llist queue
replaces direct mutex acquisition and separate code paths. The
struct svc_pending_send now holds an xdr_buf pointer instead of
svc_rqst, decoupling the queueing mechanism from RPC request
structures. A new svc_tcp_send() function accepts an xprt, xdr_buf,
and marker, then either enters the combining loop or enqueues for
processing by an active combiner.

The fore channel path through svc_tcp_sendto() now calls
svc_tcp_send() after preparing its xdr_buf. The backchannel
bc_send_request() similarly calls svc_tcp_send() in place of its
former mutex acquisition and direct bc_sendto() invocation. Both
channels queue into the same llist, so backchannel operations
receive fair treatment in the send ordering. When a backchannel
send queues behind fore channel traffic, the combining loop
processes both together with shared socket lock acquisition and
MSG_MORE coalescing where applicable.

Maintenance burden decreases with a single code path for TCP sends.
The backchannel gains batching benefits when concurrent with fore
channel load, and starvation no longer occurs because queueing
provides deterministic ordering independent of mutex contention
timing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |  2 +
 net/sunrpc/svcsock.c           | 76 +++++++++++++++++++++++++---------
 net/sunrpc/xprtsock.c          | 60 ++++++---------------------
 3 files changed, 71 insertions(+), 67 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index fae760eaa9f7..07619bf2131c 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -101,6 +101,8 @@ static inline u32 svc_sock_final_rec(struct svc_sock *svsk)
  */
 void		svc_recv(struct svc_rqst *rqstp);
 void		svc_send(struct svc_rqst *rqstp);
+int		svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+			     rpc_fraghdr marker);
 int		svc_addsock(struct svc_serv *serv, struct net *net,
 			    const int fd, char *name_return, const size_t len,
 			    const struct cred *cred);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index e3fa81b63191..8b2e9f524506 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1683,13 +1683,13 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
  */
 struct svc_pending_send {
 	struct llist_node	node;
-	struct svc_rqst		*rqstp;
+	struct xdr_buf		*xdr;
 	rpc_fraghdr		marker;
 	struct completion	done;
 	int			result;
 };
 
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker);
 
 /*
@@ -1750,6 +1750,8 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 	start = ktime_get_ns();
 
 	for (node = pending; node; node = next) {
+		size_t expected;
+
 		next = node->next;
 		ps = container_of(node, struct svc_pending_send, node);
 
@@ -1760,16 +1762,29 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 			continue;
 		}
 
-		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker);
+		sent = svc_tcp_sendmsg(svsk, ps->xdr, ps->marker);
 		trace_svcsock_tcp_send(xprt, sent);
 
-		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+		expected = ps->xdr->len + sizeof(ps->marker);
+		if (sent == expected) {
 			ps->result = sent;
 		} else {
 			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
 				  xprt->xpt_server->sv_name,
-				  sent < 0 ? "send error" : "short send", sent,
-				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
 			svc_xprt_deferred_close(xprt);
 			transport_dead = true;
 			ps->result = -EAGAIN;
@@ -1789,7 +1804,7 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
  * copy operations in this path. Therefore the caller must ensure
  * that the pages backing @xdr are unchanging.
  */
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker)
 {
 	struct msghdr msg = {
@@ -1809,39 +1824,40 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	memcpy(buf, &marker, sizeof(marker));
 	bvec_set_virt(svsk->sk_bvec, buf, sizeof(marker));
 
-	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, rqstp->rq_maxpages,
-				&rqstp->rq_res);
+	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, svsk->sk_maxpages, xdr);
 
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_bvec,
-		      1 + count, sizeof(marker) + rqstp->rq_res.len);
+		      1 + count, sizeof(marker) + xdr->len);
 	ret = sock_sendmsg(svsk->sk_sock, &msg);
 	page_frag_free(buf);
 	return ret;
 }
 
 /**
- * svc_tcp_sendto - Send out a reply on a TCP socket
- * @rqstp: completed svc_rqst
+ * svc_tcp_send - Send an XDR buffer on a TCP socket using flat combining
+ * @xprt: the transport to send on
+ * @xdr: the XDR buffer to send
+ * @marker: RPC record marker
  *
  * Flat combining reduces mutex contention: threads enqueue send
  * requests; a single thread processes the batch while holding xpt_mutex
  * to ensure RPC-level serialization.
  *
+ * Can be used for both fore channel (NFS replies) and backchannel
+ * (NFSv4 callbacks) sends since both share the same TCP connection
+ * and xpt_mutex.
+ *
  * Returns the number of bytes sent, or a negative errno.
  */
-static int svc_tcp_sendto(struct svc_rqst *rqstp)
+int svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+		 rpc_fraghdr marker)
 {
-	struct svc_xprt *xprt = rqstp->rq_xprt;
 	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
-	struct xdr_buf *xdr = &rqstp->rq_res;
 	struct svc_pending_send ps = {
-		.rqstp = rqstp,
-		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+		.xdr = xdr,
+		.marker = marker,
 	};
 
-	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
-	rqstp->rq_xprt_ctxt = NULL;
-
 	init_completion(&ps.done);
 
 	/* Enqueue send request; lock-free via llist */
@@ -1886,6 +1902,26 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
 
 	return ps.result;
 }
+EXPORT_SYMBOL_GPL(svc_tcp_send);
+
+/**
+ * svc_tcp_sendto - Send out a reply on a TCP socket
+ * @rqstp: completed svc_rqst
+ *
+ * Returns the number of bytes sent, or a negative errno.
+ */
+static int svc_tcp_sendto(struct svc_rqst *rqstp)
+{
+	struct svc_xprt *xprt = rqstp->rq_xprt;
+	struct xdr_buf *xdr = &rqstp->rq_res;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+
+	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
+	rqstp->rq_xprt_ctxt = NULL;
+
+	return svc_tcp_send(xprt, xdr, marker);
+}
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
 				       struct net *net,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 2e1fe6013361..4e1d82186b00 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2979,36 +2979,13 @@ static void bc_free(struct rpc_task *task)
 	free_page((unsigned long)buf);
 }
 
-static int bc_sendto(struct rpc_rqst *req)
-{
-	struct xdr_buf *xdr = &req->rq_snd_buf;
-	struct sock_xprt *transport =
-			container_of(req->rq_xprt, struct sock_xprt, xprt);
-	struct msghdr msg = {
-		.msg_flags	= 0,
-	};
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	unsigned int sent = 0;
-	int err;
-
-	req->rq_xtime = ktime_get();
-	err = xdr_alloc_bvec(xdr, rpc_task_gfp_mask());
-	if (err < 0)
-		return err;
-	err = xprt_sock_sendmsg(transport->sock, &msg, xdr, 0, marker, &sent);
-	xdr_free_bvec(xdr);
-	if (err < 0 || sent != (xdr->len + sizeof(marker)))
-		return -EAGAIN;
-	return sent;
-}
-
 /**
  * bc_send_request - Send a backchannel Call on a TCP socket
  * @req: rpc_rqst containing Call message to be sent
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Uses flat combining via svc_tcp_send() to participate in batched
+ * sending with fore channel traffic, ensuring fair ordering and
+ * reduced lock contention.
  *
  * Return values:
  *   %0 if the message was sent successfully
@@ -3016,29 +2993,18 @@ static int bc_sendto(struct rpc_rqst *req)
  */
 static int bc_send_request(struct rpc_rqst *req)
 {
-	struct svc_xprt	*xprt;
-	int len;
+	struct xdr_buf *xdr = &req->rq_snd_buf;
+	struct svc_xprt *xprt = req->rq_xprt->bc_xprt;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+	int ret;
 
-	/*
-	 * Get the server socket associated with this callback xprt
-	 */
-	xprt = req->rq_xprt->bc_xprt;
+	req->rq_xtime = ktime_get();
+	ret = svc_tcp_send(xprt, xdr, marker);
 
-	/*
-	 * Grab the mutex to serialize data as the connection is shared
-	 * with the fore channel
-	 */
-	mutex_lock(&xprt->xpt_mutex);
-	if (test_bit(XPT_DEAD, &xprt->xpt_flags))
-		len = -ENOTCONN;
-	else
-		len = bc_sendto(req);
-	mutex_unlock(&xprt->xpt_mutex);
-
-	if (len > 0)
-		len = 0;
-
-	return len;
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 
 static void bc_close(struct rpc_xprt *xprt)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD
  2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
                   ` (6 preceding siblings ...)
  2026-02-10 16:20 ` [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
@ 2026-02-10 16:20 ` Chuck Lever
  7 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-02-10 16:20 UTC (permalink / raw)
  To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

NFSD TCP sockets currently rely on system defaults with TCP
auto-tuning. On networks with large bandwidth-delay products, the
default maximum buffer sizes (6MB receive, 4MB send) can throttle
throughput. Administrators must resort to system-wide sysctl
adjustments (tcp_rmem/tcp_wmem), which affect all TCP connections
rather than just NFS traffic.

This change sets explicit buffer sizes for NFSD TCP data sockets.
The buffer size is set to 4 * sv_max_mesg, yielding approximately
16MB with default NFS payload sizes. On memory-constrained systems,
the buffer size is capped at 1/1024 of physical RAM, with a hard
ceiling of 16MB. SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK disable
auto-tuning, providing predictable memory consumption.

The existing svc_sock_setbufsize() is renamed to
svc_udp_setbufsize() to reflect its UDP-specific purpose, and a
new svc_tcp_setbufsize() handles TCP data connections. Listener
sockets remain unaffected, as listeners do not transfer data.

This approach improves throughput on high-speed networks without
requiring system-wide configuration changes, while automatically
scaling down buffer sizes on small systems.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svcsock.c | 52 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 8b2e9f524506..91a472021d0a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -50,6 +50,7 @@
 #include <net/handshake.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
+#include <linux/mm.h>
 #include <asm/ioctls.h>
 #include <linux/key.h>
 
@@ -377,9 +378,12 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
 }
 
 /*
- * Set socket snd and rcv buffer lengths
+ * Set socket snd and rcv buffer lengths for UDP sockets.
+ *
+ * UDP sockets need large buffers because pending requests remain
+ * in the receive buffer until processed by a worker thread.
  */
-static void svc_sock_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
+static void svc_udp_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
 {
 	unsigned int max_mesg = svsk->sk_xprt.xpt_server->sv_max_mesg;
 	struct socket *sock = svsk->sk_sock;
@@ -393,6 +397,45 @@ static void svc_sock_setbufsize(struct svc_sock *svsk, unsigned int nreqs)
 	release_sock(sock->sk);
 }
 
+/* Accommodate high bandwidth-delay product connections */
+#define SVC_TCP_SNDBUF_MAX	(16 * 1024 * 1024)
+#define SVC_TCP_RCVBUF_MAX	(16 * 1024 * 1024)
+
+/*
+ * Set socket snd and rcv buffer lengths for TCP data sockets.
+ *
+ * Buffers are sized to accommodate high-bandwidth data transfers on
+ * high-latency networks (large bandwidth-delay product). Automatic
+ * buffer tuning is disabled to allow control of server memory
+ * consumption.
+ */
+static void svc_tcp_setbufsize(struct svc_sock *svsk)
+{
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	struct socket *sock = svsk->sk_sock;
+	unsigned long mem_cap, ideal;
+	unsigned int sndbuf, rcvbuf;
+
+	/* Buffer multiple in-flight RPC messages */
+	ideal = serv->sv_max_mesg * 4;
+
+	/* Memory-based cap: 1/1024 of physical RAM */
+	mem_cap = (totalram_pages() >> 10) << PAGE_SHIFT;
+
+	sndbuf = clamp_t(unsigned long, ideal,
+			 serv->sv_max_mesg, min(mem_cap, SVC_TCP_SNDBUF_MAX));
+	rcvbuf = clamp_t(unsigned long, ideal,
+			 serv->sv_max_mesg, min(mem_cap, SVC_TCP_RCVBUF_MAX));
+
+	lock_sock(sock->sk);
+	sock->sk->sk_sndbuf = sndbuf;
+	sock->sk->sk_rcvbuf = rcvbuf;
+	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+	sock->sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+	sock->sk->sk_write_space(sock->sk);
+	release_sock(sock->sk);
+}
+
 static void svc_sock_secure_port(struct svc_rqst *rqstp)
 {
 	if (svc_port_is_privileged(svc_addr(rqstp)))
@@ -668,7 +711,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 	     * provides an upper bound on the number of threads
 	     * which will access the socket.
 	     */
-	    svc_sock_setbufsize(svsk, serv->sv_nrthreads + 3);
+	    svc_udp_setbufsize(svsk, serv->sv_nrthreads + 3);
 
 	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
@@ -884,7 +927,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
 	 * receive and respond to one request.
 	 * svc_udp_recvfrom will re-adjust if necessary
 	 */
-	svc_sock_setbufsize(svsk, 3);
+	svc_udp_setbufsize(svsk, 3);
 
 	/* data might have come in before data_ready set up */
 	set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
@@ -1993,6 +2036,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		       svsk->sk_maxpages * sizeof(struct page *));
 
 		tcp_sock_set_nodelay(sk);
+		svc_tcp_setbufsize(svsk);
 
 		switch (sk->sk_state) {
 		case TCP_SYN_RECV:
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-10 16:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
2026-02-10 16:20 ` [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends Chuck Lever
2026-02-10 16:20 ` [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox