[PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Subject: [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends
Date: Tue, 10 Feb 2026 11:20:23 -0500	[thread overview]
Message-ID: <20260210162025.2356389-7-cel@kernel.org> (raw)
In-Reply-To: <20260210162025.2356389-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

High contention on xpt_mutex during TCP reply transmission limits
nfsd scalability. When multiple threads send replies on the same
connection, mutex handoff triggers optimistic spinning that consumes
substantial CPU time. Profiling on high-throughput workloads shows
approximately 15% of cycles spent in the spin-wait path.

The flat combining pattern addresses this by allowing one thread to
perform send operations on behalf of multiple waiting threads. Rather
than each thread acquiring xpt_mutex independently, threads enqueue
their send requests to a lock-free llist. The first thread to
acquire the mutex becomes the combiner and processes all pending
sends in a batch. Other threads wait on a per-request completion
structure instead of spinning on the lock.

The combiner continues processing the queue until empty before
releasing xpt_mutex, amortizing acquisition cost across multiple
sends. TCP auto-corking naturally coalesces segments once the
first send places data in flight, avoiding the need for explicit
corking hints that would delay initial replies.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |   2 +
 net/sunrpc/svcsock.c           | 181 ++++++++++++++++++++++++++++-----
 2 files changed, 158 insertions(+), 25 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index b94096eeb890..fae760eaa9f7 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -54,6 +54,8 @@ struct svc_sock {
 
 	/* For sends (protected by xpt_mutex) */
 	struct bio_vec		*sk_bvec;
+	struct llist_head	sk_send_queue;	/* queued sends awaiting batch processing */
+	u64			sk_drain_avg_ns; /* EMA of batch drain time */
 
 	/* private TCP part */
 	/* On-the-wire fragment header: */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a89427b5fc70..e3fa81b63191 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1677,6 +1677,113 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	return len;
 }
 
+/*
+ * Pending send request for flat combining.
+ * Stack-allocated by each thread enqueuing a send on a TCP socket.
+ */
+struct svc_pending_send {
+	struct llist_node	node;
+	struct svc_rqst		*rqstp;
+	rpc_fraghdr		marker;
+	struct completion	done;
+	int			result;
+};
+
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+			   rpc_fraghdr marker);
+
+/*
+ * svc_tcp_wait_timeout - Compute adaptive timeout for flat combining wait
+ * @svsk: the socket with drain time statistics
+ *
+ * Timeout selection balances two constraints: short timeouts cause
+ * premature wakeups and unnecessary mutex acquisitions; long timeouts
+ * delay threads after batch processing completes.
+ *
+ * LAN environments with fast networks and consistent latencies work well
+ * with fixed timeouts. WAN links exhibit higher variance in send times
+ * due to congestion, packet loss, and bandwidth constraints. An adaptive
+ * timeout based on observed drain times accommodates both cases without
+ * manual tuning.
+ *
+ * The timeout targets 2x the recent average drain time, clamped to
+ * [1ms, 100ms]. The multiplier provides headroom for variance while the
+ * floor prevents excessive wakeups and the ceiling bounds worst-case
+ * latency when measurements are anomalous.
+ *
+ * Returns: timeout in jiffies
+ */
+static unsigned long svc_tcp_wait_timeout(struct svc_sock *svsk)
+{
+	u64 avg_ns = READ_ONCE(svsk->sk_drain_avg_ns);
+	unsigned long timeout;
+
+	/* Initial timeout before measurements are available */
+	if (!avg_ns)
+		return msecs_to_jiffies(10);
+
+	timeout = nsecs_to_jiffies(avg_ns * 2);
+	return clamp(timeout, msecs_to_jiffies(1), msecs_to_jiffies(100));
+}
+
+/*
+ * svc_tcp_combine_sends - Process batched send requests
+ * @svsk: the socket to send on
+ * @xprt: the transport (for dead check and close)
+ *
+ * Caller holds xpt_mutex. Drains sk_send_queue and processes each
+ * pending send. All items are completed before return, even on error.
+ */
+static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
+{
+	struct llist_node *pending, *node, *next;
+	struct svc_pending_send *ps;
+	bool transport_dead = false;
+	u64 start, elapsed, avg;
+	int sent;
+
+	/* Snapshot queued items; subsequent arrivals processed in next batch */
+	pending = llist_del_all(&svsk->sk_send_queue);
+	if (!pending)
+		return;
+
+	start = ktime_get_ns();
+
+	for (node = pending; node; node = next) {
+		next = node->next;
+		ps = container_of(node, struct svc_pending_send, node);
+
+		if (transport_dead || svc_xprt_is_dead(xprt)) {
+			transport_dead = true;
+			ps->result = -ENOTCONN;
+			complete(&ps->done);
+			continue;
+		}
+
+		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker);
+		trace_svcsock_tcp_send(xprt, sent);
+
+		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+			ps->result = sent;
+		} else {
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send", sent,
+				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+			svc_xprt_deferred_close(xprt);
+			transport_dead = true;
+			ps->result = -EAGAIN;
+		}
+
+		complete(&ps->done);
+	}
+
+	/* EMA update: new = (7 * old + measured) / 8 */
+	elapsed = ktime_get_ns() - start;
+	avg = READ_ONCE(svsk->sk_drain_avg_ns);
+	WRITE_ONCE(svsk->sk_drain_avg_ns, avg ? (7 * avg + elapsed) / 8 : elapsed);
+}
+
 /*
  * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
@@ -1716,44 +1823,68 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
  * svc_tcp_sendto - Send out a reply on a TCP socket
  * @rqstp: completed svc_rqst
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Flat combining reduces mutex contention: threads enqueue send
+ * requests; a single thread processes the batch while holding xpt_mutex
+ * to ensure RPC-level serialization.
  *
  * Returns the number of bytes sent, or a negative errno.
  */
 static int svc_tcp_sendto(struct svc_rqst *rqstp)
 {
 	struct svc_xprt *xprt = rqstp->rq_xprt;
-	struct svc_sock	*svsk = container_of(xprt, struct svc_sock, sk_xprt);
+	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
 	struct xdr_buf *xdr = &rqstp->rq_res;
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	int sent;
+	struct svc_pending_send ps = {
+		.rqstp = rqstp,
+		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+	};
 
 	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
 	rqstp->rq_xprt_ctxt = NULL;
 
-	mutex_lock(&xprt->xpt_mutex);
-	if (svc_xprt_is_dead(xprt))
-		goto out_notconn;
-	sent = svc_tcp_sendmsg(svsk, rqstp, marker);
-	trace_svcsock_tcp_send(xprt, sent);
-	if (sent < 0 || sent != (xdr->len + sizeof(marker)))
-		goto out_close;
-	mutex_unlock(&xprt->xpt_mutex);
-	return sent;
+	init_completion(&ps.done);
 
-out_notconn:
+	/* Enqueue send request; lock-free via llist */
+	llist_add(&ps.node, &svsk->sk_send_queue);
+
+	/*
+	 * Flat combining: trylock attempts xpt_mutex acquisition; success
+	 * enables processing all queued requests. The mutex provides RPC-level
+	 * serialization while batching reduces lock handoff overhead.
+	 *
+	 * Trylock first for fast path. On contention, wait briefly for batch
+	 * processing to complete this request, then fall back to blocking
+	 * mutex_lock.
+	 */
+	if (mutex_trylock(&xprt->xpt_mutex))
+		goto combine;
+
+	/*
+	 * Lock held for batch processing. Wait for completion with timeout;
+	 * batch processing may complete this request during the wait.
+	 */
+	if (wait_for_completion_timeout(&ps.done, svc_tcp_wait_timeout(svsk))) {
+		/* Completed by batch processing */
+		return ps.result;
+	}
+
+	/*
+	 * Timeout expired. Acquire mutex to enable batch processing or
+	 * discover completion occurred during mutex acquisition.
+	 */
+	mutex_lock(&xprt->xpt_mutex);
+	if (completion_done(&ps.done)) {
+		mutex_unlock(&xprt->xpt_mutex);
+		return ps.result;
+	}
+
+combine:
+	/* Mutex held; process batches until queue drains. */
+	while (!llist_empty(&svsk->sk_send_queue))
+		svc_tcp_combine_sends(svsk, xprt);
 	mutex_unlock(&xprt->xpt_mutex);
-	return -ENOTCONN;
-out_close:
-	pr_notice("rpc-srv/tcp: %s: %s %d when sending %zu bytes - shutting down socket\n",
-		  xprt->xpt_server->sv_name,
-		  (sent < 0) ? "got error" : "sent",
-		  sent, xdr->len + sizeof(marker));
-	svc_xprt_deferred_close(xprt);
-	mutex_unlock(&xprt->xpt_mutex);
-	return -EAGAIN;
+
+	return ps.result;
 }
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
-- 
2.52.0

next prev parent reply	other threads:[~2026-02-10 16:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
2026-02-10 16:20 ` Chuck Lever [this message]
2026-02-10 16:20 ` [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b94096eeb89 dfblob:fae760eaa9f dfblob:a89427b5fc7
dfblob:e3fa81b6319 )
 OR (
bs:"[PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260210162025.2356389-7-cel@kernel.org \
    --to=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox