From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Subject: [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends
Date: Tue, 10 Feb 2026 11:20:23 -0500 [thread overview]
Message-ID: <20260210162025.2356389-7-cel@kernel.org> (raw)
In-Reply-To: <20260210162025.2356389-1-cel@kernel.org>
From: Chuck Lever <chuck.lever@oracle.com>
High contention on xpt_mutex during TCP reply transmission limits
nfsd scalability. When multiple threads send replies on the same
connection, mutex handoff triggers optimistic spinning that consumes
substantial CPU time. Profiling on high-throughput workloads shows
approximately 15% of cycles spent in the spin-wait path.
The flat combining pattern addresses this by allowing one thread to
perform send operations on behalf of multiple waiting threads. Rather
than each thread acquiring xpt_mutex independently, threads enqueue
their send requests to a lock-free llist. The first thread to
acquire the mutex becomes the combiner and processes all pending
sends in a batch. Other threads wait on a per-request completion
structure instead of spinning on the lock.
The combiner continues processing the queue until empty before
releasing xpt_mutex, amortizing acquisition cost across multiple
sends. TCP auto-corking naturally coalesces segments once the
first send places data in flight, avoiding the need for explicit
corking hints that would delay initial replies.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svcsock.h | 2 +
net/sunrpc/svcsock.c | 181 ++++++++++++++++++++++++++++-----
2 files changed, 158 insertions(+), 25 deletions(-)
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index b94096eeb890..fae760eaa9f7 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -54,6 +54,8 @@ struct svc_sock {
/* For sends (protected by xpt_mutex) */
struct bio_vec *sk_bvec;
+ struct llist_head sk_send_queue; /* queued sends awaiting batch processing */
+ u64 sk_drain_avg_ns; /* EMA of batch drain time */
/* private TCP part */
/* On-the-wire fragment header: */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a89427b5fc70..e3fa81b63191 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1677,6 +1677,113 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
return len;
}
+/*
+ * Pending send request for flat combining.
+ * Stack-allocated by each thread enqueuing a send on a TCP socket.
+ */
+struct svc_pending_send {
+ struct llist_node node;
+ struct svc_rqst *rqstp;
+ rpc_fraghdr marker;
+ struct completion done;
+ int result;
+};
+
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+ rpc_fraghdr marker);
+
+/*
+ * svc_tcp_wait_timeout - Compute adaptive timeout for flat combining wait
+ * @svsk: the socket with drain time statistics
+ *
+ * Timeout selection balances two constraints: short timeouts cause
+ * premature wakeups and unnecessary mutex acquisitions; long timeouts
+ * delay threads after batch processing completes.
+ *
+ * LAN environments with fast networks and consistent latencies work well
+ * with fixed timeouts. WAN links exhibit higher variance in send times
+ * due to congestion, packet loss, and bandwidth constraints. An adaptive
+ * timeout based on observed drain times accommodates both cases without
+ * manual tuning.
+ *
+ * The timeout targets 2x the recent average drain time, clamped to
+ * [1ms, 100ms]. The multiplier provides headroom for variance while the
+ * floor prevents excessive wakeups and the ceiling bounds worst-case
+ * latency when measurements are anomalous.
+ *
+ * Returns: timeout in jiffies
+ */
+static unsigned long svc_tcp_wait_timeout(struct svc_sock *svsk)
+{
+ u64 avg_ns = READ_ONCE(svsk->sk_drain_avg_ns);
+ unsigned long timeout;
+
+ /* Initial timeout before measurements are available */
+ if (!avg_ns)
+ return msecs_to_jiffies(10);
+
+ timeout = nsecs_to_jiffies(avg_ns * 2);
+ return clamp(timeout, msecs_to_jiffies(1), msecs_to_jiffies(100));
+}
+
+/*
+ * svc_tcp_combine_sends - Process batched send requests
+ * @svsk: the socket to send on
+ * @xprt: the transport (for dead check and close)
+ *
+ * Caller holds xpt_mutex. Drains sk_send_queue and processes each
+ * pending send. All items are completed before return, even on error.
+ */
+static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
+{
+ struct llist_node *pending, *node, *next;
+ struct svc_pending_send *ps;
+ bool transport_dead = false;
+ u64 start, elapsed, avg;
+ int sent;
+
+ /* Snapshot queued items; subsequent arrivals processed in next batch */
+ pending = llist_del_all(&svsk->sk_send_queue);
+ if (!pending)
+ return;
+
+ start = ktime_get_ns();
+
+ for (node = pending; node; node = next) {
+ next = node->next;
+ ps = container_of(node, struct svc_pending_send, node);
+
+ if (transport_dead || svc_xprt_is_dead(xprt)) {
+ transport_dead = true;
+ ps->result = -ENOTCONN;
+ complete(&ps->done);
+ continue;
+ }
+
+ sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker);
+ trace_svcsock_tcp_send(xprt, sent);
+
+ if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+ ps->result = sent;
+ } else {
+ pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+ xprt->xpt_server->sv_name,
+ sent < 0 ? "send error" : "short send", sent,
+ ps->rqstp->rq_res.len + sizeof(ps->marker));
+ svc_xprt_deferred_close(xprt);
+ transport_dead = true;
+ ps->result = -EAGAIN;
+ }
+
+ complete(&ps->done);
+ }
+
+ /* EMA update: new = (7 * old + measured) / 8 */
+ elapsed = ktime_get_ns() - start;
+ avg = READ_ONCE(svsk->sk_drain_avg_ns);
+ WRITE_ONCE(svsk->sk_drain_avg_ns, avg ? (7 * avg + elapsed) / 8 : elapsed);
+}
+
/*
* MSG_SPLICE_PAGES is used exclusively to reduce the number of
* copy operations in this path. Therefore the caller must ensure
@@ -1716,44 +1823,68 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
* svc_tcp_sendto - Send out a reply on a TCP socket
* @rqstp: completed svc_rqst
*
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Flat combining reduces mutex contention: threads enqueue send
+ * requests; a single thread processes the batch while holding xpt_mutex
+ * to ensure RPC-level serialization.
*
* Returns the number of bytes sent, or a negative errno.
*/
static int svc_tcp_sendto(struct svc_rqst *rqstp)
{
struct svc_xprt *xprt = rqstp->rq_xprt;
- struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
+ struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
struct xdr_buf *xdr = &rqstp->rq_res;
- rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
- (u32)xdr->len);
- int sent;
+ struct svc_pending_send ps = {
+ .rqstp = rqstp,
+ .marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+ };
svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
rqstp->rq_xprt_ctxt = NULL;
- mutex_lock(&xprt->xpt_mutex);
- if (svc_xprt_is_dead(xprt))
- goto out_notconn;
- sent = svc_tcp_sendmsg(svsk, rqstp, marker);
- trace_svcsock_tcp_send(xprt, sent);
- if (sent < 0 || sent != (xdr->len + sizeof(marker)))
- goto out_close;
- mutex_unlock(&xprt->xpt_mutex);
- return sent;
+ init_completion(&ps.done);
-out_notconn:
+ /* Enqueue send request; lock-free via llist */
+ llist_add(&ps.node, &svsk->sk_send_queue);
+
+ /*
+ * Flat combining: trylock attempts xpt_mutex acquisition; success
+ * enables processing all queued requests. The mutex provides RPC-level
+ * serialization while batching reduces lock handoff overhead.
+ *
+ * Trylock first for fast path. On contention, wait briefly for batch
+ * processing to complete this request, then fall back to blocking
+ * mutex_lock.
+ */
+ if (mutex_trylock(&xprt->xpt_mutex))
+ goto combine;
+
+ /*
+ * Lock held for batch processing. Wait for completion with timeout;
+ * batch processing may complete this request during the wait.
+ */
+ if (wait_for_completion_timeout(&ps.done, svc_tcp_wait_timeout(svsk))) {
+ /* Completed by batch processing */
+ return ps.result;
+ }
+
+ /*
+ * Timeout expired. Acquire mutex to enable batch processing or
+ * discover completion occurred during mutex acquisition.
+ */
+ mutex_lock(&xprt->xpt_mutex);
+ if (completion_done(&ps.done)) {
+ mutex_unlock(&xprt->xpt_mutex);
+ return ps.result;
+ }
+
+combine:
+ /* Mutex held; process batches until queue drains. */
+ while (!llist_empty(&svsk->sk_send_queue))
+ svc_tcp_combine_sends(svsk, xprt);
mutex_unlock(&xprt->xpt_mutex);
- return -ENOTCONN;
-out_close:
- pr_notice("rpc-srv/tcp: %s: %s %d when sending %zu bytes - shutting down socket\n",
- xprt->xpt_server->sv_name,
- (sent < 0) ? "got error" : "sent",
- sent, xdr->len + sizeof(marker));
- svc_xprt_deferred_close(xprt);
- mutex_unlock(&xprt->xpt_mutex);
- return -EAGAIN;
+
+ return ps.result;
}
static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
--
2.52.0
next prev parent reply other threads:[~2026-02-10 16:20 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
2026-02-10 16:20 ` Chuck Lever [this message]
2026-02-10 16:20 ` [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths Chuck Lever
2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260210162025.2356389-7-cel@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=dai.ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@ownmail.net \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox