From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F28C31B114 for ; Tue, 10 Feb 2026 16:20:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770740433; cv=none; b=rzh8YGZ48uUbhdcSJlZhPyNg/0KPUTSVWhlYhJODLjiTw8/+1ze7+ZNAhPhkzI2lW63DHiIQHmcZjJYGQRrGn9bsVllAv3FTDhpGJZ3GJPK0o4Om/tocHxmWM858xLeKgY5COLBVM2Rf0gemHBYdaOVesX1B3UhCng6N0IS0oHs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770740433; c=relaxed/simple; bh=iiw+T3/I+JexHeGIy00LYkshjG97a2byqseX1TRPN7Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nFEoI592fNSZ8w+tFFWKyuX1KYTrVg9CIBfvedNorGf4gEvgEgneSTAQ4w6pz5mLrwc0Esh0Ze3tQX7qheta4auhkrODd4Rlf8iSUyfokQKe0RvGL97lcNkPrADBNwK2jsx9In9yoyqoYUSzb6K2eUJrx+MLn0zhdZ8sO6I5Tx8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OpbmfYO2; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OpbmfYO2" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7A3FAC19424; Tue, 10 Feb 2026 16:20:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1770740433; bh=iiw+T3/I+JexHeGIy00LYkshjG97a2byqseX1TRPN7Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OpbmfYO2QQGG8J2d4DVuSONw8WCd3TQEFOC+l9mAcXgehxRqN0lY5bjx/I8UD8UTK DsuAhJ9kMoxrrpWj4fUMC9XmNHntya7YtPqEszfm0IoIacFIWC9sIUjRRKRQwXyPAm QOOA0geE/dc+xmojAwzDzD2ZWGxmWMg7zZg25NH0riowBjwXEGeR8LeF2dRy8p0FZ3 TS4gfvBX1yJDlxdm3LvPh3SCB3K8LffrOGgahKRDOXsO8M2ZSFeczFvOKqU7Kvss88 I4foC+5uzD4BVXs6lRSYjuGQeBikXEwAGtVSWmB4+ExjIp0lLQd+r8wbqmSddxn59X rTJ/ciWN+ar3g== From: Chuck Lever To: NeilBrown , Jeff Layton , Olga Kornievskaia , Dai Ngo , Tom Talpey Cc: , Chuck Lever Subject: [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends Date: Tue, 10 Feb 2026 11:20:23 -0500 Message-ID: <20260210162025.2356389-7-cel@kernel.org> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260210162025.2356389-1-cel@kernel.org> References: <20260210162025.2356389-1-cel@kernel.org> Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: Chuck Lever High contention on xpt_mutex during TCP reply transmission limits nfsd scalability. When multiple threads send replies on the same connection, mutex handoff triggers optimistic spinning that consumes substantial CPU time. Profiling on high-throughput workloads shows approximately 15% of cycles spent in the spin-wait path. The flat combining pattern addresses this by allowing one thread to perform send operations on behalf of multiple waiting threads. Rather than each thread acquiring xpt_mutex independently, threads enqueue their send requests to a lock-free llist. The first thread to acquire the mutex becomes the combiner and processes all pending sends in a batch. Other threads wait on a per-request completion structure instead of spinning on the lock. The combiner continues processing the queue until empty before releasing xpt_mutex, amortizing acquisition cost across multiple sends. TCP auto-corking naturally coalesces segments once the first send places data in flight, avoiding the need for explicit corking hints that would delay initial replies. Signed-off-by: Chuck Lever --- include/linux/sunrpc/svcsock.h | 2 + net/sunrpc/svcsock.c | 181 ++++++++++++++++++++++++++++----- 2 files changed, 158 insertions(+), 25 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index b94096eeb890..fae760eaa9f7 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -54,6 +54,8 @@ struct svc_sock { /* For sends (protected by xpt_mutex) */ struct bio_vec *sk_bvec; + struct llist_head sk_send_queue; /* queued sends awaiting batch processing */ + u64 sk_drain_avg_ns; /* EMA of batch drain time */ /* private TCP part */ /* On-the-wire fragment header: */ diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index a89427b5fc70..e3fa81b63191 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1677,6 +1677,113 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) return len; } +/* + * Pending send request for flat combining. + * Stack-allocated by each thread enqueuing a send on a TCP socket. + */ +struct svc_pending_send { + struct llist_node node; + struct svc_rqst *rqstp; + rpc_fraghdr marker; + struct completion done; + int result; +}; + +static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp, + rpc_fraghdr marker); + +/* + * svc_tcp_wait_timeout - Compute adaptive timeout for flat combining wait + * @svsk: the socket with drain time statistics + * + * Timeout selection balances two constraints: short timeouts cause + * premature wakeups and unnecessary mutex acquisitions; long timeouts + * delay threads after batch processing completes. + * + * LAN environments with fast networks and consistent latencies work well + * with fixed timeouts. WAN links exhibit higher variance in send times + * due to congestion, packet loss, and bandwidth constraints. An adaptive + * timeout based on observed drain times accommodates both cases without + * manual tuning. + * + * The timeout targets 2x the recent average drain time, clamped to + * [1ms, 100ms]. The multiplier provides headroom for variance while the + * floor prevents excessive wakeups and the ceiling bounds worst-case + * latency when measurements are anomalous. + * + * Returns: timeout in jiffies + */ +static unsigned long svc_tcp_wait_timeout(struct svc_sock *svsk) +{ + u64 avg_ns = READ_ONCE(svsk->sk_drain_avg_ns); + unsigned long timeout; + + /* Initial timeout before measurements are available */ + if (!avg_ns) + return msecs_to_jiffies(10); + + timeout = nsecs_to_jiffies(avg_ns * 2); + return clamp(timeout, msecs_to_jiffies(1), msecs_to_jiffies(100)); +} + +/* + * svc_tcp_combine_sends - Process batched send requests + * @svsk: the socket to send on + * @xprt: the transport (for dead check and close) + * + * Caller holds xpt_mutex. Drains sk_send_queue and processes each + * pending send. All items are completed before return, even on error. + */ +static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt) +{ + struct llist_node *pending, *node, *next; + struct svc_pending_send *ps; + bool transport_dead = false; + u64 start, elapsed, avg; + int sent; + + /* Snapshot queued items; subsequent arrivals processed in next batch */ + pending = llist_del_all(&svsk->sk_send_queue); + if (!pending) + return; + + start = ktime_get_ns(); + + for (node = pending; node; node = next) { + next = node->next; + ps = container_of(node, struct svc_pending_send, node); + + if (transport_dead || svc_xprt_is_dead(xprt)) { + transport_dead = true; + ps->result = -ENOTCONN; + complete(&ps->done); + continue; + } + + sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker); + trace_svcsock_tcp_send(xprt, sent); + + if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) { + ps->result = sent; + } else { + pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n", + xprt->xpt_server->sv_name, + sent < 0 ? "send error" : "short send", sent, + ps->rqstp->rq_res.len + sizeof(ps->marker)); + svc_xprt_deferred_close(xprt); + transport_dead = true; + ps->result = -EAGAIN; + } + + complete(&ps->done); + } + + /* EMA update: new = (7 * old + measured) / 8 */ + elapsed = ktime_get_ns() - start; + avg = READ_ONCE(svsk->sk_drain_avg_ns); + WRITE_ONCE(svsk->sk_drain_avg_ns, avg ? (7 * avg + elapsed) / 8 : elapsed); +} + /* * MSG_SPLICE_PAGES is used exclusively to reduce the number of * copy operations in this path. Therefore the caller must ensure @@ -1716,44 +1823,68 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp, * svc_tcp_sendto - Send out a reply on a TCP socket * @rqstp: completed svc_rqst * - * xpt_mutex ensures @rqstp's whole message is written to the socket - * without interruption. + * Flat combining reduces mutex contention: threads enqueue send + * requests; a single thread processes the batch while holding xpt_mutex + * to ensure RPC-level serialization. * * Returns the number of bytes sent, or a negative errno. */ static int svc_tcp_sendto(struct svc_rqst *rqstp) { struct svc_xprt *xprt = rqstp->rq_xprt; - struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); + struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); struct xdr_buf *xdr = &rqstp->rq_res; - rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | - (u32)xdr->len); - int sent; + struct svc_pending_send ps = { + .rqstp = rqstp, + .marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len), + }; svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt); rqstp->rq_xprt_ctxt = NULL; - mutex_lock(&xprt->xpt_mutex); - if (svc_xprt_is_dead(xprt)) - goto out_notconn; - sent = svc_tcp_sendmsg(svsk, rqstp, marker); - trace_svcsock_tcp_send(xprt, sent); - if (sent < 0 || sent != (xdr->len + sizeof(marker))) - goto out_close; - mutex_unlock(&xprt->xpt_mutex); - return sent; + init_completion(&ps.done); -out_notconn: + /* Enqueue send request; lock-free via llist */ + llist_add(&ps.node, &svsk->sk_send_queue); + + /* + * Flat combining: trylock attempts xpt_mutex acquisition; success + * enables processing all queued requests. The mutex provides RPC-level + * serialization while batching reduces lock handoff overhead. + * + * Trylock first for fast path. On contention, wait briefly for batch + * processing to complete this request, then fall back to blocking + * mutex_lock. + */ + if (mutex_trylock(&xprt->xpt_mutex)) + goto combine; + + /* + * Lock held for batch processing. Wait for completion with timeout; + * batch processing may complete this request during the wait. + */ + if (wait_for_completion_timeout(&ps.done, svc_tcp_wait_timeout(svsk))) { + /* Completed by batch processing */ + return ps.result; + } + + /* + * Timeout expired. Acquire mutex to enable batch processing or + * discover completion occurred during mutex acquisition. + */ + mutex_lock(&xprt->xpt_mutex); + if (completion_done(&ps.done)) { + mutex_unlock(&xprt->xpt_mutex); + return ps.result; + } + +combine: + /* Mutex held; process batches until queue drains. */ + while (!llist_empty(&svsk->sk_send_queue)) + svc_tcp_combine_sends(svsk, xprt); mutex_unlock(&xprt->xpt_mutex); - return -ENOTCONN; -out_close: - pr_notice("rpc-srv/tcp: %s: %s %d when sending %zu bytes - shutting down socket\n", - xprt->xpt_server->sv_name, - (sent < 0) ? "got error" : "sent", - sent, xdr->len + sizeof(marker)); - svc_xprt_deferred_close(xprt); - mutex_unlock(&xprt->xpt_mutex); - return -EAGAIN; + + return ps.result; } static struct svc_xprt *svc_tcp_create(struct svc_serv *serv, -- 2.52.0