public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <cel@kernel.org>
To: NeilBrown <neilb@ownmail.net>, Jeff Layton <jlayton@kernel.org>,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <dai.ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Cc: <linux-nfs@vger.kernel.org>, Chuck Lever <chuck.lever@oracle.com>
Subject: [PATCH v2 7/8] sunrpc: unify fore and backchannel server TCP send paths
Date: Tue, 10 Feb 2026 11:20:24 -0500	[thread overview]
Message-ID: <20260210162025.2356389-8-cel@kernel.org> (raw)
In-Reply-To: <20260210162025.2356389-1-cel@kernel.org>

From: Chuck Lever <chuck.lever@oracle.com>

High latency in callback processing for NFSv4.1+ sessions can occur
when the fore channel sustains heavy request traffic. The backchannel
implementation acquires xpt_mutex directly while the fore channel
uses flat combining to batch socket operations. A thread in the
combining loop processes queued sends continuously until the llist
empties, which under load means backchannel threads block on xpt_mutex
for extended periods waiting for a turn at the socket. Delegation
recalls and other callback operations carry time constraints that
make this starvation problematic.

Routing backchannel sends through the same flat combining
infrastructure eliminates this starvation; a shared llist queue
replaces direct mutex acquisition and separate code paths. The
struct svc_pending_send now holds an xdr_buf pointer instead of
svc_rqst, decoupling the queueing mechanism from RPC request
structures. A new svc_tcp_send() function accepts an xprt, xdr_buf,
and marker, then either enters the combining loop or enqueues for
processing by an active combiner.

The fore channel path through svc_tcp_sendto() now calls
svc_tcp_send() after preparing its xdr_buf. The backchannel
bc_send_request() similarly calls svc_tcp_send() in place of its
former mutex acquisition and direct bc_sendto() invocation. Both
channels queue into the same llist, so backchannel operations
receive fair treatment in the send ordering. When a backchannel
send queues behind fore channel traffic, the combining loop
processes both together with shared socket lock acquisition and
MSG_MORE coalescing where applicable.

Maintenance burden decreases with a single code path for TCP sends.
The backchannel gains batching benefits when concurrent with fore
channel load, and starvation no longer occurs because queueing
provides deterministic ordering independent of mutex contention
timing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svcsock.h |  2 +
 net/sunrpc/svcsock.c           | 76 +++++++++++++++++++++++++---------
 net/sunrpc/xprtsock.c          | 60 ++++++---------------------
 3 files changed, 71 insertions(+), 67 deletions(-)

diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index fae760eaa9f7..07619bf2131c 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -101,6 +101,8 @@ static inline u32 svc_sock_final_rec(struct svc_sock *svsk)
  */
 void		svc_recv(struct svc_rqst *rqstp);
 void		svc_send(struct svc_rqst *rqstp);
+int		svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+			     rpc_fraghdr marker);
 int		svc_addsock(struct svc_serv *serv, struct net *net,
 			    const int fd, char *name_return, const size_t len,
 			    const struct cred *cred);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index e3fa81b63191..8b2e9f524506 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1683,13 +1683,13 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
  */
 struct svc_pending_send {
 	struct llist_node	node;
-	struct svc_rqst		*rqstp;
+	struct xdr_buf		*xdr;
 	rpc_fraghdr		marker;
 	struct completion	done;
 	int			result;
 };
 
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker);
 
 /*
@@ -1750,6 +1750,8 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 	start = ktime_get_ns();
 
 	for (node = pending; node; node = next) {
+		size_t expected;
+
 		next = node->next;
 		ps = container_of(node, struct svc_pending_send, node);
 
@@ -1760,16 +1762,29 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
 			continue;
 		}
 
-		sent = svc_tcp_sendmsg(svsk, ps->rqstp, ps->marker);
+		sent = svc_tcp_sendmsg(svsk, ps->xdr, ps->marker);
 		trace_svcsock_tcp_send(xprt, sent);
 
-		if (sent == ps->rqstp->rq_res.len + sizeof(ps->marker)) {
+		expected = ps->xdr->len + sizeof(ps->marker);
+		if (sent == expected) {
 			ps->result = sent;
 		} else {
 			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
 				  xprt->xpt_server->sv_name,
-				  sent < 0 ? "send error" : "short send", sent,
-				  ps->rqstp->rq_res.len + sizeof(ps->marker));
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
+			pr_notice("rpc-srv/tcp: %s: %s (%d of %zu bytes) - shutting down socket\n",
+				  xprt->xpt_server->sv_name,
+				  sent < 0 ? "send error" : "short send",
+				  sent, expected);
 			svc_xprt_deferred_close(xprt);
 			transport_dead = true;
 			ps->result = -EAGAIN;
@@ -1789,7 +1804,7 @@ static void svc_tcp_combine_sends(struct svc_sock *svsk, struct svc_xprt *xprt)
  * copy operations in this path. Therefore the caller must ensure
  * that the pages backing @xdr are unchanging.
  */
-static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
 			   rpc_fraghdr marker)
 {
 	struct msghdr msg = {
@@ -1809,39 +1824,40 @@ static int svc_tcp_sendmsg(struct svc_sock *svsk, struct svc_rqst *rqstp,
 	memcpy(buf, &marker, sizeof(marker));
 	bvec_set_virt(svsk->sk_bvec, buf, sizeof(marker));
 
-	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, rqstp->rq_maxpages,
-				&rqstp->rq_res);
+	count = xdr_buf_to_bvec(svsk->sk_bvec + 1, svsk->sk_maxpages, xdr);
 
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_bvec,
-		      1 + count, sizeof(marker) + rqstp->rq_res.len);
+		      1 + count, sizeof(marker) + xdr->len);
 	ret = sock_sendmsg(svsk->sk_sock, &msg);
 	page_frag_free(buf);
 	return ret;
 }
 
 /**
- * svc_tcp_sendto - Send out a reply on a TCP socket
- * @rqstp: completed svc_rqst
+ * svc_tcp_send - Send an XDR buffer on a TCP socket using flat combining
+ * @xprt: the transport to send on
+ * @xdr: the XDR buffer to send
+ * @marker: RPC record marker
  *
  * Flat combining reduces mutex contention: threads enqueue send
  * requests; a single thread processes the batch while holding xpt_mutex
  * to ensure RPC-level serialization.
  *
+ * Can be used for both fore channel (NFS replies) and backchannel
+ * (NFSv4 callbacks) sends since both share the same TCP connection
+ * and xpt_mutex.
+ *
  * Returns the number of bytes sent, or a negative errno.
  */
-static int svc_tcp_sendto(struct svc_rqst *rqstp)
+int svc_tcp_send(struct svc_xprt *xprt, struct xdr_buf *xdr,
+		 rpc_fraghdr marker)
 {
-	struct svc_xprt *xprt = rqstp->rq_xprt;
 	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
-	struct xdr_buf *xdr = &rqstp->rq_res;
 	struct svc_pending_send ps = {
-		.rqstp = rqstp,
-		.marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | (u32)xdr->len),
+		.xdr = xdr,
+		.marker = marker,
 	};
 
-	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
-	rqstp->rq_xprt_ctxt = NULL;
-
 	init_completion(&ps.done);
 
 	/* Enqueue send request; lock-free via llist */
@@ -1886,6 +1902,26 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
 
 	return ps.result;
 }
+EXPORT_SYMBOL_GPL(svc_tcp_send);
+
+/**
+ * svc_tcp_sendto - Send out a reply on a TCP socket
+ * @rqstp: completed svc_rqst
+ *
+ * Returns the number of bytes sent, or a negative errno.
+ */
+static int svc_tcp_sendto(struct svc_rqst *rqstp)
+{
+	struct svc_xprt *xprt = rqstp->rq_xprt;
+	struct xdr_buf *xdr = &rqstp->rq_res;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+
+	svc_tcp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
+	rqstp->rq_xprt_ctxt = NULL;
+
+	return svc_tcp_send(xprt, xdr, marker);
+}
 
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
 				       struct net *net,
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 2e1fe6013361..4e1d82186b00 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2979,36 +2979,13 @@ static void bc_free(struct rpc_task *task)
 	free_page((unsigned long)buf);
 }
 
-static int bc_sendto(struct rpc_rqst *req)
-{
-	struct xdr_buf *xdr = &req->rq_snd_buf;
-	struct sock_xprt *transport =
-			container_of(req->rq_xprt, struct sock_xprt, xprt);
-	struct msghdr msg = {
-		.msg_flags	= 0,
-	};
-	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
-					 (u32)xdr->len);
-	unsigned int sent = 0;
-	int err;
-
-	req->rq_xtime = ktime_get();
-	err = xdr_alloc_bvec(xdr, rpc_task_gfp_mask());
-	if (err < 0)
-		return err;
-	err = xprt_sock_sendmsg(transport->sock, &msg, xdr, 0, marker, &sent);
-	xdr_free_bvec(xdr);
-	if (err < 0 || sent != (xdr->len + sizeof(marker)))
-		return -EAGAIN;
-	return sent;
-}
-
 /**
  * bc_send_request - Send a backchannel Call on a TCP socket
  * @req: rpc_rqst containing Call message to be sent
  *
- * xpt_mutex ensures @rqstp's whole message is written to the socket
- * without interruption.
+ * Uses flat combining via svc_tcp_send() to participate in batched
+ * sending with fore channel traffic, ensuring fair ordering and
+ * reduced lock contention.
  *
  * Return values:
  *   %0 if the message was sent successfully
@@ -3016,29 +2993,18 @@ static int bc_sendto(struct rpc_rqst *req)
  */
 static int bc_send_request(struct rpc_rqst *req)
 {
-	struct svc_xprt	*xprt;
-	int len;
+	struct xdr_buf *xdr = &req->rq_snd_buf;
+	struct svc_xprt *xprt = req->rq_xprt->bc_xprt;
+	rpc_fraghdr marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
+					 (u32)xdr->len);
+	int ret;
 
-	/*
-	 * Get the server socket associated with this callback xprt
-	 */
-	xprt = req->rq_xprt->bc_xprt;
+	req->rq_xtime = ktime_get();
+	ret = svc_tcp_send(xprt, xdr, marker);
 
-	/*
-	 * Grab the mutex to serialize data as the connection is shared
-	 * with the fore channel
-	 */
-	mutex_lock(&xprt->xpt_mutex);
-	if (test_bit(XPT_DEAD, &xprt->xpt_flags))
-		len = -ENOTCONN;
-	else
-		len = bc_sendto(req);
-	mutex_unlock(&xprt->xpt_mutex);
-
-	if (len > 0)
-		len = 0;
-
-	return len;
+	if (ret < 0)
+		return ret;
+	return 0;
 }
 
 static void bc_close(struct rpc_xprt *xprt)
-- 
2.52.0


  parent reply	other threads:[~2026-02-10 16:20 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 16:20 [PATCH v2 0/8] sunrpc: Reduce lock contention for NFSD TCP sockets Chuck Lever
2026-02-10 16:20 ` [PATCH v2 1/8] sunrpc: Add XPT flags missing from SVC_XPRT_FLAG_LIST Chuck Lever
2026-02-10 16:20 ` [PATCH v2 2/8] net: datagram: bypass usercopy checks for kernel iterators Chuck Lever
2026-02-10 16:20 ` [PATCH v2 3/8] sunrpc: split svc_data_ready into protocol-specific callbacks Chuck Lever
2026-02-10 16:20 ` [PATCH v2 4/8] sunrpc: add per-transport page recycling pool Chuck Lever
2026-02-10 16:20 ` [PATCH v2 5/8] sunrpc: add dedicated TCP receiver thread Chuck Lever
2026-02-10 16:20 ` [PATCH v2 6/8] sunrpc: implement flat combining for TCP socket sends Chuck Lever
2026-02-10 16:20 ` Chuck Lever [this message]
2026-02-10 16:20 ` [PATCH v2 8/8] sunrpc: Set explicit TCP socket buffer sizes for NFSD Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260210162025.2356389-8-cel@kernel.org \
    --to=cel@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dai.ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@ownmail.net \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox