From: Chuck Lever <cel@kernel.org>
To: Hannes Reinecke <hare@suse.de>, Olga Kornievskaia <okorniev@redhat.com>
Cc: kernel-tls-handshake@lists.linux.dev,
Chuck Lever <chuck.lever@oracle.com>
Subject: [RFC PATCH 3/4] sunrpc: Use read_sock_cmsg for svcsock TCP receives
Date: Tue, 17 Feb 2026 17:20:32 -0500 [thread overview]
Message-ID: <20260217222033.1929211-4-cel@kernel.org> (raw)
In-Reply-To: <20260217222033.1929211-1-cel@kernel.org>
From: Chuck Lever <chuck.lever@oracle.com>
The svcsock TCP receive path uses sock_recvmsg() with ancillary data
buffers to detect TLS alerts when kTLS is active. This CMSG-based
approach has two drawbacks: the MSG_CTRUNC recovery dance adds
overhead to every receive, and sock_recvmsg() cannot take advantage
of zero-copy optimizations available through read_sock.
When the socket provides a read_sock_cmsg method (now set by kTLS),
svc_tcp_recvfrom() now dispatches to a new
svc_tcp_recvfrom_readsock() path. Two actor callbacks handle the
data:
svc_tcp_recv_actor() parses the RPC record byte stream directly from
skbs. Fragment header bytes fill sk_marker first; subsequent body
bytes are copied into rq_pages at the position tracked by
sk_datalen. When the last fragment of a complete RPC message
arrives, the actor sets desc->count to zero, stopping the read loop.
svc_tcp_cmsg_actor() handles non-data TLS records. For fatal alerts,
the transport is marked for deferred close. All non-data records
stop the read loop so callers can inspect the error before
continuing.
The existing sock_recvmsg() path remains as a fallback for sockets
without read_sock_cmsg (plain TCP, non-kTLS configurations).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svcsock.c | 245 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 245 insertions(+)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d61cd9b40491..9600d15287e7 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1124,6 +1124,248 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
svsk->sk_marker = xdr_zero;
}
+/*
+ * read_sock_cmsg data actor: receives decrypted application data
+ * from the TLS layer, parsing the RPC record stream (fragment
+ * headers and message bodies) and assembling complete RPC messages
+ * into @rqstp->rq_pages.
+ */
+static int svc_tcp_recv_actor(read_descriptor_t *desc,
+ struct sk_buff *skb,
+ unsigned int offset, size_t len)
+{
+ struct svc_rqst *rqstp = desc->arg.data;
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+ size_t consumed = 0;
+
+ /* Phase 1: consume fragment header bytes */
+ while (svsk->sk_tcplen < sizeof(rpc_fraghdr) && len > 0) {
+ size_t want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
+ size_t n = min(want, len);
+
+ if (skb_copy_bits(skb, offset,
+ (char *)&svsk->sk_marker +
+ svsk->sk_tcplen, n))
+ goto fault;
+ svsk->sk_tcplen += n;
+ offset += n;
+ len -= n;
+ consumed += n;
+
+ if (svsk->sk_tcplen < sizeof(rpc_fraghdr))
+ return consumed;
+
+ trace_svcsock_marker(&svsk->sk_xprt, svsk->sk_marker);
+ if (svc_sock_reclen(svsk) + svsk->sk_datalen >
+ svsk->sk_xprt.xpt_server->sv_max_mesg) {
+ net_notice_ratelimited("svc: %s oversized RPC fragment (%u octets)\n",
+ svsk->sk_xprt.xpt_server->sv_name,
+ svc_sock_reclen(svsk));
+ desc->error = -EMSGSIZE;
+ desc->count = 0;
+ return consumed;
+ }
+ }
+
+ if (len == 0)
+ return consumed;
+
+ /* Phase 2: copy body data into rq_pages */
+ {
+ size_t reclen = svc_sock_reclen(svsk);
+ size_t received = svsk->sk_tcplen - sizeof(rpc_fraghdr);
+ size_t want = reclen - received;
+ size_t take = min(want, len);
+ size_t done = 0;
+
+ while (done < take) {
+ unsigned int pg = svsk->sk_datalen >> PAGE_SHIFT;
+ unsigned int pg_off = svsk->sk_datalen &
+ (PAGE_SIZE - 1);
+ size_t chunk = min(take - done,
+ PAGE_SIZE - (size_t)pg_off);
+
+ if (skb_copy_bits(skb, offset,
+ page_address(rqstp->rq_pages[pg])
+ + pg_off,
+ chunk))
+ goto fault;
+ offset += chunk;
+ done += chunk;
+ svsk->sk_datalen += chunk;
+ }
+ svsk->sk_tcplen += take;
+ consumed += take;
+
+ /* Fragment complete? */
+ if (svsk->sk_tcplen - sizeof(rpc_fraghdr) >= reclen) {
+ if (svc_sock_final_rec(svsk)) {
+ desc->count = 0;
+ } else {
+ svc_tcp_fragment_received(svsk);
+ }
+ }
+ }
+
+ return consumed;
+
+fault:
+ desc->error = -EFAULT;
+ desc->count = 0;
+ return consumed;
+}
+
+/*
+ * read_sock_cmsg control message actor: receives non-data TLS
+ * records (alerts, handshake messages) and translates them into
+ * transport-level actions.
+ */
+static int svc_tcp_cmsg_actor(read_descriptor_t *desc,
+ struct sk_buff *skb,
+ unsigned int offset, size_t len,
+ u8 content_type)
+{
+ struct svc_rqst *rqstp = desc->arg.data;
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+
+ switch (content_type) {
+ case TLS_RECORD_TYPE_ALERT:
+ if (len >= 2) {
+ u8 alert[2];
+
+ if (!skb_copy_bits(skb, offset, alert,
+ sizeof(alert))) {
+ if (alert[0] == TLS_ALERT_LEVEL_FATAL) {
+ svc_xprt_deferred_close(
+ &svsk->sk_xprt);
+ desc->error = -ENOTCONN;
+ desc->count = 0;
+ }
+ }
+ }
+ break;
+ default:
+ break;
+ }
+ return -EAGAIN;
+}
+
+static int svc_tcp_recvfrom_readsock(struct svc_rqst *rqstp)
+{
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+ struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+ struct sock *sk = svsk->sk_sk;
+ read_descriptor_t desc = {
+ .arg.data = rqstp,
+ };
+ ssize_t len;
+ __be32 *p;
+ __be32 calldir;
+
+ clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+
+ svc_tcp_restore_pages(svsk, rqstp);
+ rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+
+ /* Ensure no stale response pages are released if the
+ * receive returns without completing a full message.
+ */
+ rqstp->rq_respages = rqstp->rq_page_end;
+ rqstp->rq_next_page = rqstp->rq_page_end;
+
+ desc.count = serv->sv_max_mesg;
+ lock_sock(sk);
+ len = svsk->sk_sock->ops->read_sock_cmsg(sk, &desc,
+ svc_tcp_recv_actor,
+ svc_tcp_cmsg_actor);
+ release_sock(sk);
+
+ if (desc.error == -EMSGSIZE)
+ goto err_delete;
+ if (desc.error < 0) {
+ len = desc.error;
+ goto error;
+ }
+ if (desc.count != 0) {
+ /* Incomplete message */
+ if (len > 0)
+ set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+ goto err_incomplete;
+ }
+
+ /* Complete RPC message received */
+ if (svsk->sk_datalen < 8)
+ goto err_nuts;
+
+ rqstp->rq_arg.len = svsk->sk_datalen;
+ rqstp->rq_arg.page_base = 0;
+ if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
+ rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
+ rqstp->rq_arg.page_len = 0;
+ } else {
+ rqstp->rq_arg.page_len = rqstp->rq_arg.len -
+ rqstp->rq_arg.head[0].iov_len;
+ }
+
+ {
+ unsigned int pg_count =
+ (svsk->sk_datalen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ rqstp->rq_respages = &rqstp->rq_pages[pg_count];
+ rqstp->rq_next_page = rqstp->rq_respages + 1;
+ }
+
+ rqstp->rq_xprt_ctxt = NULL;
+ rqstp->rq_prot = IPPROTO_TCP;
+ if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
+ set_bit(RQ_LOCAL, &rqstp->rq_flags);
+ else
+ clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+
+ p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
+ calldir = p[1];
+ if (calldir)
+ len = receive_cb_reply(svsk, rqstp);
+
+ /* Reset TCP read info */
+ svsk->sk_datalen = 0;
+ svc_tcp_fragment_received(svsk);
+
+ if (len < 0)
+ goto error;
+
+ trace_svcsock_tcp_recv(&svsk->sk_xprt, rqstp->rq_arg.len);
+ svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
+ if (serv->sv_stats)
+ serv->sv_stats->nettcpcnt++;
+
+ svc_sock_secure_port(rqstp);
+ set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+ svc_xprt_received(rqstp->rq_xprt);
+ return rqstp->rq_arg.len;
+
+err_incomplete:
+ svc_tcp_save_pages(svsk, rqstp);
+ if (len < 0 && len != -EAGAIN)
+ goto err_delete;
+ goto err_noclose;
+error:
+ if (len != -EAGAIN)
+ goto err_delete;
+ trace_svcsock_tcp_recv_eagain(&svsk->sk_xprt, 0);
+ goto err_noclose;
+err_nuts:
+ svsk->sk_datalen = 0;
+err_delete:
+ trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
+ svc_xprt_deferred_close(&svsk->sk_xprt);
+err_noclose:
+ svc_xprt_received(rqstp->rq_xprt);
+ return 0;
+}
+
/**
* svc_tcp_recvfrom - Receive data from a TCP socket
* @rqstp: request structure into which to receive an RPC Call
@@ -1152,6 +1394,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
__be32 *p;
__be32 calldir;
+ if (svsk->sk_sock->ops->read_sock_cmsg)
+ return svc_tcp_recvfrom_readsock(rqstp);
+
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
len = svc_tcp_read_marker(svsk, rqstp);
if (len < 0)
--
2.53.0
next prev parent reply other threads:[~2026-02-17 22:20 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-17 22:20 [RFC PATCH 0/4] ->read_sock with cmsg Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery Chuck Lever
2026-02-18 7:29 ` Hannes Reinecke
2026-02-18 14:33 ` Chuck Lever
2026-02-18 15:52 ` Hannes Reinecke
2026-02-18 16:12 ` Chuck Lever
2026-02-19 4:06 ` Alistair Francis
2026-02-19 8:05 ` Hannes Reinecke
2026-02-19 8:10 ` Hannes Reinecke
2026-02-19 13:59 ` Chuck Lever
2026-02-28 11:09 ` Alistair Francis
2026-02-17 22:20 ` [RFC PATCH 2/4] tls: Implement read_sock_cmsg for kTLS software path Chuck Lever
2026-02-17 22:20 ` Chuck Lever [this message]
2026-02-17 22:20 ` [RFC PATCH 4/4] sunrpc: Remove sock_recvmsg path from svcsock TCP receives Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260217222033.1929211-4-cel@kernel.org \
--to=cel@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=hare@suse.de \
--cc=kernel-tls-handshake@lists.linux.dev \
--cc=okorniev@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.