* [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-17 22:20 [RFC PATCH 0/4] ->read_sock with cmsg Chuck Lever
@ 2026-02-17 22:20 ` Chuck Lever
2026-02-18 7:29 ` Hannes Reinecke
2026-02-17 22:20 ` [RFC PATCH 2/4] tls: Implement read_sock_cmsg for kTLS software path Chuck Lever
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2026-02-17 22:20 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Kernel TCP consumers that use the read_sock interface
(proto_ops.read_sock) cannot receive TLS control messages (Alerts,
Handshake records) when kTLS is active. The current
tls_sw_read_sock() method rejects non-data records with -EINVAL, and
the sk_read_actor_t callback has no channel for delivering record-
type metadata.
Four kernel subsystems are affected: NFSD (sunrpc svcsock), NFS
client (sunrpc xprtsock), NVMe target (nvmet-tcp), and NVMe host
(nvme-tcp). Each of these either falls back to the slower
sock_recvmsg() API or lacks TLS alert handling entirely.
A new read_sock_cmsg method in struct proto_ops provides a separate
code path that delivers non-data TLS records to a callback, without
changing the existing read_sock behavior used by consumers of the
existing read_sock method.
The new sk_read_cmsg_actor_t callback type extends the
sk_read_actor_t signature with a content_type parameter carrying the
protocol-layer record type (for example, TLS_RECORD_TYPE_ALERT).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/net.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/include/linux/net.h b/include/linux/net.h
index f58b38ab37f8..94eb1c3c1cb6 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -154,6 +154,10 @@ struct sk_buff;
struct proto_accept_arg;
typedef int (*sk_read_actor_t)(read_descriptor_t *, struct sk_buff *,
unsigned int, size_t);
+typedef int (*sk_read_cmsg_actor_t)(read_descriptor_t *,
+ struct sk_buff *,
+ unsigned int, size_t,
+ u8 content_type);
typedef int (*skb_read_actor_t)(struct sock *, struct sk_buff *);
@@ -218,6 +222,10 @@ struct proto_ops {
*/
int (*read_sock)(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t recv_actor);
+ int (*read_sock_cmsg)(struct sock *sk,
+ read_descriptor_t *desc,
+ sk_read_actor_t recv_actor,
+ sk_read_cmsg_actor_t cmsg_actor);
/* This is different from read_sock(), it reads an entire skb at a time. */
int (*read_skb)(struct sock *sk, skb_read_actor_t recv_actor);
int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
--
2.53.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-17 22:20 ` [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery Chuck Lever
@ 2026-02-18 7:29 ` Hannes Reinecke
2026-02-18 14:33 ` Chuck Lever
0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2026-02-18 7:29 UTC (permalink / raw)
To: Chuck Lever, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On 2/17/26 23:20, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Kernel TCP consumers that use the read_sock interface
> (proto_ops.read_sock) cannot receive TLS control messages (Alerts,
> Handshake records) when kTLS is active. The current
> tls_sw_read_sock() method rejects non-data records with -EINVAL, and
> the sk_read_actor_t callback has no channel for delivering record-
> type metadata.
>
> Four kernel subsystems are affected: NFSD (sunrpc svcsock), NFS
> client (sunrpc xprtsock), NVMe target (nvmet-tcp), and NVMe host
> (nvme-tcp). Each of these either falls back to the slower
> sock_recvmsg() API or lacks TLS alert handling entirely.
>
And that is the contentious topic.
Is it really slower?
The way I see it, ->read_sock() has an advantage when you can work
with skbuffs (ie network packets) directly. Then it has the benefit
of being paced with the packets arriving as they do on the network.
But for TLS this is no longer true; the TLS skbuffs are synthesized
on top of the TCP stream, and the boundary of a TLS skbuff is completely
decoupled from the TCP skbuff (in theory, at least).
This can result in TLS skbuff stalls when waiting for the remainder
of the TCP data stream to arrive.
So for TLS I _think_ it's beneficial to use recvmsg, as there you have
the full set of options available.
And actually I have a patch switching nvme over to recvmsg(), but that
has stalled due to precisely this question: read_sock() inherently
faster than recvmsg? Or is that just the perception (we are closer
to the hardware, so it _must_ be faster)?
(Speaking as the author of the ->read_sock() callback for TLS.
I never really liked it ;-)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-18 7:29 ` Hannes Reinecke
@ 2026-02-18 14:33 ` Chuck Lever
2026-02-18 15:52 ` Hannes Reinecke
0 siblings, 1 reply; 14+ messages in thread
From: Chuck Lever @ 2026-02-18 14:33 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On 2/18/26 2:29 AM, Hannes Reinecke wrote:
> On 2/17/26 23:20, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Kernel TCP consumers that use the read_sock interface
>> (proto_ops.read_sock) cannot receive TLS control messages (Alerts,
>> Handshake records) when kTLS is active. The current
>> tls_sw_read_sock() method rejects non-data records with -EINVAL, and
>> the sk_read_actor_t callback has no channel for delivering record-
>> type metadata.
>>
>> Four kernel subsystems are affected: NFSD (sunrpc svcsock), NFS
>> client (sunrpc xprtsock), NVMe target (nvmet-tcp), and NVMe host
>> (nvme-tcp). Each of these either falls back to the slower
>> sock_recvmsg() API or lacks TLS alert handling entirely.
>>
> And that is the contentious topic.
> Is it really slower?
I measured 5-10% throughput increases and latency drops with NFSD,
/without/ TLS. The primary benefit of this series is that it handles
CMSG (TLS Alerts) much more cleanly.
> The way I see it, ->read_sock() has an advantage when you can work
> with skbuffs (ie network packets) directly. Then it has the benefit
> of being paced with the packets arriving as they do on the network.
>
> But for TLS this is no longer true; the TLS skbuffs are synthesized
> on top of the TCP stream, and the boundary of a TLS skbuff is completely
> decoupled from the TCP skbuff (in theory, at least).
>
> This can result in TLS skbuff stalls when waiting for the remainder
> of the TCP data stream to arrive.
An entire TLS Record has to be present before it can be decrypted and
passed to the socket consumer. Is that what you mean?
> So for TLS I _think_ it's beneficial to use recvmsg, as there you have
> the full set of options available.
>
> And actually I have a patch switching nvme over to recvmsg(), but that
> has stalled due to precisely this question: read_sock() inherently
> faster than recvmsg? Or is that just the perception (we are closer
> to the hardware, so it _must_ be faster)?
>
> (Speaking as the author of the ->read_sock() callback for TLS.
> I never really liked it ;-)
>
> Cheers,
>
> Hannes
--
Chuck Lever
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-18 14:33 ` Chuck Lever
@ 2026-02-18 15:52 ` Hannes Reinecke
2026-02-18 16:12 ` Chuck Lever
0 siblings, 1 reply; 14+ messages in thread
From: Hannes Reinecke @ 2026-02-18 15:52 UTC (permalink / raw)
To: Chuck Lever, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On 2/18/26 15:33, Chuck Lever wrote:
> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>> On 2/17/26 23:20, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> Kernel TCP consumers that use the read_sock interface
>>> (proto_ops.read_sock) cannot receive TLS control messages (Alerts,
>>> Handshake records) when kTLS is active. The current
>>> tls_sw_read_sock() method rejects non-data records with -EINVAL, and
>>> the sk_read_actor_t callback has no channel for delivering record-
>>> type metadata.
>>>
>>> Four kernel subsystems are affected: NFSD (sunrpc svcsock), NFS
>>> client (sunrpc xprtsock), NVMe target (nvmet-tcp), and NVMe host
>>> (nvme-tcp). Each of these either falls back to the slower
>>> sock_recvmsg() API or lacks TLS alert handling entirely.
>>>
>> And that is the contentious topic.
>> Is it really slower?
>
> I measured 5-10% throughput increases and latency drops with NFSD,
> /without/ TLS. The primary benefit of this series is that it handles
> CMSG (TLS Alerts) much more cleanly.
>
>
>> The way I see it, ->read_sock() has an advantage when you can work
>> with skbuffs (ie network packets) directly. Then it has the benefit
>> of being paced with the packets arriving as they do on the network.
>>
>> But for TLS this is no longer true; the TLS skbuffs are synthesized
>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
>> decoupled from the TCP skbuff (in theory, at least).
>>
>> This can result in TLS skbuff stalls when waiting for the remainder
>> of the TCP data stream to arrive.
>
> An entire TLS Record has to be present before it can be decrypted and
> passed to the socket consumer. Is that what you mean?
>
Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
at the network stack the skbuffs are assembled/decoded by the TLS layer,
only to be converted into skbuffs again for ->read_sock().
Which seems a bit pointless.
But the main point with ->read_sock() is that we can only process
skbuffs, ie we have to wait for the TLS layer to assemble the entire
record.
With recvmsg() we at least have a theoretical choice of returning
a short read, allowing the code to do something clever in the meantime.
Mind you, that is all conjecture. Might well be that ->read_sock()
is beneficial for TLS, too. Or, at least, doesn't do any harm.
It's just that I don't trust the author of the TLS read_sock()
implementation; his network stack knowledge isn't _that_ great.
(Problem is that my performance measurements always ran into
some obnoxious occasional stall, rendering the entire measurement
pretty worthless. So I couldn't tell whether the recvmsg()
implementation delivers a benefit or not)
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-18 15:52 ` Hannes Reinecke
@ 2026-02-18 16:12 ` Chuck Lever
2026-02-19 4:06 ` Alistair Francis
2026-02-19 8:10 ` Hannes Reinecke
0 siblings, 2 replies; 14+ messages in thread
From: Chuck Lever @ 2026-02-18 16:12 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
> On 2/18/26 15:33, Chuck Lever wrote:
>> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>>> The way I see it, ->read_sock() has an advantage when you can work
>>> with skbuffs (ie network packets) directly. Then it has the benefit
>>> of being paced with the packets arriving as they do on the network.
>>>
>>> But for TLS this is no longer true; the TLS skbuffs are synthesized
>>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
>>> decoupled from the TCP skbuff (in theory, at least).
>>>
>>> This can result in TLS skbuff stalls when waiting for the remainder
>>> of the TCP data stream to arrive.
>>
>> An entire TLS Record has to be present before it can be decrypted and
>> passed to the socket consumer. Is that what you mean?
>>
> Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
> at the network stack the skbuffs are assembled/decoded by the TLS layer,
> only to be converted into skbuffs again for ->read_sock().
> Which seems a bit pointless.
Is this true only of the software kTLS implementation? For
offload, I'd think the decrypted records are placed directly
in the skbs that are handed to consumers.
> But the main point with ->read_sock() is that we can only process
> skbuffs, ie we have to wait for the TLS layer to assemble the entire
> record.
> With recvmsg() we at least have a theoretical choice of returning
> a short read, allowing the code to do something clever in the meantime.
Do you mean that read_sock is always synchronous? Perhaps
judicious construction of the consumer can help there; but
I expect that plain TCP has similar pathologies during, for
instance, a network partition.
> Mind you, that is all conjecture. Might well be that ->read_sock()
> is beneficial for TLS, too. Or, at least, doesn't do any harm.
> It's just that I don't trust the author of the TLS read_sock()
> implementation; his network stack knowledge isn't _that_ great.
Time to get out our measuring sticks, I suppose.
> (Problem is that my performance measurements always ran into
> some obnoxious occasional stall, rendering the entire measurement
> pretty worthless. So I couldn't tell whether the recvmsg()
> implementation delivers a benefit or not)
Generally speaking (ie, hand-wave mode) that kind of stall arises
because the consumer expects it will get another data_ready and the
network layer does not conform with the implementer's expectations.
So I like the read_sock_cmsg design here because with recvmsg,
the CMSG payload can appear in the consumer's data buffer. For
direct I/O, that's bad. We also really don't want to place the
TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
payload can be efficiently directed to an independent buffer
that is never visible to user space.
And I prefer having a single receive path for both non-TLS and
TLS processing. That's so much less code to deal with.
This approach seems to me to be more architecturally sound.
We just have to get over the implementation bumps IMO. I think
we can certainly examine ->read_sock with software kTLS to see
if there's room for improvement.
--
Chuck Lever
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-18 16:12 ` Chuck Lever
@ 2026-02-19 4:06 ` Alistair Francis
2026-02-19 8:05 ` Hannes Reinecke
2026-02-19 8:10 ` Hannes Reinecke
1 sibling, 1 reply; 14+ messages in thread
From: Alistair Francis @ 2026-02-19 4:06 UTC (permalink / raw)
To: Chuck Lever
Cc: Hannes Reinecke, Olga Kornievskaia, kernel-tls-handshake,
Chuck Lever
On Thu, Feb 19, 2026 at 2:12 AM Chuck Lever <cel@kernel.org> wrote:
>
>
>
> On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
> > On 2/18/26 15:33, Chuck Lever wrote:
> >> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>
> >>> The way I see it, ->read_sock() has an advantage when you can work
> >>> with skbuffs (ie network packets) directly. Then it has the benefit
> >>> of being paced with the packets arriving as they do on the network.
> >>>
> >>> But for TLS this is no longer true; the TLS skbuffs are synthesized
> >>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
> >>> decoupled from the TCP skbuff (in theory, at least).
> >>>
> >>> This can result in TLS skbuff stalls when waiting for the remainder
> >>> of the TCP data stream to arrive.
> >>
> >> An entire TLS Record has to be present before it can be decrypted and
> >> passed to the socket consumer. Is that what you mean?
> >>
> > Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
> > at the network stack the skbuffs are assembled/decoded by the TLS layer,
> > only to be converted into skbuffs again for ->read_sock().
> > Which seems a bit pointless.
>
> Is this true only of the software kTLS implementation? For
> offload, I'd think the decrypted records are placed directly
> in the skbs that are handed to consumers.
>
>
> > But the main point with ->read_sock() is that we can only process
> > skbuffs, ie we have to wait for the TLS layer to assemble the entire
> > record.
> > With recvmsg() we at least have a theoretical choice of returning
> > a short read, allowing the code to do something clever in the meantime.
>
> Do you mean that read_sock is always synchronous? Perhaps
> judicious construction of the consumer can help there; but
> I expect that plain TCP has similar pathologies during, for
> instance, a network partition.
>
>
> > Mind you, that is all conjecture. Might well be that ->read_sock()
> > is beneficial for TLS, too. Or, at least, doesn't do any harm.
> > It's just that I don't trust the author of the TLS read_sock()
> > implementation; his network stack knowledge isn't _that_ great.
>
> Time to get out our measuring sticks, I suppose.
>
>
> > (Problem is that my performance measurements always ran into
> > some obnoxious occasional stall, rendering the entire measurement
> > pretty worthless. So I couldn't tell whether the recvmsg()
> > implementation delivers a benefit or not)
>
> Generally speaking (ie, hand-wave mode) that kind of stall arises
> because the consumer expects it will get another data_ready and the
> network layer does not conform with the implementer's expectations.
>
> So I like the read_sock_cmsg design here because with recvmsg,
> the CMSG payload can appear in the consumer's data buffer. For
> direct I/O, that's bad. We also really don't want to place the
> TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
> payload can be efficiently directed to an independent buffer
> that is never visible to user space.
>
> And I prefer having a single receive path for both non-TLS and
> TLS processing. That's so much less code to deal with.
>
> This approach seems to me to be more architecturally sound.
> We just have to get over the implementation bumps IMO. I think
> we can certainly examine ->read_sock with software kTLS to see
> if there's room for improvement.
I completely agree!
The current read_sock approach is very brittle, as soon as a control
message appears the whole thing falls over. In the NVMe world we are
seeing devices issuing Key-Updates, which is breaking things.
So far the approach has been to convert to use sock_recvmsg() instead
of ->read_sock(), but at least for the nvme-host conversion (not
upstream) that has exposed a range of subtle and hard to narrow down
bugs. All this churn just to handle a few control messages.
This approach (in this series) seems like a great step to handle
control messages without having to redesign existing otherwise working
recv flows to use sock_recvmsg(), which doesn't seem to have any other
benefit, besides handling control messages.
Alistair
>
> --
> Chuck Lever
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-19 4:06 ` Alistair Francis
@ 2026-02-19 8:05 ` Hannes Reinecke
0 siblings, 0 replies; 14+ messages in thread
From: Hannes Reinecke @ 2026-02-19 8:05 UTC (permalink / raw)
To: Alistair Francis, Chuck Lever
Cc: Olga Kornievskaia, kernel-tls-handshake, Chuck Lever
On 2/19/26 05:06, Alistair Francis wrote:
> On Thu, Feb 19, 2026 at 2:12 AM Chuck Lever <cel@kernel.org> wrote:
>>
>>
>>
>> On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
>>> On 2/18/26 15:33, Chuck Lever wrote:
>>>> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>>
>>>>> The way I see it, ->read_sock() has an advantage when you can work
>>>>> with skbuffs (ie network packets) directly. Then it has the benefit
>>>>> of being paced with the packets arriving as they do on the network.
>>>>>
>>>>> But for TLS this is no longer true; the TLS skbuffs are synthesized
>>>>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
>>>>> decoupled from the TCP skbuff (in theory, at least).
>>>>>
>>>>> This can result in TLS skbuff stalls when waiting for the remainder
>>>>> of the TCP data stream to arrive.
>>>>
>>>> An entire TLS Record has to be present before it can be decrypted and
>>>> passed to the socket consumer. Is that what you mean?
>>>>
>>> Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
>>> at the network stack the skbuffs are assembled/decoded by the TLS layer,
>>> only to be converted into skbuffs again for ->read_sock().
>>> Which seems a bit pointless.
>>
>> Is this true only of the software kTLS implementation? For
>> offload, I'd think the decrypted records are placed directly
>> in the skbs that are handed to consumers.
>>
>>
>>> But the main point with ->read_sock() is that we can only process
>>> skbuffs, ie we have to wait for the TLS layer to assemble the entire
>>> record.
>>> With recvmsg() we at least have a theoretical choice of returning
>>> a short read, allowing the code to do something clever in the meantime.
>>
>> Do you mean that read_sock is always synchronous? Perhaps
>> judicious construction of the consumer can help there; but
>> I expect that plain TCP has similar pathologies during, for
>> instance, a network partition.
>>
>>
>>> Mind you, that is all conjecture. Might well be that ->read_sock()
>>> is beneficial for TLS, too. Or, at least, doesn't do any harm.
>>> It's just that I don't trust the author of the TLS read_sock()
>>> implementation; his network stack knowledge isn't _that_ great.
>>
>> Time to get out our measuring sticks, I suppose.
>>
>>
>>> (Problem is that my performance measurements always ran into
>>> some obnoxious occasional stall, rendering the entire measurement
>>> pretty worthless. So I couldn't tell whether the recvmsg()
>>> implementation delivers a benefit or not)
>>
>> Generally speaking (ie, hand-wave mode) that kind of stall arises
>> because the consumer expects it will get another data_ready and the
>> network layer does not conform with the implementer's expectations.
>>
>> So I like the read_sock_cmsg design here because with recvmsg,
>> the CMSG payload can appear in the consumer's data buffer. For
>> direct I/O, that's bad. We also really don't want to place the
>> TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
>> payload can be efficiently directed to an independent buffer
>> that is never visible to user space.
>>
>> And I prefer having a single receive path for both non-TLS and
>> TLS processing. That's so much less code to deal with.
>>
>> This approach seems to me to be more architecturally sound.
>> We just have to get over the implementation bumps IMO. I think
>> we can certainly examine ->read_sock with software kTLS to see
>> if there's room for improvement.
>
> I completely agree!
>
> The current read_sock approach is very brittle, as soon as a control
> message appears the whole thing falls over. In the NVMe world we are
> seeing devices issuing Key-Updates, which is breaking things.
>
Indeed. That's why I thought recvmsg() would be a better fit.
> So far the approach has been to convert to use sock_recvmsg() instead
> of ->read_sock(), but at least for the nvme-host conversion (not
> upstream) that has exposed a range of subtle and hard to narrow down
> bugs. All this churn just to handle a few control messages.
>
Ah; didn't know this. Was wondering why you had kept quiet, but that
explains it.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-18 16:12 ` Chuck Lever
2026-02-19 4:06 ` Alistair Francis
@ 2026-02-19 8:10 ` Hannes Reinecke
2026-02-19 13:59 ` Chuck Lever
2026-02-28 11:09 ` Alistair Francis
1 sibling, 2 replies; 14+ messages in thread
From: Hannes Reinecke @ 2026-02-19 8:10 UTC (permalink / raw)
To: Chuck Lever, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On 2/18/26 17:12, Chuck Lever wrote:
>
>
> On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
>> On 2/18/26 15:33, Chuck Lever wrote:
>>> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>
>>>> The way I see it, ->read_sock() has an advantage when you can work
>>>> with skbuffs (ie network packets) directly. Then it has the benefit
>>>> of being paced with the packets arriving as they do on the network.
>>>>
>>>> But for TLS this is no longer true; the TLS skbuffs are synthesized
>>>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
>>>> decoupled from the TCP skbuff (in theory, at least).
>>>>
>>>> This can result in TLS skbuff stalls when waiting for the remainder
>>>> of the TCP data stream to arrive.
>>>
>>> An entire TLS Record has to be present before it can be decrypted and
>>> passed to the socket consumer. Is that what you mean?
>>>
>> Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
>> at the network stack the skbuffs are assembled/decoded by the TLS layer,
>> only to be converted into skbuffs again for ->read_sock().
>> Which seems a bit pointless.
>
> Is this true only of the software kTLS implementation? For
> offload, I'd think the decrypted records are placed directly
> in the skbs that are handed to consumers.
>
>
>> But the main point with ->read_sock() is that we can only process
>> skbuffs, ie we have to wait for the TLS layer to assemble the entire
>> record.
>> With recvmsg() we at least have a theoretical choice of returning
>> a short read, allowing the code to do something clever in the meantime.
>
> Do you mean that read_sock is always synchronous? Perhaps
> judicious construction of the consumer can help there; but
> I expect that plain TCP has similar pathologies during, for
> instance, a network partition.
>
>
>> Mind you, that is all conjecture. Might well be that ->read_sock()
>> is beneficial for TLS, too. Or, at least, doesn't do any harm.
>> It's just that I don't trust the author of the TLS read_sock()
>> implementation; his network stack knowledge isn't _that_ great.
>
> Time to get out our measuring sticks, I suppose.
>
>
>> (Problem is that my performance measurements always ran into
>> some obnoxious occasional stall, rendering the entire measurement
>> pretty worthless. So I couldn't tell whether the recvmsg()
>> implementation delivers a benefit or not)
>
> Generally speaking (ie, hand-wave mode) that kind of stall arises
> because the consumer expects it will get another data_ready and the
> network layer does not conform with the implementer's expectations.
>
Naa. Tested that one. I've seen stalls affecting _all_ queues at
the same time, in regular intervals spaced with the power of 2.
Current reasoning is HW-based LLDP frames handling.
> So I like the read_sock_cmsg design here because with recvmsg,
> the CMSG payload can appear in the consumer's data buffer. For
> direct I/O, that's bad. We also really don't want to place the
> TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
> payload can be efficiently directed to an independent buffer
> that is never visible to user space.
>
> And I prefer having a single receive path for both non-TLS and
> TLS processing. That's so much less code to deal with.
>
> This approach seems to me to be more architecturally sound.
> We just have to get over the implementation bumps IMO. I think
> we can certainly examine ->read_sock with software kTLS to see
> if there's room for improvement.
>
And that, I guess, is the real argument. We've seen with the
recent TLS alert CVE that posting TLS messages in the recvmsg
payload is a _bad_ idea, as this puts far too many restrictions
on the format of the message.
So you see me convinced. I'll give it a go to convert nvme-tcp.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-19 8:10 ` Hannes Reinecke
@ 2026-02-19 13:59 ` Chuck Lever
2026-02-28 11:09 ` Alistair Francis
1 sibling, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2026-02-19 13:59 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
On 2/19/26 3:10 AM, Hannes Reinecke wrote:
> On 2/18/26 17:12, Chuck Lever wrote:
>>
>>
>> On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
>>> On 2/18/26 15:33, Chuck Lever wrote:
>>>> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
>>
>>>>> The way I see it, ->read_sock() has an advantage when you can work
>>>>> with skbuffs (ie network packets) directly. Then it has the benefit
>>>>> of being paced with the packets arriving as they do on the network.
>>>>>
>>>>> But for TLS this is no longer true; the TLS skbuffs are synthesized
>>>>> on top of the TCP stream, and the boundary of a TLS skbuff is
>>>>> completely
>>>>> decoupled from the TCP skbuff (in theory, at least).
>>>>>
>>>>> This can result in TLS skbuff stalls when waiting for the remainder
>>>>> of the TCP data stream to arrive.
>>>>
>>>> An entire TLS Record has to be present before it can be decrypted and
>>>> passed to the socket consumer. Is that what you mean?
>>>>
>>> Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
>>> at the network stack the skbuffs are assembled/decoded by the TLS layer,
>>> only to be converted into skbuffs again for ->read_sock().
>>> Which seems a bit pointless.
>>
>> Is this true only of the software kTLS implementation? For
>> offload, I'd think the decrypted records are placed directly
>> in the skbs that are handed to consumers.
>>
>>
>>> But the main point with ->read_sock() is that we can only process
>>> skbuffs, ie we have to wait for the TLS layer to assemble the entire
>>> record.
>>> With recvmsg() we at least have a theoretical choice of returning
>>> a short read, allowing the code to do something clever in the meantime.
>>
>> Do you mean that read_sock is always synchronous? Perhaps
>> judicious construction of the consumer can help there; but
>> I expect that plain TCP has similar pathologies during, for
>> instance, a network partition.
>>
>>
>>> Mind you, that is all conjecture. Might well be that ->read_sock()
>>> is beneficial for TLS, too. Or, at least, doesn't do any harm.
>>> It's just that I don't trust the author of the TLS read_sock()
>>> implementation; his network stack knowledge isn't _that_ great.
>>
>> Time to get out our measuring sticks, I suppose.
>>
>>
>>> (Problem is that my performance measurements always ran into
>>> some obnoxious occasional stall, rendering the entire measurement
>>> pretty worthless. So I couldn't tell whether the recvmsg()
>>> implementation delivers a benefit or not)
>>
>> Generally speaking (ie, hand-wave mode) that kind of stall arises
>> because the consumer expects it will get another data_ready and the
>> network layer does not conform with the implementer's expectations.
>>
> Naa. Tested that one. I've seen stalls affecting _all_ queues at
> the same time, in regular intervals spaced with the power of 2.
> Current reasoning is HW-based LLDP frames handling.
>
>> So I like the read_sock_cmsg design here because with recvmsg,
>> the CMSG payload can appear in the consumer's data buffer. For
>> direct I/O, that's bad. We also really don't want to place the
>> TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
>> payload can be efficiently directed to an independent buffer
>> that is never visible to user space.
>>
>> And I prefer having a single receive path for both non-TLS and
>> TLS processing. That's so much less code to deal with.
>>
>> This approach seems to me to be more architecturally sound.
>> We just have to get over the implementation bumps IMO. I think
>> we can certainly examine ->read_sock with software kTLS to see
>> if there's room for improvement.
>>
> And that, I guess, is the real argument. We've seen with the
> recent TLS alert CVE that posting TLS messages in the recvmsg
> payload is a _bad_ idea, as this puts far too many restrictions
> on the format of the message.
>
> So you see me convinced. I'll give it a go to convert nvme-tcp.
Some exploration of the code base has shown me the realities of
the concerns you raised above. I'm going to try to address those
before pushing forward with read_sock_cmsg. Stand by ...
--
Chuck Lever
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery
2026-02-19 8:10 ` Hannes Reinecke
2026-02-19 13:59 ` Chuck Lever
@ 2026-02-28 11:09 ` Alistair Francis
1 sibling, 0 replies; 14+ messages in thread
From: Alistair Francis @ 2026-02-28 11:09 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Chuck Lever, Olga Kornievskaia, kernel-tls-handshake, Chuck Lever
On Thu, Feb 19, 2026 at 6:10 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 2/18/26 17:12, Chuck Lever wrote:
> >
> >
> > On Wed, Feb 18, 2026, at 10:52 AM, Hannes Reinecke wrote:
> >> On 2/18/26 15:33, Chuck Lever wrote:
> >>> On 2/18/26 2:29 AM, Hannes Reinecke wrote:
> >
> >>>> The way I see it, ->read_sock() has an advantage when you can work
> >>>> with skbuffs (ie network packets) directly. Then it has the benefit
> >>>> of being paced with the packets arriving as they do on the network.
> >>>>
> >>>> But for TLS this is no longer true; the TLS skbuffs are synthesized
> >>>> on top of the TCP stream, and the boundary of a TLS skbuff is completely
> >>>> decoupled from the TCP skbuff (in theory, at least).
> >>>>
> >>>> This can result in TLS skbuff stalls when waiting for the remainder
> >>>> of the TCP data stream to arrive.
> >>>
> >>> An entire TLS Record has to be present before it can be decrypted and
> >>> passed to the socket consumer. Is that what you mean?
> >>>
> >> Yes. And that makes ->read_sock() a bit pointless (from my POV); looking
> >> at the network stack the skbuffs are assembled/decoded by the TLS layer,
> >> only to be converted into skbuffs again for ->read_sock().
> >> Which seems a bit pointless.
> >
> > Is this true only of the software kTLS implementation? For
> > offload, I'd think the decrypted records are placed directly
> > in the skbs that are handed to consumers.
> >
> >
> >> But the main point with ->read_sock() is that we can only process
> >> skbuffs, ie we have to wait for the TLS layer to assemble the entire
> >> record.
> >> With recvmsg() we at least have a theoretical choice of returning
> >> a short read, allowing the code to do something clever in the meantime.
> >
> > Do you mean that read_sock is always synchronous? Perhaps
> > judicious construction of the consumer can help there; but
> > I expect that plain TCP has similar pathologies during, for
> > instance, a network partition.
> >
> >
> >> Mind you, that is all conjecture. Might well be that ->read_sock()
> >> is beneficial for TLS, too. Or, at least, doesn't do any harm.
> >> It's just that I don't trust the author of the TLS read_sock()
> >> implementation; his network stack knowledge isn't _that_ great.
> >
> > Time to get out our measuring sticks, I suppose.
> >
> >
> >> (Problem is that my performance measurements always ran into
> >> some obnoxious occasional stall, rendering the entire measurement
> >> pretty worthless. So I couldn't tell whether the recvmsg()
> >> implementation delivers a benefit or not)
> >
> > Generally speaking (ie, hand-wave mode) that kind of stall arises
> > because the consumer expects it will get another data_ready and the
> > network layer does not conform with the implementer's expectations.
> >
> Naa. Tested that one. I've seen stalls affecting _all_ queues at
> the same time, in regular intervals spaced with the power of 2.
> Current reasoning is HW-based LLDP frames handling.
>
> > So I like the read_sock_cmsg design here because with recvmsg,
> > the CMSG payload can appear in the consumer's data buffer. For
> > direct I/O, that's bad. We also really don't want to place the
> > TLS Alerts in page cache pages. With read_sock_cmsg, the CMSG
> > payload can be efficiently directed to an independent buffer
> > that is never visible to user space.
> >
> > And I prefer having a single receive path for both non-TLS and
> > TLS processing. That's so much less code to deal with.
> >
> > This approach seems to me to be more architecturally sound.
> > We just have to get over the implementation bumps IMO. I think
> > we can certainly examine ->read_sock with software kTLS to see
> > if there's room for improvement.
> >
> And that, I guess, is the real argument. We've seen with the
> recent TLS alert CVE that posting TLS messages in the recvmsg
> payload is a _bad_ idea, as this puts far too many restrictions
> on the format of the message.
>
> So you see me convinced. I'll give it a go to convert nvme-tcp.
Just a quick update that I have done this for nvme-tcp with KeyUpdate.
Just chasing down one last issue. Hopefully I can send some patches in
the next week or two.
Alistair
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare@suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [RFC PATCH 2/4] tls: Implement read_sock_cmsg for kTLS software path
2026-02-17 22:20 [RFC PATCH 0/4] ->read_sock with cmsg Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery Chuck Lever
@ 2026-02-17 22:20 ` Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 3/4] sunrpc: Use read_sock_cmsg for svcsock TCP receives Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 4/4] sunrpc: Remove sock_recvmsg path from " Chuck Lever
3 siblings, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2026-02-17 22:20 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
tls_sw_read_sock() rejects non-data records (alerts, handshake
messages) with -EINVAL. Kernel consumers that need TLS alert
delivery -- such as NFSD, NFS client, and NVMe target -- must fall
back to the slower sock_recvmsg() API to receive control messages
via CMSG.
Implement a more efficient API based on the new read_sock_cmsg()
method for these consumers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls.h | 3 +++
net/tls/tls_main.c | 2 ++
net/tls/tls_sw.c | 33 ++++++++++++++++++++++++++++-----
3 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/net/tls/tls.h b/net/tls/tls.h
index 2f86baeb71fc..2e1581b6ca25 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -168,6 +168,9 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
size_t len, unsigned int flags);
int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t read_actor);
+int tls_sw_read_sock_cmsg(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t read_actor,
+ sk_read_cmsg_actor_t cmsg_actor);
int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
void tls_device_splice_eof(struct socket *sock);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 56ce0bc8317b..40163d7baab4 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -946,11 +946,13 @@ static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG]
ops[TLS_BASE][TLS_SW ].splice_read = tls_sw_splice_read;
ops[TLS_BASE][TLS_SW ].poll = tls_sk_poll;
ops[TLS_BASE][TLS_SW ].read_sock = tls_sw_read_sock;
+ ops[TLS_BASE][TLS_SW ].read_sock_cmsg = tls_sw_read_sock_cmsg;
ops[TLS_SW ][TLS_SW ] = ops[TLS_SW ][TLS_BASE];
ops[TLS_SW ][TLS_SW ].splice_read = tls_sw_splice_read;
ops[TLS_SW ][TLS_SW ].poll = tls_sk_poll;
ops[TLS_SW ][TLS_SW ].read_sock = tls_sw_read_sock;
+ ops[TLS_SW ][TLS_SW ].read_sock_cmsg = tls_sw_read_sock_cmsg;
#ifdef CONFIG_TLS_DEVICE
ops[TLS_HW ][TLS_BASE] = ops[TLS_BASE][TLS_BASE];
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 9937d4c810f2..e45352b167c4 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2325,8 +2325,9 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
goto splice_read_end;
}
-int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
- sk_read_actor_t read_actor)
+static int __tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t read_actor,
+ sk_read_cmsg_actor_t cmsg_actor)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -2387,10 +2388,19 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
tls_rx_rec_done(ctx);
}
- /* read_sock does not support reading control messages */
if (tlm->control != TLS_RECORD_TYPE_DATA) {
- err = -EINVAL;
- goto read_sock_requeue;
+ if (!cmsg_actor) {
+ err = -EINVAL;
+ goto read_sock_requeue;
+ }
+ err = cmsg_actor(desc, skb, rxm->offset,
+ rxm->full_len, tlm->control);
+ if (err < 0)
+ goto read_sock_requeue;
+ consume_skb(skb);
+ if (!desc->count)
+ skb = NULL;
+ continue;
}
used = read_actor(desc, skb, rxm->offset, rxm->full_len);
@@ -2421,6 +2431,19 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
goto read_sock_end;
}
+int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t read_actor)
+{
+ return __tls_sw_read_sock(sk, desc, read_actor, NULL);
+}
+
+int tls_sw_read_sock_cmsg(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t read_actor,
+ sk_read_cmsg_actor_t cmsg_actor)
+{
+ return __tls_sw_read_sock(sk, desc, read_actor, cmsg_actor);
+}
+
bool tls_sw_sock_is_readable(struct sock *sk)
{
struct tls_context *tls_ctx = tls_get_ctx(sk);
--
2.53.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [RFC PATCH 3/4] sunrpc: Use read_sock_cmsg for svcsock TCP receives
2026-02-17 22:20 [RFC PATCH 0/4] ->read_sock with cmsg Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 1/4] net: Introduce read_sock_cmsg proto_ops for control message delivery Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 2/4] tls: Implement read_sock_cmsg for kTLS software path Chuck Lever
@ 2026-02-17 22:20 ` Chuck Lever
2026-02-17 22:20 ` [RFC PATCH 4/4] sunrpc: Remove sock_recvmsg path from " Chuck Lever
3 siblings, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2026-02-17 22:20 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The svcsock TCP receive path uses sock_recvmsg() with ancillary data
buffers to detect TLS alerts when kTLS is active. This CMSG-based
approach has two drawbacks: the MSG_CTRUNC recovery dance adds
overhead to every receive, and sock_recvmsg() cannot take advantage
of zero-copy optimizations available through read_sock.
When the socket provides a read_sock_cmsg method (now set by kTLS),
svc_tcp_recvfrom() now dispatches to a new
svc_tcp_recvfrom_readsock() path. Two actor callbacks handle the
data:
svc_tcp_recv_actor() parses the RPC record byte stream directly from
skbs. Fragment header bytes fill sk_marker first; subsequent body
bytes are copied into rq_pages at the position tracked by
sk_datalen. When the last fragment of a complete RPC message
arrives, the actor sets desc->count to zero, stopping the read loop.
svc_tcp_cmsg_actor() handles non-data TLS records. For fatal alerts,
the transport is marked for deferred close. All non-data records
stop the read loop so callers can inspect the error before
continuing.
The existing sock_recvmsg() path remains as a fallback for sockets
without read_sock_cmsg (plain TCP, non-kTLS configurations).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svcsock.c | 245 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 245 insertions(+)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index d61cd9b40491..9600d15287e7 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1124,6 +1124,248 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
svsk->sk_marker = xdr_zero;
}
+/*
+ * read_sock_cmsg data actor: receives decrypted application data
+ * from the TLS layer, parsing the RPC record stream (fragment
+ * headers and message bodies) and assembling complete RPC messages
+ * into @rqstp->rq_pages.
+ */
+static int svc_tcp_recv_actor(read_descriptor_t *desc,
+ struct sk_buff *skb,
+ unsigned int offset, size_t len)
+{
+ struct svc_rqst *rqstp = desc->arg.data;
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+ size_t consumed = 0;
+
+ /* Phase 1: consume fragment header bytes */
+ while (svsk->sk_tcplen < sizeof(rpc_fraghdr) && len > 0) {
+ size_t want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
+ size_t n = min(want, len);
+
+ if (skb_copy_bits(skb, offset,
+ (char *)&svsk->sk_marker +
+ svsk->sk_tcplen, n))
+ goto fault;
+ svsk->sk_tcplen += n;
+ offset += n;
+ len -= n;
+ consumed += n;
+
+ if (svsk->sk_tcplen < sizeof(rpc_fraghdr))
+ return consumed;
+
+ trace_svcsock_marker(&svsk->sk_xprt, svsk->sk_marker);
+ if (svc_sock_reclen(svsk) + svsk->sk_datalen >
+ svsk->sk_xprt.xpt_server->sv_max_mesg) {
+ net_notice_ratelimited("svc: %s oversized RPC fragment (%u octets)\n",
+ svsk->sk_xprt.xpt_server->sv_name,
+ svc_sock_reclen(svsk));
+ desc->error = -EMSGSIZE;
+ desc->count = 0;
+ return consumed;
+ }
+ }
+
+ if (len == 0)
+ return consumed;
+
+ /* Phase 2: copy body data into rq_pages */
+ {
+ size_t reclen = svc_sock_reclen(svsk);
+ size_t received = svsk->sk_tcplen - sizeof(rpc_fraghdr);
+ size_t want = reclen - received;
+ size_t take = min(want, len);
+ size_t done = 0;
+
+ while (done < take) {
+ unsigned int pg = svsk->sk_datalen >> PAGE_SHIFT;
+ unsigned int pg_off = svsk->sk_datalen &
+ (PAGE_SIZE - 1);
+ size_t chunk = min(take - done,
+ PAGE_SIZE - (size_t)pg_off);
+
+ if (skb_copy_bits(skb, offset,
+ page_address(rqstp->rq_pages[pg])
+ + pg_off,
+ chunk))
+ goto fault;
+ offset += chunk;
+ done += chunk;
+ svsk->sk_datalen += chunk;
+ }
+ svsk->sk_tcplen += take;
+ consumed += take;
+
+ /* Fragment complete? */
+ if (svsk->sk_tcplen - sizeof(rpc_fraghdr) >= reclen) {
+ if (svc_sock_final_rec(svsk)) {
+ desc->count = 0;
+ } else {
+ svc_tcp_fragment_received(svsk);
+ }
+ }
+ }
+
+ return consumed;
+
+fault:
+ desc->error = -EFAULT;
+ desc->count = 0;
+ return consumed;
+}
+
+/*
+ * read_sock_cmsg control message actor: receives non-data TLS
+ * records (alerts, handshake messages) and translates them into
+ * transport-level actions.
+ */
+static int svc_tcp_cmsg_actor(read_descriptor_t *desc,
+ struct sk_buff *skb,
+ unsigned int offset, size_t len,
+ u8 content_type)
+{
+ struct svc_rqst *rqstp = desc->arg.data;
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+
+ switch (content_type) {
+ case TLS_RECORD_TYPE_ALERT:
+ if (len >= 2) {
+ u8 alert[2];
+
+ if (!skb_copy_bits(skb, offset, alert,
+ sizeof(alert))) {
+ if (alert[0] == TLS_ALERT_LEVEL_FATAL) {
+ svc_xprt_deferred_close(
+ &svsk->sk_xprt);
+ desc->error = -ENOTCONN;
+ desc->count = 0;
+ }
+ }
+ }
+ break;
+ default:
+ break;
+ }
+ return -EAGAIN;
+}
+
+static int svc_tcp_recvfrom_readsock(struct svc_rqst *rqstp)
+{
+ struct svc_sock *svsk =
+ container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
+ struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+ struct sock *sk = svsk->sk_sk;
+ read_descriptor_t desc = {
+ .arg.data = rqstp,
+ };
+ ssize_t len;
+ __be32 *p;
+ __be32 calldir;
+
+ clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+
+ svc_tcp_restore_pages(svsk, rqstp);
+ rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
+
+ /* Ensure no stale response pages are released if the
+ * receive returns without completing a full message.
+ */
+ rqstp->rq_respages = rqstp->rq_page_end;
+ rqstp->rq_next_page = rqstp->rq_page_end;
+
+ desc.count = serv->sv_max_mesg;
+ lock_sock(sk);
+ len = svsk->sk_sock->ops->read_sock_cmsg(sk, &desc,
+ svc_tcp_recv_actor,
+ svc_tcp_cmsg_actor);
+ release_sock(sk);
+
+ if (desc.error == -EMSGSIZE)
+ goto err_delete;
+ if (desc.error < 0) {
+ len = desc.error;
+ goto error;
+ }
+ if (desc.count != 0) {
+ /* Incomplete message */
+ if (len > 0)
+ set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+ goto err_incomplete;
+ }
+
+ /* Complete RPC message received */
+ if (svsk->sk_datalen < 8)
+ goto err_nuts;
+
+ rqstp->rq_arg.len = svsk->sk_datalen;
+ rqstp->rq_arg.page_base = 0;
+ if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
+ rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
+ rqstp->rq_arg.page_len = 0;
+ } else {
+ rqstp->rq_arg.page_len = rqstp->rq_arg.len -
+ rqstp->rq_arg.head[0].iov_len;
+ }
+
+ {
+ unsigned int pg_count =
+ (svsk->sk_datalen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ rqstp->rq_respages = &rqstp->rq_pages[pg_count];
+ rqstp->rq_next_page = rqstp->rq_respages + 1;
+ }
+
+ rqstp->rq_xprt_ctxt = NULL;
+ rqstp->rq_prot = IPPROTO_TCP;
+ if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
+ set_bit(RQ_LOCAL, &rqstp->rq_flags);
+ else
+ clear_bit(RQ_LOCAL, &rqstp->rq_flags);
+
+ p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
+ calldir = p[1];
+ if (calldir)
+ len = receive_cb_reply(svsk, rqstp);
+
+ /* Reset TCP read info */
+ svsk->sk_datalen = 0;
+ svc_tcp_fragment_received(svsk);
+
+ if (len < 0)
+ goto error;
+
+ trace_svcsock_tcp_recv(&svsk->sk_xprt, rqstp->rq_arg.len);
+ svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
+ if (serv->sv_stats)
+ serv->sv_stats->nettcpcnt++;
+
+ svc_sock_secure_port(rqstp);
+ set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
+ svc_xprt_received(rqstp->rq_xprt);
+ return rqstp->rq_arg.len;
+
+err_incomplete:
+ svc_tcp_save_pages(svsk, rqstp);
+ if (len < 0 && len != -EAGAIN)
+ goto err_delete;
+ goto err_noclose;
+error:
+ if (len != -EAGAIN)
+ goto err_delete;
+ trace_svcsock_tcp_recv_eagain(&svsk->sk_xprt, 0);
+ goto err_noclose;
+err_nuts:
+ svsk->sk_datalen = 0;
+err_delete:
+ trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
+ svc_xprt_deferred_close(&svsk->sk_xprt);
+err_noclose:
+ svc_xprt_received(rqstp->rq_xprt);
+ return 0;
+}
+
/**
* svc_tcp_recvfrom - Receive data from a TCP socket
* @rqstp: request structure into which to receive an RPC Call
@@ -1152,6 +1394,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
__be32 *p;
__be32 calldir;
+ if (svsk->sk_sock->ops->read_sock_cmsg)
+ return svc_tcp_recvfrom_readsock(rqstp);
+
clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
len = svc_tcp_read_marker(svsk, rqstp);
if (len < 0)
--
2.53.0
^ permalink raw reply related [flat|nested] 14+ messages in thread* [RFC PATCH 4/4] sunrpc: Remove sock_recvmsg path from svcsock TCP receives
2026-02-17 22:20 [RFC PATCH 0/4] ->read_sock with cmsg Chuck Lever
` (2 preceding siblings ...)
2026-02-17 22:20 ` [RFC PATCH 3/4] sunrpc: Use read_sock_cmsg for svcsock TCP receives Chuck Lever
@ 2026-02-17 22:20 ` Chuck Lever
3 siblings, 0 replies; 14+ messages in thread
From: Chuck Lever @ 2026-02-17 22:20 UTC (permalink / raw)
To: Hannes Reinecke, Olga Kornievskaia; +Cc: kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The svcsock TCP receive path maintains two code paths: one
using read_sock/read_sock_cmsg and a legacy path using
sock_recvmsg. Plain TCP sockets already provide read_sock
(tcp_read_sock) in their proto_ops, so the read_sock_cmsg
path can handle all cases relevant to NFSD by falling back
to read_sock when kTLS is not active.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svcsock.c | 314 +++----------------------------------------
1 file changed, 22 insertions(+), 292 deletions(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 9600d15287e7..7d614dc44a05 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -238,140 +238,6 @@ static int svc_one_sock_name(struct svc_sock *svsk, char *buf, int remaining)
return len;
}
-static int
-svc_tcp_sock_process_cmsg(struct socket *sock, struct msghdr *msg,
- struct cmsghdr *cmsg, int ret)
-{
- u8 content_type = tls_get_record_type(sock->sk, cmsg);
- u8 level, description;
-
- switch (content_type) {
- case 0:
- break;
- case TLS_RECORD_TYPE_DATA:
- /* TLS sets EOR at the end of each application data
- * record, even though there might be more frames
- * waiting to be decrypted.
- */
- msg->msg_flags &= ~MSG_EOR;
- break;
- case TLS_RECORD_TYPE_ALERT:
- tls_alert_recv(sock->sk, msg, &level, &description);
- ret = (level == TLS_ALERT_LEVEL_FATAL) ?
- -ENOTCONN : -EAGAIN;
- break;
- default:
- /* discard this record type */
- ret = -EAGAIN;
- }
- return ret;
-}
-
-static int
-svc_tcp_sock_recv_cmsg(struct socket *sock, unsigned int *msg_flags)
-{
- union {
- struct cmsghdr cmsg;
- u8 buf[CMSG_SPACE(sizeof(u8))];
- } u;
- u8 alert[2];
- struct kvec alert_kvec = {
- .iov_base = alert,
- .iov_len = sizeof(alert),
- };
- struct msghdr msg = {
- .msg_flags = *msg_flags,
- .msg_control = &u,
- .msg_controllen = sizeof(u),
- };
- int ret;
-
- iov_iter_kvec(&msg.msg_iter, ITER_DEST, &alert_kvec, 1,
- alert_kvec.iov_len);
- ret = sock_recvmsg(sock, &msg, MSG_DONTWAIT);
- if (ret > 0 &&
- tls_get_record_type(sock->sk, &u.cmsg) == TLS_RECORD_TYPE_ALERT) {
- iov_iter_revert(&msg.msg_iter, ret);
- ret = svc_tcp_sock_process_cmsg(sock, &msg, &u.cmsg, -EAGAIN);
- }
- return ret;
-}
-
-static int
-svc_tcp_sock_recvmsg(struct svc_sock *svsk, struct msghdr *msg)
-{
- int ret;
- struct socket *sock = svsk->sk_sock;
-
- ret = sock_recvmsg(sock, msg, MSG_DONTWAIT);
- if (msg->msg_flags & MSG_CTRUNC) {
- msg->msg_flags &= ~(MSG_CTRUNC | MSG_EOR);
- if (ret == 0 || ret == -EIO)
- ret = svc_tcp_sock_recv_cmsg(sock, &msg->msg_flags);
- }
- return ret;
-}
-
-#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-static void svc_flush_bvec(const struct bio_vec *bvec, size_t size, size_t seek)
-{
- struct bvec_iter bi = {
- .bi_size = size + seek,
- };
- struct bio_vec bv;
-
- bvec_iter_advance(bvec, &bi, seek & PAGE_MASK);
- for_each_bvec(bv, bvec, bi, bi)
- flush_dcache_page(bv.bv_page);
-}
-#else
-static inline void svc_flush_bvec(const struct bio_vec *bvec, size_t size,
- size_t seek)
-{
-}
-#endif
-
-/*
- * Read from @rqstp's transport socket. The incoming message fills whole
- * pages in @rqstp's rq_pages array until the last page of the message
- * has been received into a partial page.
- */
-static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
- size_t seek)
-{
- struct svc_sock *svsk =
- container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
- struct bio_vec *bvec = rqstp->rq_bvec;
- struct msghdr msg = { NULL };
- unsigned int i;
- ssize_t len;
- size_t t;
-
- clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
-
- for (i = 0, t = 0; t < buflen; i++, t += PAGE_SIZE)
- bvec_set_page(&bvec[i], rqstp->rq_pages[i], PAGE_SIZE, 0);
- rqstp->rq_respages = &rqstp->rq_pages[i];
- rqstp->rq_next_page = rqstp->rq_respages + 1;
-
- iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
- if (seek) {
- iov_iter_advance(&msg.msg_iter, seek);
- buflen -= seek;
- }
- len = svc_tcp_sock_recvmsg(svsk, &msg);
- if (len > 0)
- svc_flush_bvec(bvec, len, seek);
-
- /* If we read a full record, then assume there may be more
- * data to read (stream based sockets only!)
- */
- if (len == buflen)
- set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
-
- return len;
-}
-
/*
* Set socket snd and rcv buffer lengths
*/
@@ -1038,50 +904,6 @@ static void svc_tcp_clear_pages(struct svc_sock *svsk)
svsk->sk_datalen = 0;
}
-/*
- * Receive fragment record header into sk_marker.
- */
-static ssize_t svc_tcp_read_marker(struct svc_sock *svsk,
- struct svc_rqst *rqstp)
-{
- ssize_t want, len;
-
- /* If we haven't gotten the record length yet,
- * get the next four bytes.
- */
- if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) {
- struct msghdr msg = { NULL };
- struct kvec iov;
-
- want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
- iov.iov_base = ((char *)&svsk->sk_marker) + svsk->sk_tcplen;
- iov.iov_len = want;
- iov_iter_kvec(&msg.msg_iter, ITER_DEST, &iov, 1, want);
- len = svc_tcp_sock_recvmsg(svsk, &msg);
- if (len < 0)
- return len;
- svsk->sk_tcplen += len;
- if (len < want) {
- /* call again to read the remaining bytes */
- goto err_short;
- }
- trace_svcsock_marker(&svsk->sk_xprt, svsk->sk_marker);
- if (svc_sock_reclen(svsk) + svsk->sk_datalen >
- svsk->sk_xprt.xpt_server->sv_max_mesg)
- goto err_too_large;
- }
- return svc_sock_reclen(svsk);
-
-err_too_large:
- net_notice_ratelimited("svc: %s oversized RPC fragment (%u octets) from %pISpc\n",
- svsk->sk_xprt.xpt_server->sv_name,
- svc_sock_reclen(svsk),
- (struct sockaddr *)&svsk->sk_xprt.xpt_remote);
- svc_xprt_deferred_close(&svsk->sk_xprt);
-err_short:
- return -EAGAIN;
-}
-
static int receive_cb_reply(struct svc_sock *svsk, struct svc_rqst *rqstp)
{
struct rpc_xprt *bc_xprt = svsk->sk_xprt.xpt_bc_xprt;
@@ -1252,7 +1074,21 @@ static int svc_tcp_cmsg_actor(read_descriptor_t *desc,
return -EAGAIN;
}
-static int svc_tcp_recvfrom_readsock(struct svc_rqst *rqstp)
+/**
+ * svc_tcp_recvfrom - Receive data from a TCP socket
+ * @rqstp: request structure into which to receive an RPC Call
+ *
+ * Called in a loop when XPT_DATA has been set.
+ *
+ * Returns:
+ * On success, the number of bytes in a received RPC Call, or
+ * %0 if a complete RPC Call message was not ready to return
+ *
+ * The zero return case handles partial receives and callback Replies.
+ * The state of a partial receive is preserved in the svc_sock for
+ * the next call to svc_tcp_recvfrom.
+ */
+static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
{
struct svc_sock *svsk =
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
@@ -1278,9 +1114,13 @@ static int svc_tcp_recvfrom_readsock(struct svc_rqst *rqstp)
desc.count = serv->sv_max_mesg;
lock_sock(sk);
- len = svsk->sk_sock->ops->read_sock_cmsg(sk, &desc,
- svc_tcp_recv_actor,
- svc_tcp_cmsg_actor);
+ if (svsk->sk_sock->ops->read_sock_cmsg)
+ len = svsk->sk_sock->ops->read_sock_cmsg(sk, &desc,
+ svc_tcp_recv_actor,
+ svc_tcp_cmsg_actor);
+ else
+ len = svsk->sk_sock->ops->read_sock(sk, &desc,
+ svc_tcp_recv_actor);
release_sock(sk);
if (desc.error == -EMSGSIZE)
@@ -1366,116 +1206,6 @@ static int svc_tcp_recvfrom_readsock(struct svc_rqst *rqstp)
return 0;
}
-/**
- * svc_tcp_recvfrom - Receive data from a TCP socket
- * @rqstp: request structure into which to receive an RPC Call
- *
- * Called in a loop when XPT_DATA has been set.
- *
- * Read the 4-byte stream record marker, then use the record length
- * in that marker to set up exactly the resources needed to receive
- * the next RPC message into @rqstp.
- *
- * Returns:
- * On success, the number of bytes in a received RPC Call, or
- * %0 if a complete RPC Call message was not ready to return
- *
- * The zero return case handles partial receives and callback Replies.
- * The state of a partial receive is preserved in the svc_sock for
- * the next call to svc_tcp_recvfrom.
- */
-static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
-{
- struct svc_sock *svsk =
- container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
- struct svc_serv *serv = svsk->sk_xprt.xpt_server;
- size_t want, base;
- ssize_t len;
- __be32 *p;
- __be32 calldir;
-
- if (svsk->sk_sock->ops->read_sock_cmsg)
- return svc_tcp_recvfrom_readsock(rqstp);
-
- clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
- len = svc_tcp_read_marker(svsk, rqstp);
- if (len < 0)
- goto error;
-
- base = svc_tcp_restore_pages(svsk, rqstp);
- want = len - (svsk->sk_tcplen - sizeof(rpc_fraghdr));
- len = svc_tcp_read_msg(rqstp, base + want, base);
- if (len >= 0) {
- trace_svcsock_tcp_recv(&svsk->sk_xprt, len);
- svsk->sk_tcplen += len;
- svsk->sk_datalen += len;
- }
- if (len != want || !svc_sock_final_rec(svsk))
- goto err_incomplete;
- if (svsk->sk_datalen < 8)
- goto err_nuts;
-
- rqstp->rq_arg.len = svsk->sk_datalen;
- rqstp->rq_arg.page_base = 0;
- if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
- rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
- rqstp->rq_arg.page_len = 0;
- } else
- rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
-
- rqstp->rq_xprt_ctxt = NULL;
- rqstp->rq_prot = IPPROTO_TCP;
- if (test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags))
- set_bit(RQ_LOCAL, &rqstp->rq_flags);
- else
- clear_bit(RQ_LOCAL, &rqstp->rq_flags);
-
- p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
- calldir = p[1];
- if (calldir)
- len = receive_cb_reply(svsk, rqstp);
-
- /* Reset TCP read info */
- svsk->sk_datalen = 0;
- svc_tcp_fragment_received(svsk);
-
- if (len < 0)
- goto error;
-
- svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
- if (serv->sv_stats)
- serv->sv_stats->nettcpcnt++;
-
- svc_sock_secure_port(rqstp);
- svc_xprt_received(rqstp->rq_xprt);
- return rqstp->rq_arg.len;
-
-err_incomplete:
- svc_tcp_save_pages(svsk, rqstp);
- if (len < 0 && len != -EAGAIN)
- goto err_delete;
- if (len == want)
- svc_tcp_fragment_received(svsk);
- else
- trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
- svc_sock_reclen(svsk),
- svsk->sk_tcplen - sizeof(rpc_fraghdr));
- goto err_noclose;
-error:
- if (len != -EAGAIN)
- goto err_delete;
- trace_svcsock_tcp_recv_eagain(&svsk->sk_xprt, 0);
- goto err_noclose;
-err_nuts:
- svsk->sk_datalen = 0;
-err_delete:
- trace_svcsock_tcp_recv_err(&svsk->sk_xprt, len);
- svc_xprt_deferred_close(&svsk->sk_xprt);
-err_noclose:
- svc_xprt_received(rqstp->rq_xprt);
- return 0; /* record not complete */
-}
-
/*
* MSG_SPLICE_PAGES is used exclusively to reduce the number of
* copy operations in this path. Therefore the caller must ensure
--
2.53.0
^ permalink raw reply related [flat|nested] 14+ messages in thread