* [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:31 ` Stefan Metzmacher
2025-08-18 14:04 ` [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol Xin Long
` (14 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch adds IPPROTO_QUIC and SOL_QUIC constants to the networking
subsystem. These definitions are essential for applications to set
socket options and protocol identifiers related to the QUIC protocol.
QUIC does not possess a protocol number allocated from IANA, and like
IPPROTO_MPTCP, IPPROTO_QUIC is merely a value used when opening a QUIC
socket with:
socket(AF_INET, SOCK_STREAM, IPPROTO_QUIC);
Note we did not opt for UDP ULP for QUIC implementation due to several
considerations:
- QUIC's connection Migration requires at least 2 UDP sockets for one
QUIC connection at the same time, not to mention the multipath
feature in one of its draft RFCs.
- In-Kernel QUIC, as a Transport Protocol, wants to provide users with
the TCP or SCTP like Socket APIs, like connect()/listen()/accept()...
Note that a single UDP socket might even be used for multiple QUIC
connections.
The use of IPPROTO_QUIC type sockets over UDP tunnel will effectively
address these challenges and provides a more flexible and scalable
solution.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
include/linux/socket.h | 1 +
include/uapi/linux/in.h | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 3b262487ec06..a7c05b064583 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -386,6 +386,7 @@ struct ucred {
#define SOL_MCTP 285
#define SOL_SMC 286
#define SOL_VSOCK 287
+#define SOL_QUIC 288
/* IPX options */
#define IPX_TYPE 1
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index ced0fc3c3aa5..34becd90d3a6 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -85,6 +85,8 @@ enum {
#define IPPROTO_RAW IPPROTO_RAW
IPPROTO_SMC = 256, /* Shared Memory Communications */
#define IPPROTO_SMC IPPROTO_SMC
+ IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure Transport */
+#define IPPROTO_QUIC IPPROTO_QUIC
IPPROTO_MPTCP = 262, /* Multipath TCP connection */
#define IPPROTO_MPTCP IPPROTO_MPTCP
IPPROTO_MAX
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-18 14:04 ` [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants Xin Long
@ 2025-08-18 14:31 ` Stefan Metzmacher
2025-08-18 16:20 ` Matthieu Baerts
2025-08-19 8:10 ` Namjae Jeon
0 siblings, 2 replies; 38+ messages in thread
From: Stefan Metzmacher @ 2025-08-18 14:31 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman, Moritz Buhl,
Tyler Fanelli, Pengtao He, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
Hi,
> diff --git a/include/linux/socket.h b/include/linux/socket.h
> index 3b262487ec06..a7c05b064583 100644
> --- a/include/linux/socket.h
> +++ b/include/linux/socket.h
> @@ -386,6 +386,7 @@ struct ucred {
> #define SOL_MCTP 285
> #define SOL_SMC 286
> #define SOL_VSOCK 287
> +#define SOL_QUIC 288
>
> /* IPX options */
> #define IPX_TYPE 1
> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> index ced0fc3c3aa5..34becd90d3a6 100644
> --- a/include/uapi/linux/in.h
> +++ b/include/uapi/linux/in.h
> @@ -85,6 +85,8 @@ enum {
> #define IPPROTO_RAW IPPROTO_RAW
> IPPROTO_SMC = 256, /* Shared Memory Communications */
> #define IPPROTO_SMC IPPROTO_SMC
> + IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure Transport */
> +#define IPPROTO_QUIC IPPROTO_QUIC
> IPPROTO_MPTCP = 262, /* Multipath TCP connection */
> #define IPPROTO_MPTCP IPPROTO_MPTCP
> IPPROTO_MAX
Can these constants be accepted, soon?
Samba 4.23.0 to be released early September will ship userspace code to
use them. It would be good to have them correct when kernel's start to
support this...
It would also mean less risk for conflicting projects with the need for such numbers.
I think it's useful to use a value lower than IPPROTO_MAX, because it means
the kernel module can also be build against older kernels as out of tree module
and still it would be transparent for userspace consumers like samba.
There are hardcoded checks for IPPROTO_MAX in inet_create, inet6_create, inet_diag_register
and the value of IPPROTO_MAX is 263 starting with commit
d25a92ccae6bed02327b63d138e12e7806830f78 in 6.11.
Thanks!
metze
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-18 14:31 ` Stefan Metzmacher
@ 2025-08-18 16:20 ` Matthieu Baerts
2025-08-18 18:37 ` Xin Long
2025-08-19 8:10 ` Namjae Jeon
1 sibling, 1 reply; 38+ messages in thread
From: Matthieu Baerts @ 2025-08-18 16:20 UTC (permalink / raw)
To: Stefan Metzmacher, Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman, Moritz Buhl,
Tyler Fanelli, Pengtao He, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
Hi Stefan, Xin,
On 18/08/2025 16:31, Stefan Metzmacher wrote:
> Hi,
>
>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>> index 3b262487ec06..a7c05b064583 100644
>> --- a/include/linux/socket.h
>> +++ b/include/linux/socket.h
>> @@ -386,6 +386,7 @@ struct ucred {
>> #define SOL_MCTP 285
>> #define SOL_SMC 286
>> #define SOL_VSOCK 287
>> +#define SOL_QUIC 288
>> /* IPX options */
>> #define IPX_TYPE 1
>> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
>> index ced0fc3c3aa5..34becd90d3a6 100644
>> --- a/include/uapi/linux/in.h
>> +++ b/include/uapi/linux/in.h
>> @@ -85,6 +85,8 @@ enum {
>> #define IPPROTO_RAW IPPROTO_RAW
>> IPPROTO_SMC = 256, /* Shared Memory Communications */
>> #define IPPROTO_SMC IPPROTO_SMC
>> + IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure
>> Transport */
>> +#define IPPROTO_QUIC IPPROTO_QUIC
>> IPPROTO_MPTCP = 262, /* Multipath TCP connection */
>> #define IPPROTO_MPTCP IPPROTO_MPTCP
>> IPPROTO_MAX
>
> Can these constants be accepted, soon?
>
> Samba 4.23.0 to be released early September will ship userspace code to
> use them. It would be good to have them correct when kernel's start to
> support this...
>
> It would also mean less risk for conflicting projects with the need for
> such numbers.
>
> I think it's useful to use a value lower than IPPROTO_MAX, because it means
> the kernel module can also be build against older kernels as out of tree
> module
> and still it would be transparent for userspace consumers like samba.
> There are hardcoded checks for IPPROTO_MAX in inet_create, inet6_create,
> inet_diag_register
> and the value of IPPROTO_MAX is 263 starting with commit
> d25a92ccae6bed02327b63d138e12e7806830f78 in 6.11.
I would also recommend not changing IPPROTO_MAX here. When IPPROTO_MAX
got increased to 263, this caused some (small) small issues because it
was hardcoded in some userspace code if I remember well.
It is unclear why IPPROTO_QUIC is using 261 and not 257, but it should
not make any differences I suppose.
Note that for MPTCP, we picked 262, just in case the protocol number was
limited to 8 bits, to fallback to IPPROTO_TCP: 262 & 0xFF = 6. At that
time, we thought it was important, because we were the first ones to use
a value higher than U8_MAX. At the end, it is good for new protocols,
not to increase IPPROTO_MAX each time :)
(@Xin: BTW, thank you for working on this!)
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-18 16:20 ` Matthieu Baerts
@ 2025-08-18 18:37 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 18:37 UTC (permalink / raw)
To: Matthieu Baerts
Cc: Stefan Metzmacher, network dev, davem, kuba, Eric Dumazet,
Paolo Abeni, Simon Horman, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Mon, Aug 18, 2025 at 12:20 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Stefan, Xin,
>
> On 18/08/2025 16:31, Stefan Metzmacher wrote:
> > Hi,
> >
> >> diff --git a/include/linux/socket.h b/include/linux/socket.h
> >> index 3b262487ec06..a7c05b064583 100644
> >> --- a/include/linux/socket.h
> >> +++ b/include/linux/socket.h
> >> @@ -386,6 +386,7 @@ struct ucred {
> >> #define SOL_MCTP 285
> >> #define SOL_SMC 286
> >> #define SOL_VSOCK 287
> >> +#define SOL_QUIC 288
> >> /* IPX options */
> >> #define IPX_TYPE 1
> >> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> >> index ced0fc3c3aa5..34becd90d3a6 100644
> >> --- a/include/uapi/linux/in.h
> >> +++ b/include/uapi/linux/in.h
> >> @@ -85,6 +85,8 @@ enum {
> >> #define IPPROTO_RAW IPPROTO_RAW
> >> IPPROTO_SMC = 256, /* Shared Memory Communications */
> >> #define IPPROTO_SMC IPPROTO_SMC
> >> + IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure
> >> Transport */
> >> +#define IPPROTO_QUIC IPPROTO_QUIC
> >> IPPROTO_MPTCP = 262, /* Multipath TCP connection */
> >> #define IPPROTO_MPTCP IPPROTO_MPTCP
> >> IPPROTO_MAX
> >
> > Can these constants be accepted, soon?
> >
> > Samba 4.23.0 to be released early September will ship userspace code to
> > use them. It would be good to have them correct when kernel's start to
> > support this...
> >
> > It would also mean less risk for conflicting projects with the need for
> > such numbers.
> >
> > I think it's useful to use a value lower than IPPROTO_MAX, because it means
> > the kernel module can also be build against older kernels as out of tree
> > module
> > and still it would be transparent for userspace consumers like samba.
> > There are hardcoded checks for IPPROTO_MAX in inet_create, inet6_create,
> > inet_diag_register
> > and the value of IPPROTO_MAX is 263 starting with commit
> > d25a92ccae6bed02327b63d138e12e7806830f78 in 6.11.
>
> I would also recommend not changing IPPROTO_MAX here. When IPPROTO_MAX
> got increased to 263, this caused some (small) small issues because it
> was hardcoded in some userspace code if I remember well.
>
> It is unclear why IPPROTO_QUIC is using 261 and not 257, but it should
> not make any differences I suppose.
>
I agree, it should not.
I wasn’t sure if any other project was using 257, so to minimize the risk
of conflicts, I’ve been using this large value from the beginning.
> Note that for MPTCP, we picked 262, just in case the protocol number was
> limited to 8 bits, to fallback to IPPROTO_TCP: 262 & 0xFF = 6. At that
> time, we thought it was important, because we were the first ones to use
> a value higher than U8_MAX. At the end, it is good for new protocols,
> not to increase IPPROTO_MAX each time :)
>
Yes, this approach saves a lot of trouble for new protocols!
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-18 14:31 ` Stefan Metzmacher
2025-08-18 16:20 ` Matthieu Baerts
@ 2025-08-19 8:10 ` Namjae Jeon
2025-08-21 8:24 ` Stefan Metzmacher
1 sibling, 1 reply; 38+ messages in thread
From: Namjae Jeon @ 2025-08-19 8:10 UTC (permalink / raw)
To: Stefan Metzmacher
Cc: Xin Long, network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
Simon Horman, Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs,
Steve French, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On Mon, Aug 18, 2025 at 11:31 PM Stefan Metzmacher <metze@samba.org> wrote:
>
> Hi,
>
> > diff --git a/include/linux/socket.h b/include/linux/socket.h
> > index 3b262487ec06..a7c05b064583 100644
> > --- a/include/linux/socket.h
> > +++ b/include/linux/socket.h
> > @@ -386,6 +386,7 @@ struct ucred {
> > #define SOL_MCTP 285
> > #define SOL_SMC 286
> > #define SOL_VSOCK 287
> > +#define SOL_QUIC 288
> >
> > /* IPX options */
> > #define IPX_TYPE 1
> > diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
> > index ced0fc3c3aa5..34becd90d3a6 100644
> > --- a/include/uapi/linux/in.h
> > +++ b/include/uapi/linux/in.h
> > @@ -85,6 +85,8 @@ enum {
> > #define IPPROTO_RAW IPPROTO_RAW
> > IPPROTO_SMC = 256, /* Shared Memory Communications */
> > #define IPPROTO_SMC IPPROTO_SMC
> > + IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure Transport */
> > +#define IPPROTO_QUIC IPPROTO_QUIC
> > IPPROTO_MPTCP = 262, /* Multipath TCP connection */
> > #define IPPROTO_MPTCP IPPROTO_MPTCP
> > IPPROTO_MAX
>
> Can these constants be accepted, soon?
>
> Samba 4.23.0 to be released early September will ship userspace code to
> use them. It would be good to have them correct when kernel's start to
> support this...
I'd like to test ksmbd with smbclient of samba, which includes quic support.
Which Samba branch should I use? How do I enable quic in Samba?
Do I need to update smb.conf?
Thanks.
>
> It would also mean less risk for conflicting projects with the need for such numbers.
>
> I think it's useful to use a value lower than IPPROTO_MAX, because it means
> the kernel module can also be build against older kernels as out of tree module
> and still it would be transparent for userspace consumers like samba.
> There are hardcoded checks for IPPROTO_MAX in inet_create, inet6_create, inet_diag_register
> and the value of IPPROTO_MAX is 263 starting with commit
> d25a92ccae6bed02327b63d138e12e7806830f78 in 6.11.
>
> Thanks!
> metze
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2025-08-19 8:10 ` Namjae Jeon
@ 2025-08-21 8:24 ` Stefan Metzmacher
0 siblings, 0 replies; 38+ messages in thread
From: Stefan Metzmacher @ 2025-08-21 8:24 UTC (permalink / raw)
To: Namjae Jeon
Cc: Xin Long, network dev, davem, kuba, Eric Dumazet, Paolo Abeni,
Simon Horman, Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs,
Steve French, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
Hi Namjae,
>>> diff --git a/include/linux/socket.h b/include/linux/socket.h
>>> index 3b262487ec06..a7c05b064583 100644
>>> --- a/include/linux/socket.h
>>> +++ b/include/linux/socket.h
>>> @@ -386,6 +386,7 @@ struct ucred {
>>> #define SOL_MCTP 285
>>> #define SOL_SMC 286
>>> #define SOL_VSOCK 287
>>> +#define SOL_QUIC 288
>>>
>>> /* IPX options */
>>> #define IPX_TYPE 1
>>> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
>>> index ced0fc3c3aa5..34becd90d3a6 100644
>>> --- a/include/uapi/linux/in.h
>>> +++ b/include/uapi/linux/in.h
>>> @@ -85,6 +85,8 @@ enum {
>>> #define IPPROTO_RAW IPPROTO_RAW
>>> IPPROTO_SMC = 256, /* Shared Memory Communications */
>>> #define IPPROTO_SMC IPPROTO_SMC
>>> + IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed and Secure Transport */
>>> +#define IPPROTO_QUIC IPPROTO_QUIC
>>> IPPROTO_MPTCP = 262, /* Multipath TCP connection */
>>> #define IPPROTO_MPTCP IPPROTO_MPTCP
>>> IPPROTO_MAX
>>
>> Can these constants be accepted, soon?
>>
>> Samba 4.23.0 to be released early September will ship userspace code to
>> use them. It would be good to have them correct when kernel's start to
>> support this...
> I'd like to test ksmbd with smbclient of samba, which includes quic support.
> Which Samba branch should I use? How do I enable quic in Samba?
> Do I need to update smb.conf?
With master or 4.23 the simplest way would be
smbclient //ksmbd-server/share \
-Uuser%Passw0rd \
--option='client smb transports = quic' \
--option='tls verify peer = no_check' \
-I 10.0.0.1
Note it only works with a name in the unc otherwise
quic can't work.
For development you may want to use
SSLKEYLOGFILE=/dev/shm/sslkeylogfile.txt smbclient ...
And point wireshark to /dev/shm/sslkeylogfile.txt with
wireshark -o tls.keylog_file:/dev/shm/sslkeylogfile.txt
Or you merge it into a pcapng file like this:
editcap --inject-secrets tls,/dev/shm/sslkeylogfile.txt capture.pcap.gz capture.pcapng.gz
Then 'wireshark capture.pcapng.gz' will have everything to decrypt.
metze
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 11:17 ` Paolo Abeni
2025-08-18 14:04 ` [PATCH net-next v2 03/15] quic: provide common utilities and data structures Xin Long
` (13 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch lays the groundwork for QUIC socket support in the kernel.
It defines the core structures and protocol hooks needed to create
QUIC sockets, without implementing any protocol behavior at this stage.
Basic integration is included to allow building the module via
CONFIG_IP_QUIC=m.
This provides the scaffolding necessary for adding actual QUIC socket
behavior in follow-up patches.
Signed-off-by: Pengtao He <hepengtao@xiaomi.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/Kconfig | 1 +
net/Makefile | 1 +
net/quic/Kconfig | 35 +++++
net/quic/Makefile | 8 +
net/quic/protocol.c | 370 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/protocol.h | 55 +++++++
net/quic/socket.c | 213 +++++++++++++++++++++++++
net/quic/socket.h | 79 ++++++++++
8 files changed, 762 insertions(+)
create mode 100644 net/quic/Kconfig
create mode 100644 net/quic/Makefile
create mode 100644 net/quic/protocol.c
create mode 100644 net/quic/protocol.h
create mode 100644 net/quic/socket.c
create mode 100644 net/quic/socket.h
diff --git a/net/Kconfig b/net/Kconfig
index d5865cf19799..1205f5b7cf59 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -249,6 +249,7 @@ source "net/bridge/netfilter/Kconfig"
endif # if NETFILTER
+source "net/quic/Kconfig"
source "net/sctp/Kconfig"
source "net/rds/Kconfig"
source "net/tipc/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index aac960c41db6..7c6de28e9aa5 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_PHONET) += phonet/
ifneq ($(CONFIG_VLAN_8021Q),)
obj-y += 8021q/
endif
+obj-$(CONFIG_IP_QUIC) += quic/
obj-$(CONFIG_IP_SCTP) += sctp/
obj-$(CONFIG_RDS) += rds/
obj-$(CONFIG_WIRELESS) += wireless/
diff --git a/net/quic/Kconfig b/net/quic/Kconfig
new file mode 100644
index 000000000000..b64fa398750e
--- /dev/null
+++ b/net/quic/Kconfig
@@ -0,0 +1,35 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# QUIC configuration
+#
+
+menuconfig IP_QUIC
+ tristate "QUIC: A UDP-Based Multiplexed and Secure Transport (Experimental)"
+ depends on INET
+ depends on IPV6
+ select CRYPTO
+ select CRYPTO_HMAC
+ select CRYPTO_HKDF
+ select CRYPTO_AES
+ select CRYPTO_GCM
+ select CRYPTO_CCM
+ select CRYPTO_CHACHA20POLY1305
+ select NET_UDP_TUNNEL
+ help
+ QUIC: A UDP-Based Multiplexed and Secure Transport
+
+ From rfc9000 <https://www.rfc-editor.org/rfc/rfc9000.html>.
+
+ QUIC provides applications with flow-controlled streams for structured
+ communication, low-latency connection establishment, and network path
+ migration. QUIC includes security measures that ensure
+ confidentiality, integrity, and availability in a range of deployment
+ circumstances. Accompanying documents describe the integration of
+ TLS for key negotiation, loss detection, and an exemplary congestion
+ control algorithm.
+
+ To compile this protocol support as a module, choose M here: the
+ module will be called quic. Debug messages are handled by the
+ kernel's dynamic debugging framework.
+
+ If in doubt, say N.
diff --git a/net/quic/Makefile b/net/quic/Makefile
new file mode 100644
index 000000000000..020e4dd133d8
--- /dev/null
+++ b/net/quic/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Makefile for QUIC support code.
+#
+
+obj-$(CONFIG_IP_QUIC) += quic.o
+
+quic-y := protocol.o socket.o
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
new file mode 100644
index 000000000000..01a5fdfb5227
--- /dev/null
+++ b/net/quic/protocol.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <linux/version.h>
+#include <linux/proc_fs.h>
+#include <net/protocol.h>
+#include <net/rps.h>
+#include <net/tls.h>
+
+#include "socket.h"
+
+static unsigned int quic_net_id __read_mostly;
+
+struct percpu_counter quic_sockets_allocated;
+
+long sysctl_quic_mem[3];
+int sysctl_quic_rmem[3];
+int sysctl_quic_wmem[3];
+
+static int quic_inet_connect(struct socket *sock, struct sockaddr *addr, int addr_len, int flags)
+{
+ struct sock *sk = sock->sk;
+ const struct proto *prot;
+
+ if (addr_len < (int)sizeof(addr->sa_family))
+ return -EINVAL;
+
+ prot = READ_ONCE(sk->sk_prot);
+
+ return prot->connect(sk, addr, addr_len);
+}
+
+static int quic_inet_listen(struct socket *sock, int backlog)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_inet_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
+{
+ return -EOPNOTSUPP;
+}
+
+static __poll_t quic_inet_poll(struct file *file, struct socket *sock, poll_table *wait)
+{
+ return 0;
+}
+
+static struct ctl_table quic_table[] = {
+ {
+ .procname = "quic_mem",
+ .data = &sysctl_quic_mem,
+ .maxlen = sizeof(sysctl_quic_mem),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax
+ },
+ {
+ .procname = "quic_rmem",
+ .data = &sysctl_quic_rmem,
+ .maxlen = sizeof(sysctl_quic_rmem),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "quic_wmem",
+ .data = &sysctl_quic_wmem,
+ .maxlen = sizeof(sysctl_quic_wmem),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+};
+
+struct quic_net *quic_net(struct net *net)
+{
+ return net_generic(net, quic_net_id);
+}
+
+#ifdef CONFIG_PROC_FS
+static const struct snmp_mib quic_snmp_list[] = {
+ SNMP_MIB_ITEM("QuicConnCurrentEstabs", QUIC_MIB_CONN_CURRENTESTABS),
+ SNMP_MIB_ITEM("QuicConnPassiveEstabs", QUIC_MIB_CONN_PASSIVEESTABS),
+ SNMP_MIB_ITEM("QuicConnActiveEstabs", QUIC_MIB_CONN_ACTIVEESTABS),
+ SNMP_MIB_ITEM("QuicPktRcvFastpaths", QUIC_MIB_PKT_RCVFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktDecFastpaths", QUIC_MIB_PKT_DECFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktEncFastpaths", QUIC_MIB_PKT_ENCFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktRcvBacklogs", QUIC_MIB_PKT_RCVBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktDecBacklogs", QUIC_MIB_PKT_DECBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktEncBacklogs", QUIC_MIB_PKT_ENCBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktInvHdrDrop", QUIC_MIB_PKT_INVHDRDROP),
+ SNMP_MIB_ITEM("QuicPktInvNumDrop", QUIC_MIB_PKT_INVNUMDROP),
+ SNMP_MIB_ITEM("QuicPktInvFrmDrop", QUIC_MIB_PKT_INVFRMDROP),
+ SNMP_MIB_ITEM("QuicPktRcvDrop", QUIC_MIB_PKT_RCVDROP),
+ SNMP_MIB_ITEM("QuicPktDecDrop", QUIC_MIB_PKT_DECDROP),
+ SNMP_MIB_ITEM("QuicPktEncDrop", QUIC_MIB_PKT_ENCDROP),
+ SNMP_MIB_ITEM("QuicFrmRcvBufDrop", QUIC_MIB_FRM_RCVBUFDROP),
+ SNMP_MIB_ITEM("QuicFrmRetrans", QUIC_MIB_FRM_RETRANS),
+ SNMP_MIB_ITEM("QuicFrmOutCloses", QUIC_MIB_FRM_OUTCLOSES),
+ SNMP_MIB_ITEM("QuicFrmInCloses", QUIC_MIB_FRM_INCLOSES),
+ SNMP_MIB_SENTINEL
+};
+
+static int quic_snmp_seq_show(struct seq_file *seq, void *v)
+{
+ unsigned long buff[QUIC_MIB_MAX];
+ struct net *net = seq->private;
+ u32 idx;
+
+ memset(buff, 0, sizeof(unsigned long) * QUIC_MIB_MAX);
+
+ snmp_get_cpu_field_batch(buff, quic_snmp_list, quic_net(net)->stat);
+ for (idx = 0; quic_snmp_list[idx].name; idx++)
+ seq_printf(seq, "%-32s\t%ld\n", quic_snmp_list[idx].name, buff[idx]);
+
+ return 0;
+}
+
+static int quic_net_proc_init(struct net *net)
+{
+ quic_net(net)->proc_net = proc_net_mkdir(net, "quic", net->proc_net);
+ if (!quic_net(net)->proc_net)
+ return -ENOMEM;
+
+ if (!proc_create_net_single("snmp", 0444, quic_net(net)->proc_net,
+ quic_snmp_seq_show, NULL))
+ goto free;
+ return 0;
+free:
+ remove_proc_subtree("quic", net->proc_net);
+ quic_net(net)->proc_net = NULL;
+ return -ENOMEM;
+}
+
+static void quic_net_proc_exit(struct net *net)
+{
+ remove_proc_subtree("quic", net->proc_net);
+ quic_net(net)->proc_net = NULL;
+}
+#endif
+
+static const struct proto_ops quic_proto_ops = {
+ .family = PF_INET,
+ .owner = THIS_MODULE,
+ .release = inet_release,
+ .bind = inet_bind,
+ .connect = quic_inet_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = inet_accept,
+ .getname = quic_inet_getname,
+ .poll = quic_inet_poll,
+ .ioctl = inet_ioctl,
+ .gettstamp = sock_gettstamp,
+ .listen = quic_inet_listen,
+ .shutdown = inet_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+};
+
+static struct inet_protosw quic_stream_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quic_prot,
+ .ops = &quic_proto_ops,
+};
+
+static struct inet_protosw quic_dgram_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quic_prot,
+ .ops = &quic_proto_ops,
+};
+
+static const struct proto_ops quicv6_proto_ops = {
+ .family = PF_INET6,
+ .owner = THIS_MODULE,
+ .release = inet6_release,
+ .bind = inet6_bind,
+ .connect = quic_inet_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = inet_accept,
+ .getname = quic_inet_getname,
+ .poll = quic_inet_poll,
+ .ioctl = inet6_ioctl,
+ .gettstamp = sock_gettstamp,
+ .listen = quic_inet_listen,
+ .shutdown = inet_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+};
+
+static struct inet_protosw quicv6_stream_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quicv6_prot,
+ .ops = &quicv6_proto_ops,
+};
+
+static struct inet_protosw quicv6_dgram_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quicv6_prot,
+ .ops = &quicv6_proto_ops,
+};
+
+static int quic_protosw_init(void)
+{
+ int err;
+
+ err = proto_register(&quic_prot, 1);
+ if (err)
+ return err;
+
+ err = proto_register(&quicv6_prot, 1);
+ if (err) {
+ proto_unregister(&quic_prot);
+ return err;
+ }
+
+ inet_register_protosw(&quic_stream_protosw);
+ inet_register_protosw(&quic_dgram_protosw);
+ inet6_register_protosw(&quicv6_stream_protosw);
+ inet6_register_protosw(&quicv6_dgram_protosw);
+
+ return 0;
+}
+
+static void quic_protosw_exit(void)
+{
+ inet_unregister_protosw(&quic_dgram_protosw);
+ inet_unregister_protosw(&quic_stream_protosw);
+ proto_unregister(&quic_prot);
+
+ inet6_unregister_protosw(&quicv6_dgram_protosw);
+ inet6_unregister_protosw(&quicv6_stream_protosw);
+ proto_unregister(&quicv6_prot);
+}
+
+static int __net_init quic_net_init(struct net *net)
+{
+ struct quic_net *qn = quic_net(net);
+ int err;
+
+ qn->stat = alloc_percpu(struct quic_mib);
+ if (!qn->stat)
+ return -ENOMEM;
+
+#ifdef CONFIG_PROC_FS
+ err = quic_net_proc_init(net);
+ if (err) {
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+ }
+#endif
+ return err;
+}
+
+static void __net_exit quic_net_exit(struct net *net)
+{
+ struct quic_net *qn = quic_net(net);
+
+#ifdef CONFIG_PROC_FS
+ quic_net_proc_exit(net);
+#endif
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+}
+
+static struct pernet_operations quic_net_ops = {
+ .init = quic_net_init,
+ .exit = quic_net_exit,
+ .id = &quic_net_id,
+ .size = sizeof(struct quic_net),
+};
+
+#ifdef CONFIG_SYSCTL
+static struct ctl_table_header *quic_sysctl_header;
+
+static void quic_sysctl_register(void)
+{
+ quic_sysctl_header = register_net_sysctl(&init_net, "net/quic", quic_table);
+}
+
+static void quic_sysctl_unregister(void)
+{
+ unregister_net_sysctl_table(quic_sysctl_header);
+}
+#endif
+
+static __init int quic_init(void)
+{
+ int max_share, err = -ENOMEM;
+ unsigned long limit;
+
+ /* Set QUIC memory limits based on available system memory, similar to sctp_init(). */
+ limit = nr_free_buffer_pages() / 8;
+ limit = max(limit, 128UL);
+ sysctl_quic_mem[0] = (long)limit / 4 * 3;
+ sysctl_quic_mem[1] = (long)limit;
+ sysctl_quic_mem[2] = sysctl_quic_mem[0] * 2;
+
+ limit = (sysctl_quic_mem[1]) << (PAGE_SHIFT - 7);
+ max_share = min(4UL * 1024 * 1024, limit);
+
+ sysctl_quic_rmem[0] = PAGE_SIZE;
+ sysctl_quic_rmem[1] = 1024 * 1024;
+ sysctl_quic_rmem[2] = max(sysctl_quic_rmem[1], max_share);
+
+ sysctl_quic_wmem[0] = PAGE_SIZE;
+ sysctl_quic_wmem[1] = 16 * 1024;
+ sysctl_quic_wmem[2] = max(64 * 1024, max_share);
+
+ err = percpu_counter_init(&quic_sockets_allocated, 0, GFP_KERNEL);
+ if (err)
+ goto err_percpu_counter;
+
+ err = register_pernet_subsys(&quic_net_ops);
+ if (err)
+ goto err_def_ops;
+
+ err = quic_protosw_init();
+ if (err)
+ goto err_protosw;
+
+#ifdef CONFIG_SYSCTL
+ quic_sysctl_register();
+#endif
+ pr_info("quic: init\n");
+ return 0;
+
+err_protosw:
+ unregister_pernet_subsys(&quic_net_ops);
+err_def_ops:
+ percpu_counter_destroy(&quic_sockets_allocated);
+err_percpu_counter:
+ return err;
+}
+
+static __exit void quic_exit(void)
+{
+#ifdef CONFIG_SYSCTL
+ quic_sysctl_unregister();
+#endif
+ quic_protosw_exit();
+ unregister_pernet_subsys(&quic_net_ops);
+ percpu_counter_destroy(&quic_sockets_allocated);
+ pr_info("quic: exit\n");
+}
+
+module_init(quic_init);
+module_exit(quic_exit);
+
+MODULE_ALIAS("net-pf-" __stringify(PF_INET) "-proto-261");
+MODULE_ALIAS("net-pf-" __stringify(PF_INET6) "-proto-261");
+MODULE_AUTHOR("Xin Long <lucien.xin@gmail.com>");
+MODULE_DESCRIPTION("Support for the QUIC protocol (RFC9000)");
+MODULE_LICENSE("GPL");
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
new file mode 100644
index 000000000000..6e6c5a6fc3f8
--- /dev/null
+++ b/net/quic/protocol.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+extern struct percpu_counter quic_sockets_allocated;
+
+extern long sysctl_quic_mem[3];
+extern int sysctl_quic_rmem[3];
+extern int sysctl_quic_wmem[3];
+
+enum {
+ QUIC_MIB_NUM = 0,
+ QUIC_MIB_CONN_CURRENTESTABS, /* Currently established connections */
+ QUIC_MIB_CONN_PASSIVEESTABS, /* Connections established passively (server-side accept) */
+ QUIC_MIB_CONN_ACTIVEESTABS, /* Connections established actively (client-side connect) */
+ QUIC_MIB_PKT_RCVFASTPATHS, /* Packets received on the fast path */
+ QUIC_MIB_PKT_DECFASTPATHS, /* Packets successfully decrypted on the fast path */
+ QUIC_MIB_PKT_ENCFASTPATHS, /* Packets encrypted on the fast path (for transmission) */
+ QUIC_MIB_PKT_RCVBACKLOGS, /* Packets received via backlog processing */
+ QUIC_MIB_PKT_DECBACKLOGS, /* Packets decrypted in backlog handler */
+ QUIC_MIB_PKT_ENCBACKLOGS, /* Packets encrypted in backlog handler */
+ QUIC_MIB_PKT_INVHDRDROP, /* Packets dropped due to invalid headers */
+ QUIC_MIB_PKT_INVNUMDROP, /* Packets dropped due to invalid packet numbers */
+ QUIC_MIB_PKT_INVFRMDROP, /* Packets dropped due to invalid frames */
+ QUIC_MIB_PKT_RCVDROP, /* Packets dropped on receive (general errors) */
+ QUIC_MIB_PKT_DECDROP, /* Packets dropped due to decryption failure */
+ QUIC_MIB_PKT_ENCDROP, /* Packets dropped due to encryption failure */
+ QUIC_MIB_FRM_RCVBUFDROP, /* Frames dropped due to receive buffer limits */
+ QUIC_MIB_FRM_RETRANS, /* Frames retransmitted */
+ QUIC_MIB_FRM_OUTCLOSES, /* Frames of CONNECTION_CLOSE sent */
+ QUIC_MIB_FRM_INCLOSES, /* Frames of CONNECTION_CLOSE received */
+ QUIC_MIB_MAX
+};
+
+struct quic_mib {
+ unsigned long mibs[QUIC_MIB_MAX]; /* Array of counters indexed by the enum above */
+};
+
+struct quic_net {
+ DEFINE_SNMP_STAT(struct quic_mib, stat); /* Per-network namespace MIB statistics */
+#ifdef CONFIG_PROC_FS
+ struct proc_dir_entry *proc_net; /* procfs entry for dumping QUIC socket stats */
+#endif
+};
+
+struct quic_net *quic_net(struct net *net);
+
+#define QUIC_INC_STATS(net, field) SNMP_INC_STATS(quic_net(net)->stat, field)
+#define QUIC_DEC_STATS(net, field) SNMP_DEC_STATS(quic_net(net)->stat, field)
diff --git a/net/quic/socket.c b/net/quic/socket.c
new file mode 100644
index 000000000000..320a9a5a3c53
--- /dev/null
+++ b/net/quic/socket.c
@@ -0,0 +1,213 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <linux/version.h>
+#include <net/tls.h>
+
+#include "socket.h"
+
+static DEFINE_PER_CPU(int, quic_memory_per_cpu_fw_alloc);
+static unsigned long quic_memory_pressure;
+static atomic_long_t quic_memory_allocated;
+
+static void quic_enter_memory_pressure(struct sock *sk)
+{
+ WRITE_ONCE(quic_memory_pressure, 1);
+}
+
+static void quic_write_space(struct sock *sk)
+{
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (skwq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND);
+ rcu_read_unlock();
+}
+
+static int quic_init_sock(struct sock *sk)
+{
+ sk->sk_destruct = inet_sock_destruct;
+ sk->sk_write_space = quic_write_space;
+ sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
+
+ WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sysctl_quic_wmem[1]));
+ WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sysctl_quic_rmem[1]));
+
+ local_bh_disable();
+ sk_sockets_allocated_inc(sk);
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+ local_bh_enable();
+
+ return 0;
+}
+
+static void quic_destroy_sock(struct sock *sk)
+{
+ local_bh_disable();
+ sk_sockets_allocated_dec(sk);
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+ local_bh_enable();
+}
+
+static int quic_bind(struct sock *sk, struct sockaddr *addr, int addr_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_connect(struct sock *sk, struct sockaddr *addr, int addr_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_hash(struct sock *sk)
+{
+ return 0;
+}
+
+static void quic_unhash(struct sock *sk)
+{
+}
+
+static int quic_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags,
+ int *addr_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static struct sock *quic_accept(struct sock *sk, struct proto_accept_arg *arg)
+{
+ arg->err = -EOPNOTSUPP;
+ return NULL;
+}
+
+static void quic_close(struct sock *sk, long timeout)
+{
+ lock_sock(sk);
+
+ quic_set_state(sk, QUIC_SS_CLOSED);
+
+ release_sock(sk);
+
+ sk_common_release(sk);
+}
+
+static int quic_do_setsockopt(struct sock *sk, int optname, sockptr_t optval, unsigned int optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen)
+{
+ if (level != SOL_QUIC)
+ return -EOPNOTSUPP;
+
+ return quic_do_setsockopt(sk, optname, optval, optlen);
+}
+
+static int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval, sockptr_t optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ if (level != SOL_QUIC)
+ return -EOPNOTSUPP;
+
+ return quic_do_getsockopt(sk, optname, USER_SOCKPTR(optval), USER_SOCKPTR(optlen));
+}
+
+static void quic_release_cb(struct sock *sk)
+{
+}
+
+static int quic_disconnect(struct sock *sk, int flags)
+{
+ quic_set_state(sk, QUIC_SS_CLOSED); /* for a listen socket only */
+ return 0;
+}
+
+static void quic_shutdown(struct sock *sk, int how)
+{
+ quic_set_state(sk, QUIC_SS_CLOSED);
+}
+
+struct proto quic_prot = {
+ .name = "QUIC",
+ .owner = THIS_MODULE,
+ .init = quic_init_sock,
+ .destroy = quic_destroy_sock,
+ .shutdown = quic_shutdown,
+ .setsockopt = quic_setsockopt,
+ .getsockopt = quic_getsockopt,
+ .connect = quic_connect,
+ .bind = quic_bind,
+ .close = quic_close,
+ .disconnect = quic_disconnect,
+ .sendmsg = quic_sendmsg,
+ .recvmsg = quic_recvmsg,
+ .accept = quic_accept,
+ .hash = quic_hash,
+ .unhash = quic_unhash,
+ .release_cb = quic_release_cb,
+ .no_autobind = true,
+ .obj_size = sizeof(struct quic_sock),
+ .sysctl_mem = sysctl_quic_mem,
+ .sysctl_rmem = sysctl_quic_rmem,
+ .sysctl_wmem = sysctl_quic_wmem,
+ .memory_pressure = &quic_memory_pressure,
+ .enter_memory_pressure = quic_enter_memory_pressure,
+ .memory_allocated = &quic_memory_allocated,
+ .per_cpu_fw_alloc = &quic_memory_per_cpu_fw_alloc,
+ .sockets_allocated = &quic_sockets_allocated,
+};
+
+struct proto quicv6_prot = {
+ .name = "QUICv6",
+ .owner = THIS_MODULE,
+ .init = quic_init_sock,
+ .destroy = quic_destroy_sock,
+ .shutdown = quic_shutdown,
+ .setsockopt = quic_setsockopt,
+ .getsockopt = quic_getsockopt,
+ .connect = quic_connect,
+ .bind = quic_bind,
+ .close = quic_close,
+ .disconnect = quic_disconnect,
+ .sendmsg = quic_sendmsg,
+ .recvmsg = quic_recvmsg,
+ .accept = quic_accept,
+ .hash = quic_hash,
+ .unhash = quic_unhash,
+ .release_cb = quic_release_cb,
+ .no_autobind = true,
+ .obj_size = sizeof(struct quic6_sock),
+ .ipv6_pinfo_offset = offsetof(struct quic6_sock, inet6),
+ .sysctl_mem = sysctl_quic_mem,
+ .sysctl_rmem = sysctl_quic_rmem,
+ .sysctl_wmem = sysctl_quic_wmem,
+ .memory_pressure = &quic_memory_pressure,
+ .enter_memory_pressure = quic_enter_memory_pressure,
+ .memory_allocated = &quic_memory_allocated,
+ .per_cpu_fw_alloc = &quic_memory_per_cpu_fw_alloc,
+ .sockets_allocated = &quic_sockets_allocated,
+};
diff --git a/net/quic/socket.h b/net/quic/socket.h
new file mode 100644
index 000000000000..ded8eb2e6a9c
--- /dev/null
+++ b/net/quic/socket.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/udp_tunnel.h>
+
+#include "protocol.h"
+
+extern struct proto quic_prot;
+extern struct proto quicv6_prot;
+
+enum quic_state {
+ QUIC_SS_CLOSED = TCP_CLOSE,
+ QUIC_SS_LISTENING = TCP_LISTEN,
+ QUIC_SS_ESTABLISHING = TCP_SYN_RECV,
+ QUIC_SS_ESTABLISHED = TCP_ESTABLISHED,
+};
+
+struct quic_sock {
+ struct inet_sock inet;
+ struct list_head reqs;
+};
+
+struct quic6_sock {
+ struct quic_sock quic;
+ struct ipv6_pinfo inet6;
+};
+
+static inline struct quic_sock *quic_sk(const struct sock *sk)
+{
+ return (struct quic_sock *)sk;
+}
+
+static inline struct list_head *quic_reqs(const struct sock *sk)
+{
+ return &quic_sk(sk)->reqs;
+}
+
+static inline bool quic_is_establishing(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_ESTABLISHING;
+}
+
+static inline bool quic_is_established(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_ESTABLISHED;
+}
+
+static inline bool quic_is_listen(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_LISTENING;
+}
+
+static inline bool quic_is_closed(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_CLOSED;
+}
+
+static inline void quic_set_state(struct sock *sk, int state)
+{
+ struct net *net = sock_net(sk);
+
+ if (sk->sk_state == state)
+ return;
+
+ if (state == QUIC_SS_ESTABLISHED)
+ QUIC_INC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
+ else if (quic_is_established(sk))
+ QUIC_DEC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
+
+ inet_sk_set_state(sk, state);
+ sk->sk_state_change(sk);
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol
2025-08-18 14:04 ` [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol Xin Long
@ 2025-08-21 11:17 ` Paolo Abeni
2025-08-23 18:38 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 11:17 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
> diff --git a/net/Makefile b/net/Makefile
> index aac960c41db6..7c6de28e9aa5 100644
> --- a/net/Makefile
> +++ b/net/Makefile
> @@ -42,6 +42,7 @@ obj-$(CONFIG_PHONET) += phonet/
> ifneq ($(CONFIG_VLAN_8021Q),)
> obj-y += 8021q/
> endif
> +obj-$(CONFIG_IP_QUIC) += quic/
> obj-$(CONFIG_IP_SCTP) += sctp/
> obj-$(CONFIG_RDS) += rds/
> obj-$(CONFIG_WIRELESS) += wireless/
> diff --git a/net/quic/Kconfig b/net/quic/Kconfig
> new file mode 100644
> index 000000000000..b64fa398750e
> --- /dev/null
> +++ b/net/quic/Kconfig
> @@ -0,0 +1,35 @@
> +# SPDX-License-Identifier: GPL-2.0-or-later
> +#
> +# QUIC configuration
> +#
> +
> +menuconfig IP_QUIC
> + tristate "QUIC: A UDP-Based Multiplexed and Secure Transport (Experimental)"
> + depends on INET
> + depends on IPV6
What if IPV6=m ?
> + select CRYPTO
> + select CRYPTO_HMAC
> + select CRYPTO_HKDF
> + select CRYPTO_AES
> + select CRYPTO_GCM
> + select CRYPTO_CCM
> + select CRYPTO_CHACHA20POLY1305
> + select NET_UDP_TUNNEL
Possibly:
default n
?
[...]
> +static int quic_init_sock(struct sock *sk)
> +{
> + sk->sk_destruct = inet_sock_destruct;
> + sk->sk_write_space = quic_write_space;
> + sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
> +
> + WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sysctl_quic_wmem[1]));
> + WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sysctl_quic_rmem[1]));
> +
> + local_bh_disable();
Why?
> + sk_sockets_allocated_inc(sk);
> + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> + local_bh_enable();
> +
> + return 0;
> +}
> +
> +static void quic_destroy_sock(struct sock *sk)
> +{
> + local_bh_disable();
Same question :)
[...]
> +static int quic_disconnect(struct sock *sk, int flags)
> +{
> + quic_set_state(sk, QUIC_SS_CLOSED); /* for a listen socket only */
> + return 0;
> +}
disconnect() primary use-case is creating a lot of syzkaller reports.
Since there should be no legacy/backward compatibility issue, I suggest
considering a simple implementation always failing.
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol
2025-08-21 11:17 ` Paolo Abeni
@ 2025-08-23 18:38 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 18:38 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 7:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> > diff --git a/net/Makefile b/net/Makefile
> > index aac960c41db6..7c6de28e9aa5 100644
> > --- a/net/Makefile
> > +++ b/net/Makefile
> > @@ -42,6 +42,7 @@ obj-$(CONFIG_PHONET) += phonet/
> > ifneq ($(CONFIG_VLAN_8021Q),)
> > obj-y += 8021q/
> > endif
> > +obj-$(CONFIG_IP_QUIC) += quic/
> > obj-$(CONFIG_IP_SCTP) += sctp/
> > obj-$(CONFIG_RDS) += rds/
> > obj-$(CONFIG_WIRELESS) += wireless/
> > diff --git a/net/quic/Kconfig b/net/quic/Kconfig
> > new file mode 100644
> > index 000000000000..b64fa398750e
> > --- /dev/null
> > +++ b/net/quic/Kconfig
> > @@ -0,0 +1,35 @@
> > +# SPDX-License-Identifier: GPL-2.0-or-later
> > +#
> > +# QUIC configuration
> > +#
> > +
> > +menuconfig IP_QUIC
> > + tristate "QUIC: A UDP-Based Multiplexed and Secure Transport (Experimental)"
> > + depends on INET
> > + depends on IPV6
>
> What if IPV6=m ?
I think 'depends on IPV6' will include IPV6=m.
>
> > + select CRYPTO
> > + select CRYPTO_HMAC
> > + select CRYPTO_HKDF
> > + select CRYPTO_AES
> > + select CRYPTO_GCM
> > + select CRYPTO_CCM
> > + select CRYPTO_CHACHA20POLY1305
> > + select NET_UDP_TUNNEL
>
> Possibly:
> default n
I missed that..
>
> ?
> [...]
> > +static int quic_init_sock(struct sock *sk)
> > +{
> > + sk->sk_destruct = inet_sock_destruct;
> > + sk->sk_write_space = quic_write_space;
> > + sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
> > +
> > + WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sysctl_quic_wmem[1]));
> > + WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sysctl_quic_rmem[1]));
> > +
> > + local_bh_disable();
>
> Why?
Good catch, there was a quic_put_port() before, and I forgot to
delete local_bh_disable()f while removing quic_put_port().
>
> > + sk_sockets_allocated_inc(sk);
> > + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
> > + local_bh_enable();
> > +
> > + return 0;
> > +}
> > +
> > +static void quic_destroy_sock(struct sock *sk)
> > +{
> > + local_bh_disable();
>
> Same question :)
>
> [...]
> > +static int quic_disconnect(struct sock *sk, int flags)
> > +{
> > + quic_set_state(sk, QUIC_SS_CLOSED); /* for a listen socket only */
> > + return 0;
> > +}
>
> disconnect() primary use-case is creating a lot of syzkaller reports.
> Since there should be no legacy/backward compatibility issue, I suggest
> considering a simple implementation always failing.
>
OK. This means shutdown(listensk) won't work.
Thanks Paolo for all the helpful feedback!
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 03/15] quic: provide common utilities and data structures
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 02/15] net: build socket infrastructure for QUIC protocol Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 12:58 ` Paolo Abeni
2025-08-18 14:04 ` [PATCH net-next v2 04/15] quic: provide family ops for address and protocol Xin Long
` (12 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch provides foundational data structures and utilities used
throughout the QUIC stack.
It introduces packet header types, connection ID support, and address
handling. Hash tables are added to manage socket lookup and connection
ID mapping.
A flexible binary data type is provided, along with helpers for parsing,
matching, and memory management. Helpers for encoding and decoding
transport parameters and frames are also included.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/common.c | 482 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/common.h | 219 ++++++++++++++++++++
net/quic/protocol.c | 6 +
net/quic/socket.c | 4 +
net/quic/socket.h | 21 ++
6 files changed, 733 insertions(+), 1 deletion(-)
create mode 100644 net/quic/common.c
create mode 100644 net/quic/common.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 020e4dd133d8..e0067272de7d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := protocol.o socket.o
+quic-y := common.o protocol.o socket.o
diff --git a/net/quic/common.c b/net/quic/common.c
new file mode 100644
index 000000000000..5a7a8257565a
--- /dev/null
+++ b/net/quic/common.c
@@ -0,0 +1,482 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include "common.h"
+
+#define QUIC_VARINT_1BYTE_MAX 0x3fULL
+#define QUIC_VARINT_2BYTE_MAX 0x3fffULL
+#define QUIC_VARINT_4BYTE_MAX 0x3fffffffULL
+
+#define QUIC_VARINT_2BYTE_PREFIX 0x40
+#define QUIC_VARINT_4BYTE_PREFIX 0x80
+#define QUIC_VARINT_8BYTE_PREFIX 0xc0
+
+#define QUIC_VARINT_LENGTH(p) BIT((*(p)) >> 6)
+#define QUIC_VARINT_VALUE_MASK 0x3f
+
+static struct quic_hash_table quic_hash_tables[QUIC_HT_MAX_TABLES];
+
+struct quic_hash_head *quic_sock_hash(u32 hash)
+{
+ return &quic_hash_tables[QUIC_HT_SOCK].hash[hash];
+}
+
+struct quic_hash_head *quic_sock_head(struct net *net, union quic_addr *s, union quic_addr *d)
+{
+ struct quic_hash_table *ht = &quic_hash_tables[QUIC_HT_SOCK];
+
+ return &ht->hash[quic_ahash(net, s, d) & (ht->size - 1)];
+}
+
+struct quic_hash_head *quic_listen_sock_head(struct net *net, u16 port)
+{
+ struct quic_hash_table *ht = &quic_hash_tables[QUIC_HT_LISTEN_SOCK];
+
+ return &ht->hash[port & (ht->size - 1)];
+}
+
+struct quic_hash_head *quic_source_conn_id_head(struct net *net, u8 *scid)
+{
+ struct quic_hash_table *ht = &quic_hash_tables[QUIC_HT_CONNECTION_ID];
+
+ return &ht->hash[jhash(scid, 4, 0) & (ht->size - 1)];
+}
+
+struct quic_hash_head *quic_udp_sock_head(struct net *net, u16 port)
+{
+ struct quic_hash_table *ht = &quic_hash_tables[QUIC_HT_UDP_SOCK];
+
+ return &ht->hash[port & (ht->size - 1)];
+}
+
+struct quic_hash_head *quic_stream_head(struct quic_hash_table *ht, s64 stream_id)
+{
+ return &ht->hash[stream_id & (ht->size - 1)];
+}
+
+void quic_hash_tables_destroy(void)
+{
+ struct quic_hash_table *ht;
+ int table;
+
+ for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
+ ht = &quic_hash_tables[table];
+ ht->size = QUIC_HT_SIZE;
+ kfree(ht->hash);
+ }
+}
+
+int quic_hash_tables_init(void)
+{
+ struct quic_hash_head *head;
+ struct quic_hash_table *ht;
+ int table, i;
+
+ for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
+ ht = &quic_hash_tables[table];
+ ht->size = QUIC_HT_SIZE;
+ head = kmalloc_array(ht->size, sizeof(*head), GFP_KERNEL);
+ if (!head) {
+ quic_hash_tables_destroy();
+ return -ENOMEM;
+ }
+ for (i = 0; i < ht->size; i++) {
+ INIT_HLIST_HEAD(&head[i].head);
+ if (table == QUIC_HT_UDP_SOCK) {
+ mutex_init(&head[i].m_lock);
+ continue;
+ }
+ spin_lock_init(&head[i].s_lock);
+ }
+ ht->hash = head;
+ }
+
+ return 0;
+}
+
+union quic_var {
+ u8 u8;
+ __be16 be16;
+ __be32 be32;
+ __be64 be64;
+};
+
+/* Returns the number of bytes required to encode a QUIC variable-length integer. */
+u8 quic_var_len(u64 n)
+{
+ if (n <= QUIC_VARINT_1BYTE_MAX)
+ return 1;
+ if (n <= QUIC_VARINT_2BYTE_MAX)
+ return 2;
+ if (n <= QUIC_VARINT_4BYTE_MAX)
+ return 4;
+ return 8;
+}
+
+/* Decodes a QUIC variable-length integer from a buffer. */
+u8 quic_get_var(u8 **pp, u32 *plen, u64 *val)
+{
+ union quic_var n = {};
+ u8 *p = *pp, len;
+ u64 v = 0;
+
+ if (!*plen)
+ return 0;
+
+ len = QUIC_VARINT_LENGTH(p);
+ if (*plen < len)
+ return 0;
+
+ switch (len) {
+ case 1:
+ v = *p;
+ break;
+ case 2:
+ memcpy(&n.be16, p, 2);
+ n.u8 &= QUIC_VARINT_VALUE_MASK;
+ v = be16_to_cpu(n.be16);
+ break;
+ case 4:
+ memcpy(&n.be32, p, 4);
+ n.u8 &= QUIC_VARINT_VALUE_MASK;
+ v = be32_to_cpu(n.be32);
+ break;
+ case 8:
+ memcpy(&n.be64, p, 8);
+ n.u8 &= QUIC_VARINT_VALUE_MASK;
+ v = be64_to_cpu(n.be64);
+ break;
+ default:
+ return 0;
+ }
+
+ *plen -= len;
+ *pp = p + len;
+ *val = v;
+ return len;
+}
+
+/* Reads a fixed-length integer from the buffer. */
+u32 quic_get_int(u8 **pp, u32 *plen, u64 *val, u32 len)
+{
+ union quic_var n;
+ u8 *p = *pp;
+ u64 v = 0;
+
+ if (*plen < len)
+ return 0;
+ *plen -= len;
+
+ switch (len) {
+ case 1:
+ v = *p;
+ break;
+ case 2:
+ memcpy(&n.be16, p, 2);
+ v = be16_to_cpu(n.be16);
+ break;
+ case 3:
+ n.be32 = 0;
+ memcpy(((u8 *)&n.be32) + 1, p, 3);
+ v = be32_to_cpu(n.be32);
+ break;
+ case 4:
+ memcpy(&n.be32, p, 4);
+ v = be32_to_cpu(n.be32);
+ break;
+ case 8:
+ memcpy(&n.be64, p, 8);
+ v = be64_to_cpu(n.be64);
+ break;
+ default:
+ return 0;
+ }
+ *pp = p + len;
+ *val = v;
+ return len;
+}
+
+u32 quic_get_data(u8 **pp, u32 *plen, u8 *data, u32 len)
+{
+ if (*plen < len)
+ return 0;
+
+ memcpy(data, *pp, len);
+ *pp += len;
+ *plen -= len;
+
+ return len;
+}
+
+/* Encodes a value into the QUIC variable-length integer format. */
+u8 *quic_put_var(u8 *p, u64 num)
+{
+ union quic_var n;
+
+ if (num <= QUIC_VARINT_1BYTE_MAX) {
+ *p++ = (u8)(num & 0xff);
+ return p;
+ }
+ if (num <= QUIC_VARINT_2BYTE_MAX) {
+ n.be16 = cpu_to_be16((u16)num);
+ *((__be16 *)p) = n.be16;
+ *p |= QUIC_VARINT_2BYTE_PREFIX;
+ return p + 2;
+ }
+ if (num <= QUIC_VARINT_4BYTE_MAX) {
+ n.be32 = cpu_to_be32((u32)num);
+ *((__be32 *)p) = n.be32;
+ *p |= QUIC_VARINT_4BYTE_PREFIX;
+ return p + 4;
+ }
+ n.be64 = cpu_to_be64(num);
+ *((__be64 *)p) = n.be64;
+ *p |= QUIC_VARINT_8BYTE_PREFIX;
+ return p + 8;
+}
+
+/* Writes a fixed-length integer to the buffer in network byte order. */
+u8 *quic_put_int(u8 *p, u64 num, u8 len)
+{
+ union quic_var n;
+
+ switch (len) {
+ case 1:
+ *p++ = (u8)(num & 0xff);
+ return p;
+ case 2:
+ n.be16 = cpu_to_be16((u16)(num & 0xffff));
+ *((__be16 *)p) = n.be16;
+ return p + 2;
+ case 4:
+ n.be32 = cpu_to_be32((u32)num);
+ *((__be32 *)p) = n.be32;
+ return p + 4;
+ default:
+ return NULL;
+ }
+}
+
+/* Encodes a value as a variable-length integer with explicit length. */
+u8 *quic_put_varint(u8 *p, u64 num, u8 len)
+{
+ union quic_var n;
+
+ switch (len) {
+ case 1:
+ *p++ = (u8)(num & 0xff);
+ return p;
+ case 2:
+ n.be16 = cpu_to_be16((u16)(num & 0xffff));
+ *((__be16 *)p) = n.be16;
+ *p |= QUIC_VARINT_2BYTE_PREFIX;
+ return p + 2;
+ case 4:
+ n.be32 = cpu_to_be32((u32)num);
+ *((__be32 *)p) = n.be32;
+ *p |= QUIC_VARINT_4BYTE_PREFIX;
+ return p + 4;
+ default:
+ return NULL;
+ }
+}
+
+u8 *quic_put_data(u8 *p, u8 *data, u32 len)
+{
+ if (!len)
+ return p;
+
+ memcpy(p, data, len);
+ return p + len;
+}
+
+/* Writes a transport parameter as two varints: ID and value length, followed by value. */
+u8 *quic_put_param(u8 *p, u16 id, u64 value)
+{
+ p = quic_put_var(p, id);
+ p = quic_put_var(p, quic_var_len(value));
+ return quic_put_var(p, value);
+}
+
+/* Reads a QUIC transport parameter value. */
+u8 quic_get_param(u64 *pdest, u8 **pp, u32 *plen)
+{
+ u64 valuelen;
+
+ if (!quic_get_var(pp, plen, &valuelen))
+ return 0;
+
+ if (*plen < valuelen)
+ return 0;
+
+ if (!quic_get_var(pp, plen, pdest))
+ return 0;
+
+ return (u8)valuelen;
+}
+
+/* rfc9000#section-a.3: DecodePacketNumber()
+ *
+ * Reconstructs the full packet number from a truncated one.
+ */
+s64 quic_get_num(s64 max_pkt_num, s64 pkt_num, u32 n)
+{
+ s64 expected = max_pkt_num + 1;
+ s64 win = BIT_ULL(n * 8);
+ s64 hwin = win / 2;
+ s64 mask = win - 1;
+ s64 cand;
+
+ cand = (expected & ~mask) | pkt_num;
+ if (cand <= expected - hwin && cand < (1ULL << 62) - win)
+ return cand + win;
+ if (cand > expected + hwin && cand >= win)
+ return cand - win;
+ return cand;
+}
+
+int quic_data_dup(struct quic_data *to, u8 *data, u32 len)
+{
+ if (!len)
+ return 0;
+
+ data = kmemdup(data, len, GFP_ATOMIC);
+ if (!data)
+ return -ENOMEM;
+
+ kfree(to->data);
+ to->data = data;
+ to->len = len;
+ return 0;
+}
+
+int quic_data_append(struct quic_data *to, u8 *data, u32 len)
+{
+ u8 *p;
+
+ if (!len)
+ return 0;
+
+ p = kzalloc(to->len + len, GFP_ATOMIC);
+ if (!p)
+ return -ENOMEM;
+ p = quic_put_data(p, to->data, to->len);
+ p = quic_put_data(p, data, len);
+
+ kfree(to->data);
+ to->len = to->len + len;
+ to->data = p - to->len;
+ return 0;
+}
+
+/* Check whether 'd2' is equal to any element inside the list 'd1'.
+ *
+ * 'd1' is assumed to be a sequence of length-prefixed elements. Each element
+ * is compared to 'd2' using 'quic_data_cmp()'.
+ *
+ * Returns 1 if a match is found, 0 otherwise.
+ */
+int quic_data_has(struct quic_data *d1, struct quic_data *d2)
+{
+ struct quic_data d;
+ u64 length;
+ u32 len;
+ u8 *p;
+
+ for (p = d1->data, len = d1->len; len; len -= length, p += length) {
+ quic_get_int(&p, &len, &length, 1);
+ quic_data(&d, p, length);
+ if (!quic_data_cmp(&d, d2))
+ return 1;
+ }
+ return 0;
+}
+
+/* Check if any element of 'd1' is present in the list 'd2'.
+ *
+ * Iterates through each element in 'd1', and uses 'quic_data_has()' to check
+ * for its presence in 'd2'.
+ *
+ * Returns 1 if any match is found, 0 otherwise.
+ */
+int quic_data_match(struct quic_data *d1, struct quic_data *d2)
+{
+ struct quic_data d;
+ u64 length;
+ u32 len;
+ u8 *p;
+
+ for (p = d1->data, len = d1->len; len; len -= length, p += length) {
+ quic_get_int(&p, &len, &length, 1);
+ quic_data(&d, p, length);
+ if (quic_data_has(d2, &d))
+ return 1;
+ }
+ return 0;
+}
+
+/* Serialize a list of 'quic_data' elements into a comma-separated string.
+ *
+ * Each element in 'from' is length-prefixed. This function copies their raw
+ * content into the output buffer 'to', inserting commas in between. The
+ * resulting string length is written to '*plen'.
+ */
+void quic_data_to_string(u8 *to, u32 *plen, struct quic_data *from)
+{
+ struct quic_data d;
+ u8 *data = to, *p;
+ u64 length;
+ u32 len;
+
+ for (p = from->data, len = from->len; len; len -= length, p += length) {
+ quic_get_int(&p, &len, &length, 1);
+ quic_data(&d, p, length);
+ data = quic_put_data(data, d.data, d.len);
+ if (len - length)
+ data = quic_put_int(data, ',', 1);
+ }
+ *plen = data - to;
+}
+
+/* Parse a comma-separated string into a 'quic_data' list format.
+ *
+ * Each comma-separated token is turned into a length-prefixed element. The
+ * first byte of each element stores the length (minus one). Elements are
+ * stored in 'to->data', and 'to->len' is updated.
+ */
+void quic_data_from_string(struct quic_data *to, u8 *from, u32 len)
+{
+ struct quic_data d;
+ u8 *p = to->data;
+
+ to->len = 0;
+ while (len) {
+ d.data = p++;
+ d.len = 1;
+ while (len && *from == ' ') {
+ from++;
+ len--;
+ }
+ while (len) {
+ if (*from == ',') {
+ from++;
+ len--;
+ break;
+ }
+ *p++ = *from++;
+ len--;
+ d.len++;
+ }
+ *d.data = (u8)(d.len - 1);
+ to->len += d.len;
+ }
+}
diff --git a/net/quic/common.h b/net/quic/common.h
new file mode 100644
index 000000000000..07f8fbc41683
--- /dev/null
+++ b/net/quic/common.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/netns/hash.h>
+#include <linux/jhash.h>
+
+#define QUIC_HT_SIZE 64
+
+#define QUIC_MAX_ACK_DELAY (16384 * 1000)
+#define QUIC_DEF_ACK_DELAY 25000
+
+#define QUIC_STREAM_BIT_FIN 0x01
+#define QUIC_STREAM_BIT_LEN 0x02
+#define QUIC_STREAM_BIT_OFF 0x04
+#define QUIC_STREAM_BIT_MASK 0x08
+
+#define QUIC_CONN_ID_MAX_LEN 20
+#define QUIC_CONN_ID_DEF_LEN 8
+
+struct quic_conn_id {
+ u8 data[QUIC_CONN_ID_MAX_LEN];
+ u8 len;
+};
+
+static inline void quic_conn_id_update(struct quic_conn_id *conn_id, u8 *data, u32 len)
+{
+ memcpy(conn_id->data, data, len);
+ conn_id->len = (u8)len;
+}
+
+struct quic_skb_cb {
+ /* Callback when encryption/decryption completes in async mode */
+ void (*crypto_done)(struct sk_buff *skb, int err);
+ union {
+ struct sk_buff *last; /* Last packet in TX bundle */
+ s64 seqno; /* Dest connection ID number on RX */
+ };
+ s64 number_max; /* Largest packet number seen before parsing this one */
+ s64 number; /* Parsed packet number */
+ u16 errcode; /* Error code if encryption/decryption fails */
+ u16 length; /* Payload length + packet number length */
+ u32 time; /* Arrival time in UDP tunnel */
+
+ u16 number_offset; /* Offset of packet number field */
+ u16 udph_offset; /* Offset of UDP header */
+ u8 number_len; /* Length of the packet number field */
+ u8 level; /* Encryption level: Initial, Handshake, App, or Early */
+
+ u8 key_update:1; /* Key update triggered by this packet */
+ u8 key_phase:1; /* Key phase used (0 or 1) */
+ u8 resume:1; /* Crypto already processed (encrypted or decrypted) */
+ u8 path:1; /* Packet arrived from a new or migrating path */
+ u8 ecn:2; /* ECN marking used on TX */
+};
+
+#define QUIC_SKB_CB(skb) ((struct quic_skb_cb *)&((skb)->cb[0]))
+
+static inline struct udphdr *quic_udphdr(const struct sk_buff *skb)
+{
+ return (struct udphdr *)(skb->head + QUIC_SKB_CB(skb)->udph_offset);
+}
+
+struct quichdr {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u8 pnl:2,
+ key:1,
+ reserved:2,
+ spin:1,
+ fixed:1,
+ form:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 form:1,
+ fixed:1,
+ spin:1,
+ reserved:2,
+ key:1,
+ pnl:2;
+#endif
+};
+
+static inline struct quichdr *quic_hdr(struct sk_buff *skb)
+{
+ return (struct quichdr *)skb_transport_header(skb);
+}
+
+struct quichshdr {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u8 pnl:2,
+ reserved:2,
+ type:2,
+ fixed:1,
+ form:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 form:1,
+ fixed:1,
+ type:2,
+ reserved:2,
+ pnl:2;
+#endif
+};
+
+static inline struct quichshdr *quic_hshdr(struct sk_buff *skb)
+{
+ return (struct quichshdr *)skb_transport_header(skb);
+}
+
+union quic_addr {
+ struct sockaddr_in6 v6;
+ struct sockaddr_in v4;
+ struct sockaddr sa;
+};
+
+static inline union quic_addr *quic_addr(const void *addr)
+{
+ return (union quic_addr *)addr;
+}
+
+struct quic_hash_head {
+ struct hlist_head head;
+ union {
+ spinlock_t s_lock; /* Protects 'head' in atomic context */
+ struct mutex m_lock; /* Protects 'head' in process context */
+ };
+};
+
+struct quic_hash_table {
+ struct quic_hash_head *hash;
+ int size;
+};
+
+enum {
+ QUIC_HT_SOCK, /* Hash table for QUIC sockets */
+ QUIC_HT_UDP_SOCK, /* Hash table for UDP tunnel sockets */
+ QUIC_HT_LISTEN_SOCK, /* Hash table for QUIC listening sockets */
+ QUIC_HT_CONNECTION_ID, /* Hash table for source connection IDs */
+ QUIC_HT_MAX_TABLES,
+};
+
+static inline u32 quic_shash(const struct net *net, const union quic_addr *a)
+{
+ u32 addr = (a->sa.sa_family == AF_INET6) ? jhash(&a->v6.sin6_addr, 16, 0)
+ : (__force u32)a->v4.sin_addr.s_addr;
+
+ return jhash_3words(addr, (__force u32)a->v4.sin_port, net_hash_mix(net), 0);
+}
+
+static inline u32 quic_ahash(const struct net *net, const union quic_addr *s,
+ const union quic_addr *d)
+{
+ u32 ports = ((__force u32)s->v4.sin_port) << 16 | (__force u32)d->v4.sin_port;
+ u32 saddr = (s->sa.sa_family == AF_INET6) ? jhash(&s->v6.sin6_addr, 16, 0)
+ : (__force u32)s->v4.sin_addr.s_addr;
+ u32 daddr = (d->sa.sa_family == AF_INET6) ? jhash(&d->v6.sin6_addr, 16, 0)
+ : (__force u32)d->v4.sin_addr.s_addr;
+
+ return jhash_3words(saddr, ports, net_hash_mix(net), daddr);
+}
+
+struct quic_data {
+ u8 *data;
+ u32 len;
+};
+
+static inline struct quic_data *quic_data(struct quic_data *d, u8 *data, u32 len)
+{
+ d->data = data;
+ d->len = len;
+ return d;
+}
+
+static inline int quic_data_cmp(struct quic_data *d1, struct quic_data *d2)
+{
+ return d1->len != d2->len || memcmp(d1->data, d2->data, d1->len);
+}
+
+static inline void quic_data_free(struct quic_data *d)
+{
+ kfree(d->data);
+ d->data = NULL;
+ d->len = 0;
+}
+
+struct quic_hash_head *quic_sock_head(struct net *net, union quic_addr *s, union quic_addr *d);
+struct quic_hash_head *quic_listen_sock_head(struct net *net, u16 port);
+struct quic_hash_head *quic_stream_head(struct quic_hash_table *ht, s64 stream_id);
+struct quic_hash_head *quic_source_conn_id_head(struct net *net, u8 *scid);
+struct quic_hash_head *quic_udp_sock_head(struct net *net, u16 port);
+
+struct quic_hash_head *quic_sock_hash(u32 hash);
+void quic_hash_tables_destroy(void);
+int quic_hash_tables_init(void);
+
+u32 quic_get_data(u8 **pp, u32 *plen, u8 *data, u32 len);
+u32 quic_get_int(u8 **pp, u32 *plen, u64 *val, u32 len);
+s64 quic_get_num(s64 max_pkt_num, s64 pkt_num, u32 n);
+u8 quic_get_param(u64 *pdest, u8 **pp, u32 *plen);
+u8 quic_get_var(u8 **pp, u32 *plen, u64 *val);
+u8 quic_var_len(u64 n);
+
+u8 *quic_put_param(u8 *p, u16 id, u64 value);
+u8 *quic_put_data(u8 *p, u8 *data, u32 len);
+u8 *quic_put_varint(u8 *p, u64 num, u8 len);
+u8 *quic_put_int(u8 *p, u64 num, u8 len);
+u8 *quic_put_var(u8 *p, u64 num);
+
+void quic_data_from_string(struct quic_data *to, u8 *from, u32 len);
+void quic_data_to_string(u8 *to, u32 *plen, struct quic_data *from);
+
+int quic_data_match(struct quic_data *d1, struct quic_data *d2);
+int quic_data_append(struct quic_data *to, u8 *data, u32 len);
+int quic_data_has(struct quic_data *d1, struct quic_data *d2);
+int quic_data_dup(struct quic_data *to, u8 *data, u32 len);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 01a5fdfb5227..522c194d4577 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -327,6 +327,9 @@ static __init int quic_init(void)
if (err)
goto err_percpu_counter;
+ if (quic_hash_tables_init())
+ goto err_hash;
+
err = register_pernet_subsys(&quic_net_ops);
if (err)
goto err_def_ops;
@@ -344,6 +347,8 @@ static __init int quic_init(void)
err_protosw:
unregister_pernet_subsys(&quic_net_ops);
err_def_ops:
+ quic_hash_tables_destroy();
+err_hash:
percpu_counter_destroy(&quic_sockets_allocated);
err_percpu_counter:
return err;
@@ -356,6 +361,7 @@ static __exit void quic_exit(void)
#endif
quic_protosw_exit();
unregister_pernet_subsys(&quic_net_ops);
+ quic_hash_tables_destroy();
percpu_counter_destroy(&quic_sockets_allocated);
pr_info("quic: exit\n");
}
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 320a9a5a3c53..9cab01109db7 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -55,6 +55,10 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_data_free(quic_ticket(sk));
+ quic_data_free(quic_token(sk));
+ quic_data_free(quic_alpn(sk));
+
local_bh_disable();
sk_sockets_allocated_dec(sk);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
diff --git a/net/quic/socket.h b/net/quic/socket.h
index ded8eb2e6a9c..6cbf12bcae75 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -10,6 +10,8 @@
#include <net/udp_tunnel.h>
+#include "common.h"
+
#include "protocol.h"
extern struct proto quic_prot;
@@ -25,6 +27,10 @@ enum quic_state {
struct quic_sock {
struct inet_sock inet;
struct list_head reqs;
+
+ struct quic_data ticket;
+ struct quic_data token;
+ struct quic_data alpn;
};
struct quic6_sock {
@@ -42,6 +48,21 @@ static inline struct list_head *quic_reqs(const struct sock *sk)
return &quic_sk(sk)->reqs;
}
+static inline struct quic_data *quic_token(const struct sock *sk)
+{
+ return &quic_sk(sk)->token;
+}
+
+static inline struct quic_data *quic_ticket(const struct sock *sk)
+{
+ return &quic_sk(sk)->ticket;
+}
+
+static inline struct quic_data *quic_alpn(const struct sock *sk)
+{
+ return &quic_sk(sk)->alpn;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 03/15] quic: provide common utilities and data structures
2025-08-18 14:04 ` [PATCH net-next v2 03/15] quic: provide common utilities and data structures Xin Long
@ 2025-08-21 12:58 ` Paolo Abeni
2025-08-23 18:15 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 12:58 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
[...]
> +void quic_hash_tables_destroy(void)
> +{
> + struct quic_hash_table *ht;
> + int table;
> +
> + for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
> + ht = &quic_hash_tables[table];
> + ht->size = QUIC_HT_SIZE;
Why?
> + kfree(ht->hash);
> + }
> +}
> +
> +int quic_hash_tables_init(void)
> +{
> + struct quic_hash_head *head;
> + struct quic_hash_table *ht;
> + int table, i;
> +
> + for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
> + ht = &quic_hash_tables[table];
> + ht->size = QUIC_HT_SIZE;
AFAICS the hash table size is always QUIC_HT_SIZE, which feels like too
small for connection and possibly quick sockets.
Do yoi need to differentiate the size among the different hash types?
> + head = kmalloc_array(ht->size, sizeof(*head), GFP_KERNEL);
If so, possibly you should resort to kvmalloc_array here.
> + if (!head) {
> + quic_hash_tables_destroy();
> + return -ENOMEM;
> + }
> + for (i = 0; i < ht->size; i++) {
> + INIT_HLIST_HEAD(&head[i].head);
> + if (table == QUIC_HT_UDP_SOCK) {
> + mutex_init(&head[i].m_lock);
> + continue;
> + }
> + spin_lock_init(&head[i].s_lock);
Doh, I missed the union mutex/spinlock. IMHO it would be cleaner to use
separate hash types.
[...]
> +/* Parse a comma-separated string into a 'quic_data' list format.
> + *
> + * Each comma-separated token is turned into a length-prefixed element. The
> + * first byte of each element stores the length (minus one). Elements are
> + * stored in 'to->data', and 'to->len' is updated.
> + */
> +void quic_data_from_string(struct quic_data *to, u8 *from, u32 len)
> +{
> + struct quic_data d;
> + u8 *p = to->data;
> +
> + to->len = 0;
> + while (len) {
> + d.data = p++;
> + d.len = 1;
> + while (len && *from == ' ') {
> + from++;
> + len--;
> + }
> + while (len) {
> + if (*from == ',') {
> + from++;
> + len--;
> + break;
> + }
> + *p++ = *from++;
> + len--;
> + d.len++;
> + }
> + *d.data = (u8)(d.len - 1);
> + to->len += d.len;
> + }
The above does not perform any bound checking vs the destination buffer,
it feels fragile.
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 03/15] quic: provide common utilities and data structures
2025-08-21 12:58 ` Paolo Abeni
@ 2025-08-23 18:15 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 18:15 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 8:58 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> [...]
> > +void quic_hash_tables_destroy(void)
> > +{
> > + struct quic_hash_table *ht;
> > + int table;
> > +
> > + for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
> > + ht = &quic_hash_tables[table];
> > + ht->size = QUIC_HT_SIZE;
>
> Why?
>
> > + kfree(ht->hash);
> > + }
> > +}
> > +
> > +int quic_hash_tables_init(void)
> > +{
> > + struct quic_hash_head *head;
> > + struct quic_hash_table *ht;
> > + int table, i;
> > +
> > + for (table = 0; table < QUIC_HT_MAX_TABLES; table++) {
> > + ht = &quic_hash_tables[table];
> > + ht->size = QUIC_HT_SIZE;
>
> AFAICS the hash table size is always QUIC_HT_SIZE, which feels like too
> small for connection and possibly quick sockets.
indeed.
>
> Do yoi need to differentiate the size among the different hash types?
Yes, I will change to use alloc_large_system_hash() for these
hashtables, align with TCP:
- source connection IDs hashtable -> tcp_hashinfo.ehash
For Data receiving
- QUIC listening sockets hashtable -> tcp_hashinfo.lhash2
For New connections
- UDP tunnel sockets hashtable -> tcp_hashinfo.bhash
For binding
- QUIC sockets hashtable -> tcp_hashinfo.bhash
For some rare special cases, like receiving a stateless reset.
>
> > + head = kmalloc_array(ht->size, sizeof(*head), GFP_KERNEL);
>
> If so, possibly you should resort to kvmalloc_array here.
>
> > + if (!head) {
> > + quic_hash_tables_destroy();
> > + return -ENOMEM;
> > + }
> > + for (i = 0; i < ht->size; i++) {
> > + INIT_HLIST_HEAD(&head[i].head);
> > + if (table == QUIC_HT_UDP_SOCK) {
> > + mutex_init(&head[i].m_lock);
> > + continue;
> > + }
> > + spin_lock_init(&head[i].s_lock);
>
> Doh, I missed the union mutex/spinlock. IMHO it would be cleaner to use
> separate hash types.
Yeah, I will separate them. :D
>
> [...]
> > +/* Parse a comma-separated string into a 'quic_data' list format.
> > + *
> > + * Each comma-separated token is turned into a length-prefixed element. The
> > + * first byte of each element stores the length (minus one). Elements are
> > + * stored in 'to->data', and 'to->len' is updated.
> > + */
> > +void quic_data_from_string(struct quic_data *to, u8 *from, u32 len)
> > +{
> > + struct quic_data d;
> > + u8 *p = to->data;
> > +
> > + to->len = 0;
> > + while (len) {
> > + d.data = p++;
> > + d.len = 1;
> > + while (len && *from == ' ') {
> > + from++;
> > + len--;
> > + }
> > + while (len) {
> > + if (*from == ',') {
> > + from++;
> > + len--;
> > + break;
> > + }
> > + *p++ = *from++;
> > + len--;
> > + d.len++;
> > + }
> > + *d.data = (u8)(d.len - 1);
> > + to->len += d.len;
> > + }
>
> The above does not perform any bound checking vs the destination buffer,
> it feels fragile.
>
This function needs two checks:
1. len < U8_MAX
2. to->data memory len >= len + 1.
Although the caller like quic_sock_set_alpn() ensures that, I guess they
should still be done in this function as a common helper.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 04/15] quic: provide family ops for address and protocol
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (2 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 03/15] quic: provide common utilities and data structures Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 13:17 ` Paolo Abeni
2025-08-18 14:04 ` [PATCH net-next v2 05/15] quic: provide quic.h header files for kernel and userspace Xin Long
` (11 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces two new abstraction structures to simplify handling
of IPv4 and IPv6 differences across the QUIC stack:
- quic_addr_family_ops: for address comparison, flow routing,
UDP config, MTU lookup, formatted output, etc.
- quic_proto_family_ops: for socket address helpers and preference.
With these additions, the QUIC core logic can remain agnostic of the
address family and socket type, improving modularity and reducing
repetitive checks throughout the codebase.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/family.c | 686 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/family.h | 41 +++
net/quic/protocol.c | 2 +-
net/quic/socket.c | 4 +-
net/quic/socket.h | 1 +
6 files changed, 732 insertions(+), 4 deletions(-)
create mode 100644 net/quic/family.c
create mode 100644 net/quic/family.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index e0067272de7d..13bf4a4e5442 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o protocol.o socket.o
+quic-y := common.o family.o protocol.o socket.o
diff --git a/net/quic/family.c b/net/quic/family.c
new file mode 100644
index 000000000000..3d52c4ee2b56
--- /dev/null
+++ b/net/quic/family.c
@@ -0,0 +1,686 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <net/udp_tunnel.h>
+#include <linux/icmp.h>
+
+#include "common.h"
+#include "family.h"
+
+struct quic_addr_family_ops {
+ u32 iph_len; /* Network layer header length */
+ int (*is_any_addr)(union quic_addr *addr); /* Check if the addr is a wildcard (ANY) */
+ /* Dump the address into a seq_file (e.g., for /proc/net/quic/sks) */
+ void (*seq_dump_addr)(struct seq_file *seq, union quic_addr *addr);
+
+ /* Initialize UDP tunnel socket configuration */
+ void (*udp_conf_init)(struct sock *sk, struct udp_port_cfg *conf, union quic_addr *addr);
+ /* Perform IP route lookup */
+ int (*flow_route)(struct sock *sk, union quic_addr *da, union quic_addr *sa,
+ struct flowi *fl);
+ /* Transmit packet through UDP tunnel socket */
+ void (*lower_xmit)(struct sock *sk, struct sk_buff *skb, struct flowi *fl);
+
+ /* Extract source and destination IP addresses from the packet */
+ void (*get_msg_addrs)(struct sk_buff *skb, union quic_addr *da, union quic_addr *sa);
+ /* Extract MTU information from an ICMP packet */
+ int (*get_mtu_info)(struct sk_buff *skb, u32 *info);
+ /* Extract ECN bits from the packet */
+ u8 (*get_msg_ecn)(struct sk_buff *skb);
+};
+
+struct quic_proto_family_ops {
+ /* Validate and convert user address from bind/connect/setsockopt */
+ int (*get_user_addr)(struct sock *sk, union quic_addr *a, struct sockaddr *addr,
+ int addr_len);
+ /* Get the 'preferred_address' from transport parameters (rfc9000#section-18.2) */
+ void (*get_pref_addr)(struct sock *sk, union quic_addr *addr, u8 **pp, u32 *plen);
+ /* Set the 'preferred_address' into transport parameters (rfc9000#section-18.2) */
+ void (*set_pref_addr)(struct sock *sk, u8 *p, union quic_addr *addr);
+
+ /* Compare two addresses considering socket family and wildcard (ANY) match */
+ bool (*cmp_sk_addr)(struct sock *sk, union quic_addr *a, union quic_addr *addr);
+ /* Get socket's local or peer address (getsockname/getpeername) */
+ int (*get_sk_addr)(struct socket *sock, struct sockaddr *addr, int peer);
+ /* Set socket's source or destination address */
+ void (*set_sk_addr)(struct sock *sk, union quic_addr *addr, bool src);
+ /* Set ECN bits for the socket */
+ void (*set_sk_ecn)(struct sock *sk, u8 ecn);
+
+ /* Handle getsockopt() for non-SOL_QUIC levels */
+ int (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval,
+ int __user *optlen);
+ /* Handle setsockopt() for non-SOL_QUIC levels */
+ int (*setsockopt)(struct sock *sk, int level, int optname, sockptr_t optval,
+ unsigned int optlen);
+};
+
+static int quic_v4_is_any_addr(union quic_addr *addr)
+{
+ return addr->v4.sin_addr.s_addr == htonl(INADDR_ANY);
+}
+
+static int quic_v6_is_any_addr(union quic_addr *addr)
+{
+ return ipv6_addr_any(&addr->v6.sin6_addr);
+}
+
+static void quic_v4_seq_dump_addr(struct seq_file *seq, union quic_addr *addr)
+{
+ seq_printf(seq, "%pI4:%d\t", &addr->v4.sin_addr.s_addr, ntohs(addr->v4.sin_port));
+}
+
+static void quic_v6_seq_dump_addr(struct seq_file *seq, union quic_addr *addr)
+{
+ seq_printf(seq, "%pI6c:%d\t", &addr->v6.sin6_addr, ntohs(addr->v4.sin_port));
+}
+
+static void quic_v4_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf, union quic_addr *a)
+{
+ conf->family = AF_INET;
+ conf->local_ip.s_addr = a->v4.sin_addr.s_addr;
+ conf->local_udp_port = a->v4.sin_port;
+ conf->use_udp6_rx_checksums = true;
+ conf->bind_ifindex = sk->sk_bound_dev_if;
+}
+
+static void quic_v6_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf, union quic_addr *a)
+{
+ conf->family = AF_INET6;
+ conf->local_ip6 = a->v6.sin6_addr;
+ conf->local_udp_port = a->v6.sin6_port;
+ conf->use_udp6_rx_checksums = true;
+ conf->ipv6_v6only = ipv6_only_sock(sk);
+ conf->bind_ifindex = sk->sk_bound_dev_if;
+}
+
+static int quic_v4_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa,
+ struct flowi *fl)
+{
+ struct flowi4 *fl4;
+ struct rtable *rt;
+ struct flowi _fl;
+
+ if (__sk_dst_check(sk, 0))
+ return 1;
+
+ fl4 = &_fl.u.ip4;
+ memset(&_fl, 0x00, sizeof(_fl));
+ fl4->saddr = sa->v4.sin_addr.s_addr;
+ fl4->fl4_sport = sa->v4.sin_port;
+ fl4->daddr = da->v4.sin_addr.s_addr;
+ fl4->fl4_dport = da->v4.sin_port;
+ fl4->flowi4_proto = IPPROTO_UDP;
+ fl4->flowi4_oif = sk->sk_bound_dev_if;
+
+ fl4->flowi4_scope = ip_sock_rt_scope(sk);
+ fl4->flowi4_tos = ip_sock_rt_tos(sk);
+
+ rt = ip_route_output_key(sock_net(sk), fl4);
+ if (IS_ERR(rt))
+ return PTR_ERR(rt);
+
+ if (!sa->v4.sin_family) {
+ sa->v4.sin_family = AF_INET;
+ sa->v4.sin_addr.s_addr = fl4->saddr;
+ }
+ sk_setup_caps(sk, &rt->dst);
+ memcpy(fl, &_fl, sizeof(_fl));
+ return 0;
+}
+
+static int quic_v6_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa,
+ struct flowi *fl)
+{
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct ip6_flowlabel *flowlabel;
+ struct dst_entry *dst;
+ struct flowi6 *fl6;
+ struct flowi _fl;
+
+ if (__sk_dst_check(sk, np->dst_cookie))
+ return 1;
+
+ fl6 = &_fl.u.ip6;
+ memset(&_fl, 0x0, sizeof(_fl));
+ fl6->saddr = sa->v6.sin6_addr;
+ fl6->fl6_sport = sa->v6.sin6_port;
+ fl6->daddr = da->v6.sin6_addr;
+ fl6->fl6_dport = da->v6.sin6_port;
+ fl6->flowi6_proto = IPPROTO_UDP;
+ fl6->flowi6_oif = sk->sk_bound_dev_if;
+
+ if (inet6_test_bit(SNDFLOW, sk)) {
+ fl6->flowlabel = (da->v6.sin6_flowinfo & IPV6_FLOWINFO_MASK);
+ if (fl6->flowlabel & IPV6_FLOWLABEL_MASK) {
+ flowlabel = fl6_sock_lookup(sk, fl6->flowlabel);
+ if (IS_ERR(flowlabel))
+ return -EINVAL;
+ fl6_sock_release(flowlabel);
+ }
+ }
+
+ dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, NULL);
+ if (IS_ERR(dst))
+ return PTR_ERR(dst);
+
+ if (!sa->v6.sin6_family) {
+ sa->v6.sin6_family = AF_INET6;
+ sa->v6.sin6_addr = fl6->saddr;
+ }
+ ip6_dst_store(sk, dst, NULL, NULL);
+ memcpy(fl, &_fl, sizeof(_fl));
+ return 0;
+}
+
+static void quic_v4_lower_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 tos = (inet_sk(sk)->tos | cb->ecn), ttl;
+ struct flowi4 *fl4 = &fl->u.ip4;
+ struct dst_entry *dst;
+ __be16 df = 0;
+
+ pr_debug("%s: skb: %p, len: %d, num: %llu, %pI4:%d -> %pI4:%d\n", __func__,
+ skb, skb->len, cb->number, &fl4->saddr, ntohs(fl4->fl4_sport),
+ &fl4->daddr, ntohs(fl4->fl4_dport));
+
+ dst = sk_dst_get(sk);
+ if (!dst) {
+ kfree_skb(skb);
+ return;
+ }
+ if (ip_dont_fragment(sk, dst) && !skb->ignore_df)
+ df = htons(IP_DF);
+
+ ttl = (u8)ip4_dst_hoplimit(dst);
+ udp_tunnel_xmit_skb((struct rtable *)dst, sk, skb, fl4->saddr, fl4->daddr,
+ tos, ttl, df, fl4->fl4_sport, fl4->fl4_dport, false, false, 0);
+}
+
+static void quic_v6_lower_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 tc = (inet6_sk(sk)->tclass | cb->ecn), ttl;
+ struct flowi6 *fl6 = &fl->u.ip6;
+ struct dst_entry *dst;
+ __be32 label;
+
+ pr_debug("%s: skb: %p, len: %d, num: %llu, %pI6c:%d -> %pI6c:%d\n", __func__,
+ skb, skb->len, cb->number, &fl6->saddr, ntohs(fl6->fl6_sport),
+ &fl6->daddr, ntohs(fl6->fl6_dport));
+
+ dst = sk_dst_get(sk);
+ if (!dst) {
+ kfree_skb(skb);
+ return;
+ }
+
+ ttl = (u8)ip6_dst_hoplimit(dst);
+ label = ip6_make_flowlabel(sock_net(sk), skb, fl6->flowlabel, true, fl6);
+ udp_tunnel6_xmit_skb(dst, sk, skb, NULL, &fl6->saddr, &fl6->daddr, tc,
+ ttl, label, fl6->fl6_sport, fl6->fl6_dport, false, 0);
+}
+
+static void quic_v4_get_msg_addrs(struct sk_buff *skb, union quic_addr *da, union quic_addr *sa)
+{
+ struct udphdr *uh = quic_udphdr(skb);
+
+ sa->v4.sin_family = AF_INET;
+ sa->v4.sin_port = uh->source;
+ sa->v4.sin_addr.s_addr = ip_hdr(skb)->saddr;
+
+ da->v4.sin_family = AF_INET;
+ da->v4.sin_port = uh->dest;
+ da->v4.sin_addr.s_addr = ip_hdr(skb)->daddr;
+}
+
+static void quic_v6_get_msg_addrs(struct sk_buff *skb, union quic_addr *da, union quic_addr *sa)
+{
+ struct udphdr *uh = quic_udphdr(skb);
+
+ sa->v6.sin6_family = AF_INET6;
+ sa->v6.sin6_port = uh->source;
+ sa->v6.sin6_addr = ipv6_hdr(skb)->saddr;
+
+ da->v6.sin6_family = AF_INET6;
+ da->v6.sin6_port = uh->dest;
+ da->v6.sin6_addr = ipv6_hdr(skb)->daddr;
+}
+
+static int quic_v4_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ struct icmphdr *hdr;
+
+ hdr = (struct icmphdr *)(skb_network_header(skb) - sizeof(struct icmphdr));
+ if (hdr->type == ICMP_DEST_UNREACH && hdr->code == ICMP_FRAG_NEEDED) {
+ *info = ntohs(hdr->un.frag.mtu);
+ return 0;
+ }
+
+ /* Defer other types' processing to UDP error handler. */
+ return 1;
+}
+
+static int quic_v6_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ struct icmp6hdr *hdr;
+
+ hdr = (struct icmp6hdr *)(skb_network_header(skb) - sizeof(struct icmp6hdr));
+ if (hdr->icmp6_type == ICMPV6_PKT_TOOBIG) {
+ *info = ntohl(hdr->icmp6_mtu);
+ return 0;
+ }
+
+ /* Defer other types' processing to UDP error handler. */
+ return 1;
+}
+
+static u8 quic_v4_get_msg_ecn(struct sk_buff *skb)
+{
+ return (ip_hdr(skb)->tos & INET_ECN_MASK);
+}
+
+static u8 quic_v6_get_msg_ecn(struct sk_buff *skb)
+{
+ return (ipv6_get_dsfield(ipv6_hdr(skb)) & INET_ECN_MASK);
+}
+
+static struct quic_addr_family_ops quic_af_inet = {
+ .iph_len = sizeof(struct iphdr),
+ .is_any_addr = quic_v4_is_any_addr,
+ .seq_dump_addr = quic_v4_seq_dump_addr,
+ .udp_conf_init = quic_v4_udp_conf_init,
+ .flow_route = quic_v4_flow_route,
+ .lower_xmit = quic_v4_lower_xmit,
+ .get_msg_addrs = quic_v4_get_msg_addrs,
+ .get_mtu_info = quic_v4_get_mtu_info,
+ .get_msg_ecn = quic_v4_get_msg_ecn,
+};
+
+static struct quic_addr_family_ops quic_af_inet6 = {
+ .iph_len = sizeof(struct ipv6hdr),
+ .is_any_addr = quic_v6_is_any_addr,
+ .seq_dump_addr = quic_v6_seq_dump_addr,
+ .udp_conf_init = quic_v6_udp_conf_init,
+ .flow_route = quic_v6_flow_route,
+ .lower_xmit = quic_v6_lower_xmit,
+ .get_msg_addrs = quic_v6_get_msg_addrs,
+ .get_mtu_info = quic_v6_get_mtu_info,
+ .get_msg_ecn = quic_v6_get_msg_ecn,
+};
+
+static struct quic_addr_family_ops *quic_afs[] = {
+ &quic_af_inet,
+ &quic_af_inet6
+};
+
+#define quic_af(a) quic_afs[(a)->sa.sa_family == AF_INET6]
+#define quic_af_skb(skb) quic_afs[ip_hdr(skb)->version == 6]
+
+static int quic_v4_get_user_addr(struct sock *sk, union quic_addr *a, struct sockaddr *addr,
+ int addr_len)
+{
+ u32 len = sizeof(struct sockaddr_in);
+
+ if (addr_len < len || addr->sa_family != AF_INET)
+ return 1;
+ if (ipv4_is_multicast(quic_addr(addr)->v4.sin_addr.s_addr))
+ return 1;
+ memcpy(a, addr, len);
+ return 0;
+}
+
+static int quic_v6_get_user_addr(struct sock *sk, union quic_addr *a, struct sockaddr *addr,
+ int addr_len)
+{
+ u32 len = sizeof(struct sockaddr_in);
+ int type;
+
+ if (addr_len < len)
+ return 1;
+
+ if (addr->sa_family != AF_INET6) {
+ if (ipv6_only_sock(sk))
+ return 1;
+ return quic_v4_get_user_addr(sk, a, addr, addr_len);
+ }
+
+ len = sizeof(struct sockaddr_in6);
+ if (addr_len < len)
+ return 1;
+ type = ipv6_addr_type(&quic_addr(addr)->v6.sin6_addr);
+ if (type != IPV6_ADDR_ANY && !(type & IPV6_ADDR_UNICAST))
+ return 1;
+ memcpy(a, addr, len);
+ return 0;
+}
+
+static void quic_v4_get_pref_addr(struct sock *sk, union quic_addr *addr, u8 **pp, u32 *plen)
+{
+ u8 *p = *pp;
+
+ memcpy(&addr->v4.sin_addr, p, QUIC_ADDR4_LEN);
+ p += QUIC_ADDR4_LEN;
+ memcpy(&addr->v4.sin_port, p, QUIC_PORT_LEN);
+ p += QUIC_PORT_LEN;
+ addr->v4.sin_family = AF_INET;
+ /* Skip over IPv6 address and port, not used for AF_INET sockets. */
+ p += QUIC_ADDR6_LEN;
+ p += QUIC_PORT_LEN;
+
+ if (!addr->v4.sin_port || quic_v4_is_any_addr(addr) ||
+ ipv4_is_multicast(addr->v4.sin_addr.s_addr))
+ memset(addr, 0, sizeof(*addr));
+ *plen -= (p - *pp);
+ *pp = p;
+}
+
+static void quic_v6_get_pref_addr(struct sock *sk, union quic_addr *addr, u8 **pp, u32 *plen)
+{
+ u8 *p = *pp;
+ int type;
+
+ /* Skip over IPv4 address and port. */
+ p += QUIC_ADDR4_LEN;
+ p += QUIC_PORT_LEN;
+ /* Try to use IPv6 address and port first. */
+ memcpy(&addr->v6.sin6_addr, p, QUIC_ADDR6_LEN);
+ p += QUIC_ADDR6_LEN;
+ memcpy(&addr->v6.sin6_port, p, QUIC_PORT_LEN);
+ p += QUIC_PORT_LEN;
+ addr->v6.sin6_family = AF_INET6;
+
+ type = ipv6_addr_type(&addr->v6.sin6_addr);
+ if (!addr->v6.sin6_port || !(type & IPV6_ADDR_UNICAST)) {
+ memset(addr, 0, sizeof(*addr));
+ if (ipv6_only_sock(sk))
+ goto out;
+ /* Fallback to IPv4 if IPv6 address is not usable. */
+ return quic_v4_get_pref_addr(sk, addr, pp, plen);
+ }
+out:
+ *plen -= (p - *pp);
+ *pp = p;
+}
+
+static void quic_v4_set_pref_addr(struct sock *sk, u8 *p, union quic_addr *addr)
+{
+ memcpy(p, &addr->v4.sin_addr, QUIC_ADDR4_LEN);
+ p += QUIC_ADDR4_LEN;
+ memcpy(p, &addr->v4.sin_port, QUIC_PORT_LEN);
+ p += QUIC_PORT_LEN;
+ memset(p, 0, QUIC_ADDR6_LEN);
+ p += QUIC_ADDR6_LEN;
+ memset(p, 0, QUIC_PORT_LEN);
+}
+
+static void quic_v6_set_pref_addr(struct sock *sk, u8 *p, union quic_addr *addr)
+{
+ if (addr->sa.sa_family == AF_INET)
+ return quic_v4_set_pref_addr(sk, p, addr);
+
+ memset(p, 0, QUIC_ADDR4_LEN);
+ p += QUIC_ADDR4_LEN;
+ memset(p, 0, QUIC_PORT_LEN);
+ p += QUIC_PORT_LEN;
+ memcpy(p, &addr->v6.sin6_addr, QUIC_ADDR6_LEN);
+ p += QUIC_ADDR6_LEN;
+ memcpy(p, &addr->v6.sin6_port, QUIC_PORT_LEN);
+}
+
+static bool quic_v4_cmp_sk_addr(struct sock *sk, union quic_addr *a, union quic_addr *addr)
+{
+ if (a->v4.sin_port != addr->v4.sin_port)
+ return false;
+ if (a->v4.sin_family != addr->v4.sin_family)
+ return false;
+ if (a->v4.sin_addr.s_addr == htonl(INADDR_ANY) ||
+ addr->v4.sin_addr.s_addr == htonl(INADDR_ANY))
+ return true;
+ return a->v4.sin_addr.s_addr == addr->v4.sin_addr.s_addr;
+}
+
+static bool quic_v6_cmp_sk_addr(struct sock *sk, union quic_addr *a, union quic_addr *addr)
+{
+ if (a->v4.sin_port != addr->v4.sin_port)
+ return false;
+
+ if (a->sa.sa_family == AF_INET && addr->sa.sa_family == AF_INET) {
+ if (a->v4.sin_addr.s_addr == htonl(INADDR_ANY) ||
+ addr->v4.sin_addr.s_addr == htonl(INADDR_ANY))
+ return true;
+ return a->v4.sin_addr.s_addr == addr->v4.sin_addr.s_addr;
+ }
+
+ if (a->sa.sa_family != addr->sa.sa_family) {
+ if (ipv6_only_sock(sk))
+ return false;
+ if (a->sa.sa_family == AF_INET6 && ipv6_addr_any(&a->v6.sin6_addr))
+ return true;
+ if (a->sa.sa_family == AF_INET && addr->sa.sa_family == AF_INET6 &&
+ ipv6_addr_v4mapped(&addr->v6.sin6_addr) &&
+ addr->v6.sin6_addr.s6_addr32[3] == a->v4.sin_addr.s_addr)
+ return true;
+ if (addr->sa.sa_family == AF_INET && a->sa.sa_family == AF_INET6 &&
+ ipv6_addr_v4mapped(&a->v6.sin6_addr) &&
+ a->v6.sin6_addr.s6_addr32[3] == addr->v4.sin_addr.s_addr)
+ return true;
+ return false;
+ }
+
+ if (ipv6_addr_any(&a->v6.sin6_addr) || ipv6_addr_any(&addr->v6.sin6_addr))
+ return true;
+ return ipv6_addr_equal(&a->v6.sin6_addr, &addr->v6.sin6_addr);
+}
+
+static int quic_v4_get_sk_addr(struct socket *sock, struct sockaddr *uaddr, int peer)
+{
+ return inet_getname(sock, uaddr, peer);
+}
+
+static int quic_v6_get_sk_addr(struct socket *sock, struct sockaddr *uaddr, int peer)
+{
+ union quic_addr *a = quic_addr(uaddr);
+ int ret;
+
+ ret = inet6_getname(sock, uaddr, peer);
+ if (ret < 0)
+ return ret;
+
+ if (a->sa.sa_family == AF_INET6 && ipv6_addr_v4mapped(&a->v6.sin6_addr)) {
+ a->v4.sin_family = AF_INET;
+ a->v4.sin_port = a->v6.sin6_port;
+ a->v4.sin_addr.s_addr = a->v6.sin6_addr.s6_addr32[3];
+ }
+
+ if (a->sa.sa_family == AF_INET) {
+ memset(a->v4.sin_zero, 0, sizeof(a->v4.sin_zero));
+ return sizeof(struct sockaddr_in);
+ }
+ return sizeof(struct sockaddr_in6);
+}
+
+static void quic_v4_set_sk_addr(struct sock *sk, union quic_addr *a, bool src)
+{
+ if (src) {
+ inet_sk(sk)->inet_sport = a->v4.sin_port;
+ inet_sk(sk)->inet_saddr = a->v4.sin_addr.s_addr;
+ } else {
+ inet_sk(sk)->inet_dport = a->v4.sin_port;
+ inet_sk(sk)->inet_daddr = a->v4.sin_addr.s_addr;
+ }
+}
+
+static void quic_v6_set_sk_addr(struct sock *sk, union quic_addr *a, bool src)
+{
+ if (src) {
+ inet_sk(sk)->inet_sport = a->v4.sin_port;
+ if (a->sa.sa_family == AF_INET) {
+ sk->sk_v6_rcv_saddr.s6_addr32[0] = 0;
+ sk->sk_v6_rcv_saddr.s6_addr32[1] = 0;
+ sk->sk_v6_rcv_saddr.s6_addr32[2] = htonl(0x0000ffff);
+ sk->sk_v6_rcv_saddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
+ } else {
+ sk->sk_v6_rcv_saddr = a->v6.sin6_addr;
+ }
+ } else {
+ inet_sk(sk)->inet_dport = a->v4.sin_port;
+ if (a->sa.sa_family == AF_INET) {
+ sk->sk_v6_daddr.s6_addr32[0] = 0;
+ sk->sk_v6_daddr.s6_addr32[1] = 0;
+ sk->sk_v6_daddr.s6_addr32[2] = htonl(0x0000ffff);
+ sk->sk_v6_daddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
+ } else {
+ sk->sk_v6_daddr = a->v6.sin6_addr;
+ }
+ }
+}
+
+static void quic_v4_set_sk_ecn(struct sock *sk, u8 ecn)
+{
+ inet_sk(sk)->tos = ((inet_sk(sk)->tos & ~INET_ECN_MASK) | ecn);
+}
+
+static void quic_v6_set_sk_ecn(struct sock *sk, u8 ecn)
+{
+ quic_v4_set_sk_ecn(sk, ecn);
+ inet6_sk(sk)->tclass = ((inet6_sk(sk)->tclass & ~INET_ECN_MASK) | ecn);
+}
+
+static struct quic_proto_family_ops quic_pf_inet = {
+ .get_user_addr = quic_v4_get_user_addr,
+ .get_pref_addr = quic_v4_get_pref_addr,
+ .set_pref_addr = quic_v4_set_pref_addr,
+ .cmp_sk_addr = quic_v4_cmp_sk_addr,
+ .get_sk_addr = quic_v4_get_sk_addr,
+ .set_sk_addr = quic_v4_set_sk_addr,
+ .set_sk_ecn = quic_v4_set_sk_ecn,
+ .setsockopt = ip_setsockopt,
+ .getsockopt = ip_getsockopt,
+};
+
+static struct quic_proto_family_ops quic_pf_inet6 = {
+ .get_user_addr = quic_v6_get_user_addr,
+ .get_pref_addr = quic_v6_get_pref_addr,
+ .set_pref_addr = quic_v6_set_pref_addr,
+ .cmp_sk_addr = quic_v6_cmp_sk_addr,
+ .get_sk_addr = quic_v6_get_sk_addr,
+ .set_sk_addr = quic_v6_set_sk_addr,
+ .set_sk_ecn = quic_v6_set_sk_ecn,
+ .setsockopt = ipv6_setsockopt,
+ .getsockopt = ipv6_getsockopt,
+};
+
+static struct quic_proto_family_ops *quic_pfs[] = {
+ &quic_pf_inet,
+ &quic_pf_inet6
+};
+
+#define quic_pf(sk) quic_pfs[(sk)->sk_family == AF_INET6]
+
+u32 quic_encap_len(union quic_addr *a)
+{
+ return sizeof(struct udphdr) + quic_af(a)->iph_len;
+}
+
+int quic_is_any_addr(union quic_addr *a)
+{
+ return quic_af(a)->is_any_addr(a);
+}
+
+void quic_seq_dump_addr(struct seq_file *seq, union quic_addr *addr)
+{
+ quic_af(addr)->seq_dump_addr(seq, addr);
+}
+
+void quic_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf, union quic_addr *a)
+{
+ quic_af(a)->udp_conf_init(sk, conf, a);
+}
+
+int quic_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa, struct flowi *fl)
+{
+ return quic_af(da)->flow_route(sk, da, sa, fl);
+}
+
+void quic_lower_xmit(struct sock *sk, struct sk_buff *skb, union quic_addr *da, struct flowi *fl)
+{
+ quic_af(da)->lower_xmit(sk, skb, fl);
+}
+
+void quic_get_msg_addrs(struct sk_buff *skb, union quic_addr *da, union quic_addr *sa)
+{
+ memset(sa, 0, sizeof(*sa));
+ memset(da, 0, sizeof(*da));
+ quic_af_skb(skb)->get_msg_addrs(skb, da, sa);
+}
+
+int quic_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ return quic_af_skb(skb)->get_mtu_info(skb, info);
+}
+
+u8 quic_get_msg_ecn(struct sk_buff *skb)
+{
+ return quic_af_skb(skb)->get_msg_ecn(skb);
+}
+
+int quic_get_user_addr(struct sock *sk, union quic_addr *a, struct sockaddr *addr, int addr_len)
+{
+ memset(a, 0, sizeof(*a));
+ return quic_pf(sk)->get_user_addr(sk, a, addr, addr_len);
+}
+
+void quic_get_pref_addr(struct sock *sk, union quic_addr *addr, u8 **pp, u32 *plen)
+{
+ memset(addr, 0, sizeof(*addr));
+ quic_pf(sk)->get_pref_addr(sk, addr, pp, plen);
+}
+
+void quic_set_pref_addr(struct sock *sk, u8 *p, union quic_addr *addr)
+{
+ quic_pf(sk)->set_pref_addr(sk, p, addr);
+}
+
+bool quic_cmp_sk_addr(struct sock *sk, union quic_addr *a, union quic_addr *addr)
+{
+ return quic_pf(sk)->cmp_sk_addr(sk, a, addr);
+}
+
+int quic_get_sk_addr(struct socket *sock, struct sockaddr *a, bool peer)
+{
+ return quic_pf(sock->sk)->get_sk_addr(sock, a, peer);
+}
+
+void quic_set_sk_addr(struct sock *sk, union quic_addr *a, bool src)
+{
+ return quic_pf(sk)->set_sk_addr(sk, a, src);
+}
+
+void quic_set_sk_ecn(struct sock *sk, u8 ecn)
+{
+ quic_pf(sk)->set_sk_ecn(sk, ecn);
+}
+
+int quic_common_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+ unsigned int optlen)
+{
+ return quic_pf(sk)->setsockopt(sk, level, optname, optval, optlen);
+}
+
+int quic_common_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
+ int __user *optlen)
+{
+ return quic_pf(sk)->getsockopt(sk, level, optname, optval, optlen);
+}
diff --git a/net/quic/family.h b/net/quic/family.h
new file mode 100644
index 000000000000..dd7af2393d07
--- /dev/null
+++ b/net/quic/family.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PORT_LEN 2
+#define QUIC_ADDR4_LEN 4
+#define QUIC_ADDR6_LEN 16
+
+#define QUIC_PREF_ADDR_LEN (QUIC_ADDR4_LEN + QUIC_PORT_LEN + QUIC_ADDR6_LEN + QUIC_PORT_LEN)
+
+void quic_seq_dump_addr(struct seq_file *seq, union quic_addr *addr);
+int quic_is_any_addr(union quic_addr *a);
+u32 quic_encap_len(union quic_addr *a);
+
+void quic_lower_xmit(struct sock *sk, struct sk_buff *skb, union quic_addr *da, struct flowi *fl);
+int quic_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa, struct flowi *fl);
+void quic_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf, union quic_addr *a);
+
+void quic_get_msg_addrs(struct sk_buff *skb, union quic_addr *da, union quic_addr *sa);
+int quic_get_mtu_info(struct sk_buff *skb, u32 *info);
+u8 quic_get_msg_ecn(struct sk_buff *skb);
+
+int quic_get_user_addr(struct sock *sk, union quic_addr *a, struct sockaddr *addr, int addr_len);
+void quic_get_pref_addr(struct sock *sk, union quic_addr *addr, u8 **pp, u32 *plen);
+void quic_set_pref_addr(struct sock *sk, u8 *p, union quic_addr *addr);
+
+bool quic_cmp_sk_addr(struct sock *sk, union quic_addr *a, union quic_addr *addr);
+int quic_get_sk_addr(struct socket *sock, struct sockaddr *a, bool peer);
+void quic_set_sk_addr(struct sock *sk, union quic_addr *a, bool src);
+void quic_set_sk_ecn(struct sock *sk, u8 ecn);
+
+int quic_common_setsockopt(struct sock *sk, int level, int optname, sockptr_t optval,
+ unsigned int optlen);
+int quic_common_getsockopt(struct sock *sk, int level, int optname, char __user *optval,
+ int __user *optlen);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 522c194d4577..08eb3b81f62f 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -47,7 +47,7 @@ static int quic_inet_listen(struct socket *sock, int backlog)
static int quic_inet_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
{
- return -EOPNOTSUPP;
+ return quic_get_sk_addr(sock, uaddr, peer);
}
static __poll_t quic_inet_poll(struct file *file, struct socket *sock, poll_table *wait)
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 9cab01109db7..025fb3ae2941 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -121,7 +121,7 @@ static int quic_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
if (level != SOL_QUIC)
- return -EOPNOTSUPP;
+ return quic_common_setsockopt(sk, level, optname, optval, optlen);
return quic_do_setsockopt(sk, optname, optval, optlen);
}
@@ -135,7 +135,7 @@ static int quic_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen)
{
if (level != SOL_QUIC)
- return -EOPNOTSUPP;
+ return quic_common_getsockopt(sk, level, optname, optval, optlen);
return quic_do_getsockopt(sk, optname, USER_SOCKPTR(optval), USER_SOCKPTR(optlen));
}
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 6cbf12bcae75..3f808489f571 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -11,6 +11,7 @@
#include <net/udp_tunnel.h>
#include "common.h"
+#include "family.h"
#include "protocol.h"
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 04/15] quic: provide family ops for address and protocol
2025-08-18 14:04 ` [PATCH net-next v2 04/15] quic: provide family ops for address and protocol Xin Long
@ 2025-08-21 13:17 ` Paolo Abeni
2025-08-23 17:22 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 13:17 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
> This patch introduces two new abstraction structures to simplify handling
> of IPv4 and IPv6 differences across the QUIC stack:
>
> - quic_addr_family_ops: for address comparison, flow routing,
> UDP config, MTU lookup, formatted output, etc.
>
> - quic_proto_family_ops: for socket address helpers and preference.
>
> With these additions, the QUIC core logic can remain agnostic of the
> address family and socket type, improving modularity and reducing
> repetitive checks throughout the codebase.
Given that you wrap the ops call in quick_<op>() helper, I'm wondering
if such abstraction is necessary/useful? 'if' statements in the quick
helper will likely reduce the code size, and will the indirect function
call overhead.
[...]
> +static void quic_v6_set_sk_addr(struct sock *sk, union quic_addr *a, bool src)
> +{
> + if (src) {
> + inet_sk(sk)->inet_sport = a->v4.sin_port;
> + if (a->sa.sa_family == AF_INET) {
> + sk->sk_v6_rcv_saddr.s6_addr32[0] = 0;
> + sk->sk_v6_rcv_saddr.s6_addr32[1] = 0;
> + sk->sk_v6_rcv_saddr.s6_addr32[2] = htonl(0x0000ffff);
> + sk->sk_v6_rcv_saddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
> + } else {
> + sk->sk_v6_rcv_saddr = a->v6.sin6_addr;
> + }
> + } else {
> + inet_sk(sk)->inet_dport = a->v4.sin_port;
> + if (a->sa.sa_family == AF_INET) {
> + sk->sk_v6_daddr.s6_addr32[0] = 0;
> + sk->sk_v6_daddr.s6_addr32[1] = 0;
> + sk->sk_v6_daddr.s6_addr32[2] = htonl(0x0000ffff);
> + sk->sk_v6_daddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
> + } else {
> + sk->sk_v6_daddr = a->v6.sin6_addr;
> + }
> + }
You could factor the addr assignment in an helper and avoid some code
duplication.
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 04/15] quic: provide family ops for address and protocol
2025-08-21 13:17 ` Paolo Abeni
@ 2025-08-23 17:22 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 17:22 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 9:17 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> > This patch introduces two new abstraction structures to simplify handling
> > of IPv4 and IPv6 differences across the QUIC stack:
> >
> > - quic_addr_family_ops: for address comparison, flow routing,
> > UDP config, MTU lookup, formatted output, etc.
> >
> > - quic_proto_family_ops: for socket address helpers and preference.
> >
> > With these additions, the QUIC core logic can remain agnostic of the
> > address family and socket type, improving modularity and reducing
> > repetitive checks throughout the codebase.
>
> Given that you wrap the ops call in quick_<op>() helper, I'm wondering
> if such abstraction is necessary/useful? 'if' statements in the quick
> helper will likely reduce the code size, and will the indirect function
> call overhead.
I'm completely fine to change things to be like:
int quic_flow_route(struct sock *sk, union quic_addr *da, union
quic_addr *sa, struct flowi *fl)
{
return da->sa.sa_family == AF_INET ? quic_v4_flow_route(sk,
da, sa, fl) :
quic_v6_flow_route(sk, da, sa, fl);
}
>
> [...]
> > +static void quic_v6_set_sk_addr(struct sock *sk, union quic_addr *a, bool src)
> > +{
> > + if (src) {
> > + inet_sk(sk)->inet_sport = a->v4.sin_port;
> > + if (a->sa.sa_family == AF_INET) {
> > + sk->sk_v6_rcv_saddr.s6_addr32[0] = 0;
> > + sk->sk_v6_rcv_saddr.s6_addr32[1] = 0;
> > + sk->sk_v6_rcv_saddr.s6_addr32[2] = htonl(0x0000ffff);
> > + sk->sk_v6_rcv_saddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
> > + } else {
> > + sk->sk_v6_rcv_saddr = a->v6.sin6_addr;
> > + }
> > + } else {
> > + inet_sk(sk)->inet_dport = a->v4.sin_port;
> > + if (a->sa.sa_family == AF_INET) {
> > + sk->sk_v6_daddr.s6_addr32[0] = 0;
> > + sk->sk_v6_daddr.s6_addr32[1] = 0;
> > + sk->sk_v6_daddr.s6_addr32[2] = htonl(0x0000ffff);
> > + sk->sk_v6_daddr.s6_addr32[3] = a->v4.sin_addr.s_addr;
> > + } else {
> > + sk->sk_v6_daddr = a->v6.sin6_addr;
> > + }
> > + }
>
> You could factor the addr assignment in an helper and avoid some code
> duplication.
>
Right, I will add a helper.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 05/15] quic: provide quic.h header files for kernel and userspace
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (3 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 04/15] quic: provide family ops for address and protocol Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 06/15] quic: add stream management Xin Long
` (10 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This commit adds quic.h to include/uapi/linux, providing the necessary
definitions for the QUIC socket API. Exporting this header allows both
user space applications and kernel subsystems to access QUIC-related
control messages, socket options, and event/notification interfaces.
Since kernel_get/setsockopt() is no longer available to kernel consumers,
a corresponding internal header, include/linux/quic.h, is added. This
provides kernel subsystems with the necessary declarations to handle
QUIC socket options directly.
Detailed descriptions of these structures are available in [1], and will
be also provided when adding corresponding socket interfaces in the
later patches.
[1] https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket-apis
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
include/linux/quic.h | 19 +++
include/uapi/linux/quic.h | 236 ++++++++++++++++++++++++++++++++++++++
net/quic/socket.c | 38 ++++++
net/quic/socket.h | 7 ++
4 files changed, 300 insertions(+)
create mode 100644 include/linux/quic.h
create mode 100644 include/uapi/linux/quic.h
diff --git a/include/linux/quic.h b/include/linux/quic.h
new file mode 100644
index 000000000000..d35ff40bb005
--- /dev/null
+++ b/include/linux/quic.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#ifndef _LINUX_QUIC_H
+#define _LINUX_QUIC_H
+
+#include <uapi/linux/quic.h>
+
+int quic_kernel_setsockopt(struct sock *sk, int optname, void *optval, unsigned int optlen);
+int quic_kernel_getsockopt(struct sock *sk, int optname, void *optval, unsigned int *optlen);
+
+#endif
diff --git a/include/uapi/linux/quic.h b/include/uapi/linux/quic.h
new file mode 100644
index 000000000000..f7c85399ac4a
--- /dev/null
+++ b/include/uapi/linux/quic.h
@@ -0,0 +1,236 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#ifndef _UAPI_LINUX_QUIC_H
+#define _UAPI_LINUX_QUIC_H
+
+#include <linux/types.h>
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#else
+#include <sys/socket.h>
+#endif
+
+/* NOTE: Structure descriptions are specified in:
+ * https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket-apis
+ */
+
+/* Send or Receive Options APIs */
+enum quic_cmsg_type {
+ QUIC_STREAM_INFO,
+ QUIC_HANDSHAKE_INFO,
+};
+
+#define QUIC_STREAM_TYPE_SERVER_MASK 0x01
+#define QUIC_STREAM_TYPE_UNI_MASK 0x02
+#define QUIC_STREAM_TYPE_MASK 0x03
+
+enum quic_msg_flags {
+ /* flags for stream_flags */
+ MSG_STREAM_NEW = MSG_SYN,
+ MSG_STREAM_FIN = MSG_FIN,
+ MSG_STREAM_UNI = MSG_CONFIRM,
+ MSG_STREAM_DONTWAIT = MSG_WAITFORONE,
+ MSG_STREAM_SNDBLOCK = MSG_ERRQUEUE,
+
+ /* extented flags for msg_flags */
+ MSG_DATAGRAM = MSG_RST,
+ MSG_NOTIFICATION = MSG_MORE,
+};
+
+enum quic_crypto_level {
+ QUIC_CRYPTO_APP,
+ QUIC_CRYPTO_INITIAL,
+ QUIC_CRYPTO_HANDSHAKE,
+ QUIC_CRYPTO_EARLY,
+ QUIC_CRYPTO_MAX,
+};
+
+struct quic_handshake_info {
+ __u8 crypto_level;
+};
+
+struct quic_stream_info {
+ __s64 stream_id;
+ __u32 stream_flags;
+};
+
+/* Socket Options APIs */
+#define QUIC_SOCKOPT_EVENT 0
+#define QUIC_SOCKOPT_STREAM_OPEN 1
+#define QUIC_SOCKOPT_STREAM_RESET 2
+#define QUIC_SOCKOPT_STREAM_STOP_SENDING 3
+#define QUIC_SOCKOPT_CONNECTION_ID 4
+#define QUIC_SOCKOPT_CONNECTION_CLOSE 5
+#define QUIC_SOCKOPT_CONNECTION_MIGRATION 6
+#define QUIC_SOCKOPT_KEY_UPDATE 7
+#define QUIC_SOCKOPT_TRANSPORT_PARAM 8
+#define QUIC_SOCKOPT_CONFIG 9
+#define QUIC_SOCKOPT_TOKEN 10
+#define QUIC_SOCKOPT_ALPN 11
+#define QUIC_SOCKOPT_SESSION_TICKET 12
+#define QUIC_SOCKOPT_CRYPTO_SECRET 13
+#define QUIC_SOCKOPT_TRANSPORT_PARAM_EXT 14
+
+#define QUIC_VERSION_V1 0x1
+#define QUIC_VERSION_V2 0x6b3343cf
+
+struct quic_transport_param {
+ __u8 remote;
+ __u8 disable_active_migration;
+ __u8 grease_quic_bit;
+ __u8 stateless_reset;
+ __u8 disable_1rtt_encryption;
+ __u8 disable_compatible_version;
+ __u8 active_connection_id_limit;
+ __u8 ack_delay_exponent;
+ __u16 max_datagram_frame_size;
+ __u16 max_udp_payload_size;
+ __u32 max_idle_timeout;
+ __u32 max_ack_delay;
+ __u16 max_streams_bidi;
+ __u16 max_streams_uni;
+ __u64 max_data;
+ __u64 max_stream_data_bidi_local;
+ __u64 max_stream_data_bidi_remote;
+ __u64 max_stream_data_uni;
+ __u64 reserved;
+};
+
+struct quic_config {
+ __u32 version;
+ __u32 plpmtud_probe_interval;
+ __u32 initial_smoothed_rtt;
+ __u32 payload_cipher_type;
+ __u8 congestion_control_algo;
+ __u8 validate_peer_address;
+ __u8 stream_data_nodelay;
+ __u8 receive_session_ticket;
+ __u8 certificate_request;
+ __u8 reserved[3];
+};
+
+struct quic_crypto_secret {
+ __u8 send; /* send or recv */
+ __u8 level; /* crypto level */
+ __u32 type; /* TLS_CIPHER_* */
+#define QUIC_CRYPTO_SECRET_BUFFER_SIZE 48
+ __u8 secret[QUIC_CRYPTO_SECRET_BUFFER_SIZE];
+};
+
+enum quic_cong_algo {
+ QUIC_CONG_ALG_RENO,
+ QUIC_CONG_ALG_CUBIC,
+ QUIC_CONG_ALG_MAX,
+};
+
+struct quic_errinfo {
+ __s64 stream_id;
+ __u32 errcode;
+};
+
+struct quic_connection_id_info {
+ __u8 dest;
+ __u32 active;
+ __u32 prior_to;
+};
+
+struct quic_event_option {
+ __u8 type;
+ __u8 on;
+};
+
+/* Event APIs */
+enum quic_event_type {
+ QUIC_EVENT_NONE,
+ QUIC_EVENT_STREAM_UPDATE,
+ QUIC_EVENT_STREAM_MAX_DATA,
+ QUIC_EVENT_STREAM_MAX_STREAM,
+ QUIC_EVENT_CONNECTION_ID,
+ QUIC_EVENT_CONNECTION_CLOSE,
+ QUIC_EVENT_CONNECTION_MIGRATION,
+ QUIC_EVENT_KEY_UPDATE,
+ QUIC_EVENT_NEW_TOKEN,
+ QUIC_EVENT_NEW_SESSION_TICKET,
+ QUIC_EVENT_MAX,
+};
+
+enum {
+ QUIC_STREAM_SEND_STATE_READY,
+ QUIC_STREAM_SEND_STATE_SEND,
+ QUIC_STREAM_SEND_STATE_SENT,
+ QUIC_STREAM_SEND_STATE_RECVD,
+ QUIC_STREAM_SEND_STATE_RESET_SENT,
+ QUIC_STREAM_SEND_STATE_RESET_RECVD,
+
+ QUIC_STREAM_RECV_STATE_RECV,
+ QUIC_STREAM_RECV_STATE_SIZE_KNOWN,
+ QUIC_STREAM_RECV_STATE_RECVD,
+ QUIC_STREAM_RECV_STATE_READ,
+ QUIC_STREAM_RECV_STATE_RESET_RECVD,
+ QUIC_STREAM_RECV_STATE_RESET_READ,
+};
+
+struct quic_stream_update {
+ __s64 id;
+ __u8 state;
+ __u32 errcode;
+ __u64 finalsz;
+};
+
+struct quic_stream_max_data {
+ __s64 id;
+ __u64 max_data;
+};
+
+struct quic_connection_close {
+ __u32 errcode;
+ __u8 frame;
+ __u8 phrase[];
+};
+
+union quic_event {
+ struct quic_stream_update update;
+ struct quic_stream_max_data max_data;
+ struct quic_connection_close close;
+ struct quic_connection_id_info info;
+ __u64 max_stream;
+ __u8 local_migration;
+ __u8 key_update_phase;
+};
+
+enum {
+ QUIC_TRANSPORT_ERROR_NONE = 0x00,
+ QUIC_TRANSPORT_ERROR_INTERNAL = 0x01,
+ QUIC_TRANSPORT_ERROR_CONNECTION_REFUSED = 0x02,
+ QUIC_TRANSPORT_ERROR_FLOW_CONTROL = 0x03,
+ QUIC_TRANSPORT_ERROR_STREAM_LIMIT = 0x04,
+ QUIC_TRANSPORT_ERROR_STREAM_STATE = 0x05,
+ QUIC_TRANSPORT_ERROR_FINAL_SIZE = 0x06,
+ QUIC_TRANSPORT_ERROR_FRAME_ENCODING = 0x07,
+ QUIC_TRANSPORT_ERROR_TRANSPORT_PARAM = 0x08,
+ QUIC_TRANSPORT_ERROR_CONNECTION_ID_LIMIT = 0x09,
+ QUIC_TRANSPORT_ERROR_PROTOCOL_VIOLATION = 0x0a,
+ QUIC_TRANSPORT_ERROR_INVALID_TOKEN = 0x0b,
+ QUIC_TRANSPORT_ERROR_APPLICATION = 0x0c,
+ QUIC_TRANSPORT_ERROR_CRYPTO_BUF_EXCEEDED = 0x0d,
+ QUIC_TRANSPORT_ERROR_KEY_UPDATE = 0x0e,
+ QUIC_TRANSPORT_ERROR_AEAD_LIMIT_REACHED = 0x0f,
+ QUIC_TRANSPORT_ERROR_NO_VIABLE_PATH = 0x10,
+
+ /* The cryptographic handshake failed. A range of 256 values is reserved
+ * for carrying error codes specific to the cryptographic handshake that
+ * is used. Codes for errors occurring when TLS is used for the
+ * cryptographic handshake are described in Section 4.8 of [QUIC-TLS].
+ */
+ QUIC_TRANSPORT_ERROR_CRYPTO = 0x0100,
+};
+
+#endif /* _UAPI_LINUX_QUIC_H */
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 025fb3ae2941..58711a224bfd 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -126,6 +126,25 @@ static int quic_setsockopt(struct sock *sk, int level, int optname,
return quic_do_setsockopt(sk, optname, optval, optlen);
}
+/**
+ * quic_kernel_setsockopt - set a QUIC socket option from within the kernel
+ * @sk: socket to configure
+ * @optname: option name (QUIC-level)
+ * @optval: pointer to the option value
+ * @optlen: size of the option value
+ *
+ * Sets a QUIC socket option on a kernel socket without involving user space.
+ *
+ * Return:
+ * - On success, 0 is returned.
+ * - On error, a negative error value is returned.
+ */
+int quic_kernel_setsockopt(struct sock *sk, int optname, void *optval, unsigned int optlen)
+{
+ return quic_do_setsockopt(sk, optname, KERNEL_SOCKPTR(optval), optlen);
+}
+EXPORT_SYMBOL_GPL(quic_kernel_setsockopt);
+
static int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval, sockptr_t optlen)
{
return -EOPNOTSUPP;
@@ -140,6 +159,25 @@ static int quic_getsockopt(struct sock *sk, int level, int optname,
return quic_do_getsockopt(sk, optname, USER_SOCKPTR(optval), USER_SOCKPTR(optlen));
}
+/**
+ * quic_kernel_getsockopt - get a QUIC socket option from within the kernel
+ * @sk: socket to query
+ * @optname: option name (QUIC-level)
+ * @optval: pointer to the buffer to receive the option value
+ * @optlen: pointer to the size of the buffer; updated to actual length on return
+ *
+ * Gets a QUIC socket option from a kernel socket, bypassing user space.
+ *
+ * Return:
+ * - On success, 0 is returned.
+ * - On error, a negative error value is returned.
+ */
+int quic_kernel_getsockopt(struct sock *sk, int optname, void *optval, unsigned int *optlen)
+{
+ return quic_do_getsockopt(sk, optname, KERNEL_SOCKPTR(optval), KERNEL_SOCKPTR(optlen));
+}
+EXPORT_SYMBOL_GPL(quic_kernel_getsockopt);
+
static void quic_release_cb(struct sock *sk)
{
}
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 3f808489f571..aeaefc677973 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -9,6 +9,7 @@
*/
#include <net/udp_tunnel.h>
+#include <linux/quic.h>
#include "common.h"
#include "family.h"
@@ -29,6 +30,7 @@ struct quic_sock {
struct inet_sock inet;
struct list_head reqs;
+ struct quic_config config;
struct quic_data ticket;
struct quic_data token;
struct quic_data alpn;
@@ -49,6 +51,11 @@ static inline struct list_head *quic_reqs(const struct sock *sk)
return &quic_sk(sk)->reqs;
}
+static inline struct quic_config *quic_config(const struct sock *sk)
+{
+ return &quic_sk(sk)->config;
+}
+
static inline struct quic_data *quic_token(const struct sock *sk)
{
return &quic_sk(sk)->token;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 06/15] quic: add stream management
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (4 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 05/15] quic: provide quic.h header files for kernel and userspace Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 13:43 ` Paolo Abeni
2025-08-18 14:04 ` [PATCH net-next v2 07/15] quic: add connection id management Xin Long
` (9 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'struct quic_stream_table' for managing QUIC streams,
each represented by 'struct quic_stream'.
It implements mechanisms for acquiring and releasing streams on both the
send and receive paths, ensuring efficient lifecycle management during
transmission and reception.
- quic_stream_send_get(): Acquire a send-side stream by ID and flags
during TX path.
- quic_stream_recv_get(): Acquire a receive-side stream by ID during
RX path.
- quic_stream_send_put(): Release a send-side stream when sending is
done.
- quic_stream_recv_put(): Release a receive-side stream when receiving
is done.
It includes logic to detect when stream ID limits are reached and when
control frames should be sent to update or request limits from the peer.
- quic_stream_id_send_exceeds(): Determines whether a
STREAMS_BLOCKED_UNI/BIDI frame should be sent to the peer.
- quic_stream_max_streams_update(): Determines whether a
MAX_STREAMS_UNI/BIDI frame should be sent to the peer.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/socket.c | 5 +
net/quic/socket.h | 8 +
net/quic/stream.c | 549 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/stream.h | 135 ++++++++++++
5 files changed, 698 insertions(+), 1 deletion(-)
create mode 100644 net/quic/stream.c
create mode 100644 net/quic/stream.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 13bf4a4e5442..094e9da5d739 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o
+quic-y := common.o family.o protocol.o socket.o stream.o
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 58711a224bfd..0ac51cc0c249 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -42,6 +42,9 @@ static int quic_init_sock(struct sock *sk)
sk->sk_write_space = quic_write_space;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
+ if (quic_stream_init(quic_streams(sk)))
+ return -ENOMEM;
+
WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sysctl_quic_wmem[1]));
WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sysctl_quic_rmem[1]));
@@ -55,6 +58,8 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_stream_free(quic_streams(sk));
+
quic_data_free(quic_ticket(sk));
quic_data_free(quic_token(sk));
quic_data_free(quic_alpn(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index aeaefc677973..3eba18514ae6 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -13,6 +13,7 @@
#include "common.h"
#include "family.h"
+#include "stream.h"
#include "protocol.h"
@@ -34,6 +35,8 @@ struct quic_sock {
struct quic_data ticket;
struct quic_data token;
struct quic_data alpn;
+
+ struct quic_stream_table streams;
};
struct quic6_sock {
@@ -71,6 +74,11 @@ static inline struct quic_data *quic_alpn(const struct sock *sk)
return &quic_sk(sk)->alpn;
}
+static inline struct quic_stream_table *quic_streams(const struct sock *sk)
+{
+ return &quic_sk(sk)->streams;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
diff --git a/net/quic/stream.c b/net/quic/stream.c
new file mode 100644
index 000000000000..f0558ee8d645
--- /dev/null
+++ b/net/quic/stream.c
@@ -0,0 +1,549 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/quic.h>
+
+#include "common.h"
+#include "stream.h"
+
+/* Check if a stream ID is valid for sending. */
+static bool quic_stream_id_send(s64 stream_id, bool is_serv)
+{
+ u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
+
+ if (is_serv) {
+ if (type == QUIC_STREAM_TYPE_CLIENT_UNI)
+ return false;
+ } else if (type == QUIC_STREAM_TYPE_SERVER_UNI) {
+ return false;
+ }
+ return true;
+}
+
+/* Check if a stream ID is valid for receiving. */
+static bool quic_stream_id_recv(s64 stream_id, bool is_serv)
+{
+ u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
+
+ if (is_serv) {
+ if (type == QUIC_STREAM_TYPE_SERVER_UNI)
+ return false;
+ } else if (type == QUIC_STREAM_TYPE_CLIENT_UNI) {
+ return false;
+ }
+ return true;
+}
+
+/* Check if a stream ID was initiated locally. */
+static bool quic_stream_id_local(s64 stream_id, u8 is_serv)
+{
+ return is_serv ^ !(stream_id & QUIC_STREAM_TYPE_SERVER_MASK);
+}
+
+/* Check if a stream ID represents a unidirectional stream. */
+static bool quic_stream_id_uni(s64 stream_id)
+{
+ return stream_id & QUIC_STREAM_TYPE_UNI_MASK;
+}
+
+struct quic_stream *quic_stream_find(struct quic_stream_table *streams, s64 stream_id)
+{
+ struct quic_hash_head *head = quic_stream_head(&streams->ht, stream_id);
+ struct quic_stream *stream;
+
+ hlist_for_each_entry(stream, &head->head, node) {
+ if (stream->id == stream_id)
+ break;
+ }
+ return stream;
+}
+
+static void quic_stream_add(struct quic_stream_table *streams, struct quic_stream *stream)
+{
+ struct quic_hash_head *head;
+
+ head = quic_stream_head(&streams->ht, stream->id);
+ hlist_add_head(&stream->node, &head->head);
+}
+
+static void quic_stream_delete(struct quic_stream *stream)
+{
+ hlist_del_init(&stream->node);
+ kfree(stream);
+}
+
+/* Create and register new streams for sending. */
+static struct quic_stream *quic_stream_send_create(struct quic_stream_table *streams,
+ s64 max_stream_id, u8 is_serv)
+{
+ struct quic_stream *stream;
+ s64 stream_id;
+
+ stream_id = streams->send.next_bidi_stream_id;
+ if (quic_stream_id_uni(max_stream_id))
+ stream_id = streams->send.next_uni_stream_id;
+
+ /* rfc9000#section-2.1: A stream ID that is used out of order results in all streams
+ * of that type with lower-numbered stream IDs also being opened.
+ */
+ while (stream_id <= max_stream_id) {
+ stream = kzalloc(sizeof(*stream), GFP_KERNEL);
+ if (!stream)
+ return NULL;
+
+ stream->id = stream_id;
+ if (quic_stream_id_uni(stream_id)) {
+ stream->send.max_bytes = streams->send.max_stream_data_uni;
+
+ if (streams->send.next_uni_stream_id < stream_id + QUIC_STREAM_ID_STEP)
+ streams->send.next_uni_stream_id = stream_id + QUIC_STREAM_ID_STEP;
+ streams->send.streams_uni++;
+
+ quic_stream_add(streams, stream);
+ stream_id += QUIC_STREAM_ID_STEP;
+ continue;
+ }
+
+ if (streams->send.next_bidi_stream_id < stream_id + QUIC_STREAM_ID_STEP)
+ streams->send.next_bidi_stream_id = stream_id + QUIC_STREAM_ID_STEP;
+ streams->send.streams_bidi++;
+
+ if (quic_stream_id_local(stream_id, is_serv)) {
+ stream->send.max_bytes = streams->send.max_stream_data_bidi_remote;
+ stream->recv.max_bytes = streams->recv.max_stream_data_bidi_local;
+ } else {
+ stream->send.max_bytes = streams->send.max_stream_data_bidi_local;
+ stream->recv.max_bytes = streams->recv.max_stream_data_bidi_remote;
+ }
+ stream->recv.window = stream->recv.max_bytes;
+
+ quic_stream_add(streams, stream);
+ stream_id += QUIC_STREAM_ID_STEP;
+ }
+ return stream;
+}
+
+/* Create and register new streams for receiving. */
+static struct quic_stream *quic_stream_recv_create(struct quic_stream_table *streams,
+ s64 max_stream_id, u8 is_serv)
+{
+ struct quic_stream *stream;
+ s64 stream_id;
+
+ stream_id = streams->recv.next_bidi_stream_id;
+ if (quic_stream_id_uni(max_stream_id))
+ stream_id = streams->recv.next_uni_stream_id;
+
+ /* rfc9000#section-2.1: A stream ID that is used out of order results in all streams
+ * of that type with lower-numbered stream IDs also being opened.
+ */
+ while (stream_id <= max_stream_id) {
+ stream = kzalloc(sizeof(*stream), GFP_ATOMIC);
+ if (!stream)
+ return NULL;
+
+ stream->id = stream_id;
+ if (quic_stream_id_uni(stream_id)) {
+ stream->recv.window = streams->recv.max_stream_data_uni;
+ stream->recv.max_bytes = stream->recv.window;
+
+ if (streams->recv.next_uni_stream_id < stream_id + QUIC_STREAM_ID_STEP)
+ streams->recv.next_uni_stream_id = stream_id + QUIC_STREAM_ID_STEP;
+ streams->recv.streams_uni++;
+
+ quic_stream_add(streams, stream);
+ stream_id += QUIC_STREAM_ID_STEP;
+ continue;
+ }
+
+ if (streams->recv.next_bidi_stream_id < stream_id + QUIC_STREAM_ID_STEP)
+ streams->recv.next_bidi_stream_id = stream_id + QUIC_STREAM_ID_STEP;
+ streams->recv.streams_bidi++;
+
+ if (quic_stream_id_local(stream_id, is_serv)) {
+ stream->send.max_bytes = streams->send.max_stream_data_bidi_remote;
+ stream->recv.max_bytes = streams->recv.max_stream_data_bidi_local;
+ } else {
+ stream->send.max_bytes = streams->send.max_stream_data_bidi_local;
+ stream->recv.max_bytes = streams->recv.max_stream_data_bidi_remote;
+ }
+ stream->recv.window = stream->recv.max_bytes;
+
+ quic_stream_add(streams, stream);
+ stream_id += QUIC_STREAM_ID_STEP;
+ }
+ return stream;
+}
+
+/* Check if a send stream ID is already closed. */
+static bool quic_stream_id_send_closed(struct quic_stream_table *streams, s64 stream_id)
+{
+ if (quic_stream_id_uni(stream_id)) {
+ if (stream_id < streams->send.next_uni_stream_id)
+ return true;
+ } else {
+ if (stream_id < streams->send.next_bidi_stream_id)
+ return true;
+ }
+ return false;
+}
+
+/* Check if a receive stream ID is already closed. */
+static bool quic_stream_id_recv_closed(struct quic_stream_table *streams, s64 stream_id)
+{
+ if (quic_stream_id_uni(stream_id)) {
+ if (stream_id < streams->recv.next_uni_stream_id)
+ return true;
+ } else {
+ if (stream_id < streams->recv.next_bidi_stream_id)
+ return true;
+ }
+ return false;
+}
+
+/* Check if a receive stream ID exceeds would exceed local's limits. */
+static bool quic_stream_id_recv_exceeds(struct quic_stream_table *streams, s64 stream_id)
+{
+ if (quic_stream_id_uni(stream_id)) {
+ if (stream_id > streams->recv.max_uni_stream_id)
+ return true;
+ } else {
+ if (stream_id > streams->recv.max_bidi_stream_id)
+ return true;
+ }
+ return false;
+}
+
+/* Check if a send stream ID would exceed peer's limits. */
+bool quic_stream_id_send_exceeds(struct quic_stream_table *streams, s64 stream_id)
+{
+ u64 nstreams;
+
+ if (quic_stream_id_uni(stream_id)) {
+ if (stream_id > streams->send.max_uni_stream_id)
+ return true;
+ } else {
+ if (stream_id > streams->send.max_bidi_stream_id)
+ return true;
+ }
+
+ if (quic_stream_id_uni(stream_id)) {
+ stream_id -= streams->send.next_uni_stream_id;
+ nstreams = quic_stream_id_to_streams(stream_id);
+ if (nstreams + streams->send.streams_uni > streams->send.max_streams_uni)
+ return true;
+ } else {
+ stream_id -= streams->send.next_bidi_stream_id;
+ nstreams = quic_stream_id_to_streams(stream_id);
+ if (nstreams + streams->send.streams_bidi > streams->send.max_streams_bidi)
+ return true;
+ }
+ return false;
+}
+
+/* Get or create a send stream by ID. */
+struct quic_stream *quic_stream_send_get(struct quic_stream_table *streams, s64 stream_id,
+ u32 flags, bool is_serv)
+{
+ struct quic_stream *stream;
+
+ if (!quic_stream_id_send(stream_id, is_serv))
+ return ERR_PTR(-EINVAL);
+
+ stream = quic_stream_find(streams, stream_id);
+ if (stream) {
+ if ((flags & MSG_STREAM_NEW) &&
+ stream->send.state != QUIC_STREAM_SEND_STATE_READY)
+ return ERR_PTR(-EINVAL);
+ return stream;
+ }
+
+ if (quic_stream_id_send_closed(streams, stream_id))
+ return ERR_PTR(-ENOSTR);
+
+ if (!(flags & MSG_STREAM_NEW))
+ return ERR_PTR(-EINVAL);
+
+ if (quic_stream_id_send_exceeds(streams, stream_id))
+ return ERR_PTR(-EAGAIN);
+
+ stream = quic_stream_send_create(streams, stream_id, is_serv);
+ if (!stream)
+ return ERR_PTR(-ENOSTR);
+ streams->send.active_stream_id = stream_id;
+ return stream;
+}
+
+/* Get or create a receive stream by ID. */
+struct quic_stream *quic_stream_recv_get(struct quic_stream_table *streams, s64 stream_id,
+ bool is_serv)
+{
+ struct quic_stream *stream;
+
+ if (!quic_stream_id_recv(stream_id, is_serv))
+ return ERR_PTR(-EINVAL);
+
+ stream = quic_stream_find(streams, stream_id);
+ if (stream)
+ return stream;
+
+ if (quic_stream_id_local(stream_id, is_serv)) {
+ if (quic_stream_id_send_closed(streams, stream_id))
+ return ERR_PTR(-ENOSTR);
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (quic_stream_id_recv_closed(streams, stream_id))
+ return ERR_PTR(-ENOSTR);
+
+ if (quic_stream_id_recv_exceeds(streams, stream_id))
+ return ERR_PTR(-EAGAIN);
+
+ stream = quic_stream_recv_create(streams, stream_id, is_serv);
+ if (!stream)
+ return ERR_PTR(-ENOSTR);
+ if (quic_stream_id_send(stream_id, is_serv))
+ streams->send.active_stream_id = stream_id;
+ return stream;
+}
+
+/* Release or clean up a send stream. This function updates stream counters and state when
+ * a send stream has either successfully sent all data or has been reset.
+ */
+void quic_stream_send_put(struct quic_stream_table *streams, struct quic_stream *stream,
+ bool is_serv)
+{
+ if (quic_stream_id_uni(stream->id)) {
+ /* For unidirectional streams, decrement uni count and delete immediately. */
+ streams->send.streams_uni--;
+ quic_stream_delete(stream);
+ return;
+ }
+
+ /* For bidi streams, only proceed if receive side is in a final state. */
+ if (stream->recv.state != QUIC_STREAM_RECV_STATE_RECVD &&
+ stream->recv.state != QUIC_STREAM_RECV_STATE_READ &&
+ stream->recv.state != QUIC_STREAM_RECV_STATE_RESET_RECVD)
+ return;
+
+ if (quic_stream_id_local(stream->id, is_serv)) {
+ /* Local-initiated stream: mark send done and decrement send.bidi count. */
+ if (!stream->send.done) {
+ stream->send.done = 1;
+ streams->send.streams_bidi--;
+ }
+ goto out;
+ }
+ /* Remote-initiated stream: mark recv done and decrement recv bidi count. */
+ if (!stream->recv.done) {
+ stream->recv.done = 1;
+ streams->recv.streams_bidi--;
+ streams->recv.bidi_pending = 1;
+ }
+out:
+ /* Delete stream if fully read or no data received. */
+ if (stream->recv.state == QUIC_STREAM_RECV_STATE_READ || !stream->recv.offset)
+ quic_stream_delete(stream);
+}
+
+/* Release or clean up a receive stream. This function updates stream counters and state when
+ * the receive side has either consumed all data or has been reset.
+ */
+void quic_stream_recv_put(struct quic_stream_table *streams, struct quic_stream *stream,
+ bool is_serv)
+{
+ if (quic_stream_id_uni(stream->id)) {
+ /* For uni streams, decrement uni count and mark done. */
+ if (!stream->recv.done) {
+ stream->recv.done = 1;
+ streams->recv.streams_uni--;
+ streams->recv.uni_pending = 1;
+ }
+ goto out;
+ }
+
+ /* For bidi streams, only proceed if send side is in a final state. */
+ if (stream->send.state != QUIC_STREAM_SEND_STATE_RECVD &&
+ stream->send.state != QUIC_STREAM_SEND_STATE_RESET_RECVD)
+ return;
+
+ if (quic_stream_id_local(stream->id, is_serv)) {
+ /* Local-initiated stream: mark send done and decrement send.bidi count. */
+ if (!stream->send.done) {
+ stream->send.done = 1;
+ streams->send.streams_bidi--;
+ }
+ goto out;
+ }
+ /* Remote-initiated stream: mark recv done and decrement recv bidi count. */
+ if (!stream->recv.done) {
+ stream->recv.done = 1;
+ streams->recv.streams_bidi--;
+ streams->recv.bidi_pending = 1;
+ }
+out:
+ /* Delete stream if fully read or no data received. */
+ if (stream->recv.state == QUIC_STREAM_RECV_STATE_READ || !stream->recv.offset)
+ quic_stream_delete(stream);
+}
+
+/* Updates the maximum allowed incoming stream IDs if any streams were recently closed.
+ * Recalculates the max_uni and max_bidi stream ID limits based on the number of open
+ * streams and whether any were marked for deletion.
+ *
+ * Returns true if either max_uni or max_bidi was updated, indicating that a
+ * MAX_STREAMS_UNI or MAX_STREAMS_BIDI frame should be sent to the peer.
+ */
+bool quic_stream_max_streams_update(struct quic_stream_table *streams, s64 *max_uni, s64 *max_bidi)
+{
+ if (streams->recv.uni_pending) {
+ streams->recv.max_uni_stream_id =
+ streams->recv.next_uni_stream_id - QUIC_STREAM_ID_STEP +
+ ((streams->recv.max_streams_uni - streams->recv.streams_uni) <<
+ QUIC_STREAM_TYPE_BITS);
+ *max_uni = quic_stream_id_to_streams(streams->recv.max_uni_stream_id);
+ streams->recv.uni_pending = 0;
+ }
+ if (streams->recv.bidi_pending) {
+ streams->recv.max_bidi_stream_id =
+ streams->recv.next_bidi_stream_id - QUIC_STREAM_ID_STEP +
+ ((streams->recv.max_streams_bidi - streams->recv.streams_bidi) <<
+ QUIC_STREAM_TYPE_BITS);
+ *max_bidi = quic_stream_id_to_streams(streams->recv.max_bidi_stream_id);
+ streams->recv.bidi_pending = 0;
+ }
+
+ return *max_uni || *max_bidi;
+}
+
+int quic_stream_init(struct quic_stream_table *streams)
+{
+ struct quic_hash_table *ht = &streams->ht;
+ struct quic_hash_head *head;
+ int i, size = QUIC_HT_SIZE;
+
+ head = kmalloc_array(size, sizeof(*head), GFP_KERNEL);
+ if (!head)
+ return -ENOMEM;
+ for (i = 0; i < size; i++)
+ INIT_HLIST_HEAD(&head[i].head);
+ ht->size = size;
+ ht->hash = head;
+ return 0;
+}
+
+void quic_stream_free(struct quic_stream_table *streams)
+{
+ struct quic_hash_table *ht = &streams->ht;
+ struct quic_hash_head *head;
+ struct quic_stream *stream;
+ struct hlist_node *tmp;
+ int i;
+
+ for (i = 0; i < ht->size; i++) {
+ head = &ht->hash[i];
+ hlist_for_each_entry_safe(stream, tmp, &head->head, node) {
+ hlist_del_init(&stream->node);
+ kfree(stream);
+ }
+ }
+ kfree(ht->hash);
+}
+
+/* Populate transport parameters from stream hash table. */
+void quic_stream_get_param(struct quic_stream_table *streams, struct quic_transport_param *p,
+ bool is_serv)
+{
+ if (p->remote) {
+ p->max_stream_data_bidi_remote = streams->send.max_stream_data_bidi_remote;
+ p->max_stream_data_bidi_local = streams->send.max_stream_data_bidi_local;
+ p->max_stream_data_uni = streams->send.max_stream_data_uni;
+ p->max_streams_bidi = streams->send.max_streams_bidi;
+ p->max_streams_uni = streams->send.max_streams_uni;
+ return;
+ }
+
+ p->max_stream_data_bidi_remote = streams->recv.max_stream_data_bidi_remote;
+ p->max_stream_data_bidi_local = streams->recv.max_stream_data_bidi_local;
+ p->max_stream_data_uni = streams->recv.max_stream_data_uni;
+ p->max_streams_bidi = streams->recv.max_streams_bidi;
+ p->max_streams_uni = streams->recv.max_streams_uni;
+}
+
+/* Configure stream hashtable from transport parameters. */
+void quic_stream_set_param(struct quic_stream_table *streams, struct quic_transport_param *p,
+ bool is_serv)
+{
+ u8 type;
+
+ if (p->remote) {
+ streams->send.max_stream_data_bidi_local = p->max_stream_data_bidi_local;
+ streams->send.max_stream_data_bidi_remote = p->max_stream_data_bidi_remote;
+ streams->send.max_stream_data_uni = p->max_stream_data_uni;
+ streams->send.max_streams_bidi = p->max_streams_bidi;
+ streams->send.max_streams_uni = p->max_streams_uni;
+ streams->send.active_stream_id = -1;
+
+ if (is_serv) {
+ type = QUIC_STREAM_TYPE_SERVER_BIDI;
+ streams->send.max_bidi_stream_id =
+ quic_stream_streams_to_id(p->max_streams_bidi, type);
+ streams->send.next_bidi_stream_id = type;
+
+ type = QUIC_STREAM_TYPE_SERVER_UNI;
+ streams->send.max_uni_stream_id =
+ quic_stream_streams_to_id(p->max_streams_uni, type);
+ streams->send.next_uni_stream_id = type;
+ return;
+ }
+
+ type = QUIC_STREAM_TYPE_CLIENT_BIDI;
+ streams->send.max_bidi_stream_id =
+ quic_stream_streams_to_id(p->max_streams_bidi, type);
+ streams->send.next_bidi_stream_id = type;
+
+ type = QUIC_STREAM_TYPE_CLIENT_UNI;
+ streams->send.max_uni_stream_id =
+ quic_stream_streams_to_id(p->max_streams_uni, type);
+ streams->send.next_uni_stream_id = type;
+ return;
+ }
+
+ streams->recv.max_stream_data_bidi_local = p->max_stream_data_bidi_local;
+ streams->recv.max_stream_data_bidi_remote = p->max_stream_data_bidi_remote;
+ streams->recv.max_stream_data_uni = p->max_stream_data_uni;
+ streams->recv.max_streams_bidi = p->max_streams_bidi;
+ streams->recv.max_streams_uni = p->max_streams_uni;
+
+ if (is_serv) {
+ type = QUIC_STREAM_TYPE_CLIENT_BIDI;
+ streams->recv.max_bidi_stream_id =
+ quic_stream_streams_to_id(p->max_streams_bidi, type);
+ streams->recv.next_bidi_stream_id = type;
+
+ type = QUIC_STREAM_TYPE_CLIENT_UNI;
+ streams->recv.max_uni_stream_id =
+ quic_stream_streams_to_id(p->max_streams_uni, type);
+ streams->recv.next_uni_stream_id = type;
+ return;
+ }
+
+ type = QUIC_STREAM_TYPE_SERVER_BIDI;
+ streams->recv.max_bidi_stream_id =
+ quic_stream_streams_to_id(p->max_streams_bidi, type);
+ streams->recv.next_bidi_stream_id = type;
+
+ type = QUIC_STREAM_TYPE_SERVER_UNI;
+ streams->recv.max_uni_stream_id =
+ quic_stream_streams_to_id(p->max_streams_uni, type);
+ streams->recv.next_uni_stream_id = type;
+}
diff --git a/net/quic/stream.h b/net/quic/stream.h
new file mode 100644
index 000000000000..4f570fdc55f2
--- /dev/null
+++ b/net/quic/stream.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_DEF_STREAMS 100
+#define QUIC_MAX_STREAMS 4096ULL
+
+/*
+ * rfc9000#section-2.1:
+ *
+ * The least significant bit (0x01) of the stream ID identifies the initiator of the stream.
+ * Client-initiated streams have even-numbered stream IDs (with the bit set to 0), and
+ * server-initiated streams have odd-numbered stream IDs (with the bit set to 1).
+ *
+ * The second least significant bit (0x02) of the stream ID distinguishes between bidirectional
+ * streams (with the bit set to 0) and unidirectional streams (with the bit set to 1).
+ */
+#define QUIC_STREAM_TYPE_BITS 2
+#define QUIC_STREAM_ID_STEP BIT(QUIC_STREAM_TYPE_BITS)
+
+#define QUIC_STREAM_TYPE_CLIENT_BIDI 0x00
+#define QUIC_STREAM_TYPE_SERVER_BIDI 0x01
+#define QUIC_STREAM_TYPE_CLIENT_UNI 0x02
+#define QUIC_STREAM_TYPE_SERVER_UNI 0x03
+
+struct quic_stream {
+ struct hlist_node node;
+ s64 id; /* Stream ID as defined in RFC 9000 Section 2.1 */
+ struct {
+ /* Sending-side stream level flow control */
+ u64 last_max_bytes; /* Maximum send offset advertised by peer at last update */
+ u64 max_bytes; /* Current maximum offset we are allowed to send to */
+ u64 bytes; /* Bytes already sent to peer */
+
+ u32 errcode; /* Application error code to send in RESET_STREAM */
+ u32 frags; /* Number of sent STREAM frames not yet acknowledged */
+ u8 state; /* Send stream state, per rfc9000#section-3.1 */
+
+ u8 data_blocked:1; /* True if flow control blocks sending more data */
+ u8 stop_sent:1; /* True if STOP_SENDING has been sent, not acknowledged */
+ u8 done:1; /* True if application indicated end of stream (FIN sent) */
+ } send;
+ struct {
+ /* Receiving-side stream level flow control */
+ u64 max_bytes; /* Maximum offset peer is allowed to send to */
+ u64 window; /* Remaining receive window before advertise a new limit */
+ u64 bytes; /* Bytes consumed by application from the stream */
+
+ u64 highest; /* Highest received offset */
+ u64 offset; /* Offset up to which data is in buffer or consumed */
+ u64 finalsz; /* Final size of the stream if FIN received */
+
+ u32 frags; /* Number of received STREAM frames pending reassembly */
+ u8 state; /* Receive stream state, per rfc9000#section-3.2 */
+ u8 done:1; /* True if FIN received and final size validated */
+ } recv;
+};
+
+struct quic_stream_table {
+ struct quic_hash_table ht; /* Hash table storing all active streams */
+
+ struct {
+ /* Parameters received from peer, defined in rfc9000#section-18.2 */
+ u64 max_stream_data_bidi_remote; /* initial_max_stream_data_bidi_remote */
+ u64 max_stream_data_bidi_local; /* initial_max_stream_data_bidi_local */
+ u64 max_stream_data_uni; /* initial_max_stream_data_uni */
+ u64 max_streams_bidi; /* initial_max_streams_bidi */
+ u64 max_streams_uni; /* initial_max_streams_uni */
+
+ s64 next_bidi_stream_id; /* Next bidi stream ID to be opened */
+ s64 next_uni_stream_id; /* Next uni stream ID to be opened */
+ s64 max_bidi_stream_id; /* Highest allowed bidi stream ID */
+ s64 max_uni_stream_id; /* Highest allowed uni stream ID */
+ s64 active_stream_id; /* Most recently opened stream ID */
+
+ u8 bidi_blocked:1; /* True if STREAMS_BLOCKED_BIDI was sent and not ACKed */
+ u8 uni_blocked:1; /* True if STREAMS_BLOCKED_UNI was sent and not ACKed */
+ u16 streams_bidi; /* Number of currently active bidi streams */
+ u16 streams_uni; /* Number of currently active uni streams */
+ } send;
+ struct {
+ /* Our advertised limits to the peer, per rfc9000#section-18.2 */
+ u64 max_stream_data_bidi_remote; /* initial_max_stream_data_bidi_remote */
+ u64 max_stream_data_bidi_local; /* initial_max_stream_data_bidi_local */
+ u64 max_stream_data_uni; /* initial_max_stream_data_uni */
+ u64 max_streams_bidi; /* initial_max_streams_bidi */
+ u64 max_streams_uni; /* initial_max_streams_uni */
+
+ s64 next_bidi_stream_id; /* Next expected bidi stream ID from peer */
+ s64 next_uni_stream_id; /* Next expected uni stream ID from peer */
+ s64 max_bidi_stream_id; /* Current allowed bidi stream ID range */
+ s64 max_uni_stream_id; /* Current allowed uni stream ID range */
+
+ u8 bidi_pending:1; /* True if MAX_STREAMS_BIDI needs to be sent */
+ u8 uni_pending:1; /* True if MAX_STREAMS_UNI needs to be sent */
+ u16 streams_bidi; /* Number of currently open bidi streams */
+ u16 streams_uni; /* Number of currently open uni streams */
+ } recv;
+};
+
+static inline u64 quic_stream_id_to_streams(s64 stream_id)
+{
+ return (u64)(stream_id >> QUIC_STREAM_TYPE_BITS) + 1;
+}
+
+static inline s64 quic_stream_streams_to_id(u64 streams, u8 type)
+{
+ return (s64)((streams - 1) << QUIC_STREAM_TYPE_BITS) | type;
+}
+
+struct quic_stream *quic_stream_send_get(struct quic_stream_table *streams, s64 stream_id,
+ u32 flags, bool is_serv);
+struct quic_stream *quic_stream_recv_get(struct quic_stream_table *streams, s64 stream_id,
+ bool is_serv);
+void quic_stream_send_put(struct quic_stream_table *streams, struct quic_stream *stream,
+ bool is_serv);
+void quic_stream_recv_put(struct quic_stream_table *streams, struct quic_stream *stream,
+ bool is_serv);
+
+bool quic_stream_max_streams_update(struct quic_stream_table *streams, s64 *max_uni, s64 *max_bidi);
+struct quic_stream *quic_stream_find(struct quic_stream_table *streams, s64 stream_id);
+bool quic_stream_id_send_exceeds(struct quic_stream_table *streams, s64 stream_id);
+
+void quic_stream_get_param(struct quic_stream_table *streams, struct quic_transport_param *p,
+ bool is_serv);
+void quic_stream_set_param(struct quic_stream_table *streams, struct quic_transport_param *p,
+ bool is_serv);
+void quic_stream_free(struct quic_stream_table *streams);
+int quic_stream_init(struct quic_stream_table *streams);
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 06/15] quic: add stream management
2025-08-18 14:04 ` [PATCH net-next v2 06/15] quic: add stream management Xin Long
@ 2025-08-21 13:43 ` Paolo Abeni
2025-08-23 17:14 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 13:43 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
> +/* Check if a stream ID is valid for sending. */
> +static bool quic_stream_id_send(s64 stream_id, bool is_serv)
> +{
> + u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
> +
> + if (is_serv) {
> + if (type == QUIC_STREAM_TYPE_CLIENT_UNI)
> + return false;
> + } else if (type == QUIC_STREAM_TYPE_SERVER_UNI) {
> + return false;
> + }
> + return true;
> +}
> +
> +/* Check if a stream ID is valid for receiving. */
> +static bool quic_stream_id_recv(s64 stream_id, bool is_serv)
> +{
> + u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
> +
> + if (is_serv) {
> + if (type == QUIC_STREAM_TYPE_SERVER_UNI)
> + return false;
> + } else if (type == QUIC_STREAM_TYPE_CLIENT_UNI) {
> + return false;
> + }
> + return true;
> +}
The above two functions could be implemented using a common helper
saving some code duplication.
> +/* Create and register new streams for sending. */
> +static struct quic_stream *quic_stream_send_create(struct quic_stream_table *streams,
> + s64 max_stream_id, u8 is_serv)
> +{
> + struct quic_stream *stream;
> + s64 stream_id;
> +
> + stream_id = streams->send.next_bidi_stream_id;
> + if (quic_stream_id_uni(max_stream_id))
> + stream_id = streams->send.next_uni_stream_id;
> +
> + /* rfc9000#section-2.1: A stream ID that is used out of order results in all streams
> + * of that type with lower-numbered stream IDs also being opened.
> + */
> + while (stream_id <= max_stream_id) {
Is wrap around thererically possible?
Who provided `max_stream_id`, the user-space? or a remote pear? what if
max_stream_id - stream_id is say 1M ?
[...]
> +/* Check if a receive stream ID is already closed. */
> +static bool quic_stream_id_recv_closed(struct quic_stream_table *streams, s64 stream_id)
> +{
> + if (quic_stream_id_uni(stream_id)) {
> + if (stream_id < streams->recv.next_uni_stream_id)
> + return true;
> + } else {
> + if (stream_id < streams->recv.next_bidi_stream_id)
> + return true;
> + }
> + return false;
> +}
I guess the above answer my previous questions, but I think that memory
accounting for stream allocation is still deserverd.
> +
> +/* Check if a receive stream ID exceeds would exceed local's limits. */
> +static bool quic_stream_id_recv_exceeds(struct quic_stream_table *streams, s64 stream_id)
> +{
> + if (quic_stream_id_uni(stream_id)) {
> + if (stream_id > streams->recv.max_uni_stream_id)
> + return true;
> + } else {
> + if (stream_id > streams->recv.max_bidi_stream_id)
> + return true;
> + }
> + return false;
> +}
> +
> +/* Check if a send stream ID would exceed peer's limits. */
> +bool quic_stream_id_send_exceeds(struct quic_stream_table *streams, s64 stream_id)
> +{
> + u64 nstreams;
> +
> + if (quic_stream_id_uni(stream_id)) {
> + if (stream_id > streams->send.max_uni_stream_id)
> + return true;
> + } else {
> + if (stream_id > streams->send.max_bidi_stream_id)
> + return true;
> + }
> +
> + if (quic_stream_id_uni(stream_id)) {
> + stream_id -= streams->send.next_uni_stream_id;
> + nstreams = quic_stream_id_to_streams(stream_id);
> + if (nstreams + streams->send.streams_uni > streams->send.max_streams_uni)
> + return true;
> + } else {
> + stream_id -= streams->send.next_bidi_stream_id;
> + nstreams = quic_stream_id_to_streams(stream_id);
> + if (nstreams + streams->send.streams_bidi > streams->send.max_streams_bidi)
> + return true;
> + }
> + return false;
> +}
> +
> +/* Get or create a send stream by ID. */
> +struct quic_stream *quic_stream_send_get(struct quic_stream_table *streams, s64 stream_id,
> + u32 flags, bool is_serv)
> +{
> + struct quic_stream *stream;
> +
> + if (!quic_stream_id_send(stream_id, is_serv))
> + return ERR_PTR(-EINVAL);
> +
> + stream = quic_stream_find(streams, stream_id);
> + if (stream) {
> + if ((flags & MSG_STREAM_NEW) &&
> + stream->send.state != QUIC_STREAM_SEND_STATE_READY)
> + return ERR_PTR(-EINVAL);
> + return stream;
> + }
> +
> + if (quic_stream_id_send_closed(streams, stream_id))
> + return ERR_PTR(-ENOSTR);
> +
> + if (!(flags & MSG_STREAM_NEW))
> + return ERR_PTR(-EINVAL);
> +
> + if (quic_stream_id_send_exceeds(streams, stream_id))
> + return ERR_PTR(-EAGAIN);
> +
> + stream = quic_stream_send_create(streams, stream_id, is_serv);
> + if (!stream)
> + return ERR_PTR(-ENOSTR);
> + streams->send.active_stream_id = stream_id;
> + return stream;
There is no locking at all in lookup/add/remove. Lacking the caller of
such functions is hard to say if that is safe. You should add some info
about that in the commit message (or lock here ;)
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 06/15] quic: add stream management
2025-08-21 13:43 ` Paolo Abeni
@ 2025-08-23 17:14 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 17:14 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 9:43 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> > +/* Check if a stream ID is valid for sending. */
> > +static bool quic_stream_id_send(s64 stream_id, bool is_serv)
> > +{
> > + u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
> > +
> > + if (is_serv) {
> > + if (type == QUIC_STREAM_TYPE_CLIENT_UNI)
> > + return false;
> > + } else if (type == QUIC_STREAM_TYPE_SERVER_UNI) {
> > + return false;
> > + }
> > + return true;
> > +}
> > +
> > +/* Check if a stream ID is valid for receiving. */
> > +static bool quic_stream_id_recv(s64 stream_id, bool is_serv)
> > +{
> > + u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
> > +
> > + if (is_serv) {
> > + if (type == QUIC_STREAM_TYPE_SERVER_UNI)
> > + return false;
> > + } else if (type == QUIC_STREAM_TYPE_CLIENT_UNI) {
> > + return false;
> > + }
> > + return true;
> > +}
>
> The above two functions could be implemented using a common helper
> saving some code duplication.
Not yet sure if it's worth a helper, I need to think about it.
>
> > +/* Create and register new streams for sending. */
> > +static struct quic_stream *quic_stream_send_create(struct quic_stream_table *streams,
> > + s64 max_stream_id, u8 is_serv)
> > +{
> > + struct quic_stream *stream;
> > + s64 stream_id;
> > +
> > + stream_id = streams->send.next_bidi_stream_id;
> > + if (quic_stream_id_uni(max_stream_id))
> > + stream_id = streams->send.next_uni_stream_id;
> > +
> > + /* rfc9000#section-2.1: A stream ID that is used out of order results in all streams
> > + * of that type with lower-numbered stream IDs also being opened.
> > + */
> > + while (stream_id <= max_stream_id) {
>
> Is wrap around thererically possible?
> Who provided `max_stream_id`, the user-space? or a remote pear? what if
> max_stream_id - stream_id is say 1M ?
There are two values limiting this:
1. streams->send.max_uni/bidi_stream_id:
the max_uni/bidi_stream_id that peer informs to be open.
2. streams->send.max_streams_uni/bidi (max value: QUIC_MAX_STREAMS(4096)):
to limit the number of the existing streams.
>
> [...]
> > +/* Check if a receive stream ID is already closed. */
> > +static bool quic_stream_id_recv_closed(struct quic_stream_table *streams, s64 stream_id)
> > +{
> > + if (quic_stream_id_uni(stream_id)) {
> > + if (stream_id < streams->recv.next_uni_stream_id)
> > + return true;
> > + } else {
> > + if (stream_id < streams->recv.next_bidi_stream_id)
> > + return true;
> > + }
> > + return false;
> > +}
>
> I guess the above answer my previous questions, but I think that memory
> accounting for stream allocation is still deserverd.
>
I can give it a try. sk_r/wmem_schedule() should be used for this I suppose.
> > +
> > +/* Check if a receive stream ID exceeds would exceed local's limits. */
> > +static bool quic_stream_id_recv_exceeds(struct quic_stream_table *streams, s64 stream_id)
> > +{
> > + if (quic_stream_id_uni(stream_id)) {
> > + if (stream_id > streams->recv.max_uni_stream_id)
> > + return true;
> > + } else {
> > + if (stream_id > streams->recv.max_bidi_stream_id)
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +/* Check if a send stream ID would exceed peer's limits. */
> > +bool quic_stream_id_send_exceeds(struct quic_stream_table *streams, s64 stream_id)
> > +{
> > + u64 nstreams;
> > +
> > + if (quic_stream_id_uni(stream_id)) {
> > + if (stream_id > streams->send.max_uni_stream_id)
> > + return true;
> > + } else {
> > + if (stream_id > streams->send.max_bidi_stream_id)
> > + return true;
> > + }
> > +
> > + if (quic_stream_id_uni(stream_id)) {
> > + stream_id -= streams->send.next_uni_stream_id;
> > + nstreams = quic_stream_id_to_streams(stream_id);
> > + if (nstreams + streams->send.streams_uni > streams->send.max_streams_uni)
> > + return true;
> > + } else {
> > + stream_id -= streams->send.next_bidi_stream_id;
> > + nstreams = quic_stream_id_to_streams(stream_id);
> > + if (nstreams + streams->send.streams_bidi > streams->send.max_streams_bidi)
> > + return true;
> > + }
> > + return false;
> > +}
> > +
> > +/* Get or create a send stream by ID. */
> > +struct quic_stream *quic_stream_send_get(struct quic_stream_table *streams, s64 stream_id,
> > + u32 flags, bool is_serv)
> > +{
> > + struct quic_stream *stream;
> > +
> > + if (!quic_stream_id_send(stream_id, is_serv))
> > + return ERR_PTR(-EINVAL);
> > +
> > + stream = quic_stream_find(streams, stream_id);
> > + if (stream) {
> > + if ((flags & MSG_STREAM_NEW) &&
> > + stream->send.state != QUIC_STREAM_SEND_STATE_READY)
> > + return ERR_PTR(-EINVAL);
> > + return stream;
> > + }
> > +
> > + if (quic_stream_id_send_closed(streams, stream_id))
> > + return ERR_PTR(-ENOSTR);
> > +
> > + if (!(flags & MSG_STREAM_NEW))
> > + return ERR_PTR(-EINVAL);
> > +
> > + if (quic_stream_id_send_exceeds(streams, stream_id))
> > + return ERR_PTR(-EAGAIN);
> > +
> > + stream = quic_stream_send_create(streams, stream_id, is_serv);
> > + if (!stream)
> > + return ERR_PTR(-ENOSTR);
> > + streams->send.active_stream_id = stream_id;
> > + return stream;
>
> There is no locking at all in lookup/add/remove. Lacking the caller of
> such functions is hard to say if that is safe. You should add some info
> about that in the commit message (or lock here ;)
>
stream_hashtable is per connection/socket, it's always protected by
sock lock, I will add information into the commit message.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 07/15] quic: add connection id management
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (5 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 06/15] quic: add stream management Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 13:55 ` Paolo Abeni
2025-08-22 17:10 ` Jason Baron
2025-08-18 14:04 ` [PATCH net-next v2 08/15] quic: add path management Xin Long
` (8 subsequent siblings)
15 siblings, 2 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'struct quic_conn_id_set' for managing Connection
IDs (CIDs), which are represented by 'struct quic_source_conn_id'
and 'struct quic_dest_conn_id'.
It provides helpers to add and remove CIDs from the set, and handles
insertion of source CIDs into the global connection ID hash table
when necessary.
- quic_conn_id_add(): Add a new Connection ID to the set, and inserts
it to conn_id hash table if it is a source conn_id.
- quic_conn_id_remove(): Remove connection IDs the set with sequence
numbers less than or equal to a number.
It also adds utilities to look up CIDs by value or sequence number,
search the global hash table for incoming packets, and check for
stateless reset tokens among destination CIDs. These functions are
essential for RX path connection lookup and stateless reset processing.
- quic_conn_id_find(): Find a Connection ID in the set by seq number.
- quic_conn_id_lookup(): Lookup a Connection ID from global hash table
using the ID value, typically used for socket lookup on the RX path.
- quic_conn_id_token_exists(): Check if a stateless reset token exists
in any dest Connection ID (used during stateless reset processing).
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/connid.c | 222 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/connid.h | 162 +++++++++++++++++++++++++++++++++
net/quic/socket.c | 6 ++
net/quic/socket.h | 13 +++
5 files changed, 404 insertions(+), 1 deletion(-)
create mode 100644 net/quic/connid.c
create mode 100644 net/quic/connid.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 094e9da5d739..eee7501588d3 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o
diff --git a/net/quic/connid.c b/net/quic/connid.c
new file mode 100644
index 000000000000..5fe38092caba
--- /dev/null
+++ b/net/quic/connid.c
@@ -0,0 +1,222 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/quic.h>
+#include <net/sock.h>
+
+#include "common.h"
+#include "connid.h"
+
+/* Lookup a source connection ID (scid) in the global source connection ID hash table. */
+struct quic_conn_id *quic_conn_id_lookup(struct net *net, u8 *scid, u32 len)
+{
+ struct quic_hash_head *head = quic_source_conn_id_head(net, scid);
+ struct quic_source_conn_id *s_conn_id;
+ struct quic_conn_id *conn_id = NULL;
+
+ spin_lock(&head->s_lock);
+ hlist_for_each_entry(s_conn_id, &head->head, node) {
+ if (net == sock_net(s_conn_id->sk) && s_conn_id->common.id.len == len &&
+ !memcmp(scid, &s_conn_id->common.id.data, s_conn_id->common.id.len)) {
+ sock_hold(s_conn_id->sk);
+ conn_id = &s_conn_id->common.id;
+ break;
+ }
+ }
+
+ spin_unlock(&head->s_lock);
+ return conn_id;
+}
+
+/* Check if a given stateless reset token exists in any connection ID in the connection ID set. */
+bool quic_conn_id_token_exists(struct quic_conn_id_set *id_set, u8 *token)
+{
+ struct quic_common_conn_id *common;
+ struct quic_dest_conn_id *dcid;
+
+ dcid = (struct quic_dest_conn_id *)id_set->active;
+ if (!memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN)) /* Fast path. */
+ return true;
+
+ list_for_each_entry(common, &id_set->head, list) {
+ dcid = (struct quic_dest_conn_id *)common;
+ if (common == id_set->active)
+ continue;
+ if (!memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN))
+ return true;
+ }
+ return false;
+}
+
+static void quic_source_conn_id_free_rcu(struct rcu_head *head)
+{
+ struct quic_source_conn_id *s_conn_id;
+
+ s_conn_id = container_of(head, struct quic_source_conn_id, rcu);
+ kfree(s_conn_id);
+}
+
+static void quic_source_conn_id_free(struct quic_source_conn_id *s_conn_id)
+{
+ u8 *data = s_conn_id->common.id.data;
+ struct quic_hash_head *head;
+
+ if (!hlist_unhashed(&s_conn_id->node)) {
+ head = quic_source_conn_id_head(sock_net(s_conn_id->sk), data);
+ spin_lock_bh(&head->s_lock);
+ hlist_del_init(&s_conn_id->node);
+ spin_unlock_bh(&head->s_lock);
+ }
+
+ /* Freeing is deferred via RCU to avoid use-after-free during concurrent lookups. */
+ call_rcu(&s_conn_id->rcu, quic_source_conn_id_free_rcu);
+}
+
+static void quic_conn_id_del(struct quic_common_conn_id *common)
+{
+ list_del(&common->list);
+ if (!common->hashed) {
+ kfree(common);
+ return;
+ }
+ quic_source_conn_id_free((struct quic_source_conn_id *)common);
+}
+
+/* Add a connection ID with sequence number and associated private data to the connection ID set. */
+int quic_conn_id_add(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *conn_id, u32 number, void *data)
+{
+ struct quic_source_conn_id *s_conn_id;
+ struct quic_dest_conn_id *d_conn_id;
+ struct quic_common_conn_id *common;
+ struct quic_hash_head *head;
+ struct list_head *list;
+
+ /* Locate insertion point to keep list ordered by number. */
+ list = &id_set->head;
+ list_for_each_entry(common, list, list) {
+ if (number == common->number)
+ return 0; /* Ignore if it is already exists on the list. */
+ if (number < common->number) {
+ list = &common->list;
+ break;
+ }
+ }
+
+ if (conn_id->len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+ common = kzalloc(id_set->entry_size, GFP_ATOMIC);
+ if (!common)
+ return -ENOMEM;
+ common->id = *conn_id;
+ common->number = number;
+ if (id_set->entry_size == sizeof(struct quic_dest_conn_id)) {
+ /* For destination connection IDs, copy the stateless reset token if available. */
+ if (data) {
+ d_conn_id = (struct quic_dest_conn_id *)common;
+ memcpy(d_conn_id->token, data, QUIC_CONN_ID_TOKEN_LEN);
+ }
+ } else {
+ /* For source connection IDs, mark as hashed and insert into the global source
+ * connection ID hashtable.
+ */
+ common->hashed = 1;
+ s_conn_id = (struct quic_source_conn_id *)common;
+ s_conn_id->sk = data;
+
+ head = quic_source_conn_id_head(sock_net(s_conn_id->sk), common->id.data);
+ spin_lock_bh(&head->s_lock);
+ hlist_add_head(&s_conn_id->node, &head->head);
+ spin_unlock_bh(&head->s_lock);
+ }
+ list_add_tail(&common->list, list);
+
+ if (number == quic_conn_id_last_number(id_set) + 1) {
+ if (!id_set->active)
+ id_set->active = common;
+ id_set->count++;
+
+ /* Increment count for consecutive following IDs. */
+ list_for_each_entry_continue(common, &id_set->head, list) {
+ if (common->number != ++number)
+ break;
+ id_set->count++;
+ }
+ }
+ return 0;
+}
+
+/* Remove connection IDs from the set with sequence numbers less than or equal to a number. */
+void quic_conn_id_remove(struct quic_conn_id_set *id_set, u32 number)
+{
+ struct quic_common_conn_id *common, *tmp;
+ struct list_head *list;
+
+ list = &id_set->head;
+ list_for_each_entry_safe(common, tmp, list, list) {
+ if (common->number <= number) {
+ if (id_set->active == common)
+ id_set->active = tmp;
+ quic_conn_id_del(common);
+ id_set->count--;
+ }
+ }
+}
+
+struct quic_conn_id *quic_conn_id_find(struct quic_conn_id_set *id_set, u32 number)
+{
+ struct quic_common_conn_id *common;
+
+ list_for_each_entry(common, &id_set->head, list)
+ if (common->number == number)
+ return &common->id;
+ return NULL;
+}
+
+void quic_conn_id_update_active(struct quic_conn_id_set *id_set, u32 number)
+{
+ struct quic_conn_id *conn_id;
+
+ if (number == id_set->active->number)
+ return;
+ conn_id = quic_conn_id_find(id_set, number);
+ if (!conn_id)
+ return;
+ quic_conn_id_set_active(id_set, conn_id);
+}
+
+void quic_conn_id_set_init(struct quic_conn_id_set *id_set, bool source)
+{
+ id_set->entry_size = source ? sizeof(struct quic_source_conn_id)
+ : sizeof(struct quic_dest_conn_id);
+ INIT_LIST_HEAD(&id_set->head);
+}
+
+void quic_conn_id_set_free(struct quic_conn_id_set *id_set)
+{
+ struct quic_common_conn_id *common, *tmp;
+
+ list_for_each_entry_safe(common, tmp, &id_set->head, list)
+ quic_conn_id_del(common);
+ id_set->count = 0;
+ id_set->active = NULL;
+}
+
+void quic_conn_id_get_param(struct quic_conn_id_set *id_set, struct quic_transport_param *p)
+{
+ p->active_connection_id_limit = id_set->max_count;
+}
+
+void quic_conn_id_set_param(struct quic_conn_id_set *id_set, struct quic_transport_param *p)
+{
+ id_set->max_count = p->active_connection_id_limit;
+}
diff --git a/net/quic/connid.h b/net/quic/connid.h
new file mode 100644
index 000000000000..cff37b2fb95b
--- /dev/null
+++ b/net/quic/connid.h
@@ -0,0 +1,162 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_CONN_ID_LIMIT 8
+#define QUIC_CONN_ID_DEF 7
+#define QUIC_CONN_ID_LEAST 2
+
+#define QUIC_CONN_ID_TOKEN_LEN 16
+
+/* Common fields shared by both source and destination Connection IDs */
+struct quic_common_conn_id {
+ struct quic_conn_id id; /* The actual Connection ID value and its length */
+ struct list_head list; /* Linked list node for conn_id list management */
+ u32 number; /* Sequence number assigned to this Connection ID */
+ u8 hashed; /* Non-zero if this ID is stored in source_conn_id hashtable */
+};
+
+struct quic_source_conn_id {
+ struct quic_common_conn_id common;
+ struct hlist_node node; /* Hash table node for fast lookup by Connection ID */
+ struct rcu_head rcu; /* RCU header for deferred destruction */
+ struct sock *sk; /* Pointer to sk associated with this Connection ID */
+};
+
+struct quic_dest_conn_id {
+ struct quic_common_conn_id common;
+ u8 token[QUIC_CONN_ID_TOKEN_LEN]; /* Stateless reset token in rfc9000#section-10.3 */
+};
+
+struct quic_conn_id_set {
+ /* Connection ID in use on the current path */
+ struct quic_common_conn_id *active;
+ /* Connection ID to use for a new path (e.g., after migration) */
+ struct quic_common_conn_id *alt;
+ struct list_head head; /* Head of the linked list of available connection IDs */
+ u8 entry_size; /* Size of each connection ID entry (in bytes) in the list */
+ u8 max_count; /* active_connection_id_limit in rfc9000#section-18.2 */
+ u8 count; /* Current number of connection IDs in the list */
+};
+
+static inline u32 quic_conn_id_first_number(struct quic_conn_id_set *id_set)
+{
+ struct quic_common_conn_id *common;
+
+ common = list_first_entry(&id_set->head, struct quic_common_conn_id, list);
+ return common->number;
+}
+
+static inline u32 quic_conn_id_last_number(struct quic_conn_id_set *id_set)
+{
+ return quic_conn_id_first_number(id_set) + id_set->count - 1;
+}
+
+static inline void quic_conn_id_generate(struct quic_conn_id *conn_id)
+{
+ get_random_bytes(conn_id->data, QUIC_CONN_ID_DEF_LEN);
+ conn_id->len = QUIC_CONN_ID_DEF_LEN;
+}
+
+/* Select an alternate destination Connection ID for a new path (e.g., after migration). */
+static inline bool quic_conn_id_select_alt(struct quic_conn_id_set *id_set, bool active)
+{
+ if (id_set->alt)
+ return true;
+ /* NAT rebinding: peer keeps using the current source conn_id.
+ * In this case, continue using the same dest conn_id for the new path.
+ */
+ if (active) {
+ id_set->alt = id_set->active;
+ return true;
+ }
+ /* Treat the prev conn_ids as used.
+ * Try selecting the next conn_id in the list, unless at the end.
+ */
+ if (id_set->active->number != quic_conn_id_last_number(id_set)) {
+ id_set->alt = list_next_entry(id_set->active, list);
+ return true;
+ }
+ /* If there's only one conn_id in the list, reuse the active one. */
+ if (id_set->active->number == quic_conn_id_first_number(id_set)) {
+ id_set->alt = id_set->active;
+ return true;
+ }
+ /* No alternate conn_id could be selected. Caller should send a
+ * QUIC_FRAME_RETIRE_CONNECTION_ID frame to request new connection IDs from the peer.
+ */
+ return false;
+}
+
+static inline void quic_conn_id_set_alt(struct quic_conn_id_set *id_set, struct quic_conn_id *alt)
+{
+ id_set->alt = (struct quic_common_conn_id *)alt;
+}
+
+/* Swap the active and alternate destination Connection IDs after path migration completes,
+ * since the path has already been switched accordingly.
+ */
+static inline void quic_conn_id_swap_active(struct quic_conn_id_set *id_set)
+{
+ void *active = id_set->active;
+
+ id_set->active = id_set->alt;
+ id_set->alt = active;
+}
+
+/* Choose which destination Connection ID to use for a new path migration if alt is true. */
+static inline struct quic_conn_id *quic_conn_id_choose(struct quic_conn_id_set *id_set, u8 alt)
+{
+ return (alt && id_set->alt) ? &id_set->alt->id : &id_set->active->id;
+}
+
+static inline struct quic_conn_id *quic_conn_id_active(struct quic_conn_id_set *id_set)
+{
+ return &id_set->active->id;
+}
+
+static inline void quic_conn_id_set_active(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *active)
+{
+ id_set->active = (struct quic_common_conn_id *)active;
+}
+
+static inline u32 quic_conn_id_number(struct quic_conn_id *conn_id)
+{
+ return ((struct quic_common_conn_id *)conn_id)->number;
+}
+
+static inline struct sock *quic_conn_id_sk(struct quic_conn_id *conn_id)
+{
+ return ((struct quic_source_conn_id *)conn_id)->sk;
+}
+
+static inline void quic_conn_id_set_token(struct quic_conn_id *conn_id, u8 *token)
+{
+ memcpy(((struct quic_dest_conn_id *)conn_id)->token, token, QUIC_CONN_ID_TOKEN_LEN);
+}
+
+static inline int quic_conn_id_cmp(struct quic_conn_id *a, struct quic_conn_id *b)
+{
+ return a->len != b->len || memcmp(a->data, b->data, a->len);
+}
+
+int quic_conn_id_add(struct quic_conn_id_set *id_set, struct quic_conn_id *conn_id,
+ u32 number, void *data);
+bool quic_conn_id_token_exists(struct quic_conn_id_set *id_set, u8 *token);
+void quic_conn_id_remove(struct quic_conn_id_set *id_set, u32 number);
+
+struct quic_conn_id *quic_conn_id_find(struct quic_conn_id_set *id_set, u32 number);
+struct quic_conn_id *quic_conn_id_lookup(struct net *net, u8 *scid, u32 len);
+void quic_conn_id_update_active(struct quic_conn_id_set *id_set, u32 number);
+
+void quic_conn_id_get_param(struct quic_conn_id_set *id_set, struct quic_transport_param *p);
+void quic_conn_id_set_param(struct quic_conn_id_set *id_set, struct quic_transport_param *p);
+void quic_conn_id_set_init(struct quic_conn_id_set *id_set, bool source);
+void quic_conn_id_set_free(struct quic_conn_id_set *id_set);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 0ac51cc0c249..02b2056078dc 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -42,6 +42,9 @@ static int quic_init_sock(struct sock *sk)
sk->sk_write_space = quic_write_space;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
+ quic_conn_id_set_init(quic_source(sk), 1);
+ quic_conn_id_set_init(quic_dest(sk), 0);
+
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
@@ -58,6 +61,9 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_conn_id_set_free(quic_source(sk));
+ quic_conn_id_set_free(quic_dest(sk));
+
quic_stream_free(quic_streams(sk));
quic_data_free(quic_ticket(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 3eba18514ae6..43f86cabb698 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -14,6 +14,7 @@
#include "common.h"
#include "family.h"
#include "stream.h"
+#include "connid.h"
#include "protocol.h"
@@ -37,6 +38,8 @@ struct quic_sock {
struct quic_data alpn;
struct quic_stream_table streams;
+ struct quic_conn_id_set source;
+ struct quic_conn_id_set dest;
};
struct quic6_sock {
@@ -79,6 +82,16 @@ static inline struct quic_stream_table *quic_streams(const struct sock *sk)
return &quic_sk(sk)->streams;
}
+static inline struct quic_conn_id_set *quic_source(const struct sock *sk)
+{
+ return &quic_sk(sk)->source;
+}
+
+static inline struct quic_conn_id_set *quic_dest(const struct sock *sk)
+{
+ return &quic_sk(sk)->dest;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 07/15] quic: add connection id management
2025-08-18 14:04 ` [PATCH net-next v2 07/15] quic: add connection id management Xin Long
@ 2025-08-21 13:55 ` Paolo Abeni
2025-08-23 15:57 ` Xin Long
2025-08-22 17:10 ` Jason Baron
1 sibling, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 13:55 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
> This patch introduces 'struct quic_conn_id_set' for managing Connection
> IDs (CIDs), which are represented by 'struct quic_source_conn_id'
> and 'struct quic_dest_conn_id'.
>
> It provides helpers to add and remove CIDs from the set, and handles
> insertion of source CIDs into the global connection ID hash table
> when necessary.
>
> - quic_conn_id_add(): Add a new Connection ID to the set, and inserts
> it to conn_id hash table if it is a source conn_id.
>
> - quic_conn_id_remove(): Remove connection IDs the set with sequence
> numbers less than or equal to a number.
It's unclear how many connections are expected to be contained in each
set. If more than an handful you should consider using RB-tree instead
of lists.
[...]
> +static void quic_source_conn_id_free(struct quic_source_conn_id *s_conn_id)
> +{
> + u8 *data = s_conn_id->common.id.data;
> + struct quic_hash_head *head;
> +
> + if (!hlist_unhashed(&s_conn_id->node)) {
> + head = quic_source_conn_id_head(sock_net(s_conn_id->sk), data);
> + spin_lock_bh(&head->s_lock);
> + hlist_del_init(&s_conn_id->node);
> + spin_unlock_bh(&head->s_lock);
> + }
> +
> + /* Freeing is deferred via RCU to avoid use-after-free during concurrent lookups. */
> + call_rcu(&s_conn_id->rcu, quic_source_conn_id_free_rcu);
> +}
> +
> +static void quic_conn_id_del(struct quic_common_conn_id *common)
> +{
> + list_del(&common->list);
> + if (!common->hashed) {
> + kfree(common);
> + return;
> + }
> + quic_source_conn_id_free((struct quic_source_conn_id *)common);
It looks like the above cast is not needed.
> +}
> +
> +/* Add a connection ID with sequence number and associated private data to the connection ID set. */
> +int quic_conn_id_add(struct quic_conn_id_set *id_set,
> + struct quic_conn_id *conn_id, u32 number, void *data)
> +{
> + struct quic_source_conn_id *s_conn_id;
> + struct quic_dest_conn_id *d_conn_id;
> + struct quic_common_conn_id *common;
> + struct quic_hash_head *head;
> + struct list_head *list;
> +
> + /* Locate insertion point to keep list ordered by number. */
> + list = &id_set->head;
> + list_for_each_entry(common, list, list) {
> + if (number == common->number)
> + return 0; /* Ignore if it is already exists on the list. */
> + if (number < common->number) {
> + list = &common->list;
> + break;
> + }
> + }
> +
> + if (conn_id->len > QUIC_CONN_ID_MAX_LEN)
> + return -EINVAL;
> + common = kzalloc(id_set->entry_size, GFP_ATOMIC);
> + if (!common)
> + return -ENOMEM;
> + common->id = *conn_id;
> + common->number = number;
> + if (id_set->entry_size == sizeof(struct quic_dest_conn_id)) {
> + /* For destination connection IDs, copy the stateless reset token if available. */
> + if (data) {
> + d_conn_id = (struct quic_dest_conn_id *)common;
> + memcpy(d_conn_id->token, data, QUIC_CONN_ID_TOKEN_LEN);
> + }
> + } else {
> + /* For source connection IDs, mark as hashed and insert into the global source
> + * connection ID hashtable.
> + */
> + common->hashed = 1;
> + s_conn_id = (struct quic_source_conn_id *)common;
> + s_conn_id->sk = data;
> +
> + head = quic_source_conn_id_head(sock_net(s_conn_id->sk), common->id.data);
> + spin_lock_bh(&head->s_lock);
> + hlist_add_head(&s_conn_id->node, &head->head);
> + spin_unlock_bh(&head->s_lock);
> + }
> + list_add_tail(&common->list, list);
It's unclear if/how id_set->list is protected vs concurrent accesses.
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 07/15] quic: add connection id management
2025-08-21 13:55 ` Paolo Abeni
@ 2025-08-23 15:57 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 15:57 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 9:55 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> > This patch introduces 'struct quic_conn_id_set' for managing Connection
> > IDs (CIDs), which are represented by 'struct quic_source_conn_id'
> > and 'struct quic_dest_conn_id'.
> >
> > It provides helpers to add and remove CIDs from the set, and handles
> > insertion of source CIDs into the global connection ID hash table
> > when necessary.
> >
> > - quic_conn_id_add(): Add a new Connection ID to the set, and inserts
> > it to conn_id hash table if it is a source conn_id.
> >
> > - quic_conn_id_remove(): Remove connection IDs the set with sequence
> > numbers less than or equal to a number.
>
> It's unclear how many connections are expected to be contained in each
> set. If more than an handful you should consider using RB-tree instead
> of lists.
>
We limit the max number of issued CIDs to 8 per connection, and the CID
per connection traversal is not on the data path, so it's fine to use
lists here.
Note that one connection/sk has one source CID set which contains a
couple of CIDs used for connection migration, and one dest CID set
to saving peer's CIDs.
> [...]
> > +static void quic_source_conn_id_free(struct quic_source_conn_id *s_conn_id)
> > +{
> > + u8 *data = s_conn_id->common.id.data;
> > + struct quic_hash_head *head;
> > +
> > + if (!hlist_unhashed(&s_conn_id->node)) {
> > + head = quic_source_conn_id_head(sock_net(s_conn_id->sk), data);
> > + spin_lock_bh(&head->s_lock);
> > + hlist_del_init(&s_conn_id->node);
> > + spin_unlock_bh(&head->s_lock);
> > + }
> > +
> > + /* Freeing is deferred via RCU to avoid use-after-free during concurrent lookups. */
> > + call_rcu(&s_conn_id->rcu, quic_source_conn_id_free_rcu);
> > +}
> > +
> > +static void quic_conn_id_del(struct quic_common_conn_id *common)
> > +{
> > + list_del(&common->list);
> > + if (!common->hashed) {
> > + kfree(common);
> > + return;
> > + }
> > + quic_source_conn_id_free((struct quic_source_conn_id *)common);
>
> It looks like the above cast is not needed.
there will be a compiling error:
/root/quic/modules/net/quic/connid.c:68:66: note: expected ‘struct
quic_source_conn_id *’ but argument is of type ‘struct
quic_common_conn_id *’
68 | static void quic_source_conn_id_free(struct
quic_source_conn_id *s_conn_id)
Or you mean change the parameter type of quic_source_conn_id_free() to:
static void quic_source_conn_id_free(struct quic_common_conn_id *common)
>
> > +}
> > +
> > +/* Add a connection ID with sequence number and associated private data to the connection ID set. */
> > +int quic_conn_id_add(struct quic_conn_id_set *id_set,
> > + struct quic_conn_id *conn_id, u32 number, void *data)
> > +{
> > + struct quic_source_conn_id *s_conn_id;
> > + struct quic_dest_conn_id *d_conn_id;
> > + struct quic_common_conn_id *common;
> > + struct quic_hash_head *head;
> > + struct list_head *list;
> > +
> > + /* Locate insertion point to keep list ordered by number. */
> > + list = &id_set->head;
> > + list_for_each_entry(common, list, list) {
> > + if (number == common->number)
> > + return 0; /* Ignore if it is already exists on the list. */
> > + if (number < common->number) {
> > + list = &common->list;
> > + break;
> > + }
> > + }
> > +
> > + if (conn_id->len > QUIC_CONN_ID_MAX_LEN)
> > + return -EINVAL;
> > + common = kzalloc(id_set->entry_size, GFP_ATOMIC);
> > + if (!common)
> > + return -ENOMEM;
> > + common->id = *conn_id;
> > + common->number = number;
> > + if (id_set->entry_size == sizeof(struct quic_dest_conn_id)) {
> > + /* For destination connection IDs, copy the stateless reset token if available. */
> > + if (data) {
> > + d_conn_id = (struct quic_dest_conn_id *)common;
> > + memcpy(d_conn_id->token, data, QUIC_CONN_ID_TOKEN_LEN);
> > + }
> > + } else {
> > + /* For source connection IDs, mark as hashed and insert into the global source
> > + * connection ID hashtable.
> > + */
> > + common->hashed = 1;
> > + s_conn_id = (struct quic_source_conn_id *)common;
> > + s_conn_id->sk = data;
> > +
> > + head = quic_source_conn_id_head(sock_net(s_conn_id->sk), common->id.data);
> > + spin_lock_bh(&head->s_lock);
> > + hlist_add_head(&s_conn_id->node, &head->head);
> > + spin_unlock_bh(&head->s_lock);
> > + }
> > + list_add_tail(&common->list, list);
>
> It's unclear if/how id_set->list is protected vs concurrent accesses.
>
id_set is per connection/socket, it's always protected by sock lock.
I will leave an annotation in the description of the function for that.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 07/15] quic: add connection id management
2025-08-18 14:04 ` [PATCH net-next v2 07/15] quic: add connection id management Xin Long
2025-08-21 13:55 ` Paolo Abeni
@ 2025-08-22 17:10 ` Jason Baron
2025-08-23 16:15 ` Xin Long
1 sibling, 1 reply; 38+ messages in thread
From: Jason Baron @ 2025-08-22 17:10 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, illiliti,
Sabrina Dubroca, Marcelo Ricardo Leitner, Daniel Stenberg,
Andy Gospodarek
Hi Xin,
On 8/18/25 10:04 AM, Xin Long wrote:
> !-------------------------------------------------------------------|
> This Message Is From an External Sender
> This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> This patch introduces 'struct quic_conn_id_set' for managing Connection
> IDs (CIDs), which are represented by 'struct quic_source_conn_id'
> and 'struct quic_dest_conn_id'.
>
> It provides helpers to add and remove CIDs from the set, and handles
> insertion of source CIDs into the global connection ID hash table
> when necessary.
>
> - quic_conn_id_add(): Add a new Connection ID to the set, and inserts
> it to conn_id hash table if it is a source conn_id.
>
> - quic_conn_id_remove(): Remove connection IDs the set with sequence
> numbers less than or equal to a number.
>
> It also adds utilities to look up CIDs by value or sequence number,
> search the global hash table for incoming packets, and check for
> stateless reset tokens among destination CIDs. These functions are
> essential for RX path connection lookup and stateless reset processing.
>
> - quic_conn_id_find(): Find a Connection ID in the set by seq number.
>
> - quic_conn_id_lookup(): Lookup a Connection ID from global hash table
> using the ID value, typically used for socket lookup on the RX path.
>
> - quic_conn_id_token_exists(): Check if a stateless reset token exists
> in any dest Connection ID (used during stateless reset processing).
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
Thanks Xin for all your work on this!
For QUIC-LB, where the server endpoint may want to choose a specific
source CID to enable 'stateless' routing, I don't currently see an API
to allow that? It appears source CIDs are created with random values and
while userspace can get/set the indexes of the current ones in use, I
don't see a way to set specific CID values?
For reference here is a proposal around it -
https://datatracker.ietf.org/doc/draft-ietf-quic-load-balancers/
In the reference above, the source CID is encrypted to help protect
traceability if the connection migrates. Thus, if the kernel were to
support such a feature, I don't think it wants to enforce a specific
encoding scheme, but perhaps it might want to be a privileged operation,
perhaps requiring CAP_NET_ADMIN to set specific source CIDs.
Thanks,
-Jason
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 07/15] quic: add connection id management
2025-08-22 17:10 ` Jason Baron
@ 2025-08-23 16:15 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 16:15 UTC (permalink / raw)
To: Jason Baron
Cc: network dev, davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, illiliti,
Sabrina Dubroca, Marcelo Ricardo Leitner, Daniel Stenberg,
Andy Gospodarek
On Fri, Aug 22, 2025 at 1:11 PM Jason Baron <jbaron@akamai.com> wrote:
>
> Hi Xin,
>
> On 8/18/25 10:04 AM, Xin Long wrote:
> > !-------------------------------------------------------------------|
> > This Message Is From an External Sender
> > This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > This patch introduces 'struct quic_conn_id_set' for managing Connection
> > IDs (CIDs), which are represented by 'struct quic_source_conn_id'
> > and 'struct quic_dest_conn_id'.
> >
> > It provides helpers to add and remove CIDs from the set, and handles
> > insertion of source CIDs into the global connection ID hash table
> > when necessary.
> >
> > - quic_conn_id_add(): Add a new Connection ID to the set, and inserts
> > it to conn_id hash table if it is a source conn_id.
> >
> > - quic_conn_id_remove(): Remove connection IDs the set with sequence
> > numbers less than or equal to a number.
> >
> > It also adds utilities to look up CIDs by value or sequence number,
> > search the global hash table for incoming packets, and check for
> > stateless reset tokens among destination CIDs. These functions are
> > essential for RX path connection lookup and stateless reset processing.
> >
> > - quic_conn_id_find(): Find a Connection ID in the set by seq number.
> >
> > - quic_conn_id_lookup(): Lookup a Connection ID from global hash table
> > using the ID value, typically used for socket lookup on the RX path.
> >
> > - quic_conn_id_token_exists(): Check if a stateless reset token exists
> > in any dest Connection ID (used during stateless reset processing).
> >
> > Signed-off-by: Xin Long <lucien.xin@gmail.com>
> > ---
>
> Thanks Xin for all your work on this!
>
> For QUIC-LB, where the server endpoint may want to choose a specific
> source CID to enable 'stateless' routing, I don't currently see an API
> to allow that? It appears source CIDs are created with random values and
> while userspace can get/set the indexes of the current ones in use, I
> don't see a way to set specific CID values?
>
> For reference here is a proposal around it -
> https://datatracker.ietf.org/doc/draft-ietf-quic-load-balancers/
>
> In the reference above, the source CID is encrypted to help protect
> traceability if the connection migrates. Thus, if the kernel were to
> support such a feature, I don't think it wants to enforce a specific
> encoding scheme, but perhaps it might want to be a privileged operation,
> perhaps requiring CAP_NET_ADMIN to set specific source CIDs.
>
Hi, Jason,
We currently support only the finalized QUIC RFCs. Drafts like
'Load Balancers' and 'Multipath' aren’t covered yet. I think
we will plan the APIs after they are standardized.
Thanks for pointing out this RFC.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 08/15] quic: add path management
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (6 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 07/15] quic: add connection id management Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-21 14:18 ` Paolo Abeni
2025-08-18 14:04 ` [PATCH net-next v2 09/15] quic: add congestion control Xin Long
` (7 subsequent siblings)
15 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_path_group' for managing paths, represented
by 'struct quic_path'. A connection may use two paths simultaneously
for connection migration.
Each path is associated with a UDP tunnel socket (sk), and a single
UDP tunnel socket can be related to multiple paths from different sockets.
These UDP tunnel sockets are wrapped in 'quic_udp_sock' structures and
stored in a hash table.
It includes mechanisms to bind and unbind paths, detect alternative paths
for migration, and swap paths to support seamless transition between
networks.
- quic_path_bind(): Bind a path to a port and associate it with a UDP sk.
- quic_path_free(): Unbind a path from a port and disassociate it from a
UDP sk.
- quic_path_swap(): Swap two paths to facilitate connection migration.
- quic_path_detect_alt(): Determine if a packet is using an alternative
path, used for connection migration.
It also integrates basic support for Packetization Layer Path MTU
Discovery (PLPMTUD), using PING frames and ICMP feedback to adjust path
MTU and handle probe confirmation or resets during routing changes.
- quic_path_pl_recv(): state transition and pmtu update after the probe
packet is acked.
- quic_path_pl_toobig(): state transition and pmtu update after
receiving a toobig or needfrag icmp packet.
- quic_path_pl_send(): state transition and pmtu update after sending a
probe packet.
- quic_path_pl_reset(): restart the probing when path routing changes.
- quic_path_pl_confirm(): check if probe packet gets acked.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/path.c | 512 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/path.h | 168 +++++++++++++++
net/quic/socket.c | 3 +
net/quic/socket.h | 21 +-
5 files changed, 703 insertions(+), 3 deletions(-)
create mode 100644 net/quic/path.c
create mode 100644 net/quic/path.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index eee7501588d3..1565fb5cef9d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o connid.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o
diff --git a/net/quic/path.c b/net/quic/path.c
new file mode 100644
index 000000000000..1b750ea4b351
--- /dev/null
+++ b/net/quic/path.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/udp_tunnel.h>
+#include <linux/quic.h>
+
+#include "common.h"
+#include "family.h"
+#include "path.h"
+
+static int (*quic_path_rcv)(struct sk_buff *skb, u8 err);
+static struct workqueue_struct *quic_wq __read_mostly;
+
+static int quic_udp_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ /* Save the UDP socket to skb->sk for later QUIC socket lookup. */
+ if (skb_linearize(skb) || !skb_set_owner_sk_safe(skb, sk))
+ return 0;
+
+ memset(skb->cb, 0, sizeof(skb->cb));
+ QUIC_SKB_CB(skb)->seqno = -1;
+ QUIC_SKB_CB(skb)->udph_offset = skb->transport_header;
+ QUIC_SKB_CB(skb)->time = jiffies_to_usecs(jiffies);
+ skb_set_transport_header(skb, sizeof(struct udphdr));
+ quic_path_rcv(skb, 0);
+ return 0;
+}
+
+static int quic_udp_err(struct sock *sk, struct sk_buff *skb)
+{
+ /* Save the UDP socket to skb->sk for later QUIC socket lookup. */
+ if (skb_linearize(skb) || !skb_set_owner_sk_safe(skb, sk))
+ return 0;
+
+ QUIC_SKB_CB(skb)->udph_offset = skb->transport_header;
+ return quic_path_rcv(skb, 1);
+}
+
+static void quic_udp_sock_put_work(struct work_struct *work)
+{
+ struct quic_udp_sock *us = container_of(work, struct quic_udp_sock, work);
+ struct quic_hash_head *head;
+
+ head = quic_udp_sock_head(sock_net(us->sk), ntohs(us->addr.v4.sin_port));
+ mutex_lock(&head->m_lock);
+ __hlist_del(&us->node);
+ udp_tunnel_sock_release(us->sk->sk_socket);
+ mutex_unlock(&head->m_lock);
+ kfree(us);
+}
+
+static struct quic_udp_sock *quic_udp_sock_create(struct sock *sk, union quic_addr *a)
+{
+ struct udp_tunnel_sock_cfg tuncfg = {};
+ struct udp_port_cfg udp_conf = {};
+ struct net *net = sock_net(sk);
+ struct quic_hash_head *head;
+ struct quic_udp_sock *us;
+ struct socket *sock;
+
+ us = kzalloc(sizeof(*us), GFP_KERNEL);
+ if (!us)
+ return NULL;
+
+ quic_udp_conf_init(sk, &udp_conf, a);
+ if (udp_sock_create(net, &udp_conf, &sock)) {
+ pr_debug("%s: failed to create udp sock\n", __func__);
+ kfree(us);
+ return NULL;
+ }
+
+ tuncfg.encap_type = 1;
+ tuncfg.encap_rcv = quic_udp_rcv;
+ tuncfg.encap_err_lookup = quic_udp_err;
+ setup_udp_tunnel_sock(net, sock, &tuncfg);
+
+ refcount_set(&us->refcnt, 1);
+ us->sk = sock->sk;
+ memcpy(&us->addr, a, sizeof(*a));
+ us->bind_ifindex = sk->sk_bound_dev_if;
+
+ head = quic_udp_sock_head(net, ntohs(a->v4.sin_port));
+ hlist_add_head(&us->node, &head->head);
+ INIT_WORK(&us->work, quic_udp_sock_put_work);
+
+ return us;
+}
+
+static bool quic_udp_sock_get(struct quic_udp_sock *us)
+{
+ return (us && refcount_inc_not_zero(&us->refcnt));
+}
+
+static void quic_udp_sock_put(struct quic_udp_sock *us)
+{
+ if (us && refcount_dec_and_test(&us->refcnt))
+ queue_work(quic_wq, &us->work);
+}
+
+/* Lookup a quic_udp_sock in the global hash table. If not found, creates and returns a new one
+ * associated with the given kernel socket.
+ */
+static struct quic_udp_sock *quic_udp_sock_lookup(struct sock *sk, union quic_addr *a, u16 port)
+{
+ struct net *net = sock_net(sk);
+ struct quic_hash_head *head;
+ struct quic_udp_sock *us;
+
+ head = quic_udp_sock_head(net, port);
+ hlist_for_each_entry(us, &head->head, node) {
+ if (net != sock_net(us->sk))
+ continue;
+ if (a) {
+ if (quic_cmp_sk_addr(us->sk, &us->addr, a) &&
+ (!us->bind_ifindex || !sk->sk_bound_dev_if ||
+ us->bind_ifindex == sk->sk_bound_dev_if))
+ return us;
+ continue;
+ }
+ if (ntohs(us->addr.v4.sin_port) == port)
+ return us;
+ }
+ return NULL;
+}
+
+/* Binds a QUIC path to a local port and sets up a UDP socket. */
+int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path)
+{
+ union quic_addr *a = quic_path_saddr(paths, path);
+ int rover, low, high, remaining;
+ struct net *net = sock_net(sk);
+ struct quic_hash_head *head;
+ struct quic_udp_sock *us;
+ u16 port;
+
+ port = ntohs(a->v4.sin_port);
+ if (port) {
+ head = quic_udp_sock_head(net, port);
+ mutex_lock(&head->m_lock);
+ us = quic_udp_sock_lookup(sk, a, port);
+ if (!quic_udp_sock_get(us)) {
+ us = quic_udp_sock_create(sk, a);
+ if (!us) {
+ mutex_unlock(&head->m_lock);
+ return -EINVAL;
+ }
+ }
+ mutex_unlock(&head->m_lock);
+
+ quic_udp_sock_put(paths->path[path].udp_sk);
+ paths->path[path].udp_sk = us;
+ return 0;
+ }
+
+ inet_get_local_port_range(net, &low, &high);
+ remaining = (high - low) + 1;
+ rover = (int)(((u64)get_random_u32() * remaining) >> 32) + low;
+ do {
+ rover++;
+ if (rover < low || rover > high)
+ rover = low;
+ port = (u16)rover;
+ if (inet_is_local_reserved_port(net, port))
+ continue;
+
+ head = quic_udp_sock_head(net, port);
+ mutex_lock(&head->m_lock);
+ if (quic_udp_sock_lookup(sk, NULL, port)) {
+ mutex_unlock(&head->m_lock);
+ cond_resched();
+ continue;
+ }
+ a->v4.sin_port = htons(port);
+ us = quic_udp_sock_create(sk, a);
+ if (!us) {
+ a->v4.sin_port = 0;
+ mutex_unlock(&head->m_lock);
+ return -EINVAL;
+ }
+ mutex_unlock(&head->m_lock);
+
+ quic_udp_sock_put(paths->path[path].udp_sk);
+ paths->path[path].udp_sk = us;
+ __sk_dst_reset(sk);
+ return 0;
+ } while (--remaining > 0);
+
+ return -EADDRINUSE;
+}
+
+/* Swaps the active and alternate QUIC paths.
+ *
+ * Promotes the alternate path (path[1]) to become the new active path (path[0]). If the
+ * alternate path has a valid UDP socket, the entire path is swapped. Otherwise, only the
+ * destination address is exchanged, assuming the source address is the same and no rebind is
+ * needed.
+ *
+ * This is typically used during path migration or alternate path promotion.
+ */
+void quic_path_swap(struct quic_path_group *paths)
+{
+ struct quic_path path = paths->path[0];
+
+ paths->alt_probes = 0;
+ paths->alt_state = QUIC_PATH_ALT_SWAPPED;
+
+ if (paths->path[1].udp_sk) {
+ paths->path[0] = paths->path[1];
+ paths->path[1] = path;
+ return;
+ }
+
+ paths->path[0].daddr = paths->path[1].daddr;
+ paths->path[1].daddr = path.daddr;
+}
+
+/* Frees resources associated with a QUIC path.
+ *
+ * This is used for cleanup during error handling or when the path is no longer needed.
+ */
+void quic_path_free(struct sock *sk, struct quic_path_group *paths, u8 path)
+{
+ paths->alt_probes = 0;
+ paths->alt_state = QUIC_PATH_ALT_NONE;
+
+ quic_udp_sock_put(paths->path[path].udp_sk);
+ paths->path[path].udp_sk = NULL;
+ memset(quic_path_daddr(paths, path), 0, sizeof(union quic_addr));
+ memset(quic_path_saddr(paths, path), 0, sizeof(union quic_addr));
+}
+
+/* Detects and records a potential alternate path.
+ *
+ * If the new source or destination address differs from the active path, and alternate path
+ * detection is not disabled, the function updates the alternate path slot (path[1]) with the
+ * new addresses.
+ *
+ * This is typically called on packet receive to detect new possible network paths (e.g., NAT
+ * rebinding, mobility).
+ *
+ * Returns 1 if a new alternate path was detected and updated, 0 otherwise.
+ */
+int quic_path_detect_alt(struct quic_path_group *paths, union quic_addr *sa, union quic_addr *da,
+ struct sock *sk)
+{
+ if ((!quic_cmp_sk_addr(sk, quic_path_saddr(paths, 0), sa) && !paths->disable_saddr_alt) ||
+ (!quic_cmp_sk_addr(sk, quic_path_daddr(paths, 0), da) && !paths->disable_daddr_alt)) {
+ if (!quic_path_saddr(paths, 1)->v4.sin_port)
+ quic_path_set_saddr(paths, 1, sa);
+
+ if (!quic_cmp_sk_addr(sk, quic_path_saddr(paths, 1), sa))
+ return 0;
+
+ if (!quic_path_daddr(paths, 1)->v4.sin_port)
+ quic_path_set_daddr(paths, 1, da);
+
+ return quic_cmp_sk_addr(sk, quic_path_daddr(paths, 1), da);
+ }
+ return 0;
+}
+
+void quic_path_get_param(struct quic_path_group *paths, struct quic_transport_param *p)
+{
+ if (p->remote) {
+ p->disable_active_migration = paths->disable_saddr_alt;
+ return;
+ }
+ p->disable_active_migration = paths->disable_daddr_alt;
+}
+
+void quic_path_set_param(struct quic_path_group *paths, struct quic_transport_param *p)
+{
+ if (p->remote) {
+ paths->disable_saddr_alt = p->disable_active_migration;
+ return;
+ }
+ paths->disable_daddr_alt = p->disable_active_migration;
+}
+
+/* State Machine defined in rfc8899#section-5.2 */
+enum quic_plpmtud_state {
+ QUIC_PL_DISABLED,
+ QUIC_PL_BASE,
+ QUIC_PL_SEARCH,
+ QUIC_PL_COMPLETE,
+ QUIC_PL_ERROR,
+};
+
+#define QUIC_BASE_PLPMTU 1200
+#define QUIC_MAX_PLPMTU 9000
+#define QUIC_MIN_PLPMTU 512
+
+#define QUIC_MAX_PROBES 3
+
+#define QUIC_PL_BIG_STEP 32
+#define QUIC_PL_MIN_STEP 4
+
+/* Handle PLPMTUD probe failure on a QUIC path.
+ *
+ * Called immediately after sending a probe packet in QUIC Path MTU Discovery. Tracks probe
+ * count and manages state transitions based on the number of probes sent and current PLPMTUD
+ * state (BASE, SEARCH, COMPLETE, ERROR). Detects probe failures and black holes, adjusting
+ * PMTU and probe sizes accordingly.
+ *
+ * Return: New PMTU value if updated, else 0.
+ */
+u32 quic_path_pl_send(struct quic_path_group *paths, s64 number)
+{
+ u32 pathmtu = 0;
+
+ paths->pl.number = number;
+ if (paths->pl.probe_count < QUIC_MAX_PROBES)
+ goto out;
+
+ paths->pl.probe_count = 0;
+ if (paths->pl.state == QUIC_PL_BASE) {
+ if (paths->pl.probe_size == QUIC_BASE_PLPMTU) { /* BASE_PLPMTU Confirming Failed */
+ paths->pl.state = QUIC_PL_ERROR; /* Base -> Error */
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (paths->pl.pmtu == paths->pl.probe_size) { /* Black Hole Detected */
+ paths->pl.state = QUIC_PL_BASE; /* Search -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_high = 0;
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ } else { /* Normal probe failure. */
+ paths->pl.probe_high = paths->pl.probe_size;
+ paths->pl.probe_size = paths->pl.pmtu;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ if (paths->pl.pmtu == paths->pl.probe_size) { /* Black Hole Detected */
+ paths->pl.state = QUIC_PL_BASE; /* Search Complete -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ }
+
+out:
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, high: %d\n", __func__, paths,
+ paths->pl.state, paths->pl.pmtu, paths->pl.probe_size, paths->pl.probe_high);
+ paths->pl.probe_count++;
+ return pathmtu;
+}
+
+/* Handle successful reception of a PMTU probe.
+ *
+ * Called when a probe packet is acknowledged. Updates probe size and transitions state if
+ * needed (e.g., from SEARCH to COMPLETE). Expands PMTU using binary or linear search
+ * depending on state.
+ *
+ * Return: New PMTU to apply if search completes, or 0 if no change.
+ */
+u32 quic_path_pl_recv(struct quic_path_group *paths, bool *raise_timer, bool *complete)
+{
+ u32 pathmtu = 0;
+
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, high: %d\n", __func__, paths,
+ paths->pl.state, paths->pl.pmtu, paths->pl.probe_size, paths->pl.probe_high);
+
+ *raise_timer = false;
+ paths->pl.number = 0;
+ paths->pl.pmtu = paths->pl.probe_size;
+ paths->pl.probe_count = 0;
+ if (paths->pl.state == QUIC_PL_BASE) {
+ paths->pl.state = QUIC_PL_SEARCH; /* Base -> Search */
+ paths->pl.probe_size += QUIC_PL_BIG_STEP;
+ } else if (paths->pl.state == QUIC_PL_ERROR) {
+ paths->pl.state = QUIC_PL_SEARCH; /* Error -> Search */
+
+ paths->pl.pmtu = paths->pl.probe_size;
+ pathmtu = (u32)paths->pl.pmtu;
+ paths->pl.probe_size += QUIC_PL_BIG_STEP;
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (!paths->pl.probe_high) {
+ if (paths->pl.probe_size < QUIC_MAX_PLPMTU) {
+ paths->pl.probe_size =
+ (u16)min(paths->pl.probe_size + QUIC_PL_BIG_STEP,
+ QUIC_MAX_PLPMTU);
+ *complete = false;
+ return pathmtu;
+ }
+ paths->pl.probe_high = QUIC_MAX_PLPMTU;
+ }
+ paths->pl.probe_size += QUIC_PL_MIN_STEP;
+ if (paths->pl.probe_size >= paths->pl.probe_high) {
+ paths->pl.probe_high = 0;
+ paths->pl.state = QUIC_PL_COMPLETE; /* Search -> Search Complete */
+
+ paths->pl.probe_size = paths->pl.pmtu;
+ pathmtu = (u32)paths->pl.pmtu;
+ *raise_timer = true;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ /* Raise probe_size again after 30 * interval in Search Complete */
+ paths->pl.state = QUIC_PL_SEARCH; /* Search Complete -> Search */
+ paths->pl.probe_size = (u16)min(paths->pl.probe_size + QUIC_PL_MIN_STEP,
+ QUIC_MAX_PLPMTU);
+ }
+
+ *complete = (paths->pl.state == QUIC_PL_COMPLETE);
+ return pathmtu;
+}
+
+/* Handle ICMP "Packet Too Big" messages.
+ *
+ * Responds to an incoming ICMP error by reducing the probe size or falling back to a safe
+ * baseline PMTU depending on current state. Also handles cases where the PMTU hint lies
+ * between probe and current PMTU.
+ *
+ * Return: New PMTU to apply if state changes, or 0 if no change.
+ */
+u32 quic_path_pl_toobig(struct quic_path_group *paths, u32 pmtu, bool *reset_timer)
+{
+ u32 pathmtu = 0;
+
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, ptb: %d\n", __func__, paths,
+ paths->pl.state, paths->pl.pmtu, paths->pl.probe_size, pmtu);
+
+ *reset_timer = false;
+ if (pmtu < QUIC_MIN_PLPMTU || pmtu >= (u32)paths->pl.probe_size)
+ return pathmtu;
+
+ if (paths->pl.state == QUIC_PL_BASE) {
+ if (pmtu >= QUIC_MIN_PLPMTU && pmtu < QUIC_BASE_PLPMTU) {
+ paths->pl.state = QUIC_PL_ERROR; /* Base -> Error */
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (pmtu >= QUIC_BASE_PLPMTU && pmtu < (u32)paths->pl.pmtu) {
+ paths->pl.state = QUIC_PL_BASE; /* Search -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_count = 0;
+
+ paths->pl.probe_high = 0;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ } else if (pmtu > (u32)paths->pl.pmtu && pmtu < (u32)paths->pl.probe_size) {
+ paths->pl.probe_size = (u16)pmtu;
+ paths->pl.probe_count = 0;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ if (pmtu >= QUIC_BASE_PLPMTU && pmtu < (u32)paths->pl.pmtu) {
+ paths->pl.state = QUIC_PL_BASE; /* Complete -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_count = 0;
+
+ paths->pl.probe_high = 0;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ *reset_timer = true;
+ }
+ }
+ return pathmtu;
+}
+
+/* Reset PLPMTUD state for a path.
+ *
+ * Resets all PLPMTUD-related state to its initial configuration. Called when a new path is
+ * initialized or when recovering from errors.
+ */
+void quic_path_pl_reset(struct quic_path_group *paths)
+{
+ paths->pl.number = 0;
+ paths->pl.state = QUIC_PL_BASE;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+}
+
+/* Check if a packet number confirms PLPMTUD probe.
+ *
+ * Checks whether the last probe (tracked by .number) has been acknowledged. If the probe
+ * number lies within the ACK range, confirmation is successful.
+ *
+ * Return: true if probe is confirmed, false otherwise.
+ */
+bool quic_path_pl_confirm(struct quic_path_group *paths, s64 largest, s64 smallest)
+{
+ return paths->pl.number && paths->pl.number >= smallest && paths->pl.number <= largest;
+}
+
+int quic_path_init(int (*rcv)(struct sk_buff *skb, u8 err))
+{
+ quic_wq = create_workqueue("quic_workqueue");
+ if (!quic_wq)
+ return -ENOMEM;
+
+ quic_path_rcv = rcv;
+ return 0;
+}
+
+void quic_path_destroy(void)
+{
+ destroy_workqueue(quic_wq);
+}
diff --git a/net/quic/path.h b/net/quic/path.h
new file mode 100644
index 000000000000..7886ba17be30
--- /dev/null
+++ b/net/quic/path.h
@@ -0,0 +1,168 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PATH_MIN_PMTU 1200U
+#define QUIC_PATH_MAX_PMTU 65536U
+
+#define QUIC_MIN_UDP_PAYLOAD 1200
+#define QUIC_MAX_UDP_PAYLOAD 65527
+
+#define QUIC_PATH_ENTROPY_LEN 8
+
+/* Connection Migration State Machine:
+ *
+ * +--------+ recv non-probing, free old path +----------+
+ * | NONE | <-------------------------------------- | SWAPPED |
+ * +--------+ +----------+
+ * | ^ \ ^
+ * | \ \ |
+ * | \ \ new path detected, | recv
+ * | \ \ has another DCID, | Path
+ * | \ \ snd Path Challenge | Response
+ * | \ ------------------------------- |
+ * | ------------------------------- \ |
+ * | new path detected, Path \ \ |
+ * | has no other DCID, Challenge \ \ |
+ * | request a new DCID failed \ \ |
+ * v \ v |
+ * +----------+ +----------+
+ * | PENDING | ------------------------------------> | PROBING |
+ * +----------+ recv a new DCID, snd Path Challenge +----------+
+ */
+enum {
+ QUIC_PATH_ALT_NONE, /* No alternate path (migration complete or aborted) */
+ QUIC_PATH_ALT_PENDING, /* Waiting for a new destination CID for migration */
+ QUIC_PATH_ALT_PROBING, /* Validating the alternate path (PATH_CHALLENGE) */
+ QUIC_PATH_ALT_SWAPPED, /* Alternate path is now active; roles swapped */
+};
+
+struct quic_udp_sock {
+ struct work_struct work; /* Workqueue to destroy UDP tunnel socket */
+ struct hlist_node node; /* Entry in address-based UDP socket hash table */
+ union quic_addr addr;
+ int bind_ifindex;
+ refcount_t refcnt;
+ struct sock *sk; /* Underlying UDP tunnel socket */
+};
+
+struct quic_path {
+ union quic_addr daddr; /* Destination address */
+ union quic_addr saddr; /* Source address */
+ struct quic_udp_sock *udp_sk; /* Wrapped UDP socket used to receive QUIC packets */
+};
+
+struct quic_path_group {
+ /* Connection ID validation during handshake (rfc9000#section-7.3) */
+ struct quic_conn_id retry_dcid; /* Source CID from Retry packet */
+ struct quic_conn_id orig_dcid; /* Destination CID from first Initial */
+
+ /* Path validation (rfc9000#section-8.2) */
+ u8 entropy[QUIC_PATH_ENTROPY_LEN]; /* Entropy for PATH_CHALLENGE */
+ struct quic_path path[2]; /* Active path (0) and alternate path (1) */
+ struct flowi fl; /* Flow info from routing decisions */
+
+ /* Anti-amplification limit (rfc9000#section-8) */
+ u16 ampl_sndlen; /* Bytes sent before address is validated */
+ u16 ampl_rcvlen; /* Bytes received to lift amplification limit */
+
+ /* MTU discovery handling */
+ u32 mtu_info; /* PMTU value from received ICMP, pending apply */
+ struct { /* PLPMTUD probing (rfc8899) */
+ s64 number; /* Packet number used for current probe */
+ u16 pmtu; /* Confirmed path MTU */
+
+ u16 probe_size; /* Current probe packet size */
+ u16 probe_high; /* Highest failed probe size */
+ u8 probe_count; /* Retry count for current probe_size */
+ u8 state; /* Probe state machine (rfc8899#section-5.2) */
+ } pl;
+
+ /* Connection Migration (rfc9000#section-9) */
+ u8 disable_saddr_alt:1; /* Remote disable_active_migration (rfc9000#section-18.2) */
+ u8 disable_daddr_alt:1; /* Local disable_active_migration (rfc9000#section-18.2) */
+ u8 pref_addr:1; /* Preferred address offered (rfc9000#section-18.2) */
+ u8 alt_probes; /* Number of PATH_CHALLENGE probes sent */
+ u8 alt_state; /* State for alternate path migration logic (see above) */
+
+ u8 ecn_probes; /* ECN probe counter */
+ u8 validated:1; /* Path validated with PATH_RESPONSE */
+ u8 blocked:1; /* Blocked by anti-amplification limit */
+ u8 retry:1; /* Retry used in initial packet */
+ u8 serv:1; /* Indicates server side */
+};
+
+static inline union quic_addr *quic_path_saddr(struct quic_path_group *paths, u8 path)
+{
+ return &paths->path[path].saddr;
+}
+
+static inline void quic_path_set_saddr(struct quic_path_group *paths, u8 path,
+ union quic_addr *addr)
+{
+ memcpy(quic_path_saddr(paths, path), addr, sizeof(*addr));
+}
+
+static inline union quic_addr *quic_path_daddr(struct quic_path_group *paths, u8 path)
+{
+ return &paths->path[path].daddr;
+}
+
+static inline void quic_path_set_daddr(struct quic_path_group *paths, u8 path,
+ union quic_addr *addr)
+{
+ memcpy(quic_path_daddr(paths, path), addr, sizeof(*addr));
+}
+
+static inline union quic_addr *quic_path_uaddr(struct quic_path_group *paths, u8 path)
+{
+ return &paths->path[path].udp_sk->addr;
+}
+
+static inline struct sock *quic_path_usock(struct quic_path_group *paths, u8 path)
+{
+ return paths->path[path].udp_sk->sk;
+}
+
+static inline bool quic_path_alt_state(struct quic_path_group *paths, u8 state)
+{
+ return paths->alt_state == state;
+}
+
+static inline void quic_path_set_alt_state(struct quic_path_group *paths, u8 state)
+{
+ paths->alt_state = state;
+}
+
+/* Returns the destination Connection ID (DCID) used for identifying the connection.
+ * Per rfc9000#section-7.3, handshake packets are considered part of the same connection
+ * if their DCID matches the one returned here.
+ */
+static inline struct quic_conn_id *quic_path_orig_dcid(struct quic_path_group *paths)
+{
+ return paths->retry ? &paths->retry_dcid : &paths->orig_dcid;
+}
+
+int quic_path_detect_alt(struct quic_path_group *paths, union quic_addr *sa, union quic_addr *da,
+ struct sock *sk);
+int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path);
+void quic_path_free(struct sock *sk, struct quic_path_group *paths, u8 path);
+void quic_path_swap(struct quic_path_group *paths);
+
+u32 quic_path_pl_recv(struct quic_path_group *paths, bool *raise_timer, bool *complete);
+u32 quic_path_pl_toobig(struct quic_path_group *paths, u32 pmtu, bool *reset_timer);
+u32 quic_path_pl_send(struct quic_path_group *paths, s64 number);
+
+void quic_path_get_param(struct quic_path_group *paths, struct quic_transport_param *p);
+void quic_path_set_param(struct quic_path_group *paths, struct quic_transport_param *p);
+bool quic_path_pl_confirm(struct quic_path_group *paths, s64 largest, s64 smallest);
+void quic_path_pl_reset(struct quic_path_group *paths);
+
+int quic_path_init(int (*rcv)(struct sk_buff *skb, u8 err));
+void quic_path_destroy(void);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 02b2056078dc..c549e76623e3 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -61,6 +61,9 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_path_free(sk, quic_paths(sk), 0);
+ quic_path_free(sk, quic_paths(sk), 1);
+
quic_conn_id_set_free(quic_source(sk));
quic_conn_id_set_free(quic_dest(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 43f86cabb698..3cff2a1d478a 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -15,6 +15,7 @@
#include "family.h"
#include "stream.h"
#include "connid.h"
+#include "path.h"
#include "protocol.h"
@@ -40,6 +41,7 @@ struct quic_sock {
struct quic_stream_table streams;
struct quic_conn_id_set source;
struct quic_conn_id_set dest;
+ struct quic_path_group paths;
};
struct quic6_sock {
@@ -92,6 +94,16 @@ static inline struct quic_conn_id_set *quic_dest(const struct sock *sk)
return &quic_sk(sk)->dest;
}
+static inline struct quic_path_group *quic_paths(const struct sock *sk)
+{
+ return &quic_sk(sk)->paths;
+}
+
+static inline bool quic_is_serv(const struct sock *sk)
+{
+ return quic_paths(sk)->serv;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
@@ -115,14 +127,19 @@ static inline bool quic_is_closed(struct sock *sk)
static inline void quic_set_state(struct sock *sk, int state)
{
struct net *net = sock_net(sk);
+ int mib;
if (sk->sk_state == state)
return;
- if (state == QUIC_SS_ESTABLISHED)
+ if (state == QUIC_SS_ESTABLISHED) {
+ mib = quic_is_serv(sk) ? QUIC_MIB_CONN_PASSIVEESTABS
+ : QUIC_MIB_CONN_ACTIVEESTABS;
+ QUIC_INC_STATS(net, mib);
QUIC_INC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
- else if (quic_is_established(sk))
+ } else if (quic_is_established(sk)) {
QUIC_DEC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
+ }
inet_sk_set_state(sk, state);
sk->sk_state_change(sk);
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 08/15] quic: add path management
2025-08-18 14:04 ` [PATCH net-next v2 08/15] quic: add path management Xin Long
@ 2025-08-21 14:18 ` Paolo Abeni
2025-08-23 15:40 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Paolo Abeni @ 2025-08-21 14:18 UTC (permalink / raw)
To: Xin Long, network dev
Cc: davem, kuba, Eric Dumazet, Simon Horman, Stefan Metzmacher,
Moritz Buhl, Tyler Fanelli, Pengtao He, linux-cifs, Steve French,
Namjae Jeon, Paulo Alcantara, Tom Talpey, kernel-tls-handshake,
Chuck Lever, Jeff Layton, Benjamin Coddington, Steve Dickson,
Hannes Reinecke, Alexander Aring, David Howells, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek
On 8/18/25 4:04 PM, Xin Long wrote:
> +/* Lookup a quic_udp_sock in the global hash table. If not found, creates and returns a new one
> + * associated with the given kernel socket.
> + */
> +static struct quic_udp_sock *quic_udp_sock_lookup(struct sock *sk, union quic_addr *a, u16 port)
> +{
> + struct net *net = sock_net(sk);
> + struct quic_hash_head *head;
> + struct quic_udp_sock *us;
> +
> + head = quic_udp_sock_head(net, port);
> + hlist_for_each_entry(us, &head->head, node) {
> + if (net != sock_net(us->sk))
> + continue;
> + if (a) {
> + if (quic_cmp_sk_addr(us->sk, &us->addr, a) &&
> + (!us->bind_ifindex || !sk->sk_bound_dev_if ||
> + us->bind_ifindex == sk->sk_bound_dev_if))
> + return us;
> + continue;
> + }
> + if (ntohs(us->addr.v4.sin_port) == port)
> + return us;
> + }
> + return NULL;
> +}
The function description does not match the actual function implementation.
> +
> +/* Binds a QUIC path to a local port and sets up a UDP socket. */
> +int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path)
> +{
> + union quic_addr *a = quic_path_saddr(paths, path);
> + int rover, low, high, remaining;
> + struct net *net = sock_net(sk);
> + struct quic_hash_head *head;
> + struct quic_udp_sock *us;
> + u16 port;
> +
> + port = ntohs(a->v4.sin_port);
> + if (port) {
> + head = quic_udp_sock_head(net, port);
> + mutex_lock(&head->m_lock);
> + us = quic_udp_sock_lookup(sk, a, port);
> + if (!quic_udp_sock_get(us)) {
> + us = quic_udp_sock_create(sk, a);
> + if (!us) {
> + mutex_unlock(&head->m_lock);
> + return -EINVAL;
> + }
> + }
> + mutex_unlock(&head->m_lock);
> +
> + quic_udp_sock_put(paths->path[path].udp_sk);
> + paths->path[path].udp_sk = us;
> + return 0;
> + }
> +
> + inet_get_local_port_range(net, &low, &high);
you could/should use inet_sk_get_local_port_range().
/P
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 08/15] quic: add path management
2025-08-21 14:18 ` Paolo Abeni
@ 2025-08-23 15:40 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-23 15:40 UTC (permalink / raw)
To: Paolo Abeni
Cc: network dev, davem, kuba, Eric Dumazet, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
On Thu, Aug 21, 2025 at 10:18 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/18/25 4:04 PM, Xin Long wrote:
> > +/* Lookup a quic_udp_sock in the global hash table. If not found, creates and returns a new one
> > + * associated with the given kernel socket.
> > + */
> > +static struct quic_udp_sock *quic_udp_sock_lookup(struct sock *sk, union quic_addr *a, u16 port)
> > +{
> > + struct net *net = sock_net(sk);
> > + struct quic_hash_head *head;
> > + struct quic_udp_sock *us;
> > +
> > + head = quic_udp_sock_head(net, port);
> > + hlist_for_each_entry(us, &head->head, node) {
> > + if (net != sock_net(us->sk))
> > + continue;
> > + if (a) {
> > + if (quic_cmp_sk_addr(us->sk, &us->addr, a) &&
> > + (!us->bind_ifindex || !sk->sk_bound_dev_if ||
> > + us->bind_ifindex == sk->sk_bound_dev_if))
> > + return us;
> > + continue;
> > + }
> > + if (ntohs(us->addr.v4.sin_port) == port)
> > + return us;
> > + }
> > + return NULL;
> > +}
>
> The function description does not match the actual function implementation.
Right, I forgot to update the description after moving the creation out.
>
> > +
> > +/* Binds a QUIC path to a local port and sets up a UDP socket. */
> > +int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path)
> > +{
> > + union quic_addr *a = quic_path_saddr(paths, path);
> > + int rover, low, high, remaining;
> > + struct net *net = sock_net(sk);
> > + struct quic_hash_head *head;
> > + struct quic_udp_sock *us;
> > + u16 port;
> > +
> > + port = ntohs(a->v4.sin_port);
> > + if (port) {
> > + head = quic_udp_sock_head(net, port);
> > + mutex_lock(&head->m_lock);
> > + us = quic_udp_sock_lookup(sk, a, port);
> > + if (!quic_udp_sock_get(us)) {
> > + us = quic_udp_sock_create(sk, a);
> > + if (!us) {
> > + mutex_unlock(&head->m_lock);
> > + return -EINVAL;
> > + }
> > + }
> > + mutex_unlock(&head->m_lock);
> > +
> > + quic_udp_sock_put(paths->path[path].udp_sk);
> > + paths->path[path].udp_sk = us;
> > + return 0;
> > + }
> > +
> > + inet_get_local_port_range(net, &low, &high);
>
> you could/should use inet_sk_get_local_port_range().
>
True, will update.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH net-next v2 09/15] quic: add congestion control
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (7 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 08/15] quic: add path management Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 10/15] quic: add packet number space Xin Long
` (6 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_cong' for RTT measurement and congestion
control. The 'quic_cong_ops' is added to define the congestion
control algorithm.
It implements a congestion control state machine with slow start,
congestion avoidance, and recovery phases, and introduces the New
Reno and CUBIC algorithms.
The implementation updates RTT estimates when packets are acknowledged,
reacts to loss and ECN signals, and adjusts the congestion window
accordingly during packet transmission and acknowledgment processing.
- quic_cong_rtt_update(): Performs RTT measurement, invoked when a
packet is acknowledged by the largest number in the ACK frame.
- quic_cong_on_packet_acked(): Invoked when a packet is acknowledged.
- quic_cong_on_packet_lost(): Invoked when a packet is marked as lost.
- quic_cong_on_process_ecn(): Invoked when an ACK_ECN frame is received.
- quic_cong_on_packet_sent(): Invoked when a packet is transmitted.
- quic_cong_on_ack_recv(): Invoked when an ACK frame is received.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 3 +-
net/quic/cong.c | 700 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/cong.h | 120 ++++++++
net/quic/socket.c | 1 +
net/quic/socket.h | 7 +
5 files changed, 830 insertions(+), 1 deletion(-)
create mode 100644 net/quic/cong.c
create mode 100644 net/quic/cong.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 1565fb5cef9d..4d4a42c6d565 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,5 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
+ cong.o
diff --git a/net/quic/cong.c b/net/quic/cong.c
new file mode 100644
index 000000000000..d598cc14b15e
--- /dev/null
+++ b/net/quic/cong.c
@@ -0,0 +1,700 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/jiffies.h>
+#include <linux/quic.h>
+#include <net/sock.h>
+
+#include "common.h"
+#include "cong.h"
+
+/* CUBIC APIs */
+struct quic_cubic {
+ /* Variables of Interest in rfc9438#section-4.1.2 */
+ u32 pending_w_add; /* Accumulate fractional increments to W_est */
+ u32 origin_point; /* W_max */
+ u32 epoch_start; /* t_epoch */
+ u32 pending_add; /* Accumulates fractional additions to W_cubic */
+ u32 w_last_max; /* last W_max */
+ u32 w_tcp; /* W_est */
+ u64 k; /* K */
+
+ /* HyStart++ variables in rfc9406#section-4.2 */
+ u32 current_round_min_rtt; /* currentRoundMinRTT */
+ u32 css_baseline_min_rtt; /* cssBaselineMinRtt */
+ u32 last_round_min_rtt; /* lastRoundMinRTT */
+ u16 rtt_sample_count; /* rttSampleCount */
+ u16 css_rounds; /* Counter for consecutive rounds showing RTT increase */
+ s64 window_end; /* End of current CSS round (packet number) */
+};
+
+/* HyStart++ constants in rfc9406#section-4.3 */
+#define QUIC_HS_MIN_SSTHRESH 16
+#define QUIC_HS_N_RTT_SAMPLE 8
+#define QUIC_HS_MIN_ETA 4000
+#define QUIC_HS_MAX_ETA 16000
+#define QUIC_HS_MIN_RTT_DIVISOR 8
+#define QUIC_HS_CSS_GROWTH_DIVISOR 4
+#define QUIC_HS_CSS_ROUNDS 5
+
+static u64 cubic_root(u64 n)
+{
+ u64 a, d;
+
+ if (!n)
+ return 0;
+
+ d = (64 - __builtin_clzll(n)) / 3;
+ a = BIT_ULL(d + 1);
+
+ for (; a * a * a > n;) {
+ d = div64_ul(n, a * a);
+ a = div64_ul(2 * a + d, 3);
+ }
+ return a;
+}
+
+/* rfc9406#section-4: HyStart++ Algorithm */
+static void cubic_slow_start(struct quic_cong *cong, u32 bytes, s64 number)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+ u32 eta;
+
+ if (cubic->window_end <= number)
+ cubic->window_end = -1;
+
+ /* cwnd = cwnd + (min(N, L * SMSS) / CSS_GROWTH_DIVISOR) */
+ if (cubic->css_baseline_min_rtt != U32_MAX)
+ bytes = bytes / QUIC_HS_CSS_GROWTH_DIVISOR;
+ cong->window = min_t(u32, cong->window + bytes, cong->max_window);
+
+ if (cubic->css_baseline_min_rtt != U32_MAX) {
+ /* If CSS_ROUNDS rounds are complete, enter congestion avoidance. */
+ if (++cubic->css_rounds > QUIC_HS_CSS_ROUNDS) {
+ cubic->css_baseline_min_rtt = U32_MAX;
+ cubic->w_last_max = cong->window;
+ cong->ssthresh = cong->window;
+ cubic->css_rounds = 0;
+ }
+ return;
+ }
+
+ /* if ((rttSampleCount >= N_RTT_SAMPLE) AND
+ * (currentRoundMinRTT != infinity) AND
+ * (lastRoundMinRTT != infinity))
+ * RttThresh = max(MIN_RTT_THRESH,
+ * min(lastRoundMinRTT / MIN_RTT_DIVISOR, MAX_RTT_THRESH))
+ * if (currentRoundMinRTT >= (lastRoundMinRTT + RttThresh))
+ * cssBaselineMinRtt = currentRoundMinRTT
+ * exit slow start and enter CSS
+ */
+ if (cubic->last_round_min_rtt != U32_MAX &&
+ cubic->current_round_min_rtt != U32_MAX &&
+ cong->window >= QUIC_HS_MIN_SSTHRESH * cong->mss &&
+ cubic->rtt_sample_count >= QUIC_HS_N_RTT_SAMPLE) {
+ eta = cubic->last_round_min_rtt / QUIC_HS_MIN_RTT_DIVISOR;
+ if (eta < QUIC_HS_MIN_ETA)
+ eta = QUIC_HS_MIN_ETA;
+ else if (eta > QUIC_HS_MAX_ETA)
+ eta = QUIC_HS_MAX_ETA;
+
+ pr_debug("%s: current_round_min_rtt: %u, last_round_min_rtt: %u, eta: %u\n",
+ __func__, cubic->current_round_min_rtt, cubic->last_round_min_rtt, eta);
+
+ /* Delay increase triggers slow start exit and enter CSS. */
+ if (cubic->current_round_min_rtt >= cubic->last_round_min_rtt + eta)
+ cubic->css_baseline_min_rtt = cubic->current_round_min_rtt;
+ }
+}
+
+/* rfc9438#section-4: CUBIC Congestion Control */
+static void cubic_cong_avoid(struct quic_cong *cong, u32 bytes)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+ u64 tx, kx, time_delta, delta, t;
+ u64 target_add, tcp_add = 0;
+ u64 target, cwnd_thres, m;
+
+ if (cubic->epoch_start == U32_MAX) {
+ cubic->epoch_start = cong->time;
+ if (cong->window < cubic->w_last_max) {
+ /*
+ * ┌────────────────┐
+ * 3 │W - cwnd
+ * ╲ │ max epoch
+ * K = ╲ │────────────────
+ * ╲│ C
+ */
+ cubic->k = cubic->w_last_max - cong->window;
+ cubic->k = cubic_root(div64_ul(cubic->k * 10, (u64)cong->mss * 4));
+ cubic->origin_point = cubic->w_last_max;
+ } else {
+ cubic->k = 0;
+ cubic->origin_point = cong->window;
+ }
+ cubic->w_tcp = cong->window;
+ cubic->pending_add = 0;
+ cubic->pending_w_add = 0;
+ }
+
+ /*
+ * t = t - t
+ * current epoch
+ */
+ t = cong->time - cubic->epoch_start;
+ tx = div64_ul(t << 10, USEC_PER_SEC);
+ kx = (cubic->k << 10);
+ if (tx > kx)
+ time_delta = tx - kx;
+ else
+ time_delta = kx - tx;
+ /*
+ * 3
+ * W (t) = C * (t - K) + W
+ * cubic max
+ */
+ delta = cong->mss * ((((time_delta * time_delta) >> 10) * time_delta) >> 10);
+ delta = div64_ul(delta * 4, 10) >> 10;
+ if (tx > kx)
+ target = cubic->origin_point + delta;
+ else
+ target = cubic->origin_point - delta;
+
+ /*
+ * W (t + RTT)
+ * cubic
+ */
+ cwnd_thres = (div64_ul((t + cong->smoothed_rtt) << 10, USEC_PER_SEC) * target) >> 10;
+ pr_debug("%s: tgt: %llu, thres: %llu, delta: %llu, t: %llu, srtt: %u, tx: %llu, kx: %llu\n",
+ __func__, target, cwnd_thres, delta, t, cong->smoothed_rtt, tx, kx);
+ /*
+ * ⎧
+ * ⎪cwnd if W (t + RTT) < cwnd
+ * ⎪ cubic
+ * ⎨1.5 * cwnd if W (t + RTT) > 1.5 * cwnd
+ * target = ⎪ cubic
+ * ⎪W (t + RTT) otherwise
+ * ⎩ cubic
+ */
+ if (cwnd_thres < cong->window)
+ target = cong->window;
+ else if (cwnd_thres * 2 > (u64)cong->window * 3)
+ target = cong->window * 3 / 2;
+ else
+ target = cwnd_thres;
+
+ /*
+ * target - cwnd
+ * ─────────────
+ * cwnd
+ */
+ if (target > cong->window) {
+ target_add = cubic->pending_add + cong->mss * (target - cong->window);
+ cubic->pending_add = do_div(target_add, cong->window);
+ } else {
+ target_add = cubic->pending_add + cong->mss;
+ cubic->pending_add = do_div(target_add, 100 * cong->window);
+ }
+
+ pr_debug("%s: target: %llu, window: %u, target_add: %llu\n",
+ __func__, target, cong->window, target_add);
+
+ /*
+ * segments_acked
+ * W = W + α * ──────────────
+ * est est cubic cwnd
+ */
+ m = cubic->pending_w_add + cong->mss * bytes;
+ cubic->pending_w_add = do_div(m, cong->window);
+ cubic->w_tcp += m;
+
+ if (cubic->w_tcp > cong->window)
+ tcp_add = div64_ul((u64)cong->mss * (cubic->w_tcp - cong->window), cong->window);
+
+ pr_debug("%s: w_tcp: %u, window: %u, tcp_add: %llu\n",
+ __func__, cubic->w_tcp, cong->window, tcp_add);
+
+ /* W_cubic(_t_) or _W_est_, whichever is bigger. */
+ cong->window += max(tcp_add, target_add);
+}
+
+static void cubic_recovery(struct quic_cong *cong)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+
+ cong->recovery_time = cong->time;
+ cubic->epoch_start = U32_MAX;
+
+ /* rfc9438#section-3.4:
+ * CUBIC sets the multiplicative window decrease factor (β__cubic_) to 0.7,
+ * whereas Reno uses 0.5.
+ *
+ * rfc9438#section-4.6:
+ * ssthresh = flight_size * β new ssthresh
+ *
+ * Some implementations of CUBIC currently use _cwnd_ instead of _flight_size_ when
+ * calculating a new _ssthresh_.
+ *
+ * rfc9438#section-4.7:
+ *
+ * ⎧ 1 + β
+ * ⎪ cubic
+ * ⎪cwnd * ────────── if cwnd < W_max and fast convergence
+ * W = ⎨ 2
+ * max ⎪ enabled, further reduce W_max
+ * ⎪
+ * ⎩cwnd otherwise, remember cwnd before reduction
+ */
+ if (cong->window < cubic->w_last_max)
+ cubic->w_last_max = cong->window * 17 / 10 / 2;
+ else
+ cubic->w_last_max = cong->window;
+
+ cong->ssthresh = cong->window * 7 / 10;
+ cong->ssthresh = max(cong->ssthresh, cong->min_window);
+ cong->window = cong->ssthresh;
+}
+
+static int quic_cong_check_persistent_congestion(struct quic_cong *cong, u32 time)
+{
+ u32 ssthresh;
+
+ /* rfc9002#section-7.6.1:
+ * (smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay) *
+ * kPersistentCongestionThreshold
+ */
+ ssthresh = cong->smoothed_rtt + max(4 * cong->rttvar, QUIC_KGRANULARITY);
+ ssthresh = (ssthresh + cong->max_ack_delay) * QUIC_KPERSISTENT_CONGESTION_THRESHOLD;
+ if (cong->time - time <= ssthresh)
+ return 0;
+
+ pr_debug("%s: permanent congestion, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ cong->min_rtt_valid = 0;
+ cong->window = cong->min_window;
+ cong->state = QUIC_CONG_SLOW_START;
+ return 1;
+}
+
+static void quic_cubic_on_packet_lost(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ if (quic_cong_check_persistent_congestion(cong, time))
+ return;
+
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ pr_debug("%s: slow_start -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ return;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ pr_debug("%s: cong_avoid -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+
+ cong->state = QUIC_CONG_RECOVERY_PERIOD;
+ cubic_recovery(cong);
+}
+
+static void quic_cubic_on_packet_acked(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ cubic_slow_start(cong, bytes, number);
+ if (cong->window >= cong->ssthresh) {
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: slow_start -> cong_avoid, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ }
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ if (cong->recovery_time < time) {
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: recovery -> cong_avoid, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ }
+ break;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ cubic_cong_avoid(cong, bytes);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+}
+
+static void quic_cubic_on_process_ecn(struct quic_cong *cong)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ pr_debug("%s: slow_start -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ return;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ pr_debug("%s: cong_avoid -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+
+ cong->state = QUIC_CONG_RECOVERY_PERIOD;
+ cubic_recovery(cong);
+}
+
+static void quic_cubic_on_init(struct quic_cong *cong)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+
+ cubic->epoch_start = U32_MAX;
+ cubic->origin_point = 0;
+ cubic->w_last_max = 0;
+ cubic->w_tcp = 0;
+ cubic->k = 0;
+
+ cubic->current_round_min_rtt = U32_MAX;
+ cubic->css_baseline_min_rtt = U32_MAX;
+ cubic->last_round_min_rtt = U32_MAX;
+ cubic->rtt_sample_count = 0;
+ cubic->window_end = -1;
+ cubic->css_rounds = 0;
+}
+
+static void quic_cubic_on_packet_sent(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+
+ if (cubic->window_end != -1)
+ return;
+
+ /* rfc9406#section-4.2:
+ * lastRoundMinRTT = currentRoundMinRTT
+ * currentRoundMinRTT = infinity
+ * rttSampleCount = 0
+ */
+ cubic->window_end = number;
+ cubic->last_round_min_rtt = cubic->current_round_min_rtt;
+ cubic->current_round_min_rtt = U32_MAX;
+ cubic->rtt_sample_count = 0;
+
+ pr_debug("%s: last_round_min_rtt: %u\n", __func__, cubic->last_round_min_rtt);
+}
+
+static void quic_cubic_on_rtt_update(struct quic_cong *cong)
+{
+ struct quic_cubic *cubic = quic_cong_priv(cong);
+
+ if (cubic->window_end == -1)
+ return;
+
+ pr_debug("%s: current_round_min_rtt: %u, latest_rtt: %u\n",
+ __func__, cubic->current_round_min_rtt, cong->latest_rtt);
+
+ /* rfc9406#section-4.2:
+ * currentRoundMinRTT = min(currentRoundMinRTT, currRTT)
+ * rttSampleCount += 1
+ */
+ if (cubic->current_round_min_rtt > cong->latest_rtt) {
+ cubic->current_round_min_rtt = cong->latest_rtt;
+ if (cubic->current_round_min_rtt < cubic->css_baseline_min_rtt) {
+ cubic->css_baseline_min_rtt = U32_MAX;
+ cubic->css_rounds = 0;
+ }
+ }
+ cubic->rtt_sample_count++;
+}
+
+/* NEW RENO APIs */
+static void quic_reno_on_packet_lost(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ if (quic_cong_check_persistent_congestion(cong, time))
+ return;
+
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ pr_debug("%s: slow_start -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ return;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ pr_debug("%s: cong_avoid -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+
+ cong->recovery_time = cong->time;
+ cong->state = QUIC_CONG_RECOVERY_PERIOD;
+ cong->ssthresh = max(cong->window >> 1U, cong->min_window);
+ cong->window = cong->ssthresh;
+}
+
+static void quic_reno_on_packet_acked(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ cong->window = min_t(u32, cong->window + bytes, cong->max_window);
+ if (cong->window >= cong->ssthresh) {
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: slow_start -> cong_avoid, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ }
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ if (cong->recovery_time < time) {
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: recovery -> cong_avoid, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ }
+ break;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ cong->window += cong->mss * bytes / cong->window;
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+}
+
+static void quic_reno_on_process_ecn(struct quic_cong *cong)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ pr_debug("%s: slow_start -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ return;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ pr_debug("%s: cong_avoid -> recovery, cwnd: %u, ssthresh: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__, cong->state);
+ return;
+ }
+
+ cong->recovery_time = cong->time;
+ cong->state = QUIC_CONG_RECOVERY_PERIOD;
+ cong->ssthresh = max(cong->window >> 1U, cong->min_window);
+ cong->window = cong->ssthresh;
+}
+
+static void quic_reno_on_init(struct quic_cong *cong)
+{
+}
+
+static struct quic_cong_ops quic_congs[] = {
+ { /* QUIC_CONG_ALG_RENO */
+ .on_packet_acked = quic_reno_on_packet_acked,
+ .on_packet_lost = quic_reno_on_packet_lost,
+ .on_process_ecn = quic_reno_on_process_ecn,
+ .on_init = quic_reno_on_init,
+ },
+ { /* QUIC_CONG_ALG_CUBIC */
+ .on_packet_acked = quic_cubic_on_packet_acked,
+ .on_packet_lost = quic_cubic_on_packet_lost,
+ .on_process_ecn = quic_cubic_on_process_ecn,
+ .on_init = quic_cubic_on_init,
+ .on_packet_sent = quic_cubic_on_packet_sent,
+ .on_rtt_update = quic_cubic_on_rtt_update,
+ },
+};
+
+/* COMMON APIs */
+void quic_cong_on_packet_lost(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ cong->ops->on_packet_lost(cong, time, bytes, number);
+}
+
+void quic_cong_on_packet_acked(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ cong->ops->on_packet_acked(cong, time, bytes, number);
+}
+
+void quic_cong_on_process_ecn(struct quic_cong *cong)
+{
+ cong->ops->on_process_ecn(cong);
+}
+
+/* Update Probe Timeout (PTO) and loss detection delay based on RTT stats. */
+static void quic_cong_pto_update(struct quic_cong *cong)
+{
+ u32 pto, loss_delay;
+
+ /* rfc9002#section-6.2.1:
+ * PTO = smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay
+ */
+ pto = cong->smoothed_rtt + max(4 * cong->rttvar, QUIC_KGRANULARITY);
+ cong->pto = pto + cong->max_ack_delay;
+
+ /* rfc9002#section-6.1.2:
+ * max(kTimeThreshold * max(smoothed_rtt, latest_rtt), kGranularity)
+ */
+ loss_delay = QUIC_KTIME_THRESHOLD(max(cong->smoothed_rtt, cong->latest_rtt));
+ cong->loss_delay = max(loss_delay, QUIC_KGRANULARITY);
+
+ pr_debug("%s: update pto: %u\n", __func__, pto);
+}
+
+/* Update pacing timestamp after sending 'bytes' bytes.
+ *
+ * This function tracks when the next packet is allowed to be sent based on pacing rate.
+ */
+static void quic_cong_update_pacing_time(struct quic_cong *cong, u32 bytes)
+{
+ unsigned long rate = READ_ONCE(cong->pacing_rate);
+ u64 prior_time, credit, len_ns;
+
+ if (!rate)
+ return;
+
+ prior_time = cong->pacing_time;
+ cong->pacing_time = max(cong->pacing_time, ktime_get_ns());
+ credit = cong->pacing_time - prior_time;
+
+ /* take into account OS jitter */
+ len_ns = div64_ul((u64)bytes * NSEC_PER_SEC, rate);
+ len_ns -= min_t(u64, len_ns / 2, credit);
+ cong->pacing_time += len_ns;
+}
+
+/* Compute and update the pacing rate based on congestion window and smoothed RTT. */
+static void quic_cong_pace_update(struct quic_cong *cong, u32 bytes, u32 max_rate)
+{
+ u64 rate;
+
+ /* rate = N * congestion_window / smoothed_rtt */
+ rate = (u64)cong->window * USEC_PER_SEC * 2;
+ if (likely(cong->smoothed_rtt))
+ rate = div64_ul(rate, cong->smoothed_rtt);
+
+ WRITE_ONCE(cong->pacing_rate, min_t(u64, rate, max_rate));
+ pr_debug("%s: update pacing rate: %u, max rate: %u, srtt: %u\n",
+ __func__, cong->pacing_rate, max_rate, cong->smoothed_rtt);
+}
+
+void quic_cong_on_packet_sent(struct quic_cong *cong, u32 time, u32 bytes, s64 number)
+{
+ if (!bytes)
+ return;
+ if (cong->ops->on_packet_sent)
+ cong->ops->on_packet_sent(cong, time, bytes, number);
+ quic_cong_update_pacing_time(cong, bytes);
+}
+
+void quic_cong_on_ack_recv(struct quic_cong *cong, u32 bytes, u32 max_rate)
+{
+ if (!bytes)
+ return;
+ if (cong->ops->on_ack_recv)
+ cong->ops->on_ack_recv(cong, bytes, max_rate);
+ quic_cong_pace_update(cong, bytes, max_rate);
+}
+
+/* rfc9002#section-5: Estimating the Round-Trip Time */
+void quic_cong_rtt_update(struct quic_cong *cong, u32 time, u32 ack_delay)
+{
+ u32 adjusted_rtt, rttvar_sample;
+
+ /* Ignore RTT sample if ACK delay is suspiciously large. */
+ if (ack_delay > cong->max_ack_delay * 2)
+ return;
+
+ /* rfc9002#section-5.1: latest_rtt = ack_time - send_time_of_largest_acked */
+ cong->latest_rtt = cong->time - time;
+
+ /* rfc9002#section-5.2: Estimating min_rtt */
+ if (!cong->min_rtt_valid) {
+ cong->min_rtt = cong->latest_rtt;
+ cong->min_rtt_valid = 1;
+ }
+ if (cong->min_rtt > cong->latest_rtt)
+ cong->min_rtt = cong->latest_rtt;
+
+ if (!cong->is_rtt_set) {
+ /* rfc9002#section-5.3:
+ * smoothed_rtt = latest_rtt
+ * rttvar = latest_rtt / 2
+ */
+ cong->smoothed_rtt = cong->latest_rtt;
+ cong->rttvar = cong->smoothed_rtt / 2;
+ quic_cong_pto_update(cong);
+ cong->is_rtt_set = 1;
+ return;
+ }
+
+ /* rfc9002#section-5.3:
+ * adjusted_rtt = latest_rtt
+ * if (latest_rtt >= min_rtt + ack_delay):
+ * adjusted_rtt = latest_rtt - ack_delay
+ * smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rtt
+ * rttvar_sample = abs(smoothed_rtt - adjusted_rtt)
+ * rttvar = 3/4 * rttvar + 1/4 * rttvar_sample
+ */
+ adjusted_rtt = cong->latest_rtt;
+ if (cong->latest_rtt >= cong->min_rtt + ack_delay)
+ adjusted_rtt = cong->latest_rtt - ack_delay;
+
+ cong->smoothed_rtt = (cong->smoothed_rtt * 7 + adjusted_rtt) / 8;
+ if (cong->smoothed_rtt >= adjusted_rtt)
+ rttvar_sample = cong->smoothed_rtt - adjusted_rtt;
+ else
+ rttvar_sample = adjusted_rtt - cong->smoothed_rtt;
+ cong->rttvar = (cong->rttvar * 3 + rttvar_sample) / 4;
+ quic_cong_pto_update(cong);
+
+ if (cong->ops->on_rtt_update)
+ cong->ops->on_rtt_update(cong);
+}
+
+void quic_cong_set_algo(struct quic_cong *cong, u8 algo)
+{
+ if (algo >= QUIC_CONG_ALG_MAX)
+ algo = QUIC_CONG_ALG_RENO;
+
+ cong->state = QUIC_CONG_SLOW_START;
+ cong->ssthresh = U32_MAX;
+ cong->ops = &quic_congs[algo];
+ cong->ops->on_init(cong);
+}
+
+void quic_cong_set_srtt(struct quic_cong *cong, u32 srtt)
+{
+ /* rfc9002#section-5.3:
+ * smoothed_rtt = kInitialRtt
+ * rttvar = kInitialRtt / 2
+ */
+ cong->latest_rtt = srtt;
+ cong->smoothed_rtt = cong->latest_rtt;
+ cong->rttvar = cong->smoothed_rtt / 2;
+ quic_cong_pto_update(cong);
+}
+
+void quic_cong_init(struct quic_cong *cong)
+{
+ cong->max_ack_delay = QUIC_DEF_ACK_DELAY;
+ cong->max_window = S32_MAX / 2;
+ quic_cong_set_algo(cong, QUIC_CONG_ALG_RENO);
+ quic_cong_set_srtt(cong, QUIC_RTT_INIT);
+}
diff --git a/net/quic/cong.h b/net/quic/cong.h
new file mode 100644
index 000000000000..cb83c00a554f
--- /dev/null
+++ b/net/quic/cong.h
@@ -0,0 +1,120 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_KPERSISTENT_CONGESTION_THRESHOLD 3
+#define QUIC_KPACKET_THRESHOLD 3
+#define QUIC_KTIME_THRESHOLD(rtt) ((rtt) * 9 / 8)
+#define QUIC_KGRANULARITY 1000U
+
+#define QUIC_RTT_INIT 333000U
+#define QUIC_RTT_MAX 2000000U
+#define QUIC_RTT_MIN QUIC_KGRANULARITY
+
+/* rfc9002#section-7.3: Congestion Control States
+ *
+ * New path or +------------+
+ * persistent congestion | Slow |
+ * (O)---------------------->| Start |
+ * +------------+
+ * |
+ * Loss or |
+ * ECN-CE increase |
+ * v
+ * +------------+ Loss or +------------+
+ * | Congestion | ECN-CE increase | Recovery |
+ * | Avoidance |------------------>| Period |
+ * +------------+ +------------+
+ * ^ |
+ * | |
+ * +----------------------------+
+ * Acknowledgment of packet
+ * sent during recovery
+ */
+enum quic_cong_state {
+ QUIC_CONG_SLOW_START,
+ QUIC_CONG_RECOVERY_PERIOD,
+ QUIC_CONG_CONGESTION_AVOIDANCE,
+};
+
+struct quic_cong {
+ /* RTT tracking */
+ u32 smoothed_rtt; /* Smoothed RTT */
+ u32 latest_rtt; /* Latest RTT sample */
+ u32 min_rtt; /* Lowest observed RTT */
+ u32 rttvar; /* RTT variation */
+ u32 loss_delay; /* Time before marking loss */
+ u32 pto; /* Probe timeout */
+
+ /* Timing & pacing */
+ u32 max_ack_delay; /* max_ack_delay from rfc9000#section-18.2 */
+ u32 recovery_time; /* Recovery period start */
+ u32 pacing_rate; /* Packet sending speed Bytes/sec */
+ u64 pacing_time; /* Next send time */
+ u32 time; /* Cached time */
+
+ /* Congestion window */
+ u32 max_window; /* Max growth cap */
+ u32 min_window; /* Min window limit */
+ u32 ssthresh; /* Slow start threshold */
+ u32 window; /* Bytes in flight allowed */
+ u32 mss; /* QUIC MSS (excl. UDP) */
+
+ /* Algorithm-specific */
+ struct quic_cong_ops *ops;
+ u64 priv[8]; /* Algo private data */
+
+ /* Flags & state */
+ u8 min_rtt_valid; /* min_rtt initialized */
+ u8 is_rtt_set; /* RTT samples exist */
+ u8 state; /* State machine in rfc9002#section-7.3 */
+};
+
+/* Hooks for congestion control algorithms */
+struct quic_cong_ops {
+ void (*on_packet_acked)(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+ void (*on_packet_lost)(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+ void (*on_process_ecn)(struct quic_cong *cong);
+ void (*on_init)(struct quic_cong *cong);
+
+ /* Optional callbacks */
+ void (*on_packet_sent)(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+ void (*on_ack_recv)(struct quic_cong *cong, u32 bytes, u32 max_rate);
+ void (*on_rtt_update)(struct quic_cong *cong);
+};
+
+static inline void quic_cong_set_mss(struct quic_cong *cong, u32 mss)
+{
+ if (cong->mss == mss)
+ return;
+
+ /* rfc9002#section-7.2: Initial and Minimum Congestion Window */
+ cong->mss = mss;
+ cong->min_window = max(min(mss * 10, 14720U), mss * 2);
+
+ if (cong->window < cong->min_window)
+ cong->window = cong->min_window;
+}
+
+static inline void *quic_cong_priv(struct quic_cong *cong)
+{
+ return (void *)cong->priv;
+}
+
+void quic_cong_on_packet_acked(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+void quic_cong_on_packet_lost(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+void quic_cong_on_process_ecn(struct quic_cong *cong);
+
+void quic_cong_on_packet_sent(struct quic_cong *cong, u32 time, u32 bytes, s64 number);
+void quic_cong_on_ack_recv(struct quic_cong *cong, u32 bytes, u32 max_rate);
+void quic_cong_rtt_update(struct quic_cong *cong, u32 time, u32 ack_delay);
+
+void quic_cong_set_srtt(struct quic_cong *cong, u32 srtt);
+void quic_cong_set_algo(struct quic_cong *cong, u8 algo);
+void quic_cong_init(struct quic_cong *cong);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index c549e76623e3..8ee4e4cd4ee3 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -44,6 +44,7 @@ static int quic_init_sock(struct sock *sk)
quic_conn_id_set_init(quic_source(sk), 1);
quic_conn_id_set_init(quic_dest(sk), 0);
+ quic_cong_init(quic_cong(sk));
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 3cff2a1d478a..019f8752fc87 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -16,6 +16,7 @@
#include "stream.h"
#include "connid.h"
#include "path.h"
+#include "cong.h"
#include "protocol.h"
@@ -42,6 +43,7 @@ struct quic_sock {
struct quic_conn_id_set source;
struct quic_conn_id_set dest;
struct quic_path_group paths;
+ struct quic_cong cong;
};
struct quic6_sock {
@@ -104,6 +106,11 @@ static inline bool quic_is_serv(const struct sock *sk)
return quic_paths(sk)->serv;
}
+static inline struct quic_cong *quic_cong(const struct sock *sk)
+{
+ return &quic_sk(sk)->cong;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 10/15] quic: add packet number space
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (8 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 09/15] quic: add congestion control Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 11/15] quic: add crypto key derivation and installation Xin Long
` (5 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_pnspace', which manages per packet number
space members.
It maintains the next packet number to assign, tracks the total length of
frames currently in flight, and records the time when the next packet may
be considered lost. It also keeps track of the largest acknowledged packet
number, the time it was acknowledged, and when the most recent ack
eliciting packet was sent. These fields are useful for loss detection,
RTT estimation, and congestion control.
To support ACK frame generation, quic_pnspace includes a packet number
acknowledgment map (pn_ack_map) that tracks received packet numbers.
Supporting functions are provided to validate and mark received packet
numbers and compute the number of gap blocks needed during ACK frame
construction.
- quic_pnspace_check(): Validates a received packet number.
- quic_pnspace_mark(): Marks a received packet number in the ACK map.
- quic_pnspace_num_gabs(): Returns the gap ACK blocks for constructing
ACK frames.
Note QUIC uses separate packet number spaces for each encryption level
(APP, INITIAL, HANDSHAKE, EARLY) except EARLY and all generations of
APP keys use the same packet number space, as describe in
rfc9002#section-4.1.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/pnspace.c | 224 +++++++++++++++++++++++++++++++++++++++++++++
net/quic/pnspace.h | 150 ++++++++++++++++++++++++++++++
net/quic/socket.c | 12 +++
net/quic/socket.h | 7 ++
5 files changed, 394 insertions(+), 1 deletion(-)
create mode 100644 net/quic/pnspace.c
create mode 100644 net/quic/pnspace.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 4d4a42c6d565..9d8e18297911 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o
+ cong.o pnspace.o
diff --git a/net/quic/pnspace.c b/net/quic/pnspace.c
new file mode 100644
index 000000000000..3f61b0bc6fc6
--- /dev/null
+++ b/net/quic/pnspace.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/slab.h>
+
+#include "pnspace.h"
+
+int quic_pnspace_init(struct quic_pnspace *space)
+{
+ if (!space->pn_map) {
+ space->pn_map = kzalloc(BITS_TO_BYTES(QUIC_PN_MAP_INITIAL), GFP_KERNEL);
+ if (!space->pn_map)
+ return -ENOMEM;
+ space->pn_map_len = QUIC_PN_MAP_INITIAL;
+ } else {
+ bitmap_zero(space->pn_map, space->pn_map_len);
+ }
+
+ space->max_time_limit = QUIC_PNSPACE_TIME_LIMIT;
+ space->next_pn = QUIC_PNSPACE_NEXT_PN;
+ space->base_pn = -1;
+ return 0;
+}
+
+void quic_pnspace_free(struct quic_pnspace *space)
+{
+ space->pn_map_len = 0;
+ kfree(space->pn_map);
+}
+
+/* Expand the bitmap tracking received packet numbers. Ensures the pn_map bitmap can
+ * cover at least @size packet numbers. Allocates a larger bitmap, copies existing
+ * data, and updates metadata.
+ *
+ * Returns: 1 if the bitmap was successfully grown, 0 on failure or if the requested
+ * size exceeds QUIC_PN_MAP_SIZE.
+ */
+static int quic_pnspace_grow(struct quic_pnspace *space, u16 size)
+{
+ u16 len, inc, offset;
+ unsigned long *new;
+
+ if (size > QUIC_PN_MAP_SIZE)
+ return 0;
+
+ inc = ALIGN((size - space->pn_map_len), BITS_PER_LONG) + QUIC_PN_MAP_INCREMENT;
+ len = (u16)min(space->pn_map_len + inc, QUIC_PN_MAP_SIZE);
+
+ new = kzalloc(BITS_TO_BYTES(len), GFP_ATOMIC);
+ if (!new)
+ return 0;
+
+ offset = (u16)(space->max_pn_seen + 1 - space->base_pn);
+ bitmap_copy(new, space->pn_map, offset);
+ kfree(space->pn_map);
+ space->pn_map = new;
+ space->pn_map_len = len;
+
+ return 1;
+}
+
+/* Check if a packet number has been received.
+ *
+ * Returns: 0 if the packet number has not been received. 1 if it has already
+ * been received. -1 if the packet number is too old or too far in the future
+ * to track.
+ */
+int quic_pnspace_check(struct quic_pnspace *space, s64 pn)
+{
+ if (space->base_pn == -1) /* No any packet number received yet. */
+ return 0;
+
+ if (pn < space->min_pn_seen || pn >= space->base_pn + QUIC_PN_MAP_SIZE)
+ return -1;
+
+ if (pn < space->base_pn || (pn - space->base_pn < space->pn_map_len &&
+ test_bit(pn - space->base_pn, space->pn_map)))
+ return 1;
+
+ return 0;
+}
+
+/* Advance base_pn past contiguous received packet numbers. Finds the next gap
+ * (unreceived packet) beyond @pn, shifts the bitmap, and updates base_pn
+ * accordingly.
+ */
+static void quic_pnspace_move(struct quic_pnspace *space, s64 pn)
+{
+ u16 offset;
+
+ offset = (u16)(pn + 1 - space->base_pn);
+ offset = (u16)find_next_zero_bit(space->pn_map, space->pn_map_len, offset);
+ space->base_pn += offset;
+ bitmap_shift_right(space->pn_map, space->pn_map, offset, space->pn_map_len);
+}
+
+/* Mark a packet number as received. Updates the packet number map to record
+ * reception of @pn. Advances base_pn if possible, and updates max/min/last seen
+ * fields as needed.
+ *
+ * Returns: 0 on success or if the packet was already marked. -ENOMEM if bitmap
+ * allocation failed during growth.
+ */
+int quic_pnspace_mark(struct quic_pnspace *space, s64 pn)
+{
+ s64 last_max_pn_seen;
+ u16 gap;
+
+ if (space->base_pn == -1) {
+ /* Initialize base_pn based on the peer's first packet number since peer's
+ * packet numbers may start at a non-zero value.
+ */
+ quic_pnspace_set_base_pn(space, pn + 1);
+ return 0;
+ }
+
+ /* Ignore packets with number less than current base (already processed). */
+ if (pn < space->base_pn)
+ return 0;
+
+ /* If gap is beyond current map length, try to grow the bitmap to accommodate. */
+ gap = (u16)(pn - space->base_pn);
+ if (gap >= space->pn_map_len && !quic_pnspace_grow(space, gap + 1))
+ return -ENOMEM;
+
+ if (space->max_pn_seen < pn) {
+ space->max_pn_seen = pn;
+ space->max_pn_time = space->time;
+ }
+
+ if (space->base_pn == pn) { /* If packet is exactly at base_pn (next expected packet). */
+ if (quic_pnspace_has_gap(space)) /* Advance base_pn to next unacked packet. */
+ quic_pnspace_move(space, pn);
+ else /* Fast path: increment base_pn if no gaps. */
+ space->base_pn++;
+ } else { /* Mark this packet as received in the bitmap. */
+ set_bit(gap, space->pn_map);
+ }
+
+ /* Only update min and last_max_pn_seen if this packet is the current max_pn. */
+ if (space->max_pn_seen != pn)
+ return 0;
+
+ /* Check if enough time has elapsed or enough packets have been received to
+ * update tracking.
+ */
+ last_max_pn_seen = min_t(s64, space->last_max_pn_seen, space->base_pn);
+ if (space->max_pn_time < space->last_max_pn_time + space->max_time_limit &&
+ space->max_pn_seen <= last_max_pn_seen + QUIC_PN_MAP_LIMIT)
+ return 0;
+
+ /* Advance base_pn if last_max_pn_seen is ahead of current base_pn. This is
+ * needed because QUIC doesn't retransmit packets; retransmitted frames are
+ * carried in new packets, so we move forward.
+ */
+ if (space->last_max_pn_seen + 1 > space->base_pn)
+ quic_pnspace_move(space, space->last_max_pn_seen);
+
+ space->min_pn_seen = space->last_max_pn_seen;
+ space->last_max_pn_seen = space->max_pn_seen;
+ space->last_max_pn_time = space->max_pn_time;
+ return 0;
+}
+
+/* Find the next gap in received packet numbers. Scans pn_map for a gap starting from
+ * *@iter. A gap is a contiguous block of unreceived packets between received ones.
+ *
+ * Returns: 1 if a gap was found, 0 if no more gaps exist or are relevant.
+ */
+static int quic_pnspace_next_gap_ack(const struct quic_pnspace *space,
+ s64 *iter, u16 *start, u16 *end)
+{
+ u16 start_ = 0, end_ = 0, offset = (u16)(*iter - space->base_pn);
+
+ start_ = (u16)find_next_zero_bit(space->pn_map, space->pn_map_len, offset);
+ if (space->max_pn_seen <= space->base_pn + start_)
+ return 0;
+
+ end_ = (u16)find_next_bit(space->pn_map, space->pn_map_len, start_);
+ if (space->max_pn_seen <= space->base_pn + end_ - 1)
+ return 0;
+
+ *start = start_ + 1;
+ *end = end_;
+ *iter = space->base_pn + *end;
+ return 1;
+}
+
+/* Generate gap acknowledgment blocks (GABs). GABs describe ranges of unacknowledged
+ * packets between received ones, and are used in ACK frames.
+ *
+ * Returns: Number of generated GABs (up to QUIC_PN_MAX_GABS).
+ */
+u16 quic_pnspace_num_gabs(struct quic_pnspace *space, struct quic_gap_ack_block *gabs)
+{
+ u16 start, end, ngaps = 0;
+ s64 iter;
+
+ if (!quic_pnspace_has_gap(space))
+ return 0;
+
+ iter = space->base_pn;
+ /* Loop through all gaps until the end of the window or max allowed gaps. */
+ while (quic_pnspace_next_gap_ack(space, &iter, &start, &end)) {
+ gabs[ngaps].start = start;
+ if (ngaps == QUIC_PN_MAX_GABS - 1) {
+ gabs[ngaps].end = (u16)(space->max_pn_seen - space->base_pn);
+ ngaps++;
+ break;
+ }
+ gabs[ngaps].end = end;
+ ngaps++;
+ }
+ return ngaps;
+}
diff --git a/net/quic/pnspace.h b/net/quic/pnspace.h
new file mode 100644
index 000000000000..ff700c2cd2ef
--- /dev/null
+++ b/net/quic/pnspace.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PN_MAX_GABS 32
+#define QUIC_PN_MAP_MAX_PN (BIT_ULL(62) - 1)
+
+#define QUIC_PN_MAP_INITIAL 64
+#define QUIC_PN_MAP_INCREMENT QUIC_PN_MAP_INITIAL
+#define QUIC_PN_MAP_SIZE 4096
+#define QUIC_PN_MAP_LIMIT (QUIC_PN_MAP_SIZE * 3 / 4)
+
+#define QUIC_PNSPACE_MAX (QUIC_CRYPTO_MAX - 1)
+#define QUIC_PNSPACE_NEXT_PN 0
+#define QUIC_PNSPACE_TIME_LIMIT (333000 * 3)
+
+enum {
+ QUIC_ECN_ECT1,
+ QUIC_ECN_ECT0,
+ QUIC_ECN_CE,
+ QUIC_ECN_MAX
+};
+
+enum {
+ QUIC_ECN_LOCAL, /* ECN bits from incoming IP headers */
+ QUIC_ECN_PEER, /* ECN bits reported by peer in ACK frames */
+ QUIC_ECN_DIR_MAX
+};
+
+/* Represents a gap (range of missing packets) in the ACK map. The values are offsets from
+ * base_pn, with both 'start' and 'end' being +1.
+ */
+struct quic_gap_ack_block {
+ u16 start;
+ u16 end;
+};
+
+/* Packet Number Map (pn_map) Layout:
+ *
+ * min_pn_seen -->++-----------------------+---------------------+---
+ * base_pn -----^ last_max_pn_seen --^ max_pn_seen --^
+ *
+ * Map Advancement Logic:
+ * - min_pn_seen = last_max_pn_seen;
+ * - base_pn = first zero bit after last_max_pn_seen;
+ * - last_max_pn_seen = max_pn_seen;
+ * - last_max_pn_time = current time;
+ *
+ * Conditions to Advance pn_map:
+ * - (max_pn_time - last_max_pn_time) >= max_time_limit, or
+ * - (max_pn_seen - last_max_pn_seen) > QUIC_PN_MAP_LIMIT
+ *
+ * Gap Search Range:
+ * - From (base_pn - 1) to max_pn_seen
+ */
+struct quic_pnspace {
+ /* ECN counters indexed by direction (TX/RX) and ECN codepoint (ECT1, ECT0, CE) */
+ u64 ecn_count[QUIC_ECN_DIR_MAX][QUIC_ECN_MAX];
+ unsigned long *pn_map; /* Bit map tracking received packet numbers for ACK generation */
+ u16 pn_map_len; /* Length of the packet number bit map (in bits) */
+ u8 need_sack:1; /* Flag indicating a SACK frame should be sent for this space */
+ u8 sack_path:1; /* Path used for sending the SACK frame */
+
+ s64 last_max_pn_seen; /* Highest packet number seen before pn_map advanced */
+ u32 last_max_pn_time; /* Timestamp when last_max_pn_seen was received */
+ u32 max_time_limit; /* Time threshold to trigger pn_map advancement on packet receipt */
+ s64 min_pn_seen; /* Smallest packet number received in this space */
+ s64 max_pn_seen; /* Largest packet number received in this space */
+ u32 max_pn_time; /* Time at which max_pn_seen was received */
+ s64 base_pn; /* Packet number corresponding to the start of the pn_map */
+ u32 time; /* Cached current time, or time accept a socket (listen socket) */
+
+ s64 max_pn_acked_seen; /* Largest packet number acknowledged by the peer */
+ u32 max_pn_acked_time; /* Time at which max_pn_acked_seen was acknowledged */
+ u32 last_sent_time; /* Time when the last ack-eliciting packet was sent */
+ u32 loss_time; /* Time after which the next packet can be declared lost */
+ u32 inflight; /* Bytes of all ack-eliciting frames in flight in this space */
+ s64 next_pn; /* Next packet number to send in this space */
+};
+
+static inline void quic_pnspace_set_max_pn_acked_seen(struct quic_pnspace *space,
+ s64 max_pn_acked_seen)
+{
+ if (space->max_pn_acked_seen >= max_pn_acked_seen)
+ return;
+ space->max_pn_acked_seen = max_pn_acked_seen;
+ space->max_pn_acked_time = jiffies_to_usecs(jiffies);
+}
+
+static inline void quic_pnspace_set_base_pn(struct quic_pnspace *space, s64 pn)
+{
+ space->base_pn = pn;
+ space->max_pn_seen = space->base_pn - 1;
+ space->last_max_pn_seen = space->max_pn_seen;
+ space->min_pn_seen = space->max_pn_seen;
+
+ space->max_pn_time = space->time;
+ space->last_max_pn_time = space->max_pn_time;
+}
+
+static inline bool quic_pnspace_has_gap(const struct quic_pnspace *space)
+{
+ return space->base_pn != space->max_pn_seen + 1;
+}
+
+static inline void quic_pnspace_inc_ecn_count(struct quic_pnspace *space, u8 ecn)
+{
+ if (!ecn)
+ return;
+ space->ecn_count[QUIC_ECN_LOCAL][ecn - 1]++;
+}
+
+/* Check if any ECN-marked packets were received. */
+static inline bool quic_pnspace_has_ecn_count(struct quic_pnspace *space)
+{
+ return space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_ECT0] ||
+ space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_ECT1] ||
+ space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_CE];
+}
+
+/* Updates the stored ECN counters based on values received in the peer's ACK
+ * frame. Each counter is updated only if the new value is higher.
+ *
+ * Returns: 1 if CE count was increased (congestion indicated), 0 otherwise.
+ */
+static inline int quic_pnspace_set_ecn_count(struct quic_pnspace *space, u64 *ecn_count)
+{
+ if (space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_ECT0] < ecn_count[QUIC_ECN_ECT0])
+ space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_ECT0] = ecn_count[QUIC_ECN_ECT0];
+ if (space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_ECT1] < ecn_count[QUIC_ECN_ECT1])
+ space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_ECT1] = ecn_count[QUIC_ECN_ECT1];
+ if (space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_CE] < ecn_count[QUIC_ECN_CE]) {
+ space->ecn_count[QUIC_ECN_PEER][QUIC_ECN_CE] = ecn_count[QUIC_ECN_CE];
+ return 1;
+ }
+ return 0;
+}
+
+u16 quic_pnspace_num_gabs(struct quic_pnspace *space, struct quic_gap_ack_block *gabs);
+int quic_pnspace_check(struct quic_pnspace *space, s64 pn);
+int quic_pnspace_mark(struct quic_pnspace *space, s64 pn);
+
+void quic_pnspace_free(struct quic_pnspace *space);
+int quic_pnspace_init(struct quic_pnspace *space);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 8ee4e4cd4ee3..05569b58f10b 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -38,6 +38,8 @@ static void quic_write_space(struct sock *sk)
static int quic_init_sock(struct sock *sk)
{
+ u8 i;
+
sk->sk_destruct = inet_sock_destruct;
sk->sk_write_space = quic_write_space;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
@@ -49,6 +51,11 @@ static int quic_init_sock(struct sock *sk)
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
+ for (i = 0; i < QUIC_PNSPACE_MAX; i++) {
+ if (quic_pnspace_init(quic_pnspace(sk, i)))
+ return -ENOMEM;
+ }
+
WRITE_ONCE(sk->sk_sndbuf, READ_ONCE(sysctl_quic_wmem[1]));
WRITE_ONCE(sk->sk_rcvbuf, READ_ONCE(sysctl_quic_rmem[1]));
@@ -62,6 +69,11 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ u8 i;
+
+ for (i = 0; i < QUIC_PNSPACE_MAX; i++)
+ quic_pnspace_free(quic_pnspace(sk, i));
+
quic_path_free(sk, quic_paths(sk), 0);
quic_path_free(sk, quic_paths(sk), 1);
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 019f8752fc87..7e87de319c40 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -11,6 +11,7 @@
#include <net/udp_tunnel.h>
#include <linux/quic.h>
+#include "pnspace.h"
#include "common.h"
#include "family.h"
#include "stream.h"
@@ -44,6 +45,7 @@ struct quic_sock {
struct quic_conn_id_set dest;
struct quic_path_group paths;
struct quic_cong cong;
+ struct quic_pnspace space[QUIC_PNSPACE_MAX];
};
struct quic6_sock {
@@ -111,6 +113,11 @@ static inline struct quic_cong *quic_cong(const struct sock *sk)
return &quic_sk(sk)->cong;
}
+static inline struct quic_pnspace *quic_pnspace(const struct sock *sk, u8 level)
+{
+ return &quic_sk(sk)->space[level % QUIC_CRYPTO_EARLY];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 11/15] quic: add crypto key derivation and installation
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (9 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 10/15] quic: add packet number space Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 12/15] quic: add crypto packet encryption and decryption Xin Long
` (4 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_crypto', a component responsible for QUIC
encryption key derivation and installation across the various key
levels: Initial, Handshake, 0-RTT (Early), and 1-RTT (Application).
It provides helpers to derive and install initial secrets, set traffic
secrets and install the corresponding keys, and perform key updates to
enable forward secrecy. Additionally, it implements stateless reset
token generation, used to support connection reset without state.
- quic_crypto_initial_keys_install(): Derive and install initial keys.
- quic_crypto_set_cipher(): Allocate all transforms based on the cipher
type provided.
- quic_crypto_set_secret(): Set the traffic secret and install derived
keys.
- quic_crypto_key_update(): Rekey and install new keys to the !phase
side.
- quic_crypto_generate_stateless_reset_token(): Generate token for
stateless reset.
These mechanisms are essential for establishing and maintaining secure
communication throughout the QUIC connection lifecycle.
Signed-off-by: Pengtao He <hepengtao@xiaomi.com>
Signed-off-by: Moritz Buhl <mbuhl@openbsd.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/crypto.c | 542 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/crypto.h | 73 ++++++
net/quic/protocol.c | 12 +
net/quic/protocol.h | 2 +
net/quic/socket.c | 2 +
net/quic/socket.h | 7 +
7 files changed, 639 insertions(+), 1 deletion(-)
create mode 100644 net/quic/crypto.c
create mode 100644 net/quic/crypto.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 9d8e18297911..58bb18f7926d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o
+ cong.o pnspace.o crypto.o
diff --git a/net/quic/crypto.c b/net/quic/crypto.c
new file mode 100644
index 000000000000..860e3dfd4a28
--- /dev/null
+++ b/net/quic/crypto.c
@@ -0,0 +1,542 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <crypto/skcipher.h>
+#include <linux/skbuff.h>
+#include <crypto/aead.h>
+#include <crypto/hkdf.h>
+#include <linux/quic.h>
+#include <net/tls.h>
+
+#include "common.h"
+#include "crypto.h"
+
+#define QUIC_RANDOM_DATA_LEN 32
+
+static u8 quic_random_data[QUIC_RANDOM_DATA_LEN] __read_mostly;
+
+/* HKDF-Extract. */
+static int quic_crypto_hkdf_extract(struct crypto_shash *tfm, struct quic_data *srt,
+ struct quic_data *hash, struct quic_data *key)
+{
+ return hkdf_extract(tfm, hash->data, hash->len, srt->data, srt->len, key->data);
+}
+
+#define QUIC_MAX_INFO_LEN 256
+
+/* HKDF-Expand-Label. */
+static int quic_crypto_hkdf_expand(struct crypto_shash *tfm, struct quic_data *srt,
+ struct quic_data *label, struct quic_data *hash,
+ struct quic_data *key)
+{
+ u8 info[QUIC_MAX_INFO_LEN], *p = info;
+ u8 LABEL[] = "tls13 ";
+ u32 infolen;
+ int err;
+
+ /* rfc8446#section-7.1:
+ *
+ * HKDF-Expand-Label(Secret, Label, Context, Length) =
+ * HKDF-Expand(Secret, HkdfLabel, Length)
+ *
+ * Where HkdfLabel is specified as:
+ *
+ * struct {
+ * uint16 length = Length;
+ * opaque label<7..255> = "tls13 " + Label;
+ * opaque context<0..255> = Context;
+ * } HkdfLabel;
+ */
+ *p++ = (u8)(key->len / QUIC_MAX_INFO_LEN);
+ *p++ = (u8)(key->len % QUIC_MAX_INFO_LEN);
+ *p++ = (u8)(sizeof(LABEL) - 1 + label->len);
+ p = quic_put_data(p, LABEL, sizeof(LABEL) - 1);
+ p = quic_put_data(p, label->data, label->len);
+ if (hash) {
+ *p++ = (u8)hash->len;
+ p = quic_put_data(p, hash->data, hash->len);
+ } else {
+ *p++ = 0;
+ }
+ infolen = (u32)(p - info);
+
+ err = crypto_shash_setkey(tfm, srt->data, srt->len);
+ if (err)
+ return err;
+
+ return hkdf_expand(tfm, info, infolen, key->data, key->len);
+}
+
+#define KEY_LABEL_V1 "quic key"
+#define IV_LABEL_V1 "quic iv"
+#define HP_KEY_LABEL_V1 "quic hp"
+
+#define KU_LABEL_V1 "quic ku"
+
+/* rfc9369#section-3.3.2:
+ *
+ * The labels used in rfc9001 to derive packet protection keys, header protection keys, Retry
+ * Integrity Tag keys, and key updates change from "quic key" to "quicv2 key", from "quic iv"
+ * to "quicv2 iv", from "quic hp" to "quicv2 hp", and from "quic ku" to "quicv2 ku".
+ */
+#define KEY_LABEL_V2 "quicv2 key"
+#define IV_LABEL_V2 "quicv2 iv"
+#define HP_KEY_LABEL_V2 "quicv2 hp"
+
+#define KU_LABEL_V2 "quicv2 ku"
+
+/* Packet Protection Keys. */
+static int quic_crypto_keys_derive(struct crypto_shash *tfm, struct quic_data *s,
+ struct quic_data *k, struct quic_data *i,
+ struct quic_data *hp_k, u32 version)
+{
+ struct quic_data hp_k_l = {HP_KEY_LABEL_V1, strlen(HP_KEY_LABEL_V1)};
+ struct quic_data k_l = {KEY_LABEL_V1, strlen(KEY_LABEL_V1)};
+ struct quic_data i_l = {IV_LABEL_V1, strlen(IV_LABEL_V1)};
+ struct quic_data z = {};
+ int err;
+
+ /* rfc9001#section-5.1:
+ *
+ * The current encryption level secret and the label "quic key" are input to the
+ * KDF to produce the AEAD key; the label "quic iv" is used to derive the
+ * Initialization Vector (IV). The header protection key uses the "quic hp" label.
+ * Using these labels provides key separation between QUIC and TLS.
+ */
+ if (version == QUIC_VERSION_V2) {
+ quic_data(&hp_k_l, HP_KEY_LABEL_V2, strlen(HP_KEY_LABEL_V2));
+ quic_data(&k_l, KEY_LABEL_V2, strlen(KEY_LABEL_V2));
+ quic_data(&i_l, IV_LABEL_V2, strlen(IV_LABEL_V2));
+ }
+
+ err = quic_crypto_hkdf_expand(tfm, s, &k_l, &z, k);
+ if (err)
+ return err;
+ err = quic_crypto_hkdf_expand(tfm, s, &i_l, &z, i);
+ if (err)
+ return err;
+ /* Don't change hp key for key update. */
+ if (!hp_k)
+ return 0;
+
+ return quic_crypto_hkdf_expand(tfm, s, &hp_k_l, &z, hp_k);
+}
+
+/* Derive and install transmission (TX) packet protection keys for the current key phase.
+ * This involves generating AEAD encryption key, IV, and optionally header protection key.
+ */
+static int quic_crypto_tx_keys_derive_and_install(struct quic_crypto *crypto)
+{
+ struct quic_data srt = {}, k, iv, hp_k = {}, *hp = NULL;
+ u8 tx_key[QUIC_KEY_LEN], tx_hp_key[QUIC_KEY_LEN];
+ int err, phase = crypto->key_phase;
+ u32 keylen, ivlen = QUIC_IV_LEN;
+
+ keylen = crypto->cipher->keylen;
+ quic_data(&srt, crypto->tx_secret, crypto->cipher->secretlen);
+ quic_data(&k, tx_key, keylen);
+ quic_data(&iv, crypto->tx_iv[phase], ivlen);
+ /* Only derive header protection key when not in key update. */
+ if (!crypto->key_pending)
+ hp = quic_data(&hp_k, tx_hp_key, keylen);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &iv, hp, crypto->version);
+ if (err)
+ return err;
+ err = crypto_aead_setauthsize(crypto->tx_tfm[phase], QUIC_TAG_LEN);
+ if (err)
+ return err;
+ err = crypto_aead_setkey(crypto->tx_tfm[phase], tx_key, keylen);
+ if (err)
+ return err;
+ if (hp) {
+ err = crypto_skcipher_setkey(crypto->tx_hp_tfm, tx_hp_key, keylen);
+ if (err)
+ return err;
+ }
+ pr_debug("%s: k: %16phN, iv: %12phN, hp_k:%16phN\n", __func__, k.data, iv.data, hp_k.data);
+ return 0;
+}
+
+/* Derive and install reception (RX) packet protection keys for the current key phase.
+ * This installs AEAD decryption key, IV, and optionally header protection key.
+ */
+static int quic_crypto_rx_keys_derive_and_install(struct quic_crypto *crypto)
+{
+ struct quic_data srt = {}, k, iv, hp_k = {}, *hp = NULL;
+ u8 rx_key[QUIC_KEY_LEN], rx_hp_key[QUIC_KEY_LEN];
+ int err, phase = crypto->key_phase;
+ u32 keylen, ivlen = QUIC_IV_LEN;
+
+ keylen = crypto->cipher->keylen;
+ quic_data(&srt, crypto->rx_secret, crypto->cipher->secretlen);
+ quic_data(&k, rx_key, keylen);
+ quic_data(&iv, crypto->rx_iv[phase], ivlen);
+ /* Only derive header protection key when not in key update. */
+ if (!crypto->key_pending)
+ hp = quic_data(&hp_k, rx_hp_key, keylen);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &iv, hp, crypto->version);
+ if (err)
+ return err;
+ err = crypto_aead_setauthsize(crypto->rx_tfm[phase], QUIC_TAG_LEN);
+ if (err)
+ return err;
+ err = crypto_aead_setkey(crypto->rx_tfm[phase], rx_key, keylen);
+ if (err)
+ return err;
+ if (hp) {
+ err = crypto_skcipher_setkey(crypto->rx_hp_tfm, rx_hp_key, keylen);
+ if (err)
+ return err;
+ }
+ pr_debug("%s: k: %16phN, iv: %12phN, hp_k:%16phN\n", __func__, k.data, iv.data, hp_k.data);
+ return 0;
+}
+
+#define QUIC_CIPHER_MIN TLS_CIPHER_AES_GCM_128
+#define QUIC_CIPHER_MAX TLS_CIPHER_CHACHA20_POLY1305
+
+#define TLS_CIPHER_AES_GCM_128_SECRET_SIZE 32
+#define TLS_CIPHER_AES_GCM_256_SECRET_SIZE 48
+#define TLS_CIPHER_AES_CCM_128_SECRET_SIZE 32
+#define TLS_CIPHER_CHACHA20_POLY1305_SECRET_SIZE 32
+
+#define CIPHER_DESC(type, aead_name, skc_name, sha_name)[type - QUIC_CIPHER_MIN] = { \
+ .secretlen = type ## _SECRET_SIZE, \
+ .keylen = type ## _KEY_SIZE, \
+ .aead = aead_name, \
+ .skc = skc_name, \
+ .shash = sha_name, \
+}
+
+static struct quic_cipher ciphers[QUIC_CIPHER_MAX + 1 - QUIC_CIPHER_MIN] = {
+ CIPHER_DESC(TLS_CIPHER_AES_GCM_128, "gcm(aes)", "ecb(aes)", "hmac(sha256)"),
+ CIPHER_DESC(TLS_CIPHER_AES_GCM_256, "gcm(aes)", "ecb(aes)", "hmac(sha384)"),
+ CIPHER_DESC(TLS_CIPHER_AES_CCM_128, "ccm(aes)", "ecb(aes)", "hmac(sha256)"),
+ CIPHER_DESC(TLS_CIPHER_CHACHA20_POLY1305,
+ "rfc7539(chacha20,poly1305)", "chacha20", "hmac(sha256)"),
+};
+
+int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u8 flag)
+{
+ struct quic_cipher *cipher;
+ int err = -EINVAL;
+ void *tfm;
+
+ if (type < QUIC_CIPHER_MIN || type > QUIC_CIPHER_MAX)
+ return -EINVAL;
+
+ cipher = &ciphers[type - QUIC_CIPHER_MIN];
+ tfm = crypto_alloc_shash(cipher->shash, 0, 0);
+ if (IS_ERR(tfm))
+ return PTR_ERR(tfm);
+ crypto->secret_tfm = tfm;
+
+ /* Request only synchronous crypto by specifying CRYPTO_ALG_ASYNC. This
+ * ensures tag generation does not rely on async callbacks.
+ */
+ tfm = crypto_alloc_aead(cipher->aead, 0, CRYPTO_ALG_ASYNC);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tag_tfm = tfm;
+
+ /* Allocate AEAD and HP transform for each RX key phase. */
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_tfm[0] = tfm;
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_tfm[1] = tfm;
+ tfm = crypto_alloc_sync_skcipher(cipher->skc, 0, 0);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_hp_tfm = tfm;
+
+ /* Allocate AEAD and HP transform for each TX key phase. */
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_tfm[0] = tfm;
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_tfm[1] = tfm;
+ tfm = crypto_alloc_sync_skcipher(cipher->skc, 0, 0);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_hp_tfm = tfm;
+
+ crypto->cipher = cipher;
+ crypto->cipher_type = type;
+ return 0;
+err:
+ quic_crypto_free(crypto);
+ return err;
+}
+
+int quic_crypto_set_secret(struct quic_crypto *crypto, struct quic_crypto_secret *srt,
+ u32 version, u8 flag)
+{
+ int err;
+
+ /* If no cipher has been initialized yet, set it up. */
+ if (!crypto->cipher) {
+ err = quic_crypto_set_cipher(crypto, srt->type, flag);
+ if (err)
+ return err;
+ }
+
+ /* Handle RX path setup. */
+ if (!srt->send) {
+ crypto->version = version;
+ memcpy(crypto->rx_secret, srt->secret, crypto->cipher->secretlen);
+ err = quic_crypto_rx_keys_derive_and_install(crypto);
+ if (err)
+ return err;
+ crypto->recv_ready = 1;
+ return 0;
+ }
+
+ /* Handle TX path setup. */
+ crypto->version = version;
+ memcpy(crypto->tx_secret, srt->secret, crypto->cipher->secretlen);
+ err = quic_crypto_tx_keys_derive_and_install(crypto);
+ if (err)
+ return err;
+ crypto->send_ready = 1;
+ return 0;
+}
+
+int quic_crypto_get_secret(struct quic_crypto *crypto, struct quic_crypto_secret *srt)
+{
+ u8 *secret;
+
+ if (!crypto->cipher)
+ return -EINVAL;
+ srt->type = crypto->cipher_type;
+ secret = srt->send ? crypto->tx_secret : crypto->rx_secret;
+ memcpy(srt->secret, secret, crypto->cipher->secretlen);
+ return 0;
+}
+
+/* Initiating a Key Update. */
+int quic_crypto_key_update(struct quic_crypto *crypto)
+{
+ u8 tx_secret[QUIC_SECRET_LEN], rx_secret[QUIC_SECRET_LEN];
+ struct quic_data l = {KU_LABEL_V1, strlen(KU_LABEL_V1)};
+ struct quic_data z = {}, k, srt;
+ u32 secret_len;
+ int err;
+
+ if (crypto->key_pending || !crypto->recv_ready)
+ return -EINVAL;
+
+ /* rfc9001#section-6.1:
+ *
+ * Endpoints maintain separate read and write secrets for packet protection. An
+ * endpoint initiates a key update by updating its packet protection write secret
+ * and using that to protect new packets. The endpoint creates a new write secret
+ * from the existing write secret. This uses the KDF function provided by TLS with
+ * a label of "quic ku". The corresponding key and IV are created from that
+ * secret. The header protection key is not updated.
+ *
+ * For example,to update write keys with TLS 1.3, HKDF-Expand-Label is used as:
+ * secret_<n+1> = HKDF-Expand-Label(secret_<n>, "quic ku",
+ * "", Hash.length)
+ */
+ secret_len = crypto->cipher->secretlen;
+ if (crypto->version == QUIC_VERSION_V2)
+ quic_data(&l, KU_LABEL_V2, strlen(KU_LABEL_V2));
+
+ crypto->key_pending = 1;
+ memcpy(tx_secret, crypto->tx_secret, secret_len);
+ memcpy(rx_secret, crypto->rx_secret, secret_len);
+ crypto->key_phase = !crypto->key_phase;
+
+ quic_data(&srt, tx_secret, secret_len);
+ quic_data(&k, crypto->tx_secret, secret_len);
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &srt, &l, &z, &k);
+ if (err)
+ goto err;
+ err = quic_crypto_tx_keys_derive_and_install(crypto);
+ if (err)
+ goto err;
+
+ quic_data(&srt, rx_secret, secret_len);
+ quic_data(&k, crypto->rx_secret, secret_len);
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &srt, &l, &z, &k);
+ if (err)
+ goto err;
+ err = quic_crypto_rx_keys_derive_and_install(crypto);
+ if (err)
+ goto err;
+ return 0;
+err:
+ crypto->key_pending = 0;
+ memcpy(crypto->tx_secret, tx_secret, secret_len);
+ memcpy(crypto->rx_secret, rx_secret, secret_len);
+ crypto->key_phase = !crypto->key_phase;
+ return err;
+}
+
+void quic_crypto_free(struct quic_crypto *crypto)
+{
+ if (crypto->tag_tfm)
+ crypto_free_aead(crypto->tag_tfm);
+ if (crypto->rx_tfm[0])
+ crypto_free_aead(crypto->rx_tfm[0]);
+ if (crypto->rx_tfm[1])
+ crypto_free_aead(crypto->rx_tfm[1]);
+ if (crypto->tx_tfm[0])
+ crypto_free_aead(crypto->tx_tfm[0]);
+ if (crypto->tx_tfm[1])
+ crypto_free_aead(crypto->tx_tfm[1]);
+ if (crypto->secret_tfm)
+ crypto_free_shash(crypto->secret_tfm);
+ if (crypto->rx_hp_tfm)
+ crypto_free_skcipher(crypto->rx_hp_tfm);
+ if (crypto->tx_hp_tfm)
+ crypto_free_skcipher(crypto->tx_hp_tfm);
+
+ memset(crypto, 0, offsetof(struct quic_crypto, send_offset));
+}
+
+#define QUIC_INITIAL_SALT_V1 \
+ "\x38\x76\x2c\xf7\xf5\x59\x34\xb3\x4d\x17\x9a\xe6\xa4\xc8\x0c\xad\xcc\xbb\x7f\x0a"
+#define QUIC_INITIAL_SALT_V2 \
+ "\x0d\xed\xe3\xde\xf7\x00\xa6\xdb\x81\x93\x81\xbe\x6e\x26\x9d\xcb\xf9\xbd\x2e\xd9"
+
+#define QUIC_INITIAL_SALT_LEN 20
+
+/* Initial Secrets. */
+int quic_crypto_initial_keys_install(struct quic_crypto *crypto, struct quic_conn_id *conn_id,
+ u32 version, bool is_serv)
+{
+ u8 secret[TLS_CIPHER_AES_GCM_128_SECRET_SIZE];
+ struct quic_data salt, s, k, l, dcid, z = {};
+ struct quic_crypto_secret srt = {};
+ char *tl, *rl, *sal;
+ int err;
+
+ /* rfc9001#section-5.2:
+ *
+ * The secret used by clients to construct Initial packets uses the PRK and the
+ * label "client in" as input to the HKDF-Expand-Label function from TLS [TLS13]
+ * to produce a 32-byte secret. Packets constructed by the server use the same
+ * process with the label "server in". The hash function for HKDF when deriving
+ * initial secrets and keys is SHA-256 [SHA].
+ *
+ * This process in pseudocode is:
+ *
+ * initial_salt = 0x38762cf7f55934b34d179ae6a4c80cadccbb7f0a
+ * initial_secret = HKDF-Extract(initial_salt,
+ * client_dst_connection_id)
+ *
+ * client_initial_secret = HKDF-Expand-Label(initial_secret,
+ * "client in", "",
+ * Hash.length)
+ * server_initial_secret = HKDF-Expand-Label(initial_secret,
+ * "server in", "",
+ * Hash.length)
+ */
+ if (is_serv) {
+ rl = "client in";
+ tl = "server in";
+ } else {
+ tl = "client in";
+ rl = "server in";
+ }
+ sal = QUIC_INITIAL_SALT_V1;
+ if (version == QUIC_VERSION_V2)
+ sal = QUIC_INITIAL_SALT_V2;
+ quic_data(&salt, sal, QUIC_INITIAL_SALT_LEN);
+ quic_data(&dcid, conn_id->data, conn_id->len);
+ quic_data(&s, secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ err = quic_crypto_hkdf_extract(crypto->secret_tfm, &salt, &dcid, &s);
+ if (err)
+ return err;
+
+ quic_data(&l, tl, strlen(tl));
+ quic_data(&k, srt.secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ srt.type = TLS_CIPHER_AES_GCM_128;
+ srt.send = 1;
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &s, &l, &z, &k);
+ if (err)
+ return err;
+ err = quic_crypto_set_secret(crypto, &srt, version, 0);
+ if (err)
+ return err;
+
+ quic_data(&l, rl, strlen(rl));
+ quic_data(&k, srt.secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ srt.type = TLS_CIPHER_AES_GCM_128;
+ srt.send = 0;
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &s, &l, &z, &k);
+ if (err)
+ return err;
+ return quic_crypto_set_secret(crypto, &srt, version, 0);
+}
+
+/* Generate a derived key using HKDF-Extract and HKDF-Expand with a given label. */
+static int quic_crypto_generate_key(struct quic_crypto *crypto, void *data, u32 len,
+ char *label, u8 *token, u32 key_len)
+{
+ struct crypto_shash *tfm = crypto->secret_tfm;
+ u8 secret[TLS_CIPHER_AES_GCM_128_SECRET_SIZE];
+ struct quic_data salt, s, l, k, z = {};
+ int err;
+
+ quic_data(&salt, data, len);
+ quic_data(&k, quic_random_data, QUIC_RANDOM_DATA_LEN);
+ quic_data(&s, secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ err = quic_crypto_hkdf_extract(tfm, &salt, &k, &s);
+ if (err)
+ return err;
+
+ quic_data(&l, label, strlen(label));
+ quic_data(&k, token, key_len);
+ return quic_crypto_hkdf_expand(tfm, &s, &l, &z, &k);
+}
+
+/* Derive a stateless reset token from connection-specific input. */
+int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto, void *data,
+ u32 len, u8 *key, u32 key_len)
+{
+ return quic_crypto_generate_key(crypto, data, len, "stateless_reset", key, key_len);
+}
+
+/* Derive a session ticket key using HKDF from connection-specific input. */
+int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto, void *data,
+ u32 len, u8 *key, u32 key_len)
+{
+ return quic_crypto_generate_key(crypto, data, len, "session_ticket", key, key_len);
+}
+
+void quic_crypto_init(void)
+{
+ get_random_bytes(quic_random_data, QUIC_RANDOM_DATA_LEN);
+}
diff --git a/net/quic/crypto.h b/net/quic/crypto.h
new file mode 100644
index 000000000000..2bc960a8489e
--- /dev/null
+++ b/net/quic/crypto.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_TAG_LEN 16
+#define QUIC_IV_LEN 12
+#define QUIC_KEY_LEN 32
+#define QUIC_SECRET_LEN 48
+
+#define QUIC_TOKEN_FLAG_REGULAR 0
+#define QUIC_TOKEN_FLAG_RETRY 1
+#define QUIC_TOKEN_TIMEOUT_REGULAR 3000000
+#define QUIC_TOKEN_TIMEOUT_RETRY 600000000
+
+struct quic_cipher {
+ u32 secretlen; /* Length of the traffic secret */
+ u32 keylen; /* Length of the AEAD key */
+
+ char *shash; /* Name of hash algorithm used for key derivation */
+ char *aead; /* Name of AEAD algorithm used for payload en/decryption */
+ char *skc; /* Name of cipher algorithm used for header protection */
+};
+
+struct quic_crypto {
+ struct crypto_skcipher *tx_hp_tfm; /* Transform for TX header protection */
+ struct crypto_skcipher *rx_hp_tfm; /* Transform for RX header protection */
+ struct crypto_shash *secret_tfm; /* Transform for key derivation (HKDF) */
+ struct crypto_aead *tx_tfm[2]; /* AEAD transform for TX (key phase 0 and 1) */
+ struct crypto_aead *rx_tfm[2]; /* AEAD transform for RX (key phase 0 and 1) */
+ struct crypto_aead *tag_tfm; /* AEAD transform used for Retry token validation */
+ struct quic_cipher *cipher; /* Cipher information (selected cipher suite) */
+ u32 cipher_type; /* Cipher suite (e.g., AES_GCM_128, etc.) */
+
+ u8 tx_secret[QUIC_SECRET_LEN]; /* TX secret derived or provided by user space */
+ u8 rx_secret[QUIC_SECRET_LEN]; /* RX secret derived or provided by user space */
+ u8 tx_iv[2][QUIC_IV_LEN]; /* IVs for TX (key phase 0 and 1) */
+ u8 rx_iv[2][QUIC_IV_LEN]; /* IVs for RX (key phase 0 and 1) */
+
+ u32 key_update_send_time; /* Time when 1st packet was sent after key update */
+ u32 key_update_time; /* Time to retain old keys after key update */
+ u32 version; /* QUIC version in use */
+
+ u8 ticket_ready:1; /* True if a session ticket is ready to read */
+ u8 key_pending:1; /* A key update is in progress */
+ u8 send_ready:1; /* TX encryption context is initialized */
+ u8 recv_ready:1; /* RX decryption context is initialized */
+ u8 key_phase:1; /* Current key phase being used (0 or 1) */
+
+ u64 send_offset; /* Number of handshake bytes sent by user at this level */
+ u64 recv_offset; /* Number of handshake bytes read by user at this level */
+};
+
+int quic_crypto_set_secret(struct quic_crypto *crypto, struct quic_crypto_secret *srt,
+ u32 version, u8 flag);
+int quic_crypto_get_secret(struct quic_crypto *crypto, struct quic_crypto_secret *srt);
+int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u8 flag);
+int quic_crypto_key_update(struct quic_crypto *crypto);
+
+int quic_crypto_initial_keys_install(struct quic_crypto *crypto, struct quic_conn_id *conn_id,
+ u32 version, bool is_serv);
+int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto, void *data,
+ u32 len, u8 *key, u32 key_len);
+int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto, void *data,
+ u32 len, u8 *key, u32 key_len);
+
+void quic_crypto_free(struct quic_crypto *crypto);
+void quic_crypto_init(void);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 08eb3b81f62f..fb98ef10f852 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -258,9 +258,18 @@ static int __net_init quic_net_init(struct net *net)
if (!qn->stat)
return -ENOMEM;
+ err = quic_crypto_set_cipher(&qn->crypto, TLS_CIPHER_AES_GCM_128, CRYPTO_ALG_ASYNC);
+ if (err) {
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+ return err;
+ }
+ spin_lock_init(&qn->lock);
+
#ifdef CONFIG_PROC_FS
err = quic_net_proc_init(net);
if (err) {
+ quic_crypto_free(&qn->crypto);
free_percpu(qn->stat);
qn->stat = NULL;
}
@@ -275,6 +284,7 @@ static void __net_exit quic_net_exit(struct net *net)
#ifdef CONFIG_PROC_FS
quic_net_proc_exit(net);
#endif
+ quic_crypto_free(&qn->crypto);
free_percpu(qn->stat);
qn->stat = NULL;
}
@@ -323,6 +333,8 @@ static __init int quic_init(void)
sysctl_quic_wmem[1] = 16 * 1024;
sysctl_quic_wmem[2] = max(64 * 1024, max_share);
+ quic_crypto_init();
+
err = percpu_counter_init(&quic_sockets_allocated, 0, GFP_KERNEL);
if (err)
goto err_percpu_counter;
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
index 6e6c5a6fc3f8..1df926ef0a75 100644
--- a/net/quic/protocol.h
+++ b/net/quic/protocol.h
@@ -47,6 +47,8 @@ struct quic_net {
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *proc_net; /* procfs entry for dumping QUIC socket stats */
#endif
+ struct quic_crypto crypto; /* Context for decrypting Initial packets for ALPN */
+ spinlock_t lock; /* Lock protecting crypto context for Initial packet decryption */
};
struct quic_net *quic_net(struct net *net);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 05569b58f10b..2425494a3df3 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -73,6 +73,8 @@ static void quic_destroy_sock(struct sock *sk)
for (i = 0; i < QUIC_PNSPACE_MAX; i++)
quic_pnspace_free(quic_pnspace(sk, i));
+ for (i = 0; i < QUIC_CRYPTO_MAX; i++)
+ quic_crypto_free(quic_crypto(sk, i));
quic_path_free(sk, quic_paths(sk), 0);
quic_path_free(sk, quic_paths(sk), 1);
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 7e87de319c40..c8df98351c6b 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -16,6 +16,7 @@
#include "family.h"
#include "stream.h"
#include "connid.h"
+#include "crypto.h"
#include "path.h"
#include "cong.h"
@@ -46,6 +47,7 @@ struct quic_sock {
struct quic_path_group paths;
struct quic_cong cong;
struct quic_pnspace space[QUIC_PNSPACE_MAX];
+ struct quic_crypto crypto[QUIC_CRYPTO_MAX];
};
struct quic6_sock {
@@ -118,6 +120,11 @@ static inline struct quic_pnspace *quic_pnspace(const struct sock *sk, u8 level)
return &quic_sk(sk)->space[level % QUIC_CRYPTO_EARLY];
}
+static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
+{
+ return &quic_sk(sk)->crypto[level];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 12/15] quic: add crypto packet encryption and decryption
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (10 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 11/15] quic: add crypto key derivation and installation Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 13/15] quic: add timer management Xin Long
` (3 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch adds core support for packet-level encryption and decryption
using AEAD, including both payload protection and QUIC header protection.
It introduces helpers to encrypt packets before transmission and to
remove header protection and decrypt payloads upon reception, in line
with QUIC's cryptographic requirements.
- quic_crypto_encrypt(): Perform header protection and payload
encryption (TX).
- quic_crypto_decrypt(): Perform header protection removal and
payload decryption (RX).
The patch also includes support for Retry token handling. It provides
helpers to compute the Retry integrity tag, generate tokens for address
validation, and verify tokens received from clients during the
handshake phase.
- quic_crypto_get_retry_tag(): Compute tag for Retry packets.
- quic_crypto_generate_token(): Generate retry token.
- quic_crypto_verify_token(): Verify retry token.
These additions establish the cryptographic primitives necessary for
secure QUIC packet exchange and address validation.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/crypto.c | 658 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/crypto.h | 10 +
2 files changed, 668 insertions(+)
diff --git a/net/quic/crypto.c b/net/quic/crypto.c
index 860e3dfd4a28..bb8fc6acce08 100644
--- a/net/quic/crypto.c
+++ b/net/quic/crypto.c
@@ -201,6 +201,343 @@ static int quic_crypto_rx_keys_derive_and_install(struct quic_crypto *crypto)
return 0;
}
+static void *quic_crypto_skcipher_mem_alloc(struct crypto_skcipher *tfm, u32 mask_size,
+ u8 **iv, struct skcipher_request **req)
+{
+ unsigned int iv_size, req_size;
+ unsigned int len;
+ u8 *mem;
+
+ iv_size = crypto_skcipher_ivsize(tfm);
+ req_size = sizeof(**req) + crypto_skcipher_reqsize(tfm);
+
+ len = mask_size;
+ len += iv_size;
+ len += crypto_skcipher_alignmask(tfm) & ~(crypto_tfm_ctx_alignment() - 1);
+ len = ALIGN(len, crypto_tfm_ctx_alignment());
+ len += req_size;
+
+ mem = kzalloc(len, GFP_ATOMIC);
+ if (!mem)
+ return NULL;
+
+ *iv = (u8 *)PTR_ALIGN(mem + mask_size, crypto_skcipher_alignmask(tfm) + 1);
+ *req = (struct skcipher_request *)PTR_ALIGN(*iv + iv_size,
+ crypto_tfm_ctx_alignment());
+
+ return (void *)mem;
+}
+
+#define QUIC_SAMPLE_LEN 16
+#define QUIC_MAX_PN_LEN 4
+
+#define QUIC_HEADER_FORM_BIT 0x80
+#define QUIC_LONG_HEADER_MASK 0x0f
+#define QUIC_SHORT_HEADER_MASK 0x1f
+
+/* Header Protection. */
+static int quic_crypto_header_encrypt(struct crypto_skcipher *tfm, struct sk_buff *skb, bool chacha)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct skcipher_request *req;
+ struct scatterlist sg;
+ u8 *mask, *iv, *p;
+ int err, i;
+
+ mask = quic_crypto_skcipher_mem_alloc(tfm, QUIC_SAMPLE_LEN, &iv, &req);
+ if (!mask)
+ return -ENOMEM;
+
+ /* rfc9001#section-5.4.2: Header Protection Sample:
+ *
+ * # pn_offset is the start of the Packet Number field.
+ * sample_offset = pn_offset + 4
+ *
+ * sample = packet[sample_offset..sample_offset+sample_length]
+ *
+ * rfc9001#section-5.4.3: AES-Based Header Protection:
+ *
+ * header_protection(hp_key, sample):
+ * mask = AES-ECB(hp_key, sample)
+ *
+ * rfc9001#section-5.4.4: ChaCha20-Based Header Protection:
+ *
+ * header_protection(hp_key, sample):
+ * counter = sample[0..3]
+ * nonce = sample[4..15]
+ * mask = ChaCha20(hp_key, counter, nonce, {0,0,0,0,0})
+ */
+ memcpy((chacha ? iv : mask), skb->data + cb->number_offset + QUIC_MAX_PN_LEN,
+ QUIC_SAMPLE_LEN);
+ sg_init_one(&sg, mask, QUIC_SAMPLE_LEN);
+ skcipher_request_set_tfm(req, tfm);
+ skcipher_request_set_crypt(req, &sg, &sg, QUIC_SAMPLE_LEN, iv);
+ err = crypto_skcipher_encrypt(req);
+ if (err)
+ goto err;
+
+ /* rfc9001#section-5.4.1:
+ *
+ * mask = header_protection(hp_key, sample)
+ *
+ * pn_length = (packet[0] & 0x03) + 1
+ * if (packet[0] & 0x80) == 0x80:
+ * # Long header: 4 bits masked
+ * packet[0] ^= mask[0] & 0x0f
+ * else:
+ * # Short header: 5 bits masked
+ * packet[0] ^= mask[0] & 0x1f
+ *
+ * # pn_offset is the start of the Packet Number field.
+ * packet[pn_offset:pn_offset+pn_length] ^= mask[1:1+pn_length]
+ */
+ p = skb->data;
+ *p = (u8)(*p ^ (mask[0] & (((*p & QUIC_HEADER_FORM_BIT) == QUIC_HEADER_FORM_BIT) ?
+ QUIC_LONG_HEADER_MASK : QUIC_SHORT_HEADER_MASK)));
+ p = skb->data + cb->number_offset;
+ for (i = 1; i <= cb->number_len; i++)
+ *p++ ^= mask[i];
+err:
+ kfree(mask);
+ return err;
+}
+
+/* Extracts and reconstructs the packet number from an incoming QUIC packet. */
+static void quic_crypto_get_header(struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct quichdr *hdr = quic_hdr(skb);
+ u32 len = QUIC_MAX_PN_LEN;
+ u8 *p = (u8 *)hdr;
+
+ /* rfc9000#section-17.1:
+ *
+ * Once header protection is removed, the packet number is decoded by finding the packet
+ * number value that is closest to the next expected packet. The next expected packet is
+ * the highest received packet number plus one.
+ */
+ p += cb->number_offset;
+ cb->key_phase = hdr->key;
+ cb->number_len = hdr->pnl + 1;
+ quic_get_int(&p, &len, &cb->number, cb->number_len);
+ cb->number = quic_get_num(cb->number_max, cb->number, cb->number_len);
+
+ if (cb->number > cb->number_max)
+ cb->number_max = cb->number;
+}
+
+#define QUIC_PN_LEN_BITS_MASK 0x03
+
+static int quic_crypto_header_decrypt(struct crypto_skcipher *tfm, struct sk_buff *skb, bool chacha)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct quichdr *hdr = quic_hdr(skb);
+ int err, i, len = cb->length;
+ struct skcipher_request *req;
+ struct scatterlist sg;
+ u8 *mask, *iv, *p;
+
+ mask = quic_crypto_skcipher_mem_alloc(tfm, QUIC_SAMPLE_LEN, &iv, &req);
+ if (!mask)
+ return -ENOMEM;
+
+ if (len < QUIC_MAX_PN_LEN + QUIC_SAMPLE_LEN) {
+ err = -EINVAL;
+ goto err;
+ }
+
+ /* Similar logic to quic_crypto_header_encrypt(). */
+ p = (u8 *)hdr + cb->number_offset;
+ memcpy((chacha ? iv : mask), p + QUIC_MAX_PN_LEN, QUIC_SAMPLE_LEN);
+ sg_init_one(&sg, mask, QUIC_SAMPLE_LEN);
+ skcipher_request_set_tfm(req, tfm);
+ skcipher_request_set_crypt(req, &sg, &sg, QUIC_SAMPLE_LEN, iv);
+ err = crypto_skcipher_encrypt(req);
+ if (err)
+ goto err;
+
+ p = (u8 *)hdr;
+ *p = (u8)(*p ^ (mask[0] & (((*p & QUIC_HEADER_FORM_BIT) == QUIC_HEADER_FORM_BIT) ?
+ QUIC_LONG_HEADER_MASK : QUIC_SHORT_HEADER_MASK)));
+ cb->number_len = (*p & QUIC_PN_LEN_BITS_MASK) + 1;
+ p += cb->number_offset;
+ for (i = 0; i < cb->number_len; ++i)
+ *(p + i) = *((u8 *)hdr + cb->number_offset + i) ^ mask[i + 1];
+ quic_crypto_get_header(skb);
+
+err:
+ kfree(mask);
+ return err;
+}
+
+static void *quic_crypto_aead_mem_alloc(struct crypto_aead *tfm, u32 ctx_size,
+ u8 **iv, struct aead_request **req,
+ struct scatterlist **sg, u32 nsg)
+{
+ unsigned int iv_size, req_size;
+ unsigned int len;
+ u8 *mem;
+
+ iv_size = crypto_aead_ivsize(tfm);
+ req_size = sizeof(**req) + crypto_aead_reqsize(tfm);
+
+ len = ctx_size;
+ len += iv_size;
+ len += crypto_aead_alignmask(tfm) & ~(crypto_tfm_ctx_alignment() - 1);
+ len = ALIGN(len, crypto_tfm_ctx_alignment());
+ len += req_size;
+ len = ALIGN(len, __alignof__(struct scatterlist));
+ len += nsg * sizeof(**sg);
+
+ mem = kzalloc(len, GFP_ATOMIC);
+ if (!mem)
+ return NULL;
+
+ *iv = (u8 *)PTR_ALIGN(mem + ctx_size, crypto_aead_alignmask(tfm) + 1);
+ *req = (struct aead_request *)PTR_ALIGN(*iv + iv_size,
+ crypto_tfm_ctx_alignment());
+ *sg = (struct scatterlist *)PTR_ALIGN((u8 *)*req + req_size,
+ __alignof__(struct scatterlist));
+
+ return (void *)mem;
+}
+
+static void quic_crypto_destruct_skb(struct sk_buff *skb)
+{
+ kfree(skb_shinfo(skb)->destructor_arg);
+ sock_efree(skb);
+}
+
+static void quic_crypto_done(void *data, int err)
+{
+ struct sk_buff *skb = data;
+
+ QUIC_SKB_CB(skb)->crypto_done(skb, err);
+}
+
+/* AEAD Usage. */
+static int quic_crypto_payload_encrypt(struct crypto_aead *tfm, struct sk_buff *skb,
+ u8 *tx_iv, bool ccm)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct quichdr *hdr = quic_hdr(skb);
+ u8 *iv, i, nonce[QUIC_IV_LEN];
+ struct aead_request *req;
+ struct sk_buff *trailer;
+ struct scatterlist *sg;
+ u32 nsg, hlen, len;
+ void *ctx;
+ __be64 n;
+ int err;
+
+ len = skb->len;
+ err = skb_cow_data(skb, QUIC_TAG_LEN, &trailer);
+ if (err < 0)
+ return err;
+ nsg = (u32)err;
+ pskb_put(skb, trailer, QUIC_TAG_LEN);
+ hdr->key = cb->key_phase;
+
+ ctx = quic_crypto_aead_mem_alloc(tfm, 0, &iv, &req, &sg, nsg);
+ if (!ctx)
+ return -ENOMEM;
+
+ sg_init_table(sg, nsg);
+ err = skb_to_sgvec(skb, sg, 0, (int)skb->len);
+ if (err < 0)
+ goto err;
+
+ /* rfc9001#section-5.3:
+ *
+ * The associated data, A, for the AEAD is the contents of the QUIC header,
+ * starting from the first byte of either the short or long header, up to and
+ * including the unprotected packet number.
+ *
+ * The nonce, N, is formed by combining the packet protection IV with the packet
+ * number. The 62 bits of the reconstructed QUIC packet number in network byte
+ * order are left-padded with zeros to the size of the IV. The exclusive OR of the
+ * padded packet number and the IV forms the AEAD nonce.
+ */
+ hlen = cb->number_offset + cb->number_len;
+ memcpy(nonce, tx_iv, QUIC_IV_LEN);
+ n = cpu_to_be64(cb->number);
+ for (i = 0; i < sizeof(n); i++)
+ nonce[QUIC_IV_LEN - sizeof(n) + i] ^= ((u8 *)&n)[i];
+
+ /* For CCM based ciphers, first byte of IV is a constant. */
+ iv[0] = TLS_AES_CCM_IV_B0_BYTE;
+ memcpy(&iv[ccm], nonce, QUIC_IV_LEN);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, hlen);
+ aead_request_set_crypt(req, sg, sg, len - hlen, iv);
+ aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, (void *)quic_crypto_done, skb);
+
+ err = crypto_aead_encrypt(req);
+ if (err == -EINPROGRESS) {
+ /* Will complete asynchronously; set destructor to free context. */
+ skb->destructor = quic_crypto_destruct_skb;
+ skb_shinfo(skb)->destructor_arg = ctx;
+ return err;
+ }
+
+err:
+ kfree(ctx);
+ return err;
+}
+
+static int quic_crypto_payload_decrypt(struct crypto_aead *tfm, struct sk_buff *skb,
+ u8 *rx_iv, bool ccm)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 *iv, i, nonce[QUIC_IV_LEN];
+ struct aead_request *req;
+ struct sk_buff *trailer;
+ int nsg, hlen, len, err;
+ struct scatterlist *sg;
+ void *ctx;
+ __be64 n;
+
+ len = cb->length + cb->number_offset;
+ hlen = cb->number_offset + cb->number_len;
+ if (len - hlen < QUIC_TAG_LEN)
+ return -EINVAL;
+ nsg = skb_cow_data(skb, 0, &trailer);
+ if (nsg < 0)
+ return nsg;
+ ctx = quic_crypto_aead_mem_alloc(tfm, 0, &iv, &req, &sg, nsg);
+ if (!ctx)
+ return -ENOMEM;
+
+ sg_init_table(sg, nsg);
+ err = skb_to_sgvec(skb, sg, 0, len);
+ if (err < 0)
+ goto err;
+ skb_dst_force(skb);
+
+ /* Similar logic to quic_crypto_payload_encrypt(). */
+ memcpy(nonce, rx_iv, QUIC_IV_LEN);
+ n = cpu_to_be64(cb->number);
+ for (i = 0; i < sizeof(n); i++)
+ nonce[QUIC_IV_LEN - sizeof(n) + i] ^= ((u8 *)&n)[i];
+
+ iv[0] = TLS_AES_CCM_IV_B0_BYTE;
+ memcpy(&iv[ccm], nonce, QUIC_IV_LEN);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, hlen);
+ aead_request_set_crypt(req, sg, sg, len - hlen, iv);
+ aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, (void *)quic_crypto_done, skb);
+
+ err = crypto_aead_decrypt(req);
+ if (err == -EINPROGRESS) {
+ skb->destructor = quic_crypto_destruct_skb;
+ skb_shinfo(skb)->destructor_arg = ctx;
+ return err;
+ }
+err:
+ kfree(ctx);
+ return err;
+}
+
#define QUIC_CIPHER_MIN TLS_CIPHER_AES_GCM_128
#define QUIC_CIPHER_MAX TLS_CIPHER_CHACHA20_POLY1305
@@ -225,6 +562,132 @@ static struct quic_cipher ciphers[QUIC_CIPHER_MAX + 1 - QUIC_CIPHER_MIN] = {
"rfc7539(chacha20,poly1305)", "chacha20", "hmac(sha256)"),
};
+static bool quic_crypto_is_cipher_ccm(struct quic_crypto *crypto)
+{
+ return crypto->cipher_type == TLS_CIPHER_AES_CCM_128;
+}
+
+static bool quic_crypto_is_cipher_chacha(struct quic_crypto *crypto)
+{
+ return crypto->cipher_type == TLS_CIPHER_CHACHA20_POLY1305;
+}
+
+/* Encrypts a QUIC packet before transmission. This function performs AEAD encryption of
+ * the packet payload and applies header protection. It handles key phase tracking and key
+ * update timing..
+ *
+ * Return: 0 on success, or a negative error code.
+ */
+int quic_crypto_encrypt(struct quic_crypto *crypto, struct sk_buff *skb)
+{
+ u8 *iv, cha, ccm, phase = crypto->key_phase;
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ int err;
+
+ cb->key_phase = phase;
+ iv = crypto->tx_iv[phase];
+ /* Packet payload is already encrypted (e.g., resumed from async), proceed to header
+ * protection only.
+ */
+ if (cb->resume)
+ goto out;
+
+ /* If a key update is pending and this is the first packet using the new key, save the
+ * current time. Later used to clear old keys after some time has passed (see
+ * quic_crypto_decrypt()).
+ */
+ if (crypto->key_pending && !crypto->key_update_send_time)
+ crypto->key_update_send_time = jiffies_to_usecs(jiffies);
+
+ ccm = quic_crypto_is_cipher_ccm(crypto);
+ err = quic_crypto_payload_encrypt(crypto->tx_tfm[phase], skb, iv, ccm);
+ if (err)
+ return err;
+out:
+ cha = quic_crypto_is_cipher_chacha(crypto);
+ return quic_crypto_header_encrypt(crypto->tx_hp_tfm, skb, cha);
+}
+
+/* Decrypts a QUIC packet after reception. This function removes header protection,
+ * decrypts the payload, and processes any key updates if the key phase bit changes.
+ *
+ * Return: 0 on success, or a negative error code.
+ */
+int quic_crypto_decrypt(struct quic_crypto *crypto, struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 *iv, cha, ccm, phase;
+ int err = 0;
+ u32 time;
+
+ /* Payload was decrypted asynchronously. Proceed with parsing packet number and key
+ * phase.
+ */
+ if (cb->resume) {
+ quic_crypto_get_header(skb);
+ goto out;
+ }
+
+ cha = quic_crypto_is_cipher_chacha(crypto);
+ err = quic_crypto_header_decrypt(crypto->rx_hp_tfm, skb, cha);
+ if (err) {
+ pr_debug("%s: hd decrypt err %d\n", __func__, err);
+ return err;
+ }
+
+ /* rfc9001#section-6:
+ *
+ * The Key Phase bit allows a recipient to detect a change in keying material without
+ * needing to receive the first packet that triggered the change. An endpoint that
+ * notices a changed Key Phase bit updates keys and decrypts the packet that contains
+ * the changed value.
+ */
+ if (cb->key_phase != crypto->key_phase && !crypto->key_pending) {
+ if (!crypto->send_ready) /* Not ready for key update. */
+ return -EINVAL;
+ err = quic_crypto_key_update(crypto); /* Perform a key update. */
+ if (err) {
+ cb->errcode = QUIC_TRANSPORT_ERROR_KEY_UPDATE;
+ return err;
+ }
+ cb->key_update = 1; /* Mark packet as triggering key update. */
+ }
+
+ phase = cb->key_phase;
+ iv = crypto->rx_iv[phase];
+ ccm = quic_crypto_is_cipher_ccm(crypto);
+ err = quic_crypto_payload_decrypt(crypto->rx_tfm[phase], skb, iv, ccm);
+ if (err) {
+ if (err == -EINPROGRESS)
+ return err;
+ /* When using the old keys can not decrypt the packets, the peer might
+ * start another key_update. Thus, clear the last key_pending so that
+ * next packets will trigger the new key-update.
+ */
+ if (crypto->key_pending && cb->key_phase != crypto->key_phase) {
+ crypto->key_pending = 0;
+ crypto->key_update_time = 0;
+ }
+ return err;
+ }
+
+out:
+ /* rfc9001#section-6.1:
+ *
+ * An endpoint MUST retain old keys until it has successfully unprotected a
+ * packet sent using the new keys. An endpoint SHOULD retain old keys for
+ * some time after unprotecting a packet sent using the new keys.
+ */
+ if (crypto->key_pending && cb->key_phase == crypto->key_phase) {
+ time = crypto->key_update_send_time;
+ if (time && jiffies_to_usecs(jiffies) - time >= crypto->key_update_time) {
+ crypto->key_pending = 0;
+ crypto->key_update_time = 0;
+ }
+ }
+ return err;
+}
+
int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u8 flag)
{
struct quic_cipher *cipher;
@@ -501,6 +964,201 @@ int quic_crypto_initial_keys_install(struct quic_crypto *crypto, struct quic_con
return quic_crypto_set_secret(crypto, &srt, version, 0);
}
+#define QUIC_RETRY_KEY_V1 "\xbe\x0c\x69\x0b\x9f\x66\x57\x5a\x1d\x76\x6b\x54\xe3\x68\xc8\x4e"
+#define QUIC_RETRY_KEY_V2 "\x8f\xb4\xb0\x1b\x56\xac\x48\xe2\x60\xfb\xcb\xce\xad\x7c\xcc\x92"
+
+#define QUIC_RETRY_NONCE_V1 "\x46\x15\x99\xd3\x5d\x63\x2b\xf2\x23\x98\x25\xbb"
+#define QUIC_RETRY_NONCE_V2 "\xd8\x69\x69\xbc\x2d\x7c\x6d\x99\x90\xef\xb0\x4a"
+
+/* Retry Packet Integrity. */
+int quic_crypto_get_retry_tag(struct quic_crypto *crypto, struct sk_buff *skb,
+ struct quic_conn_id *odcid, u32 version, u8 *tag)
+{
+ struct crypto_aead *tfm = crypto->tag_tfm;
+ u8 *pseudo_retry, *p, *iv, *key;
+ struct aead_request *req;
+ struct scatterlist *sg;
+ u32 plen;
+ int err;
+
+ /* rfc9001#section-5.8:
+ *
+ * The Retry Integrity Tag is a 128-bit field that is computed as the output of
+ * AEAD_AES_128_GCM used with the following inputs:
+ *
+ * - The secret key, K, is 128 bits equal to 0xbe0c690b9f66575a1d766b54e368c84e.
+ * - The nonce, N, is 96 bits equal to 0x461599d35d632bf2239825bb.
+ * - The plaintext, P, is empty.
+ * - The associated data, A, is the contents of the Retry Pseudo-Packet,
+ *
+ * The Retry Pseudo-Packet is not sent over the wire. It is computed by taking the
+ * transmitted Retry packet, removing the Retry Integrity Tag, and prepending the
+ * two following fields: ODCID Length + Original Destination Connection ID (ODCID).
+ */
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ return err;
+ key = QUIC_RETRY_KEY_V1;
+ if (version == QUIC_VERSION_V2)
+ key = QUIC_RETRY_KEY_V2;
+ err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ if (err)
+ return err;
+
+ plen = 1 + odcid->len + skb->len - QUIC_TAG_LEN;
+ pseudo_retry = quic_crypto_aead_mem_alloc(tfm, plen + QUIC_TAG_LEN, &iv, &req, &sg, 1);
+ if (!pseudo_retry)
+ return -ENOMEM;
+
+ p = pseudo_retry;
+ p = quic_put_int(p, odcid->len, 1);
+ p = quic_put_data(p, odcid->data, odcid->len);
+ p = quic_put_data(p, skb->data, skb->len - QUIC_TAG_LEN);
+ sg_init_one(sg, pseudo_retry, plen + QUIC_TAG_LEN);
+
+ memcpy(iv, QUIC_RETRY_NONCE_V1, QUIC_IV_LEN);
+ if (version == QUIC_VERSION_V2)
+ memcpy(iv, QUIC_RETRY_NONCE_V2, QUIC_IV_LEN);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, plen);
+ aead_request_set_crypt(req, sg, sg, 0, iv);
+ err = crypto_aead_encrypt(req);
+ if (!err)
+ memcpy(tag, p, QUIC_TAG_LEN);
+ kfree(pseudo_retry);
+ return err;
+}
+
+/* Generate a token for Retry or address validation.
+ *
+ * Builds a token with the format: [client address][timestamp][original DCID][auth tag]
+ *
+ * Encrypts the token (excluding the first flag byte) using AES-GCM with a key and IV
+ * derived via HKDF. The original DCID is stored to be recovered later from a Client
+ * Initial packet. Ensures the token is bound to the client address and time, preventing
+ * reuse or tampering.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+int quic_crypto_generate_token(struct quic_crypto *crypto, void *addr, u32 addrlen,
+ struct quic_conn_id *conn_id, u8 *token, u32 *tlen)
+{
+ u8 key[TLS_CIPHER_AES_GCM_128_KEY_SIZE], iv[QUIC_IV_LEN], *retry_token, *tx_iv, *p;
+ struct crypto_aead *tfm = crypto->tag_tfm;
+ u32 ts = jiffies_to_usecs(jiffies), len;
+ struct quic_data srt = {}, k, i;
+ struct aead_request *req;
+ struct scatterlist *sg;
+ int err;
+
+ quic_data(&srt, quic_random_data, QUIC_RANDOM_DATA_LEN);
+ quic_data(&k, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ quic_data(&i, iv, QUIC_IV_LEN);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &i, NULL, QUIC_VERSION_V1);
+ if (err)
+ return err;
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ return err;
+ err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ if (err)
+ return err;
+ token++;
+ len = addrlen + sizeof(ts) + conn_id->len + QUIC_TAG_LEN;
+ retry_token = quic_crypto_aead_mem_alloc(tfm, len, &tx_iv, &req, &sg, 1);
+ if (!retry_token)
+ return -ENOMEM;
+
+ p = retry_token;
+ p = quic_put_data(p, addr, addrlen);
+ p = quic_put_int(p, ts, sizeof(ts));
+ quic_put_data(p, conn_id->data, conn_id->len);
+ sg_init_one(sg, retry_token, len);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, addrlen);
+ aead_request_set_crypt(req, sg, sg, len - addrlen - QUIC_TAG_LEN, iv);
+ err = crypto_aead_encrypt(req);
+ if (!err) {
+ memcpy(token, retry_token, len);
+ *tlen = len + 1;
+ }
+ kfree(retry_token);
+ return err;
+}
+
+/* Validate a Retry or address validation token.
+ *
+ * Decrypts the token using derived key and IV. Checks that the decrypted address matches
+ * the provided address, validates the embedded timestamp against current time with a
+ * version-specific timeout. If applicable, it extracts and returns the original
+ * destination connection ID (ODCID) for Retry packets.
+ *
+ * Returns 0 if the token is valid, -EINVAL if invalid, or another negative error code.
+ */
+int quic_crypto_verify_token(struct quic_crypto *crypto, void *addr, u32 addrlen,
+ struct quic_conn_id *conn_id, u8 *token, u32 len)
+{
+ u32 ts = jiffies_to_usecs(jiffies), timeout = QUIC_TOKEN_TIMEOUT_RETRY;
+ u8 key[TLS_CIPHER_AES_GCM_128_KEY_SIZE], iv[QUIC_IV_LEN];
+ u8 *retry_token, *rx_iv, *p, flag = *token;
+ struct crypto_aead *tfm = crypto->tag_tfm;
+ struct quic_data srt = {}, k, i;
+ struct aead_request *req;
+ struct scatterlist *sg;
+ int err;
+ u64 t;
+
+ if (len < sizeof(flag) + addrlen + sizeof(ts) + QUIC_TAG_LEN)
+ return -EINVAL;
+ quic_data(&srt, quic_random_data, QUIC_RANDOM_DATA_LEN);
+ quic_data(&k, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ quic_data(&i, iv, QUIC_IV_LEN);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &i, NULL, QUIC_VERSION_V1);
+ if (err)
+ return err;
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ return err;
+ err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ if (err)
+ return err;
+ len--;
+ token++;
+ retry_token = quic_crypto_aead_mem_alloc(tfm, len, &rx_iv, &req, &sg, 1);
+ if (!retry_token)
+ return -ENOMEM;
+
+ memcpy(retry_token, token, len);
+ sg_init_one(sg, retry_token, len);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, addrlen);
+ aead_request_set_crypt(req, sg, sg, len - addrlen, iv);
+ err = crypto_aead_decrypt(req);
+ if (err)
+ goto out;
+
+ err = -EINVAL;
+ p = retry_token;
+ if (memcmp(p, addr, addrlen))
+ goto out;
+ p += addrlen;
+ len -= addrlen;
+ if (flag == QUIC_TOKEN_FLAG_REGULAR)
+ timeout = QUIC_TOKEN_TIMEOUT_REGULAR;
+ if (!quic_get_int(&p, &len, &t, sizeof(ts)) || t + timeout < ts)
+ goto out;
+ len -= QUIC_TAG_LEN;
+ if (len > QUIC_CONN_ID_MAX_LEN)
+ goto out;
+
+ if (flag == QUIC_TOKEN_FLAG_RETRY)
+ quic_conn_id_update(conn_id, p, len);
+ err = 0;
+out:
+ kfree(retry_token);
+ return err;
+}
+
/* Generate a derived key using HKDF-Extract and HKDF-Expand with a given label. */
static int quic_crypto_generate_key(struct quic_crypto *crypto, void *data, u32 len,
char *label, u8 *token, u32 key_len)
diff --git a/net/quic/crypto.h b/net/quic/crypto.h
index 2bc960a8489e..91ccd0ec0590 100644
--- a/net/quic/crypto.h
+++ b/net/quic/crypto.h
@@ -62,6 +62,9 @@ int quic_crypto_get_secret(struct quic_crypto *crypto, struct quic_crypto_secret
int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u8 flag);
int quic_crypto_key_update(struct quic_crypto *crypto);
+int quic_crypto_encrypt(struct quic_crypto *crypto, struct sk_buff *skb);
+int quic_crypto_decrypt(struct quic_crypto *crypto, struct sk_buff *skb);
+
int quic_crypto_initial_keys_install(struct quic_crypto *crypto, struct quic_conn_id *conn_id,
u32 version, bool is_serv);
int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto, void *data,
@@ -69,5 +72,12 @@ int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto, void *da
int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto, void *data,
u32 len, u8 *key, u32 key_len);
+int quic_crypto_generate_token(struct quic_crypto *crypto, void *addr, u32 addrlen,
+ struct quic_conn_id *conn_id, u8 *token, u32 *tlen);
+int quic_crypto_get_retry_tag(struct quic_crypto *crypto, struct sk_buff *skb,
+ struct quic_conn_id *odcid, u32 version, u8 *tag);
+int quic_crypto_verify_token(struct quic_crypto *crypto, void *addr, u32 addrlen,
+ struct quic_conn_id *conn_id, u8 *token, u32 len);
+
void quic_crypto_free(struct quic_crypto *crypto);
void quic_crypto_init(void);
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 13/15] quic: add timer management
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (11 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 12/15] quic: add crypto packet encryption and decryption Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 14/15] quic: add frame encoder and decoder base Xin Long
` (2 subsequent siblings)
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_timer' to unify and manage the five main
timers used in QUIC: loss detection, delayed ACK, path validation,
PMTU probing, and pacing. These timers are critical for driving
retransmissions, connection liveness, and flow control.
Each timer type is initialized, started, reset, or stopped using a common
set of operations.
- quic_timer_reset(): Reset a timer with type and timeout
- quic_timer_start(): Start a timer with type and timeout
- quic_timer_stop(): Stop a timer with type
Although handler functions for each timer are defined, they are currently
placeholders; their logic will be implemented in upcoming patches for
packet transmission and outqueue handling.
Deferred timer actions are also integrated through quic_release_cb(),
which dispatches to the appropriate handler when timers expire.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/socket.c | 33 ++++++++
net/quic/socket.h | 33 ++++++++
net/quic/timer.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/timer.h | 47 +++++++++++
5 files changed, 310 insertions(+), 1 deletion(-)
create mode 100644 net/quic/timer.c
create mode 100644 net/quic/timer.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 58bb18f7926d..2ccf01ad9e22 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o crypto.o
+ cong.o pnspace.o crypto.o timer.o
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 2425494a3df3..cbcfec3a02b2 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -48,6 +48,8 @@ static int quic_init_sock(struct sock *sk)
quic_conn_id_set_init(quic_dest(sk), 0);
quic_cong_init(quic_cong(sk));
+ quic_timer_init(sk);
+
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
@@ -71,6 +73,8 @@ static void quic_destroy_sock(struct sock *sk)
{
u8 i;
+ quic_timer_free(sk);
+
for (i = 0; i < QUIC_PNSPACE_MAX; i++)
quic_pnspace_free(quic_pnspace(sk, i));
for (i = 0; i < QUIC_CRYPTO_MAX; i++)
@@ -209,6 +213,35 @@ EXPORT_SYMBOL_GPL(quic_kernel_getsockopt);
static void quic_release_cb(struct sock *sk)
{
+ /* Similar to tcp_release_cb(). */
+ unsigned long nflags, flags = smp_load_acquire(&sk->sk_tsq_flags);
+
+ do {
+ if (!(flags & QUIC_DEFERRED_ALL))
+ return;
+ nflags = flags & ~QUIC_DEFERRED_ALL;
+ } while (!try_cmpxchg(&sk->sk_tsq_flags, &flags, nflags));
+
+ if (flags & QUIC_F_LOSS_DEFERRED) {
+ quic_timer_loss_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_SACK_DEFERRED) {
+ quic_timer_sack_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_PATH_DEFERRED) {
+ quic_timer_path_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_PMTU_DEFERRED) {
+ quic_timer_pmtu_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_TSQ_DEFERRED) {
+ quic_timer_pace_handler(sk);
+ __sock_put(sk);
+ }
}
static int quic_disconnect(struct sock *sk, int flags)
diff --git a/net/quic/socket.h b/net/quic/socket.h
index c8df98351c6b..b3cf31e005ce 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -21,6 +21,7 @@
#include "cong.h"
#include "protocol.h"
+#include "timer.h"
extern struct proto quic_prot;
extern struct proto quicv6_prot;
@@ -32,6 +33,31 @@ enum quic_state {
QUIC_SS_ESTABLISHED = TCP_ESTABLISHED,
};
+enum quic_tsq_enum {
+ QUIC_MTU_REDUCED_DEFERRED,
+ QUIC_LOSS_DEFERRED,
+ QUIC_SACK_DEFERRED,
+ QUIC_PATH_DEFERRED,
+ QUIC_PMTU_DEFERRED,
+ QUIC_TSQ_DEFERRED,
+};
+
+enum quic_tsq_flags {
+ QUIC_F_MTU_REDUCED_DEFERRED = BIT(QUIC_MTU_REDUCED_DEFERRED),
+ QUIC_F_LOSS_DEFERRED = BIT(QUIC_LOSS_DEFERRED),
+ QUIC_F_SACK_DEFERRED = BIT(QUIC_SACK_DEFERRED),
+ QUIC_F_PATH_DEFERRED = BIT(QUIC_PATH_DEFERRED),
+ QUIC_F_PMTU_DEFERRED = BIT(QUIC_PMTU_DEFERRED),
+ QUIC_F_TSQ_DEFERRED = BIT(QUIC_TSQ_DEFERRED),
+};
+
+#define QUIC_DEFERRED_ALL (QUIC_F_MTU_REDUCED_DEFERRED | \
+ QUIC_F_LOSS_DEFERRED | \
+ QUIC_F_SACK_DEFERRED | \
+ QUIC_F_PATH_DEFERRED | \
+ QUIC_F_PMTU_DEFERRED | \
+ QUIC_F_TSQ_DEFERRED)
+
struct quic_sock {
struct inet_sock inet;
struct list_head reqs;
@@ -48,6 +74,8 @@ struct quic_sock {
struct quic_cong cong;
struct quic_pnspace space[QUIC_PNSPACE_MAX];
struct quic_crypto crypto[QUIC_CRYPTO_MAX];
+
+ struct quic_timer timers[QUIC_TIMER_MAX];
};
struct quic6_sock {
@@ -125,6 +153,11 @@ static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
return &quic_sk(sk)->crypto[level];
}
+static inline void *quic_timer(const struct sock *sk, u8 type)
+{
+ return (void *)&quic_sk(sk)->timers[type];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
diff --git a/net/quic/timer.c b/net/quic/timer.c
new file mode 100644
index 000000000000..10b304db84a9
--- /dev/null
+++ b/net/quic/timer.c
@@ -0,0 +1,196 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include "socket.h"
+
+void quic_timer_sack_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_sack_timeout(struct timer_list *t)
+{
+ struct quic_sock *qs = container_of(t, struct quic_sock, timers[QUIC_TIMER_SACK].t);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(QUIC_SACK_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ quic_timer_sack_handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+}
+
+void quic_timer_loss_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_loss_timeout(struct timer_list *t)
+{
+ struct quic_sock *qs = container_of(t, struct quic_sock, timers[QUIC_TIMER_LOSS].t);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(QUIC_LOSS_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ quic_timer_loss_handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+}
+
+void quic_timer_path_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_path_timeout(struct timer_list *t)
+{
+ struct quic_sock *qs = container_of(t, struct quic_sock, timers[QUIC_TIMER_PATH].t);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(QUIC_PATH_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ quic_timer_path_handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+}
+
+void quic_timer_reset_path(struct sock *sk)
+{
+ struct quic_cong *cong = quic_cong(sk);
+ u64 timeout = cong->pto * 2;
+
+ /* Calculate timeout based on cong.pto, but enforce a lower bound. */
+ if (timeout < QUIC_MIN_PATH_TIMEOUT)
+ timeout = QUIC_MIN_PATH_TIMEOUT;
+ quic_timer_reset(sk, QUIC_TIMER_PATH, timeout);
+}
+
+void quic_timer_pmtu_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_pmtu_timeout(struct timer_list *t)
+{
+ struct quic_sock *qs = container_of(t, struct quic_sock, timers[QUIC_TIMER_PMTU].t);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(QUIC_PMTU_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ quic_timer_pmtu_handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+}
+
+void quic_timer_pace_handler(struct sock *sk)
+{
+}
+
+static enum hrtimer_restart quic_timer_pace_timeout(struct hrtimer *hr)
+{
+ struct quic_sock *qs = container_of(hr, struct quic_sock, timers[QUIC_TIMER_PACE].hr);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(QUIC_TSQ_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ quic_timer_pace_handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ return HRTIMER_NORESTART;
+}
+
+void quic_timer_reset(struct sock *sk, u8 type, u64 timeout)
+{
+ struct timer_list *t = quic_timer(sk, type);
+
+ if (timeout && !mod_timer(t, jiffies + usecs_to_jiffies(timeout)))
+ sock_hold(sk);
+}
+
+void quic_timer_start(struct sock *sk, u8 type, u64 timeout)
+{
+ struct timer_list *t;
+ struct hrtimer *hr;
+
+ if (type == QUIC_TIMER_PACE) {
+ hr = quic_timer(sk, type);
+
+ if (!hrtimer_is_queued(hr)) {
+ hrtimer_start(hr, ns_to_ktime(timeout), HRTIMER_MODE_ABS_PINNED_SOFT);
+ sock_hold(sk);
+ }
+ return;
+ }
+
+ t = quic_timer(sk, type);
+ if (timeout && !timer_pending(t)) {
+ if (!mod_timer(t, jiffies + usecs_to_jiffies(timeout)))
+ sock_hold(sk);
+ }
+}
+
+void quic_timer_stop(struct sock *sk, u8 type)
+{
+ if (type == QUIC_TIMER_PACE) {
+ if (hrtimer_try_to_cancel(quic_timer(sk, type)) == 1)
+ sock_put(sk);
+ return;
+ }
+ if (timer_delete(quic_timer(sk, type)))
+ sock_put(sk);
+}
+
+void quic_timer_init(struct sock *sk)
+{
+ timer_setup(quic_timer(sk, QUIC_TIMER_LOSS), quic_timer_loss_timeout, 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_SACK), quic_timer_sack_timeout, 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_PATH), quic_timer_path_timeout, 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_PMTU), quic_timer_pmtu_timeout, 0);
+ /* Use hrtimer for pace timer, ensuring precise control over send timing. */
+ hrtimer_setup(quic_timer(sk, QUIC_TIMER_PACE), quic_timer_pace_timeout,
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_SOFT);
+}
+
+void quic_timer_free(struct sock *sk)
+{
+ quic_timer_stop(sk, QUIC_TIMER_LOSS);
+ quic_timer_stop(sk, QUIC_TIMER_SACK);
+ quic_timer_stop(sk, QUIC_TIMER_PATH);
+ quic_timer_stop(sk, QUIC_TIMER_PMTU);
+ quic_timer_stop(sk, QUIC_TIMER_PACE);
+}
diff --git a/net/quic/timer.h b/net/quic/timer.h
new file mode 100644
index 000000000000..61b094325334
--- /dev/null
+++ b/net/quic/timer.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+enum {
+ QUIC_TIMER_LOSS, /* Loss detection timer: triggers retransmission on packet loss */
+ QUIC_TIMER_SACK, /* ACK delay timer, also used as idle timer alias */
+ QUIC_TIMER_PATH, /* Path validation timer: verifies network path connectivity */
+ QUIC_TIMER_PMTU, /* Packetization Layer Path MTU Discovery probing timer */
+ QUIC_TIMER_PACE, /* Pacing timer: controls packet transmission pacing */
+ QUIC_TIMER_MAX,
+ QUIC_TIMER_IDLE = QUIC_TIMER_SACK,
+};
+
+struct quic_timer {
+ union {
+ struct timer_list t;
+ struct hrtimer hr;
+ };
+};
+
+#define QUIC_MIN_PROBE_TIMEOUT 5000000
+
+#define QUIC_MIN_PATH_TIMEOUT 1500000
+
+#define QUIC_MIN_IDLE_TIMEOUT 1000000
+#define QUIC_DEF_IDLE_TIMEOUT 30000000
+
+void quic_timer_reset(struct sock *sk, u8 type, u64 timeout);
+void quic_timer_start(struct sock *sk, u8 type, u64 timeout);
+void quic_timer_stop(struct sock *sk, u8 type);
+void quic_timer_init(struct sock *sk);
+void quic_timer_free(struct sock *sk);
+
+void quic_timer_reset_path(struct sock *sk);
+
+void quic_timer_loss_handler(struct sock *sk);
+void quic_timer_pace_handler(struct sock *sk);
+void quic_timer_path_handler(struct sock *sk);
+void quic_timer_sack_handler(struct sock *sk);
+void quic_timer_pmtu_handler(struct sock *sk);
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 14/15] quic: add frame encoder and decoder base
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (12 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 13/15] quic: add timer management Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-18 14:04 ` [PATCH net-next v2 15/15] quic: add packet builder and parser base Xin Long
2025-08-23 15:20 ` [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents John Ericson
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_frame' to represent QUIC frames and
'quic_frame_ops' to define the associated operations for encoding,
processing, and acknowledgment.
This abstraction sets the foundation for flexible and modular frame
handling. While core operations are defined, actual implementation
will follow in subsequent patches once packet handling and
inqueue/outqueue infrastructure are in place.
The patch introduces hooks for invoking frame-specific logic:
- quic_frame_create(): Invoke the .create operation of the frame.
- quic_frame_process(): Invoke the .process operation of the frame.
- quic_frame_ack(): Invoke the .ack operation of the frame.
To manage frame lifecycles, reference counting is used, supported by
- quic_frame_get(): Increment the reference count of a frame.
- quic_frame_put(): Decrement the reference count of a frame.
- quic_frame_alloc(): Allocate a frame and set its data.
Frames are allocated through quic_frame_alloc(), and a dedicated
kmem_cache (quic_frame_cachep) is added to optimize memory usage.
For STREAM frames, additional data can be appended using
- quic_frame_stream_append(): Append more data to a STREAM frame.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/frame.c | 558 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/frame.h | 195 ++++++++++++++++
net/quic/protocol.c | 9 +
net/quic/protocol.h | 1 +
net/quic/socket.h | 2 +
6 files changed, 766 insertions(+), 1 deletion(-)
create mode 100644 net/quic/frame.c
create mode 100644 net/quic/frame.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 2ccf01ad9e22..645ee470c95e 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o crypto.o timer.o
+ cong.o pnspace.o crypto.o timer.o frame.o
diff --git a/net/quic/frame.c b/net/quic/frame.c
new file mode 100644
index 000000000000..d1e99c4f4804
--- /dev/null
+++ b/net/quic/frame.c
@@ -0,0 +1,558 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/proto_memory.h>
+
+#include "socket.h"
+
+/* ACK Frame {
+ * Type (i) = 0x02..0x03,
+ * Largest Acknowledged (i),
+ * ACK Delay (i),
+ * ACK Range Count (i),
+ * First ACK Range (i),
+ * ACK Range (..) ...,
+ * [ECN Counts (..)],
+ * }
+ */
+
+static struct quic_frame *quic_frame_ack_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_ping_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_padding_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_new_token_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+/* STREAM Frame {
+ * Type (i) = 0x08..0x0f,
+ * Stream ID (i),
+ * [Offset (i)],
+ * [Length (i)],
+ * Stream Data (..),
+ * }
+ */
+
+static struct quic_frame *quic_frame_stream_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_handshake_done_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_crypto_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_retire_conn_id_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_new_conn_id_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_path_response_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_path_challenge_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_reset_stream_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_stop_sending_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_max_data_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_max_stream_data_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_max_streams_uni_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_max_streams_bidi_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_connection_close_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_data_blocked_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_stream_data_blocked_create(struct sock *sk,
+ void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_streams_blocked_uni_create(struct sock *sk,
+ void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_streams_blocked_bidi_create(struct sock *sk,
+ void *data, u8 type)
+{
+ return NULL;
+}
+
+static int quic_frame_crypto_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_stream_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_ack_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_new_conn_id_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_retire_conn_id_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_new_token_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_handshake_done_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_padding_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_ping_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_path_challenge_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_reset_stream_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_stop_sending_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_max_data_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_max_stream_data_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_max_streams_uni_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_max_streams_bidi_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_connection_close_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_data_blocked_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_stream_data_blocked_process(struct sock *sk, struct quic_frame *frame,
+ u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_streams_blocked_uni_process(struct sock *sk, struct quic_frame *frame,
+ u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_streams_blocked_bidi_process(struct sock *sk, struct quic_frame *frame,
+ u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_path_response_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static struct quic_frame *quic_frame_invalid_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static struct quic_frame *quic_frame_datagram_create(struct sock *sk, void *data, u8 type)
+{
+ return NULL;
+}
+
+static int quic_frame_invalid_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_frame_datagram_process(struct sock *sk, struct quic_frame *frame, u8 type)
+{
+ return -EOPNOTSUPP;
+}
+
+static void quic_frame_padding_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_ping_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_ack_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_reset_stream_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_stop_sending_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_crypto_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_new_token_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_stream_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_max_data_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_max_stream_data_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_max_streams_bidi_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_max_streams_uni_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_data_blocked_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_stream_data_blocked_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_streams_blocked_bidi_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_streams_blocked_uni_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_new_conn_id_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_retire_conn_id_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_path_challenge_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_path_response_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_connection_close_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_handshake_done_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_invalid_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+static void quic_frame_datagram_ack(struct sock *sk, struct quic_frame *frame)
+{
+}
+
+#define quic_frame_create_and_process_and_ack(type, eliciting) \
+ { \
+ .frame_create = quic_frame_##type##_create, \
+ .frame_process = quic_frame_##type##_process, \
+ .frame_ack = quic_frame_##type##_ack, \
+ .ack_eliciting = eliciting \
+ }
+
+static struct quic_frame_ops quic_frame_ops[QUIC_FRAME_MAX + 1] = {
+ quic_frame_create_and_process_and_ack(padding, 0), /* 0x00 */
+ quic_frame_create_and_process_and_ack(ping, 1),
+ quic_frame_create_and_process_and_ack(ack, 0),
+ quic_frame_create_and_process_and_ack(ack, 0), /* ack_ecn */
+ quic_frame_create_and_process_and_ack(reset_stream, 1),
+ quic_frame_create_and_process_and_ack(stop_sending, 1),
+ quic_frame_create_and_process_and_ack(crypto, 1),
+ quic_frame_create_and_process_and_ack(new_token, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(stream, 1),
+ quic_frame_create_and_process_and_ack(max_data, 1), /* 0x10 */
+ quic_frame_create_and_process_and_ack(max_stream_data, 1),
+ quic_frame_create_and_process_and_ack(max_streams_bidi, 1),
+ quic_frame_create_and_process_and_ack(max_streams_uni, 1),
+ quic_frame_create_and_process_and_ack(data_blocked, 1),
+ quic_frame_create_and_process_and_ack(stream_data_blocked, 1),
+ quic_frame_create_and_process_and_ack(streams_blocked_bidi, 1),
+ quic_frame_create_and_process_and_ack(streams_blocked_uni, 1),
+ quic_frame_create_and_process_and_ack(new_conn_id, 1),
+ quic_frame_create_and_process_and_ack(retire_conn_id, 1),
+ quic_frame_create_and_process_and_ack(path_challenge, 0),
+ quic_frame_create_and_process_and_ack(path_response, 0),
+ quic_frame_create_and_process_and_ack(connection_close, 0),
+ quic_frame_create_and_process_and_ack(connection_close, 0),
+ quic_frame_create_and_process_and_ack(handshake_done, 1),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0), /* 0x20 */
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(invalid, 0),
+ quic_frame_create_and_process_and_ack(datagram, 1), /* 0x30 */
+ quic_frame_create_and_process_and_ack(datagram, 1),
+};
+
+void quic_frame_ack(struct sock *sk, struct quic_frame *frame)
+{
+ quic_frame_ops[frame->type].frame_ack(sk, frame);
+
+ list_del_init(&frame->list);
+ frame->transmitted = 0;
+ quic_frame_put(frame);
+}
+
+int quic_frame_process(struct sock *sk, struct quic_frame *frame)
+{
+ u8 type, level = frame->level;
+ int ret;
+
+ while (frame->len > 0) {
+ type = *frame->data++;
+ frame->len--;
+
+ if (type > QUIC_FRAME_MAX) {
+ pr_debug("%s: unsupported frame, type: %x, level: %d\n",
+ __func__, type, level);
+ return -EPROTONOSUPPORT;
+ } else if (quic_frame_level_check(level, type)) {
+ pr_debug("%s: invalid frame, type: %x, level: %d\n",
+ __func__, type, level);
+ return -EINVAL;
+ }
+ ret = quic_frame_ops[type].frame_process(sk, frame, type);
+ if (ret < 0) {
+ pr_debug("%s: failed, type: %x, level: %d, err: %d\n",
+ __func__, type, level, ret);
+ return ret;
+ }
+ pr_debug("%s: done, type: %x, level: %d\n", __func__, type, level);
+
+ frame->data += ret;
+ frame->len -= ret;
+ }
+ return 0;
+}
+
+struct quic_frame *quic_frame_create(struct sock *sk, u8 type, void *data)
+{
+ struct quic_frame *frame;
+
+ if (type > QUIC_FRAME_MAX)
+ return NULL;
+ frame = quic_frame_ops[type].frame_create(sk, data, type);
+ if (!frame) {
+ pr_debug("%s: failed, type: %x\n", __func__, type);
+ return NULL;
+ }
+ INIT_LIST_HEAD(&frame->list);
+ if (!frame->type)
+ frame->type = type;
+ frame->ack_eliciting = quic_frame_ops[type].ack_eliciting;
+ pr_debug("%s: done, type: %x, len: %u\n", __func__, type, frame->len);
+ return frame;
+}
+
+struct quic_frame *quic_frame_alloc(u32 size, u8 *data, gfp_t gfp)
+{
+ struct quic_frame *frame;
+
+ frame = kmem_cache_zalloc(quic_frame_cachep, gfp);
+ if (!frame)
+ return NULL;
+ if (data) {
+ frame->data = data;
+ goto out;
+ }
+ frame->data = kmalloc(size, gfp);
+ if (!frame->data) {
+ kmem_cache_free(quic_frame_cachep, frame);
+ return NULL;
+ }
+out:
+ refcount_set(&frame->refcnt, 1);
+ frame->offset = -1;
+ frame->len = (u16)size;
+ frame->size = frame->len;
+ return frame;
+}
+
+static void quic_frame_free(struct quic_frame *frame)
+{
+ struct quic_frame_frag *frag, *next;
+
+ if (!frame->type && frame->skb) { /* RX path frame with skb. */
+ kfree_skb(frame->skb);
+ goto out;
+ }
+
+ for (frag = frame->flist; frag; frag = next) {
+ next = frag->next;
+ kfree(frag);
+ }
+ kfree(frame->data);
+out:
+ kmem_cache_free(quic_frame_cachep, frame);
+}
+
+struct quic_frame *quic_frame_get(struct quic_frame *frame)
+{
+ refcount_inc(&frame->refcnt);
+ return frame;
+}
+
+void quic_frame_put(struct quic_frame *frame)
+{
+ if (refcount_dec_and_test(&frame->refcnt))
+ quic_frame_free(frame);
+}
+
+int quic_frame_stream_append(struct sock *sk, struct quic_frame *frame,
+ struct quic_msginfo *info, u8 pack)
+{
+ return -1;
+}
diff --git a/net/quic/frame.h b/net/quic/frame.h
new file mode 100644
index 000000000000..7bcdba1e9bdd
--- /dev/null
+++ b/net/quic/frame.h
@@ -0,0 +1,195 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_CLOSE_PHRASE_MAX_LEN 80
+
+#define QUIC_TOKEN_MAX_LEN 120
+
+#define QUIC_TICKET_MIN_LEN 64
+#define QUIC_TICKET_MAX_LEN 4096
+
+#define QUIC_FRAME_BUF_SMALL 20
+#define QUIC_FRAME_BUF_LARGE 100
+
+enum {
+ QUIC_FRAME_PADDING = 0x00,
+ QUIC_FRAME_PING = 0x01,
+ QUIC_FRAME_ACK = 0x02,
+ QUIC_FRAME_ACK_ECN = 0x03,
+ QUIC_FRAME_RESET_STREAM = 0x04,
+ QUIC_FRAME_STOP_SENDING = 0x05,
+ QUIC_FRAME_CRYPTO = 0x06,
+ QUIC_FRAME_NEW_TOKEN = 0x07,
+ QUIC_FRAME_STREAM = 0x08,
+ QUIC_FRAME_MAX_DATA = 0x10,
+ QUIC_FRAME_MAX_STREAM_DATA = 0x11,
+ QUIC_FRAME_MAX_STREAMS_BIDI = 0x12,
+ QUIC_FRAME_MAX_STREAMS_UNI = 0x13,
+ QUIC_FRAME_DATA_BLOCKED = 0x14,
+ QUIC_FRAME_STREAM_DATA_BLOCKED = 0x15,
+ QUIC_FRAME_STREAMS_BLOCKED_BIDI = 0x16,
+ QUIC_FRAME_STREAMS_BLOCKED_UNI = 0x17,
+ QUIC_FRAME_NEW_CONNECTION_ID = 0x18,
+ QUIC_FRAME_RETIRE_CONNECTION_ID = 0x19,
+ QUIC_FRAME_PATH_CHALLENGE = 0x1a,
+ QUIC_FRAME_PATH_RESPONSE = 0x1b,
+ QUIC_FRAME_CONNECTION_CLOSE = 0x1c,
+ QUIC_FRAME_CONNECTION_CLOSE_APP = 0x1d,
+ QUIC_FRAME_HANDSHAKE_DONE = 0x1e,
+ QUIC_FRAME_DATAGRAM = 0x30, /* RFC 9221 */
+ QUIC_FRAME_DATAGRAM_LEN = 0x31,
+ QUIC_FRAME_MAX = QUIC_FRAME_DATAGRAM_LEN,
+};
+
+enum {
+ QUIC_TRANSPORT_PARAM_ORIGINAL_DESTINATION_CONNECTION_ID = 0x0000,
+ QUIC_TRANSPORT_PARAM_MAX_IDLE_TIMEOUT = 0x0001,
+ QUIC_TRANSPORT_PARAM_STATELESS_RESET_TOKEN = 0x0002,
+ QUIC_TRANSPORT_PARAM_MAX_UDP_PAYLOAD_SIZE = 0x0003,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_DATA = 0x0004,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_STREAM_DATA_BIDI_LOCAL = 0x0005,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_STREAM_DATA_BIDI_REMOTE = 0x0006,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_STREAM_DATA_UNI = 0x0007,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_STREAMS_BIDI = 0x0008,
+ QUIC_TRANSPORT_PARAM_INITIAL_MAX_STREAMS_UNI = 0x0009,
+ QUIC_TRANSPORT_PARAM_ACK_DELAY_EXPONENT = 0x000a,
+ QUIC_TRANSPORT_PARAM_MAX_ACK_DELAY = 0x000b,
+ QUIC_TRANSPORT_PARAM_DISABLE_ACTIVE_MIGRATION = 0x000c,
+ QUIC_TRANSPORT_PARAM_PREFERRED_ADDRESS = 0x000d,
+ QUIC_TRANSPORT_PARAM_ACTIVE_CONNECTION_ID_LIMIT = 0x000e,
+ QUIC_TRANSPORT_PARAM_INITIAL_SOURCE_CONNECTION_ID = 0x000f,
+ QUIC_TRANSPORT_PARAM_RETRY_SOURCE_CONNECTION_ID = 0x0010,
+ QUIC_TRANSPORT_PARAM_MAX_DATAGRAM_FRAME_SIZE = 0x0020,
+ QUIC_TRANSPORT_PARAM_GREASE_QUIC_BIT = 0x2ab2,
+ QUIC_TRANSPORT_PARAM_VERSION_INFORMATION = 0x11,
+ QUIC_TRANSPORT_PARAM_DISABLE_1RTT_ENCRYPTION = 0xbaad,
+};
+
+/* Arguments passed to create a STREAM frame */
+struct quic_msginfo {
+ struct quic_stream *stream; /* The QUIC stream associated with this frame */
+ struct iov_iter *msg; /* Iterator over message data to send */
+ u32 flags; /* Flags controlling stream frame creation */
+ u8 level; /* Encryption level for this frame */
+};
+
+/* Arguments passed to create a PING frame */
+struct quic_probeinfo {
+ u16 size; /* Size of the PING packet */
+ u8 level; /* Encryption level for this frame */
+};
+
+/* Operations for creating, processing, and acknowledging QUIC frames */
+struct quic_frame_ops {
+ struct quic_frame *(*frame_create)(struct sock *sk, void *data, u8 type);
+ int (*frame_process)(struct sock *sk, struct quic_frame *frame, u8 type);
+ void (*frame_ack)(struct sock *sk, struct quic_frame *frame);
+ u8 ack_eliciting;
+};
+
+/* Fragment of data appended to a STREAM frame */
+struct quic_frame_frag {
+ struct quic_frame_frag *next; /* Next fragment in the linked list */
+ u16 size; /* Size of this data fragment */
+ u8 data[]; /* Flexible array member holding fragment data */
+};
+
+struct quic_frame {
+ union {
+ struct quic_frame_frag *flist; /* For TX: linked list of appended data fragments */
+ struct sk_buff *skb; /* For RX: skb containing the raw frame data */
+ };
+ struct quic_stream *stream; /* Stream related to this frame, NULL if none */
+ struct list_head list; /* Linked list node for queuing frames */
+ union {
+ s64 offset; /* For RX: stream/crypto data offset or read data offset */
+ s64 number; /* For TX: first packet number used */
+ };
+ u8 *data; /* Pointer to the actual frame data buffer */
+
+ refcount_t refcnt;
+ u16 errcode; /* Error code set during frame processing */
+ u8 level; /* Packet number space: Initial, Handshake, or App */
+ u8 type; /* Frame type identifier */
+ u16 bytes; /* Number of user data bytes */
+ u16 size; /* Allocated data buffer size */
+ u16 len; /* Total frame length including appended fragments */
+
+ u8 ack_eliciting:1; /* Frame requires acknowledgment */
+ u8 transmitted:1; /* Frame is in the transmitted queue */
+ u8 stream_fin:1; /* Frame includes FIN flag for stream */
+ u8 nodelay:1; /* Frame bypasses Nagle's algorithm for sending */
+ u8 padding:1; /* Padding is needed after this frame */
+ u8 dgram:1; /* Frame represents a datagram message (RX only) */
+ u8 event:1; /* Frame represents an event (RX only) */
+ u8 path:1; /* Path index used to send this frame */
+};
+
+static inline bool quic_frame_new_conn_id(u8 type)
+{
+ return type == QUIC_FRAME_NEW_CONNECTION_ID;
+}
+
+static inline bool quic_frame_dgram(u8 type)
+{
+ return type == QUIC_FRAME_DATAGRAM || type == QUIC_FRAME_DATAGRAM_LEN;
+}
+
+static inline bool quic_frame_stream(u8 type)
+{
+ return type >= QUIC_FRAME_STREAM && type < QUIC_FRAME_MAX_DATA;
+}
+
+static inline bool quic_frame_sack(u8 type)
+{
+ return type == QUIC_FRAME_ACK || type == QUIC_FRAME_ACK_ECN;
+}
+
+static inline bool quic_frame_ping(u8 type)
+{
+ return type == QUIC_FRAME_PING;
+}
+
+/* Check if a given frame type is valid for the specified encryption level,
+ * based on the Frame Types table from rfc9000#section-12.4.
+ *
+ * Returns 0 if valid, 1 otherwise.
+ */
+static inline int quic_frame_level_check(u8 level, u8 type)
+{
+ if (level == QUIC_CRYPTO_APP)
+ return 0;
+
+ if (level == QUIC_CRYPTO_EARLY) {
+ if (type == QUIC_FRAME_ACK || type == QUIC_FRAME_ACK_ECN ||
+ type == QUIC_FRAME_CRYPTO || type == QUIC_FRAME_HANDSHAKE_DONE ||
+ type == QUIC_FRAME_NEW_TOKEN || type == QUIC_FRAME_PATH_RESPONSE ||
+ type == QUIC_FRAME_RETIRE_CONNECTION_ID)
+ return 1;
+ return 0;
+ }
+
+ if (type != QUIC_FRAME_ACK && type != QUIC_FRAME_ACK_ECN &&
+ type != QUIC_FRAME_PADDING && type != QUIC_FRAME_PING &&
+ type != QUIC_FRAME_CRYPTO && type != QUIC_FRAME_CONNECTION_CLOSE)
+ return 1;
+ return 0;
+}
+
+int quic_frame_stream_append(struct sock *sk, struct quic_frame *frame,
+ struct quic_msginfo *info, u8 pack);
+
+struct quic_frame *quic_frame_alloc(u32 size, u8 *data, gfp_t gfp);
+struct quic_frame *quic_frame_get(struct quic_frame *frame);
+void quic_frame_put(struct quic_frame *frame);
+
+struct quic_frame *quic_frame_create(struct sock *sk, u8 type, void *data);
+int quic_frame_process(struct sock *sk, struct quic_frame *frame);
+void quic_frame_ack(struct sock *sk, struct quic_frame *frame);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index fb98ef10f852..4725e3aa7785 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -21,6 +21,7 @@
static unsigned int quic_net_id __read_mostly;
+struct kmem_cache *quic_frame_cachep __read_mostly;
struct percpu_counter quic_sockets_allocated;
long sysctl_quic_mem[3];
@@ -335,6 +336,11 @@ static __init int quic_init(void)
quic_crypto_init();
+ quic_frame_cachep = kmem_cache_create("quic_frame", sizeof(struct quic_frame),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+ if (!quic_frame_cachep)
+ goto err;
+
err = percpu_counter_init(&quic_sockets_allocated, 0, GFP_KERNEL);
if (err)
goto err_percpu_counter;
@@ -363,6 +369,8 @@ static __init int quic_init(void)
err_hash:
percpu_counter_destroy(&quic_sockets_allocated);
err_percpu_counter:
+ kmem_cache_destroy(quic_frame_cachep);
+err:
return err;
}
@@ -375,6 +383,7 @@ static __exit void quic_exit(void)
unregister_pernet_subsys(&quic_net_ops);
quic_hash_tables_destroy();
percpu_counter_destroy(&quic_sockets_allocated);
+ kmem_cache_destroy(quic_frame_cachep);
pr_info("quic: exit\n");
}
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
index 1df926ef0a75..92ad261199c1 100644
--- a/net/quic/protocol.h
+++ b/net/quic/protocol.h
@@ -8,6 +8,7 @@
* Xin Long <lucien.xin@gmail.com>
*/
+extern struct kmem_cache *quic_frame_cachep __read_mostly;
extern struct percpu_counter quic_sockets_allocated;
extern long sysctl_quic_mem[3];
diff --git a/net/quic/socket.h b/net/quic/socket.h
index b3cf31e005ce..75cb90177a01 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -20,6 +20,8 @@
#include "path.h"
#include "cong.h"
+#include "frame.h"
+
#include "protocol.h"
#include "timer.h"
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH net-next v2 15/15] quic: add packet builder and parser base
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (13 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 14/15] quic: add frame encoder and decoder base Xin Long
@ 2025-08-18 14:04 ` Xin Long
2025-08-23 15:20 ` [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents John Ericson
15 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-18 14:04 UTC (permalink / raw)
To: network dev
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
linux-cifs, Steve French, Namjae Jeon, Paulo Alcantara,
Tom Talpey, kernel-tls-handshake, Chuck Lever, Jeff Layton,
Benjamin Coddington, Steve Dickson, Hannes Reinecke,
Alexander Aring, David Howells, Cong Wang, D . Wythe, Jason Baron,
illiliti, Sabrina Dubroca, Marcelo Ricardo Leitner,
Daniel Stenberg, Andy Gospodarek
This patch introduces 'quic_packet' to handle packing and unpacking of
QUIC packets on both the transmit (TX) and receive (RX) paths.
On the TX path, it provides functionality for frame packing and packet
construction. The packet configuration includes setting the path,
calculating overhead, and verifying routing. Frames are appended to the
packet before it is created with the queued frames.
Once assembled, the packet is encrypted, bundled, and sent out. There
is also support to flush the packet when no additional frames remain.
Functions to create application (short) and handshake (long) packets
are currently placeholders for future implementation.
- quic_packet_config(): Set the path, compute overhead, and verify routing.
- quic_packet_tail(): Append a frame to the packet for transmission.
- quic_packet_create(): Create the packet with the queued frames.
- quic_packet_xmit(): Encrypt, bundle, and send out the packet.
- quic_packet_flush(): Send the packet if there's nothing left to bundle.
On the RX side, the patch introduces mechanisms to parse the ALPN from
client Initial packets to determine the correct listener socket. Received
packets are then routed and processed accordingly. Similar to the TX path,
handling for application and handshake packets is not yet implemented.
- quic_packet_parse_alpn()`: Parse the ALPN from a client Initial packet,
then locate the appropriate listener using the ALPN.
- quic_packet_rcv(): Locate the appropriate socket to handle the packet
via quic_packet_process().
- quic_packet_process()`: Process the received packet.
In addition to packet flow, this patch adds support for ICMP-based MTU
updates by locating the relevant socket and updating the stored PMTU
accordingly.
- quic_packet_rcv_err_pmtu(): Find the socket and update the PMTU via
quic_packet_mss_update().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
net/quic/Makefile | 2 +-
net/quic/packet.c | 892 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/packet.h | 129 +++++++
net/quic/protocol.c | 7 +
net/quic/socket.c | 114 ++++++
net/quic/socket.h | 12 +
6 files changed, 1155 insertions(+), 1 deletion(-)
create mode 100644 net/quic/packet.c
create mode 100644 net/quic/packet.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 645ee470c95e..4a43052eb441 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o crypto.o timer.o frame.o
+ cong.o pnspace.o crypto.o timer.o frame.o packet.o
diff --git a/net/quic/packet.c b/net/quic/packet.c
new file mode 100644
index 000000000000..47e094b56a21
--- /dev/null
+++ b/net/quic/packet.c
@@ -0,0 +1,892 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include "socket.h"
+
+#define QUIC_HLEN 1
+
+#define QUIC_LONG_HLEN(dcid, scid) \
+ (QUIC_HLEN + QUIC_VERSION_LEN + 1 + (dcid)->len + 1 + (scid)->len)
+
+#define QUIC_VERSION_NUM 2
+
+/* Supported QUIC versions and their compatible versions. Used for Compatible Version
+ * Negotiation in rfc9368#section-2.3.
+ */
+static u32 quic_versions[QUIC_VERSION_NUM][4] = {
+ /* Version, Compatible Versions */
+ { QUIC_VERSION_V1, QUIC_VERSION_V2, QUIC_VERSION_V1, 0 },
+ { QUIC_VERSION_V2, QUIC_VERSION_V2, QUIC_VERSION_V1, 0 },
+};
+
+/* Get the compatible version list for a given QUIC version. */
+u32 *quic_packet_compatible_versions(u32 version)
+{
+ u8 i;
+
+ for (i = 0; i < QUIC_VERSION_NUM; i++)
+ if (version == quic_versions[i][0])
+ return quic_versions[i];
+ return NULL;
+}
+
+/* Convert version-specific type to internal standard packet type. */
+static u8 quic_packet_version_get_type(u32 version, u8 type)
+{
+ if (version == QUIC_VERSION_V1)
+ return type;
+
+ switch (type) {
+ case QUIC_PACKET_INITIAL_V2:
+ return QUIC_PACKET_INITIAL;
+ case QUIC_PACKET_0RTT_V2:
+ return QUIC_PACKET_0RTT;
+ case QUIC_PACKET_HANDSHAKE_V2:
+ return QUIC_PACKET_HANDSHAKE;
+ case QUIC_PACKET_RETRY_V2:
+ return QUIC_PACKET_RETRY;
+ default:
+ return -1;
+ }
+ return -1;
+}
+
+/* Parse QUIC version and connection IDs (DCID and SCID) from a Long header packet buffer. */
+static int quic_packet_get_version_and_connid(struct quic_conn_id *dcid, struct quic_conn_id *scid,
+ u32 *version, u8 **pp, u32 *plen)
+{
+ u64 len, v;
+
+ *pp += QUIC_HLEN;
+ *plen -= QUIC_HLEN;
+
+ if (!quic_get_int(pp, plen, &v, QUIC_VERSION_LEN))
+ return -EINVAL;
+ *version = v;
+
+ if (!quic_get_int(pp, plen, &len, 1) ||
+ len > *plen || len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+ quic_conn_id_update(dcid, *pp, len);
+ *plen -= len;
+ *pp += len;
+
+ if (!quic_get_int(pp, plen, &len, 1) ||
+ len > *plen || len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+ quic_conn_id_update(scid, *pp, len);
+ *plen -= len;
+ *pp += len;
+ return 0;
+}
+
+/* Change the QUIC version for the connection.
+ *
+ * Frees existing initial crypto keys and installs new initial keys compatible with the new
+ * version.
+ */
+static int quic_packet_version_change(struct sock *sk, struct quic_conn_id *dcid, u32 version)
+{
+ struct quic_crypto *crypto = quic_crypto(sk, QUIC_CRYPTO_INITIAL);
+
+ if (quic_crypto_initial_keys_install(crypto, dcid, version, quic_is_serv(sk)))
+ return -1;
+
+ quic_packet(sk)->version = version;
+ return 0;
+}
+
+/* Select the best compatible QUIC version from offered list.
+ *
+ * Considers the local preferred version, currently chosen version, and versions offered by
+ * the peer. Selects the best compatible version based on client/server role and updates the
+ * connection version accordingly.
+ */
+int quic_packet_select_version(struct sock *sk, u32 *versions, u8 count)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_config *c = quic_config(sk);
+ u8 i, pref_found = 0, ch_found = 0;
+ u32 preferred, chosen, best = 0;
+
+ preferred = c->version ?: QUIC_VERSION_V1;
+ chosen = packet->version;
+
+ for (i = 0; i < count; i++) {
+ if (!quic_packet_compatible_versions(versions[i]))
+ continue;
+ if (preferred == versions[i])
+ pref_found = 1;
+ if (chosen == versions[i])
+ ch_found = 1;
+ if (best < versions[i]) /* Track highest offered version. */
+ best = versions[i];
+ }
+
+ if (!pref_found && !ch_found && !best)
+ return -1;
+
+ if (quic_is_serv(sk)) { /* Server prefers preferred version if offered, else chosen. */
+ if (pref_found)
+ best = preferred;
+ else if (ch_found)
+ best = chosen;
+ } else { /* Client prefers chosen version, else preferred. */
+ if (ch_found)
+ best = chosen;
+ else if (pref_found)
+ best = preferred;
+ }
+
+ if (packet->version == best)
+ return 0;
+
+ /* Change to selected best version. */
+ return quic_packet_version_change(sk, &quic_paths(sk)->orig_dcid, best);
+}
+
+/* Extracts a QUIC token from a buffer in the Client Initial packet. */
+static int quic_packet_get_token(struct quic_data *token, u8 **pp, u32 *plen)
+{
+ u64 len;
+
+ if (!quic_get_var(pp, plen, &len) || len > *plen)
+ return -EINVAL;
+ quic_data(token, *pp, len);
+ *plen -= len;
+ *pp += len;
+ return 0;
+}
+
+/* Process PMTU reduction event on a QUIC socket. */
+void quic_packet_rcv_err_pmtu(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_config *c = quic_config(sk);
+ u32 pathmtu, info, taglen;
+ struct dst_entry *dst;
+ bool reset_timer;
+
+ if (!ip_sk_accept_pmtu(sk))
+ return;
+
+ info = clamp(paths->mtu_info, QUIC_PATH_MIN_PMTU, QUIC_PATH_MAX_PMTU);
+ /* If PLPMTUD is not enabled, update MSS using the route and ICMP info. */
+ if (!c->plpmtud_probe_interval) {
+ if (quic_packet_route(sk) < 0)
+ return;
+
+ dst = __sk_dst_get(sk);
+ dst->ops->update_pmtu(dst, sk, NULL, info, true);
+ quic_packet_mss_update(sk, info - packet->hlen);
+ return;
+ }
+ /* PLPMTUD is enabled: adjust to smaller PMTU, subtract headers and AEAD tag. Also
+ * notify the QUIC path layer for possible state changes and probing.
+ */
+ taglen = quic_packet_taglen(packet);
+ info = info - packet->hlen - taglen;
+ pathmtu = quic_path_pl_toobig(paths, info, &reset_timer);
+ if (reset_timer)
+ quic_timer_reset(sk, QUIC_TIMER_PMTU, c->plpmtud_probe_interval);
+ if (pathmtu)
+ quic_packet_mss_update(sk, pathmtu + taglen);
+}
+
+/* Handle ICMP Toobig packet and update QUIC socket path MTU. */
+static int quic_packet_rcv_err(struct sk_buff *skb)
+{
+ union quic_addr daddr, saddr;
+ struct sock *sk = NULL;
+ int ret = 0;
+ u32 info;
+
+ /* All we can do is lookup the matching QUIC socket by addresses. */
+ quic_get_msg_addrs(skb, &saddr, &daddr);
+ sk = quic_sock_lookup(skb, &daddr, &saddr, NULL);
+ if (!sk)
+ return -ENOENT;
+
+ bh_lock_sock(sk);
+ if (quic_is_listen(sk))
+ goto out;
+
+ if (quic_get_mtu_info(skb, &info))
+ goto out;
+
+ ret = 1; /* Success: update socket path MTU info. */
+ quic_paths(sk)->mtu_info = info;
+ if (sock_owned_by_user(sk)) {
+ /* Socket is in use by userspace context. Defer MTU processing to later via
+ * tasklet. Ensure the socket is not dropped before deferral.
+ */
+ if (!test_and_set_bit(QUIC_MTU_REDUCED_DEFERRED, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+ /* Otherwise, process the MTU reduction now. */
+ quic_packet_rcv_err_pmtu(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ return ret;
+}
+
+#define TLS_MT_CLIENT_HELLO 1
+#define TLS_EXT_alpn 16
+
+/* TLS Client Hello Msg:
+ *
+ * uint16 ProtocolVersion;
+ * opaque Random[32];
+ * uint8 CipherSuite[2];
+ *
+ * struct {
+ * ExtensionType extension_type;
+ * opaque extension_data<0..2^16-1>;
+ * } Extension;
+ *
+ * struct {
+ * ProtocolVersion legacy_version = 0x0303;
+ * Random rand;
+ * opaque legacy_session_id<0..32>;
+ * CipherSuite cipher_suites<2..2^16-2>;
+ * opaque legacy_compression_methods<1..2^8-1>;
+ * Extension extensions<8..2^16-1>;
+ * } ClientHello;
+ */
+
+#define TLS_CH_RANDOM_LEN 32
+#define TLS_CH_VERSION_LEN 2
+
+/* Extract ALPN data from a TLS ClientHello message.
+ *
+ * Parses the TLS ClientHello handshake message to find the ALPN (Application Layer Protocol
+ * Negotiation) TLS extension. It validates the TLS ClientHello structure, including version,
+ * random, session ID, cipher suites, compression methods, and extensions. Once the ALPN
+ * extension is found, the ALPN protocols list is extracted and stored in @alpn.
+ *
+ * Return: 0 on success or no ALPN found, a negative error code on failed parsing.
+ */
+static int quic_packet_get_alpn(struct quic_data *alpn, u8 *p, u32 len)
+{
+ int err = -EINVAL, found = 0;
+ u64 length, type;
+
+ /* Verify handshake message type (ClientHello) and its length. */
+ if (!quic_get_int(&p, &len, &type, 1) || type != TLS_MT_CLIENT_HELLO)
+ return err;
+ if (!quic_get_int(&p, &len, &length, 3) ||
+ length < TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN)
+ return err;
+ if (len > (u32)length) /* Limit len to handshake message length if larger. */
+ len = length;
+ /* Skip legacy_version (2 bytes) + random (32 bytes). */
+ p += TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN;
+ len -= TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN;
+ /* legacy_session_id_len must be zero (QUIC requirement). */
+ if (!quic_get_int(&p, &len, &length, 1) || length)
+ return err;
+
+ /* Skip cipher_suites (2 bytes length + variable data). */
+ if (!quic_get_int(&p, &len, &length, 2) || length > (u64)len)
+ return err;
+ len -= length;
+ p += length;
+
+ /* Skip legacy_compression_methods (1 byte length + variable data). */
+ if (!quic_get_int(&p, &len, &length, 1) || length > (u64)len)
+ return err;
+ len -= length;
+ p += length;
+
+ if (!quic_get_int(&p, &len, &length, 2)) /* Read TLS extensions length (2 bytes). */
+ return err;
+ if (len > (u32)length) /* Limit len to extensions length if larger. */
+ len = length;
+ while (len > 4) { /* Iterate over extensions to find ALPN (type TLS_EXT_alpn). */
+ if (!quic_get_int(&p, &len, &type, 2))
+ break;
+ if (!quic_get_int(&p, &len, &length, 2))
+ break;
+ if (len < (u32)length) /* Incomplete TLS extensions. */
+ return 0;
+ if (type == TLS_EXT_alpn) { /* Found ALPN extension. */
+ len = length;
+ found = 1;
+ break;
+ }
+ /* Skip non-ALPN extensions. */
+ p += length;
+ len -= length;
+ }
+ if (!found) { /* no ALPN extension found: set alpn->len = 0 and alpn->data = p. */
+ quic_data(alpn, p, 0);
+ return 0;
+ }
+
+ /* Parse ALPN protocols list length (2 bytes). */
+ if (!quic_get_int(&p, &len, &length, 2) || length > (u64)len)
+ return err;
+ quic_data(alpn, p, length); /* Store ALPN protocols list in alpn->data. */
+ len = length;
+ while (len) { /* Validate ALPN protocols list format. */
+ if (!quic_get_int(&p, &len, &length, 1) || length > (u64)len) {
+ /* Malformed ALPN entry: set alpn->len = 0 and alpn->data = NULL. */
+ quic_data(alpn, NULL, 0);
+ return err;
+ }
+ len -= length;
+ p += length;
+ }
+ pr_debug("%s: alpn_len: %d\n", __func__, alpn->len);
+ return 0;
+}
+
+/* Parse ALPN from a QUIC Initial packet.
+ *
+ * This function processes a QUIC Initial packet to extract the ALPN from the TLS ClientHello
+ * message inside the QUIC CRYPTO frame. It verifies packet type, version compatibility,
+ * decrypts the packet payload, and locates the CRYPTO frame to parse the TLS ClientHello.
+ * Finally, it calls quic_packet_get_alpn() to extract the ALPN extension data.
+ *
+ * Return: 0 on success or no ALPN found, a negative error code on failed parsing.
+ */
+static int quic_packet_parse_alpn(struct sk_buff *skb, struct quic_data *alpn)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = dev_net(skb->dev);
+ struct quic_net *qn = quic_net(net);
+ u8 *p = skb->data, *data, type;
+ struct quic_conn_id dcid, scid;
+ u32 len = skb->len, version;
+ struct quic_crypto *crypto;
+ struct quic_data token;
+ u64 offset, length;
+ int err = -EINVAL;
+
+ if (quic_packet_get_version_and_connid(&dcid, &scid, &version, &p, &len))
+ return -EINVAL;
+ if (!quic_packet_compatible_versions(version))
+ return 0;
+ /* Only parse Initial packets. */
+ type = quic_packet_version_get_type(version, quic_hshdr(skb)->type);
+ if (type != QUIC_PACKET_INITIAL)
+ return 0;
+ if (quic_packet_get_token(&token, &p, &len))
+ return -EINVAL;
+ if (!quic_get_var(&p, &len, &length) || length > (u64)len)
+ return err;
+ cb->length = (u16)length;
+ /* Copy skb data for restoring in case of decrypt failure. */
+ data = kmemdup(skb->data, skb->len, GFP_ATOMIC);
+ if (!data)
+ return -ENOMEM;
+
+ spin_lock(&qn->lock);
+ /* Install initial keys for packet decryption to crypto. */
+ crypto = &quic_net(net)->crypto;
+ err = quic_crypto_initial_keys_install(crypto, &dcid, version, 1);
+ if (err) {
+ spin_unlock(&qn->lock);
+ goto out;
+ }
+ cb->number_offset = (u16)(p - skb->data);
+ err = quic_crypto_decrypt(crypto, skb);
+ if (err) {
+ spin_unlock(&qn->lock);
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_DECDROP);
+ /* Restore original data on decrypt failure. */
+ memcpy(skb->data, data, skb->len);
+ goto out;
+ }
+ spin_unlock(&qn->lock);
+
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_DECFASTPATHS);
+ cb->resume = 1; /* Mark this packet as already decrypted. */
+
+ /* Find the QUIC CRYPTO frame. */
+ p += cb->number_len;
+ len = cb->length - cb->number_len - QUIC_TAG_LEN;
+ for (; len && !(*p); p++, len--) /* Skip the padding frame. */
+ ;
+ if (!len-- || *p++ != QUIC_FRAME_CRYPTO)
+ goto out;
+ if (!quic_get_var(&p, &len, &offset) || offset)
+ goto out;
+ if (!quic_get_var(&p, &len, &length) || length > (u64)len)
+ goto out;
+
+ /* Parse the TLS CLIENT_HELLO message. */
+ err = quic_packet_get_alpn(alpn, p, length);
+
+out:
+ kfree(data);
+ return err;
+}
+
+/* Extract the Destination Connection ID (DCID) from a QUIC Long header packet. */
+int quic_packet_get_dcid(struct quic_conn_id *dcid, struct sk_buff *skb)
+{
+ u32 plen = skb->len;
+ u8 *p = skb->data;
+ u64 len;
+
+ if (plen < QUIC_HLEN + QUIC_VERSION_LEN)
+ return -EINVAL;
+ plen -= (QUIC_HLEN + QUIC_VERSION_LEN);
+ p += (QUIC_HLEN + QUIC_VERSION_LEN);
+
+ if (!quic_get_int(&p, &plen, &len, 1) ||
+ len > plen || len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+ quic_conn_id_update(dcid, p, len);
+ return 0;
+}
+
+/* Determine the QUIC socket associated with an incoming packet. */
+static struct sock *quic_packet_get_sock(struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = dev_net(skb->dev);
+ struct quic_conn_id dcid, *conn_id;
+ union quic_addr daddr, saddr;
+ struct quic_data alpns = {};
+ struct sock *sk = NULL;
+
+ if (skb->len < QUIC_HLEN)
+ return NULL;
+
+ if (!quic_hdr(skb)->form) { /* Short header path. */
+ if (skb->len < QUIC_HLEN + QUIC_CONN_ID_DEF_LEN)
+ return NULL;
+ /* Fast path: look up QUIC connection by fixed-length DCID
+ * (Currently, only source CIDs of size QUIC_CONN_ID_DEF_LEN are used).
+ */
+ conn_id = quic_conn_id_lookup(net, skb->data + QUIC_HLEN,
+ QUIC_CONN_ID_DEF_LEN);
+ if (conn_id) {
+ cb->seqno = quic_conn_id_number(conn_id);
+ return quic_conn_id_sk(conn_id); /* Return associated socket. */
+ }
+
+ /* Fallback: listener socket lookup
+ * (May be used to send a stateless reset from a listen socket).
+ */
+ quic_get_msg_addrs(skb, &daddr, &saddr);
+ sk = quic_listen_sock_lookup(skb, &daddr, &saddr, &alpns);
+ if (sk)
+ return sk;
+ /* Final fallback: address-based connection lookup
+ * (May be used to receive a stateless reset).
+ */
+ return quic_sock_lookup(skb, &daddr, &saddr, NULL);
+ }
+
+ /* Long header path. */
+ if (quic_packet_get_dcid(&dcid, skb))
+ return NULL;
+ /* Fast path: look up QUIC connection by parsed DCID. */
+ conn_id = quic_conn_id_lookup(net, dcid.data, dcid.len);
+ if (conn_id) {
+ cb->seqno = quic_conn_id_number(conn_id);
+ return quic_conn_id_sk(conn_id); /* Return associated socket. */
+ }
+
+ /* Fallback: address + DCID lookup
+ * (May be used for 0-RTT or a follow-up Client Initial packet).
+ */
+ quic_get_msg_addrs(skb, &daddr, &saddr);
+ sk = quic_sock_lookup(skb, &daddr, &saddr, &dcid);
+ if (sk)
+ return sk;
+ /* Final fallback: listener socket lookup
+ * (Used for receiving the first Client Initial packet).
+ */
+ if (quic_packet_parse_alpn(skb, &alpns))
+ return NULL;
+ return quic_listen_sock_lookup(skb, &daddr, &saddr, &alpns);
+}
+
+/* Entry point for processing received QUIC packets. */
+int quic_packet_rcv(struct sk_buff *skb, u8 err)
+{
+ struct net *net = dev_net(skb->dev);
+ struct sock *sk;
+
+ if (unlikely(err))
+ return quic_packet_rcv_err(skb);
+
+ skb_pull(skb, skb_transport_offset(skb));
+
+ /* Look up socket from socket or connection IDs hash tables. */
+ sk = quic_packet_get_sock(skb);
+ if (!sk)
+ goto err;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ /* Socket is busy (owned by user context): queue to backlog. */
+ if (sk_add_backlog(sk, skb, READ_ONCE(sk->sk_rcvbuf))) {
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVDROP);
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ goto err;
+ }
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVBACKLOGS);
+ } else {
+ /* Socket not busy: process immediately. */
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVFASTPATHS);
+ sk->sk_backlog_rcv(sk, skb); /* quic_packet_process(). */
+ }
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ return 0;
+
+err:
+ kfree_skb(skb);
+ return -EINVAL;
+}
+
+static int quic_packet_listen_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+static int quic_packet_handshake_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+static int quic_packet_app_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+int quic_packet_process(struct sock *sk, struct sk_buff *skb)
+{
+ if (quic_is_closed(sk)) {
+ kfree_skb(skb);
+ return 0;
+ }
+
+ if (quic_is_listen(sk))
+ return quic_packet_listen_process(sk, skb);
+
+ if (quic_hdr(skb)->form)
+ return quic_packet_handshake_process(sk, skb);
+
+ return quic_packet_app_process(sk, skb);
+}
+
+/* make these fixed for easy coding */
+#define QUIC_PACKET_NUMBER_LEN 4
+#define QUIC_PACKET_LENGTH_LEN 4
+
+static struct sk_buff *quic_packet_handshake_create(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_frame *frame, *next;
+
+ /* Free all frames for now, and future patches will implement the actual creation logic. */
+ list_for_each_entry_safe(frame, next, &packet->frame_list, list) {
+ list_del(&frame->list);
+ quic_frame_put(frame);
+ }
+ return NULL;
+}
+
+static int quic_packet_number_check(struct sock *sk)
+{
+ return 0;
+}
+
+static struct sk_buff *quic_packet_app_create(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_frame *frame, *next;
+
+ /* Free all frames for now, and future patches will implement the actual creation logic. */
+ list_for_each_entry_safe(frame, next, &packet->frame_list, list) {
+ list_del(&frame->list);
+ quic_frame_put(frame);
+ }
+ return NULL;
+}
+
+/* Update the MSS and inform congestion control. */
+void quic_packet_mss_update(struct sock *sk, u32 mss)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_cong *cong = quic_cong(sk);
+
+ packet->mss[0] = (u16)mss;
+ quic_cong_set_mss(cong, packet->mss[0] - packet->taglen[0]);
+}
+
+/* Perform routing for the QUIC packet on the specified path, update header length and MSS
+ * accordingly, reset path and start PMTU timer.
+ */
+int quic_packet_route(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_config *c = quic_config(sk);
+ union quic_addr *sa, *da;
+ u32 pmtu;
+ int err;
+
+ da = quic_path_daddr(paths, packet->path);
+ sa = quic_path_saddr(paths, packet->path);
+ err = quic_flow_route(sk, da, sa, &paths->fl);
+ if (err)
+ return err;
+
+ packet->hlen = quic_encap_len(da);
+ pmtu = min_t(u32, dst_mtu(__sk_dst_get(sk)), QUIC_PATH_MAX_PMTU);
+ quic_packet_mss_update(sk, pmtu - packet->hlen);
+
+ quic_path_pl_reset(paths);
+ quic_timer_reset(sk, QUIC_TIMER_PMTU, c->plpmtud_probe_interval);
+ return 0;
+}
+
+/* Configure the QUIC packet header and routing based on encryption level and path. */
+int quic_packet_config(struct sock *sk, u8 level, u8 path)
+{
+ struct quic_conn_id_set *dest = quic_dest(sk), *source = quic_source(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_config *c = quic_config(sk);
+ u32 hlen = QUIC_HLEN;
+
+ /* If packet already has data, no need to reconfigure. */
+ if (!quic_packet_empty(packet))
+ return 0;
+
+ packet->ack_eliciting = 0;
+ packet->frame_len = 0;
+ packet->ipfragok = 0;
+ packet->padding = 0;
+ packet->frames = 0;
+ hlen += QUIC_PACKET_NUMBER_LEN; /* Packet number length. */
+ hlen += quic_conn_id_choose(dest, path)->len; /* DCID length. */
+ if (level) {
+ hlen += 1; /* Length byte for DCID. */
+ hlen += 1 + quic_conn_id_active(source)->len; /* Length byte + SCID length. */
+ if (level == QUIC_CRYPTO_INITIAL) /* Include token for Initial packets. */
+ hlen += quic_var_len(quic_token(sk)->len) + quic_token(sk)->len;
+ hlen += QUIC_VERSION_LEN; /* Version length. */
+ hlen += QUIC_PACKET_LENGTH_LEN; /* Packet length field length. */
+ /* Allow fragmentation if PLPMTUD is enabled, as it no longer relies on ICMP
+ * Toobig messages to discover the path MTU.
+ */
+ packet->ipfragok = !!c->plpmtud_probe_interval;
+ }
+ packet->level = level;
+ packet->len = (u16)hlen;
+ packet->overhead = (u8)hlen;
+
+ if (packet->path != path) { /* If the path changed, update and reset routing cache. */
+ packet->path = path;
+ __sk_dst_reset(sk);
+ }
+
+ /* Perform routing and MSS update for the configured packet. */
+ if (quic_packet_route(sk) < 0)
+ return -1;
+ return 0;
+}
+
+static void quic_packet_encrypt_done(struct sk_buff *skb, int err)
+{
+ /* Free it for now, future patches will implement the actual deferred transmission logic. */
+ kfree_skb(skb);
+}
+
+/* Coalescing Packets. */
+static int quic_packet_bundle(struct sock *sk, struct sk_buff *skb)
+{
+ struct quic_skb_cb *head_cb, *cb = QUIC_SKB_CB(skb);
+ struct quic_packet *packet = quic_packet(sk);
+ struct sk_buff *p;
+
+ if (!packet->head) { /* First packet to bundle: initialize the head. */
+ packet->head = skb;
+ cb->last = skb;
+ goto out;
+ }
+
+ /* If bundling would exceed MSS, flush the current bundle. */
+ if (packet->head->len + skb->len >= packet->mss[0]) {
+ quic_packet_flush(sk);
+ packet->head = skb;
+ cb->last = skb;
+ goto out;
+ }
+ /* Bundle it and update metadata for the aggregate skb. */
+ p = packet->head;
+ head_cb = QUIC_SKB_CB(p);
+ if (head_cb->last == p)
+ skb_shinfo(p)->frag_list = skb;
+ else
+ head_cb->last->next = skb;
+ p->data_len += skb->len;
+ p->truesize += skb->truesize;
+ p->len += skb->len;
+ head_cb->last = skb;
+ head_cb->ecn |= cb->ecn; /* Merge ECN flags. */
+
+out:
+ /* rfc9000#section-12.2:
+ * Packets with a short header (Section 17.3) do not contain a Length field and so
+ * cannot be followed by other packets in the same UDP datagram.
+ *
+ * so Return 1 to flush if it is a Short header packet.
+ */
+ return !cb->level;
+}
+
+/* Transmit a QUIC packet, possibly encrypting and bundling it. */
+int quic_packet_xmit(struct sock *sk, struct sk_buff *skb)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = sock_net(sk);
+ int err;
+
+ /* Associate skb with sk to ensure sk is valid during async encryption completion. */
+ WARN_ON(!skb_set_owner_sk_safe(skb, sk));
+
+ /* Skip encryption if taglen == 0 (e.g., disable_1rtt_encryption). */
+ if (!packet->taglen[quic_hdr(skb)->form])
+ goto xmit;
+
+ cb->crypto_done = quic_packet_encrypt_done;
+ err = quic_crypto_encrypt(quic_crypto(sk, packet->level), skb);
+ if (err) {
+ if (err != -EINPROGRESS) {
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCDROP);
+ kfree_skb(skb);
+ return err;
+ }
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCBACKLOGS);
+ return err;
+ }
+ if (!cb->resume) /* Encryption completes synchronously. */
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCFASTPATHS);
+
+xmit:
+ if (quic_packet_bundle(sk, skb))
+ quic_packet_flush(sk);
+ return 0;
+}
+
+/* Create and transmit a new QUIC packet. */
+int quic_packet_create(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct sk_buff *skb;
+ int err;
+
+ err = quic_packet_number_check(sk);
+ if (err)
+ goto err;
+
+ if (packet->level)
+ skb = quic_packet_handshake_create(sk);
+ else
+ skb = quic_packet_app_create(sk);
+ if (!skb) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ err = quic_packet_xmit(sk, skb);
+ if (err && err != -EINPROGRESS)
+ goto err;
+
+ /* Return 1 if at least one ACK-eliciting (non-PING) frame was sent. */
+ return !!packet->frames;
+err:
+ pr_debug("%s: err: %d\n", __func__, err);
+ return 0;
+}
+
+/* Flush any coalesced/bundled QUIC packets. */
+void quic_packet_flush(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+
+ if (packet->head) {
+ quic_lower_xmit(sk, packet->head,
+ quic_path_daddr(paths, packet->path), &paths->fl);
+ packet->head = NULL;
+ }
+}
+
+/* Append a frame to the tail of the current QUIC packet. */
+int quic_packet_tail(struct sock *sk, struct quic_frame *frame)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ u8 taglen;
+
+ /* Reject frame if it doesn't match the packet's encryption level or path, or if
+ * padding is already in place (no further frames should be added).
+ */
+ if (frame->level != (packet->level % QUIC_CRYPTO_EARLY) ||
+ frame->path != packet->path || packet->padding)
+ return 0;
+
+ /* Check if frame would exceed the current datagram MSS (excluding AEAD tag). */
+ taglen = quic_packet_taglen(packet);
+ if (packet->len + frame->len > packet->mss[frame->dgram] - taglen) {
+ /* If some data has already been added to the packet, bail out. */
+ if (packet->len != packet->overhead)
+ return 0;
+ /* Otherwise, allow IP fragmentation for this packet unless it’s a PING probe. */
+ if (!quic_frame_ping(frame->type))
+ packet->ipfragok = 1;
+ }
+ if (frame->padding)
+ packet->padding = frame->padding;
+
+ /* Track frames that require retransmission if lost (i.e., ACK-eliciting and non-PING). */
+ if (frame->ack_eliciting) {
+ packet->ack_eliciting = 1;
+ if (!quic_frame_ping(frame->type)) {
+ packet->frames++;
+ packet->frame_len += frame->len;
+ }
+ }
+
+ list_move_tail(&frame->list, &packet->frame_list);
+ packet->len += frame->len;
+ return frame->len;
+}
+
+void quic_packet_init(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+
+ INIT_LIST_HEAD(&packet->frame_list);
+ packet->taglen[0] = QUIC_TAG_LEN;
+ packet->taglen[1] = QUIC_TAG_LEN;
+ packet->mss[0] = QUIC_TAG_LEN;
+ packet->mss[1] = QUIC_TAG_LEN;
+
+ packet->version = QUIC_VERSION_V1;
+}
diff --git a/net/quic/packet.h b/net/quic/packet.h
new file mode 100644
index 000000000000..c2b4a3b2f16c
--- /dev/null
+++ b/net/quic/packet.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+struct quic_packet {
+ struct quic_conn_id dcid; /* Dest Connection ID from received packet */
+ struct quic_conn_id scid; /* Source Connection ID from received packet */
+ union quic_addr daddr; /* Dest address from received packet */
+ union quic_addr saddr; /* Source address from received packet */
+
+ struct list_head frame_list; /* List of frames to pack into packet for send */
+ struct sk_buff *head; /* Head skb for packet bundling on send */
+ u16 frame_len; /* Length of all ack-eliciting frames excluding PING */
+ u8 taglen[2]; /* Tag length for short and long packets */
+ u32 version; /* QUIC version used/selected during handshake */
+ u8 errframe; /* Frame type causing packet processing failure */
+ u8 overhead; /* QUIC header length excluding frames */
+ u16 errcode; /* Error code on packet processing failure */
+ u16 frames; /* Number of ack-eliciting frames excluding PING */
+ u16 mss[2]; /* MSS for datagram and non-datagram packets */
+ u16 hlen; /* UDP + IP header length for sending */
+ u16 len; /* QUIC packet length excluding taglen for sending */
+
+ u8 ack_eliciting:1; /* Packet contains ack-eliciting frames to send */
+ u8 ack_requested:1; /* Packet contains ack-eliciting frames received */
+ u8 ack_immediate:1; /* Send ACK immediately (skip ack_delay timer) */
+ u8 non_probing:1; /* Packet has ack-eliciting frames excluding NEW_CONNECTION_ID */
+ u8 has_sack:1; /* Packet has ACK frames received */
+ u8 ipfragok:1; /* Allow IP fragmentation */
+ u8 padding:1; /* Packet has padding frames */
+ u8 path:1; /* Path identifier used to send this packet */
+ u8 level; /* Encryption level used */
+};
+
+struct quic_packet_sent {
+ struct list_head list; /* Link in sent packet list for ACK tracking */
+ u32 sent_time; /* Time when packet was sent */
+ u16 frame_len; /* Combined length of all frames held */
+ u16 frames; /* Number of frames held */
+
+ s64 number; /* Packet number */
+ u8 level; /* Packet number space */
+ u8 ecn:2; /* ECN bits */
+
+ struct quic_frame *frame_array[]; /* Array of pointers to held frames */
+};
+
+#define QUIC_PACKET_INITIAL_V1 0
+#define QUIC_PACKET_0RTT_V1 1
+#define QUIC_PACKET_HANDSHAKE_V1 2
+#define QUIC_PACKET_RETRY_V1 3
+
+#define QUIC_PACKET_INITIAL_V2 1
+#define QUIC_PACKET_0RTT_V2 2
+#define QUIC_PACKET_HANDSHAKE_V2 3
+#define QUIC_PACKET_RETRY_V2 0
+
+#define QUIC_PACKET_INITIAL QUIC_PACKET_INITIAL_V1
+#define QUIC_PACKET_0RTT QUIC_PACKET_0RTT_V1
+#define QUIC_PACKET_HANDSHAKE QUIC_PACKET_HANDSHAKE_V1
+#define QUIC_PACKET_RETRY QUIC_PACKET_RETRY_V1
+
+#define QUIC_VERSION_LEN 4
+
+static inline u8 quic_packet_taglen(struct quic_packet *packet)
+{
+ return packet->taglen[!!packet->level];
+}
+
+static inline void quic_packet_set_taglen(struct quic_packet *packet, u8 taglen)
+{
+ packet->taglen[0] = taglen;
+}
+
+static inline u32 quic_packet_mss(struct quic_packet *packet)
+{
+ return packet->mss[0] - packet->taglen[!!packet->level];
+}
+
+static inline u32 quic_packet_max_payload(struct quic_packet *packet)
+{
+ return packet->mss[0] - packet->overhead - packet->taglen[!!packet->level];
+}
+
+static inline u32 quic_packet_max_payload_dgram(struct quic_packet *packet)
+{
+ return packet->mss[1] - packet->overhead - packet->taglen[!!packet->level];
+}
+
+static inline int quic_packet_empty(struct quic_packet *packet)
+{
+ return list_empty(&packet->frame_list);
+}
+
+static inline void quic_packet_reset(struct quic_packet *packet)
+{
+ packet->level = 0;
+ packet->errcode = 0;
+ packet->errframe = 0;
+ packet->has_sack = 0;
+ packet->non_probing = 0;
+ packet->ack_requested = 0;
+ packet->ack_immediate = 0;
+}
+
+int quic_packet_tail(struct sock *sk, struct quic_frame *frame);
+int quic_packet_process(struct sock *sk, struct sk_buff *skb);
+int quic_packet_config(struct sock *sk, u8 level, u8 path);
+
+int quic_packet_xmit(struct sock *sk, struct sk_buff *skb);
+int quic_packet_create(struct sock *sk);
+int quic_packet_route(struct sock *sk);
+
+void quic_packet_mss_update(struct sock *sk, u32 mss);
+void quic_packet_flush(struct sock *sk);
+void quic_packet_init(struct sock *sk);
+
+int quic_packet_get_dcid(struct quic_conn_id *dcid, struct sk_buff *skb);
+int quic_packet_select_version(struct sock *sk, u32 *versions, u8 count);
+u32 *quic_packet_compatible_versions(u32 version);
+
+void quic_packet_rcv_err_pmtu(struct sock *sk);
+int quic_packet_rcv(struct sk_buff *skb, u8 err);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 4725e3aa7785..65f2036fb5f3 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -352,6 +352,10 @@ static __init int quic_init(void)
if (err)
goto err_def_ops;
+ err = quic_path_init(quic_packet_rcv);
+ if (err)
+ goto err_path;
+
err = quic_protosw_init();
if (err)
goto err_protosw;
@@ -363,6 +367,8 @@ static __init int quic_init(void)
return 0;
err_protosw:
+ quic_path_destroy();
+err_path:
unregister_pernet_subsys(&quic_net_ops);
err_def_ops:
quic_hash_tables_destroy();
@@ -380,6 +386,7 @@ static __exit void quic_exit(void)
quic_sysctl_unregister();
#endif
quic_protosw_exit();
+ quic_path_destroy();
unregister_pernet_subsys(&quic_net_ops);
quic_hash_tables_destroy();
percpu_counter_destroy(&quic_sockets_allocated);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index cbcfec3a02b2..868266fcd63c 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -25,6 +25,113 @@ static void quic_enter_memory_pressure(struct sock *sk)
WRITE_ONCE(quic_memory_pressure, 1);
}
+/* Lookup a connected QUIC socket based on address and dest connection ID.
+ *
+ * This function searches the established (non-listening) QUIC socket table for a socket that
+ * matches the source and dest addresses and, optionally, the dest connection ID (DCID). The
+ * value returned by quic_path_orig_dcid() might be the original dest connection ID from the
+ * ClientHello or the Source Connection ID from a Retry packet before.
+ *
+ * The DCID is provided from a handshake packet when searching by source connection ID fails,
+ * such as when the peer has not yet received server's response and updated the DCID.
+ *
+ * Return: A pointer to the matching connected socket, or NULL if no match is found.
+ */
+struct sock *quic_sock_lookup(struct sk_buff *skb, union quic_addr *sa, union quic_addr *da,
+ struct quic_conn_id *dcid)
+{
+ struct net *net = dev_net(skb->dev);
+ struct quic_path_group *paths;
+ struct quic_hash_head *head;
+ struct sock *sk;
+
+ head = quic_sock_head(net, sa, da);
+ spin_lock(&head->s_lock);
+ sk_for_each(sk, &head->head) {
+ if (net != sock_net(sk))
+ continue;
+ paths = quic_paths(sk);
+ if (quic_cmp_sk_addr(sk, quic_path_saddr(paths, 0), sa) &&
+ quic_cmp_sk_addr(sk, quic_path_daddr(paths, 0), da) &&
+ quic_path_usock(paths, 0) == skb->sk &&
+ (!dcid || !quic_conn_id_cmp(quic_path_orig_dcid(paths), dcid))) {
+ sock_hold(sk);
+ break;
+ }
+ }
+ spin_unlock(&head->s_lock);
+
+ return sk;
+}
+
+/* Find the listening QUIC socket for an incoming packet.
+ *
+ * This function searches the QUIC socket table for a listening socket that matches the dest
+ * address and port, and the ALPN(s) if presented in the ClientHello. If multiple listening
+ * sockets are bound to the same address, port, and ALPN(s) (e.g., via SO_REUSEPORT), this
+ * function selects a socket from the reuseport group.
+ *
+ * Return: A pointer to the matching listening socket, or NULL if no match is found.
+ */
+struct sock *quic_listen_sock_lookup(struct sk_buff *skb, union quic_addr *sa, union quic_addr *da,
+ struct quic_data *alpns)
+{
+ struct net *net = dev_net(skb->dev);
+ struct sock *sk = NULL, *tmp;
+ struct quic_hash_head *head;
+ struct quic_data alpn;
+ union quic_addr *a;
+ u64 length;
+ u32 len;
+ u8 *p;
+
+ head = quic_listen_sock_head(net, ntohs(sa->v4.sin_port));
+ spin_lock(&head->s_lock);
+
+ if (!alpns->len) { /* No ALPN entries present or failed to parse the ALPNs. */
+ sk_for_each(tmp, &head->head) {
+ /* If alpns->data != NULL, TLS parsing succeeded but no ALPN was found.
+ * In this case, only match sockets that have no ALPN set.
+ */
+ a = quic_path_saddr(quic_paths(tmp), 0);
+ if (net == sock_net(tmp) && quic_cmp_sk_addr(tmp, a, sa) &&
+ quic_path_usock(quic_paths(tmp), 0) == skb->sk &&
+ (!alpns->data || !quic_alpn(tmp)->len)) {
+ sk = tmp;
+ if (!quic_is_any_addr(a)) /* Prefer specific address match. */
+ break;
+ }
+ }
+ goto unlock;
+ }
+
+ /* ALPN present: loop through each ALPN entry. */
+ for (p = alpns->data, len = alpns->len; len; len -= length, p += length) {
+ quic_get_int(&p, &len, &length, 1);
+ quic_data(&alpn, p, length);
+ sk_for_each(tmp, &head->head) {
+ a = quic_path_saddr(quic_paths(tmp), 0);
+ if (net == sock_net(tmp) && quic_cmp_sk_addr(tmp, a, sa) &&
+ quic_path_usock(quic_paths(tmp), 0) == skb->sk &&
+ quic_data_has(quic_alpn(tmp), &alpn)) {
+ sk = tmp;
+ if (!quic_is_any_addr(a))
+ break;
+ }
+ }
+ if (sk)
+ break;
+ }
+unlock:
+ if (sk && sk->sk_reuseport)
+ sk = reuseport_select_sock(sk, quic_shash(net, da), skb, 1);
+ if (sk)
+ sock_hold(sk);
+
+ spin_unlock(&head->s_lock);
+ return sk;
+}
+
static void quic_write_space(struct sock *sk)
{
struct socket_wq *wq;
@@ -49,6 +156,7 @@ static int quic_init_sock(struct sock *sk)
quic_cong_init(quic_cong(sk));
quic_timer_init(sk);
+ quic_packet_init(sk);
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
@@ -222,6 +330,10 @@ static void quic_release_cb(struct sock *sk)
nflags = flags & ~QUIC_DEFERRED_ALL;
} while (!try_cmpxchg(&sk->sk_tsq_flags, &flags, nflags));
+ if (flags & QUIC_F_MTU_REDUCED_DEFERRED) {
+ quic_packet_rcv_err_pmtu(sk);
+ __sock_put(sk);
+ }
if (flags & QUIC_F_LOSS_DEFERRED) {
quic_timer_loss_handler(sk);
__sock_put(sk);
@@ -272,6 +384,7 @@ struct proto quic_prot = {
.accept = quic_accept,
.hash = quic_hash,
.unhash = quic_unhash,
+ .backlog_rcv = quic_packet_process,
.release_cb = quic_release_cb,
.no_autobind = true,
.obj_size = sizeof(struct quic_sock),
@@ -302,6 +415,7 @@ struct proto quicv6_prot = {
.accept = quic_accept,
.hash = quic_hash,
.unhash = quic_unhash,
+ .backlog_rcv = quic_packet_process,
.release_cb = quic_release_cb,
.no_autobind = true,
.obj_size = sizeof(struct quic6_sock),
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 75cb90177a01..99cc460158a0 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -20,6 +20,7 @@
#include "path.h"
#include "cong.h"
+#include "packet.h"
#include "frame.h"
#include "protocol.h"
@@ -77,6 +78,7 @@ struct quic_sock {
struct quic_pnspace space[QUIC_PNSPACE_MAX];
struct quic_crypto crypto[QUIC_CRYPTO_MAX];
+ struct quic_packet packet;
struct quic_timer timers[QUIC_TIMER_MAX];
};
@@ -155,6 +157,11 @@ static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
return &quic_sk(sk)->crypto[level];
}
+static inline struct quic_packet *quic_packet(const struct sock *sk)
+{
+ return &quic_sk(sk)->packet;
+}
+
static inline void *quic_timer(const struct sock *sk, u8 type)
{
return (void *)&quic_sk(sk)->timers[type];
@@ -200,3 +207,8 @@ static inline void quic_set_state(struct sock *sk, int state)
inet_sk_set_state(sk, state);
sk->sk_state_change(sk);
}
+
+struct sock *quic_listen_sock_lookup(struct sk_buff *skb, union quic_addr *sa, union quic_addr *da,
+ struct quic_data *alpns);
+struct sock *quic_sock_lookup(struct sk_buff *skb, union quic_addr *sa, union quic_addr *da,
+ struct quic_conn_id *dcid);
--
2.47.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents
2025-08-18 14:04 [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (14 preceding siblings ...)
2025-08-18 14:04 ` [PATCH net-next v2 15/15] quic: add packet builder and parser base Xin Long
@ 2025-08-23 15:20 ` John Ericson
2025-08-24 17:57 ` Xin Long
15 siblings, 1 reply; 38+ messages in thread
From: John Ericson @ 2025-08-23 15:20 UTC (permalink / raw)
To: Xin Long, network dev; +Cc: draft-lxin-quic-socket-apis
(Note: This is an interface more than implementation question --- apologies in advanced if this is not the right place to ask. I originally sent this message to [0] about the IETF internet draft [1], but then I realized that is just an alias for the draft authors, and not a public mailing list, so I figured this would be better in order to have something in the public record.)
---
I was surprised to see that (if I understand correctly) in the current design, all communication over one connection must happen with the same socket, and instead stream ids are the sole mechanism to distinguish between different streams (e.g. for sending and receiving).
This does work, but it is bad for application programming which wants to take advantage of separate streams while being transport-agnostic. For example, it would be very nice to run an arbitrary program with stdout and stderr hooked up to separate QUIC streams. This can be elegantly accomplished if there is an option to create a fresh socket / file descriptor which is just associated with a single stream. Then "regular" send/rescv, or even read/write, can be used with multiple streams.
I see that the SCTP socket interface has sctp_peeloff [2] for this purpose. Could something similar be included in this specification?
John
[0]: draft-lxin-quic-socket-apis@ietf.org
[1]: https://datatracker.ietf.org/doc/draft-lxin-quic-socket-apis/
[2]: https://man7.org/linux/man-pages/man3/sctp_peeloff.3.html
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents
2025-08-23 15:20 ` [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents John Ericson
@ 2025-08-24 17:57 ` Xin Long
2025-08-26 21:48 ` Xin Long
0 siblings, 1 reply; 38+ messages in thread
From: Xin Long @ 2025-08-24 17:57 UTC (permalink / raw)
To: John Ericson; +Cc: network dev, draft-lxin-quic-socket-apis
On Sat, Aug 23, 2025 at 11:21 AM John Ericson <mail@johnericson.me> wrote:
>
> (Note: This is an interface more than implementation question --- apologies in advanced if this is not the right place to ask. I originally sent this message to [0] about the IETF internet draft [1], but then I realized that is just an alias for the draft authors, and not a public mailing list, so I figured this would be better in order to have something in the public record.)
>
> ---
>
> I was surprised to see that (if I understand correctly) in the current design, all communication over one connection must happen with the same socket, and instead stream ids are the sole mechanism to distinguish between different streams (e.g. for sending and receiving).
>
> This does work, but it is bad for application programming which wants to take advantage of separate streams while being transport-agnostic. For example, it would be very nice to run an arbitrary program with stdout and stderr hooked up to separate QUIC streams. This can be elegantly accomplished if there is an option to create a fresh socket / file descriptor which is just associated with a single stream. Then "regular" send/rescv, or even read/write, can be used with multiple streams.
>
> I see that the SCTP socket interface has sctp_peeloff [2] for this purpose. Could something similar be included in this specification?
Hi, John,
That is a bit different. In SCTP, sctp_peeloff() detaches an
association/connection from a one-to-many socket and returns it as a
new socket. It does not peel off a stream. Stream send/receive
operations in SCTP are actually quite similar to how QUIC handles
streams in the proposed QUIC socket API.
For QUIC, supporting 'stream peeloff' might mean creating a new socket
type that carries a stream ID and maps its sendmsg/recvmsg to the
'parent' QUIC socket. But there are details to sort out, like whether
the 'parent-child relationship' should be maintained. We also need to
consider whether this is worth implementing in the kernel, or if a
similar API could be provided in libquic.
I’ll be requesting a mailing list for QUIC development and new
interfaces, and this would be a good topic to continue there.
Thanks for your comment.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH net-next v2 00/15] net: introduce QUIC infrastructure and core subcomponents
2025-08-24 17:57 ` Xin Long
@ 2025-08-26 21:48 ` Xin Long
0 siblings, 0 replies; 38+ messages in thread
From: Xin Long @ 2025-08-26 21:48 UTC (permalink / raw)
To: John Ericson; +Cc: network dev, draft-lxin-quic-socket-apis
On Sun, Aug 24, 2025 at 1:57 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Sat, Aug 23, 2025 at 11:21 AM John Ericson <mail@johnericson.me> wrote:
> >
> > (Note: This is an interface more than implementation question --- apologies in advanced if this is not the right place to ask. I originally sent this message to [0] about the IETF internet draft [1], but then I realized that is just an alias for the draft authors, and not a public mailing list, so I figured this would be better in order to have something in the public record.)
> >
> > ---
> >
> > I was surprised to see that (if I understand correctly) in the current design, all communication over one connection must happen with the same socket, and instead stream ids are the sole mechanism to distinguish between different streams (e.g. for sending and receiving).
> >
> > This does work, but it is bad for application programming which wants to take advantage of separate streams while being transport-agnostic. For example, it would be very nice to run an arbitrary program with stdout and stderr hooked up to separate QUIC streams. This can be elegantly accomplished if there is an option to create a fresh socket / file descriptor which is just associated with a single stream. Then "regular" send/rescv, or even read/write, can be used with multiple streams.
> >
> > I see that the SCTP socket interface has sctp_peeloff [2] for this purpose. Could something similar be included in this specification?
> Hi, John,
>
> That is a bit different. In SCTP, sctp_peeloff() detaches an
> association/connection from a one-to-many socket and returns it as a
> new socket. It does not peel off a stream. Stream send/receive
> operations in SCTP are actually quite similar to how QUIC handles
> streams in the proposed QUIC socket API.
>
> For QUIC, supporting 'stream peeloff' might mean creating a new socket
> type that carries a stream ID and maps its sendmsg/recvmsg to the
> 'parent' QUIC socket. But there are details to sort out, like whether
> the 'parent-child relationship' should be maintained. We also need to
> consider whether this is worth implementing in the kernel, or if a
> similar API could be provided in libquic.
>
> I’ll be requesting a mailing list for QUIC development and new
> interfaces, and this would be a good topic to continue there.
>
Hi, John,
Feel free to create a thread on quic@lists.linux.dev for this.
Thanks.
^ permalink raw reply [flat|nested] 38+ messages in thread