* [PATCH net-next v11 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 02/15] net: build socket infrastructure for QUIC protocol Xin Long
` (13 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch adds IPPROTO_QUIC and SOL_QUIC constants to the networking
subsystem. These definitions are essential for applications to set
socket options and protocol identifiers related to the QUIC protocol.
QUIC does not possess a protocol number allocated from IANA, and like
IPPROTO_MPTCP, IPPROTO_QUIC is merely a value used when opening a QUIC
socket with:
socket(AF_INET, SOCK_STREAM, IPPROTO_QUIC);
Note we did not opt for UDP ULP for QUIC implementation due to several
considerations:
- QUIC's connection Migration requires at least 2 UDP sockets for one
QUIC connection at the same time, not to mention the multipath
feature in one of its draft RFCs.
- In-Kernel QUIC, as a Transport Protocol, wants to provide users with
the TCP or SCTP like Socket APIs, like connect()/listen()/accept()...
Note that a single UDP socket might even be used for multiple QUIC
connections.
The use of IPPROTO_QUIC type sockets over UDP tunnel will effectively
address these challenges and provides a more flexible and scalable
solution.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v11:
- Set maximum line length to 80 characters.
---
include/linux/socket.h | 1 +
include/uapi/linux/in.h | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ec4a0a025793..9b6c3cd766ca 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -401,6 +401,7 @@ struct ucred {
#define SOL_MCTP 285
#define SOL_SMC 286
#define SOL_VSOCK 287
+#define SOL_QUIC 288
/* IPX options */
#define IPX_TYPE 1
diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index ced0fc3c3aa5..e4072152f2e6 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -85,6 +85,8 @@ enum {
#define IPPROTO_RAW IPPROTO_RAW
IPPROTO_SMC = 256, /* Shared Memory Communications */
#define IPPROTO_SMC IPPROTO_SMC
+ IPPROTO_QUIC = 261, /* A UDP-Based Multiplexed Secure Transport */
+#define IPPROTO_QUIC IPPROTO_QUIC
IPPROTO_MPTCP = 262, /* Multipath TCP connection */
#define IPPROTO_MPTCP IPPROTO_MPTCP
IPPROTO_MAX
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 02/15] net: build socket infrastructure for QUIC protocol
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 03/15] quic: provide common utilities and data structures Xin Long
` (12 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch lays the groundwork for QUIC socket support in the kernel.
It defines the core structures and protocol hooks needed to create
QUIC sockets, without implementing any protocol behavior at this stage.
Basic integration is included to allow building the module via
CONFIG_IP_QUIC=m.
This provides the scaffolding necessary for adding actual QUIC socket
behavior in follow-up patches.
Signed-off-by: Pengtao He <hepengtao@xiaomi.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v3:
- Kconfig: add 'default n' for IP_QUIC (reported by Paolo).
- quic_disconnect(): return -EOPNOTSUPP (suggested by Paolo).
- quic_init/destroy_sock(): drop local_bh_disable/enable() calls (noted
by Paolo).
- sysctl: add alpn_demux option to en/disable ALPN-based demux.
- SNMP: remove SNMP_MIB_SENTINEL, switch to
snmp_get_cpu_field_batch_cnt() to align with latest net-next changes.
v4:
- Remove unnecessary READ_ONCE() in quic_inet_connect() (reported by
Paolo).
v5:
- Update the type of the parameter 'addr' in quic_inet_connect(),
quic_connect(), and quic_bind() to match the latest net-next changes.
- Define quic_is_serv() to reuse sk->sk_max_ack_backlog for server-side
detection; path->serv will be deleted in a later patch.
- Use MODULE_ALIAS_NET_PF_PROTO instead of MODULE_ALIAS (suggested by
Stefan).
- Add the missing Documentation entry for the new sysctl options (noted
by Paolo).
- Add the missing MAINTAINERS entry for the QUIC PROTOCOL (noted by
Jakub).
v6:
- Relocate the QUIC PROTOCOL MAINTAINERS entry to its proper section
(noted by Jakub).
v7:
- Replace #ifdef CONFIG_XXX with #if IS_ENABLED(CONFIG_XXX) (noted by
Paolo).
v8:
- Fix an issue where an uninitialized value could be returned from
quic_net_init() by initializing err to 0 (reported by AI review).
- Replace the global ALPN demultiplexing sysctl with a static key that
will be enabled only when ALPN is configured on a listening socket
(noted by Stefan).
v10:
- Note for AI reviews: inet6_register_protosw() never fails for QUIC,
so checking its return value is unnecessary.
- Remove sk_sndbuf/sk_rcvbuf initialization in quic_init_sock(), as it's
already done in sock_init_data_uid() with sysctl_w/rmem_default.
v11:
- Note for AI review: quic_sk(sk)->reqs is used only for listen sockets
and will be initialized in quic_inet_listen() in a later patch, so
there is no need to initialize it in quic_init_sock().
- Set maximum line length to 80 characters.
- Drop addr_len from quic_recvmsg() to match the latest
proto_ops.recvmsg() update.
- Use %lu for SNMP counters in quic_snmp_seq_show().
---
Documentation/networking/ip-sysctl.rst | 39 +++
MAINTAINERS | 7 +
net/Kconfig | 1 +
net/Makefile | 1 +
net/quic/Kconfig | 36 +++
net/quic/Makefile | 8 +
net/quic/protocol.c | 376 +++++++++++++++++++++++++
net/quic/protocol.h | 57 ++++
net/quic/socket.c | 210 ++++++++++++++
net/quic/socket.h | 89 ++++++
10 files changed, 824 insertions(+)
create mode 100644 net/quic/Kconfig
create mode 100644 net/quic/Makefile
create mode 100644 net/quic/protocol.c
create mode 100644 net/quic/protocol.h
create mode 100644 net/quic/socket.c
create mode 100644 net/quic/socket.h
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 2e3a746fcc6d..5e3aafe9e236 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -3805,6 +3805,45 @@ l3mdev_accept - BOOLEAN
Default: 1 (enabled)
+``/proc/sys/net/quic/*`` Variables
+===================================
+
+quic_mem - vector of 3 LONGs: min, pressure, max
+ Number of pages allowed for queueing by all QUIC sockets.
+
+ min: below this number of pages QUIC is not bothered about its
+ memory appetite.
+
+ pressure: when amount of memory allocated by QUIC exceeds this number
+ of pages, QUIC moderates its memory consumption and enters memory
+ pressure mode, which is exited when memory consumption falls
+ under "min".
+
+ max: number of pages allowed for queueing by all QUIC sockets.
+
+ Defaults are calculated at boot time from amount of available
+ memory.
+
+quic_rmem - vector of 3 INTEGERs: min, default, max
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimal size of receive buffer used by QUIC sockets.
+ It is guaranteed to each QUIC socket, even under moderate memory
+ pressure.
+
+ Default: 4K
+
+quic_wmem - vector of 3 INTEGERs: min, default, max
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Amount of memory reserved for send buffers for QUIC sockets.
+ Each QUIC socket has rights to use it due to fact of its birth.
+
+ Default: 4K
+
+
``/proc/sys/net/core/*``
========================
diff --git a/MAINTAINERS b/MAINTAINERS
index 7d65f9435950..532030036a8c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21896,6 +21896,13 @@ L: linux-wireless@vger.kernel.org
S: Maintained
F: drivers/net/wireless/quantenna/
+QUIC PROTOCOL
+M: Xin Long <lucien.xin@gmail.com>
+L: quic@lists.linux.dev
+S: Maintained
+W: https://github.com/lxin/quic
+F: net/quic/
+
RADEON and AMDGPU DRM DRIVERS
M: Alex Deucher <alexander.deucher@amd.com>
M: Christian König <christian.koenig@amd.com>
diff --git a/net/Kconfig b/net/Kconfig
index 62266eaf0e95..dd2ed8420102 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -251,6 +251,7 @@ source "net/bridge/netfilter/Kconfig"
endif # if NETFILTER
+source "net/quic/Kconfig"
source "net/sctp/Kconfig"
source "net/rds/Kconfig"
source "net/tipc/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 90e3d72bf58b..cd43d03907cd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_PHONET) += phonet/
ifneq ($(CONFIG_VLAN_8021Q),)
obj-y += 8021q/
endif
+obj-$(CONFIG_IP_QUIC) += quic/
obj-$(CONFIG_IP_SCTP) += sctp/
obj-$(CONFIG_RDS) += rds/
obj-$(CONFIG_WIRELESS) += wireless/
diff --git a/net/quic/Kconfig b/net/quic/Kconfig
new file mode 100644
index 000000000000..bbd174c02c1f
--- /dev/null
+++ b/net/quic/Kconfig
@@ -0,0 +1,36 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# QUIC configuration
+#
+
+menuconfig IP_QUIC
+ tristate "QUIC: A UDP-Based Multiplexed Secure Transport (Experimental)"
+ depends on INET
+ depends on IPV6
+ select CRYPTO
+ select CRYPTO_HMAC
+ select CRYPTO_HKDF
+ select CRYPTO_AES
+ select CRYPTO_GCM
+ select CRYPTO_CCM
+ select CRYPTO_CHACHA20POLY1305
+ select NET_UDP_TUNNEL
+ default n
+ help
+ QUIC: A UDP-Based Multiplexed and Secure Transport
+
+ From rfc9000 <https://www.rfc-editor.org/rfc/rfc9000.html>.
+
+ QUIC provides applications with flow-controlled streams for structured
+ communication, low-latency connection establishment, and network path
+ migration. QUIC includes security measures that ensure
+ confidentiality, integrity, and availability in a range of deployment
+ circumstances. Accompanying documents describe the integration of
+ TLS for key negotiation, loss detection, and an exemplary congestion
+ control algorithm.
+
+ To compile this protocol support as a module, choose M here: the
+ module will be called quic. Debug messages are handled by the
+ kernel's dynamic debugging framework.
+
+ If in doubt, say N.
diff --git a/net/quic/Makefile b/net/quic/Makefile
new file mode 100644
index 000000000000..020e4dd133d8
--- /dev/null
+++ b/net/quic/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Makefile for QUIC support code.
+#
+
+obj-$(CONFIG_IP_QUIC) += quic.o
+
+quic-y := protocol.o socket.o
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
new file mode 100644
index 000000000000..73ccbddeff79
--- /dev/null
+++ b/net/quic/protocol.c
@@ -0,0 +1,376 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <linux/proc_fs.h>
+#include <net/protocol.h>
+#include <net/rps.h>
+#include <net/tls.h>
+
+#include "socket.h"
+
+static unsigned int quic_net_id __read_mostly;
+
+struct percpu_counter quic_sockets_allocated;
+
+DEFINE_STATIC_KEY_FALSE(quic_alpn_demux_key);
+
+long sysctl_quic_mem[3];
+int sysctl_quic_rmem[3];
+int sysctl_quic_wmem[3];
+
+static int quic_inet_connect(struct socket *sock, struct sockaddr_unsized *addr,
+ int addr_len, int flags)
+{
+ struct sock *sk = sock->sk;
+
+ if (addr_len < (int)sizeof(addr->sa_family))
+ return -EINVAL;
+
+ return sk->sk_prot->connect(sk, addr, addr_len);
+}
+
+static int quic_inet_listen(struct socket *sock, int backlog)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_inet_getname(struct socket *sock, struct sockaddr *uaddr,
+ int peer)
+{
+ return -EOPNOTSUPP;
+}
+
+static __poll_t quic_inet_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ return 0;
+}
+
+static struct ctl_table quic_table[] = {
+ {
+ .procname = "quic_mem",
+ .data = &sysctl_quic_mem,
+ .maxlen = sizeof(sysctl_quic_mem),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax
+ },
+ {
+ .procname = "quic_rmem",
+ .data = &sysctl_quic_rmem,
+ .maxlen = sizeof(sysctl_quic_rmem),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "quic_wmem",
+ .data = &sysctl_quic_wmem,
+ .maxlen = sizeof(sysctl_quic_wmem),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+};
+
+struct quic_net *quic_net(struct net *net)
+{
+ return net_generic(net, quic_net_id);
+}
+
+#if IS_ENABLED(CONFIG_PROC_FS)
+static const struct snmp_mib quic_snmp_list[] = {
+ SNMP_MIB_ITEM("QuicConnCurrentEstabs", QUIC_MIB_CONN_CURRENTESTABS),
+ SNMP_MIB_ITEM("QuicConnPassiveEstabs", QUIC_MIB_CONN_PASSIVEESTABS),
+ SNMP_MIB_ITEM("QuicConnActiveEstabs", QUIC_MIB_CONN_ACTIVEESTABS),
+ SNMP_MIB_ITEM("QuicPktRcvFastpaths", QUIC_MIB_PKT_RCVFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktDecFastpaths", QUIC_MIB_PKT_DECFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktEncFastpaths", QUIC_MIB_PKT_ENCFASTPATHS),
+ SNMP_MIB_ITEM("QuicPktRcvBacklogs", QUIC_MIB_PKT_RCVBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktDecBacklogs", QUIC_MIB_PKT_DECBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktEncBacklogs", QUIC_MIB_PKT_ENCBACKLOGS),
+ SNMP_MIB_ITEM("QuicPktInvHdrDrop", QUIC_MIB_PKT_INVHDRDROP),
+ SNMP_MIB_ITEM("QuicPktInvNumDrop", QUIC_MIB_PKT_INVNUMDROP),
+ SNMP_MIB_ITEM("QuicPktInvFrmDrop", QUIC_MIB_PKT_INVFRMDROP),
+ SNMP_MIB_ITEM("QuicPktRcvDrop", QUIC_MIB_PKT_RCVDROP),
+ SNMP_MIB_ITEM("QuicPktDecDrop", QUIC_MIB_PKT_DECDROP),
+ SNMP_MIB_ITEM("QuicPktEncDrop", QUIC_MIB_PKT_ENCDROP),
+ SNMP_MIB_ITEM("QuicFrmRcvBufDrop", QUIC_MIB_FRM_RCVBUFDROP),
+ SNMP_MIB_ITEM("QuicFrmRetrans", QUIC_MIB_FRM_RETRANS),
+ SNMP_MIB_ITEM("QuicFrmOutCloses", QUIC_MIB_FRM_OUTCLOSES),
+ SNMP_MIB_ITEM("QuicFrmInCloses", QUIC_MIB_FRM_INCLOSES),
+};
+
+static int quic_snmp_seq_show(struct seq_file *seq, void *v)
+{
+ unsigned long buff[ARRAY_SIZE(quic_snmp_list)];
+ const int cnt = ARRAY_SIZE(quic_snmp_list);
+ struct net *net = seq->private;
+ u32 idx;
+
+ memset(buff, 0, sizeof(buff));
+
+ snmp_get_cpu_field_batch_cnt(buff, quic_snmp_list, cnt,
+ quic_net(net)->stat);
+ for (idx = 0; idx < cnt; idx++)
+ seq_printf(seq, "%-32s\t%lu\n", quic_snmp_list[idx].name,
+ buff[idx]);
+
+ return 0;
+}
+
+static int quic_net_proc_init(struct net *net)
+{
+ quic_net(net)->proc_net = proc_net_mkdir(net, "quic", net->proc_net);
+ if (!quic_net(net)->proc_net)
+ return -ENOMEM;
+
+ if (!proc_create_net_single("snmp", 0444, quic_net(net)->proc_net,
+ quic_snmp_seq_show, NULL))
+ goto free;
+ return 0;
+free:
+ remove_proc_subtree("quic", net->proc_net);
+ quic_net(net)->proc_net = NULL;
+ return -ENOMEM;
+}
+
+static void quic_net_proc_exit(struct net *net)
+{
+ remove_proc_subtree("quic", net->proc_net);
+ quic_net(net)->proc_net = NULL;
+}
+#endif
+
+static const struct proto_ops quic_proto_ops = {
+ .family = PF_INET,
+ .owner = THIS_MODULE,
+ .release = inet_release,
+ .bind = inet_bind,
+ .connect = quic_inet_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = inet_accept,
+ .getname = quic_inet_getname,
+ .poll = quic_inet_poll,
+ .ioctl = inet_ioctl,
+ .gettstamp = sock_gettstamp,
+ .listen = quic_inet_listen,
+ .shutdown = inet_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+};
+
+static struct inet_protosw quic_stream_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quic_prot,
+ .ops = &quic_proto_ops,
+};
+
+static struct inet_protosw quic_dgram_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quic_prot,
+ .ops = &quic_proto_ops,
+};
+
+static const struct proto_ops quicv6_proto_ops = {
+ .family = PF_INET6,
+ .owner = THIS_MODULE,
+ .release = inet6_release,
+ .bind = inet6_bind,
+ .connect = quic_inet_connect,
+ .socketpair = sock_no_socketpair,
+ .accept = inet_accept,
+ .getname = quic_inet_getname,
+ .poll = quic_inet_poll,
+ .ioctl = inet6_ioctl,
+ .gettstamp = sock_gettstamp,
+ .listen = quic_inet_listen,
+ .shutdown = inet_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = inet_sendmsg,
+ .recvmsg = inet_recvmsg,
+ .mmap = sock_no_mmap,
+};
+
+static struct inet_protosw quicv6_stream_protosw = {
+ .type = SOCK_STREAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quicv6_prot,
+ .ops = &quicv6_proto_ops,
+};
+
+static struct inet_protosw quicv6_dgram_protosw = {
+ .type = SOCK_DGRAM,
+ .protocol = IPPROTO_QUIC,
+ .prot = &quicv6_prot,
+ .ops = &quicv6_proto_ops,
+};
+
+static int quic_protosw_init(void)
+{
+ int err;
+
+ err = proto_register(&quic_prot, 1);
+ if (err)
+ return err;
+
+ err = proto_register(&quicv6_prot, 1);
+ if (err) {
+ proto_unregister(&quic_prot);
+ return err;
+ }
+
+ inet_register_protosw(&quic_stream_protosw);
+ inet_register_protosw(&quic_dgram_protosw);
+ inet6_register_protosw(&quicv6_stream_protosw);
+ inet6_register_protosw(&quicv6_dgram_protosw);
+
+ return 0;
+}
+
+static void quic_protosw_exit(void)
+{
+ inet_unregister_protosw(&quic_dgram_protosw);
+ inet_unregister_protosw(&quic_stream_protosw);
+ proto_unregister(&quic_prot);
+
+ inet6_unregister_protosw(&quicv6_dgram_protosw);
+ inet6_unregister_protosw(&quicv6_stream_protosw);
+ proto_unregister(&quicv6_prot);
+}
+
+static int __net_init quic_net_init(struct net *net)
+{
+ struct quic_net *qn = quic_net(net);
+ int err = 0;
+
+ qn->stat = alloc_percpu(struct quic_mib);
+ if (!qn->stat)
+ return -ENOMEM;
+
+#if IS_ENABLED(CONFIG_PROC_FS)
+ err = quic_net_proc_init(net);
+ if (err) {
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+ }
+#endif
+ return err;
+}
+
+static void __net_exit quic_net_exit(struct net *net)
+{
+ struct quic_net *qn = quic_net(net);
+
+#if IS_ENABLED(CONFIG_PROC_FS)
+ quic_net_proc_exit(net);
+#endif
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+}
+
+static struct pernet_operations quic_net_ops = {
+ .init = quic_net_init,
+ .exit = quic_net_exit,
+ .id = &quic_net_id,
+ .size = sizeof(struct quic_net),
+};
+
+#if IS_ENABLED(CONFIG_SYSCTL)
+static struct ctl_table_header *quic_sysctl_header;
+
+static void quic_sysctl_register(void)
+{
+ quic_sysctl_header = register_net_sysctl(&init_net, "net/quic",
+ quic_table);
+}
+
+static void quic_sysctl_unregister(void)
+{
+ unregister_net_sysctl_table(quic_sysctl_header);
+}
+#endif
+
+static __init int quic_init(void)
+{
+ int max_share, err = -ENOMEM;
+ unsigned long limit;
+
+ /* Set QUIC memory limits based on available system memory, similar to
+ * sctp_init().
+ */
+ limit = nr_free_buffer_pages() / 8;
+ limit = max(limit, 128UL);
+ sysctl_quic_mem[0] = (long)limit / 4 * 3;
+ sysctl_quic_mem[1] = (long)limit;
+ sysctl_quic_mem[2] = sysctl_quic_mem[0] * 2;
+
+ limit = (sysctl_quic_mem[1]) << (PAGE_SHIFT - 7);
+ max_share = min(4UL * 1024 * 1024, limit);
+
+ sysctl_quic_rmem[0] = PAGE_SIZE;
+ sysctl_quic_rmem[1] = 1024 * 1024;
+ sysctl_quic_rmem[2] = max(sysctl_quic_rmem[1], max_share);
+
+ sysctl_quic_wmem[0] = PAGE_SIZE;
+ sysctl_quic_wmem[1] = 16 * 1024;
+ sysctl_quic_wmem[2] = max(64 * 1024, max_share);
+
+ err = percpu_counter_init(&quic_sockets_allocated, 0, GFP_KERNEL);
+ if (err)
+ goto err_percpu_counter;
+
+ err = register_pernet_subsys(&quic_net_ops);
+ if (err)
+ goto err_def_ops;
+
+ err = quic_protosw_init();
+ if (err)
+ goto err_protosw;
+
+#if IS_ENABLED(CONFIG_SYSCTL)
+ quic_sysctl_register();
+#endif
+ pr_info("quic: init\n");
+ return 0;
+
+err_protosw:
+ unregister_pernet_subsys(&quic_net_ops);
+err_def_ops:
+ percpu_counter_destroy(&quic_sockets_allocated);
+err_percpu_counter:
+ return err;
+}
+
+static __exit void quic_exit(void)
+{
+#if IS_ENABLED(CONFIG_SYSCTL)
+ quic_sysctl_unregister();
+#endif
+ quic_protosw_exit();
+ unregister_pernet_subsys(&quic_net_ops);
+ percpu_counter_destroy(&quic_sockets_allocated);
+ pr_info("quic: exit\n");
+}
+
+module_init(quic_init);
+module_exit(quic_exit);
+
+MODULE_ALIAS_NET_PF_PROTO(PF_INET, 261); /* IPPROTO_QUIC == 261 */
+MODULE_ALIAS_NET_PF_PROTO(PF_INET6, 261);
+MODULE_AUTHOR("Xin Long <lucien.xin@gmail.com>");
+MODULE_DESCRIPTION("Support for the QUIC protocol (RFC9000)");
+MODULE_LICENSE("GPL");
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
new file mode 100644
index 000000000000..fbd0fe39eccc
--- /dev/null
+++ b/net/quic/protocol.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+extern struct percpu_counter quic_sockets_allocated;
+
+DECLARE_STATIC_KEY_FALSE(quic_alpn_demux_key);
+
+extern long sysctl_quic_mem[3];
+extern int sysctl_quic_rmem[3];
+extern int sysctl_quic_wmem[3];
+
+enum {
+ QUIC_MIB_NUM = 0,
+ QUIC_MIB_CONN_CURRENTESTABS, /* Current established connections */
+ QUIC_MIB_CONN_PASSIVEESTABS, /* Passively established connections */
+ QUIC_MIB_CONN_ACTIVEESTABS, /* Actively established connections */
+ QUIC_MIB_PKT_RCVFASTPATHS, /* Packets received on fast path */
+ QUIC_MIB_PKT_DECFASTPATHS, /* Packets decrypted on fast path */
+ QUIC_MIB_PKT_ENCFASTPATHS, /* Packets encrypted on fast path */
+ QUIC_MIB_PKT_RCVBACKLOGS, /* Packets processed via backlog */
+ QUIC_MIB_PKT_DECBACKLOGS, /* Packets decrypted in backlog */
+ QUIC_MIB_PKT_ENCBACKLOGS, /* Packets encrypted in backlog */
+ QUIC_MIB_PKT_INVHDRDROP, /* Dropped: invalid packet header */
+ QUIC_MIB_PKT_INVNUMDROP, /* Dropped: invalid packet number */
+ QUIC_MIB_PKT_INVFRMDROP, /* Dropped: invalid frame */
+ QUIC_MIB_PKT_RCVDROP, /* Dropped on receive (general) */
+ QUIC_MIB_PKT_DECDROP, /* Dropped: decryption failure */
+ QUIC_MIB_PKT_ENCDROP, /* Dropped: encryption failure */
+ QUIC_MIB_FRM_RCVBUFDROP, /* Frames dropped: recv buf limit */
+ QUIC_MIB_FRM_RETRANS, /* Frames retransmitted */
+ QUIC_MIB_FRM_OUTCLOSES, /* CONNECTION_CLOSE frames sent */
+ QUIC_MIB_FRM_INCLOSES, /* CONNECTION_CLOSE frames rcvd */
+ QUIC_MIB_MAX
+};
+
+struct quic_mib {
+ unsigned long mibs[QUIC_MIB_MAX]; /* Counters indexed by QUIC_MIB_* */
+};
+
+struct quic_net {
+ DEFINE_SNMP_STAT(struct quic_mib, stat); /* Per-net QUIC MIB stats */
+#if IS_ENABLED(CONFIG_PROC_FS)
+ struct proc_dir_entry *proc_net; /* procfs entry for QUIC stats */
+#endif
+};
+
+struct quic_net *quic_net(struct net *net);
+
+#define QUIC_INC_STATS(net, field) SNMP_INC_STATS(quic_net(net)->stat, field)
+#define QUIC_DEC_STATS(net, field) SNMP_DEC_STATS(quic_net(net)->stat, field)
diff --git a/net/quic/socket.c b/net/quic/socket.c
new file mode 100644
index 000000000000..6605266eaa59
--- /dev/null
+++ b/net/quic/socket.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <net/tls.h>
+
+#include "socket.h"
+
+static DEFINE_PER_CPU(int, quic_memory_per_cpu_fw_alloc);
+static unsigned long quic_memory_pressure;
+static atomic_long_t quic_memory_allocated;
+
+static void quic_enter_memory_pressure(struct sock *sk)
+{
+ WRITE_ONCE(quic_memory_pressure, 1);
+}
+
+static void quic_write_space(struct sock *sk)
+{
+ __poll_t mask = EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (skwq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, mask);
+ rcu_read_unlock();
+}
+
+static int quic_init_sock(struct sock *sk)
+{
+ sk->sk_destruct = inet_sock_destruct;
+ sk->sk_write_space = quic_write_space;
+ sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
+
+ sk_sockets_allocated_inc(sk);
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+
+ return 0;
+}
+
+static void quic_destroy_sock(struct sock *sk)
+{
+ sk_sockets_allocated_dec(sk);
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+}
+
+static int quic_bind(struct sock *sk, struct sockaddr_unsized *addr,
+ int addr_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_connect(struct sock *sk, struct sockaddr_unsized *addr,
+ int addr_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_hash(struct sock *sk)
+{
+ return 0;
+}
+
+static void quic_unhash(struct sock *sk)
+{
+}
+
+static int quic_sendmsg(struct sock *sk, struct msghdr *msg, size_t msg_len)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+ int flags)
+{
+ return -EOPNOTSUPP;
+}
+
+static struct sock *quic_accept(struct sock *sk, struct proto_accept_arg *arg)
+{
+ arg->err = -EOPNOTSUPP;
+ return NULL;
+}
+
+static void quic_close(struct sock *sk, long timeout)
+{
+ lock_sock(sk);
+
+ quic_set_state(sk, QUIC_SS_CLOSED);
+
+ release_sock(sk);
+
+ sk_common_release(sk);
+}
+
+static int quic_do_setsockopt(struct sock *sk, int optname, sockptr_t optval,
+ unsigned int optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen)
+{
+ if (level != SOL_QUIC)
+ return -EOPNOTSUPP;
+
+ return quic_do_setsockopt(sk, optname, optval, optlen);
+}
+
+static int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval,
+ sockptr_t optlen)
+{
+ return -EOPNOTSUPP;
+}
+
+static int quic_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ if (level != SOL_QUIC)
+ return -EOPNOTSUPP;
+
+ return quic_do_getsockopt(sk, optname, USER_SOCKPTR(optval),
+ USER_SOCKPTR(optlen));
+}
+
+static void quic_release_cb(struct sock *sk)
+{
+}
+
+static int quic_disconnect(struct sock *sk, int flags)
+{
+ return -EOPNOTSUPP;
+}
+
+static void quic_shutdown(struct sock *sk, int how)
+{
+ quic_set_state(sk, QUIC_SS_CLOSED);
+}
+
+struct proto quic_prot = {
+ .name = "QUIC",
+ .owner = THIS_MODULE,
+ .init = quic_init_sock,
+ .destroy = quic_destroy_sock,
+ .shutdown = quic_shutdown,
+ .setsockopt = quic_setsockopt,
+ .getsockopt = quic_getsockopt,
+ .connect = quic_connect,
+ .bind = quic_bind,
+ .close = quic_close,
+ .disconnect = quic_disconnect,
+ .sendmsg = quic_sendmsg,
+ .recvmsg = quic_recvmsg,
+ .accept = quic_accept,
+ .hash = quic_hash,
+ .unhash = quic_unhash,
+ .release_cb = quic_release_cb,
+ .no_autobind = true,
+ .obj_size = sizeof(struct quic_sock),
+ .sysctl_mem = sysctl_quic_mem,
+ .sysctl_rmem = sysctl_quic_rmem,
+ .sysctl_wmem = sysctl_quic_wmem,
+ .memory_pressure = &quic_memory_pressure,
+ .enter_memory_pressure = quic_enter_memory_pressure,
+ .memory_allocated = &quic_memory_allocated,
+ .per_cpu_fw_alloc = &quic_memory_per_cpu_fw_alloc,
+ .sockets_allocated = &quic_sockets_allocated,
+};
+
+struct proto quicv6_prot = {
+ .name = "QUICv6",
+ .owner = THIS_MODULE,
+ .init = quic_init_sock,
+ .destroy = quic_destroy_sock,
+ .shutdown = quic_shutdown,
+ .setsockopt = quic_setsockopt,
+ .getsockopt = quic_getsockopt,
+ .connect = quic_connect,
+ .bind = quic_bind,
+ .close = quic_close,
+ .disconnect = quic_disconnect,
+ .sendmsg = quic_sendmsg,
+ .recvmsg = quic_recvmsg,
+ .accept = quic_accept,
+ .hash = quic_hash,
+ .unhash = quic_unhash,
+ .release_cb = quic_release_cb,
+ .no_autobind = true,
+ .obj_size = sizeof(struct quic6_sock),
+ .ipv6_pinfo_offset = offsetof(struct quic6_sock, inet6),
+ .sysctl_mem = sysctl_quic_mem,
+ .sysctl_rmem = sysctl_quic_rmem,
+ .sysctl_wmem = sysctl_quic_wmem,
+ .memory_pressure = &quic_memory_pressure,
+ .enter_memory_pressure = quic_enter_memory_pressure,
+ .memory_allocated = &quic_memory_allocated,
+ .per_cpu_fw_alloc = &quic_memory_per_cpu_fw_alloc,
+ .sockets_allocated = &quic_sockets_allocated,
+};
diff --git a/net/quic/socket.h b/net/quic/socket.h
new file mode 100644
index 000000000000..98d3f738e909
--- /dev/null
+++ b/net/quic/socket.h
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/udp_tunnel.h>
+
+#include "protocol.h"
+
+extern struct proto quic_prot;
+extern struct proto quicv6_prot;
+
+enum quic_state {
+ QUIC_SS_CLOSED = TCP_CLOSE,
+ QUIC_SS_LISTENING = TCP_LISTEN,
+ QUIC_SS_ESTABLISHING = TCP_SYN_RECV,
+ QUIC_SS_ESTABLISHED = TCP_ESTABLISHED,
+};
+
+struct quic_sock {
+ struct inet_sock inet;
+ struct list_head reqs;
+};
+
+struct quic6_sock {
+ struct quic_sock quic;
+ struct ipv6_pinfo inet6;
+};
+
+static inline struct quic_sock *quic_sk(const struct sock *sk)
+{
+ return (struct quic_sock *)sk;
+}
+
+static inline struct list_head *quic_reqs(const struct sock *sk)
+{
+ return &quic_sk(sk)->reqs;
+}
+
+static inline bool quic_is_serv(const struct sock *sk)
+{
+ return !!sk->sk_max_ack_backlog;
+}
+
+static inline bool quic_is_establishing(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_ESTABLISHING;
+}
+
+static inline bool quic_is_established(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_ESTABLISHED;
+}
+
+static inline bool quic_is_listen(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_LISTENING;
+}
+
+static inline bool quic_is_closed(struct sock *sk)
+{
+ return sk->sk_state == QUIC_SS_CLOSED;
+}
+
+static inline void quic_set_state(struct sock *sk, int state)
+{
+ struct net *net = sock_net(sk);
+ int mib;
+
+ if (sk->sk_state == state)
+ return;
+
+ if (state == QUIC_SS_ESTABLISHED) {
+ mib = quic_is_serv(sk) ? QUIC_MIB_CONN_PASSIVEESTABS :
+ QUIC_MIB_CONN_ACTIVEESTABS;
+ QUIC_INC_STATS(net, mib);
+ QUIC_INC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
+ } else if (quic_is_established(sk)) {
+ QUIC_DEC_STATS(net, QUIC_MIB_CONN_CURRENTESTABS);
+ }
+
+ inet_sk_set_state(sk, state);
+ sk->sk_state_change(sk);
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 03/15] quic: provide common utilities and data structures
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 01/15] net: define IPPROTO_QUIC and SOL_QUIC constants Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 02/15] net: build socket infrastructure for QUIC protocol Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 04/15] quic: provide family ops for address and protocol Xin Long
` (11 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch provides foundational data structures and utilities used
throughout the QUIC stack.
It introduces packet header types, connection ID support, and address
handling. Hash tables are added to manage socket lookup and connection
ID mapping.
A flexible binary data type is provided, along with helpers for parsing,
matching, and memory management. Helpers for encoding and decoding
transport parameters and frames are also included.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v3:
- Rework hashtables: split into two types and size them based on
totalram_pages(), similar to SCTP (reported by Paolo).
- struct quic_shash_table: use rwlock instead of spinlock.
- quic_data_from/to_string(): add safety and common-case checks
(noted by Paolo).
v4:
- Handle the error returned by quic_hash_tables_init() properly
(reported by Simon).
- Use vmalloc() to simplify hashtable allocations (suggested by Paolo).
- Replace rwlock_t with spinlock_t and use hlist_nulls_head in
quic_shash_head for lockless lookup/access (suggested by Paolo).
- Define QUIC_PN_BITS to replace a magical number in quic_get_num()
(reported by Paolo)
- Rename several hash-related functions:
* quic_(listen_)sock_hash() → quic_(listen_)sock_head()
* quic_(listen_)sock_head() → quic_(listen_)sock_hash()
* quic_shash() → quic_addr_hash()
* quic_ahash() → call its code directly in quic_sock_hash().
- Include net in the hash calculations in quic_listen_sock_hash() and
quic_udp_sock_head(), and include len in quic_source_conn_id_head().
v5:
- Use u64 for skb_cb->time and u32 for skb_cb->seqno, and introduce
quic_ktime_get_us() to obtain timestamps in microseconds.
- Remove skb_cb->number_max and reuse number as the largest previously
seen, and add skb_cb->crypto_ctx for async crypto context freeing.
- Add case 8 to quic_put_int(), which will be used to pack a u64
timestamp into the token in a later patch.
v6:
- Since transport_header is no longer set for QUIC, use skb->data
instead in quic_hdr() and quic_hshdr(). As the UDP header can now
be accessed via udp_hdr(), drop udph_offset from struct quic_skb_cb.
- Note for AI reviews: although ticket/token/alpn are not initialized
in quic_init_sock(), it is safe to kfree() them in quic_destroy_sock()
because they are always set to NULL (via sk_alloc() in inet/6_create()
or memset(0) in quic_accept() in a later patchset).
v7:
- Add the missing check len < length in quic_data_has/match() (reported
by AI review).
v9:
- Add BUILD_BUG_ON() to check size of struct quic_skb_cb in quic_init()
(suggested by Paolo).
v10:
- Add a comment to quic_conn_id_update() to clarify that the caller is
responsible for ensuring the connection ID length does not
exceed QUIC_CONN_ID_MAX_LEN.
- Ensure quic_get_param() validates that the decoded parameter value
consumes exactly the expected valuelen in quic_get_param() (noted by
AI review).
- Replace manual memcpy() + endian conversion and temporary union usage
with get_unaligned_beNN() and put_unaligned_beNN() helpers for reading
and writing integers in network byte order.
- Replace vmalloc(size * sizeof(type)) with vmalloc_array() in hash
table allocations.
- Move *plen update to after successful parse in quic_get_int().
v11:
- Set maximum line length to 80 characters.
- Change return type of quic_data_match() and quic_data_has() to bool.
- Add a check for len in quic_conn_id_update().
- Avoid roundup_pow_of_two(0) in quic_hash_tables_init().
---
net/quic/Makefile | 2 +-
net/quic/common.c | 564 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/common.h | 212 +++++++++++++++++
net/quic/protocol.c | 10 +
net/quic/socket.c | 4 +
net/quic/socket.h | 21 ++
6 files changed, 812 insertions(+), 1 deletion(-)
create mode 100644 net/quic/common.c
create mode 100644 net/quic/common.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 020e4dd133d8..e0067272de7d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := protocol.o socket.o
+quic-y := common.o protocol.o socket.o
diff --git a/net/quic/common.c b/net/quic/common.c
new file mode 100644
index 000000000000..7b0d711a8f94
--- /dev/null
+++ b/net/quic/common.c
@@ -0,0 +1,564 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/unaligned.h>
+#include <net/netns/hash.h>
+#include <linux/vmalloc.h>
+#include <linux/jhash.h>
+
+#include "common.h"
+
+#define QUIC_VARINT_1BYTE_MAX 0x3fULL
+#define QUIC_VARINT_2BYTE_MAX 0x3fffULL
+#define QUIC_VARINT_4BYTE_MAX 0x3fffffffULL
+#define QUIC_VARINT_8BYTE_MAX 0x3fffffffffffffffULL
+
+#define QUIC_VARINT_2BYTE_PREFIX 0x40
+#define QUIC_VARINT_4BYTE_PREFIX 0x80
+#define QUIC_VARINT_8BYTE_PREFIX 0xc0
+
+#define QUIC_VARINT_LENGTH(p) BIT((*(p)) >> 6)
+
+struct quic_hashinfo {
+ struct quic_shash_table shash; /* Source connection ID hashtable */
+ struct quic_shash_table lhash; /* Listening sock hashtable */
+ struct quic_shash_table chash; /* Connection sock hashtable */
+ struct quic_uhash_table uhash; /* UDP sock hashtable */
+};
+
+static struct quic_hashinfo quic_hashinfo;
+
+u32 quic_sock_hash_size(void)
+{
+ return quic_hashinfo.chash.size;
+}
+
+u32 quic_sock_hash(struct net *net, union quic_addr *s, union quic_addr *d)
+{
+ u32 ports = ((__force u32)s->v4.sin_port) << 16 |
+ (__force u32)d->v4.sin_port;
+ u32 saddr = (s->sa.sa_family == AF_INET6) ?
+ jhash(&s->v6.sin6_addr, 16, 0) :
+ (__force u32)s->v4.sin_addr.s_addr;
+ u32 daddr = (d->sa.sa_family == AF_INET6) ?
+ jhash(&d->v6.sin6_addr, 16, 0) :
+ (__force u32)d->v4.sin_addr.s_addr;
+ u32 hash = jhash_3words(saddr, ports, net_hash_mix(net), daddr);
+
+ return hash & (quic_sock_hash_size() - 1);
+}
+
+struct quic_shash_head *quic_sock_head(u32 hash)
+{
+ return &quic_hashinfo.chash.hash[hash];
+}
+
+u32 quic_listen_sock_hash_size(void)
+{
+ return quic_hashinfo.lhash.size;
+}
+
+u32 quic_listen_sock_hash(struct net *net, u16 port)
+{
+ u32 hash = jhash_2words((__force u32)port, net_hash_mix(net), 0);
+
+ return hash & (quic_listen_sock_hash_size() - 1);
+}
+
+struct quic_shash_head *quic_listen_sock_head(u32 hash)
+{
+ return &quic_hashinfo.lhash.hash[hash];
+}
+
+struct quic_shash_head *quic_source_conn_id_head(struct net *net, u8 *scid,
+ u32 len)
+{
+ u32 hash = jhash_2words(jhash(scid, len, 0), net_hash_mix(net), 0);
+ struct quic_shash_table *ht = &quic_hashinfo.shash;
+
+ return &ht->hash[hash & (ht->size - 1)];
+}
+
+struct quic_uhash_head *quic_udp_sock_head(struct net *net, u16 port)
+{
+ u32 hash = jhash_2words((__force u32)port, net_hash_mix(net), 0);
+ struct quic_uhash_table *ht = &quic_hashinfo.uhash;
+
+ return &ht->hash[hash & (ht->size - 1)];
+}
+
+u32 quic_addr_hash(struct net *net, union quic_addr *a)
+{
+ u32 addr = (a->sa.sa_family == AF_INET6) ?
+ jhash(&a->v6.sin6_addr, 16, 0) :
+ (__force u32)a->v4.sin_addr.s_addr;
+
+ return jhash_3words(addr, (__force u32)a->v4.sin_port,
+ net_hash_mix(net), 0);
+}
+
+void quic_hash_tables_destroy(void)
+{
+ vfree(quic_hashinfo.shash.hash);
+ vfree(quic_hashinfo.lhash.hash);
+ vfree(quic_hashinfo.chash.hash);
+ vfree(quic_hashinfo.uhash.hash);
+}
+
+static int quic_shash_table_init(struct quic_shash_table *ht, u32 size)
+{
+ int i;
+
+ ht->hash = vmalloc_array(size, sizeof(struct quic_shash_head));
+ if (!ht->hash)
+ return -ENOMEM;
+
+ ht->size = size;
+ for (i = 0; i < ht->size; i++) {
+ spin_lock_init(&ht->hash[i].lock);
+ INIT_HLIST_NULLS_HEAD(&ht->hash[i].head, i);
+ }
+ return 0;
+}
+
+static int quic_uhash_table_init(struct quic_uhash_table *ht, u32 size)
+{
+ int i;
+
+ ht->hash = vmalloc_array(size, sizeof(struct quic_uhash_head));
+ if (!ht->hash)
+ return -ENOMEM;
+
+ ht->size = size;
+ for (i = 0; i < ht->size; i++) {
+ mutex_init(&ht->hash[i].lock);
+ INIT_HLIST_HEAD(&ht->hash[i].head);
+ }
+ return 0;
+}
+
+int quic_hash_tables_init(void)
+{
+ unsigned long nr_pages = totalram_pages();
+ u32 limit, size;
+ int err;
+
+ /* Scale hash table size based on system memory, similar to SCTP. */
+ if (nr_pages >= (128 * 1024))
+ limit = nr_pages >> (22 - PAGE_SHIFT);
+ else
+ limit = nr_pages >> (24 - PAGE_SHIFT);
+
+ limit = roundup_pow_of_two(limit ?: 1);
+
+ /* Source connection ID table (fast lookup, larger size) */
+ size = min(limit, 64 * 1024U);
+ err = quic_shash_table_init(&quic_hashinfo.shash, size);
+ if (err)
+ goto err;
+ size = min(limit, 16 * 1024U);
+ err = quic_shash_table_init(&quic_hashinfo.lhash, size);
+ if (err)
+ goto err;
+ err = quic_shash_table_init(&quic_hashinfo.chash, size);
+ if (err)
+ goto err;
+ err = quic_uhash_table_init(&quic_hashinfo.uhash, size);
+ if (err)
+ goto err;
+ return 0;
+err:
+ quic_hash_tables_destroy();
+ return err;
+}
+
+/* Returns the number of bytes required to encode a QUIC variable-length
+ * integer.
+ */
+u8 quic_var_len(u64 n)
+{
+ if (n <= QUIC_VARINT_1BYTE_MAX)
+ return 1;
+ if (n <= QUIC_VARINT_2BYTE_MAX)
+ return 2;
+ if (n <= QUIC_VARINT_4BYTE_MAX)
+ return 4;
+ return 8;
+}
+
+/* Decodes a QUIC variable-length integer from a buffer. */
+u8 quic_get_var(u8 **pp, u32 *plen, u64 *val)
+{
+ u8 *p = *pp, len;
+ u64 v = 0;
+
+ if (!*plen)
+ return 0;
+
+ len = QUIC_VARINT_LENGTH(p);
+ if (*plen < len)
+ return 0;
+
+ switch (len) {
+ case 1:
+ v = *p;
+ break;
+ case 2:
+ v = get_unaligned_be16(p) & QUIC_VARINT_2BYTE_MAX;
+ break;
+ case 4:
+ v = get_unaligned_be32(p) & QUIC_VARINT_4BYTE_MAX;
+ break;
+ case 8:
+ v = get_unaligned_be64(p) & QUIC_VARINT_8BYTE_MAX;
+ break;
+ default:
+ return 0;
+ }
+
+ *plen -= len;
+ *pp = p + len;
+ *val = v;
+ return len;
+}
+
+/* Reads a fixed-length integer from the buffer. */
+u32 quic_get_int(u8 **pp, u32 *plen, u64 *val, u32 len)
+{
+ u8 *p = *pp;
+ u64 v = 0;
+
+ if (*plen < len)
+ return 0;
+
+ switch (len) {
+ case 1:
+ v = *p;
+ break;
+ case 2:
+ v = get_unaligned_be16(p);
+ break;
+ case 3:
+ v = get_unaligned_be24(p);
+ break;
+ case 4:
+ v = get_unaligned_be32(p);
+ break;
+ case 8:
+ v = get_unaligned_be64(p);
+ break;
+ default:
+ return 0;
+ }
+ *plen -= len;
+ *pp = p + len;
+ *val = v;
+ return len;
+}
+
+u32 quic_get_data(u8 **pp, u32 *plen, u8 *data, u32 len)
+{
+ if (*plen < len)
+ return 0;
+
+ memcpy(data, *pp, len);
+ *pp += len;
+ *plen -= len;
+
+ return len;
+}
+
+/* Encodes a value into the QUIC variable-length integer format. */
+u8 *quic_put_var(u8 *p, u64 num)
+{
+ if (num <= QUIC_VARINT_1BYTE_MAX) {
+ *p++ = (u8)num;
+ return p;
+ }
+ if (num <= QUIC_VARINT_2BYTE_MAX) {
+ put_unaligned_be16((u16)num, p);
+ *p |= QUIC_VARINT_2BYTE_PREFIX;
+ return p + 2;
+ }
+ if (num <= QUIC_VARINT_4BYTE_MAX) {
+ put_unaligned_be32((u32)num, p);
+ *p |= QUIC_VARINT_4BYTE_PREFIX;
+ return p + 4;
+ }
+ put_unaligned_be64(num, p);
+ *p |= QUIC_VARINT_8BYTE_PREFIX;
+ return p + 8;
+}
+
+/* Writes a fixed-length integer to the buffer in network byte order. */
+u8 *quic_put_int(u8 *p, u64 num, u8 len)
+{
+ switch (len) {
+ case 1:
+ *p++ = (u8)num;
+ return p;
+ case 2:
+ put_unaligned_be16((u16)num, p);
+ return p + 2;
+ case 4:
+ put_unaligned_be32((u32)num, p);
+ return p + 4;
+ case 8:
+ put_unaligned_be64(num, p);
+ return p + 8;
+ default:
+ return NULL;
+ }
+}
+
+/* Encodes a value as a variable-length integer with explicit length. */
+u8 *quic_put_varint(u8 *p, u64 num, u8 len)
+{
+ switch (len) {
+ case 1:
+ *p++ = (u8)num;
+ return p;
+ case 2:
+ put_unaligned_be16((u16)num, p);
+ *p |= QUIC_VARINT_2BYTE_PREFIX;
+ return p + 2;
+ case 4:
+ put_unaligned_be32((u32)num, p);
+ *p |= QUIC_VARINT_4BYTE_PREFIX;
+ return p + 4;
+ default:
+ return NULL;
+ }
+}
+
+u8 *quic_put_data(u8 *p, u8 *data, u32 len)
+{
+ if (!len)
+ return p;
+
+ memcpy(p, data, len);
+ return p + len;
+}
+
+/* Writes a transport parameter as two varints: ID and value length, followed
+ * by value.
+ */
+u8 *quic_put_param(u8 *p, u16 id, u64 value)
+{
+ p = quic_put_var(p, id);
+ p = quic_put_var(p, quic_var_len(value));
+ return quic_put_var(p, value);
+}
+
+/* Reads a QUIC transport parameter value. */
+u8 quic_get_param(u64 *pdest, u8 **pp, u32 *plen)
+{
+ u64 valuelen;
+
+ if (!quic_get_var(pp, plen, &valuelen))
+ return 0;
+
+ if (*plen < valuelen)
+ return 0;
+
+ if (quic_get_var(pp, plen, pdest) != valuelen)
+ return 0;
+
+ return (u8)valuelen;
+}
+
+/* rfc9000#section-a.3: DecodePacketNumber()
+ *
+ * Reconstructs the full packet number from a truncated one.
+ */
+s64 quic_get_num(s64 max_pkt_num, s64 pkt_num, u32 n)
+{
+ s64 expected = max_pkt_num + 1;
+ s64 win = BIT_ULL(n * 8);
+ s64 hwin = win / 2;
+ s64 mask = win - 1;
+ s64 cand;
+
+ cand = (expected & ~mask) | pkt_num;
+ if (cand <= expected - hwin && cand < BIT_ULL(QUIC_PN_BITS) - win)
+ return cand + win;
+ if (cand > expected + hwin && cand >= win)
+ return cand - win;
+ return cand;
+}
+
+int quic_data_dup(struct quic_data *to, u8 *data, u32 len)
+{
+ if (!len)
+ return 0;
+
+ data = kmemdup(data, len, GFP_ATOMIC);
+ if (!data)
+ return -ENOMEM;
+
+ kfree(to->data);
+ to->data = data;
+ to->len = len;
+ return 0;
+}
+
+int quic_data_append(struct quic_data *to, u8 *data, u32 len)
+{
+ u8 *p;
+
+ if (!len)
+ return 0;
+
+ p = kzalloc(to->len + len, GFP_ATOMIC);
+ if (!p)
+ return -ENOMEM;
+ p = quic_put_data(p, to->data, to->len);
+ p = quic_put_data(p, data, len);
+
+ kfree(to->data);
+ to->len = to->len + len;
+ to->data = p - to->len;
+ return 0;
+}
+
+/* Check whether 'd2' is equal to any element inside the list 'd1'.
+ *
+ * 'd1' is assumed to be a sequence of length-prefixed elements. Each element
+ * is compared to 'd2' using 'quic_data_cmp()'.
+ *
+ * Returns true if a match is found, false otherwise.
+ */
+bool quic_data_has(struct quic_data *d1, struct quic_data *d2)
+{
+ struct quic_data d;
+ u64 length;
+ u32 len;
+ u8 *p;
+
+ for (p = d1->data, len = d1->len; len; len -= length, p += length) {
+ if (!quic_get_int(&p, &len, &length, 1) || len < length)
+ return false;
+ quic_data(&d, p, length);
+ if (!quic_data_cmp(&d, d2))
+ return true;
+ }
+ return false;
+}
+
+/* Check if any element of 'd1' is present in the list 'd2'.
+ *
+ * Iterates through each element in 'd1', and uses 'quic_data_has()' to check
+ * for its presence in 'd2'.
+ *
+ * Returns true if any match is found, false otherwise.
+ */
+bool quic_data_match(struct quic_data *d1, struct quic_data *d2)
+{
+ struct quic_data d;
+ u64 length;
+ u32 len;
+ u8 *p;
+
+ for (p = d1->data, len = d1->len; len; len -= length, p += length) {
+ if (!quic_get_int(&p, &len, &length, 1) || len < length)
+ return false;
+ quic_data(&d, p, length);
+ if (quic_data_has(d2, &d))
+ return true;
+ }
+ return false;
+}
+
+/* Serialize a list of 'quic_data' elements into a comma-separated string.
+ *
+ * Each element in 'from' is length-prefixed. This function copies their raw
+ * content into the output buffer 'to', inserting commas in between. The
+ * resulting string length is written to '*plen'.
+ */
+int quic_data_to_string(u8 *to, u32 *plen, struct quic_data *from)
+{
+ u32 remlen = *plen;
+ struct quic_data d;
+ u8 *data = to, *p;
+ u64 length;
+ u32 len;
+
+ p = from->data;
+ len = from->len;
+ while (len) {
+ if (!quic_get_int(&p, &len, &length, 1) || len < length)
+ return -EINVAL;
+
+ quic_data(&d, p, length);
+ if (d.len > remlen)
+ return -EOVERFLOW;
+
+ data = quic_put_data(data, d.data, d.len);
+ remlen -= d.len;
+ p += d.len;
+ len -= d.len;
+ if (len) {
+ if (!remlen)
+ return -EOVERFLOW;
+ data = quic_put_int(data, ',', 1);
+ remlen--;
+ }
+ }
+ *plen = data - to;
+ return 0;
+}
+
+/* Parse a comma-separated string into a 'quic_data' list format.
+ *
+ * Each comma-separated token is turned into a length-prefixed element. The
+ * first byte of each element stores the length. Elements are stored in
+ * 'to->data', and 'to->len' is updated.
+ */
+int quic_data_from_string(struct quic_data *to, u8 *from, u32 len)
+{
+ u32 remlen = to->len;
+ struct quic_data d;
+ u8 *p = to->data;
+
+ to->len = 0;
+ while (len) {
+ while (len && *from == ' ') {
+ from++;
+ len--;
+ }
+ if (!len)
+ break;
+ if (!remlen)
+ return -EOVERFLOW;
+ d.data = p++;
+ d.len = 0;
+ remlen--;
+ while (len) {
+ if (*from == ',') {
+ from++;
+ len--;
+ break;
+ }
+ if (!remlen)
+ return -EOVERFLOW;
+ *p++ = *from++;
+ len--;
+ d.len++;
+ remlen--;
+ }
+ if (d.len > U8_MAX)
+ return -EINVAL;
+ *d.data = (u8)(d.len);
+ to->len += d.len + 1;
+ }
+ return 0;
+}
diff --git a/net/quic/common.h b/net/quic/common.h
new file mode 100644
index 000000000000..e8aceaecccce
--- /dev/null
+++ b/net/quic/common.h
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/net_namespace.h>
+
+#define QUIC_MAX_ACK_DELAY (16384 * 1000)
+#define QUIC_DEF_ACK_DELAY 25000
+
+#define QUIC_STREAM_BIT_FIN 0x01
+#define QUIC_STREAM_BIT_LEN 0x02
+#define QUIC_STREAM_BIT_OFF 0x04
+#define QUIC_STREAM_BIT_MASK 0x08
+
+#define QUIC_CONN_ID_MAX_LEN 20
+#define QUIC_CONN_ID_DEF_LEN 8
+
+#define QUIC_PN_MAX_LEN 4 /* For encoded packet number */
+#define QUIC_PN_BITS 62
+#define QUIC_PN_MAX (BIT_ULL(QUIC_PN_BITS) - 1)
+
+struct quic_conn_id {
+ u8 data[QUIC_CONN_ID_MAX_LEN];
+ u8 len;
+};
+
+static inline void quic_conn_id_update(struct quic_conn_id *conn_id, u8 *data,
+ u32 len)
+{
+ /* The caller must ensure len does not exceed QUIC_CONN_ID_MAX_LEN. */
+ if (WARN_ON_ONCE(len > QUIC_CONN_ID_MAX_LEN))
+ return;
+ memcpy(conn_id->data, data, len);
+ conn_id->len = (u8)len;
+}
+
+struct quic_skb_cb {
+ /* Callback and temporary context when encryption/decryption completes
+ * in async mode
+ */
+ void (*crypto_done)(struct sk_buff *skb, int err);
+ void *crypto_ctx;
+ union {
+ struct sk_buff *last; /* Last packet in bundle on TX */
+ u64 time; /* Arrival timestamp in UDP tunnel on RX */
+ };
+ s64 number; /* Parsed packet number, or the largest previously seen */
+ u32 seqno; /* Dest connection ID number on RX */
+ u16 errcode; /* Error code if encryption/decryption fails */
+ u16 length; /* Payload length + packet number length */
+
+ u16 number_offset; /* Offset of packet number field */
+ u8 number_len; /* Length of the packet number field */
+ u8 level; /* Encryption level: Initial, Handshake, App, or Early */
+
+ u8 key_update:1; /* Key update triggered by this packet */
+ u8 key_phase:1; /* Key phase used (0 or 1) */
+ u8 backlog:1; /* Enqueued into backlog list */
+ u8 resume:1; /* Crypto already processed (encrypted or decrypted) */
+ u8 path:1; /* Packet arrived from a new or migrating path */
+ u8 ecn:2; /* ECN marking used on TX */
+};
+
+#define QUIC_SKB_CB(skb) ((struct quic_skb_cb *)&((skb)->cb[0]))
+
+struct quichdr {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u8 pnl:2,
+ key:1,
+ reserved:2,
+ spin:1,
+ fixed:1,
+ form:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 form:1,
+ fixed:1,
+ spin:1,
+ reserved:2,
+ key:1,
+ pnl:2;
+#endif
+};
+
+static inline struct quichdr *quic_hdr(struct sk_buff *skb)
+{
+ return (struct quichdr *)skb->data;
+}
+
+struct quichshdr {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+ __u8 pnl:2,
+ reserved:2,
+ type:2,
+ fixed:1,
+ form:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 form:1,
+ fixed:1,
+ type:2,
+ reserved:2,
+ pnl:2;
+#endif
+};
+
+static inline struct quichshdr *quic_hshdr(struct sk_buff *skb)
+{
+ return (struct quichshdr *)skb->data;
+}
+
+union quic_addr {
+ struct sockaddr_in6 v6;
+ struct sockaddr_in v4;
+ struct sockaddr sa;
+};
+
+static inline union quic_addr *quic_addr(const void *addr)
+{
+ return (union quic_addr *)addr;
+}
+
+struct quic_shash_head {
+ struct hlist_nulls_head head;
+ spinlock_t lock; /* Protects 'head' in atomic context */
+};
+
+struct quic_shash_table {
+ struct quic_shash_head *hash;
+ u32 size;
+};
+
+struct quic_uhash_head {
+ struct hlist_head head;
+ struct mutex lock; /* Protects 'head' in process context */
+};
+
+struct quic_uhash_table {
+ struct quic_uhash_head *hash;
+ u32 size;
+};
+
+struct quic_data {
+ u8 *data;
+ u32 len;
+};
+
+static inline struct quic_data *quic_data(struct quic_data *d, u8 *data,
+ u32 len)
+{
+ d->data = data;
+ d->len = len;
+ return d;
+}
+
+static inline int quic_data_cmp(struct quic_data *d1, struct quic_data *d2)
+{
+ return d1->len != d2->len || memcmp(d1->data, d2->data, d1->len);
+}
+
+static inline void quic_data_free(struct quic_data *d)
+{
+ kfree(d->data);
+ d->data = NULL;
+ d->len = 0;
+}
+
+static inline u64 quic_ktime_get_us(void)
+{
+ return ktime_to_us(ktime_get());
+}
+
+u32 quic_sock_hash(struct net *net, union quic_addr *s, union quic_addr *d);
+struct quic_shash_head *quic_sock_head(u32 hash);
+u32 quic_sock_hash_size(void);
+
+u32 quic_listen_sock_hash(struct net *net, u16 port);
+struct quic_shash_head *quic_listen_sock_head(u32 hash);
+u32 quic_listen_sock_hash_size(void);
+
+struct quic_shash_head *quic_source_conn_id_head(struct net *net, u8 *scid,
+ u32 len);
+struct quic_uhash_head *quic_udp_sock_head(struct net *net, u16 port);
+u32 quic_addr_hash(struct net *net, union quic_addr *a);
+
+void quic_hash_tables_destroy(void);
+int quic_hash_tables_init(void);
+
+u32 quic_get_data(u8 **pp, u32 *plen, u8 *data, u32 len);
+u32 quic_get_int(u8 **pp, u32 *plen, u64 *val, u32 len);
+s64 quic_get_num(s64 max_pkt_num, s64 pkt_num, u32 n);
+u8 quic_get_param(u64 *pdest, u8 **pp, u32 *plen);
+u8 quic_get_var(u8 **pp, u32 *plen, u64 *val);
+u8 quic_var_len(u64 n);
+
+u8 *quic_put_param(u8 *p, u16 id, u64 value);
+u8 *quic_put_data(u8 *p, u8 *data, u32 len);
+u8 *quic_put_varint(u8 *p, u64 num, u8 len);
+u8 *quic_put_int(u8 *p, u64 num, u8 len);
+u8 *quic_put_var(u8 *p, u64 num);
+
+int quic_data_from_string(struct quic_data *to, u8 *from, u32 len);
+int quic_data_to_string(u8 *to, u32 *plen, struct quic_data *from);
+
+bool quic_data_match(struct quic_data *d1, struct quic_data *d2);
+bool quic_data_has(struct quic_data *d1, struct quic_data *d2);
+int quic_data_append(struct quic_data *to, u8 *data, u32 len);
+int quic_data_dup(struct quic_data *to, u8 *data, u32 len);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 73ccbddeff79..807e2a228ba8 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -309,6 +309,9 @@ static __init int quic_init(void)
int max_share, err = -ENOMEM;
unsigned long limit;
+ BUILD_BUG_ON(sizeof(struct quic_skb_cb) >
+ sizeof_field(struct sk_buff, cb));
+
/* Set QUIC memory limits based on available system memory, similar to
* sctp_init().
*/
@@ -333,6 +336,10 @@ static __init int quic_init(void)
if (err)
goto err_percpu_counter;
+ err = quic_hash_tables_init();
+ if (err)
+ goto err_hash;
+
err = register_pernet_subsys(&quic_net_ops);
if (err)
goto err_def_ops;
@@ -350,6 +357,8 @@ static __init int quic_init(void)
err_protosw:
unregister_pernet_subsys(&quic_net_ops);
err_def_ops:
+ quic_hash_tables_destroy();
+err_hash:
percpu_counter_destroy(&quic_sockets_allocated);
err_percpu_counter:
return err;
@@ -362,6 +371,7 @@ static __exit void quic_exit(void)
#endif
quic_protosw_exit();
unregister_pernet_subsys(&quic_net_ops);
+ quic_hash_tables_destroy();
percpu_counter_destroy(&quic_sockets_allocated);
pr_info("quic: exit\n");
}
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 6605266eaa59..6a742fe57df1 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -50,6 +50,10 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_data_free(quic_ticket(sk));
+ quic_data_free(quic_token(sk));
+ quic_data_free(quic_alpn(sk));
+
sk_sockets_allocated_dec(sk);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
}
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 98d3f738e909..9a2f4b851676 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -10,6 +10,8 @@
#include <net/udp_tunnel.h>
+#include "common.h"
+
#include "protocol.h"
extern struct proto quic_prot;
@@ -25,6 +27,10 @@ enum quic_state {
struct quic_sock {
struct inet_sock inet;
struct list_head reqs;
+
+ struct quic_data ticket;
+ struct quic_data token;
+ struct quic_data alpn;
};
struct quic6_sock {
@@ -42,6 +48,21 @@ static inline struct list_head *quic_reqs(const struct sock *sk)
return &quic_sk(sk)->reqs;
}
+static inline struct quic_data *quic_token(const struct sock *sk)
+{
+ return &quic_sk(sk)->token;
+}
+
+static inline struct quic_data *quic_ticket(const struct sock *sk)
+{
+ return &quic_sk(sk)->ticket;
+}
+
+static inline struct quic_data *quic_alpn(const struct sock *sk)
+{
+ return &quic_sk(sk)->alpn;
+}
+
static inline bool quic_is_serv(const struct sock *sk)
{
return !!sk->sk_max_ack_backlog;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 04/15] quic: provide family ops for address and protocol
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (2 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 03/15] quic: provide common utilities and data structures Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 05/15] quic: provide quic.h header files for kernel and userspace Xin Long
` (10 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
Introduce QUIC address and protocol family operations to handle IPv4/IPv6
specifics consistently, similar to SCTP. The new quic_family.{c,h} provide
helpers for routing, skb transmit handling, address parsing and comparison
and UDP socket config initializing etc.
This consolidates protocol-family logic and enables cleaner dual-stack
support in the QUIC socket implementation.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v2:
- Add more checks for addrs in .get_user_addr() and .get_pref_addr().
- Consider sk_bound_dev_if in .udp_conf_init() and .flow_route() to
support vrf.
v3:
- Remove quic_addr_family/proto_ops abstraction; use if statements to
reduce indirect call overhead (suggested by Paolo).
- quic_v6_set_sk_addr(): add quic_v6_copy_sk_addr() helper to avoid
duplicate code (noted by Paolo).
- quic_v4_flow_route(): use flowi4_dscp per latest net-next changes.
v4:
- Remove unnecessary _fl variable from flow_route() functions (noted
by Paolo).
- Fix coding style of ?: operator (noted by Paolo).
v5:
- Remove several unused functions from this patch series (suggested by Paolo):
* quic_seq_dump_addr()
* quic_get_msg_ecn()
* quic_get_user_addr()
* quic_get_pref_addr()
* quic_set_pref_addr()
* quic_set_sk_addr()
* quic_set_sk_ecn()
- Replace the sa->v4/v6.sin_family checks with quic_v4/v6_is_any_addr()
in quic_v4/v6_flow_route() (suggested by Paolo).
- Introduce quic_v4_match_v6_addr() to simplify family-mismatch checks
between sk and addr in quic_v6_cmp_sk_addr() (notied by Paolo).
v6:
- Use udp_hdr(skb) to access UDP header in quic_v4/6_get_msg_addrs(), as
transport_header is no longer reset for QUIC.
v10:
- Fix argument types passed to ip6_dst_store() in quic_v6_flow_route().
v11:
- Set maximum line length to 80 characters.
- Change return type of quic_is_any_addr() to bool.
- Call local_bh_disable() in quic_lower_xmit() because
udp(6)_tunnel_xmit_skb() requires a non-preemptible context.
- Return a negative errno (-EINVAL) instead of 1 in
quic_v4/v6_get_mtu_info().
---
net/quic/Makefile | 2 +-
net/quic/family.c | 402 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/family.h | 39 +++++
net/quic/protocol.c | 2 +-
net/quic/socket.c | 6 +-
net/quic/socket.h | 1 +
6 files changed, 448 insertions(+), 4 deletions(-)
create mode 100644 net/quic/family.c
create mode 100644 net/quic/family.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index e0067272de7d..13bf4a4e5442 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o protocol.o socket.o
+quic-y := common.o family.o protocol.o socket.o
diff --git a/net/quic/family.c b/net/quic/family.c
new file mode 100644
index 000000000000..c16fffb35d9e
--- /dev/null
+++ b/net/quic/family.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/inet_common.h>
+#include <net/udp_tunnel.h>
+#include <linux/icmp.h>
+
+#include "common.h"
+#include "family.h"
+
+static bool quic_v4_is_any_addr(union quic_addr *addr)
+{
+ return addr->v4.sin_addr.s_addr == htonl(INADDR_ANY);
+}
+
+static bool quic_v6_is_any_addr(union quic_addr *addr)
+{
+ return ipv6_addr_any(&addr->v6.sin6_addr);
+}
+
+static void quic_v4_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf,
+ union quic_addr *a)
+{
+ conf->family = AF_INET;
+ conf->local_ip.s_addr = a->v4.sin_addr.s_addr;
+ conf->local_udp_port = a->v4.sin_port;
+ conf->use_udp6_rx_checksums = true;
+ conf->bind_ifindex = sk->sk_bound_dev_if;
+}
+
+static void quic_v6_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf,
+ union quic_addr *a)
+{
+ conf->family = AF_INET6;
+ conf->local_ip6 = a->v6.sin6_addr;
+ conf->local_udp_port = a->v6.sin6_port;
+ conf->use_udp6_rx_checksums = true;
+ conf->ipv6_v6only = ipv6_only_sock(sk);
+ conf->bind_ifindex = sk->sk_bound_dev_if;
+}
+
+static int quic_v4_flow_route(struct sock *sk, union quic_addr *da,
+ union quic_addr *sa, struct flowi *fl)
+{
+ struct flowi4 *fl4;
+ struct rtable *rt;
+
+ if (__sk_dst_check(sk, 0))
+ return 1;
+
+ memset(fl, 0x00, sizeof(*fl));
+ fl4 = &fl->u.ip4;
+ fl4->saddr = sa->v4.sin_addr.s_addr;
+ fl4->fl4_sport = sa->v4.sin_port;
+ fl4->daddr = da->v4.sin_addr.s_addr;
+ fl4->fl4_dport = da->v4.sin_port;
+ fl4->flowi4_proto = IPPROTO_UDP;
+ fl4->flowi4_oif = sk->sk_bound_dev_if;
+
+ fl4->flowi4_scope = ip_sock_rt_scope(sk);
+ fl4->flowi4_dscp = inet_sk_dscp(inet_sk(sk));
+
+ rt = ip_route_output_key(sock_net(sk), fl4);
+ if (IS_ERR(rt))
+ return PTR_ERR(rt);
+
+ if (quic_v4_is_any_addr(sa)) {
+ sa->v4.sin_family = AF_INET;
+ sa->v4.sin_addr.s_addr = fl4->saddr;
+ }
+ sk_setup_caps(sk, &rt->dst);
+ return 0;
+}
+
+static int quic_v6_flow_route(struct sock *sk, union quic_addr *da,
+ union quic_addr *sa, struct flowi *fl)
+{
+ struct ipv6_pinfo *np = inet6_sk(sk);
+ struct ip6_flowlabel *flowlabel;
+ struct dst_entry *dst;
+ struct flowi6 *fl6;
+
+ if (__sk_dst_check(sk, np->dst_cookie))
+ return 1;
+
+ memset(fl, 0x00, sizeof(*fl));
+ fl6 = &fl->u.ip6;
+ fl6->saddr = sa->v6.sin6_addr;
+ fl6->fl6_sport = sa->v6.sin6_port;
+ fl6->daddr = da->v6.sin6_addr;
+ fl6->fl6_dport = da->v6.sin6_port;
+ fl6->flowi6_proto = IPPROTO_UDP;
+ fl6->flowi6_oif = sk->sk_bound_dev_if;
+
+ if (inet6_test_bit(SNDFLOW, sk)) {
+ fl6->flowlabel = (da->v6.sin6_flowinfo & IPV6_FLOWINFO_MASK);
+ if (fl6->flowlabel & IPV6_FLOWLABEL_MASK) {
+ flowlabel = fl6_sock_lookup(sk, fl6->flowlabel);
+ if (IS_ERR(flowlabel))
+ return -EINVAL;
+ fl6_sock_release(flowlabel);
+ }
+ }
+
+ dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, NULL);
+ if (IS_ERR(dst))
+ return PTR_ERR(dst);
+
+ if (quic_v6_is_any_addr(sa)) {
+ sa->v6.sin6_family = AF_INET6;
+ sa->v6.sin6_addr = fl6->saddr;
+ }
+ ip6_dst_store(sk, dst, false, false);
+ return 0;
+}
+
+static void quic_v4_lower_xmit(struct sock *sk, struct sk_buff *skb,
+ struct flowi *fl)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 tos = (inet_sk(sk)->tos | cb->ecn), ttl;
+ struct flowi4 *fl4 = &fl->u.ip4;
+ struct dst_entry *dst;
+ __be16 df = 0;
+
+ pr_debug("%s: skb: %p, len: %d, num: %llu, %pI4:%d -> %pI4:%d\n",
+ __func__, skb, skb->len, cb->number, &fl4->saddr,
+ ntohs(fl4->fl4_sport), &fl4->daddr, ntohs(fl4->fl4_dport));
+
+ dst = sk_dst_get(sk);
+ if (!dst) {
+ kfree_skb(skb);
+ return;
+ }
+ if (ip_dont_fragment(sk, dst) && !skb->ignore_df)
+ df = htons(IP_DF);
+
+ ttl = (u8)ip4_dst_hoplimit(dst);
+ udp_tunnel_xmit_skb((struct rtable *)dst, sk, skb, fl4->saddr,
+ fl4->daddr, tos, ttl, df, fl4->fl4_sport,
+ fl4->fl4_dport, false, false, 0);
+}
+
+static void quic_v6_lower_xmit(struct sock *sk, struct sk_buff *skb,
+ struct flowi *fl)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 tc = (inet6_sk(sk)->tclass | cb->ecn), ttl;
+ struct flowi6 *fl6 = &fl->u.ip6;
+ struct dst_entry *dst;
+ __be32 label;
+
+ pr_debug("%s: skb: %p, len: %d, num: %llu, %pI6c:%d -> %pI6c:%d\n",
+ __func__, skb, skb->len, cb->number, &fl6->saddr,
+ ntohs(fl6->fl6_sport), &fl6->daddr, ntohs(fl6->fl6_dport));
+
+ dst = sk_dst_get(sk);
+ if (!dst) {
+ kfree_skb(skb);
+ return;
+ }
+
+ ttl = (u8)ip6_dst_hoplimit(dst);
+ label = ip6_make_flowlabel(sock_net(sk), skb, fl6->flowlabel, true,
+ fl6);
+ udp_tunnel6_xmit_skb(dst, sk, skb, NULL, &fl6->saddr, &fl6->daddr, tc,
+ ttl, label, fl6->fl6_sport, fl6->fl6_dport, false,
+ 0);
+}
+
+static void quic_v4_get_msg_addrs(struct sk_buff *skb, union quic_addr *da,
+ union quic_addr *sa)
+{
+ struct udphdr *uh = udp_hdr(skb);
+
+ sa->v4.sin_family = AF_INET;
+ sa->v4.sin_port = uh->source;
+ sa->v4.sin_addr.s_addr = ip_hdr(skb)->saddr;
+
+ da->v4.sin_family = AF_INET;
+ da->v4.sin_port = uh->dest;
+ da->v4.sin_addr.s_addr = ip_hdr(skb)->daddr;
+}
+
+static void quic_v6_get_msg_addrs(struct sk_buff *skb, union quic_addr *da,
+ union quic_addr *sa)
+{
+ struct udphdr *uh = udp_hdr(skb);
+
+ sa->v6.sin6_family = AF_INET6;
+ sa->v6.sin6_port = uh->source;
+ sa->v6.sin6_addr = ipv6_hdr(skb)->saddr;
+
+ da->v6.sin6_family = AF_INET6;
+ da->v6.sin6_port = uh->dest;
+ da->v6.sin6_addr = ipv6_hdr(skb)->daddr;
+}
+
+static int quic_v4_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ struct icmphdr *hdr;
+
+ hdr = (struct icmphdr *)(skb_network_header(skb) -
+ sizeof(struct icmphdr));
+ if (hdr->type == ICMP_DEST_UNREACH && hdr->code == ICMP_FRAG_NEEDED) {
+ *info = ntohs(hdr->un.frag.mtu);
+ return 0;
+ }
+
+ /* Defer other types' processing to UDP error handler. */
+ return -EINVAL;
+}
+
+static int quic_v6_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ struct icmp6hdr *hdr;
+
+ hdr = (struct icmp6hdr *)(skb_network_header(skb) -
+ sizeof(struct icmp6hdr));
+ if (hdr->icmp6_type == ICMPV6_PKT_TOOBIG) {
+ *info = ntohl(hdr->icmp6_mtu);
+ return 0;
+ }
+
+ /* Defer other types' processing to UDP error handler. */
+ return -EINVAL;
+}
+
+static bool quic_v4_cmp_sk_addr(struct sock *sk, union quic_addr *a,
+ union quic_addr *addr)
+{
+ if (a->v4.sin_port != addr->v4.sin_port)
+ return false;
+ if (a->v4.sin_family != addr->v4.sin_family)
+ return false;
+ if (a->v4.sin_addr.s_addr == htonl(INADDR_ANY) ||
+ addr->v4.sin_addr.s_addr == htonl(INADDR_ANY))
+ return true;
+ return a->v4.sin_addr.s_addr == addr->v4.sin_addr.s_addr;
+}
+
+static bool quic_v4_match_v6_addr(union quic_addr *a4, union quic_addr *a6)
+{
+ if (ipv6_addr_any(&a6->v6.sin6_addr))
+ return true;
+ if (ipv6_addr_v4mapped(&a6->v6.sin6_addr) &&
+ a6->v6.sin6_addr.s6_addr32[3] == a4->v4.sin_addr.s_addr)
+ return true;
+ return false;
+}
+
+static bool quic_v6_cmp_sk_addr(struct sock *sk, union quic_addr *a,
+ union quic_addr *addr)
+{
+ if (a->v4.sin_port != addr->v4.sin_port)
+ return false;
+
+ if (a->sa.sa_family == AF_INET && addr->sa.sa_family == AF_INET) {
+ if (a->v4.sin_addr.s_addr == htonl(INADDR_ANY) ||
+ addr->v4.sin_addr.s_addr == htonl(INADDR_ANY))
+ return true;
+ return a->v4.sin_addr.s_addr == addr->v4.sin_addr.s_addr;
+ }
+
+ if (a->sa.sa_family != addr->sa.sa_family) {
+ if (ipv6_only_sock(sk))
+ return false;
+ if (a->sa.sa_family == AF_INET)
+ return quic_v4_match_v6_addr(a, addr);
+ return quic_v4_match_v6_addr(addr, a);
+ }
+
+ if (ipv6_addr_any(&a->v6.sin6_addr) ||
+ ipv6_addr_any(&addr->v6.sin6_addr))
+ return true;
+ return ipv6_addr_equal(&a->v6.sin6_addr, &addr->v6.sin6_addr);
+}
+
+static int quic_v4_get_sk_addr(struct socket *sock, struct sockaddr *uaddr,
+ int peer)
+{
+ return inet_getname(sock, uaddr, peer);
+}
+
+static int quic_v6_get_sk_addr(struct socket *sock, struct sockaddr *uaddr,
+ int peer)
+{
+ union quic_addr *a = quic_addr(uaddr);
+ int ret;
+
+ ret = inet6_getname(sock, uaddr, peer);
+ if (ret < 0)
+ return ret;
+
+ if (a->sa.sa_family == AF_INET6 &&
+ ipv6_addr_v4mapped(&a->v6.sin6_addr)) {
+ a->v4.sin_family = AF_INET;
+ a->v4.sin_port = a->v6.sin6_port;
+ a->v4.sin_addr.s_addr = a->v6.sin6_addr.s6_addr32[3];
+ }
+
+ if (a->sa.sa_family == AF_INET) {
+ memset(a->v4.sin_zero, 0, sizeof(a->v4.sin_zero));
+ return sizeof(struct sockaddr_in);
+ }
+ return sizeof(struct sockaddr_in6);
+}
+
+#define quic_af_ipv4(a) ((a)->sa.sa_family == AF_INET)
+
+u32 quic_encap_len(union quic_addr *a)
+{
+ return (quic_af_ipv4(a) ? sizeof(struct iphdr) :
+ sizeof(struct ipv6hdr)) +
+ sizeof(struct udphdr);
+}
+
+bool quic_is_any_addr(union quic_addr *a)
+{
+ return quic_af_ipv4(a) ? quic_v4_is_any_addr(a) :
+ quic_v6_is_any_addr(a);
+}
+
+void quic_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf,
+ union quic_addr *a)
+{
+ quic_af_ipv4(a) ? quic_v4_udp_conf_init(sk, conf, a) :
+ quic_v6_udp_conf_init(sk, conf, a);
+}
+
+int quic_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa,
+ struct flowi *fl)
+{
+ return quic_af_ipv4(da) ? quic_v4_flow_route(sk, da, sa, fl) :
+ quic_v6_flow_route(sk, da, sa, fl);
+}
+
+void quic_lower_xmit(struct sock *sk, struct sk_buff *skb, union quic_addr *da,
+ struct flowi *fl)
+{
+ local_bh_disable();
+ quic_af_ipv4(da) ? quic_v4_lower_xmit(sk, skb, fl) :
+ quic_v6_lower_xmit(sk, skb, fl);
+ local_bh_enable();
+}
+
+#define quic_skb_ipv4(skb) (ip_hdr(skb)->version == 4)
+
+void quic_get_msg_addrs(struct sk_buff *skb, union quic_addr *da,
+ union quic_addr *sa)
+{
+ memset(sa, 0, sizeof(*sa));
+ memset(da, 0, sizeof(*da));
+ quic_skb_ipv4(skb) ? quic_v4_get_msg_addrs(skb, da, sa) :
+ quic_v6_get_msg_addrs(skb, da, sa);
+}
+
+int quic_get_mtu_info(struct sk_buff *skb, u32 *info)
+{
+ return quic_skb_ipv4(skb) ? quic_v4_get_mtu_info(skb, info) :
+ quic_v6_get_mtu_info(skb, info);
+}
+
+#define quic_pf_ipv4(sk) ((sk)->sk_family == PF_INET)
+
+bool quic_cmp_sk_addr(struct sock *sk, union quic_addr *a,
+ union quic_addr *addr)
+{
+ return quic_pf_ipv4(sk) ? quic_v4_cmp_sk_addr(sk, a, addr) :
+ quic_v6_cmp_sk_addr(sk, a, addr);
+}
+
+int quic_get_sk_addr(struct socket *sock, struct sockaddr *a, bool peer)
+{
+ return quic_pf_ipv4(sock->sk) ? quic_v4_get_sk_addr(sock, a, peer) :
+ quic_v6_get_sk_addr(sock, a, peer);
+}
+
+int quic_common_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen)
+{
+ return quic_pf_ipv4(sk) ?
+ ip_setsockopt(sk, level, optname, optval, optlen) :
+ ipv6_setsockopt(sk, level, optname, optval, optlen);
+}
+
+int quic_common_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen)
+{
+ return quic_pf_ipv4(sk) ?
+ ip_getsockopt(sk, level, optname, optval, optlen) :
+ ipv6_getsockopt(sk, level, optname, optval, optlen);
+}
diff --git a/net/quic/family.h b/net/quic/family.h
new file mode 100644
index 000000000000..a68356c1ffb5
--- /dev/null
+++ b/net/quic/family.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PORT_LEN 2
+#define QUIC_ADDR4_LEN 4
+#define QUIC_ADDR6_LEN 16
+
+#define QUIC_PREF_ADDR_LEN \
+ (QUIC_ADDR4_LEN + QUIC_PORT_LEN + QUIC_ADDR6_LEN + QUIC_PORT_LEN)
+
+bool quic_is_any_addr(union quic_addr *a);
+u32 quic_encap_len(union quic_addr *a);
+
+void quic_lower_xmit(struct sock *sk, struct sk_buff *skb, union quic_addr *da,
+ struct flowi *fl);
+int quic_flow_route(struct sock *sk, union quic_addr *da, union quic_addr *sa,
+ struct flowi *fl);
+void quic_udp_conf_init(struct sock *sk, struct udp_port_cfg *conf,
+ union quic_addr *a);
+
+void quic_get_msg_addrs(struct sk_buff *skb, union quic_addr *da,
+ union quic_addr *sa);
+int quic_get_mtu_info(struct sk_buff *skb, u32 *info);
+
+bool quic_cmp_sk_addr(struct sock *sk, union quic_addr *a,
+ union quic_addr *addr);
+int quic_get_sk_addr(struct socket *sock, struct sockaddr *a, bool peer);
+
+int quic_common_setsockopt(struct sock *sk, int level, int optname,
+ sockptr_t optval, unsigned int optlen);
+int quic_common_getsockopt(struct sock *sk, int level, int optname,
+ char __user *optval, int __user *optlen);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 807e2a228ba8..c247f00f7ddc 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -47,7 +47,7 @@ static int quic_inet_listen(struct socket *sock, int backlog)
static int quic_inet_getname(struct socket *sock, struct sockaddr *uaddr,
int peer)
{
- return -EOPNOTSUPP;
+ return quic_get_sk_addr(sock, uaddr, peer);
}
static __poll_t quic_inet_poll(struct file *file, struct socket *sock,
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 6a742fe57df1..f1181215ebbf 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -117,7 +117,8 @@ static int quic_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
if (level != SOL_QUIC)
- return -EOPNOTSUPP;
+ return quic_common_setsockopt(sk, level, optname, optval,
+ optlen);
return quic_do_setsockopt(sk, optname, optval, optlen);
}
@@ -132,7 +133,8 @@ static int quic_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen)
{
if (level != SOL_QUIC)
- return -EOPNOTSUPP;
+ return quic_common_getsockopt(sk, level, optname, optval,
+ optlen);
return quic_do_getsockopt(sk, optname, USER_SOCKPTR(optval),
USER_SOCKPTR(optlen));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 9a2f4b851676..0aa642e3b0ae 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -11,6 +11,7 @@
#include <net/udp_tunnel.h>
#include "common.h"
+#include "family.h"
#include "protocol.h"
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 05/15] quic: provide quic.h header files for kernel and userspace
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (3 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 04/15] quic: provide family ops for address and protocol Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 06/15] quic: add stream management Xin Long
` (9 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This commit adds quic.h to include/uapi/linux, providing the necessary
definitions for the QUIC socket API. Exporting this header allows both
user space applications and kernel subsystems to access QUIC-related
control messages, socket options, and event/notification interfaces.
Since kernel_get/setsockopt() is no longer available to kernel consumers,
a corresponding internal header, include/linux/quic.h, is added. This
exposes quic_do_get/setsockopt() to handle QUIC socket options directly
for kernel subsystems.
Detailed descriptions of these structures are available in [1], and will
be also provided when adding corresponding socket interfaces in the
later patches.
[1] https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket-apis
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Thomas Dreibholz <dreibh@simula.no>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v2:
- Fix a kernel API description warning, found by Jakub.
- Replace uintN_t with __uN, capitalize _UAPI_LINUX_QUIC_H, and
assign explicit values for QUIC_TRANSPORT_ERROR_ enum in UAPI
quic.h, suggested by David Howells.
v4:
- Use MSG_QUIC_ prefix for MSG_* flags to avoid conflicts with other
protocols, such as MSG_NOTIFICATION in SCTP (reported by Thomas).
- Remove QUIC_CONG_ALG_CUBIC; only NEW RENO congestion control is
supported in this version.
v5:
- Add include/linux/quic.h and include/uapi/linux/quic.h to the
QUIC PROTOCOL entry in MAINTAINERS.
v6:
- Fix the copy/pasted the uAPI path for SCTP to the QUIC entry (noted
by Jakub).
v7:
- Expose quic_do_get/setsockopt() instead of quic_kernel_get/setsockopt()
(suggested by Paolo).
v10:
- Fix typo: 'extented' -> 'extended' (noted by AI review).
- Add comment for inclusion of sys/socket.h in uapi quic.h.
- Add uses-libc += linux/quic.h in usr/include/Makefile to fix the new
build error.
- Delete config from struct quic_sock, its members will be split into
other subcomponents in the future patches.
- Add explicit reserved fields to multiple structs to account for
implicit padding and ensure UAPI stability.
- Expand reserved fields in struct transport_param and config, handshake
and stream_info to allow future extensions without breaking the UAPI.
v11:
- Set maximum line length to 80 characters.
- Drop trailing reserved fields in structs and rely on
copy_struct_to/from_user() for extensibility; keep reserved fields
in the middle to indicate memory holes.
---
MAINTAINERS | 2 +
include/linux/quic.h | 22 ++++
include/uapi/linux/quic.h | 237 ++++++++++++++++++++++++++++++++++++++
net/quic/socket.c | 36 +++++-
net/quic/socket.h | 1 +
usr/include/Makefile | 1 +
6 files changed, 295 insertions(+), 4 deletions(-)
create mode 100644 include/linux/quic.h
create mode 100644 include/uapi/linux/quic.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 532030036a8c..271ddf760eb4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21901,6 +21901,8 @@ M: Xin Long <lucien.xin@gmail.com>
L: quic@lists.linux.dev
S: Maintained
W: https://github.com/lxin/quic
+F: include/linux/quic.h
+F: include/uapi/linux/quic.h
F: net/quic/
RADEON and AMDGPU DRM DRIVERS
diff --git a/include/linux/quic.h b/include/linux/quic.h
new file mode 100644
index 000000000000..4e43d20c9c11
--- /dev/null
+++ b/include/linux/quic.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#ifndef _LINUX_QUIC_H
+#define _LINUX_QUIC_H
+
+#include <linux/sockptr.h>
+#include <uapi/linux/quic.h>
+
+int quic_do_setsockopt(struct sock *sk, int optname, sockptr_t optval,
+ unsigned int optlen);
+int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval,
+ sockptr_t optlen);
+
+#endif
diff --git a/include/uapi/linux/quic.h b/include/uapi/linux/quic.h
new file mode 100644
index 000000000000..92a0336b9b89
--- /dev/null
+++ b/include/uapi/linux/quic.h
@@ -0,0 +1,237 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#ifndef _UAPI_LINUX_QUIC_H
+#define _UAPI_LINUX_QUIC_H
+
+#include <linux/types.h>
+#ifdef __KERNEL__
+#include <linux/socket.h>
+#else
+#include <sys/socket.h> /* for MSG_* flags */
+#endif
+
+/* NOTE: Structure descriptions are specified in:
+ * https://datatracker.ietf.org/doc/html/draft-lxin-quic-socket-apis
+ */
+
+/* Send or Receive Options APIs */
+enum quic_cmsg_type {
+ QUIC_STREAM_INFO,
+ QUIC_HANDSHAKE_INFO,
+};
+
+#define QUIC_STREAM_TYPE_SERVER_MASK 0x01
+#define QUIC_STREAM_TYPE_UNI_MASK 0x02
+#define QUIC_STREAM_TYPE_MASK 0x03
+
+enum quic_msg_flags {
+ /* flags for stream_flags */
+ MSG_QUIC_STREAM_NEW = MSG_SYN,
+ MSG_QUIC_STREAM_FIN = MSG_FIN,
+ MSG_QUIC_STREAM_UNI = MSG_CONFIRM,
+ MSG_QUIC_STREAM_DONTWAIT = MSG_WAITFORONE,
+ MSG_QUIC_STREAM_SNDBLOCK = MSG_ERRQUEUE,
+
+ /* extended flags for msg_flags */
+ MSG_QUIC_DATAGRAM = MSG_RST,
+ MSG_QUIC_NOTIFICATION = MSG_MORE,
+};
+
+enum quic_crypto_level {
+ QUIC_CRYPTO_APP,
+ QUIC_CRYPTO_INITIAL,
+ QUIC_CRYPTO_HANDSHAKE,
+ QUIC_CRYPTO_EARLY,
+ QUIC_CRYPTO_MAX,
+};
+
+struct quic_handshake_info {
+ __u8 crypto_level;
+};
+
+struct quic_stream_info {
+ __s64 stream_id;
+ __u32 stream_flags;
+};
+
+/* Socket Options APIs */
+#define QUIC_SOCKOPT_EVENT 0
+#define QUIC_SOCKOPT_STREAM_OPEN 1
+#define QUIC_SOCKOPT_STREAM_RESET 2
+#define QUIC_SOCKOPT_STREAM_STOP_SENDING 3
+#define QUIC_SOCKOPT_CONNECTION_ID 4
+#define QUIC_SOCKOPT_CONNECTION_CLOSE 5
+#define QUIC_SOCKOPT_CONNECTION_MIGRATION 6
+#define QUIC_SOCKOPT_KEY_UPDATE 7
+#define QUIC_SOCKOPT_TRANSPORT_PARAM 8
+#define QUIC_SOCKOPT_CONFIG 9
+#define QUIC_SOCKOPT_TOKEN 10
+#define QUIC_SOCKOPT_ALPN 11
+#define QUIC_SOCKOPT_SESSION_TICKET 12
+#define QUIC_SOCKOPT_CRYPTO_SECRET 13
+#define QUIC_SOCKOPT_TRANSPORT_PARAM_EXT 14
+
+#define QUIC_VERSION_V1 0x1
+#define QUIC_VERSION_V2 0x6b3343cf
+
+struct quic_transport_param {
+ __u8 remote;
+ __u8 disable_active_migration;
+ __u8 grease_quic_bit;
+ __u8 stateless_reset;
+ __u8 disable_1rtt_encryption;
+ __u8 disable_compatible_version;
+ __u8 active_connection_id_limit;
+ __u8 ack_delay_exponent;
+ __u16 max_datagram_frame_size;
+ __u16 max_udp_payload_size;
+ __u32 max_idle_timeout;
+ __u32 max_ack_delay;
+ __u16 max_streams_bidi;
+ __u16 max_streams_uni;
+ __u64 max_data;
+ __u64 max_stream_data_bidi_local;
+ __u64 max_stream_data_bidi_remote;
+ __u64 max_stream_data_uni;
+};
+
+struct quic_config {
+ __u32 version;
+ __u32 plpmtud_probe_interval;
+ __u32 initial_smoothed_rtt;
+ __u32 payload_cipher_type;
+ __u8 congestion_control_algo;
+ __u8 validate_peer_address;
+ __u8 stream_data_nodelay;
+ __u8 receive_session_ticket;
+ __u8 certificate_request;
+};
+
+struct quic_crypto_secret {
+ __u8 send; /* send or recv */
+ __u8 level; /* crypto level */
+ __u16 reserved;
+ __u32 type; /* TLS_CIPHER_* */
+#define QUIC_CRYPTO_SECRET_BUFFER_SIZE 48
+ __u8 secret[QUIC_CRYPTO_SECRET_BUFFER_SIZE];
+};
+
+enum quic_cong_algo {
+ QUIC_CONG_ALG_RENO,
+ QUIC_CONG_ALG_MAX,
+};
+
+struct quic_errinfo {
+ __s64 stream_id;
+ __u32 errcode;
+};
+
+struct quic_connection_id_info {
+ __u8 dest;
+ __u8 reserved[3];
+ __u32 active;
+ __u32 prior_to;
+};
+
+struct quic_event_option {
+ __u8 type;
+ __u8 on;
+};
+
+/* Event APIs */
+enum quic_event_type {
+ QUIC_EVENT_NONE,
+ QUIC_EVENT_STREAM_UPDATE,
+ QUIC_EVENT_STREAM_MAX_DATA,
+ QUIC_EVENT_STREAM_MAX_STREAM,
+ QUIC_EVENT_CONNECTION_ID,
+ QUIC_EVENT_CONNECTION_CLOSE,
+ QUIC_EVENT_CONNECTION_MIGRATION,
+ QUIC_EVENT_KEY_UPDATE,
+ QUIC_EVENT_NEW_TOKEN,
+ QUIC_EVENT_NEW_SESSION_TICKET,
+ QUIC_EVENT_MAX,
+};
+
+enum {
+ QUIC_STREAM_SEND_STATE_READY,
+ QUIC_STREAM_SEND_STATE_SEND,
+ QUIC_STREAM_SEND_STATE_SENT,
+ QUIC_STREAM_SEND_STATE_RECVD,
+ QUIC_STREAM_SEND_STATE_RESET_SENT,
+ QUIC_STREAM_SEND_STATE_RESET_RECVD,
+
+ QUIC_STREAM_RECV_STATE_RECV,
+ QUIC_STREAM_RECV_STATE_SIZE_KNOWN,
+ QUIC_STREAM_RECV_STATE_RECVD,
+ QUIC_STREAM_RECV_STATE_READ,
+ QUIC_STREAM_RECV_STATE_RESET_RECVD,
+ QUIC_STREAM_RECV_STATE_RESET_READ,
+};
+
+struct quic_stream_update {
+ __s64 id;
+ __u8 state;
+ __u8 reserved[3];
+ __u32 errcode;
+ __u64 finalsz;
+};
+
+struct quic_stream_max_data {
+ __s64 id;
+ __u64 max_data;
+};
+
+struct quic_connection_close {
+ __u32 errcode;
+ __u8 frame;
+ __u8 reserved[3];
+ __u8 phrase[];
+};
+
+union quic_event {
+ struct quic_stream_update update;
+ struct quic_stream_max_data max_data;
+ struct quic_connection_close close;
+ struct quic_connection_id_info info;
+ __u64 max_stream;
+ __u8 local_migration;
+ __u8 key_update_phase;
+};
+
+enum {
+ QUIC_TRANSPORT_ERROR_NONE = 0x00,
+ QUIC_TRANSPORT_ERROR_INTERNAL = 0x01,
+ QUIC_TRANSPORT_ERROR_CONNECTION_REFUSED = 0x02,
+ QUIC_TRANSPORT_ERROR_FLOW_CONTROL = 0x03,
+ QUIC_TRANSPORT_ERROR_STREAM_LIMIT = 0x04,
+ QUIC_TRANSPORT_ERROR_STREAM_STATE = 0x05,
+ QUIC_TRANSPORT_ERROR_FINAL_SIZE = 0x06,
+ QUIC_TRANSPORT_ERROR_FRAME_ENCODING = 0x07,
+ QUIC_TRANSPORT_ERROR_TRANSPORT_PARAM = 0x08,
+ QUIC_TRANSPORT_ERROR_CONNECTION_ID_LIMIT = 0x09,
+ QUIC_TRANSPORT_ERROR_PROTOCOL_VIOLATION = 0x0a,
+ QUIC_TRANSPORT_ERROR_INVALID_TOKEN = 0x0b,
+ QUIC_TRANSPORT_ERROR_APPLICATION = 0x0c,
+ QUIC_TRANSPORT_ERROR_CRYPTO_BUF_EXCEEDED = 0x0d,
+ QUIC_TRANSPORT_ERROR_KEY_UPDATE = 0x0e,
+ QUIC_TRANSPORT_ERROR_AEAD_LIMIT_REACHED = 0x0f,
+ QUIC_TRANSPORT_ERROR_NO_VIABLE_PATH = 0x10,
+
+ /* The cryptographic handshake failed. A range of 256 values is reserved
+ * for carrying error codes specific to the cryptographic handshake that
+ * is used. Codes for errors occurring when TLS is used for the
+ * cryptographic handshake are described in Section 4.8 of [QUIC-TLS].
+ */
+ QUIC_TRANSPORT_ERROR_CRYPTO = 0x0100,
+};
+
+#endif /* _UAPI_LINUX_QUIC_H */
diff --git a/net/quic/socket.c b/net/quic/socket.c
index f1181215ebbf..8dc2cb7628db 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -107,11 +107,25 @@ static void quic_close(struct sock *sk, long timeout)
sk_common_release(sk);
}
-static int quic_do_setsockopt(struct sock *sk, int optname, sockptr_t optval,
- unsigned int optlen)
+/**
+ * quic_do_setsockopt - set a QUIC socket option
+ * @sk: socket to configure
+ * @optname: option name (QUIC-level)
+ * @optval: user buffer containing the option value
+ * @optlen: size of the option value
+ *
+ * Sets a QUIC socket option on a given socket.
+ *
+ * Return:
+ * - On success, 0 is returned.
+ * - On error, a negative error value is returned.
+ */
+int quic_do_setsockopt(struct sock *sk, int optname, sockptr_t optval,
+ unsigned int optlen)
{
return -EOPNOTSUPP;
}
+EXPORT_SYMBOL_GPL(quic_do_setsockopt);
static int quic_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen)
@@ -123,11 +137,25 @@ static int quic_setsockopt(struct sock *sk, int level, int optname,
return quic_do_setsockopt(sk, optname, optval, optlen);
}
-static int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval,
- sockptr_t optlen)
+/**
+ * quic_do_getsockopt - get a QUIC socket option
+ * @sk: socket to query
+ * @optname: option name (QUIC-level)
+ * @optval: user buffer to receive the option value
+ * @optlen: pointer to buffer size; updated with actual size on return
+ *
+ * Gets a QUIC socket option from a given socket.
+ *
+ * Return:
+ * - On success, 0 is returned.
+ * - On error, a negative error value is returned.
+ */
+int quic_do_getsockopt(struct sock *sk, int optname, sockptr_t optval,
+ sockptr_t optlen)
{
return -EOPNOTSUPP;
}
+EXPORT_SYMBOL_GPL(quic_do_getsockopt);
static int quic_getsockopt(struct sock *sk, int level, int optname,
char __user *optval, int __user *optlen)
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 0aa642e3b0ae..61df0c5867be 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -9,6 +9,7 @@
*/
#include <net/udp_tunnel.h>
+#include <linux/quic.h>
#include "common.h"
#include "family.h"
diff --git a/usr/include/Makefile b/usr/include/Makefile
index 6d86a53c6f0a..a0442e4d77da 100644
--- a/usr/include/Makefile
+++ b/usr/include/Makefile
@@ -120,6 +120,7 @@ uses-libc += linux/netfilter_ipv4.h
uses-libc += linux/netfilter_ipv4/ip_tables.h
uses-libc += linux/netfilter_ipv6.h
uses-libc += linux/netfilter_ipv6/ip6_tables.h
+uses-libc += linux/quic.h
uses-libc += linux/route.h
uses-libc += linux/shm.h
uses-libc += linux/soundcard.h
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 06/15] quic: add stream management
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (4 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 05/15] quic: provide quic.h header files for kernel and userspace Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-26 15:06 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 07/15] quic: add connection id management Xin Long
` (8 subsequent siblings)
14 siblings, 1 reply; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'struct quic_stream_table' for managing QUIC streams,
each represented by 'struct quic_stream'.
It implements mechanisms for acquiring and releasing streams on both the
send and receive paths, ensuring efficient lifecycle management during
transmission and reception.
- quic_stream_get(): Acquire a send-side stream by ID and flags during
TX path, or a receive-side stream by ID during RX path.
- quic_stream_put(): Release a send-side stream when sending is done,
or a receive-side stream when receiving is done.
It includes logic to detect when stream ID limits are reached and when
control frames should be sent to update or request limits from the peer.
- quic_stream_id_exceeds(): Check a stream ID would exceed local (recv)
or peer (send) limits.
- quic_stream_max_streams_update(): Determines whether a
MAX_STREAMS_UNI/BIDI frame should be sent to the peer.
Note stream hash table is per socket, the operations on it are always
protected by the sock lock.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v3:
- Merge send/recv stream helpers into unified functions to reduce code:
* quic_stream_id_send/recv() → quic_stream_id_valid()
* quic_stream_id_send/recv_closed() → quic_stream_id_closed()
* quic_stream_id_send/recv_exceeds() → quic_stream_id_exceeds()
(pointed out by Paolo).
- Clarify in changelog that stream hash table is always protected by sock
lock (suggested by Paolo).
- quic_stream_init/free(): adjust for new hashtable type; call
quic_stream_delete() in quic_stream_free() to avoid open-coded logic.
- Receiving streams: delete stream only when fully read or reset, instead
of when no data was received. Prevents freeing a stream while a FIN
with no data is still queued.
v4:
- Replace struct quic_shash_table with struct hlist_head for the
stream hashtable. Since they are protected by the socket lock,
no per-chain lock is needed.
- Initialize stream to NULL in stream creation functions to avoid
warnings from Smatch (reported by Simon).
- Allocate send streams with GFP_KERNEL_ACCOUNT and receive streams
with GFP_ATOMIC | __GFP_ACCOUNT for memory accounting (suggested
by Paolo).
v5:
- Introduce struct quic_stream_limits to merge quic_stream_send_create()
and quic_stream_recv_create(), and to simplify quic_stream_get_param()
(suggested by Paolo).
- Annotate the sock-lock requirement for quic_stream_send/recv_get()
and quic_stream_send/recv_put() (notied by Paolo).
- Add quic_stream_bidi_put() to deduplicate the common logic between
quic_stream_send_put() and quic_stream_recv_put().
- Remove the unnecessary check when incrementing
streams->send.next_bidi/uni_stream_id in quic_stream_create().
- Remove the unused 'is_serv' parameter from quic_stream_get_param().
v7:
- Free the allocated streams on error path in quic_stream_create() (noted
by Paolo).
- Merge quic_stream_send_get/put() and quic_stream_recv_get/put() helpers
to quic_stream_get/put() (suggested by Paolo).
- Add more comments in quic_stream_id_exceeds() and quic_stream_create().
v8:
- Replace bitfields with plain u8 in struct quic_stream_limits and struct
quic_stream (suggested by Paolo).
v9:
- Fix grammar in the comment for quic_stream::send.window.
v10:
- Move quic_stream_init() to after sock_prot_inuse_add() ensure counters
are incremented before any early return paths in quic_init_sock(),
preventing underflow in quic_destroy_sock() (noted by AI review).
- Initialize the output parameters '*max_uni' and '*max_bidi' to 0 at the
start of quic_stream_max_streams_update()
- Use 'stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD' instead of '!='
for clearer intent.
- Simplify some state checks in quic_stream_put() by using range
comparisons (> or <) instead of multiple != conditions.
- streams_uni/bidi are u16 type, and their overflow is already prevented
by QUIC_MAX_STREAMS indirectly. Update comment in quic_stream_create().
- Replace open-coded kzalloc(sizeof(*stream)) with kzalloc_obj(*stream)
in quic_stream_create().
v11:
- Set maximum line length to 80 characters.
- Change is_serv parameter type to bool in quic_stream_id_local().
---
net/quic/Makefile | 2 +-
net/quic/socket.c | 5 +
net/quic/socket.h | 8 +
net/quic/stream.c | 444 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/stream.h | 133 ++++++++++++++
5 files changed, 591 insertions(+), 1 deletion(-)
create mode 100644 net/quic/stream.c
create mode 100644 net/quic/stream.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 13bf4a4e5442..094e9da5d739 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o
+quic-y := common.o family.o protocol.o socket.o stream.o
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 8dc2cb7628db..0006668551f4 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -45,11 +45,16 @@ static int quic_init_sock(struct sock *sk)
sk_sockets_allocated_inc(sk);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+ if (quic_stream_init(quic_streams(sk)))
+ return -ENOMEM;
+
return 0;
}
static void quic_destroy_sock(struct sock *sk)
{
+ quic_stream_free(quic_streams(sk));
+
quic_data_free(quic_ticket(sk));
quic_data_free(quic_token(sk));
quic_data_free(quic_alpn(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 61df0c5867be..e76737b9b74b 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -13,6 +13,7 @@
#include "common.h"
#include "family.h"
+#include "stream.h"
#include "protocol.h"
@@ -33,6 +34,8 @@ struct quic_sock {
struct quic_data ticket;
struct quic_data token;
struct quic_data alpn;
+
+ struct quic_stream_table streams;
};
struct quic6_sock {
@@ -65,6 +68,11 @@ static inline struct quic_data *quic_alpn(const struct sock *sk)
return &quic_sk(sk)->alpn;
}
+static inline struct quic_stream_table *quic_streams(const struct sock *sk)
+{
+ return &quic_sk(sk)->streams;
+}
+
static inline bool quic_is_serv(const struct sock *sk)
{
return !!sk->sk_max_ack_backlog;
diff --git a/net/quic/stream.c b/net/quic/stream.c
new file mode 100644
index 000000000000..4d980f9b03ce
--- /dev/null
+++ b/net/quic/stream.c
@@ -0,0 +1,444 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/quic.h>
+
+#include "common.h"
+#include "stream.h"
+
+/* Check if a stream ID is valid for sending or receiving. */
+static bool quic_stream_id_valid(s64 stream_id, bool is_serv, bool send)
+{
+ u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
+
+ if (send) {
+ if (is_serv)
+ return type != QUIC_STREAM_TYPE_CLIENT_UNI;
+ return type != QUIC_STREAM_TYPE_SERVER_UNI;
+ }
+ if (is_serv)
+ return type != QUIC_STREAM_TYPE_SERVER_UNI;
+ return type != QUIC_STREAM_TYPE_CLIENT_UNI;
+}
+
+/* Check if a stream ID was initiated locally. */
+static bool quic_stream_id_local(s64 stream_id, bool is_serv)
+{
+ return is_serv ^ !(stream_id & QUIC_STREAM_TYPE_SERVER_MASK);
+}
+
+/* Check if a stream ID represents a unidirectional stream. */
+static bool quic_stream_id_uni(s64 stream_id)
+{
+ return stream_id & QUIC_STREAM_TYPE_UNI_MASK;
+}
+
+#define QUIC_STREAM_HT_SIZE 64
+
+static struct hlist_head *quic_stream_head(struct quic_stream_table *streams,
+ s64 stream_id)
+{
+ return &streams->head[stream_id & (QUIC_STREAM_HT_SIZE - 1)];
+}
+
+struct quic_stream *quic_stream_find(struct quic_stream_table *streams,
+ s64 stream_id)
+{
+ struct hlist_head *head = quic_stream_head(streams, stream_id);
+ struct quic_stream *stream;
+
+ hlist_for_each_entry(stream, head, node) {
+ if (stream->id == stream_id)
+ break;
+ }
+ return stream;
+}
+
+static void quic_stream_add(struct quic_stream_table *streams,
+ struct quic_stream *stream)
+{
+ struct hlist_head *head;
+
+ head = quic_stream_head(streams, stream->id);
+ hlist_add_head(&stream->node, head);
+}
+
+static void quic_stream_delete(struct quic_stream *stream)
+{
+ hlist_del_init(&stream->node);
+ kfree(stream);
+}
+
+/* Create and register new streams for sending or receiving. */
+static struct quic_stream *quic_stream_create(struct quic_stream_table *streams,
+ s64 max_stream_id, bool send,
+ bool is_serv)
+{
+ struct quic_stream_limits *limits = &streams->send;
+ struct quic_stream *pos, *stream = NULL;
+ gfp_t gfp = GFP_KERNEL_ACCOUNT;
+ struct hlist_node *tmp;
+ HLIST_HEAD(head);
+ s64 stream_id;
+ u32 count = 0;
+
+ if (!send) {
+ limits = &streams->recv;
+ gfp = GFP_ATOMIC | __GFP_ACCOUNT;
+ }
+ stream_id = limits->next_bidi_stream_id;
+ if (quic_stream_id_uni(max_stream_id))
+ stream_id = limits->next_uni_stream_id;
+
+ /* rfc9000#section-2.1: A stream ID that is used out of order results in
+ * all streams of that type with lower-numbered stream IDs also being
+ * opened.
+ */
+ while (stream_id <= max_stream_id) {
+ stream = kzalloc_obj(*stream, gfp);
+ if (!stream)
+ goto free;
+
+ stream->id = stream_id;
+ if (quic_stream_id_uni(stream_id)) {
+ if (send) {
+ stream->send.max_bytes =
+ limits->max_stream_data_uni;
+ } else {
+ stream->recv.max_bytes =
+ limits->max_stream_data_uni;
+ stream->recv.window = stream->recv.max_bytes;
+ }
+ hlist_add_head(&stream->node, &head);
+ stream_id += QUIC_STREAM_ID_STEP;
+ continue;
+ }
+
+ if (quic_stream_id_local(stream_id, is_serv)) {
+ stream->send.max_bytes =
+ streams->send.max_stream_data_bidi_remote;
+ stream->recv.max_bytes =
+ streams->recv.max_stream_data_bidi_local;
+ } else {
+ stream->send.max_bytes =
+ streams->send.max_stream_data_bidi_local;
+ stream->recv.max_bytes =
+ streams->recv.max_stream_data_bidi_remote;
+ }
+ stream->recv.window = stream->recv.max_bytes;
+ hlist_add_head(&stream->node, &head);
+ stream_id += QUIC_STREAM_ID_STEP;
+ }
+
+ hlist_for_each_entry_safe(pos, tmp, &head, node) {
+ hlist_del_init(&pos->node);
+ quic_stream_add(streams, pos);
+ count++;
+ }
+
+ /* Streams must be opened sequentially. Update the next stream ID so the
+ * correct starting point is known if an out-of-order open is requested.
+ * Note overflow of next_uni/bidi_stream_id is impossible with s64.
+ */
+ if (quic_stream_id_uni(stream_id)) {
+ limits->next_uni_stream_id = stream_id;
+ limits->streams_uni += count;
+ return stream;
+ }
+
+ limits->next_bidi_stream_id = stream_id;
+ limits->streams_bidi += count;
+ return stream;
+
+free:
+ hlist_for_each_entry_safe(pos, tmp, &head, node) {
+ hlist_del_init(&pos->node);
+ kfree(pos);
+ }
+ return NULL;
+}
+
+/* Check if a send or receive stream ID is already closed. */
+static bool quic_stream_id_closed(struct quic_stream_table *streams,
+ s64 stream_id, bool send)
+{
+ struct quic_stream_limits *limits = send ? &streams->send :
+ &streams->recv;
+
+ if (quic_stream_id_uni(stream_id))
+ return stream_id < limits->next_uni_stream_id;
+ return stream_id < limits->next_bidi_stream_id;
+}
+
+/* Check if a stream ID would exceed local (recv) or peer (send) limits. */
+bool quic_stream_id_exceeds(struct quic_stream_table *streams, s64 stream_id,
+ bool send)
+{
+ u64 nstreams;
+
+ if (!send) {
+ /* recv.max_uni_stream_id is updated in
+ * quic_stream_max_streams_update() already based on
+ * next_uni/bidi_stream_id, max_streams_uni/bidi, and
+ * streams_uni/bidi, so only recv.max_uni_stream_id needs to be
+ * checked.
+ */
+ if (quic_stream_id_uni(stream_id))
+ return stream_id > streams->recv.max_uni_stream_id;
+
+ return stream_id > streams->recv.max_bidi_stream_id;
+ }
+
+ if (quic_stream_id_uni(stream_id)) {
+ if (stream_id > streams->send.max_uni_stream_id)
+ return true;
+ stream_id -= streams->send.next_uni_stream_id;
+ nstreams = quic_stream_id_to_streams(stream_id);
+
+ return nstreams + streams->send.streams_uni >
+ streams->send.max_streams_uni;
+ }
+
+ if (stream_id > streams->send.max_bidi_stream_id)
+ return true;
+ stream_id -= streams->send.next_bidi_stream_id;
+ nstreams = quic_stream_id_to_streams(stream_id);
+
+ return nstreams + streams->send.streams_bidi >
+ streams->send.max_streams_bidi;
+}
+
+/* Get or create a send or recv stream by ID. Requires sock lock held. */
+struct quic_stream *quic_stream_get(struct quic_stream_table *streams,
+ s64 stream_id, u32 flags, bool is_serv,
+ bool send)
+{
+ struct quic_stream *stream;
+
+ if (!quic_stream_id_valid(stream_id, is_serv, send))
+ return ERR_PTR(-EINVAL);
+
+ stream = quic_stream_find(streams, stream_id);
+ if (stream) {
+ if (send && (flags & MSG_QUIC_STREAM_NEW) &&
+ stream->send.state != QUIC_STREAM_SEND_STATE_READY)
+ return ERR_PTR(-EINVAL);
+ return stream;
+ }
+
+ if (!send && quic_stream_id_local(stream_id, is_serv)) {
+ if (quic_stream_id_closed(streams, stream_id, !send))
+ return ERR_PTR(-ENOSTR);
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (quic_stream_id_closed(streams, stream_id, send))
+ return ERR_PTR(-ENOSTR);
+
+ if (send && !(flags & MSG_QUIC_STREAM_NEW))
+ return ERR_PTR(-EINVAL);
+
+ if (quic_stream_id_exceeds(streams, stream_id, send))
+ return ERR_PTR(-EAGAIN);
+
+ stream = quic_stream_create(streams, stream_id, send, is_serv);
+ if (!stream)
+ return ERR_PTR(-ENOSTR);
+
+ if (send || quic_stream_id_valid(stream_id, is_serv, !send))
+ streams->send.active_stream_id = stream_id;
+
+ return stream;
+}
+
+/* Release or clean up a send or recv stream. This function updates stream
+ * counters and state when a send stream has either successfully sent all data
+ * or has been reset, or when a recv stream has either consumed all data or has
+ * been reset. Requires sock lock held.
+ */
+void quic_stream_put(struct quic_stream_table *streams,
+ struct quic_stream *stream, bool is_serv, bool send)
+{
+ if (quic_stream_id_uni(stream->id)) {
+ if (send) {
+ /* For uni streams, decrement uni count and delete
+ * immediately.
+ */
+ streams->send.streams_uni--;
+ quic_stream_delete(stream);
+ return;
+ }
+ /* For uni streams, decrement uni count and mark done. */
+ if (!stream->recv.done) {
+ stream->recv.done = 1;
+ streams->recv.streams_uni--;
+ streams->recv.uni_pending = 1;
+ }
+ /* Delete stream if fully read or reset. */
+ if (stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD)
+ quic_stream_delete(stream);
+ return;
+ }
+
+ if (send) {
+ /* For bidi streams, only proceed if receive side is in a final
+ * state.
+ */
+ if (stream->recv.state < QUIC_STREAM_RECV_STATE_RECVD)
+ return;
+ } else {
+ /* For bidi streams, only proceed if send side is in a final
+ * state.
+ */
+ if (stream->send.state != QUIC_STREAM_SEND_STATE_RECVD &&
+ stream->send.state != QUIC_STREAM_SEND_STATE_RESET_RECVD)
+ return;
+ }
+
+ if (quic_stream_id_local(stream->id, is_serv)) {
+ /* Local-initiated stream: mark send done and decrement
+ * send.bidi count.
+ */
+ if (!stream->send.done) {
+ stream->send.done = 1;
+ streams->send.streams_bidi--;
+ }
+ } else {
+ /* Remote-initiated stream: mark recv done and decrement recv
+ * bidi count.
+ */
+ if (!stream->recv.done) {
+ stream->recv.done = 1;
+ streams->recv.streams_bidi--;
+ streams->recv.bidi_pending = 1;
+ }
+ }
+
+ /* Delete stream if fully read or reset. */
+ if (stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD)
+ quic_stream_delete(stream);
+}
+
+/* Updates the maximum allowed incoming stream IDs if any streams were recently
+ * closed. Recalculates the max_uni and max_bidi stream ID limits based on the
+ * number of open streams and whether any were marked for deletion.
+ *
+ * Returns true if either max_uni or max_bidi was updated, indicating that a
+ * MAX_STREAMS_UNI or MAX_STREAMS_BIDI frame should be sent to the peer.
+ */
+bool quic_stream_max_streams_update(struct quic_stream_table *streams,
+ s64 *max_uni, s64 *max_bidi)
+{
+ s64 max, rem;
+
+ *max_uni = 0;
+ *max_bidi = 0;
+ if (streams->recv.uni_pending) {
+ rem = streams->recv.max_streams_uni - streams->recv.streams_uni;
+ max = streams->recv.next_uni_stream_id - QUIC_STREAM_ID_STEP +
+ (rem << QUIC_STREAM_TYPE_BITS);
+
+ streams->recv.max_uni_stream_id = max;
+ *max_uni = quic_stream_id_to_streams(max);
+ streams->recv.uni_pending = 0;
+ }
+ if (streams->recv.bidi_pending) {
+ rem = streams->recv.max_streams_bidi -
+ streams->recv.streams_bidi;
+ max = streams->recv.next_bidi_stream_id - QUIC_STREAM_ID_STEP +
+ (rem << QUIC_STREAM_TYPE_BITS);
+
+ streams->recv.max_bidi_stream_id = max;
+ *max_bidi = quic_stream_id_to_streams(max);
+ streams->recv.bidi_pending = 0;
+ }
+
+ return *max_uni || *max_bidi;
+}
+
+int quic_stream_init(struct quic_stream_table *streams)
+{
+ struct hlist_head *head;
+ int i;
+
+ head = kmalloc_array(QUIC_STREAM_HT_SIZE, sizeof(*head), GFP_KERNEL);
+ if (!head)
+ return -ENOMEM;
+ for (i = 0; i < QUIC_STREAM_HT_SIZE; i++)
+ INIT_HLIST_HEAD(&head[i]);
+ streams->head = head;
+ return 0;
+}
+
+void quic_stream_free(struct quic_stream_table *streams)
+{
+ struct quic_stream *stream;
+ struct hlist_head *head;
+ struct hlist_node *tmp;
+ int i;
+
+ if (!streams->head)
+ return;
+
+ for (i = 0; i < QUIC_STREAM_HT_SIZE; i++) {
+ head = &streams->head[i];
+ hlist_for_each_entry_safe(stream, tmp, head, node)
+ quic_stream_delete(stream);
+ }
+ kfree(streams->head);
+}
+
+/* Populate transport parameters from stream hash table. */
+void quic_stream_get_param(struct quic_stream_table *streams,
+ struct quic_transport_param *p)
+{
+ struct quic_stream_limits *limits = p->remote ? &streams->send :
+ &streams->recv;
+
+ p->max_stream_data_bidi_remote = limits->max_stream_data_bidi_remote;
+ p->max_stream_data_bidi_local = limits->max_stream_data_bidi_local;
+ p->max_stream_data_uni = limits->max_stream_data_uni;
+ p->max_streams_bidi = limits->max_streams_bidi;
+ p->max_streams_uni = limits->max_streams_uni;
+}
+
+/* Configure stream hashtable from transport parameters. */
+void quic_stream_set_param(struct quic_stream_table *streams,
+ struct quic_transport_param *p, bool is_serv)
+{
+ struct quic_stream_limits *limits = p->remote ? &streams->send :
+ &streams->recv;
+ u8 bidi_type, uni_type;
+
+ limits->max_stream_data_bidi_local = p->max_stream_data_bidi_local;
+ limits->max_stream_data_bidi_remote = p->max_stream_data_bidi_remote;
+ limits->max_stream_data_uni = p->max_stream_data_uni;
+ limits->max_streams_bidi = p->max_streams_bidi;
+ limits->max_streams_uni = p->max_streams_uni;
+ limits->active_stream_id = -1;
+
+ if (p->remote ^ is_serv) {
+ bidi_type = QUIC_STREAM_TYPE_CLIENT_BIDI;
+ uni_type = QUIC_STREAM_TYPE_CLIENT_UNI;
+ } else {
+ bidi_type = QUIC_STREAM_TYPE_SERVER_BIDI;
+ uni_type = QUIC_STREAM_TYPE_SERVER_UNI;
+ }
+
+ limits->max_bidi_stream_id =
+ quic_stream_streams_to_id(p->max_streams_bidi, bidi_type);
+ limits->next_bidi_stream_id = bidi_type;
+
+ limits->max_uni_stream_id =
+ quic_stream_streams_to_id(p->max_streams_uni, uni_type);
+ limits->next_uni_stream_id = uni_type;
+}
diff --git a/net/quic/stream.h b/net/quic/stream.h
new file mode 100644
index 000000000000..435ae1246e05
--- /dev/null
+++ b/net/quic/stream.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_DEF_STREAMS 100
+#define QUIC_MAX_STREAMS 4096ULL
+
+/*
+ * rfc9000#section-2.1:
+ *
+ * The least significant bit (0x01) of the stream ID identifies the initiator
+ * of the stream. Client-initiated streams have even-numbered stream IDs
+ * (with the bit set to 0), and server-initiated streams have odd-numbered
+ * stream IDs (with the bit set to 1).
+ *
+ * The second least significant bit (0x02) of the stream ID distinguishes
+ * between bidirectional streams (with the bit set to 0) and unidirectional
+ * streams (with the bit set to 1).
+ */
+#define QUIC_STREAM_TYPE_BITS 2
+#define QUIC_STREAM_ID_STEP BIT(QUIC_STREAM_TYPE_BITS)
+
+#define QUIC_STREAM_TYPE_CLIENT_BIDI 0x00
+#define QUIC_STREAM_TYPE_SERVER_BIDI 0x01
+#define QUIC_STREAM_TYPE_CLIENT_UNI 0x02
+#define QUIC_STREAM_TYPE_SERVER_UNI 0x03
+
+struct quic_stream {
+ struct hlist_node node;
+ s64 id; /* Stream ID as defined in RFC 9000 Section 2.1 */
+ struct {
+ /* Sending-side stream level flow control */
+ u64 last_max_bytes; /* Max send offset advertised by peer */
+ u64 max_bytes; /* Max offset allowed to send */
+ u64 bytes; /* Bytes already sent to peer */
+
+ u32 errcode; /* App error code for RESET_STREAM */
+ u32 frags; /* STREAM frames sent but not yet acked */
+ u8 state; /* Send stream state, per rfc9000#section-3.1 */
+
+ u8 data_blocked; /* True if flow control blocks sending */
+ u8 done; /* True if FIN has been sent */
+ } send;
+ struct {
+ /* Receiving-side stream level flow control */
+ u64 max_bytes; /* Max offset peer can send */
+ u64 window; /* Remaining receive window */
+ u64 bytes; /* Bytes consumed by app */
+
+ u64 highest; /* Highest received offset */
+ u64 offset; /* Data buffered or consumed */
+ u64 finalsz; /* Final stream size if FIN received */
+
+ u32 frags; /* STREAM frames pending reassembly */
+ u8 state; /* Receive stream state, per rfc9000#section-3.2 */
+
+ u8 stop_sent; /* True if STOP_SENDING has been sent */
+ u8 done; /* True if all data has been received or read */
+ } recv;
+};
+
+struct quic_stream_limits {
+ /* Stream limit parameters defined in rfc9000#section-18.2:
+ *
+ * - initial_max_stream_data_bidi_remote
+ * - initial_max_stream_data_bidi_local
+ * - initial_max_stream_data_uni
+ * - initial_max_streams_bidi
+ * - initial_max_streams_uni
+ */
+ u64 max_stream_data_bidi_remote;
+ u64 max_stream_data_bidi_local;
+ u64 max_stream_data_uni;
+ u64 max_streams_bidi;
+ u64 max_streams_uni;
+
+ s64 next_bidi_stream_id; /* Next bidi stream ID to open or accept */
+ s64 next_uni_stream_id; /* Next uni stream ID to open or accept */
+ s64 max_bidi_stream_id; /* Highest allowed bidi stream ID */
+ s64 max_uni_stream_id; /* Highest allowed uni stream ID */
+ s64 active_stream_id; /* Most recently opened stream ID */
+
+ u8 bidi_blocked; /* STREAMS_BLOCKED_BIDI sent, awaiting ACK */
+ u8 uni_blocked; /* STREAMS_BLOCKED_UNI sent, awaiting ACK */
+ u8 bidi_pending; /* MAX_STREAMS_BIDI needs to be sent */
+ u8 uni_pending; /* MAX_STREAMS_UNI needs to be sent */
+
+ u16 streams_bidi; /* Number of open bidi streams */
+ u16 streams_uni; /* Number of open uni streams */
+};
+
+struct quic_stream_table {
+ struct hlist_head *head; /* Hash table storing all active streams */
+
+ struct quic_stream_limits send; /* Limits advertised by peer */
+ struct quic_stream_limits recv; /* Limits we advertise to peer */
+};
+
+static inline u64 quic_stream_id_to_streams(s64 stream_id)
+{
+ return (u64)(stream_id >> QUIC_STREAM_TYPE_BITS) + 1;
+}
+
+static inline s64 quic_stream_streams_to_id(u64 streams, u8 type)
+{
+ return (s64)((streams - 1) << QUIC_STREAM_TYPE_BITS) | type;
+}
+
+struct quic_stream *quic_stream_get(struct quic_stream_table *streams,
+ s64 stream_id, u32 flags, bool is_serv,
+ bool send);
+void quic_stream_put(struct quic_stream_table *streams,
+ struct quic_stream *stream, bool is_serv, bool send);
+
+bool quic_stream_max_streams_update(struct quic_stream_table *streams,
+ s64 *max_uni, s64 *max_bidi);
+bool quic_stream_id_exceeds(struct quic_stream_table *streams,
+ s64 stream_id, bool send);
+struct quic_stream *quic_stream_find(struct quic_stream_table *streams,
+ s64 stream_id);
+
+void quic_stream_get_param(struct quic_stream_table *streams,
+ struct quic_transport_param *p);
+void quic_stream_set_param(struct quic_stream_table *streams,
+ struct quic_transport_param *p, bool is_serv);
+void quic_stream_free(struct quic_stream_table *streams);
+int quic_stream_init(struct quic_stream_table *streams);
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH net-next v11 06/15] quic: add stream management
2026-03-25 3:47 ` [PATCH net-next v11 06/15] quic: add stream management Xin Long
@ 2026-03-26 15:06 ` Xin Long
0 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-26 15:06 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
On Tue, Mar 24, 2026 at 11:49 PM Xin Long <lucien.xin@gmail.com> wrote:
>
> This patch introduces 'struct quic_stream_table' for managing QUIC streams,
> each represented by 'struct quic_stream'.
>
> It implements mechanisms for acquiring and releasing streams on both the
> send and receive paths, ensuring efficient lifecycle management during
> transmission and reception.
>
> - quic_stream_get(): Acquire a send-side stream by ID and flags during
> TX path, or a receive-side stream by ID during RX path.
>
> - quic_stream_put(): Release a send-side stream when sending is done,
> or a receive-side stream when receiving is done.
>
> It includes logic to detect when stream ID limits are reached and when
> control frames should be sent to update or request limits from the peer.
>
> - quic_stream_id_exceeds(): Check a stream ID would exceed local (recv)
> or peer (send) limits.
>
> - quic_stream_max_streams_update(): Determines whether a
> MAX_STREAMS_UNI/BIDI frame should be sent to the peer.
>
> Note stream hash table is per socket, the operations on it are always
> protected by the sock lock.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> Acked-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v3:
> - Merge send/recv stream helpers into unified functions to reduce code:
> * quic_stream_id_send/recv() → quic_stream_id_valid()
> * quic_stream_id_send/recv_closed() → quic_stream_id_closed()
> * quic_stream_id_send/recv_exceeds() → quic_stream_id_exceeds()
> (pointed out by Paolo).
> - Clarify in changelog that stream hash table is always protected by sock
> lock (suggested by Paolo).
> - quic_stream_init/free(): adjust for new hashtable type; call
> quic_stream_delete() in quic_stream_free() to avoid open-coded logic.
> - Receiving streams: delete stream only when fully read or reset, instead
> of when no data was received. Prevents freeing a stream while a FIN
> with no data is still queued.
> v4:
> - Replace struct quic_shash_table with struct hlist_head for the
> stream hashtable. Since they are protected by the socket lock,
> no per-chain lock is needed.
> - Initialize stream to NULL in stream creation functions to avoid
> warnings from Smatch (reported by Simon).
> - Allocate send streams with GFP_KERNEL_ACCOUNT and receive streams
> with GFP_ATOMIC | __GFP_ACCOUNT for memory accounting (suggested
> by Paolo).
> v5:
> - Introduce struct quic_stream_limits to merge quic_stream_send_create()
> and quic_stream_recv_create(), and to simplify quic_stream_get_param()
> (suggested by Paolo).
> - Annotate the sock-lock requirement for quic_stream_send/recv_get()
> and quic_stream_send/recv_put() (notied by Paolo).
> - Add quic_stream_bidi_put() to deduplicate the common logic between
> quic_stream_send_put() and quic_stream_recv_put().
> - Remove the unnecessary check when incrementing
> streams->send.next_bidi/uni_stream_id in quic_stream_create().
> - Remove the unused 'is_serv' parameter from quic_stream_get_param().
> v7:
> - Free the allocated streams on error path in quic_stream_create() (noted
> by Paolo).
> - Merge quic_stream_send_get/put() and quic_stream_recv_get/put() helpers
> to quic_stream_get/put() (suggested by Paolo).
> - Add more comments in quic_stream_id_exceeds() and quic_stream_create().
> v8:
> - Replace bitfields with plain u8 in struct quic_stream_limits and struct
> quic_stream (suggested by Paolo).
> v9:
> - Fix grammar in the comment for quic_stream::send.window.
> v10:
> - Move quic_stream_init() to after sock_prot_inuse_add() ensure counters
> are incremented before any early return paths in quic_init_sock(),
> preventing underflow in quic_destroy_sock() (noted by AI review).
> - Initialize the output parameters '*max_uni' and '*max_bidi' to 0 at the
> start of quic_stream_max_streams_update()
> - Use 'stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD' instead of '!='
> for clearer intent.
> - Simplify some state checks in quic_stream_put() by using range
> comparisons (> or <) instead of multiple != conditions.
> - streams_uni/bidi are u16 type, and their overflow is already prevented
> by QUIC_MAX_STREAMS indirectly. Update comment in quic_stream_create().
> - Replace open-coded kzalloc(sizeof(*stream)) with kzalloc_obj(*stream)
> in quic_stream_create().
> v11:
> - Set maximum line length to 80 characters.
> - Change is_serv parameter type to bool in quic_stream_id_local().
> ---
> net/quic/Makefile | 2 +-
> net/quic/socket.c | 5 +
> net/quic/socket.h | 8 +
> net/quic/stream.c | 444 ++++++++++++++++++++++++++++++++++++++++++++++
> net/quic/stream.h | 133 ++++++++++++++
> 5 files changed, 591 insertions(+), 1 deletion(-)
> create mode 100644 net/quic/stream.c
> create mode 100644 net/quic/stream.h
>
> diff --git a/net/quic/Makefile b/net/quic/Makefile
> index 13bf4a4e5442..094e9da5d739 100644
> --- a/net/quic/Makefile
> +++ b/net/quic/Makefile
> @@ -5,4 +5,4 @@
>
> obj-$(CONFIG_IP_QUIC) += quic.o
>
> -quic-y := common.o family.o protocol.o socket.o
> +quic-y := common.o family.o protocol.o socket.o stream.o
> diff --git a/net/quic/socket.c b/net/quic/socket.c
> index 8dc2cb7628db..0006668551f4 100644
> --- a/net/quic/socket.c
> +++ b/net/quic/socket.c
> @@ -45,11 +45,16 @@ static int quic_init_sock(struct sock *sk)
> sk_sockets_allocated_inc(sk);
> sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
>
> + if (quic_stream_init(quic_streams(sk)))
> + return -ENOMEM;
> +
> return 0;
> }
>
> static void quic_destroy_sock(struct sock *sk)
> {
> + quic_stream_free(quic_streams(sk));
> +
> quic_data_free(quic_ticket(sk));
> quic_data_free(quic_token(sk));
> quic_data_free(quic_alpn(sk));
> diff --git a/net/quic/socket.h b/net/quic/socket.h
> index 61df0c5867be..e76737b9b74b 100644
> --- a/net/quic/socket.h
> +++ b/net/quic/socket.h
> @@ -13,6 +13,7 @@
>
> #include "common.h"
> #include "family.h"
> +#include "stream.h"
>
> #include "protocol.h"
>
> @@ -33,6 +34,8 @@ struct quic_sock {
> struct quic_data ticket;
> struct quic_data token;
> struct quic_data alpn;
> +
> + struct quic_stream_table streams;
> };
>
> struct quic6_sock {
> @@ -65,6 +68,11 @@ static inline struct quic_data *quic_alpn(const struct sock *sk)
> return &quic_sk(sk)->alpn;
> }
>
> +static inline struct quic_stream_table *quic_streams(const struct sock *sk)
> +{
> + return &quic_sk(sk)->streams;
> +}
> +
> static inline bool quic_is_serv(const struct sock *sk)
> {
> return !!sk->sk_max_ack_backlog;
> diff --git a/net/quic/stream.c b/net/quic/stream.c
> new file mode 100644
> index 000000000000..4d980f9b03ce
> --- /dev/null
> +++ b/net/quic/stream.c
> @@ -0,0 +1,444 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/* QUIC kernel implementation
> + * (C) Copyright Red Hat Corp. 2023
> + *
> + * This file is part of the QUIC kernel implementation
> + *
> + * Initialization/cleanup for QUIC protocol support.
> + *
> + * Written or modified by:
> + * Xin Long <lucien.xin@gmail.com>
> + */
> +
> +#include <linux/quic.h>
> +
> +#include "common.h"
> +#include "stream.h"
> +
> +/* Check if a stream ID is valid for sending or receiving. */
> +static bool quic_stream_id_valid(s64 stream_id, bool is_serv, bool send)
> +{
> + u8 type = (stream_id & QUIC_STREAM_TYPE_MASK);
> +
> + if (send) {
> + if (is_serv)
> + return type != QUIC_STREAM_TYPE_CLIENT_UNI;
> + return type != QUIC_STREAM_TYPE_SERVER_UNI;
> + }
> + if (is_serv)
> + return type != QUIC_STREAM_TYPE_SERVER_UNI;
> + return type != QUIC_STREAM_TYPE_CLIENT_UNI;
> +}
> +
> +/* Check if a stream ID was initiated locally. */
> +static bool quic_stream_id_local(s64 stream_id, bool is_serv)
> +{
> + return is_serv ^ !(stream_id & QUIC_STREAM_TYPE_SERVER_MASK);
> +}
> +
> +/* Check if a stream ID represents a unidirectional stream. */
> +static bool quic_stream_id_uni(s64 stream_id)
> +{
> + return stream_id & QUIC_STREAM_TYPE_UNI_MASK;
> +}
> +
> +#define QUIC_STREAM_HT_SIZE 64
> +
> +static struct hlist_head *quic_stream_head(struct quic_stream_table *streams,
> + s64 stream_id)
> +{
> + return &streams->head[stream_id & (QUIC_STREAM_HT_SIZE - 1)];
> +}
> +
> +struct quic_stream *quic_stream_find(struct quic_stream_table *streams,
> + s64 stream_id)
> +{
> + struct hlist_head *head = quic_stream_head(streams, stream_id);
> + struct quic_stream *stream;
> +
> + hlist_for_each_entry(stream, head, node) {
> + if (stream->id == stream_id)
> + break;
> + }
> + return stream;
> +}
> +
> +static void quic_stream_add(struct quic_stream_table *streams,
> + struct quic_stream *stream)
> +{
> + struct hlist_head *head;
> +
> + head = quic_stream_head(streams, stream->id);
> + hlist_add_head(&stream->node, head);
> +}
> +
> +static void quic_stream_delete(struct quic_stream *stream)
> +{
> + hlist_del_init(&stream->node);
> + kfree(stream);
> +}
> +
> +/* Create and register new streams for sending or receiving. */
> +static struct quic_stream *quic_stream_create(struct quic_stream_table *streams,
> + s64 max_stream_id, bool send,
> + bool is_serv)
> +{
> + struct quic_stream_limits *limits = &streams->send;
> + struct quic_stream *pos, *stream = NULL;
> + gfp_t gfp = GFP_KERNEL_ACCOUNT;
> + struct hlist_node *tmp;
> + HLIST_HEAD(head);
> + s64 stream_id;
> + u32 count = 0;
> +
> + if (!send) {
> + limits = &streams->recv;
> + gfp = GFP_ATOMIC | __GFP_ACCOUNT;
> + }
> + stream_id = limits->next_bidi_stream_id;
> + if (quic_stream_id_uni(max_stream_id))
> + stream_id = limits->next_uni_stream_id;
> +
> + /* rfc9000#section-2.1: A stream ID that is used out of order results in
> + * all streams of that type with lower-numbered stream IDs also being
> + * opened.
> + */
> + while (stream_id <= max_stream_id) {
> + stream = kzalloc_obj(*stream, gfp);
> + if (!stream)
> + goto free;
> +
> + stream->id = stream_id;
> + if (quic_stream_id_uni(stream_id)) {
> + if (send) {
> + stream->send.max_bytes =
> + limits->max_stream_data_uni;
> + } else {
> + stream->recv.max_bytes =
> + limits->max_stream_data_uni;
> + stream->recv.window = stream->recv.max_bytes;
> + }
> + hlist_add_head(&stream->node, &head);
> + stream_id += QUIC_STREAM_ID_STEP;
> + continue;
> + }
> +
> + if (quic_stream_id_local(stream_id, is_serv)) {
> + stream->send.max_bytes =
> + streams->send.max_stream_data_bidi_remote;
> + stream->recv.max_bytes =
> + streams->recv.max_stream_data_bidi_local;
> + } else {
> + stream->send.max_bytes =
> + streams->send.max_stream_data_bidi_local;
> + stream->recv.max_bytes =
> + streams->recv.max_stream_data_bidi_remote;
> + }
> + stream->recv.window = stream->recv.max_bytes;
> + hlist_add_head(&stream->node, &head);
> + stream_id += QUIC_STREAM_ID_STEP;
> + }
> +
> + hlist_for_each_entry_safe(pos, tmp, &head, node) {
> + hlist_del_init(&pos->node);
> + quic_stream_add(streams, pos);
> + count++;
> + }
> +
> + /* Streams must be opened sequentially. Update the next stream ID so the
> + * correct starting point is known if an out-of-order open is requested.
> + * Note overflow of next_uni/bidi_stream_id is impossible with s64.
> + */
> + if (quic_stream_id_uni(stream_id)) {
> + limits->next_uni_stream_id = stream_id;
> + limits->streams_uni += count;
> + return stream;
> + }
> +
> + limits->next_bidi_stream_id = stream_id;
> + limits->streams_bidi += count;
> + return stream;
> +
> +free:
> + hlist_for_each_entry_safe(pos, tmp, &head, node) {
> + hlist_del_init(&pos->node);
> + kfree(pos);
> + }
> + return NULL;
> +}
> +
> +/* Check if a send or receive stream ID is already closed. */
> +static bool quic_stream_id_closed(struct quic_stream_table *streams,
> + s64 stream_id, bool send)
> +{
> + struct quic_stream_limits *limits = send ? &streams->send :
> + &streams->recv;
> +
> + if (quic_stream_id_uni(stream_id))
> + return stream_id < limits->next_uni_stream_id;
> + return stream_id < limits->next_bidi_stream_id;
> +}
> +
> +/* Check if a stream ID would exceed local (recv) or peer (send) limits. */
> +bool quic_stream_id_exceeds(struct quic_stream_table *streams, s64 stream_id,
> + bool send)
> +{
> + u64 nstreams;
> +
> + if (!send) {
> + /* recv.max_uni_stream_id is updated in
> + * quic_stream_max_streams_update() already based on
> + * next_uni/bidi_stream_id, max_streams_uni/bidi, and
> + * streams_uni/bidi, so only recv.max_uni_stream_id needs to be
> + * checked.
> + */
> + if (quic_stream_id_uni(stream_id))
> + return stream_id > streams->recv.max_uni_stream_id;
> +
> + return stream_id > streams->recv.max_bidi_stream_id;
> + }
> +
> + if (quic_stream_id_uni(stream_id)) {
> + if (stream_id > streams->send.max_uni_stream_id)
> + return true;
> + stream_id -= streams->send.next_uni_stream_id;
> + nstreams = quic_stream_id_to_streams(stream_id);
> +
> + return nstreams + streams->send.streams_uni >
> + streams->send.max_streams_uni;
> + }
> +
> + if (stream_id > streams->send.max_bidi_stream_id)
> + return true;
> + stream_id -= streams->send.next_bidi_stream_id;
> + nstreams = quic_stream_id_to_streams(stream_id);
> +
> + return nstreams + streams->send.streams_bidi >
> + streams->send.max_streams_bidi;
> +}
> +
> +/* Get or create a send or recv stream by ID. Requires sock lock held. */
> +struct quic_stream *quic_stream_get(struct quic_stream_table *streams,
> + s64 stream_id, u32 flags, bool is_serv,
> + bool send)
> +{
> + struct quic_stream *stream;
> +
> + if (!quic_stream_id_valid(stream_id, is_serv, send))
> + return ERR_PTR(-EINVAL);
> +
> + stream = quic_stream_find(streams, stream_id);
> + if (stream) {
> + if (send && (flags & MSG_QUIC_STREAM_NEW) &&
> + stream->send.state != QUIC_STREAM_SEND_STATE_READY)
> + return ERR_PTR(-EINVAL);
> + return stream;
> + }
> +
> + if (!send && quic_stream_id_local(stream_id, is_serv)) {
> + if (quic_stream_id_closed(streams, stream_id, !send))
> + return ERR_PTR(-ENOSTR);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + if (quic_stream_id_closed(streams, stream_id, send))
> + return ERR_PTR(-ENOSTR);
> +
> + if (send && !(flags & MSG_QUIC_STREAM_NEW))
> + return ERR_PTR(-EINVAL);
> +
> + if (quic_stream_id_exceeds(streams, stream_id, send))
> + return ERR_PTR(-EAGAIN);
> +
> + stream = quic_stream_create(streams, stream_id, send, is_serv);
> + if (!stream)
> + return ERR_PTR(-ENOSTR);
> +
> + if (send || quic_stream_id_valid(stream_id, is_serv, !send))
> + streams->send.active_stream_id = stream_id;
> +
> + return stream;
> +}
> +
> +/* Release or clean up a send or recv stream. This function updates stream
> + * counters and state when a send stream has either successfully sent all data
> + * or has been reset, or when a recv stream has either consumed all data or has
> + * been reset. Requires sock lock held.
> + */
> +void quic_stream_put(struct quic_stream_table *streams,
> + struct quic_stream *stream, bool is_serv, bool send)
> +{
> + if (quic_stream_id_uni(stream->id)) {
> + if (send) {
> + /* For uni streams, decrement uni count and delete
> + * immediately.
> + */
> + streams->send.streams_uni--;
> + quic_stream_delete(stream);
> + return;
> + }
> + /* For uni streams, decrement uni count and mark done. */
> + if (!stream->recv.done) {
> + stream->recv.done = 1;
> + streams->recv.streams_uni--;
> + streams->recv.uni_pending = 1;
> + }
> + /* Delete stream if fully read or reset. */
> + if (stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD)
> + quic_stream_delete(stream);
> + return;
> + }
> +
> + if (send) {
> + /* For bidi streams, only proceed if receive side is in a final
> + * state.
> + */
> + if (stream->recv.state < QUIC_STREAM_RECV_STATE_RECVD)
> + return;
> + } else {
> + /* For bidi streams, only proceed if send side is in a final
> + * state.
> + */
> + if (stream->send.state != QUIC_STREAM_SEND_STATE_RECVD &&
> + stream->send.state != QUIC_STREAM_SEND_STATE_RESET_RECVD)
> + return;
> + }
> +
> + if (quic_stream_id_local(stream->id, is_serv)) {
> + /* Local-initiated stream: mark send done and decrement
> + * send.bidi count.
> + */
> + if (!stream->send.done) {
> + stream->send.done = 1;
> + streams->send.streams_bidi--;
> + }
> + } else {
> + /* Remote-initiated stream: mark recv done and decrement recv
> + * bidi count.
> + */
> + if (!stream->recv.done) {
> + stream->recv.done = 1;
> + streams->recv.streams_bidi--;
> + streams->recv.bidi_pending = 1;
> + }
> + }
> +
> + /* Delete stream if fully read or reset. */
> + if (stream->recv.state > QUIC_STREAM_RECV_STATE_RECVD)
> + quic_stream_delete(stream);
> +}
> +
> +/* Updates the maximum allowed incoming stream IDs if any streams were recently
> + * closed. Recalculates the max_uni and max_bidi stream ID limits based on the
> + * number of open streams and whether any were marked for deletion.
> + *
> + * Returns true if either max_uni or max_bidi was updated, indicating that a
> + * MAX_STREAMS_UNI or MAX_STREAMS_BIDI frame should be sent to the peer.
> + */
> +bool quic_stream_max_streams_update(struct quic_stream_table *streams,
> + s64 *max_uni, s64 *max_bidi)
> +{
> + s64 max, rem;
> +
> + *max_uni = 0;
> + *max_bidi = 0;
> + if (streams->recv.uni_pending) {
> + rem = streams->recv.max_streams_uni - streams->recv.streams_uni;
> + max = streams->recv.next_uni_stream_id - QUIC_STREAM_ID_STEP +
> + (rem << QUIC_STREAM_TYPE_BITS);
> +
> + streams->recv.max_uni_stream_id = max;
> + *max_uni = quic_stream_id_to_streams(max);
> + streams->recv.uni_pending = 0;
> + }
> + if (streams->recv.bidi_pending) {
> + rem = streams->recv.max_streams_bidi -
> + streams->recv.streams_bidi;
> + max = streams->recv.next_bidi_stream_id - QUIC_STREAM_ID_STEP +
> + (rem << QUIC_STREAM_TYPE_BITS);
> +
> + streams->recv.max_bidi_stream_id = max;
> + *max_bidi = quic_stream_id_to_streams(max);
> + streams->recv.bidi_pending = 0;
> + }
> +
> + return *max_uni || *max_bidi;
> +}
> +
> +int quic_stream_init(struct quic_stream_table *streams)
> +{
> + struct hlist_head *head;
> + int i;
> +
> + head = kmalloc_array(QUIC_STREAM_HT_SIZE, sizeof(*head), GFP_KERNEL);
> + if (!head)
> + return -ENOMEM;
> + for (i = 0; i < QUIC_STREAM_HT_SIZE; i++)
> + INIT_HLIST_HEAD(&head[i]);
> + streams->head = head;
> + return 0;
> +}
> +
> +void quic_stream_free(struct quic_stream_table *streams)
> +{
> + struct quic_stream *stream;
> + struct hlist_head *head;
> + struct hlist_node *tmp;
> + int i;
> +
> + if (!streams->head)
> + return;
> +
> + for (i = 0; i < QUIC_STREAM_HT_SIZE; i++) {
> + head = &streams->head[i];
> + hlist_for_each_entry_safe(stream, tmp, head, node)
> + quic_stream_delete(stream);
> + }
> + kfree(streams->head);
The AI report on
https://netdev-ai.bots.linux.dev/ai-review.html?id=1624d906-c0b6-4e12-a63f-5cbfc51b660e#patch-5
is false positive.
As the sk_alloc() calls sk_prot_alloc() with __GFP_ZERO, and the streams->head
is always initialized to NULL.
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net-next v11 07/15] quic: add connection id management
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (5 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 06/15] quic: add stream management Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 08/15] quic: add path management Xin Long
` (7 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'struct quic_conn_id_set' for managing Connection
IDs (CIDs), which are represented by 'struct quic_source_conn_id'
and 'struct quic_dest_conn_id'.
It provides helpers to add and remove CIDs from the set, and handles
insertion of source CIDs into the global connection ID hash table
when necessary.
- quic_conn_id_add(): Add a new Connection ID to the set, and inserts
it to conn_id hash table if it is a source conn_id.
- quic_conn_id_remove(): Remove connection IDs the set with sequence
numbers less than or equal to a number.
It also adds utilities to look up CIDs by value or sequence number,
search the global hash table for incoming packets, and check for
stateless reset tokens among destination CIDs. These functions are
essential for RX path connection lookup and stateless reset processing.
- quic_conn_id_find(): Find a Connection ID in the set by seq number.
- quic_conn_id_lookup(): Lookup a Connection ID from global hash table
using the ID value, typically used for socket lookup on the RX path.
- quic_conn_id_token_exists(): Check if a stateless reset token exists
in any dest Connection ID (used during stateless reset processing).
Note source/dest conn_id set is per socket, the operations on it are
always protected by the sock lock.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v3:
- Clarify in changelog that conn_id set is always protected by sock lock
(suggested by Paolo).
- Adjust global source conn_id hashtable operations for the new hashtable
type.
v4:
- Replace struct hlist_node with hlist_nulls_node for the node in
struct quic_source_conn_id to support lockless lookup.
v7:
- Break the loop earlier if common->number > number in
quic_conn_id_remove/find() (suggested by Paolo).
- Add a comment in quic_conn_id_first_number().
v8:
- Add a comment to quic_conn_id_remove() clarifying that the ID number
must be smaller than the sequence number of the last ID in the set.
v11:
- Note for AI review: each id_set contains at most 8 connection IDs, so
using an RB-tree for faster lookup is unnecessary.
- Set maximum line length to 80 characters.
- Add a check for number in quic_conn_id_remove().
---
net/quic/Makefile | 2 +-
net/quic/connid.c | 249 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/connid.h | 182 +++++++++++++++++++++++++++++++++
net/quic/socket.c | 6 ++
net/quic/socket.h | 13 +++
5 files changed, 451 insertions(+), 1 deletion(-)
create mode 100644 net/quic/connid.c
create mode 100644 net/quic/connid.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 094e9da5d739..eee7501588d3 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o
diff --git a/net/quic/connid.c b/net/quic/connid.c
new file mode 100644
index 000000000000..25913da89eeb
--- /dev/null
+++ b/net/quic/connid.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/quic.h>
+#include <net/sock.h>
+
+#include "common.h"
+#include "connid.h"
+
+/* Lookup a source connection ID (scid) in the global source connection ID hash
+ * table.
+ */
+struct quic_conn_id *quic_conn_id_lookup(struct net *net, u8 *scid, u32 len)
+{
+ struct quic_shash_head *head = quic_source_conn_id_head(net, scid, len);
+ struct quic_source_conn_id *s_conn_id;
+ struct quic_conn_id *conn_id = NULL;
+ struct hlist_nulls_node *node;
+
+ hlist_nulls_for_each_entry_rcu(s_conn_id, node, &head->head, node) {
+ if (net != sock_net(s_conn_id->sk))
+ continue;
+ if (s_conn_id->common.id.len != len ||
+ memcmp(scid, &s_conn_id->common.id.data, len))
+ continue;
+ if (likely(refcount_inc_not_zero(&s_conn_id->sk->sk_refcnt)))
+ conn_id = &s_conn_id->common.id;
+ break;
+ }
+ return conn_id;
+}
+
+/* Check if a given stateless reset token exists in any connection ID in the
+ * connection ID set.
+ */
+bool quic_conn_id_token_exists(struct quic_conn_id_set *id_set, u8 *token)
+{
+ struct quic_common_conn_id *common;
+ struct quic_dest_conn_id *dcid;
+
+ dcid = (struct quic_dest_conn_id *)id_set->active;
+ if (!memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN))
+ return true; /* Fast path. */
+
+ list_for_each_entry(common, &id_set->head, list) {
+ dcid = (struct quic_dest_conn_id *)common;
+ if (common == id_set->active)
+ continue;
+ if (!memcmp(dcid->token, token, QUIC_CONN_ID_TOKEN_LEN))
+ return true;
+ }
+ return false;
+}
+
+static void quic_source_conn_id_free_rcu(struct rcu_head *head)
+{
+ struct quic_source_conn_id *s_conn_id;
+
+ s_conn_id = container_of(head, struct quic_source_conn_id, rcu);
+ kfree(s_conn_id);
+}
+
+static void quic_source_conn_id_free(struct quic_source_conn_id *s_conn_id)
+{
+ u8 *data = s_conn_id->common.id.data;
+ u32 len = s_conn_id->common.id.len;
+ struct quic_shash_head *head;
+
+ if (!hlist_nulls_unhashed(&s_conn_id->node)) {
+ head = quic_source_conn_id_head(sock_net(s_conn_id->sk), data,
+ len);
+ spin_lock_bh(&head->lock);
+ hlist_nulls_del_init_rcu(&s_conn_id->node);
+ spin_unlock_bh(&head->lock);
+ }
+
+ /* Freeing is deferred via RCU to avoid use-after-free during
+ * concurrent lookups.
+ */
+ call_rcu(&s_conn_id->rcu, quic_source_conn_id_free_rcu);
+}
+
+static void quic_conn_id_del(struct quic_common_conn_id *common)
+{
+ list_del(&common->list);
+ if (!common->hashed) {
+ kfree(common);
+ return;
+ }
+ quic_source_conn_id_free((struct quic_source_conn_id *)common);
+}
+
+/* Add a connection ID with sequence number and associated private data to the
+ * connection ID set.
+ */
+int quic_conn_id_add(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *conn_id, u32 number, void *data)
+{
+ struct quic_source_conn_id *s_conn_id;
+ struct quic_dest_conn_id *d_conn_id;
+ struct quic_common_conn_id *common;
+ struct quic_shash_head *head;
+ struct list_head *list;
+
+ /* Locate insertion point to keep list ordered by number. */
+ list = &id_set->head;
+ list_for_each_entry(common, list, list) {
+ if (number == common->number)
+ return 0; /* Ignore if it already exists on the list. */
+ if (number < common->number) {
+ list = &common->list;
+ break;
+ }
+ }
+
+ if (conn_id->len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+ common = kzalloc(id_set->entry_size, GFP_ATOMIC);
+ if (!common)
+ return -ENOMEM;
+ common->id = *conn_id;
+ common->number = number;
+ if (id_set->entry_size == sizeof(struct quic_dest_conn_id)) {
+ /* For destination connection IDs, copy the stateless reset
+ * token if available.
+ */
+ if (data) {
+ d_conn_id = (struct quic_dest_conn_id *)common;
+ memcpy(d_conn_id->token, data, QUIC_CONN_ID_TOKEN_LEN);
+ }
+ } else {
+ /* For source connection IDs, mark as hashed and insert into
+ * the global source connection ID hashtable.
+ */
+ common->hashed = 1;
+ s_conn_id = (struct quic_source_conn_id *)common;
+ s_conn_id->sk = data;
+
+ head = quic_source_conn_id_head(sock_net(s_conn_id->sk),
+ common->id.data,
+ common->id.len);
+ spin_lock_bh(&head->lock);
+ hlist_nulls_add_head_rcu(&s_conn_id->node, &head->head);
+ spin_unlock_bh(&head->lock);
+ }
+ list_add_tail(&common->list, list);
+
+ if (number == quic_conn_id_last_number(id_set) + 1) {
+ if (!id_set->active)
+ id_set->active = common;
+ id_set->count++;
+
+ /* Increment count for consecutive following IDs. */
+ list_for_each_entry_continue(common, &id_set->head, list) {
+ if (common->number != ++number)
+ break;
+ id_set->count++;
+ }
+ }
+ return 0;
+}
+
+/* Remove consecutive connection IDs from the set with sequence numbers less
+ * than or equal to a number.
+ */
+void quic_conn_id_remove(struct quic_conn_id_set *id_set, u32 number)
+{
+ struct quic_common_conn_id *common, *tmp;
+ struct list_head *list;
+
+ /* The number must be less than the sequence number of the last
+ * consecutive connection ID in the set.
+ */
+ if (WARN_ON_ONCE(number >= quic_conn_id_last_number(id_set)))
+ return;
+ list = &id_set->head;
+ list_for_each_entry_safe(common, tmp, list, list) {
+ if (common->number > number)
+ break;
+ if (id_set->active == common)
+ id_set->active = tmp;
+ quic_conn_id_del(common);
+ id_set->count--;
+ }
+}
+
+struct quic_conn_id *quic_conn_id_find(struct quic_conn_id_set *id_set,
+ u32 number)
+{
+ struct quic_common_conn_id *common;
+
+ list_for_each_entry(common, &id_set->head, list) {
+ if (common->number > number)
+ break;
+ if (common->number == number)
+ return &common->id;
+ }
+ return NULL;
+}
+
+void quic_conn_id_update_active(struct quic_conn_id_set *id_set, u32 number)
+{
+ struct quic_conn_id *conn_id;
+
+ if (number == id_set->active->number)
+ return;
+ conn_id = quic_conn_id_find(id_set, number);
+ if (!conn_id)
+ return;
+ quic_conn_id_set_active(id_set, conn_id);
+}
+
+void quic_conn_id_set_init(struct quic_conn_id_set *id_set, bool source)
+{
+ id_set->entry_size = source ? sizeof(struct quic_source_conn_id) :
+ sizeof(struct quic_dest_conn_id);
+ INIT_LIST_HEAD(&id_set->head);
+}
+
+void quic_conn_id_set_free(struct quic_conn_id_set *id_set)
+{
+ struct quic_common_conn_id *common, *tmp;
+
+ list_for_each_entry_safe(common, tmp, &id_set->head, list)
+ quic_conn_id_del(common);
+ id_set->count = 0;
+ id_set->active = NULL;
+}
+
+void quic_conn_id_get_param(struct quic_conn_id_set *id_set,
+ struct quic_transport_param *p)
+{
+ p->active_connection_id_limit = id_set->max_count;
+}
+
+void quic_conn_id_set_param(struct quic_conn_id_set *id_set,
+ struct quic_transport_param *p)
+{
+ id_set->max_count = p->active_connection_id_limit;
+}
diff --git a/net/quic/connid.h b/net/quic/connid.h
new file mode 100644
index 000000000000..af5157959b2c
--- /dev/null
+++ b/net/quic/connid.h
@@ -0,0 +1,182 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_CONN_ID_LIMIT 8
+#define QUIC_CONN_ID_DEF 7
+#define QUIC_CONN_ID_LEAST 2
+
+#define QUIC_CONN_ID_TOKEN_LEN 16
+
+/* Common fields shared by both source and destination Connection IDs */
+struct quic_common_conn_id {
+ struct quic_conn_id id; /* Connection ID value and its length */
+ struct list_head list; /* List node for connection ID management */
+ u32 number; /* Sequence number assigned to this Connection ID */
+ u8 hashed; /* Non-zero if stored in source_conn_id hash table */
+};
+
+struct quic_source_conn_id {
+ struct quic_common_conn_id common;
+ struct hlist_nulls_node node; /* Hash table node for fast lookup */
+ struct rcu_head rcu; /* RCU header for deferred destruction */
+ struct sock *sk; /* Socket associated with this Connection ID */
+};
+
+struct quic_dest_conn_id {
+ struct quic_common_conn_id common;
+ /* Stateless reset token in rfc9000#section-10.3 */
+ u8 token[QUIC_CONN_ID_TOKEN_LEN];
+};
+
+struct quic_conn_id_set {
+ /* Connection ID in use on the current path */
+ struct quic_common_conn_id *active;
+ /* Connection ID to use for a new path (e.g., after migration) */
+ struct quic_common_conn_id *alt;
+ struct list_head head; /* List head of available connection IDs */
+ u8 entry_size; /* Size of each connection ID entry in the list */
+ u8 max_count; /* active_connection_id_limit in rfc9000#section-18.2 */
+ u8 count; /* Current number of connection IDs in the list */
+};
+
+static inline u32 quic_conn_id_first_number(struct quic_conn_id_set *id_set)
+{
+ struct quic_common_conn_id *common;
+
+ /* The id_set is guaranteed to be non-empty when called (sk is not in
+ * CLOSE state).
+ */
+ common = list_first_entry(&id_set->head, struct quic_common_conn_id,
+ list);
+ return common->number;
+}
+
+static inline u32 quic_conn_id_last_number(struct quic_conn_id_set *id_set)
+{
+ return quic_conn_id_first_number(id_set) + id_set->count - 1;
+}
+
+static inline void quic_conn_id_generate(struct quic_conn_id *conn_id)
+{
+ get_random_bytes(conn_id->data, QUIC_CONN_ID_DEF_LEN);
+ conn_id->len = QUIC_CONN_ID_DEF_LEN;
+}
+
+/* Select an alternate destination Connection ID for a new path (e.g., after
+ * migration).
+ */
+static inline bool quic_conn_id_select_alt(struct quic_conn_id_set *id_set,
+ bool active)
+{
+ if (id_set->alt)
+ return true;
+ /* NAT rebinding: peer keeps using the current source conn_id.
+ * In this case, continue using the same dest conn_id for the new path.
+ */
+ if (active) {
+ id_set->alt = id_set->active;
+ return true;
+ }
+ /* Treat the prev conn_ids as used.
+ * Try selecting the next conn_id in the list, unless at the end.
+ */
+ if (id_set->active->number != quic_conn_id_last_number(id_set)) {
+ id_set->alt = list_next_entry(id_set->active, list);
+ return true;
+ }
+ /* If there's only one conn_id in the list, reuse the active one. */
+ if (id_set->active->number == quic_conn_id_first_number(id_set)) {
+ id_set->alt = id_set->active;
+ return true;
+ }
+ /* No alternate conn_id could be selected. Caller should send a
+ * QUIC_FRAME_RETIRE_CONNECTION_ID frame to request new connection IDs
+ * from the peer.
+ */
+ return false;
+}
+
+static inline void quic_conn_id_set_alt(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *alt)
+{
+ id_set->alt = (struct quic_common_conn_id *)alt;
+}
+
+/* Swap the active and alternate destination Connection IDs after path
+ * migration completes, since the path has already been switched accordingly.
+ */
+static inline void quic_conn_id_swap_active(struct quic_conn_id_set *id_set)
+{
+ void *active = id_set->active;
+
+ id_set->active = id_set->alt;
+ id_set->alt = active;
+}
+
+/* Choose which destination Connection ID to use for a new path migration if
+ * alt is true.
+ */
+static inline struct quic_conn_id *
+quic_conn_id_choose(struct quic_conn_id_set *id_set, u8 alt)
+{
+ return (alt && id_set->alt) ? &id_set->alt->id : &id_set->active->id;
+}
+
+static inline struct quic_conn_id *
+quic_conn_id_active(struct quic_conn_id_set *id_set)
+{
+ return &id_set->active->id;
+}
+
+static inline void quic_conn_id_set_active(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *active)
+{
+ id_set->active = (struct quic_common_conn_id *)active;
+}
+
+static inline u32 quic_conn_id_number(struct quic_conn_id *conn_id)
+{
+ return ((struct quic_common_conn_id *)conn_id)->number;
+}
+
+static inline struct sock *quic_conn_id_sk(struct quic_conn_id *conn_id)
+{
+ return ((struct quic_source_conn_id *)conn_id)->sk;
+}
+
+static inline void quic_conn_id_set_token(struct quic_conn_id *conn_id,
+ u8 *token)
+{
+ memcpy(((struct quic_dest_conn_id *)conn_id)->token, token,
+ QUIC_CONN_ID_TOKEN_LEN);
+}
+
+static inline int quic_conn_id_cmp(struct quic_conn_id *a,
+ struct quic_conn_id *b)
+{
+ return a->len != b->len || memcmp(a->data, b->data, a->len);
+}
+
+int quic_conn_id_add(struct quic_conn_id_set *id_set,
+ struct quic_conn_id *conn_id, u32 number, void *data);
+bool quic_conn_id_token_exists(struct quic_conn_id_set *id_set, u8 *token);
+void quic_conn_id_remove(struct quic_conn_id_set *id_set, u32 number);
+
+struct quic_conn_id *quic_conn_id_find(struct quic_conn_id_set *id_set,
+ u32 number);
+struct quic_conn_id *quic_conn_id_lookup(struct net *net, u8 *scid, u32 len);
+void quic_conn_id_update_active(struct quic_conn_id_set *id_set, u32 number);
+
+void quic_conn_id_get_param(struct quic_conn_id_set *id_set,
+ struct quic_transport_param *p);
+void quic_conn_id_set_param(struct quic_conn_id_set *id_set,
+ struct quic_transport_param *p);
+void quic_conn_id_set_init(struct quic_conn_id_set *id_set, bool source);
+void quic_conn_id_set_free(struct quic_conn_id_set *id_set);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 0006668551f4..aa451ea8f516 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -45,6 +45,9 @@ static int quic_init_sock(struct sock *sk)
sk_sockets_allocated_inc(sk);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
+ quic_conn_id_set_init(quic_source(sk), 1);
+ quic_conn_id_set_init(quic_dest(sk), 0);
+
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
@@ -53,6 +56,9 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_conn_id_set_free(quic_source(sk));
+ quic_conn_id_set_free(quic_dest(sk));
+
quic_stream_free(quic_streams(sk));
quic_data_free(quic_ticket(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index e76737b9b74b..68a58f0016cc 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -14,6 +14,7 @@
#include "common.h"
#include "family.h"
#include "stream.h"
+#include "connid.h"
#include "protocol.h"
@@ -36,6 +37,8 @@ struct quic_sock {
struct quic_data alpn;
struct quic_stream_table streams;
+ struct quic_conn_id_set source;
+ struct quic_conn_id_set dest;
};
struct quic6_sock {
@@ -73,6 +76,16 @@ static inline struct quic_stream_table *quic_streams(const struct sock *sk)
return &quic_sk(sk)->streams;
}
+static inline struct quic_conn_id_set *quic_source(const struct sock *sk)
+{
+ return &quic_sk(sk)->source;
+}
+
+static inline struct quic_conn_id_set *quic_dest(const struct sock *sk)
+{
+ return &quic_sk(sk)->dest;
+}
+
static inline bool quic_is_serv(const struct sock *sk)
{
return !!sk->sk_max_ack_backlog;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 08/15] quic: add path management
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (6 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 07/15] quic: add connection id management Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 09/15] quic: add congestion control Xin Long
` (6 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_path_group' for managing paths, represented
by 'struct quic_path'. A connection may use two paths simultaneously
for connection migration.
Each path is associated with a UDP tunnel socket (sk), and a single
UDP tunnel socket can be related to multiple paths from different sockets.
These UDP tunnel sockets are wrapped in 'quic_udp_sock' structures and
stored in a hash table.
It includes mechanisms to bind and unbind paths, detect alternative paths
for migration, and swap paths to support seamless transition between
networks.
- quic_path_bind(): Bind a path to a port and associate it with a UDP sk.
- quic_path_unbind(): Unbind a path from a port and disassociate it from a
UDP sk.
- quic_path_swap(): Swap two paths to facilitate connection migration.
- quic_path_detect_alt(): Determine if a packet is using an alternative
path, used for connection migration.
It also integrates basic support for Packetization Layer Path MTU
Discovery (PLPMTUD), using PING frames and ICMP feedback to adjust path
MTU and handle probe confirmation or resets during routing changes.
- quic_path_pl_recv(): state transition and pmtu update after the probe
packet is acked.
- quic_path_pl_toobig(): state transition and pmtu update after
receiving a toobig or needfrag icmp packet.
- quic_path_pl_send(): state transition and pmtu update after sending a
probe packet.
- quic_path_pl_reset(): restart the probing when path routing changes.
- quic_path_pl_confirm(): check if probe packet gets acked.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v3:
- Fix annotation in quic_udp_sock_lookup() (noted by Paolo).
- Use inet_sk_get_local_port_range() instead of
inet_get_local_port_range() (suggested by Paolo).
- Adjust global UDP tunnel socket hashtable operations for the new
hashtable type.
- Delete quic_workqueue; use system_wq for UDP tunnel socket destroy.
v4:
- Cache UDP tunnel socket pointer and its source address in struct
quic_path for RCU-protected lookup/access.
- Return -EAGAIN instead of -EINVAL in quic_path_bind() when UDP
socket is being released in workqueue.
- Move udp_tunnel_sock_release() out of the mutex_lock to avoid a
warning of lockdep in quic_udp_sock_put_work().
- Introduce quic_wq for UDP socket release work, so all pending works
can be flushed before destroying the hashtable in quic_exit().
v5:
- Rename quic_path_free() to quic_path_unbind() (suggested by Paolo).
- Remove the 'serv' member from struct quic_path_group, since
quic_is_serv() defined in a previous patch now uses
sk->sk_max_ack_backlog for server-side detection.
- Use quic_ktime_get_us() to set skb_cb->time, as RTT is measured
in microseconds and jiffies_to_usecs() is not accurate enough.
v6:
- Do not reset transport_header for QUIC in quic_udp_rcv(), allowing
removal of udph_offset and enabling access to the UDP header via
udp_hdr(); Pull skb->data in quic_udp_rcv() to allow access to the
QUIC header via skb->data.
v7:
- Pass udp sk to quic_path_rcv() and move the call to skb_linearize()
and skb_set_owner_sk_safe() to .quic_path_rcv().
- Delete the call to skb_linearize() and skb_set_owner_sk_safe() from
quic_udp_err(), as it should not change skb in .encap_err_lookup()
(noted by AI review).
v8:
- Remove indirect quic_path_rcv and late call quic_packet_rcv()
directly via extern (noted by Paolo).
- Add a comment in quic_udp_rcv() clarifying it must return 0.
- Add a comment in quic_udp_sock_put() clarifying the UDP socket
may be freed in atomic RX context during connection migration.
- Reorder some quic_path_group members to reduce struct size.
v10:
- Replace open-coded kzalloc(sizeof(*us)) with kzalloc_obj(*us) in
quic_stream_create().
- Use get_random_u32_below() for ephemeral port selection instead of
manual scaling of get_random_u32() in quic_path_bind().
- Reset additional PLPMTUD probe state (probe_high, probe_count) in
quic_path_pl_reset() to ensure a clean probe restart.
- Add plpmtud_interval to struct quic_path_group to store the PLPMTUD
probe timer interval, previously kept in struct quic_sock.config.
v11:
- Set maximum line length to 80 characters.
- Add additional comments in quic_path_bind() and quic_path_pl_send()
for clarity.
- Return ERR_PTR() instead of NULL on error in quic_udp_sock_create().
- Change return type of quic_path_detect_alt() to bool.
- Allocate quic_wq using alloc_workqueue(WQ_MEM_RECLAIM | WQ_UNBOUND)
for UDP socket destruction and backlog packet processing (noted by
AI review).
---
net/quic/Makefile | 2 +-
net/quic/path.c | 560 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/path.h | 184 +++++++++++++++
net/quic/protocol.c | 15 ++
net/quic/socket.c | 3 +
net/quic/socket.h | 7 +
6 files changed, 770 insertions(+), 1 deletion(-)
create mode 100644 net/quic/path.c
create mode 100644 net/quic/path.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index eee7501588d3..1565fb5cef9d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o connid.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o
diff --git a/net/quic/path.c b/net/quic/path.c
new file mode 100644
index 000000000000..7f72fdd9c45f
--- /dev/null
+++ b/net/quic/path.c
@@ -0,0 +1,560 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <net/udp_tunnel.h>
+#include <linux/quic.h>
+
+#include "common.h"
+#include "family.h"
+#include "path.h"
+
+static int quic_udp_rcv(struct sock *sk, struct sk_buff *skb)
+{
+ memset(skb->cb, 0, sizeof(skb->cb));
+ QUIC_SKB_CB(skb)->seqno = -1;
+ QUIC_SKB_CB(skb)->time = quic_ktime_get_us();
+
+ skb_pull(skb, sizeof(struct udphdr));
+ skb_dst_force(skb);
+ kfree_skb(skb);
+ /* .encap_rcv must return 0 if skb was either consumed or dropped. */
+ return 0;
+}
+
+static int quic_udp_err(struct sock *sk, struct sk_buff *skb)
+{
+ return 0;
+}
+
+static void quic_udp_sock_put_work(struct work_struct *work)
+{
+ struct quic_udp_sock *us = container_of(work, struct quic_udp_sock,
+ work);
+ struct quic_uhash_head *head;
+ struct sock *sk = us->sk;
+
+ /* Hold the sock to safely access it in quic_udp_sock_lookup() even
+ * after udp_tunnel_sock_release(). The release must occur before
+ * __hlist_del() so a new UDP tunnel socket can be created for the same
+ * address and port if quic_udp_sock_lookup() fails to find one.
+ *
+ * Note: udp_tunnel_sock_release() cannot be called under the mutex due
+ * to some lockdep warnings.
+ */
+ sock_hold(sk);
+ udp_tunnel_sock_release(sk->sk_socket);
+
+ head = quic_udp_sock_head(sock_net(sk), ntohs(us->addr.v4.sin_port));
+ mutex_lock(&head->lock);
+ __hlist_del(&us->node);
+ mutex_unlock(&head->lock);
+
+ sock_put(sk);
+ kfree(us);
+}
+
+static struct quic_udp_sock *quic_udp_sock_create(struct sock *sk,
+ union quic_addr *a)
+{
+ struct udp_tunnel_sock_cfg tuncfg = {};
+ struct udp_port_cfg udp_conf = {};
+ struct net *net = sock_net(sk);
+ struct quic_uhash_head *head;
+ struct quic_udp_sock *us;
+ struct socket *sock;
+ int err;
+
+ us = kzalloc_obj(*us, GFP_KERNEL);
+ if (!us)
+ return ERR_PTR(-ENOMEM);
+
+ quic_udp_conf_init(sk, &udp_conf, a);
+ err = udp_sock_create(net, &udp_conf, &sock);
+ if (err) {
+ pr_debug("%s: failed to create udp sock\n", __func__);
+ kfree(us);
+ return ERR_PTR(err);
+ }
+
+ tuncfg.encap_type = 1;
+ tuncfg.encap_rcv = quic_udp_rcv;
+ tuncfg.encap_err_lookup = quic_udp_err;
+ setup_udp_tunnel_sock(net, sock, &tuncfg);
+
+ refcount_set(&us->refcnt, 1);
+ us->sk = sock->sk;
+ memcpy(&us->addr, a, sizeof(*a));
+ us->bind_ifindex = sk->sk_bound_dev_if;
+
+ head = quic_udp_sock_head(net, ntohs(a->v4.sin_port));
+ hlist_add_head(&us->node, &head->head);
+ INIT_WORK(&us->work, quic_udp_sock_put_work);
+
+ return us;
+}
+
+static bool quic_udp_sock_get(struct quic_udp_sock *us)
+{
+ return refcount_inc_not_zero(&us->refcnt);
+}
+
+static void quic_udp_sock_put(struct quic_udp_sock *us)
+{
+ /* The UDP socket may be freed in atomic RX context during connection
+ * migration; defer the release to a workqueue.
+ */
+ if (refcount_dec_and_test(&us->refcnt))
+ queue_work(quic_wq, &us->work);
+}
+
+/* Lookup a quic_udp_sock in the global hash table by port or address. If 'a'
+ * is provided, it searches for a socket whose local address matches 'a' and,
+ * if applicable, matches the device binding. If 'a' is NULL, it searches only
+ * by port.
+ */
+static struct quic_udp_sock *quic_udp_sock_lookup(struct sock *sk,
+ union quic_addr *a, u16 port)
+{
+ struct net *net = sock_net(sk);
+ struct quic_uhash_head *head;
+ struct quic_udp_sock *us;
+
+ head = quic_udp_sock_head(net, port);
+ hlist_for_each_entry(us, &head->head, node) {
+ if (net != sock_net(us->sk))
+ continue;
+ if (a) {
+ if (quic_cmp_sk_addr(us->sk, &us->addr, a) &&
+ (!us->bind_ifindex || !sk->sk_bound_dev_if ||
+ us->bind_ifindex == sk->sk_bound_dev_if))
+ return us;
+ continue;
+ }
+ if (ntohs(us->addr.v4.sin_port) == port)
+ return us;
+ }
+ return NULL;
+}
+
+static void quic_path_set_udp_sk(struct quic_path *path,
+ struct quic_udp_sock *us)
+{
+ if (path->udp_sk)
+ quic_udp_sock_put(path->udp_sk);
+
+ path->udp_sk = us;
+ if (!us) {
+ path->usk = NULL;
+ memset(&path->uaddr, 0, sizeof(path->uaddr));
+ return;
+ }
+ path->usk = us->sk;
+ memcpy(&path->uaddr, &us->addr, sizeof(us->addr));
+}
+
+/* Binds a QUIC path to a local port and sets up a UDP socket. */
+int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path)
+{
+ union quic_addr *a = quic_path_saddr(paths, path);
+ int rover, low, high, remaining;
+ struct net *net = sock_net(sk);
+ struct quic_uhash_head *head;
+ struct quic_udp_sock *us;
+ u16 port;
+
+ port = ntohs(a->v4.sin_port);
+ if (port) {
+ head = quic_udp_sock_head(net, port);
+ mutex_lock(&head->lock);
+ us = quic_udp_sock_lookup(sk, a, port);
+ if (us) {
+ /* Allow reuse of an existing UDP tunnel socket.
+ * However, if it is in the middle of asynchronous
+ * teardown (via workqueue), it is temporarily unusable.
+ * Return -EAGAIN (not -EADDRINUSE) to signal the caller
+ * to retry soon.
+ */
+ if (!quic_udp_sock_get(us)) {
+ mutex_unlock(&head->lock);
+ return -EAGAIN;
+ }
+ } else {
+ us = quic_udp_sock_create(sk, a);
+ if (IS_ERR(us)) {
+ mutex_unlock(&head->lock);
+ return PTR_ERR(us);
+ }
+ }
+ mutex_unlock(&head->lock);
+ quic_path_set_udp_sk(&paths->path[path], us);
+ return 0;
+ }
+
+ inet_sk_get_local_port_range(sk, &low, &high);
+ remaining = (high - low) + 1;
+ rover = get_random_u32_below(remaining) + low;
+ do {
+ rover++;
+ if (rover < low || rover > high)
+ rover = low;
+ port = (u16)rover;
+ if (inet_is_local_reserved_port(net, port))
+ continue;
+
+ head = quic_udp_sock_head(net, port);
+ mutex_lock(&head->lock);
+ if (quic_udp_sock_lookup(sk, NULL, port)) {
+ mutex_unlock(&head->lock);
+ cond_resched();
+ continue;
+ }
+ a->v4.sin_port = htons(port);
+ us = quic_udp_sock_create(sk, a);
+ if (IS_ERR(us)) {
+ a->v4.sin_port = 0;
+ mutex_unlock(&head->lock);
+ return PTR_ERR(us);
+ }
+ mutex_unlock(&head->lock);
+
+ quic_path_set_udp_sk(&paths->path[path], us);
+ __sk_dst_reset(sk);
+ return 0;
+ } while (--remaining > 0);
+
+ return -EADDRINUSE;
+}
+
+/* Swaps the active and alternate QUIC paths.
+ *
+ * Promotes the alternate path (path[1]) to become the new active path
+ * (path[0]). If the alternate path has a valid UDP socket, the entire path is
+ * swapped. Otherwise, only the destination address is exchanged, assuming the
+ * source address is the same and no rebind is needed.
+ *
+ * This is typically used during path migration or alternate path promotion.
+ */
+void quic_path_swap(struct quic_path_group *paths)
+{
+ struct quic_path path = paths->path[0];
+
+ paths->alt_probes = 0;
+ paths->alt_state = QUIC_PATH_ALT_SWAPPED;
+
+ if (paths->path[1].udp_sk) {
+ paths->path[0] = paths->path[1];
+ paths->path[1] = path;
+ return;
+ }
+
+ paths->path[0].daddr = paths->path[1].daddr;
+ paths->path[1].daddr = path.daddr;
+}
+
+/* Frees resources associated with a QUIC path.
+ *
+ * This is used for cleanup during error handling or when the path is no longer
+ * needed.
+ */
+void quic_path_unbind(struct sock *sk, struct quic_path_group *paths, u8 path)
+{
+ paths->alt_probes = 0;
+ paths->alt_state = QUIC_PATH_ALT_NONE;
+
+ quic_path_set_udp_sk(&paths->path[path], NULL);
+
+ memset(quic_path_daddr(paths, path), 0, sizeof(union quic_addr));
+ memset(quic_path_saddr(paths, path), 0, sizeof(union quic_addr));
+}
+
+/* Detects and records a potential alternate path.
+ *
+ * If the new source or destination address differs from the active path, and
+ * alternate path detection is not disabled, the function updates the alternate
+ * path slot (path[1]) with the new addresses.
+ *
+ * This is typically called on packet receive to detect new possible network
+ * paths (e.g., NAT rebinding, mobility).
+ *
+ * Returns true if a new alternate path was detected and updated, false
+ * otherwise.
+ */
+bool quic_path_detect_alt(struct quic_path_group *paths, union quic_addr *sa,
+ union quic_addr *da, struct sock *sk)
+{
+ union quic_addr *saddr = quic_path_saddr(paths, 0);
+ union quic_addr *daddr = quic_path_daddr(paths, 0);
+
+ if ((!quic_cmp_sk_addr(sk, saddr, sa) && !paths->disable_saddr_alt) ||
+ (!quic_cmp_sk_addr(sk, daddr, da) && !paths->disable_daddr_alt)) {
+ if (!quic_path_saddr(paths, 1)->v4.sin_port)
+ quic_path_set_saddr(paths, 1, sa);
+
+ if (!quic_cmp_sk_addr(sk, quic_path_saddr(paths, 1), sa))
+ return false;
+
+ if (!quic_path_daddr(paths, 1)->v4.sin_port)
+ quic_path_set_daddr(paths, 1, da);
+
+ return quic_cmp_sk_addr(sk, quic_path_daddr(paths, 1), da);
+ }
+ return false;
+}
+
+void quic_path_get_param(struct quic_path_group *paths,
+ struct quic_transport_param *p)
+{
+ if (p->remote) {
+ p->disable_active_migration = paths->disable_saddr_alt;
+ return;
+ }
+ p->disable_active_migration = paths->disable_daddr_alt;
+}
+
+void quic_path_set_param(struct quic_path_group *paths,
+ struct quic_transport_param *p)
+{
+ if (p->remote) {
+ paths->disable_saddr_alt = p->disable_active_migration;
+ return;
+ }
+ paths->disable_daddr_alt = p->disable_active_migration;
+}
+
+/* State Machine defined in rfc8899#section-5.2 */
+enum quic_plpmtud_state {
+ QUIC_PL_DISABLED,
+ QUIC_PL_BASE,
+ QUIC_PL_SEARCH,
+ QUIC_PL_COMPLETE,
+ QUIC_PL_ERROR,
+};
+
+#define QUIC_BASE_PLPMTU 1200
+#define QUIC_MAX_PLPMTU 9000
+#define QUIC_MIN_PLPMTU 512
+
+#define QUIC_MAX_PROBES 3
+
+#define QUIC_PL_BIG_STEP 32
+#define QUIC_PL_MIN_STEP 4
+
+/* Handle PLPMTUD probe failure on a QUIC path.
+ *
+ * Called immediately after sending a probe packet in QUIC Path MTU Discovery.
+ * Tracks probe count and manages state transitions based on the number of
+ * probes sent and current PLPMTUD state (BASE, SEARCH, COMPLETE, ERROR).
+ * Detects probe failures and black holes, adjusting PMTU and probe sizes
+ * accordingly.
+ *
+ * Return: New PMTU value if updated, else 0.
+ */
+u32 quic_path_pl_send(struct quic_path_group *paths, s64 number)
+{
+ u32 pathmtu = 0;
+
+ paths->pl.number = number;
+ if (paths->pl.probe_count < QUIC_MAX_PROBES)
+ goto out;
+
+ paths->pl.probe_count = 0;
+ if (paths->pl.state == QUIC_PL_BASE) {
+ if (paths->pl.probe_size == QUIC_BASE_PLPMTU) {
+ /* BASE_PLPMTU Confirming Failed: Base -> Error. */
+ paths->pl.state = QUIC_PL_ERROR;
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (paths->pl.pmtu == paths->pl.probe_size) {
+ /* Black Hole Detected: Search -> Base. */
+ paths->pl.state = QUIC_PL_BASE;
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_high = 0;
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ } else { /* Normal probe failure. */
+ paths->pl.probe_high = paths->pl.probe_size;
+ paths->pl.probe_size = paths->pl.pmtu;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ if (paths->pl.pmtu == paths->pl.probe_size) {
+ /* Black Hole Detected: Search Complete -> Base. */
+ paths->pl.state = QUIC_PL_BASE;
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+
+ /* probe_high already reset when entering COMPLETE. */
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ }
+
+out:
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, high: %d\n",
+ __func__, paths, paths->pl.state, paths->pl.pmtu,
+ paths->pl.probe_size, paths->pl.probe_high);
+ paths->pl.probe_count++;
+ return pathmtu;
+}
+
+/* Handle successful reception of a PMTU probe.
+ *
+ * Called when a probe packet is acknowledged. Updates probe size and
+ * transitions state if needed (e.g., from SEARCH to COMPLETE). Expands PMTU
+ * using binary or linear search depending on state.
+ *
+ * Return: New PMTU to apply if search completes, or 0 if no change.
+ */
+u32 quic_path_pl_recv(struct quic_path_group *paths, bool *raise_timer,
+ bool *complete)
+{
+ u32 pathmtu = 0;
+ u16 next;
+
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, high: %d\n",
+ __func__, paths, paths->pl.state, paths->pl.pmtu,
+ paths->pl.probe_size, paths->pl.probe_high);
+
+ *raise_timer = false;
+ paths->pl.number = 0;
+ paths->pl.pmtu = paths->pl.probe_size;
+ paths->pl.probe_count = 0;
+ if (paths->pl.state == QUIC_PL_BASE) {
+ paths->pl.state = QUIC_PL_SEARCH; /* Base -> Search */
+ paths->pl.probe_size += QUIC_PL_BIG_STEP;
+ } else if (paths->pl.state == QUIC_PL_ERROR) {
+ paths->pl.state = QUIC_PL_SEARCH; /* Error -> Search */
+
+ paths->pl.pmtu = paths->pl.probe_size;
+ pathmtu = (u32)paths->pl.pmtu;
+ paths->pl.probe_size += QUIC_PL_BIG_STEP;
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (!paths->pl.probe_high) {
+ if (paths->pl.probe_size < QUIC_MAX_PLPMTU) {
+ next = paths->pl.probe_size + QUIC_PL_BIG_STEP;
+ paths->pl.probe_size =
+ min_t(u16, next, QUIC_MAX_PLPMTU);
+ *complete = false;
+ return pathmtu;
+ }
+ paths->pl.probe_high = QUIC_MAX_PLPMTU;
+ }
+ paths->pl.probe_size += QUIC_PL_MIN_STEP;
+ if (paths->pl.probe_size >= paths->pl.probe_high) {
+ paths->pl.probe_high = 0;
+ /* Search -> Search Complete */
+ paths->pl.state = QUIC_PL_COMPLETE;
+
+ paths->pl.probe_size = paths->pl.pmtu;
+ pathmtu = (u32)paths->pl.pmtu;
+ *raise_timer = true;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ /* Raise probe_size after 30 * interval in Search Complete;
+ * Search Complete -> Search.
+ */
+ paths->pl.state = QUIC_PL_SEARCH;
+ next = paths->pl.probe_size + QUIC_PL_MIN_STEP;
+ paths->pl.probe_size = min_t(u16, next, QUIC_MAX_PLPMTU);
+ }
+
+ *complete = (paths->pl.state == QUIC_PL_COMPLETE);
+ return pathmtu;
+}
+
+/* Handle ICMP "Packet Too Big" messages.
+ *
+ * Responds to an incoming ICMP error by reducing the probe size or falling
+ * back to a safe baseline PMTU depending on current state. Also handles cases
+ * where the PMTU hint lies between probe and current PMTU.
+ *
+ * Return: New PMTU to apply if state changes, or 0 if no change.
+ */
+u32 quic_path_pl_toobig(struct quic_path_group *paths, u32 pmtu,
+ bool *reset_timer)
+{
+ u32 pathmtu = 0;
+
+ pr_debug("%s: dst: %p, state: %d, pmtu: %d, size: %d, ptb: %d\n",
+ __func__, paths, paths->pl.state, paths->pl.pmtu,
+ paths->pl.probe_size, pmtu);
+
+ *reset_timer = false;
+ if (pmtu < QUIC_MIN_PLPMTU || pmtu >= (u32)paths->pl.probe_size)
+ return pathmtu;
+
+ if (paths->pl.state == QUIC_PL_BASE) {
+ if (pmtu >= QUIC_MIN_PLPMTU && pmtu < QUIC_BASE_PLPMTU) {
+ paths->pl.state = QUIC_PL_ERROR; /* Base -> Error */
+
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ }
+ } else if (paths->pl.state == QUIC_PL_SEARCH) {
+ if (pmtu >= QUIC_BASE_PLPMTU && pmtu < (u32)paths->pl.pmtu) {
+ paths->pl.state = QUIC_PL_BASE; /* Search -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_count = 0;
+
+ paths->pl.probe_high = 0;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ } else if (pmtu > (u32)paths->pl.pmtu &&
+ pmtu < (u32)paths->pl.probe_size) {
+ paths->pl.probe_size = (u16)pmtu;
+ paths->pl.probe_count = 0;
+ }
+ } else if (paths->pl.state == QUIC_PL_COMPLETE) {
+ if (pmtu >= QUIC_BASE_PLPMTU && pmtu < (u32)paths->pl.pmtu) {
+ paths->pl.state = QUIC_PL_BASE; /* Complete -> Base */
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+ paths->pl.probe_count = 0;
+
+ paths->pl.probe_high = 0;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ pathmtu = QUIC_BASE_PLPMTU;
+ *reset_timer = true;
+ }
+ }
+ return pathmtu;
+}
+
+/* Reset PLPMTUD state for a path.
+ *
+ * Resets all PLPMTUD-related state to its initial configuration. Called when
+ * a new path is initialized or when recovering from errors.
+ */
+void quic_path_pl_reset(struct quic_path_group *paths)
+{
+ paths->pl.number = 0;
+ paths->pl.probe_high = 0;
+ paths->pl.probe_count = 0;
+ paths->pl.state = QUIC_PL_BASE;
+ paths->pl.pmtu = QUIC_BASE_PLPMTU;
+ paths->pl.probe_size = QUIC_BASE_PLPMTU;
+}
+
+/* Check if a packet number confirms PLPMTUD probe.
+ *
+ * Checks whether the last probe (tracked by .number) has been acknowledged.
+ * If the probe number lies within the ACK range, confirmation is successful.
+ *
+ * Return: true if probe is confirmed, false otherwise.
+ */
+bool quic_path_pl_confirm(struct quic_path_group *paths, s64 largest,
+ s64 smallest)
+{
+ return paths->pl.number && paths->pl.number >= smallest &&
+ paths->pl.number <= largest;
+}
diff --git a/net/quic/path.h b/net/quic/path.h
new file mode 100644
index 000000000000..ca18eb38e907
--- /dev/null
+++ b/net/quic/path.h
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PATH_MIN_PMTU 1200U
+#define QUIC_PATH_MAX_PMTU 65536U
+
+#define QUIC_MIN_UDP_PAYLOAD 1200
+#define QUIC_MAX_UDP_PAYLOAD 65527
+
+#define QUIC_PATH_ENTROPY_LEN 8
+
+extern struct workqueue_struct *quic_wq;
+
+/* Connection Migration State Machine:
+ *
+ * +--------+ recv non-probing, free old path +----------+
+ * | NONE | <-------------------------------------- | SWAPPED |
+ * +--------+ +----------+
+ * | ^ \ ^
+ * | \ \ |
+ * | \ \ new path detected, | recv
+ * | \ \ has another DCID, | Path
+ * | \ \ snd Path Challenge | Response
+ * | \ ------------------------------- |
+ * | ------------------------------- \ |
+ * | new path detected, Path \ \ |
+ * | has no other DCID, Challenge \ \ |
+ * | request a new DCID failed \ \ |
+ * v \ v |
+ * +----------+ +----------+
+ * | PENDING | ------------------------------------> | PROBING |
+ * +----------+ recv a new DCID, snd Path Challenge +----------+
+ */
+enum {
+ QUIC_PATH_ALT_NONE,
+ QUIC_PATH_ALT_PENDING, /* Waiting for new dest conn ID for migration */
+ QUIC_PATH_ALT_PROBING, /* Validating alternate path (PATH_CHALLENGE) */
+ QUIC_PATH_ALT_SWAPPED, /* Alternate path is now active; roles swapped */
+};
+
+struct quic_udp_sock {
+ struct work_struct work; /* Workqueue to destroy UDP tunnel socket */
+ struct hlist_node node; /* Node in addr-based UDP socket hash table */
+ union quic_addr addr; /* Source addr of underlying UDP tunnel socket */
+ int bind_ifindex;
+ refcount_t refcnt;
+ struct sock *sk; /* Underlying UDP tunnel socket */
+};
+
+struct quic_path {
+ union quic_addr daddr; /* Destination address */
+ union quic_addr saddr; /* Source address */
+
+ /* Wrapped UDP socket for receiving QUIC */
+ struct quic_udp_sock *udp_sk;
+ /* Cached UDP tunnel socket and source addr for RCU access */
+ union quic_addr uaddr;
+ struct sock *usk;
+};
+
+struct quic_path_group {
+ /* Connection ID validation during handshake (rfc9000#section-7.3) */
+ struct quic_conn_id retry_dcid; /* Source CID from Retry packet */
+ struct quic_conn_id orig_dcid; /* Destination CID from first Initial */
+
+ /* Path validation (rfc9000#section-8.2) */
+ u8 entropy[QUIC_PATH_ENTROPY_LEN]; /* Entropy for PATH_CHALLENGE */
+ struct quic_path path[2]; /* Active path (0) and alternate path (1) */
+ struct flowi fl; /* Flow info from routing decisions */
+
+ /* Anti-amplification limit (rfc9000#section-8) */
+ u16 ampl_sndlen; /* Bytes sent before address is validated */
+ u16 ampl_rcvlen; /* Bytes received to lift amplification limit */
+
+ /* MTU discovery handling */
+ u32 mtu_info; /* PMTU value from received ICMP, pending apply */
+ struct { /* PLPMTUD probing (rfc8899) */
+ s64 number; /* Packet number used for current probe */
+ u16 pmtu; /* Confirmed path MTU */
+
+ u16 probe_size; /* Current probe packet size */
+ u16 probe_high; /* Highest failed probe size */
+ u8 probe_count; /* Retry count for current probe_size */
+ u8 state; /* Probe state machine (rfc8899#section-5.2) */
+ } pl;
+
+ u32 plpmtud_interval; /* Time interval for the PLPMTUD probe timer */
+
+ u8 ecn_probes; /* ECN probe counter */
+ u8 validated:1; /* Path validated with PATH_RESPONSE */
+ u8 blocked:1; /* Blocked by anti-amplification limit */
+ u8 retry:1; /* Retry used in initial packet */
+
+ /* Connection Migration (rfc9000#section-9) */
+ u8 disable_saddr_alt:1; /* Remote disable_active_migration parameter */
+ u8 disable_daddr_alt:1; /* Local disable_active_migration parameter */
+ u8 pref_addr:1; /* Preferred address offered (rfc9000#section-18.2) */
+ u8 alt_probes; /* Number of PATH_CHALLENGE probes sent */
+ u8 alt_state; /* Connection migration state (see above) */
+};
+
+static inline union quic_addr *quic_path_saddr(struct quic_path_group *paths,
+ u8 path)
+{
+ return &paths->path[path].saddr;
+}
+
+static inline void quic_path_set_saddr(struct quic_path_group *paths, u8 path,
+ union quic_addr *addr)
+{
+ memcpy(quic_path_saddr(paths, path), addr, sizeof(*addr));
+}
+
+static inline union quic_addr *quic_path_daddr(struct quic_path_group *paths,
+ u8 path)
+{
+ return &paths->path[path].daddr;
+}
+
+static inline void quic_path_set_daddr(struct quic_path_group *paths, u8 path,
+ union quic_addr *addr)
+{
+ memcpy(quic_path_daddr(paths, path), addr, sizeof(*addr));
+}
+
+static inline union quic_addr *quic_path_uaddr(struct quic_path_group *paths,
+ u8 path)
+{
+ return &paths->path[path].uaddr;
+}
+
+static inline struct sock *quic_path_usock(struct quic_path_group *paths,
+ u8 path)
+{
+ return paths->path[path].usk;
+}
+
+static inline bool quic_path_alt_state(struct quic_path_group *paths, u8 state)
+{
+ return paths->alt_state == state;
+}
+
+static inline void quic_path_set_alt_state(struct quic_path_group *paths,
+ u8 state)
+{
+ paths->alt_state = state;
+}
+
+/* Returns the destination Connection ID (DCID) used for identifying the
+ * connection. Per rfc9000#section-7.3, handshake packets are considered part
+ * of the same connection if their DCID matches the one returned here.
+ */
+static inline struct quic_conn_id *
+quic_path_orig_dcid(struct quic_path_group *paths)
+{
+ return paths->retry ? &paths->retry_dcid : &paths->orig_dcid;
+}
+
+bool quic_path_detect_alt(struct quic_path_group *paths, union quic_addr *sa,
+ union quic_addr *da, struct sock *sk);
+int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path);
+void quic_path_unbind(struct sock *sk, struct quic_path_group *paths, u8 path);
+void quic_path_swap(struct quic_path_group *paths);
+
+u32 quic_path_pl_recv(struct quic_path_group *paths, bool *raise_timer,
+ bool *complete);
+u32 quic_path_pl_toobig(struct quic_path_group *paths, u32 pmtu,
+ bool *reset_timer);
+u32 quic_path_pl_send(struct quic_path_group *paths, s64 number);
+
+void quic_path_get_param(struct quic_path_group *paths,
+ struct quic_transport_param *p);
+void quic_path_set_param(struct quic_path_group *paths,
+ struct quic_transport_param *p);
+bool quic_path_pl_confirm(struct quic_path_group *paths,
+ s64 largest, s64 smallest);
+void quic_path_pl_reset(struct quic_path_group *paths);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index c247f00f7ddc..fd528fb2fc46 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -21,6 +21,7 @@
static unsigned int quic_net_id __read_mostly;
struct percpu_counter quic_sockets_allocated;
+struct workqueue_struct *quic_wq;
DEFINE_STATIC_KEY_FALSE(quic_alpn_demux_key);
@@ -340,6 +341,16 @@ static __init int quic_init(void)
if (err)
goto err_hash;
+ /* Allocate an unbound reclaimable workqueue for UDP socket destruction
+ * and backlog packet processing.
+ */
+ quic_wq = alloc_workqueue("quic_workqueue",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
+ if (!quic_wq) {
+ err = -ENOMEM;
+ goto err_wq;
+ }
+
err = register_pernet_subsys(&quic_net_ops);
if (err)
goto err_def_ops;
@@ -357,6 +368,8 @@ static __init int quic_init(void)
err_protosw:
unregister_pernet_subsys(&quic_net_ops);
err_def_ops:
+ destroy_workqueue(quic_wq);
+err_wq:
quic_hash_tables_destroy();
err_hash:
percpu_counter_destroy(&quic_sockets_allocated);
@@ -371,6 +384,8 @@ static __exit void quic_exit(void)
#endif
quic_protosw_exit();
unregister_pernet_subsys(&quic_net_ops);
+ flush_workqueue(quic_wq);
+ destroy_workqueue(quic_wq);
quic_hash_tables_destroy();
percpu_counter_destroy(&quic_sockets_allocated);
pr_info("quic: exit\n");
diff --git a/net/quic/socket.c b/net/quic/socket.c
index aa451ea8f516..d5ac77c02861 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -56,6 +56,9 @@ static int quic_init_sock(struct sock *sk)
static void quic_destroy_sock(struct sock *sk)
{
+ quic_path_unbind(sk, quic_paths(sk), 0);
+ quic_path_unbind(sk, quic_paths(sk), 1);
+
quic_conn_id_set_free(quic_source(sk));
quic_conn_id_set_free(quic_dest(sk));
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 68a58f0016cc..91338601905e 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -15,6 +15,7 @@
#include "family.h"
#include "stream.h"
#include "connid.h"
+#include "path.h"
#include "protocol.h"
@@ -39,6 +40,7 @@ struct quic_sock {
struct quic_stream_table streams;
struct quic_conn_id_set source;
struct quic_conn_id_set dest;
+ struct quic_path_group paths;
};
struct quic6_sock {
@@ -86,6 +88,11 @@ static inline struct quic_conn_id_set *quic_dest(const struct sock *sk)
return &quic_sk(sk)->dest;
}
+static inline struct quic_path_group *quic_paths(const struct sock *sk)
+{
+ return &quic_sk(sk)->paths;
+}
+
static inline bool quic_is_serv(const struct sock *sk)
{
return !!sk->sk_max_ack_backlog;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 09/15] quic: add congestion control
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (7 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 08/15] quic: add path management Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 10/15] quic: add packet number space Xin Long
` (5 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_cong' for RTT measurement and congestion
control. The 'quic_cong_ops' is added to define the congestion
control algorithm.
It implements a congestion control state machine with slow start,
congestion avoidance, and recovery phases, and currently introduces
the New Reno algorithm only.
The implementation updates RTT estimates when packets are acknowledged,
reacts to loss and ECN signals, and adjusts the congestion window
accordingly during packet transmission and acknowledgment processing.
- quic_cong_rtt_update(): Performs RTT measurement, invoked when a
packet is acknowledged by the largest number in the ACK frame.
- quic_cong_on_packet_acked(): Invoked when a packet is acknowledged.
- quic_cong_on_packet_lost(): Invoked when a packet is marked as lost.
- quic_cong_on_process_ecn(): Invoked when an ACK_ECN frame is received.
- quic_cong_on_packet_sent(): Invoked when a packet is transmitted.
- quic_cong_on_ack_recv(): Invoked when an ACK frame is received.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v4:
- Remove the CUBIC congestion algorithm support for this version
(suggested by Paolo).
v5:
- Do not update the pacing rate when !cong->smoothed_rtt in
quic_cong_pace_update() (suggested by Paolo).
- Change timestamp variables from u32 to u64, as RTT is measured in
microseconds and u64 provides sufficient precision for timestamps
in microsecond.
v8:
- Add a comment in quic_reno_on_packet_acked() clarifying cong->window
is never zero (noted by AI review).
v9:
- Use abs_diff() to simplify RTT variance calculation (noted by AI
review).
- Fix a small typo in the comment for struct quic_cong::time (noted
by AI review).
- Fix another small typo in quic_cong_check_persistent_congestion().
v10:
- Note for AI reviews: cong->time will always be set before processing
the packet on the RX path in the future patch series.
- Add initial_srtt and algo to struct quic_cong to store the initial
smoothed RTT and congestion control algorithm, previously kept in
quic_sock.config.
v11:
- Note for AI reviews: later patches guarantee cong->time is set
before any quic_cong function uses it.
- Note for AI reviews: RFC 9002 section 5.3 does NOT specify that
rttvar_sample must use the previous smoothed_rtt value.
- Set maximum line length to 80 characters.
- Change return type of quic_cong_check_persistent_congestion() to bool.
- Add a check for algo in quic_cong_set_algo().
- Extract quic_reno_handle_packet_lost() from quic_reno_on_packet_lost()
and quic_reno_on_process_ecn() (noted by AI review).
---
net/quic/Makefile | 3 +-
net/quic/cong.c | 314 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/cong.h | 129 +++++++++++++++++++
net/quic/socket.c | 1 +
net/quic/socket.h | 7 ++
5 files changed, 453 insertions(+), 1 deletion(-)
create mode 100644 net/quic/cong.c
create mode 100644 net/quic/cong.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 1565fb5cef9d..4d4a42c6d565 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -5,4 +5,5 @@
obj-$(CONFIG_IP_QUIC) += quic.o
-quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o
+quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
+ cong.o
diff --git a/net/quic/cong.c b/net/quic/cong.c
new file mode 100644
index 000000000000..85c3dedf6a60
--- /dev/null
+++ b/net/quic/cong.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/quic.h>
+
+#include "common.h"
+#include "cong.h"
+
+static bool quic_cong_check_persistent_congestion(struct quic_cong *cong,
+ u64 time)
+{
+ u32 ssthresh;
+
+ /* rfc9002#section-7.6.1:
+ * (smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay) *
+ * kPersistentCongestionThreshold
+ */
+ ssthresh = cong->smoothed_rtt +
+ max(4 * cong->rttvar, QUIC_KGRANULARITY);
+ ssthresh = (ssthresh + cong->max_ack_delay) *
+ QUIC_KPERSISTENT_CONGESTION_THRESHOLD;
+ if (cong->time - time <= ssthresh)
+ return false;
+
+ pr_debug("%s: persistent congestion, cwnd: %u, ssth: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ cong->min_rtt_valid = 0;
+ cong->window = cong->min_window;
+ cong->state = QUIC_CONG_SLOW_START;
+ return true;
+}
+
+/* NEW RENO APIs */
+static void quic_reno_handle_packet_lost(struct quic_cong *cong)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ pr_debug("%s: slow_start -> recovery, cwnd: %u, ssth: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ return;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ pr_debug("%s: cong_avoid -> recovery, cwnd: %u, ssth: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__,
+ cong->state);
+ return;
+ }
+
+ cong->recovery_time = cong->time;
+ cong->state = QUIC_CONG_RECOVERY_PERIOD;
+ cong->ssthresh = max(cong->window >> 1U, cong->min_window);
+ cong->window = cong->ssthresh;
+}
+
+static void quic_reno_on_packet_lost(struct quic_cong *cong, u64 time,
+ u32 bytes, s64 number)
+{
+ if (quic_cong_check_persistent_congestion(cong, time))
+ return;
+
+ quic_reno_handle_packet_lost(cong);
+}
+
+static void quic_reno_on_packet_acked(struct quic_cong *cong, u64 time,
+ u32 bytes, s64 number)
+{
+ switch (cong->state) {
+ case QUIC_CONG_SLOW_START:
+ cong->window = min_t(u32, cong->window + bytes,
+ cong->max_window);
+ if (cong->window < cong->ssthresh)
+ break;
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: slow_start -> cong_avoid, cwnd: %u, ssth: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_RECOVERY_PERIOD:
+ if (cong->recovery_time >= time)
+ break;
+ cong->state = QUIC_CONG_CONGESTION_AVOIDANCE;
+ pr_debug("%s: recovery -> cong_avoid, cwnd: %u, ssth: %u\n",
+ __func__, cong->window, cong->ssthresh);
+ break;
+ case QUIC_CONG_CONGESTION_AVOIDANCE:
+ /* cong->window is never zero; it is initialized by
+ * quic_packet_route() during connect/accept.
+ */
+ cong->window += cong->mss * bytes / cong->window;
+ break;
+ default:
+ pr_debug("%s: wrong congestion state: %d\n", __func__,
+ cong->state);
+ return;
+ }
+}
+
+static void quic_reno_on_process_ecn(struct quic_cong *cong)
+{
+ quic_reno_handle_packet_lost(cong);
+}
+
+static void quic_reno_on_init(struct quic_cong *cong)
+{
+}
+
+static struct quic_cong_ops quic_congs[] = {
+ { /* QUIC_CONG_ALG_RENO */
+ .on_packet_acked = quic_reno_on_packet_acked,
+ .on_packet_lost = quic_reno_on_packet_lost,
+ .on_process_ecn = quic_reno_on_process_ecn,
+ .on_init = quic_reno_on_init,
+ },
+};
+
+/* COMMON APIs */
+void quic_cong_on_packet_lost(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number)
+{
+ cong->ops->on_packet_lost(cong, time, bytes, number);
+}
+
+void quic_cong_on_packet_acked(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number)
+{
+ cong->ops->on_packet_acked(cong, time, bytes, number);
+}
+
+void quic_cong_on_process_ecn(struct quic_cong *cong)
+{
+ cong->ops->on_process_ecn(cong);
+}
+
+/* Update Probe Timeout (PTO) and loss detection delay based on RTT stats. */
+static void quic_cong_pto_update(struct quic_cong *cong)
+{
+ u32 pto, loss_delay;
+
+ /* rfc9002#section-6.2.1:
+ * PTO = smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay
+ */
+ pto = cong->smoothed_rtt + max(4 * cong->rttvar, QUIC_KGRANULARITY);
+ cong->pto = pto + cong->max_ack_delay;
+
+ /* rfc9002#section-6.1.2:
+ * max(kTimeThreshold * max(smoothed_rtt, latest_rtt), kGranularity)
+ */
+ loss_delay = QUIC_KTIME_THRESHOLD(max(cong->smoothed_rtt,
+ cong->latest_rtt));
+ cong->loss_delay = max(loss_delay, QUIC_KGRANULARITY);
+
+ pr_debug("%s: update pto: %u\n", __func__, pto);
+}
+
+/* Update pacing timestamp after sending 'bytes' bytes.
+ *
+ * This function tracks when the next packet is allowed to be sent based on
+ * pacing rate.
+ */
+static void quic_cong_update_pacing_time(struct quic_cong *cong, u32 bytes)
+{
+ u64 prior_time, credit, len_ns, rate = READ_ONCE(cong->pacing_rate);
+
+ if (!rate)
+ return;
+
+ prior_time = cong->pacing_time;
+ cong->pacing_time = max(cong->pacing_time, ktime_get_ns());
+ credit = cong->pacing_time - prior_time;
+
+ /* take into account OS jitter */
+ len_ns = div64_ul((u64)bytes * NSEC_PER_SEC, rate);
+ len_ns -= min_t(u64, len_ns / 2, credit);
+ cong->pacing_time += len_ns;
+}
+
+/* Compute and update the pacing rate based on congestion window and smoothed
+ * RTT.
+ */
+static void quic_cong_pace_update(struct quic_cong *cong, u32 bytes,
+ u64 max_rate)
+{
+ u64 rate;
+
+ if (unlikely(!cong->smoothed_rtt))
+ return;
+
+ /* rate = N * congestion_window / smoothed_rtt */
+ rate = div64_ul((u64)cong->window * USEC_PER_SEC * 2,
+ cong->smoothed_rtt);
+
+ WRITE_ONCE(cong->pacing_rate, min_t(u64, rate, max_rate));
+ pr_debug("%s: update pacing rate: %llu, max rate: %llu, srtt: %u\n",
+ __func__, cong->pacing_rate, max_rate, cong->smoothed_rtt);
+}
+
+void quic_cong_on_packet_sent(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number)
+{
+ if (!bytes)
+ return;
+ if (cong->ops->on_packet_sent)
+ cong->ops->on_packet_sent(cong, time, bytes, number);
+ quic_cong_update_pacing_time(cong, bytes);
+}
+
+void quic_cong_on_ack_recv(struct quic_cong *cong, u32 bytes, u64 max_rate)
+{
+ if (!bytes)
+ return;
+ if (cong->ops->on_ack_recv)
+ cong->ops->on_ack_recv(cong, bytes, max_rate);
+ quic_cong_pace_update(cong, bytes, max_rate);
+}
+
+/* rfc9002#section-5: Estimating the Round-Trip Time */
+void quic_cong_rtt_update(struct quic_cong *cong, u64 time, u32 ack_delay)
+{
+ u32 adjusted_rtt, rttvar_sample;
+
+ /* Ignore RTT sample if ACK delay is suspiciously large. */
+ if (ack_delay > cong->max_ack_delay * 2)
+ return;
+
+ /* rfc9002#section-5.1:
+ * latest_rtt = ack_time - send_time_of_largest_acked
+ */
+ cong->latest_rtt = cong->time - time;
+
+ /* rfc9002#section-5.2: Estimating min_rtt */
+ if (!cong->min_rtt_valid) {
+ cong->min_rtt = cong->latest_rtt;
+ cong->min_rtt_valid = 1;
+ }
+ if (cong->min_rtt > cong->latest_rtt)
+ cong->min_rtt = cong->latest_rtt;
+
+ if (!cong->is_rtt_set) {
+ /* rfc9002#section-5.3:
+ * smoothed_rtt = latest_rtt
+ * rttvar = latest_rtt / 2
+ */
+ cong->smoothed_rtt = cong->latest_rtt;
+ cong->rttvar = cong->smoothed_rtt / 2;
+ quic_cong_pto_update(cong);
+ cong->is_rtt_set = 1;
+ return;
+ }
+
+ /* rfc9002#section-5.3:
+ * adjusted_rtt = latest_rtt
+ * if (latest_rtt >= min_rtt + ack_delay):
+ * adjusted_rtt = latest_rtt - ack_delay
+ * smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * adjusted_rtt
+ * rttvar_sample = abs(smoothed_rtt - adjusted_rtt)
+ * rttvar = 3/4 * rttvar + 1/4 * rttvar_sample
+ */
+ adjusted_rtt = cong->latest_rtt;
+ if (cong->latest_rtt >= cong->min_rtt + ack_delay)
+ adjusted_rtt = cong->latest_rtt - ack_delay;
+
+ cong->smoothed_rtt = (cong->smoothed_rtt * 7 + adjusted_rtt) / 8;
+ rttvar_sample = abs_diff(cong->smoothed_rtt, adjusted_rtt);
+ cong->rttvar = (cong->rttvar * 3 + rttvar_sample) / 4;
+ quic_cong_pto_update(cong);
+
+ if (cong->ops->on_rtt_update)
+ cong->ops->on_rtt_update(cong);
+}
+
+void quic_cong_set_algo(struct quic_cong *cong, u8 algo)
+{
+ /* The caller must ensure algo < QUIC_CONG_ALG_MAX. */
+ if (WARN_ON_ONCE(algo >= QUIC_CONG_ALG_MAX))
+ return;
+ cong->algo = algo;
+ cong->state = QUIC_CONG_SLOW_START;
+ cong->ssthresh = U32_MAX;
+ cong->ops = &quic_congs[algo];
+ cong->ops->on_init(cong);
+}
+
+void quic_cong_set_srtt(struct quic_cong *cong, u32 srtt)
+{
+ /* rfc9002#section-5.3:
+ * smoothed_rtt = kInitialRtt
+ * rttvar = kInitialRtt / 2
+ */
+ cong->initial_srtt = srtt;
+ cong->latest_rtt = srtt;
+ cong->smoothed_rtt = cong->latest_rtt;
+ cong->rttvar = cong->smoothed_rtt / 2;
+ quic_cong_pto_update(cong);
+}
+
+void quic_cong_init(struct quic_cong *cong)
+{
+ cong->max_ack_delay = QUIC_DEF_ACK_DELAY;
+ cong->max_window = S32_MAX / 2;
+ quic_cong_set_algo(cong, QUIC_CONG_ALG_RENO);
+ quic_cong_set_srtt(cong, QUIC_RTT_INIT);
+}
diff --git a/net/quic/cong.h b/net/quic/cong.h
new file mode 100644
index 000000000000..aef765d097f2
--- /dev/null
+++ b/net/quic/cong.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_KPERSISTENT_CONGESTION_THRESHOLD 3
+#define QUIC_KPACKET_THRESHOLD 3
+#define QUIC_KTIME_THRESHOLD(rtt) ((rtt) * 9 / 8)
+#define QUIC_KGRANULARITY 1000U
+
+#define QUIC_RTT_INIT 333000U
+#define QUIC_RTT_MAX 2000000U
+#define QUIC_RTT_MIN QUIC_KGRANULARITY
+
+/* rfc9002#section-7.3: Congestion Control States
+ *
+ * New path or +------------+
+ * persistent congestion | Slow |
+ * (O)---------------------->| Start |
+ * +------------+
+ * |
+ * Loss or |
+ * ECN-CE increase |
+ * v
+ * +------------+ Loss or +------------+
+ * | Congestion | ECN-CE increase | Recovery |
+ * | Avoidance |------------------>| Period |
+ * +------------+ +------------+
+ * ^ |
+ * | |
+ * +----------------------------+
+ * Acknowledgment of packet
+ * sent during recovery
+ */
+enum quic_cong_state {
+ QUIC_CONG_SLOW_START,
+ QUIC_CONG_RECOVERY_PERIOD,
+ QUIC_CONG_CONGESTION_AVOIDANCE,
+};
+
+struct quic_cong {
+ /* RTT tracking */
+ u32 max_ack_delay; /* max_ack_delay from rfc9000#section-18.2 */
+ u32 smoothed_rtt; /* Smoothed RTT */
+ u32 latest_rtt; /* Latest RTT sample */
+ u32 min_rtt; /* Lowest observed RTT */
+ u32 rttvar; /* RTT variation */
+ u32 pto; /* Probe timeout */
+
+ /* Timing & pacing */
+ u64 recovery_time; /* Recovery period start timestamp */
+ u64 pacing_rate; /* Packet sending speed Bytes/sec */
+ u64 pacing_time; /* Next scheduled send timestamp (ns) */
+ u64 time; /* Cached current timestamp */
+
+ /* Congestion window */
+ u32 max_window; /* Max growth cap */
+ u32 min_window; /* Min window limit */
+ u32 loss_delay; /* Time before marking loss */
+ u32 ssthresh; /* Slow start threshold */
+ u32 window; /* Bytes in flight allowed */
+ u32 mss; /* QUIC MSS (excl. UDP) */
+
+ /* Algorithm-specific */
+ struct quic_cong_ops *ops;
+ u64 priv[8]; /* Algo private data */
+
+ u32 initial_srtt; /* Initial smoothed RTT */
+ u8 algo; /* Congestion control algorithm */
+
+ /* Flags & state */
+ u8 min_rtt_valid; /* min_rtt initialized */
+ u8 is_rtt_set; /* RTT samples exist */
+ u8 state; /* State machine in rfc9002#section-7.3 */
+};
+
+/* Hooks for congestion control algorithms */
+struct quic_cong_ops {
+ void (*on_packet_acked)(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+ void (*on_packet_lost)(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+ void (*on_process_ecn)(struct quic_cong *cong);
+ void (*on_init)(struct quic_cong *cong);
+
+ /* Optional callbacks */
+ void (*on_packet_sent)(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+ void (*on_ack_recv)(struct quic_cong *cong, u32 bytes, u64 max_rate);
+ void (*on_rtt_update)(struct quic_cong *cong);
+};
+
+static inline void quic_cong_set_mss(struct quic_cong *cong, u32 mss)
+{
+ if (cong->mss == mss)
+ return;
+
+ /* rfc9002#section-7.2: Initial and Minimum Congestion Window */
+ cong->mss = mss;
+ cong->min_window = max(min(mss * 10, 14720U), mss * 2);
+
+ if (cong->window < cong->min_window)
+ cong->window = cong->min_window;
+}
+
+static inline void *quic_cong_priv(struct quic_cong *cong)
+{
+ return (void *)cong->priv;
+}
+
+void quic_cong_on_packet_acked(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+void quic_cong_on_packet_lost(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+void quic_cong_on_process_ecn(struct quic_cong *cong);
+
+void quic_cong_on_packet_sent(struct quic_cong *cong, u64 time, u32 bytes,
+ s64 number);
+void quic_cong_on_ack_recv(struct quic_cong *cong, u32 bytes, u64 max_rate);
+void quic_cong_rtt_update(struct quic_cong *cong, u64 time, u32 ack_delay);
+
+void quic_cong_set_srtt(struct quic_cong *cong, u32 srtt);
+void quic_cong_set_algo(struct quic_cong *cong, u8 algo);
+void quic_cong_init(struct quic_cong *cong);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index d5ac77c02861..7a4a00498b54 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -47,6 +47,7 @@ static int quic_init_sock(struct sock *sk)
quic_conn_id_set_init(quic_source(sk), 1);
quic_conn_id_set_init(quic_dest(sk), 0);
+ quic_cong_init(quic_cong(sk));
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 91338601905e..9201ca3edad0 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -16,6 +16,7 @@
#include "stream.h"
#include "connid.h"
#include "path.h"
+#include "cong.h"
#include "protocol.h"
@@ -41,6 +42,7 @@ struct quic_sock {
struct quic_conn_id_set source;
struct quic_conn_id_set dest;
struct quic_path_group paths;
+ struct quic_cong cong;
};
struct quic6_sock {
@@ -98,6 +100,11 @@ static inline bool quic_is_serv(const struct sock *sk)
return !!sk->sk_max_ack_backlog;
}
+static inline struct quic_cong *quic_cong(const struct sock *sk)
+{
+ return &quic_sk(sk)->cong;
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 10/15] quic: add packet number space
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (8 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 09/15] quic: add congestion control Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 11/15] quic: add crypto key derivation and installation Xin Long
` (4 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_pnspace', which manages per packet number
space members.
It maintains the next packet number to assign, tracks the total length of
frames currently in flight, and records the time when the next packet may
be considered lost. It also keeps track of the largest acknowledged packet
number, the time it was acknowledged, and when the most recent ack
eliciting packet was sent. These fields are useful for loss detection,
RTT estimation, and congestion control.
To support ACK frame generation, quic_pnspace includes a packet number
acknowledgment map (pn_ack_map) that tracks received packet numbers.
Supporting functions are provided to validate and mark received packet
numbers and compute the number of gap blocks needed during ACK frame
construction.
- quic_pnspace_check(): Validates a received packet number.
- quic_pnspace_mark(): Marks a received packet number in the ACK map.
- quic_pnspace_num_gabs(): Returns the gap ACK blocks for constructing
ACK frames.
Note QUIC uses separate packet number spaces for each encryption level
(APP, INITIAL, HANDSHAKE, EARLY) except EARLY and all generations of
APP keys use the same packet number space, as describe in
rfc9002#section-4.1.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v5:
- Change timestamp variables from u32 to u64 and use quic_ktime_get_us()
to set max_pn_acked_time, as jiffies_to_usecs() is not accurate enough.
- Reorder some members in quic_pnspace to reduce 32-bit holes (noted
by Paolo).
v6:
- Note for AI reviews: it's safe to do cast (u16)(pn - space->base_pn)
in quic_pnspace_mark(), as the pn < base_pn + QUIC_PN_MAP_SIZE (4096)
validation is always done in quic_pnspace_check(), which will always
be called before quic_pnspace_mark() in a later patchset.
- Note for AI reviews: failures in quic_pnspace_init() do not result in a
pn_map leak in quic_init_sock(), because quic_destroy_sock() is always
called to free it in err path, either via inet/6_create() or through
quic_accept() in a later patchset.
v8:
- Replace bitfields with plain u8 in struct quic_pnspace.
v10:
- Fix a grammar error in the comment of quic_pnspace_check().
v11:
- Note for AI reviews: RFC 9000 does not define integer IDs for packet
number spaces. In this implementation, App=0, Initial=1,
Handshake=2, and Early maps to 0 (3 % 3).
- Set maximum line length to 80 characters.
- clear space->pn_map pointer after free in quic_pnspace_free().
- Change quic_pnspace_grow() to return negative errno on failure or 0
on success.
- Change return type of quic_pnspace_next_gap_ack() and
quic_pnspace_set_ecn_count() to bool.
- Return -EINVAL instead of -1 on failure in quic_pnspace_check().
---
net/quic/Makefile | 2 +-
net/quic/pnspace.c | 250 +++++++++++++++++++++++++++++++++++++++++++++
net/quic/pnspace.h | 157 ++++++++++++++++++++++++++++
net/quic/socket.c | 12 +++
net/quic/socket.h | 7 ++
5 files changed, 427 insertions(+), 1 deletion(-)
create mode 100644 net/quic/pnspace.c
create mode 100644 net/quic/pnspace.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 4d4a42c6d565..9d8e18297911 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o
+ cong.o pnspace.o
diff --git a/net/quic/pnspace.c b/net/quic/pnspace.c
new file mode 100644
index 000000000000..d8f19f4e0fc6
--- /dev/null
+++ b/net/quic/pnspace.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <linux/slab.h>
+
+#include "common.h"
+#include "pnspace.h"
+
+int quic_pnspace_init(struct quic_pnspace *space)
+{
+ if (!space->pn_map) {
+ space->pn_map = kzalloc(BITS_TO_BYTES(QUIC_PN_MAP_INITIAL),
+ GFP_KERNEL);
+ if (!space->pn_map)
+ return -ENOMEM;
+ space->pn_map_len = QUIC_PN_MAP_INITIAL;
+ } else {
+ bitmap_zero(space->pn_map, space->pn_map_len);
+ }
+
+ space->max_time_limit = QUIC_PNSPACE_TIME_LIMIT;
+ space->next_pn = QUIC_PNSPACE_NEXT_PN;
+ space->base_pn = -1;
+ return 0;
+}
+
+void quic_pnspace_free(struct quic_pnspace *space)
+{
+ space->pn_map_len = 0;
+ kfree(space->pn_map);
+ space->pn_map = NULL;
+}
+
+/* Expand the bitmap tracking received packet numbers. Ensures the pn_map
+ * bitmap can cover at least @size packet numbers. Allocates a larger bitmap,
+ * copies existing data, and updates metadata.
+ *
+ * Return: 0 on success, or a negative errno value on failure.
+ */
+static int quic_pnspace_grow(struct quic_pnspace *space, u16 size)
+{
+ u16 len, inc, offset;
+ unsigned long *new;
+
+ if (size > QUIC_PN_MAP_SIZE)
+ return -EINVAL;
+
+ inc = ALIGN((size - space->pn_map_len), BITS_PER_LONG) +
+ QUIC_PN_MAP_INCREMENT;
+ len = (u16)min(space->pn_map_len + inc, QUIC_PN_MAP_SIZE);
+
+ new = kzalloc(BITS_TO_BYTES(len), GFP_ATOMIC);
+ if (!new)
+ return -ENOMEM;
+
+ offset = (u16)(space->max_pn_seen + 1 - space->base_pn);
+ bitmap_copy(new, space->pn_map, offset);
+ kfree(space->pn_map);
+ space->pn_map = new;
+ space->pn_map_len = len;
+
+ return 0;
+}
+
+/* Check if a packet number has been received.
+ *
+ * Returns: 0 if the packet number has not been received. 1 if it has already
+ * been received. -EINVAL if the packet number is too old or too far in the
+ * future to track.
+ */
+int quic_pnspace_check(struct quic_pnspace *space, s64 pn)
+{
+ if (space->base_pn == -1) /* No packet number received yet. */
+ return 0;
+
+ if (pn < space->min_pn_seen || pn >= space->base_pn + QUIC_PN_MAP_SIZE)
+ return -EINVAL;
+
+ if (pn < space->base_pn)
+ return 1;
+ if (pn - space->base_pn < space->pn_map_len &&
+ test_bit(pn - space->base_pn, space->pn_map))
+ return 1;
+
+ return 0;
+}
+
+/* Advance base_pn past contiguous received packet numbers. Finds the next gap
+ * (unreceived packet) beyond @pn, shifts the bitmap, and updates base_pn
+ * accordingly.
+ */
+static void quic_pnspace_move(struct quic_pnspace *space, s64 pn)
+{
+ u16 offset;
+
+ offset = (u16)(pn + 1 - space->base_pn);
+ offset = (u16)find_next_zero_bit(space->pn_map, space->pn_map_len,
+ offset);
+ space->base_pn += offset;
+ bitmap_shift_right(space->pn_map, space->pn_map, offset,
+ space->pn_map_len);
+}
+
+/* Mark a packet number as received. Updates the packet number map to record
+ * reception of @pn. Advances base_pn if possible, and updates max/min/last
+ * seen fields as needed.
+ *
+ * Returns: 0 on success or if the packet was already marked, or a negative
+ * error returned by bitmap growth when expanding the map.
+ */
+int quic_pnspace_mark(struct quic_pnspace *space, s64 pn)
+{
+ s64 last_max_pn_seen;
+ u64 last_max_pn_time;
+ u16 gap;
+ int err;
+
+ if (space->base_pn == -1) {
+ /* Initialize base_pn based on the peer's first packet number
+ * since peer's packet numbers may start at a non-zero value.
+ */
+ quic_pnspace_set_base_pn(space, pn + 1);
+ return 0;
+ }
+
+ /* Ignore packets with number less than current base (already
+ * processed).
+ */
+ if (pn < space->base_pn)
+ return 0;
+
+ /* If gap is beyond current map length, try to grow the bitmap to
+ * accommodate.
+ */
+ gap = (u16)(pn - space->base_pn);
+ if (gap >= space->pn_map_len) {
+ err = quic_pnspace_grow(space, gap + 1);
+ if (err)
+ return err;
+ }
+
+ if (space->max_pn_seen < pn) {
+ space->max_pn_seen = pn;
+ space->max_pn_time = space->time;
+ }
+
+ if (space->base_pn == pn) { /* PN is next expected packet. */
+ if (quic_pnspace_has_gap(space)) /* Advance to next gap. */
+ quic_pnspace_move(space, pn);
+ else /* Fast path: increment base_pn if no gaps. */
+ space->base_pn++;
+ } else { /* Mark this packet as received in the bitmap. */
+ set_bit(gap, space->pn_map);
+ }
+
+ /* Only update min and last_max_pn_seen if this packet is the current
+ * max_pn.
+ */
+ if (space->max_pn_seen != pn)
+ return 0;
+
+ /* Check if enough time has elapsed or enough packets have been
+ * received to update tracking.
+ */
+ last_max_pn_seen = min_t(s64, space->last_max_pn_seen, space->base_pn);
+ last_max_pn_time = space->last_max_pn_time;
+ if (space->max_pn_time < last_max_pn_time + space->max_time_limit &&
+ space->max_pn_seen <= last_max_pn_seen + QUIC_PN_MAP_LIMIT)
+ return 0;
+
+ /* Advance base_pn if last_max_pn_seen is ahead of current base_pn.
+ * This is needed because QUIC doesn't retransmit packets;
+ * retransmitted frames are carried in new packets, so we move forward.
+ */
+ if (space->last_max_pn_seen + 1 > space->base_pn)
+ quic_pnspace_move(space, space->last_max_pn_seen);
+
+ space->min_pn_seen = space->last_max_pn_seen;
+ space->last_max_pn_seen = space->max_pn_seen;
+ space->last_max_pn_time = space->max_pn_time;
+ return 0;
+}
+
+/* Find the next gap in received packet numbers. Scans pn_map for a gap
+ * starting from *@iter. A gap is a contiguous block of unreceived packets
+ * between received ones.
+ *
+ * Returns: true if a gap was found, false if no more gaps exist or are
+ * relevant.
+ */
+static bool quic_pnspace_next_gap_ack(const struct quic_pnspace *space,
+ s64 *iter, u16 *start, u16 *end)
+{
+ u16 start_ = 0, end_ = 0, offset = (u16)(*iter - space->base_pn);
+
+ start_ = (u16)find_next_zero_bit(space->pn_map, space->pn_map_len,
+ offset);
+ if (space->max_pn_seen <= space->base_pn + start_)
+ return false;
+
+ end_ = (u16)find_next_bit(space->pn_map, space->pn_map_len, start_);
+ if (space->max_pn_seen <= space->base_pn + end_ - 1)
+ return false;
+
+ *start = start_ + 1;
+ *end = end_;
+ *iter = space->base_pn + *end;
+ return true;
+}
+
+/* Generate gap acknowledgment blocks (GABs). GABs describe ranges of
+ * unacknowledged packets between received ones, and are used in ACK frames.
+ *
+ * Returns: Number of generated GABs (up to QUIC_PN_MAP_MAX_GABS).
+ */
+u16 quic_pnspace_num_gabs(struct quic_pnspace *space,
+ struct quic_gap_ack_block *gabs)
+{
+ u16 start, end, ngaps = 0;
+ s64 iter;
+
+ if (!quic_pnspace_has_gap(space))
+ return 0;
+
+ iter = space->base_pn;
+ /* Loop through all gaps until the end of the window or max allowed
+ * gaps.
+ */
+ while (quic_pnspace_next_gap_ack(space, &iter, &start, &end)) {
+ gabs[ngaps].start = start;
+ if (ngaps == QUIC_PN_MAP_MAX_GABS - 1) {
+ gabs[ngaps].end =
+ (u16)(space->max_pn_seen - space->base_pn);
+ ngaps++;
+ break;
+ }
+ gabs[ngaps].end = end;
+ ngaps++;
+ }
+ return ngaps;
+}
diff --git a/net/quic/pnspace.h b/net/quic/pnspace.h
new file mode 100644
index 000000000000..0961add68401
--- /dev/null
+++ b/net/quic/pnspace.h
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_PN_MAP_MAX_GABS 32
+
+#define QUIC_PN_MAP_INITIAL 64
+#define QUIC_PN_MAP_INCREMENT QUIC_PN_MAP_INITIAL
+#define QUIC_PN_MAP_SIZE 4096
+#define QUIC_PN_MAP_LIMIT (QUIC_PN_MAP_SIZE * 3 / 4)
+
+#define QUIC_PNSPACE_MAX (QUIC_CRYPTO_MAX - 1)
+#define QUIC_PNSPACE_NEXT_PN 0
+#define QUIC_PNSPACE_TIME_LIMIT (333000 * 3)
+
+enum {
+ QUIC_ECN_ECT1,
+ QUIC_ECN_ECT0,
+ QUIC_ECN_CE,
+ QUIC_ECN_MAX
+};
+
+enum {
+ QUIC_ECN_LOCAL, /* ECN bits from incoming IP headers */
+ QUIC_ECN_PEER, /* ECN bits reported by peer in ACK frames */
+ QUIC_ECN_DIR_MAX
+};
+
+/* Represents a gap (range of missing packets) in the ACK map. The values are
+ * offsets from base_pn, with both 'start' and 'end' being +1.
+ */
+struct quic_gap_ack_block {
+ u16 start;
+ u16 end;
+};
+
+/* Packet Number Map (pn_map) Layout:
+ *
+ * min_pn_seen -->++-----------------------+---------------------+---
+ * base_pn -----^ last_max_pn_seen --^ max_pn_seen --^
+ *
+ * Map Advancement Logic:
+ * - min_pn_seen = last_max_pn_seen;
+ * - base_pn = first zero bit after last_max_pn_seen;
+ * - last_max_pn_seen = max_pn_seen;
+ * - last_max_pn_time = current time;
+ *
+ * Conditions to Advance pn_map:
+ * - (max_pn_time - last_max_pn_time) >= max_time_limit, or
+ * - (max_pn_seen - last_max_pn_seen) > QUIC_PN_MAP_LIMIT
+ *
+ * Gap Search Range:
+ * - From (base_pn - 1) to max_pn_seen
+ */
+struct quic_pnspace {
+ /* ECN counters indexed by dir and ECN codepoint (ECT1, ECT0, CE) */
+ u64 ecn_count[QUIC_ECN_DIR_MAX][QUIC_ECN_MAX];
+ unsigned long *pn_map; /* Received PN bitmap for ACK generation */
+ u16 pn_map_len; /* Length of the PN bit map (in bits) */
+ u8 need_sack; /* Flag indicating a SACK frame should be sent */
+ u8 sack_path; /* Path used for sending the SACK frame */
+
+ s64 last_max_pn_seen; /* Largest PN seen before pn_map advance */
+ u64 last_max_pn_time; /* Timestamp last_max_pn_seen was received */
+ s64 min_pn_seen; /* Smallest PN received */
+ s64 max_pn_seen; /* Largest PN received */
+ u64 max_pn_time; /* Timestamp max_pn_seen was received */
+ s64 base_pn; /* PN corresponding to the start of the pn_map */
+ u64 time; /* Cached now, or latest socket accept timestamp */
+
+ s64 max_pn_acked_seen; /* Largest PN ACKed by peer */
+ u64 max_pn_acked_time; /* Timestamp max_pn_acked_seen was ACKed */
+ u64 last_sent_time; /* Timestamp last ack-eliciting packet sent */
+ u64 loss_time; /* Timestamp the packet can be declared lost */
+ s64 next_pn; /* Next PN to send */
+
+ u32 max_time_limit; /* Time threshold to trigger pn_map advance */
+ u32 inflight; /* Ack-eliciting bytes in flight */
+};
+
+static inline void
+quic_pnspace_set_max_pn_acked_seen(struct quic_pnspace *space,
+ s64 max_pn_acked_seen)
+{
+ if (space->max_pn_acked_seen >= max_pn_acked_seen)
+ return;
+ space->max_pn_acked_seen = max_pn_acked_seen;
+ space->max_pn_acked_time = quic_ktime_get_us();
+}
+
+static inline void quic_pnspace_set_base_pn(struct quic_pnspace *space, s64 pn)
+{
+ space->base_pn = pn;
+ space->max_pn_seen = space->base_pn - 1;
+ space->last_max_pn_seen = space->max_pn_seen;
+ space->min_pn_seen = space->max_pn_seen;
+
+ space->max_pn_time = space->time;
+ space->last_max_pn_time = space->max_pn_time;
+}
+
+static inline bool quic_pnspace_has_gap(const struct quic_pnspace *space)
+{
+ return space->base_pn != space->max_pn_seen + 1;
+}
+
+static inline void quic_pnspace_inc_ecn_count(struct quic_pnspace *space,
+ u8 ecn)
+{
+ if (!ecn)
+ return;
+ space->ecn_count[QUIC_ECN_LOCAL][ecn - 1]++;
+}
+
+/* Check if any ECN-marked packets were received. */
+static inline bool quic_pnspace_has_ecn_count(struct quic_pnspace *space)
+{
+ return space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_ECT0] ||
+ space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_ECT1] ||
+ space->ecn_count[QUIC_ECN_LOCAL][QUIC_ECN_CE];
+}
+
+/* Updates the stored ECN counters based on values received in the peer's ACK
+ * frame. Each counter is updated only if the new value is higher.
+ *
+ * Returns: true if CE count was increased (congestion indicated), false
+ * otherwise.
+ */
+static inline bool quic_pnspace_set_ecn_count(struct quic_pnspace *space,
+ u64 *ecn_count)
+{
+ u64 *count = space->ecn_count[QUIC_ECN_PEER];
+
+ if (count[QUIC_ECN_ECT0] < ecn_count[QUIC_ECN_ECT0])
+ count[QUIC_ECN_ECT0] = ecn_count[QUIC_ECN_ECT0];
+ if (count[QUIC_ECN_ECT1] < ecn_count[QUIC_ECN_ECT1])
+ count[QUIC_ECN_ECT1] = ecn_count[QUIC_ECN_ECT1];
+ if (count[QUIC_ECN_CE] < ecn_count[QUIC_ECN_CE]) {
+ count[QUIC_ECN_CE] = ecn_count[QUIC_ECN_CE];
+ return true;
+ }
+ return false;
+}
+
+u16 quic_pnspace_num_gabs(struct quic_pnspace *space,
+ struct quic_gap_ack_block *gabs);
+int quic_pnspace_check(struct quic_pnspace *space, s64 pn);
+int quic_pnspace_mark(struct quic_pnspace *space, s64 pn);
+
+void quic_pnspace_free(struct quic_pnspace *space);
+int quic_pnspace_init(struct quic_pnspace *space);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 7a4a00498b54..3b66cf8a942a 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -38,6 +38,8 @@ static void quic_write_space(struct sock *sk)
static int quic_init_sock(struct sock *sk)
{
+ u8 i;
+
sk->sk_destruct = inet_sock_destruct;
sk->sk_write_space = quic_write_space;
sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
@@ -52,11 +54,21 @@ static int quic_init_sock(struct sock *sk)
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
+ for (i = 0; i < QUIC_PNSPACE_MAX; i++) {
+ if (quic_pnspace_init(quic_pnspace(sk, i)))
+ return -ENOMEM;
+ }
+
return 0;
}
static void quic_destroy_sock(struct sock *sk)
{
+ u8 i;
+
+ for (i = 0; i < QUIC_PNSPACE_MAX; i++)
+ quic_pnspace_free(quic_pnspace(sk, i));
+
quic_path_unbind(sk, quic_paths(sk), 0);
quic_path_unbind(sk, quic_paths(sk), 1);
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 9201ca3edad0..68c7b22d1e88 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -12,6 +12,7 @@
#include <linux/quic.h>
#include "common.h"
+#include "pnspace.h"
#include "family.h"
#include "stream.h"
#include "connid.h"
@@ -43,6 +44,7 @@ struct quic_sock {
struct quic_conn_id_set dest;
struct quic_path_group paths;
struct quic_cong cong;
+ struct quic_pnspace space[QUIC_PNSPACE_MAX];
};
struct quic6_sock {
@@ -105,6 +107,11 @@ static inline struct quic_cong *quic_cong(const struct sock *sk)
return &quic_sk(sk)->cong;
}
+static inline struct quic_pnspace *quic_pnspace(const struct sock *sk, u8 level)
+{
+ return &quic_sk(sk)->space[level % QUIC_CRYPTO_EARLY];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 11/15] quic: add crypto key derivation and installation
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (9 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 10/15] quic: add packet number space Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 12/15] quic: add crypto packet encryption and decryption Xin Long
` (3 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_crypto', a component responsible for QUIC
encryption key derivation and installation across the various key
levels: Initial, Handshake, 0-RTT (Early), and 1-RTT (Application).
It provides helpers to derive and install initial secrets, set traffic
secrets and install the corresponding keys, and perform key updates to
enable forward secrecy. Additionally, it implements stateless reset
token generation, used to support connection reset without state.
- quic_crypto_initial_keys_install(): Derive and install initial keys.
- quic_crypto_set_cipher(): Allocate all transforms based on the cipher
type provided.
- quic_crypto_set_secret(): Set the traffic secret and install derived
keys.
- quic_crypto_key_update(): Rekey and install new keys to the !phase
side.
- quic_crypto_generate_stateless_reset_token(): Generate token for
stateless reset.
These mechanisms are essential for establishing and maintaining secure
communication throughout the QUIC connection lifecycle.
Signed-off-by: Pengtao He <hepengtao@xiaomi.com>
Signed-off-by: Moritz Buhl <mbuhl@openbsd.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v3:
- Remove lock from quic_net, since Initial packet decryption for ALPN
will be handled serially in a workqueue when ALPN demux is enabled.
v4:
- Use local cipher pointer in quic_crypto_set_secret() to avoid a
warning from Smatch.
v5:
- Change the timestamp variables from u32 to u64, which provides
sufficient precision for timestamps in microsecond.
v8:
- Remove the redundant err initialization in quic_net_init(), since err
is now assigned from quic_crypto_set_cipher().
v10:
- Fix header protection key passed to pr_debug().
- Fix typo: "For example,to ..." -> "For example, to ..." in comment.
- Remove redundant initialization of err in quic_crypto_set_cipher().
v11:
- Note for AI reviews: later patches guarantee all tfms (including
secret_tfm) are set before any quic_crypto function uses them.
- Set maximum line length to 80 characters.
- Merge quic_crypto_rx/tx_keys_derive_and_install() to
quic_crypto_keys_derive_and_install() with an extra parameter to
reduce code duplication (noted by AI review).
- Change flag parameter type to u32 in quic_crypto_set_secret() and
quic_crypto_set_cipher().
---
net/quic/Makefile | 2 +-
net/quic/crypto.c | 566 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/crypto.h | 80 +++++++
net/quic/protocol.c | 14 +-
net/quic/protocol.h | 2 +
net/quic/socket.c | 2 +
net/quic/socket.h | 7 +
7 files changed, 671 insertions(+), 2 deletions(-)
create mode 100644 net/quic/crypto.c
create mode 100644 net/quic/crypto.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 9d8e18297911..58bb18f7926d 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o
+ cong.o pnspace.o crypto.o
diff --git a/net/quic/crypto.c b/net/quic/crypto.c
new file mode 100644
index 000000000000..218d3fe49dff
--- /dev/null
+++ b/net/quic/crypto.c
@@ -0,0 +1,566 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include <crypto/skcipher.h>
+#include <linux/skbuff.h>
+#include <crypto/aead.h>
+#include <crypto/hkdf.h>
+#include <linux/quic.h>
+#include <net/tls.h>
+
+#include "common.h"
+#include "crypto.h"
+
+#define QUIC_RANDOM_DATA_LEN 32
+
+static u8 quic_random_data[QUIC_RANDOM_DATA_LEN] __read_mostly;
+
+/* HKDF-Extract. */
+static int quic_crypto_hkdf_extract(struct crypto_shash *tfm,
+ struct quic_data *srt,
+ struct quic_data *hash,
+ struct quic_data *key)
+{
+ return hkdf_extract(tfm, hash->data, hash->len, srt->data, srt->len,
+ key->data);
+}
+
+#define QUIC_MAX_INFO_LEN 256
+
+/* HKDF-Expand-Label. */
+static int quic_crypto_hkdf_expand(struct crypto_shash *tfm,
+ struct quic_data *srt,
+ struct quic_data *label,
+ struct quic_data *hash,
+ struct quic_data *key)
+{
+ u8 info[QUIC_MAX_INFO_LEN], *p = info;
+ u8 LABEL[] = "tls13 ";
+ u32 infolen;
+ int err;
+
+ /* rfc8446#section-7.1:
+ *
+ * HKDF-Expand-Label(Secret, Label, Context, Length) =
+ * HKDF-Expand(Secret, HkdfLabel, Length)
+ *
+ * Where HkdfLabel is specified as:
+ *
+ * struct {
+ * uint16 length = Length;
+ * opaque label<7..255> = "tls13 " + Label;
+ * opaque context<0..255> = Context;
+ * } HkdfLabel;
+ */
+ *p++ = (u8)(key->len / QUIC_MAX_INFO_LEN);
+ *p++ = (u8)(key->len % QUIC_MAX_INFO_LEN);
+ *p++ = (u8)(sizeof(LABEL) - 1 + label->len);
+ p = quic_put_data(p, LABEL, sizeof(LABEL) - 1);
+ p = quic_put_data(p, label->data, label->len);
+ if (hash) {
+ *p++ = (u8)hash->len;
+ p = quic_put_data(p, hash->data, hash->len);
+ } else {
+ *p++ = 0;
+ }
+ infolen = (u32)(p - info);
+
+ err = crypto_shash_setkey(tfm, srt->data, srt->len);
+ if (err)
+ return err;
+
+ return hkdf_expand(tfm, info, infolen, key->data, key->len);
+}
+
+#define KEY_LABEL_V1 "quic key"
+#define IV_LABEL_V1 "quic iv"
+#define HP_KEY_LABEL_V1 "quic hp"
+
+#define KU_LABEL_V1 "quic ku"
+
+/* rfc9369#section-3.3.2:
+ *
+ * The labels used in rfc9001 to derive packet protection keys, header
+ * protection keys, Retry Integrity Tag keys, and key updates change from "quic
+ * key" to "quicv2 key", from "quic iv" to "quicv2 iv", from "quic hp" to
+ * "quicv2 hp", and from "quic ku" to "quicv2 ku".
+ */
+#define KEY_LABEL_V2 "quicv2 key"
+#define IV_LABEL_V2 "quicv2 iv"
+#define HP_KEY_LABEL_V2 "quicv2 hp"
+
+#define KU_LABEL_V2 "quicv2 ku"
+
+/* Packet Protection Keys. */
+static int quic_crypto_keys_derive(struct crypto_shash *tfm,
+ struct quic_data *s, struct quic_data *k,
+ struct quic_data *i, struct quic_data *hp_k,
+ u32 version)
+{
+ struct quic_data hp_k_l = {HP_KEY_LABEL_V1, strlen(HP_KEY_LABEL_V1)};
+ struct quic_data k_l = {KEY_LABEL_V1, strlen(KEY_LABEL_V1)};
+ struct quic_data i_l = {IV_LABEL_V1, strlen(IV_LABEL_V1)};
+ struct quic_data z = {};
+ int err;
+
+ /* rfc9001#section-5.1:
+ *
+ * The current encryption level secret and the label "quic key" are
+ * input to the KDF to produce the AEAD key; the label "quic iv" is
+ * used to derive the Initialization Vector (IV). The header protection
+ * key uses the "quic hp" label. Using these labels provides key
+ * separation between QUIC and TLS.
+ */
+ if (version == QUIC_VERSION_V2) {
+ quic_data(&hp_k_l, HP_KEY_LABEL_V2, strlen(HP_KEY_LABEL_V2));
+ quic_data(&k_l, KEY_LABEL_V2, strlen(KEY_LABEL_V2));
+ quic_data(&i_l, IV_LABEL_V2, strlen(IV_LABEL_V2));
+ }
+
+ err = quic_crypto_hkdf_expand(tfm, s, &k_l, &z, k);
+ if (err)
+ return err;
+ err = quic_crypto_hkdf_expand(tfm, s, &i_l, &z, i);
+ if (err)
+ return err;
+ /* Don't change hp key for key update. */
+ if (!hp_k)
+ return 0;
+
+ return quic_crypto_hkdf_expand(tfm, s, &hp_k_l, &z, hp_k);
+}
+
+/* Derive and install reception (RX) or transmission (TX) packet protection
+ * keys for the current key phase. This installs AEAD protection key, IV, and
+ * optionally header protection key.
+ */
+static int quic_crypto_keys_derive_and_install(struct quic_crypto *crypto,
+ bool rx)
+{
+ struct quic_data srt = {}, k, iv, hp_k = {}, *hp = NULL;
+ u8 key[QUIC_KEY_LEN], hp_key[QUIC_KEY_LEN] = {};
+ int err, phase = crypto->key_phase;
+ u32 keylen, ivlen = QUIC_IV_LEN;
+ struct crypto_skcipher *hp_tfm;
+ struct crypto_aead *tfm;
+
+ keylen = crypto->cipher->keylen;
+ quic_data(&k, key, keylen);
+
+ if (rx) {
+ quic_data(&srt, crypto->rx_secret, crypto->cipher->secretlen);
+ quic_data(&iv, crypto->rx_iv[phase], ivlen);
+ tfm = crypto->rx_tfm[phase];
+ hp_tfm = crypto->rx_hp_tfm;
+ } else {
+ quic_data(&srt, crypto->tx_secret, crypto->cipher->secretlen);
+ quic_data(&iv, crypto->tx_iv[phase], ivlen);
+ tfm = crypto->tx_tfm[phase];
+ hp_tfm = crypto->tx_hp_tfm;
+ }
+
+ /* Only derive header protection key when not in key update. */
+ if (!crypto->key_pending)
+ hp = quic_data(&hp_k, hp_key, keylen);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &iv, hp,
+ crypto->version);
+ if (err)
+ goto out;
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ goto out;
+ err = crypto_aead_setkey(tfm, key, keylen);
+ if (err)
+ goto out;
+ if (hp) {
+ err = crypto_skcipher_setkey(hp_tfm, hp_key, keylen);
+ if (err)
+ goto out;
+ }
+ pr_debug("%s: rx: %d k: %16phN, iv: %12phN, hp_k:%16phN\n", __func__,
+ rx, k.data, iv.data, hp_key);
+out:
+ memzero_explicit(key, sizeof(key));
+ memzero_explicit(hp_key, sizeof(hp_key));
+ return err;
+}
+
+#define QUIC_CIPHER_MIN TLS_CIPHER_AES_GCM_128
+#define QUIC_CIPHER_MAX TLS_CIPHER_CHACHA20_POLY1305
+
+#define TLS_CIPHER_AES_GCM_128_SECRET_SIZE 32
+#define TLS_CIPHER_AES_GCM_256_SECRET_SIZE 48
+#define TLS_CIPHER_AES_CCM_128_SECRET_SIZE 32
+#define TLS_CIPHER_CHACHA20_POLY1305_SECRET_SIZE 32
+
+#define CIPHER_DESC(type, aead_n, skc_n, sha_n)[type - QUIC_CIPHER_MIN] = { \
+ .secretlen = type ## _SECRET_SIZE, \
+ .keylen = type ## _KEY_SIZE, \
+ .aead = aead_n, \
+ .skc = skc_n, \
+ .shash = sha_n, \
+}
+
+static struct quic_cipher ciphers[QUIC_CIPHER_MAX + 1 - QUIC_CIPHER_MIN] = {
+ CIPHER_DESC(TLS_CIPHER_AES_GCM_128,
+ "gcm(aes)", "ecb(aes)", "hmac(sha256)"),
+ CIPHER_DESC(TLS_CIPHER_AES_GCM_256,
+ "gcm(aes)", "ecb(aes)", "hmac(sha384)"),
+ CIPHER_DESC(TLS_CIPHER_AES_CCM_128,
+ "ccm(aes)", "ecb(aes)", "hmac(sha256)"),
+ CIPHER_DESC(TLS_CIPHER_CHACHA20_POLY1305,
+ "rfc7539(chacha20,poly1305)", "chacha20", "hmac(sha256)"),
+};
+
+int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u32 flag)
+{
+ struct quic_cipher *cipher;
+ void *tfm;
+ int err;
+
+ if (type < QUIC_CIPHER_MIN || type > QUIC_CIPHER_MAX)
+ return -EINVAL;
+
+ cipher = &ciphers[type - QUIC_CIPHER_MIN];
+ tfm = crypto_alloc_shash(cipher->shash, 0, 0);
+ if (IS_ERR(tfm))
+ return PTR_ERR(tfm);
+ crypto->secret_tfm = tfm;
+
+ /* Request only synchronous crypto by specifying CRYPTO_ALG_ASYNC. This
+ * ensures tag generation does not rely on async callbacks.
+ */
+ tfm = crypto_alloc_aead(cipher->aead, 0, CRYPTO_ALG_ASYNC);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tag_tfm = tfm;
+
+ /* Allocate AEAD and HP transform for each RX key phase. */
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_tfm[0] = tfm;
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_tfm[1] = tfm;
+ tfm = crypto_alloc_sync_skcipher(cipher->skc, 0, 0);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->rx_hp_tfm = tfm;
+
+ /* Allocate AEAD and HP transform for each TX key phase. */
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_tfm[0] = tfm;
+ tfm = crypto_alloc_aead(cipher->aead, 0, flag);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_tfm[1] = tfm;
+ tfm = crypto_alloc_sync_skcipher(cipher->skc, 0, 0);
+ if (IS_ERR(tfm)) {
+ err = PTR_ERR(tfm);
+ goto err;
+ }
+ crypto->tx_hp_tfm = tfm;
+
+ crypto->cipher = cipher;
+ crypto->cipher_type = type;
+ return 0;
+err:
+ quic_crypto_free(crypto);
+ return err;
+}
+
+int quic_crypto_set_secret(struct quic_crypto *crypto,
+ struct quic_crypto_secret *srt,
+ u32 version, u32 flag)
+{
+ struct quic_cipher *cipher;
+ int err;
+
+ /* If no cipher has been initialized yet, set it up. */
+ if (!crypto->cipher) {
+ err = quic_crypto_set_cipher(crypto, srt->type, flag);
+ if (err)
+ return err;
+ }
+ cipher = crypto->cipher;
+
+ /* Handle RX path setup. */
+ if (!srt->send) {
+ crypto->version = version;
+ memcpy(crypto->rx_secret, srt->secret, cipher->secretlen);
+ err = quic_crypto_keys_derive_and_install(crypto, true);
+ if (err)
+ return err;
+ crypto->recv_ready = 1;
+ return 0;
+ }
+
+ /* Handle TX path setup. */
+ crypto->version = version;
+ memcpy(crypto->tx_secret, srt->secret, cipher->secretlen);
+ err = quic_crypto_keys_derive_and_install(crypto, false);
+ if (err)
+ return err;
+ crypto->send_ready = 1;
+ return 0;
+}
+
+int quic_crypto_get_secret(struct quic_crypto *crypto,
+ struct quic_crypto_secret *srt)
+{
+ u8 *secret;
+
+ if (!crypto->cipher)
+ return -EINVAL;
+ srt->type = crypto->cipher_type;
+ secret = srt->send ? crypto->tx_secret : crypto->rx_secret;
+ memcpy(srt->secret, secret, crypto->cipher->secretlen);
+ return 0;
+}
+
+/* Initiating a Key Update. */
+int quic_crypto_key_update(struct quic_crypto *crypto)
+{
+ u8 tx_secret[QUIC_SECRET_LEN], rx_secret[QUIC_SECRET_LEN];
+ struct quic_data l = {KU_LABEL_V1, strlen(KU_LABEL_V1)};
+ struct quic_data z = {}, k, srt;
+ u32 secret_len;
+ int err;
+
+ if (crypto->key_pending || !crypto->recv_ready)
+ return -EINVAL;
+
+ /* rfc9001#section-6.1:
+ *
+ * Endpoints maintain separate read and write secrets for packet
+ * protection. An endpoint initiates a key update by updating its
+ * packet protection write secret and using that to protect new
+ * packets. The endpoint creates a new write secret from the existing
+ * write secret. This uses the KDF function provided by TLS with a
+ * label of "quic ku". The corresponding key and IV are created from
+ * that secret. The header protection key is not updated.
+ *
+ * For example, to update write keys with TLS 1.3, HKDF-Expand-Label is
+ * used as:
+ * secret_<n+1> = HKDF-Expand-Label(secret_<n>, "quic ku",
+ * "", Hash.length)
+ */
+ secret_len = crypto->cipher->secretlen;
+ if (crypto->version == QUIC_VERSION_V2)
+ quic_data(&l, KU_LABEL_V2, strlen(KU_LABEL_V2));
+
+ crypto->key_pending = 1;
+ memcpy(tx_secret, crypto->tx_secret, secret_len);
+ memcpy(rx_secret, crypto->rx_secret, secret_len);
+ crypto->key_phase = !crypto->key_phase;
+
+ quic_data(&srt, tx_secret, secret_len);
+ quic_data(&k, crypto->tx_secret, secret_len);
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &srt, &l, &z, &k);
+ if (err)
+ goto err;
+ err = quic_crypto_keys_derive_and_install(crypto, false);
+ if (err)
+ goto err;
+
+ quic_data(&srt, rx_secret, secret_len);
+ quic_data(&k, crypto->rx_secret, secret_len);
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &srt, &l, &z, &k);
+ if (err)
+ goto err;
+ err = quic_crypto_keys_derive_and_install(crypto, true);
+ if (err)
+ goto err;
+out:
+ memzero_explicit(tx_secret, sizeof(tx_secret));
+ memzero_explicit(rx_secret, sizeof(rx_secret));
+ return err;
+err:
+ crypto->key_pending = 0;
+ memcpy(crypto->tx_secret, tx_secret, secret_len);
+ memcpy(crypto->rx_secret, rx_secret, secret_len);
+ crypto->key_phase = !crypto->key_phase;
+ goto out;
+}
+
+void quic_crypto_free(struct quic_crypto *crypto)
+{
+ if (crypto->tag_tfm)
+ crypto_free_aead(crypto->tag_tfm);
+ if (crypto->rx_tfm[0])
+ crypto_free_aead(crypto->rx_tfm[0]);
+ if (crypto->rx_tfm[1])
+ crypto_free_aead(crypto->rx_tfm[1]);
+ if (crypto->tx_tfm[0])
+ crypto_free_aead(crypto->tx_tfm[0]);
+ if (crypto->tx_tfm[1])
+ crypto_free_aead(crypto->tx_tfm[1]);
+ if (crypto->secret_tfm)
+ crypto_free_shash(crypto->secret_tfm);
+ if (crypto->rx_hp_tfm)
+ crypto_free_skcipher(crypto->rx_hp_tfm);
+ if (crypto->tx_hp_tfm)
+ crypto_free_skcipher(crypto->tx_hp_tfm);
+
+ memzero_explicit(crypto, offsetof(struct quic_crypto, send_offset));
+}
+
+#define QUIC_INITIAL_SALT_V1 \
+ "\x38\x76\x2c\xf7\xf5\x59\x34\xb3\x4d\x17" \
+ "\x9a\xe6\xa4\xc8\x0c\xad\xcc\xbb\x7f\x0a"
+
+#define QUIC_INITIAL_SALT_V2 \
+ "\x0d\xed\xe3\xde\xf7\x00\xa6\xdb\x81\x93" \
+ "\x81\xbe\x6e\x26\x9d\xcb\xf9\xbd\x2e\xd9"
+
+#define QUIC_INITIAL_SALT_LEN 20
+
+/* Initial Secrets. */
+int quic_crypto_initial_keys_install(struct quic_crypto *crypto,
+ struct quic_conn_id *conn_id,
+ u32 version, bool is_serv)
+{
+ u8 secret[TLS_CIPHER_AES_GCM_128_SECRET_SIZE];
+ struct quic_data salt, s, k, l, dcid, z = {};
+ struct quic_crypto_secret srt = {};
+ char *tl, *rl, *sal;
+ int err;
+
+ /* rfc9001#section-5.2:
+ *
+ * The secret used by clients to construct Initial packets uses the PRK
+ * and the label "client in" as input to the HKDF-Expand-Label function
+ * from TLS [TLS13] to produce a 32-byte secret. Packets constructed by
+ * the server use the same process with the label "server in". The hash
+ * function for HKDF when deriving initial secrets and keys is SHA-256
+ * [SHA].
+ *
+ * This process in pseudocode is:
+ *
+ * initial_salt = 0x38762cf7f55934b34d179ae6a4c80cadccbb7f0a
+ * initial_secret = HKDF-Extract(initial_salt,
+ * client_dst_connection_id)
+ *
+ * client_initial_secret = HKDF-Expand-Label(initial_secret,
+ * "client in", "",
+ * Hash.length)
+ * server_initial_secret = HKDF-Expand-Label(initial_secret,
+ * "server in", "",
+ * Hash.length)
+ */
+ if (is_serv) {
+ rl = "client in";
+ tl = "server in";
+ } else {
+ tl = "client in";
+ rl = "server in";
+ }
+ sal = QUIC_INITIAL_SALT_V1;
+ if (version == QUIC_VERSION_V2)
+ sal = QUIC_INITIAL_SALT_V2;
+ quic_data(&salt, sal, QUIC_INITIAL_SALT_LEN);
+ quic_data(&dcid, conn_id->data, conn_id->len);
+ quic_data(&s, secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ err = quic_crypto_hkdf_extract(crypto->secret_tfm, &salt, &dcid, &s);
+ if (err)
+ goto out;
+
+ quic_data(&l, tl, strlen(tl));
+ quic_data(&k, srt.secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ srt.type = TLS_CIPHER_AES_GCM_128;
+ srt.send = 1;
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &s, &l, &z, &k);
+ if (err)
+ goto out;
+ err = quic_crypto_set_secret(crypto, &srt, version, 0);
+ if (err)
+ goto out;
+
+ quic_data(&l, rl, strlen(rl));
+ quic_data(&k, srt.secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ srt.type = TLS_CIPHER_AES_GCM_128;
+ srt.send = 0;
+ err = quic_crypto_hkdf_expand(crypto->secret_tfm, &s, &l, &z, &k);
+ if (err)
+ goto out;
+ err = quic_crypto_set_secret(crypto, &srt, version, 0);
+out:
+ memzero_explicit(secret, sizeof(secret));
+ memzero_explicit(&srt, sizeof(srt));
+ return err;
+}
+
+/* Generate a derived key using HKDF-Extract and HKDF-Expand with a given
+ * label.
+ */
+static int quic_crypto_generate_key(struct quic_crypto *crypto, void *data,
+ u32 len, char *label, u8 *token,
+ u32 key_len)
+{
+ struct crypto_shash *tfm = crypto->secret_tfm;
+ u8 secret[TLS_CIPHER_AES_GCM_128_SECRET_SIZE];
+ struct quic_data salt, s, l, k, z = {};
+ int err;
+
+ quic_data(&salt, data, len);
+ quic_data(&k, quic_random_data, QUIC_RANDOM_DATA_LEN);
+ quic_data(&s, secret, TLS_CIPHER_AES_GCM_128_SECRET_SIZE);
+ err = quic_crypto_hkdf_extract(tfm, &salt, &k, &s);
+ if (err)
+ goto out;
+
+ quic_data(&l, label, strlen(label));
+ quic_data(&k, token, key_len);
+ err = quic_crypto_hkdf_expand(tfm, &s, &l, &z, &k);
+out:
+ memzero_explicit(secret, sizeof(secret));
+ return err;
+}
+
+/* Derive a stateless reset token from connection-specific input. */
+int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto,
+ void *data, u32 len, u8 *key,
+ u32 key_len)
+{
+ return quic_crypto_generate_key(crypto, data, len, "stateless_reset",
+ key, key_len);
+}
+
+/* Derive a session ticket key using HKDF from connection-specific input. */
+int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto,
+ void *data, u32 len, u8 *key,
+ u32 key_len)
+{
+ return quic_crypto_generate_key(crypto, data, len, "session_ticket",
+ key, key_len);
+}
+
+void quic_crypto_init(void)
+{
+ get_random_bytes(quic_random_data, QUIC_RANDOM_DATA_LEN);
+}
diff --git a/net/quic/crypto.h b/net/quic/crypto.h
new file mode 100644
index 000000000000..f9450e55d6dd
--- /dev/null
+++ b/net/quic/crypto.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#define QUIC_TAG_LEN 16
+#define QUIC_IV_LEN 12
+#define QUIC_KEY_LEN 32
+#define QUIC_SECRET_LEN 48
+
+#define QUIC_TOKEN_FLAG_REGULAR 0
+#define QUIC_TOKEN_FLAG_RETRY 1
+#define QUIC_TOKEN_TIMEOUT_RETRY 3000000
+#define QUIC_TOKEN_TIMEOUT_REGULAR 600000000
+
+struct quic_cipher {
+ u32 secretlen; /* Length of the traffic secret */
+ u32 keylen; /* Length of the AEAD key */
+
+ char *shash; /* Name of hash algorithm used for key derivation */
+ char *aead; /* Name of AEAD algorithm used for payload en/decryption */
+ char *skc; /* Name of cipher algorithm used for header protection */
+};
+
+struct quic_crypto {
+ struct crypto_skcipher *tx_hp_tfm; /* TX header protection tfm */
+ struct crypto_skcipher *rx_hp_tfm; /* RX header protection tfm */
+ struct crypto_shash *secret_tfm; /* Key derivation (HKDF) tfm */
+ struct crypto_aead *tx_tfm[2]; /* AEAD tfm for TX (key phase 0 and 1) */
+ struct crypto_aead *rx_tfm[2]; /* AEAD tfm for RX (key phase 0 and 1) */
+ struct crypto_aead *tag_tfm; /* AEAD tfm used for token validation */
+ struct quic_cipher *cipher; /* Cipher info (selected cipher suite) */
+ u32 cipher_type; /* Cipher suite (e.g., AES_GCM_128, etc.) */
+
+ u8 tx_secret[QUIC_SECRET_LEN]; /* TX secret (derived or from user) */
+ u8 rx_secret[QUIC_SECRET_LEN]; /* RX secret (derived or from user) */
+ u8 tx_iv[2][QUIC_IV_LEN]; /* IVs for TX (key phase 0 and 1) */
+ u8 rx_iv[2][QUIC_IV_LEN]; /* IVs for RX (key phase 0 and 1) */
+
+ /* Timestamp 1st packet sent after key update */
+ u64 key_update_send_time;
+ u64 key_update_time; /* Timestamp old keys retained after key update */
+ u32 version; /* QUIC version in use */
+
+ u8 ticket_ready:1; /* True if a session ticket is ready to read */
+ u8 key_pending:1; /* A key update is in progress */
+ u8 send_ready:1; /* TX encryption context is initialized */
+ u8 recv_ready:1; /* RX decryption context is initialized */
+ u8 key_phase:1; /* Current key phase being used (0 or 1) */
+
+ u64 send_offset; /* Number of handshake bytes sent by user */
+ u64 recv_offset; /* Number of handshake bytes read by user */
+};
+
+int quic_crypto_set_secret(struct quic_crypto *crypto,
+ struct quic_crypto_secret *srt,
+ u32 version, u32 flag);
+int quic_crypto_get_secret(struct quic_crypto *crypto,
+ struct quic_crypto_secret *srt);
+int quic_crypto_set_cipher(struct quic_crypto *crypto,
+ u32 type, u32 flag);
+int quic_crypto_key_update(struct quic_crypto *crypto);
+
+int quic_crypto_initial_keys_install(struct quic_crypto *crypto,
+ struct quic_conn_id *conn_id,
+ u32 version, bool is_serv);
+int quic_crypto_generate_session_ticket_key(struct quic_crypto *crypto,
+ void *data, u32 len, u8 *key,
+ u32 key_len);
+int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto,
+ void *data, u32 len, u8 *key,
+ u32 key_len);
+
+void quic_crypto_free(struct quic_crypto *crypto);
+void quic_crypto_init(void);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index fd528fb2fc46..7f055c88bbde 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -256,15 +256,24 @@ static void quic_protosw_exit(void)
static int __net_init quic_net_init(struct net *net)
{
struct quic_net *qn = quic_net(net);
- int err = 0;
+ int err;
qn->stat = alloc_percpu(struct quic_mib);
if (!qn->stat)
return -ENOMEM;
+ err = quic_crypto_set_cipher(&qn->crypto, TLS_CIPHER_AES_GCM_128,
+ CRYPTO_ALG_ASYNC);
+ if (err) {
+ free_percpu(qn->stat);
+ qn->stat = NULL;
+ return err;
+ }
+
#if IS_ENABLED(CONFIG_PROC_FS)
err = quic_net_proc_init(net);
if (err) {
+ quic_crypto_free(&qn->crypto);
free_percpu(qn->stat);
qn->stat = NULL;
}
@@ -279,6 +288,7 @@ static void __net_exit quic_net_exit(struct net *net)
#if IS_ENABLED(CONFIG_PROC_FS)
quic_net_proc_exit(net);
#endif
+ quic_crypto_free(&qn->crypto);
free_percpu(qn->stat);
qn->stat = NULL;
}
@@ -333,6 +343,8 @@ static __init int quic_init(void)
sysctl_quic_wmem[1] = 16 * 1024;
sysctl_quic_wmem[2] = max(64 * 1024, max_share);
+ quic_crypto_init();
+
err = percpu_counter_init(&quic_sockets_allocated, 0, GFP_KERNEL);
if (err)
goto err_percpu_counter;
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
index fbd0fe39eccc..b8584e72ff14 100644
--- a/net/quic/protocol.h
+++ b/net/quic/protocol.h
@@ -49,6 +49,8 @@ struct quic_net {
#if IS_ENABLED(CONFIG_PROC_FS)
struct proc_dir_entry *proc_net; /* procfs entry for QUIC stats */
#endif
+ /* Context for decrypting Initial packets for ALPN */
+ struct quic_crypto crypto;
};
struct quic_net *quic_net(struct net *net);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 3b66cf8a942a..4016de4b39fe 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -68,6 +68,8 @@ static void quic_destroy_sock(struct sock *sk)
for (i = 0; i < QUIC_PNSPACE_MAX; i++)
quic_pnspace_free(quic_pnspace(sk, i));
+ for (i = 0; i < QUIC_CRYPTO_MAX; i++)
+ quic_crypto_free(quic_crypto(sk, i));
quic_path_unbind(sk, quic_paths(sk), 0);
quic_path_unbind(sk, quic_paths(sk), 1);
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 68c7b22d1e88..d7811391cc8b 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -16,6 +16,7 @@
#include "family.h"
#include "stream.h"
#include "connid.h"
+#include "crypto.h"
#include "path.h"
#include "cong.h"
@@ -45,6 +46,7 @@ struct quic_sock {
struct quic_path_group paths;
struct quic_cong cong;
struct quic_pnspace space[QUIC_PNSPACE_MAX];
+ struct quic_crypto crypto[QUIC_CRYPTO_MAX];
};
struct quic6_sock {
@@ -112,6 +114,11 @@ static inline struct quic_pnspace *quic_pnspace(const struct sock *sk, u8 level)
return &quic_sk(sk)->space[level % QUIC_CRYPTO_EARLY];
}
+static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
+{
+ return &quic_sk(sk)->crypto[level];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 12/15] quic: add crypto packet encryption and decryption
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (10 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 11/15] quic: add crypto key derivation and installation Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 13/15] quic: add timer management Xin Long
` (2 subsequent siblings)
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch adds core support for packet-level encryption and decryption
using AEAD, including both payload protection and QUIC header protection.
It introduces helpers to encrypt packets before transmission and to
remove header protection and decrypt payloads upon reception, in line
with QUIC's cryptographic requirements.
- quic_crypto_encrypt(): Perform header protection and payload
encryption (TX).
- quic_crypto_decrypt(): Perform header protection removal and
payload decryption (RX).
The patch also includes support for Retry token handling. It provides
helpers to compute the Retry integrity tag, generate tokens for address
validation, and verify tokens received from clients during the
handshake phase.
- quic_crypto_get_retry_tag(): Compute tag for Retry packets.
- quic_crypto_generate_token(): Generate retry token.
- quic_crypto_verify_token(): Verify retry token.
These additions establish the cryptographic primitives necessary for
secure QUIC packet exchange and address validation.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v3:
- quic_crypto_decrypt(): return -EKEYREVOKED to defer key updates to
the workqueue when the packet is not marked backlog, since
quic_crypto_key_update()/crypto_aead_setkey() must run in process
context.
- Only perform header decryption if !cb->number_len to avoid double
decryption when a key-update packet (with flipped key_phase)
re-enters quic_crypto_decrypt() from the workqueue.
v4:
- skb_dst_force() is already called in in quic_udp_rcv() on recv path,
so remove its call from quic_crypto_payload_decrypt(), which may be
called without RCU protection.
- Remove the incorrect (void *) cast to quic_crypto_done.
v5:
- Use skb_cb->crypto_ctx for async crypto context freeing, which is
safer than using skb_shinfo(skb)->destructor_arg.
- skb_cb->number_max is removed and number is reused as the largest
previously seen and update quic_crypto_get_header() accordingly.
- Change timestamp variables from u32 to u64 and use quic_ktime_get_us()
for current timestamps, as jiffies_to_usecs() is not accurate enough.
v6:
- Rename quic_crypto_get_header() to quic_crypto_get_number(), move
key_phase parsing out of it, check cb->length when parsing packet
number, and update all callers.
- Use hdr->pnl + 1 instead of (*p & QUIC_PN_LEN_BITS_MASK) + 1 to get
packet number length, and remove the unnecessary the len variable
and QUIC_PN_LEN_BITS_MASK macro from quic_crypto_header_decrypt().
v8:
- Move skb_cow_data() from quic_crypto_payload_decrypt() to
quic_crypto_header_decrypt(), as header decryption also writes to
the skb and is invoked earlier than payload decryption on RX path.
v10:
- Fix double period at end of the comment for quic_crypto_encrypt().
v11:
- Note for AI review: call quic_crypto_aead_mem_alloc() after
skb_cow_data() as it uses the return value for nsg. If the allocation
fails, skb is dropped, so no rollback of prior skb modifications is
needed.
- Set maximum line length to 80 characters.
- Add a check for skb->len in quic_crypto_get_retry_tag().
- Also reset key_update_send_time when key_pending is cleared in
quic_crypto_decrypt().
- Handle -EBUSY returned from crypto_aead_en/decrypt() and return when
err == -EINPROGRESS in quic_crypto_done().
- Extract quic_crypto_token_init() from quic_crypto_generate_token() and
quic_crypto_verify_token().
- Merge quic_crypto_header_en/decrypt() to quic_crypto_header_protect()
with an extra parameter to reduce code duplication.
- Merge quic_crypto_payload_en/decrypt() to quic_crypto_payload_protect()
with an extra parameter to reduce code duplication (noted by AI
review).
---
net/quic/crypto.c | 638 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/crypto.h | 12 +
2 files changed, 650 insertions(+)
diff --git a/net/quic/crypto.c b/net/quic/crypto.c
index 218d3fe49dff..5b0a74e9d8ac 100644
--- a/net/quic/crypto.c
+++ b/net/quic/crypto.c
@@ -194,6 +194,275 @@ static int quic_crypto_keys_derive_and_install(struct quic_crypto *crypto,
return err;
}
+static void *quic_crypto_skcipher_mem_alloc(struct crypto_skcipher *tfm,
+ u32 mask_size, u8 **iv,
+ struct skcipher_request **req)
+{
+ unsigned int iv_size, req_size;
+ unsigned int len;
+ u8 *mem;
+
+ iv_size = crypto_skcipher_ivsize(tfm);
+ req_size = sizeof(**req) + crypto_skcipher_reqsize(tfm);
+
+ len = mask_size;
+ len += iv_size;
+ len += crypto_skcipher_alignmask(tfm) &
+ ~(crypto_tfm_ctx_alignment() - 1);
+ len = ALIGN(len, crypto_tfm_ctx_alignment());
+ len += req_size;
+
+ mem = kzalloc(len, GFP_ATOMIC);
+ if (!mem)
+ return NULL;
+
+ *iv = (u8 *)PTR_ALIGN(mem + mask_size,
+ crypto_skcipher_alignmask(tfm) + 1);
+ *req = (struct skcipher_request *)PTR_ALIGN(*iv + iv_size,
+ crypto_tfm_ctx_alignment());
+
+ return (void *)mem;
+}
+
+/* Extracts and reconstructs the packet number from an incoming QUIC packet. */
+static int quic_crypto_get_number(struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ s64 number_max = cb->number;
+ u32 len = cb->length;
+ u8 *p;
+
+ /* rfc9000#section-17.1:
+ *
+ * Once header protection is removed, the packet number is decoded by
+ * finding the packet number value that is closest to the next expected
+ * packet. The next expected packet is the highest received packet
+ * number plus one.
+ */
+ p = (u8 *)quic_hdr(skb) + cb->number_offset;
+ if (!quic_get_int(&p, &len, &cb->number, cb->number_len))
+ return -EINVAL;
+ cb->number = quic_get_num(number_max, cb->number, cb->number_len);
+ return 0;
+}
+
+#define QUIC_SAMPLE_LEN 16
+
+#define QUIC_HEADER_FORM_BIT 0x80
+#define QUIC_LONG_HEADER_MASK 0x0f
+#define QUIC_SHORT_HEADER_MASK 0x1f
+
+/* Header Protection. */
+static int quic_crypto_header_protect(struct crypto_skcipher *tfm,
+ struct sk_buff *skb, bool chacha,
+ bool enc)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct skcipher_request *req;
+ u8 *mask, *iv, *p, h_mask;
+ struct sk_buff *trailer;
+ struct scatterlist sg;
+ int err, i;
+
+ if (!enc) {
+ if (cb->length < QUIC_PN_MAX_LEN + QUIC_SAMPLE_LEN)
+ return -EINVAL;
+
+ err = skb_cow_data(skb, 0, &trailer);
+ if (err < 0)
+ return err;
+ }
+
+ mask = quic_crypto_skcipher_mem_alloc(tfm, QUIC_SAMPLE_LEN, &iv, &req);
+ if (!mask)
+ return -ENOMEM;
+
+ /* rfc9001#section-5.4.2: Header Protection Sample:
+ *
+ * # pn_offset is the start of the Packet Number field.
+ * sample_offset = pn_offset + 4
+ *
+ * sample = packet[sample_offset..sample_offset+sample_length]
+ *
+ * rfc9001#section-5.4.3: AES-Based Header Protection:
+ *
+ * header_protection(hp_key, sample):
+ * mask = AES-ECB(hp_key, sample)
+ *
+ * rfc9001#section-5.4.4: ChaCha20-Based Header Protection:
+ *
+ * header_protection(hp_key, sample):
+ * counter = sample[0..3]
+ * nonce = sample[4..15]
+ * mask = ChaCha20(hp_key, counter, nonce, {0,0,0,0,0})
+ */
+ p = skb->data + cb->number_offset + QUIC_PN_MAX_LEN;
+ memcpy((chacha ? iv : mask), p, QUIC_SAMPLE_LEN);
+ sg_init_one(&sg, mask, QUIC_SAMPLE_LEN);
+ skcipher_request_set_tfm(req, tfm);
+ skcipher_request_set_crypt(req, &sg, &sg, QUIC_SAMPLE_LEN, iv);
+ err = crypto_skcipher_encrypt(req);
+ if (err)
+ goto err;
+
+ /* rfc9001#section-5.4.1:
+ *
+ * mask = header_protection(hp_key, sample)
+ *
+ * pn_length = (packet[0] & 0x03) + 1
+ * if (packet[0] & 0x80) == 0x80:
+ * # Long header: 4 bits masked
+ * packet[0] ^= mask[0] & 0x0f
+ * else:
+ * # Short header: 5 bits masked
+ * packet[0] ^= mask[0] & 0x1f
+ *
+ * # pn_offset is the start of the Packet Number field.
+ * packet[pn_offset:pn_offset+pn_length] ^= mask[1:1+pn_length]
+ */
+ p = skb->data;
+ h_mask = ((*p & QUIC_HEADER_FORM_BIT) == QUIC_HEADER_FORM_BIT) ?
+ QUIC_LONG_HEADER_MASK : QUIC_SHORT_HEADER_MASK;
+ *p = (u8)(*p ^ (mask[0] & h_mask));
+ if (!enc) {
+ cb->key_phase = quic_hdr(skb)->key;
+ cb->number_len = quic_hdr(skb)->pnl + 1;
+ }
+ p += cb->number_offset;
+ for (i = 1; i <= cb->number_len; i++)
+ *p++ ^= mask[i];
+
+ if (!enc)
+ err = quic_crypto_get_number(skb);
+err:
+ kfree_sensitive(mask);
+ return err;
+}
+
+static void *quic_crypto_aead_mem_alloc(struct crypto_aead *tfm, u32 ctx_size,
+ u8 **iv, struct aead_request **req,
+ struct scatterlist **sg, u32 nsg)
+{
+ unsigned int iv_size, req_size;
+ unsigned int len;
+ u8 *mem;
+
+ iv_size = crypto_aead_ivsize(tfm);
+ req_size = sizeof(**req) + crypto_aead_reqsize(tfm);
+
+ len = ctx_size;
+ len += iv_size;
+ len += crypto_aead_alignmask(tfm) & ~(crypto_tfm_ctx_alignment() - 1);
+ len = ALIGN(len, crypto_tfm_ctx_alignment());
+ len += req_size;
+ len = ALIGN(len, __alignof__(struct scatterlist));
+ len += nsg * sizeof(**sg);
+
+ mem = kzalloc(len, GFP_ATOMIC);
+ if (!mem)
+ return NULL;
+
+ *iv = (u8 *)PTR_ALIGN(mem + ctx_size, crypto_aead_alignmask(tfm) + 1);
+ *req = (struct aead_request *)PTR_ALIGN(*iv + iv_size,
+ crypto_tfm_ctx_alignment());
+ *sg = (struct scatterlist *)PTR_ALIGN((u8 *)*req + req_size,
+ __alignof__(struct scatterlist));
+
+ return (void *)mem;
+}
+
+static void quic_crypto_done(void *data, int err)
+{
+ struct sk_buff *skb = data;
+
+ if (err == -EINPROGRESS)
+ return;
+
+ kfree_sensitive(QUIC_SKB_CB(skb)->crypto_ctx);
+ QUIC_SKB_CB(skb)->crypto_done(skb, err);
+}
+
+/* AEAD Usage. */
+static int quic_crypto_payload_protect(struct crypto_aead *tfm,
+ struct sk_buff *skb, u8 *base_iv,
+ bool ccm, bool enc)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 *iv, i, nonce[QUIC_IV_LEN];
+ u32 len, hlen, sglen, nsg;
+ struct aead_request *req;
+ struct sk_buff *trailer;
+ struct scatterlist *sg;
+ void *ctx;
+ __be64 n;
+ int err;
+
+ hlen = cb->number_offset + cb->number_len;
+ if (enc) {
+ len = skb->len;
+ err = skb_cow_data(skb, QUIC_TAG_LEN, &trailer);
+ if (err < 0)
+ return err;
+ pskb_put(skb, trailer, QUIC_TAG_LEN);
+ quic_hdr(skb)->key = cb->key_phase;
+ sglen = skb->len;
+ nsg = (u32)err;
+ } else {
+ len = cb->length + cb->number_offset;
+ if (len - hlen < QUIC_TAG_LEN)
+ return -EINVAL;
+ sglen = len;
+ nsg = 1;
+ }
+
+ ctx = quic_crypto_aead_mem_alloc(tfm, 0, &iv, &req, &sg, nsg);
+ if (!ctx)
+ return -ENOMEM;
+
+ sg_init_table(sg, nsg);
+ err = skb_to_sgvec(skb, sg, 0, sglen);
+ if (err < 0)
+ goto err;
+
+ /* rfc9001#section-5.3:
+ *
+ * The associated data, A, for the AEAD is the contents of the QUIC
+ * header, starting from the first byte of either the short or long
+ * header, up to and including the unprotected packet number.
+ *
+ * The nonce, N, is formed by combining the packet protection IV with
+ * the packet number. The 62 bits of the reconstructed QUIC packet
+ * number in network byte order are left-padded with zeros to the size
+ * of the IV. The exclusive OR of the padded packet number and the IV
+ * forms the AEAD nonce.
+ */
+ memcpy(nonce, base_iv, QUIC_IV_LEN);
+ n = cpu_to_be64(cb->number);
+ for (i = 0; i < sizeof(n); i++)
+ nonce[QUIC_IV_LEN - sizeof(n) + i] ^= ((u8 *)&n)[i];
+
+ /* For CCM based ciphers, first byte of IV is a constant. */
+ iv[0] = TLS_AES_CCM_IV_B0_BYTE;
+ memcpy(&iv[ccm], nonce, QUIC_IV_LEN);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, hlen);
+ aead_request_set_crypt(req, sg, sg, len - hlen, iv);
+ aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+ quic_crypto_done, skb);
+
+ cb->crypto_ctx = ctx; /* Async free context for quic_crypto_done() */
+ err = enc ? crypto_aead_encrypt(req) : crypto_aead_decrypt(req);
+ if (err == -EINPROGRESS || err == -EBUSY) {
+ memzero_explicit(nonce, sizeof(nonce));
+ return -EINPROGRESS;
+ }
+
+err:
+ kfree_sensitive(ctx);
+ memzero_explicit(nonce, sizeof(nonce));
+ return err;
+}
+
#define QUIC_CIPHER_MIN TLS_CIPHER_AES_GCM_128
#define QUIC_CIPHER_MAX TLS_CIPHER_CHACHA20_POLY1305
@@ -221,6 +490,146 @@ static struct quic_cipher ciphers[QUIC_CIPHER_MAX + 1 - QUIC_CIPHER_MIN] = {
"rfc7539(chacha20,poly1305)", "chacha20", "hmac(sha256)"),
};
+static bool quic_crypto_is_cipher_ccm(struct quic_crypto *crypto)
+{
+ return crypto->cipher_type == TLS_CIPHER_AES_CCM_128;
+}
+
+static bool quic_crypto_is_cipher_chacha(struct quic_crypto *crypto)
+{
+ return crypto->cipher_type == TLS_CIPHER_CHACHA20_POLY1305;
+}
+
+/* Encrypts a QUIC packet before transmission. This function performs AEAD
+ * encryption of the packet payload and applies header protection. It handles
+ * key phase tracking and key update timing.
+ *
+ * Return: 0 on success, or a negative error code.
+ */
+int quic_crypto_encrypt(struct quic_crypto *crypto, struct sk_buff *skb)
+{
+ u8 *iv, cha, ccm, phase = crypto->key_phase;
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ int err;
+
+ cb->key_phase = phase;
+ iv = crypto->tx_iv[phase];
+ /* Packet payload is already encrypted (e.g., resumed from async),
+ * proceed to header protection only.
+ */
+ if (cb->resume)
+ goto out;
+
+ /* If a key update is pending and this is the first packet using the
+ * new key, save the current time. Later used to clear old keys after
+ * some time has passed (see quic_crypto_decrypt()).
+ */
+ if (crypto->key_pending && !crypto->key_update_send_time)
+ crypto->key_update_send_time = quic_ktime_get_us();
+
+ ccm = quic_crypto_is_cipher_ccm(crypto);
+ err = quic_crypto_payload_protect(crypto->tx_tfm[phase], skb, iv, ccm,
+ true);
+ if (err)
+ return err;
+out:
+ cha = quic_crypto_is_cipher_chacha(crypto);
+ return quic_crypto_header_protect(crypto->tx_hp_tfm, skb, cha, true);
+}
+
+/* Decrypts a QUIC packet after reception. This function removes header
+ * protection, decrypts the payload, and processes any key updates if the key
+ * phase bit changes.
+ *
+ * Return: 0 on success, or a negative error code.
+ */
+int quic_crypto_decrypt(struct quic_crypto *crypto, struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ u8 *iv, cha, ccm, phase;
+ int err = 0;
+ u64 time;
+
+ /* Payload was decrypted asynchronously. Proceed with parsing packet
+ * number and key phase.
+ */
+ if (cb->resume) {
+ err = quic_crypto_get_number(skb);
+ if (err)
+ return err;
+ goto out;
+ }
+ if (!cb->number_len) { /* Packet header not yet decrypted. */
+ cha = quic_crypto_is_cipher_chacha(crypto);
+ err = quic_crypto_header_protect(crypto->rx_hp_tfm, skb, cha,
+ false);
+ if (err) {
+ pr_debug("%s: hd decrypt err %d\n", __func__, err);
+ return err;
+ }
+ }
+
+ /* rfc9001#section-6:
+ *
+ * The Key Phase bit allows a recipient to detect a change in keying
+ * material without needing to receive the first packet that triggered
+ * the change. An endpoint that notices a changed Key Phase bit updates
+ * keys and decrypts the packet that contains the changed value.
+ */
+ if (cb->key_phase != crypto->key_phase && !crypto->key_pending) {
+ if (!crypto->send_ready) /* Not ready for key update. */
+ return -EINVAL;
+ if (!cb->backlog) /* Key update requires process context */
+ return -EKEYREVOKED;
+ err = quic_crypto_key_update(crypto); /* Perform key update. */
+ if (err) {
+ cb->errcode = QUIC_TRANSPORT_ERROR_KEY_UPDATE;
+ return err;
+ }
+ cb->key_update = 1; /* Mark packet as triggering key update. */
+ }
+
+ phase = cb->key_phase;
+ iv = crypto->rx_iv[phase];
+ ccm = quic_crypto_is_cipher_ccm(crypto);
+ err = quic_crypto_payload_protect(crypto->rx_tfm[phase], skb, iv, ccm,
+ false);
+ if (err) {
+ if (err == -EINPROGRESS)
+ return err;
+ /* When using the old keys can not decrypt the packets, the
+ * peer might start another key_update. Thus, clear the last
+ * key_pending so that next packets will trigger the new
+ * key-update.
+ */
+ if (crypto->key_pending && cb->key_phase != crypto->key_phase) {
+ crypto->key_pending = 0;
+ crypto->key_update_time = 0;
+ crypto->key_update_send_time = 0;
+ }
+ return err;
+ }
+
+out:
+ /* rfc9001#section-6.1:
+ *
+ * An endpoint MUST retain old keys until it has successfully
+ * unprotected a packet sent using the new keys. An endpoint SHOULD
+ * retain old keys for some time after unprotecting a packet sent using
+ * the new keys.
+ */
+ if (crypto->key_pending && cb->key_phase == crypto->key_phase) {
+ time = crypto->key_update_send_time;
+ if (time &&
+ quic_ktime_get_us() - time >= crypto->key_update_time) {
+ crypto->key_pending = 0;
+ crypto->key_update_time = 0;
+ crypto->key_update_send_time = 0;
+ }
+ }
+ return err;
+}
+
int quic_crypto_set_cipher(struct quic_crypto *crypto, u32 type, u32 flag)
{
struct quic_cipher *cipher;
@@ -515,6 +924,235 @@ int quic_crypto_initial_keys_install(struct quic_crypto *crypto,
return err;
}
+#define QUIC_RETRY_KEY_V1 \
+ "\xbe\x0c\x69\x0b\x9f\x66\x57\x5a\x1d\x76\x6b\x54\xe3\x68\xc8\x4e"
+#define QUIC_RETRY_KEY_V2 \
+ "\x8f\xb4\xb0\x1b\x56\xac\x48\xe2\x60\xfb\xcb\xce\xad\x7c\xcc\x92"
+
+#define QUIC_RETRY_NONCE_V1 "\x46\x15\x99\xd3\x5d\x63\x2b\xf2\x23\x98\x25\xbb"
+#define QUIC_RETRY_NONCE_V2 "\xd8\x69\x69\xbc\x2d\x7c\x6d\x99\x90\xef\xb0\x4a"
+
+/* Retry Packet Integrity. */
+int quic_crypto_get_retry_tag(struct quic_crypto *crypto, struct sk_buff *skb,
+ struct quic_conn_id *odcid, u32 version, u8 *tag)
+{
+ struct crypto_aead *tfm = crypto->tag_tfm;
+ u8 *pseudo_retry, *p, *iv, *key;
+ struct aead_request *req;
+ struct scatterlist *sg;
+ u32 plen;
+ int err;
+
+ if (skb->len < QUIC_TAG_LEN)
+ return -EINVAL;
+
+ /* rfc9001#section-5.8:
+ *
+ * The Retry Integrity Tag is a 128-bit field that is computed as the
+ * output of AEAD_AES_128_GCM used with the following inputs:
+ *
+ * - The secret key, K, is 128 bits equal to
+ * 0xbe0c690b9f66575a1d766b54e368c84e.
+ * - The nonce, N, is 96 bits equal to 0x461599d35d632bf2239825bb.
+ * - The plaintext, P, is empty.
+ * - The associated data, A, is the contents of the Retry
+ * Pseudo-Packet,
+ *
+ * The Retry Pseudo-Packet is not sent over the wire. It is computed by
+ * taking the transmitted Retry packet, removing the Retry Integrity
+ * Tag, and prepending the two following fields: ODCID Length +
+ * Original Destination Connection ID (ODCID).
+ */
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ return err;
+ key = QUIC_RETRY_KEY_V1;
+ if (version == QUIC_VERSION_V2)
+ key = QUIC_RETRY_KEY_V2;
+ err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ if (err)
+ return err;
+
+ /* The caller must ensure skb->len > QUIC_TAG_LEN. */
+ plen = 1 + odcid->len + skb->len - QUIC_TAG_LEN;
+ pseudo_retry = quic_crypto_aead_mem_alloc(tfm, plen + QUIC_TAG_LEN, &iv,
+ &req, &sg, 1);
+ if (!pseudo_retry)
+ return -ENOMEM;
+
+ p = pseudo_retry;
+ p = quic_put_int(p, odcid->len, 1);
+ p = quic_put_data(p, odcid->data, odcid->len);
+ p = quic_put_data(p, skb->data, skb->len - QUIC_TAG_LEN);
+ sg_init_one(sg, pseudo_retry, plen + QUIC_TAG_LEN);
+
+ memcpy(iv, QUIC_RETRY_NONCE_V1, QUIC_IV_LEN);
+ if (version == QUIC_VERSION_V2)
+ memcpy(iv, QUIC_RETRY_NONCE_V2, QUIC_IV_LEN);
+ aead_request_set_tfm(req, tfm);
+ aead_request_set_ad(req, plen);
+ aead_request_set_crypt(req, sg, sg, 0, iv);
+ err = crypto_aead_encrypt(req);
+ if (!err)
+ memcpy(tag, p, QUIC_TAG_LEN);
+ kfree_sensitive(pseudo_retry);
+ return err;
+}
+
+/* Initialize crypto for token operations.
+ *
+ * Derives a key and IV using HKDF, configures the AEAD transform, and
+ * allocates memory for the token request. Used by both generation and
+ * verification paths.
+ *
+ * Returns the token buffer on success or an ERR_PTR() on failure.
+ */
+static void *quic_crypto_token_init(struct quic_crypto *crypto, u32 len,
+ u8 **token_iv, struct aead_request **req,
+ struct scatterlist **sg)
+{
+ u8 key[TLS_CIPHER_AES_GCM_128_KEY_SIZE], iv[QUIC_IV_LEN];
+ struct crypto_aead *tfm = crypto->tag_tfm;
+ struct quic_data srt = {}, k, i;
+ void *token = NULL;
+ int err;
+
+ quic_data(&srt, quic_random_data, QUIC_RANDOM_DATA_LEN);
+ quic_data(&k, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ quic_data(&i, iv, QUIC_IV_LEN);
+ err = quic_crypto_keys_derive(crypto->secret_tfm, &srt, &k, &i, NULL,
+ QUIC_VERSION_V1);
+ if (err)
+ goto out;
+ err = crypto_aead_setauthsize(tfm, QUIC_TAG_LEN);
+ if (err)
+ goto out;
+ err = crypto_aead_setkey(tfm, key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+ if (err)
+ goto out;
+ token = quic_crypto_aead_mem_alloc(tfm, len, token_iv, req, sg, 1);
+ if (!token) {
+ err = -ENOMEM;
+ goto out;
+ }
+ memcpy(*token_iv, iv, QUIC_IV_LEN);
+out:
+ memzero_explicit(key, sizeof(key));
+ memzero_explicit(iv, sizeof(iv));
+ return token ?: ERR_PTR(err);
+}
+
+/* Generate a token for Retry or address validation.
+ *
+ * Builds a token with the format: [client address][timestamp][original
+ * DCID][auth tag]
+ *
+ * Encrypts the token (excluding the first flag byte) using AES-GCM with a key
+ * and IV derived via HKDF. The original DCID is stored to be recovered later
+ * from a Client Initial packet. Ensures the token is bound to the client
+ * address and time, preventing reuse or tampering.
+ *
+ * Returns 0 on success or a negative error code on failure.
+ */
+int quic_crypto_generate_token(struct quic_crypto *crypto, void *addr,
+ u32 addrlen, struct quic_conn_id *conn_id,
+ u8 *token, u32 *tlen)
+{
+ u64 ts = quic_ktime_get_us();
+ struct aead_request *req;
+ struct scatterlist *sg;
+ u8 *token_buf, *iv, *p;
+ int err, len;
+
+ len = addrlen + sizeof(ts) + conn_id->len + QUIC_TAG_LEN;
+ token_buf = quic_crypto_token_init(crypto, len, &iv, &req, &sg);
+ if (IS_ERR(token_buf))
+ return PTR_ERR(token_buf);
+
+ p = token_buf;
+ p = quic_put_data(p, addr, addrlen);
+ p = quic_put_int(p, ts, sizeof(ts));
+ quic_put_data(p, conn_id->data, conn_id->len);
+
+ sg_init_one(sg, token_buf, len);
+ aead_request_set_tfm(req, crypto->tag_tfm);
+ aead_request_set_ad(req, addrlen);
+ aead_request_set_crypt(req, sg, sg, len - addrlen - QUIC_TAG_LEN, iv);
+ err = crypto_aead_encrypt(req);
+ if (err)
+ goto out;
+
+ memcpy(token + 1, token_buf, len);
+ *tlen = len + 1;
+out:
+ kfree_sensitive(token_buf);
+ return err;
+}
+
+/* Validate a Retry or address validation token.
+ *
+ * Decrypts the token using derived key and IV. Checks that the decrypted
+ * address matches the provided address, validates the embedded timestamp
+ * against current time with a version-specific timeout. If applicable, it
+ * extracts and returns the original destination connection ID (ODCID) for
+ * Retry packets.
+ *
+ * Returns 0 if the token is valid, -EINVAL if invalid, or another negative
+ * error code.
+ */
+int quic_crypto_verify_token(struct quic_crypto *crypto, void *addr,
+ u32 addrlen, struct quic_conn_id *conn_id,
+ u8 *token, u32 len)
+{
+ u64 t, ts = quic_ktime_get_us(), timeout = QUIC_TOKEN_TIMEOUT_RETRY;
+ u8 *token_buf, *iv, *p, flag = *token;
+ struct aead_request *req;
+ struct scatterlist *sg;
+ int err;
+
+ if (len < sizeof(flag) + addrlen + sizeof(ts) + QUIC_TAG_LEN)
+ return -EINVAL;
+ len--;
+ token++;
+
+ token_buf = quic_crypto_token_init(crypto, len, &iv, &req, &sg);
+ if (IS_ERR(token_buf))
+ return PTR_ERR(token_buf);
+
+ memcpy(token_buf, token, len);
+
+ sg_init_one(sg, token_buf, len);
+ aead_request_set_tfm(req, crypto->tag_tfm);
+ aead_request_set_ad(req, addrlen);
+ aead_request_set_crypt(req, sg, sg, len - addrlen, iv);
+ err = crypto_aead_decrypt(req);
+ if (err)
+ goto out;
+
+ err = -EINVAL;
+ p = token_buf;
+ if (memcmp(p, addr, addrlen))
+ goto out;
+
+ p += addrlen;
+ len -= addrlen;
+ if (flag == QUIC_TOKEN_FLAG_REGULAR)
+ timeout = QUIC_TOKEN_TIMEOUT_REGULAR;
+ if (!quic_get_int(&p, &len, &t, sizeof(ts)) || t + timeout < ts)
+ goto out;
+
+ len -= QUIC_TAG_LEN;
+ if (len > QUIC_CONN_ID_MAX_LEN)
+ goto out;
+
+ if (flag == QUIC_TOKEN_FLAG_RETRY)
+ quic_conn_id_update(conn_id, p, len);
+ err = 0;
+out:
+ kfree_sensitive(token_buf);
+ return err;
+}
+
/* Generate a derived key using HKDF-Extract and HKDF-Expand with a given
* label.
*/
diff --git a/net/quic/crypto.h b/net/quic/crypto.h
index f9450e55d6dd..b84275783880 100644
--- a/net/quic/crypto.h
+++ b/net/quic/crypto.h
@@ -66,6 +66,9 @@ int quic_crypto_set_cipher(struct quic_crypto *crypto,
u32 type, u32 flag);
int quic_crypto_key_update(struct quic_crypto *crypto);
+int quic_crypto_encrypt(struct quic_crypto *crypto, struct sk_buff *skb);
+int quic_crypto_decrypt(struct quic_crypto *crypto, struct sk_buff *skb);
+
int quic_crypto_initial_keys_install(struct quic_crypto *crypto,
struct quic_conn_id *conn_id,
u32 version, bool is_serv);
@@ -76,5 +79,14 @@ int quic_crypto_generate_stateless_reset_token(struct quic_crypto *crypto,
void *data, u32 len, u8 *key,
u32 key_len);
+int quic_crypto_generate_token(struct quic_crypto *crypto, void *addr,
+ u32 addrlen, struct quic_conn_id *conn_id,
+ u8 *token, u32 *tlen);
+int quic_crypto_get_retry_tag(struct quic_crypto *crypto, struct sk_buff *skb,
+ struct quic_conn_id *odcid, u32 version, u8 *tag);
+int quic_crypto_verify_token(struct quic_crypto *crypto, void *addr,
+ u32 addrlen, struct quic_conn_id *conn_id,
+ u8 *token, u32 len);
+
void quic_crypto_free(struct quic_crypto *crypto);
void quic_crypto_init(void);
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 13/15] quic: add timer management
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (11 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 12/15] quic: add crypto packet encryption and decryption Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 14/15] quic: add packet builder base Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 15/15] quic: add packet parser base Xin Long
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_timer' to unify and manage the five main
timers used in QUIC: loss detection, delayed ACK, path validation,
PMTU probing, and pacing. These timers are critical for driving
retransmissions, connection liveness, and flow control.
Each timer type is initialized, started, reset, or stopped using a common
set of operations.
- quic_timer_reset(): Reset a timer with type and timeout
- quic_timer_start(): Start a timer with type and timeout
- quic_timer_stop(): Stop a timer with type
Although handler functions for each timer are defined, they are currently
placeholders; their logic will be implemented in upcoming patches for
packet transmission and outqueue handling.
Deferred timer actions are also integrated through quic_release_cb(),
which dispatches to the appropriate handler when timers expire.
Signed-off-by: Tyler Fanelli <tfanelli@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
---
v5:
- Rename QUIC_TSQ_DEFERRED to QUIC_PACE_DEFERRED.
v11:
- Note for AI reviews: it is safe to avoid synchronous timer shutdown,
as the timer holds a socket reference and handlers will not access
pnspace/crypto/cong data once sk_state is closed in later patches.
- Note for AI reviews: QUIC_F_MTU_REDUCED_DEFERRED will be used in a
later patch; handling in quic_release_cb() will be added then.
- Set maximum line length to 80 characters.
- Add a check for type in quic_timer_reset().
- Extract quic_timer_timeout() from
quic_timer_sack/loss/path/pmtu/pace_timeout() (noted by AI review).
---
net/quic/Makefile | 2 +-
net/quic/socket.c | 33 ++++++++++
net/quic/socket.h | 33 ++++++++++
net/quic/timer.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/timer.h | 47 +++++++++++++
5 files changed, 279 insertions(+), 1 deletion(-)
create mode 100644 net/quic/timer.c
create mode 100644 net/quic/timer.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 58bb18f7926d..2ccf01ad9e22 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o crypto.o
+ cong.o pnspace.o crypto.o timer.o
diff --git a/net/quic/socket.c b/net/quic/socket.c
index 4016de4b39fe..c37cc1522254 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -51,6 +51,8 @@ static int quic_init_sock(struct sock *sk)
quic_conn_id_set_init(quic_dest(sk), 0);
quic_cong_init(quic_cong(sk));
+ quic_timer_init(sk);
+
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
@@ -66,6 +68,8 @@ static void quic_destroy_sock(struct sock *sk)
{
u8 i;
+ quic_timer_free(sk);
+
for (i = 0; i < QUIC_PNSPACE_MAX; i++)
quic_pnspace_free(quic_pnspace(sk, i));
for (i = 0; i < QUIC_CRYPTO_MAX; i++)
@@ -199,6 +203,35 @@ static int quic_getsockopt(struct sock *sk, int level, int optname,
static void quic_release_cb(struct sock *sk)
{
+ /* Similar to tcp_release_cb(). */
+ unsigned long nflags, flags = smp_load_acquire(&sk->sk_tsq_flags);
+
+ do {
+ if (!(flags & QUIC_DEFERRED_ALL))
+ return;
+ nflags = flags & ~QUIC_DEFERRED_ALL;
+ } while (!try_cmpxchg(&sk->sk_tsq_flags, &flags, nflags));
+
+ if (flags & QUIC_F_LOSS_DEFERRED) {
+ quic_timer_loss_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_SACK_DEFERRED) {
+ quic_timer_sack_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_PATH_DEFERRED) {
+ quic_timer_path_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_PMTU_DEFERRED) {
+ quic_timer_pmtu_handler(sk);
+ __sock_put(sk);
+ }
+ if (flags & QUIC_F_PACE_DEFERRED) {
+ quic_timer_pace_handler(sk);
+ __sock_put(sk);
+ }
}
static int quic_disconnect(struct sock *sk, int flags)
diff --git a/net/quic/socket.h b/net/quic/socket.h
index d7811391cc8b..c5654fdc06b5 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -21,6 +21,7 @@
#include "cong.h"
#include "protocol.h"
+#include "timer.h"
extern struct proto quic_prot;
extern struct proto quicv6_prot;
@@ -32,6 +33,31 @@ enum quic_state {
QUIC_SS_ESTABLISHED = TCP_ESTABLISHED,
};
+enum quic_tsq_enum {
+ QUIC_MTU_REDUCED_DEFERRED,
+ QUIC_LOSS_DEFERRED,
+ QUIC_SACK_DEFERRED,
+ QUIC_PATH_DEFERRED,
+ QUIC_PMTU_DEFERRED,
+ QUIC_PACE_DEFERRED,
+};
+
+enum quic_tsq_flags {
+ QUIC_F_MTU_REDUCED_DEFERRED = BIT(QUIC_MTU_REDUCED_DEFERRED),
+ QUIC_F_LOSS_DEFERRED = BIT(QUIC_LOSS_DEFERRED),
+ QUIC_F_SACK_DEFERRED = BIT(QUIC_SACK_DEFERRED),
+ QUIC_F_PATH_DEFERRED = BIT(QUIC_PATH_DEFERRED),
+ QUIC_F_PMTU_DEFERRED = BIT(QUIC_PMTU_DEFERRED),
+ QUIC_F_PACE_DEFERRED = BIT(QUIC_PACE_DEFERRED),
+};
+
+#define QUIC_DEFERRED_ALL (QUIC_F_MTU_REDUCED_DEFERRED | \
+ QUIC_F_LOSS_DEFERRED | \
+ QUIC_F_SACK_DEFERRED | \
+ QUIC_F_PATH_DEFERRED | \
+ QUIC_F_PMTU_DEFERRED | \
+ QUIC_F_PACE_DEFERRED)
+
struct quic_sock {
struct inet_sock inet;
struct list_head reqs;
@@ -47,6 +73,8 @@ struct quic_sock {
struct quic_cong cong;
struct quic_pnspace space[QUIC_PNSPACE_MAX];
struct quic_crypto crypto[QUIC_CRYPTO_MAX];
+
+ struct quic_timer timers[QUIC_TIMER_MAX];
};
struct quic6_sock {
@@ -119,6 +147,11 @@ static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
return &quic_sk(sk)->crypto[level];
}
+static inline void *quic_timer(const struct sock *sk, u8 type)
+{
+ return (void *)&quic_sk(sk)->timers[type];
+}
+
static inline bool quic_is_establishing(struct sock *sk)
{
return sk->sk_state == QUIC_SS_ESTABLISHING;
diff --git a/net/quic/timer.c b/net/quic/timer.c
new file mode 100644
index 000000000000..a0914eae628e
--- /dev/null
+++ b/net/quic/timer.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include "socket.h"
+
+static void quic_timer_timeout(struct quic_timer *t, int type, int defer_bit,
+ void (*handler)(struct sock *sk))
+{
+ struct quic_sock *qs = container_of(t, struct quic_sock, timers[type]);
+ struct sock *sk = &qs->inet.sk;
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ if (!test_and_set_bit(defer_bit, &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+
+ handler(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+}
+
+void quic_timer_sack_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_sack_timeout(struct timer_list *t)
+{
+ quic_timer_timeout((struct quic_timer *)t, QUIC_TIMER_SACK,
+ QUIC_SACK_DEFERRED, quic_timer_sack_handler);
+}
+
+void quic_timer_loss_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_loss_timeout(struct timer_list *t)
+{
+ quic_timer_timeout((struct quic_timer *)t, QUIC_TIMER_LOSS,
+ QUIC_LOSS_DEFERRED, quic_timer_loss_handler);
+}
+
+void quic_timer_path_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_path_timeout(struct timer_list *t)
+{
+ quic_timer_timeout((struct quic_timer *)t, QUIC_TIMER_PATH,
+ QUIC_PATH_DEFERRED, quic_timer_path_handler);
+}
+
+void quic_timer_reset_path(struct sock *sk)
+{
+ struct quic_cong *cong = quic_cong(sk);
+ u64 timeout = cong->pto * 2;
+
+ /* Calculate timeout based on cong.pto, but enforce a lower bound. */
+ if (timeout < QUIC_MIN_PATH_TIMEOUT)
+ timeout = QUIC_MIN_PATH_TIMEOUT;
+ quic_timer_reset(sk, QUIC_TIMER_PATH, timeout);
+}
+
+void quic_timer_pmtu_handler(struct sock *sk)
+{
+}
+
+static void quic_timer_pmtu_timeout(struct timer_list *t)
+{
+ quic_timer_timeout((struct quic_timer *)t, QUIC_TIMER_PMTU,
+ QUIC_PMTU_DEFERRED, quic_timer_pmtu_handler);
+}
+
+void quic_timer_pace_handler(struct sock *sk)
+{
+}
+
+static enum hrtimer_restart quic_timer_pace_timeout(struct hrtimer *hr)
+{
+ quic_timer_timeout((struct quic_timer *)hr, QUIC_TIMER_PACE,
+ QUIC_PACE_DEFERRED, quic_timer_pace_handler);
+ return HRTIMER_NORESTART;
+}
+
+void quic_timer_reset(struct sock *sk, u8 type, u64 timeout)
+{
+ struct timer_list *t = quic_timer(sk, type);
+
+ /* Note that type must never be QUIC_TIMER_PACE for this helper. */
+ if (WARN_ON_ONCE(type == QUIC_TIMER_PACE))
+ return;
+ if (timeout && !mod_timer(t, jiffies + usecs_to_jiffies(timeout)))
+ sock_hold(sk);
+}
+
+void quic_timer_start(struct sock *sk, u8 type, u64 timeout)
+{
+ struct timer_list *t;
+ struct hrtimer *hr;
+
+ if (type == QUIC_TIMER_PACE) {
+ hr = quic_timer(sk, type);
+
+ if (!hrtimer_is_queued(hr)) {
+ hrtimer_start(hr, ns_to_ktime(timeout),
+ HRTIMER_MODE_ABS_PINNED_SOFT);
+ sock_hold(sk);
+ }
+ return;
+ }
+
+ t = quic_timer(sk, type);
+ if (timeout && !timer_pending(t)) {
+ if (!mod_timer(t, jiffies + usecs_to_jiffies(timeout)))
+ sock_hold(sk);
+ }
+}
+
+void quic_timer_stop(struct sock *sk, u8 type)
+{
+ if (type == QUIC_TIMER_PACE) {
+ if (hrtimer_try_to_cancel(quic_timer(sk, type)) == 1)
+ sock_put(sk);
+ return;
+ }
+ if (timer_delete(quic_timer(sk, type)))
+ sock_put(sk);
+}
+
+void quic_timer_init(struct sock *sk)
+{
+ timer_setup(quic_timer(sk, QUIC_TIMER_LOSS), quic_timer_loss_timeout,
+ 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_SACK), quic_timer_sack_timeout,
+ 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_PATH), quic_timer_path_timeout,
+ 0);
+ timer_setup(quic_timer(sk, QUIC_TIMER_PMTU), quic_timer_pmtu_timeout,
+ 0);
+ /* Use hrtimer for pace timer, ensuring precise control over send
+ * timing.
+ */
+ hrtimer_setup(quic_timer(sk, QUIC_TIMER_PACE), quic_timer_pace_timeout,
+ CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_SOFT);
+}
+
+void quic_timer_free(struct sock *sk)
+{
+ quic_timer_stop(sk, QUIC_TIMER_LOSS);
+ quic_timer_stop(sk, QUIC_TIMER_SACK);
+ quic_timer_stop(sk, QUIC_TIMER_PATH);
+ quic_timer_stop(sk, QUIC_TIMER_PMTU);
+ quic_timer_stop(sk, QUIC_TIMER_PACE);
+}
diff --git a/net/quic/timer.h b/net/quic/timer.h
new file mode 100644
index 000000000000..a5dec85f7c3e
--- /dev/null
+++ b/net/quic/timer.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+enum {
+ QUIC_TIMER_LOSS, /* Loss detection timer: retransmit on packet loss */
+ QUIC_TIMER_SACK, /* ACK delay timer, also used as idle timer alias */
+ QUIC_TIMER_PATH, /* Path validation timer: verifies path connectivity */
+ QUIC_TIMER_PMTU, /* PLPMTUD probing timer */
+ QUIC_TIMER_PACE, /* Pacing timer: controls packet transmission pacing */
+ QUIC_TIMER_MAX,
+ QUIC_TIMER_IDLE = QUIC_TIMER_SACK,
+};
+
+struct quic_timer {
+ union {
+ struct timer_list t;
+ struct hrtimer hr;
+ };
+};
+
+#define QUIC_MIN_PROBE_TIMEOUT 5000000
+
+#define QUIC_MIN_PATH_TIMEOUT 1500000
+
+#define QUIC_MIN_IDLE_TIMEOUT 1000000
+#define QUIC_DEF_IDLE_TIMEOUT 30000000
+
+void quic_timer_reset(struct sock *sk, u8 type, u64 timeout);
+void quic_timer_start(struct sock *sk, u8 type, u64 timeout);
+void quic_timer_stop(struct sock *sk, u8 type);
+void quic_timer_init(struct sock *sk);
+void quic_timer_free(struct sock *sk);
+
+void quic_timer_reset_path(struct sock *sk);
+
+void quic_timer_loss_handler(struct sock *sk);
+void quic_timer_pace_handler(struct sock *sk);
+void quic_timer_path_handler(struct sock *sk);
+void quic_timer_sack_handler(struct sock *sk);
+void quic_timer_pmtu_handler(struct sock *sk);
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 14/15] quic: add packet builder base
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (12 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 13/15] quic: add timer management Xin Long
@ 2026-03-25 3:47 ` Xin Long
2026-03-25 3:47 ` [PATCH net-next v11 15/15] quic: add packet parser base Xin Long
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch introduces 'quic_packet' to handle packing of QUIC packets on
the transmit (TX) path.
It provides functionality for frame packing and packet construction. The
packet configuration includes setting the path, calculating overhead,
and verifying routing. Frames are appended to the packet before it is
created with the queued frames.
Once assembled, the packet is encrypted, bundled, and sent out. There
is also support to flush the packet when no additional frames remain.
Functions to create application (short) and handshake (long) packets
are currently placeholders for future implementation.
- quic_packet_config(): Set the path, compute overhead, and verify routing.
- quic_packet_create_and_xmit(): Create and send the packet with the queued
frames.
- quic_packet_xmit(): Encrypt, bundle, and send out the packet.
- quic_packet_flush(): Send the packet if there's nothing left to bundle.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v3:
- Adjust global connection and listen socket hashtable operations
based on the new hashtable type.
- Introduce quic_packet_backlog_schedule() to enqueue Initial packets
to quic_net.backlog_list and defer their decryption for ALPN demux
to quic_packet_backlog_work() on quic_net.work, since
quic_crypto_initial_keys_install()/crypto_aead_setkey() must run
in process context.
v4:
- Update quic_(listen_)sock_lookup() to support lockless socket
lookup using hlist_nulls_node APIs.
- Use quic_wq for QUIC packet backlog processing work.
v5:
- Rename quic_packet_create() to quic_packet_create_and_xmit()
(suggested by Paolo).
- Move the packet parser base code to a separate patch, keeping only
the packet builder base in this patch (suggested by Paolo).
- Change sent_time timestamp from u32 to u64 to improve accuracy.
v8:
- Remove the dependency on struct quic_frame by returning NULL in
quic_packet_handshake/app_create() and dropping quic_packet_tail()
and struct quic_packet_sent. This effectively strips out patch 14
(suggested by Paolo).
v9:
- Warn on oversized header length in quic_packet_config() (suggested by
Paolo).
- Factor bundle initialization into a common 'init' goto label in
quic_packet_bundle() (suggested by Paolo).
- Clarify comment for packet->ipfragok in quic_packet_config().
v10:
- Set MSS to QUIC_MIN_UDP_PAYLOAD in quic_packet_init(); it serves only
as a default for procfs dumps before a connection exists.
- Introduce QUIC_PACKET_INVALID as a return value for invalid packet
types used in the later patch.
- quic_sock.config.plpmtud_probe_interval has been moved to
quic_path_group.plpmtud_interval, so update its usage in
quic_packet_route() and quic_packet_config() accordingly.
v11:
- Set maximum line length to 80 characters.
- Change return type of quic_packet_empty() to bool.
- Propagate errors from quic_packet_route() in quic_packet_config()
(noted by AI review).
- Use quic_packet_taglen() instead of open-coded logic in
quic_packet_mss(), quic_packet_max_payload(), and
quic_packet_max_payload_dgram() (noted by AI review).
- Replace some magic numbers with QUIC_PACKET_FORM_SHORT/LONG and
QUIC_PACKET_MSS_NORMAL/DGRAM (noted by Paolo).
- Use WARN_ON_ONCE() instead of WARN_ON() in quic_packet_xmit().
skb_set_owner_w/r() cannot be used here because it performs memory
accounting, which is not desired in this context (noted by Paolo).
---
net/quic/Makefile | 2 +-
net/quic/packet.c | 268 ++++++++++++++++++++++++++++++++++++++++++++++
net/quic/packet.h | 117 ++++++++++++++++++++
net/quic/socket.c | 1 +
net/quic/socket.h | 8 ++
5 files changed, 395 insertions(+), 1 deletion(-)
create mode 100644 net/quic/packet.c
create mode 100644 net/quic/packet.h
diff --git a/net/quic/Makefile b/net/quic/Makefile
index 2ccf01ad9e22..0f903f4a7ff1 100644
--- a/net/quic/Makefile
+++ b/net/quic/Makefile
@@ -6,4 +6,4 @@
obj-$(CONFIG_IP_QUIC) += quic.o
quic-y := common.o family.o protocol.o socket.o stream.o connid.o path.o \
- cong.o pnspace.o crypto.o timer.o
+ cong.o pnspace.o crypto.o timer.o packet.o
diff --git a/net/quic/packet.c b/net/quic/packet.c
new file mode 100644
index 000000000000..0805bc77c2a2
--- /dev/null
+++ b/net/quic/packet.c
@@ -0,0 +1,268 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Initialization/cleanup for QUIC protocol support.
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+#include "socket.h"
+
+#define QUIC_HLEN 1
+
+/* Make these fixed for easy coding. */
+#define QUIC_PACKET_NUMBER_LEN QUIC_PN_MAX_LEN
+#define QUIC_PACKET_LENGTH_LEN 4
+
+static struct sk_buff *quic_packet_handshake_create(struct sock *sk)
+{
+ return NULL;
+}
+
+static int quic_packet_number_check(struct sock *sk)
+{
+ return 0;
+}
+
+static struct sk_buff *quic_packet_app_create(struct sock *sk)
+{
+ return NULL;
+}
+
+/* Update the MSS and inform congestion control. */
+void quic_packet_mss_update(struct sock *sk, u32 mss)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_cong *cong = quic_cong(sk);
+
+ packet->mss[QUIC_PACKET_MSS_NORMAL] = (u16)mss;
+ quic_cong_set_mss(cong, packet->mss[QUIC_PACKET_MSS_NORMAL] -
+ packet->taglen[QUIC_PACKET_FORM_SHORT]);
+}
+
+/* Perform routing for the QUIC packet on the specified path, update header
+ * length and MSS accordingly, reset path and start PMTU timer.
+ */
+int quic_packet_route(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ union quic_addr *sa, *da;
+ u32 pmtu;
+ int err;
+
+ da = quic_path_daddr(paths, packet->path);
+ sa = quic_path_saddr(paths, packet->path);
+ err = quic_flow_route(sk, da, sa, &paths->fl);
+ if (err)
+ return err < 0 ? err : 0;
+
+ packet->hlen = quic_encap_len(da);
+ pmtu = min_t(u32, dst_mtu(__sk_dst_get(sk)), QUIC_PATH_MAX_PMTU);
+ quic_packet_mss_update(sk, pmtu - packet->hlen);
+
+ quic_path_pl_reset(paths);
+ quic_timer_reset(sk, QUIC_TIMER_PMTU, paths->plpmtud_interval);
+ return 0;
+}
+
+/* Configure the QUIC packet header and routing based on encryption level and
+ * path.
+ */
+int quic_packet_config(struct sock *sk, u8 level, u8 path)
+{
+ struct quic_conn_id_set *source = quic_source(sk);
+ struct quic_conn_id_set *dest = quic_dest(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ u32 hlen = QUIC_HLEN;
+
+ /* If packet already has data, no need to reconfigure. */
+ if (!quic_packet_empty(packet))
+ return 0;
+
+ packet->ack_eliciting = 0;
+ packet->frame_len = 0;
+ packet->ipfragok = 0;
+ packet->padding = 0;
+ packet->frames = 0;
+ hlen += QUIC_PACKET_NUMBER_LEN; /* Packet number length. */
+ hlen += quic_conn_id_choose(dest, path)->len; /* DCID length. */
+ if (level) {
+ hlen += 1; /* Length byte for DCID. */
+ /* Length byte + SCID length. */
+ hlen += 1 + quic_conn_id_active(source)->len;
+ /* Include token for Initial packets. */
+ if (level == QUIC_CRYPTO_INITIAL)
+ hlen += quic_var_len(quic_token(sk)->len) +
+ quic_token(sk)->len;
+ hlen += QUIC_VERSION_LEN; /* Version length. */
+ hlen += QUIC_PACKET_LENGTH_LEN; /* Packet length field. */
+ /* Allow fragmentation for handshake packets before PLPMTUD
+ * probing starts. MTU discovery does not rely on ICMP Packet
+ * Too Big once PLPMTUD is enabled.
+ */
+ packet->ipfragok = !!quic_paths(sk)->plpmtud_interval;
+ }
+ packet->level = level;
+ packet->len = (u16)hlen;
+ packet->overhead = (u8)hlen;
+ DEBUG_NET_WARN_ON_ONCE(hlen > 255);
+
+ if (packet->path != path) {
+ /* Path changed; update and reset routing cache */
+ packet->path = path;
+ __sk_dst_reset(sk);
+ }
+
+ /* Perform routing and MSS update for the configured packet. */
+ return quic_packet_route(sk);
+}
+
+static void quic_packet_encrypt_done(struct sk_buff *skb, int err)
+{
+ /* Free it for now, future patches will implement the actual deferred
+ * transmission logic.
+ */
+ kfree_skb(skb);
+}
+
+/* Coalescing Packets. */
+static int quic_packet_bundle(struct sock *sk, struct sk_buff *skb)
+{
+ struct quic_skb_cb *head_cb, *cb = QUIC_SKB_CB(skb);
+ struct quic_packet *packet = quic_packet(sk);
+ struct sk_buff *p;
+
+ if (!packet->head) /* First packet to bundle: initialize the head. */
+ goto init;
+
+ /* If bundling would exceed MSS, flush the current bundle. */
+ if (packet->head->len + skb->len >=
+ packet->mss[QUIC_PACKET_MSS_NORMAL]) {
+ quic_packet_flush(sk);
+ goto init;
+ }
+ /* Bundle it and update metadata for the aggregate skb. */
+ p = packet->head;
+ head_cb = QUIC_SKB_CB(p);
+ if (head_cb->last == p)
+ skb_shinfo(p)->frag_list = skb;
+ else
+ head_cb->last->next = skb;
+ p->data_len += skb->len;
+ p->truesize += skb->truesize;
+ p->len += skb->len;
+ head_cb->last = skb;
+ head_cb->ecn |= cb->ecn; /* Merge ECN flags. */
+
+out:
+ /* rfc9000#section-12.2: Packets with a short header (Section 17.3) do
+ * not contain a Length field and so cannot be followed by other
+ * packets in the same UDP datagram.
+ *
+ * so Return 1 to flush if it is a Short header packet.
+ */
+ return !cb->level;
+init:
+ packet->head = skb;
+ cb->last = skb;
+ goto out;
+}
+
+/* Transmit a QUIC packet, possibly encrypting and bundling it. */
+int quic_packet_xmit(struct sock *sk, struct sk_buff *skb)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = sock_net(sk);
+ int err;
+
+ /* Skip encryption if taglen == 0 (e.g., disable_1rtt_encryption). */
+ if (!packet->taglen[quic_hdr(skb)->form])
+ goto xmit;
+
+ cb->crypto_done = quic_packet_encrypt_done;
+ /* Associate skb with sk to ensure sk is valid during async encryption
+ * completion.
+ */
+ WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
+ err = quic_crypto_encrypt(quic_crypto(sk, packet->level), skb);
+ if (err) {
+ if (err != -EINPROGRESS) {
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCDROP);
+ kfree_skb(skb);
+ return err;
+ }
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCBACKLOGS);
+ return err;
+ }
+ if (!cb->resume) /* Encryption completes synchronously. */
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_ENCFASTPATHS);
+
+xmit:
+ if (quic_packet_bundle(sk, skb))
+ quic_packet_flush(sk);
+ return 0;
+}
+
+/* Create and transmit a new QUIC packet. */
+int quic_packet_create_and_xmit(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+ struct sk_buff *skb;
+ int err;
+
+ err = quic_packet_number_check(sk);
+ if (err)
+ goto err;
+
+ if (packet->level)
+ skb = quic_packet_handshake_create(sk);
+ else
+ skb = quic_packet_app_create(sk);
+ if (!skb) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ err = quic_packet_xmit(sk, skb);
+ if (err && err != -EINPROGRESS)
+ goto err;
+
+ /* Return 1 if at least one ACK-eliciting (non-PING) frame was sent. */
+ return !!packet->frames;
+err:
+ pr_debug("%s: err: %d\n", __func__, err);
+ return 0;
+}
+
+/* Flush any coalesced/bundled QUIC packets. */
+void quic_packet_flush(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+
+ if (packet->head) {
+ quic_lower_xmit(sk, packet->head,
+ quic_path_daddr(paths, packet->path),
+ &paths->fl);
+ packet->head = NULL;
+ }
+}
+
+void quic_packet_init(struct sock *sk)
+{
+ struct quic_packet *packet = quic_packet(sk);
+
+ INIT_LIST_HEAD(&packet->frame_list);
+ packet->taglen[QUIC_PACKET_FORM_SHORT] = QUIC_TAG_LEN;
+ packet->taglen[QUIC_PACKET_FORM_LONG] = QUIC_TAG_LEN;
+ packet->mss[QUIC_PACKET_MSS_NORMAL] = QUIC_MIN_UDP_PAYLOAD;
+ packet->mss[QUIC_PACKET_MSS_DGRAM] = QUIC_MIN_UDP_PAYLOAD;
+
+ packet->version = QUIC_VERSION_V1;
+}
diff --git a/net/quic/packet.h b/net/quic/packet.h
new file mode 100644
index 000000000000..834c4f72271b
--- /dev/null
+++ b/net/quic/packet.h
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* QUIC kernel implementation
+ * (C) Copyright Red Hat Corp. 2023
+ *
+ * This file is part of the QUIC kernel implementation
+ *
+ * Written or modified by:
+ * Xin Long <lucien.xin@gmail.com>
+ */
+
+struct quic_packet {
+ struct quic_conn_id dcid; /* Dest Conn ID from received packet */
+ struct quic_conn_id scid; /* Source Conn ID from received packet */
+ union quic_addr daddr; /* Dest address from received packet */
+ union quic_addr saddr; /* Source address from received packet */
+
+ struct list_head frame_list; /* Frames to pack into packet for send */
+ struct sk_buff *head; /* Head skb for packet bundling on send */
+ u16 frame_len; /* Length of all ack-eliciting frames excluding PING */
+ u8 taglen[2]; /* Tag length for short and long packets */
+ u32 version; /* QUIC version used/selected during handshake */
+ u8 errframe; /* Frame type causing packet processing failure */
+ u8 overhead; /* QUIC header length excluding frames */
+ u16 errcode; /* Error code on packet processing failure */
+ u16 frames; /* Number of ack-eliciting frames excluding PING */
+ u16 mss[2]; /* MSS for datagram and non-datagram packets */
+ u16 hlen; /* UDP + IP header length for sending */
+ u16 len; /* QUIC packet length excluding taglen for sending */
+
+ u8 ack_eliciting:1; /* Packet contains ack-eliciting frames to send */
+ u8 ack_requested:1; /* Packet contains ack-eliciting frames received */
+ u8 ack_immediate:1; /* Send ACK immediately (skip ack_delay timer) */
+ u8 non_probing:1; /* Ack-eliciting packet (excl. NEW_CONNECTION_ID) */
+ u8 has_sack:1; /* Packet has ACK frames received */
+ u8 ipfragok:1; /* Allow IP fragmentation */
+ u8 padding:1; /* Packet has padding frames */
+ u8 path:1; /* Path identifier used to send this packet */
+ u8 level; /* Encryption level used */
+};
+
+#define QUIC_PACKET_INITIAL_V1 0
+#define QUIC_PACKET_0RTT_V1 1
+#define QUIC_PACKET_HANDSHAKE_V1 2
+#define QUIC_PACKET_RETRY_V1 3
+
+#define QUIC_PACKET_INITIAL_V2 1
+#define QUIC_PACKET_0RTT_V2 2
+#define QUIC_PACKET_HANDSHAKE_V2 3
+#define QUIC_PACKET_RETRY_V2 0
+
+#define QUIC_PACKET_INITIAL QUIC_PACKET_INITIAL_V1
+#define QUIC_PACKET_0RTT QUIC_PACKET_0RTT_V1
+#define QUIC_PACKET_HANDSHAKE QUIC_PACKET_HANDSHAKE_V1
+#define QUIC_PACKET_RETRY QUIC_PACKET_RETRY_V1
+
+#define QUIC_PACKET_INVALID 0xff
+
+#define QUIC_VERSION_LEN 4
+
+#define QUIC_PACKET_MSS_NORMAL 0
+#define QUIC_PACKET_MSS_DGRAM 1
+
+#define QUIC_PACKET_FORM_SHORT 0
+#define QUIC_PACKET_FORM_LONG 1
+
+static inline u8 quic_packet_taglen(struct quic_packet *packet)
+{
+ return packet->taglen[packet->level != QUIC_CRYPTO_APP];
+}
+
+static inline void quic_packet_set_taglen(struct quic_packet *packet, u8 taglen)
+{
+ packet->taglen[QUIC_PACKET_FORM_SHORT] = taglen;
+}
+
+static inline u32 quic_packet_mss(struct quic_packet *packet)
+{
+ return packet->mss[QUIC_PACKET_MSS_NORMAL] - quic_packet_taglen(packet);
+}
+
+static inline u32 quic_packet_max_payload(struct quic_packet *packet)
+{
+ return packet->mss[QUIC_PACKET_MSS_NORMAL] - packet->overhead -
+ quic_packet_taglen(packet);
+}
+
+static inline u32 quic_packet_max_payload_dgram(struct quic_packet *packet)
+{
+ return packet->mss[QUIC_PACKET_MSS_DGRAM] - packet->overhead -
+ quic_packet_taglen(packet);
+}
+
+static inline bool quic_packet_empty(struct quic_packet *packet)
+{
+ return list_empty(&packet->frame_list);
+}
+
+static inline void quic_packet_reset(struct quic_packet *packet)
+{
+ packet->level = 0;
+ packet->errcode = 0;
+ packet->errframe = 0;
+ packet->has_sack = 0;
+ packet->non_probing = 0;
+ packet->ack_requested = 0;
+ packet->ack_immediate = 0;
+}
+
+int quic_packet_config(struct sock *sk, u8 level, u8 path);
+
+int quic_packet_xmit(struct sock *sk, struct sk_buff *skb);
+int quic_packet_create_and_xmit(struct sock *sk);
+int quic_packet_route(struct sock *sk);
+
+void quic_packet_mss_update(struct sock *sk, u32 mss);
+void quic_packet_flush(struct sock *sk);
+void quic_packet_init(struct sock *sk);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index c37cc1522254..b9fbc33c0f79 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -52,6 +52,7 @@ static int quic_init_sock(struct sock *sk)
quic_cong_init(quic_cong(sk));
quic_timer_init(sk);
+ quic_packet_init(sk);
if (quic_stream_init(quic_streams(sk)))
return -ENOMEM;
diff --git a/net/quic/socket.h b/net/quic/socket.h
index c5654fdc06b5..1efc76ec2033 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -20,6 +20,8 @@
#include "path.h"
#include "cong.h"
+#include "packet.h"
+
#include "protocol.h"
#include "timer.h"
@@ -74,6 +76,7 @@ struct quic_sock {
struct quic_pnspace space[QUIC_PNSPACE_MAX];
struct quic_crypto crypto[QUIC_CRYPTO_MAX];
+ struct quic_packet packet;
struct quic_timer timers[QUIC_TIMER_MAX];
};
@@ -147,6 +150,11 @@ static inline struct quic_crypto *quic_crypto(const struct sock *sk, u8 level)
return &quic_sk(sk)->crypto[level];
}
+static inline struct quic_packet *quic_packet(const struct sock *sk)
+{
+ return &quic_sk(sk)->packet;
+}
+
static inline void *quic_timer(const struct sock *sk, u8 type)
{
return (void *)&quic_sk(sk)->timers[type];
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [PATCH net-next v11 15/15] quic: add packet parser base
2026-03-25 3:47 [PATCH net-next v11 00/15] net: introduce QUIC infrastructure and core subcomponents Xin Long
` (13 preceding siblings ...)
2026-03-25 3:47 ` [PATCH net-next v11 14/15] quic: add packet builder base Xin Long
@ 2026-03-25 3:47 ` Xin Long
14 siblings, 0 replies; 17+ messages in thread
From: Xin Long @ 2026-03-25 3:47 UTC (permalink / raw)
To: network dev, quic
Cc: davem, kuba, Eric Dumazet, Paolo Abeni, Simon Horman,
Stefan Metzmacher, Moritz Buhl, Tyler Fanelli, Pengtao He,
Thomas Dreibholz, linux-cifs, Steve French, Namjae Jeon,
Paulo Alcantara, Tom Talpey, kernel-tls-handshake, Chuck Lever,
Jeff Layton, Steve Dickson, Hannes Reinecke, Alexander Aring,
David Howells, Matthieu Baerts, John Ericson, Cong Wang,
D . Wythe, Jason Baron, illiliti, Sabrina Dubroca,
Marcelo Ricardo Leitner, Daniel Stenberg, Andy Gospodarek,
Marc E . Fiuczynski
This patch uses 'quic_packet' to handle packing of QUIC packets on the
receive (RX) path.
It introduces mechanisms to parse the ALPN from client Initial packets
to determine the correct listener socket. Received packets are then
routed and processed accordingly. Similar to the TX path, handling for
application and handshake packets is not yet implemented.
- quic_packet_parse_alpn()`: Parse the ALPN from a client Initial packet,
then locate the appropriate listener using the ALPN.
- quic_packet_rcv(): Locate the appropriate socket to handle the packet
via quic_packet_process().
- quic_packet_process()`: Process the received packet.
In addition to packet flow, this patch adds support for ICMP-based MTU
updates by locating the relevant socket and updating the stored PMTU
accordingly.
- quic_packet_rcv_err_pmtu(): Find the socket and update the PMTU via
quic_packet_mss_update().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
v5:
- In quic_packet_rcv_err(), remove the unnecessary quic_is_listen()
check and move quic_get_mtu_info() out of sock lock (suggested
by Paolo).
- Replace cancel_work_sync() to disable_work_sync() (suggested by
Paolo).
v6:
- Fix the loop using skb_dequeue() in quic_packet_backlog_work(), and
kfree_skb() when sk is not found (reported by AI Reviews).
- Remove skb_pull() from quic_packet_rcv(), since it is now handled
in quic_path_rcv().
- Note for AI reviews: add if (dst) check in quic_packet_rcv_err_pmtu(),
although quic_packet_route() >= 0 already guarantees it is not NULL.
- Note for AI reviews: it is safe to do *plen -= QUIC_HLEN in
quic_packet_get_version_and_connid(), since quic_packet_get_sock()
already checks if (skb->len < QUIC_HLEN).
- Note for AI reviews: cb->length - cb->number_len - QUIC_TAG_LEN
cannot underflow, because quic_crypto_header_decrypt() already checks
if (cb->length < QUIC_PN_MAX_LEN + QUIC_SAMPLE_LEN).
- Note for AI reviews: the cast length in quic_packet_parse_alpn()
is safe, as there is a prior check if (length > (u16)len); len is
skb->len, which cannot exceed U16_MAX for UDP packet with QUIC.
- Note for AI reviews: it's correct to do if (flags &
QUIC_F_MTU_REDUCED_DEFERRED) in quic_release_cb(), since
QUIC_MTU_REDUCED_DEFERRED is the bit used with test_and_set_bit().
- Note for AI reviews: move skb_cb->backlog = 1 before adding skb to
backlog, although it's safe to write skb_cb after adding to backlog
with sk_lock.slock, as skb dequeue from backlog requires sk_lock.slock.
v7:
- Pass udp sk to quic_packet_rcv(), quic_packet_rcv_err() and
quic_sock_lookup().
- Move the call to skb_linearize() and skb_set_owner_sk_safe() to
.quic_path_rcv()/quic_packet_rcv().
v8:
- Replace the global ALPN demultiplexing sysctl with the static key in
quic_packet_parse_alpn() (noted by Stefan).
- Refetch skb->data after decrypt in ALPN parsing, as skb_cow_data()
may reallocate the skb data buffer (reported by Syzkaller).
- The indirect quic_path_rcv has been removed and call quic_packet_rcv()
directly via extern.
- Do not restore skb data when QUIC Initial decryption fails, as the
caller will free the skb for this failure anyway.
- With patch 14 removed, define a temporary QUIC_FRAME_CRYPTO ID when
parsing the ALPN.
v9:
- Remove local_bh_disable() in quic_packet_get_listen_sock() as it's now
using rcu_read_lock instead of spin_sock in quic_listen_sock_lookup()
(noted by Paolo).
v10:
- Return QUIC_PACKET_INVALID (instead of -1) for invalid packet types in
quic_packet_version_get_type().
- Update the comment to clarify in quic_packet_rcv_err() that ICMP errors
embed the original QUIC packet, reversing src/dst addrs when parsed.
- Use qn->backlog_list.lock in quic_packet_backlog_schedule() to
prevent a TOCTOU race between the head->qlen check and its update in
__skb_queue_tail().
- Add check 'len < TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN' before parsing
ClientHello in quic_packet_get_alpn().
- Add more limits in quic_packet_get_alpn() to improve robustness against
malformed TLS ClientHello messages.
- Move skb_queue_purge() to after disable_work_sync() in quic_net_exit()
for clarity and to satisfy AI review.
- quic_sock.config.plpmtud_probe_interval has been moved to
quic_path_group.plpmtud_interval, so update its usage in
quic_packet_rcv_err_pmtu() accordingly.
- Remove quic_packet_select_version() and quic_packet_version_change();
they will be reintroduced later when needed in the next patch series.
v11:
- Note for AI review: refcount increments in quic_listen_sock_lookup()
and quic_sock_lookup() are left unchanged due to code complexity.
- Set maximum line length to 80 characters.
- Do not mark backlog packets as sleepable (cb->backlog = 1) in
sk_add_backlog path; Replace spin_(un)lock() with spin_(un)lock_bh()
in quic_packet_backlog_schedule().
- Return -ENOBUFS instead of -1 in quic_packet_backlog_schedule().
- Change err parameter type from u8 to bool (icmp) in quic_packet_rcv().
- Propagate errors from quic_packet_get_sock() and sk_add_backlog() in
quic_packet_rcv().
- Propagate errors from quic_packet_get_dcid() and
quic_packet_parse_alpn() in quic_packet_get_sock() via ERR_PTR().
- Propagate errors from quic_packet_parse_alpn() in
quic_packet_get_listen_sock() via ERR_PTR().
- Propagate errors from quic_packet_get_version_and_connid() and
quic_packet_get_token() in quic_packet_parse_alpn().
- Do not hold skb when calling quic_packet_backlog_schedule() in
quic_packet_parse_alpn(); do not free skb when returning
-EINPROGRESS from quic_packet_get_sock() in quic_packet_rcv().
- Move the quic_packet_rcv() declaration from packet.h to path.h,
as it's only called in path.c (noted by AI review).
- Merge quic_packet_get_dcid() and quic_packet_get_version_and_connid()
into quic_packet_get_long_header() and extract quic_packet_get_connid()
(noted by AI review).
---
net/quic/packet.c | 605 ++++++++++++++++++++++++++++++++++++++++++++
net/quic/packet.h | 8 +
net/quic/path.c | 4 +-
net/quic/path.h | 2 +
net/quic/protocol.c | 5 +
net/quic/protocol.h | 4 +
net/quic/socket.c | 149 +++++++++++
net/quic/socket.h | 7 +
8 files changed, 782 insertions(+), 2 deletions(-)
diff --git a/net/quic/packet.c b/net/quic/packet.c
index 0805bc77c2a2..88fbe839789d 100644
--- a/net/quic/packet.c
+++ b/net/quic/packet.c
@@ -14,6 +14,611 @@
#define QUIC_HLEN 1
+#define QUIC_LONG_HLEN(dcid, scid) \
+ (QUIC_HLEN + QUIC_VERSION_LEN + 1 + (dcid)->len + 1 + (scid)->len)
+
+#define QUIC_VERSION_NUM 2
+
+/* Supported QUIC versions and their compatible versions. Used for Compatible
+ * Version Negotiation in rfc9368#section-2.3.
+ */
+static u32 quic_versions[QUIC_VERSION_NUM][4] = {
+ /* Version, Compatible Versions */
+ { QUIC_VERSION_V1, QUIC_VERSION_V2, QUIC_VERSION_V1, 0 },
+ { QUIC_VERSION_V2, QUIC_VERSION_V2, QUIC_VERSION_V1, 0 },
+};
+
+/* Get the compatible version list for a given QUIC version. */
+u32 *quic_packet_compatible_versions(u32 version)
+{
+ u8 i;
+
+ for (i = 0; i < QUIC_VERSION_NUM; i++)
+ if (version == quic_versions[i][0])
+ return quic_versions[i];
+ return NULL;
+}
+
+/* Convert version-specific type to internal standard packet type. */
+static u8 quic_packet_version_get_type(u32 version, u8 type)
+{
+ if (version == QUIC_VERSION_V1)
+ return type;
+
+ switch (type) {
+ case QUIC_PACKET_INITIAL_V2:
+ return QUIC_PACKET_INITIAL;
+ case QUIC_PACKET_0RTT_V2:
+ return QUIC_PACKET_0RTT;
+ case QUIC_PACKET_HANDSHAKE_V2:
+ return QUIC_PACKET_HANDSHAKE;
+ case QUIC_PACKET_RETRY_V2:
+ return QUIC_PACKET_RETRY;
+ default:
+ return QUIC_PACKET_INVALID;
+ }
+}
+
+/* Extracts a QUIC Connection ID from a buffer in the long header packet. */
+static int quic_packet_get_connid(struct quic_conn_id *connid, u8 **pp,
+ u32 *plen)
+{
+ u64 len;
+
+ if (!quic_get_int(pp, plen, &len, 1) ||
+ len > *plen || len > QUIC_CONN_ID_MAX_LEN)
+ return -EINVAL;
+
+ quic_conn_id_update(connid, *pp, len);
+ *plen -= len;
+ *pp += len;
+ return 0;
+}
+
+/* Parse QUIC version and connection IDs (DCID and SCID) from a Long header
+ * packet buffer.
+ */
+static int quic_packet_get_long_header(struct quic_conn_id *dcid,
+ struct quic_conn_id *scid, u32 *version,
+ u8 **pp, u32 *plen)
+{
+ int err;
+ u64 v;
+
+ *pp += QUIC_HLEN;
+ *plen -= QUIC_HLEN;
+
+ if (!quic_get_int(pp, plen, &v, QUIC_VERSION_LEN))
+ return -EINVAL;
+ if (version)
+ *version = v;
+
+ err = quic_packet_get_connid(dcid, pp, plen);
+ if (err)
+ return err;
+ if (!scid)
+ return 0;
+ return quic_packet_get_connid(scid, pp, plen);
+}
+
+/* Extracts a QUIC token from a buffer in the Client Initial packet. */
+static int quic_packet_get_token(struct quic_data *token, u8 **pp, u32 *plen)
+{
+ u64 len;
+
+ if (!quic_get_var(pp, plen, &len) || len > *plen)
+ return -EINVAL;
+ quic_data(token, *pp, len);
+ *plen -= len;
+ *pp += len;
+ return 0;
+}
+
+/* Process PMTU reduction event on a QUIC socket. */
+void quic_packet_rcv_err_pmtu(struct sock *sk)
+{
+ struct quic_path_group *paths = quic_paths(sk);
+ struct quic_packet *packet = quic_packet(sk);
+ u32 pathmtu, info, taglen;
+ struct dst_entry *dst;
+ bool reset_timer;
+
+ if (!ip_sk_accept_pmtu(sk))
+ return;
+
+ info = clamp(paths->mtu_info, QUIC_PATH_MIN_PMTU, QUIC_PATH_MAX_PMTU);
+ /* If PLPMTUD is not enabled, update MSS using route and ICMP info. */
+ if (!paths->plpmtud_interval) {
+ if (quic_packet_route(sk))
+ return;
+
+ dst = __sk_dst_get(sk);
+ if (dst)
+ dst->ops->update_pmtu(dst, sk, NULL, info, true);
+ quic_packet_mss_update(sk, info - packet->hlen);
+ return;
+ }
+ /* PLPMTUD is enabled: adjust to smaller PMTU, subtract headers and
+ * AEAD tag. Also notify the QUIC path layer for possible state
+ * changes and probing.
+ */
+ taglen = quic_packet_taglen(packet);
+ info = info - packet->hlen - taglen;
+ pathmtu = quic_path_pl_toobig(paths, info, &reset_timer);
+ if (reset_timer)
+ quic_timer_reset(sk, QUIC_TIMER_PMTU, paths->plpmtud_interval);
+ if (pathmtu)
+ quic_packet_mss_update(sk, pathmtu + taglen);
+}
+
+/* Handle ICMP Toobig packet and update QUIC socket path MTU. */
+static int quic_packet_rcv_err(struct sock *sk, struct sk_buff *skb)
+{
+ union quic_addr daddr, saddr;
+ u32 info;
+
+ /* ICMP embeds the original outgoing QUIC packet, so saddr/daddr are
+ * reversed when parsed. Only address-based socket lookup is possible
+ * in this case.
+ */
+ quic_get_msg_addrs(skb, &saddr, &daddr);
+ sk = quic_sock_lookup(skb, &daddr, &saddr, sk, NULL);
+ if (!sk)
+ return -ENOENT;
+
+ if (quic_get_mtu_info(skb, &info)) {
+ sock_put(sk);
+ return 0;
+ }
+
+ /* Success: update socket path MTU info. */
+ bh_lock_sock(sk);
+ quic_paths(sk)->mtu_info = info;
+ if (sock_owned_by_user(sk)) {
+ /* Socket locked by userspace. Defer MTU processing via
+ * release_cb. Hold socket reference to prevent it being
+ * freed before deferral.
+ */
+ if (!test_and_set_bit(QUIC_MTU_REDUCED_DEFERRED,
+ &sk->sk_tsq_flags))
+ sock_hold(sk);
+ goto out;
+ }
+ /* Otherwise, process the MTU reduction now. */
+ quic_packet_rcv_err_pmtu(sk);
+out:
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ return 1;
+}
+
+#define QUIC_PACKET_BACKLOG_MAX 4096
+
+/* Queue a packet for later processing when sleeping is allowed. */
+static int quic_packet_backlog_schedule(struct net *net, struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct quic_net *qn = quic_net(net);
+ struct sk_buff_head *head;
+
+ if (cb->backlog)
+ return 0;
+
+ head = &qn->backlog_list;
+ spin_lock_bh(&head->lock);
+ if (head->qlen >= QUIC_PACKET_BACKLOG_MAX) {
+ spin_unlock_bh(&head->lock);
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVDROP);
+ kfree_skb(skb);
+ return -ENOBUFS;
+ }
+ cb->backlog = 1;
+ __skb_queue_tail(head, skb);
+ spin_unlock_bh(&head->lock);
+
+ queue_work(quic_wq, &qn->work);
+ return 1;
+}
+
+#define TLS_MT_CLIENT_HELLO 1
+#define TLS_EXT_alpn 16
+
+/* TLS Client Hello Msg:
+ *
+ * uint16 ProtocolVersion;
+ * opaque Random[32];
+ * uint8 CipherSuite[2];
+ *
+ * struct {
+ * ExtensionType extension_type;
+ * opaque extension_data<0..2^16-1>;
+ * } Extension;
+ *
+ * struct {
+ * ProtocolVersion legacy_version = 0x0303;
+ * Random rand;
+ * opaque legacy_session_id<0..32>;
+ * CipherSuite cipher_suites<2..2^16-2>;
+ * opaque legacy_compression_methods<1..2^8-1>;
+ * Extension extensions<8..2^16-1>;
+ * } ClientHello;
+ */
+
+#define TLS_CH_RANDOM_LEN 32
+#define TLS_CH_VERSION_LEN 2
+#define TLS_MAX_EXTENSIONS 128
+
+/* Extract ALPN data from a TLS ClientHello message.
+ *
+ * Parses the TLS ClientHello handshake message to find the ALPN (Application
+ * Layer Protocol Negotiation) TLS extension. It validates the TLS ClientHello
+ * structure, including version, random, session ID, cipher suites, compression
+ * methods, and extensions. Once the ALPN extension is found, the ALPN
+ * protocols list is extracted and stored in @alpn.
+ *
+ * Return: 0 on success or no ALPN found, a negative error code on failed
+ * parsing.
+ */
+static int quic_packet_get_alpn(struct quic_data *alpn, u8 *p, u32 len)
+{
+ int err = -EINVAL, found = 0, exts = 0;
+ u64 length, type;
+
+ /* Verify handshake message type (ClientHello) and its length. */
+ if (!quic_get_int(&p, &len, &type, 1) || type != TLS_MT_CLIENT_HELLO)
+ return err;
+ if (!quic_get_int(&p, &len, &length, 3) ||
+ len < TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN ||
+ length < TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN)
+ return err;
+ if (len > (u32)length) /* Cap len to handshake msg length. */
+ len = length;
+ /* Skip legacy_version (2 bytes) + random (32 bytes). */
+ p += TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN;
+ len -= TLS_CH_RANDOM_LEN + TLS_CH_VERSION_LEN;
+ /* legacy_session_id_len must be zero (QUIC requirement). */
+ if (!quic_get_int(&p, &len, &length, 1) || length)
+ return err;
+
+ /* Skip cipher_suites (2 bytes length + variable data). */
+ if (!quic_get_int(&p, &len, &length, 2) || length > (u64)len)
+ return err;
+ len -= length;
+ p += length;
+
+ /* Skip legacy_compression_methods (1 byte length + variable data). */
+ if (!quic_get_int(&p, &len, &length, 1) || length > (u64)len)
+ return err;
+ len -= length;
+ p += length;
+
+ /* Read TLS extensions length (2 bytes). */
+ if (!quic_get_int(&p, &len, &length, 2))
+ return err;
+ if (len > (u32)length) /* Limit len to extensions length if larger. */
+ len = length;
+ while (len > 4) { /* Scan extensions for ALPN (TLS_EXT_alpn) */
+ if (!quic_get_int(&p, &len, &type, 2))
+ break;
+ if (!quic_get_int(&p, &len, &length, 2))
+ break;
+ if (len < (u32)length) /* Incomplete TLS extensions. */
+ return 0;
+ if (type == TLS_EXT_alpn) { /* Found ALPN extension. */
+ if (length > QUIC_ALPN_MAX_LEN)
+ return err;
+ len = length;
+ found = 1;
+ break;
+ }
+ /* Skip non-ALPN extensions. */
+ p += length;
+ len -= length;
+ if (exts++ >= TLS_MAX_EXTENSIONS)
+ return err;
+ }
+ if (!found) { /* No ALPN ext: set alpn->len = 0 and alpn->data = p. */
+ quic_data(alpn, p, 0);
+ return 0;
+ }
+
+ /* Parse ALPN protocols list length (2 bytes). */
+ if (!quic_get_int(&p, &len, &length, 2) || length > (u64)len)
+ return err;
+ quic_data(alpn, p, length); /* Store ALPN list in alpn->data. */
+ len = length;
+ while (len) { /* Validate ALPN protocols list format. */
+ if (!quic_get_int(&p, &len, &length, 1) || length > (u64)len) {
+ /* Bad ALPN: set alpn->len = 0, alpn->data = NULL. */
+ quic_data(alpn, NULL, 0);
+ return err;
+ }
+ len -= length;
+ p += length;
+ }
+ pr_debug("%s: alpn_len: %d\n", __func__, alpn->len);
+ return 0;
+}
+
+#define QUIC_FRAME_CRYPTO 0x06
+
+/* Parse ALPN from a QUIC Initial packet.
+ *
+ * This function processes a QUIC Initial packet to extract the ALPN from the
+ * TLS ClientHello message inside the QUIC CRYPTO frame. It verifies packet
+ * type, version compatibility, decrypts the packet payload, and locates the
+ * CRYPTO frame to parse the TLS ClientHello. Finally, it calls
+ * quic_packet_get_alpn() to extract the ALPN extension data.
+ *
+ * Return: 0 on success or no ALPN found, a negative error code on failed
+ * parsing.
+ */
+static int quic_packet_parse_alpn(struct sk_buff *skb, struct quic_data *alpn)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = sock_net(skb->sk);
+ struct quic_conn_id dcid, scid;
+ u32 len = skb->len, version;
+ struct quic_crypto *crypto;
+ u8 *p = skb->data, type;
+ struct quic_data token;
+ u64 offset, length;
+ int err;
+
+ if (!static_branch_unlikely(&quic_alpn_demux_key))
+ return 0;
+ err = quic_packet_get_long_header(&dcid, &scid, &version, &p, &len);
+ if (err)
+ return err;
+ if (!quic_packet_compatible_versions(version))
+ return 0;
+ /* Only parse Initial packets. */
+ type = quic_packet_version_get_type(version, quic_hshdr(skb)->type);
+ if (type != QUIC_PACKET_INITIAL)
+ return 0;
+ err = quic_packet_get_token(&token, &p, &len);
+ if (err)
+ return err;
+ if (!quic_get_var(&p, &len, &length) || length > (u64)len)
+ return -EINVAL;
+ if (quic_packet_backlog_schedule(net, skb))
+ return -EINPROGRESS;
+ cb->length = (u16)length;
+
+ /* Install initial keys for packet decryption to crypto. */
+ crypto = &quic_net(net)->crypto;
+ err = quic_crypto_initial_keys_install(crypto, &dcid, version, 1);
+ if (err)
+ return err;
+ cb->number_offset = (u16)(p - skb->data);
+ err = quic_crypto_decrypt(crypto, skb);
+ if (err) {
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_DECDROP);
+ return err;
+ }
+
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_DECFASTPATHS);
+ cb->resume = 1; /* Mark this packet as already decrypted. */
+
+ /* Find the QUIC CRYPTO frame. */
+ p = skb->data + cb->number_offset + cb->number_len;
+ len = cb->length - cb->number_len - QUIC_TAG_LEN;
+ for (; len && !(*p); p++, len--) /* Skip the padding frame. */
+ ;
+ if (!len-- || *p++ != QUIC_FRAME_CRYPTO)
+ return 0;
+ if (!quic_get_var(&p, &len, &offset) || offset)
+ return 0;
+ if (!quic_get_var(&p, &len, &length) || length > (u64)len)
+ return 0;
+
+ /* Parse the TLS CLIENT_HELLO message. */
+ return quic_packet_get_alpn(alpn, p, length);
+}
+
+/* Lookup listening socket for Client Initial packet (in process context). */
+static struct sock *quic_packet_get_listen_sock(struct sk_buff *skb)
+{
+ union quic_addr daddr, saddr;
+ struct quic_data alpns = {};
+ struct sock *sk;
+ int err;
+
+ quic_get_msg_addrs(skb, &daddr, &saddr);
+
+ err = quic_packet_parse_alpn(skb, &alpns);
+ if (err)
+ return ERR_PTR(err);
+
+ sk = quic_listen_sock_lookup(skb, &daddr, &saddr, &alpns);
+ if (!sk)
+ return ERR_PTR(-ENOENT);
+ return sk;
+}
+
+/* Determine the QUIC socket associated with an incoming packet. */
+static struct sock *quic_packet_get_sock(struct sk_buff *skb)
+{
+ struct quic_skb_cb *cb = QUIC_SKB_CB(skb);
+ struct net *net = sock_net(skb->sk);
+ struct quic_conn_id dcid, *conn_id;
+ union quic_addr daddr, saddr;
+ struct quic_data alpns = {};
+ struct sock *sk = NULL;
+ u32 len = skb->len;
+ u8 *p = skb->data;
+ int err;
+
+ if (skb->len < QUIC_HLEN)
+ return ERR_PTR(-EINVAL);
+
+ if (quic_hdr(skb)->form == QUIC_PACKET_FORM_SHORT) {
+ /* Short header path. */
+ if (skb->len < QUIC_HLEN + QUIC_CONN_ID_DEF_LEN)
+ return ERR_PTR(-EINVAL);
+ /* Fast path: look up QUIC connection by fixed-length DCID
+ * (Currently, only QUIC_CONN_ID_DEF_LEN-length SCIDs are used).
+ */
+ conn_id = quic_conn_id_lookup(net, skb->data + QUIC_HLEN,
+ QUIC_CONN_ID_DEF_LEN);
+ if (conn_id) {
+ cb->seqno = quic_conn_id_number(conn_id);
+ /* Return associated socket. */
+ return quic_conn_id_sk(conn_id);
+ }
+
+ /* Fallback: listener socket lookup
+ * (May be used to send a stateless reset from a listen socket).
+ */
+ quic_get_msg_addrs(skb, &daddr, &saddr);
+ sk = quic_listen_sock_lookup(skb, &daddr, &saddr, &alpns);
+ if (sk)
+ return sk;
+ /* Final fallback: address-based connection lookup
+ * (May be used to receive a stateless reset).
+ */
+ sk = quic_sock_lookup(skb, &daddr, &saddr, skb->sk, NULL);
+ if (!sk)
+ return ERR_PTR(-ENOENT);
+ return sk;
+ }
+
+ /* Long header path. */
+ err = quic_packet_get_long_header(&dcid, NULL, NULL, &p, &len);
+ if (err)
+ return ERR_PTR(err);
+ /* Fast path: look up QUIC connection by parsed DCID. */
+ conn_id = quic_conn_id_lookup(net, dcid.data, dcid.len);
+ if (conn_id) {
+ cb->seqno = quic_conn_id_number(conn_id);
+ return quic_conn_id_sk(conn_id); /* Return associated socket. */
+ }
+
+ /* Fallback: address + DCID lookup
+ * (May be used for 0-RTT or a follow-up Client Initial packet).
+ */
+ quic_get_msg_addrs(skb, &daddr, &saddr);
+ sk = quic_sock_lookup(skb, &daddr, &saddr, skb->sk, &dcid);
+ if (sk)
+ return sk;
+ /* Final fallback: listener socket lookup
+ * (Used for receiving the first Client Initial packet).
+ */
+ err = quic_packet_parse_alpn(skb, &alpns);
+ if (err)
+ return ERR_PTR(err);
+ sk = quic_listen_sock_lookup(skb, &daddr, &saddr, &alpns);
+ if (!sk)
+ return ERR_PTR(-ENOENT);
+ return sk;
+}
+
+/* Entry point for processing received QUIC packets. */
+int quic_packet_rcv(struct sock *sk, struct sk_buff *skb, bool icmp)
+{
+ struct net *net = sock_net(sk);
+ int err;
+
+ if (unlikely(icmp))
+ return quic_packet_rcv_err(sk, skb);
+
+ /* Save the UDP socket to skb->sk for later QUIC socket lookup. */
+ if (skb_linearize(skb) || !skb_set_owner_sk_safe(skb, sk)) {
+ err = -EINVAL;
+ goto err;
+ }
+
+ /* Look up socket from socket or connection IDs hash tables. */
+ sk = quic_packet_get_sock(skb);
+ if (IS_ERR(sk)) {
+ err = PTR_ERR(sk);
+ if (err == -EINPROGRESS)
+ return 0;
+ goto err;
+ }
+
+ bh_lock_sock(sk);
+ if (sock_owned_by_user(sk)) {
+ /* Socket is busy (owned by user context): queue to backlog. */
+ err = sk_add_backlog(sk, skb, READ_ONCE(sk->sk_rcvbuf));
+ if (err) {
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ goto err;
+ }
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVBACKLOGS);
+ } else {
+ /* Socket not busy: process immediately. */
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVFASTPATHS);
+ sk->sk_backlog_rcv(sk, skb); /* quic_packet_process(). */
+ }
+ bh_unlock_sock(sk);
+ sock_put(sk);
+ return 0;
+err:
+ pr_debug("%s: failed, len: %d, err: %d\n", __func__, skb->len, err);
+ QUIC_INC_STATS(net, QUIC_MIB_PKT_RCVDROP);
+ kfree_skb(skb);
+ return err;
+}
+
+static int quic_packet_listen_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+static int quic_packet_handshake_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+static int quic_packet_app_process(struct sock *sk, struct sk_buff *skb)
+{
+ kfree_skb(skb);
+ return -EOPNOTSUPP;
+}
+
+int quic_packet_process(struct sock *sk, struct sk_buff *skb)
+{
+ if (quic_is_closed(sk)) {
+ kfree_skb(skb);
+ return 0;
+ }
+
+ if (quic_is_listen(sk))
+ return quic_packet_listen_process(sk, skb);
+
+ if (quic_hdr(skb)->form == QUIC_PACKET_FORM_LONG)
+ return quic_packet_handshake_process(sk, skb);
+
+ return quic_packet_app_process(sk, skb);
+}
+
+/* Work function to process packets in the backlog queue. */
+void quic_packet_backlog_work(struct work_struct *work)
+{
+ struct quic_net *qn = container_of(work, struct quic_net, work);
+ struct sk_buff_head *head = &qn->backlog_list;
+ struct sk_buff *skb;
+ struct sock *sk;
+
+ while ((skb = skb_dequeue(head)) != NULL) {
+ sk = quic_packet_get_listen_sock(skb);
+ if (IS_ERR(sk)) {
+ QUIC_INC_STATS(sock_net(skb->sk), QUIC_MIB_PKT_RCVDROP);
+ kfree_skb(skb);
+ continue;
+ }
+
+ lock_sock(sk);
+ quic_packet_process(sk, skb);
+ release_sock(sk);
+ sock_put(sk);
+ }
+}
+
/* Make these fixed for easy coding. */
#define QUIC_PACKET_NUMBER_LEN QUIC_PN_MAX_LEN
#define QUIC_PACKET_LENGTH_LEN 4
diff --git a/net/quic/packet.h b/net/quic/packet.h
index 834c4f72271b..9e2f429d4d93 100644
--- a/net/quic/packet.h
+++ b/net/quic/packet.h
@@ -57,6 +57,8 @@ struct quic_packet {
#define QUIC_VERSION_LEN 4
+#define QUIC_ALPN_MAX_LEN 128
+
#define QUIC_PACKET_MSS_NORMAL 0
#define QUIC_PACKET_MSS_DGRAM 1
@@ -106,6 +108,7 @@ static inline void quic_packet_reset(struct quic_packet *packet)
packet->ack_immediate = 0;
}
+int quic_packet_process(struct sock *sk, struct sk_buff *skb);
int quic_packet_config(struct sock *sk, u8 level, u8 path);
int quic_packet_xmit(struct sock *sk, struct sk_buff *skb);
@@ -115,3 +118,8 @@ int quic_packet_route(struct sock *sk);
void quic_packet_mss_update(struct sock *sk, u32 mss);
void quic_packet_flush(struct sock *sk);
void quic_packet_init(struct sock *sk);
+
+u32 *quic_packet_compatible_versions(u32 version);
+
+void quic_packet_backlog_work(struct work_struct *work);
+void quic_packet_rcv_err_pmtu(struct sock *sk);
diff --git a/net/quic/path.c b/net/quic/path.c
index 7f72fdd9c45f..eb8cb48fe56e 100644
--- a/net/quic/path.c
+++ b/net/quic/path.c
@@ -25,14 +25,14 @@ static int quic_udp_rcv(struct sock *sk, struct sk_buff *skb)
skb_pull(skb, sizeof(struct udphdr));
skb_dst_force(skb);
- kfree_skb(skb);
+ quic_packet_rcv(sk, skb, false);
/* .encap_rcv must return 0 if skb was either consumed or dropped. */
return 0;
}
static int quic_udp_err(struct sock *sk, struct sk_buff *skb)
{
- return 0;
+ return quic_packet_rcv(sk, skb, true);
}
static void quic_udp_sock_put_work(struct work_struct *work)
diff --git a/net/quic/path.h b/net/quic/path.h
index ca18eb38e907..9f772d989676 100644
--- a/net/quic/path.h
+++ b/net/quic/path.h
@@ -163,6 +163,8 @@ quic_path_orig_dcid(struct quic_path_group *paths)
return paths->retry ? &paths->retry_dcid : &paths->orig_dcid;
}
+int quic_packet_rcv(struct sock *sk, struct sk_buff *skb, bool icmp);
+
bool quic_path_detect_alt(struct quic_path_group *paths, union quic_addr *sa,
union quic_addr *da, struct sock *sk);
int quic_path_bind(struct sock *sk, struct quic_path_group *paths, u8 path);
diff --git a/net/quic/protocol.c b/net/quic/protocol.c
index 7f055c88bbde..0012d362330a 100644
--- a/net/quic/protocol.c
+++ b/net/quic/protocol.c
@@ -270,6 +270,9 @@ static int __net_init quic_net_init(struct net *net)
return err;
}
+ INIT_WORK(&qn->work, quic_packet_backlog_work);
+ skb_queue_head_init(&qn->backlog_list);
+
#if IS_ENABLED(CONFIG_PROC_FS)
err = quic_net_proc_init(net);
if (err) {
@@ -288,6 +291,8 @@ static void __net_exit quic_net_exit(struct net *net)
#if IS_ENABLED(CONFIG_PROC_FS)
quic_net_proc_exit(net);
#endif
+ disable_work_sync(&qn->work);
+ skb_queue_purge(&qn->backlog_list);
quic_crypto_free(&qn->crypto);
free_percpu(qn->stat);
qn->stat = NULL;
diff --git a/net/quic/protocol.h b/net/quic/protocol.h
index b8584e72ff14..25001aaaad4a 100644
--- a/net/quic/protocol.h
+++ b/net/quic/protocol.h
@@ -51,6 +51,10 @@ struct quic_net {
#endif
/* Context for decrypting Initial packets for ALPN */
struct quic_crypto crypto;
+
+ /* Queue of packets deferred for processing in process context */
+ struct sk_buff_head backlog_list;
+ struct work_struct work; /* Work to drain/process backlog_list */
};
struct quic_net *quic_net(struct net *net);
diff --git a/net/quic/socket.c b/net/quic/socket.c
index b9fbc33c0f79..bb52f83e9e54 100644
--- a/net/quic/socket.c
+++ b/net/quic/socket.c
@@ -24,6 +24,149 @@ static void quic_enter_memory_pressure(struct sock *sk)
WRITE_ONCE(quic_memory_pressure, 1);
}
+/* Lookup a connected QUIC socket based on address and dest connection ID.
+ *
+ * This function searches the established (non-listening) QUIC socket table for
+ * a socket that matches the source and dest addresses and, optionally, the
+ * dest connection ID (DCID). The value returned by quic_path_orig_dcid() might
+ * be the original dest connection ID from the ClientHello or the Source
+ * Connection ID from a Retry packet before.
+ *
+ * The DCID is provided from a handshake packet when searching by source
+ * connection ID fails, such as when the peer has not yet received server's
+ * response and updated the DCID.
+ *
+ * Return: A pointer to the matching connected socket, or NULL if no match is
+ * found.
+ */
+struct sock *quic_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
+ union quic_addr *da, struct sock *usk,
+ struct quic_conn_id *dcid)
+{
+ struct net *net = sock_net(usk);
+ struct quic_path_group *paths;
+ struct hlist_nulls_node *node;
+ struct quic_shash_head *head;
+ struct sock *sk = NULL, *tmp;
+ struct quic_conn_id *odcid;
+ unsigned int hash;
+
+ hash = quic_sock_hash(net, sa, da);
+ head = quic_sock_head(hash);
+
+ rcu_read_lock();
+begin:
+ sk_nulls_for_each_rcu(tmp, node, &head->head) {
+ if (net != sock_net(tmp))
+ continue;
+ paths = quic_paths(tmp);
+ odcid = quic_path_orig_dcid(paths);
+ if (quic_cmp_sk_addr(tmp, quic_path_saddr(paths, 0), sa) &&
+ quic_cmp_sk_addr(tmp, quic_path_daddr(paths, 0), da) &&
+ quic_path_usock(paths, 0) == usk &&
+ (!dcid || !quic_conn_id_cmp(odcid, dcid))) {
+ sk = tmp;
+ break;
+ }
+ }
+ /* If the nulls value we got at the end of the iteration is different
+ * from the expected one, we must restart the lookup as the list was
+ * modified concurrently.
+ */
+ if (!sk && get_nulls_value(node) != hash)
+ goto begin;
+
+ if (sk && unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+ sk = NULL;
+ rcu_read_unlock();
+ return sk;
+}
+
+/* Find the listening QUIC socket for an incoming packet.
+ *
+ * This function searches the QUIC socket table for a listening socket that
+ * matches the dest address and port, and the ALPN(s) if presented in the
+ * ClientHello. If multiple listening sockets are bound to the same address,
+ * port, and ALPN(s) (e.g., via SO_REUSEPORT), this function selects a socket
+ * from the reuseport group.
+ *
+ * Return: A pointer to the matching listening socket, or NULL if no match is
+ * found.
+ */
+struct sock *quic_listen_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
+ union quic_addr *da,
+ struct quic_data *alpns)
+{
+ struct net *net = sock_net(skb->sk);
+ struct hlist_nulls_node *node;
+ struct sock *sk = NULL, *tmp;
+ struct quic_shash_head *head;
+ struct quic_data alpn;
+ union quic_addr *a;
+ u32 hash, len;
+ u64 length;
+ u8 *p;
+
+ hash = quic_listen_sock_hash(net, ntohs(sa->v4.sin_port));
+ head = quic_listen_sock_head(hash);
+
+ rcu_read_lock();
+begin:
+ if (!alpns->len) { /* No ALPNs or parse failed */
+ sk_nulls_for_each_rcu(tmp, node, &head->head) {
+ /* If alpns->data != NULL, TLS parsing succeeded but no
+ * ALPN was found. In this case, only match sockets
+ * that have no ALPN set.
+ */
+ a = quic_path_saddr(quic_paths(tmp), 0);
+ if (net == sock_net(tmp) &&
+ quic_cmp_sk_addr(tmp, a, sa) &&
+ quic_path_usock(quic_paths(tmp), 0) == skb->sk &&
+ (!alpns->data || !quic_alpn(tmp)->len)) {
+ sk = tmp;
+ if (!quic_is_any_addr(a))
+ break; /* Prefer specific addr match. */
+ }
+ }
+ goto out;
+ }
+
+ /* ALPN present: loop through each ALPN entry. */
+ for (p = alpns->data, len = alpns->len; len;
+ len -= length, p += length) {
+ quic_get_int(&p, &len, &length, 1);
+ quic_data(&alpn, p, length);
+ sk_nulls_for_each_rcu(tmp, node, &head->head) {
+ a = quic_path_saddr(quic_paths(tmp), 0);
+ if (net == sock_net(tmp) &&
+ quic_cmp_sk_addr(tmp, a, sa) &&
+ quic_path_usock(quic_paths(tmp), 0) == skb->sk &&
+ quic_data_has(quic_alpn(tmp), &alpn)) {
+ sk = tmp;
+ if (!quic_is_any_addr(a))
+ break;
+ }
+ }
+ if (sk)
+ break;
+ }
+out:
+ /* If the nulls value we got at the end of the iteration is different
+ * from the expected one, we must restart the lookup as the list was
+ * modified concurrently.
+ */
+ if (!sk && get_nulls_value(node) != hash)
+ goto begin;
+
+ if (sk && sk->sk_reuseport)
+ sk = reuseport_select_sock(sk, quic_addr_hash(net, da), skb, 1);
+
+ if (sk && unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
+ sk = NULL;
+ rcu_read_unlock();
+ return sk;
+}
+
static void quic_write_space(struct sock *sk)
{
__poll_t mask = EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
@@ -213,6 +356,10 @@ static void quic_release_cb(struct sock *sk)
nflags = flags & ~QUIC_DEFERRED_ALL;
} while (!try_cmpxchg(&sk->sk_tsq_flags, &flags, nflags));
+ if (flags & QUIC_F_MTU_REDUCED_DEFERRED) {
+ quic_packet_rcv_err_pmtu(sk);
+ __sock_put(sk);
+ }
if (flags & QUIC_F_LOSS_DEFERRED) {
quic_timer_loss_handler(sk);
__sock_put(sk);
@@ -262,6 +409,7 @@ struct proto quic_prot = {
.accept = quic_accept,
.hash = quic_hash,
.unhash = quic_unhash,
+ .backlog_rcv = quic_packet_process,
.release_cb = quic_release_cb,
.no_autobind = true,
.obj_size = sizeof(struct quic_sock),
@@ -292,6 +440,7 @@ struct proto quicv6_prot = {
.accept = quic_accept,
.hash = quic_hash,
.unhash = quic_unhash,
+ .backlog_rcv = quic_packet_process,
.release_cb = quic_release_cb,
.no_autobind = true,
.obj_size = sizeof(struct quic6_sock),
diff --git a/net/quic/socket.h b/net/quic/socket.h
index 1efc76ec2033..3c1bea767be9 100644
--- a/net/quic/socket.h
+++ b/net/quic/socket.h
@@ -200,3 +200,10 @@ static inline void quic_set_state(struct sock *sk, int state)
inet_sk_set_state(sk, state);
sk->sk_state_change(sk);
}
+
+struct sock *quic_listen_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
+ union quic_addr *da,
+ struct quic_data *alpns);
+struct sock *quic_sock_lookup(struct sk_buff *skb, union quic_addr *sa,
+ union quic_addr *da, struct sock *usk,
+ struct quic_conn_id *dcid);
--
2.47.1
^ permalink raw reply related [flat|nested] 17+ messages in thread