* [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP
@ 2025-05-01 3:51 Jiayuan Chen
2025-05-01 3:51 ` [RFC net-next v1 2/2] selftests/net: Add udp UDP_STOP_RCV selftest Jiayuan Chen
2025-05-01 4:42 ` [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Kuniyuki Iwashima
0 siblings, 2 replies; 6+ messages in thread
From: Jiayuan Chen @ 2025-05-01 3:51 UTC (permalink / raw)
To: netdev
Cc: Jiayuan Chen, Willem de Bruijn, David S. Miller, David Ahern,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, linux-kernel, linux-kselftest
For some services we are using "established-over-unconnected" model.
'''
// create unconnected socket and 'listen()'
srv_fd = socket(AF_INET, SOCK_DGRAM)
setsockopt(srv_fd, SO_REUSEPORT)
bind(srv_fd, SERVER_ADDR, SERVER_PORT)
// 'accept()'
data, client_addr = recvmsg(srv_fd)
// create a connected socket for this request
cli_fd = socket(AF_INET, SOCK_DGRAM)
setsockopt(cli_fd, SO_REUSEPORT)
bind(cli_fd, SERVER_ADDR, SERVER_PORT)
connect(cli, client_addr)
...
// do handshake with cli_fd
'''
This programming pattern simulates accept() using UDP, creating a new
socket for each client request. The server can then use separate sockets
to handle client requests, avoiding the need to use a single UDP socket
for I/O transmission.
But there is a race condition between the bind() and connect() of the
connected socket:
We might receive unexpected packets belonging to the unconnected socket
before connect() is executed, which is not what we need.
(Of course, before connect(), the unconnected socket will also receive
packets from the connected socket, which is easily resolved because
upper-layer protocols typically require explicit boundaries, and we
receive a complete packet before creating a connected socket.)
Before this patch, the connected socket had to filter requests at recvmsg
time, acting as a dispatcher to some extent. With this patch, we can
consider the bind and connect operations to be atomic.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
include/linux/udp.h | 1 +
include/uapi/linux/udp.h | 1 +
net/ipv4/udp.c | 13 ++++++++++---
net/ipv6/udp.c | 5 +++--
4 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 895240177f4f..8d281a0c0d9d 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -42,6 +42,7 @@ enum {
UDP_FLAGS_ENCAP_ENABLED, /* This socket enabled encap */
UDP_FLAGS_UDPLITE_SEND_CC, /* set via udplite setsockopt */
UDP_FLAGS_UDPLITE_RECV_CC, /* set via udplite setsockopt */
+ UDP_FLAGS_STOP_RCV, /* Stop receiving packets */
};
struct udp_sock {
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index edca3e430305..bb8e0a749a55 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -34,6 +34,7 @@ struct udphdr {
#define UDP_NO_CHECK6_RX 102 /* Disable accepting checksum for UDP6 */
#define UDP_SEGMENT 103 /* Set GSO segmentation size */
#define UDP_GRO 104 /* This socket can receive UDP GRO packets */
+#define UDP_STOP_RCV 105 /* This socket will not receive any packets */
/* UDP encapsulation types */
#define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* unused draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f9f5b92cf4b6..764d337ab1b3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -376,7 +376,8 @@ static int compute_score(struct sock *sk, const struct net *net,
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
- ipv6_only_sock(sk))
+ ipv6_only_sock(sk) ||
+ udp_test_bit(STOP_RCV, sk))
return -1;
if (sk->sk_rcv_saddr != daddr)
@@ -494,7 +495,7 @@ static struct sock *udp4_lib_lookup2(const struct net *net,
result = inet_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, udp_ehashfn);
- if (!result) {
+ if (!result || udp_test_bit(STOP_RCV, result)) {
result = sk;
continue;
}
@@ -3031,7 +3032,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int optname,
set_xfrm_gro_udp_encap_rcv(up->encap_type, sk->sk_family, sk);
sockopt_release_sock(sk);
break;
-
+ case UDP_STOP_RCV:
+ udp_assign_bit(STOP_RCV, sk, valbool);
+ break;
/*
* UDP-Lite's partial checksum coverage (RFC 3828).
*/
@@ -3120,6 +3123,10 @@ int udp_lib_getsockopt(struct sock *sk, int level, int optname,
val = udp_test_bit(GRO_ENABLED, sk);
break;
+ case UDP_STOP_RCV:
+ val = udp_test_bit(STOP_RCV, sk);
+ break;
+
/* The following two cannot be changed on UDP sockets, the return is
* always 0 (which corresponds to the full checksum coverage of UDP). */
case UDPLITE_SEND_CSCOV:
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7317f8e053f1..55896a78e94b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -137,7 +137,8 @@ static int compute_score(struct sock *sk, const struct net *net,
if (!net_eq(sock_net(sk), net) ||
udp_sk(sk)->udp_port_hash != hnum ||
- sk->sk_family != PF_INET6)
+ sk->sk_family != PF_INET6 ||
+ udp_test_bit(STOP_RCV, sk))
return -1;
if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
@@ -245,7 +246,7 @@ static struct sock *udp6_lib_lookup2(const struct net *net,
result = inet6_lookup_reuseport(net, sk, skb, sizeof(struct udphdr),
saddr, sport, daddr, hnum, udp6_ehashfn);
- if (!result) {
+ if (!result || udp_test_bit(STOP_RCV, result)) {
result = sk;
continue;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC net-next v1 2/2] selftests/net: Add udp UDP_STOP_RCV selftest
2025-05-01 3:51 [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Jiayuan Chen
@ 2025-05-01 3:51 ` Jiayuan Chen
2025-05-01 4:42 ` [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Kuniyuki Iwashima
1 sibling, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2025-05-01 3:51 UTC (permalink / raw)
To: netdev
Cc: Jiayuan Chen, Willem de Bruijn, David S. Miller, David Ahern,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, linux-kernel, linux-kselftest
Add a new selftest, which uses UDP_STOP_RCV to make UDP simulate TCP's
listen and accept.
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
tools/testing/selftests/net/.gitignore | 1 +
tools/testing/selftests/net/Makefile | 1 +
.../testing/selftests/net/test_udp_stop_rcv.c | 275 ++++++++++++++++++
3 files changed, 277 insertions(+)
create mode 100644 tools/testing/selftests/net/test_udp_stop_rcv.c
diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index 532bb732bc6d..293f7cd27e5e 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -61,3 +61,4 @@ udpgso
udpgso_bench_rx
udpgso_bench_tx
unix_connect
+test_udp_stop_rcv
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 124078b56fa4..0e8fcca9f133 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -108,6 +108,7 @@ TEST_GEN_PROGS += proc_net_pktgen
TEST_PROGS += lwt_dst_cache_ref_loop.sh
TEST_PROGS += skf_net_off.sh
TEST_GEN_FILES += skf_net_off
+TEST_GEN_FILES += test_udp_stop_rcv
# YNL files, must be before "include ..lib.mk"
YNL_GEN_FILES := busy_poller netlink-dumps
diff --git a/tools/testing/selftests/net/test_udp_stop_rcv.c b/tools/testing/selftests/net/test_udp_stop_rcv.c
new file mode 100644
index 000000000000..e01d097a93be
--- /dev/null
+++ b/tools/testing/selftests/net/test_udp_stop_rcv.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <net/if.h>
+#include <linux/in.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <netinet/if_ether.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/udp.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#ifndef UDP_STOP_RCV
+#define UDP_STOP_RCV 105
+#endif
+
+static bool cfg_do_ipv4;
+static bool cfg_do_ipv6;
+
+static char buf[1024];
+static const char *syn = "client request";
+static const char *synack = "server accepted";
+static const char *ack = "established";
+
+const struct in6_addr addr6 = {
+ { { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 } }, /* 0::1 */
+};
+
+const struct in_addr addr4 = {
+ __constant_htonl(0x7f000001), /* 127.0.0.1 */
+};
+
+static int __send_one(const struct sockaddr *srv, const socklen_t srv_len)
+{
+ int cli_fd = -1, ret = 0;
+
+ cli_fd = socket(srv->sa_family, SOCK_DGRAM, 0);
+ if (cli_fd <= 0)
+ goto err;
+
+ ret = connect(cli_fd, srv, srv_len);
+ if (ret < 0)
+ goto err;
+
+ ret = send(cli_fd, syn, strlen(syn), 0);
+ if (ret != strlen(syn)) {
+ ret = -1;
+ goto err;
+ }
+
+ return cli_fd;
+err:
+ if (cli_fd > 0)
+ close(cli_fd);
+ return -1;
+}
+
+static int send_one(const struct sockaddr *srv, const socklen_t srv_len)
+{
+ int cli_fd;
+
+ cli_fd = __send_one(srv, srv_len);
+ if (cli_fd <= 0)
+ return -1;
+
+ close(cli_fd);
+ return 0;
+}
+
+static int send_many(const struct sockaddr *addr, const socklen_t alen)
+{
+ int i = 0, err;
+
+ for (i = 0; i < 100; i++) {
+ err = send_one(addr, alen);
+ if (err)
+ return err;
+ }
+ return 0;
+}
+
+/* client server
+ * "client request"->
+ * <- "server accepted"
+ * "established" ->
+ */
+static void run_test(struct sockaddr *srv, socklen_t srv_len,
+ struct sockaddr *cli, socklen_t cli_len)
+{
+ socklen_t size;
+ struct timeval tv = { .tv_sec = 1, .tv_usec = 0 };
+ int one = 1, srv_fd = -1, ret;
+ int session_fd = -1;
+ int cli_fd;
+
+ srv_fd = socket(srv->sa_family, SOCK_DGRAM, 0);
+ if (srv_fd == -1)
+ error(1, errno, "socket srv_fd");
+
+ if (setsockopt(srv_fd, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one)))
+ error(1, errno, "SO_REUSEPORT");
+
+ if (bind(srv_fd, srv, srv_len))
+ error(1, errno, "bind srv_fd");
+
+ if (getsockname(srv_fd, srv, &srv_len))
+ error(1, errno, "getsockname()");
+
+ /* send syn to server */
+ cli_fd = __send_one(srv, srv_len);
+ if (cli_fd < 0)
+ error(1, errno, "new_client_req()");
+
+ ret = recvfrom(srv_fd, (char *)buf, sizeof(buf), MSG_WAITALL, cli, &cli_len);
+ if (ret < 0)
+ error(1, errno, "recvfrom()");
+
+ /* create session for this request */
+ session_fd = socket(srv->sa_family, SOCK_DGRAM, 0);
+ if (session_fd == -1)
+ error(1, errno, "socket session_fd");
+
+ if (setsockopt(session_fd, SOL_SOCKET, SO_REUSEPORT, &one, sizeof(one)))
+ error(1, errno, "SO_REUSEPORT");
+
+ /* we ready to bind the server address and do not want to receive any packets */
+ if (setsockopt(session_fd, SOL_UDP, UDP_STOP_RCV, &one, sizeof(one)))
+ error(1, errno, "setsockopt WAIT_CONNECT");
+
+ if (setsockopt(session_fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)))
+ error(1, errno, "setsockopt SO_RCVTIMEO");
+
+ one = 0;
+ size = sizeof(one);
+ if (getsockopt(session_fd, SOL_UDP, UDP_STOP_RCV, &one, &size) ||
+ one != 1)
+ error(1, errno, "getsockopt UDP_STOP_RCV");
+
+ /* bind the same address as srv_fd */
+ if (bind(session_fd, srv, srv_len))
+ error(1, errno, "bind srv_fd");
+
+ /* simulate many other requests */
+ if (send_many(srv, srv_len))
+ error(1, errno, "send_many()");
+
+ /* should no data assigned to session_fd
+ * as we set UDP_STOP_RCV before
+ */
+ ret = read(session_fd, (char *)buf, sizeof(buf));
+ if (ret > 0)
+ error(1, errno, "session_fd should no data received");
+
+ /* build 4-tuple */
+ ret = connect(session_fd, cli, cli_len);
+ if (ret < 0)
+ error(1, errno, "connect(cli)");
+
+ /* now we are ready to communicate with specified client */
+ one = 0;
+ if (setsockopt(session_fd, SOL_UDP, UDP_STOP_RCV, &one, sizeof(one)))
+ error(1, errno, "setsockopt WAIT_CONNECT");
+
+ /* server sends synack to the client */
+ ret = send(session_fd, synack, strlen(synack), 0);
+ if (ret != strlen(synack))
+ error(1, errno, "send(synack)");
+
+ /* client receives the synack */
+ ret = read(cli_fd, (char *)buf, sizeof(buf));
+ if (ret != strlen(synack))
+ error(1, errno, "read(synack)");
+
+ /* client sends the ack to server */
+ ret = send(cli_fd, ack, strlen(ack), 0);
+ if (ret != strlen(ack))
+ error(1, errno, "send(ack)");
+
+ /* the server should receive the ack */
+ ret = read(session_fd, (char *)buf, sizeof(buf));
+ if (ret != strlen(ack))
+ error(1, errno, "read(ack)");
+
+ /* send many requests that not belongs to the session */
+ if (send_many(srv, srv_len))
+ error(1, errno, "send_many()");
+
+ ret = read(session_fd, (char *)buf, sizeof(buf));
+ if (ret > 0)
+ error(1, errno, "session_fd should no data received");
+
+ if (cli_fd != -1)
+ close(cli_fd);
+ if (srv_fd != -1)
+ close(srv_fd);
+ if (session_fd != -1)
+ close(session_fd);
+}
+
+static void run_test_v4(void)
+{
+ struct sockaddr_in addr = {0};
+ struct sockaddr_in cli = {0};
+
+ addr.sin_family = AF_INET;
+ addr.sin_port = 0;
+ addr.sin_addr = addr4;
+
+ run_test((void *)&addr, sizeof(addr), (void *)&cli, sizeof(cli));
+ fprintf(stderr, "v4 OK\n");
+}
+
+static void run_test_v6(void)
+{
+ struct sockaddr_in6 addr = {0};
+ struct sockaddr_in6 cli = {0};
+
+ addr.sin6_family = AF_INET6;
+ addr.sin6_port = 0;
+ addr.sin6_addr = addr6;
+
+ run_test((void *)&addr, sizeof(addr), (void *)&cli, sizeof(cli));
+ fprintf(stderr, "v6 OK\n");
+}
+
+static void parse_opts(int argc, char **argv)
+{
+ int c;
+
+ while ((c = getopt(argc, argv, "46")) != -1) {
+ switch (c) {
+ case '4':
+ cfg_do_ipv4 = true;
+ break;
+ case '6':
+ cfg_do_ipv6 = true;
+ break;
+ default:
+ error(1, 0, "%s: parse error", argv[0]);
+ }
+ }
+
+ if (!cfg_do_ipv4 && !cfg_do_ipv6) {
+ cfg_do_ipv4 = 1;
+ cfg_do_ipv6 = 1;
+ }
+}
+
+int main(int argc, char **argv)
+{
+ parse_opts(argc, argv);
+
+ if (cfg_do_ipv4)
+ run_test_v4();
+ if (cfg_do_ipv6)
+ run_test_v6();
+
+ fprintf(stderr, "test OK\n");
+ return 0;
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP
2025-05-01 3:51 [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Jiayuan Chen
2025-05-01 3:51 ` [RFC net-next v1 2/2] selftests/net: Add udp UDP_STOP_RCV selftest Jiayuan Chen
@ 2025-05-01 4:42 ` Kuniyuki Iwashima
2025-05-01 6:22 ` Jiayuan Chen
1 sibling, 1 reply; 6+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-01 4:42 UTC (permalink / raw)
To: jiayuan.chen
Cc: davem, dsahern, edumazet, horms, kuba, linux-kernel,
linux-kselftest, netdev, pabeni, shuah, willemdebruijn.kernel
From: Jiayuan Chen <jiayuan.chen@linux.dev>
Date: Thu, 1 May 2025 11:51:08 +0800
> For some services we are using "established-over-unconnected" model.
>
> '''
> // create unconnected socket and 'listen()'
> srv_fd = socket(AF_INET, SOCK_DGRAM)
> setsockopt(srv_fd, SO_REUSEPORT)
> bind(srv_fd, SERVER_ADDR, SERVER_PORT)
>
> // 'accept()'
> data, client_addr = recvmsg(srv_fd)
>
> // create a connected socket for this request
> cli_fd = socket(AF_INET, SOCK_DGRAM)
> setsockopt(cli_fd, SO_REUSEPORT)
> bind(cli_fd, SERVER_ADDR, SERVER_PORT)
> connect(cli, client_addr)
> ...
> // do handshake with cli_fd
> '''
>
> This programming pattern simulates accept() using UDP, creating a new
> socket for each client request. The server can then use separate sockets
> to handle client requests, avoiding the need to use a single UDP socket
> for I/O transmission.
>
> But there is a race condition between the bind() and connect() of the
> connected socket:
> We might receive unexpected packets belonging to the unconnected socket
> before connect() is executed, which is not what we need.
> (Of course, before connect(), the unconnected socket will also receive
> packets from the connected socket, which is easily resolved because
> upper-layer protocols typically require explicit boundaries, and we
> receive a complete packet before creating a connected socket.)
>
> Before this patch, the connected socket had to filter requests at recvmsg
> time, acting as a dispatcher to some extent. With this patch, we can
> consider the bind and connect operations to be atomic.
SO_ATTACH_REUSEPORT_EBPF is what you want.
The socket won't receive any packets until the socket is added to
the BPF map.
No need to reinvent a subset of BPF functionalities.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP
2025-05-01 4:42 ` [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Kuniyuki Iwashima
@ 2025-05-01 6:22 ` Jiayuan Chen
2025-05-01 7:12 ` Kuniyuki Iwashima
0 siblings, 1 reply; 6+ messages in thread
From: Jiayuan Chen @ 2025-05-01 6:22 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: davem, dsahern, edumazet, horms, kuba, linux-kernel,
linux-kselftest, netdev, pabeni, shuah, willemdebruijn.kernel
2025/5/1 12:42, "Kuniyuki Iwashima" <kuniyu@amazon.com> wrote:
>
> From: Jiayuan Chen <jiayuan.chen@linux.dev>
>
> Date: Thu, 1 May 2025 11:51:08 +0800
>
> >
> > For some services we are using "established-over-unconnected" model.
> >
> >
> >
> > '''
> >
> > // create unconnected socket and 'listen()'
> >
> > srv_fd = socket(AF_INET, SOCK_DGRAM)
> >
> > setsockopt(srv_fd, SO_REUSEPORT)
> >
> > bind(srv_fd, SERVER_ADDR, SERVER_PORT)
> >
> >
> >
> > // 'accept()'
> >
> > data, client_addr = recvmsg(srv_fd)
> >
> >
> >
> > // create a connected socket for this request
> >
> > cli_fd = socket(AF_INET, SOCK_DGRAM)
> >
> > setsockopt(cli_fd, SO_REUSEPORT)
> >
> > bind(cli_fd, SERVER_ADDR, SERVER_PORT)
> >
> > connect(cli, client_addr)
> >
> > ...
> >
> > // do handshake with cli_fd
> >
> > '''
> >
> >
> >
> > This programming pattern simulates accept() using UDP, creating a new
> >
> > socket for each client request. The server can then use separate sockets
> >
> > to handle client requests, avoiding the need to use a single UDP socket
> >
> > for I/O transmission.
> >
> >
> >
> > But there is a race condition between the bind() and connect() of the
> >
> > connected socket:
> >
> > We might receive unexpected packets belonging to the unconnected socket
> >
> > before connect() is executed, which is not what we need.
> >
> > (Of course, before connect(), the unconnected socket will also receive
> >
> > packets from the connected socket, which is easily resolved because
> >
> > upper-layer protocols typically require explicit boundaries, and we
> >
> > receive a complete packet before creating a connected socket.)
> >
> >
> >
> > Before this patch, the connected socket had to filter requests at recvmsg
> >
> > time, acting as a dispatcher to some extent. With this patch, we can
> >
> > consider the bind and connect operations to be atomic.
> >
>
> SO_ATTACH_REUSEPORT_EBPF is what you want.
>
> The socket won't receive any packets until the socket is added to
>
> the BPF map.
>
> No need to reinvent a subset of BPF functionalities.
>
I think this feature is for selecting one socket, not filtering out certain
sockets.
Does this mean that I need to first capture all sockets bound to the same
port, and then if the kernel selects a socket that I don't want to receive
packets on, I'll need to implement an algorithm in the BPF program to
choose another socket from the ones I've captured, in order to avoid
returning that socket?
This looks like it completely bypasses the kernel's built-in scoring
logic. Or is expanding BPF_PROG_TYPE_SK_REUSEPORT to have filtering
capabilities also an acceptable solution?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP
2025-05-01 6:22 ` Jiayuan Chen
@ 2025-05-01 7:12 ` Kuniyuki Iwashima
2025-05-01 14:27 ` Willem de Bruijn
0 siblings, 1 reply; 6+ messages in thread
From: Kuniyuki Iwashima @ 2025-05-01 7:12 UTC (permalink / raw)
To: jiayuan.chen
Cc: davem, dsahern, edumazet, horms, kuba, kuniyu, linux-kernel,
linux-kselftest, netdev, pabeni, shuah, willemdebruijn.kernel
From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
Date: Thu, 01 May 2025 06:22:17 +0000
> 2025/5/1 12:42, "Kuniyuki Iwashima" <kuniyu@amazon.com> wrote:
>
> >
> > From: Jiayuan Chen <jiayuan.chen@linux.dev>
> >
> > Date: Thu, 1 May 2025 11:51:08 +0800
> >
> > >
> > > For some services we are using "established-over-unconnected" model.
> > >
> > >
> > >
> > > '''
> > >
> > > // create unconnected socket and 'listen()'
> > >
> > > srv_fd = socket(AF_INET, SOCK_DGRAM)
> > >
> > > setsockopt(srv_fd, SO_REUSEPORT)
> > >
> > > bind(srv_fd, SERVER_ADDR, SERVER_PORT)
> > >
> > >
> > >
> > > // 'accept()'
> > >
> > > data, client_addr = recvmsg(srv_fd)
> > >
> > >
> > >
> > > // create a connected socket for this request
> > >
> > > cli_fd = socket(AF_INET, SOCK_DGRAM)
> > >
> > > setsockopt(cli_fd, SO_REUSEPORT)
> > >
> > > bind(cli_fd, SERVER_ADDR, SERVER_PORT)
> > >
> > > connect(cli, client_addr)
> > >
> > > ...
> > >
> > > // do handshake with cli_fd
> > >
> > > '''
> > >
> > >
> > >
> > > This programming pattern simulates accept() using UDP, creating a new
> > >
> > > socket for each client request. The server can then use separate sockets
> > >
> > > to handle client requests, avoiding the need to use a single UDP socket
> > >
> > > for I/O transmission.
> > >
> > >
> > >
> > > But there is a race condition between the bind() and connect() of the
> > >
> > > connected socket:
> > >
> > > We might receive unexpected packets belonging to the unconnected socket
> > >
> > > before connect() is executed, which is not what we need.
> > >
> > > (Of course, before connect(), the unconnected socket will also receive
> > >
> > > packets from the connected socket, which is easily resolved because
> > >
> > > upper-layer protocols typically require explicit boundaries, and we
> > >
> > > receive a complete packet before creating a connected socket.)
> > >
> > >
> > >
> > > Before this patch, the connected socket had to filter requests at recvmsg
> > >
> > > time, acting as a dispatcher to some extent. With this patch, we can
> > >
> > > consider the bind and connect operations to be atomic.
> > >
> >
> > SO_ATTACH_REUSEPORT_EBPF is what you want.
> >
> > The socket won't receive any packets until the socket is added to
> >
> > the BPF map.
> >
> > No need to reinvent a subset of BPF functionalities.
> >
>
> I think this feature is for selecting one socket, not filtering out certain
> sockets.
>
> Does this mean that I need to first capture all sockets bound to the same
> port, and then if the kernel selects a socket that I don't want to receive
> packets on, I'll need to implement an algorithm in the BPF program to
> choose another socket from the ones I've captured, in order to avoid
> returning that socket?
Right.
If you want a set of sockets to listen on the port, you can implement
as such with BPF; register the sockets to the BPF map, and if kernel pick
up other sockets and triggers the BPF prog, just return one of the
registerd sk.
Even when you have connect()ed sockets on the same port, kernel will
fall back to the normal scoring to find the best one, and it's not a
problem as the last 'result' is one selected by BPF or a connected sk,
and the packet won't be routed to not-yet-registered unconnected sk.
>
> This looks like it completely bypasses the kernel's built-in scoring
> logic. Or is expanding BPF_PROG_TYPE_SK_REUSEPORT to have filtering
> capabilities also an acceptable solution?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP
2025-05-01 7:12 ` Kuniyuki Iwashima
@ 2025-05-01 14:27 ` Willem de Bruijn
0 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2025-05-01 14:27 UTC (permalink / raw)
To: Kuniyuki Iwashima, jiayuan.chen
Cc: davem, dsahern, edumazet, horms, kuba, kuniyu, linux-kernel,
linux-kselftest, netdev, pabeni, shuah, willemdebruijn.kernel
Kuniyuki Iwashima wrote:
> From: "Jiayuan Chen" <jiayuan.chen@linux.dev>
> Date: Thu, 01 May 2025 06:22:17 +0000
> > 2025/5/1 12:42, "Kuniyuki Iwashima" <kuniyu@amazon.com> wrote:
> >
> > >
> > > From: Jiayuan Chen <jiayuan.chen@linux.dev>
> > >
> > > Date: Thu, 1 May 2025 11:51:08 +0800
> > >
> > > >
> > > > For some services we are using "established-over-unconnected" model.
> > > >
> > > >
> > > >
> > > > '''
> > > >
> > > > // create unconnected socket and 'listen()'
> > > >
> > > > srv_fd = socket(AF_INET, SOCK_DGRAM)
> > > >
> > > > setsockopt(srv_fd, SO_REUSEPORT)
> > > >
> > > > bind(srv_fd, SERVER_ADDR, SERVER_PORT)
> > > >
> > > >
> > > >
> > > > // 'accept()'
> > > >
> > > > data, client_addr = recvmsg(srv_fd)
> > > >
> > > >
> > > >
> > > > // create a connected socket for this request
> > > >
> > > > cli_fd = socket(AF_INET, SOCK_DGRAM)
> > > >
> > > > setsockopt(cli_fd, SO_REUSEPORT)
> > > >
> > > > bind(cli_fd, SERVER_ADDR, SERVER_PORT)
> > > >
> > > > connect(cli, client_addr)
> > > >
> > > > ...
> > > >
> > > > // do handshake with cli_fd
> > > >
> > > > '''
> > > >
> > > >
> > > >
> > > > This programming pattern simulates accept() using UDP, creating a new
> > > >
> > > > socket for each client request. The server can then use separate sockets
> > > >
> > > > to handle client requests, avoiding the need to use a single UDP socket
> > > >
> > > > for I/O transmission.
> > > >
> > > >
> > > >
> > > > But there is a race condition between the bind() and connect() of the
> > > >
> > > > connected socket:
> > > >
> > > > We might receive unexpected packets belonging to the unconnected socket
> > > >
> > > > before connect() is executed, which is not what we need.
> > > >
> > > > (Of course, before connect(), the unconnected socket will also receive
> > > >
> > > > packets from the connected socket, which is easily resolved because
> > > >
> > > > upper-layer protocols typically require explicit boundaries, and we
> > > >
> > > > receive a complete packet before creating a connected socket.)
> > > >
> > > >
> > > >
> > > > Before this patch, the connected socket had to filter requests at recvmsg
> > > >
> > > > time, acting as a dispatcher to some extent. With this patch, we can
> > > >
> > > > consider the bind and connect operations to be atomic.
> > > >
> > >
> > > SO_ATTACH_REUSEPORT_EBPF is what you want.
> > >
> > > The socket won't receive any packets until the socket is added to
> > >
> > > the BPF map.
> > >
> > > No need to reinvent a subset of BPF functionalities.
> > >
> >
> > I think this feature is for selecting one socket, not filtering out certain
> > sockets.
> >
> > Does this mean that I need to first capture all sockets bound to the same
> > port, and then if the kernel selects a socket that I don't want to receive
> > packets on, I'll need to implement an algorithm in the BPF program to
> > choose another socket from the ones I've captured, in order to avoid
> > returning that socket?
>
> Right.
>
> If you want a set of sockets to listen on the port, you can implement
> as such with BPF; register the sockets to the BPF map, and if kernel pick
> up other sockets and triggers the BPF prog, just return one of the
> registerd sk.
>
> Even when you have connect()ed sockets on the same port, kernel will
> fall back to the normal scoring to find the best one, and it's not a
> problem as the last 'result' is one selected by BPF or a connected sk,
> and the packet won't be routed to not-yet-registered unconnected sk.
>
>
> >
> > This looks like it completely bypasses the kernel's built-in scoring
> > logic. Or is expanding BPF_PROG_TYPE_SK_REUSEPORT to have filtering
> > capabilities also an acceptable solution?
Reuseport BPF exists because we want to avoid having to continue to
add custom rules in C for each scenario.
In this case, I did wonder whether it is possible to avoid hitting
the soon-to-be connected socket with the standard reuseport
algorithm in reuseport_select_sock_by_hash.
Setting SO_INCOMING_CPU to a cpu on which no packets arrive will
lower its priority relative to other sockets. It's a bit of a hack,
but should work?
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-05-01 14:27 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-01 3:51 [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Jiayuan Chen
2025-05-01 3:51 ` [RFC net-next v1 2/2] selftests/net: Add udp UDP_STOP_RCV selftest Jiayuan Chen
2025-05-01 4:42 ` [RFC net-next v1 1/2] udp: Introduce UDP_STOP_RCV option for UDP Kuniyuki Iwashima
2025-05-01 6:22 ` Jiayuan Chen
2025-05-01 7:12 ` Kuniyuki Iwashima
2025-05-01 14:27 ` Willem de Bruijn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).