netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues
@ 2025-11-17 11:07 Jiayuan Chen
  2025-11-17 11:07 ` [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-17 11:07 UTC (permalink / raw)
  To: bpf
  Cc: jiayuan.chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrii Nakryiko,
	Eduard Zingerman, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

syzkaller reported a bug [1] where a socket using sockmap, after being
unloaded, exposed incorrect copied_seq calculation. The selftest I
provided can be used to reproduce the issue reported by syzkaller.

TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40
WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724
Call Trace:
 <TASK>
 receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline]
 tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200
 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713
 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812
 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421
 __sys_getsockopt+0x12f/0x260 net/socket.c:2450
 __do_sys_getsockopt net/socket.c:2457 [inline]
 __se_sys_getsockopt net/socket.c:2454 [inline]
 __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

A sockmap socket maintains its own receive queue (ingress_msg) which may
contain data from either its own protocol stack or forwarded from other
sockets.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

The issue occurs when reading from ingress_msg: we update tp->copied_seq
by default, but if the data comes from other sockets (not the socket's
own protocol stack), tcp->rcv_nxt remains unchanged. Later, when
converting back to a native socket, reads may fail as copied_seq could
be significantly larger than rcv_nxt.

Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is
insufficient for sockmap sockets, requiring separate field tracking.

[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983

Jiayuan Chen (3):
  bpf, sockmap: Fix incorrect copied_seq calculation
  bpf, sockmap: Fix FIONREAD for sockmap
  bpf, selftest: Add tests for FIONREAD and copied_seq

 include/linux/skmsg.h                         |  71 ++++++-
 net/core/skmsg.c                              |  20 +-
 net/ipv4/tcp_bpf.c                            |  26 ++-
 net/ipv4/udp_bpf.c                            |  25 ++-
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 192 +++++++++++++++++-
 .../bpf/progs/test_sockmap_pass_prog.c        |   8 +
 6 files changed, 325 insertions(+), 17 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2025-11-17 11:07 [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
@ 2025-11-17 11:07 ` Jiayuan Chen
  2025-11-19 19:53   ` Jakub Sitnicki
  2025-11-17 11:07 ` [PATCH bpf-next v1 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-17 11:07 UTC (permalink / raw)
  To: bpf
  Cc: jiayuan.chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 include/linux/skmsg.h | 25 ++++++++++++++++++++++++-
 net/core/skmsg.c      | 19 ++++++++++++++++---
 net/ipv4/tcp_bpf.c    |  5 +++--
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 49847888c287..b7826cb2a388 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -23,6 +23,17 @@ enum __sk_action {
 	__SK_NONE,
 };
 
+/* The BPF program sets BPF_F_INGRESS on sk_msg to indicate data needs to be
+ * redirected to the ingress queue of a specified socket. Since BPF_F_INGRESS is
+ * defined in UAPI so that we can't extend this enum for our internal flags. We
+ * define some internal flags here while inheriting BPF_F_INGRESS.
+ */
+enum {
+	SK_MSG_F_INGRESS	= BPF_F_INGRESS, /* (1ULL << 0) */
+	/* internal flag */
+	SK_MSG_F_INGRESS_SELF	= (1ULL << 1)
+};
+
 struct sk_msg_sg {
 	u32				start;
 	u32				curr;
@@ -141,8 +152,20 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
 int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		   int len, int flags);
+int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		     int len, int flags, int *from_self_copied);
 bool sk_msg_is_readable(struct sock *sk);
 
+static inline bool sk_msg_is_to_self(struct sk_msg *msg)
+{
+	return msg->flags & SK_MSG_F_INGRESS_SELF;
+}
+
+static inline void sk_msg_set_to_self(struct sk_msg *msg)
+{
+	msg->flags |= SK_MSG_F_INGRESS_SELF;
+}
+
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
 	WARN_ON(i == msg->sg.end && bytes);
@@ -235,7 +258,7 @@ static inline struct page *sk_msg_page(struct sk_msg *msg, int which)
 
 static inline bool sk_msg_to_ingress(const struct sk_msg *msg)
 {
-	return msg->flags & BPF_F_INGRESS;
+	return msg->flags & SK_MSG_F_INGRESS;
 }
 
 static inline void sk_msg_compute_data_pointers(struct sk_msg *msg)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 2ac7731e1e0a..25d88c2082e9 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -409,14 +409,14 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
-/* Receive sk_msg from psock->ingress_msg to @msg. */
-int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
-		   int len, int flags)
+int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		     int len, int flags, int *from_self_copied)
 {
 	struct iov_iter *iter = &msg->msg_iter;
 	int peek = flags & MSG_PEEK;
 	struct sk_msg *msg_rx;
 	int i, copied = 0;
+	bool to_self;
 
 	msg_rx = sk_psock_peek_msg(psock);
 	while (copied != len) {
@@ -425,6 +425,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		if (unlikely(!msg_rx))
 			break;
 
+		to_self = sk_msg_is_to_self(msg_rx);
 		i = msg_rx->sg.start;
 		do {
 			struct page *page;
@@ -443,6 +444,9 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 			}
 
 			copied += copy;
+			if (to_self && from_self_copied)
+				*from_self_copied += copy;
+
 			if (likely(!peek)) {
 				sge->offset += copy;
 				sge->length -= copy;
@@ -487,6 +491,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 out:
 	return copied;
 }
+EXPORT_SYMBOL_GPL(__sk_msg_recvmsg);
+
+/* Receive sk_msg from psock->ingress_msg to @msg. */
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags)
+{
+	return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL);
+}
 EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
 
 bool sk_msg_is_readable(struct sock *sk)
@@ -616,6 +628,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	if (unlikely(!msg))
 		return -EAGAIN;
 	skb_set_owner_r(skb, sk);
+	sk_msg_set_to_self(msg);
 	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
 	if (err < 0)
 		kfree(msg);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index a268e1595b22..6332fc36ffe6 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -226,6 +226,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	int peek = flags & MSG_PEEK;
 	struct sk_psock *psock;
 	struct tcp_sock *tcp;
+	int from_self_copied = 0;
 	int copied = 0;
 	u32 seq;
 
@@ -262,7 +263,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	}
 
 msg_bytes_ready:
-	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	copied = __sk_msg_recvmsg(sk, psock, msg, len, flags, &from_self_copied);
 	/* The typical case for EFAULT is the socket was gracefully
 	 * shutdown with a FIN pkt. So check here the other case is
 	 * some error on copy_page_to_iter which would be unexpected.
@@ -277,7 +278,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 			goto out;
 		}
 	}
-	seq += copied;
+	seq += from_self_copied;
 	if (!copied) {
 		long timeo;
 		int data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf-next v1 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2025-11-17 11:07 [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
  2025-11-17 11:07 ` [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
@ 2025-11-17 11:07 ` Jiayuan Chen
  2025-11-17 11:07 ` [PATCH bpf-next v1 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
  2025-11-21 19:12 ` [syzbot ci] Re: bpf: Fix FIONREAD and copied_seq issues syzbot ci
  3 siblings, 0 replies; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-17 11:07 UTC (permalink / raw)
  To: bpf
  Cc: jiayuan.chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new ingress_size field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations. While Unix
and VSOCK also support sockmap and have similar FIONREAD calculation
issues, fixing them would require more extensive changes
(please let me know if modifications are needed). I believe it's not
appropriate to include those changes under this fix patch.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 include/linux/skmsg.h | 46 ++++++++++++++++++++++++++++++++++++++++++-
 net/core/skmsg.c      |  1 +
 net/ipv4/tcp_bpf.c    | 21 ++++++++++++++++++++
 net/ipv4/udp_bpf.c    | 25 +++++++++++++++++++----
 4 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index b7826cb2a388..dab6844d7d43 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -108,6 +108,7 @@ struct sk_psock {
 	struct sk_buff_head		ingress_skb;
 	struct list_head		ingress_msg;
 	spinlock_t			ingress_lock;
+	ssize_t				ingress_size;
 	unsigned long			state;
 	struct list_head		link;
 	spinlock_t			link_lock;
@@ -342,6 +343,16 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
 	kfree_skb(skb);
 }
 
+static inline ssize_t sk_psock_get_msg_size(struct sk_psock *psock)
+{
+	return psock->ingress_size;
+}
+
+static inline void sk_psock_inc_msg_size(struct sk_psock *psock, ssize_t diff)
+{
+	psock->ingress_size += diff;
+}
+
 static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 				      struct sk_msg *msg)
 {
@@ -350,6 +361,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 	spin_lock_bh(&psock->ingress_lock);
 	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
 		list_add_tail(&msg->list, &psock->ingress_msg);
+		sk_psock_inc_msg_size(psock, msg->sg.size);
 		ret = true;
 	} else {
 		sk_msg_free(psock->sk, msg);
@@ -366,8 +378,10 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
 
 	spin_lock_bh(&psock->ingress_lock);
 	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
-	if (msg)
+	if (msg) {
 		list_del(&msg->list);
+		sk_psock_inc_msg_size(psock, -msg->sg.size);
+	}
 	spin_unlock_bh(&psock->ingress_lock);
 	return msg;
 }
@@ -544,6 +558,36 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
 	return !!psock->saved_data_ready;
 }
 
+static inline ssize_t sk_psock_msg_inq(struct sock *sk)
+{
+	struct sk_psock *psock;
+	ssize_t inq = 0;
+
+	psock = sk_psock_get(sk);
+	if (likely(psock)) {
+		inq = sk_psock_get_msg_size(psock);
+		sk_psock_put(sk, psock);
+	}
+	return inq;
+}
+
+/* for udp */
+static inline ssize_t sk_msg_first_length(struct sock *sk)
+{
+	struct sk_psock *psock;
+	struct sk_msg *msg;
+	ssize_t inq = 0;
+
+	psock = sk_psock_get(sk);
+	if (likely(psock)) {
+		msg = sk_psock_peek_msg(psock);
+		if (msg)
+			inq = msg->sg.size;
+		sk_psock_put(sk, psock);
+	}
+	return inq;
+}
+
 #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
 
 #define BPF_F_STRPARSER	(1UL << 1)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 25d88c2082e9..5cd449b196ae 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
 					atomic_sub(copy, &sk->sk_rmem_alloc);
 				}
 				msg_rx->sg.size -= copy;
+				sk_psock_inc_msg_size(psock, -copy);
 
 				if (!sge->length) {
 					sk_msg_iter_var_next(i);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 6332fc36ffe6..a9c758868f13 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -10,6 +10,7 @@
 
 #include <net/inet_common.h>
 #include <net/tls.h>
+#include <asm/ioctls.h>
 
 void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
 {
@@ -332,6 +333,25 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	return copied;
 }
 
+static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	bool slow;
+
+	/* we only care about FIONREAD */
+	if (cmd != SIOCINQ)
+		return tcp_ioctl(sk, cmd, karg);
+
+	/* works similar as tcp_ioctl */
+	if (sk->sk_state == TCP_LISTEN)
+		return -EINVAL;
+
+	slow = lock_sock_fast(sk);
+	*karg = sk_psock_msg_inq(sk);
+	unlock_sock_fast(sk, slow);
+
+	return 0;
+}
+
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			   int flags, int *addr_len)
 {
@@ -610,6 +630,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
 	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
+	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
 
 	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
 	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 0735d820e413..7928bec7a53c 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -5,6 +5,7 @@
 #include <net/sock.h>
 #include <net/udp.h>
 #include <net/inet_common.h>
+#include <asm/ioctls.h>
 
 #include "udp_impl.h"
 
@@ -111,12 +112,28 @@ enum {
 static DEFINE_SPINLOCK(udpv6_prot_lock);
 static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
 
+static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	/* we only care about FIONREAD */
+	if (cmd != SIOCINQ)
+		return tcp_ioctl(sk, cmd, karg);
+
+	/* works similar as udp_ioctl.
+	 * man udp(7): "FIONREAD (SIOCINQ): Returns the size of the next
+	 * pending datagram in the integer in bytes, or 0 when no datagram
+	 * is pending."
+	 */
+	*karg = sk_msg_first_length(sk);
+	return 0;
+}
+
 static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
 {
-	*prot        = *base;
-	prot->close  = sock_map_close;
-	prot->recvmsg = udp_bpf_recvmsg;
-	prot->sock_is_readable = sk_msg_is_readable;
+	*prot			= *base;
+	prot->close		= sock_map_close;
+	prot->recvmsg		= udp_bpf_recvmsg;
+	prot->sock_is_readable	= sk_msg_is_readable;
+	prot->ioctl		= udp_bpf_ioctl;
 }
 
 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf-next v1 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq
  2025-11-17 11:07 [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
  2025-11-17 11:07 ` [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
  2025-11-17 11:07 ` [PATCH bpf-next v1 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
@ 2025-11-17 11:07 ` Jiayuan Chen
  2025-11-21 19:12 ` [syzbot ci] Re: bpf: Fix FIONREAD and copied_seq issues syzbot ci
  3 siblings, 0 replies; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-17 11:07 UTC (permalink / raw)
  To: bpf
  Cc: jiayuan.chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrii Nakryiko,
	Eduard Zingerman, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

This commit adds two new test functions: one to reproduce the bug reported
by syzkaller [1], and another to cover the calculation of copied_seq.

The tests primarily involve installing  and uninstalling sockmap on
sockets, then reading data to verify proper functionality.

Additionally, extend the do_test_sockmap_skb_verdict_fionread() function
to support UDP FIONREAD testing.

[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 192 +++++++++++++++++-
 .../bpf/progs/test_sockmap_pass_prog.c        |   8 +
 2 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index 1e3e4392dcca..e6cff25f4b75 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1,7 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2020 Cloudflare
 #include <error.h>
-#include <netinet/tcp.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
 #include <sys/epoll.h>
 
 #include "test_progs.h"
@@ -22,6 +23,16 @@
 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF_NO_WP	-1	/* Turn off without window probes */
 
+/**
+ * SOL_TCP is defined in <netinet/tcp.h> while field
+ * copybuf_address of tcp_zerocopy_receive is not in it
+ * Although glibc has merged my patch to sync headers,
+ * the fix will take time to propagate, hence this workaround.
+ */
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
 static int connected_socket_v4(void)
 {
 	struct sockaddr_in addr = {
@@ -536,13 +547,14 @@ static void test_sockmap_skb_verdict_shutdown(void)
 }
 
 
-static void test_sockmap_skb_verdict_fionread(bool pass_prog)
+static void do_test_sockmap_skb_verdict_fionread(int sotype, bool pass_prog)
 {
 	int err, map, verdict, c0 = -1, c1 = -1, p0 = -1, p1 = -1;
 	int expected, zero = 0, sent, recvd, avail;
 	struct test_sockmap_pass_prog *pass = NULL;
 	struct test_sockmap_drop_prog *drop = NULL;
 	char buf[256] = "0123456789";
+	int split_len = sizeof(buf) / 2;
 
 	if (pass_prog) {
 		pass = test_sockmap_pass_prog__open_and_load();
@@ -550,7 +562,10 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 			return;
 		verdict = bpf_program__fd(pass->progs.prog_skb_verdict);
 		map = bpf_map__fd(pass->maps.sock_map_rx);
-		expected = sizeof(buf);
+		if (sotype == SOCK_DGRAM)
+			expected = split_len; /* FIONREAD for UDP is different from TCP */
+		else
+			expected = sizeof(buf);
 	} else {
 		drop = test_sockmap_drop_prog__open_and_load();
 		if (!ASSERT_OK_PTR(drop, "open_and_load"))
@@ -566,7 +581,7 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 	if (!ASSERT_OK(err, "bpf_prog_attach"))
 		goto out;
 
-	err = create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1);
+	err = create_socket_pairs(AF_INET, sotype, &c0, &c1, &p0, &p1);
 	if (!ASSERT_OK(err, "create_socket_pairs()"))
 		goto out;
 
@@ -574,8 +589,9 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 	if (!ASSERT_OK(err, "bpf_map_update_elem(c1)"))
 		goto out_close;
 
-	sent = xsend(p1, &buf, sizeof(buf), 0);
-	ASSERT_EQ(sent, sizeof(buf), "xsend(p0)");
+	sent = xsend(p1, &buf, split_len, 0);
+	sent += xsend(p1, &buf, sizeof(buf) - split_len, 0);
+	ASSERT_EQ(sent, sizeof(buf), "xsend(p1)");
 	err = ioctl(c1, FIONREAD, &avail);
 	ASSERT_OK(err, "ioctl(FIONREAD) error");
 	ASSERT_EQ(avail, expected, "ioctl(FIONREAD)");
@@ -597,6 +613,12 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 		test_sockmap_drop_prog__destroy(drop);
 }
 
+static void test_sockmap_skb_verdict_fionread(bool pass_prog)
+{
+	do_test_sockmap_skb_verdict_fionread(SOCK_STREAM, pass_prog);
+	do_test_sockmap_skb_verdict_fionread(SOCK_DGRAM, pass_prog);
+}
+
 static void test_sockmap_skb_verdict_change_tail(void)
 {
 	struct test_sockmap_change_tail *skel;
@@ -1042,6 +1064,160 @@ static void test_sockmap_vsock_unconnected(void)
 	xclose(map);
 }
 
+/* it used to reproduce WARNING */
+static void test_sockmap_zc(void)
+{
+	int map, err, sent, recvd, zero = 0, one = 1, on = 1;
+	char buf[10] = "0123456789", rcv[11], addr[100];
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+	struct tcp_zerocopy_receive zc;
+	socklen_t zc_len = sizeof(zc);
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1))
+		goto end;
+
+	prog = skel->progs.prog_skb_verdict_ingress;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto end;
+
+	sent = xsend(c0, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend"))
+		goto end;
+
+	/* trigger tcp_bpf_recvmsg_parser and inc copied_seq of p1 */
+	recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1)"))
+		goto end;
+
+	/* uninstall sockmap of p1 */
+	bpf_map_delete_elem(map, &one);
+
+	/* trigger tcp stack and the rcv_nxt of p1 is less than copied_seq */
+	sent = xsend(c1, buf, sizeof(buf) - 1, 0);
+	if (!ASSERT_EQ(sent, sizeof(buf) - 1, "xsend"))
+		goto end;
+
+	err = setsockopt(p1, SOL_SOCKET, SO_ZEROCOPY, &on, sizeof(on));
+	if (!ASSERT_OK(err, "setsockopt"))
+		goto end;
+
+	memset(&zc, 0, sizeof(zc));
+	zc.copybuf_address = (__u64)((unsigned long)addr);
+	zc.copybuf_len = sizeof(addr);
+
+	err = getsockopt(p1, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len);
+	if (!ASSERT_OK(err, "getsockopt"))
+		goto end;
+
+end:
+	if (c0 >= 0)
+		close(c0);
+	if (p0 >= 0)
+		close(p0);
+	if (c1 >= 0)
+		close(c1);
+	if (p1 >= 0)
+		close(p1);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
+/* it used to check whether copied_seq of sk is correct */
+static void test_sockmap_copied_seq(void)
+{
+	int map, err, sent, recvd, zero = 0, one = 1;
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+	char buf[10] = "0123456789", rcv[11];
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1))
+		goto end;
+
+	prog = skel->progs.prog_skb_verdict_ingress;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(p0)"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(p1)"))
+		goto end;
+
+	/* just trigger sockamp: data sent by c0 will be received by p1 */
+	sent = xsend(c0, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), bpf"))
+		goto end;
+
+	recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), bpf"))
+		goto end;
+
+	/* uninstall sockmap of p1 and p0 */
+	err = bpf_map_delete_elem(map, &one);
+	if (!ASSERT_OK(err, "bpf_map_delete_elem(1)"))
+		goto end;
+	err = bpf_map_delete_elem(map, &zero);
+	if (!ASSERT_OK(err, "bpf_map_delete_elem(0)"))
+		goto end;
+
+	/* now all sockets become plain socket, they should work */
+
+	/* test copied_seq of p1 by running tcp native stack */
+	sent = xsend(c1, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c1), native"))
+		goto end;
+
+	recvd = recv(p1, rcv, sizeof(rcv), MSG_DONTWAIT);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), native"))
+		goto end;
+
+	/* p0 previously redirected skb to p1, we also check copied_seq of p0 */
+	sent = xsend(c0, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), native"))
+		goto end;
+
+	recvd = recv(p0, rcv, sizeof(rcv), MSG_DONTWAIT);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p0), native"))
+		goto end;
+
+end:
+	if (c0 >= 0)
+		close(c0);
+	if (p0 >= 0)
+		close(p0);
+	if (c1 >= 0)
+		close(c1);
+	if (p1 >= 0)
+		close(p1);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
 void test_sockmap_basic(void)
 {
 	if (test__start_subtest("sockmap create_update_free"))
@@ -1108,4 +1284,8 @@ void test_sockmap_basic(void)
 		test_sockmap_skb_verdict_vsock_poll();
 	if (test__start_subtest("sockmap vsock unconnected"))
 		test_sockmap_vsock_unconnected();
+	if (test__start_subtest("sockmap with zc"))
+		test_sockmap_zc();
+	if (test__start_subtest("sockmap recover"))
+		test_sockmap_copied_seq();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c b/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
index 69aacc96db36..4bc97da15a69 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
@@ -44,4 +44,12 @@ int prog_skb_parser(struct __sk_buff *skb)
 	return SK_PASS;
 }
 
+SEC("sk_skb/stream_parser")
+int prog_skb_verdict_ingress(struct __sk_buff *skb)
+{
+	int one = 1;
+
+	return bpf_sk_redirect_map(skb, &sock_map_rx, one, BPF_F_INGRESS);
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2025-11-17 11:07 ` [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
@ 2025-11-19 19:53   ` Jakub Sitnicki
  2025-11-20  2:49     ` Jiayuan Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2025-11-19 19:53 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

On Mon, Nov 17, 2025 at 07:07 PM +08, Jiayuan Chen wrote:
> A socket using sockmap has its own independent receive queue: ingress_msg.
> This queue may contain data from its own protocol stack or from other
> sockets.
>
> The issue is that when reading from ingress_msg, we update tp->copied_seq
> by default. However, if the data is not from its own protocol stack,
> tcp->rcv_nxt is not increased. Later, if we convert this socket to a
> native socket, reading from this socket may fail because copied_seq might
> be significantly larger than rcv_nxt.
>
> This fix also addresses the syzkaller-reported bug referenced in the
> Closes tag.
>
> This patch marks the skmsg objects in ingress_msg. When reading, we update
> copied_seq only if the data is from its own protocol stack.
>
>                                                      FD1:read()
>                                                      --  FD1->copied_seq++
>                                                          |  [read data]
>                                                          |
>                                 [enqueue data]           v
>                   [sockmap]     -> ingress to self ->  ingress_msg queue
> FD1 native stack  ------>                                 ^
> -- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
>                                        |                  |
>                                        |             ingress to FD1
>                                        v                  ^
>                                       ...                 |  [sockmap]
>                                                      FD2 native stack
>
> Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  include/linux/skmsg.h | 25 ++++++++++++++++++++++++-
>  net/core/skmsg.c      | 19 ++++++++++++++++---
>  net/ipv4/tcp_bpf.c    |  5 +++--
>  3 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 49847888c287..b7826cb2a388 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -23,6 +23,17 @@ enum __sk_action {
>  	__SK_NONE,
>  };
>  
> +/* The BPF program sets BPF_F_INGRESS on sk_msg to indicate data needs to be
> + * redirected to the ingress queue of a specified socket. Since BPF_F_INGRESS is
> + * defined in UAPI so that we can't extend this enum for our internal flags. We
> + * define some internal flags here while inheriting BPF_F_INGRESS.
> + */
> +enum {
> +	SK_MSG_F_INGRESS	= BPF_F_INGRESS, /* (1ULL << 0) */
> +	/* internal flag */
> +	SK_MSG_F_INGRESS_SELF	= (1ULL << 1)
> +};
> +


I'm wondering if we need additional state to track this.
Can we track sk_msg's construted from skb's that were not redirected by
setting `sk_msg.sk = sk` to indicate that the source socket is us in
sk_psock_skb_ingress_self()?

If not, then I'd just offset the internal flags like we do in
net/core/filter.c, BPF_F_REDIRECT_INTERNAL.

>  struct sk_msg_sg {
>  	u32				start;
>  	u32				curr;
> @@ -141,8 +152,20 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
>  			     struct sk_msg *msg, u32 bytes);
>  int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  		   int len, int flags);
> +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		     int len, int flags, int *from_self_copied);
>  bool sk_msg_is_readable(struct sock *sk);
>  
> +static inline bool sk_msg_is_to_self(struct sk_msg *msg)
> +{
> +	return msg->flags & SK_MSG_F_INGRESS_SELF;
> +}
> +
> +static inline void sk_msg_set_to_self(struct sk_msg *msg)
> +{
> +	msg->flags |= SK_MSG_F_INGRESS_SELF;
> +}
> +
>  static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
>  {
>  	WARN_ON(i == msg->sg.end && bytes);
> @@ -235,7 +258,7 @@ static inline struct page *sk_msg_page(struct sk_msg *msg, int which)
>  
>  static inline bool sk_msg_to_ingress(const struct sk_msg *msg)
>  {
> -	return msg->flags & BPF_F_INGRESS;
> +	return msg->flags & SK_MSG_F_INGRESS;
>  }
>  
>  static inline void sk_msg_compute_data_pointers(struct sk_msg *msg)
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 2ac7731e1e0a..25d88c2082e9 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -409,14 +409,14 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
>  }
>  EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
>  
> -/* Receive sk_msg from psock->ingress_msg to @msg. */
> -int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> -		   int len, int flags)
> +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		     int len, int flags, int *from_self_copied)
>  {
>  	struct iov_iter *iter = &msg->msg_iter;
>  	int peek = flags & MSG_PEEK;
>  	struct sk_msg *msg_rx;
>  	int i, copied = 0;
> +	bool to_self;
>  
>  	msg_rx = sk_psock_peek_msg(psock);
>  	while (copied != len) {
> @@ -425,6 +425,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  		if (unlikely(!msg_rx))
>  			break;
>  
> +		to_self = sk_msg_is_to_self(msg_rx);
>  		i = msg_rx->sg.start;
>  		do {
>  			struct page *page;
> @@ -443,6 +444,9 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  			}
>  
>  			copied += copy;
> +			if (to_self && from_self_copied)
> +				*from_self_copied += copy;
> +
>  			if (likely(!peek)) {
>  				sge->offset += copy;
>  				sge->length -= copy;
> @@ -487,6 +491,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  out:
>  	return copied;
>  }
> +EXPORT_SYMBOL_GPL(__sk_msg_recvmsg);
> +
> +/* Receive sk_msg from psock->ingress_msg to @msg. */
> +int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		   int len, int flags)
> +{
> +	return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL);
> +}
>  EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
>  
>  bool sk_msg_is_readable(struct sock *sk)
> @@ -616,6 +628,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>  	if (unlikely(!msg))
>  		return -EAGAIN;
>  	skb_set_owner_r(skb, sk);
> +	sk_msg_set_to_self(msg);
>  	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
>  	if (err < 0)
>  		kfree(msg);
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index a268e1595b22..6332fc36ffe6 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -226,6 +226,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	int peek = flags & MSG_PEEK;
>  	struct sk_psock *psock;
>  	struct tcp_sock *tcp;
> +	int from_self_copied = 0;
>  	int copied = 0;
>  	u32 seq;
>  
> @@ -262,7 +263,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	}
>  
>  msg_bytes_ready:
> -	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
> +	copied = __sk_msg_recvmsg(sk, psock, msg, len, flags, &from_self_copied);
>  	/* The typical case for EFAULT is the socket was gracefully
>  	 * shutdown with a FIN pkt. So check here the other case is
>  	 * some error on copy_page_to_iter which would be unexpected.
> @@ -277,7 +278,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  			goto out;
>  		}
>  	}
> -	seq += copied;
> +	seq += from_self_copied;
>  	if (!copied) {
>  		long timeo;
>  		int data;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2025-11-19 19:53   ` Jakub Sitnicki
@ 2025-11-20  2:49     ` Jiayuan Chen
  2025-11-20 12:58       ` Jakub Sitnicki
  0 siblings, 1 reply; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-20  2:49 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, John Fastabend, David  S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin  KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP  Singh,
	Stanislav Fomichev, Hao  Luo, Jiri Olsa, Shuah Khan,
	Michal Luczaj, Stefano Garzarella, Cong Wang, netdev,
	linux-kernel, linux-kselftest

November 20, 2025 at 03:53, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:

[...]
> >  +/* The BPF program sets BPF_F_INGRESS on sk_msg to indicate data needs to be
> >  + * redirected to the ingress queue of a specified socket. Since BPF_F_INGRESS is
> >  + * defined in UAPI so that we can't extend this enum for our internal flags. We
> >  + * define some internal flags here while inheriting BPF_F_INGRESS.
> >  + */
> >  +enum {
> >  + SK_MSG_F_INGRESS = BPF_F_INGRESS, /* (1ULL << 0) */
> >  + /* internal flag */
> >  + SK_MSG_F_INGRESS_SELF = (1ULL << 1)
> >  +};
> >  +
> > 
> I'm wondering if we need additional state to track this.
> Can we track sk_msg's construted from skb's that were not redirected by
> setting `sk_msg.sk = sk` to indicate that the source socket is us in
> sk_psock_skb_ingress_self()?

Functionally, that would work. However, in that case, we would have to hold
a reference to sk until the sk_msg is read, which would delay the release of
sk. One concern is that if there is a bug in the read-side application, sk
might never be released.


> If not, then I'd just offset the internal flags like we do in
> net/core/filter.c, BPF_F_REDIRECT_INTERNAL.

I think we can try offsetting the internal flags.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2025-11-20  2:49     ` Jiayuan Chen
@ 2025-11-20 12:58       ` Jakub Sitnicki
  2025-11-20 14:03         ` Jiayuan Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2025-11-20 12:58 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

On Thu, Nov 20, 2025 at 02:49 AM GMT, Jiayuan Chen wrote:
> November 20, 2025 at 03:53, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:
>
> [...]
>> >  +/* The BPF program sets BPF_F_INGRESS on sk_msg to indicate data needs to be
>> >  + * redirected to the ingress queue of a specified socket. Since BPF_F_INGRESS is
>> >  + * defined in UAPI so that we can't extend this enum for our internal flags. We
>> >  + * define some internal flags here while inheriting BPF_F_INGRESS.
>> >  + */
>> >  +enum {
>> >  + SK_MSG_F_INGRESS = BPF_F_INGRESS, /* (1ULL << 0) */
>> >  + /* internal flag */
>> >  + SK_MSG_F_INGRESS_SELF = (1ULL << 1)
>> >  +};
>> >  +
>> > 
>> I'm wondering if we need additional state to track this.
>> Can we track sk_msg's construted from skb's that were not redirected by
>> setting `sk_msg.sk = sk` to indicate that the source socket is us in
>> sk_psock_skb_ingress_self()?
>
> Functionally, that would work. However, in that case, we would have to hold
> a reference to sk until the sk_msg is read, which would delay the release of
> sk. One concern is that if there is a bug in the read-side application, sk
> might never be released.

We don't need to grab a reference to sk if we're talking about setting
it only in sk_psock_skb_ingress_self(). psock already holds a ref for
psock->sk, and we purge psock->ingress_msg queue when destroying the
psock before releasing the sock ref in sk_psock_destroy().

While there's nothing wrong with an internal flaag, I'm trying to see if
we make things somewhat consistent so as a result sk_msg state is easier
to reason about.

My thinking here is that we already set sk_msg.sk to source socket in
sk_psock_msg_verdict() on sendmsg() path, so we know that this is the
purpose of that field. We could mimic this on recvmsg() path.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2025-11-20 12:58       ` Jakub Sitnicki
@ 2025-11-20 14:03         ` Jiayuan Chen
  0 siblings, 0 replies; 9+ messages in thread
From: Jiayuan Chen @ 2025-11-20 14:03 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David  Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard  Zingerman, Song Liu, Yonghong  Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

November 20, 2025 at 20:58, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:


> 
> On Thu, Nov 20, 2025 at 02:49 AM GMT, Jiayuan Chen wrote:
> 
> > 
> > November 20, 2025 at 03:53, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:
> > 
> >  [...]
> > 
> > > 
> > > > +/* The BPF program sets BPF_F_INGRESS on sk_msg to indicate data needs to be
> > >  > + * redirected to the ingress queue of a specified socket. Since BPF_F_INGRESS is
> > >  > + * defined in UAPI so that we can't extend this enum for our internal flags. We
> > >  > + * define some internal flags here while inheriting BPF_F_INGRESS.
> > >  > + */
> > >  > +enum {
> > >  > + SK_MSG_F_INGRESS = BPF_F_INGRESS, /* (1ULL << 0) */
> > >  > + /* internal flag */
> > >  > + SK_MSG_F_INGRESS_SELF = (1ULL << 1)
> > >  > +};
> > >  > +
> > >  > 
> > >  I'm wondering if we need additional state to track this.
> > >  Can we track sk_msg's construted from skb's that were not redirected by
> > >  setting `sk_msg.sk = sk` to indicate that the source socket is us in
> > >  sk_psock_skb_ingress_self()?
> > > 
> >  Functionally, that would work. However, in that case, we would have to hold
> >  a reference to sk until the sk_msg is read, which would delay the release of
> >  sk. One concern is that if there is a bug in the read-side application, sk
> >  might never be released.
> > 
> We don't need to grab a reference to sk if we're talking about setting
> it only in sk_psock_skb_ingress_self(). psock already holds a ref for
> psock->sk, and we purge psock->ingress_msg queue when destroying the
> psock before releasing the sock ref in sk_psock_destroy().

I see. When it's an ingress to self redirection, the msg.sk would point to
the same socket as psock->sk (the socket itself), not to another socket, so
indeed no additional reference grab is needed.

> While there's nothing wrong with an internal flaag, I'm trying to see if
> we make things somewhat consistent so as a result sk_msg state is easier
> to reason about.
> 
> My thinking here is that we already set sk_msg.sk to source socket in
> sk_psock_msg_verdict() on sendmsg() path, so we know that this is the
> purpose of that field. We could mimic this on recvmsg() path.
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [syzbot ci] Re: bpf: Fix FIONREAD and copied_seq issues
  2025-11-17 11:07 [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
                   ` (2 preceding siblings ...)
  2025-11-17 11:07 ` [PATCH bpf-next v1 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
@ 2025-11-21 19:12 ` syzbot ci
  3 siblings, 0 replies; 9+ messages in thread
From: syzbot ci @ 2025-11-21 19:12 UTC (permalink / raw)
  To: andrii, ast, bpf, cong.wang, daniel, davem, dsahern, eddyz87,
	edumazet, haoluo, horms, jakub, jiayuan.chen, john.fastabend,
	jolsa, kpsingh, kuba, kuniyu, linux-kernel, linux-kselftest,
	martin.lau, mhal, ncardwell, netdev, pabeni, sdf, sgarzare, shuah,
	song, yonghong.song
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] bpf: Fix FIONREAD and copied_seq issues
https://lore.kernel.org/all/20251117110736.293040-1-jiayuan.chen@linux.dev
* [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
* [PATCH bpf-next v1 2/3] bpf, sockmap: Fix FIONREAD for sockmap
* [PATCH bpf-next v1 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq

and found the following issues:
* KASAN: slab-out-of-bounds Read in tcp_ioctl
* KASAN: slab-use-after-free Read in tcp_ioctl

Full report is available here:
https://ci.syzbot.org/series/d61ee16d-47d7-4d43-ae17-0fb7c57066d9

***

KASAN: slab-out-of-bounds Read in tcp_ioctl

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      4722981cca373a338bbcf3a93ecf7144a892b03b
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/41c96565-ad0a-411a-985e-bed590104688/config
C repro:   https://ci.syzbot.org/findings/29a438ed-4757-4e3b-a8ba-f66ae067f793/c_repro
syz repro: https://ci.syzbot.org/findings/29a438ed-4757-4e3b-a8ba-f66ae067f793/syz_repro

==================================================================
BUG: KASAN: slab-out-of-bounds in tcp_ioctl+0x673/0x860 net/ipv4/tcp.c:659
Read of size 2 at addr ffff8881097608f6 by task syz.0.17/5965

CPU: 0 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x240 mm/kasan/report.c:482
 kasan_report+0x118/0x150 mm/kasan/report.c:595
 tcp_ioctl+0x673/0x860 net/ipv4/tcp.c:659
 sock_ioctl_out net/core/sock.c:4392 [inline]
 sk_ioctl+0x3c7/0x600 net/core/sock.c:4420
 inet6_ioctl+0x204/0x280 net/ipv6/af_inet6.c:590
 sock_do_ioctl+0xdc/0x300 net/socket.c:1254
 sock_ioctl+0x576/0x790 net/socket.c:1375
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe4b998f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd756e87d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fe4b9be5fa0 RCX: 00007fe4b998f749
RDX: 0000000000000000 RSI: 0000000000008905 RDI: 0000000000000003
RBP: 00007fe4b9a13f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fe4b9be5fa0 R14: 00007fe4b9be5fa0 R15: 0000000000000003
 </TASK>

Allocated by task 5965:
 kasan_save_stack mm/kasan/common.c:56 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
 unpoison_slab_object mm/kasan/common.c:342 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:368
 kasan_slab_alloc include/linux/kasan.h:252 [inline]
 slab_post_alloc_hook mm/slub.c:4978 [inline]
 slab_alloc_node mm/slub.c:5288 [inline]
 kmem_cache_alloc_noprof+0x367/0x6e0 mm/slub.c:5295
 sk_prot_alloc+0x57/0x220 net/core/sock.c:2233
 sk_alloc+0x3a/0x370 net/core/sock.c:2295
 inet6_create+0x7f0/0x1260 net/ipv6/af_inet6.c:193
 __sock_create+0x4b3/0x9f0 net/socket.c:1605
 sock_create net/socket.c:1663 [inline]
 __sys_socket_create net/socket.c:1700 [inline]
 __sys_socket+0xd7/0x1b0 net/socket.c:1747
 __do_sys_socket net/socket.c:1761 [inline]
 __se_sys_socket net/socket.c:1759 [inline]
 __x64_sys_socket+0x7a/0x90 net/socket.c:1759
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The buggy address belongs to the object at ffff888109760000
 which belongs to the cache UDPv6 of size 2176
The buggy address is located 118 bytes to the right of
 allocated 2176-byte region [ffff888109760000, ffff888109760880)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109760
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888112b82901
flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000040 ffff88817159e000 dead000000000122 0000000000000000
raw: 0000000000000000 00000000800e000e 00000000f5000000 ffff888112b82901
head: 017ff00000000040 ffff88817159e000 dead000000000122 0000000000000000
head: 0000000000000000 00000000800e000e 00000000f5000000 ffff888112b82901
head: 017ff00000000003 ffffea000425d801 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 5556, tgid 5556 (dhcpcd), ts 39360973378, free_ts 39270063669
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1850
 prep_new_page mm/page_alloc.c:1858 [inline]
 get_page_from_freelist+0x2365/0x2440 mm/page_alloc.c:3884
 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5183
 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
 alloc_slab_page mm/slub.c:3059 [inline]
 allocate_slab+0x96/0x350 mm/slub.c:3232
 new_slab mm/slub.c:3286 [inline]
 ___slab_alloc+0xf56/0x1990 mm/slub.c:4655
 __slab_alloc+0x65/0x100 mm/slub.c:4778
 __slab_alloc_node mm/slub.c:4854 [inline]
 slab_alloc_node mm/slub.c:5276 [inline]
 kmem_cache_alloc_noprof+0x3f9/0x6e0 mm/slub.c:5295
 sk_prot_alloc+0x57/0x220 net/core/sock.c:2233
 sk_alloc+0x3a/0x370 net/core/sock.c:2295
 inet6_create+0x7f0/0x1260 net/ipv6/af_inet6.c:193
 __sock_create+0x4b3/0x9f0 net/socket.c:1605
 sock_create net/socket.c:1663 [inline]
 __sys_socket_create net/socket.c:1700 [inline]
 __sys_socket+0xd7/0x1b0 net/socket.c:1747
 __do_sys_socket net/socket.c:1761 [inline]
 __se_sys_socket net/socket.c:1759 [inline]
 __x64_sys_socket+0x7a/0x90 net/socket.c:1759
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5554 tgid 5554 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1394 [inline]
 __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2906
 discard_slab mm/slub.c:3330 [inline]
 __put_partials+0x146/0x170 mm/slub.c:3876
 put_cpu_partial+0x1f2/0x2e0 mm/slub.c:3951
 __slab_free+0x2b9/0x390 mm/slub.c:5929
 qlink_free mm/kasan/quarantine.c:163 [inline]
 qlist_free_all+0x97/0x140 mm/kasan/quarantine.c:179
 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286
 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:352
 kasan_slab_alloc include/linux/kasan.h:252 [inline]
 slab_post_alloc_hook mm/slub.c:4978 [inline]
 slab_alloc_node mm/slub.c:5288 [inline]
 kmem_cache_alloc_noprof+0x367/0x6e0 mm/slub.c:5295
 getname_flags+0xb8/0x540 fs/namei.c:146
 getname include/linux/fs.h:2922 [inline]
 do_sys_openat2+0xbc/0x1c0 fs/open.c:1431
 do_sys_open fs/open.c:1452 [inline]
 __do_sys_openat fs/open.c:1468 [inline]
 __se_sys_openat fs/open.c:1463 [inline]
 __x64_sys_openat+0x138/0x170 fs/open.c:1463
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Memory state around the buggy address:
 ffff888109760780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888109760800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff888109760880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                                             ^
 ffff888109760900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888109760980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================


***

KASAN: slab-use-after-free Read in tcp_ioctl

tree:      bpf-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/bpf/bpf-next.git
base:      4722981cca373a338bbcf3a93ecf7144a892b03b
arch:      amd64
compiler:  Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config:    https://ci.syzbot.org/builds/41c96565-ad0a-411a-985e-bed590104688/config
syz repro: https://ci.syzbot.org/findings/c04327b1-fdf2-4b1c-9c19-973a310a26d4/syz_repro

==================================================================
BUG: KASAN: slab-use-after-free in tcp_ioctl+0x76d/0x860 net/ipv4/tcp.c:678
Read of size 4 at addr ffff8881023c66e4 by task syz.0.21/6017

CPU: 0 UID: 0 PID: 6017 Comm: syz.0.21 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x240 mm/kasan/report.c:482
 kasan_report+0x118/0x150 mm/kasan/report.c:595
 tcp_ioctl+0x76d/0x860 net/ipv4/tcp.c:678
 sock_ioctl_out net/core/sock.c:4392 [inline]
 sk_ioctl+0x3c7/0x600 net/core/sock.c:4420
 inet_ioctl+0x416/0x4c0 net/ipv4/af_inet.c:1007
 sock_do_ioctl+0xdc/0x300 net/socket.c:1254
 sock_ioctl+0x576/0x790 net/socket.c:1375
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:597 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fd75e18f749
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fd75f094038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd75e3e5fa0 RCX: 00007fd75e18f749
RDX: 0000200000000080 RSI: 000000000000894b RDI: 0000000000000003
RBP: 00007fd75e213f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fd75e3e6038 R14: 00007fd75e3e5fa0 R15: 00007ffc97663258
 </TASK>

Allocated by task 5843:
 kasan_save_stack mm/kasan/common.c:56 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
 unpoison_slab_object mm/kasan/common.c:342 [inline]
 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:368
 kasan_slab_alloc include/linux/kasan.h:252 [inline]
 slab_post_alloc_hook mm/slub.c:4978 [inline]
 slab_alloc_node mm/slub.c:5288 [inline]
 kmem_cache_alloc_noprof+0x367/0x6e0 mm/slub.c:5295
 sk_prot_alloc+0x57/0x220 net/core/sock.c:2233
 sk_alloc+0x3a/0x370 net/core/sock.c:2295
 inet_create+0x7a0/0x1000 net/ipv4/af_inet.c:328
 __sock_create+0x4b3/0x9f0 net/socket.c:1605
 inet_ctl_sock_create+0x9a/0x220 net/ipv4/af_inet.c:1632
 igmp_net_init+0xb9/0x150 net/ipv4/igmp.c:3125
 ops_init+0x35c/0x5c0 net/core/net_namespace.c:137
 setup_net+0xfe/0x320 net/core/net_namespace.c:445
 copy_net_ns+0x34e/0x4e0 net/core/net_namespace.c:580
 create_new_namespaces+0x3f3/0x720 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0x11c/0x170 kernel/nsproxy.c:218
 ksys_unshare+0x4c8/0x8c0 kernel/fork.c:3129
 __do_sys_unshare kernel/fork.c:3200 [inline]
 __se_sys_unshare kernel/fork.c:3198 [inline]
 __x64_sys_unshare+0x38/0x50 kernel/fork.c:3198
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5682:
 kasan_save_stack mm/kasan/common.c:56 [inline]
 kasan_save_track+0x3e/0x80 mm/kasan/common.c:77
 __kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:587
 kasan_save_free_info mm/kasan/kasan.h:406 [inline]
 poison_slab_object mm/kasan/common.c:252 [inline]
 __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:284
 kasan_slab_free include/linux/kasan.h:234 [inline]
 slab_free_hook mm/slub.c:2543 [inline]
 slab_free mm/slub.c:6642 [inline]
 kmem_cache_free+0x19b/0x690 mm/slub.c:6752
 sk_prot_free net/core/sock.c:2276 [inline]
 __sk_destruct+0x4d2/0x660 net/core/sock.c:2373
 inet_release+0x144/0x190 net/ipv4/af_inet.c:437
 __sock_release net/socket.c:662 [inline]
 sock_release+0x85/0x150 net/socket.c:690
 ops_exit_list net/core/net_namespace.c:199 [inline]
 ops_undo_list+0x49a/0x990 net/core/net_namespace.c:252
 cleanup_net+0x4d8/0x820 net/core/net_namespace.c:695
 process_one_work kernel/workqueue.c:3263 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427
 kthread+0x711/0x8a0 kernel/kthread.c:463
 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245

The buggy address belongs to the object at ffff8881023c6600
 which belongs to the cache UDP of size 1984
The buggy address is located 228 bytes inside of
 freed 1984-byte region [ffff8881023c6600, ffff8881023c6dc0)

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1023c0
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888114298b01
anon flags: 0x17ff00000000040(head|node=0|zone=2|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 017ff00000000040 ffff888104ab6640 0000000000000000 0000000000000001
raw: 0000000000000000 00000000800f000f 00000000f5000000 ffff888114298b01
head: 017ff00000000040 ffff888104ab6640 0000000000000000 0000000000000001
head: 0000000000000000 00000000800f000f 00000000f5000000 ffff888114298b01
head: 017ff00000000003 ffffea000408f001 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 1, tgid 1 (swapper/0), ts 4132853290, free_ts 0
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1850
 prep_new_page mm/page_alloc.c:1858 [inline]
 get_page_from_freelist+0x2365/0x2440 mm/page_alloc.c:3884
 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5183
 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416
 alloc_slab_page mm/slub.c:3059 [inline]
 allocate_slab+0x96/0x350 mm/slub.c:3232
 new_slab mm/slub.c:3286 [inline]
 ___slab_alloc+0xf56/0x1990 mm/slub.c:4655
 __slab_alloc+0x65/0x100 mm/slub.c:4778
 __slab_alloc_node mm/slub.c:4854 [inline]
 slab_alloc_node mm/slub.c:5276 [inline]
 kmem_cache_alloc_noprof+0x3f9/0x6e0 mm/slub.c:5295
 sk_prot_alloc+0x57/0x220 net/core/sock.c:2233
 sk_alloc+0x3a/0x370 net/core/sock.c:2295
 inet_create+0x7a0/0x1000 net/ipv4/af_inet.c:328
 __sock_create+0x4b3/0x9f0 net/socket.c:1605
 inet_ctl_sock_create+0x9a/0x220 net/ipv4/af_inet.c:1632
 igmp_net_init+0xb9/0x150 net/ipv4/igmp.c:3125
 ops_init+0x35c/0x5c0 net/core/net_namespace.c:137
 __register_pernet_operations net/core/net_namespace.c:1314 [inline]
 register_pernet_operations+0x336/0x800 net/core/net_namespace.c:1391
page_owner free stack trace missing

Memory state around the buggy address:
 ffff8881023c6580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8881023c6600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8881023c6680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                       ^
 ffff8881023c6700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8881023c6780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-21 19:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 11:07 [PATCH bpf-next v1 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
2025-11-17 11:07 ` [PATCH bpf-next v1 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
2025-11-19 19:53   ` Jakub Sitnicki
2025-11-20  2:49     ` Jiayuan Chen
2025-11-20 12:58       ` Jakub Sitnicki
2025-11-20 14:03         ` Jiayuan Chen
2025-11-17 11:07 ` [PATCH bpf-next v1 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
2025-11-17 11:07 ` [PATCH bpf-next v1 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
2025-11-21 19:12 ` [syzbot ci] Re: bpf: Fix FIONREAD and copied_seq issues syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).