[PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied_seq issues
@ 2026-01-13  2:50 Jiayuan Chen
  2026-01-13  2:50 ` [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-13  2:50 UTC (permalink / raw)
  To: bpf
  Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Stefano Garzarella, Cong Wang, netdev, linux-kernel,
	linux-kselftest

syzkaller reported a bug [1] where a socket using sockmap, after being
unloaded, exposed incorrect copied_seq calculation. The selftest I
provided can be used to reproduce the issue reported by syzkaller.

TCP recvmsg seq # bug 2: copied E92C873, seq E68D125, rcvnxt E7CEB7C, fl 40
WARNING: CPU: 1 PID: 5997 at net/ipv4/tcp.c:2724 tcp_recvmsg_locked+0xb2f/0x2910 net/ipv4/tcp.c:2724
Call Trace:
 <TASK>
 receive_fallback_to_copy net/ipv4/tcp.c:1968 [inline]
 tcp_zerocopy_receive+0x131a/0x2120 net/ipv4/tcp.c:2200
 do_tcp_getsockopt+0xe28/0x26c0 net/ipv4/tcp.c:4713
 tcp_getsockopt+0xdf/0x100 net/ipv4/tcp.c:4812
 do_sock_getsockopt+0x34d/0x440 net/socket.c:2421
 __sys_getsockopt+0x12f/0x260 net/socket.c:2450
 __do_sys_getsockopt net/socket.c:2457 [inline]
 __se_sys_getsockopt net/socket.c:2454 [inline]
 __x64_sys_getsockopt+0xbd/0x160 net/socket.c:2454
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xcd/0xfa0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

A sockmap socket maintains its own receive queue (ingress_msg) which may
contain data from either its own protocol stack or forwarded from other
sockets.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

The issue occurs when reading from ingress_msg: we update tp->copied_seq
by default, but if the data comes from other sockets (not the socket's
own protocol stack), tcp->rcv_nxt remains unchanged. Later, when
converting back to a native socket, reads may fail as copied_seq could
be significantly larger than rcv_nxt.

Additionally, FIONREAD calculation based on copied_seq and rcv_nxt is
insufficient for sockmap sockets, requiring separate field tracking.

[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983

---
v5 -> v7: Some modifications suggested by Jakub Sitnicki, and added Reviewed-by tag.
https://lore.kernel.org/bpf/20260106051458.279151-1-jiayuan.chen@linux.dev/

v1 -> v5: Use skmsg.sk instead of extending BPF_F_XXX macro and fix CI
          failure reported by CI
v1: https://lore.kernel.org/bpf/20251117110736.293040-1-jiayuan.chen@linux.dev/

Jiayuan Chen (3):
  bpf, sockmap: Fix incorrect copied_seq calculation
  bpf, sockmap: Fix FIONREAD for sockmap
  bpf, selftest: Add tests for FIONREAD and copied_seq

 include/linux/skmsg.h                         |  70 ++++-
 net/core/skmsg.c                              |  31 +-
 net/ipv4/tcp_bpf.c                            |  37 ++-
 net/ipv4/udp_bpf.c                            |  23 +-
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 277 +++++++++++++++++-
 .../bpf/progs/test_sockmap_pass_prog.c        |  14 +
 6 files changed, 435 insertions(+), 17 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2026-01-13  2:50 [PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
@ 2026-01-13  2:50 ` Jiayuan Chen
  2026-01-20 15:01   ` Jakub Sitnicki
  2026-01-13  2:50 ` [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
  2026-01-13  2:50 ` [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
  2 siblings, 1 reply; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-13  2:50 UTC (permalink / raw)
  To: bpf
  Cc: Jiayuan Chen, Jakub Sitnicki, John Fastabend, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Cong Wang, netdev, linux-kernel, linux-kselftest

A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 include/linux/skmsg.h |  2 ++
 net/core/skmsg.c      | 28 +++++++++++++++++++++++++---
 net/ipv4/tcp_bpf.c    |  5 +++--
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 49847888c287..dfdc158ab88c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -141,6 +141,8 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 			     struct sk_msg *msg, u32 bytes);
 int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 		   int len, int flags);
+int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		     int len, int flags, int *copied_from_self);
 bool sk_msg_is_readable(struct sock *sk);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 2ac7731e1e0a..ca22ecdbf192 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -409,22 +409,26 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
-/* Receive sk_msg from psock->ingress_msg to @msg. */
-int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
-		   int len, int flags)
+int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		     int len, int flags, int *copied_from_self)
 {
 	struct iov_iter *iter = &msg->msg_iter;
 	int peek = flags & MSG_PEEK;
 	struct sk_msg *msg_rx;
 	int i, copied = 0;
+	bool from_self;
 
 	msg_rx = sk_psock_peek_msg(psock);
+	if (copied_from_self)
+		*copied_from_self = 0;
+
 	while (copied != len) {
 		struct scatterlist *sge;
 
 		if (unlikely(!msg_rx))
 			break;
 
+		from_self = msg_rx->sk == sk;
 		i = msg_rx->sg.start;
 		do {
 			struct page *page;
@@ -443,6 +447,9 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 			}
 
 			copied += copy;
+			if (from_self && copied_from_self)
+				*copied_from_self += copy;
+
 			if (likely(!peek)) {
 				sge->offset += copy;
 				sge->length -= copy;
@@ -487,6 +494,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 out:
 	return copied;
 }
+EXPORT_SYMBOL_GPL(__sk_msg_recvmsg);
+
+/* Receive sk_msg from psock->ingress_msg to @msg. */
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+		   int len, int flags)
+{
+	return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL);
+}
 EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
 
 bool sk_msg_is_readable(struct sock *sk)
@@ -616,6 +631,12 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	if (unlikely(!msg))
 		return -EAGAIN;
 	skb_set_owner_r(skb, sk);
+
+	/* This is used in tcp_bpf_recvmsg_parser() to determine whether the
+	 * data originates from the socket's own protocol stack. No need to
+	 * refcount sk because msg's lifetime is bound to sk via the ingress_msg.
+	 */
+	msg->sk = sk;
 	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
 	if (err < 0)
 		kfree(msg);
@@ -909,6 +930,7 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
 	sk_msg_compute_data_pointers(msg);
 	msg->sk = sk;
 	ret = bpf_prog_run_pin_on_cpu(prog, msg);
+	msg->sk = NULL;
 	ret = sk_psock_map_verd(ret, msg->sk_redir);
 	psock->apply_bytes = msg->apply_bytes;
 	if (ret == __SK_REDIRECT) {
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index a268e1595b22..5c698fd7fbf8 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -226,6 +226,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	int peek = flags & MSG_PEEK;
 	struct sk_psock *psock;
 	struct tcp_sock *tcp;
+	int copied_from_self = 0;
 	int copied = 0;
 	u32 seq;
 
@@ -262,7 +263,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	}
 
 msg_bytes_ready:
-	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	copied = __sk_msg_recvmsg(sk, psock, msg, len, flags, &copied_from_self);
 	/* The typical case for EFAULT is the socket was gracefully
 	 * shutdown with a FIN pkt. So check here the other case is
 	 * some error on copy_page_to_iter which would be unexpected.
@@ -277,7 +278,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 			goto out;
 		}
 	}
-	seq += copied;
+	seq += copied_from_self;
 	if (!copied) {
 		long timeo;
 		int data;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-13  2:50 [PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
  2026-01-13  2:50 ` [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
@ 2026-01-13  2:50 ` Jiayuan Chen
  2026-01-20 15:00   ` Jakub Sitnicki
  2026-01-13  2:50 ` [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
  2 siblings, 1 reply; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-13  2:50 UTC (permalink / raw)
  To: bpf
  Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrii Nakryiko,
	Eduard Zingerman, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Cong Wang, netdev, linux-kernel, linux-kselftest

A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.

Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                      FD1:read()
                                                      --  FD1->copied_seq++
                                                          |  [read data]
                                                          |
                                   [enqueue data]         v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 include/linux/skmsg.h | 68 +++++++++++++++++++++++++++++++++++++++++--
 net/core/skmsg.c      |  3 ++
 net/ipv4/tcp_bpf.c    | 32 ++++++++++++++++++++
 net/ipv4/udp_bpf.c    | 23 ++++++++++++---
 4 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index dfdc158ab88c..829b281d6c9c 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -97,6 +97,8 @@ struct sk_psock {
 	struct sk_buff_head		ingress_skb;
 	struct list_head		ingress_msg;
 	spinlock_t			ingress_lock;
+	/** @msg_tot_len: Total bytes queued in ingress_msg list. */
+	u32				msg_tot_len;
 	unsigned long			state;
 	struct list_head		link;
 	spinlock_t			link_lock;
@@ -321,6 +323,27 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
 	kfree_skb(skb);
 }
 
+static inline u32 sk_psock_get_msg_len_nolock(struct sk_psock *psock)
+{
+	/* Used by ioctl to read msg_tot_len only; lock-free for performance */
+	return READ_ONCE(psock->msg_tot_len);
+}
+
+static inline void sk_psock_msg_len_add_locked(struct sk_psock *psock, int diff)
+{
+	/* Use WRITE_ONCE to ensure correct read in sk_psock_get_msg_len_nolock().
+	 * ingress_lock should be held to prevent concurrent updates to msg_tot_len
+	 */
+	WRITE_ONCE(psock->msg_tot_len, psock->msg_tot_len + diff);
+}
+
+static inline void sk_psock_msg_len_add(struct sk_psock *psock, int diff)
+{
+	spin_lock_bh(&psock->ingress_lock);
+	sk_psock_msg_len_add_locked(psock, diff);
+	spin_unlock_bh(&psock->ingress_lock);
+}
+
 static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 				      struct sk_msg *msg)
 {
@@ -329,6 +352,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
 	spin_lock_bh(&psock->ingress_lock);
 	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
 		list_add_tail(&msg->list, &psock->ingress_msg);
+		sk_psock_msg_len_add_locked(psock, msg->sg.size);
 		ret = true;
 	} else {
 		sk_msg_free(psock->sk, msg);
@@ -345,18 +369,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
 
 	spin_lock_bh(&psock->ingress_lock);
 	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
-	if (msg)
+	if (msg) {
 		list_del(&msg->list);
+		sk_psock_msg_len_add_locked(psock, -msg->sg.size);
+	}
 	spin_unlock_bh(&psock->ingress_lock);
 	return msg;
 }
 
+static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock *psock)
+{
+	return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
+}
+
 static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
 {
 	struct sk_msg *msg;
 
 	spin_lock_bh(&psock->ingress_lock);
-	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
+	msg = sk_psock_peek_msg_locked(psock);
 	spin_unlock_bh(&psock->ingress_lock);
 	return msg;
 }
@@ -523,6 +554,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
 	return !!psock->saved_data_ready;
 }
 
+/* for tcp only, sk is locked */
+static inline ssize_t sk_psock_msg_inq(struct sock *sk)
+{
+	struct sk_psock *psock;
+	ssize_t inq = 0;
+
+	psock = sk_psock_get(sk);
+	if (likely(psock)) {
+		inq = sk_psock_get_msg_len_nolock(psock);
+		sk_psock_put(sk, psock);
+	}
+	return inq;
+}
+
+/* for udp only, sk is not locked */
+static inline ssize_t sk_msg_first_len(struct sock *sk)
+{
+	struct sk_psock *psock;
+	struct sk_msg *msg;
+	ssize_t inq = 0;
+
+	psock = sk_psock_get(sk);
+	if (likely(psock)) {
+		spin_lock_bh(&psock->ingress_lock);
+		msg = sk_psock_peek_msg_locked(psock);
+		if (msg)
+			inq = msg->sg.size;
+		spin_unlock_bh(&psock->ingress_lock);
+		sk_psock_put(sk, psock);
+	}
+	return inq;
+}
+
 #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
 
 #define BPF_F_STRPARSER	(1UL << 1)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ca22ecdbf192..dd1965bf8f6a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -458,6 +458,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
 					atomic_sub(copy, &sk->sk_rmem_alloc);
 				}
 				msg_rx->sg.size -= copy;
+				sk_psock_msg_len_add(psock, -copy);
 
 				if (!sge->length) {
 					sk_msg_iter_var_next(i);
@@ -822,9 +823,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
 		list_del(&msg->list);
 		if (!msg->skb)
 			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
+		sk_psock_msg_len_add(psock, -msg->sg.size);
 		sk_msg_free(psock->sk, msg);
 		kfree(msg);
 	}
+	WARN_ON_ONCE(psock->msg_tot_len);
 }
 
 static void __sk_psock_zap_ingress(struct sk_psock *psock)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 5c698fd7fbf8..476101533349 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -10,6 +10,7 @@
 
 #include <net/inet_common.h>
 #include <net/tls.h>
+#include <asm/ioctls.h>
 
 void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
 {
@@ -332,6 +333,36 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 	return copied;
 }
 
+static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	bool slow;
+
+	if (cmd != SIOCINQ)
+		return tcp_ioctl(sk, cmd, karg);
+
+	/* works similar as tcp_ioctl */
+	if (sk->sk_state == TCP_LISTEN)
+		return -EINVAL;
+
+	slow = lock_sock_fast(sk);
+
+	/* sk_receive_queue being non-empty only occurs if the sk received
+	 * data before being accepted, and no new data has arrived to trigger
+	 * read_skb. If data comes from the sk's native stack, we will also
+	 * consume sk_receive_queue using read_skb.
+	 */
+	if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
+		unlock_sock_fast(sk, slow);
+		return tcp_ioctl(sk, cmd, karg);
+	}
+
+	*karg = sk_psock_msg_inq(sk);
+
+	unlock_sock_fast(sk, slow);
+
+	return 0;
+}
+
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			   int flags, int *addr_len)
 {
@@ -610,6 +641,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_BASE].close		= sock_map_close;
 	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
 	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
+	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
 
 	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
 	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 0735d820e413..91233e37cd97 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -5,6 +5,7 @@
 #include <net/sock.h>
 #include <net/udp.h>
 #include <net/inet_common.h>
+#include <asm/ioctls.h>
 
 #include "udp_impl.h"
 
@@ -111,12 +112,26 @@ enum {
 static DEFINE_SPINLOCK(udpv6_prot_lock);
 static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
 
+static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
+{
+	if (cmd != SIOCINQ)
+		return udp_ioctl(sk, cmd, karg);
+
+	/* Since we don't hold a lock, sk_receive_queue may contain data.
+	 * BPF might only be processing this data at the moment. We only
+	 * care about the data in the ingress_msg here.
+	 */
+	*karg = sk_msg_first_len(sk);
+	return 0;
+}
+
 static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
 {
-	*prot        = *base;
-	prot->close  = sock_map_close;
-	prot->recvmsg = udp_bpf_recvmsg;
-	prot->sock_is_readable = sk_msg_is_readable;
+	*prot			= *base;
+	prot->close		= sock_map_close;
+	prot->recvmsg		= udp_bpf_recvmsg;
+	prot->sock_is_readable	= sk_msg_is_readable;
+	prot->ioctl		= udp_bpf_ioctl;
 }
 
 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq
  2026-01-13  2:50 [PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
  2026-01-13  2:50 ` [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
  2026-01-13  2:50 ` [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
@ 2026-01-13  2:50 ` Jiayuan Chen
  2026-01-21  9:45   ` Jakub Sitnicki
  2 siblings, 1 reply; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-13  2:50 UTC (permalink / raw)
  To: bpf
  Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Neal Cardwell, Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Stefano Garzarella, Michal Luczaj, Cong Wang, netdev,
	linux-kernel, linux-kselftest

This commit adds two new test functions: one to reproduce the bug reported
by syzkaller [1], and another to cover the calculation of copied_seq.

The tests primarily involve installing  and uninstalling sockmap on
sockets, then reading data to verify proper functionality.

Additionally, extend the do_test_sockmap_skb_verdict_fionread() function
to support UDP FIONREAD testing.

[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
 .../selftests/bpf/prog_tests/sockmap_basic.c  | 277 +++++++++++++++++-
 .../bpf/progs/test_sockmap_pass_prog.c        |  14 +
 2 files changed, 285 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index 1e3e4392dcca..1f1289f5a8c2 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1,7 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 // Copyright (c) 2020 Cloudflare
 #include <error.h>
-#include <netinet/tcp.h>
+#include <linux/tcp.h>
+#include <linux/socket.h>
 #include <sys/epoll.h>
 
 #include "test_progs.h"
@@ -22,6 +23,15 @@
 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF_NO_WP	-1	/* Turn off without window probes */
 
+/**
+ * SOL_TCP is defined in <netinet/tcp.h> (glibc), but the copybuf_address
+ * field of tcp_zerocopy_receive is not yet included in older versions.
+ * This workaround remains necessary until the glibc update propagates.
+ */
+#ifndef SOL_TCP
+#define SOL_TCP 6
+#endif
+
 static int connected_socket_v4(void)
 {
 	struct sockaddr_in addr = {
@@ -536,13 +546,14 @@ static void test_sockmap_skb_verdict_shutdown(void)
 }
 
 
-static void test_sockmap_skb_verdict_fionread(bool pass_prog)
+static void do_test_sockmap_skb_verdict_fionread(int sotype, bool pass_prog)
 {
 	int err, map, verdict, c0 = -1, c1 = -1, p0 = -1, p1 = -1;
 	int expected, zero = 0, sent, recvd, avail;
 	struct test_sockmap_pass_prog *pass = NULL;
 	struct test_sockmap_drop_prog *drop = NULL;
 	char buf[256] = "0123456789";
+	int split_len = sizeof(buf) / 2;
 
 	if (pass_prog) {
 		pass = test_sockmap_pass_prog__open_and_load();
@@ -550,7 +561,10 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 			return;
 		verdict = bpf_program__fd(pass->progs.prog_skb_verdict);
 		map = bpf_map__fd(pass->maps.sock_map_rx);
-		expected = sizeof(buf);
+		if (sotype == SOCK_DGRAM)
+			expected = split_len; /* FIONREAD for UDP is different from TCP */
+		else
+			expected = sizeof(buf);
 	} else {
 		drop = test_sockmap_drop_prog__open_and_load();
 		if (!ASSERT_OK_PTR(drop, "open_and_load"))
@@ -566,7 +580,7 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 	if (!ASSERT_OK(err, "bpf_prog_attach"))
 		goto out;
 
-	err = create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1);
+	err = create_socket_pairs(AF_INET, sotype, &c0, &c1, &p0, &p1);
 	if (!ASSERT_OK(err, "create_socket_pairs()"))
 		goto out;
 
@@ -574,8 +588,9 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 	if (!ASSERT_OK(err, "bpf_map_update_elem(c1)"))
 		goto out_close;
 
-	sent = xsend(p1, &buf, sizeof(buf), 0);
-	ASSERT_EQ(sent, sizeof(buf), "xsend(p0)");
+	sent = xsend(p1, &buf, split_len, 0);
+	sent += xsend(p1, &buf, sizeof(buf) - split_len, 0);
+	ASSERT_EQ(sent, sizeof(buf), "xsend(p1)");
 	err = ioctl(c1, FIONREAD, &avail);
 	ASSERT_OK(err, "ioctl(FIONREAD) error");
 	ASSERT_EQ(avail, expected, "ioctl(FIONREAD)");
@@ -597,6 +612,12 @@ static void test_sockmap_skb_verdict_fionread(bool pass_prog)
 		test_sockmap_drop_prog__destroy(drop);
 }
 
+static void test_sockmap_skb_verdict_fionread(bool pass_prog)
+{
+	do_test_sockmap_skb_verdict_fionread(SOCK_STREAM, pass_prog);
+	do_test_sockmap_skb_verdict_fionread(SOCK_DGRAM, pass_prog);
+}
+
 static void test_sockmap_skb_verdict_change_tail(void)
 {
 	struct test_sockmap_change_tail *skel;
@@ -1042,6 +1063,240 @@ static void test_sockmap_vsock_unconnected(void)
 	xclose(map);
 }
 
+/* it is used to reproduce WARNING */
+static void test_sockmap_zc(void)
+{
+	int map, err, sent, recvd, zero = 0, one = 1, on = 1;
+	char buf[10] = "0123456789", rcv[11], addr[100];
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+	struct tcp_zerocopy_receive zc;
+	socklen_t zc_len = sizeof(zc);
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1))
+		goto end;
+
+	prog = skel->progs.prog_skb_verdict_ingress;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto end;
+
+	sent = xsend(c0, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend"))
+		goto end;
+
+	/* trigger tcp_bpf_recvmsg_parser and inc copied_seq of p1 */
+	recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1)"))
+		goto end;
+
+	/* uninstall sockmap of p1 */
+	bpf_map_delete_elem(map, &one);
+
+	/* trigger tcp stack and the rcv_nxt of p1 is less than copied_seq */
+	sent = xsend(c1, buf, sizeof(buf) - 1, 0);
+	if (!ASSERT_EQ(sent, sizeof(buf) - 1, "xsend"))
+		goto end;
+
+	err = setsockopt(p1, SOL_SOCKET, SO_ZEROCOPY, &on, sizeof(on));
+	if (!ASSERT_OK(err, "setsockopt"))
+		goto end;
+
+	memset(&zc, 0, sizeof(zc));
+	zc.copybuf_address = (__u64)((unsigned long)addr);
+	zc.copybuf_len = sizeof(addr);
+
+	err = getsockopt(p1, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len);
+	if (!ASSERT_OK(err, "getsockopt"))
+		goto end;
+
+end:
+	if (c0 >= 0)
+		close(c0);
+	if (p0 >= 0)
+		close(p0);
+	if (c1 >= 0)
+		close(c1);
+	if (p1 >= 0)
+		close(p1);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
+/* it is used to check whether copied_seq of sk is correct */
+static void test_sockmap_copied_seq(bool strp)
+{
+	int i, map, err, sent, recvd, zero = 0, one = 1;
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+	char buf[10] = "0123456789", rcv[11];
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1))
+		goto end;
+
+	prog = skel->progs.prog_skb_verdict_ingress;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
+		goto end;
+
+	if (strp) {
+		prog = skel->progs.prog_skb_verdict_ingress_strp;
+		err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_PARSER, 0);
+		if (!ASSERT_OK(err, "bpf_prog_attach parser"))
+			goto end;
+	}
+
+	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(p0)"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(p1)"))
+		goto end;
+
+	/* just trigger sockamp: data sent by c0 will be received by p1 */
+	sent = xsend(c0, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), bpf"))
+		goto end;
+
+	/* do partial read */
+	recvd = recv_timeout(p1, rcv, 1, MSG_DONTWAIT, 1);
+	recvd += recv_timeout(p1, rcv + 1, sizeof(rcv) - 1, MSG_DONTWAIT, 1);
+	if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), bpf") ||
+	    !ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch"))
+		goto end;
+
+	/* uninstall sockmap of p1 and p0 */
+	err = bpf_map_delete_elem(map, &one);
+	if (!ASSERT_OK(err, "bpf_map_delete_elem(1)"))
+		goto end;
+
+	err = bpf_map_delete_elem(map, &zero);
+	if (!ASSERT_OK(err, "bpf_map_delete_elem(0)"))
+		goto end;
+
+	/* now all sockets become plain socket, they should still work */
+	for (i = 0; i < 5; i++) {
+		/* test copied_seq of p1 by running tcp native stack */
+		sent = xsend(c1, buf, sizeof(buf), 0);
+		if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c1), native"))
+			goto end;
+
+		recvd = recv(p1, rcv, sizeof(rcv), MSG_DONTWAIT);
+		if (!ASSERT_EQ(recvd, sent, "recv_timeout(p1), native"))
+			goto end;
+
+		/* p0 previously redirected skb to p1, we also check copied_seq of p0 */
+		sent = xsend(c0, buf, sizeof(buf), 0);
+		if (!ASSERT_EQ(sent, sizeof(buf), "xsend(c0), native"))
+			goto end;
+
+		recvd = recv(p0, rcv, sizeof(rcv), MSG_DONTWAIT);
+		if (!ASSERT_EQ(recvd, sent, "recv_timeout(p0), native"))
+			goto end;
+	}
+
+end:
+	if (c0 >= 0)
+		close(c0);
+	if (p0 >= 0)
+		close(p0);
+	if (c1 >= 0)
+		close(c1);
+	if (p1 >= 0)
+		close(p1);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
+/* it is used to send data to via native stack and BPF redirecting */
+static void test_sockmap_multi_channels(int sotype)
+{
+	int map, err, sent, recvd, zero = 0, one = 1, avail = 0;
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+	char buf[10] = "0123456789", rcv[11];
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	err = create_socket_pairs(AF_INET, sotype, &c0, &c1, &p0, &p1);
+	if (err)
+		goto end;
+
+	prog = skel->progs.prog_skb_verdict_ingress;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(p0)"))
+		goto end;
+
+	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem"))
+		goto end;
+
+	/* send data to p1 via native stack */
+	sent = xsend(c1, buf, 2, 0);
+	if (!ASSERT_EQ(sent, 2, "xsend(2)"))
+		goto end;
+
+	sleep(1);
+	err = ioctl(p1, FIONREAD, &avail);
+	ASSERT_OK(err, "ioctl(FIONREAD) partial call");
+	ASSERT_EQ(avail, 2, "ioctl(FIONREAD) partial return");
+
+	/* send data to p1 via bpf redirecting */
+	sent = xsend(c0, buf + 2, sizeof(buf) - 2, 0);
+	if (!ASSERT_EQ(sent, sizeof(buf) - 2, "xsend(remain-data)"))
+		goto end;
+
+	sleep(1);
+	err = ioctl(p1, FIONREAD, &avail);
+	ASSERT_OK(err, "ioctl(FIONREAD) full call");
+	ASSERT_EQ(avail, sotype == SOCK_DGRAM ? 2 : sizeof(buf), "ioctl(FIONREAD) full return");
+
+	recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+	if (!ASSERT_EQ(recvd, sizeof(buf), "recv_timeout(p1)") ||
+	    !ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch"))
+		goto end;
+end:
+	if (c0 >= 0)
+		close(c0);
+	if (p0 >= 0)
+		close(p0);
+	if (c1 >= 0)
+		close(c1);
+	if (p1 >= 0)
+		close(p1);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
 void test_sockmap_basic(void)
 {
 	if (test__start_subtest("sockmap create_update_free"))
@@ -1108,4 +1363,14 @@ void test_sockmap_basic(void)
 		test_sockmap_skb_verdict_vsock_poll();
 	if (test__start_subtest("sockmap vsock unconnected"))
 		test_sockmap_vsock_unconnected();
+	if (test__start_subtest("sockmap with zc"))
+		test_sockmap_zc();
+	if (test__start_subtest("sockmap recover"))
+		test_sockmap_copied_seq(false);
+	if (test__start_subtest("sockmap recover with strp"))
+		test_sockmap_copied_seq(true);
+	if (test__start_subtest("sockmap tcp multi channels"))
+		test_sockmap_multi_channels(SOCK_STREAM);
+	if (test__start_subtest("sockmap udp multi channels"))
+		test_sockmap_multi_channels(SOCK_DGRAM);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c b/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
index 69aacc96db36..ef9edca184ea 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_pass_prog.c
@@ -44,4 +44,18 @@ int prog_skb_parser(struct __sk_buff *skb)
 	return SK_PASS;
 }
 
+SEC("sk_skb/stream_verdict")
+int prog_skb_verdict_ingress(struct __sk_buff *skb)
+{
+	int one = 1;
+
+	return bpf_sk_redirect_map(skb, &sock_map_rx, one, BPF_F_INGRESS);
+}
+
+SEC("sk_skb/stream_parser")
+int prog_skb_verdict_ingress_strp(struct __sk_buff *skb)
+{
+	return skb->len;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-13  2:50 ` [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
@ 2026-01-20 15:00   ` Jakub Sitnicki
  2026-01-21  9:36     ` Jakub Sitnicki
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-20 15:00 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Michal Luczaj, Cong Wang, netdev, linux-kernel,
	linux-kselftest

On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> A socket using sockmap has its own independent receive queue: ingress_msg.
> This queue may contain data from its own protocol stack or from other
> sockets.
>
> Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
> calculate FIONREAD is not enough.
>
> This patch adds a new msg_tot_len field in the psock structure to record
> the data length in ingress_msg. Additionally, we implement new ioctl
> interfaces for TCP and UDP to intercept FIONREAD operations.
>
> Unix and VSOCK sockets have similar issues, but fixing them is outside
> the scope of this patch as it would require more intrusive changes.
>
> Previous work by John Fastabend made some efforts towards FIONREAD support:
> commit e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> Although the current patch is based on the previous work by John Fastabend,
> it is acceptable for our Fixes tag to point to the same commit.
>
>                                                       FD1:read()
>                                                       --  FD1->copied_seq++
>                                                           |  [read data]
>                                                           |
>                                    [enqueue data]         v
>                   [sockmap]     -> ingress to self ->  ingress_msg queue
> FD1 native stack  ------>                                 ^
> -- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
>                                        |                  |
>                                        |             ingress to FD1
>                                        v                  ^
>                                       ...                 |  [sockmap]
>                                                      FD2 native stack
>
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  include/linux/skmsg.h | 68 +++++++++++++++++++++++++++++++++++++++++--
>  net/core/skmsg.c      |  3 ++
>  net/ipv4/tcp_bpf.c    | 32 ++++++++++++++++++++
>  net/ipv4/udp_bpf.c    | 23 ++++++++++++---
>  4 files changed, 120 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index dfdc158ab88c..829b281d6c9c 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -97,6 +97,8 @@ struct sk_psock {
>  	struct sk_buff_head		ingress_skb;
>  	struct list_head		ingress_msg;
>  	spinlock_t			ingress_lock;
> +	/** @msg_tot_len: Total bytes queued in ingress_msg list. */
> +	u32				msg_tot_len;
>  	unsigned long			state;
>  	struct list_head		link;
>  	spinlock_t			link_lock;
> @@ -321,6 +323,27 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
>  	kfree_skb(skb);
>  }
>  
> +static inline u32 sk_psock_get_msg_len_nolock(struct sk_psock *psock)
> +{
> +	/* Used by ioctl to read msg_tot_len only; lock-free for performance */
> +	return READ_ONCE(psock->msg_tot_len);
> +}
> +
> +static inline void sk_psock_msg_len_add_locked(struct sk_psock *psock, int diff)
> +{
> +	/* Use WRITE_ONCE to ensure correct read in sk_psock_get_msg_len_nolock().
> +	 * ingress_lock should be held to prevent concurrent updates to msg_tot_len
> +	 */
> +	WRITE_ONCE(psock->msg_tot_len, psock->msg_tot_len + diff);
> +}
> +
> +static inline void sk_psock_msg_len_add(struct sk_psock *psock, int diff)
> +{
> +	spin_lock_bh(&psock->ingress_lock);
> +	sk_psock_msg_len_add_locked(psock, diff);
> +	spin_unlock_bh(&psock->ingress_lock);
> +}
> +
>  static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
> @@ -329,6 +352,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  	spin_lock_bh(&psock->ingress_lock);
>  	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>  		list_add_tail(&msg->list, &psock->ingress_msg);
> +		sk_psock_msg_len_add_locked(psock, msg->sg.size);
>  		ret = true;
>  	} else {
>  		sk_msg_free(psock->sk, msg);
> @@ -345,18 +369,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
>  
>  	spin_lock_bh(&psock->ingress_lock);
>  	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> -	if (msg)
> +	if (msg) {
>  		list_del(&msg->list);
> +		sk_psock_msg_len_add_locked(psock, -msg->sg.size);
> +	}
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }
>  
> +static inline struct sk_msg *sk_psock_peek_msg_locked(struct sk_psock *psock)
> +{
> +	return list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> +}
> +
>  static inline struct sk_msg *sk_psock_peek_msg(struct sk_psock *psock)
>  {
>  	struct sk_msg *msg;
>  
>  	spin_lock_bh(&psock->ingress_lock);
> -	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> +	msg = sk_psock_peek_msg_locked(psock);
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }
> @@ -523,6 +554,39 @@ static inline bool sk_psock_strp_enabled(struct sk_psock *psock)
>  	return !!psock->saved_data_ready;
>  }
>  
> +/* for tcp only, sk is locked */
> +static inline ssize_t sk_psock_msg_inq(struct sock *sk)
> +{
> +	struct sk_psock *psock;
> +	ssize_t inq = 0;
> +
> +	psock = sk_psock_get(sk);
> +	if (likely(psock)) {
> +		inq = sk_psock_get_msg_len_nolock(psock);
> +		sk_psock_put(sk, psock);
> +	}
> +	return inq;
> +}
> +
> +/* for udp only, sk is not locked */
> +static inline ssize_t sk_msg_first_len(struct sock *sk)
> +{
> +	struct sk_psock *psock;
> +	struct sk_msg *msg;
> +	ssize_t inq = 0;
> +
> +	psock = sk_psock_get(sk);
> +	if (likely(psock)) {
> +		spin_lock_bh(&psock->ingress_lock);
> +		msg = sk_psock_peek_msg_locked(psock);
> +		if (msg)
> +			inq = msg->sg.size;
> +		spin_unlock_bh(&psock->ingress_lock);
> +		sk_psock_put(sk, psock);
> +	}
> +	return inq;
> +}
> +
>  #if IS_ENABLED(CONFIG_NET_SOCK_MSG)
>  
>  #define BPF_F_STRPARSER	(1UL << 1)
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index ca22ecdbf192..dd1965bf8f6a 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -458,6 +458,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
>  					atomic_sub(copy, &sk->sk_rmem_alloc);
>  				}
>  				msg_rx->sg.size -= copy;
> +				sk_psock_msg_len_add(psock, -copy);
>  
>  				if (!sge->length) {
>  					sk_msg_iter_var_next(i);
> @@ -822,9 +823,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
>  		list_del(&msg->list);
>  		if (!msg->skb)
>  			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> +		sk_psock_msg_len_add(psock, -msg->sg.size);
>  		sk_msg_free(psock->sk, msg);
>  		kfree(msg);
>  	}
> +	WARN_ON_ONCE(psock->msg_tot_len);
>  }
>  
>  static void __sk_psock_zap_ingress(struct sk_psock *psock)
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index 5c698fd7fbf8..476101533349 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -10,6 +10,7 @@
>  
>  #include <net/inet_common.h>
>  #include <net/tls.h>
> +#include <asm/ioctls.h>
>  
>  void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -332,6 +333,36 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	return copied;
>  }
>  
> +static int tcp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> +{
> +	bool slow;
> +
> +	if (cmd != SIOCINQ)
> +		return tcp_ioctl(sk, cmd, karg);
> +
> +	/* works similar as tcp_ioctl */
> +	if (sk->sk_state == TCP_LISTEN)
> +		return -EINVAL;
> +
> +	slow = lock_sock_fast(sk);
> +
> +	/* sk_receive_queue being non-empty only occurs if the sk received
> +	 * data before being accepted, and no new data has arrived to trigger
> +	 * read_skb. If data comes from the sk's native stack, we will also
> +	 * consume sk_receive_queue using read_skb.
> +	 */

You mean "before being added to sockmap"?

> +	if (unlikely(!skb_queue_empty(&sk->sk_receive_queue))) {
> +		unlock_sock_fast(sk, slow);
> +		return tcp_ioctl(sk, cmd, karg);
> +	}
> +
> +	*karg = sk_psock_msg_inq(sk);
> +
> +	unlock_sock_fast(sk, slow);
> +
> +	return 0;
> +}
> +
>  static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  			   int flags, int *addr_len)
>  {
> @@ -610,6 +641,7 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  	prot[TCP_BPF_BASE].close		= sock_map_close;
>  	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
>  	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
> +	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;
>  
>  	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
>  	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> index 0735d820e413..91233e37cd97 100644
> --- a/net/ipv4/udp_bpf.c
> +++ b/net/ipv4/udp_bpf.c
> @@ -5,6 +5,7 @@
>  #include <net/sock.h>
>  #include <net/udp.h>
>  #include <net/inet_common.h>
> +#include <asm/ioctls.h>
>  
>  #include "udp_impl.h"
>  
> @@ -111,12 +112,26 @@ enum {
>  static DEFINE_SPINLOCK(udpv6_prot_lock);
>  static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
>  
> +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> +{
> +	if (cmd != SIOCINQ)
> +		return udp_ioctl(sk, cmd, karg);
> +
> +	/* Since we don't hold a lock, sk_receive_queue may contain data.
> +	 * BPF might only be processing this data at the moment. We only
> +	 * care about the data in the ingress_msg here.
> +	 */

I think we should strive for a design where FIONREAD does not go down
after you add your socket to sockmap, if there was no recvmsg call in
between. To show what I mean, I added this test that's currently failing
for udp:

---8<---
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index 1f1289f5a8c2..123c96fcaef0 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1229,6 +1229,66 @@ static void test_sockmap_copied_seq(bool strp)
 	test_sockmap_pass_prog__destroy(skel);
 }
 
+/* Test FIONREAD when data exists in sk_receive_queue before sockmap insertion */
+static void test_sockmap_fionread_pre_insert(int sotype)
+{
+	int map, err, sent, recvd, zero = 0, avail = 0;
+	struct test_sockmap_pass_prog *skel = NULL;
+	int c = -1, p = -1;
+	char buf[10] = "0123456789", rcv[11];
+	struct bpf_program *prog;
+
+	skel = test_sockmap_pass_prog__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		return;
+
+	prog = skel->progs.prog_skb_verdict;
+	map = bpf_map__fd(skel->maps.sock_map_rx);
+
+	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
+	if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
+		goto end;
+
+	err = create_pair(AF_INET, sotype, &c, &p);
+	if (!ASSERT_OK(err, "create_pair"))
+		goto end;
+
+	/* Step 1: Send data BEFORE sockmap insertion - lands in sk_receive_queue */
+	sent = xsend(p, buf, sizeof(buf), 0);
+	if (!ASSERT_EQ(sent, sizeof(buf), "xsend pre-insert"))
+		goto end;
+
+	/* Step 2: Verify FIONREAD reports data in sk_receive_queue */
+	err = poll_read(c, IO_TIMEOUT_SEC);
+	if (!ASSERT_OK(err, "poll_read pre-insert"))
+		goto end;
+	err = ioctl(c, FIONREAD, &avail);
+	ASSERT_OK(err, "ioctl(FIONREAD) pre-insert error");
+	ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) pre-insert");
+
+	/* Step 3: Insert socket into sockmap */
+	err = bpf_map_update_elem(map, &zero, &c, BPF_ANY);
+	if (!ASSERT_OK(err, "bpf_map_update_elem(c)"))
+		goto end;
+
+	/* Step 4: FIONREAD should still report the data in sk_receive_queue */
+	err = ioctl(c, FIONREAD, &avail);
+	ASSERT_OK(err, "ioctl(FIONREAD) post-insert error");
+	ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) post-insert");
+
+	/* Verify we can still read the data */
+	recvd = recv_timeout(c, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
+	ASSERT_EQ(recvd, sizeof(buf), "recv post-insert");
+	ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch");
+
+end:
+	if (c >= 0)
+		close(c);
+	if (p >= 0)
+		close(p);
+	test_sockmap_pass_prog__destroy(skel);
+}
+
 /* it is used to send data to via native stack and BPF redirecting */
 static void test_sockmap_multi_channels(int sotype)
 {
@@ -1373,4 +1433,8 @@ void test_sockmap_basic(void)
 		test_sockmap_multi_channels(SOCK_STREAM);
 	if (test__start_subtest("sockmap udp multi channels"))
 		test_sockmap_multi_channels(SOCK_DGRAM);
+	if (test__start_subtest("sockmap tcp fionread pre-insert"))
+		test_sockmap_fionread_pre_insert(SOCK_STREAM);
+	if (test__start_subtest("sockmap udp fionread pre-insert"))
+		test_sockmap_fionread_pre_insert(SOCK_DGRAM);
 }
--->8---


> +	*karg = sk_msg_first_len(sk);
> +	return 0;
> +}
> +
>  static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base)
>  {
> -	*prot        = *base;
> -	prot->close  = sock_map_close;
> -	prot->recvmsg = udp_bpf_recvmsg;
> -	prot->sock_is_readable = sk_msg_is_readable;
> +	*prot			= *base;
> +	prot->close		= sock_map_close;
> +	prot->recvmsg		= udp_bpf_recvmsg;
> +	prot->sock_is_readable	= sk_msg_is_readable;
> +	prot->ioctl		= udp_bpf_ioctl;
>  }
>  
>  static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2026-01-13  2:50 ` [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
@ 2026-01-20 15:01   ` Jakub Sitnicki
  2026-01-20 15:38     ` John Fastabend
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-20 15:01 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan, Michal Luczaj,
	Cong Wang, netdev, linux-kernel, linux-kselftest

On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> A socket using sockmap has its own independent receive queue: ingress_msg.
> This queue may contain data from its own protocol stack or from other
> sockets.
>
> The issue is that when reading from ingress_msg, we update tp->copied_seq
> by default. However, if the data is not from its own protocol stack,
> tcp->rcv_nxt is not increased. Later, if we convert this socket to a
> native socket, reading from this socket may fail because copied_seq might
> be significantly larger than rcv_nxt.
>
> This fix also addresses the syzkaller-reported bug referenced in the
> Closes tag.
>
> This patch marks the skmsg objects in ingress_msg. When reading, we update
> copied_seq only if the data is from its own protocol stack.
>
>                                                      FD1:read()
>                                                      --  FD1->copied_seq++
>                                                          |  [read data]
>                                                          |
>                                 [enqueue data]           v
>                   [sockmap]     -> ingress to self ->  ingress_msg queue
> FD1 native stack  ------>                                 ^
> -- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
>                                        |                  |
>                                        |             ingress to FD1
>                                        v                  ^
>                                       ...                 |  [sockmap]
>                                                      FD2 native stack
>
> Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
> Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  include/linux/skmsg.h |  2 ++
>  net/core/skmsg.c      | 28 +++++++++++++++++++++++++---
>  net/ipv4/tcp_bpf.c    |  5 +++--
>  3 files changed, 30 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 49847888c287..dfdc158ab88c 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -141,6 +141,8 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
>  			     struct sk_msg *msg, u32 bytes);
>  int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  		   int len, int flags);
> +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		     int len, int flags, int *copied_from_self);
>  bool sk_msg_is_readable(struct sock *sk);
>  
>  static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 2ac7731e1e0a..ca22ecdbf192 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -409,22 +409,26 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
>  }
>  EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
>  
> -/* Receive sk_msg from psock->ingress_msg to @msg. */
> -int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> -		   int len, int flags)
> +int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		     int len, int flags, int *copied_from_self)
>  {
>  	struct iov_iter *iter = &msg->msg_iter;
>  	int peek = flags & MSG_PEEK;
>  	struct sk_msg *msg_rx;
>  	int i, copied = 0;
> +	bool from_self;
>  
>  	msg_rx = sk_psock_peek_msg(psock);
> +	if (copied_from_self)
> +		*copied_from_self = 0;
> +
>  	while (copied != len) {
>  		struct scatterlist *sge;
>  
>  		if (unlikely(!msg_rx))
>  			break;
>  
> +		from_self = msg_rx->sk == sk;
>  		i = msg_rx->sg.start;
>  		do {
>  			struct page *page;
> @@ -443,6 +447,9 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  			}
>  
>  			copied += copy;
> +			if (from_self && copied_from_self)
> +				*copied_from_self += copy;
> +
>  			if (likely(!peek)) {
>  				sge->offset += copy;
>  				sge->length -= copy;
> @@ -487,6 +494,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
>  out:
>  	return copied;
>  }
> +EXPORT_SYMBOL_GPL(__sk_msg_recvmsg);

Nit: Sorry, I haven't caught that before. tcp_bpf is a built-in. We
don't need to export this internal helper to modules.

> +
> +/* Receive sk_msg from psock->ingress_msg to @msg. */
> +int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> +		   int len, int flags)
> +{
> +	return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL);
> +}
>  EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
>  
>  bool sk_msg_is_readable(struct sock *sk)
> @@ -616,6 +631,12 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>  	if (unlikely(!msg))
>  		return -EAGAIN;
>  	skb_set_owner_r(skb, sk);
> +
> +	/* This is used in tcp_bpf_recvmsg_parser() to determine whether the
> +	 * data originates from the socket's own protocol stack. No need to
> +	 * refcount sk because msg's lifetime is bound to sk via the ingress_msg.
> +	 */
> +	msg->sk = sk;
>  	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
>  	if (err < 0)
>  		kfree(msg);
> @@ -909,6 +930,7 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
>  	sk_msg_compute_data_pointers(msg);
>  	msg->sk = sk;
>  	ret = bpf_prog_run_pin_on_cpu(prog, msg);
> +	msg->sk = NULL;
>  	ret = sk_psock_map_verd(ret, msg->sk_redir);
>  	psock->apply_bytes = msg->apply_bytes;
>  	if (ret == __SK_REDIRECT) {
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index a268e1595b22..5c698fd7fbf8 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -226,6 +226,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	int peek = flags & MSG_PEEK;
>  	struct sk_psock *psock;
>  	struct tcp_sock *tcp;
> +	int copied_from_self = 0;
>  	int copied = 0;
>  	u32 seq;
>  
> @@ -262,7 +263,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  	}
>  
>  msg_bytes_ready:
> -	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
> +	copied = __sk_msg_recvmsg(sk, psock, msg, len, flags, &copied_from_self);
>  	/* The typical case for EFAULT is the socket was gracefully
>  	 * shutdown with a FIN pkt. So check here the other case is
>  	 * some error on copy_page_to_iter which would be unexpected.
> @@ -277,7 +278,7 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
>  			goto out;
>  		}
>  	}
> -	seq += copied;
> +	seq += copied_from_self;
>  	if (!copied) {
>  		long timeo;
>  		int data;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2026-01-20 15:01   ` Jakub Sitnicki
@ 2026-01-20 15:38     ` John Fastabend
  2026-01-21  9:43       ` Jakub Sitnicki
  0 siblings, 1 reply; 13+ messages in thread
From: John Fastabend @ 2026-01-20 15:38 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Jiayuan Chen, bpf, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Michal Luczaj, Cong Wang, netdev, linux-kernel, linux-kselftest

On 2026-01-20 16:01:21, Jakub Sitnicki wrote:
> On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> > A socket using sockmap has its own independent receive queue: ingress_msg.
> > This queue may contain data from its own protocol stack or from other
> > sockets.
> >
> > The issue is that when reading from ingress_msg, we update tp->copied_seq
> > by default. However, if the data is not from its own protocol stack,
> > tcp->rcv_nxt is not increased. Later, if we convert this socket to a
> > native socket, reading from this socket may fail because copied_seq might
> > be significantly larger than rcv_nxt.
> >
> > This fix also addresses the syzkaller-reported bug referenced in the
> > Closes tag.
> >
> > This patch marks the skmsg objects in ingress_msg. When reading, we update
> > copied_seq only if the data is from its own protocol stack.
> >
> >                                                      FD1:read()
> >                                                      --  FD1->copied_seq++
> >                                                          |  [read data]
> >                                                          |
> >                                 [enqueue data]           v
> >                   [sockmap]     -> ingress to self ->  ingress_msg queue
> > FD1 native stack  ------>                                 ^
> > -- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
> >                                        |                  |
> >                                        |             ingress to FD1
> >                                        v                  ^
> >                                       ...                 |  [sockmap]
> >                                                      FD2 native stack
> >
> > Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
> > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
> > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>

[...]

> > @@ -487,6 +494,14 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
> >  out:
> >  	return copied;
> >  }
> > +EXPORT_SYMBOL_GPL(__sk_msg_recvmsg);
> 
> Nit: Sorry, I haven't caught that before. tcp_bpf is a built-in. We
> don't need to export this internal helper to modules.

We could probably push this without the 2/3 patch? If we are debating
that patch still would be good to get this merged.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-20 15:00   ` Jakub Sitnicki
@ 2026-01-21  9:36     ` Jakub Sitnicki
  2026-01-21 12:55       ` Jiayuan Chen
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-21  9:36 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Michal Luczaj, Cong Wang, netdev, linux-kernel,
	linux-kselftest

On Tue, Jan 20, 2026 at 04:00 PM +01, Jakub Sitnicki wrote:
> On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:

[...]

>> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
>> index 0735d820e413..91233e37cd97 100644
>> --- a/net/ipv4/udp_bpf.c
>> +++ b/net/ipv4/udp_bpf.c
>> @@ -5,6 +5,7 @@
>>  #include <net/sock.h>
>>  #include <net/udp.h>
>>  #include <net/inet_common.h>
>> +#include <asm/ioctls.h>
>>  
>>  #include "udp_impl.h"
>>  
>> @@ -111,12 +112,26 @@ enum {
>>  static DEFINE_SPINLOCK(udpv6_prot_lock);
>>  static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
>>  
>> +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
>> +{
>> +	if (cmd != SIOCINQ)
>> +		return udp_ioctl(sk, cmd, karg);
>> +
>> +	/* Since we don't hold a lock, sk_receive_queue may contain data.
>> +	 * BPF might only be processing this data at the moment. We only
>> +	 * care about the data in the ingress_msg here.
>> +	 */
>
> I think we should strive for a design where FIONREAD does not go down
> after you add your socket to sockmap, if there was no recvmsg call in
> between. To show what I mean, I added this test that's currently failing
> for udp:
>
> ---8<---
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> index 1f1289f5a8c2..123c96fcaef0 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> @@ -1229,6 +1229,66 @@ static void test_sockmap_copied_seq(bool strp)
>  	test_sockmap_pass_prog__destroy(skel);
>  }
>  
> +/* Test FIONREAD when data exists in sk_receive_queue before sockmap insertion */
> +static void test_sockmap_fionread_pre_insert(int sotype)
> +{
> +	int map, err, sent, recvd, zero = 0, avail = 0;
> +	struct test_sockmap_pass_prog *skel = NULL;
> +	int c = -1, p = -1;
> +	char buf[10] = "0123456789", rcv[11];
> +	struct bpf_program *prog;
> +
> +	skel = test_sockmap_pass_prog__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "open_and_load"))
> +		return;
> +
> +	prog = skel->progs.prog_skb_verdict;
> +	map = bpf_map__fd(skel->maps.sock_map_rx);
> +
> +	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
> +	if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
> +		goto end;
> +
> +	err = create_pair(AF_INET, sotype, &c, &p);
> +	if (!ASSERT_OK(err, "create_pair"))
> +		goto end;
> +
> +	/* Step 1: Send data BEFORE sockmap insertion - lands in sk_receive_queue */
> +	sent = xsend(p, buf, sizeof(buf), 0);
> +	if (!ASSERT_EQ(sent, sizeof(buf), "xsend pre-insert"))
> +		goto end;
> +
> +	/* Step 2: Verify FIONREAD reports data in sk_receive_queue */
> +	err = poll_read(c, IO_TIMEOUT_SEC);
> +	if (!ASSERT_OK(err, "poll_read pre-insert"))
> +		goto end;
> +	err = ioctl(c, FIONREAD, &avail);
> +	ASSERT_OK(err, "ioctl(FIONREAD) pre-insert error");
> +	ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) pre-insert");
> +
> +	/* Step 3: Insert socket into sockmap */
> +	err = bpf_map_update_elem(map, &zero, &c, BPF_ANY);
> +	if (!ASSERT_OK(err, "bpf_map_update_elem(c)"))
> +		goto end;
> +
> +	/* Step 4: FIONREAD should still report the data in sk_receive_queue */
> +	err = ioctl(c, FIONREAD, &avail);
> +	ASSERT_OK(err, "ioctl(FIONREAD) post-insert error");
> +	ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) post-insert");
> +
> +	/* Verify we can still read the data */
> +	recvd = recv_timeout(c, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
> +	ASSERT_EQ(recvd, sizeof(buf), "recv post-insert");
> +	ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch");
> +
> +end:
> +	if (c >= 0)
> +		close(c);
> +	if (p >= 0)
> +		close(p);
> +	test_sockmap_pass_prog__destroy(skel);
> +}
> +
>  /* it is used to send data to via native stack and BPF redirecting */
>  static void test_sockmap_multi_channels(int sotype)
>  {
> @@ -1373,4 +1433,8 @@ void test_sockmap_basic(void)
>  		test_sockmap_multi_channels(SOCK_STREAM);
>  	if (test__start_subtest("sockmap udp multi channels"))
>  		test_sockmap_multi_channels(SOCK_DGRAM);
> +	if (test__start_subtest("sockmap tcp fionread pre-insert"))
> +		test_sockmap_fionread_pre_insert(SOCK_STREAM);
> +	if (test__start_subtest("sockmap udp fionread pre-insert"))
> +		test_sockmap_fionread_pre_insert(SOCK_DGRAM);
>  }
> --->8---
>
>
>> +	*karg = sk_msg_first_len(sk);
>> +	return 0;
>> +}
>> +

I've been thinking about this some more and came to the conclusion that
this udp_bpf_ioctl implementation is actually what we want, while
tcp_bpf_ioctl *should not* be checking if the sk_receive_queue is
non-empty.

Why? Because the verdict prog might redirect or drop the skbs from
sk_receive_queue once it actually runs. The messages might never appear
on the msg_ingress queue.

What I think we should be doing, in the end, is kicking the
sk_receive_queue processing on bpf_map_update_elem, if there's data
ready.

The API semantics I'm proposing is:

1. ioctl(FIONREAD)              -> reports N bytes
2. bpf_map_update_elem(sk)      -> socket inserted into sockmap
3. poll() for POLLIN            -> wait for socket to be ready to read
5. ioctl(FIONREAD)              -> report N bytes if verdict prog didn't
                                   redirect or drop it

We don't have to add the the queue kick on map update in this series.

If you decide to leave it for later, can I ask that you open an issue at
our GH project [1]?

I don't want it to fall through the cracks. And I sometimes have people
asking what they could help with in sockmap.

Thanks,
-jkbs

[1] https://github.com/sockmap-project/sockmap-project/issues



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation
  2026-01-20 15:38     ` John Fastabend
@ 2026-01-21  9:43       ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-21  9:43 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jiayuan Chen, bpf, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Michal Luczaj, Cong Wang, netdev, linux-kernel, linux-kselftest

On Tue, Jan 20, 2026 at 07:38 AM -08, John Fastabend wrote:
> We could probably push this without the 2/3 patch? If we are debating
> that patch still would be good to get this merged.

Definitely. But also see my email from today [1]. I think I figured a
way how we can make leave FIONREAD in a better shape that it is today -
with what Jiayuan proposed - and improve it in the future.

[1] https://msgid.link/871pjjux2u.fsf@cloudflare.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq
  2026-01-13  2:50 ` [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
@ 2026-01-21  9:45   ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-21  9:45 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	Stefano Garzarella, Michal Luczaj, Cong Wang, netdev,
	linux-kernel, linux-kselftest

On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> +/* it is used to send data to via native stack and BPF redirecting */
> +static void test_sockmap_multi_channels(int sotype)
> +{
> +	int map, err, sent, recvd, zero = 0, one = 1, avail = 0;
> +	struct test_sockmap_pass_prog *skel = NULL;
> +	int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
> +	char buf[10] = "0123456789", rcv[11];
> +	struct bpf_program *prog;
> +
> +	skel = test_sockmap_pass_prog__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "open_and_load"))
> +		return;
> +
> +	err = create_socket_pairs(AF_INET, sotype, &c0, &c1, &p0, &p1);
> +	if (err)
> +		goto end;
> +
> +	prog = skel->progs.prog_skb_verdict_ingress;
> +	map = bpf_map__fd(skel->maps.sock_map_rx);
> +
> +	err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
> +	if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
> +		goto end;
> +
> +	err = bpf_map_update_elem(map, &zero, &p0, BPF_ANY);
> +	if (!ASSERT_OK(err, "bpf_map_update_elem(p0)"))
> +		goto end;
> +
> +	err = bpf_map_update_elem(map, &one, &p1, BPF_ANY);
> +	if (!ASSERT_OK(err, "bpf_map_update_elem"))
> +		goto end;
> +
> +	/* send data to p1 via native stack */
> +	sent = xsend(c1, buf, 2, 0);
> +	if (!ASSERT_EQ(sent, 2, "xsend(2)"))
> +		goto end;
> +
> +	sleep(1);
> +	err = ioctl(p1, FIONREAD, &avail);
> +	ASSERT_OK(err, "ioctl(FIONREAD) partial call");
> +	ASSERT_EQ(avail, 2, "ioctl(FIONREAD) partial return");
> +
> +	/* send data to p1 via bpf redirecting */
> +	sent = xsend(c0, buf + 2, sizeof(buf) - 2, 0);
> +	if (!ASSERT_EQ(sent, sizeof(buf) - 2, "xsend(remain-data)"))
> +		goto end;
> +
> +	sleep(1);
> +	err = ioctl(p1, FIONREAD, &avail);
> +	ASSERT_OK(err, "ioctl(FIONREAD) full call");
> +	ASSERT_EQ(avail, sotype == SOCK_DGRAM ? 2 : sizeof(buf), "ioctl(FIONREAD) full return");
> +
> +	recvd = recv_timeout(p1, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
> +	if (!ASSERT_EQ(recvd, sizeof(buf), "recv_timeout(p1)") ||
> +	    !ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch"))
> +		goto end;
> +end:
> +	if (c0 >= 0)
> +		close(c0);
> +	if (p0 >= 0)
> +		close(p0);
> +	if (c1 >= 0)
> +		close(c1);
> +	if (p1 >= 0)
> +		close(p1);
> +	test_sockmap_pass_prog__destroy(skel);
> +}
> +
>  void test_sockmap_basic(void)
>  {
>  	if (test__start_subtest("sockmap create_update_free"))

Can you replace sleep(1) with poll_read(fd, IO_TIMEOUT_SEC)?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-21  9:36     ` Jakub Sitnicki
@ 2026-01-21 12:55       ` Jiayuan Chen
  2026-01-22  3:56         ` Jiayuan Chen
  0 siblings, 1 reply; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-21 12:55 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, John Fastabend, David  S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel  Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP  Singh, Stanislav Fomichev, Hao  Luo, Jiri Olsa,
	Shuah Khan, Michal Luczaj, Cong Wang, netdev, linux-kernel,
	linux-kselftest

January 21, 2026 at 17:36, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:


> 
> On Tue, Jan 20, 2026 at 04:00 PM +01, Jakub Sitnicki wrote:
> 
> > 
> > On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> > 
> [...]
> 
> > 
> > > 
> > > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> > >  index 0735d820e413..91233e37cd97 100644
> > >  --- a/net/ipv4/udp_bpf.c
> > >  +++ b/net/ipv4/udp_bpf.c
> > >  @@ -5,6 +5,7 @@
> > >  #include <net/sock.h>
> > >  #include <net/udp.h>
> > >  #include <net/inet_common.h>
> > >  +#include <asm/ioctls.h>
> > >  
> > >  #include "udp_impl.h"
> > >  
> > >  @@ -111,12 +112,26 @@ enum {
> > >  static DEFINE_SPINLOCK(udpv6_prot_lock);
> > >  static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
> > >  
> > >  +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> > >  +{
> > >  + if (cmd != SIOCINQ)
> > >  + return udp_ioctl(sk, cmd, karg);
> > >  +
> > >  + /* Since we don't hold a lock, sk_receive_queue may contain data.
> > >  + * BPF might only be processing this data at the moment. We only
> > >  + * care about the data in the ingress_msg here.
> > >  + */
> > > 
> >  I think we should strive for a design where FIONREAD does not go down
> >  after you add your socket to sockmap, if there was no recvmsg call in
> >  between. To show what I mean, I added this test that's currently failing
> >  for udp:
> > 
> >  ---8<---
> >  diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  index 1f1289f5a8c2..123c96fcaef0 100644
> >  --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  @@ -1229,6 +1229,66 @@ static void test_sockmap_copied_seq(bool strp)
> >  test_sockmap_pass_prog__destroy(skel);
> >  }
> >  
> >  +/* Test FIONREAD when data exists in sk_receive_queue before sockmap insertion */
> >  +static void test_sockmap_fionread_pre_insert(int sotype)
> >  +{
> >  + int map, err, sent, recvd, zero = 0, avail = 0;
> >  + struct test_sockmap_pass_prog *skel = NULL;
> >  + int c = -1, p = -1;
> >  + char buf[10] = "0123456789", rcv[11];
> >  + struct bpf_program *prog;
> >  +
> >  + skel = test_sockmap_pass_prog__open_and_load();
> >  + if (!ASSERT_OK_PTR(skel, "open_and_load"))
> >  + return;
> >  +
> >  + prog = skel->progs.prog_skb_verdict;
> >  + map = bpf_map__fd(skel->maps.sock_map_rx);
> >  +
> >  + err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
> >  + if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
> >  + goto end;
> >  +
> >  + err = create_pair(AF_INET, sotype, &c, &p);
> >  + if (!ASSERT_OK(err, "create_pair"))
> >  + goto end;
> >  +
> >  + /* Step 1: Send data BEFORE sockmap insertion - lands in sk_receive_queue */
> >  + sent = xsend(p, buf, sizeof(buf), 0);
> >  + if (!ASSERT_EQ(sent, sizeof(buf), "xsend pre-insert"))
> >  + goto end;
> >  +
> >  + /* Step 2: Verify FIONREAD reports data in sk_receive_queue */
> >  + err = poll_read(c, IO_TIMEOUT_SEC);
> >  + if (!ASSERT_OK(err, "poll_read pre-insert"))
> >  + goto end;
> >  + err = ioctl(c, FIONREAD, &avail);
> >  + ASSERT_OK(err, "ioctl(FIONREAD) pre-insert error");
> >  + ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) pre-insert");
> >  +
> >  + /* Step 3: Insert socket into sockmap */
> >  + err = bpf_map_update_elem(map, &zero, &c, BPF_ANY);
> >  + if (!ASSERT_OK(err, "bpf_map_update_elem(c)"))
> >  + goto end;
> >  +
> >  + /* Step 4: FIONREAD should still report the data in sk_receive_queue */
> >  + err = ioctl(c, FIONREAD, &avail);
> >  + ASSERT_OK(err, "ioctl(FIONREAD) post-insert error");
> >  + ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) post-insert");
> >  +
> >  + /* Verify we can still read the data */
> >  + recvd = recv_timeout(c, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
> >  + ASSERT_EQ(recvd, sizeof(buf), "recv post-insert");
> >  + ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch");
> >  +
> >  +end:
> >  + if (c >= 0)
> >  + close(c);
> >  + if (p >= 0)
> >  + close(p);
> >  + test_sockmap_pass_prog__destroy(skel);
> >  +}
> >  +
> >  /* it is used to send data to via native stack and BPF redirecting */
> >  static void test_sockmap_multi_channels(int sotype)
> >  {
> >  @@ -1373,4 +1433,8 @@ void test_sockmap_basic(void)
> >  test_sockmap_multi_channels(SOCK_STREAM);
> >  if (test__start_subtest("sockmap udp multi channels"))
> >  test_sockmap_multi_channels(SOCK_DGRAM);
> >  + if (test__start_subtest("sockmap tcp fionread pre-insert"))
> >  + test_sockmap_fionread_pre_insert(SOCK_STREAM);
> >  + if (test__start_subtest("sockmap udp fionread pre-insert"))
> >  + test_sockmap_fionread_pre_insert(SOCK_DGRAM);
> >  }
> >  --->8---
> > 
> > > 
> > > + *karg = sk_msg_first_len(sk);
> > >  + return 0;
> > >  +}
> > >  +
> > >
> > 
> I've been thinking about this some more and came to the conclusion that
> this udp_bpf_ioctl implementation is actually what we want, while
> tcp_bpf_ioctl *should not* be checking if the sk_receive_queue is
> non-empty.
> 
> Why? Because the verdict prog might redirect or drop the skbs from
> sk_receive_queue once it actually runs. The messages might never appear
> on the msg_ingress queue.
> 
> What I think we should be doing, in the end, is kicking the
> sk_receive_queue processing on bpf_map_update_elem, if there's data
> ready.
> 
> The API semantics I'm proposing is:
> 
> 1. ioctl(FIONREAD) -> reports N bytes
> 2. bpf_map_update_elem(sk) -> socket inserted into sockmap
> 3. poll() for POLLIN -> wait for socket to be ready to read
> 5. ioctl(FIONREAD) -> report N bytes if verdict prog didn't
>  redirect or drop it
> 
> We don't have to add the the queue kick on map update in this series.
> 
> If you decide to leave it for later, can I ask that you open an issue at
> our GH project [1]?
> 
> I don't want it to fall through the cracks. And I sometimes have people
> asking what they could help with in sockmap.
> 
> Thanks,
> -jkbs
> 
> [1] https://github.com/sockmap-project/sockmap-project/issues
>


Hi Jakub,

Thanks for taking the time to think through this carefully. I agree with your
analysis - reporting sk_receive_queue length is misleading since the verdict
prog might redirect or drop those skbs.

There's no rush to merge this patch.

Since the kick queue on bpf_map_update_elem addresses a closely related issue,
I think it makes sense to include it in this patchset for easier tracking rather
than splitting it out.

I'll spend more time looking into this and come back with an updated version.

Thanks,
Jiayuan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-21 12:55       ` Jiayuan Chen
@ 2026-01-22  3:56         ` Jiayuan Chen
  2026-01-23 14:59           ` Jakub Sitnicki
  0 siblings, 1 reply; 13+ messages in thread
From: Jiayuan Chen @ 2026-01-22  3:56 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, John Fastabend, David  S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David  Ahern, Andrii Nakryiko,
	Eduard Zingerman, Alexei Starovoitov, Daniel  Borkmann,
	Martin  KaFai Lau, Song Liu, Yonghong Song, KP  Singh,
	Stanislav Fomichev, Hao  Luo, Jiri Olsa, Shuah Khan,
	Michal Luczaj, Cong Wang, netdev, linux-kernel, linux-kselftest

January 21, 2026 at 20:55, "Jiayuan Chen" <jiayuan.chen@linux.dev mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E > wrote:


> 
> January 21, 2026 at 17:36, "Jakub Sitnicki" <jakub@cloudflare.com mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E > wrote:
> 
> > 
> > On Tue, Jan 20, 2026 at 04:00 PM +01, Jakub Sitnicki wrote:
> >  
> >  
> >  On Tue, Jan 13, 2026 at 10:50 AM +08, Jiayuan Chen wrote:
> >  
> >  [...]
> >  
> >  
> >  > 
> >  > diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> >  > index 0735d820e413..91233e37cd97 100644
> >  > --- a/net/ipv4/udp_bpf.c
> >  > +++ b/net/ipv4/udp_bpf.c
> >  > @@ -5,6 +5,7 @@
> >  > #include <net/sock.h>
> >  > #include <net/udp.h>
> >  > #include <net/inet_common.h>
> >  > +#include <asm/ioctls.h>
> >  > 
> >  > #include "udp_impl.h"
> >  > 
> >  > @@ -111,12 +112,26 @@ enum {
> >  > static DEFINE_SPINLOCK(udpv6_prot_lock);
> >  > static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
> >  > 
> >  > +static int udp_bpf_ioctl(struct sock *sk, int cmd, int *karg)
> >  > +{
> >  > + if (cmd != SIOCINQ)
> >  > + return udp_ioctl(sk, cmd, karg);
> >  > +
> >  > + /* Since we don't hold a lock, sk_receive_queue may contain data.
> >  > + * BPF might only be processing this data at the moment. We only
> >  > + * care about the data in the ingress_msg here.
> >  > + */
> >  > 
> >  I think we should strive for a design where FIONREAD does not go down
> >  after you add your socket to sockmap, if there was no recvmsg call in
> >  between. To show what I mean, I added this test that's currently failing
> >  for udp:
> >  
> >  ---8<---
> >  diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  index 1f1289f5a8c2..123c96fcaef0 100644
> >  --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> >  @@ -1229,6 +1229,66 @@ static void test_sockmap_copied_seq(bool strp)
> >  test_sockmap_pass_prog__destroy(skel);
> >  }
> >  
> >  +/* Test FIONREAD when data exists in sk_receive_queue before sockmap insertion */
> >  +static void test_sockmap_fionread_pre_insert(int sotype)
> >  +{
> >  + int map, err, sent, recvd, zero = 0, avail = 0;
> >  + struct test_sockmap_pass_prog *skel = NULL;
> >  + int c = -1, p = -1;
> >  + char buf[10] = "0123456789", rcv[11];
> >  + struct bpf_program *prog;
> >  +
> >  + skel = test_sockmap_pass_prog__open_and_load();
> >  + if (!ASSERT_OK_PTR(skel, "open_and_load"))
> >  + return;
> >  +
> >  + prog = skel->progs.prog_skb_verdict;
> >  + map = bpf_map__fd(skel->maps.sock_map_rx);
> >  +
> >  + err = bpf_prog_attach(bpf_program__fd(prog), map, BPF_SK_SKB_STREAM_VERDICT, 0);
> >  + if (!ASSERT_OK(err, "bpf_prog_attach verdict"))
> >  + goto end;
> >  +
> >  + err = create_pair(AF_INET, sotype, &c, &p);
> >  + if (!ASSERT_OK(err, "create_pair"))
> >  + goto end;
> >  +
> >  + /* Step 1: Send data BEFORE sockmap insertion - lands in sk_receive_queue */
> >  + sent = xsend(p, buf, sizeof(buf), 0);
> >  + if (!ASSERT_EQ(sent, sizeof(buf), "xsend pre-insert"))
> >  + goto end;
> >  +
> >  + /* Step 2: Verify FIONREAD reports data in sk_receive_queue */
> >  + err = poll_read(c, IO_TIMEOUT_SEC);
> >  + if (!ASSERT_OK(err, "poll_read pre-insert"))
> >  + goto end;
> >  + err = ioctl(c, FIONREAD, &avail);
> >  + ASSERT_OK(err, "ioctl(FIONREAD) pre-insert error");
> >  + ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) pre-insert");
> >  +
> >  + /* Step 3: Insert socket into sockmap */
> >  + err = bpf_map_update_elem(map, &zero, &c, BPF_ANY);
> >  + if (!ASSERT_OK(err, "bpf_map_update_elem(c)"))
> >  + goto end;
> >  +
> >  + /* Step 4: FIONREAD should still report the data in sk_receive_queue */
> >  + err = ioctl(c, FIONREAD, &avail);
> >  + ASSERT_OK(err, "ioctl(FIONREAD) post-insert error");
> >  + ASSERT_EQ(avail, sizeof(buf), "ioctl(FIONREAD) post-insert");
> >  +
> >  + /* Verify we can still read the data */
> >  + recvd = recv_timeout(c, rcv, sizeof(rcv), MSG_DONTWAIT, IO_TIMEOUT_SEC);
> >  + ASSERT_EQ(recvd, sizeof(buf), "recv post-insert");
> >  + ASSERT_OK(memcmp(buf, rcv, recvd), "data mismatch");
> >  +
> >  +end:
> >  + if (c >= 0)
> >  + close(c);
> >  + if (p >= 0)
> >  + close(p);
> >  + test_sockmap_pass_prog__destroy(skel);
> >  +}
> >  +
> >  /* it is used to send data to via native stack and BPF redirecting */
> >  static void test_sockmap_multi_channels(int sotype)
> >  {
> >  @@ -1373,4 +1433,8 @@ void test_sockmap_basic(void)
> >  test_sockmap_multi_channels(SOCK_STREAM);
> >  if (test__start_subtest("sockmap udp multi channels"))
> >  test_sockmap_multi_channels(SOCK_DGRAM);
> >  + if (test__start_subtest("sockmap tcp fionread pre-insert"))
> >  + test_sockmap_fionread_pre_insert(SOCK_STREAM);
> >  + if (test__start_subtest("sockmap udp fionread pre-insert"))
> >  + test_sockmap_fionread_pre_insert(SOCK_DGRAM);
> >  }
> >  --->8---
> >  
> >  > 
> >  > + *karg = sk_msg_first_len(sk);
> >  > + return 0;
> >  > +}
> >  > +
> >  >
> >  
> >  I've been thinking about this some more and came to the conclusion that
> >  this udp_bpf_ioctl implementation is actually what we want, while
> >  tcp_bpf_ioctl *should not* be checking if the sk_receive_queue is
> >  non-empty.
> >  
> >  Why? Because the verdict prog might redirect or drop the skbs from
> >  sk_receive_queue once it actually runs. The messages might never appear
> >  on the msg_ingress queue.
> >  
> >  What I think we should be doing, in the end, is kicking the
> >  sk_receive_queue processing on bpf_map_update_elem, if there's data
> >  ready.
> >  
> >  The API semantics I'm proposing is:
> >  
> >  1. ioctl(FIONREAD) -> reports N bytes
> >  2. bpf_map_update_elem(sk) -> socket inserted into sockmap
> >  3. poll() for POLLIN -> wait for socket to be ready to read
> >  5. ioctl(FIONREAD) -> report N bytes if verdict prog didn't
> >  redirect or drop it
> >  
> >  We don't have to add the the queue kick on map update in this series.
> >  
> >  If you decide to leave it for later, can I ask that you open an issue at
> >  our GH project [1]?
> >  
> >  I don't want it to fall through the cracks. And I sometimes have people
> >  asking what they could help with in sockmap.
> >  
> >  Thanks,
> >  -jkbs
> >  
> >  [1] https://github.com/sockmap-project/sockmap-project/issues
> > 
> Hi Jakub,
> 
> Thanks for taking the time to think through this carefully. I agree with your
> analysis - reporting sk_receive_queue length is misleading since the verdict
> prog might redirect or drop those skbs.
> 
> There's no rush to merge this patch.
> 
> Since the kick queue on bpf_map_update_elem addresses a closely related issue,
> I think it makes sense to include it in this patchset for easier tracking rather
> than splitting it out.
> 
> I'll spend more time looking into this and come back with an updated version.
> 
> Thanks,
> Jiayuan
>


Hi Jakub,

  I've been thinking about this more, and I realize the problem is not as simple as it seems.

  Regarding kicking the sk_receive_queue on bpf_map_update_elem: the BPF
  program may not be fully initialized at that point. For example, with a
  redirect program, the destination fd might not yet be inserted into the
  map. If we kick the data through the BPF program immediately, the
  redirect lookup would fail, leading to unexpected behavior (data being
  dropped or passed to the wrong socket).

  I also considered triggering the kick in poll/select via
  sk_msg_is_readable(). However, this approach doesn't work for TCP
  because tcp_poll() -> tcp_stream_is_readable() -> tcp_epollin_ready()
  will return early when sk_receive_queue has data, before ever calling
  sk_is_readable().

  In the next version, I'll address your other nits and remove the
  sk_receive_queue check from tcp_bpf_ioctl. I'll also open an issue on
  the GH project to track this problem so we can continue exploring
  better solutions.

  Thanks,
  Jiayuan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap
  2026-01-22  3:56         ` Jiayuan Chen
@ 2026-01-23 14:59           ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2026-01-23 14:59 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Michal Luczaj, Cong Wang, netdev, linux-kernel,
	linux-kselftest

On Thu, Jan 22, 2026 at 03:56 AM GMT, Jiayuan Chen wrote:
> January 21, 2026 at 20:55, "Jiayuan Chen" <jiayuan.chen@linux.dev
> mailto:jiayuan.chen@linux.dev?to=%22Jiayuan%20Chen%22%20%3Cjiayuan.chen%40linux.dev%3E
>> wrote:
>> January 21, 2026 at 17:36, "Jakub Sitnicki" <jakub@cloudflare.com
>> mailto:jakub@cloudflare.com?to=%22Jakub%20Sitnicki%22%20%3Cjakub%40cloudflare.com%3E
>> >  I've been thinking about this some more and came to the conclusion that
>> >  this udp_bpf_ioctl implementation is actually what we want, while
>> >  tcp_bpf_ioctl *should not* be checking if the sk_receive_queue is
>> >  non-empty.
>> >  
>> >  Why? Because the verdict prog might redirect or drop the skbs from
>> >  sk_receive_queue once it actually runs. The messages might never appear
>> >  on the msg_ingress queue.
>> >  
>> >  What I think we should be doing, in the end, is kicking the
>> >  sk_receive_queue processing on bpf_map_update_elem, if there's data
>> >  ready.
>> >  
>> >  The API semantics I'm proposing is:
>> >  
>> >  1. ioctl(FIONREAD) -> reports N bytes
>> >  2. bpf_map_update_elem(sk) -> socket inserted into sockmap
>> >  3. poll() for POLLIN -> wait for socket to be ready to read
>> >  5. ioctl(FIONREAD) -> report N bytes if verdict prog didn't
>> >  redirect or drop it
>> >  
>> >  We don't have to add the the queue kick on map update in this series.
>> >  
>> >  If you decide to leave it for later, can I ask that you open an issue at
>> >  our GH project [1]?
>> >  
>> >  I don't want it to fall through the cracks. And I sometimes have people
>> >  asking what they could help with in sockmap.
>> >  
>> >  Thanks,
>> >  -jkbs
>> >  
>> >  [1] https://github.com/sockmap-project/sockmap-project/issues
>> > 
>> Hi Jakub,
>> 
>> Thanks for taking the time to think through this carefully. I agree with your
>> analysis - reporting sk_receive_queue length is misleading since the verdict
>> prog might redirect or drop those skbs.
>> 
>> There's no rush to merge this patch.
>> 
>> Since the kick queue on bpf_map_update_elem addresses a closely related issue,
>> I think it makes sense to include it in this patchset for easier tracking rather
>> than splitting it out.
>> 
>> I'll spend more time looking into this and come back with an updated version.
>> 
>> Thanks,
>> Jiayuan
>>
>
>
> Hi Jakub,
>
>   I've been thinking about this more, and I realize the problem is not as simple as it seems.
>
>   Regarding kicking the sk_receive_queue on bpf_map_update_elem: the BPF
>   program may not be fully initialized at that point. For example, with a
>   redirect program, the destination fd might not yet be inserted into the
>   map. If we kick the data through the BPF program immediately, the
>   redirect lookup would fail, leading to unexpected behavior (data being
>   dropped or passed to the wrong socket).

I reckon there is not much we can do about it because we have no control
over when inserts/removes sockets from sockmap. It can happen at any
time.

Also, a newly received segment can trigger sk_data_ready callback,
and that would also cause the skbs to get processed. We don't have
control of that either.

Does this change break any of our existing tests/benchmarks or some
other setup of yours?

>   I also considered triggering the kick in poll/select via
>   sk_msg_is_readable(). However, this approach doesn't work for TCP
>   because tcp_poll() -> tcp_stream_is_readable() -> tcp_epollin_ready()
>   will return early when sk_receive_queue has data, before ever calling
>   sk_is_readable().
>
>   In the next version, I'll address your other nits and remove the
>   sk_receive_queue check from tcp_bpf_ioctl. I'll also open an issue on
>   the GH project to track this problem so we can continue exploring
>   better solutions.

Sounds like a plan. Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-01-23 14:59 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-13  2:50 [PATCH bpf-next v7 0/3] bpf: Fix FIONREAD and copied_seq issues Jiayuan Chen
2026-01-13  2:50 ` [PATCH bpf-next v7 1/3] bpf, sockmap: Fix incorrect copied_seq calculation Jiayuan Chen
2026-01-20 15:01   ` Jakub Sitnicki
2026-01-20 15:38     ` John Fastabend
2026-01-21  9:43       ` Jakub Sitnicki
2026-01-13  2:50 ` [PATCH bpf-next v7 2/3] bpf, sockmap: Fix FIONREAD for sockmap Jiayuan Chen
2026-01-20 15:00   ` Jakub Sitnicki
2026-01-21  9:36     ` Jakub Sitnicki
2026-01-21 12:55       ` Jiayuan Chen
2026-01-22  3:56         ` Jiayuan Chen
2026-01-23 14:59           ` Jakub Sitnicki
2026-01-13  2:50 ` [PATCH bpf-next v7 3/3] bpf, selftest: Add tests for FIONREAD and copied_seq Jiayuan Chen
2026-01-21  9:45   ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox