* [PATCH net 0/2] bpf, sockmap: fix copied_seq after partial TCP read
@ 2026-07-02 14:09 Dong Chenchen
2026-07-02 14:09 ` [PATCH net 1/2] bpf, sockmap: account only unread data in tcp_eat_skb Dong Chenchen
2026-07-02 14:09 ` [PATCH net 2/2] selftests/bpf: cover sockmap drop after partial TCP read Dong Chenchen
0 siblings, 2 replies; 3+ messages in thread
From: Dong Chenchen @ 2026-07-02 14:09 UTC (permalink / raw)
To: daniel, edumazet, ncardwell, kuniyu, john.fastabend, jakub,
jiayuan.chen
Cc: davem, kuba, pabeni, horms, zhangchangzhong, netdev, bpf,
Dong Chenchen
tcp_eat_skb() assumes that an skb dequeued by the sockmap verdict path
has not previously been consumed. However, a socket can be inserted
into a sockmap after userspace has partially read the skb at the head of
its receive queue.
When new data invokes the verdict path, tcp_eat_skb() advances
copied_seq by the full skb length. This counts the already-read prefix
twice, moves copied_seq beyond rcv_nxt, and makes later native TCP reads
fail. TCP_ZEROCOPY_RECEIVE then triggers the tcp_recvmsg_locked()
sequence warning reported by syzbot.
TCP recvmsg seq # bug 2: copied AA28C633, seq AA28C601, rcvnxt AA28C602
WARNING: net/ipv4/tcp.c:2745 at tcp_recvmsg_locked
RIP: 0010:tcp_recvmsg_locked (net/ipv4/tcp.c:2745)
Call Trace:
<TASK>
tcp_zerocopy_receive (net/ipv4/tcp.c:1995 net/ipv4/tcp.c:2227)
do_tcp_getsockopt (net/ipv4/tcp.c:4771)
tcp_getsockopt (net/ipv4/tcp.c:4869)
do_sock_getsockopt (net/socket.c:2487)
__sys_getsockopt (net/socket.c:2518)
__x64_sys_getsockopt (net/socket.c:2525 net/socket.c:2522)
do_syscall_64 (arch/x86/entry/syscall_64.c:63)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)
Patch 1 advances copied_seq to the skb end sequence and accounts only
the unread sequence-space delta during receive-buffer cleanup.
Patch 2 adds a deterministic regression test which reproduces the
tcp_recvmsg_locked() warning on the unpatched kernel.
Dong Chenchen (2):
bpf, sockmap: account only unread data in tcp_eat_skb
selftests/bpf: cover sockmap drop after partial TCP read
net/ipv4/tcp_bpf.c | 9 ++-
.../selftests/bpf/prog_tests/sockmap_basic.c | 73 +++++++++++++++++++
2 files changed, 78 insertions(+), 4 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 3+ messages in thread
* [PATCH net 1/2] bpf, sockmap: account only unread data in tcp_eat_skb
2026-07-02 14:09 [PATCH net 0/2] bpf, sockmap: fix copied_seq after partial TCP read Dong Chenchen
@ 2026-07-02 14:09 ` Dong Chenchen
2026-07-02 14:09 ` [PATCH net 2/2] selftests/bpf: cover sockmap drop after partial TCP read Dong Chenchen
1 sibling, 0 replies; 3+ messages in thread
From: Dong Chenchen @ 2026-07-02 14:09 UTC (permalink / raw)
To: daniel, edumazet, ncardwell, kuniyu, john.fastabend, jakub,
jiayuan.chen
Cc: davem, kuba, pabeni, horms, zhangchangzhong, netdev, bpf,
Dong Chenchen, syzbot+06dbd397158ec0ea4983
tcp_eat_skb() advances copied_seq by the full skb length when a sockmap
verdict drops or redirects an skb. This assumes none of the skb has
previously been consumed.
That assumption does not hold when userspace partially reads an skb
before adding the socket to a sockmap. A later packet invokes the
verdict path, which dequeues the partially consumed skb. Adding its full
length counts the consumed prefix twice and can move copied_seq beyond
rcv_nxt, causing subsequent native TCP reads to fail.
The following sequence reproduces the corruption:
1. TCP receives a 200-byte segment; skb sits on sk_receive_queue.
2. Userspace reads 50 bytes //copied_seq = 50, rcv_nxt = 200
3. Socket is inserted into a sockmap with an SK_DROP verdict.
4. A 1-byte segment arrives and tcp_try_coalesce() merges it with the
existing skb. //skb->len = 201, copied_seq = 50, rcv_nxt = 201
5. The verdict path calls tcp_eat_skb(), which does:
copied_seq += skb->len; // copied_seq = 251, rcv_nxt = 201
This counts the 50 already-read bytes again.
6. After removing the socket from the map, native receive triggers the
sequence warning:
TCP recvmsg seq # bug 2: copied AA28C633, seq AA28C601,
rcvnxt AA28C602, fl 40
WARNING: net/ipv4/tcp.c:2745 at tcp_recvmsg_locked+0x45e/0x9f0
Fix tcp_eat_skb() to advance copied_seq to the skb TCP end sequence and
pass only the distance from the old copied_seq to end_seq to
__tcp_cleanup_rbuf().
Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
Reported-by: syzbot+06dbd397158ec0ea4983@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
---
net/ipv4/tcp_bpf.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cc0bd73f36b6..d640f8e06529 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -15,7 +15,7 @@
void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tcp;
- int copied;
+ u32 end_seq, delta;
if (!skb || !skb->len || !sk_is_tcp(sk))
return;
@@ -24,10 +24,11 @@ void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
return;
tcp = tcp_sk(sk);
- copied = tcp->copied_seq + skb->len;
- WRITE_ONCE(tcp->copied_seq, copied);
+ end_seq = TCP_SKB_CB(skb)->end_seq;
+ delta = end_seq - tcp->copied_seq;
+ WRITE_ONCE(tcp->copied_seq, end_seq);
tcp_rcv_space_adjust(sk);
- __tcp_cleanup_rbuf(sk, skb->len);
+ __tcp_cleanup_rbuf(sk, delta);
}
static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* [PATCH net 2/2] selftests/bpf: cover sockmap drop after partial TCP read
2026-07-02 14:09 [PATCH net 0/2] bpf, sockmap: fix copied_seq after partial TCP read Dong Chenchen
2026-07-02 14:09 ` [PATCH net 1/2] bpf, sockmap: account only unread data in tcp_eat_skb Dong Chenchen
@ 2026-07-02 14:09 ` Dong Chenchen
1 sibling, 0 replies; 3+ messages in thread
From: Dong Chenchen @ 2026-07-02 14:09 UTC (permalink / raw)
To: daniel, edumazet, ncardwell, kuniyu, john.fastabend, jakub,
jiayuan.chen
Cc: davem, kuba, pabeni, horms, zhangchangzhong, netdev, bpf,
Dong Chenchen
Add a regression test for a TCP socket that is partially read before it
is inserted into a sockmap with an SK_DROP verdict.
The test leaves part of the original skb on the receive queue, adds the
socket to the map, and sends another byte to drive the verdict path.
After removing the socket from the map, it invokes TCP_ZEROCOPY_RECEIVE
on newly arrived native data.
Without the tcp_eat_skb() fix, copied_seq includes the already consumed
prefix, TCP_ZEROCOPY_RECEIVE triggers the sequence warning, and no data
is copied. With the fix, the new byte is copied without a warning.
Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com>
---
.../selftests/bpf/prog_tests/sockmap_basic.c | 73 +++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..106dd03cde84 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -1275,6 +1275,77 @@ static void test_sockmap_copied_seq(bool strp)
test_sockmap_pass_prog__destroy(skel);
}
+static void test_sockmap_drop_after_partial_read(void)
+{
+ int map, err, sent, recvd, zero = 0, on = 1;
+ struct test_sockmap_drop_prog *skel;
+ int c0 = -1, p0 = -1, c1 = -1, p1 = -1;
+ struct tcp_zerocopy_receive zc = {};
+ socklen_t zc_len = sizeof(zc);
+ char buf[200] = {}, rcv[50], addr[100];
+ struct bpf_program *prog;
+
+ skel = test_sockmap_drop_prog__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ return;
+
+ if (create_socket_pairs(AF_INET, SOCK_STREAM, &c0, &c1, &p0, &p1))
+ goto end;
+
+ sent = xsend(c0, buf, sizeof(buf), 0);
+ if (!ASSERT_EQ(sent, sizeof(buf), "xsend(native)"))
+ goto end;
+
+ recvd = recv_timeout(p0, rcv, sizeof(rcv), MSG_DONTWAIT, 1);
+ if (!ASSERT_EQ(recvd, sizeof(rcv), "recv_timeout(partial)"))
+ goto end;
+
+ prog = skel->progs.prog_skb_verdict;
+ map = bpf_map__fd(skel->maps.sock_map_rx);
+ err = bpf_prog_attach(bpf_program__fd(prog), map,
+ BPF_SK_SKB_STREAM_VERDICT, 0);
+ if (!ASSERT_OK(err, "bpf_prog_attach"))
+ goto end;
+
+ err = bpf_map_update_elem(map, &zero, &p0, BPF_NOEXIST);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto end;
+
+ sent = xsend(c0, buf, 1, 0);
+ if (!ASSERT_EQ(sent, 1, "xsend(drop)"))
+ goto end;
+
+ err = bpf_map_delete_elem(map, &zero);
+ if (!ASSERT_OK(err, "bpf_map_delete_elem"))
+ goto end;
+
+ sent = xsend(c0, buf, 1, 0);
+ if (!ASSERT_EQ(sent, 1, "xsend(native again)"))
+ goto end;
+
+ err = setsockopt(p0, SOL_SOCKET, SO_ZEROCOPY, &on, sizeof(on));
+ if (!ASSERT_OK(err, "setsockopt(SO_ZEROCOPY)"))
+ goto end;
+
+ zc.copybuf_address = (__u64)(unsigned long)addr;
+ zc.copybuf_len = sizeof(addr);
+ err = getsockopt(p0, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &zc, &zc_len);
+ if (!ASSERT_OK(err, "getsockopt(TCP_ZEROCOPY_RECEIVE)"))
+ goto end;
+
+ ASSERT_EQ(zc.copybuf_len, 1, "TCP_ZEROCOPY_RECEIVE copied");
+end:
+ if (c0 >= 0)
+ close(c0);
+ if (p0 >= 0)
+ close(p0);
+ if (c1 >= 0)
+ close(c1);
+ if (p1 >= 0)
+ close(p1);
+ test_sockmap_drop_prog__destroy(skel);
+}
+
/* Wait until FIONREAD returns the expected value or timeout */
static int wait_for_fionread(int fd, int expected, unsigned int timeout_ms)
{
@@ -1447,6 +1518,8 @@ void test_sockmap_basic(void)
test_sockmap_copied_seq(false);
if (test__start_subtest("sockmap recover with strp"))
test_sockmap_copied_seq(true);
+ if (test__start_subtest("sockmap drop after partial read"))
+ test_sockmap_drop_after_partial_read();
if (test__start_subtest("sockmap tcp multi channels"))
test_sockmap_multi_channels(SOCK_STREAM);
if (test__start_subtest("sockmap udp multi channels"))
--
2.43.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-07-02 14:01 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 14:09 [PATCH net 0/2] bpf, sockmap: fix copied_seq after partial TCP read Dong Chenchen
2026-07-02 14:09 ` [PATCH net 1/2] bpf, sockmap: account only unread data in tcp_eat_skb Dong Chenchen
2026-07-02 14:09 ` [PATCH net 2/2] selftests/bpf: cover sockmap drop after partial TCP read Dong Chenchen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox