* [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress @ 2025-10-09 3:07 zhengguoyong 2025-10-09 7:07 ` Eric Dumazet 2025-10-09 7:59 ` Paolo Abeni 0 siblings, 2 replies; 5+ messages in thread From: zhengguoyong @ 2025-10-09 3:07 UTC (permalink / raw) To: john.fastabend, jakub, davem, edumazet, kuba, pabeni; +Cc: netdev, bpf When using sockmap to forward TCP traffic to the application layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser processing flow will synchronously update the tp->copied_seq field. This causes tp->rcv_nxt to become less than tp->copied_seq. Later, when this socket receives SKB packets from the protocol stack, in the call chain tcp_data_ready → tcp_epollin_ready, the function tcp_epollin_ready will return false, preventing the socket from being woken up to receive new packets. Therefore, it is necessary to synchronously update the tp->rcv_nxt information in sk_psock_skb_ingress. Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> --- net/core/skmsg.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 9becadd..e9d841c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, struct sock *sk = psock->sk; struct sk_msg *msg; int err; + u32 seq; /* If we are receiving on the same sock skb->sk is already assigned, * skip memory accounting and owner transition seeing it already set @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, */ skb_set_owner_r(skb, sk); err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); - if (err < 0) + if (err < 0) { kfree(msg); + } else { + bh_lock_sock_nested(sk); + seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len; + WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq); + bh_unlock_sock(sk); + } + return err; } -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress 2025-10-09 3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong @ 2025-10-09 7:07 ` Eric Dumazet 2025-10-10 8:18 ` zhengguoyong 2025-10-09 7:59 ` Paolo Abeni 1 sibling, 1 reply; 5+ messages in thread From: Eric Dumazet @ 2025-10-09 7:07 UTC (permalink / raw) To: zhengguoyong; +Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf On Wed, Oct 8, 2025 at 8:07 PM zhengguoyong <zhenggy@chinatelecom.cn> wrote: > > When using sockmap to forward TCP traffic to the application > layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser > processing flow will synchronously update the tp->copied_seq field. > This causes tp->rcv_nxt to become less than tp->copied_seq. > > Later, when this socket receives SKB packets from the protocol stack, > in the call chain tcp_data_ready → tcp_epollin_ready, the function > tcp_epollin_ready will return false, preventing the socket from being > woken up to receive new packets. > > Therefore, it is necessary to synchronously update the tp->rcv_nxt > information in sk_psock_skb_ingress. > > Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> Hi GuoYong Zheng We request a Fixes: tag for patches claiming to fix a bug. How would stable teams decide to backport a patch or not, and to which versions, without having to fully understand this code ? > --- > net/core/skmsg.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 9becadd..e9d841c 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > struct sock *sk = psock->sk; > struct sk_msg *msg; > int err; > + u32 seq; > > /* If we are receiving on the same sock skb->sk is already assigned, > * skip memory accounting and owner transition seeing it already set > @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > */ > skb_set_owner_r(skb, sk); > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); > - if (err < 0) > + if (err < 0) { > kfree(msg); > + } else { > + bh_lock_sock_nested(sk); > + seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len; > + WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq); This does not look to be the right place. Re-locking a socket _after_ the fundamental change took place is fundamentally racy. Also do we have a guarantee sk is always a TCP socket at this point ? If yes, why do we have sk_is_tcp() check in sk_psock_init_strp() ? > + bh_unlock_sock(sk); > + } > + > return err; > } > > -- > 1.8.3.1 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress 2025-10-09 7:07 ` Eric Dumazet @ 2025-10-10 8:18 ` zhengguoyong 2025-10-10 9:16 ` Eric Dumazet 0 siblings, 1 reply; 5+ messages in thread From: zhengguoyong @ 2025-10-10 8:18 UTC (permalink / raw) To: 【外部账号】 Eric Dumazet Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf Hi Eric, Thank you for your reply. Indeed, using bh_lock_sock_nested can lead to deadlock risks. I apologize for not noticing this earlier. Can I change bh_lock_sock_nested to lock_sock to resolve this deadlock issue? Or do you have any better suggestions on where to update the tp->rcv_nxt field in the process? Look forward to your response. Thank you. *From:* 【外部账号】Eric Dumazet *收件人:* zhengguoyong *抄送:* john.fastabend@gmail.com, jakub@cloudflare.com, davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, bpf@vger.kernel.org *主题:* [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress *日期:* 2025-10-09 15:07:44 > On Wed, Oct 8, 2025 at 8:07 PM zhengguoyong <zhenggy@chinatelecom.cn> wrote: >> >> When using sockmap to forward TCP traffic to the application >> layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser >> processing flow will synchronously update the tp->copied_seq field. >> This causes tp->rcv_nxt to become less than tp->copied_seq. >> >> Later, when this socket receives SKB packets from the protocol stack, >> in the call chain tcp_data_ready → tcp_epollin_ready, the function >> tcp_epollin_ready will return false, preventing the socket from being >> woken up to receive new packets. >> >> Therefore, it is necessary to synchronously update the tp->rcv_nxt >> information in sk_psock_skb_ingress. >> >> Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> > > Hi GuoYong Zheng > > We request a Fixes: tag for patches claiming to fix a bug. > > How would stable teams decide to backport a patch or not, and to which versions, > without having to fully understand this code ? > > >> --- >> net/core/skmsg.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c >> index 9becadd..e9d841c 100644 >> --- a/net/core/skmsg.c >> +++ b/net/core/skmsg.c >> @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, >> struct sock *sk = psock->sk; >> struct sk_msg *msg; >> int err; >> + u32 seq; >> >> /* If we are receiving on the same sock skb->sk is already assigned, >> * skip memory accounting and owner transition seeing it already set >> @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, >> */ >> skb_set_owner_r(skb, sk); >> err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); >> - if (err < 0) >> + if (err < 0) { >> kfree(msg); >> + } else { >> + bh_lock_sock_nested(sk); >> + seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len; >> + WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq); > > This does not look to be the right place. > > Re-locking a socket _after_ the fundamental change took place is > fundamentally racy. > > Also do we have a guarantee sk is always a TCP socket at this point ? > > If yes, why do we have sk_is_tcp() check in sk_psock_init_strp() ? > >> + bh_unlock_sock(sk); >> + } >> + >> return err; >> } >> >> -- >> 1.8.3.1 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress 2025-10-10 8:18 ` zhengguoyong @ 2025-10-10 9:16 ` Eric Dumazet 0 siblings, 0 replies; 5+ messages in thread From: Eric Dumazet @ 2025-10-10 9:16 UTC (permalink / raw) To: zhengguoyong; +Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf On Fri, Oct 10, 2025 at 1:17 AM zhengguoyong <zhenggy@chinatelecom.cn> wrote: > > Hi Eric, > > Thank you for your reply. Indeed, using bh_lock_sock_nested can lead to deadlock risks. > I apologize for not noticing this earlier. > > Can I change bh_lock_sock_nested to lock_sock to resolve this deadlock issue? > Or do you have any better suggestions on where to update the tp->rcv_nxt field in the process? > Please look at my original feedback. lock_sock() is not an option. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress 2025-10-09 3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong 2025-10-09 7:07 ` Eric Dumazet @ 2025-10-09 7:59 ` Paolo Abeni 1 sibling, 0 replies; 5+ messages in thread From: Paolo Abeni @ 2025-10-09 7:59 UTC (permalink / raw) To: zhengguoyong, john.fastabend, jakub, davem, edumazet, kuba; +Cc: netdev, bpf On 10/9/25 5:07 AM, zhengguoyong wrote: > When using sockmap to forward TCP traffic to the application > layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser > processing flow will synchronously update the tp->copied_seq field. > This causes tp->rcv_nxt to become less than tp->copied_seq. > > Later, when this socket receives SKB packets from the protocol stack, > in the call chain tcp_data_ready → tcp_epollin_ready, the function > tcp_epollin_ready will return false, preventing the socket from being > woken up to receive new packets. > > Therefore, it is necessary to synchronously update the tp->rcv_nxt > information in sk_psock_skb_ingress. > > Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn> > --- > net/core/skmsg.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 9becadd..e9d841c 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > struct sock *sk = psock->sk; > struct sk_msg *msg; > int err; > + u32 seq; > > /* If we are receiving on the same sock skb->sk is already assigned, > * skip memory accounting and owner transition seeing it already set > @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb, > */ > skb_set_owner_r(skb, sk); > err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true); > - if (err < 0) > + if (err < 0) { > kfree(msg); > + } else { > + bh_lock_sock_nested(sk); Apparently this is triggering deadlock in our CI: WARNING: inconsistent lock state 6.17.0-gb9bdadc5b6ca-dirty #8 Tainted: G OE -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. kworker/1:36/3777 [HC0[0]:SC0[0]:HE1:SE1] takes: 00000000b80163e8 (slock-AF_INET/1){+.?.}-{2:2}, at: sk_psock_backlog+0x656/0xf18 {IN-SOFTIRQ-W} state was registered at: __lock_acquire+0x4dc/0xd58 lock_acquire.part.0+0x114/0x278 lock_acquire+0x9c/0x160 _raw_spin_lock_nested+0x58/0xa8 tcp_v4_rcv+0x23a0/0x32a8 ip_protocol_deliver_rcu+0x6c/0x418 ip_local_deliver_finish+0x364/0x5d0 ip_local_deliver+0x17a/0x3f8 ip_rcv+0xd6/0x318 __netif_receive_skb_one_core+0x11c/0x158 process_backlog+0x58c/0x1618 __napi_poll+0x86/0x488 net_rx_action+0x482/0xb08 handle_softirqs+0x3cc/0xc88 do_softirq+0x1fc/0x248 __local_bh_enable_ip+0x332/0x3a0 __dev_queue_xmit+0x90a/0x1738 neigh_resolve_output+0x4c4/0x848 ip_finish_output2+0x728/0x1bc8 ip_output+0x1ea/0x5d0 __ip_queue_xmit+0x71a/0x1088 __tcp_transmit_skb+0x118c/0x2470 tcp_connect+0x10ca/0x18b8 tcp_v4_connect+0x11cc/0x1788 __inet_stream_connect+0x324/0xc00 inet_stream_connect+0x70/0xb8 __sys_connect+0xea/0x148 __do_sys_socketcall+0x2b4/0x4c0 __do_syscall+0x138/0x3e0 system_call+0x6e/0x90 irq event stamp: 531707 hardirqs last enabled at (531707): [<0008bdf65ac55ad2>] __local_bh_enable_ip+0x23a/0x3a0 hardirqs last disabled at (531705): [<0008bdf65ac55b6a>] __local_bh_enable_ip+0x2d2/0x3a0 softirqs last enabled at (531706): [<0008bdf65c31f142>] sk_psock_skb_ingress_enqueue+0x2aa/0x468 softirqs last disabled at (531704): [<0008bdf65c31f112>] sk_psock_skb_ingress_enqueue+0x27a/0x468 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(slock-AF_INET/1); <Interrupt> lock(slock-AF_INET/1); *** DEADLOCK *** 3 locks held by kworker/1:36/3777: #0: 0000000080042158 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x766/0x15e0 #1: 0008bd765cbffba8 ((work_completion)(&(&psock->work)->work)){+.+.}-{0:0}, at: process_one_work+0x794/0x15e0 #2: 000000008eb583c8 (&psock->work_mutex){+.+.}-{3:3}, at: sk_psock_backlog+0x198/0xf18 stack backtrace: CPU: 1 UID: 0 PID: 3777 Comm: kworker/1:36 Tainted: G OE 6.17.0-gb9bdadc5b6ca-dirty #8 NONE Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: IBM 8561 LT1 400 (KVM/Linux) Workqueue: events sk_psock_backlog Call Trace: [<0008bdf65aaca9de>] dump_stack_lvl+0x106/0x168 [<0008bdf65adb5990>] print_usage_bug.part.0+0x2e8/0x398 [<0008bdf65adb616a>] mark_lock_irq+0x72a/0x9c0 [<0008bdf65adb670e>] mark_lock+0x30e/0x7c0 [<0008bdf65adb6f38>] mark_usage+0xc8/0x178 [<0008bdf65adb819c>] __lock_acquire+0x4dc/0xd58 [<0008bdf65adb8b2c>] lock_acquire.part.0+0x114/0x278 [<0008bdf65adb8d2c>] lock_acquire+0x9c/0x160 [<0008bdf65cbc6498>] _raw_spin_lock_nested+0x58/0xa8 [<0008bdf65c323716>] sk_psock_backlog+0x656/0xf18 [<0008bdf65ac9b236>] process_one_work+0x83e/0x15e0 [<0008bdf65ac9c790>] worker_thread+0x7b8/0x1020 [<0008bdf65acbb0f8>] kthread+0x3c0/0x6e8 [<0008bdf65aad02f4>] __ret_from_fork+0xdc/0x800 [<0008bdf65cbc7fd2>] ret_from_fork+0xa/0x30 INFO: lockdep is turned off. More details at: https://github.com/kernel-patches/bpf/actions/runs/18367014116/job/52322106520 /P ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-10 9:16 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-09 3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong 2025-10-09 7:07 ` Eric Dumazet 2025-10-10 8:18 ` zhengguoyong 2025-10-10 9:16 ` Eric Dumazet 2025-10-09 7:59 ` Paolo Abeni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).