[PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
@ 2025-10-09  3:07 zhengguoyong
  2025-10-09  7:07 ` Eric Dumazet
  2025-10-09  7:59 ` Paolo Abeni
  0 siblings, 2 replies; 5+ messages in thread
From: zhengguoyong @ 2025-10-09  3:07 UTC (permalink / raw)
  To: john.fastabend, jakub, davem, edumazet, kuba, pabeni; +Cc: netdev, bpf

When using sockmap to forward TCP traffic to the application
layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser
processing flow will synchronously update the tp->copied_seq field.
This causes tp->rcv_nxt to become less than tp->copied_seq.

Later, when this socket receives SKB packets from the protocol stack,
in the call chain tcp_data_ready → tcp_epollin_ready, the function
tcp_epollin_ready will return false, preventing the socket from being
woken up to receive new packets.

Therefore, it is necessary to synchronously update the tp->rcv_nxt
information in sk_psock_skb_ingress.

Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
---
 net/core/skmsg.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9becadd..e9d841c 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 	struct sock *sk = psock->sk;
 	struct sk_msg *msg;
 	int err;
+	u32 seq;

 	/* If we are receiving on the same sock skb->sk is already assigned,
 	 * skip memory accounting and owner transition seeing it already set
@@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
 	 */
 	skb_set_owner_r(skb, sk);
 	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
-	if (err < 0)
+	if (err < 0) {
 		kfree(msg);
+	} else {
+		bh_lock_sock_nested(sk);
+		seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len;
+		WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq);
+		bh_unlock_sock(sk);
+	}
+
 	return err;
 }

-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
  2025-10-09  3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong
@ 2025-10-09  7:07 ` Eric Dumazet
  2025-10-10  8:18   ` zhengguoyong
  2025-10-09  7:59 ` Paolo Abeni
  1 sibling, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2025-10-09  7:07 UTC (permalink / raw)
  To: zhengguoyong; +Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf

On Wed, Oct 8, 2025 at 8:07 PM zhengguoyong <zhenggy@chinatelecom.cn> wrote:
>
> When using sockmap to forward TCP traffic to the application
> layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser
> processing flow will synchronously update the tp->copied_seq field.
> This causes tp->rcv_nxt to become less than tp->copied_seq.
>
> Later, when this socket receives SKB packets from the protocol stack,
> in the call chain tcp_data_ready → tcp_epollin_ready, the function
> tcp_epollin_ready will return false, preventing the socket from being
> woken up to receive new packets.
>
> Therefore, it is necessary to synchronously update the tp->rcv_nxt
> information in sk_psock_skb_ingress.
>
> Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>

Hi GuoYong Zheng

We request a Fixes: tag for patches claiming to fix a bug.

How would stable teams decide to backport a patch or not, and to which versions,
without having to fully understand this code ?


> ---
>  net/core/skmsg.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 9becadd..e9d841c 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>         struct sock *sk = psock->sk;
>         struct sk_msg *msg;
>         int err;
> +       u32 seq;
>
>         /* If we are receiving on the same sock skb->sk is already assigned,
>          * skip memory accounting and owner transition seeing it already set
> @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>          */
>         skb_set_owner_r(skb, sk);
>         err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
> -       if (err < 0)
> +       if (err < 0) {
>                 kfree(msg);
> +       } else {
> +               bh_lock_sock_nested(sk);
> +               seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len;
> +               WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq);

This does not look to be the right place.

Re-locking a socket _after_ the fundamental change took place is
fundamentally racy.

Also do we have a guarantee sk is always a TCP socket at this point ?

If yes, why do we have sk_is_tcp() check in sk_psock_init_strp() ?

> +               bh_unlock_sock(sk);
> +       }
> +
>         return err;
>  }
>
> --
> 1.8.3.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
  2025-10-09  7:07 ` Eric Dumazet
@ 2025-10-10  8:18   ` zhengguoyong
  2025-10-10  9:16     ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: zhengguoyong @ 2025-10-10  8:18 UTC (permalink / raw)
  To: 【外部账号】 Eric Dumazet
  Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf

Hi Eric,

Thank you for your reply. Indeed, using bh_lock_sock_nested can lead to deadlock risks.
I apologize for not noticing this earlier.

Can I change bh_lock_sock_nested to lock_sock to resolve this deadlock issue?
Or do you have any better suggestions on where to update the tp->rcv_nxt field in the process?

Look forward to your response. Thank you.

*From:* 【外部账号】Eric Dumazet
*收件人:* zhengguoyong
*抄送:* john.fastabend@gmail.com, jakub@cloudflare.com, davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, bpf@vger.kernel.org
*主题:* [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
*日期:* 2025-10-09 15:07:44
> On Wed, Oct 8, 2025 at 8:07 PM zhengguoyong <zhenggy@chinatelecom.cn> wrote:
>>
>> When using sockmap to forward TCP traffic to the application
>> layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser
>> processing flow will synchronously update the tp->copied_seq field.
>> This causes tp->rcv_nxt to become less than tp->copied_seq.
>>
>> Later, when this socket receives SKB packets from the protocol stack,
>> in the call chain tcp_data_ready → tcp_epollin_ready, the function
>> tcp_epollin_ready will return false, preventing the socket from being
>> woken up to receive new packets.
>>
>> Therefore, it is necessary to synchronously update the tp->rcv_nxt
>> information in sk_psock_skb_ingress.
>>
>> Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
> 
> Hi GuoYong Zheng
> 
> We request a Fixes: tag for patches claiming to fix a bug.
> 
> How would stable teams decide to backport a patch or not, and to which versions,
> without having to fully understand this code ?
> 
> 
>> ---
>>  net/core/skmsg.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 9becadd..e9d841c 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>>         struct sock *sk = psock->sk;
>>         struct sk_msg *msg;
>>         int err;
>> +       u32 seq;
>>
>>         /* If we are receiving on the same sock skb->sk is already assigned,
>>          * skip memory accounting and owner transition seeing it already set
>> @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>>          */
>>         skb_set_owner_r(skb, sk);
>>         err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
>> -       if (err < 0)
>> +       if (err < 0) {
>>                 kfree(msg);
>> +       } else {
>> +               bh_lock_sock_nested(sk);
>> +               seq = READ_ONCE(tcp_sk(sk)->rcv_nxt) + len;
>> +               WRITE_ONCE(tcp_sk(sk)->rcv_nxt, seq);
> 
> This does not look to be the right place.
> 
> Re-locking a socket _after_ the fundamental change took place is
> fundamentally racy.
> 
> Also do we have a guarantee sk is always a TCP socket at this point ?
> 
> If yes, why do we have sk_is_tcp() check in sk_psock_init_strp() ?
> 
>> +               bh_unlock_sock(sk);
>> +       }
>> +
>>         return err;
>>  }
>>
>> --
>> 1.8.3.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
  2025-10-10  8:18   ` zhengguoyong
@ 2025-10-10  9:16     ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2025-10-10  9:16 UTC (permalink / raw)
  To: zhengguoyong; +Cc: john.fastabend, jakub, davem, kuba, pabeni, netdev, bpf

On Fri, Oct 10, 2025 at 1:17 AM zhengguoyong <zhenggy@chinatelecom.cn> wrote:
>
> Hi Eric,
>
> Thank you for your reply. Indeed, using bh_lock_sock_nested can lead to deadlock risks.
> I apologize for not noticing this earlier.
>
> Can I change bh_lock_sock_nested to lock_sock to resolve this deadlock issue?
> Or do you have any better suggestions on where to update the tp->rcv_nxt field in the process?
>

Please look at my original feedback.  lock_sock() is not an option.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress
  2025-10-09  3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong
  2025-10-09  7:07 ` Eric Dumazet
@ 2025-10-09  7:59 ` Paolo Abeni
  1 sibling, 0 replies; 5+ messages in thread
From: Paolo Abeni @ 2025-10-09  7:59 UTC (permalink / raw)
  To: zhengguoyong, john.fastabend, jakub, davem, edumazet, kuba; +Cc: netdev, bpf



On 10/9/25 5:07 AM, zhengguoyong wrote:
> When using sockmap to forward TCP traffic to the application
> layer of the peer socket, the peer socket's tcp_bpf_recvmsg_parser
> processing flow will synchronously update the tp->copied_seq field.
> This causes tp->rcv_nxt to become less than tp->copied_seq.
> 
> Later, when this socket receives SKB packets from the protocol stack,
> in the call chain tcp_data_ready → tcp_epollin_ready, the function
> tcp_epollin_ready will return false, preventing the socket from being
> woken up to receive new packets.
> 
> Therefore, it is necessary to synchronously update the tp->rcv_nxt
> information in sk_psock_skb_ingress.
> 
> Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
> ---
>  net/core/skmsg.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 9becadd..e9d841c 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -576,6 +576,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>  	struct sock *sk = psock->sk;
>  	struct sk_msg *msg;
>  	int err;
> +	u32 seq;
> 
>  	/* If we are receiving on the same sock skb->sk is already assigned,
>  	 * skip memory accounting and owner transition seeing it already set
> @@ -595,8 +596,15 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
>  	 */
>  	skb_set_owner_r(skb, sk);
>  	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
> -	if (err < 0)
> +	if (err < 0) {
>  		kfree(msg);
> +	} else {
> +		bh_lock_sock_nested(sk);

Apparently this is triggering deadlock in our CI:

  WARNING: inconsistent lock state
  6.17.0-gb9bdadc5b6ca-dirty #8 Tainted: G           OE
  --------------------------------
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  kworker/1:36/3777 [HC0[0]:SC0[0]:HE1:SE1] takes:
  00000000b80163e8 (slock-AF_INET/1){+.?.}-{2:2}, at:
sk_psock_backlog+0x656/0xf18
  {IN-SOFTIRQ-W} state was registered at:
    __lock_acquire+0x4dc/0xd58
    lock_acquire.part.0+0x114/0x278
    lock_acquire+0x9c/0x160
    _raw_spin_lock_nested+0x58/0xa8
    tcp_v4_rcv+0x23a0/0x32a8
    ip_protocol_deliver_rcu+0x6c/0x418
    ip_local_deliver_finish+0x364/0x5d0
    ip_local_deliver+0x17a/0x3f8
    ip_rcv+0xd6/0x318
    __netif_receive_skb_one_core+0x11c/0x158
    process_backlog+0x58c/0x1618
    __napi_poll+0x86/0x488
    net_rx_action+0x482/0xb08
    handle_softirqs+0x3cc/0xc88
    do_softirq+0x1fc/0x248
    __local_bh_enable_ip+0x332/0x3a0
    __dev_queue_xmit+0x90a/0x1738
    neigh_resolve_output+0x4c4/0x848
    ip_finish_output2+0x728/0x1bc8
    ip_output+0x1ea/0x5d0
    __ip_queue_xmit+0x71a/0x1088
    __tcp_transmit_skb+0x118c/0x2470
    tcp_connect+0x10ca/0x18b8
    tcp_v4_connect+0x11cc/0x1788
    __inet_stream_connect+0x324/0xc00
    inet_stream_connect+0x70/0xb8
    __sys_connect+0xea/0x148
    __do_sys_socketcall+0x2b4/0x4c0
    __do_syscall+0x138/0x3e0
    system_call+0x6e/0x90
  irq event stamp: 531707
  hardirqs last  enabled at (531707): [<0008bdf65ac55ad2>]
__local_bh_enable_ip+0x23a/0x3a0
  hardirqs last disabled at (531705): [<0008bdf65ac55b6a>]
__local_bh_enable_ip+0x2d2/0x3a0
  softirqs last  enabled at (531706): [<0008bdf65c31f142>]
sk_psock_skb_ingress_enqueue+0x2aa/0x468
  softirqs last disabled at (531704): [<0008bdf65c31f112>]
sk_psock_skb_ingress_enqueue+0x27a/0x468

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(slock-AF_INET/1);
    <Interrupt>
      lock(slock-AF_INET/1);

   *** DEADLOCK ***

  3 locks held by kworker/1:36/3777:
   #0: 0000000080042158 ((wq_completion)events){+.+.}-{0:0}, at:
process_one_work+0x766/0x15e0
   #1: 0008bd765cbffba8
((work_completion)(&(&psock->work)->work)){+.+.}-{0:0}, at:
process_one_work+0x794/0x15e0
   #2: 000000008eb583c8 (&psock->work_mutex){+.+.}-{3:3}, at:
sk_psock_backlog+0x198/0xf18

  stack backtrace:
  CPU: 1 UID: 0 PID: 3777 Comm: kworker/1:36 Tainted: G           OE
  6.17.0-gb9bdadc5b6ca-dirty #8 NONE
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: IBM 8561 LT1 400 (KVM/Linux)
  Workqueue: events sk_psock_backlog
  Call Trace:
   [<0008bdf65aaca9de>] dump_stack_lvl+0x106/0x168
   [<0008bdf65adb5990>] print_usage_bug.part.0+0x2e8/0x398
   [<0008bdf65adb616a>] mark_lock_irq+0x72a/0x9c0
   [<0008bdf65adb670e>] mark_lock+0x30e/0x7c0
   [<0008bdf65adb6f38>] mark_usage+0xc8/0x178
   [<0008bdf65adb819c>] __lock_acquire+0x4dc/0xd58
   [<0008bdf65adb8b2c>] lock_acquire.part.0+0x114/0x278
   [<0008bdf65adb8d2c>] lock_acquire+0x9c/0x160
   [<0008bdf65cbc6498>] _raw_spin_lock_nested+0x58/0xa8
   [<0008bdf65c323716>] sk_psock_backlog+0x656/0xf18
   [<0008bdf65ac9b236>] process_one_work+0x83e/0x15e0
   [<0008bdf65ac9c790>] worker_thread+0x7b8/0x1020
   [<0008bdf65acbb0f8>] kthread+0x3c0/0x6e8
   [<0008bdf65aad02f4>] __ret_from_fork+0xdc/0x800
   [<0008bdf65cbc7fd2>] ret_from_fork+0xa/0x30
  INFO: lockdep is turned off.

More details at:

https://github.com/kernel-patches/bpf/actions/runs/18367014116/job/52322106520

/P


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-10  9:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-09  3:07 [PATCH] bpf, sockmap: Update tp->rcv_nxt in sk_psock_skb_ingress zhengguoyong
2025-10-09  7:07 ` Eric Dumazet
2025-10-10  8:18   ` zhengguoyong
2025-10-10  9:16     ` Eric Dumazet
2025-10-09  7:59 ` Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).