[syzbot] [net?] WARNING: refcount bug in inet_twsk

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
@ 2024-08-11  1:29 syzbot
  2024-08-11  2:29 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: syzbot @ 2024-08-11  1:29 UTC (permalink / raw)
  To: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com

------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:336 [inline]
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
 ksys_unshare+0x419/0x970 kernel/fork.c:3323
 __do_sys_unshare kernel/fork.c:3394 [inline]
 __se_sys_unshare kernel/fork.c:3392 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f56d7c7cee9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
@ 2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
                     ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11  2:29 UTC (permalink / raw)
  To: syzbot+8ea26396ff85d23a8929
  Cc: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
Date: Sat, 10 Aug 2024 18:29:20 -0700
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> 
> ------------[ cut here ]------------
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31

Eric, this is the weird report I was talking about at netdevconf :)

It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
than refcount_inc().

I started to see the same splat at a very low rate after consuming
commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
so twsk is now visible before tw_dr->tw_refcount is incremented.

I came up with the diff below but was suspecting a bug in another place,
possibly QEMU, so I haven't posted the diff officially.

refcount_inc() was actually deferred, but it's still under an ehash lock,
and inet_twsk_deschedule_put() must be serialised with the same ehash
lock.  Even inet_twsk_kill() performs the ehash lock dance before calling
refcount_dec().

So, it should be impossible that refcount_inc() is not visible after double
lock/unlock and before refcount_dec(), so this report looks bogus to me :S

---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..c3b2f0426e01 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -115,6 +115,8 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
+	refcount_inc(&tw->tw_dr->tw_refcount);
+
 	/* Step 1: Put TW into bind hash. Original socket stays there too.
 	   Note, that any socket with inet->num != 0 MUST be bound in
 	   binding cache, even if it is closed.
@@ -301,7 +303,6 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
 						     LINUX_MIB_TIMEWAITED);
 		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
-		refcount_inc(&tw->tw_dr->tw_refcount);
 	} else {
 		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 	}
---8<---


> Modules linked in:
> CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
> RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  __refcount_dec include/linux/refcount.h:336 [inline]
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
>  ksys_unshare+0x419/0x970 kernel/fork.c:3323
>  __do_sys_unshare kernel/fork.c:3394 [inline]
>  __se_sys_unshare kernel/fork.c:3392 [inline]
>  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f56d7c7cee9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
> RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
> RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
>  </TASK>
> 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
@ 2024-08-11  5:42   ` Jason Xing
  2024-08-11 13:24   ` Florian Westphal
  2024-08-11 13:32   ` [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill Florian Westphal
  2 siblings, 0 replies; 24+ messages in thread
From: Jason Xing @ 2024-08-11  5:42 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Hello Kuniyuki,

On Sun, Aug 11, 2024 at 10:35 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
>
> Eric, this is the weird report I was talking about at netdevconf :)
>
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
>
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
>
> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
>
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
>
> refcount_inc() was actually deferred, but it's still under an ehash lock,
> and inet_twsk_deschedule_put() must be serialised with the same ehash
> lock.  Even inet_twsk_kill() performs the ehash lock dance before calling
> refcount_dec().
>
> So, it should be impossible that refcount_inc() is not visible after double
> lock/unlock and before refcount_dec(), so this report looks bogus to me :S

In normal cases, I agree with this, since inc/dec are all protected
under the spin lock. There is no way we can decrement it if we don't
increment it first.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
@ 2024-08-11 13:24   ` Florian Westphal
  2024-08-11 14:54     ` Florian Westphal
  2024-08-11 13:32   ` [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill Florian Westphal
  2 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-11 13:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > 
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> 
> Eric, this is the weird report I was talking about at netdevconf :)
> 
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
> 
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

I think I see why.

The reported splat is without this above commit.
But from backtrace we entered here:

                if (net->ipv4.tcp_death_row.hashinfo->pernet) {
                        /* Even if tw_refcount == 1, we must clean up kernel reqsk */
                        inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
                } else if (!purged_once) {
-------------->         inet_twsk_purge(&tcp_hashinfo);  // THIS
                        purged_once = true;

> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
> 
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
> 
> refcount_inc() was actually deferred, but it's still under an ehash lock,

but different struct inet_hashinfo, so the locks don't help :/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
  2024-08-11 13:24   ` Florian Westphal
@ 2024-08-11 13:32   ` Florian Westphal
  2024-08-11 22:35     ` Kuniyuki Iwashima
  2 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-11 13:32 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > 
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> 
> Eric, this is the weird report I was talking about at netdevconf :)
> 
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
> 
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

Could you share a splat you saw?

Was it also for the fallback tcp_hashinfo and/or from error
unwindinding after a ops->init() failure on netns creation?

I don't think it makes sense following this specific report given
b334b924c9b7 made changes in this area, but a post-b334b924c9b7
would be more relevant I think.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 13:24   ` Florian Westphal
@ 2024-08-11 14:54     ` Florian Westphal
  2024-08-11 16:28       ` Florian Westphal
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-11 14:54 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
	edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs

Florian Westphal <fw@strlen.de> wrote:
> > I came up with the diff below but was suspecting a bug in another place,
> > possibly QEMU, so I haven't posted the diff officially.
> > 
> > refcount_inc() was actually deferred, but it's still under an ehash lock,
> 
> but different struct inet_hashinfo, so the locks don't help :/

No, fallback is fine: pernet tw_refcount, init_net ehash lock array. so
they same buckets should serialize on same ehash lock.

https://syzkaller.appspot.com/x/log.txt?x=117f3182980000

... shows at two cores racing:

[ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
and
[ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).

first splat backtrace shows invocation of tcp_sk_exit_batch() from
netns error unwinding code.

Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
likely walking init_net tcp_hashinfo.

The warn of second core is:
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

Looks like somehow netns cleanup work queue skipped at least one tw sk,
hitting above splat.

Then, first core did refcount_dec() on tw_refcount, which produces
dec-to-0 warn (which makes sense if "supposedly final" decrement was
already done.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 14:54     ` Florian Westphal
@ 2024-08-11 16:28       ` Florian Westphal
  2024-08-11 23:00         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-11 16:28 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
	edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs

Florian Westphal <fw@strlen.de> wrote:
> https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> 
> ... shows at two cores racing:
> 
> [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> and
> [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> 
> 
> first splat backtrace shows invocation of tcp_sk_exit_batch() from
> netns error unwinding code.
> 
> Second one lacks backtrace, but its also in tcp_sk_exit_batch(),

... which doesn't work.  Does this look like a plausible
theory/exlanation?

Given:
1 exiting netns, has >= 1 tw sk.
1 (unrelated) netns that failed in setup_net

... we run into following race:

exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
inet_twsk_purge(&tcp_hashinfo).

At same time, from error unwinding code, we also call tcp_sk_exit_batch().

Both threads walk tcp_hashinfo ehash buckets.

From work queue (normal netns exit path), we hit

303                         if (state == TCP_TIME_WAIT) {
304                                 inet_twsk_deschedule_put(inet_twsk(sk));

Because both threads operate on tcp_hashinfo, the unrelated
struct net (exiting net) is also visible to error-unwinding thread.

So, error unwinding code will call

303                         if (state == TCP_TIME_WAIT) {
304                                 inet_twsk_deschedule_put(inet_twsk(sk));

for the same tw sk and both threads do

218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
219 {
220         if (del_timer_sync(&tw->tw_timer))
221                 inet_twsk_kill(tw);

Error unwind path cancel timer, calls inet_twsk_kill, while
work queue sees timer as already shut-down so it ends up
returning to tcp_sk_exit_batch(), where it will WARN here:

  WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

... because the supposedly-last tw_refcount decrement did not drop
it down to 0.

Meanwhile, error unwiding thread calls refcount_dec() on
tw_refcount, which now drops down to 0 instead of 1, which
provides another warn splat.

I'll ponder on ways to fix this tomorrow unless someone
else already has better theory/solution.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 13:32   ` [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill Florian Westphal
@ 2024-08-11 22:35     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 22:35 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 15:32:00 +0200
> > > ------------[ cut here ]------------
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > 
> > Eric, this is the weird report I was talking about at netdevconf :)
> > 
> > It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> > than refcount_inc().
> > 
> > I started to see the same splat at a very low rate after consuming
> > commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
> 
> Could you share a splat you saw?
> 
> Was it also for the fallback tcp_hashinfo and/or from error
> unwindinding after a ops->init() failure on netns creation?
> 
> I don't think it makes sense following this specific report given
> b334b924c9b7 made changes in this area, but a post-b334b924c9b7
> would be more relevant I think.

Ah, I missed the initial syzbot's report was created before the commit.

Here's splats I saw:

[ Note both splats were mixed in the log as with the syzbot's report,
  so two threads were calling tcp_sk_exit_batch() concurrently. ]

---8<---
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 1760 at lib/refcount.c:31 refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Modules linked in:
CPU: 0 PID: 1760 Comm: syz-executor.4 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Code: 05 cd 8d aa 04 01 e8 48 e4 e9 fe 0f 0b e9 d3 fe ff ff e8 1c 36 1c ff 48 c7 c7 a0 4a 80 85 c6 05 aa 8d aa 04 01 e8 29 e4 e9 fe <0f> 0b e9 b4 fe ff ff 48 89 ef e8 fa 56 69 ff e9 5c fe ff ff 0f 1f
RSP: 0018:ffffc9000118fa78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000b13b000
RDX: 0000000000040000 RSI: ffffffff8118dedf RDI: 0000000000000001
RBP: ffff88800bb59440 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000002d2d2d2d R12: ffff88800bb59440
R13: ffff88810130f3d0 R14: ffff8881013ee820 R15: ffff888102001198
FS:  00007f5252dbd640(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004a6128 CR3: 00000001038b8003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 inet_twsk_kill (./include/linux/refcount.h:336 ./include/linux/refcount.h:351 net/ipv4/inet_timewait_sock.c:70)
 inet_twsk_deschedule_put (net/ipv4/inet_timewait_sock.c:266)
 inet_twsk_purge (net/ipv4/inet_timewait_sock.c:349)
 tcp_twsk_purge (net/ipv4/tcp_minisocks.c:402)
 tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3512)
 ops_exit_list (net/core/net_namespace.c:179)
 setup_net (net/core/net_namespace.c:374 (discriminator 3))
 copy_net_ns (net/core/net_namespace.c:510)
 create_new_namespaces (kernel/nsproxy.c:110)
 unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
 ksys_unshare (kernel/fork.c:3325)
 __x64_sys_unshare (kernel/fork.c:3392)
 do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f5253a29e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f5252dbcc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00000000004d4070 RCX: 00007f5253a29e5d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 00000000004d4070 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f5253a8a530 R15: 0000000000000000
 </TASK>

WARNING: CPU: 1 PID: 671 at net/ipv4/tcp_ipv4.c:3514 tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Modules linked in:
CPU: 1 PID: 671 Comm: kworker/u8:15 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Code: 89 ee e8 9f 1b 28 fd 85 ed 0f 88 b3 00 00 00 e8 92 22 28 fd 31 ff 89 ee e8 89 1b 28 fd 85 ed 0f 84 9d 00 00 00 e8 7c 22 28 fd <0f> 0b e8 75 22 28 fd 48 89 df e8 ed 23 02 00 48 8d 7b 28 48 89 f8
RSP: 0018:ffffc90000ebfc60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88800bb59140 RCX: ffffffff84230cb7
RDX: ffff888103871180 RSI: ffffffff84230cc4 RDI: 0000000000000005
RBP: 0000000000000002 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000002 R11: 00000000000002bd R12: ffff88800bb59440
R13: dffffc0000000000 R14: ffffc90000ebfd08 R15: fffffbfff0d7fb5c
FS:  0000000000000000(0000) GS:ffff88811af00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f274114ddf8 CR3: 000000010a11c006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ops_exit_list (net/core/net_namespace.c:179)
 cleanup_net (net/core/net_namespace.c:639 (discriminator 3))
 process_one_work (kernel/workqueue.c:3236)
 worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>
---8<---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 16:28       ` Florian Westphal
@ 2024-08-11 23:00         ` Kuniyuki Iwashima
  2024-08-11 23:08           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:00 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 18:28:50 +0200
> Florian Westphal <fw@strlen.de> wrote:
> > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > 
> > ... shows at two cores racing:
> > 
> > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > and
> > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > 
> > 
> > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > netns error unwinding code.
> > 
> > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> 
> ... which doesn't work.  Does this look like a plausible
> theory/exlanation?

Yes!  The problem here is that inet_twsk_purge() operates on twsk
not in net_exit_list, but I think such a check is overkill and we
can work around it in another way.


> 
> Given:
> 1 exiting netns, has >= 1 tw sk.
> 1 (unrelated) netns that failed in setup_net
> 
> ... we run into following race:
> 
> exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> inet_twsk_purge(&tcp_hashinfo).
> 
> At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> 
> Both threads walk tcp_hashinfo ehash buckets.
> 
> From work queue (normal netns exit path), we hit
> 
> 303                         if (state == TCP_TIME_WAIT) {
> 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> 
> Because both threads operate on tcp_hashinfo, the unrelated
> struct net (exiting net) is also visible to error-unwinding thread.
> 
> So, error unwinding code will call
> 
> 303                         if (state == TCP_TIME_WAIT) {
> 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> 
> for the same tw sk and both threads do
> 
> 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> 219 {
> 220         if (del_timer_sync(&tw->tw_timer))
> 221                 inet_twsk_kill(tw);
> 
> Error unwind path cancel timer, calls inet_twsk_kill, while
> work queue sees timer as already shut-down so it ends up
> returning to tcp_sk_exit_batch(), where it will WARN here:
> 
>   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> ... because the supposedly-last tw_refcount decrement did not drop
> it down to 0.
> 
> Meanwhile, error unwiding thread calls refcount_dec() on
> tw_refcount, which now drops down to 0 instead of 1, which
> provides another warn splat.
> 
> I'll ponder on ways to fix this tomorrow unless someone
> else already has better theory/solution.

We need to sync two inet_twsk_kill(), so maybe give up one
if twsk is not hashed ?

---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..51889567274b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
 	spin_lock(lock);
-	sk_nulls_del_node_init_rcu((struct sock *)tw);
+	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
+		spin_unlock(lock);
+		return false;
+	}
 	spin_unlock(lock);
 
 	/* Disassociate with bind bucket. */
---8<---

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:00         ` Kuniyuki Iwashima
@ 2024-08-11 23:08           ` Kuniyuki Iwashima
  2024-08-12  0:36             ` Jason Xing
  2024-08-12 14:01             ` Florian Westphal
  0 siblings, 2 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:08 UTC (permalink / raw)
  To: kuniyu
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Sun, 11 Aug 2024 16:00:29 -0700
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 11 Aug 2024 18:28:50 +0200
> > Florian Westphal <fw@strlen.de> wrote:
> > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > 
> > > ... shows at two cores racing:
> > > 
> > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > and
> > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > 
> > > 
> > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > netns error unwinding code.
> > > 
> > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > 
> > ... which doesn't work.  Does this look like a plausible
> > theory/exlanation?
> 
> Yes!  The problem here is that inet_twsk_purge() operates on twsk
> not in net_exit_list, but I think such a check is overkill and we
> can work around it in another way.
> 
> 
> > 
> > Given:
> > 1 exiting netns, has >= 1 tw sk.
> > 1 (unrelated) netns that failed in setup_net
> > 
> > ... we run into following race:
> > 
> > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > inet_twsk_purge(&tcp_hashinfo).
> > 
> > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> > 
> > Both threads walk tcp_hashinfo ehash buckets.
> > 
> > From work queue (normal netns exit path), we hit
> > 
> > 303                         if (state == TCP_TIME_WAIT) {
> > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > 
> > Because both threads operate on tcp_hashinfo, the unrelated
> > struct net (exiting net) is also visible to error-unwinding thread.
> > 
> > So, error unwinding code will call
> > 
> > 303                         if (state == TCP_TIME_WAIT) {
> > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > 
> > for the same tw sk and both threads do
> > 
> > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > 219 {
> > 220         if (del_timer_sync(&tw->tw_timer))
> > 221                 inet_twsk_kill(tw);
> > 
> > Error unwind path cancel timer, calls inet_twsk_kill, while
> > work queue sees timer as already shut-down so it ends up
> > returning to tcp_sk_exit_batch(), where it will WARN here:
> > 
> >   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > 
> > ... because the supposedly-last tw_refcount decrement did not drop
> > it down to 0.
> > 
> > Meanwhile, error unwiding thread calls refcount_dec() on
> > tw_refcount, which now drops down to 0 instead of 1, which
> > provides another warn splat.
> > 
> > I'll ponder on ways to fix this tomorrow unless someone
> > else already has better theory/solution.
> 
> We need to sync two inet_twsk_kill(), so maybe give up one
> if twsk is not hashed ?
> 
> ---8<---
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 337390ba85b4..51889567274b 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>  	struct inet_bind_hashbucket *bhead, *bhead2;
>  
>  	spin_lock(lock);
> -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> +		spin_unlock(lock);
> +		return false;

forgot to remove false, just return :)


> +	}
>  	spin_unlock(lock);
>  
>  	/* Disassociate with bind bucket. */
> ---8<---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:08           ` Kuniyuki Iwashima
@ 2024-08-12  0:36             ` Jason Xing
  2024-08-12 14:01             ` Florian Westphal
  1 sibling, 0 replies; 24+ messages in thread
From: Jason Xing @ 2024-08-12  0:36 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

On Mon, Aug 12, 2024 at 7:09 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > >
> > > > ... shows at two cores racing:
> > > >
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > >
> > > >
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > >
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > >
> > > ... which doesn't work.  Does this look like a plausible
> > > theory/exlanation?
> >
> > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.
> >
> >
> > >
> > > Given:
> > > 1 exiting netns, has >= 1 tw sk.
> > > 1 (unrelated) netns that failed in setup_net
> > >
> > > ... we run into following race:
> > >
> > > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > > inet_twsk_purge(&tcp_hashinfo).
> > >
> > > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> > >
> > > Both threads walk tcp_hashinfo ehash buckets.
> > >
> > > From work queue (normal netns exit path), we hit
> > >
> > > 303                         if (state == TCP_TIME_WAIT) {
> > > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > Because both threads operate on tcp_hashinfo, the unrelated
> > > struct net (exiting net) is also visible to error-unwinding thread.
> > >
> > > So, error unwinding code will call
> > >
> > > 303                         if (state == TCP_TIME_WAIT) {
> > > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > for the same tw sk and both threads do
> > >
> > > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > > 219 {
> > > 220         if (del_timer_sync(&tw->tw_timer))
> > > 221                 inet_twsk_kill(tw);
> > >
> > > Error unwind path cancel timer, calls inet_twsk_kill, while
> > > work queue sees timer as already shut-down so it ends up
> > > returning to tcp_sk_exit_batch(), where it will WARN here:
> > >
> > >   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > >
> > > ... because the supposedly-last tw_refcount decrement did not drop
> > > it down to 0.
> > >
> > > Meanwhile, error unwiding thread calls refcount_dec() on
> > > tw_refcount, which now drops down to 0 instead of 1, which
> > > provides another warn splat.

Right, last night I should have thought of it. I noticed that two
'killers' may access the inet_twsk_kill() concurrently (during that
time, I paid attention to the tcp_abort() which can cause such an
issue), but I easily let it go :(

Thanks for the detailed explanation:)

> > >
> > > I'll ponder on ways to fix this tomorrow unless someone
> > > else already has better theory/solution.
> >
> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?
> >
> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >       struct inet_bind_hashbucket *bhead, *bhead2;
> >
> >       spin_lock(lock);
> > -     sk_nulls_del_node_init_rcu((struct sock *)tw);
> > +     if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > +             spin_unlock(lock);
> > +             return false;
>
> forgot to remove false, just return :)

It does make sense to me !

Thanks,
Jason

>
>
> > +     }
> >       spin_unlock(lock);
> >
> >       /* Disassociate with bind bucket. */
> > ---8<---
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:08           ` Kuniyuki Iwashima
  2024-08-12  0:36             ` Jason Xing
@ 2024-08-12 14:01             ` Florian Westphal
  2024-08-12 14:30               ` Jason Xing
  2024-08-12 20:00               ` Kuniyuki Iwashima
  1 sibling, 2 replies; 24+ messages in thread
From: Florian Westphal @ 2024-08-12 14:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > 
> > > > ... shows at two cores racing:
> > > > 
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > 
> > > > 
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > > 
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > 
> > > ... which doesn't work.  Does this look like a plausible
> > > theory/exlanation?
> > 
> > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.

I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
called inet_twsk_kill(), 'our' task has to wait for that to complete.

We need to force proper ordering so that all twsk found

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

/*HERE*/tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);

.... have gone through inet_twsk_kill() so tw_refcount managed to
drop back to 1 before doing
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
.

> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?

Not sure, afaiu only one thread enters inet_twsk_kill()
(the one that manages to deactivate the timer).

> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >  	struct inet_bind_hashbucket *bhead, *bhead2;
> >  
> >  	spin_lock(lock);
> > -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> > +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > +		spin_unlock(lock);
> > +		return false;
> 
> forgot to remove false, just return :)

I don't see how this helps, we need to wait until 'stolen' twsk
has gone through inet_twsk_kill() and decremented tw_refcount.
Obviously It would be a bit simpler if we had a reliable reproducer :-)

Possible solutions I came up with so far:

1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").

This commit replaced a net_eq(twsk_net(tw) ... with a check for
dead netns (ns.count == 0),

Downside: We need to remove the purged_once trick that calls
inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
tcp_twsk_purge() as well.

As per b099ce2602d8 changelog, likely increases netns dismantle times.

Upside: simpler code, so this is my preferred solution.

No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
called refcount_dec(->tw_refcount) for every twsk in the exiting netns
list, without other task stealing twsks owned by exiting netns.

Solution 2: change tcp_sk_exit_batch like this:

   tcp_twsk_purge(net_exit_list);

+  list_for_each_entry(net, net_exit_list, exit_list) {
+      while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
+         schedule();
+
+  }

    list_for_each_entry(net, net_exit_list, exit_list) {
       inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
       WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

This synchronizes two concurrent tcp_sk_exit_batch() calls via
existing refcount; if netns setup error unwinding ran off with one of
'our' twsk, it will wait until other task has completed the refcount decrement.

I don't expect it to increase netns dismantle times, else we'd have seen
the WARN_ON_ONCE splat frequently.

Solution 3:

Similar to 2), but via mutex_lock/unlock pair:

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

	mutex_lock(&tcp_exit_batch_mutex);

        tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
                tcp_fastopen_ctx_destroy(net);
        }
	mutex_unlock(&tcp_exit_batch_mutex);
}

Solution 4:

I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
the 'wrong' time.  This is independent "problem", I might be
imagining things here.

Consider:
313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
314 {
[..]
321         for (slot = 0; slot <= ehash_mask; slot++, head++) {

tw sk timer fires on other cpu, inet_twsk_kill() does:

56         spin_lock(lock);
57         sk_nulls_del_node_init_rcu((struct sock *)tw);
58         spin_unlock(lock);

... then other cpu gets preempted.
inet_twsk_purge() resumes and hits empty chain head:

322                 if (hlist_nulls_empty(&head->chain))
323                         continue;

so we don't(can't) wait for the timer to run to completion.

If this sounds plausible to you, this gives us solution 4:

Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
dropped down to 1.

Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
needs to be moved down:

 48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
...
 58     /* Disassociate with bind bucket. */
...
 68     spin_unlock(&bhead->lock);

 70     refcount_dec(&tw->tw_dr->tw_refcount);

 +      spin_lock(lock);
 +      sk_nulls_del_node_init_rcu((struct sock *)tw);
 +      spin_unlock(lock);
71      inet_twsk_put(tw);
72 }

... so concurrent purge() call will find us
the node list (and then wait on timer_shutdown_sync())
until other cpu executing the timer is done.

If twsk was unlinked from table already before
inet_twsk_purge() had chance to find it sk, then in worst
case call to tcp_twsk_destructor() is missing, but I don't
see any ordering requirements that need us to wait for this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:01             ` Florian Westphal
@ 2024-08-12 14:30               ` Jason Xing
  2024-08-12 15:03                 ` Florian Westphal
  2024-08-12 20:00               ` Kuniyuki Iwashima
  1 sibling, 1 reply; 24+ messages in thread
From: Jason Xing @ 2024-08-12 14:30 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
	netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

Hello Florian,

On Mon, Aug 12, 2024 at 10:02 PM Florian Westphal <fw@strlen.de> wrote:
>
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > >
> > > > > ... shows at two cores racing:
> > > > >
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > >
> > > > >
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > >
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > >
> > > > ... which doesn't work.  Does this look like a plausible
> > > > theory/exlanation?
> > >
> > > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
>
> I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
>
> We need to force proper ordering so that all twsk found
>
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
>
> /*HERE*/tcp_twsk_purge(net_exit_list);
>
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
>
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
>
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).
>
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > >     struct inet_bind_hashbucket *bhead, *bhead2;
> > >
> > >     spin_lock(lock);
> > > -   sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > +   if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > +           spin_unlock(lock);
> > > +           return false;
> >
> > forgot to remove false, just return :)
>
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)

Allow me to say something irrelevant to this bug report.

Do you think that Kuniyuki's patch can solve the race between two
'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?

It at least does help avoid decrementing tw_refcount twice in the
above case if I understand correctly.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:30               ` Jason Xing
@ 2024-08-12 15:03                 ` Florian Westphal
  2024-08-12 15:49                   ` Jason Xing
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-12 15:03 UTC (permalink / raw)
  To: Jason Xing
  Cc: Florian Westphal, Kuniyuki Iwashima, davem, dsahern, edumazet,
	kuba, linux-kernel, netdev, pabeni, syzbot+8ea26396ff85d23a8929,
	syzkaller-bugs

Jason Xing <kerneljasonxing@gmail.com> wrote:
> > I don't see how this helps, we need to wait until 'stolen' twsk
> > has gone through inet_twsk_kill() and decremented tw_refcount.
> > Obviously It would be a bit simpler if we had a reliable reproducer :-)
> 
> Allow me to say something irrelevant to this bug report.
> 
> Do you think that Kuniyuki's patch can solve the race between two
> 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?

I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
which does:

        if (timer_shutdown_sync(&tw->tw_timer))
                inet_twsk_kill(tw);

So I don't see how two concurrent callers, working on same tw address,
would both be able to shut down the timer.

One will shut it down and calls inet_twsk_kill(), other will wait until
the callback has completed, but it doesn't call inet_twsk_kill().

> It at least does help avoid decrementing tw_refcount twice in the
> above case if I understand correctly.

I don't think the refcount is decremented twice.

Problem is one thread is already at the 'final' decrement of
 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

in tcp_sk_exit_batch(), while other thread has not yet called
refcount_dec() on it (inet_twsk_kill still executing).

So we get two splats, refcount_dec_and_test() returns 1 not expected 0
and refcount_dec() coming right afterwards from other task observes the
transition to 0, while it should have dropped down to 1.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 15:03                 ` Florian Westphal
@ 2024-08-12 15:49                   ` Jason Xing
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Xing @ 2024-08-12 15:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
	netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

On Mon, Aug 12, 2024 at 11:03 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > I don't see how this helps, we need to wait until 'stolen' twsk
> > > has gone through inet_twsk_kill() and decremented tw_refcount.
> > > Obviously It would be a bit simpler if we had a reliable reproducer :-)
> >
> > Allow me to say something irrelevant to this bug report.
> >
> > Do you think that Kuniyuki's patch can solve the race between two
> > 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> > concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?
>
> I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
> which does:
>
>         if (timer_shutdown_sync(&tw->tw_timer))
>                 inet_twsk_kill(tw);
>
> So I don't see how two concurrent callers, working on same tw address,
> would both be able to shut down the timer.
>
> One will shut it down and calls inet_twsk_kill(), other will wait until
> the callback has completed, but it doesn't call inet_twsk_kill().

Oh, thanks. Since timer_shutdown_sync() can be used as a lock, there
is indeed no way to call inet_twsk_kill() concurrently.

>
> > It at least does help avoid decrementing tw_refcount twice in the
> > above case if I understand correctly.
>
> I don't think the refcount is decremented twice.
>
> Problem is one thread is already at the 'final' decrement of
>  WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> in tcp_sk_exit_batch(), while other thread has not yet called
> refcount_dec() on it (inet_twsk_kill still executing).
>
> So we get two splats, refcount_dec_and_test() returns 1 not expected 0
> and refcount_dec() coming right afterwards from other task observes the
> transition to 0, while it should have dropped down to 1.

I see.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:01             ` Florian Westphal
  2024-08-12 14:30               ` Jason Xing
@ 2024-08-12 20:00               ` Kuniyuki Iwashima
  2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
  1 sibling, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-12 20:00 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Mon, 12 Aug 2024 16:01:04 +0200
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > > 
> > > > > ... shows at two cores racing:
> > > > > 
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > > 
> > > > > 
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > > 
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > > 
> > > > ... which doesn't work.  Does this look like a plausible
> > > > theory/exlanation?
> > > 
> > > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
> 
> I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
> 
> We need to force proper ordering so that all twsk found
> 
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
> 
> /*HERE*/tcp_twsk_purge(net_exit_list);
> 
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
> 
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
> 
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
> 
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).

Ah, you're right.

> 
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > >  	struct inet_bind_hashbucket *bhead, *bhead2;
> > >  
> > >  	spin_lock(lock);
> > > -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > +		spin_unlock(lock);
> > > +		return false;
> > 
> > forgot to remove false, just return :)
> 
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)
> 
> Possible solutions I came up with so far:
> 
> 1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").
> 
> This commit replaced a net_eq(twsk_net(tw) ... with a check for
> dead netns (ns.count == 0),
> 
> Downside: We need to remove the purged_once trick that calls
> inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
> tcp_twsk_purge() as well.
> 
> As per b099ce2602d8 changelog, likely increases netns dismantle times.

Yes, so I think we should avoid this way although user could mitigate
it with per-net ehash.


> 
> Upside: simpler code, so this is my preferred solution.
> 
> No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
> called refcount_dec(->tw_refcount) for every twsk in the exiting netns
> list, without other task stealing twsks owned by exiting netns.
> 
> Solution 2: change tcp_sk_exit_batch like this:
> 
>    tcp_twsk_purge(net_exit_list);
> 
> +  list_for_each_entry(net, net_exit_list, exit_list) {
> +      while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
> +         schedule();
> +
> +  }
> 
>     list_for_each_entry(net, net_exit_list, exit_list) {
>        inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>        WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> This synchronizes two concurrent tcp_sk_exit_batch() calls via
> existing refcount; if netns setup error unwinding ran off with one of
> 'our' twsk, it will wait until other task has completed the refcount decrement.
> 
> I don't expect it to increase netns dismantle times, else we'd have seen
> the WARN_ON_ONCE splat frequently.
> 
> Solution 3:

I prefer this because in all of my local reports I see two splats from
exit_batch calls.

If the same issue still happens with solution 3, we'll see the same
splats with mostly no risk.  But solution 2 will give us hung task,
possibly in cleanup_net, which is worse.

Then, we can consider other possibilities explained in solution 4
or like that kernel twsk is put into ehash of dying netns.

Another option would be add reftracker for tcp_death_row.tw_refcount
to detect the root cause ?


> 
> Similar to 2), but via mutex_lock/unlock pair:
> 
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
> 
> 	mutex_lock(&tcp_exit_batch_mutex);
> 
>         tcp_twsk_purge(net_exit_list);
> 
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>                 tcp_fastopen_ctx_destroy(net);
>         }
> 	mutex_unlock(&tcp_exit_batch_mutex);
> }
> 
> Solution 4:
> 
> I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
> the 'wrong' time.  This is independent "problem", I might be
> imagining things here.
> 
> Consider:
> 313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> 314 {
> [..]
> 321         for (slot = 0; slot <= ehash_mask; slot++, head++) {
> 
> tw sk timer fires on other cpu, inet_twsk_kill() does:
> 
> 56         spin_lock(lock);
> 57         sk_nulls_del_node_init_rcu((struct sock *)tw);
> 58         spin_unlock(lock);
> 
> ... then other cpu gets preempted.
> inet_twsk_purge() resumes and hits empty chain head:
> 
> 322                 if (hlist_nulls_empty(&head->chain))
> 323                         continue;
> 
> so we don't(can't) wait for the timer to run to completion.
> 
> If this sounds plausible to you, this gives us solution 4:
> 
> Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
> dropped down to 1.
> 
> Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
> needs to be moved down:
> 
>  48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
> ...
>  58     /* Disassociate with bind bucket. */
> ...
>  68     spin_unlock(&bhead->lock);
> 
>  70     refcount_dec(&tw->tw_dr->tw_refcount);
> 
>  +      spin_lock(lock);
>  +      sk_nulls_del_node_init_rcu((struct sock *)tw);
>  +      spin_unlock(lock);
> 71      inet_twsk_put(tw);
> 72 }
> 
> ... so concurrent purge() call will find us
> the node list (and then wait on timer_shutdown_sync())
> until other cpu executing the timer is done.
> 
> If twsk was unlinked from table already before
> inet_twsk_purge() had chance to find it sk, then in worst
> case call to tcp_twsk_destructor() is missing, but I don't
> see any ordering requirements that need us to wait for this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 20:00               ` Kuniyuki Iwashima
@ 2024-08-12 22:28                 ` Florian Westphal
  2024-08-12 23:28                   ` Kuniyuki Iwashima
                                     ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Florian Westphal @ 2024-08-12 22:28 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Kuniyuki Iwashima, Jason Xing, syzbot+8ea26396ff85d23a8929

Its possible that two threads call tcp_sk_exit_batch() concurrently,
once from the cleanup_net workqueue, once from a task that failed to clone
a new netns.  In the latter case, error unwinding calls the exit handlers
in reverse order for the 'failed' netns.

tcp_sk_exit_batch() calls tcp_twsk_purge().
Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
this function picks up twsk in any dying netns, not just the one passed
in via exit_batch list.

This means that the error unwind of setup_net() can "steal" and destroy
timewait sockets belonging to the exiting netns.

This allows the netns exit worker to proceed to call

WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

without the expected 1 -> 0 transition, which then splats.

At same time, error unwind path that is also running inet_twsk_purge()
will splat as well:

WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
...
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110

... because refcount_dec() of tw_refcount unexpectedly dropped to 0.

This doesn't seem like an actual bug (no tw sockets got lost and I don't
see a use-after-free) but as erroneous trigger of debug check.

Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
blocks other task from doing final _dec_and_test before mutex-owner has
removed all tw sockets of dying netns.

Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Jason Xing <kerneljasonxing@gmail.com>
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/tcp_ipv4.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fd17f25ff288..a4e510846905 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -97,6 +97,8 @@ static DEFINE_PER_CPU(struct sock_bh_locked, ipv4_tcp_sk) = {
 	.bh_lock = INIT_LOCAL_LOCK(bh_lock),
 };

+static DEFINE_MUTEX(tcp_exit_batch_mutex);
+
 static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
 	return secure_tcp_seq(ip_hdr(skb)->daddr,
@@ -3514,6 +3516,16 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 {
 	struct net *net;

+	/* make sure concurrent calls to tcp_sk_exit_batch from net_cleanup_work
+	 * and failed setup_net error unwinding path are serialized.
+	 *
+	 * tcp_twsk_purge() handles twsk in any dead netns, not just those in
+	 * net_exit_list, the thread that dismantles a particular twsk must
+	 * do so without other thread progressing to refcount_dec_and_test() of
+	 * tcp_death_row.tw_refcount.
+	 */
+	mutex_lock(&tcp_exit_batch_mutex);
+
 	tcp_twsk_purge(net_exit_list);

 	list_for_each_entry(net, net_exit_list, exit_list) {
@@ -3521,6 +3533,8 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
 		WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
 		tcp_fastopen_ctx_destroy(net);
 	}
+
+	mutex_unlock(&tcp_exit_batch_mutex);
 }

 static struct pernet_operations __net_initdata tcp_sk_ops = {
-- 
2.44.2

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
@ 2024-08-12 23:28                   ` Kuniyuki Iwashima
  2024-08-12 23:52                     ` Florian Westphal
  2024-08-13  2:48                   ` Jason Xing
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-12 23:28 UTC (permalink / raw)
  To: fw
  Cc: davem, edumazet, kerneljasonxing, kuba, kuniyu, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929

From: Florian Westphal <fw@strlen.de>
Date: Tue, 13 Aug 2024 00:28:25 +0200
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
> 
> This allows the netns exit worker to proceed to call
> 
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> without the expected 1 -> 0 transition, which then splats.
> 
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
> 
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> 
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> 
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.

I guess the reason you don't move the check back to tcp_sk_exit() is
to catch a potential issue explained in solution 4 in the link, right ?

Then,

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>

> 
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
> 
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 23:28                   ` Kuniyuki Iwashima
@ 2024-08-12 23:52                     ` Florian Westphal
  2024-08-13  0:01                       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 24+ messages in thread
From: Florian Westphal @ 2024-08-12 23:52 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: fw, davem, edumazet, kerneljasonxing, kuba, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> > 
> > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > see a use-after-free) but as erroneous trigger of debug check.
> 
> I guess the reason you don't move the check back to tcp_sk_exit() is

No, it would not work. .exit runs before .exit_batch, we'd splat.

Before e9bd0cca09d1 ordering doesn't matter because
refcount_dec_and_test is used consistently, so it was not relevant
if the 0-transition occured from .exit or later via inet_twsk_kill.

> to catch a potential issue explained in solution 4 in the link, right ?

Yes, it would help to catch such issue wrt. twsk lifetime.
Having the WARN ensures no twsk can escape .exit_batch completion.

E.g. if twsk destructors in the future refer to some other
object that has to be released before netns pointers become invalid
or something like that.

Does that make sense?  Otherwise I'm open to alternative approaches.
Or we can wait until syzbot finds a reproducer, I don't think its
a real/urgent bug.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 23:52                     ` Florian Westphal
@ 2024-08-13  0:01                       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 24+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-13  0:01 UTC (permalink / raw)
  To: fw
  Cc: davem, edumazet, kerneljasonxing, kuba, kuniyu, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929

From: Florian Westphal <fw@strlen.de>
Date: Tue, 13 Aug 2024 01:52:59 +0200
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> > > 
> > > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > > see a use-after-free) but as erroneous trigger of debug check.
> > 
> > I guess the reason you don't move the check back to tcp_sk_exit() is
> 
> No, it would not work. .exit runs before .exit_batch, we'd splat.

ahh, sorry, somehow I checked the order of exit_batch_rtnl() and
ops_exit_list() ... :/


> 
> Before e9bd0cca09d1 ordering doesn't matter because
> refcount_dec_and_test is used consistently, so it was not relevant
> if the 0-transition occured from .exit or later via inet_twsk_kill.
> 
> > to catch a potential issue explained in solution 4 in the link, right ?
> 
> Yes, it would help to catch such issue wrt. twsk lifetime.
> Having the WARN ensures no twsk can escape .exit_batch completion.
> 
> E.g. if twsk destructors in the future refer to some other
> object that has to be released before netns pointers become invalid
> or something like that.
> 
> Does that make sense?

Yes.


> Otherwise I'm open to alternative approaches.
> Or we can wait until syzbot finds a reproducer, I don't think its
> a real/urgent bug.

Agree, also I guess syzbot will not find a repro as it's been
3 months since the first report.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
  2024-08-12 23:28                   ` Kuniyuki Iwashima
@ 2024-08-13  2:48                   ` Jason Xing
  2024-08-15 10:47                   ` Paolo Abeni
  2024-08-19 15:50                   ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 24+ messages in thread
From: Jason Xing @ 2024-08-13  2:48 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929

On Tue, Aug 13, 2024 at 7:04 AM Florian Westphal <fw@strlen.de> wrote:
>
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
>
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
>
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
>
> This allows the netns exit worker to proceed to call
>
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> without the expected 1 -> 0 transition, which then splats.
>
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
>
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
>
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.
>
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
>
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>

Really thanks for your great effort :) It's not easy to analyze at all.

Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
  2024-08-12 23:28                   ` Kuniyuki Iwashima
  2024-08-13  2:48                   ` Jason Xing
@ 2024-08-15 10:47                   ` Paolo Abeni
  2024-08-19 15:36                     ` Eric Dumazet
  2024-08-19 15:50                   ` patchwork-bot+netdevbpf
  3 siblings, 1 reply; 24+ messages in thread
From: Paolo Abeni @ 2024-08-15 10:47 UTC (permalink / raw)
  To: Florian Westphal, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Kuniyuki Iwashima,
	Jason Xing, syzbot+8ea26396ff85d23a8929

On 8/13/24 00:28, Florian Westphal wrote:
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> This means that the error unwind of setup_net() can "steal" and destroy
> timewait sockets belonging to the exiting netns.
> 
> This allows the netns exit worker to proceed to call
> 
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> without the expected 1 -> 0 transition, which then splats.
> 
> At same time, error unwind path that is also running inet_twsk_purge()
> will splat as well:
> 
> WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> ...
>   refcount_dec include/linux/refcount.h:351 [inline]
>   inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>   inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
>   inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>   tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>   ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>   setup_net+0x714/0xb40 net/core/net_namespace.c:375
>   copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>   create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> 
> ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> 
> This doesn't seem like an actual bug (no tw sockets got lost and I don't
> see a use-after-free) but as erroneous trigger of debug check.
> 
> Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> blocks other task from doing final _dec_and_test before mutex-owner has
> removed all tw sockets of dying netns.
> 
> Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> Cc: Jason Xing <kerneljasonxing@gmail.com>
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> Signed-off-by: Florian Westphal <fw@strlen.de>

The fixes LGTM, but I prefer to keep the patch in PW a little longer to 
allow Eric having a look here.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-15 10:47                   ` Paolo Abeni
@ 2024-08-19 15:36                     ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2024-08-19 15:36 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Florian Westphal, netdev, David S. Miller, Jakub Kicinski,
	Kuniyuki Iwashima, Jason Xing, syzbot+8ea26396ff85d23a8929

On Thu, Aug 15, 2024 at 12:47 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 8/13/24 00:28, Florian Westphal wrote:
> > Its possible that two threads call tcp_sk_exit_batch() concurrently,
> > once from the cleanup_net workqueue, once from a task that failed to clone
> > a new netns.  In the latter case, error unwinding calls the exit handlers
> > in reverse order for the 'failed' netns.
> >
> > tcp_sk_exit_batch() calls tcp_twsk_purge().
> > Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> > this function picks up twsk in any dying netns, not just the one passed
> > in via exit_batch list.
> >
> > This means that the error unwind of setup_net() can "steal" and destroy
> > timewait sockets belonging to the exiting netns.
> >
> > This allows the netns exit worker to proceed to call
> >
> > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> >
> > without the expected 1 -> 0 transition, which then splats.
> >
> > At same time, error unwind path that is also running inet_twsk_purge()
> > will splat as well:
> >
> > WARNING: .. at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210
> > ...
> >   refcount_dec include/linux/refcount.h:351 [inline]
> >   inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> >   inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221
> >   inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> >   tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> >   ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> >   setup_net+0x714/0xb40 net/core/net_namespace.c:375
> >   copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> >   create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> >
> > ... because refcount_dec() of tw_refcount unexpectedly dropped to 0.
> >
> > This doesn't seem like an actual bug (no tw sockets got lost and I don't
> > see a use-after-free) but as erroneous trigger of debug check.
> >
> > Add a mutex to force strict ordering: the task that calls tcp_twsk_purge()
> > blocks other task from doing final _dec_and_test before mutex-owner has
> > removed all tw sockets of dying netns.
> >
> > Fixes: e9bd0cca09d1 ("tcp: Don't allocate tcp_death_row outside of struct netns_ipv4.")
> > Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Cc: Jason Xing <kerneljasonxing@gmail.com>
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/netdev/0000000000003a5292061f5e4e19@google.com/
> > Link: https://lore.kernel.org/netdev/20240812140104.GA21559@breakpoint.cc/
> > Signed-off-by: Florian Westphal <fw@strlen.de>
>
> The fixes LGTM, but I prefer to keep the patch in PW a little longer to
> allow Eric having a look here.

Excellent patch, thanks a lot Florian !

I had this syzbot report in my queue for a while, but I presumed this
was caused by RDS,
thus waiting for more signal.

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch
  2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
                                     ` (2 preceding siblings ...)
  2024-08-15 10:47                   ` Paolo Abeni
@ 2024-08-19 15:50                   ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 24+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-08-19 15:50 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, pabeni, davem, edumazet, kuba, kuniyu, kerneljasonxing,
	syzbot+8ea26396ff85d23a8929

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 13 Aug 2024 00:28:25 +0200 you wrote:
> Its possible that two threads call tcp_sk_exit_batch() concurrently,
> once from the cleanup_net workqueue, once from a task that failed to clone
> a new netns.  In the latter case, error unwinding calls the exit handlers
> in reverse order for the 'failed' netns.
> 
> tcp_sk_exit_batch() calls tcp_twsk_purge().
> Problem is that since commit b099ce2602d8 ("net: Batch inet_twsk_purge"),
> this function picks up twsk in any dying netns, not just the one passed
> in via exit_batch list.
> 
> [...]

Here is the summary with links:
  - [net] tcp: prevent concurrent execution of tcp_sk_exit_batch
    https://git.kernel.org/netdev/net/c/565d121b6998

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-08-19 15:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-11  1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
2024-08-11  2:29 ` Kuniyuki Iwashima
2024-08-11  5:42   ` Jason Xing
2024-08-11 13:24   ` Florian Westphal
2024-08-11 14:54     ` Florian Westphal
2024-08-11 16:28       ` Florian Westphal
2024-08-11 23:00         ` Kuniyuki Iwashima
2024-08-11 23:08           ` Kuniyuki Iwashima
2024-08-12  0:36             ` Jason Xing
2024-08-12 14:01             ` Florian Westphal
2024-08-12 14:30               ` Jason Xing
2024-08-12 15:03                 ` Florian Westphal
2024-08-12 15:49                   ` Jason Xing
2024-08-12 20:00               ` Kuniyuki Iwashima
2024-08-12 22:28                 ` [PATCH net] tcp: prevent concurrent execution of tcp_sk_exit_batch Florian Westphal
2024-08-12 23:28                   ` Kuniyuki Iwashima
2024-08-12 23:52                     ` Florian Westphal
2024-08-13  0:01                       ` Kuniyuki Iwashima
2024-08-13  2:48                   ` Jason Xing
2024-08-15 10:47                   ` Paolo Abeni
2024-08-19 15:36                     ` Eric Dumazet
2024-08-19 15:50                   ` patchwork-bot+netdevbpf
2024-08-11 13:32   ` [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill Florian Westphal
2024-08-11 22:35     ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).