[syzbot] [net?] WARNING: refcount bug in inet_twsk

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
@ 2024-08-11  1:29 syzbot
  2024-08-11  2:29 ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: syzbot @ 2024-08-11  1:29 UTC (permalink / raw)
  To: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com

------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:336 [inline]
 refcount_dec include/linux/refcount.h:351 [inline]
 inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
 inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
 inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
 tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
 tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
 ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
 setup_net+0x714/0xb40 net/core/net_namespace.c:375
 copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
 create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
 ksys_unshare+0x419/0x970 kernel/fork.c:3323
 __do_sys_unshare kernel/fork.c:3394 [inline]
 __se_sys_unshare kernel/fork.c:3392 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f56d7c7cee9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
@ 2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
                     ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11  2:29 UTC (permalink / raw)
  To: syzbot+8ea26396ff85d23a8929
  Cc: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
	syzkaller-bugs

From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
Date: Sat, 10 Aug 2024 18:29:20 -0700
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> 
> ------------[ cut here ]------------
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31

Eric, this is the weird report I was talking about at netdevconf :)

It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
than refcount_inc().

I started to see the same splat at a very low rate after consuming
commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
so twsk is now visible before tw_dr->tw_refcount is incremented.

I came up with the diff below but was suspecting a bug in another place,
possibly QEMU, so I haven't posted the diff officially.

refcount_inc() was actually deferred, but it's still under an ehash lock,
and inet_twsk_deschedule_put() must be serialised with the same ehash
lock.  Even inet_twsk_kill() performs the ehash lock dance before calling
refcount_dec().

So, it should be impossible that refcount_inc() is not visible after double
lock/unlock and before refcount_dec(), so this report looks bogus to me :S

---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..c3b2f0426e01 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -115,6 +115,8 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
+	refcount_inc(&tw->tw_dr->tw_refcount);
+
 	/* Step 1: Put TW into bind hash. Original socket stays there too.
 	   Note, that any socket with inet->num != 0 MUST be bound in
 	   binding cache, even if it is closed.
@@ -301,7 +303,6 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 		__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
 						     LINUX_MIB_TIMEWAITED);
 		BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
-		refcount_inc(&tw->tw_dr->tw_refcount);
 	} else {
 		mod_timer_pending(&tw->tw_timer, jiffies + timeo);
 	}
---8<---


> Modules linked in:
> CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
> RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> FS:  00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> Call Trace:
>  <TASK>
>  __refcount_dec include/linux/refcount.h:336 [inline]
>  refcount_dec include/linux/refcount.h:351 [inline]
>  inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
>  inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
>  inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
>  tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
>  tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
>  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
>  setup_net+0x714/0xb40 net/core/net_namespace.c:375
>  copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
>  create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
>  unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
>  ksys_unshare+0x419/0x970 kernel/fork.c:3323
>  __do_sys_unshare kernel/fork.c:3394 [inline]
>  __se_sys_unshare kernel/fork.c:3392 [inline]
>  __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f56d7c7cee9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
> RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
> RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
>  </TASK>
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
@ 2024-08-11  5:42   ` Jason Xing
  2024-08-11 13:24   ` Florian Westphal
  2024-08-11 13:32   ` Florian Westphal
  2 siblings, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-11  5:42 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Hello Kuniyuki,

On Sun, Aug 11, 2024 at 10:35 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
>
> Eric, this is the weird report I was talking about at netdevconf :)
>
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
>
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
>
> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
>
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
>
> refcount_inc() was actually deferred, but it's still under an ehash lock,
> and inet_twsk_deschedule_put() must be serialised with the same ehash
> lock.  Even inet_twsk_kill() performs the ehash lock dance before calling
> refcount_dec().
>
> So, it should be impossible that refcount_inc() is not visible after double
> lock/unlock and before refcount_dec(), so this report looks bogus to me :S

In normal cases, I agree with this, since inc/dec are all protected
under the spin lock. There is no way we can decrement it if we don't
increment it first.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
@ 2024-08-11 13:24   ` Florian Westphal
  2024-08-11 14:54     ` Florian Westphal
  2024-08-11 13:32   ` Florian Westphal
  2 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 13:24 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > 
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> 
> Eric, this is the weird report I was talking about at netdevconf :)
> 
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
> 
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

I think I see why.

The reported splat is without this above commit.
But from backtrace we entered here:

                if (net->ipv4.tcp_death_row.hashinfo->pernet) {
                        /* Even if tw_refcount == 1, we must clean up kernel reqsk */
                        inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
                } else if (!purged_once) {
-------------->         inet_twsk_purge(&tcp_hashinfo);  // THIS
                        purged_once = true;

> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
> 
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
> 
> refcount_inc() was actually deferred, but it's still under an ehash lock,

but different struct inet_hashinfo, so the locks don't help :/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11  2:29 ` Kuniyuki Iwashima
  2024-08-11  5:42   ` Jason Xing
  2024-08-11 13:24   ` Florian Westphal
@ 2024-08-11 13:32   ` Florian Westphal
  2024-08-11 22:35     ` Kuniyuki Iwashima
  2 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 13:32 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
	linux-kernel, netdev, pabeni, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> > 
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> 
> Eric, this is the weird report I was talking about at netdevconf :)
> 
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
> 
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").

Could you share a splat you saw?

Was it also for the fallback tcp_hashinfo and/or from error
unwindinding after a ops->init() failure on netns creation?

I don't think it makes sense following this specific report given
b334b924c9b7 made changes in this area, but a post-b334b924c9b7
would be more relevant I think.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 13:24   ` Florian Westphal
@ 2024-08-11 14:54     ` Florian Westphal
  2024-08-11 16:28       ` Florian Westphal
  0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 14:54 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
	edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs

Florian Westphal <fw@strlen.de> wrote:
> > I came up with the diff below but was suspecting a bug in another place,
> > possibly QEMU, so I haven't posted the diff officially.
> > 
> > refcount_inc() was actually deferred, but it's still under an ehash lock,
> 
> but different struct inet_hashinfo, so the locks don't help :/

No, fallback is fine: pernet tw_refcount, init_net ehash lock array. so
they same buckets should serialize on same ehash lock.

https://syzkaller.appspot.com/x/log.txt?x=117f3182980000

... shows at two cores racing:

[ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
and
[ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).

first splat backtrace shows invocation of tcp_sk_exit_batch() from
netns error unwinding code.

Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
likely walking init_net tcp_hashinfo.

The warn of second core is:
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

Looks like somehow netns cleanup work queue skipped at least one tw sk,
hitting above splat.

Then, first core did refcount_dec() on tw_refcount, which produces
dec-to-0 warn (which makes sense if "supposedly final" decrement was
already done.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 14:54     ` Florian Westphal
@ 2024-08-11 16:28       ` Florian Westphal
  2024-08-11 23:00         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 16:28 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
	edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs

Florian Westphal <fw@strlen.de> wrote:
> https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> 
> ... shows at two cores racing:
> 
> [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> and
> [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> 
> 
> first splat backtrace shows invocation of tcp_sk_exit_batch() from
> netns error unwinding code.
> 
> Second one lacks backtrace, but its also in tcp_sk_exit_batch(),

... which doesn't work.  Does this look like a plausible
theory/exlanation?

Given:
1 exiting netns, has >= 1 tw sk.
1 (unrelated) netns that failed in setup_net

... we run into following race:

exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
inet_twsk_purge(&tcp_hashinfo).

At same time, from error unwinding code, we also call tcp_sk_exit_batch().

Both threads walk tcp_hashinfo ehash buckets.

From work queue (normal netns exit path), we hit

303                         if (state == TCP_TIME_WAIT) {
304                                 inet_twsk_deschedule_put(inet_twsk(sk));

Because both threads operate on tcp_hashinfo, the unrelated
struct net (exiting net) is also visible to error-unwinding thread.

So, error unwinding code will call

303                         if (state == TCP_TIME_WAIT) {
304                                 inet_twsk_deschedule_put(inet_twsk(sk));

for the same tw sk and both threads do

218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
219 {
220         if (del_timer_sync(&tw->tw_timer))
221                 inet_twsk_kill(tw);

Error unwind path cancel timer, calls inet_twsk_kill, while
work queue sees timer as already shut-down so it ends up
returning to tcp_sk_exit_batch(), where it will WARN here:

  WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

... because the supposedly-last tw_refcount decrement did not drop
it down to 0.

Meanwhile, error unwiding thread calls refcount_dec() on
tw_refcount, which now drops down to 0 instead of 1, which
provides another warn splat.

I'll ponder on ways to fix this tomorrow unless someone
else already has better theory/solution.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 13:32   ` Florian Westphal
@ 2024-08-11 22:35     ` Kuniyuki Iwashima
  0 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 22:35 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 15:32:00 +0200
> > > ------------[ cut here ]------------
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> > 
> > Eric, this is the weird report I was talking about at netdevconf :)
> > 
> > It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> > than refcount_inc().
> > 
> > I started to see the same splat at a very low rate after consuming
> > commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
> 
> Could you share a splat you saw?
> 
> Was it also for the fallback tcp_hashinfo and/or from error
> unwindinding after a ops->init() failure on netns creation?
> 
> I don't think it makes sense following this specific report given
> b334b924c9b7 made changes in this area, but a post-b334b924c9b7
> would be more relevant I think.

Ah, I missed the initial syzbot's report was created before the commit.

Here's splats I saw:

[ Note both splats were mixed in the log as with the syzbot's report,
  so two threads were calling tcp_sk_exit_batch() concurrently. ]

---8<---
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 1760 at lib/refcount.c:31 refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Modules linked in:
CPU: 0 PID: 1760 Comm: syz-executor.4 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Code: 05 cd 8d aa 04 01 e8 48 e4 e9 fe 0f 0b e9 d3 fe ff ff e8 1c 36 1c ff 48 c7 c7 a0 4a 80 85 c6 05 aa 8d aa 04 01 e8 29 e4 e9 fe <0f> 0b e9 b4 fe ff ff 48 89 ef e8 fa 56 69 ff e9 5c fe ff ff 0f 1f
RSP: 0018:ffffc9000118fa78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000b13b000
RDX: 0000000000040000 RSI: ffffffff8118dedf RDI: 0000000000000001
RBP: ffff88800bb59440 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000002d2d2d2d R12: ffff88800bb59440
R13: ffff88810130f3d0 R14: ffff8881013ee820 R15: ffff888102001198
FS:  00007f5252dbd640(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004a6128 CR3: 00000001038b8003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 inet_twsk_kill (./include/linux/refcount.h:336 ./include/linux/refcount.h:351 net/ipv4/inet_timewait_sock.c:70)
 inet_twsk_deschedule_put (net/ipv4/inet_timewait_sock.c:266)
 inet_twsk_purge (net/ipv4/inet_timewait_sock.c:349)
 tcp_twsk_purge (net/ipv4/tcp_minisocks.c:402)
 tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3512)
 ops_exit_list (net/core/net_namespace.c:179)
 setup_net (net/core/net_namespace.c:374 (discriminator 3))
 copy_net_ns (net/core/net_namespace.c:510)
 create_new_namespaces (kernel/nsproxy.c:110)
 unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
 ksys_unshare (kernel/fork.c:3325)
 __x64_sys_unshare (kernel/fork.c:3392)
 do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f5253a29e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f5252dbcc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00000000004d4070 RCX: 00007f5253a29e5d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 00000000004d4070 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f5253a8a530 R15: 0000000000000000
 </TASK>

WARNING: CPU: 1 PID: 671 at net/ipv4/tcp_ipv4.c:3514 tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Modules linked in:
CPU: 1 PID: 671 Comm: kworker/u8:15 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Code: 89 ee e8 9f 1b 28 fd 85 ed 0f 88 b3 00 00 00 e8 92 22 28 fd 31 ff 89 ee e8 89 1b 28 fd 85 ed 0f 84 9d 00 00 00 e8 7c 22 28 fd <0f> 0b e8 75 22 28 fd 48 89 df e8 ed 23 02 00 48 8d 7b 28 48 89 f8
RSP: 0018:ffffc90000ebfc60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88800bb59140 RCX: ffffffff84230cb7
RDX: ffff888103871180 RSI: ffffffff84230cc4 RDI: 0000000000000005
RBP: 0000000000000002 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000002 R11: 00000000000002bd R12: ffff88800bb59440
R13: dffffc0000000000 R14: ffffc90000ebfd08 R15: fffffbfff0d7fb5c
FS:  0000000000000000(0000) GS:ffff88811af00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f274114ddf8 CR3: 000000010a11c006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ops_exit_list (net/core/net_namespace.c:179)
 cleanup_net (net/core/net_namespace.c:639 (discriminator 3))
 process_one_work (kernel/workqueue.c:3236)
 worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
 kthread (kernel/kthread.c:389)
 ret_from_fork (arch/x86/kernel/process.c:153)
 ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
 </TASK>
---8<---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 16:28       ` Florian Westphal
@ 2024-08-11 23:00         ` Kuniyuki Iwashima
  2024-08-11 23:08           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:00 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 18:28:50 +0200
> Florian Westphal <fw@strlen.de> wrote:
> > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > 
> > ... shows at two cores racing:
> > 
> > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > and
> > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > 
> > 
> > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > netns error unwinding code.
> > 
> > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> 
> ... which doesn't work.  Does this look like a plausible
> theory/exlanation?

Yes!  The problem here is that inet_twsk_purge() operates on twsk
not in net_exit_list, but I think such a check is overkill and we
can work around it in another way.


> 
> Given:
> 1 exiting netns, has >= 1 tw sk.
> 1 (unrelated) netns that failed in setup_net
> 
> ... we run into following race:
> 
> exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> inet_twsk_purge(&tcp_hashinfo).
> 
> At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> 
> Both threads walk tcp_hashinfo ehash buckets.
> 
> From work queue (normal netns exit path), we hit
> 
> 303                         if (state == TCP_TIME_WAIT) {
> 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> 
> Because both threads operate on tcp_hashinfo, the unrelated
> struct net (exiting net) is also visible to error-unwinding thread.
> 
> So, error unwinding code will call
> 
> 303                         if (state == TCP_TIME_WAIT) {
> 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> 
> for the same tw sk and both threads do
> 
> 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> 219 {
> 220         if (del_timer_sync(&tw->tw_timer))
> 221                 inet_twsk_kill(tw);
> 
> Error unwind path cancel timer, calls inet_twsk_kill, while
> work queue sees timer as already shut-down so it ends up
> returning to tcp_sk_exit_batch(), where it will WARN here:
> 
>   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> ... because the supposedly-last tw_refcount decrement did not drop
> it down to 0.
> 
> Meanwhile, error unwiding thread calls refcount_dec() on
> tw_refcount, which now drops down to 0 instead of 1, which
> provides another warn splat.
> 
> I'll ponder on ways to fix this tomorrow unless someone
> else already has better theory/solution.

We need to sync two inet_twsk_kill(), so maybe give up one
if twsk is not hashed ?

---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..51889567274b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
 	spin_lock(lock);
-	sk_nulls_del_node_init_rcu((struct sock *)tw);
+	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
+		spin_unlock(lock);
+		return false;
+	}
 	spin_unlock(lock);
 
 	/* Disassociate with bind bucket. */
---8<---

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:00         ` Kuniyuki Iwashima
@ 2024-08-11 23:08           ` Kuniyuki Iwashima
  2024-08-12  0:36             ` Jason Xing
  2024-08-12 14:01             ` Florian Westphal
  0 siblings, 2 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:08 UTC (permalink / raw)
  To: kuniyu
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Sun, 11 Aug 2024 16:00:29 -0700
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 11 Aug 2024 18:28:50 +0200
> > Florian Westphal <fw@strlen.de> wrote:
> > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > 
> > > ... shows at two cores racing:
> > > 
> > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > and
> > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > 
> > > 
> > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > netns error unwinding code.
> > > 
> > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > 
> > ... which doesn't work.  Does this look like a plausible
> > theory/exlanation?
> 
> Yes!  The problem here is that inet_twsk_purge() operates on twsk
> not in net_exit_list, but I think such a check is overkill and we
> can work around it in another way.
> 
> 
> > 
> > Given:
> > 1 exiting netns, has >= 1 tw sk.
> > 1 (unrelated) netns that failed in setup_net
> > 
> > ... we run into following race:
> > 
> > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > inet_twsk_purge(&tcp_hashinfo).
> > 
> > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> > 
> > Both threads walk tcp_hashinfo ehash buckets.
> > 
> > From work queue (normal netns exit path), we hit
> > 
> > 303                         if (state == TCP_TIME_WAIT) {
> > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > 
> > Because both threads operate on tcp_hashinfo, the unrelated
> > struct net (exiting net) is also visible to error-unwinding thread.
> > 
> > So, error unwinding code will call
> > 
> > 303                         if (state == TCP_TIME_WAIT) {
> > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > 
> > for the same tw sk and both threads do
> > 
> > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > 219 {
> > 220         if (del_timer_sync(&tw->tw_timer))
> > 221                 inet_twsk_kill(tw);
> > 
> > Error unwind path cancel timer, calls inet_twsk_kill, while
> > work queue sees timer as already shut-down so it ends up
> > returning to tcp_sk_exit_batch(), where it will WARN here:
> > 
> >   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > 
> > ... because the supposedly-last tw_refcount decrement did not drop
> > it down to 0.
> > 
> > Meanwhile, error unwiding thread calls refcount_dec() on
> > tw_refcount, which now drops down to 0 instead of 1, which
> > provides another warn splat.
> > 
> > I'll ponder on ways to fix this tomorrow unless someone
> > else already has better theory/solution.
> 
> We need to sync two inet_twsk_kill(), so maybe give up one
> if twsk is not hashed ?
> 
> ---8<---
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 337390ba85b4..51889567274b 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>  	struct inet_bind_hashbucket *bhead, *bhead2;
>  
>  	spin_lock(lock);
> -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> +		spin_unlock(lock);
> +		return false;

forgot to remove false, just return :)


> +	}
>  	spin_unlock(lock);
>  
>  	/* Disassociate with bind bucket. */
> ---8<---

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:08           ` Kuniyuki Iwashima
@ 2024-08-12  0:36             ` Jason Xing
  2024-08-12 14:01             ` Florian Westphal
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-12  0:36 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

On Mon, Aug 12, 2024 at 7:09 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > >
> > > > ... shows at two cores racing:
> > > >
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > >
> > > >
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > >
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > >
> > > ... which doesn't work.  Does this look like a plausible
> > > theory/exlanation?
> >
> > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.
> >
> >
> > >
> > > Given:
> > > 1 exiting netns, has >= 1 tw sk.
> > > 1 (unrelated) netns that failed in setup_net
> > >
> > > ... we run into following race:
> > >
> > > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > > inet_twsk_purge(&tcp_hashinfo).
> > >
> > > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> > >
> > > Both threads walk tcp_hashinfo ehash buckets.
> > >
> > > From work queue (normal netns exit path), we hit
> > >
> > > 303                         if (state == TCP_TIME_WAIT) {
> > > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > Because both threads operate on tcp_hashinfo, the unrelated
> > > struct net (exiting net) is also visible to error-unwinding thread.
> > >
> > > So, error unwinding code will call
> > >
> > > 303                         if (state == TCP_TIME_WAIT) {
> > > 304                                 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > for the same tw sk and both threads do
> > >
> > > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > > 219 {
> > > 220         if (del_timer_sync(&tw->tw_timer))
> > > 221                 inet_twsk_kill(tw);
> > >
> > > Error unwind path cancel timer, calls inet_twsk_kill, while
> > > work queue sees timer as already shut-down so it ends up
> > > returning to tcp_sk_exit_batch(), where it will WARN here:
> > >
> > >   WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > >
> > > ... because the supposedly-last tw_refcount decrement did not drop
> > > it down to 0.
> > >
> > > Meanwhile, error unwiding thread calls refcount_dec() on
> > > tw_refcount, which now drops down to 0 instead of 1, which
> > > provides another warn splat.

Right, last night I should have thought of it. I noticed that two
'killers' may access the inet_twsk_kill() concurrently (during that
time, I paid attention to the tcp_abort() which can cause such an
issue), but I easily let it go :(

Thanks for the detailed explanation:)

> > >
> > > I'll ponder on ways to fix this tomorrow unless someone
> > > else already has better theory/solution.
> >
> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?
> >
> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >       struct inet_bind_hashbucket *bhead, *bhead2;
> >
> >       spin_lock(lock);
> > -     sk_nulls_del_node_init_rcu((struct sock *)tw);
> > +     if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > +             spin_unlock(lock);
> > +             return false;
>
> forgot to remove false, just return :)

It does make sense to me !

Thanks,
Jason

>
>
> > +     }
> >       spin_unlock(lock);
> >
> >       /* Disassociate with bind bucket. */
> > ---8<---
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-11 23:08           ` Kuniyuki Iwashima
  2024-08-12  0:36             ` Jason Xing
@ 2024-08-12 14:01             ` Florian Westphal
  2024-08-12 14:30               ` Jason Xing
  2024-08-12 20:00               ` Kuniyuki Iwashima
  1 sibling, 2 replies; 16+ messages in thread
From: Florian Westphal @ 2024-08-12 14:01 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
	syzbot+8ea26396ff85d23a8929, syzkaller-bugs

Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > 
> > > > ... shows at two cores racing:
> > > > 
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > 
> > > > 
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > > 
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > 
> > > ... which doesn't work.  Does this look like a plausible
> > > theory/exlanation?
> > 
> > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.

I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
called inet_twsk_kill(), 'our' task has to wait for that to complete.

We need to force proper ordering so that all twsk found

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

/*HERE*/tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);

.... have gone through inet_twsk_kill() so tw_refcount managed to
drop back to 1 before doing
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
.

> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?

Not sure, afaiu only one thread enters inet_twsk_kill()
(the one that manages to deactivate the timer).

> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> >  	struct inet_bind_hashbucket *bhead, *bhead2;
> >  
> >  	spin_lock(lock);
> > -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> > +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > +		spin_unlock(lock);
> > +		return false;
> 
> forgot to remove false, just return :)

I don't see how this helps, we need to wait until 'stolen' twsk
has gone through inet_twsk_kill() and decremented tw_refcount.
Obviously It would be a bit simpler if we had a reliable reproducer :-)

Possible solutions I came up with so far:

1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").

This commit replaced a net_eq(twsk_net(tw) ... with a check for
dead netns (ns.count == 0),

Downside: We need to remove the purged_once trick that calls
inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
tcp_twsk_purge() as well.

As per b099ce2602d8 changelog, likely increases netns dismantle times.

Upside: simpler code, so this is my preferred solution.

No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
called refcount_dec(->tw_refcount) for every twsk in the exiting netns
list, without other task stealing twsks owned by exiting netns.

Solution 2: change tcp_sk_exit_batch like this:

   tcp_twsk_purge(net_exit_list);

+  list_for_each_entry(net, net_exit_list, exit_list) {
+      while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
+         schedule();
+
+  }

    list_for_each_entry(net, net_exit_list, exit_list) {
       inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
       WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

This synchronizes two concurrent tcp_sk_exit_batch() calls via
existing refcount; if netns setup error unwinding ran off with one of
'our' twsk, it will wait until other task has completed the refcount decrement.

I don't expect it to increase netns dismantle times, else we'd have seen
the WARN_ON_ONCE splat frequently.

Solution 3:

Similar to 2), but via mutex_lock/unlock pair:

static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
        struct net *net;

	mutex_lock(&tcp_exit_batch_mutex);

        tcp_twsk_purge(net_exit_list);

        list_for_each_entry(net, net_exit_list, exit_list) {
                inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
                WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
                tcp_fastopen_ctx_destroy(net);
        }
	mutex_unlock(&tcp_exit_batch_mutex);
}

Solution 4:

I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
the 'wrong' time.  This is independent "problem", I might be
imagining things here.

Consider:
313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
314 {
[..]
321         for (slot = 0; slot <= ehash_mask; slot++, head++) {

tw sk timer fires on other cpu, inet_twsk_kill() does:

56         spin_lock(lock);
57         sk_nulls_del_node_init_rcu((struct sock *)tw);
58         spin_unlock(lock);

... then other cpu gets preempted.
inet_twsk_purge() resumes and hits empty chain head:

322                 if (hlist_nulls_empty(&head->chain))
323                         continue;

so we don't(can't) wait for the timer to run to completion.

If this sounds plausible to you, this gives us solution 4:

Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
dropped down to 1.

Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
needs to be moved down:

 48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
...
 58     /* Disassociate with bind bucket. */
...
 68     spin_unlock(&bhead->lock);

 70     refcount_dec(&tw->tw_dr->tw_refcount);

 +      spin_lock(lock);
 +      sk_nulls_del_node_init_rcu((struct sock *)tw);
 +      spin_unlock(lock);
71      inet_twsk_put(tw);
72 }

... so concurrent purge() call will find us
the node list (and then wait on timer_shutdown_sync())
until other cpu executing the timer is done.

If twsk was unlinked from table already before
inet_twsk_purge() had chance to find it sk, then in worst
case call to tcp_twsk_destructor() is missing, but I don't
see any ordering requirements that need us to wait for this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:01             ` Florian Westphal
@ 2024-08-12 14:30               ` Jason Xing
  2024-08-12 15:03                 ` Florian Westphal
  2024-08-12 20:00               ` Kuniyuki Iwashima
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Xing @ 2024-08-12 14:30 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
	netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

Hello Florian,

On Mon, Aug 12, 2024 at 10:02 PM Florian Westphal <fw@strlen.de> wrote:
>
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > >
> > > > > ... shows at two cores racing:
> > > > >
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > >
> > > > >
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > >
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > >
> > > > ... which doesn't work.  Does this look like a plausible
> > > > theory/exlanation?
> > >
> > > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
>
> I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
>
> We need to force proper ordering so that all twsk found
>
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
>
> /*HERE*/tcp_twsk_purge(net_exit_list);
>
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
>
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
>
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).
>
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > >     struct inet_bind_hashbucket *bhead, *bhead2;
> > >
> > >     spin_lock(lock);
> > > -   sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > +   if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > +           spin_unlock(lock);
> > > +           return false;
> >
> > forgot to remove false, just return :)
>
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)

Allow me to say something irrelevant to this bug report.

Do you think that Kuniyuki's patch can solve the race between two
'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?

It at least does help avoid decrementing tw_refcount twice in the
above case if I understand correctly.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:30               ` Jason Xing
@ 2024-08-12 15:03                 ` Florian Westphal
  2024-08-12 15:49                   ` Jason Xing
  0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-12 15:03 UTC (permalink / raw)
  To: Jason Xing
  Cc: Florian Westphal, Kuniyuki Iwashima, davem, dsahern, edumazet,
	kuba, linux-kernel, netdev, pabeni, syzbot+8ea26396ff85d23a8929,
	syzkaller-bugs

Jason Xing <kerneljasonxing@gmail.com> wrote:
> > I don't see how this helps, we need to wait until 'stolen' twsk
> > has gone through inet_twsk_kill() and decremented tw_refcount.
> > Obviously It would be a bit simpler if we had a reliable reproducer :-)
> 
> Allow me to say something irrelevant to this bug report.
> 
> Do you think that Kuniyuki's patch can solve the race between two
> 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?

I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
which does:

        if (timer_shutdown_sync(&tw->tw_timer))
                inet_twsk_kill(tw);

So I don't see how two concurrent callers, working on same tw address,
would both be able to shut down the timer.

One will shut it down and calls inet_twsk_kill(), other will wait until
the callback has completed, but it doesn't call inet_twsk_kill().

> It at least does help avoid decrementing tw_refcount twice in the
> above case if I understand correctly.

I don't think the refcount is decremented twice.

Problem is one thread is already at the 'final' decrement of
 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

in tcp_sk_exit_batch(), while other thread has not yet called
refcount_dec() on it (inet_twsk_kill still executing).

So we get two splats, refcount_dec_and_test() returns 1 not expected 0
and refcount_dec() coming right afterwards from other task observes the
transition to 0, while it should have dropped down to 1.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 15:03                 ` Florian Westphal
@ 2024-08-12 15:49                   ` Jason Xing
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-12 15:49 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
	netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

On Mon, Aug 12, 2024 at 11:03 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > I don't see how this helps, we need to wait until 'stolen' twsk
> > > has gone through inet_twsk_kill() and decremented tw_refcount.
> > > Obviously It would be a bit simpler if we had a reliable reproducer :-)
> >
> > Allow me to say something irrelevant to this bug report.
> >
> > Do you think that Kuniyuki's patch can solve the race between two
> > 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> > concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?
>
> I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
> which does:
>
>         if (timer_shutdown_sync(&tw->tw_timer))
>                 inet_twsk_kill(tw);
>
> So I don't see how two concurrent callers, working on same tw address,
> would both be able to shut down the timer.
>
> One will shut it down and calls inet_twsk_kill(), other will wait until
> the callback has completed, but it doesn't call inet_twsk_kill().

Oh, thanks. Since timer_shutdown_sync() can be used as a lock, there
is indeed no way to call inet_twsk_kill() concurrently.

>
> > It at least does help avoid decrementing tw_refcount twice in the
> > above case if I understand correctly.
>
> I don't think the refcount is decremented twice.
>
> Problem is one thread is already at the 'final' decrement of
>  WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> in tcp_sk_exit_batch(), while other thread has not yet called
> refcount_dec() on it (inet_twsk_kill still executing).
>
> So we get two splats, refcount_dec_and_test() returns 1 not expected 0
> and refcount_dec() coming right afterwards from other task observes the
> transition to 0, while it should have dropped down to 1.

I see.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
  2024-08-12 14:01             ` Florian Westphal
  2024-08-12 14:30               ` Jason Xing
@ 2024-08-12 20:00               ` Kuniyuki Iwashima
  1 sibling, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-12 20:00 UTC (permalink / raw)
  To: fw
  Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
	pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs

From: Florian Westphal <fw@strlen.de>
Date: Mon, 12 Aug 2024 16:01:04 +0200
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > > 
> > > > > ... shows at two cores racing:
> > > > > 
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][   T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > > 
> > > > > 
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > > 
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > > 
> > > > ... which doesn't work.  Does this look like a plausible
> > > > theory/exlanation?
> > > 
> > > Yes!  The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
> 
> I'm not so sure.  Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
> 
> We need to force proper ordering so that all twsk found
> 
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
> 
> /*HERE*/tcp_twsk_purge(net_exit_list);
> 
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
> 
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
> 
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
> 
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).

Ah, you're right.

> 
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > >  	struct inet_bind_hashbucket *bhead, *bhead2;
> > >  
> > >  	spin_lock(lock);
> > > -	sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > +	if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > +		spin_unlock(lock);
> > > +		return false;
> > 
> > forgot to remove false, just return :)
> 
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)
> 
> Possible solutions I came up with so far:
> 
> 1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").
> 
> This commit replaced a net_eq(twsk_net(tw) ... with a check for
> dead netns (ns.count == 0),
> 
> Downside: We need to remove the purged_once trick that calls
> inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
> tcp_twsk_purge() as well.
> 
> As per b099ce2602d8 changelog, likely increases netns dismantle times.

Yes, so I think we should avoid this way although user could mitigate
it with per-net ehash.


> 
> Upside: simpler code, so this is my preferred solution.
> 
> No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
> called refcount_dec(->tw_refcount) for every twsk in the exiting netns
> list, without other task stealing twsks owned by exiting netns.
> 
> Solution 2: change tcp_sk_exit_batch like this:
> 
>    tcp_twsk_purge(net_exit_list);
> 
> +  list_for_each_entry(net, net_exit_list, exit_list) {
> +      while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
> +         schedule();
> +
> +  }
> 
>     list_for_each_entry(net, net_exit_list, exit_list) {
>        inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>        WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> This synchronizes two concurrent tcp_sk_exit_batch() calls via
> existing refcount; if netns setup error unwinding ran off with one of
> 'our' twsk, it will wait until other task has completed the refcount decrement.
> 
> I don't expect it to increase netns dismantle times, else we'd have seen
> the WARN_ON_ONCE splat frequently.
> 
> Solution 3:

I prefer this because in all of my local reports I see two splats from
exit_batch calls.

If the same issue still happens with solution 3, we'll see the same
splats with mostly no risk.  But solution 2 will give us hung task,
possibly in cleanup_net, which is worse.

Then, we can consider other possibilities explained in solution 4
or like that kernel twsk is put into ehash of dying netns.

Another option would be add reftracker for tcp_death_row.tw_refcount
to detect the root cause ?


> 
> Similar to 2), but via mutex_lock/unlock pair:
> 
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
>         struct net *net;
> 
> 	mutex_lock(&tcp_exit_batch_mutex);
> 
>         tcp_twsk_purge(net_exit_list);
> 
>         list_for_each_entry(net, net_exit_list, exit_list) {
>                 inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>                 WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>                 tcp_fastopen_ctx_destroy(net);
>         }
> 	mutex_unlock(&tcp_exit_batch_mutex);
> }
> 
> Solution 4:
> 
> I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
> the 'wrong' time.  This is independent "problem", I might be
> imagining things here.
> 
> Consider:
> 313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> 314 {
> [..]
> 321         for (slot = 0; slot <= ehash_mask; slot++, head++) {
> 
> tw sk timer fires on other cpu, inet_twsk_kill() does:
> 
> 56         spin_lock(lock);
> 57         sk_nulls_del_node_init_rcu((struct sock *)tw);
> 58         spin_unlock(lock);
> 
> ... then other cpu gets preempted.
> inet_twsk_purge() resumes and hits empty chain head:
> 
> 322                 if (hlist_nulls_empty(&head->chain))
> 323                         continue;
> 
> so we don't(can't) wait for the timer to run to completion.
> 
> If this sounds plausible to you, this gives us solution 4:
> 
> Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
> dropped down to 1.
> 
> Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
> needs to be moved down:
> 
>  48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
> ...
>  58     /* Disassociate with bind bucket. */
> ...
>  68     spin_unlock(&bhead->lock);
> 
>  70     refcount_dec(&tw->tw_dr->tw_refcount);
> 
>  +      spin_lock(lock);
>  +      sk_nulls_del_node_init_rcu((struct sock *)tw);
>  +      spin_unlock(lock);
> 71      inet_twsk_put(tw);
> 72 }
> 
> ... so concurrent purge() call will find us
> the node list (and then wait on timer_shutdown_sync())
> until other cpu executing the timer is done.
> 
> If twsk was unlinked from table already before
> inet_twsk_purge() had chance to find it sk, then in worst
> case call to tcp_twsk_destructor() is missing, but I don't
> see any ordering requirements that need us to wait for this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-08-12 20:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-11  1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
2024-08-11  2:29 ` Kuniyuki Iwashima
2024-08-11  5:42   ` Jason Xing
2024-08-11 13:24   ` Florian Westphal
2024-08-11 14:54     ` Florian Westphal
2024-08-11 16:28       ` Florian Westphal
2024-08-11 23:00         ` Kuniyuki Iwashima
2024-08-11 23:08           ` Kuniyuki Iwashima
2024-08-12  0:36             ` Jason Xing
2024-08-12 14:01             ` Florian Westphal
2024-08-12 14:30               ` Jason Xing
2024-08-12 15:03                 ` Florian Westphal
2024-08-12 15:49                   ` Jason Xing
2024-08-12 20:00               ` Kuniyuki Iwashima
2024-08-11 13:32   ` Florian Westphal
2024-08-11 22:35     ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox