* [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
@ 2024-08-11 1:29 syzbot
2024-08-11 2:29 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: syzbot @ 2024-08-11 1:29 UTC (permalink / raw)
To: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: 33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
kernel config: https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Modules linked in:
CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:336 [inline]
refcount_dec include/linux/refcount.h:351 [inline]
inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
setup_net+0x714/0xb40 net/core/net_namespace.c:375
copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
ksys_unshare+0x419/0x970 kernel/fork.c:3323
__do_sys_unshare kernel/fork.c:3394 [inline]
__se_sys_unshare kernel/fork.c:3392 [inline]
__x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f56d7c7cee9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
@ 2024-08-11 2:29 ` Kuniyuki Iwashima
2024-08-11 5:42 ` Jason Xing
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 2:29 UTC (permalink / raw)
To: syzbot+8ea26396ff85d23a8929
Cc: davem, dsahern, edumazet, kuba, linux-kernel, netdev, pabeni,
syzkaller-bugs
From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
Date: Sat, 10 Aug 2024 18:29:20 -0700
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> kernel config: https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
>
> ------------[ cut here ]------------
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
Eric, this is the weird report I was talking about at netdevconf :)
It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
than refcount_inc().
I started to see the same splat at a very low rate after consuming
commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
so twsk is now visible before tw_dr->tw_refcount is incremented.
I came up with the diff below but was suspecting a bug in another place,
possibly QEMU, so I haven't posted the diff officially.
refcount_inc() was actually deferred, but it's still under an ehash lock,
and inet_twsk_deschedule_put() must be serialised with the same ehash
lock. Even inet_twsk_kill() performs the ehash lock dance before calling
refcount_dec().
So, it should be impossible that refcount_inc() is not visible after double
lock/unlock and before refcount_dec(), so this report looks bogus to me :S
---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..c3b2f0426e01 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -115,6 +115,8 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
struct inet_bind_hashbucket *bhead, *bhead2;
+ refcount_inc(&tw->tw_dr->tw_refcount);
+
/* Step 1: Put TW into bind hash. Original socket stays there too.
Note, that any socket with inet->num != 0 MUST be bound in
binding cache, even if it is closed.
@@ -301,7 +303,6 @@ void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
__NET_INC_STATS(twsk_net(tw), kill ? LINUX_MIB_TIMEWAITKILLED :
LINUX_MIB_TIMEWAITED);
BUG_ON(mod_timer(&tw->tw_timer, jiffies + timeo));
- refcount_inc(&tw->tw_dr->tw_refcount);
} else {
mod_timer_pending(&tw->tw_timer, jiffies + timeo);
}
---8<---
> Modules linked in:
> CPU: 3 PID: 1396 Comm: syz-executor.3 Not tainted 6.9.0-syzkaller-07370-g33e02dc69afb #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> Code: 8b e8 37 85 cf fc 90 0f 0b 90 90 e9 c3 fe ff ff e8 68 34 0d fd c6 05 0d 81 4c 0b 01 90 48 c7 c7 20 2b 8f 8b e8 14 85 cf fc 90 <0f> 0b 90 90 e9 a0 fe ff ff 48 89 ef e8 e2 e8 68 fd e9 44 fe ff ff
> RSP: 0018:ffffc9000480fa70 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9002ce28000
> RDX: 0000000000040000 RSI: ffffffff81505406 RDI: 0000000000000001
> RBP: ffff88804d8b3f80 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000002 R12: ffff88804d8b3f80
> R13: ffff888031c601c0 R14: ffffc900013c04f8 R15: 000000002a3e5567
> FS: 00007f56d897c6c0(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b3182b000 CR3: 0000000034ed6000 CR4: 0000000000350ef0
> Call Trace:
> <TASK>
> __refcount_dec include/linux/refcount.h:336 [inline]
> refcount_dec include/linux/refcount.h:351 [inline]
> inet_twsk_kill+0x758/0x9c0 net/ipv4/inet_timewait_sock.c:70
> inet_twsk_deschedule_put net/ipv4/inet_timewait_sock.c:221 [inline]
> inet_twsk_purge+0x725/0x890 net/ipv4/inet_timewait_sock.c:304
> tcp_twsk_purge+0x115/0x150 net/ipv4/tcp_minisocks.c:402
> tcp_sk_exit_batch+0x1c/0x170 net/ipv4/tcp_ipv4.c:3522
> ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
> setup_net+0x714/0xb40 net/core/net_namespace.c:375
> copy_net_ns+0x2f0/0x670 net/core/net_namespace.c:508
> create_new_namespaces+0x3ea/0xb10 kernel/nsproxy.c:110
> unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228
> ksys_unshare+0x419/0x970 kernel/fork.c:3323
> __do_sys_unshare kernel/fork.c:3394 [inline]
> __se_sys_unshare kernel/fork.c:3392 [inline]
> __x64_sys_unshare+0x31/0x40 kernel/fork.c:3392
> do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
> entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f56d7c7cee9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f56d897c0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
> RAX: ffffffffffffffda RBX: 00007f56d7dac1f0 RCX: 00007f56d7c7cee9
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000042000000
> RBP: 00007f56d7cc949e R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 000000000000006e R14: 00007f56d7dac1f0 R15: 00007ffe66454be8
> </TASK>
>
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 2:29 ` Kuniyuki Iwashima
@ 2024-08-11 5:42 ` Jason Xing
2024-08-11 13:24 ` Florian Westphal
2024-08-11 13:32 ` Florian Westphal
2 siblings, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-11 5:42 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
linux-kernel, netdev, pabeni, syzkaller-bugs
Hello Kuniyuki,
On Sun, Aug 11, 2024 at 10:35 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
>
> Eric, this is the weird report I was talking about at netdevconf :)
>
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
>
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
>
> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
>
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
>
> refcount_inc() was actually deferred, but it's still under an ehash lock,
> and inet_twsk_deschedule_put() must be serialised with the same ehash
> lock. Even inet_twsk_kill() performs the ehash lock dance before calling
> refcount_dec().
>
> So, it should be impossible that refcount_inc() is not visible after double
> lock/unlock and before refcount_dec(), so this report looks bogus to me :S
In normal cases, I agree with this, since inc/dec are all protected
under the spin lock. There is no way we can decrement it if we don't
increment it first.
Thanks,
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 2:29 ` Kuniyuki Iwashima
2024-08-11 5:42 ` Jason Xing
@ 2024-08-11 13:24 ` Florian Westphal
2024-08-11 14:54 ` Florian Westphal
2024-08-11 13:32 ` Florian Westphal
2 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 13:24 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
linux-kernel, netdev, pabeni, syzkaller-bugs
Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
>
> Eric, this is the weird report I was talking about at netdevconf :)
>
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
>
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
I think I see why.
The reported splat is without this above commit.
But from backtrace we entered here:
if (net->ipv4.tcp_death_row.hashinfo->pernet) {
/* Even if tw_refcount == 1, we must clean up kernel reqsk */
inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
} else if (!purged_once) {
--------------> inet_twsk_purge(&tcp_hashinfo); // THIS
purged_once = true;
> The commit a bit deferred refcount_inc(tw_refcount) after the hash dance,
> so twsk is now visible before tw_dr->tw_refcount is incremented.
>
> I came up with the diff below but was suspecting a bug in another place,
> possibly QEMU, so I haven't posted the diff officially.
>
> refcount_inc() was actually deferred, but it's still under an ehash lock,
but different struct inet_hashinfo, so the locks don't help :/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 2:29 ` Kuniyuki Iwashima
2024-08-11 5:42 ` Jason Xing
2024-08-11 13:24 ` Florian Westphal
@ 2024-08-11 13:32 ` Florian Westphal
2024-08-11 22:35 ` Kuniyuki Iwashima
2 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 13:32 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: syzbot+8ea26396ff85d23a8929, davem, dsahern, edumazet, kuba,
linux-kernel, netdev, pabeni, syzkaller-bugs
Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: syzbot <syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com>
> Date: Sat, 10 Aug 2024 18:29:20 -0700
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 33e02dc69afb Merge tag 'sound-6.10-rc1' of git://git.kerne..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=25544a2faf4bae65
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8ea26396ff85d23a8929
> > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-33e02dc6.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/573c88ac3233/vmlinux-33e02dc6.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/760a52b9a00a/bzImage-33e02dc6.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8ea26396ff85d23a8929@syzkaller.appspotmail.com
> >
> > ------------[ cut here ]------------
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
>
> Eric, this is the weird report I was talking about at netdevconf :)
>
> It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> than refcount_inc().
>
> I started to see the same splat at a very low rate after consuming
> commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
Could you share a splat you saw?
Was it also for the fallback tcp_hashinfo and/or from error
unwindinding after a ops->init() failure on netns creation?
I don't think it makes sense following this specific report given
b334b924c9b7 made changes in this area, but a post-b334b924c9b7
would be more relevant I think.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 13:24 ` Florian Westphal
@ 2024-08-11 14:54 ` Florian Westphal
2024-08-11 16:28 ` Florian Westphal
0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 14:54 UTC (permalink / raw)
To: Florian Westphal
Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs
Florian Westphal <fw@strlen.de> wrote:
> > I came up with the diff below but was suspecting a bug in another place,
> > possibly QEMU, so I haven't posted the diff officially.
> >
> > refcount_inc() was actually deferred, but it's still under an ehash lock,
>
> but different struct inet_hashinfo, so the locks don't help :/
No, fallback is fine: pernet tw_refcount, init_net ehash lock array. so
they same buckets should serialize on same ehash lock.
https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
... shows at two cores racing:
[ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
and
[ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
first splat backtrace shows invocation of tcp_sk_exit_batch() from
netns error unwinding code.
Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
likely walking init_net tcp_hashinfo.
The warn of second core is:
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
Looks like somehow netns cleanup work queue skipped at least one tw sk,
hitting above splat.
Then, first core did refcount_dec() on tw_refcount, which produces
dec-to-0 warn (which makes sense if "supposedly final" decrement was
already done.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 14:54 ` Florian Westphal
@ 2024-08-11 16:28 ` Florian Westphal
2024-08-11 23:00 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-11 16:28 UTC (permalink / raw)
To: Florian Westphal
Cc: Kuniyuki Iwashima, syzbot+8ea26396ff85d23a8929, davem, dsahern,
edumazet, kuba, linux-kernel, netdev, pabeni, syzkaller-bugs
Florian Westphal <fw@strlen.de> wrote:
> https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
>
> ... shows at two cores racing:
>
> [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> and
> [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
>
>
> first splat backtrace shows invocation of tcp_sk_exit_batch() from
> netns error unwinding code.
>
> Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
... which doesn't work. Does this look like a plausible
theory/exlanation?
Given:
1 exiting netns, has >= 1 tw sk.
1 (unrelated) netns that failed in setup_net
... we run into following race:
exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
inet_twsk_purge(&tcp_hashinfo).
At same time, from error unwinding code, we also call tcp_sk_exit_batch().
Both threads walk tcp_hashinfo ehash buckets.
From work queue (normal netns exit path), we hit
303 if (state == TCP_TIME_WAIT) {
304 inet_twsk_deschedule_put(inet_twsk(sk));
Because both threads operate on tcp_hashinfo, the unrelated
struct net (exiting net) is also visible to error-unwinding thread.
So, error unwinding code will call
303 if (state == TCP_TIME_WAIT) {
304 inet_twsk_deschedule_put(inet_twsk(sk));
for the same tw sk and both threads do
218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
219 {
220 if (del_timer_sync(&tw->tw_timer))
221 inet_twsk_kill(tw);
Error unwind path cancel timer, calls inet_twsk_kill, while
work queue sees timer as already shut-down so it ends up
returning to tcp_sk_exit_batch(), where it will WARN here:
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
... because the supposedly-last tw_refcount decrement did not drop
it down to 0.
Meanwhile, error unwiding thread calls refcount_dec() on
tw_refcount, which now drops down to 0 instead of 1, which
provides another warn splat.
I'll ponder on ways to fix this tomorrow unless someone
else already has better theory/solution.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 13:32 ` Florian Westphal
@ 2024-08-11 22:35 ` Kuniyuki Iwashima
0 siblings, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 22:35 UTC (permalink / raw)
To: fw
Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs
From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 15:32:00 +0200
> > > ------------[ cut here ]------------
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 3 PID: 1396 at lib/refcount.c:31 refcount_warn_saturate+0x1ed/0x210 lib/refcount.c:31
> >
> > Eric, this is the weird report I was talking about at netdevconf :)
> >
> > It seems refcount_dec(&tw->tw_dr->tw_refcount) is somehow done earlier
> > than refcount_inc().
> >
> > I started to see the same splat at a very low rate after consuming
> > commit b334b924c9b7 ("net: tcp/dccp: prepare for tw_timer un-pinning").
>
> Could you share a splat you saw?
>
> Was it also for the fallback tcp_hashinfo and/or from error
> unwindinding after a ops->init() failure on netns creation?
>
> I don't think it makes sense following this specific report given
> b334b924c9b7 made changes in this area, but a post-b334b924c9b7
> would be more relevant I think.
Ah, I missed the initial syzbot's report was created before the commit.
Here's splats I saw:
[ Note both splats were mixed in the log as with the syzbot's report,
so two threads were calling tcp_sk_exit_batch() concurrently. ]
---8<---
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 1760 at lib/refcount.c:31 refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Modules linked in:
CPU: 0 PID: 1760 Comm: syz-executor.4 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:refcount_warn_saturate (lib/refcount.c:31 (discriminator 3))
Code: 05 cd 8d aa 04 01 e8 48 e4 e9 fe 0f 0b e9 d3 fe ff ff e8 1c 36 1c ff 48 c7 c7 a0 4a 80 85 c6 05 aa 8d aa 04 01 e8 29 e4 e9 fe <0f> 0b e9 b4 fe ff ff 48 89 ef e8 fa 56 69 ff e9 5c fe ff ff 0f 1f
RSP: 0018:ffffc9000118fa78 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000b13b000
RDX: 0000000000040000 RSI: ffffffff8118dedf RDI: 0000000000000001
RBP: ffff88800bb59440 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000002d2d2d2d R12: ffff88800bb59440
R13: ffff88810130f3d0 R14: ffff8881013ee820 R15: ffff888102001198
FS: 00007f5252dbd640(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004a6128 CR3: 00000001038b8003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
inet_twsk_kill (./include/linux/refcount.h:336 ./include/linux/refcount.h:351 net/ipv4/inet_timewait_sock.c:70)
inet_twsk_deschedule_put (net/ipv4/inet_timewait_sock.c:266)
inet_twsk_purge (net/ipv4/inet_timewait_sock.c:349)
tcp_twsk_purge (net/ipv4/tcp_minisocks.c:402)
tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3512)
ops_exit_list (net/core/net_namespace.c:179)
setup_net (net/core/net_namespace.c:374 (discriminator 3))
copy_net_ns (net/core/net_namespace.c:510)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3325)
__x64_sys_unshare (kernel/fork.c:3392)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
RIP: 0033:0x7f5253a29e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f5252dbcc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 00000000004d4070 RCX: 00007f5253a29e5d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 00000000004d4070 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000006e R14: 00007f5253a8a530 R15: 0000000000000000
</TASK>
WARNING: CPU: 1 PID: 671 at net/ipv4/tcp_ipv4.c:3514 tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Modules linked in:
CPU: 1 PID: 671 Comm: kworker/u8:15 Not tainted 6.10.0-rc3-00688-g934c29999b57-dirty #20
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:tcp_sk_exit_batch (net/ipv4/tcp_ipv4.c:3514)
Code: 89 ee e8 9f 1b 28 fd 85 ed 0f 88 b3 00 00 00 e8 92 22 28 fd 31 ff 89 ee e8 89 1b 28 fd 85 ed 0f 84 9d 00 00 00 e8 7c 22 28 fd <0f> 0b e8 75 22 28 fd 48 89 df e8 ed 23 02 00 48 8d 7b 28 48 89 f8
RSP: 0018:ffffc90000ebfc60 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88800bb59140 RCX: ffffffff84230cb7
RDX: ffff888103871180 RSI: ffffffff84230cc4 RDI: 0000000000000005
RBP: 0000000000000002 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000002 R11: 00000000000002bd R12: ffff88800bb59440
R13: dffffc0000000000 R14: ffffc90000ebfd08 R15: fffffbfff0d7fb5c
FS: 0000000000000000(0000) GS:ffff88811af00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f274114ddf8 CR3: 000000010a11c006 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
ops_exit_list (net/core/net_namespace.c:179)
cleanup_net (net/core/net_namespace.c:639 (discriminator 3))
process_one_work (kernel/workqueue.c:3236)
worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
kthread (kernel/kthread.c:389)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
---8<---
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 16:28 ` Florian Westphal
@ 2024-08-11 23:00 ` Kuniyuki Iwashima
2024-08-11 23:08 ` Kuniyuki Iwashima
0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:00 UTC (permalink / raw)
To: fw
Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs
From: Florian Westphal <fw@strlen.de>
Date: Sun, 11 Aug 2024 18:28:50 +0200
> Florian Westphal <fw@strlen.de> wrote:
> > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> >
> > ... shows at two cores racing:
> >
> > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > and
> > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> >
> >
> > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > netns error unwinding code.
> >
> > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
>
> ... which doesn't work. Does this look like a plausible
> theory/exlanation?
Yes! The problem here is that inet_twsk_purge() operates on twsk
not in net_exit_list, but I think such a check is overkill and we
can work around it in another way.
>
> Given:
> 1 exiting netns, has >= 1 tw sk.
> 1 (unrelated) netns that failed in setup_net
>
> ... we run into following race:
>
> exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> inet_twsk_purge(&tcp_hashinfo).
>
> At same time, from error unwinding code, we also call tcp_sk_exit_batch().
>
> Both threads walk tcp_hashinfo ehash buckets.
>
> From work queue (normal netns exit path), we hit
>
> 303 if (state == TCP_TIME_WAIT) {
> 304 inet_twsk_deschedule_put(inet_twsk(sk));
>
> Because both threads operate on tcp_hashinfo, the unrelated
> struct net (exiting net) is also visible to error-unwinding thread.
>
> So, error unwinding code will call
>
> 303 if (state == TCP_TIME_WAIT) {
> 304 inet_twsk_deschedule_put(inet_twsk(sk));
>
> for the same tw sk and both threads do
>
> 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> 219 {
> 220 if (del_timer_sync(&tw->tw_timer))
> 221 inet_twsk_kill(tw);
>
> Error unwind path cancel timer, calls inet_twsk_kill, while
> work queue sees timer as already shut-down so it ends up
> returning to tcp_sk_exit_batch(), where it will WARN here:
>
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> ... because the supposedly-last tw_refcount decrement did not drop
> it down to 0.
>
> Meanwhile, error unwiding thread calls refcount_dec() on
> tw_refcount, which now drops down to 0 instead of 1, which
> provides another warn splat.
>
> I'll ponder on ways to fix this tomorrow unless someone
> else already has better theory/solution.
We need to sync two inet_twsk_kill(), so maybe give up one
if twsk is not hashed ?
---8<---
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 337390ba85b4..51889567274b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
struct inet_bind_hashbucket *bhead, *bhead2;
spin_lock(lock);
- sk_nulls_del_node_init_rcu((struct sock *)tw);
+ if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
+ spin_unlock(lock);
+ return false;
+ }
spin_unlock(lock);
/* Disassociate with bind bucket. */
---8<---
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 23:00 ` Kuniyuki Iwashima
@ 2024-08-11 23:08 ` Kuniyuki Iwashima
2024-08-12 0:36 ` Jason Xing
2024-08-12 14:01 ` Florian Westphal
0 siblings, 2 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-11 23:08 UTC (permalink / raw)
To: kuniyu
Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
syzbot+8ea26396ff85d23a8929, syzkaller-bugs
From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Sun, 11 Aug 2024 16:00:29 -0700
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 11 Aug 2024 18:28:50 +0200
> > Florian Westphal <fw@strlen.de> wrote:
> > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > >
> > > ... shows at two cores racing:
> > >
> > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > and
> > > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > >
> > >
> > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > netns error unwinding code.
> > >
> > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> >
> > ... which doesn't work. Does this look like a plausible
> > theory/exlanation?
>
> Yes! The problem here is that inet_twsk_purge() operates on twsk
> not in net_exit_list, but I think such a check is overkill and we
> can work around it in another way.
>
>
> >
> > Given:
> > 1 exiting netns, has >= 1 tw sk.
> > 1 (unrelated) netns that failed in setup_net
> >
> > ... we run into following race:
> >
> > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > inet_twsk_purge(&tcp_hashinfo).
> >
> > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> >
> > Both threads walk tcp_hashinfo ehash buckets.
> >
> > From work queue (normal netns exit path), we hit
> >
> > 303 if (state == TCP_TIME_WAIT) {
> > 304 inet_twsk_deschedule_put(inet_twsk(sk));
> >
> > Because both threads operate on tcp_hashinfo, the unrelated
> > struct net (exiting net) is also visible to error-unwinding thread.
> >
> > So, error unwinding code will call
> >
> > 303 if (state == TCP_TIME_WAIT) {
> > 304 inet_twsk_deschedule_put(inet_twsk(sk));
> >
> > for the same tw sk and both threads do
> >
> > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > 219 {
> > 220 if (del_timer_sync(&tw->tw_timer))
> > 221 inet_twsk_kill(tw);
> >
> > Error unwind path cancel timer, calls inet_twsk_kill, while
> > work queue sees timer as already shut-down so it ends up
> > returning to tcp_sk_exit_batch(), where it will WARN here:
> >
> > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> >
> > ... because the supposedly-last tw_refcount decrement did not drop
> > it down to 0.
> >
> > Meanwhile, error unwiding thread calls refcount_dec() on
> > tw_refcount, which now drops down to 0 instead of 1, which
> > provides another warn splat.
> >
> > I'll ponder on ways to fix this tomorrow unless someone
> > else already has better theory/solution.
>
> We need to sync two inet_twsk_kill(), so maybe give up one
> if twsk is not hashed ?
>
> ---8<---
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 337390ba85b4..51889567274b 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> struct inet_bind_hashbucket *bhead, *bhead2;
>
> spin_lock(lock);
> - sk_nulls_del_node_init_rcu((struct sock *)tw);
> + if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> + spin_unlock(lock);
> + return false;
forgot to remove false, just return :)
> + }
> spin_unlock(lock);
>
> /* Disassociate with bind bucket. */
> ---8<---
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 23:08 ` Kuniyuki Iwashima
@ 2024-08-12 0:36 ` Jason Xing
2024-08-12 14:01 ` Florian Westphal
1 sibling, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-12 0:36 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
syzbot+8ea26396ff85d23a8929, syzkaller-bugs
On Mon, Aug 12, 2024 at 7:09 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > >
> > > > ... shows at two cores racing:
> > > >
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > >
> > > >
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > >
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > >
> > > ... which doesn't work. Does this look like a plausible
> > > theory/exlanation?
> >
> > Yes! The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.
> >
> >
> > >
> > > Given:
> > > 1 exiting netns, has >= 1 tw sk.
> > > 1 (unrelated) netns that failed in setup_net
> > >
> > > ... we run into following race:
> > >
> > > exiting netns, from cleanup wq, calls tcp_sk_exit_batch(), which calls
> > > inet_twsk_purge(&tcp_hashinfo).
> > >
> > > At same time, from error unwinding code, we also call tcp_sk_exit_batch().
> > >
> > > Both threads walk tcp_hashinfo ehash buckets.
> > >
> > > From work queue (normal netns exit path), we hit
> > >
> > > 303 if (state == TCP_TIME_WAIT) {
> > > 304 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > Because both threads operate on tcp_hashinfo, the unrelated
> > > struct net (exiting net) is also visible to error-unwinding thread.
> > >
> > > So, error unwinding code will call
> > >
> > > 303 if (state == TCP_TIME_WAIT) {
> > > 304 inet_twsk_deschedule_put(inet_twsk(sk));
> > >
> > > for the same tw sk and both threads do
> > >
> > > 218 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> > > 219 {
> > > 220 if (del_timer_sync(&tw->tw_timer))
> > > 221 inet_twsk_kill(tw);
> > >
> > > Error unwind path cancel timer, calls inet_twsk_kill, while
> > > work queue sees timer as already shut-down so it ends up
> > > returning to tcp_sk_exit_batch(), where it will WARN here:
> > >
> > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > >
> > > ... because the supposedly-last tw_refcount decrement did not drop
> > > it down to 0.
> > >
> > > Meanwhile, error unwiding thread calls refcount_dec() on
> > > tw_refcount, which now drops down to 0 instead of 1, which
> > > provides another warn splat.
Right, last night I should have thought of it. I noticed that two
'killers' may access the inet_twsk_kill() concurrently (during that
time, I paid attention to the tcp_abort() which can cause such an
issue), but I easily let it go :(
Thanks for the detailed explanation:)
> > >
> > > I'll ponder on ways to fix this tomorrow unless someone
> > > else already has better theory/solution.
> >
> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?
> >
> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > struct inet_bind_hashbucket *bhead, *bhead2;
> >
> > spin_lock(lock);
> > - sk_nulls_del_node_init_rcu((struct sock *)tw);
> > + if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > + spin_unlock(lock);
> > + return false;
>
> forgot to remove false, just return :)
It does make sense to me !
Thanks,
Jason
>
>
> > + }
> > spin_unlock(lock);
> >
> > /* Disassociate with bind bucket. */
> > ---8<---
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-11 23:08 ` Kuniyuki Iwashima
2024-08-12 0:36 ` Jason Xing
@ 2024-08-12 14:01 ` Florian Westphal
2024-08-12 14:30 ` Jason Xing
2024-08-12 20:00 ` Kuniyuki Iwashima
1 sibling, 2 replies; 16+ messages in thread
From: Florian Westphal @ 2024-08-12 14:01 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: davem, dsahern, edumazet, fw, kuba, linux-kernel, netdev, pabeni,
syzbot+8ea26396ff85d23a8929, syzkaller-bugs
Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> From: Kuniyuki Iwashima <kuniyu@amazon.com>
> Date: Sun, 11 Aug 2024 16:00:29 -0700
> > From: Florian Westphal <fw@strlen.de>
> > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > Florian Westphal <fw@strlen.de> wrote:
> > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > >
> > > > ... shows at two cores racing:
> > > >
> > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > and
> > > > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > >
> > > >
> > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > netns error unwinding code.
> > > >
> > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > >
> > > ... which doesn't work. Does this look like a plausible
> > > theory/exlanation?
> >
> > Yes! The problem here is that inet_twsk_purge() operates on twsk
> > not in net_exit_list, but I think such a check is overkill and we
> > can work around it in another way.
I'm not so sure. Once 'other' inet_twsk_purge() found the twsk and
called inet_twsk_kill(), 'our' task has to wait for that to complete.
We need to force proper ordering so that all twsk found
static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
struct net *net;
/*HERE*/tcp_twsk_purge(net_exit_list);
list_for_each_entry(net, net_exit_list, exit_list) {
inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
.... have gone through inet_twsk_kill() so tw_refcount managed to
drop back to 1 before doing
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
.
> > We need to sync two inet_twsk_kill(), so maybe give up one
> > if twsk is not hashed ?
Not sure, afaiu only one thread enters inet_twsk_kill()
(the one that manages to deactivate the timer).
> > ---8<---
> > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > index 337390ba85b4..51889567274b 100644
> > --- a/net/ipv4/inet_timewait_sock.c
> > +++ b/net/ipv4/inet_timewait_sock.c
> > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > struct inet_bind_hashbucket *bhead, *bhead2;
> >
> > spin_lock(lock);
> > - sk_nulls_del_node_init_rcu((struct sock *)tw);
> > + if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > + spin_unlock(lock);
> > + return false;
>
> forgot to remove false, just return :)
I don't see how this helps, we need to wait until 'stolen' twsk
has gone through inet_twsk_kill() and decremented tw_refcount.
Obviously It would be a bit simpler if we had a reliable reproducer :-)
Possible solutions I came up with so far:
1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").
This commit replaced a net_eq(twsk_net(tw) ... with a check for
dead netns (ns.count == 0),
Downside: We need to remove the purged_once trick that calls
inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
tcp_twsk_purge() as well.
As per b099ce2602d8 changelog, likely increases netns dismantle times.
Upside: simpler code, so this is my preferred solution.
No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
called refcount_dec(->tw_refcount) for every twsk in the exiting netns
list, without other task stealing twsks owned by exiting netns.
Solution 2: change tcp_sk_exit_batch like this:
tcp_twsk_purge(net_exit_list);
+ list_for_each_entry(net, net_exit_list, exit_list) {
+ while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
+ schedule();
+
+ }
list_for_each_entry(net, net_exit_list, exit_list) {
inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
This synchronizes two concurrent tcp_sk_exit_batch() calls via
existing refcount; if netns setup error unwinding ran off with one of
'our' twsk, it will wait until other task has completed the refcount decrement.
I don't expect it to increase netns dismantle times, else we'd have seen
the WARN_ON_ONCE splat frequently.
Solution 3:
Similar to 2), but via mutex_lock/unlock pair:
static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
{
struct net *net;
mutex_lock(&tcp_exit_batch_mutex);
tcp_twsk_purge(net_exit_list);
list_for_each_entry(net, net_exit_list, exit_list) {
inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
tcp_fastopen_ctx_destroy(net);
}
mutex_unlock(&tcp_exit_batch_mutex);
}
Solution 4:
I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
the 'wrong' time. This is independent "problem", I might be
imagining things here.
Consider:
313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
314 {
[..]
321 for (slot = 0; slot <= ehash_mask; slot++, head++) {
tw sk timer fires on other cpu, inet_twsk_kill() does:
56 spin_lock(lock);
57 sk_nulls_del_node_init_rcu((struct sock *)tw);
58 spin_unlock(lock);
... then other cpu gets preempted.
inet_twsk_purge() resumes and hits empty chain head:
322 if (hlist_nulls_empty(&head->chain))
323 continue;
so we don't(can't) wait for the timer to run to completion.
If this sounds plausible to you, this gives us solution 4:
Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
dropped down to 1.
Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
needs to be moved down:
48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
...
58 /* Disassociate with bind bucket. */
...
68 spin_unlock(&bhead->lock);
70 refcount_dec(&tw->tw_dr->tw_refcount);
+ spin_lock(lock);
+ sk_nulls_del_node_init_rcu((struct sock *)tw);
+ spin_unlock(lock);
71 inet_twsk_put(tw);
72 }
... so concurrent purge() call will find us
the node list (and then wait on timer_shutdown_sync())
until other cpu executing the timer is done.
If twsk was unlinked from table already before
inet_twsk_purge() had chance to find it sk, then in worst
case call to tcp_twsk_destructor() is missing, but I don't
see any ordering requirements that need us to wait for this.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-12 14:01 ` Florian Westphal
@ 2024-08-12 14:30 ` Jason Xing
2024-08-12 15:03 ` Florian Westphal
2024-08-12 20:00 ` Kuniyuki Iwashima
1 sibling, 1 reply; 16+ messages in thread
From: Jason Xing @ 2024-08-12 14:30 UTC (permalink / raw)
To: Florian Westphal
Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs
Hello Florian,
On Mon, Aug 12, 2024 at 10:02 PM Florian Westphal <fw@strlen.de> wrote:
>
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > >
> > > > > ... shows at two cores racing:
> > > > >
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > >
> > > > >
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > >
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > >
> > > > ... which doesn't work. Does this look like a plausible
> > > > theory/exlanation?
> > >
> > > Yes! The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
>
> I'm not so sure. Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
>
> We need to force proper ordering so that all twsk found
>
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
> struct net *net;
>
> /*HERE*/tcp_twsk_purge(net_exit_list);
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
>
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
>
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).
>
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > > struct inet_bind_hashbucket *bhead, *bhead2;
> > >
> > > spin_lock(lock);
> > > - sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > + if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > + spin_unlock(lock);
> > > + return false;
> >
> > forgot to remove false, just return :)
>
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)
Allow me to say something irrelevant to this bug report.
Do you think that Kuniyuki's patch can solve the race between two
'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?
It at least does help avoid decrementing tw_refcount twice in the
above case if I understand correctly.
Thanks,
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-12 14:30 ` Jason Xing
@ 2024-08-12 15:03 ` Florian Westphal
2024-08-12 15:49 ` Jason Xing
0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2024-08-12 15:03 UTC (permalink / raw)
To: Jason Xing
Cc: Florian Westphal, Kuniyuki Iwashima, davem, dsahern, edumazet,
kuba, linux-kernel, netdev, pabeni, syzbot+8ea26396ff85d23a8929,
syzkaller-bugs
Jason Xing <kerneljasonxing@gmail.com> wrote:
> > I don't see how this helps, we need to wait until 'stolen' twsk
> > has gone through inet_twsk_kill() and decremented tw_refcount.
> > Obviously It would be a bit simpler if we had a reliable reproducer :-)
>
> Allow me to say something irrelevant to this bug report.
>
> Do you think that Kuniyuki's patch can solve the race between two
> 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?
I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
which does:
if (timer_shutdown_sync(&tw->tw_timer))
inet_twsk_kill(tw);
So I don't see how two concurrent callers, working on same tw address,
would both be able to shut down the timer.
One will shut it down and calls inet_twsk_kill(), other will wait until
the callback has completed, but it doesn't call inet_twsk_kill().
> It at least does help avoid decrementing tw_refcount twice in the
> above case if I understand correctly.
I don't think the refcount is decremented twice.
Problem is one thread is already at the 'final' decrement of
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
in tcp_sk_exit_batch(), while other thread has not yet called
refcount_dec() on it (inet_twsk_kill still executing).
So we get two splats, refcount_dec_and_test() returns 1 not expected 0
and refcount_dec() coming right afterwards from other task observes the
transition to 0, while it should have dropped down to 1.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-12 15:03 ` Florian Westphal
@ 2024-08-12 15:49 ` Jason Xing
0 siblings, 0 replies; 16+ messages in thread
From: Jason Xing @ 2024-08-12 15:49 UTC (permalink / raw)
To: Florian Westphal
Cc: Kuniyuki Iwashima, davem, dsahern, edumazet, kuba, linux-kernel,
netdev, pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs
On Mon, Aug 12, 2024 at 11:03 PM Florian Westphal <fw@strlen.de> wrote:
>
> Jason Xing <kerneljasonxing@gmail.com> wrote:
> > > I don't see how this helps, we need to wait until 'stolen' twsk
> > > has gone through inet_twsk_kill() and decremented tw_refcount.
> > > Obviously It would be a bit simpler if we had a reliable reproducer :-)
> >
> > Allow me to say something irrelevant to this bug report.
> >
> > Do you think that Kuniyuki's patch can solve the race between two
> > 'killers' calling inet_twsk_deschedule_put()->inet_twsk_kill()
> > concurrently at two cores, say, inet_twsk_purge() and tcp_abort()?
>
> I don't think its possible, tcp_abort() calls inet_twsk_deschedule_put,
> which does:
>
> if (timer_shutdown_sync(&tw->tw_timer))
> inet_twsk_kill(tw);
>
> So I don't see how two concurrent callers, working on same tw address,
> would both be able to shut down the timer.
>
> One will shut it down and calls inet_twsk_kill(), other will wait until
> the callback has completed, but it doesn't call inet_twsk_kill().
Oh, thanks. Since timer_shutdown_sync() can be used as a lock, there
is indeed no way to call inet_twsk_kill() concurrently.
>
> > It at least does help avoid decrementing tw_refcount twice in the
> > above case if I understand correctly.
>
> I don't think the refcount is decremented twice.
>
> Problem is one thread is already at the 'final' decrement of
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> in tcp_sk_exit_batch(), while other thread has not yet called
> refcount_dec() on it (inet_twsk_kill still executing).
>
> So we get two splats, refcount_dec_and_test() returns 1 not expected 0
> and refcount_dec() coming right afterwards from other task observes the
> transition to 0, while it should have dropped down to 1.
I see.
Thanks,
Jason
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill
2024-08-12 14:01 ` Florian Westphal
2024-08-12 14:30 ` Jason Xing
@ 2024-08-12 20:00 ` Kuniyuki Iwashima
1 sibling, 0 replies; 16+ messages in thread
From: Kuniyuki Iwashima @ 2024-08-12 20:00 UTC (permalink / raw)
To: fw
Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, netdev,
pabeni, syzbot+8ea26396ff85d23a8929, syzkaller-bugs
From: Florian Westphal <fw@strlen.de>
Date: Mon, 12 Aug 2024 16:01:04 +0200
> Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > From: Kuniyuki Iwashima <kuniyu@amazon.com>
> > Date: Sun, 11 Aug 2024 16:00:29 -0700
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Sun, 11 Aug 2024 18:28:50 +0200
> > > > Florian Westphal <fw@strlen.de> wrote:
> > > > > https://syzkaller.appspot.com/x/log.txt?x=117f3182980000
> > > > >
> > > > > ... shows at two cores racing:
> > > > >
> > > > > [ 3127.234402][ T1396] CPU: 3 PID: 1396 Comm: syz-executor.3 Not
> > > > > and
> > > > > [ 3127.257864][ T13] CPU: 1 PID: 13 Comm: kworker/u32:1 Not tainted 6.9.0-syzkalle (netns cleanup net).
> > > > >
> > > > >
> > > > > first splat backtrace shows invocation of tcp_sk_exit_batch() from
> > > > > netns error unwinding code.
> > > > >
> > > > > Second one lacks backtrace, but its also in tcp_sk_exit_batch(),
> > > >
> > > > ... which doesn't work. Does this look like a plausible
> > > > theory/exlanation?
> > >
> > > Yes! The problem here is that inet_twsk_purge() operates on twsk
> > > not in net_exit_list, but I think such a check is overkill and we
> > > can work around it in another way.
>
> I'm not so sure. Once 'other' inet_twsk_purge() found the twsk and
> called inet_twsk_kill(), 'our' task has to wait for that to complete.
>
> We need to force proper ordering so that all twsk found
>
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
> struct net *net;
>
> /*HERE*/tcp_twsk_purge(net_exit_list);
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
>
> .... have gone through inet_twsk_kill() so tw_refcount managed to
> drop back to 1 before doing
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> .
>
> > > We need to sync two inet_twsk_kill(), so maybe give up one
> > > if twsk is not hashed ?
>
> Not sure, afaiu only one thread enters inet_twsk_kill()
> (the one that manages to deactivate the timer).
Ah, you're right.
>
> > > ---8<---
> > > diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> > > index 337390ba85b4..51889567274b 100644
> > > --- a/net/ipv4/inet_timewait_sock.c
> > > +++ b/net/ipv4/inet_timewait_sock.c
> > > @@ -52,7 +52,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
> > > struct inet_bind_hashbucket *bhead, *bhead2;
> > >
> > > spin_lock(lock);
> > > - sk_nulls_del_node_init_rcu((struct sock *)tw);
> > > + if (!sk_nulls_del_node_init_rcu((struct sock *)tw)) {
> > > + spin_unlock(lock);
> > > + return false;
> >
> > forgot to remove false, just return :)
>
> I don't see how this helps, we need to wait until 'stolen' twsk
> has gone through inet_twsk_kill() and decremented tw_refcount.
> Obviously It would be a bit simpler if we had a reliable reproducer :-)
>
> Possible solutions I came up with so far:
>
> 1) revert b099ce2602d8 ("net: Batch inet_twsk_purge").
>
> This commit replaced a net_eq(twsk_net(tw) ... with a check for
> dead netns (ns.count == 0),
>
> Downside: We need to remove the purged_once trick that calls
> inet_twsk_purge(&tcp_hashinfo) only once per exiting batch in
> tcp_twsk_purge() as well.
>
> As per b099ce2602d8 changelog, likely increases netns dismantle times.
Yes, so I think we should avoid this way although user could mitigate
it with per-net ehash.
>
> Upside: simpler code, so this is my preferred solution.
>
> No concurrent runoff anymore, by time tcp_twsk_purge() returns it has
> called refcount_dec(->tw_refcount) for every twsk in the exiting netns
> list, without other task stealing twsks owned by exiting netns.
>
> Solution 2: change tcp_sk_exit_batch like this:
>
> tcp_twsk_purge(net_exit_list);
>
> + list_for_each_entry(net, net_exit_list, exit_list) {
> + while (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) > 1)
> + schedule();
> +
> + }
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
>
> This synchronizes two concurrent tcp_sk_exit_batch() calls via
> existing refcount; if netns setup error unwinding ran off with one of
> 'our' twsk, it will wait until other task has completed the refcount decrement.
>
> I don't expect it to increase netns dismantle times, else we'd have seen
> the WARN_ON_ONCE splat frequently.
>
> Solution 3:
I prefer this because in all of my local reports I see two splats from
exit_batch calls.
If the same issue still happens with solution 3, we'll see the same
splats with mostly no risk. But solution 2 will give us hung task,
possibly in cleanup_net, which is worse.
Then, we can consider other possibilities explained in solution 4
or like that kernel twsk is put into ehash of dying netns.
Another option would be add reftracker for tcp_death_row.tw_refcount
to detect the root cause ?
>
> Similar to 2), but via mutex_lock/unlock pair:
>
> static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
> {
> struct net *net;
>
> mutex_lock(&tcp_exit_batch_mutex);
>
> tcp_twsk_purge(net_exit_list);
>
> list_for_each_entry(net, net_exit_list, exit_list) {
> inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> tcp_fastopen_ctx_destroy(net);
> }
> mutex_unlock(&tcp_exit_batch_mutex);
> }
>
> Solution 4:
>
> I have doubts wrt. tcp_twsk_purge() interaction with tw timer firing at
> the 'wrong' time. This is independent "problem", I might be
> imagining things here.
>
> Consider:
> 313 void inet_twsk_purge(struct inet_hashinfo *hashinfo)
> 314 {
> [..]
> 321 for (slot = 0; slot <= ehash_mask; slot++, head++) {
>
> tw sk timer fires on other cpu, inet_twsk_kill() does:
>
> 56 spin_lock(lock);
> 57 sk_nulls_del_node_init_rcu((struct sock *)tw);
> 58 spin_unlock(lock);
>
> ... then other cpu gets preempted.
> inet_twsk_purge() resumes and hits empty chain head:
>
> 322 if (hlist_nulls_empty(&head->chain))
> 323 continue;
>
> so we don't(can't) wait for the timer to run to completion.
>
> If this sounds plausible to you, this gives us solution 4:
>
> Restart inet_twsk_purge() loop until tw_dr->tw_refcount) has
> dropped down to 1.
>
> Alternatively (still assuming the above race is real), sk_nulls_del_node_init_rcu
> needs to be moved down:
>
> 48 static void inet_twsk_kill(struct inet_timewait_sock *tw)
> ...
> 58 /* Disassociate with bind bucket. */
> ...
> 68 spin_unlock(&bhead->lock);
>
> 70 refcount_dec(&tw->tw_dr->tw_refcount);
>
> + spin_lock(lock);
> + sk_nulls_del_node_init_rcu((struct sock *)tw);
> + spin_unlock(lock);
> 71 inet_twsk_put(tw);
> 72 }
>
> ... so concurrent purge() call will find us
> the node list (and then wait on timer_shutdown_sync())
> until other cpu executing the timer is done.
>
> If twsk was unlinked from table already before
> inet_twsk_purge() had chance to find it sk, then in worst
> case call to tcp_twsk_destructor() is missing, but I don't
> see any ordering requirements that need us to wait for this.
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-08-12 20:00 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-11 1:29 [syzbot] [net?] WARNING: refcount bug in inet_twsk_kill syzbot
2024-08-11 2:29 ` Kuniyuki Iwashima
2024-08-11 5:42 ` Jason Xing
2024-08-11 13:24 ` Florian Westphal
2024-08-11 14:54 ` Florian Westphal
2024-08-11 16:28 ` Florian Westphal
2024-08-11 23:00 ` Kuniyuki Iwashima
2024-08-11 23:08 ` Kuniyuki Iwashima
2024-08-12 0:36 ` Jason Xing
2024-08-12 14:01 ` Florian Westphal
2024-08-12 14:30 ` Jason Xing
2024-08-12 15:03 ` Florian Westphal
2024-08-12 15:49 ` Jason Xing
2024-08-12 20:00 ` Kuniyuki Iwashima
2024-08-11 13:32 ` Florian Westphal
2024-08-11 22:35 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox