[syzbot] [net?] INFO: rcu detected stall in unix

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [syzbot] [net?] INFO: rcu detected stall in unix_release
@ 2023-08-13 20:45 syzbot
  2023-08-14 23:03 ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: syzbot @ 2023-08-13 20:45 UTC (permalink / raw)
  To: bpf, brauner, davem, edumazet, jiri, kuba, linux-kernel, netdev,
	pabeni, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    d0378ae6d16c Merge branch 'enetc-probe-fix'
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=1052ea2ba80000
kernel config:  https://syzkaller.appspot.com/x/.config?x=fa5bd4cd5ab6259d
dashboard link: https://syzkaller.appspot.com/bug?extid=a3618a167af2021433cd
compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1152c6eda80000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13b1eddda80000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/c893f52cd6ab/disk-d0378ae6.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/dfb7a8b86a99/vmlinux-d0378ae6.xz
kernel image: https://storage.googleapis.com/syzbot-assets/cb9134e0a22c/bzImage-d0378ae6.xz

The issue was bisected to:

commit c2368b19807affd7621f7c4638cd2e17fec13021
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Fri Jul 29 07:10:35 2022 +0000

    net: devlink: introduce "unregistering" mark and use it during devlinks iteration

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=134f1179a80000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=10cf1179a80000
console output: https://syzkaller.appspot.com/x/log.txt?x=174f1179a80000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com
Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration")

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	0-....: (10499 ticks this GP) idle=9774/1/0x4000000000000000 softirq=8757/8758 fqs=5219
rcu: 	         hardirqs   softirqs   csw/system
rcu: 	 number:        1          0            0
rcu: 	cputime:    26308      26181           17   ==> 52490(ms)
rcu: 	(t=10500 jiffies g=8417 q=457 ncpus=2)
CPU: 0 PID: 5047 Comm: syz-executor224 Not tainted 6.5.0-rc4-syzkaller-00212-gd0378ae6d16c #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
RIP: 0010:taprio_dequeue_tc_priority+0x263/0x4b0 net/sched/sch_taprio.c:798
Code: 8b 74 24 10 89 ef 44 89 f6 e8 29 b8 2c f9 44 39 f5 0f 84 40 ff ff ff e8 2b bd 2c f9 49 83 ff 0f 0f 87 e1 01 00 00 48 8b 04 24 <0f> b6 00 38 44 24 36 7c 08 84 c0 0f 85 bf 01 00 00 8b 33 8b 4c 24
RSP: 0018:ffffc90000007d60 EFLAGS: 00000293
RAX: ffffed10047a4a72 RBX: ffff888023d25394 RCX: 0000000000000100
RDX: ffff888028efbb80 RSI: ffffffff88594af5 RDI: 0000000000000004
RBP: 0000000000000008 R08: 0000000000000004 R09: 0000000000000008
R10: 0000000000000000 R11: ffffc90000007ff8 R12: 0000000000000010
R13: ffff88802d19ab60 R14: 0000000000000000 R15: 0000000000000001
FS:  0000555555857380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000600 CR3: 000000002cdd1000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 taprio_dequeue+0x12e/0x5f0 net/sched/sch_taprio.c:868
 dequeue_skb net/sched/sch_generic.c:292 [inline]
 qdisc_restart net/sched/sch_generic.c:397 [inline]
 __qdisc_run+0x1c4/0x19d0 net/sched/sch_generic.c:415
 qdisc_run include/net/pkt_sched.h:125 [inline]
 qdisc_run include/net/pkt_sched.h:122 [inline]
 net_tx_action+0x71e/0xc80 net/core/dev.c:5049
 __do_softirq+0x218/0x965 kernel/softirq.c:553
 invoke_softirq kernel/softirq.c:427 [inline]
 __irq_exit_rcu kernel/softirq.c:632 [inline]
 irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1109
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645
RIP: 0010:unwind_next_frame+0x5ba/0x2020 arch/x86/kernel/unwind_orc.c:517
Code: 31 02 00 00 41 80 fe 04 0f 84 08 0c 00 00 41 80 fe 05 0f 85 d7 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 14 24 48 c1 ea 03 <80> 3c 02 00 0f 85 42 19 00 00 48 89 c8 4d 8b 7d 38 48 ba 00 00 00
RSP: 0018:ffffc90003b9f748 EFLAGS: 00000a02
RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffffff8f3ed5c8
RDX: 1ffff92000773efe RSI: 0000000000000001 RDI: ffffffff8ec31910
RBP: ffffc90003b9f800 R08: ffffffff8f3ed646 R09: ffffffff8f3ed5cc
R10: ffffc90003b9f7b8 R11: 000000000000d9e9 R12: ffffc90003b9f808
R13: ffffc90003b9f7b8 R14: 0000000000000005 R15: 0000000000000000
 arch_stack_walk+0x8b/0xf0 arch/x86/kernel/stacktrace.c:25
 stack_trace_save+0x96/0xd0 kernel/stacktrace.c:122
 kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
 kasan_set_track+0x25/0x30 mm/kasan/common.c:52
 kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522
 ____kasan_slab_free mm/kasan/common.c:236 [inline]
 ____kasan_slab_free+0x15e/0x1b0 mm/kasan/common.c:200
 kasan_slab_free include/linux/kasan.h:162 [inline]
 slab_free_hook mm/slub.c:1792 [inline]
 slab_free_freelist_hook+0x10b/0x1e0 mm/slub.c:1818
 slab_free mm/slub.c:3801 [inline]
 kmem_cache_free+0xf0/0x490 mm/slub.c:3823
 sk_prot_free net/core/sock.c:2122 [inline]
 __sk_destruct+0x49e/0x770 net/core/sock.c:2216
 sk_destruct+0xc2/0xf0 net/core/sock.c:2231
 __sk_free+0xc4/0x3a0 net/core/sock.c:2242
 sk_free+0x7c/0xa0 net/core/sock.c:2253
 sock_put include/net/sock.h:1975 [inline]
 unix_release_sock+0xa76/0xf70 net/unix/af_unix.c:668
 unix_release+0x88/0xe0 net/unix/af_unix.c:1065
 __sock_release+0xcd/0x290 net/socket.c:654
 sock_close+0x1c/0x20 net/socket.c:1386
 __fput+0x3fd/0xac0 fs/file_table.c:384
 task_work_run+0x14d/0x240 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204
 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
 syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297
 do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc3bb116ef7
Code: 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8
RSP: 002b:00007ffd1d8ead88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007fc3bb116ef7
RDX: 0000000000000000 RSI: 0000000000008933 RDI: 0000000000000004
RBP: 00007ffd1d8ead90 R08: 0000000000000008 R09: 0000000000000004
R10: 000000000000000b R11: 0000000000000246 R12: 00007ffd1d8eafc0
R13: 00003faaaaaaaaaa R14: 00007ffd1d8eaff0 R15: 00007fc3bb164376
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection

If the bug is already fixed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to change bug's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the bug is a duplicate of another bug, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-13 20:45 [syzbot] [net?] INFO: rcu detected stall in unix_release syzbot
@ 2023-08-14 23:03 ` Jakub Kicinski
  2023-08-15 11:28   ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Kicinski @ 2023-08-14 23:03 UTC (permalink / raw)
  To: syzbot, Vladimir Oltean
  Cc: bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni,
	syzkaller-bugs

Hi Vladimir, any ideas for this one?
The bisection looks pooped, FWIW, looks like a taprio inf loop.

On Sun, 13 Aug 2023 13:45:59 -0700 syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    d0378ae6d16c Merge branch 'enetc-probe-fix'
> git tree:       net
> console output: https://syzkaller.appspot.com/x/log.txt?x=1052ea2ba80000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=fa5bd4cd5ab6259d
> dashboard link: https://syzkaller.appspot.com/bug?extid=a3618a167af2021433cd
> compiler:       gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1152c6eda80000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13b1eddda80000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/c893f52cd6ab/disk-d0378ae6.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/dfb7a8b86a99/vmlinux-d0378ae6.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/cb9134e0a22c/bzImage-d0378ae6.xz
> 
> The issue was bisected to:
> 
> commit c2368b19807affd7621f7c4638cd2e17fec13021
> Author: Jiri Pirko <jiri@nvidia.com>
> Date:   Fri Jul 29 07:10:35 2022 +0000
> 
>     net: devlink: introduce "unregistering" mark and use it during devlinks iteration
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=134f1179a80000
> final oops:     https://syzkaller.appspot.com/x/report.txt?x=10cf1179a80000
> console output: https://syzkaller.appspot.com/x/log.txt?x=174f1179a80000
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com
> Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration")
> 
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 	0-....: (10499 ticks this GP) idle=9774/1/0x4000000000000000 softirq=8757/8758 fqs=5219
> rcu: 	         hardirqs   softirqs   csw/system
> rcu: 	 number:        1          0            0
> rcu: 	cputime:    26308      26181           17   ==> 52490(ms)
> rcu: 	(t=10500 jiffies g=8417 q=457 ncpus=2)
> CPU: 0 PID: 5047 Comm: syz-executor224 Not tainted 6.5.0-rc4-syzkaller-00212-gd0378ae6d16c #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
> RIP: 0010:taprio_dequeue_tc_priority+0x263/0x4b0 net/sched/sch_taprio.c:798
> Code: 8b 74 24 10 89 ef 44 89 f6 e8 29 b8 2c f9 44 39 f5 0f 84 40 ff ff ff e8 2b bd 2c f9 49 83 ff 0f 0f 87 e1 01 00 00 48 8b 04 24 <0f> b6 00 38 44 24 36 7c 08 84 c0 0f 85 bf 01 00 00 8b 33 8b 4c 24
> RSP: 0018:ffffc90000007d60 EFLAGS: 00000293
> RAX: ffffed10047a4a72 RBX: ffff888023d25394 RCX: 0000000000000100
> RDX: ffff888028efbb80 RSI: ffffffff88594af5 RDI: 0000000000000004
> RBP: 0000000000000008 R08: 0000000000000004 R09: 0000000000000008
> R10: 0000000000000000 R11: ffffc90000007ff8 R12: 0000000000000010
> R13: ffff88802d19ab60 R14: 0000000000000000 R15: 0000000000000001
> FS:  0000555555857380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020000600 CR3: 000000002cdd1000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <IRQ>
>  taprio_dequeue+0x12e/0x5f0 net/sched/sch_taprio.c:868
>  dequeue_skb net/sched/sch_generic.c:292 [inline]
>  qdisc_restart net/sched/sch_generic.c:397 [inline]
>  __qdisc_run+0x1c4/0x19d0 net/sched/sch_generic.c:415
>  qdisc_run include/net/pkt_sched.h:125 [inline]
>  qdisc_run include/net/pkt_sched.h:122 [inline]
>  net_tx_action+0x71e/0xc80 net/core/dev.c:5049
>  __do_softirq+0x218/0x965 kernel/softirq.c:553
>  invoke_softirq kernel/softirq.c:427 [inline]
>  __irq_exit_rcu kernel/softirq.c:632 [inline]
>  irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
>  sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1109
>  </IRQ>
>  <TASK>
>  asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645
> RIP: 0010:unwind_next_frame+0x5ba/0x2020 arch/x86/kernel/unwind_orc.c:517
> Code: 31 02 00 00 41 80 fe 04 0f 84 08 0c 00 00 41 80 fe 05 0f 85 d7 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 14 24 48 c1 ea 03 <80> 3c 02 00 0f 85 42 19 00 00 48 89 c8 4d 8b 7d 38 48 ba 00 00 00
> RSP: 0018:ffffc90003b9f748 EFLAGS: 00000a02
> RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffffff8f3ed5c8
> RDX: 1ffff92000773efe RSI: 0000000000000001 RDI: ffffffff8ec31910
> RBP: ffffc90003b9f800 R08: ffffffff8f3ed646 R09: ffffffff8f3ed5cc
> R10: ffffc90003b9f7b8 R11: 000000000000d9e9 R12: ffffc90003b9f808
> R13: ffffc90003b9f7b8 R14: 0000000000000005 R15: 0000000000000000
>  arch_stack_walk+0x8b/0xf0 arch/x86/kernel/stacktrace.c:25
>  stack_trace_save+0x96/0xd0 kernel/stacktrace.c:122
>  kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
>  kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>  kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522
>  ____kasan_slab_free mm/kasan/common.c:236 [inline]
>  ____kasan_slab_free+0x15e/0x1b0 mm/kasan/common.c:200
>  kasan_slab_free include/linux/kasan.h:162 [inline]
>  slab_free_hook mm/slub.c:1792 [inline]
>  slab_free_freelist_hook+0x10b/0x1e0 mm/slub.c:1818
>  slab_free mm/slub.c:3801 [inline]
>  kmem_cache_free+0xf0/0x490 mm/slub.c:3823
>  sk_prot_free net/core/sock.c:2122 [inline]
>  __sk_destruct+0x49e/0x770 net/core/sock.c:2216
>  sk_destruct+0xc2/0xf0 net/core/sock.c:2231
>  __sk_free+0xc4/0x3a0 net/core/sock.c:2242
>  sk_free+0x7c/0xa0 net/core/sock.c:2253
>  sock_put include/net/sock.h:1975 [inline]
>  unix_release_sock+0xa76/0xf70 net/unix/af_unix.c:668
>  unix_release+0x88/0xe0 net/unix/af_unix.c:1065
>  __sock_release+0xcd/0x290 net/socket.c:654
>  sock_close+0x1c/0x20 net/socket.c:1386
>  __fput+0x3fd/0xac0 fs/file_table.c:384
>  task_work_run+0x14d/0x240 kernel/task_work.c:179
>  resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
>  exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204
>  __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
>  syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297
>  do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7fc3bb116ef7
> Code: 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8
> RSP: 002b:00007ffd1d8ead88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007fc3bb116ef7
> RDX: 0000000000000000 RSI: 0000000000008933 RDI: 0000000000000004
> RBP: 00007ffd1d8ead90 R08: 0000000000000008 R09: 0000000000000004
> R10: 000000000000000b R11: 0000000000000246 R12: 00007ffd1d8eafc0
> R13: 00003faaaaaaaaaa R14: 00007ffd1d8eaff0 R15: 00007fc3bb164376
>  </TASK>
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection
> 
> If the bug is already fixed, let syzbot know by replying with:
> #syz fix: exact-commit-title
> 
> If you want syzbot to run the reproducer, reply with:
> #syz test: git://repo/address.git branch-or-commit-hash
> If you attach or paste a git patch, syzbot will apply it before testing.
> 
> If you want to change bug's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
> 
> If the bug is a duplicate of another bug, reply with:
> #syz dup: exact-subject-of-another-report
> 
> If you want to undo deduplication, reply with:
> #syz undup


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-14 23:03 ` Jakub Kicinski
@ 2023-08-15 11:28   ` Vladimir Oltean
  2023-08-16 22:57     ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Oltean @ 2023-08-15 11:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev,
	pabeni, syzkaller-bugs

On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> Hi Vladimir, any ideas for this one?
> The bisection looks pooped, FWIW, looks like a taprio inf loop.

I'm looking into it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-15 11:28   ` Vladimir Oltean
@ 2023-08-16 22:57     ` Vladimir Oltean
  2023-08-17  2:58       ` Jakub Kicinski
  2023-08-17 16:30       ` Jamal Hadi Salim
  0 siblings, 2 replies; 9+ messages in thread
From: Vladimir Oltean @ 2023-08-16 22:57 UTC (permalink / raw)
  To: Jakub Kicinski, Jamal Hadi Salim, Cong Wang, Pedro Tammela,
	Victor Nogueira
  Cc: syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev,
	pabeni, syzkaller-bugs, Vinicius Costa Gomes

Hi Jakub,

On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> > Hi Vladimir, any ideas for this one?
> > The bisection looks pooped, FWIW, looks like a taprio inf loop.
> 
> I'm looking into it.

Here's what I've found out and what help I'll need going forward.

Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(),
leading to an RCU stall.

Short description of taprio_dequeue_tc_priority(): it cycles
q->cur_txq[tc] in the range between [ offset, offset + count ), where:

	int offset = dev->tc_to_txq[tc].offset;
	int count = dev->tc_to_txq[tc].count;

with the initial q->cur_txq[tc], aka the "first_txq" variable, being set
by the control path: taprio_change(), also called by taprio_init():

	if (mqprio) {
		(...)
		for (i = 0; i < mqprio->num_tc; i++) {
			(...)
			q->cur_txq[i] = mqprio->offset[i];
		}
	}

In the buggy case that leads to the RCU stall, the line in taprio_change()
which sets q->cur_txq[i] never gets executed. So first_txq will be 0
(pre-initialized memory), and if that's outside of the [ offset, offset + count )
range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects
to cycle through, the kernel is toast.

The nitty gritty of that is boring. What's not boring is how come the
control path skips the q->cur_txq[i] assignment. It's because "mqprio"
is NULL, and that's because taprio_change() (important: also tail-called
from taprio_init()) has this logic to detect a change in the traffic
class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP
netlink attribute:

	/* no changes - no new mqprio settings */
	if (!taprio_mqprio_cmp(q, dev, mqprio))
		mqprio = NULL;

And what happens is that:
- we go through taprio_init()
- a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us
- taprio_mqprio_cmp() sees that there's no change compared to the
  netdev's existing traffic class config
- taprio_change() sets "mqprio" to NULL, ignoring the given
  TCA_TAPRIO_ATTR_PRIOMAP
- we skip modifying q->cur_txq[i], as if it was a taprio_change() call
  that came straight from Qdisc_ops :: change(), rather than what it
  really is: one from Qdisc_ops :: init()

So the next question: why does taprio_mqprio_cmp() see that there's no
change? Because there is no change. When Qdisc_ops :: init() is called,
the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and
all that.

But why? A previous taprio, if that existed, will call taprio_destroy()
-> netdev_reset_tc(), so it won't leave state behind that will hinder
the current taprio. Checking for stuff in the netdev state is just so
that taprio_change() can distinguish between a direct Qdisc_ops :: change()
call vs one coming from init().

Finally, here's where the syzbot repro becomes relevant. It crafts the
RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc()
in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written
in mind.

With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is
supposed to call init() the first time and replace() the second time.
What the repro does is make the above sequence call two init() methods
back to back.

To create an iproute2-based reproducer rather than the C one provided by
syzbot, we need this iproute2 change:

diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index 56086c43b7fa..20d9622b6bf3 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv)
 		return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
 	if (matches(*argv, "change") == 0)
 		return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1);
+	if (strcmp(*argv, "replace-exclusive") == 0)
+		return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1);
 	if (matches(*argv, "replace") == 0)
 		return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1);
 	if (matches(*argv, "link") == 0)

which basically implements a crafted alternative of "tc qdisc replace"
which also sets the NLM_F_EXCL flag in n->nlmsg_flags.

Then, the minimal repro script can simply be expressed as:

#!/bin/bash

ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1
ip link set veth0 up && ip link set veth1 up

for ((i = 0; i < 2; i++)); do
	tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \
		num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
		queues 8@0 4@8 \
		clockid REALTIME \
		base-time 0 \
		cycle-time 61679 \
		sched-entry S 0 54336 \
		sched-entry S 0x8a27 7343 \
		max-sdu 18343 18343 \
		flags 0
done

ip link del veth0

Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init()
code path instead of change() for the second Qdisc.

The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself)
to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs.

When the taprio_init() method executes for i=1, taprio_destroy() hasn't
been called yet. So neither has netdev_reset_tc() been called, and
that's part of the problem (the one that causes the infinite loop in
dequeue()).

But, taprio_destroy() will finally get called for the initial taprio
created at i=0. The call trace looks like this:

 rtnetlink_rcv_msg()
 -> tc_modify_qdisc()
    -> qdisc_graft()
       -> taprio_attach() for i=1
          -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0
             -> __qdisc_destroy()
                -> taprio_destroy()

What's more interesting is that the late taprio_destroy() for i=0
effectively destroys the netdev state - the netdev_reset_tc() call -
done by taprio_init() -> taprio_change() for i=1, and that can't be
too good, either. Even if there's no immediately observable hang, the
traffic classes are reset even though the Qdisc thinks they aren't.

Taprio isn't the only one affected by this. Mqprio also has the pattern
of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy().
But with the possibility of destroy(i=0) not being serialized with
init(i=1), that's buggy.

Sorry for the long message. This is where I'm at. For me, this is the
bottom of where things are intuitive. I don't understand what is
considered to be expected behavior from tc_modify_qdisc(), and what is
considered to be sane Qdisc-facing API, and I need help.

I've completely stopped debugging when I saw that the code enters
through this path at i=1, so I really can't tell you more:

				/* This magic test requires explanation.
				 *
				 *   We know, that some child q is already
				 *   attached to this parent and have choice:
				 *   either to change it or to create/graft new one.
				 *
				 *   1. We are allowed to create/graft only
				 *   if CREATE and REPLACE flags are set.
				 *
				 *   2. If EXCL is set, requestor wanted to say,
				 *   that qdisc tcm_handle is not expected
				 *   to exist, so that we choose create/graft too.
				 *
				 *   3. The last case is when no flags are set.
				 *   Alas, it is sort of hole in API, we
				 *   cannot decide what to do unambiguously.
				 *   For now we select create/graft, if
				 *   user gave KIND, which does not match existing.
				 */
				if ((n->nlmsg_flags & NLM_F_CREATE) &&
				    (n->nlmsg_flags & NLM_F_REPLACE) &&
				    ((n->nlmsg_flags & NLM_F_EXCL) ||
				     (tca[TCA_KIND] &&
				      nla_strcmp(tca[TCA_KIND], q->ops->id)))) {
					netdev_err(dev, "magic test\n");
					goto create_n_graft;
				}

I've added more Qdisc people to the discussion. The problem description
is pretty much self-contained in this email, and going to the original
syzbot report won't bring much else.

There are multiple workarounds that can be done in taprio (and mqprio)
depending on what is considered as being sane API. Though I don't want
to get ahead of myself. Maybe there is a way to fast-forward the
qdisc_destroy() of the previous taprio so it doesn't overlap with the
new one's qdisc_create().

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-16 22:57     ` Vladimir Oltean
@ 2023-08-17  2:58       ` Jakub Kicinski
  2023-08-17 16:30       ` Jamal Hadi Salim
  1 sibling, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2023-08-17  2:58 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Jamal Hadi Salim, Cong Wang, Pedro Tammela, Victor Nogueira,
	syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev,
	pabeni, syzkaller-bugs, Vinicius Costa Gomes

On Thu, 17 Aug 2023 01:57:59 +0300 Vladimir Oltean wrote:
> There are multiple workarounds that can be done in taprio (and mqprio)
> depending on what is considered as being sane API. Though I don't want
> to get ahead of myself. Maybe there is a way to fast-forward the
> qdisc_destroy() of the previous taprio so it doesn't overlap with the
> new one's qdisc_create().

Thanks for the details. I'm going to let others comment, but sounds 
a bit similar to the recent problem with the ingress qdisc. The qdisc
expects to own the netdev which explodes when its lifetime rules are
fully exercised :(

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-16 22:57     ` Vladimir Oltean
  2023-08-17  2:58       ` Jakub Kicinski
@ 2023-08-17 16:30       ` Jamal Hadi Salim
  2023-08-18 15:27         ` Jamal Hadi Salim
  1 sibling, 1 reply; 9+ messages in thread
From: Jamal Hadi Salim @ 2023-08-17 16:30 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot,
	bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni,
	syzkaller-bugs, Vinicius Costa Gomes

On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> Hi Jakub,
>
> On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote:
> > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> > > Hi Vladimir, any ideas for this one?
> > > The bisection looks pooped, FWIW, looks like a taprio inf loop.
> >
> > I'm looking into it.
>
> Here's what I've found out and what help I'll need going forward.
>
> Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(),
> leading to an RCU stall.
>
> Short description of taprio_dequeue_tc_priority(): it cycles
> q->cur_txq[tc] in the range between [ offset, offset + count ), where:
>
>         int offset = dev->tc_to_txq[tc].offset;
>         int count = dev->tc_to_txq[tc].count;
>
> with the initial q->cur_txq[tc], aka the "first_txq" variable, being set
> by the control path: taprio_change(), also called by taprio_init():
>
>         if (mqprio) {
>                 (...)
>                 for (i = 0; i < mqprio->num_tc; i++) {
>                         (...)
>                         q->cur_txq[i] = mqprio->offset[i];
>                 }
>         }
>
> In the buggy case that leads to the RCU stall, the line in taprio_change()
> which sets q->cur_txq[i] never gets executed. So first_txq will be 0
> (pre-initialized memory), and if that's outside of the [ offset, offset + count )
> range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects
> to cycle through, the kernel is toast.
>
> The nitty gritty of that is boring. What's not boring is how come the
> control path skips the q->cur_txq[i] assignment. It's because "mqprio"
> is NULL, and that's because taprio_change() (important: also tail-called
> from taprio_init()) has this logic to detect a change in the traffic
> class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP
> netlink attribute:
>
>         /* no changes - no new mqprio settings */
>         if (!taprio_mqprio_cmp(q, dev, mqprio))
>                 mqprio = NULL;
>
> And what happens is that:
> - we go through taprio_init()
> - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us
> - taprio_mqprio_cmp() sees that there's no change compared to the
>   netdev's existing traffic class config
> - taprio_change() sets "mqprio" to NULL, ignoring the given
>   TCA_TAPRIO_ATTR_PRIOMAP
> - we skip modifying q->cur_txq[i], as if it was a taprio_change() call
>   that came straight from Qdisc_ops :: change(), rather than what it
>   really is: one from Qdisc_ops :: init()
>
> So the next question: why does taprio_mqprio_cmp() see that there's no
> change? Because there is no change. When Qdisc_ops :: init() is called,
> the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and
> all that.
>
> But why? A previous taprio, if that existed, will call taprio_destroy()
> -> netdev_reset_tc(), so it won't leave state behind that will hinder
> the current taprio. Checking for stuff in the netdev state is just so
> that taprio_change() can distinguish between a direct Qdisc_ops :: change()
> call vs one coming from init().
>
> Finally, here's where the syzbot repro becomes relevant. It crafts the
> RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc()
> in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written
> in mind.
>
> With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is
> supposed to call init() the first time and replace() the second time.
> What the repro does is make the above sequence call two init() methods
> back to back.
>
> To create an iproute2-based reproducer rather than the C one provided by
> syzbot, we need this iproute2 change:
>
> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> index 56086c43b7fa..20d9622b6bf3 100644
> --- a/tc/tc_qdisc.c
> +++ b/tc/tc_qdisc.c
> @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv)
>                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
>         if (matches(*argv, "change") == 0)
>                 return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1);
> +       if (strcmp(*argv, "replace-exclusive") == 0)
> +               return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1);
>         if (matches(*argv, "replace") == 0)
>                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1);
>         if (matches(*argv, "link") == 0)
>
> which basically implements a crafted alternative of "tc qdisc replace"
> which also sets the NLM_F_EXCL flag in n->nlmsg_flags.
>
> Then, the minimal repro script can simply be expressed as:
>
> #!/bin/bash
>
> ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1
> ip link set veth0 up && ip link set veth1 up
>
> for ((i = 0; i < 2; i++)); do
>         tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \
>                 num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
>                 queues 8@0 4@8 \
>                 clockid REALTIME \
>                 base-time 0 \
>                 cycle-time 61679 \
>                 sched-entry S 0 54336 \
>                 sched-entry S 0x8a27 7343 \
>                 max-sdu 18343 18343 \
>                 flags 0
> done
>
> ip link del veth0
>
> Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init()
> code path instead of change() for the second Qdisc.
>
> The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself)
> to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs.
>
> When the taprio_init() method executes for i=1, taprio_destroy() hasn't
> been called yet. So neither has netdev_reset_tc() been called, and
> that's part of the problem (the one that causes the infinite loop in
> dequeue()).
>
> But, taprio_destroy() will finally get called for the initial taprio
> created at i=0. The call trace looks like this:
>
>  rtnetlink_rcv_msg()
>  -> tc_modify_qdisc()
>     -> qdisc_graft()
>        -> taprio_attach() for i=1
>           -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0
>              -> __qdisc_destroy()
>                 -> taprio_destroy()
>
> What's more interesting is that the late taprio_destroy() for i=0
> effectively destroys the netdev state - the netdev_reset_tc() call -
> done by taprio_init() -> taprio_change() for i=1, and that can't be
> too good, either. Even if there's no immediately observable hang, the
> traffic classes are reset even though the Qdisc thinks they aren't.
>
> Taprio isn't the only one affected by this. Mqprio also has the pattern
> of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy().
> But with the possibility of destroy(i=0) not being serialized with
> init(i=1), that's buggy.
>
> Sorry for the long message. This is where I'm at. For me, this is the
> bottom of where things are intuitive. I don't understand what is
> considered to be expected behavior from tc_modify_qdisc(), and what is
> considered to be sane Qdisc-facing API, and I need help.
>
> I've completely stopped debugging when I saw that the code enters
> through this path at i=1, so I really can't tell you more:
>
>                                 /* This magic test requires explanation.
>                                  *
>                                  *   We know, that some child q is already
>                                  *   attached to this parent and have choice:
>                                  *   either to change it or to create/graft new one.
>                                  *
>                                  *   1. We are allowed to create/graft only
>                                  *   if CREATE and REPLACE flags are set.
>                                  *
>                                  *   2. If EXCL is set, requestor wanted to say,
>                                  *   that qdisc tcm_handle is not expected
>                                  *   to exist, so that we choose create/graft too.
>                                  *
>                                  *   3. The last case is when no flags are set.
>                                  *   Alas, it is sort of hole in API, we
>                                  *   cannot decide what to do unambiguously.
>                                  *   For now we select create/graft, if
>                                  *   user gave KIND, which does not match existing.
>                                  */
>                                 if ((n->nlmsg_flags & NLM_F_CREATE) &&
>                                     (n->nlmsg_flags & NLM_F_REPLACE) &&
>                                     ((n->nlmsg_flags & NLM_F_EXCL) ||
>                                      (tca[TCA_KIND] &&
>                                       nla_strcmp(tca[TCA_KIND], q->ops->id)))) {
>                                         netdev_err(dev, "magic test\n");
>                                         goto create_n_graft;
>                                 }
>
> I've added more Qdisc people to the discussion. The problem description
> is pretty much self-contained in this email, and going to the original
> syzbot report won't bring much else.
>

I will take a look tommorow.

cheers,
jamal

> There are multiple workarounds that can be done in taprio (and mqprio)
> depending on what is considered as being sane API. Though I don't want
> to get ahead of myself. Maybe there is a way to fast-forward the
> qdisc_destroy() of the previous taprio so it doesn't overlap with the
> new one's qdisc_create().

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-17 16:30       ` Jamal Hadi Salim
@ 2023-08-18 15:27         ` Jamal Hadi Salim
  2023-08-18 16:07           ` Vladimir Oltean
  0 siblings, 1 reply; 9+ messages in thread
From: Jamal Hadi Salim @ 2023-08-18 15:27 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot,
	bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni,
	syzkaller-bugs, Vinicius Costa Gomes

[-- Attachment #1: Type: text/plain, Size: 10390 bytes --]

On Thu, Aug 17, 2023 at 12:30 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> >
> > Hi Jakub,
> >
> > On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote:
> > > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote:
> > > > Hi Vladimir, any ideas for this one?
> > > > The bisection looks pooped, FWIW, looks like a taprio inf loop.
> > >
> > > I'm looking into it.
> >
> > Here's what I've found out and what help I'll need going forward.
> >
> > Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(),
> > leading to an RCU stall.
> >
> > Short description of taprio_dequeue_tc_priority(): it cycles
> > q->cur_txq[tc] in the range between [ offset, offset + count ), where:
> >
> >         int offset = dev->tc_to_txq[tc].offset;
> >         int count = dev->tc_to_txq[tc].count;
> >
> > with the initial q->cur_txq[tc], aka the "first_txq" variable, being set
> > by the control path: taprio_change(), also called by taprio_init():
> >
> >         if (mqprio) {
> >                 (...)
> >                 for (i = 0; i < mqprio->num_tc; i++) {
> >                         (...)
> >                         q->cur_txq[i] = mqprio->offset[i];
> >                 }
> >         }
> >
> > In the buggy case that leads to the RCU stall, the line in taprio_change()
> > which sets q->cur_txq[i] never gets executed. So first_txq will be 0
> > (pre-initialized memory), and if that's outside of the [ offset, offset + count )
> > range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects
> > to cycle through, the kernel is toast.
> >
> > The nitty gritty of that is boring. What's not boring is how come the
> > control path skips the q->cur_txq[i] assignment. It's because "mqprio"
> > is NULL, and that's because taprio_change() (important: also tail-called
> > from taprio_init()) has this logic to detect a change in the traffic
> > class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP
> > netlink attribute:
> >
> >         /* no changes - no new mqprio settings */
> >         if (!taprio_mqprio_cmp(q, dev, mqprio))
> >                 mqprio = NULL;
> >
> > And what happens is that:
> > - we go through taprio_init()
> > - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us
> > - taprio_mqprio_cmp() sees that there's no change compared to the
> >   netdev's existing traffic class config
> > - taprio_change() sets "mqprio" to NULL, ignoring the given
> >   TCA_TAPRIO_ATTR_PRIOMAP
> > - we skip modifying q->cur_txq[i], as if it was a taprio_change() call
> >   that came straight from Qdisc_ops :: change(), rather than what it
> >   really is: one from Qdisc_ops :: init()
> >
> > So the next question: why does taprio_mqprio_cmp() see that there's no
> > change? Because there is no change. When Qdisc_ops :: init() is called,
> > the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and
> > all that.
> >
> > But why? A previous taprio, if that existed, will call taprio_destroy()
> > -> netdev_reset_tc(), so it won't leave state behind that will hinder
> > the current taprio. Checking for stuff in the netdev state is just so
> > that taprio_change() can distinguish between a direct Qdisc_ops :: change()
> > call vs one coming from init().
> >
> > Finally, here's where the syzbot repro becomes relevant. It crafts the
> > RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc()
> > in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written
> > in mind.
> >
> > With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is
> > supposed to call init() the first time and replace() the second time.
> > What the repro does is make the above sequence call two init() methods
> > back to back.
> >
> > To create an iproute2-based reproducer rather than the C one provided by
> > syzbot, we need this iproute2 change:
> >
> > diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> > index 56086c43b7fa..20d9622b6bf3 100644
> > --- a/tc/tc_qdisc.c
> > +++ b/tc/tc_qdisc.c
> > @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv)
> >                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1);
> >         if (matches(*argv, "change") == 0)
> >                 return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1);
> > +       if (strcmp(*argv, "replace-exclusive") == 0)
> > +               return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1);
> >         if (matches(*argv, "replace") == 0)
> >                 return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1);
> >         if (matches(*argv, "link") == 0)
> >
> > which basically implements a crafted alternative of "tc qdisc replace"
> > which also sets the NLM_F_EXCL flag in n->nlmsg_flags.
> >
> > Then, the minimal repro script can simply be expressed as:
> >
> > #!/bin/bash
> >
> > ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1
> > ip link set veth0 up && ip link set veth1 up
> >
> > for ((i = 0; i < 2; i++)); do
> >         tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \
> >                 num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
> >                 queues 8@0 4@8 \
> >                 clockid REALTIME \
> >                 base-time 0 \
> >                 cycle-time 61679 \
> >                 sched-entry S 0 54336 \
> >                 sched-entry S 0x8a27 7343 \
> >                 max-sdu 18343 18343 \
> >                 flags 0
> > done
> >
> > ip link del veth0
> >
> > Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init()
> > code path instead of change() for the second Qdisc.
> >
> > The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself)
> > to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs.
> >
> > When the taprio_init() method executes for i=1, taprio_destroy() hasn't
> > been called yet. So neither has netdev_reset_tc() been called, and
> > that's part of the problem (the one that causes the infinite loop in
> > dequeue()).
> >
> > But, taprio_destroy() will finally get called for the initial taprio
> > created at i=0. The call trace looks like this:
> >
> >  rtnetlink_rcv_msg()
> >  -> tc_modify_qdisc()
> >     -> qdisc_graft()
> >        -> taprio_attach() for i=1
> >           -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0
> >              -> __qdisc_destroy()
> >                 -> taprio_destroy()
> >
> > What's more interesting is that the late taprio_destroy() for i=0
> > effectively destroys the netdev state - the netdev_reset_tc() call -
> > done by taprio_init() -> taprio_change() for i=1, and that can't be
> > too good, either. Even if there's no immediately observable hang, the
> > traffic classes are reset even though the Qdisc thinks they aren't.
> >
> > Taprio isn't the only one affected by this. Mqprio also has the pattern
> > of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy().
> > But with the possibility of destroy(i=0) not being serialized with
> > init(i=1), that's buggy.
> >
> > Sorry for the long message. This is where I'm at. For me, this is the
> > bottom of where things are intuitive. I don't understand what is
> > considered to be expected behavior from tc_modify_qdisc(), and what is
> > considered to be sane Qdisc-facing API, and I need help.
> >
> > I've completely stopped debugging when I saw that the code enters
> > through this path at i=1, so I really can't tell you more:
> >
> >                                 /* This magic test requires explanation.
> >                                  *
> >                                  *   We know, that some child q is already
> >                                  *   attached to this parent and have choice:
> >                                  *   either to change it or to create/graft new one.
> >                                  *
> >                                  *   1. We are allowed to create/graft only
> >                                  *   if CREATE and REPLACE flags are set.
> >                                  *
> >                                  *   2. If EXCL is set, requestor wanted to say,
> >                                  *   that qdisc tcm_handle is not expected
> >                                  *   to exist, so that we choose create/graft too.
> >                                  *
> >                                  *   3. The last case is when no flags are set.
> >                                  *   Alas, it is sort of hole in API, we
> >                                  *   cannot decide what to do unambiguously.
> >                                  *   For now we select create/graft, if
> >                                  *   user gave KIND, which does not match existing.
> >                                  */
> >                                 if ((n->nlmsg_flags & NLM_F_CREATE) &&
> >                                     (n->nlmsg_flags & NLM_F_REPLACE) &&
> >                                     ((n->nlmsg_flags & NLM_F_EXCL) ||
> >                                      (tca[TCA_KIND] &&
> >                                       nla_strcmp(tca[TCA_KIND], q->ops->id)))) {
> >                                         netdev_err(dev, "magic test\n");
> >                                         goto create_n_graft;
> >                                 }
> >
> > I've added more Qdisc people to the discussion. The problem description
> > is pretty much self-contained in this email, and going to the original
> > syzbot report won't bring much else.
> >
>
> I will take a look tommorow.
>

Can you try the attached patchlet?

cheers,
jamal

> cheers,
> jamal
>
> > There are multiple workarounds that can be done in taprio (and mqprio)
> > depending on what is considered as being sane API. Though I don't want
> > to get ahead of myself. Maybe there is a way to fast-forward the
> > qdisc_destroy() of the previous taprio so it doesn't overlap with the
> > new one's qdisc_create().

[-- Attachment #2: patchlet-qdisc --]
[-- Type: application/octet-stream, Size: 2233 bytes --]

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index aa6b1fe65151..8409b61d3313 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1551,6 +1551,25 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
  * Create/change qdisc.
  */
 
+static inline bool cmd_create_or_replace(struct nlmsghdr *n)
+{
+	return (n->nlmsg_flags & NLM_F_CREATE &&
+		n->nlmsg_flags & NLM_F_REPLACE);
+}
+
+static inline bool cmd_create_exclusive(struct nlmsghdr *n)
+{
+	return (n->nlmsg_flags & NLM_F_CREATE &&
+		n->nlmsg_flags & NLM_F_EXCL);
+}
+
+static inline bool cmd_change(struct nlmsghdr *n)
+{
+	return (!(n->nlmsg_flags & NLM_F_CREATE) &&
+		!(n->nlmsg_flags & NLM_F_REPLACE) &&
+		!(n->nlmsg_flags & NLM_F_EXCL));
+}
+
 static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			   struct netlink_ext_ack *extack)
 {
@@ -1659,12 +1678,17 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 				 *   For now we select create/graft, if
 				 *   user gave KIND, which does not match existing.
 				 */
-				if ((n->nlmsg_flags & NLM_F_CREATE) &&
-				    (n->nlmsg_flags & NLM_F_REPLACE) &&
-				    ((n->nlmsg_flags & NLM_F_EXCL) ||
-				     (tca[TCA_KIND] &&
-				      nla_strcmp(tca[TCA_KIND], q->ops->id))))
-					goto create_n_graft;
+                                if (tca[TCA_KIND] &&
+                                    nla_strcmp(tca[TCA_KIND], q->ops->id)) {
+                                        if (cmd_create_or_replace(n) ||
+					    cmd_create_exclusive(n)) {
+                                                goto create_n_graft;
+                                        } else {
+                                                if (cmd_change(n))
+                                                        goto create_n_graft2;
+                                        }
+                                }
+
 			}
 		}
 	} else {
@@ -1698,6 +1722,7 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 		NL_SET_ERR_MSG(extack, "Qdisc not found. To create specify NLM_F_CREATE flag");
 		return -ENOENT;
 	}
+create_n_graft2:
 	if (clid == TC_H_INGRESS) {
 		if (dev_ingress_queue(dev)) {
 			q = qdisc_create(dev, dev_ingress_queue(dev),

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-18 15:27         ` Jamal Hadi Salim
@ 2023-08-18 16:07           ` Vladimir Oltean
  2023-08-18 17:10             ` Jamal Hadi Salim
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Oltean @ 2023-08-18 16:07 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot,
	bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni,
	syzkaller-bugs, Vinicius Costa Gomes

Hi Jamal,

On Fri, Aug 18, 2023 at 11:27:27AM -0400, Jamal Hadi Salim wrote:
> Can you try the attached patchlet?

Thanks for the patch. I've tried it, and it eliminates the code path
(and thus the problem) exposed by the syzbot program, by responding to
RTM_NEWQDISC messages having the NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL
flags with "Error: Exclusivity flag on, cannot modify.".

Actually, to be precise, the first such netlink message successfully
creates the qdisc, but then the subsequent ones leave that qdisc alone
(don't change it), by failing with this extack message.

If that's the behavior that you intended, then I guess the answer is
that it works. Thanks a lot.

What would be an appropriate Fixes: tag?

Side note: I believe that we can now also revert commit be3618d96510
("net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq"),
which was papering over an unknown (at the time) issue - the same as
this one - without really even completely covering it, either. Hence
this other syzbot report.
https://lore.kernel.org/netdev/3b977f76-0289-270e-8310-179315ee927d@huawei.com/T/
https://lore.kernel.org/netdev/20230608062756.3626573-1-shaozhengchao@huawei.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release
  2023-08-18 16:07           ` Vladimir Oltean
@ 2023-08-18 17:10             ` Jamal Hadi Salim
  0 siblings, 0 replies; 9+ messages in thread
From: Jamal Hadi Salim @ 2023-08-18 17:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot,
	bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni,
	syzkaller-bugs, Vinicius Costa Gomes, Zhengchao Shao

Hi Vladimir,

On Fri, Aug 18, 2023 at 12:07 PM Vladimir Oltean
<vladimir.oltean@nxp.com> wrote:
>
> Hi Jamal,
>
> On Fri, Aug 18, 2023 at 11:27:27AM -0400, Jamal Hadi Salim wrote:
> > Can you try the attached patchlet?
>
> Thanks for the patch. I've tried it, and it eliminates the code path
> (and thus the problem) exposed by the syzbot program, by responding to
> RTM_NEWQDISC messages having the NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL
> flags with "Error: Exclusivity flag on, cannot modify.".
>

Ok, that is more of the expected behavior.
Noone should ever send that mumbo-jumbo (I doubt there is a "legit"
control app that will do that).

> Actually, to be precise, the first such netlink message successfully
> creates the qdisc, but then the subsequent ones leave that qdisc alone
> (don't change it), by failing with this extack message.
>

Yes, the first one will succeed because the root qdisc hasnt been
grafted yet (and the only interesting bit is NLM_F_CREATE. everything
else is ignored).

> If that's the behavior that you intended, then I guess the answer is
> that it works. Thanks a lot.
>
> What would be an appropriate Fixes: tag?
>

This should have been from early days when we trusted that iproute2
would do the right thing. I will look.
I dont think this is a taprio only potential victim, it's just that
syzbot was able to aggravate taprio sooner (it probably would have got
to some other qdisc later in its adventures).

> Side note: I believe that we can now also revert commit be3618d96510
> ("net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq"),
> which was papering over an unknown (at the time) issue - the same as
> this one - without really even completely covering it, either.

Unfortunately the commit log is not helpful - i cant tell what
"replace" means and cant seem to find the repro either. If you revert
it and see the problem going away then we are good.
+Cc  Zhengchao Shao <shaozhengchao@huawei.com>

>Hence
> this other syzbot report.
> https://lore.kernel.org/netdev/3b977f76-0289-270e-8310-179315ee927d@huawei.com/T/
> https://lore.kernel.org/netdev/20230608062756.3626573-1-shaozhengchao@huawei.com/

Makes sense.
BTW, thanks for your report - it made it faster to zone on the issue.
The comments above that code also need a bit of fixing to provide clarity.

cheers,
jamal

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-08-18 17:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-13 20:45 [syzbot] [net?] INFO: rcu detected stall in unix_release syzbot
2023-08-14 23:03 ` Jakub Kicinski
2023-08-15 11:28   ` Vladimir Oltean
2023-08-16 22:57     ` Vladimir Oltean
2023-08-17  2:58       ` Jakub Kicinski
2023-08-17 16:30       ` Jamal Hadi Salim
2023-08-18 15:27         ` Jamal Hadi Salim
2023-08-18 16:07           ` Vladimir Oltean
2023-08-18 17:10             ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).