* [syzbot] [net?] INFO: rcu detected stall in unix_release
@ 2023-08-13 20:45 syzbot
2023-08-14 23:03 ` Jakub Kicinski
0 siblings, 1 reply; 9+ messages in thread
From: syzbot @ 2023-08-13 20:45 UTC (permalink / raw)
To: bpf, brauner, davem, edumazet, jiri, kuba, linux-kernel, netdev,
pabeni, syzkaller-bugs
Hello,
syzbot found the following issue on:
HEAD commit: d0378ae6d16c Merge branch 'enetc-probe-fix'
git tree: net
console output: https://syzkaller.appspot.com/x/log.txt?x=1052ea2ba80000
kernel config: https://syzkaller.appspot.com/x/.config?x=fa5bd4cd5ab6259d
dashboard link: https://syzkaller.appspot.com/bug?extid=a3618a167af2021433cd
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1152c6eda80000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13b1eddda80000
Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/c893f52cd6ab/disk-d0378ae6.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/dfb7a8b86a99/vmlinux-d0378ae6.xz
kernel image: https://storage.googleapis.com/syzbot-assets/cb9134e0a22c/bzImage-d0378ae6.xz
The issue was bisected to:
commit c2368b19807affd7621f7c4638cd2e17fec13021
Author: Jiri Pirko <jiri@nvidia.com>
Date: Fri Jul 29 07:10:35 2022 +0000
net: devlink: introduce "unregistering" mark and use it during devlinks iteration
bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=134f1179a80000
final oops: https://syzkaller.appspot.com/x/report.txt?x=10cf1179a80000
console output: https://syzkaller.appspot.com/x/log.txt?x=174f1179a80000
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com
Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration")
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-....: (10499 ticks this GP) idle=9774/1/0x4000000000000000 softirq=8757/8758 fqs=5219
rcu: hardirqs softirqs csw/system
rcu: number: 1 0 0
rcu: cputime: 26308 26181 17 ==> 52490(ms)
rcu: (t=10500 jiffies g=8417 q=457 ncpus=2)
CPU: 0 PID: 5047 Comm: syz-executor224 Not tainted 6.5.0-rc4-syzkaller-00212-gd0378ae6d16c #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023
RIP: 0010:taprio_dequeue_tc_priority+0x263/0x4b0 net/sched/sch_taprio.c:798
Code: 8b 74 24 10 89 ef 44 89 f6 e8 29 b8 2c f9 44 39 f5 0f 84 40 ff ff ff e8 2b bd 2c f9 49 83 ff 0f 0f 87 e1 01 00 00 48 8b 04 24 <0f> b6 00 38 44 24 36 7c 08 84 c0 0f 85 bf 01 00 00 8b 33 8b 4c 24
RSP: 0018:ffffc90000007d60 EFLAGS: 00000293
RAX: ffffed10047a4a72 RBX: ffff888023d25394 RCX: 0000000000000100
RDX: ffff888028efbb80 RSI: ffffffff88594af5 RDI: 0000000000000004
RBP: 0000000000000008 R08: 0000000000000004 R09: 0000000000000008
R10: 0000000000000000 R11: ffffc90000007ff8 R12: 0000000000000010
R13: ffff88802d19ab60 R14: 0000000000000000 R15: 0000000000000001
FS: 0000555555857380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000600 CR3: 000000002cdd1000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
taprio_dequeue+0x12e/0x5f0 net/sched/sch_taprio.c:868
dequeue_skb net/sched/sch_generic.c:292 [inline]
qdisc_restart net/sched/sch_generic.c:397 [inline]
__qdisc_run+0x1c4/0x19d0 net/sched/sch_generic.c:415
qdisc_run include/net/pkt_sched.h:125 [inline]
qdisc_run include/net/pkt_sched.h:122 [inline]
net_tx_action+0x71e/0xc80 net/core/dev.c:5049
__do_softirq+0x218/0x965 kernel/softirq.c:553
invoke_softirq kernel/softirq.c:427 [inline]
__irq_exit_rcu kernel/softirq.c:632 [inline]
irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644
sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1109
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645
RIP: 0010:unwind_next_frame+0x5ba/0x2020 arch/x86/kernel/unwind_orc.c:517
Code: 31 02 00 00 41 80 fe 04 0f 84 08 0c 00 00 41 80 fe 05 0f 85 d7 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 14 24 48 c1 ea 03 <80> 3c 02 00 0f 85 42 19 00 00 48 89 c8 4d 8b 7d 38 48 ba 00 00 00
RSP: 0018:ffffc90003b9f748 EFLAGS: 00000a02
RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffffff8f3ed5c8
RDX: 1ffff92000773efe RSI: 0000000000000001 RDI: ffffffff8ec31910
RBP: ffffc90003b9f800 R08: ffffffff8f3ed646 R09: ffffffff8f3ed5cc
R10: ffffc90003b9f7b8 R11: 000000000000d9e9 R12: ffffc90003b9f808
R13: ffffc90003b9f7b8 R14: 0000000000000005 R15: 0000000000000000
arch_stack_walk+0x8b/0xf0 arch/x86/kernel/stacktrace.c:25
stack_trace_save+0x96/0xd0 kernel/stacktrace.c:122
kasan_save_stack+0x33/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
____kasan_slab_free+0x15e/0x1b0 mm/kasan/common.c:200
kasan_slab_free include/linux/kasan.h:162 [inline]
slab_free_hook mm/slub.c:1792 [inline]
slab_free_freelist_hook+0x10b/0x1e0 mm/slub.c:1818
slab_free mm/slub.c:3801 [inline]
kmem_cache_free+0xf0/0x490 mm/slub.c:3823
sk_prot_free net/core/sock.c:2122 [inline]
__sk_destruct+0x49e/0x770 net/core/sock.c:2216
sk_destruct+0xc2/0xf0 net/core/sock.c:2231
__sk_free+0xc4/0x3a0 net/core/sock.c:2242
sk_free+0x7c/0xa0 net/core/sock.c:2253
sock_put include/net/sock.h:1975 [inline]
unix_release_sock+0xa76/0xf70 net/unix/af_unix.c:668
unix_release+0x88/0xe0 net/unix/af_unix.c:1065
__sock_release+0xcd/0x290 net/socket.c:654
sock_close+0x1c/0x20 net/socket.c:1386
__fput+0x3fd/0xac0 fs/file_table.c:384
task_work_run+0x14d/0x240 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297
do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fc3bb116ef7
Code: 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8
RSP: 002b:00007ffd1d8ead88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007fc3bb116ef7
RDX: 0000000000000000 RSI: 0000000000008933 RDI: 0000000000000004
RBP: 00007ffd1d8ead90 R08: 0000000000000008 R09: 0000000000000004
R10: 000000000000000b R11: 0000000000000246 R12: 00007ffd1d8eafc0
R13: 00003faaaaaaaaaa R14: 00007ffd1d8eaff0 R15: 00007fc3bb164376
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
If the bug is already fixed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to change bug's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the bug is a duplicate of another bug, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-13 20:45 [syzbot] [net?] INFO: rcu detected stall in unix_release syzbot @ 2023-08-14 23:03 ` Jakub Kicinski 2023-08-15 11:28 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Jakub Kicinski @ 2023-08-14 23:03 UTC (permalink / raw) To: syzbot, Vladimir Oltean Cc: bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs Hi Vladimir, any ideas for this one? The bisection looks pooped, FWIW, looks like a taprio inf loop. On Sun, 13 Aug 2023 13:45:59 -0700 syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit: d0378ae6d16c Merge branch 'enetc-probe-fix' > git tree: net > console output: https://syzkaller.appspot.com/x/log.txt?x=1052ea2ba80000 > kernel config: https://syzkaller.appspot.com/x/.config?x=fa5bd4cd5ab6259d > dashboard link: https://syzkaller.appspot.com/bug?extid=a3618a167af2021433cd > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1152c6eda80000 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=13b1eddda80000 > > Downloadable assets: > disk image: https://storage.googleapis.com/syzbot-assets/c893f52cd6ab/disk-d0378ae6.raw.xz > vmlinux: https://storage.googleapis.com/syzbot-assets/dfb7a8b86a99/vmlinux-d0378ae6.xz > kernel image: https://storage.googleapis.com/syzbot-assets/cb9134e0a22c/bzImage-d0378ae6.xz > > The issue was bisected to: > > commit c2368b19807affd7621f7c4638cd2e17fec13021 > Author: Jiri Pirko <jiri@nvidia.com> > Date: Fri Jul 29 07:10:35 2022 +0000 > > net: devlink: introduce "unregistering" mark and use it during devlinks iteration > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=134f1179a80000 > final oops: https://syzkaller.appspot.com/x/report.txt?x=10cf1179a80000 > console output: https://syzkaller.appspot.com/x/log.txt?x=174f1179a80000 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com > Fixes: c2368b19807a ("net: devlink: introduce "unregistering" mark and use it during devlinks iteration") > > rcu: INFO: rcu_preempt self-detected stall on CPU > rcu: 0-....: (10499 ticks this GP) idle=9774/1/0x4000000000000000 softirq=8757/8758 fqs=5219 > rcu: hardirqs softirqs csw/system > rcu: number: 1 0 0 > rcu: cputime: 26308 26181 17 ==> 52490(ms) > rcu: (t=10500 jiffies g=8417 q=457 ncpus=2) > CPU: 0 PID: 5047 Comm: syz-executor224 Not tainted 6.5.0-rc4-syzkaller-00212-gd0378ae6d16c #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 > RIP: 0010:taprio_dequeue_tc_priority+0x263/0x4b0 net/sched/sch_taprio.c:798 > Code: 8b 74 24 10 89 ef 44 89 f6 e8 29 b8 2c f9 44 39 f5 0f 84 40 ff ff ff e8 2b bd 2c f9 49 83 ff 0f 0f 87 e1 01 00 00 48 8b 04 24 <0f> b6 00 38 44 24 36 7c 08 84 c0 0f 85 bf 01 00 00 8b 33 8b 4c 24 > RSP: 0018:ffffc90000007d60 EFLAGS: 00000293 > RAX: ffffed10047a4a72 RBX: ffff888023d25394 RCX: 0000000000000100 > RDX: ffff888028efbb80 RSI: ffffffff88594af5 RDI: 0000000000000004 > RBP: 0000000000000008 R08: 0000000000000004 R09: 0000000000000008 > R10: 0000000000000000 R11: ffffc90000007ff8 R12: 0000000000000010 > R13: ffff88802d19ab60 R14: 0000000000000000 R15: 0000000000000001 > FS: 0000555555857380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000020000600 CR3: 000000002cdd1000 CR4: 00000000003506f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <IRQ> > taprio_dequeue+0x12e/0x5f0 net/sched/sch_taprio.c:868 > dequeue_skb net/sched/sch_generic.c:292 [inline] > qdisc_restart net/sched/sch_generic.c:397 [inline] > __qdisc_run+0x1c4/0x19d0 net/sched/sch_generic.c:415 > qdisc_run include/net/pkt_sched.h:125 [inline] > qdisc_run include/net/pkt_sched.h:122 [inline] > net_tx_action+0x71e/0xc80 net/core/dev.c:5049 > __do_softirq+0x218/0x965 kernel/softirq.c:553 > invoke_softirq kernel/softirq.c:427 [inline] > __irq_exit_rcu kernel/softirq.c:632 [inline] > irq_exit_rcu+0xb7/0x120 kernel/softirq.c:644 > sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1109 > </IRQ> > <TASK> > asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645 > RIP: 0010:unwind_next_frame+0x5ba/0x2020 arch/x86/kernel/unwind_orc.c:517 > Code: 31 02 00 00 41 80 fe 04 0f 84 08 0c 00 00 41 80 fe 05 0f 85 d7 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8b 14 24 48 c1 ea 03 <80> 3c 02 00 0f 85 42 19 00 00 48 89 c8 4d 8b 7d 38 48 ba 00 00 00 > RSP: 0018:ffffc90003b9f748 EFLAGS: 00000a02 > RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffffff8f3ed5c8 > RDX: 1ffff92000773efe RSI: 0000000000000001 RDI: ffffffff8ec31910 > RBP: ffffc90003b9f800 R08: ffffffff8f3ed646 R09: ffffffff8f3ed5cc > R10: ffffc90003b9f7b8 R11: 000000000000d9e9 R12: ffffc90003b9f808 > R13: ffffc90003b9f7b8 R14: 0000000000000005 R15: 0000000000000000 > arch_stack_walk+0x8b/0xf0 arch/x86/kernel/stacktrace.c:25 > stack_trace_save+0x96/0xd0 kernel/stacktrace.c:122 > kasan_save_stack+0x33/0x50 mm/kasan/common.c:45 > kasan_set_track+0x25/0x30 mm/kasan/common.c:52 > kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:522 > ____kasan_slab_free mm/kasan/common.c:236 [inline] > ____kasan_slab_free+0x15e/0x1b0 mm/kasan/common.c:200 > kasan_slab_free include/linux/kasan.h:162 [inline] > slab_free_hook mm/slub.c:1792 [inline] > slab_free_freelist_hook+0x10b/0x1e0 mm/slub.c:1818 > slab_free mm/slub.c:3801 [inline] > kmem_cache_free+0xf0/0x490 mm/slub.c:3823 > sk_prot_free net/core/sock.c:2122 [inline] > __sk_destruct+0x49e/0x770 net/core/sock.c:2216 > sk_destruct+0xc2/0xf0 net/core/sock.c:2231 > __sk_free+0xc4/0x3a0 net/core/sock.c:2242 > sk_free+0x7c/0xa0 net/core/sock.c:2253 > sock_put include/net/sock.h:1975 [inline] > unix_release_sock+0xa76/0xf70 net/unix/af_unix.c:668 > unix_release+0x88/0xe0 net/unix/af_unix.c:1065 > __sock_release+0xcd/0x290 net/socket.c:654 > sock_close+0x1c/0x20 net/socket.c:1386 > __fput+0x3fd/0xac0 fs/file_table.c:384 > task_work_run+0x14d/0x240 kernel/task_work.c:179 > resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] > exit_to_user_mode_loop kernel/entry/common.c:171 [inline] > exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204 > __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] > syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297 > do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > RIP: 0033:0x7fc3bb116ef7 > Code: 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8 > RSP: 002b:00007ffd1d8ead88 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 > RAX: 0000000000000000 RBX: 0000000000000004 RCX: 00007fc3bb116ef7 > RDX: 0000000000000000 RSI: 0000000000008933 RDI: 0000000000000004 > RBP: 00007ffd1d8ead90 R08: 0000000000000008 R09: 0000000000000004 > R10: 000000000000000b R11: 0000000000000246 R12: 00007ffd1d8eafc0 > R13: 00003faaaaaaaaaa R14: 00007ffd1d8eaff0 R15: 00007fc3bb164376 > </TASK> > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkaller@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > For information about bisection process see: https://goo.gl/tpsmEJ#bisection > > If the bug is already fixed, let syzbot know by replying with: > #syz fix: exact-commit-title > > If you want syzbot to run the reproducer, reply with: > #syz test: git://repo/address.git branch-or-commit-hash > If you attach or paste a git patch, syzbot will apply it before testing. > > If you want to change bug's subsystems, reply with: > #syz set subsystems: new-subsystem > (See the list of subsystem names on the web dashboard) > > If the bug is a duplicate of another bug, reply with: > #syz dup: exact-subject-of-another-report > > If you want to undo deduplication, reply with: > #syz undup ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-14 23:03 ` Jakub Kicinski @ 2023-08-15 11:28 ` Vladimir Oltean 2023-08-16 22:57 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Vladimir Oltean @ 2023-08-15 11:28 UTC (permalink / raw) To: Jakub Kicinski Cc: syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote: > Hi Vladimir, any ideas for this one? > The bisection looks pooped, FWIW, looks like a taprio inf loop. I'm looking into it. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-15 11:28 ` Vladimir Oltean @ 2023-08-16 22:57 ` Vladimir Oltean 2023-08-17 2:58 ` Jakub Kicinski 2023-08-17 16:30 ` Jamal Hadi Salim 0 siblings, 2 replies; 9+ messages in thread From: Vladimir Oltean @ 2023-08-16 22:57 UTC (permalink / raw) To: Jakub Kicinski, Jamal Hadi Salim, Cong Wang, Pedro Tammela, Victor Nogueira Cc: syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes Hi Jakub, On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote: > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote: > > Hi Vladimir, any ideas for this one? > > The bisection looks pooped, FWIW, looks like a taprio inf loop. > > I'm looking into it. Here's what I've found out and what help I'll need going forward. Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(), leading to an RCU stall. Short description of taprio_dequeue_tc_priority(): it cycles q->cur_txq[tc] in the range between [ offset, offset + count ), where: int offset = dev->tc_to_txq[tc].offset; int count = dev->tc_to_txq[tc].count; with the initial q->cur_txq[tc], aka the "first_txq" variable, being set by the control path: taprio_change(), also called by taprio_init(): if (mqprio) { (...) for (i = 0; i < mqprio->num_tc; i++) { (...) q->cur_txq[i] = mqprio->offset[i]; } } In the buggy case that leads to the RCU stall, the line in taprio_change() which sets q->cur_txq[i] never gets executed. So first_txq will be 0 (pre-initialized memory), and if that's outside of the [ offset, offset + count ) range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects to cycle through, the kernel is toast. The nitty gritty of that is boring. What's not boring is how come the control path skips the q->cur_txq[i] assignment. It's because "mqprio" is NULL, and that's because taprio_change() (important: also tail-called from taprio_init()) has this logic to detect a change in the traffic class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP netlink attribute: /* no changes - no new mqprio settings */ if (!taprio_mqprio_cmp(q, dev, mqprio)) mqprio = NULL; And what happens is that: - we go through taprio_init() - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us - taprio_mqprio_cmp() sees that there's no change compared to the netdev's existing traffic class config - taprio_change() sets "mqprio" to NULL, ignoring the given TCA_TAPRIO_ATTR_PRIOMAP - we skip modifying q->cur_txq[i], as if it was a taprio_change() call that came straight from Qdisc_ops :: change(), rather than what it really is: one from Qdisc_ops :: init() So the next question: why does taprio_mqprio_cmp() see that there's no change? Because there is no change. When Qdisc_ops :: init() is called, the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and all that. But why? A previous taprio, if that existed, will call taprio_destroy() -> netdev_reset_tc(), so it won't leave state behind that will hinder the current taprio. Checking for stuff in the netdev state is just so that taprio_change() can distinguish between a direct Qdisc_ops :: change() call vs one coming from init(). Finally, here's where the syzbot repro becomes relevant. It crafts the RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc() in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written in mind. With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is supposed to call init() the first time and replace() the second time. What the repro does is make the above sequence call two init() methods back to back. To create an iproute2-based reproducer rather than the C one provided by syzbot, we need this iproute2 change: diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c index 56086c43b7fa..20d9622b6bf3 100644 --- a/tc/tc_qdisc.c +++ b/tc/tc_qdisc.c @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv) return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1); if (matches(*argv, "change") == 0) return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1); + if (strcmp(*argv, "replace-exclusive") == 0) + return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1); if (matches(*argv, "replace") == 0) return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1); if (matches(*argv, "link") == 0) which basically implements a crafted alternative of "tc qdisc replace" which also sets the NLM_F_EXCL flag in n->nlmsg_flags. Then, the minimal repro script can simply be expressed as: #!/bin/bash ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1 ip link set veth0 up && ip link set veth1 up for ((i = 0; i < 2; i++)); do tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \ num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 8@0 4@8 \ clockid REALTIME \ base-time 0 \ cycle-time 61679 \ sched-entry S 0 54336 \ sched-entry S 0x8a27 7343 \ max-sdu 18343 18343 \ flags 0 done ip link del veth0 Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init() code path instead of change() for the second Qdisc. The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself) to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs. When the taprio_init() method executes for i=1, taprio_destroy() hasn't been called yet. So neither has netdev_reset_tc() been called, and that's part of the problem (the one that causes the infinite loop in dequeue()). But, taprio_destroy() will finally get called for the initial taprio created at i=0. The call trace looks like this: rtnetlink_rcv_msg() -> tc_modify_qdisc() -> qdisc_graft() -> taprio_attach() for i=1 -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0 -> __qdisc_destroy() -> taprio_destroy() What's more interesting is that the late taprio_destroy() for i=0 effectively destroys the netdev state - the netdev_reset_tc() call - done by taprio_init() -> taprio_change() for i=1, and that can't be too good, either. Even if there's no immediately observable hang, the traffic classes are reset even though the Qdisc thinks they aren't. Taprio isn't the only one affected by this. Mqprio also has the pattern of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy(). But with the possibility of destroy(i=0) not being serialized with init(i=1), that's buggy. Sorry for the long message. This is where I'm at. For me, this is the bottom of where things are intuitive. I don't understand what is considered to be expected behavior from tc_modify_qdisc(), and what is considered to be sane Qdisc-facing API, and I need help. I've completely stopped debugging when I saw that the code enters through this path at i=1, so I really can't tell you more: /* This magic test requires explanation. * * We know, that some child q is already * attached to this parent and have choice: * either to change it or to create/graft new one. * * 1. We are allowed to create/graft only * if CREATE and REPLACE flags are set. * * 2. If EXCL is set, requestor wanted to say, * that qdisc tcm_handle is not expected * to exist, so that we choose create/graft too. * * 3. The last case is when no flags are set. * Alas, it is sort of hole in API, we * cannot decide what to do unambiguously. * For now we select create/graft, if * user gave KIND, which does not match existing. */ if ((n->nlmsg_flags & NLM_F_CREATE) && (n->nlmsg_flags & NLM_F_REPLACE) && ((n->nlmsg_flags & NLM_F_EXCL) || (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], q->ops->id)))) { netdev_err(dev, "magic test\n"); goto create_n_graft; } I've added more Qdisc people to the discussion. The problem description is pretty much self-contained in this email, and going to the original syzbot report won't bring much else. There are multiple workarounds that can be done in taprio (and mqprio) depending on what is considered as being sane API. Though I don't want to get ahead of myself. Maybe there is a way to fast-forward the qdisc_destroy() of the previous taprio so it doesn't overlap with the new one's qdisc_create(). ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-16 22:57 ` Vladimir Oltean @ 2023-08-17 2:58 ` Jakub Kicinski 2023-08-17 16:30 ` Jamal Hadi Salim 1 sibling, 0 replies; 9+ messages in thread From: Jakub Kicinski @ 2023-08-17 2:58 UTC (permalink / raw) To: Vladimir Oltean Cc: Jamal Hadi Salim, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes On Thu, 17 Aug 2023 01:57:59 +0300 Vladimir Oltean wrote: > There are multiple workarounds that can be done in taprio (and mqprio) > depending on what is considered as being sane API. Though I don't want > to get ahead of myself. Maybe there is a way to fast-forward the > qdisc_destroy() of the previous taprio so it doesn't overlap with the > new one's qdisc_create(). Thanks for the details. I'm going to let others comment, but sounds a bit similar to the recent problem with the ingress qdisc. The qdisc expects to own the netdev which explodes when its lifetime rules are fully exercised :( ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-16 22:57 ` Vladimir Oltean 2023-08-17 2:58 ` Jakub Kicinski @ 2023-08-17 16:30 ` Jamal Hadi Salim 2023-08-18 15:27 ` Jamal Hadi Salim 1 sibling, 1 reply; 9+ messages in thread From: Jamal Hadi Salim @ 2023-08-17 16:30 UTC (permalink / raw) To: Vladimir Oltean Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > Hi Jakub, > > On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote: > > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote: > > > Hi Vladimir, any ideas for this one? > > > The bisection looks pooped, FWIW, looks like a taprio inf loop. > > > > I'm looking into it. > > Here's what I've found out and what help I'll need going forward. > > Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(), > leading to an RCU stall. > > Short description of taprio_dequeue_tc_priority(): it cycles > q->cur_txq[tc] in the range between [ offset, offset + count ), where: > > int offset = dev->tc_to_txq[tc].offset; > int count = dev->tc_to_txq[tc].count; > > with the initial q->cur_txq[tc], aka the "first_txq" variable, being set > by the control path: taprio_change(), also called by taprio_init(): > > if (mqprio) { > (...) > for (i = 0; i < mqprio->num_tc; i++) { > (...) > q->cur_txq[i] = mqprio->offset[i]; > } > } > > In the buggy case that leads to the RCU stall, the line in taprio_change() > which sets q->cur_txq[i] never gets executed. So first_txq will be 0 > (pre-initialized memory), and if that's outside of the [ offset, offset + count ) > range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects > to cycle through, the kernel is toast. > > The nitty gritty of that is boring. What's not boring is how come the > control path skips the q->cur_txq[i] assignment. It's because "mqprio" > is NULL, and that's because taprio_change() (important: also tail-called > from taprio_init()) has this logic to detect a change in the traffic > class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP > netlink attribute: > > /* no changes - no new mqprio settings */ > if (!taprio_mqprio_cmp(q, dev, mqprio)) > mqprio = NULL; > > And what happens is that: > - we go through taprio_init() > - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us > - taprio_mqprio_cmp() sees that there's no change compared to the > netdev's existing traffic class config > - taprio_change() sets "mqprio" to NULL, ignoring the given > TCA_TAPRIO_ATTR_PRIOMAP > - we skip modifying q->cur_txq[i], as if it was a taprio_change() call > that came straight from Qdisc_ops :: change(), rather than what it > really is: one from Qdisc_ops :: init() > > So the next question: why does taprio_mqprio_cmp() see that there's no > change? Because there is no change. When Qdisc_ops :: init() is called, > the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and > all that. > > But why? A previous taprio, if that existed, will call taprio_destroy() > -> netdev_reset_tc(), so it won't leave state behind that will hinder > the current taprio. Checking for stuff in the netdev state is just so > that taprio_change() can distinguish between a direct Qdisc_ops :: change() > call vs one coming from init(). > > Finally, here's where the syzbot repro becomes relevant. It crafts the > RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc() > in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written > in mind. > > With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is > supposed to call init() the first time and replace() the second time. > What the repro does is make the above sequence call two init() methods > back to back. > > To create an iproute2-based reproducer rather than the C one provided by > syzbot, we need this iproute2 change: > > diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c > index 56086c43b7fa..20d9622b6bf3 100644 > --- a/tc/tc_qdisc.c > +++ b/tc/tc_qdisc.c > @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv) > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1); > if (matches(*argv, "change") == 0) > return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1); > + if (strcmp(*argv, "replace-exclusive") == 0) > + return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1); > if (matches(*argv, "replace") == 0) > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1); > if (matches(*argv, "link") == 0) > > which basically implements a crafted alternative of "tc qdisc replace" > which also sets the NLM_F_EXCL flag in n->nlmsg_flags. > > Then, the minimal repro script can simply be expressed as: > > #!/bin/bash > > ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1 > ip link set veth0 up && ip link set veth1 up > > for ((i = 0; i < 2; i++)); do > tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \ > num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ > queues 8@0 4@8 \ > clockid REALTIME \ > base-time 0 \ > cycle-time 61679 \ > sched-entry S 0 54336 \ > sched-entry S 0x8a27 7343 \ > max-sdu 18343 18343 \ > flags 0 > done > > ip link del veth0 > > Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init() > code path instead of change() for the second Qdisc. > > The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself) > to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs. > > When the taprio_init() method executes for i=1, taprio_destroy() hasn't > been called yet. So neither has netdev_reset_tc() been called, and > that's part of the problem (the one that causes the infinite loop in > dequeue()). > > But, taprio_destroy() will finally get called for the initial taprio > created at i=0. The call trace looks like this: > > rtnetlink_rcv_msg() > -> tc_modify_qdisc() > -> qdisc_graft() > -> taprio_attach() for i=1 > -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0 > -> __qdisc_destroy() > -> taprio_destroy() > > What's more interesting is that the late taprio_destroy() for i=0 > effectively destroys the netdev state - the netdev_reset_tc() call - > done by taprio_init() -> taprio_change() for i=1, and that can't be > too good, either. Even if there's no immediately observable hang, the > traffic classes are reset even though the Qdisc thinks they aren't. > > Taprio isn't the only one affected by this. Mqprio also has the pattern > of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy(). > But with the possibility of destroy(i=0) not being serialized with > init(i=1), that's buggy. > > Sorry for the long message. This is where I'm at. For me, this is the > bottom of where things are intuitive. I don't understand what is > considered to be expected behavior from tc_modify_qdisc(), and what is > considered to be sane Qdisc-facing API, and I need help. > > I've completely stopped debugging when I saw that the code enters > through this path at i=1, so I really can't tell you more: > > /* This magic test requires explanation. > * > * We know, that some child q is already > * attached to this parent and have choice: > * either to change it or to create/graft new one. > * > * 1. We are allowed to create/graft only > * if CREATE and REPLACE flags are set. > * > * 2. If EXCL is set, requestor wanted to say, > * that qdisc tcm_handle is not expected > * to exist, so that we choose create/graft too. > * > * 3. The last case is when no flags are set. > * Alas, it is sort of hole in API, we > * cannot decide what to do unambiguously. > * For now we select create/graft, if > * user gave KIND, which does not match existing. > */ > if ((n->nlmsg_flags & NLM_F_CREATE) && > (n->nlmsg_flags & NLM_F_REPLACE) && > ((n->nlmsg_flags & NLM_F_EXCL) || > (tca[TCA_KIND] && > nla_strcmp(tca[TCA_KIND], q->ops->id)))) { > netdev_err(dev, "magic test\n"); > goto create_n_graft; > } > > I've added more Qdisc people to the discussion. The problem description > is pretty much self-contained in this email, and going to the original > syzbot report won't bring much else. > I will take a look tommorow. cheers, jamal > There are multiple workarounds that can be done in taprio (and mqprio) > depending on what is considered as being sane API. Though I don't want > to get ahead of myself. Maybe there is a way to fast-forward the > qdisc_destroy() of the previous taprio so it doesn't overlap with the > new one's qdisc_create(). ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-17 16:30 ` Jamal Hadi Salim @ 2023-08-18 15:27 ` Jamal Hadi Salim 2023-08-18 16:07 ` Vladimir Oltean 0 siblings, 1 reply; 9+ messages in thread From: Jamal Hadi Salim @ 2023-08-18 15:27 UTC (permalink / raw) To: Vladimir Oltean Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes [-- Attachment #1: Type: text/plain, Size: 10390 bytes --] On Thu, Aug 17, 2023 at 12:30 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Aug 16, 2023 at 6:58 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > > > Hi Jakub, > > > > On Tue, Aug 15, 2023 at 02:28:21PM +0300, Vladimir Oltean wrote: > > > On Mon, Aug 14, 2023 at 04:03:03PM -0700, Jakub Kicinski wrote: > > > > Hi Vladimir, any ideas for this one? > > > > The bisection looks pooped, FWIW, looks like a taprio inf loop. > > > > > > I'm looking into it. > > > > Here's what I've found out and what help I'll need going forward. > > > > Indeed there is an infinite loop in taprio_dequeue() -> taprio_dequeue_tc_priority(), > > leading to an RCU stall. > > > > Short description of taprio_dequeue_tc_priority(): it cycles > > q->cur_txq[tc] in the range between [ offset, offset + count ), where: > > > > int offset = dev->tc_to_txq[tc].offset; > > int count = dev->tc_to_txq[tc].count; > > > > with the initial q->cur_txq[tc], aka the "first_txq" variable, being set > > by the control path: taprio_change(), also called by taprio_init(): > > > > if (mqprio) { > > (...) > > for (i = 0; i < mqprio->num_tc; i++) { > > (...) > > q->cur_txq[i] = mqprio->offset[i]; > > } > > } > > > > In the buggy case that leads to the RCU stall, the line in taprio_change() > > which sets q->cur_txq[i] never gets executed. So first_txq will be 0 > > (pre-initialized memory), and if that's outside of the [ offset, offset + count ) > > range that taprio_dequeue_tc_priority() -> taprio_next_tc_txq() expects > > to cycle through, the kernel is toast. > > > > The nitty gritty of that is boring. What's not boring is how come the > > control path skips the q->cur_txq[i] assignment. It's because "mqprio" > > is NULL, and that's because taprio_change() (important: also tail-called > > from taprio_init()) has this logic to detect a change in the traffic > > class settings of the device, compared to the passed TCA_TAPRIO_ATTR_PRIOMAP > > netlink attribute: > > > > /* no changes - no new mqprio settings */ > > if (!taprio_mqprio_cmp(q, dev, mqprio)) > > mqprio = NULL; > > > > And what happens is that: > > - we go through taprio_init() > > - a TCA_TAPRIO_ATTR_PRIOMAP gets passed to us > > - taprio_mqprio_cmp() sees that there's no change compared to the > > netdev's existing traffic class config > > - taprio_change() sets "mqprio" to NULL, ignoring the given > > TCA_TAPRIO_ATTR_PRIOMAP > > - we skip modifying q->cur_txq[i], as if it was a taprio_change() call > > that came straight from Qdisc_ops :: change(), rather than what it > > really is: one from Qdisc_ops :: init() > > > > So the next question: why does taprio_mqprio_cmp() see that there's no > > change? Because there is no change. When Qdisc_ops :: init() is called, > > the netdev really has a non-zero dev->num_tc, prio_tc_map, tc_to_txq and > > all that. > > > > But why? A previous taprio, if that existed, will call taprio_destroy() > > -> netdev_reset_tc(), so it won't leave state behind that will hinder > > the current taprio. Checking for stuff in the netdev state is just so > > that taprio_change() can distinguish between a direct Qdisc_ops :: change() > > call vs one coming from init(). > > > > Finally, here's where the syzbot repro becomes relevant. It crafts the > > RTM_NEWQDISC netlink message in such a way, that it makes tc_modify_qdisc() > > in sch_api.c call a Qdisc_ops sequence with which taprio wasn't written > > in mind. > > > > With "tc qdisc replace && tc qdisc replace", tc_modify_qdisc() is > > supposed to call init() the first time and replace() the second time. > > What the repro does is make the above sequence call two init() methods > > back to back. > > > > To create an iproute2-based reproducer rather than the C one provided by > > syzbot, we need this iproute2 change: > > > > diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c > > index 56086c43b7fa..20d9622b6bf3 100644 > > --- a/tc/tc_qdisc.c > > +++ b/tc/tc_qdisc.c > > @@ -448,6 +448,8 @@ int do_qdisc(int argc, char **argv) > > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_EXCL|NLM_F_CREATE, argc-1, argv+1); > > if (matches(*argv, "change") == 0) > > return tc_qdisc_modify(RTM_NEWQDISC, 0, argc-1, argv+1); > > + if (strcmp(*argv, "replace-exclusive") == 0) > > + return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL, argc-1, argv+1); > > if (matches(*argv, "replace") == 0) > > return tc_qdisc_modify(RTM_NEWQDISC, NLM_F_CREATE|NLM_F_REPLACE, argc-1, argv+1); > > if (matches(*argv, "link") == 0) > > > > which basically implements a crafted alternative of "tc qdisc replace" > > which also sets the NLM_F_EXCL flag in n->nlmsg_flags. > > > > Then, the minimal repro script can simply be expressed as: > > > > #!/bin/bash > > > > ip link add veth0 numtxqueues 16 numrxqueues 16 type veth peer name veth1 > > ip link set veth0 up && ip link set veth1 up > > > > for ((i = 0; i < 2; i++)); do > > tc qdisc replace-exclusive dev veth0 root stab overhead 24 taprio \ > > num_tc 2 map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \ > > queues 8@0 4@8 \ > > clockid REALTIME \ > > base-time 0 \ > > cycle-time 61679 \ > > sched-entry S 0 54336 \ > > sched-entry S 0x8a27 7343 \ > > max-sdu 18343 18343 \ > > flags 0 > > done > > > > ip link del veth0 > > > > Here's how things go sideways if sch_api.c goes through the Qdisc_ops :: init() > > code path instead of change() for the second Qdisc. > > > > The first taprio_attach() (i=0) will attach the root taprio Qdisc (aka itself) > > to all netdev TX queues, and qdisc_put() the existing pfifo default Qdiscs. > > > > When the taprio_init() method executes for i=1, taprio_destroy() hasn't > > been called yet. So neither has netdev_reset_tc() been called, and > > that's part of the problem (the one that causes the infinite loop in > > dequeue()). > > > > But, taprio_destroy() will finally get called for the initial taprio > > created at i=0. The call trace looks like this: > > > > rtnetlink_rcv_msg() > > -> tc_modify_qdisc() > > -> qdisc_graft() > > -> taprio_attach() for i=1 > > -> qdisc_put() for the old Qdiscs attached to the TX queues, aka the taprio from i=0 > > -> __qdisc_destroy() > > -> taprio_destroy() > > > > What's more interesting is that the late taprio_destroy() for i=0 > > effectively destroys the netdev state - the netdev_reset_tc() call - > > done by taprio_init() -> taprio_change() for i=1, and that can't be > > too good, either. Even if there's no immediately observable hang, the > > traffic classes are reset even though the Qdisc thinks they aren't. > > > > Taprio isn't the only one affected by this. Mqprio also has the pattern > > of calling netdev_set_num_tc() from Qdisc_ops :: init() and destroy(). > > But with the possibility of destroy(i=0) not being serialized with > > init(i=1), that's buggy. > > > > Sorry for the long message. This is where I'm at. For me, this is the > > bottom of where things are intuitive. I don't understand what is > > considered to be expected behavior from tc_modify_qdisc(), and what is > > considered to be sane Qdisc-facing API, and I need help. > > > > I've completely stopped debugging when I saw that the code enters > > through this path at i=1, so I really can't tell you more: > > > > /* This magic test requires explanation. > > * > > * We know, that some child q is already > > * attached to this parent and have choice: > > * either to change it or to create/graft new one. > > * > > * 1. We are allowed to create/graft only > > * if CREATE and REPLACE flags are set. > > * > > * 2. If EXCL is set, requestor wanted to say, > > * that qdisc tcm_handle is not expected > > * to exist, so that we choose create/graft too. > > * > > * 3. The last case is when no flags are set. > > * Alas, it is sort of hole in API, we > > * cannot decide what to do unambiguously. > > * For now we select create/graft, if > > * user gave KIND, which does not match existing. > > */ > > if ((n->nlmsg_flags & NLM_F_CREATE) && > > (n->nlmsg_flags & NLM_F_REPLACE) && > > ((n->nlmsg_flags & NLM_F_EXCL) || > > (tca[TCA_KIND] && > > nla_strcmp(tca[TCA_KIND], q->ops->id)))) { > > netdev_err(dev, "magic test\n"); > > goto create_n_graft; > > } > > > > I've added more Qdisc people to the discussion. The problem description > > is pretty much self-contained in this email, and going to the original > > syzbot report won't bring much else. > > > > I will take a look tommorow. > Can you try the attached patchlet? cheers, jamal > cheers, > jamal > > > There are multiple workarounds that can be done in taprio (and mqprio) > > depending on what is considered as being sane API. Though I don't want > > to get ahead of myself. Maybe there is a way to fast-forward the > > qdisc_destroy() of the previous taprio so it doesn't overlap with the > > new one's qdisc_create(). [-- Attachment #2: patchlet-qdisc --] [-- Type: application/octet-stream, Size: 2233 bytes --] diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index aa6b1fe65151..8409b61d3313 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1551,6 +1551,25 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, * Create/change qdisc. */ +static inline bool cmd_create_or_replace(struct nlmsghdr *n) +{ + return (n->nlmsg_flags & NLM_F_CREATE && + n->nlmsg_flags & NLM_F_REPLACE); +} + +static inline bool cmd_create_exclusive(struct nlmsghdr *n) +{ + return (n->nlmsg_flags & NLM_F_CREATE && + n->nlmsg_flags & NLM_F_EXCL); +} + +static inline bool cmd_change(struct nlmsghdr *n) +{ + return (!(n->nlmsg_flags & NLM_F_CREATE) && + !(n->nlmsg_flags & NLM_F_REPLACE) && + !(n->nlmsg_flags & NLM_F_EXCL)); +} + static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, struct netlink_ext_ack *extack) { @@ -1659,12 +1678,17 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, * For now we select create/graft, if * user gave KIND, which does not match existing. */ - if ((n->nlmsg_flags & NLM_F_CREATE) && - (n->nlmsg_flags & NLM_F_REPLACE) && - ((n->nlmsg_flags & NLM_F_EXCL) || - (tca[TCA_KIND] && - nla_strcmp(tca[TCA_KIND], q->ops->id)))) - goto create_n_graft; + if (tca[TCA_KIND] && + nla_strcmp(tca[TCA_KIND], q->ops->id)) { + if (cmd_create_or_replace(n) || + cmd_create_exclusive(n)) { + goto create_n_graft; + } else { + if (cmd_change(n)) + goto create_n_graft2; + } + } + } } } else { @@ -1698,6 +1722,7 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n, NL_SET_ERR_MSG(extack, "Qdisc not found. To create specify NLM_F_CREATE flag"); return -ENOENT; } +create_n_graft2: if (clid == TC_H_INGRESS) { if (dev_ingress_queue(dev)) { q = qdisc_create(dev, dev_ingress_queue(dev), ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-18 15:27 ` Jamal Hadi Salim @ 2023-08-18 16:07 ` Vladimir Oltean 2023-08-18 17:10 ` Jamal Hadi Salim 0 siblings, 1 reply; 9+ messages in thread From: Vladimir Oltean @ 2023-08-18 16:07 UTC (permalink / raw) To: Jamal Hadi Salim Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes Hi Jamal, On Fri, Aug 18, 2023 at 11:27:27AM -0400, Jamal Hadi Salim wrote: > Can you try the attached patchlet? Thanks for the patch. I've tried it, and it eliminates the code path (and thus the problem) exposed by the syzbot program, by responding to RTM_NEWQDISC messages having the NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL flags with "Error: Exclusivity flag on, cannot modify.". Actually, to be precise, the first such netlink message successfully creates the qdisc, but then the subsequent ones leave that qdisc alone (don't change it), by failing with this extack message. If that's the behavior that you intended, then I guess the answer is that it works. Thanks a lot. What would be an appropriate Fixes: tag? Side note: I believe that we can now also revert commit be3618d96510 ("net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq"), which was papering over an unknown (at the time) issue - the same as this one - without really even completely covering it, either. Hence this other syzbot report. https://lore.kernel.org/netdev/3b977f76-0289-270e-8310-179315ee927d@huawei.com/T/ https://lore.kernel.org/netdev/20230608062756.3626573-1-shaozhengchao@huawei.com/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [syzbot] [net?] INFO: rcu detected stall in unix_release 2023-08-18 16:07 ` Vladimir Oltean @ 2023-08-18 17:10 ` Jamal Hadi Salim 0 siblings, 0 replies; 9+ messages in thread From: Jamal Hadi Salim @ 2023-08-18 17:10 UTC (permalink / raw) To: Vladimir Oltean Cc: Jakub Kicinski, Cong Wang, Pedro Tammela, Victor Nogueira, syzbot, bpf, brauner, davem, edumazet, jiri, linux-kernel, netdev, pabeni, syzkaller-bugs, Vinicius Costa Gomes, Zhengchao Shao Hi Vladimir, On Fri, Aug 18, 2023 at 12:07 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > Hi Jamal, > > On Fri, Aug 18, 2023 at 11:27:27AM -0400, Jamal Hadi Salim wrote: > > Can you try the attached patchlet? > > Thanks for the patch. I've tried it, and it eliminates the code path > (and thus the problem) exposed by the syzbot program, by responding to > RTM_NEWQDISC messages having the NLM_F_CREATE|NLM_F_REPLACE|NLM_F_EXCL > flags with "Error: Exclusivity flag on, cannot modify.". > Ok, that is more of the expected behavior. Noone should ever send that mumbo-jumbo (I doubt there is a "legit" control app that will do that). > Actually, to be precise, the first such netlink message successfully > creates the qdisc, but then the subsequent ones leave that qdisc alone > (don't change it), by failing with this extack message. > Yes, the first one will succeed because the root qdisc hasnt been grafted yet (and the only interesting bit is NLM_F_CREATE. everything else is ignored). > If that's the behavior that you intended, then I guess the answer is > that it works. Thanks a lot. > > What would be an appropriate Fixes: tag? > This should have been from early days when we trusted that iproute2 would do the right thing. I will look. I dont think this is a taprio only potential victim, it's just that syzbot was able to aggravate taprio sooner (it probably would have got to some other qdisc later in its adventures). > Side note: I believe that we can now also revert commit be3618d96510 > ("net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq"), > which was papering over an unknown (at the time) issue - the same as > this one - without really even completely covering it, either. Unfortunately the commit log is not helpful - i cant tell what "replace" means and cant seem to find the repro either. If you revert it and see the problem going away then we are good. +Cc Zhengchao Shao <shaozhengchao@huawei.com> >Hence > this other syzbot report. > https://lore.kernel.org/netdev/3b977f76-0289-270e-8310-179315ee927d@huawei.com/T/ > https://lore.kernel.org/netdev/20230608062756.3626573-1-shaozhengchao@huawei.com/ Makes sense. BTW, thanks for your report - it made it faster to zone on the issue. The comments above that code also need a bit of fixing to provide clarity. cheers, jamal ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-08-18 17:11 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-13 20:45 [syzbot] [net?] INFO: rcu detected stall in unix_release syzbot 2023-08-14 23:03 ` Jakub Kicinski 2023-08-15 11:28 ` Vladimir Oltean 2023-08-16 22:57 ` Vladimir Oltean 2023-08-17 2:58 ` Jakub Kicinski 2023-08-17 16:30 ` Jamal Hadi Salim 2023-08-18 15:27 ` Jamal Hadi Salim 2023-08-18 16:07 ` Vladimir Oltean 2023-08-18 17:10 ` Jamal Hadi Salim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).