Lockdep warnings on kexec (virtio

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Lockdep warnings on kexec (virtio_blk, hrtimers)
@ 2024-12-09 14:28 David Woodhouse
  2024-12-10  1:56 ` Jason Wang
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-09 14:28 UTC (permalink / raw)
  To: x86@kernel.org
  Cc: hpa, dyoung, kexec, linux-ext4, Michael S. Tsirkin, Jason Wang

[-- Attachment #1: Type: text/plain, Size: 23292 bytes --]

Using the test case in https://git.kernel.org/torvalds/c/07fa619f2a40c
I see sporadic lockdep warnings.

This warning on the way into kexec seems to happen every time:

[   67.416890] Freezing user space processes
[   67.419277] Freezing user space processes completed (elapsed 0.001 seconds)
[   67.420754] OOM killer disabled.
[   67.433337] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:7[558]
[   67.433337]      preempt=0x00000000 lock=0->2 RCU=0->0 workfn=async_run_entry_fn
[   67.436941] 2 locks held by kworker/u8:7/558:
[   67.437912]  #0: ffff893982fdb858 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
[   67.439980]  #1: ffff893982fdb890 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
[   67.441783] CPU: 0 UID: 0 PID: 558 Comm: kworker/u8:7 Not tainted 6.13.0-rc1+ #2032
[   67.442462] Disabling non-boot CPUs ...
[   67.443101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   67.443106] Workqueue: async async_run_entry_fn
[   67.443114] Call Trace:
[   67.443117]  <TASK>
[   67.443123]  dump_stack_lvl+0x84/0xd0
[   67.447325]  process_one_work.cold+0x6d/0xc8
[   67.447991]  ? __pfx_async_run_entry_fn+0x10/0x10
[   67.448644]  ? process_one_work+0x24a/0x590
[   67.449226]  worker_thread+0x1c3/0x3b0
[   67.449757]  ? __pfx_worker_thread+0x10/0x10
[   67.450345]  kthread+0xd5/0x100
[   67.450822]  ? __pfx_kthread+0x10/0x10
[   67.451320]  ret_from_fork+0x34/0x50
[   67.451775]  ? __pfx_kthread+0x10/0x10
[   67.452239]  ret_from_fork_asm+0x1a/0x30
[   67.452747]  </TASK>
[   67.471104] smpboot: CPU 1 is now offline



This one happens only occasionally (1 in 20 or so):

B[   67.487529] ------------[ cut here ]------------
[   67.488018] Interrupts enabled after irqrouter_resume+0x0/0x50
[   67.488684] WARNING: CPU: 0 PID: 571 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
[   67.489571] Modules linked in:
[   67.489920] CPU: 0 UID: 0 PID: 571 Comm: loadret Not tainted 6.13.0-rc1+ #2032
[   67.490692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   67.491856] RIP: 0010:syscore_resume+0x18a/0x220
[   67.492347] Code: 00 e9 21 ff ff ff 80 3d 62 d0 d3 01 00 0f 85 03 ff ff ff 48 8b 73 18 48 c7 c7 06 f6 e9 8d c6 05 4a d0 d3 01 01 e8 16 56 46 ff <0f> 0b e9 e5 fe ff ff e8 ea e9 54 ff 84 c0 0f 85 fb fe ff ff 80 3d
[   67.494253] RSP: 0018:ffffaca540773a08 EFLAGS: 00010286
[   67.494816] RAX: 0000000000000000 RBX: ffffffff8e9918c0 RCX: 0000000000000027
[   67.495560] RDX: ffff8939fdc21a88 RSI: 0000000000000001 RDI: ffff8939fdc21a80
[   67.496324] RBP: 0000000000037e0c R08: 0000000000000000 R09: 0000000000000000
[   67.497052] R10: 0000000000000001 R11: ffffffff8e782fd8 R12: ffffaca540773a38
[   67.497788] R13: ffffffff8e68ebe0 R14: 00000000fee1dead R15: 0000000000000000
[   67.498504] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
[   67.499320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.499930] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
[   67.500650] Call Trace:
[   67.500913]  <TASK>
[   67.501147]  ? __warn.cold+0xb7/0x151
[   67.501535]  ? syscore_resume+0x18a/0x220
[   67.501963]  ? report_bug+0xff/0x140
[   67.502331]  ? console_unlock+0x9d/0x150
[   67.502757]  ? handle_bug+0x58/0x90
[   67.503128]  ? exc_invalid_op+0x17/0x70
[   67.503529]  ? asm_exc_invalid_op+0x1a/0x20
[   67.503975]  ? syscore_resume+0x18a/0x220
[   67.504390]  ? syscore_resume+0x18a/0x220
[   67.504814]  kernel_kexec+0xf6/0x180
[   67.505190]  __do_sys_reboot+0x206/0x250
[   67.505653]  do_syscall_64+0x95/0x180
[   67.506037]  ? __lock_acquire+0x45f/0x25c0
[   67.506451]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.506974]  ? smp_call_function_many_cond+0x11c/0x790
[   67.507503]  ? lock_acquire+0xd0/0x310
[   67.507896]  ? free_unref_page+0x22b/0x6a0
[   67.508316]  ? find_held_lock+0x2b/0x80
[   67.508725]  ? free_unref_page+0x510/0x6a0
[   67.509157]  ? do_raw_spin_unlock+0x4d/0xb0
[   67.509592]  ? _raw_spin_unlock+0x23/0x40
[   67.510006]  ? free_unref_page+0x510/0x6a0
[   67.510428]  ? arch_kexec_pre_free_pages+0x1a/0x40
[   67.510938]  ? do_kexec_load+0x11d/0x340
[   67.511343]  ? kfree+0xdb/0x3a0
[   67.511697]  ? __x64_sys_kexec_load+0xa9/0xe0
[   67.512149]  ? kfree+0xdb/0x3a0
[   67.512485]  ? do_kexec_load+0x11d/0x340
[   67.512912]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.513430]  ? syscall_exit_to_user_mode+0x97/0x290
[   67.513935]  ? do_syscall_64+0xa1/0x180
[   67.514333]  ? find_held_lock+0x2b/0x80
[   67.514736]  ? do_user_addr_fault+0x59f/0x8a0
[   67.515193]  ? do_user_addr_fault+0x5a9/0x8a0
[   67.515655]  ? trace_hardirqs_off+0x4b/0xc0
[   67.516102]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.516625]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   67.517138] RIP: 0033:0x7f9ae765115d
[   67.517515] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   67.519312] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   67.520057] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
[   67.520760] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   67.521454] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
[   67.522172] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
[   67.522881] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
[   67.523593]  </TASK>
[   67.523832] irq event stamp: 15605
[   67.524181] hardirqs last  enabled at (15613): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
[   67.525034] hardirqs last disabled at (15620): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
[   67.525935] softirqs last  enabled at (14732): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.526780] softirqs last disabled at (14717): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.527616] ---[ end trace 0000000000000000 ]---
[   67.530225] ------------[ cut here ]------------
[   67.530760] WARNING: CPU: 0 PID: 571 at kernel/time/hrtimer.c:995 hrtimers_resume_local+0x29/0x40
[   67.531654] Modules linked in:
[   67.531982] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
[   67.532862] Tainted: [W]=WARN
[   67.533185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   67.534313] RIP: 0010:hrtimers_resume_local+0x29/0x40
[   67.534833] Code: 90 66 0f 1f 00 0f 1f 44 00 00 8b 05 c5 57 81 02 85 c0 74 18 65 8b 05 0e 8c d4 73 85 c0 75 0d 65 8b 05 c7 88 d4 73 85 c0 74 02 <0f> 0b 31 ff e9 de ee ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[   67.536771] RSP: 0018:ffffaca5407739a0 EFLAGS: 00010202
[   67.537296] RAX: 0000000000000001 RBX: 0000000fcd31f08b RCX: 00000000000006e0
[   67.538000] RDX: 0000000000000029 RSI: 00000000007f56f4 RDI: 00000000000006e0
[   67.538706] RBP: ffffaca5407739f8 R08: 0000000000000001 R09: 0000000000000000
[   67.539406] R10: 0000000000000001 R11: ffffffff9018d188 R12: 0000000000000202
[   67.540104] R13: ffffffff8e666ca0 R14: 00000000fee1dead R15: 0000000000000000
[   67.540840] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
[   67.541638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.542218] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
[   67.542915] Call Trace:
[   67.543185]  <TASK>
[   67.543413]  ? __warn.cold+0xb7/0x151
[   67.543804]  ? hrtimers_resume_local+0x29/0x40
[   67.544269]  ? report_bug+0xff/0x140
[   67.544646]  ? handle_bug+0x58/0x90
[   67.544999]  ? exc_invalid_op+0x17/0x70
[   67.545396]  ? asm_exc_invalid_op+0x1a/0x20
[   67.545862]  ? hrtimers_resume_local+0x29/0x40
[   67.546318]  timekeeping_resume+0x148/0x190
[   67.546743]  syscore_resume+0x67/0x220
[   67.547132]  kernel_kexec+0xf6/0x180
[   67.547491]  __do_sys_reboot+0x206/0x250
[   67.547902]  do_syscall_64+0x95/0x180
[   67.548285]  ? __lock_acquire+0x45f/0x25c0
[   67.548714]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.549255]  ? smp_call_function_many_cond+0x11c/0x790
[   67.549783]  ? lock_acquire+0xd0/0x310
[   67.550176]  ? free_unref_page+0x22b/0x6a0
[   67.550594]  ? find_held_lock+0x2b/0x80
[   67.550973]  ? free_unref_page+0x510/0x6a0
[   67.551409]  ? do_raw_spin_unlock+0x4d/0xb0
[   67.551849]  ? _raw_spin_unlock+0x23/0x40
[   67.552263]  ? free_unref_page+0x510/0x6a0
[   67.552682]  ? arch_kexec_pre_free_pages+0x1a/0x40
[   67.553169]  ? do_kexec_load+0x11d/0x340
[   67.553553]  ? kfree+0xdb/0x3a0
[   67.553895]  ? __x64_sys_kexec_load+0xa9/0xe0
[   67.554345]  ? kfree+0xdb/0x3a0
[   67.554678]  ? do_kexec_load+0x11d/0x340
[   67.555073]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.555581]  ? syscall_exit_to_user_mode+0x97/0x290
[   67.556121]  ? do_syscall_64+0xa1/0x180
[   67.556506]  ? find_held_lock+0x2b/0x80
[   67.556898]  ? do_user_addr_fault+0x59f/0x8a0
[   67.557347]  ? do_user_addr_fault+0x5a9/0x8a0
[   67.557795]  ? trace_hardirqs_off+0x4b/0xc0
[   67.558221]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.558721]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   67.559230] RIP: 0033:0x7f9ae765115d
[   67.559625] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   67.561378] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   67.562124] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
[   67.562815] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   67.563507] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
[   67.564223] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
[   67.564912] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
[   67.565614]  </TASK>
[   67.565847] irq event stamp: 16351
[   67.566214] hardirqs last  enabled at (16361): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
[   67.567039] hardirqs last disabled at (16368): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
[   67.567867] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.568702] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.569516] ---[ end trace 0000000000000000 ]---
[   67.569981] 
[   67.570150] ================================
[   67.570561] WARNING: inconsistent lock state
[   67.570982] 6.13.0-rc1+ #2032 Tainted: G        W         
[   67.571501] --------------------------------
[   67.571922] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   67.572490] loadret/571 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   67.572984] ffff8939fdc267d8 (hrtimer_bases.lock){?.-.}-{2:2}, at: retrigger_next_event+0x38/0xd0
[   67.573831] {IN-HARDIRQ-W} state was registered at:
[   67.574299]   lock_acquire+0xd0/0x310
[   67.574670]   _raw_spin_lock_irqsave+0x48/0x70
[   67.575102]   hrtimer_run_queues+0x4d/0x150
[   67.575504]   update_process_times+0x34/0xf0
[   67.575926]   tick_periodic+0x29/0xe0
[   67.576287]   tick_handle_periodic+0x24/0x70
[   67.576715]   timer_interrupt+0x18/0x30
[   67.577093]   __handle_irq_event_percpu+0x87/0x260
[   67.577573]   handle_irq_event+0x38/0x90
[   67.577970]   handle_level_irq+0x8e/0x160
[   67.578350]   __common_interrupt+0x5c/0x120
[   67.578763]   common_interrupt+0x80/0xa0
[   67.579144]   asm_common_interrupt+0x26/0x40
[   67.579558]   __x86_return_thunk+0x0/0x10
[   67.579951]   _raw_spin_unlock_irqrestore+0x45/0x70
[   67.580418]   __setup_irq+0x34d/0x6a0
[   67.580787]   request_threaded_irq+0x115/0x1b0
[   67.581214]   hpet_time_init+0x31/0x50
[   67.581577]   x86_late_time_init+0x1b/0x40
[   67.581975]   start_kernel+0x998/0xa40
[   67.582336]   x86_64_start_reservations+0x24/0x30
[   67.582792]   x86_64_start_kernel+0xed/0xf0
[   67.583187]   common_startup_64+0x13e/0x141
[   67.583603] irq event stamp: 16407
[   67.583934] hardirqs last  enabled at (16407): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
[   67.584742] hardirqs last disabled at (16406): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
[   67.585537] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.586338] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
[   67.587143] 
[   67.587143] other info that might help us debug this:
[   67.587759]  Possible unsafe locking scenario:
[   67.587759] 
[   67.588319]        CPU0
[   67.588564]        ----
[   67.588822]   lock(hrtimer_bases.lock);
[   67.589192]   <Interrupt>
[   67.589451]     lock(hrtimer_bases.lock);
[   67.589843] 
[   67.589843]  *** DEADLOCK ***
[   67.589843] 
[   67.590400] 1 lock held by loadret/571:
[   67.590775]  #0: ffffffff8e6902c8 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xc5/0x250
[   67.591655] 
[   67.591655] stack backtrace:
[   67.592073] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
[   67.592883] Tainted: [W]=WARN
[   67.593176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   67.594234] Call Trace:
[   67.594480]  <TASK>
[   67.594707]  dump_stack_lvl+0x84/0xd0
[   67.595067]  print_usage_bug.part.0+0x257/0x340
[   67.595505]  mark_lock+0x735/0x960
[   67.595849]  ? vprintk_emit+0x111/0x460
[   67.596221]  ? hrtimers_resume_local+0x29/0x40
[   67.596666]  ? _printk+0x6c/0x90
[   67.596986]  __lock_acquire+0x7ee/0x25c0
[   67.597367]  ? __warn.cold+0x7f/0x151
[   67.597728]  ? hrtimers_resume_local+0x29/0x40
[   67.598160]  ? nbcon_get_cpu_emergency_nesting+0xa/0x30
[   67.598671]  ? nbcon_cpu_emergency_exit+0xe/0x40
[   67.599115]  ? report_bug+0xff/0x140
[   67.599462]  lock_acquire+0xd0/0x310
[   67.599815]  ? retrigger_next_event+0x38/0xd0
[   67.600234]  _raw_spin_lock+0x30/0x40
[   67.600603]  ? retrigger_next_event+0x38/0xd0
[   67.601025]  retrigger_next_event+0x38/0xd0
[   67.601425]  timekeeping_resume+0x148/0x190
[   67.601832]  syscore_resume+0x67/0x220
[   67.602193]  kernel_kexec+0xf6/0x180
[   67.602542]  __do_sys_reboot+0x206/0x250
[   67.602934]  do_syscall_64+0x95/0x180
[   67.603292]  ? __lock_acquire+0x45f/0x25c0
[   67.603706]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.604188]  ? smp_call_function_many_cond+0x11c/0x790
[   67.604684]  ? lock_acquire+0xd0/0x310
[   67.605049]  ? free_unref_page+0x22b/0x6a0
[   67.605444]  ? find_held_lock+0x2b/0x80
[   67.605820]  ? free_unref_page+0x510/0x6a0
[   67.606209]  ? do_raw_spin_unlock+0x4d/0xb0
[   67.606621]  ? _raw_spin_unlock+0x23/0x40
[   67.607008]  ? free_unref_page+0x510/0x6a0
[   67.607406]  ? arch_kexec_pre_free_pages+0x1a/0x40
[   67.607873]  ? do_kexec_load+0x11d/0x340
[   67.608251]  ? kfree+0xdb/0x3a0
[   67.608565]  ? __x64_sys_kexec_load+0xa9/0xe0
[   67.608999]  ? kfree+0xdb/0x3a0
[   67.609314]  ? do_kexec_load+0x11d/0x340
[   67.609699]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.610179]  ? syscall_exit_to_user_mode+0x97/0x290
[   67.610654]  ? do_syscall_64+0xa1/0x180
[   67.611027]  ? find_held_lock+0x2b/0x80
[   67.611399]  ? do_user_addr_fault+0x59f/0x8a0
[   67.611839]  ? do_user_addr_fault+0x5a9/0x8a0
[   67.612265]  ? trace_hardirqs_off+0x4b/0xc0
[   67.612662]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   67.613147]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   67.613639] RIP: 0033:0x7f9ae765115d
[   67.613988] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   67.615719] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   67.616429] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
[   67.617098] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   67.617770] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
[   67.618440] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
[   67.619120] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
[   67.619803]  </TASK>
[   67.620127] Enabling non-boot CPUs ...


I also saw this one *between* attempts, presumably caused by the virtblk_freeze() one:

[   23.699450] 
[   23.699826] ======================================================
[   23.701054] WARNING: possible circular locking dependency detected
[   23.702409] 6.13.0-rc1+ #2032 Not tainted
[   23.703148] ------------------------------------------------------
[   23.704248] kworker/u8:4/76 is trying to acquire lock:
[   23.705177] ffff892c811a2d48 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}, at: process_one_work+0x51d/0x590
[   23.706899] 
[   23.706899] but task is already holding lock:
[   23.707829] ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
[   23.709152] 
[   23.709152] which lock already depends on the new lock.
[   23.709152] 
[   23.710309] 
[   23.710309] the existing dependency chain (in reverse order) is:
[   23.711374] 
[   23.711374] -> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
[   23.712380]        blk_mq_submit_bio+0x90d/0xb00
[   23.713006]        __submit_bio+0x10d/0x1f0
[   23.713559]        submit_bio_noacct_nocheck+0x324/0x420
[   23.714266]        ext4_bio_write_folio+0x1fc/0x750
[   23.714851]        mpage_submit_folio+0x8d/0xb0
[   23.715409]        mpage_process_page_bufs+0xd0/0x1b0
[   23.716024]        mpage_prepare_extent_to_map+0x1d0/0x510
[   23.716669]        ext4_do_writepages+0x4ec/0xee0
[   23.717246]        ext4_writepages+0xe0/0x280
[   23.717772]        do_writepages+0xeb/0x290
[   23.718264]        filemap_fdatawrite_wbc+0x4f/0x70
[   23.718830]        __filemap_fdatawrite_range+0x60/0x90
[   23.719440]        file_write_and_wait_range+0x47/0xb0
[   23.720049]        ext4_sync_file+0xac/0x3e0
[   23.720548]        do_fsync+0x39/0x70
[   23.720995]        __x64_sys_fsync+0x13/0x20
[   23.721481]        do_syscall_64+0x95/0x180
[   23.721943]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   23.722522] 
[   23.722522] -> #2 (jbd2_handle){++++}-{0:0}:
[   23.723139]        start_this_handle+0x193/0x540
[   23.723634]        jbd2_journal_start_reserved+0x54/0x1e0
[   23.724210]        __ext4_journal_start_reserved+0x7a/0x170
[   23.724756]        ext4_convert_unwritten_io_end_vec+0x2b/0xe0
[   23.725304]        ext4_end_io_rsv_work+0x102/0x1d0
[   23.725798]        process_one_work+0x21f/0x590
[   23.726255]        worker_thread+0x1c3/0x3b0
[   23.726690]        kthread+0xd5/0x100
[   23.727072]        ret_from_fork+0x34/0x50
[   23.727481]        ret_from_fork_asm+0x1a/0x30
[   23.727939] 
[   23.727939] -> #1 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}-{0:0}:
[   23.728788]        process_one_work+0x1f4/0x590
[   23.729267]        worker_thread+0x1c3/0x3b0
[   23.729715]        kthread+0xd5/0x100
[   23.730100]        ret_from_fork+0x34/0x50
[   23.730502]        ret_from_fork_asm+0x1a/0x30
[   23.730944] 
[   23.730944] -> #0 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}:
[   23.731688]        __lock_acquire+0x14ba/0x25c0
[   23.732142]        lock_acquire+0xd0/0x310
[   23.732550]        process_one_work+0x52e/0x590
[   23.732995]        worker_thread+0x1c3/0x3b0
[   23.733424]        kthread+0xd5/0x100
[   23.733783]        ret_from_fork+0x34/0x50
[   23.734191]        ret_from_fork_asm+0x1a/0x30
[   23.734632] 
[   23.734632] other info that might help us debug this:
[   23.734632] 
[   23.735409] Chain exists of:
[   23.735409]   (wq_completion)ext4-rsv-conversion --> jbd2_handle --> &q->q_usage_counter(io)
[   23.735409] 
[   23.736630]  Possible unsafe locking scenario:
[   23.736630] 
[   23.737214]        CPU0                    CPU1
[   23.737662]        ----                    ----
[   23.738124]   lock(&q->q_usage_counter(io));
[   23.738539]                                lock(jbd2_handle);
[   23.739112]                                lock(&q->q_usage_counter(io));
[   23.739770]   lock((wq_completion)ext4-rsv-conversion);
[   23.740296] 
[   23.740296]  *** DEADLOCK ***
[   23.740296] 
[   23.740850] 2 locks held by kworker/u8:4/76:
[   23.741284]  #0: ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
[   23.742172]  #1: ffff892c82fb5468 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
[   23.743083] 
[   23.743083] stack backtrace:
[   23.743514] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
[   23.744259] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   23.745355] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
[   23.745947] Call Trace:
[   23.746207]  <TASK>
[   23.746428]  dump_stack_lvl+0x84/0xd0
[   23.746807]  print_circular_bug.cold+0x178/0x1be
[   23.747280]  check_noncircular+0x148/0x160
[   23.747694]  __lock_acquire+0x14ba/0x25c0
[   23.748114]  lock_acquire+0xd0/0x310
[   23.748480]  ? process_one_work+0x51d/0x590
[   23.748906]  ? mark_held_locks+0x40/0x70
[   23.749303]  process_one_work+0x52e/0x590
[   23.749705]  ? process_one_work+0x51d/0x590
[   23.750136]  worker_thread+0x1c3/0x3b0
[   23.750503]  ? __pfx_worker_thread+0x10/0x10
[   23.750923]  kthread+0xd5/0x100
[   23.751235]  ? __pfx_kthread+0x10/0x10
[   23.751601]  ret_from_fork+0x34/0x50
[   23.751968]  ? __pfx_kthread+0x10/0x10
[   23.752334]  ret_from_fork_asm+0x1a/0x30
[   23.752721]  </TASK>
[   23.753031] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:4[76]
[   23.753031]      preempt=0x00000000 lock=2->0 RCU=0->0 workfn=ext4_end_io_rsv_work
[   23.754419] INFO: lockdep is turned off.
[   23.754814] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
[   23.755537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   23.756621] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
[   23.757214] Call Trace:
[   23.757464]  <TASK>
[   23.757682]  dump_stack_lvl+0x84/0xd0
[   23.758060]  process_one_work.cold+0x6d/0xc8
[   23.758481]  ? __pfx_ext4_end_io_rsv_work+0x10/0x10
[   23.758946]  ? process_one_work+0x24a/0x590
[   23.759339]  worker_thread+0x1c3/0x3b0
[   23.759706]  ? __pfx_worker_thread+0x10/0x10
[   23.760276]  kthread+0xd5/0x100
[   23.760581]  ? __pfx_kthread+0x10/0x10
[   23.760956]  ret_from_fork+0x34/0x50
[   23.761306]  ? __pfx_kthread+0x10/0x10
[   23.761669]  ret_from_fork_asm+0x1a/0x30
[   23.762064]  </TASK>




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-09 14:28 Lockdep warnings on kexec (virtio_blk, hrtimers) David Woodhouse
@ 2024-12-10  1:56 ` Jason Wang
  2024-12-11 12:42   ` Stefan Hajnoczi
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Wang @ 2024-12-10  1:56 UTC (permalink / raw)
  To: David Woodhouse
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, Stefan Hajnoczi, eperezma,
	Paolo Bonzini

Adding more virtio-blk people here.

On Mon, Dec 9, 2024 at 10:49 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> Using the test case in https://git.kernel.org/torvalds/c/07fa619f2a40c
> I see sporadic lockdep warnings.
>
> This warning on the way into kexec seems to happen every time:
>
> [   67.416890] Freezing user space processes
> [   67.419277] Freezing user space processes completed (elapsed 0.001 seconds)
> [   67.420754] OOM killer disabled.
> [   67.433337] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:7[558]
> [   67.433337]      preempt=0x00000000 lock=0->2 RCU=0->0 workfn=async_run_entry_fn
> [   67.436941] 2 locks held by kworker/u8:7/558:
> [   67.437912]  #0: ffff893982fdb858 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> [   67.439980]  #1: ffff893982fdb890 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> [   67.441783] CPU: 0 UID: 0 PID: 558 Comm: kworker/u8:7 Not tainted 6.13.0-rc1+ #2032
> [   67.442462] Disabling non-boot CPUs ...
> [   67.443101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   67.443106] Workqueue: async async_run_entry_fn
> [   67.443114] Call Trace:
> [   67.443117]  <TASK>
> [   67.443123]  dump_stack_lvl+0x84/0xd0
> [   67.447325]  process_one_work.cold+0x6d/0xc8
> [   67.447991]  ? __pfx_async_run_entry_fn+0x10/0x10
> [   67.448644]  ? process_one_work+0x24a/0x590
> [   67.449226]  worker_thread+0x1c3/0x3b0
> [   67.449757]  ? __pfx_worker_thread+0x10/0x10
> [   67.450345]  kthread+0xd5/0x100
> [   67.450822]  ? __pfx_kthread+0x10/0x10
> [   67.451320]  ret_from_fork+0x34/0x50
> [   67.451775]  ? __pfx_kthread+0x10/0x10
> [   67.452239]  ret_from_fork_asm+0x1a/0x30
> [   67.452747]  </TASK>
> [   67.471104] smpboot: CPU 1 is now offline
>
>
>
> This one happens only occasionally (1 in 20 or so):
>
> B[   67.487529] ------------[ cut here ]------------
> [   67.488018] Interrupts enabled after irqrouter_resume+0x0/0x50
> [   67.488684] WARNING: CPU: 0 PID: 571 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
> [   67.489571] Modules linked in:
> [   67.489920] CPU: 0 UID: 0 PID: 571 Comm: loadret Not tainted 6.13.0-rc1+ #2032
> [   67.490692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   67.491856] RIP: 0010:syscore_resume+0x18a/0x220
> [   67.492347] Code: 00 e9 21 ff ff ff 80 3d 62 d0 d3 01 00 0f 85 03 ff ff ff 48 8b 73 18 48 c7 c7 06 f6 e9 8d c6 05 4a d0 d3 01 01 e8 16 56 46 ff <0f> 0b e9 e5 fe ff ff e8 ea e9 54 ff 84 c0 0f 85 fb fe ff ff 80 3d
> [   67.494253] RSP: 0018:ffffaca540773a08 EFLAGS: 00010286
> [   67.494816] RAX: 0000000000000000 RBX: ffffffff8e9918c0 RCX: 0000000000000027
> [   67.495560] RDX: ffff8939fdc21a88 RSI: 0000000000000001 RDI: ffff8939fdc21a80
> [   67.496324] RBP: 0000000000037e0c R08: 0000000000000000 R09: 0000000000000000
> [   67.497052] R10: 0000000000000001 R11: ffffffff8e782fd8 R12: ffffaca540773a38
> [   67.497788] R13: ffffffff8e68ebe0 R14: 00000000fee1dead R15: 0000000000000000
> [   67.498504] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
> [   67.499320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.499930] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
> [   67.500650] Call Trace:
> [   67.500913]  <TASK>
> [   67.501147]  ? __warn.cold+0xb7/0x151
> [   67.501535]  ? syscore_resume+0x18a/0x220
> [   67.501963]  ? report_bug+0xff/0x140
> [   67.502331]  ? console_unlock+0x9d/0x150
> [   67.502757]  ? handle_bug+0x58/0x90
> [   67.503128]  ? exc_invalid_op+0x17/0x70
> [   67.503529]  ? asm_exc_invalid_op+0x1a/0x20
> [   67.503975]  ? syscore_resume+0x18a/0x220
> [   67.504390]  ? syscore_resume+0x18a/0x220
> [   67.504814]  kernel_kexec+0xf6/0x180
> [   67.505190]  __do_sys_reboot+0x206/0x250
> [   67.505653]  do_syscall_64+0x95/0x180
> [   67.506037]  ? __lock_acquire+0x45f/0x25c0
> [   67.506451]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.506974]  ? smp_call_function_many_cond+0x11c/0x790
> [   67.507503]  ? lock_acquire+0xd0/0x310
> [   67.507896]  ? free_unref_page+0x22b/0x6a0
> [   67.508316]  ? find_held_lock+0x2b/0x80
> [   67.508725]  ? free_unref_page+0x510/0x6a0
> [   67.509157]  ? do_raw_spin_unlock+0x4d/0xb0
> [   67.509592]  ? _raw_spin_unlock+0x23/0x40
> [   67.510006]  ? free_unref_page+0x510/0x6a0
> [   67.510428]  ? arch_kexec_pre_free_pages+0x1a/0x40
> [   67.510938]  ? do_kexec_load+0x11d/0x340
> [   67.511343]  ? kfree+0xdb/0x3a0
> [   67.511697]  ? __x64_sys_kexec_load+0xa9/0xe0
> [   67.512149]  ? kfree+0xdb/0x3a0
> [   67.512485]  ? do_kexec_load+0x11d/0x340
> [   67.512912]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.513430]  ? syscall_exit_to_user_mode+0x97/0x290
> [   67.513935]  ? do_syscall_64+0xa1/0x180
> [   67.514333]  ? find_held_lock+0x2b/0x80
> [   67.514736]  ? do_user_addr_fault+0x59f/0x8a0
> [   67.515193]  ? do_user_addr_fault+0x5a9/0x8a0
> [   67.515655]  ? trace_hardirqs_off+0x4b/0xc0
> [   67.516102]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.516625]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   67.517138] RIP: 0033:0x7f9ae765115d
> [   67.517515] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> [   67.519312] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> [   67.520057] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> [   67.520760] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> [   67.521454] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> [   67.522172] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> [   67.522881] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> [   67.523593]  </TASK>
> [   67.523832] irq event stamp: 15605
> [   67.524181] hardirqs last  enabled at (15613): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> [   67.525034] hardirqs last disabled at (15620): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> [   67.525935] softirqs last  enabled at (14732): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.526780] softirqs last disabled at (14717): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.527616] ---[ end trace 0000000000000000 ]---
> [   67.530225] ------------[ cut here ]------------
> [   67.530760] WARNING: CPU: 0 PID: 571 at kernel/time/hrtimer.c:995 hrtimers_resume_local+0x29/0x40
> [   67.531654] Modules linked in:
> [   67.531982] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
> [   67.532862] Tainted: [W]=WARN
> [   67.533185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   67.534313] RIP: 0010:hrtimers_resume_local+0x29/0x40
> [   67.534833] Code: 90 66 0f 1f 00 0f 1f 44 00 00 8b 05 c5 57 81 02 85 c0 74 18 65 8b 05 0e 8c d4 73 85 c0 75 0d 65 8b 05 c7 88 d4 73 85 c0 74 02 <0f> 0b 31 ff e9 de ee ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
> [   67.536771] RSP: 0018:ffffaca5407739a0 EFLAGS: 00010202
> [   67.537296] RAX: 0000000000000001 RBX: 0000000fcd31f08b RCX: 00000000000006e0
> [   67.538000] RDX: 0000000000000029 RSI: 00000000007f56f4 RDI: 00000000000006e0
> [   67.538706] RBP: ffffaca5407739f8 R08: 0000000000000001 R09: 0000000000000000
> [   67.539406] R10: 0000000000000001 R11: ffffffff9018d188 R12: 0000000000000202
> [   67.540104] R13: ffffffff8e666ca0 R14: 00000000fee1dead R15: 0000000000000000
> [   67.540840] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
> [   67.541638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.542218] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
> [   67.542915] Call Trace:
> [   67.543185]  <TASK>
> [   67.543413]  ? __warn.cold+0xb7/0x151
> [   67.543804]  ? hrtimers_resume_local+0x29/0x40
> [   67.544269]  ? report_bug+0xff/0x140
> [   67.544646]  ? handle_bug+0x58/0x90
> [   67.544999]  ? exc_invalid_op+0x17/0x70
> [   67.545396]  ? asm_exc_invalid_op+0x1a/0x20
> [   67.545862]  ? hrtimers_resume_local+0x29/0x40
> [   67.546318]  timekeeping_resume+0x148/0x190
> [   67.546743]  syscore_resume+0x67/0x220
> [   67.547132]  kernel_kexec+0xf6/0x180
> [   67.547491]  __do_sys_reboot+0x206/0x250
> [   67.547902]  do_syscall_64+0x95/0x180
> [   67.548285]  ? __lock_acquire+0x45f/0x25c0
> [   67.548714]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.549255]  ? smp_call_function_many_cond+0x11c/0x790
> [   67.549783]  ? lock_acquire+0xd0/0x310
> [   67.550176]  ? free_unref_page+0x22b/0x6a0
> [   67.550594]  ? find_held_lock+0x2b/0x80
> [   67.550973]  ? free_unref_page+0x510/0x6a0
> [   67.551409]  ? do_raw_spin_unlock+0x4d/0xb0
> [   67.551849]  ? _raw_spin_unlock+0x23/0x40
> [   67.552263]  ? free_unref_page+0x510/0x6a0
> [   67.552682]  ? arch_kexec_pre_free_pages+0x1a/0x40
> [   67.553169]  ? do_kexec_load+0x11d/0x340
> [   67.553553]  ? kfree+0xdb/0x3a0
> [   67.553895]  ? __x64_sys_kexec_load+0xa9/0xe0
> [   67.554345]  ? kfree+0xdb/0x3a0
> [   67.554678]  ? do_kexec_load+0x11d/0x340
> [   67.555073]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.555581]  ? syscall_exit_to_user_mode+0x97/0x290
> [   67.556121]  ? do_syscall_64+0xa1/0x180
> [   67.556506]  ? find_held_lock+0x2b/0x80
> [   67.556898]  ? do_user_addr_fault+0x59f/0x8a0
> [   67.557347]  ? do_user_addr_fault+0x5a9/0x8a0
> [   67.557795]  ? trace_hardirqs_off+0x4b/0xc0
> [   67.558221]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.558721]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   67.559230] RIP: 0033:0x7f9ae765115d
> [   67.559625] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> [   67.561378] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> [   67.562124] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> [   67.562815] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> [   67.563507] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> [   67.564223] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> [   67.564912] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> [   67.565614]  </TASK>
> [   67.565847] irq event stamp: 16351
> [   67.566214] hardirqs last  enabled at (16361): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> [   67.567039] hardirqs last disabled at (16368): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> [   67.567867] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.568702] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.569516] ---[ end trace 0000000000000000 ]---
> [   67.569981]
> [   67.570150] ================================
> [   67.570561] WARNING: inconsistent lock state
> [   67.570982] 6.13.0-rc1+ #2032 Tainted: G        W
> [   67.571501] --------------------------------
> [   67.571922] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> [   67.572490] loadret/571 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   67.572984] ffff8939fdc267d8 (hrtimer_bases.lock){?.-.}-{2:2}, at: retrigger_next_event+0x38/0xd0
> [   67.573831] {IN-HARDIRQ-W} state was registered at:
> [   67.574299]   lock_acquire+0xd0/0x310
> [   67.574670]   _raw_spin_lock_irqsave+0x48/0x70
> [   67.575102]   hrtimer_run_queues+0x4d/0x150
> [   67.575504]   update_process_times+0x34/0xf0
> [   67.575926]   tick_periodic+0x29/0xe0
> [   67.576287]   tick_handle_periodic+0x24/0x70
> [   67.576715]   timer_interrupt+0x18/0x30
> [   67.577093]   __handle_irq_event_percpu+0x87/0x260
> [   67.577573]   handle_irq_event+0x38/0x90
> [   67.577970]   handle_level_irq+0x8e/0x160
> [   67.578350]   __common_interrupt+0x5c/0x120
> [   67.578763]   common_interrupt+0x80/0xa0
> [   67.579144]   asm_common_interrupt+0x26/0x40
> [   67.579558]   __x86_return_thunk+0x0/0x10
> [   67.579951]   _raw_spin_unlock_irqrestore+0x45/0x70
> [   67.580418]   __setup_irq+0x34d/0x6a0
> [   67.580787]   request_threaded_irq+0x115/0x1b0
> [   67.581214]   hpet_time_init+0x31/0x50
> [   67.581577]   x86_late_time_init+0x1b/0x40
> [   67.581975]   start_kernel+0x998/0xa40
> [   67.582336]   x86_64_start_reservations+0x24/0x30
> [   67.582792]   x86_64_start_kernel+0xed/0xf0
> [   67.583187]   common_startup_64+0x13e/0x141
> [   67.583603] irq event stamp: 16407
> [   67.583934] hardirqs last  enabled at (16407): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> [   67.584742] hardirqs last disabled at (16406): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> [   67.585537] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.586338] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> [   67.587143]
> [   67.587143] other info that might help us debug this:
> [   67.587759]  Possible unsafe locking scenario:
> [   67.587759]
> [   67.588319]        CPU0
> [   67.588564]        ----
> [   67.588822]   lock(hrtimer_bases.lock);
> [   67.589192]   <Interrupt>
> [   67.589451]     lock(hrtimer_bases.lock);
> [   67.589843]
> [   67.589843]  *** DEADLOCK ***
> [   67.589843]
> [   67.590400] 1 lock held by loadret/571:
> [   67.590775]  #0: ffffffff8e6902c8 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xc5/0x250
> [   67.591655]
> [   67.591655] stack backtrace:
> [   67.592073] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
> [   67.592883] Tainted: [W]=WARN
> [   67.593176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   67.594234] Call Trace:
> [   67.594480]  <TASK>
> [   67.594707]  dump_stack_lvl+0x84/0xd0
> [   67.595067]  print_usage_bug.part.0+0x257/0x340
> [   67.595505]  mark_lock+0x735/0x960
> [   67.595849]  ? vprintk_emit+0x111/0x460
> [   67.596221]  ? hrtimers_resume_local+0x29/0x40
> [   67.596666]  ? _printk+0x6c/0x90
> [   67.596986]  __lock_acquire+0x7ee/0x25c0
> [   67.597367]  ? __warn.cold+0x7f/0x151
> [   67.597728]  ? hrtimers_resume_local+0x29/0x40
> [   67.598160]  ? nbcon_get_cpu_emergency_nesting+0xa/0x30
> [   67.598671]  ? nbcon_cpu_emergency_exit+0xe/0x40
> [   67.599115]  ? report_bug+0xff/0x140
> [   67.599462]  lock_acquire+0xd0/0x310
> [   67.599815]  ? retrigger_next_event+0x38/0xd0
> [   67.600234]  _raw_spin_lock+0x30/0x40
> [   67.600603]  ? retrigger_next_event+0x38/0xd0
> [   67.601025]  retrigger_next_event+0x38/0xd0
> [   67.601425]  timekeeping_resume+0x148/0x190
> [   67.601832]  syscore_resume+0x67/0x220
> [   67.602193]  kernel_kexec+0xf6/0x180
> [   67.602542]  __do_sys_reboot+0x206/0x250
> [   67.602934]  do_syscall_64+0x95/0x180
> [   67.603292]  ? __lock_acquire+0x45f/0x25c0
> [   67.603706]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.604188]  ? smp_call_function_many_cond+0x11c/0x790
> [   67.604684]  ? lock_acquire+0xd0/0x310
> [   67.605049]  ? free_unref_page+0x22b/0x6a0
> [   67.605444]  ? find_held_lock+0x2b/0x80
> [   67.605820]  ? free_unref_page+0x510/0x6a0
> [   67.606209]  ? do_raw_spin_unlock+0x4d/0xb0
> [   67.606621]  ? _raw_spin_unlock+0x23/0x40
> [   67.607008]  ? free_unref_page+0x510/0x6a0
> [   67.607406]  ? arch_kexec_pre_free_pages+0x1a/0x40
> [   67.607873]  ? do_kexec_load+0x11d/0x340
> [   67.608251]  ? kfree+0xdb/0x3a0
> [   67.608565]  ? __x64_sys_kexec_load+0xa9/0xe0
> [   67.608999]  ? kfree+0xdb/0x3a0
> [   67.609314]  ? do_kexec_load+0x11d/0x340
> [   67.609699]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.610179]  ? syscall_exit_to_user_mode+0x97/0x290
> [   67.610654]  ? do_syscall_64+0xa1/0x180
> [   67.611027]  ? find_held_lock+0x2b/0x80
> [   67.611399]  ? do_user_addr_fault+0x59f/0x8a0
> [   67.611839]  ? do_user_addr_fault+0x5a9/0x8a0
> [   67.612265]  ? trace_hardirqs_off+0x4b/0xc0
> [   67.612662]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> [   67.613147]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   67.613639] RIP: 0033:0x7f9ae765115d
> [   67.613988] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> [   67.615719] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> [   67.616429] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> [   67.617098] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> [   67.617770] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> [   67.618440] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> [   67.619120] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> [   67.619803]  </TASK>
> [   67.620127] Enabling non-boot CPUs ...
>
>
> I also saw this one *between* attempts, presumably caused by the virtblk_freeze() one:
>
> [   23.699450]
> [   23.699826] ======================================================
> [   23.701054] WARNING: possible circular locking dependency detected
> [   23.702409] 6.13.0-rc1+ #2032 Not tainted
> [   23.703148] ------------------------------------------------------
> [   23.704248] kworker/u8:4/76 is trying to acquire lock:
> [   23.705177] ffff892c811a2d48 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}, at: process_one_work+0x51d/0x590
> [   23.706899]
> [   23.706899] but task is already holding lock:
> [   23.707829] ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> [   23.709152]
> [   23.709152] which lock already depends on the new lock.
> [   23.709152]
> [   23.710309]
> [   23.710309] the existing dependency chain (in reverse order) is:
> [   23.711374]
> [   23.711374] -> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
> [   23.712380]        blk_mq_submit_bio+0x90d/0xb00
> [   23.713006]        __submit_bio+0x10d/0x1f0
> [   23.713559]        submit_bio_noacct_nocheck+0x324/0x420
> [   23.714266]        ext4_bio_write_folio+0x1fc/0x750
> [   23.714851]        mpage_submit_folio+0x8d/0xb0
> [   23.715409]        mpage_process_page_bufs+0xd0/0x1b0
> [   23.716024]        mpage_prepare_extent_to_map+0x1d0/0x510
> [   23.716669]        ext4_do_writepages+0x4ec/0xee0
> [   23.717246]        ext4_writepages+0xe0/0x280
> [   23.717772]        do_writepages+0xeb/0x290
> [   23.718264]        filemap_fdatawrite_wbc+0x4f/0x70
> [   23.718830]        __filemap_fdatawrite_range+0x60/0x90
> [   23.719440]        file_write_and_wait_range+0x47/0xb0
> [   23.720049]        ext4_sync_file+0xac/0x3e0
> [   23.720548]        do_fsync+0x39/0x70
> [   23.720995]        __x64_sys_fsync+0x13/0x20
> [   23.721481]        do_syscall_64+0x95/0x180
> [   23.721943]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   23.722522]
> [   23.722522] -> #2 (jbd2_handle){++++}-{0:0}:
> [   23.723139]        start_this_handle+0x193/0x540
> [   23.723634]        jbd2_journal_start_reserved+0x54/0x1e0
> [   23.724210]        __ext4_journal_start_reserved+0x7a/0x170
> [   23.724756]        ext4_convert_unwritten_io_end_vec+0x2b/0xe0
> [   23.725304]        ext4_end_io_rsv_work+0x102/0x1d0
> [   23.725798]        process_one_work+0x21f/0x590
> [   23.726255]        worker_thread+0x1c3/0x3b0
> [   23.726690]        kthread+0xd5/0x100
> [   23.727072]        ret_from_fork+0x34/0x50
> [   23.727481]        ret_from_fork_asm+0x1a/0x30
> [   23.727939]
> [   23.727939] -> #1 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}-{0:0}:
> [   23.728788]        process_one_work+0x1f4/0x590
> [   23.729267]        worker_thread+0x1c3/0x3b0
> [   23.729715]        kthread+0xd5/0x100
> [   23.730100]        ret_from_fork+0x34/0x50
> [   23.730502]        ret_from_fork_asm+0x1a/0x30
> [   23.730944]
> [   23.730944] -> #0 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}:
> [   23.731688]        __lock_acquire+0x14ba/0x25c0
> [   23.732142]        lock_acquire+0xd0/0x310
> [   23.732550]        process_one_work+0x52e/0x590
> [   23.732995]        worker_thread+0x1c3/0x3b0
> [   23.733424]        kthread+0xd5/0x100
> [   23.733783]        ret_from_fork+0x34/0x50
> [   23.734191]        ret_from_fork_asm+0x1a/0x30
> [   23.734632]
> [   23.734632] other info that might help us debug this:
> [   23.734632]
> [   23.735409] Chain exists of:
> [   23.735409]   (wq_completion)ext4-rsv-conversion --> jbd2_handle --> &q->q_usage_counter(io)
> [   23.735409]
> [   23.736630]  Possible unsafe locking scenario:
> [   23.736630]
> [   23.737214]        CPU0                    CPU1
> [   23.737662]        ----                    ----
> [   23.738124]   lock(&q->q_usage_counter(io));
> [   23.738539]                                lock(jbd2_handle);
> [   23.739112]                                lock(&q->q_usage_counter(io));
> [   23.739770]   lock((wq_completion)ext4-rsv-conversion);
> [   23.740296]
> [   23.740296]  *** DEADLOCK ***
> [   23.740296]
> [   23.740850] 2 locks held by kworker/u8:4/76:
> [   23.741284]  #0: ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> [   23.742172]  #1: ffff892c82fb5468 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> [   23.743083]
> [   23.743083] stack backtrace:
> [   23.743514] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
> [   23.744259] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   23.745355] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [   23.745947] Call Trace:
> [   23.746207]  <TASK>
> [   23.746428]  dump_stack_lvl+0x84/0xd0
> [   23.746807]  print_circular_bug.cold+0x178/0x1be
> [   23.747280]  check_noncircular+0x148/0x160
> [   23.747694]  __lock_acquire+0x14ba/0x25c0
> [   23.748114]  lock_acquire+0xd0/0x310
> [   23.748480]  ? process_one_work+0x51d/0x590
> [   23.748906]  ? mark_held_locks+0x40/0x70
> [   23.749303]  process_one_work+0x52e/0x590
> [   23.749705]  ? process_one_work+0x51d/0x590
> [   23.750136]  worker_thread+0x1c3/0x3b0
> [   23.750503]  ? __pfx_worker_thread+0x10/0x10
> [   23.750923]  kthread+0xd5/0x100
> [   23.751235]  ? __pfx_kthread+0x10/0x10
> [   23.751601]  ret_from_fork+0x34/0x50
> [   23.751968]  ? __pfx_kthread+0x10/0x10
> [   23.752334]  ret_from_fork_asm+0x1a/0x30
> [   23.752721]  </TASK>
> [   23.753031] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:4[76]
> [   23.753031]      preempt=0x00000000 lock=2->0 RCU=0->0 workfn=ext4_end_io_rsv_work
> [   23.754419] INFO: lockdep is turned off.
> [   23.754814] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
> [   23.755537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   23.756621] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> [   23.757214] Call Trace:
> [   23.757464]  <TASK>
> [   23.757682]  dump_stack_lvl+0x84/0xd0
> [   23.758060]  process_one_work.cold+0x6d/0xc8
> [   23.758481]  ? __pfx_ext4_end_io_rsv_work+0x10/0x10
> [   23.758946]  ? process_one_work+0x24a/0x590
> [   23.759339]  worker_thread+0x1c3/0x3b0
> [   23.759706]  ? __pfx_worker_thread+0x10/0x10
> [   23.760276]  kthread+0xd5/0x100
> [   23.760581]  ? __pfx_kthread+0x10/0x10
> [   23.760956]  ret_from_fork+0x34/0x50
> [   23.761306]  ? __pfx_kthread+0x10/0x10
> [   23.761669]  ret_from_fork_asm+0x1a/0x30
> [   23.762064]  </TASK>
>
>
>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-10  1:56 ` Jason Wang
@ 2024-12-11 12:42   ` Stefan Hajnoczi
  2024-12-12 11:07     ` David Woodhouse
  2024-12-12 11:12     ` Lockdep warnings on kexec (virtio_blk, hrtimers) Ming Lei
  0 siblings, 2 replies; 34+ messages in thread
From: Stefan Hajnoczi @ 2024-12-11 12:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Woodhouse, x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei

[-- Attachment #1: Type: text/plain, Size: 25435 bytes --]

On Tue, Dec 10, 2024 at 09:56:43AM +0800, Jason Wang wrote:
> Adding more virtio-blk people here.

Please try Ming Lei's recent fix in Jens' tree:

  virtio-blk: don't keep queue frozen during system suspend
  commit: 7678abee0867e6b7fb89aa40f6e9f575f755fb37

https://git.kernel.dk/cgit/linux/commit/?h=block-6.13&id=7678abee0867e6b7fb89aa40f6e9f575f755fb37

Stefan

> 
> On Mon, Dec 9, 2024 at 10:49 PM David Woodhouse <dwmw2@infradead.org> wrote:
> >
> > Using the test case in https://git.kernel.org/torvalds/c/07fa619f2a40c
> > I see sporadic lockdep warnings.
> >
> > This warning on the way into kexec seems to happen every time:
> >
> > [   67.416890] Freezing user space processes
> > [   67.419277] Freezing user space processes completed (elapsed 0.001 seconds)
> > [   67.420754] OOM killer disabled.
> > [   67.433337] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:7[558]
> > [   67.433337]      preempt=0x00000000 lock=0->2 RCU=0->0 workfn=async_run_entry_fn
> > [   67.436941] 2 locks held by kworker/u8:7/558:
> > [   67.437912]  #0: ffff893982fdb858 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> > [   67.439980]  #1: ffff893982fdb890 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> > [   67.441783] CPU: 0 UID: 0 PID: 558 Comm: kworker/u8:7 Not tainted 6.13.0-rc1+ #2032
> > [   67.442462] Disabling non-boot CPUs ...
> > [   67.443101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   67.443106] Workqueue: async async_run_entry_fn
> > [   67.443114] Call Trace:
> > [   67.443117]  <TASK>
> > [   67.443123]  dump_stack_lvl+0x84/0xd0
> > [   67.447325]  process_one_work.cold+0x6d/0xc8
> > [   67.447991]  ? __pfx_async_run_entry_fn+0x10/0x10
> > [   67.448644]  ? process_one_work+0x24a/0x590
> > [   67.449226]  worker_thread+0x1c3/0x3b0
> > [   67.449757]  ? __pfx_worker_thread+0x10/0x10
> > [   67.450345]  kthread+0xd5/0x100
> > [   67.450822]  ? __pfx_kthread+0x10/0x10
> > [   67.451320]  ret_from_fork+0x34/0x50
> > [   67.451775]  ? __pfx_kthread+0x10/0x10
> > [   67.452239]  ret_from_fork_asm+0x1a/0x30
> > [   67.452747]  </TASK>
> > [   67.471104] smpboot: CPU 1 is now offline
> >
> >
> >
> > This one happens only occasionally (1 in 20 or so):
> >
> > B[   67.487529] ------------[ cut here ]------------
> > [   67.488018] Interrupts enabled after irqrouter_resume+0x0/0x50
> > [   67.488684] WARNING: CPU: 0 PID: 571 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
> > [   67.489571] Modules linked in:
> > [   67.489920] CPU: 0 UID: 0 PID: 571 Comm: loadret Not tainted 6.13.0-rc1+ #2032
> > [   67.490692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   67.491856] RIP: 0010:syscore_resume+0x18a/0x220
> > [   67.492347] Code: 00 e9 21 ff ff ff 80 3d 62 d0 d3 01 00 0f 85 03 ff ff ff 48 8b 73 18 48 c7 c7 06 f6 e9 8d c6 05 4a d0 d3 01 01 e8 16 56 46 ff <0f> 0b e9 e5 fe ff ff e8 ea e9 54 ff 84 c0 0f 85 fb fe ff ff 80 3d
> > [   67.494253] RSP: 0018:ffffaca540773a08 EFLAGS: 00010286
> > [   67.494816] RAX: 0000000000000000 RBX: ffffffff8e9918c0 RCX: 0000000000000027
> > [   67.495560] RDX: ffff8939fdc21a88 RSI: 0000000000000001 RDI: ffff8939fdc21a80
> > [   67.496324] RBP: 0000000000037e0c R08: 0000000000000000 R09: 0000000000000000
> > [   67.497052] R10: 0000000000000001 R11: ffffffff8e782fd8 R12: ffffaca540773a38
> > [   67.497788] R13: ffffffff8e68ebe0 R14: 00000000fee1dead R15: 0000000000000000
> > [   67.498504] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
> > [   67.499320] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   67.499930] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
> > [   67.500650] Call Trace:
> > [   67.500913]  <TASK>
> > [   67.501147]  ? __warn.cold+0xb7/0x151
> > [   67.501535]  ? syscore_resume+0x18a/0x220
> > [   67.501963]  ? report_bug+0xff/0x140
> > [   67.502331]  ? console_unlock+0x9d/0x150
> > [   67.502757]  ? handle_bug+0x58/0x90
> > [   67.503128]  ? exc_invalid_op+0x17/0x70
> > [   67.503529]  ? asm_exc_invalid_op+0x1a/0x20
> > [   67.503975]  ? syscore_resume+0x18a/0x220
> > [   67.504390]  ? syscore_resume+0x18a/0x220
> > [   67.504814]  kernel_kexec+0xf6/0x180
> > [   67.505190]  __do_sys_reboot+0x206/0x250
> > [   67.505653]  do_syscall_64+0x95/0x180
> > [   67.506037]  ? __lock_acquire+0x45f/0x25c0
> > [   67.506451]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.506974]  ? smp_call_function_many_cond+0x11c/0x790
> > [   67.507503]  ? lock_acquire+0xd0/0x310
> > [   67.507896]  ? free_unref_page+0x22b/0x6a0
> > [   67.508316]  ? find_held_lock+0x2b/0x80
> > [   67.508725]  ? free_unref_page+0x510/0x6a0
> > [   67.509157]  ? do_raw_spin_unlock+0x4d/0xb0
> > [   67.509592]  ? _raw_spin_unlock+0x23/0x40
> > [   67.510006]  ? free_unref_page+0x510/0x6a0
> > [   67.510428]  ? arch_kexec_pre_free_pages+0x1a/0x40
> > [   67.510938]  ? do_kexec_load+0x11d/0x340
> > [   67.511343]  ? kfree+0xdb/0x3a0
> > [   67.511697]  ? __x64_sys_kexec_load+0xa9/0xe0
> > [   67.512149]  ? kfree+0xdb/0x3a0
> > [   67.512485]  ? do_kexec_load+0x11d/0x340
> > [   67.512912]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.513430]  ? syscall_exit_to_user_mode+0x97/0x290
> > [   67.513935]  ? do_syscall_64+0xa1/0x180
> > [   67.514333]  ? find_held_lock+0x2b/0x80
> > [   67.514736]  ? do_user_addr_fault+0x59f/0x8a0
> > [   67.515193]  ? do_user_addr_fault+0x5a9/0x8a0
> > [   67.515655]  ? trace_hardirqs_off+0x4b/0xc0
> > [   67.516102]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.516625]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   67.517138] RIP: 0033:0x7f9ae765115d
> > [   67.517515] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> > [   67.519312] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> > [   67.520057] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> > [   67.520760] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> > [   67.521454] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> > [   67.522172] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> > [   67.522881] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> > [   67.523593]  </TASK>
> > [   67.523832] irq event stamp: 15605
> > [   67.524181] hardirqs last  enabled at (15613): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> > [   67.525034] hardirqs last disabled at (15620): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> > [   67.525935] softirqs last  enabled at (14732): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.526780] softirqs last disabled at (14717): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.527616] ---[ end trace 0000000000000000 ]---
> > [   67.530225] ------------[ cut here ]------------
> > [   67.530760] WARNING: CPU: 0 PID: 571 at kernel/time/hrtimer.c:995 hrtimers_resume_local+0x29/0x40
> > [   67.531654] Modules linked in:
> > [   67.531982] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
> > [   67.532862] Tainted: [W]=WARN
> > [   67.533185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   67.534313] RIP: 0010:hrtimers_resume_local+0x29/0x40
> > [   67.534833] Code: 90 66 0f 1f 00 0f 1f 44 00 00 8b 05 c5 57 81 02 85 c0 74 18 65 8b 05 0e 8c d4 73 85 c0 75 0d 65 8b 05 c7 88 d4 73 85 c0 74 02 <0f> 0b 31 ff e9 de ee ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
> > [   67.536771] RSP: 0018:ffffaca5407739a0 EFLAGS: 00010202
> > [   67.537296] RAX: 0000000000000001 RBX: 0000000fcd31f08b RCX: 00000000000006e0
> > [   67.538000] RDX: 0000000000000029 RSI: 00000000007f56f4 RDI: 00000000000006e0
> > [   67.538706] RBP: ffffaca5407739f8 R08: 0000000000000001 R09: 0000000000000000
> > [   67.539406] R10: 0000000000000001 R11: ffffffff9018d188 R12: 0000000000000202
> > [   67.540104] R13: ffffffff8e666ca0 R14: 00000000fee1dead R15: 0000000000000000
> > [   67.540840] FS:  00007f9ae771f540(0000) GS:ffff8939fdc00000(0000) knlGS:0000000000000000
> > [   67.541638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   67.542218] CR2: 0000559aeea00b70 CR3: 0000000007a5e001 CR4: 0000000000170ef0
> > [   67.542915] Call Trace:
> > [   67.543185]  <TASK>
> > [   67.543413]  ? __warn.cold+0xb7/0x151
> > [   67.543804]  ? hrtimers_resume_local+0x29/0x40
> > [   67.544269]  ? report_bug+0xff/0x140
> > [   67.544646]  ? handle_bug+0x58/0x90
> > [   67.544999]  ? exc_invalid_op+0x17/0x70
> > [   67.545396]  ? asm_exc_invalid_op+0x1a/0x20
> > [   67.545862]  ? hrtimers_resume_local+0x29/0x40
> > [   67.546318]  timekeeping_resume+0x148/0x190
> > [   67.546743]  syscore_resume+0x67/0x220
> > [   67.547132]  kernel_kexec+0xf6/0x180
> > [   67.547491]  __do_sys_reboot+0x206/0x250
> > [   67.547902]  do_syscall_64+0x95/0x180
> > [   67.548285]  ? __lock_acquire+0x45f/0x25c0
> > [   67.548714]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.549255]  ? smp_call_function_many_cond+0x11c/0x790
> > [   67.549783]  ? lock_acquire+0xd0/0x310
> > [   67.550176]  ? free_unref_page+0x22b/0x6a0
> > [   67.550594]  ? find_held_lock+0x2b/0x80
> > [   67.550973]  ? free_unref_page+0x510/0x6a0
> > [   67.551409]  ? do_raw_spin_unlock+0x4d/0xb0
> > [   67.551849]  ? _raw_spin_unlock+0x23/0x40
> > [   67.552263]  ? free_unref_page+0x510/0x6a0
> > [   67.552682]  ? arch_kexec_pre_free_pages+0x1a/0x40
> > [   67.553169]  ? do_kexec_load+0x11d/0x340
> > [   67.553553]  ? kfree+0xdb/0x3a0
> > [   67.553895]  ? __x64_sys_kexec_load+0xa9/0xe0
> > [   67.554345]  ? kfree+0xdb/0x3a0
> > [   67.554678]  ? do_kexec_load+0x11d/0x340
> > [   67.555073]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.555581]  ? syscall_exit_to_user_mode+0x97/0x290
> > [   67.556121]  ? do_syscall_64+0xa1/0x180
> > [   67.556506]  ? find_held_lock+0x2b/0x80
> > [   67.556898]  ? do_user_addr_fault+0x59f/0x8a0
> > [   67.557347]  ? do_user_addr_fault+0x5a9/0x8a0
> > [   67.557795]  ? trace_hardirqs_off+0x4b/0xc0
> > [   67.558221]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.558721]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   67.559230] RIP: 0033:0x7f9ae765115d
> > [   67.559625] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> > [   67.561378] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> > [   67.562124] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> > [   67.562815] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> > [   67.563507] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> > [   67.564223] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> > [   67.564912] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> > [   67.565614]  </TASK>
> > [   67.565847] irq event stamp: 16351
> > [   67.566214] hardirqs last  enabled at (16361): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> > [   67.567039] hardirqs last disabled at (16368): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> > [   67.567867] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.568702] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.569516] ---[ end trace 0000000000000000 ]---
> > [   67.569981]
> > [   67.570150] ================================
> > [   67.570561] WARNING: inconsistent lock state
> > [   67.570982] 6.13.0-rc1+ #2032 Tainted: G        W
> > [   67.571501] --------------------------------
> > [   67.571922] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> > [   67.572490] loadret/571 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [   67.572984] ffff8939fdc267d8 (hrtimer_bases.lock){?.-.}-{2:2}, at: retrigger_next_event+0x38/0xd0
> > [   67.573831] {IN-HARDIRQ-W} state was registered at:
> > [   67.574299]   lock_acquire+0xd0/0x310
> > [   67.574670]   _raw_spin_lock_irqsave+0x48/0x70
> > [   67.575102]   hrtimer_run_queues+0x4d/0x150
> > [   67.575504]   update_process_times+0x34/0xf0
> > [   67.575926]   tick_periodic+0x29/0xe0
> > [   67.576287]   tick_handle_periodic+0x24/0x70
> > [   67.576715]   timer_interrupt+0x18/0x30
> > [   67.577093]   __handle_irq_event_percpu+0x87/0x260
> > [   67.577573]   handle_irq_event+0x38/0x90
> > [   67.577970]   handle_level_irq+0x8e/0x160
> > [   67.578350]   __common_interrupt+0x5c/0x120
> > [   67.578763]   common_interrupt+0x80/0xa0
> > [   67.579144]   asm_common_interrupt+0x26/0x40
> > [   67.579558]   __x86_return_thunk+0x0/0x10
> > [   67.579951]   _raw_spin_unlock_irqrestore+0x45/0x70
> > [   67.580418]   __setup_irq+0x34d/0x6a0
> > [   67.580787]   request_threaded_irq+0x115/0x1b0
> > [   67.581214]   hpet_time_init+0x31/0x50
> > [   67.581577]   x86_late_time_init+0x1b/0x40
> > [   67.581975]   start_kernel+0x998/0xa40
> > [   67.582336]   x86_64_start_reservations+0x24/0x30
> > [   67.582792]   x86_64_start_kernel+0xed/0xf0
> > [   67.583187]   common_startup_64+0x13e/0x141
> > [   67.583603] irq event stamp: 16407
> > [   67.583934] hardirqs last  enabled at (16407): [<ffffffff8c281b8e>] __up_console_sem+0x7e/0x90
> > [   67.584742] hardirqs last disabled at (16406): [<ffffffff8c281b73>] __up_console_sem+0x63/0x90
> > [   67.585537] softirqs last  enabled at (16260): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.586338] softirqs last disabled at (16245): [<ffffffff8c1c6c12>] __irq_exit_rcu+0xe2/0x100
> > [   67.587143]
> > [   67.587143] other info that might help us debug this:
> > [   67.587759]  Possible unsafe locking scenario:
> > [   67.587759]
> > [   67.588319]        CPU0
> > [   67.588564]        ----
> > [   67.588822]   lock(hrtimer_bases.lock);
> > [   67.589192]   <Interrupt>
> > [   67.589451]     lock(hrtimer_bases.lock);
> > [   67.589843]
> > [   67.589843]  *** DEADLOCK ***
> > [   67.589843]
> > [   67.590400] 1 lock held by loadret/571:
> > [   67.590775]  #0: ffffffff8e6902c8 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xc5/0x250
> > [   67.591655]
> > [   67.591655] stack backtrace:
> > [   67.592073] CPU: 0 UID: 0 PID: 571 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2032
> > [   67.592883] Tainted: [W]=WARN
> > [   67.593176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   67.594234] Call Trace:
> > [   67.594480]  <TASK>
> > [   67.594707]  dump_stack_lvl+0x84/0xd0
> > [   67.595067]  print_usage_bug.part.0+0x257/0x340
> > [   67.595505]  mark_lock+0x735/0x960
> > [   67.595849]  ? vprintk_emit+0x111/0x460
> > [   67.596221]  ? hrtimers_resume_local+0x29/0x40
> > [   67.596666]  ? _printk+0x6c/0x90
> > [   67.596986]  __lock_acquire+0x7ee/0x25c0
> > [   67.597367]  ? __warn.cold+0x7f/0x151
> > [   67.597728]  ? hrtimers_resume_local+0x29/0x40
> > [   67.598160]  ? nbcon_get_cpu_emergency_nesting+0xa/0x30
> > [   67.598671]  ? nbcon_cpu_emergency_exit+0xe/0x40
> > [   67.599115]  ? report_bug+0xff/0x140
> > [   67.599462]  lock_acquire+0xd0/0x310
> > [   67.599815]  ? retrigger_next_event+0x38/0xd0
> > [   67.600234]  _raw_spin_lock+0x30/0x40
> > [   67.600603]  ? retrigger_next_event+0x38/0xd0
> > [   67.601025]  retrigger_next_event+0x38/0xd0
> > [   67.601425]  timekeeping_resume+0x148/0x190
> > [   67.601832]  syscore_resume+0x67/0x220
> > [   67.602193]  kernel_kexec+0xf6/0x180
> > [   67.602542]  __do_sys_reboot+0x206/0x250
> > [   67.602934]  do_syscall_64+0x95/0x180
> > [   67.603292]  ? __lock_acquire+0x45f/0x25c0
> > [   67.603706]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.604188]  ? smp_call_function_many_cond+0x11c/0x790
> > [   67.604684]  ? lock_acquire+0xd0/0x310
> > [   67.605049]  ? free_unref_page+0x22b/0x6a0
> > [   67.605444]  ? find_held_lock+0x2b/0x80
> > [   67.605820]  ? free_unref_page+0x510/0x6a0
> > [   67.606209]  ? do_raw_spin_unlock+0x4d/0xb0
> > [   67.606621]  ? _raw_spin_unlock+0x23/0x40
> > [   67.607008]  ? free_unref_page+0x510/0x6a0
> > [   67.607406]  ? arch_kexec_pre_free_pages+0x1a/0x40
> > [   67.607873]  ? do_kexec_load+0x11d/0x340
> > [   67.608251]  ? kfree+0xdb/0x3a0
> > [   67.608565]  ? __x64_sys_kexec_load+0xa9/0xe0
> > [   67.608999]  ? kfree+0xdb/0x3a0
> > [   67.609314]  ? do_kexec_load+0x11d/0x340
> > [   67.609699]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.610179]  ? syscall_exit_to_user_mode+0x97/0x290
> > [   67.610654]  ? do_syscall_64+0xa1/0x180
> > [   67.611027]  ? find_held_lock+0x2b/0x80
> > [   67.611399]  ? do_user_addr_fault+0x59f/0x8a0
> > [   67.611839]  ? do_user_addr_fault+0x5a9/0x8a0
> > [   67.612265]  ? trace_hardirqs_off+0x4b/0xc0
> > [   67.612662]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
> > [   67.613147]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   67.613639] RIP: 0033:0x7f9ae765115d
> > [   67.613988] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
> > [   67.615719] RSP: 002b:00007ffd17db0ad8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
> > [   67.616429] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9ae765115d
> > [   67.617098] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
> > [   67.617770] RBP: 00007ffd17db0b20 R08: 0000004d00000000 R09: 0000004d00000000
> > [   67.618440] R10: 00007f9ae7745f30 R11: 0000000000000246 R12: 0000000000401070
> > [   67.619120] R13: 00007ffd17db0c00 R14: 0000000000000000 R15: 0000000000000000
> > [   67.619803]  </TASK>
> > [   67.620127] Enabling non-boot CPUs ...
> >
> >
> > I also saw this one *between* attempts, presumably caused by the virtblk_freeze() one:
> >
> > [   23.699450]
> > [   23.699826] ======================================================
> > [   23.701054] WARNING: possible circular locking dependency detected
> > [   23.702409] 6.13.0-rc1+ #2032 Not tainted
> > [   23.703148] ------------------------------------------------------
> > [   23.704248] kworker/u8:4/76 is trying to acquire lock:
> > [   23.705177] ffff892c811a2d48 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}, at: process_one_work+0x51d/0x590
> > [   23.706899]
> > [   23.706899] but task is already holding lock:
> > [   23.707829] ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> > [   23.709152]
> > [   23.709152] which lock already depends on the new lock.
> > [   23.709152]
> > [   23.710309]
> > [   23.710309] the existing dependency chain (in reverse order) is:
> > [   23.711374]
> > [   23.711374] -> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
> > [   23.712380]        blk_mq_submit_bio+0x90d/0xb00
> > [   23.713006]        __submit_bio+0x10d/0x1f0
> > [   23.713559]        submit_bio_noacct_nocheck+0x324/0x420
> > [   23.714266]        ext4_bio_write_folio+0x1fc/0x750
> > [   23.714851]        mpage_submit_folio+0x8d/0xb0
> > [   23.715409]        mpage_process_page_bufs+0xd0/0x1b0
> > [   23.716024]        mpage_prepare_extent_to_map+0x1d0/0x510
> > [   23.716669]        ext4_do_writepages+0x4ec/0xee0
> > [   23.717246]        ext4_writepages+0xe0/0x280
> > [   23.717772]        do_writepages+0xeb/0x290
> > [   23.718264]        filemap_fdatawrite_wbc+0x4f/0x70
> > [   23.718830]        __filemap_fdatawrite_range+0x60/0x90
> > [   23.719440]        file_write_and_wait_range+0x47/0xb0
> > [   23.720049]        ext4_sync_file+0xac/0x3e0
> > [   23.720548]        do_fsync+0x39/0x70
> > [   23.720995]        __x64_sys_fsync+0x13/0x20
> > [   23.721481]        do_syscall_64+0x95/0x180
> > [   23.721943]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [   23.722522]
> > [   23.722522] -> #2 (jbd2_handle){++++}-{0:0}:
> > [   23.723139]        start_this_handle+0x193/0x540
> > [   23.723634]        jbd2_journal_start_reserved+0x54/0x1e0
> > [   23.724210]        __ext4_journal_start_reserved+0x7a/0x170
> > [   23.724756]        ext4_convert_unwritten_io_end_vec+0x2b/0xe0
> > [   23.725304]        ext4_end_io_rsv_work+0x102/0x1d0
> > [   23.725798]        process_one_work+0x21f/0x590
> > [   23.726255]        worker_thread+0x1c3/0x3b0
> > [   23.726690]        kthread+0xd5/0x100
> > [   23.727072]        ret_from_fork+0x34/0x50
> > [   23.727481]        ret_from_fork_asm+0x1a/0x30
> > [   23.727939]
> > [   23.727939] -> #1 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}-{0:0}:
> > [   23.728788]        process_one_work+0x1f4/0x590
> > [   23.729267]        worker_thread+0x1c3/0x3b0
> > [   23.729715]        kthread+0xd5/0x100
> > [   23.730100]        ret_from_fork+0x34/0x50
> > [   23.730502]        ret_from_fork_asm+0x1a/0x30
> > [   23.730944]
> > [   23.730944] -> #0 ((wq_completion)ext4-rsv-conversion){+.+.}-{0:0}:
> > [   23.731688]        __lock_acquire+0x14ba/0x25c0
> > [   23.732142]        lock_acquire+0xd0/0x310
> > [   23.732550]        process_one_work+0x52e/0x590
> > [   23.732995]        worker_thread+0x1c3/0x3b0
> > [   23.733424]        kthread+0xd5/0x100
> > [   23.733783]        ret_from_fork+0x34/0x50
> > [   23.734191]        ret_from_fork_asm+0x1a/0x30
> > [   23.734632]
> > [   23.734632] other info that might help us debug this:
> > [   23.734632]
> > [   23.735409] Chain exists of:
> > [   23.735409]   (wq_completion)ext4-rsv-conversion --> jbd2_handle --> &q->q_usage_counter(io)
> > [   23.735409]
> > [   23.736630]  Possible unsafe locking scenario:
> > [   23.736630]
> > [   23.737214]        CPU0                    CPU1
> > [   23.737662]        ----                    ----
> > [   23.738124]   lock(&q->q_usage_counter(io));
> > [   23.738539]                                lock(jbd2_handle);
> > [   23.739112]                                lock(&q->q_usage_counter(io));
> > [   23.739770]   lock((wq_completion)ext4-rsv-conversion);
> > [   23.740296]
> > [   23.740296]  *** DEADLOCK ***
> > [   23.740296]
> > [   23.740850] 2 locks held by kworker/u8:4/76:
> > [   23.741284]  #0: ffff892c82fb5430 (&q->q_usage_counter(io)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> > [   23.742172]  #1: ffff892c82fb5468 (&q->q_usage_counter(queue)){++++}-{0:0}, at: virtblk_freeze+0x28/0x70
> > [   23.743083]
> > [   23.743083] stack backtrace:
> > [   23.743514] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
> > [   23.744259] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   23.745355] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> > [   23.745947] Call Trace:
> > [   23.746207]  <TASK>
> > [   23.746428]  dump_stack_lvl+0x84/0xd0
> > [   23.746807]  print_circular_bug.cold+0x178/0x1be
> > [   23.747280]  check_noncircular+0x148/0x160
> > [   23.747694]  __lock_acquire+0x14ba/0x25c0
> > [   23.748114]  lock_acquire+0xd0/0x310
> > [   23.748480]  ? process_one_work+0x51d/0x590
> > [   23.748906]  ? mark_held_locks+0x40/0x70
> > [   23.749303]  process_one_work+0x52e/0x590
> > [   23.749705]  ? process_one_work+0x51d/0x590
> > [   23.750136]  worker_thread+0x1c3/0x3b0
> > [   23.750503]  ? __pfx_worker_thread+0x10/0x10
> > [   23.750923]  kthread+0xd5/0x100
> > [   23.751235]  ? __pfx_kthread+0x10/0x10
> > [   23.751601]  ret_from_fork+0x34/0x50
> > [   23.751968]  ? __pfx_kthread+0x10/0x10
> > [   23.752334]  ret_from_fork_asm+0x1a/0x30
> > [   23.752721]  </TASK>
> > [   23.753031] BUG: workqueue leaked atomic, lock or RCU: kworker/u8:4[76]
> > [   23.753031]      preempt=0x00000000 lock=2->0 RCU=0->0 workfn=ext4_end_io_rsv_work
> > [   23.754419] INFO: lockdep is turned off.
> > [   23.754814] CPU: 0 UID: 0 PID: 76 Comm: kworker/u8:4 Not tainted 6.13.0-rc1+ #2032
> > [   23.755537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> > [   23.756621] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> > [   23.757214] Call Trace:
> > [   23.757464]  <TASK>
> > [   23.757682]  dump_stack_lvl+0x84/0xd0
> > [   23.758060]  process_one_work.cold+0x6d/0xc8
> > [   23.758481]  ? __pfx_ext4_end_io_rsv_work+0x10/0x10
> > [   23.758946]  ? process_one_work+0x24a/0x590
> > [   23.759339]  worker_thread+0x1c3/0x3b0
> > [   23.759706]  ? __pfx_worker_thread+0x10/0x10
> > [   23.760276]  kthread+0xd5/0x100
> > [   23.760581]  ? __pfx_kthread+0x10/0x10
> > [   23.760956]  ret_from_fork+0x34/0x50
> > [   23.761306]  ? __pfx_kthread+0x10/0x10
> > [   23.761669]  ret_from_fork_asm+0x1a/0x30
> > [   23.762064]  </TASK>
> >
> >
> >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-11 12:42   ` Stefan Hajnoczi
@ 2024-12-12 11:07     ` David Woodhouse
  2024-12-12 13:34       ` Thomas Gleixner
  2024-12-12 11:12     ` Lockdep warnings on kexec (virtio_blk, hrtimers) Ming Lei
  1 sibling, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-12 11:07 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei

[-- Attachment #1: Type: text/plain, Size: 14110 bytes --]

On Wed, 2024-12-11 at 07:42 -0500, Stefan Hajnoczi wrote:
> On Tue, Dec 10, 2024 at 09:56:43AM +0800, Jason Wang wrote:
> > Adding more virtio-blk people here.
> 
> Please try Ming Lei's recent fix in Jens' tree:
> 
>   virtio-blk: don't keep queue frozen during system suspend
>   commit: 7678abee0867e6b7fb89aa40f6e9f575f755fb37
> 
> https://git.kernel.dk/cgit/linux/commit/?h=block-6.13&id=7678abee0867e6b7fb89aa40f6e9f575f755fb37

Thanks. That does make those warnings go away. I do still get this one
occasionally though. It seems to go away without 'no_console_suspend'
on the command line, but I'm not sure that makes it OK.

B[   23.665211] ------------[ cut here ]------------
[   23.665790] Interrupts enabled after irqrouter_resume+0x0/0x50
[   23.666596] WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
[   23.667587] Modules linked in:
[   23.667964] CPU: 0 UID: 0 PID: 560 Comm: loadret Not tainted 6.13.0-rc1+ #2033
[   23.668806] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   23.670086] RIP: 0010:syscore_resume+0x18a/0x220
[   23.670614] Code: 00 e9 21 ff ff ff 80 3d 62 d0 d3 01 00 0f 85 03 ff ff ff 48 8b 73 18 48 c7 c7 06 f6 e9 a9 c6 05 4a d0 d3 01 01 e8 16 56 46 ff <0f> 0b e9 e5 fe ff ff e8 ea e9 54 ff 84 c0 0f 85 fb fe ff ff 80 3d
[   23.672525] RSP: 0018:ffffb2f640aefbb8 EFLAGS: 00010282
[   23.673060] RAX: 0000000000000000 RBX: ffffffffaa9918c0 RCX: 0000000000000027
[   23.673759] RDX: ffff8d75fdc21a88 RSI: 0000000000000001 RDI: ffff8d75fdc21a80
[   23.674474] RBP: 0000000000037e0c R08: 0000000000000000 R09: 0000000000000000
[   23.675198] R10: 0000000000000001 R11: ffffffffaa782fd8 R12: ffffb2f640aefbe8
[   23.675905] R13: ffffffffaa68ebe0 R14: 00000000fee1dead R15: 0000000000000000
[   23.676610] FS:  00007f058661a540(0000) GS:ffff8d75fdc00000(0000) knlGS:0000000000000000
[   23.677431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.678015] CR2: 000055a0b9f49b70 CR3: 0000000014798001 CR4: 0000000000170ef0
[   23.678727] Call Trace:
[   23.679015]  <TASK>
[   23.679248]  ? __warn.cold+0xb7/0x151
[   23.679629]  ? syscore_resume+0x18a/0x220
[   23.680086]  ? report_bug+0xff/0x140
[   23.680484]  ? console_unlock+0x9d/0x150
[   23.680891]  ? handle_bug+0x58/0x90
[   23.681257]  ? exc_invalid_op+0x17/0x70
[   23.681643]  ? asm_exc_invalid_op+0x1a/0x20
[   23.682077]  ? syscore_resume+0x18a/0x220
[   23.682493]  ? syscore_resume+0x18a/0x220
[   23.682929]  kernel_kexec+0xf6/0x180
[   23.683305]  __do_sys_reboot+0x206/0x250
[   23.683721]  do_syscall_64+0x95/0x180
[   23.684163]  ? __lock_acquire+0x45f/0x25c0
[   23.684622]  ? __lock_acquire+0x45f/0x25c0
[   23.685100]  ? __handle_mm_fault+0x7df/0xfa0
[   23.685612]  ? reacquire_held_locks+0xd2/0x1f0
[   23.686118]  ? do_user_addr_fault+0x555/0x8a0
[   23.686615]  ? lock_acquire+0xd0/0x310
[   23.687022]  ? find_held_lock+0x2b/0x80
[   23.687470]  ? do_user_addr_fault+0x59f/0x8a0
[   23.687928]  ? do_user_addr_fault+0x5a9/0x8a0
[   23.688406]  ? trace_hardirqs_off+0x4b/0xc0
[   23.688837]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   23.689367]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   23.689866] RIP: 0033:0x7f058654c15d
[   23.690244] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   23.692088] RSP: 002b:00007ffccf368718 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   23.692857] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f058654c15d
[   23.693575] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   23.694317] RBP: 00007ffccf368760 R08: 0000004d00000000 R09: 0000004d00000000
[   23.695017] R10: 00007f0586640f30 R11: 0000000000000246 R12: 0000000000401070
[   23.695719] R13: 00007ffccf368840 R14: 0000000000000000 R15: 0000000000000000
[   23.696453]  </TASK>
[   23.696681] irq event stamp: 15565
[   23.697043] hardirqs last  enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
[   23.697855] hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90
[   23.698673] softirqs last  enabled at (14798): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.699481] softirqs last disabled at (14777): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.700284] ---[ end trace 0000000000000000 ]---
[   23.702460] ------------[ cut here ]------------
[   23.702963] WARNING: CPU: 0 PID: 560 at kernel/time/hrtimer.c:995 hrtimers_resume_local+0x29/0x40
[   23.703885] Modules linked in:
[   23.704223] CPU: 0 UID: 0 PID: 560 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2033
[   23.705089] Tainted: [W]=WARN
[   23.705415] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   23.706536] RIP: 0010:hrtimers_resume_local+0x29/0x40
[   23.707068] Code: 90 66 0f 1f 00 0f 1f 44 00 00 8b 05 c5 57 81 02 85 c0 74 18 65 8b 05 0e 8c d4 57 85 c0 75 0d 65 8b 05 c7 88 d4 57 85 c0 74 02 <0f> 0b 31 ff e9 de ee ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
[   23.708834] RSP: 0018:ffffb2f640aefb50 EFLAGS: 00010202
[   23.709429] RAX: 0000000000000001 RBX: 000000059af022d0 RCX: 00000000000006e0
[   23.710121] RDX: 000000000000000e RSI: 000000008e2b60ca RDI: 00000000000006e0
[   23.710831] RBP: ffffb2f640aefba8 R08: 0000000000000001 R09: 0000000000000000
[   23.711559] R10: 0000000000000001 R11: ffffffffac18d188 R12: 0000000000000206
[   23.712249] R13: ffffffffaa666ca0 R14: 00000000fee1dead R15: 0000000000000000
[   23.712952] FS:  00007f058661a540(0000) GS:ffff8d75fdc00000(0000) knlGS:0000000000000000
[   23.713746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.714301] CR2: 000055a0b9f49b70 CR3: 0000000014798001 CR4: 0000000000170ef0
[   23.714997] Call Trace:
[   23.715274]  <TASK>
[   23.715512]  ? __warn.cold+0xb7/0x151
[   23.715884]  ? hrtimers_resume_local+0x29/0x40
[   23.716330]  ? report_bug+0xff/0x140
[   23.716711]  ? handle_bug+0x58/0x90
[   23.717071]  ? exc_invalid_op+0x17/0x70
[   23.717465]  ? asm_exc_invalid_op+0x1a/0x20
[   23.717890]  ? hrtimers_resume_local+0x29/0x40
[   23.718356]  timekeeping_resume+0x148/0x190
[   23.718772]  syscore_resume+0x67/0x220
[   23.719151]  kernel_kexec+0xf6/0x180
[   23.719568]  __do_sys_reboot+0x206/0x250
[   23.720044]  do_syscall_64+0x95/0x180
[   23.720429]  ? __lock_acquire+0x45f/0x25c0
[   23.720849]  ? __lock_acquire+0x45f/0x25c0
[   23.721258]  ? __handle_mm_fault+0x7df/0xfa0
[   23.721697]  ? reacquire_held_locks+0xd2/0x1f0
[   23.722139]  ? do_user_addr_fault+0x555/0x8a0
[   23.722618]  ? lock_acquire+0xd0/0x310
[   23.723053]  ? find_held_lock+0x2b/0x80
[   23.723505]  ? do_user_addr_fault+0x59f/0x8a0
[   23.724010]  ? do_user_addr_fault+0x5a9/0x8a0
[   23.724467]  ? trace_hardirqs_off+0x4b/0xc0
[   23.724916]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   23.725404]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   23.725935] RIP: 0033:0x7f058654c15d
[   23.726318] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   23.728079] RSP: 002b:00007ffccf368718 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   23.728827] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f058654c15d
[   23.729560] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   23.730358] RBP: 00007ffccf368760 R08: 0000004d00000000 R09: 0000004d00000000
[   23.731089] R10: 00007f0586640f30 R11: 0000000000000246 R12: 0000000000401070
[   23.731755] R13: 00007ffccf368840 R14: 0000000000000000 R15: 0000000000000000
[   23.732455]  </TASK>
[   23.732681] irq event stamp: 16269
[   23.733016] hardirqs last  enabled at (16277): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
[   23.733823] hardirqs last disabled at (16286): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90
[   23.734639] softirqs last  enabled at (15872): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.735463] softirqs last disabled at (15863): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.736316] ---[ end trace 0000000000000000 ]---
[   23.736789] 
[   23.736993] ================================
[   23.737439] WARNING: inconsistent lock state
[   23.737908] 6.13.0-rc1+ #2033 Tainted: G        W         
[   23.738493] --------------------------------
[   23.738900] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   23.739468] loadret/560 [HC0[0]:SC0[0]:HE1:SE1] takes:
[   23.739957] ffff8d75fdc267d8 (hrtimer_bases.lock){?.-.}-{2:2}, at: retrigger_next_event+0x38/0xd0
[   23.740813] {IN-HARDIRQ-W} state was registered at:
[   23.741276]   lock_acquire+0xd0/0x310
[   23.741628]   _raw_spin_lock_irqsave+0x48/0x70
[   23.742058]   hrtimer_run_queues+0x4d/0x150
[   23.742455]   update_process_times+0x34/0xf0
[   23.742853]   tick_periodic+0x29/0xe0
[   23.743216]   tick_handle_periodic+0x24/0x70
[   23.743621]   timer_interrupt+0x18/0x30
[   23.743999]   __handle_irq_event_percpu+0x87/0x260
[   23.744449]   handle_irq_event+0x38/0x90
[   23.744819]   handle_level_irq+0x8e/0x160
[   23.745212]   __common_interrupt+0x5c/0x120
[   23.745603]   common_interrupt+0x80/0xa0
[   23.745986]   asm_common_interrupt+0x26/0x40
[   23.746395]   __x86_return_thunk+0x0/0x10
[   23.746772]   _raw_spin_unlock_irqrestore+0x45/0x70
[   23.747251]   __setup_irq+0x34d/0x6a0
[   23.747602]   request_threaded_irq+0x115/0x1b0
[   23.748039]   hpet_time_init+0x31/0x50
[   23.748396]   x86_late_time_init+0x1b/0x40
[   23.748786]   start_kernel+0x998/0xa40
[   23.749154]   x86_64_start_reservations+0x24/0x30
[   23.749600]   x86_64_start_kernel+0xed/0xf0
[   23.750011]   common_startup_64+0x13e/0x141
[   23.750398] irq event stamp: 16325
[   23.750745] hardirqs last  enabled at (16325): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
[   23.751549] hardirqs last disabled at (16324): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90
[   23.752349] softirqs last  enabled at (15872): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.753138] softirqs last disabled at (15863): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
[   23.753956] 
[   23.753956] other info that might help us debug this:
[   23.754597]  Possible unsafe locking scenario:
[   23.754597] 
[   23.755163]        CPU0
[   23.755397]        ----
[   23.755637]   lock(hrtimer_bases.lock);
[   23.756010]   <Interrupt>
[   23.756264]     lock(hrtimer_bases.lock);
[   23.756640] 
[   23.756640]  *** DEADLOCK ***
[   23.756640] 
[   23.757197] 1 lock held by loadret/560:
[   23.757555]  #0: ffffffffaa6902c8 (system_transition_mutex){+.+.}-{4:4}, at: __do_sys_reboot+0xc5/0x250
[   23.758426] 
[   23.758426] stack backtrace:
[   23.758867] CPU: 0 UID: 0 PID: 560 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2033
[   23.759806] Tainted: [W]=WARN
[   23.760151] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   23.761228] Call Trace:
[   23.761481]  <TASK>
[   23.761713]  dump_stack_lvl+0x84/0xd0
[   23.762117]  print_usage_bug.part.0+0x257/0x340
[   23.762553]  mark_lock+0x735/0x960
[   23.762911]  ? vprintk_emit+0x111/0x460
[   23.763288]  ? hrtimers_resume_local+0x29/0x40
[   23.763704]  ? _printk+0x6c/0x90
[   23.764018]  __lock_acquire+0x7ee/0x25c0
[   23.764385]  ? __warn.cold+0x7f/0x151
[   23.764727]  ? hrtimers_resume_local+0x29/0x40
[   23.765161]  ? nbcon_get_cpu_emergency_nesting+0xa/0x30
[   23.765643]  ? nbcon_cpu_emergency_exit+0xe/0x40
[   23.766081]  ? report_bug+0xff/0x140
[   23.766428]  lock_acquire+0xd0/0x310
[   23.766771]  ? retrigger_next_event+0x38/0xd0
[   23.767189]  _raw_spin_lock+0x30/0x40
[   23.767542]  ? retrigger_next_event+0x38/0xd0
[   23.767967]  retrigger_next_event+0x38/0xd0
[   23.768365]  timekeeping_resume+0x148/0x190
[   23.768763]  syscore_resume+0x67/0x220
[   23.769136]  kernel_kexec+0xf6/0x180
[   23.769478]  __do_sys_reboot+0x206/0x250
[   23.769851]  do_syscall_64+0x95/0x180
[   23.770205]  ? __lock_acquire+0x45f/0x25c0
[   23.770590]  ? __lock_acquire+0x45f/0x25c0
[   23.770987]  ? __handle_mm_fault+0x7df/0xfa0
[   23.771395]  ? reacquire_held_locks+0xd2/0x1f0
[   23.771812]  ? do_user_addr_fault+0x555/0x8a0
[   23.772229]  ? lock_acquire+0xd0/0x310
[   23.772584]  ? find_held_lock+0x2b/0x80
[   23.772934]  ? do_user_addr_fault+0x59f/0x8a0
[   23.773350]  ? do_user_addr_fault+0x5a9/0x8a0
[   23.773761]  ? trace_hardirqs_off+0x4b/0xc0
[   23.774169]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   23.774643]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   23.775111] RIP: 0033:0x7f058654c15d
[   23.775451] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d fb 5c 0c 00 f7 d8 64 89 01 48
[   23.777180] RSP: 002b:00007ffccf368718 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[   23.777872] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f058654c15d
[   23.778535] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[   23.779181] RBP: 00007ffccf368760 R08: 0000004d00000000 R09: 0000004d00000000
[   23.779836] R10: 00007f0586640f30 R11: 0000000000000246 R12: 0000000000401070
[   23.780497] R13: 00007ffccf368840 R14: 0000000000000000 R15: 0000000000000000
[   23.781181]  </TASK>
[   23.781485] Enabling non-boot CPUs ...
[   23.781875] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[   23.782558] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   23.784675] CPU1 is up
[   23.789380] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   23.794137] OOM killer enabled.
Success794454] Restarting tasks ... done.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-11 12:42   ` Stefan Hajnoczi
  2024-12-12 11:07     ` David Woodhouse
@ 2024-12-12 11:12     ` Ming Lei
  1 sibling, 0 replies; 34+ messages in thread
From: Ming Lei @ 2024-12-12 11:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, David Woodhouse, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini

On Thu, Dec 12, 2024 at 3:01 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Dec 10, 2024 at 09:56:43AM +0800, Jason Wang wrote:
> > Adding more virtio-blk people here.
>
> Please try Ming Lei's recent fix in Jens' tree:
>
>   virtio-blk: don't keep queue frozen during system suspend
>   commit: 7678abee0867e6b7fb89aa40f6e9f575f755fb37

This commit has been merged to v6.13-rc2.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-12 11:07     ` David Woodhouse
@ 2024-12-12 13:34       ` Thomas Gleixner
  2024-12-12 13:46         ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-12 13:34 UTC (permalink / raw)
  To: David Woodhouse, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

CC+ printk folks

On Thu, Dec 12 2024 at 11:07, David Woodhouse wrote:
> On Wed, 2024-12-11 at 07:42 -0500, Stefan Hajnoczi wrote:
>> On Tue, Dec 10, 2024 at 09:56:43AM +0800, Jason Wang wrote:
>> > Adding more virtio-blk people here.
>> 
>> Please try Ming Lei's recent fix in Jens' tree:
>> 
>>   virtio-blk: don't keep queue frozen during system suspend
>>   commit: 7678abee0867e6b7fb89aa40f6e9f575f755fb37
>> 
>> https://git.kernel.dk/cgit/linux/commit/?h=block-6.13&id=7678abee0867e6b7fb89aa40f6e9f575f755fb37
>
> Thanks. That does make those warnings go away. I do still get this one
> occasionally though. It seems to go away without 'no_console_suspend'
> on the command line, but I'm not sure that makes it OK.

Not really.

> [   23.665790] Interrupts enabled after irqrouter_resume+0x0/0x50

The resume callback irqrouter_resume() returns with interrupts enabled,
but it's absolutely unclear where this happens. The lockdep tracking is
not really helpful:

> [   23.697043] hardirqs last  enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
> [   23.697855] hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90

__up_console_sem()
{
	printk_safe_enter_irqsave(flags);       // Assuming this is __up_console_sem+0x63/0x90
                                                // Saves state in @flags and disables interrupts
        up(&console_sem);
        printk_safe_exit_irqrestore(flags);     // Assuming this is __up_console_sem+0x7e/0x90
                                                // Restores the interrupt state from @flags
}

Though the events are in reverse order:

    last enabled  at 15573
    last disabled at 15580

At event #15573 printk_safe_exit_irqrestore(flags) enabled interrupts,
which means the preceeding printk_safe_enter_irqsave(flags) was invoked
with interrupts enabled. But that enable event wiped the real culprit,
which enabled interrupts before __up_console_sem() was invoked.

At event #15580 printk_safe_enter_irqsave(flags); disables interrupts
again, which is probably at the point where printk() dumps the bug, but
I might be misreading this.

Now David's observation that the problem "goes away" when he adds
"no_console_suspend" on the command line is definitely interesting, but
does not really help in figuring out the root cause.

> [   23.698673] softirqs last  enabled at (14798): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
> [   23.699481] softirqs last disabled at (14777): [<ffffffffa81c6c12>] __irq_exit_rcu+0xe2/0x100
> [   23.700284] ---[ end trace 0000000000000000 ]---
> [   23.702460] ------------[ cut here ]------------
> [   23.702963] WARNING: CPU: 0 PID: 560 at kernel/time/hrtimer.c:995 hrtimers_resume_local+0x29/0x40

This one is just a consequence of the above.

David, can you retest with the debug patch below? That should pin-point
the real culprit.

Thanks,

        tglx
---
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -621,6 +621,9 @@ do {									\
 
 extern void lockdep_assert_in_softirq_func(void);
 
+extern void lockdep_suspend_syscore_enter(void);
+extern void lockdep_suspend_syscore_exit(void);
+
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(lock) do { } while (0)
@@ -635,6 +638,8 @@ extern void lockdep_assert_in_softirq_fu
 # define lockdep_assert_preemption_disabled() do { } while (0)
 # define lockdep_assert_in_softirq() do { } while (0)
 # define lockdep_assert_in_softirq_func() do { } while (0)
+static inline void lockdep_suspend_syscore_enter(void) { }
+static inline void lockdep_suspend_syscore_exit(void) { }
 #endif
 
 #ifdef CONFIG_PROVE_RAW_LOCK_NESTING
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1025,6 +1025,7 @@ int kernel_kexec(void)
 		if (error)
 			goto Enable_cpus;
 		local_irq_disable();
+		lockdep_suspend_syscore_enter();
 		error = syscore_suspend();
 		if (error)
 			goto Enable_irqs;
@@ -1054,6 +1055,7 @@ int kernel_kexec(void)
 	if (kexec_image->preserve_context) {
 		syscore_resume();
  Enable_irqs:
+		lockdep_suspend_syscore_exit();
 		local_irq_enable();
  Enable_cpus:
 		suspend_enable_secondary_cpus();
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4408,6 +4408,18 @@ void lockdep_hardirqs_on_prepare(void)
 }
 EXPORT_SYMBOL_GPL(lockdep_hardirqs_on_prepare);
 
+static bool suspend_syscore_active;
+
+void noinstr lockdep_suspend_syscore_enter(void)
+{
+	suspend_syscore_active = true;
+}
+
+void noinstr lockdep_suspend_syscore_exit(void)
+{
+	suspend_syscore_active = false;
+}
+
 void noinstr lockdep_hardirqs_on(unsigned long ip)
 {
 	struct irqtrace_events *trace = &current->irqtrace;
@@ -4456,6 +4468,8 @@ void noinstr lockdep_hardirqs_on(unsigne
 	if (DEBUG_LOCKS_WARN_ON(!irqs_disabled()))
 		return;
 
+	DEBUG_LOCKS_WARN_ON(suspend_syscore_active);
+
 	/*
 	 * Ensure the lock stack remained unchanged between
 	 * lockdep_hardirqs_on_prepare() and lockdep_hardirqs_on().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-12 13:34       ` Thomas Gleixner
@ 2024-12-12 13:46         ` David Woodhouse
  2024-12-12 18:04           ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-12 13:46 UTC (permalink / raw)
  To: Thomas Gleixner, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 10754 bytes --]

On Thu, 2024-12-12 at 14:34 +0100, Thomas Gleixner wrote:
> 
> David, can you retest with the debug patch below? That should pin-point
> the real culprit.

B[    1.545489] ------------[ cut here ]------------
[    1.546338] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
[    1.546375] WARNING: CPU: 0 PID: 18 at kernel/locking/lockdep.c:4471 lockdep_hardirqs_on+0x13a/0x140
[    1.548658] Modules linked in:
[    1.549164] CPU: 0 UID: 0 PID: 18 Comm: rcu_preempt Not tainted 6.13.0-rc1+ #2034
[    1.550421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.552173] RIP: 0010:lockdep_hardirqs_on+0x13a/0x140
[    1.552918] Code: 85 c9 74 13 8b 15 c6 b8 de 02 85 d2 0f 84 4d ff ff ff e9 58 ff ff ff 48 c7 c6 a3 47 e5 bb 48 c7 c7 4c 8f e4 bb e8 96 52 e6 fe <0f> 0b eb d6 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[    1.555728] RSP: 0018:ffffbde64009bcc8 EFLAGS: 00010082
[    1.556513] RAX: 0000000000000000 RBX: ffff9d23c18db340 RCX: ffffffffbc782a08
[    1.557579] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[    1.558629] RBP: ffffffffba20fe54 R08: 0000000000000000 R09: 0000000000000000
[    1.559683] R10: 0000000000037e0c R11: 0000000000000000 R12: ffff9d23c3a18000
[    1.560721] R13: 0000000000000000 R14: ffff9d23c18db340 R15: 0000000000000000
[    1.561758] FS:  0000000000000000(0000) GS:ffff9d243d600000(0000) knlGS:0000000000000000
[    1.562925] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.563765] CR2: 00007fa03567f070 CR3: 00000000039b0001 CR4: 0000000000170ef0
[    1.564825] Call Trace:
[    1.565178]  <TASK>
[    1.565483]  ? __warn.cold+0xb7/0x151
[    1.566009]  ? lockdep_hardirqs_on+0x13a/0x140
[    1.566663]  ? report_bug+0xff/0x140
[    1.567176]  ? console_unlock+0x9d/0x150
[    1.567745]  ? handle_bug+0x58/0x90
[    1.568239]  ? exc_invalid_op+0x17/0x70
[    1.568796]  ? asm_exc_invalid_op+0x1a/0x20
[    1.569411]  ? finish_task_switch.isra.0+0xc4/0x2d0
[    1.570125]  ? lockdep_hardirqs_on+0x13a/0x140
[    1.570759]  ? lockdep_hardirqs_on+0x13a/0x140
[    1.571399]  finish_task_switch.isra.0+0xc4/0x2d0
[    1.572062]  __schedule+0x50a/0x1a10
[    1.572586]  ? find_held_lock+0x2b/0x80
[    1.573125]  ? schedule+0xea/0x140
[    1.573616]  ? __pfx_rcu_gp_kthread+0x10/0x10
[    1.574232]  schedule+0x3a/0x140
[    1.574708]  schedule_timeout+0x91/0x110
[    1.575261]  ? __pfx_process_timeout+0x10/0x10
[    1.575896]  rcu_gp_fqs_loop+0x10b/0x5b0
[    1.576455]  ? _raw_spin_unlock_irq+0x28/0x50
[    1.577072]  rcu_gp_kthread+0xf8/0x1b0
[    1.577621]  kthread+0xd5/0x100
[    1.578066]  ? __pfx_kthread+0x10/0x10
[    1.578606]  ret_from_fork+0x34/0x50
[    1.579110]  ? __pfx_kthread+0x10/0x10
[    1.579661]  ret_from_fork_asm+0x1a/0x30
[    1.580210]  </TASK>
[    1.580533] irq event stamp: 4182
[    1.581000] hardirqs last  enabled at (4181): [<ffffffffbb36b74a>] _raw_spin_unlock_irqrestore+0x5a/0x70
[    1.582346] hardirqs last disabled at (4182): [<ffffffffbb35fa37>] __schedule+0xf67/0x1a10
[    1.583504] softirqs last  enabled at (0): [<ffffffffba1b92d7>] copy_process+0xac7/0x2ba0
[    1.584651] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    1.585531] ---[ end trace 0000000000000000 ]---
[    1.586205] ------------[ cut here ]------------
[    1.586863] WARNING: CPU: 0 PID: 18 at kernel/time/timekeeping.c:807 ktime_get+0xd6/0x100
[    1.588024] Modules linked in:
[    1.588464] CPU: 0 UID: 0 PID: 18 Comm: rcu_preempt Tainted: G        W          6.13.0-rc1+ #2034
[    1.589748] Tainted: [W]=WARN
[    1.590162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.591771] RIP: 0010:ktime_get+0xd6/0x100
[    1.592362] Code: cc cc cc cc 48 d1 e8 48 f7 d0 48 85 c8 74 22 8b 0d 23 16 eb 03 83 f9 3f 0f 87 f9 13 d4 ff 48 8b 05 17 16 eb 03 48 d3 e8 eb be <0f> 0b e9 40 ff ff ff 48 8b 15 04 16 eb 03 8b 35 fa 15 eb 03 8b 3d
[    1.594979] RSP: 0018:ffffbde64009bb90 EFLAGS: 00010002
[    1.595724] RAX: 0000000000000001 RBX: ffff9d243d639b28 RCX: 0000000000000003
[    1.596727] RDX: 00000000ffffff01 RSI: 0000000000000000 RDI: ffff9d243d639b28
[    1.597732] RBP: ffff9d243d6390c0 R08: 0000000000000000 R09: ffff9d243d739bc1
[    1.598735] R10: ffff9d243d627580 R11: 0000000000000000 R12: 000000001cb51477
[    1.599736] R13: 00000000972b390b R14: 000000000335a525 R15: 0000000000000001
[    1.600750] FS:  0000000000000000(0000) GS:ffff9d243d600000(0000) knlGS:0000000000000000
[    1.601871] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.602670] CR2: 00007fa03567f070 CR3: 00000000039b0001 CR4: 0000000000170ef0
[    1.603669] Call Trace:
[    1.604007]  <TASK>
[    1.604304]  ? __warn.cold+0xb7/0x151
[    1.604811]  ? ktime_get+0xd6/0x100
[    1.605300]  ? report_bug+0xff/0x140
[    1.605794]  ? handle_bug+0x58/0x90
[    1.606276]  ? exc_invalid_op+0x17/0x70
[    1.606813]  ? asm_exc_invalid_op+0x1a/0x20
[    1.607413]  ? ktime_get+0xd6/0x100
[    1.607907]  start_dl_timer+0x7c/0x200
[    1.608433]  update_curr_dl_se+0x1bf/0x1f0
[    1.609002]  update_curr+0x1a6/0x1d0
[    1.609523]  dequeue_entity+0x2b/0x630
[    1.610047]  ? asm_sysvec_kvm_asyncpf_interrupt+0x1a/0x20
[    1.610797]  dequeue_entities+0x113/0x5d0
[    1.611364]  ? __pfx_rcu_gp_kthread+0x10/0x10
[    1.611964]  dequeue_task_fair+0x139/0x2a0
[    1.612539]  ? __pfx_rcu_gp_kthread+0x10/0x10
[    1.613140]  __schedule+0x859/0x1a10
[    1.613636]  ? lock_timer_base+0x2b/0xf0
[    1.614176]  ? lock_acquire+0x2a2/0x310
[    1.614718]  ? lock_release+0x218/0x2c0
[    1.615248]  ? __pfx_rcu_gp_kthread+0x10/0x10
[    1.615861]  schedule+0x3a/0x140
[    1.616317]  schedule_timeout+0x91/0x110
[    1.616859]  ? __pfx_process_timeout+0x10/0x10
[    1.617477]  rcu_gp_fqs_loop+0x10b/0x5b0
[    1.618018]  ? _raw_spin_unlock_irq+0x28/0x50
[    1.618624]  rcu_gp_kthread+0xf8/0x1b0
[    1.619145]  kthread+0xd5/0x100
[    1.619617]  ? __pfx_kthread+0x10/0x10
[    1.620135]  ret_from_fork+0x34/0x50
[    1.620678]  ? __pfx_kthread+0x10/0x10
[    1.621202]  ret_from_fork_asm+0x1a/0x30
[    1.621756]  </TASK>
[    1.622057] irq event stamp: 4183
[    1.622519] hardirqs last  enabled at (4183): [<ffffffffba20fe54>] finish_task_switch.isra.0+0xc4/0x2d0
[    1.623816] hardirqs last disabled at (4182): [<ffffffffbb35fa37>] __schedule+0xf67/0x1a10
[    1.624952] softirqs last  enabled at (0): [<ffffffffba1b92d7>] copy_process+0xac7/0x2ba0
[    1.626085] softirqs last disabled at (0): [<0000000000000000>] 0x0
[    1.626958] ---[ end trace 0000000000000000 ]---
[    1.627802] ------------[ cut here ]------------
[    1.628457] Interrupts enabled after irqrouter_resume+0x0/0x50
[    1.629279] WARNING: CPU: 0 PID: 214 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
[    1.630485] Modules linked in:
[    1.630901] CPU: 0 UID: 0 PID: 214 Comm: loadret Tainted: G        W          6.13.0-rc1+ #2034
[    1.632088] Tainted: [W]=WARN
[    1.632500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[    1.634079] RIP: 0010:syscore_resume+0x18a/0x220
[    1.634717] Code: 00 e9 21 ff ff ff 80 3d 22 d0 d3 01 00 0f 85 03 ff ff ff 48 8b 73 18 48 c7 c7 8d f6 e9 bb c6 05 0a d0 d3 01 01 e8 d6 55 46 ff <0f> 0b e9 e5 fe ff ff e8 aa e9 54 ff 84 c0 0f 85 fb fe ff ff 80 3d
[    1.637298] RSP: 0018:ffffbde6402c7b38 EFLAGS: 00010286
[    1.638015] RAX: 0000000000000000 RBX: ffffffffbc9918c0 RCX: ffffffffbc782a08
[    1.638995] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
[    1.639982] RBP: 0000000000037e0c R08: 0000000000000000 R09: 0000000000000000
[    1.640961] R10: 0000000000000000 R11: 7075727265746e49 R12: ffffbde6402c7b68
[    1.641956] R13: ffffffffbc68ebe0 R14: 00000000fee1dead R15: 0000000000000000
[    1.642933] FS:  00007fed47db1680(0000) GS:ffff9d243d600000(0000) knlGS:0000000000000000
[    1.644036] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.644825] CR2: 00007fa03567f070 CR3: 00000000039b0001 CR4: 0000000000170ef0
[    1.645811] Call Trace:
[    1.646145]  <TASK>
[    1.646437]  ? __warn.cold+0xb7/0x151
[    1.646935]  ? syscore_resume+0x18a/0x220
[    1.647485]  ? report_bug+0xff/0x140
[    1.647973]  ? console_unlock+0x9d/0x150
[    1.648513]  ? handle_bug+0x58/0x90
[    1.648990]  ? exc_invalid_op+0x17/0x70
[    1.649528]  ? asm_exc_invalid_op+0x1a/0x20
[    1.650106]  ? syscore_resume+0x18a/0x220
[    1.650661]  ? syscore_resume+0x18a/0x220
[    1.651210]  kernel_kexec+0xff/0x190
[    1.651712]  __do_sys_reboot+0x206/0x250
[    1.652267]  do_syscall_64+0x95/0x180
[    1.652774]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[    1.653473]  ? smp_call_function_many_cond+0x11c/0x790
[    1.654172]  ? lock_acquire+0xd0/0x310
[    1.654690]  ? free_unref_page+0x22b/0x6a0
[    1.655249]  ? __slab_free+0xdf/0x330
[    1.655760]  ? do_kexec_load+0x11d/0x340
[    1.656304]  ? kfree+0x2bf/0x3a0
[    1.656749]  ? __x64_sys_kexec_load+0xa9/0xe0
[    1.657356]  ? kfree+0xdb/0x3a0
[    1.657797]  ? do_kexec_load+0x11d/0x340
[    1.658342]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[    1.659042]  ? syscall_exit_to_user_mode+0x97/0x290
[    1.659720]  ? do_syscall_64+0xa1/0x180
[    1.660245]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    1.660941] RIP: 0033:0x7fed47ce160d
[    1.661434] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 47 0c 00 f7 d8 64 89 01 48
[    1.663995] RSP: 002b:00007ffc75b2de78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a9
[    1.665024] RAX: ffffffffffffffda RBX: 00007ffc75b2dfe8 RCX: 00007fed47ce160d
[    1.666000] RDX: 0000000045584543 RSI: 0000000028121969 RDI: 00000000fee1dead
[    1.666974] RBP: 00007ffc75b2dec0 R08: 0000000000000000 R09: 0000000000000000
[    1.667947] R10: 00007fed47dbce90 R11: 0000000000000246 R12: 0000000000000001
[    1.668918] R13: 0000000000000000 R14: 00007fed47dec000 R15: 0000000000403e00
[    1.669899]  </TASK>
[    1.670195] irq event stamp: 12392
[    1.670661] hardirqs last  enabled at (12391): [<ffffffffba2ff03f>] smp_call_function_many_cond+0x66f/0x790
[    1.671990] hardirqs last disabled at (12392): [<ffffffffba305b2c>] kernel_kexec+0x13c/0x190
[    1.673135] softirqs last  enabled at (12376): [<ffffffffba1c6c12>] __irq_exit_rcu+0xe2/0x100
[    1.674307] softirqs last disabled at (12363): [<ffffffffba1c6c12>] __irq_exit_rcu+0xe2/0x100
[    1.675472] ---[ end trace 0000000000000000 ]---


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-12 13:46         ` David Woodhouse
@ 2024-12-12 18:04           ` Thomas Gleixner
  2024-12-12 19:19             ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-12 18:04 UTC (permalink / raw)
  To: David Woodhouse, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

On Thu, Dec 12 2024 at 13:46, David Woodhouse wrote:
> On Thu, 2024-12-12 at 14:34 +0100, Thomas Gleixner wrote:
>> 
>> David, can you retest with the debug patch below? That should pin-point
>> the real culprit.
>
> B[    1.545489] ------------[ cut here ]------------
> [    1.546338] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
> [    1.546375] WARNING: CPU: 0 PID: 18 at kernel/locking/lockdep.c:4471 lockdep_hardirqs_on+0x13a/0x140

Now this starts to be completely mysterious:

> [    1.571399]  finish_task_switch.isra.0+0xc4/0x2d0
> [    1.572062]  __schedule+0x50a/0x1a10

How on earth did this end up in schedule() on the rcu_gp_kthread between
syscore_suspend() and syscore_resume()?

> [    1.575896]  rcu_gp_fqs_loop+0x10b/0x5b0
> [    1.576455]  ? _raw_spin_unlock_irq+0x28/0x50
> [    1.577072]  rcu_gp_kthread+0xf8/0x1b0
> [    1.577621]  kthread+0xd5/0x100
> [    1.578066]  ? __pfx_kthread+0x10/0x10

I tried to reproduce, but failed. Do you have a simple reproducer
recipe?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-12 18:04           ` Thomas Gleixner
@ 2024-12-12 19:19             ` David Woodhouse
  2024-12-13  0:14               ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-12 19:19 UTC (permalink / raw)
  To: Thomas Gleixner, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2633 bytes --]

On Thu, 2024-12-12 at 19:04 +0100, Thomas Gleixner wrote:
> 
> I tried to reproduce, but failed. Do you have a simple reproducer
> recipe?

Build current master (231825b2e1ff here). The config I'm using is at
http://david.woodhou.se/config-x86-kjump-irqs although I don't think
there's anything special other than CONFIG_KEXEC_JUMP and enough
lockdep to trigger the complaints.

Build the test case from
https://git.kernel.org/torvalds/c/07fa619f2a40c (which at some point
I'll clean up and add to the kernel selftests).

Then just run it repeatedly. Ten or twenty invocations should trigger
it at least once.

I've been running in qemu with the test case shoved into an initrd for
faster testing, but it works just as well done manually. If it matters,
the QEMU command line on my Haswell box is

 qemu-system-x86_64 -accel kvm,kernel-irqchip=split -display none \
   -serial mon:stdio -kernel arch/x86/boot/bzImage -smp 2 -m 2g \
   -append "console=ttyS0 root=/dev/vda1 no_console_suspend earlyprintk=serial" \
   -drive file=/var/lib/libvirt/images/fedora.qcow2,if=virtio \
   -cpu host --no-reboot -nic user,model=virtio 

Probably the only important part of that is the no_console_suspend.

A normal invocation looks like this:

[root@localhost ~]# ./loadret
[   83.908243] Freezing user space processes
[   83.910587] Freezing user space processes completed (elapsed 0.001 seconds)
[   83.911614] OOM killer disabled.
[   83.923550] ata2: found unknown device (class 0)
[   83.938623] Disabling non-boot CPUs ...
[   83.962083] smpboot: CPU 1 is now offline
[   83.963022] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[   83.983498] Enabling non-boot CPUs ...
[   83.984410] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[   83.985792] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   83.989823] CPU1 is up
[   83.999270] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   84.004705] OOM killer enabled.
[   84.005299] Restarting tasks ... done.
Success


But sometimes it looks like this...

[root@localhost ~]# ./loadret
[   84.618610] Freezing user space processes
[   84.621427] Freezing user space processes completed (elapsed 0.001 seconds)
[   84.622510] OOM killer disabled.
[   84.628613] ata2: found unknown device (class 0)
[   84.652758] Disabling non-boot CPUs ...
[   84.672223] smpboot: CPU 1 is now offline
[   84.673091] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[   84.682946] ------------[ cut here ]------------
[   84.683584] Interrupts enabled after irqrouter_resume+0x0/0x50


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-12 19:19             ` David Woodhouse
@ 2024-12-13  0:14               ` Thomas Gleixner
  2024-12-13  9:31                 ` David Woodhouse
  2024-12-13 11:20                 ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13  0:14 UTC (permalink / raw)
  To: David Woodhouse, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

On Thu, Dec 12 2024 at 19:19, David Woodhouse wrote:
> On Thu, 2024-12-12 at 19:04 +0100, Thomas Gleixner wrote:
> Build current master (231825b2e1ff here). The config I'm using is at
> http://david.woodhou.se/config-x86-kjump-irqs although I don't think
> there's anything special other than CONFIG_KEXEC_JUMP and enough
> lockdep to trigger the complaints.
>
> I've been running in qemu with the test case shoved into an initrd for
> faster testing, but it works just as well done manually. If it matters,
> the QEMU command line on my Haswell box is
>
>  qemu-system-x86_64 -accel kvm,kernel-irqchip=split -display none \
>    -serial mon:stdio -kernel arch/x86/boot/bzImage -smp 2 -m 2g \
>    -append "console=ttyS0 root=/dev/vda1 no_console_suspend earlyprintk=serial" \
>    -drive file=/var/lib/libvirt/images/fedora.qcow2,if=virtio \
>    -cpu host --no-reboot -nic user,model=virtio 
>
> Probably the only important part of that is the no_console_suspend.

I misread that and somehow thought it does _not_ happen with
no_console_suspend. Duh!

So that made it reproduce and after adding some more lockdep hacks here
is the culprit:

[   15.177945] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
[   15.178980] CPU: 0 UID: 0 PID: 410 Comm: k.1 Not tainted 6.13.0-rc2+ #118
[   15.185613] Call Trace:
[   15.185760]  <TASK>
[   15.187678]  __cond_resched+0x21/0x60
[   15.187923]  down_timeout+0x18/0x60
[   15.188159]  acpi_os_wait_semaphore+0x4c/0x80
[   15.188424]  acpi_ut_acquire_mutex+0x3d/0x100
[   15.188665]  acpi_ns_get_node+0x27/0x60
[   15.188879]  acpi_ns_evaluate+0x1cb/0x2d0
[   15.189106]  acpi_rs_set_srs_method_data+0x156/0x190
[   15.189382]  acpi_pci_link_set+0x11c/0x290
[   15.189618]  irqrouter_resume+0x54/0x60
[   15.189826]  syscore_resume+0x6a/0x200
[   15.190033]  kernel_kexec+0x145/0x1c0
[   15.190239]  __do_sys_reboot+0xeb/0x240
[   15.190561]  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a wakeup and the scheduler
sets the NEED_RESCHED flag.

cond_resched() sees it set and invokes schedule() from a completely bogus
context.

Nothing knowns about the current system state and therefore happily
assumes that everything is cool and shiny.

I also found that kexec jump fails to set system_state = SYSTEM_SUSPEND
before syscore_suspend() and back to SYSTEM_RUNNING after
syscore_resume().  See suspend/hibernate...

But setting this does not solve the problem because NEED_RESCHED still
gets set and cond_resched() is oblivious of the state.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operation
it's just a question of time.

And of course then you need a code path which is invoked after that,
which actually invokes cond_resched().

Adding more debug, there are indeed a couple of ways to set NEED_RESCHED
between invoking syscore_suspend() and returning from syscore_resume().
Some stuff schedules work [un]conditonally.

So the real question is how to deal with the general problem, which
obviously affects suspend as well.

The only thing I came up with so far is the hack below. It cures the
problem for PREEMPT_NONE and PREEMPT_VOLUNTARY. PREEMPT_FULL is not
affected because preemptible() checks whether the preemption count is
zero and interrupts are enabled and cond_resched() not active.

We could check for interrupts enabled in cond_resched() too, but that's
not pretty either.

With that applied the problem goes away, but after a lot of repetitions
of the reproducer in a tight loop the whole machinery stops dead:

[   29.913179] Disabling non-boot CPUs ...
[   29.930328] smpboot: CPU 1 is now offline
[   29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[   29.940588] Enabling non-boot CPUs ...
[   29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[   29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   29.942654] CPU1 is up
[   29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   29.948556] OOM killer enabled.
[   29.948750] Restarting tasks ... done.
Success
[   29.960044] Freezing user space processes
[   29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
[   29.961861] OOM killer disabled.
[   30.102485] ata2: found unknown device (class 0)
[   30.107387] Disabling non-boot CPUs ...

That happens without 'no_console_suspend' on the command line as
well, but that's for tomorrow ...

Thanks,

        tglx
---
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1025,6 +1025,7 @@ int kernel_kexec(void)
 		if (error)
 			goto Enable_cpus;
 		local_irq_disable();
+		system_state = SYSTEM_SUSPEND;
 		error = syscore_suspend();
 		if (error)
 			goto Enable_irqs;
@@ -1054,6 +1055,7 @@ int kernel_kexec(void)
 	if (kexec_image->preserve_context) {
 		syscore_resume();
  Enable_irqs:
+		system_state = SYSTEM_RUNNING;
 		local_irq_enable();
  Enable_cpus:
 		suspend_enable_secondary_cpus();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
 #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
 int __sched __cond_resched(void)
 {
-	if (should_resched(0)) {
+	if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
 		preempt_schedule_common();
 		return 1;
 	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13  0:14               ` Thomas Gleixner
@ 2024-12-13  9:31                 ` David Woodhouse
  2024-12-13  9:43                   ` David Woodhouse
  2024-12-13 11:20                 ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13  9:31 UTC (permalink / raw)
  To: Thomas Gleixner, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 4141 bytes --]

On Fri, 2024-12-13 at 01:14 +0100, Thomas Gleixner wrote:
> 
> With that applied the problem goes away, but after a lot of repetitions
> of the reproducer in a tight loop the whole machinery stops dead:
> 
> [   29.913179] Disabling non-boot CPUs ...
> [   29.930328] smpboot: CPU 1 is now offline
> [   29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
> B[   29.940588] Enabling non-boot CPUs ...
> [   29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
> [   29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [   29.942654] CPU1 is up
> [   29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
> [   29.948556] OOM killer enabled.
> [   29.948750] Restarting tasks ... done.
> Success
> [   29.960044] Freezing user space processes
> [   29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
> [   29.961861] OOM killer disabled.
> [   30.102485] ata2: found unknown device (class 0)
> [   30.107387] Disabling non-boot CPUs ...
> 
> That happens without 'no_console_suspend' on the command line as
> well, but that's for tomorrow ...

I think I saw that lockup once last night too. This morning I did not
see it after hundreds of invocations on my kexec-debug tree (based on
tip/x86/boot which is 6.13-rc1).

I switched to master (231825b2e1 still) and saw it after a few
attempts.

[   34.250006] Freezing user space processes
[   34.251930] Freezing user space processes completed (elapsed 0.001 seconds)
[   34.252730] OOM killer disabled.
[   34.253141] printk: Suspending console(s) (use no_console_suspend to debug)

(gdb) t a a bt

Thread 2 (Thread 1.2 (CPU#1 [halted ])):
#0  0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127
#1  0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175
#2  default_idle () at arch/x86/kernel/process.c:742
#3  0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117
#4  0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185
#5  do_idle () at kernel/sched/idle.c:325
#6  0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_AP_ONLINE_IDLE) at kernel/sched/idle.c:423
#7  0xffffffff8115b572 in start_secondary (unused=<optimized out>) at arch/x86/kernel/smpboot.c:314
#8  0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 1.1 (CPU#0 [halted ])):
#0  0xffffffff8235886f in pv_native_safe_halt () at arch/x86/kernel/paravirt.c:127
#1  0xffffffff8235b699 in arch_safe_halt () at ./arch/x86/include/asm/paravirt.h:175
#2  default_idle () at arch/x86/kernel/process.c:742
#3  0xffffffff8235bb0a in default_idle_call () at kernel/sched/idle.c:117
#4  0xffffffff81243195 in cpuidle_idle_call () at kernel/sched/idle.c:185
#5  do_idle () at kernel/sched/idle.c:325
#6  0xffffffff812434b9 in cpu_startup_entry (state=state@entry=CPUHP_ONLINE) at kernel/sched/idle.c:423
#7  0xffffffff8235c9c7 in rest_init () at init/main.c:747
#8  0xffffffff8419a694 in start_kernel () at init/main.c:1102
#9  0xffffffff841ac6a4 in x86_64_start_reservations (real_mode_data=real_mode_data@entry=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:507
#10 0xffffffff841ac7fd in x86_64_start_kernel (real_mode_data=0x147b0 <exception_stacks+34736> <error: Cannot access memory at address 0x147b0>) at arch/x86/kernel/head64.c:488
#11 0xffffffff8110a38d in secondary_startup_64 () at arch/x86/kernel/head_64.S:414
#12 0x0000000000000000 in ?? ()
(gdb) 

But I haven't ingested your fix yet, maybe so I can't be entirely
surprised if CPU0 scheduled away and ended up in the idle thread? 

If I were cleverer I'd remember how to make gdb give me a backtrace for
the process which is actually in the kexec sys_reboot() system call,
instead of the boring idle thread.

(gdb) p sysrq_handle_showstate('t')

That didn't work. Maybe if I'd actually had no_console_suspend on this
boot. Will try again.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13  9:31                 ` David Woodhouse
@ 2024-12-13  9:43                   ` David Woodhouse
  2024-12-13 10:42                     ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13  9:43 UTC (permalink / raw)
  To: Thomas Gleixner, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1591 bytes --]

On Fri, 2024-12-13 at 09:31 +0000, David Woodhouse wrote:
> 
> (gdb) p sysrq_handle_showstate('t')
> 
> That didn't work. Maybe if I'd actually had no_console_suspend on this
> boot. Will try again.

With your fix I get the same thing (both CPUs in idle thread). And with
no_console_suspend on the command line, 'p sysrq_handle_showstate('t')'
does work...

[  113.462898] task:loadret         state:D stack:0     pid:707   tgid:707   ppid:531    flags:0x00004002
[  113.463615] Call Trace:
[  113.463841]  <TASK>
[  113.464029]  __schedule+0x502/0x1a10
[  113.464331]  ? find_held_lock+0x2b/0x80
[  113.464661]  ? schedule+0xea/0x140
[  113.464961]  schedule+0x3a/0x140
[  113.465234]  schedule_timeout+0xcc/0x110
[  113.465580]  __wait_for_common+0x91/0x1c0
[  113.465923]  ? __pfx_schedule_timeout+0x10/0x10
[  113.466304]  cpuhp_kick_ap_work+0x13e/0x390
[  113.466657]  _cpu_down+0xd4/0x370
[  113.466936]  freeze_secondary_cpus.cold+0x3f/0xd4
[  113.467326]  kernel_kexec+0xa2/0x1a0
[  113.467634]  __do_sys_reboot+0x206/0x250
[  113.467974]  do_syscall_64+0x95/0x180
[  113.468266]  ? kfree+0xdb/0x3a0
[  113.468542]  ? __x64_sys_kexec_load+0xa9/0xe0
[  113.468906]  ? kfree+0xdb/0x3a0
[  113.469177]  ? do_kexec_load+0x120/0x340
[  113.469525]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  113.469940]  ? syscall_exit_to_user_mode+0x97/0x290
[  113.470338]  ? do_syscall_64+0xa1/0x180
[  113.470667]  ? trace_hardirqs_off+0x4b/0xc0
[  113.471021]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[  113.471414]  entry_SYSCALL_64_after_hwframe+0x76/0x7e


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13  9:43                   ` David Woodhouse
@ 2024-12-13 10:42                     ` Thomas Gleixner
  2024-12-13 11:09                       ` Ming Lei
  2024-12-13 11:12                       ` David Woodhouse
  0 siblings, 2 replies; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13 10:42 UTC (permalink / raw)
  To: David Woodhouse, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe

On Fri, Dec 13 2024 at 09:43, David Woodhouse wrote:
> On Fri, 2024-12-13 at 09:31 +0000, David Woodhouse wrote:
>> 
>> (gdb) p sysrq_handle_showstate('t')
>> 
>> That didn't work. Maybe if I'd actually had no_console_suspend on this
>> boot. Will try again.
>
> With your fix I get the same thing (both CPUs in idle thread). And with
> no_console_suspend on the command line, 'p sysrq_handle_showstate('t')'
> does work...
>
> [  113.462898] task:loadret         state:D stack:0     pid:707   tgid:707   ppid:531    flags:0x00004002
> [  113.463615] Call Trace:
> [  113.463841]  <TASK>
> [  113.464029]  __schedule+0x502/0x1a10
> [  113.464961]  schedule+0x3a/0x140
> [  113.465234]  schedule_timeout+0xcc/0x110
> [  113.465580]  __wait_for_common+0x91/0x1c0
> [  113.466304]  cpuhp_kick_ap_work+0x13e/0x390
> [  113.466657]  _cpu_down+0xd4/0x370
> [  113.466936]  freeze_secondary_cpus.cold+0x3f/0xd4
> [  113.467326]  kernel_kexec+0xa2/0x1a0

That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
here:

 task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
 Call Trace:
  <TASK>
  __schedule+0x51f/0x1a80
  schedule+0x3a/0x140
  schedule_timeout+0x90/0x110
  msleep+0x2b/0x40
  blk_mq_hctx_notify_offline+0x160/0x3a0
  cpuhp_invoke_callback+0x2a8/0x6c0
  cpuhp_thread_fun+0x1ed/0x270
  smpboot_thread_fn+0xda/0x1d0

So something with those blk_mq fixes went sideways.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 10:42                     ` Thomas Gleixner
@ 2024-12-13 11:09                       ` Ming Lei
  2024-12-13 11:31                         ` Thomas Gleixner
  2024-12-13 11:12                       ` David Woodhouse
  1 sibling, 1 reply; 34+ messages in thread
From: Ming Lei @ 2024-12-13 11:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe

On Fri, Dec 13, 2024 at 11:42:59AM +0100, Thomas Gleixner wrote:
> On Fri, Dec 13 2024 at 09:43, David Woodhouse wrote:
> > On Fri, 2024-12-13 at 09:31 +0000, David Woodhouse wrote:
> >> 
> >> (gdb) p sysrq_handle_showstate('t')
> >> 
> >> That didn't work. Maybe if I'd actually had no_console_suspend on this
> >> boot. Will try again.
> >
> > With your fix I get the same thing (both CPUs in idle thread). And with
> > no_console_suspend on the command line, 'p sysrq_handle_showstate('t')'
> > does work...
> >
> > [  113.462898] task:loadret         state:D stack:0     pid:707   tgid:707   ppid:531    flags:0x00004002
> > [  113.463615] Call Trace:
> > [  113.463841]  <TASK>
> > [  113.464029]  __schedule+0x502/0x1a10
> > [  113.464961]  schedule+0x3a/0x140
> > [  113.465234]  schedule_timeout+0xcc/0x110
> > [  113.465580]  __wait_for_common+0x91/0x1c0
> > [  113.466304]  cpuhp_kick_ap_work+0x13e/0x390
> > [  113.466657]  _cpu_down+0xd4/0x370
> > [  113.466936]  freeze_secondary_cpus.cold+0x3f/0xd4
> > [  113.467326]  kernel_kexec+0xa2/0x1a0
> 
> That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
> here:
> 
>  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
>  Call Trace:
>   <TASK>
>   __schedule+0x51f/0x1a80
>   schedule+0x3a/0x140
>   schedule_timeout+0x90/0x110
>   msleep+0x2b/0x40
>   blk_mq_hctx_notify_offline+0x160/0x3a0
>   cpuhp_invoke_callback+0x2a8/0x6c0
>   cpuhp_thread_fun+0x1ed/0x270
>   smpboot_thread_fn+0xda/0x1d0
> 
> So something with those blk_mq fixes went sideways.

The cpuhp callback is just waiting for inflight IOs to be completed when
the irq is still live.

It looks same with the following report:

https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/

Still triggered in case of kexec & qemu, which should be one qemu
problem.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 10:42                     ` Thomas Gleixner
  2024-12-13 11:09                       ` Ming Lei
@ 2024-12-13 11:12                       ` David Woodhouse
  2024-12-13 11:33                         ` Ming Lei
  1 sibling, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13 11:12 UTC (permalink / raw)
  To: Thomas Gleixner, Stefan Hajnoczi, Jason Wang
  Cc: x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	ming.lei, Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 916 bytes --]

On Fri, 2024-12-13 at 11:42 +0100, Thomas Gleixner wrote:
> 
> That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
> here:
> 
>  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
>  Call Trace:
>   <TASK>
>   __schedule+0x51f/0x1a80
>   schedule+0x3a/0x140
>   schedule_timeout+0x90/0x110
>   msleep+0x2b/0x40
>   blk_mq_hctx_notify_offline+0x160/0x3a0
>   cpuhp_invoke_callback+0x2a8/0x6c0
>   cpuhp_thread_fun+0x1ed/0x270
>   smpboot_thread_fn+0xda/0x1d0
> 
> So something with those blk_mq fixes went sideways.

 $ git bisect bad
7678abee0867e6b7fb89aa40f6e9f575f755fb37 is the first bad commit
commit 7678abee0867e6b7fb89aa40f6e9f575f755fb37 (HEAD)
Author: Ming Lei <ming.lei@redhat.com>
Date:   Tue Nov 12 20:58:21 2024 +0800

    virtio-blk: don't keep queue frozen during system suspend


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13  0:14               ` Thomas Gleixner
  2024-12-13  9:31                 ` David Woodhouse
@ 2024-12-13 11:20                 ` Peter Zijlstra
  2024-12-13 13:13                   ` Thomas Gleixner
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2024-12-13 11:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, ming.lei, Petr Mladek, John Ogness

On Fri, Dec 13, 2024 at 01:14:54AM +0100, Thomas Gleixner wrote:

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
>  #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
>  int __sched __cond_resched(void)
>  {
> -	if (should_resched(0)) {
> +	if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
>  		preempt_schedule_common();
>  		return 1;
>  	}

Perhaps we should just do:

	if (should_resched(0) && !irqs_disabled())

That's more or less what preemptible() does too. Yes, mucking about with
IF is expensive, but it's only done if preempt_count is agreeing we
should schedule, which it shouldn't be most of the time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:09                       ` Ming Lei
@ 2024-12-13 11:31                         ` Thomas Gleixner
  2024-12-13 11:48                           ` Ming Lei
  2024-12-13 13:17                           ` David Woodhouse
  0 siblings, 2 replies; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13 11:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

On Fri, Dec 13 2024 at 19:09, Ming Lei wrote:
> On Fri, Dec 13, 2024 at 11:42:59AM +0100, Thomas Gleixner wrote:
>> That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
>> here:
>> 
>>  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
>>  Call Trace:
>>   <TASK>
>>   __schedule+0x51f/0x1a80
>>   schedule+0x3a/0x140
>>   schedule_timeout+0x90/0x110
>>   msleep+0x2b/0x40
>>   blk_mq_hctx_notify_offline+0x160/0x3a0
>>   cpuhp_invoke_callback+0x2a8/0x6c0
>>   cpuhp_thread_fun+0x1ed/0x270
>>   smpboot_thread_fn+0xda/0x1d0
>> 
>> So something with those blk_mq fixes went sideways.
>
> The cpuhp callback is just waiting for inflight IOs to be completed when
> the irq is still live.
>
> It looks same with the following report:
>
> https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/
>
> Still triggered in case of kexec & qemu, which should be one qemu
> problem.

I'd rather say, that's a kexec problem. On the same instance a loop test
of suspend to ram with pm_test=core just works fine. That's equivalent
to the kexec scenario. It goes down to syscore_suspend() and skips the
actual suspend low level magic. It then resumes with syscore_resume()
and brings the machine back up.

That runs for 2 hours now, while the kexec muck dies within 2
minutes....

And if you look at the difference of these implementations, you might
notice that kexec just implemented some rudimentary version of the
actual suspend logic. Based on let's hope it works that way.

This is just insane and should be rewritten to actually reuse the suspend
mechanism, which is way better tested than this kexec jump muck.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:12                       ` David Woodhouse
@ 2024-12-13 11:33                         ` Ming Lei
  0 siblings, 0 replies; 34+ messages in thread
From: Ming Lei @ 2024-12-13 11:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe

On Fri, Dec 13, 2024 at 11:12:41AM +0000, David Woodhouse wrote:
> On Fri, 2024-12-13 at 11:42 +0100, Thomas Gleixner wrote:
> > 
> > That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
> > here:
> > 
> >  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
> >  Call Trace:
> >   <TASK>
> >   __schedule+0x51f/0x1a80
> >   schedule+0x3a/0x140
> >   schedule_timeout+0x90/0x110
> >   msleep+0x2b/0x40
> >   blk_mq_hctx_notify_offline+0x160/0x3a0
> >   cpuhp_invoke_callback+0x2a8/0x6c0
> >   cpuhp_thread_fun+0x1ed/0x270
> >   smpboot_thread_fn+0xda/0x1d0
> > 
> > So something with those blk_mq fixes went sideways.
> 
>  $ git bisect bad
> 7678abee0867e6b7fb89aa40f6e9f575f755fb37 is the first bad commit
> commit 7678abee0867e6b7fb89aa40f6e9f575f755fb37 (HEAD)
> Author: Ming Lei <ming.lei@redhat.com>
> Date:   Tue Nov 12 20:58:21 2024 +0800
> 
>     virtio-blk: don't keep queue frozen during system suspend

The above commit just adds blk_mq_unfreeze_queue() in virtblk_freeze(),
which may wake up pending request allocation.

Seems frozen processes still can be woken up after suspend_freeze_processes()
returns, then new request allocation can't be completed because there
isn't good CPU for this allocation.

Is it expected to see frozen process to be wakeup during suspend?

If yes, the following patch may be helpful:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aa340b097b6e..0cfc7a927e7e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -557,6 +557,9 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 	if (tag == BLK_MQ_NO_TAG) {
 		if (data->flags & BLK_MQ_REQ_NOWAIT)
 			return NULL;
+
+		if (system_state != SYSTEM_RUNNING)
+			return NULL;
 		/*
 		 * Give up the CPU and sleep for a random short time to
 		 * ensure that thread using a realtime scheduling class


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:31                         ` Thomas Gleixner
@ 2024-12-13 11:48                           ` Ming Lei
  2024-12-13 13:23                             ` Thomas Gleixner
  2024-12-13 13:17                           ` David Woodhouse
  1 sibling, 1 reply; 34+ messages in thread
From: Ming Lei @ 2024-12-13 11:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

On Fri, Dec 13, 2024 at 12:31:24PM +0100, Thomas Gleixner wrote:
> On Fri, Dec 13 2024 at 19:09, Ming Lei wrote:
> > On Fri, Dec 13, 2024 at 11:42:59AM +0100, Thomas Gleixner wrote:
> >> That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
> >> here:
> >> 
> >>  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
> >>  Call Trace:
> >>   <TASK>
> >>   __schedule+0x51f/0x1a80
> >>   schedule+0x3a/0x140
> >>   schedule_timeout+0x90/0x110
> >>   msleep+0x2b/0x40
> >>   blk_mq_hctx_notify_offline+0x160/0x3a0
> >>   cpuhp_invoke_callback+0x2a8/0x6c0
> >>   cpuhp_thread_fun+0x1ed/0x270
> >>   smpboot_thread_fn+0xda/0x1d0
> >> 
> >> So something with those blk_mq fixes went sideways.
> >
> > The cpuhp callback is just waiting for inflight IOs to be completed when
> > the irq is still live.
> >
> > It looks same with the following report:
> >
> > https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/
> >
> > Still triggered in case of kexec & qemu, which should be one qemu
> > problem.
> 
> I'd rather say, that's a kexec problem. On the same instance a loop test
> of suspend to ram with pm_test=core just works fine. That's equivalent
> to the kexec scenario. It goes down to syscore_suspend() and skips the
> actual suspend low level magic. It then resumes with syscore_resume()
> and brings the machine back up.
> 
> That runs for 2 hours now, while the kexec muck dies within 2
> minutes....
> 
> And if you look at the difference of these implementations, you might
> notice that kexec just implemented some rudimentary version of the
> actual suspend logic. Based on let's hope it works that way.
> 
> This is just insane and should be rewritten to actually reuse the suspend
> mechanism, which is way better tested than this kexec jump muck.

But kexec is supposed to align with reboot/shutdown, instead of suspend,
and it is calling ->shutdown() for notifying driver & device.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:20                 ` Peter Zijlstra
@ 2024-12-13 13:13                   ` Thomas Gleixner
  2024-12-16 13:20                     ` [PATCH] sched: Prevent rescheduling when interrupts are disabled Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, ming.lei, Petr Mladek, John Ogness

On Fri, Dec 13 2024 at 12:20, Peter Zijlstra wrote:
> On Fri, Dec 13, 2024 at 01:14:54AM +0100, Thomas Gleixner wrote:
>
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
>>  #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
>>  int __sched __cond_resched(void)
>>  {
>> -	if (should_resched(0)) {
>> +	if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
>>  		preempt_schedule_common();
>>  		return 1;
>>  	}
>
> Perhaps we should just do:
>
> 	if (should_resched(0) && !irqs_disabled())
>
> That's more or less what preemptible() does too. Yes, mucking about with
> IF is expensive, but it's only done if preempt_count is agreeing we
> should schedule, which it shouldn't be most of the time.

Makes sense. Let me write a real patch.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:31                         ` Thomas Gleixner
  2024-12-13 11:48                           ` Ming Lei
@ 2024-12-13 13:17                           ` David Woodhouse
  1 sibling, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2024-12-13 13:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ming Lei
  Cc: Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 4537 bytes --]

On Fri, 2024-12-13 at 12:31 +0100, Thomas Gleixner wrote:
> On Fri, Dec 13 2024 at 19:09, Ming Lei wrote:
> > On Fri, Dec 13, 2024 at 11:42:59AM +0100, Thomas Gleixner wrote:
> > > That's the control thread on CPU0. The hotplug thread on CPU1 is stuck
> > > here:
> > > 
> > >  task:cpuhp/1         state:D stack:0     pid:24    tgid:24    ppid:2      flags:0x00004000
> > >  Call Trace:
> > >   <TASK>
> > >   __schedule+0x51f/0x1a80
> > >   schedule+0x3a/0x140
> > >   schedule_timeout+0x90/0x110
> > >   msleep+0x2b/0x40
> > >   blk_mq_hctx_notify_offline+0x160/0x3a0
> > >   cpuhp_invoke_callback+0x2a8/0x6c0
> > >   cpuhp_thread_fun+0x1ed/0x270
> > >   smpboot_thread_fn+0xda/0x1d0
> > > 
> > > So something with those blk_mq fixes went sideways.
> > 
> > The cpuhp callback is just waiting for inflight IOs to be completed when
> > the irq is still live.
> > 
> > It looks same with the following report:
> > 
> > https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/
> > 
> > Still triggered in case of kexec & qemu, which should be one qemu
> > problem.
> 
> I'd rather say, that's a kexec problem. On the same instance a loop test
> of suspend to ram with pm_test=core just works fine. That's equivalent
> to the kexec scenario. It goes down to syscore_suspend() and skips the
> actual suspend low level magic. It then resumes with syscore_resume()
> and brings the machine back up.
> 
> That runs for 2 hours now, while the kexec muck dies within 2
> minutes....
> 
> And if you look at the difference of these implementations, you might
> notice that kexec just implemented some rudimentary version of the
> actual suspend logic. Based on let's hope it works that way.
> 
> This is just insane and should be rewritten to actually reuse the suspend
> mechanism, which is way better tested than this kexec jump muck.

Not sure it helps for the above linux-scsi issue since that's an
*actual* kexec, not 'kexec jump muck'. But for the kjump this dirty
proof of concept seems to work:

--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -19,6 +19,7 @@
 #include <linux/gfp.h>
 #include <linux/io.h>
 #include <linux/kernel.h>
+#include <linux/kexec.h>
 #include <linux/list.h>
 #include <linux/mm.h>
 #include <linux/slab.h>
@@ -446,6 +447,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
        error = syscore_suspend();
        if (!error) {
                *wakeup = pm_wakeup_pending();
+               if (kexec_image && kexec_image->preserve_context) {
+                       machine_kexec(kexec_image);
+               } else
                if (!(suspend_test(TEST_CORE) || *wakeup)) {
                        trace_suspend_resume(TPS("machine_suspend"),
                                state, true);


[root@localhost ~]# echo mem > /sys/power/state
[   61.854085] PM: suspend entry (deep)
[   61.868380] Filesystems sync: 0.013 seconds
[   61.873692] Freezing user space processes
[   61.876739] Freezing user space processes completed (elapsed 0.002 seconds)
[   61.878175] OOM killer disabled.
[   61.878861] Freezing remaining freezable tasks
[   61.880818] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   61.889138] ata2.00: Check power mode failed (err_mask=0x1)
[   61.893351] PM: suspend devices took 0.011 seconds
[   61.899373] ACPI: PM: Preparing to enter system sleep state S3
[   61.900802] ACPI: PM: Saving platform NVS memory
[   61.901861] Disabling non-boot CPUs ...
[   61.906841] smpboot: CPU 1 is now offline
Exc:0000000000000003
Err:0000000000000000
rip:0000228e60970000
rax:0000000000000018
rbx:0000000000000000
rcx:0000000000000001
rdx:0000000000000000
rsi:00000000228e6540
rdi:00000000228e4002
r8 :0000000000000000
r9 :0000000022927000
r10:0000000000000000
r11:0000000000000001
r12:0000000000170e70
r13:0000000000170ef0
r14:ffff888006064110
r15:ffff888006f61e20
cr2:00007f408f990098
B[   61.925154] Enabling non-boot CPUs ...
[   61.925987] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   61.929886] CPU1 is up
[   61.930514] ACPI: PM: Waking up from system sleep state S3
[   61.950391] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   61.954844] PM: resume devices took 0.020 seconds
[   61.955968] OOM killer enabled.
[   61.956500] Restarting tasks ... done.
[   61.958890] random: crng reseeded on system resumption
[   61.962280] PM: suspend exit


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 11:48                           ` Ming Lei
@ 2024-12-13 13:23                             ` Thomas Gleixner
  2024-12-13 14:07                               ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13 13:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

On Fri, Dec 13 2024 at 19:48, Ming Lei wrote:
> On Fri, Dec 13, 2024 at 12:31:24PM +0100, Thomas Gleixner wrote:
>> I'd rather say, that's a kexec problem. On the same instance a loop test
>> of suspend to ram with pm_test=core just works fine. That's equivalent
>> to the kexec scenario. It goes down to syscore_suspend() and skips the
>> actual suspend low level magic. It then resumes with syscore_resume()
>> and brings the machine back up.
>> 
>> That runs for 2 hours now, while the kexec muck dies within 2
>> minutes....
>> 
>> And if you look at the difference of these implementations, you might
>> notice that kexec just implemented some rudimentary version of the
>> actual suspend logic. Based on let's hope it works that way.
>> 
>> This is just insane and should be rewritten to actually reuse the suspend
>> mechanism, which is way better tested than this kexec jump muck.
>
> But kexec is supposed to align with reboot/shutdown, instead of suspend,
> and it is calling ->shutdown() for notifying driver & device.

That's only true for the case where the new kernel takes over.

In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
it is supposed to align with suspend/resume and if you look at the code
then it actually mimics suspend/resume in the most dilettanteish way.

It's a patently bad idea to clobber the kernel with kexec jump "fixes"
instead of using the well tested and established suspend/resume
machinery.

All it takes is to:

    1) disable the wakeup logic

    2) provide a mechanism to invoke machine_kexec() instead of the
       actual suspend mechanism.

No?

Thanks

        tglx




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 13:23                             ` Thomas Gleixner
@ 2024-12-13 14:07                               ` David Woodhouse
  2024-12-13 17:05                                 ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13 14:07 UTC (permalink / raw)
  To: Thomas Gleixner, Ming Lei
  Cc: Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 2750 bytes --]

On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
> On Fri, Dec 13 2024 at 19:48, Ming Lei wrote:
> > On Fri, Dec 13, 2024 at 12:31:24PM +0100, Thomas Gleixner wrote:
> > > I'd rather say, that's a kexec problem. On the same instance a loop test
> > > of suspend to ram with pm_test=core just works fine. That's equivalent
> > > to the kexec scenario. It goes down to syscore_suspend() and skips the
> > > actual suspend low level magic. It then resumes with syscore_resume()
> > > and brings the machine back up.
> > > 
> > > That runs for 2 hours now, while the kexec muck dies within 2
> > > minutes....
> > > 
> > > And if you look at the difference of these implementations, you might
> > > notice that kexec just implemented some rudimentary version of the
> > > actual suspend logic. Based on let's hope it works that way.
> > > 
> > > This is just insane and should be rewritten to actually reuse the suspend
> > > mechanism, which is way better tested than this kexec jump muck.
> > 
> > But kexec is supposed to align with reboot/shutdown, instead of suspend,
> > and it is calling ->shutdown() for notifying driver & device.
> 
> That's only true for the case where the new kernel takes over.
> 
> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
> it is supposed to align with suspend/resume and if you look at the code
> then it actually mimics suspend/resume in the most dilettanteish way.

Did you mean KEXEC_JUMP=y there?

I spent a while the other week trying to understand the case where
CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
the conclusion that it was a mirage. Userspace can't *actually* set the
KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.

The whole of the code path for that case is dead code. It's confusing
because as discussed elsewhere, we don't just #ifdef out the whole of
that dead code path, but only the bits which don't actually *compile*
(like references to restore_processor_state() etc.).

> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
> instead of using the well tested and established suspend/resume
> machinery.
> 
> All it takes is to:
> 
>     1) disable the wakeup logic
> 
>     2) provide a mechanism to invoke machine_kexec() instead of the
>        actual suspend mechanism.
> 
> No?

Agreed. The hacky proof of concept I posted earlier invoking
machine_kexec() instead of suspend_ops->enter() works fine. I'll look
at cleaning it up and making it not invoke all the ACPI hooks for
*actual* suspend to RAM, etc.

As I noted though, it doesn't address that linux-scsi report which was
a *real* kexec, not a kjump.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 14:07                               ` David Woodhouse
@ 2024-12-13 17:05                                 ` Thomas Gleixner
  2024-12-13 17:17                                   ` David Woodhouse
                                                     ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-13 17:05 UTC (permalink / raw)
  To: David Woodhouse, Ming Lei
  Cc: Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

On Fri, Dec 13 2024 at 14:07, David Woodhouse wrote:
> On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
>> That's only true for the case where the new kernel takes over.
>> 
>> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
>> it is supposed to align with suspend/resume and if you look at the code
>> then it actually mimics suspend/resume in the most dilettanteish way.
>
> Did you mean KEXEC_JUMP=y there?

Yes, of course.

> I spent a while the other week trying to understand the case where
> CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
> the conclusion that it was a mirage. Userspace can't *actually* set the
> KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.
>
> The whole of the code path for that case is dead code. It's confusing
> because as discussed elsewhere, we don't just #ifdef out the whole of
> that dead code path, but only the bits which don't actually *compile*
> (like references to restore_processor_state() etc.).

Yes, I had to stare at it quite a while. :)

>> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
>> instead of using the well tested and established suspend/resume
>> machinery.
>> 
>> All it takes is to:
>> 
>>     1) disable the wakeup logic
>> 
>>     2) provide a mechanism to invoke machine_kexec() instead of the
>>        actual suspend mechanism.
>> 
>> No?
>
> Agreed. The hacky proof of concept I posted earlier invoking
> machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> at cleaning it up and making it not invoke all the ACPI hooks for
> *actual* suspend to RAM, etc.

Something like the below? It survived an hour of loop testing.

> As I noted though, it doesn't address that linux-scsi report which was
> a *real* kexec, not a kjump.

I was not looking at that path at all.

Thanks,

        tglx
---
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -582,4 +582,7 @@ enum suspend_stat_step {
 void dpm_save_failed_dev(const char *name);
 void dpm_save_failed_step(enum suspend_stat_step step);
 
+int kexec_suspend(void);
+void kexec_machine_execute(void);
+
 #endif /* _LINUX_SUSPEND_H */
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -984,6 +984,12 @@ bool kexec_load_permitted(int kexec_imag
 	return true;
 }
 
+void kexec_machine_execute(void)
+{
+	kmsg_dump(KMSG_DUMP_SHUTDOWN);
+	machine_kexec(kexec_image);
+}
+
 /*
  * Move into place and start executing a preloaded standalone
  * executable.  If nothing was preloaded return an error.
@@ -999,38 +1005,9 @@ int kernel_kexec(void)
 		goto Unlock;
 	}
 
-#ifdef CONFIG_KEXEC_JUMP
-	if (kexec_image->preserve_context) {
-		pm_prepare_console();
-		error = freeze_processes();
-		if (error) {
-			error = -EBUSY;
-			goto Restore_console;
-		}
-		suspend_console();
-		error = dpm_suspend_start(PMSG_FREEZE);
-		if (error)
-			goto Resume_console;
-		/* At this point, dpm_suspend_start() has been called,
-		 * but *not* dpm_suspend_end(). We *must* call
-		 * dpm_suspend_end() now.  Otherwise, drivers for
-		 * some devices (e.g. interrupt controllers) become
-		 * desynchronized with the actual state of the
-		 * hardware at resume time, and evil weirdness ensues.
-		 */
-		error = dpm_suspend_end(PMSG_FREEZE);
-		if (error)
-			goto Resume_devices;
-		error = suspend_disable_secondary_cpus();
-		if (error)
-			goto Enable_cpus;
-		local_irq_disable();
-		error = syscore_suspend();
-		if (error)
-			goto Enable_irqs;
-	} else
-#endif
-	{
+	if (IS_ENABLED(CONFIG_KEXEC_JUMP) && kexec_image->preserve_context) {
+		error = kexec_suspend();
+	} else {
 		kexec_in_progress = true;
 		kernel_restart_prepare("kexec reboot");
 		migrate_to_reboot_cpu();
@@ -1045,30 +1022,10 @@ int kernel_kexec(void)
 		cpu_hotplug_enable();
 		pr_notice("Starting new kernel\n");
 		machine_shutdown();
+		kexec_machine_execute();
 	}
 
-	kmsg_dump(KMSG_DUMP_SHUTDOWN);
-	machine_kexec(kexec_image);
-
-#ifdef CONFIG_KEXEC_JUMP
-	if (kexec_image->preserve_context) {
-		syscore_resume();
- Enable_irqs:
-		local_irq_enable();
- Enable_cpus:
-		suspend_enable_secondary_cpus();
-		dpm_resume_start(PMSG_RESTORE);
- Resume_devices:
-		dpm_resume_end(PMSG_RESTORE);
- Resume_console:
-		resume_console();
-		thaw_processes();
- Restore_console:
-		pm_restore_console();
-	}
-#endif
-
- Unlock:
+Unlock:
 	kexec_unlock();
 	return error;
 }
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -400,7 +400,7 @@ void __weak arch_suspend_enable_irqs(voi
  *
  * This function should be called after devices have been suspended.
  */
-static int suspend_enter(suspend_state_t state, bool *wakeup)
+static int suspend_enter(suspend_state_t state, bool *wakeup, bool kexec_jump)
 {
 	int error;
 
@@ -445,15 +445,19 @@ static int suspend_enter(suspend_state_t
 
 	error = syscore_suspend();
 	if (!error) {
-		*wakeup = pm_wakeup_pending();
-		if (!(suspend_test(TEST_CORE) || *wakeup)) {
-			trace_suspend_resume(TPS("machine_suspend"),
-				state, true);
-			error = suspend_ops->enter(state);
-			trace_suspend_resume(TPS("machine_suspend"),
-				state, false);
-		} else if (*wakeup) {
-			error = -EBUSY;
+		if (IS_ENABLED(CONFIG_KEXEC_JUMP) && kexec_jump) {
+			kexec_machine_execute();
+		} else {
+			*wakeup = pm_wakeup_pending();
+			if (!(suspend_test(TEST_CORE) || *wakeup)) {
+				trace_suspend_resume(TPS("machine_suspend"),
+						     state, true);
+				error = suspend_ops->enter(state);
+				trace_suspend_resume(TPS("machine_suspend"),
+						     state, false);
+			} else if (*wakeup) {
+				error = -EBUSY;
+			}
 		}
 		syscore_resume();
 	}
@@ -485,7 +489,7 @@ static int suspend_enter(suspend_state_t
  * suspend_devices_and_enter - Suspend devices and enter system sleep state.
  * @state: System sleep state to enter.
  */
-int suspend_devices_and_enter(suspend_state_t state)
+static int __suspend_devices_and_enter(suspend_state_t state, bool kexec_jump)
 {
 	int error;
 	bool wakeup = false;
@@ -514,7 +518,7 @@ int suspend_devices_and_enter(suspend_st
 		goto Recover_platform;
 
 	do {
-		error = suspend_enter(state, &wakeup);
+		error = suspend_enter(state, &wakeup, kexec_jump);
 	} while (!error && !wakeup && platform_suspend_again(state));
 
  Resume_devices:
@@ -536,6 +540,15 @@ int suspend_devices_and_enter(suspend_st
 }
 
 /**
+ * suspend_devices_and_enter - Suspend devices and enter system sleep state.
+ * @state: System sleep state to enter.
+ */
+int suspend_devices_and_enter(suspend_state_t state)
+{
+	return __suspend_devices_and_enter(state, false);
+}
+
+/**
  * suspend_finish - Clean up before finishing the suspend sequence.
  *
  * Call platform code to clean up, restart processes, and free the console that
@@ -607,6 +620,21 @@ static int enter_state(suspend_state_t s
 	return error;
 }
 
+#ifdef CONFIG_KEXEC_JUMP
+int kexec_suspend(void)
+{
+	int error;
+
+	ksys_sync_helper();
+	error = freeze_processes();
+	if (error)
+		return error;
+	error = __suspend_devices_and_enter(PM_SUSPEND_MEM, true);
+	thaw_processes();
+	return error;
+}
+#endif
+
 /**
  * pm_suspend - Externally visible function for suspending the system.
  * @state: System sleep state to enter.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 17:05                                 ` Thomas Gleixner
@ 2024-12-13 17:17                                   ` David Woodhouse
  2024-12-13 17:48                                     ` Rafael J. Wysocki
  2024-12-13 17:32                                   ` Rafael J. Wysocki
  2024-12-13 17:59                                   ` Rafael J. Wysocki
  2 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13 17:17 UTC (permalink / raw)
  To: Thomas Gleixner, Ming Lei
  Cc: Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

[-- Attachment #1: Type: text/plain, Size: 1811 bytes --]

On Fri, 2024-12-13 at 18:05 +0100, Thomas Gleixner wrote:
> 
> > Agreed. The hacky proof of concept I posted earlier invoking
> > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> > at cleaning it up and making it not invoke all the ACPI hooks for
> > *actual* suspend to RAM, etc.
> 
> Something like the below? It survived an hour of loop testing.

If I read that correctly, it's still invoking the standard platform
(e.g. ACPI) hooks for suspend-to-RAM, when it probably shouldn't?

I suspect it wants its *own* set of platform_suspend_ops, which are
mostly empty apart from the ->enter() ? 

I started looking at that, but now my eyes are currently bleeding after
seeing the existing platform_suspend_ops vs. platform_s2idle_ops
structures, which are kind of similar but not the same. And the set of
helper functions which invoke one or the other, from the barely
tolerable platform_resume_end()...

static void platform_resume_end(suspend_state_t state)
{
	if (state == PM_SUSPEND_TO_IDLE && s2idle_ops && s2idle_ops->end)
		s2idle_ops->end();
	else if (suspend_ops && suspend_ops->end)
		suspend_ops->end();
}

... to the extra-special platform_resume_noirq() which is similar
except that it needs three *different* names (_resume_noirq vs.
restore_early vs. wake):

static void platform_resume_noirq(suspend_state_t state)
{
	if (state == PM_SUSPEND_TO_IDLE) {
		if (s2idle_ops && s2idle_ops->restore_early)
			s2idle_ops->restore_early();
	} else if (suspend_ops->wake) {
		suspend_ops->wake();
	}
}

I wonder if we end up wanting a *third* set there, for the kjump_ops?
Except can we unify the structure definitions and then just *use* the
appropriate one of the three, which is either passed down or selected
using the 'state'?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 17:05                                 ` Thomas Gleixner
  2024-12-13 17:17                                   ` David Woodhouse
@ 2024-12-13 17:32                                   ` Rafael J. Wysocki
  2024-12-13 19:06                                     ` Rafael J. Wysocki
  2024-12-13 17:59                                   ` Rafael J. Wysocki
  2 siblings, 1 reply; 34+ messages in thread
From: Rafael J. Wysocki @ 2024-12-13 17:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Ming Lei, Stefan Hajnoczi, Jason Wang,
	x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe,
	Rafael J. Wysocki

On Fri, Dec 13, 2024 at 6:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, Dec 13 2024 at 14:07, David Woodhouse wrote:
> > On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
> >> That's only true for the case where the new kernel takes over.
> >>
> >> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
> >> it is supposed to align with suspend/resume and if you look at the code
> >> then it actually mimics suspend/resume in the most dilettanteish way.
> >
> > Did you mean KEXEC_JUMP=y there?
>
> Yes, of course.
>
> > I spent a while the other week trying to understand the case where
> > CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
> > the conclusion that it was a mirage. Userspace can't *actually* set the
> > KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.
> >
> > The whole of the code path for that case is dead code. It's confusing
> > because as discussed elsewhere, we don't just #ifdef out the whole of
> > that dead code path, but only the bits which don't actually *compile*
> > (like references to restore_processor_state() etc.).
>
> Yes, I had to stare at it quite a while. :)
>
> >> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
> >> instead of using the well tested and established suspend/resume
> >> machinery.
> >>
> >> All it takes is to:
> >>
> >>     1) disable the wakeup logic
> >>
> >>     2) provide a mechanism to invoke machine_kexec() instead of the
> >>        actual suspend mechanism.
> >>
> >> No?
> >
> > Agreed. The hacky proof of concept I posted earlier invoking
> > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> > at cleaning it up and making it not invoke all the ACPI hooks for
> > *actual* suspend to RAM, etc.
>
> Something like the below? It survived an hour of loop testing.

I think that this KEXEC_JUMP thing can be dropped entirely and forgotten.

I'm not aware of anyone actually using it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 17:17                                   ` David Woodhouse
@ 2024-12-13 17:48                                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 34+ messages in thread
From: Rafael J. Wysocki @ 2024-12-13 17:48 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, Ming Lei, Stefan Hajnoczi, Jason Wang,
	x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe,
	Rafael J. Wysocki

On Fri, Dec 13, 2024 at 6:17 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Fri, 2024-12-13 at 18:05 +0100, Thomas Gleixner wrote:
> >
> > > Agreed. The hacky proof of concept I posted earlier invoking
> > > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> > > at cleaning it up and making it not invoke all the ACPI hooks for
> > > *actual* suspend to RAM, etc.
> >
> > Something like the below? It survived an hour of loop testing.
>
> If I read that correctly, it's still invoking the standard platform
> (e.g. ACPI) hooks for suspend-to-RAM, when it probably shouldn't?
>
> I suspect it wants its *own* set of platform_suspend_ops, which are
> mostly empty apart from the ->enter() ?
>
> I started looking at that, but now my eyes are currently bleeding after
> seeing the existing platform_suspend_ops vs. platform_s2idle_ops
> structures, which are kind of similar but not the same. And the set of
> helper functions which invoke one or the other, from the barely
> tolerable platform_resume_end()...
>
> static void platform_resume_end(suspend_state_t state)
> {
>         if (state == PM_SUSPEND_TO_IDLE && s2idle_ops && s2idle_ops->end)
>                 s2idle_ops->end();
>         else if (suspend_ops && suspend_ops->end)
>                 suspend_ops->end();
> }
>
> ... to the extra-special platform_resume_noirq() which is similar
> except that it needs three *different* names (_resume_noirq vs.
> restore_early vs. wake):
>
> static void platform_resume_noirq(suspend_state_t state)
> {
>         if (state == PM_SUSPEND_TO_IDLE) {
>                 if (s2idle_ops && s2idle_ops->restore_early)
>                         s2idle_ops->restore_early();
>         } else if (suspend_ops->wake) {
>                 suspend_ops->wake();
>         }
> }
>
>
> I wonder if we end up wanting a *third* set there, for the kjump_ops?

No, no.

The "vision" behind kexec jump was to use it for implementing
hibernation, but that never happened.

It doesn't actually need any platform ops at all because everything it
does is to jump from one kernel to another, both residing in memory at
this point.  No firmware is involved.

> Except can we unify the structure definitions and then just *use* the
> appropriate one of the three, which is either passed down or selected
> using the 'state'?

That can be done I suppose, but it won't help much in the kexec jump case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 17:05                                 ` Thomas Gleixner
  2024-12-13 17:17                                   ` David Woodhouse
  2024-12-13 17:32                                   ` Rafael J. Wysocki
@ 2024-12-13 17:59                                   ` Rafael J. Wysocki
  2 siblings, 0 replies; 34+ messages in thread
From: Rafael J. Wysocki @ 2024-12-13 17:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Ming Lei, Stefan Hajnoczi, Jason Wang,
	x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe,
	Rafael J. Wysocki

On Fri, Dec 13, 2024 at 6:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, Dec 13 2024 at 14:07, David Woodhouse wrote:
> > On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
> >> That's only true for the case where the new kernel takes over.
> >>
> >> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
> >> it is supposed to align with suspend/resume and if you look at the code
> >> then it actually mimics suspend/resume in the most dilettanteish way.
> >
> > Did you mean KEXEC_JUMP=y there?
>
> Yes, of course.
>
> > I spent a while the other week trying to understand the case where
> > CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
> > the conclusion that it was a mirage. Userspace can't *actually* set the
> > KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.
> >
> > The whole of the code path for that case is dead code. It's confusing
> > because as discussed elsewhere, we don't just #ifdef out the whole of
> > that dead code path, but only the bits which don't actually *compile*
> > (like references to restore_processor_state() etc.).
>
> Yes, I had to stare at it quite a while. :)
>
> >> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
> >> instead of using the well tested and established suspend/resume
> >> machinery.
> >>
> >> All it takes is to:
> >>
> >>     1) disable the wakeup logic
> >>
> >>     2) provide a mechanism to invoke machine_kexec() instead of the
> >>        actual suspend mechanism.
> >>
> >> No?
> >
> > Agreed. The hacky proof of concept I posted earlier invoking
> > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> > at cleaning it up and making it not invoke all the ACPI hooks for
> > *actual* suspend to RAM, etc.
>
> Something like the below? It survived an hour of loop testing.
>
> > As I noted though, it doesn't address that linux-scsi report which was
> > a *real* kexec, not a kjump.
>
> I was not looking at that path at all.
>
> Thanks,
>
>         tglx
> ---
> --- a/include/linux/suspend.h
> +++ b/include/linux/suspend.h
> @@ -582,4 +582,7 @@ enum suspend_stat_step {
>  void dpm_save_failed_dev(const char *name);
>  void dpm_save_failed_step(enum suspend_stat_step step);
>
> +int kexec_suspend(void);
> +void kexec_machine_execute(void);
> +
>  #endif /* _LINUX_SUSPEND_H */
> --- a/kernel/kexec_core.c
> +++ b/kernel/kexec_core.c
> @@ -984,6 +984,12 @@ bool kexec_load_permitted(int kexec_imag
>         return true;
>  }
>
> +void kexec_machine_execute(void)
> +{
> +       kmsg_dump(KMSG_DUMP_SHUTDOWN);
> +       machine_kexec(kexec_image);
> +}
> +
>  /*
>   * Move into place and start executing a preloaded standalone
>   * executable.  If nothing was preloaded return an error.
> @@ -999,38 +1005,9 @@ int kernel_kexec(void)
>                 goto Unlock;
>         }
>
> -#ifdef CONFIG_KEXEC_JUMP
> -       if (kexec_image->preserve_context) {
> -               pm_prepare_console();
> -               error = freeze_processes();
> -               if (error) {
> -                       error = -EBUSY;
> -                       goto Restore_console;
> -               }
> -               suspend_console();
> -               error = dpm_suspend_start(PMSG_FREEZE);
> -               if (error)
> -                       goto Resume_console;
> -               /* At this point, dpm_suspend_start() has been called,
> -                * but *not* dpm_suspend_end(). We *must* call
> -                * dpm_suspend_end() now.  Otherwise, drivers for
> -                * some devices (e.g. interrupt controllers) become
> -                * desynchronized with the actual state of the
> -                * hardware at resume time, and evil weirdness ensues.
> -                */
> -               error = dpm_suspend_end(PMSG_FREEZE);
> -               if (error)
> -                       goto Resume_devices;
> -               error = suspend_disable_secondary_cpus();
> -               if (error)
> -                       goto Enable_cpus;
> -               local_irq_disable();
> -               error = syscore_suspend();
> -               if (error)
> -                       goto Enable_irqs;
> -       } else
> -#endif
> -       {
> +       if (IS_ENABLED(CONFIG_KEXEC_JUMP) && kexec_image->preserve_context) {
> +               error = kexec_suspend();
> +       } else {
>                 kexec_in_progress = true;
>                 kernel_restart_prepare("kexec reboot");
>                 migrate_to_reboot_cpu();
> @@ -1045,30 +1022,10 @@ int kernel_kexec(void)
>                 cpu_hotplug_enable();
>                 pr_notice("Starting new kernel\n");
>                 machine_shutdown();
> +               kexec_machine_execute();
>         }
>
> -       kmsg_dump(KMSG_DUMP_SHUTDOWN);
> -       machine_kexec(kexec_image);
> -
> -#ifdef CONFIG_KEXEC_JUMP
> -       if (kexec_image->preserve_context) {
> -               syscore_resume();
> - Enable_irqs:
> -               local_irq_enable();
> - Enable_cpus:
> -               suspend_enable_secondary_cpus();
> -               dpm_resume_start(PMSG_RESTORE);
> - Resume_devices:
> -               dpm_resume_end(PMSG_RESTORE);
> - Resume_console:
> -               resume_console();
> -               thaw_processes();
> - Restore_console:
> -               pm_restore_console();
> -       }
> -#endif
> -
> - Unlock:
> +Unlock:
>         kexec_unlock();
>         return error;
>  }
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -400,7 +400,7 @@ void __weak arch_suspend_enable_irqs(voi
>   *
>   * This function should be called after devices have been suspended.
>   */
> -static int suspend_enter(suspend_state_t state, bool *wakeup)
> +static int suspend_enter(suspend_state_t state, bool *wakeup, bool kexec_jump)
>  {
>         int error;
>
> @@ -445,15 +445,19 @@ static int suspend_enter(suspend_state_t
>
>         error = syscore_suspend();
>         if (!error) {
> -               *wakeup = pm_wakeup_pending();
> -               if (!(suspend_test(TEST_CORE) || *wakeup)) {
> -                       trace_suspend_resume(TPS("machine_suspend"),
> -                               state, true);
> -                       error = suspend_ops->enter(state);
> -                       trace_suspend_resume(TPS("machine_suspend"),
> -                               state, false);
> -               } else if (*wakeup) {
> -                       error = -EBUSY;
> +               if (IS_ENABLED(CONFIG_KEXEC_JUMP) && kexec_jump) {
> +                       kexec_machine_execute();
> +               } else {
> +                       *wakeup = pm_wakeup_pending();
> +                       if (!(suspend_test(TEST_CORE) || *wakeup)) {
> +                               trace_suspend_resume(TPS("machine_suspend"),
> +                                                    state, true);
> +                               error = suspend_ops->enter(state);
> +                               trace_suspend_resume(TPS("machine_suspend"),
> +                                                    state, false);
> +                       } else if (*wakeup) {
> +                               error = -EBUSY;
> +                       }
>                 }
>                 syscore_resume();
>         }
> @@ -485,7 +489,7 @@ static int suspend_enter(suspend_state_t
>   * suspend_devices_and_enter - Suspend devices and enter system sleep state.
>   * @state: System sleep state to enter.
>   */
> -int suspend_devices_and_enter(suspend_state_t state)
> +static int __suspend_devices_and_enter(suspend_state_t state, bool kexec_jump)
>  {
>         int error;
>         bool wakeup = false;
> @@ -514,7 +518,7 @@ int suspend_devices_and_enter(suspend_st
>                 goto Recover_platform;
>
>         do {
> -               error = suspend_enter(state, &wakeup);
> +               error = suspend_enter(state, &wakeup, kexec_jump);
>         } while (!error && !wakeup && platform_suspend_again(state));
>
>   Resume_devices:
> @@ -536,6 +540,15 @@ int suspend_devices_and_enter(suspend_st
>  }
>
>  /**
> + * suspend_devices_and_enter - Suspend devices and enter system sleep state.
> + * @state: System sleep state to enter.
> + */
> +int suspend_devices_and_enter(suspend_state_t state)
> +{
> +       return __suspend_devices_and_enter(state, false);
> +}
> +
> +/**
>   * suspend_finish - Clean up before finishing the suspend sequence.
>   *
>   * Call platform code to clean up, restart processes, and free the console that
> @@ -607,6 +620,21 @@ static int enter_state(suspend_state_t s
>         return error;
>  }
>
> +#ifdef CONFIG_KEXEC_JUMP
> +int kexec_suspend(void)
> +{
> +       int error;
> +
> +       ksys_sync_helper();
> +       error = freeze_processes();
> +       if (error)
> +               return error;
> +       error = __suspend_devices_and_enter(PM_SUSPEND_MEM, true);

Why do you want to hook this up to the suspend code?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 17:32                                   ` Rafael J. Wysocki
@ 2024-12-13 19:06                                     ` Rafael J. Wysocki
  2024-12-13 20:16                                       ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Rafael J. Wysocki @ 2024-12-13 19:06 UTC (permalink / raw)
  To: Thomas Gleixner, David Woodhouse
  Cc: Ming Lei, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe, Rafael J. Wysocki

On Fri, Dec 13, 2024 at 6:32 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Dec 13, 2024 at 6:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Fri, Dec 13 2024 at 14:07, David Woodhouse wrote:
> > > On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
> > >> That's only true for the case where the new kernel takes over.
> > >>
> > >> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
> > >> it is supposed to align with suspend/resume and if you look at the code
> > >> then it actually mimics suspend/resume in the most dilettanteish way.
> > >
> > > Did you mean KEXEC_JUMP=y there?
> >
> > Yes, of course.
> >
> > > I spent a while the other week trying to understand the case where
> > > CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
> > > the conclusion that it was a mirage. Userspace can't *actually* set the
> > > KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.
> > >
> > > The whole of the code path for that case is dead code. It's confusing
> > > because as discussed elsewhere, we don't just #ifdef out the whole of
> > > that dead code path, but only the bits which don't actually *compile*
> > > (like references to restore_processor_state() etc.).
> >
> > Yes, I had to stare at it quite a while. :)
> >
> > >> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
> > >> instead of using the well tested and established suspend/resume
> > >> machinery.
> > >>
> > >> All it takes is to:
> > >>
> > >>     1) disable the wakeup logic
> > >>
> > >>     2) provide a mechanism to invoke machine_kexec() instead of the
> > >>        actual suspend mechanism.
> > >>
> > >> No?
> > >
> > > Agreed. The hacky proof of concept I posted earlier invoking
> > > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
> > > at cleaning it up and making it not invoke all the ACPI hooks for
> > > *actual* suspend to RAM, etc.
> >
> > Something like the below? It survived an hour of loop testing.
>
> I think that this KEXEC_JUMP thing can be dropped entirely and forgotten.
>
> I'm not aware of anyone actually using it.

And now I've been made aware that it's used.  Oh well.

As discussed with Dave over IRC, the current implementation isn't
actually that bad.  It might use PMSG_THAW instead of PMSG_RESTORE in
kernel_kexec(), but other than this it reflects the code flow around
the jump from the restore kernel to the image one during resume from
hibernation.

This means that hibernation and kexec jump could be unified somewhat.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 19:06                                     ` Rafael J. Wysocki
@ 2024-12-13 20:16                                       ` David Woodhouse
  2024-12-14  9:57                                         ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-13 20:16 UTC (permalink / raw)
  To: Rafael J. Wysocki, Thomas Gleixner
  Cc: Ming Lei, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe

On 13 December 2024 19:06:52 GMT, "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>On Fri, Dec 13, 2024 at 6:32 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Fri, Dec 13, 2024 at 6:05 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> >
>> > On Fri, Dec 13 2024 at 14:07, David Woodhouse wrote:
>> > > On Fri, 2024-12-13 at 14:23 +0100, Thomas Gleixner wrote:
>> > >> That's only true for the case where the new kernel takes over.
>> > >>
>> > >> In the case KEXEC_JUMP=n and kexec_image->preserve_context == true, then
>> > >> it is supposed to align with suspend/resume and if you look at the code
>> > >> then it actually mimics suspend/resume in the most dilettanteish way.
>> > >
>> > > Did you mean KEXEC_JUMP=y there?
>> >
>> > Yes, of course.
>> >
>> > > I spent a while the other week trying to understand the case where
>> > > CONFIG_KEXEC_JUMP=n and kexec_image->preserve_context=true, and came to
>> > > the conclusion that it was a mirage. Userspace can't *actually* set the
>> > > KEXEC_PRESERVE_CONTEXT bit when setting up the image, if KEXEC_JUMP=n.
>> > >
>> > > The whole of the code path for that case is dead code. It's confusing
>> > > because as discussed elsewhere, we don't just #ifdef out the whole of
>> > > that dead code path, but only the bits which don't actually *compile*
>> > > (like references to restore_processor_state() etc.).
>> >
>> > Yes, I had to stare at it quite a while. :)
>> >
>> > >> It's a patently bad idea to clobber the kernel with kexec jump "fixes"
>> > >> instead of using the well tested and established suspend/resume
>> > >> machinery.
>> > >>
>> > >> All it takes is to:
>> > >>
>> > >>     1) disable the wakeup logic
>> > >>
>> > >>     2) provide a mechanism to invoke machine_kexec() instead of the
>> > >>        actual suspend mechanism.
>> > >>
>> > >> No?
>> > >
>> > > Agreed. The hacky proof of concept I posted earlier invoking
>> > > machine_kexec() instead of suspend_ops->enter() works fine. I'll look
>> > > at cleaning it up and making it not invoke all the ACPI hooks for
>> > > *actual* suspend to RAM, etc.
>> >
>> > Something like the below? It survived an hour of loop testing.
>>
>> I think that this KEXEC_JUMP thing can be dropped entirely and forgotten.
>>
>> I'm not aware of anyone actually using it.
>
>And now I've been made aware that it's used.  Oh well.
>
>As discussed with Dave over IRC, the current implementation isn't
>actually that bad.  It might use PMSG_THAW instead of PMSG_RESTORE in
>kernel_kexec(), but other than this it reflects the code flow around
>the jump from the restore kernel to the image one during resume from
>hibernation.
>
>This means that hibernation and kexec jump could be unified somewhat.

Fair enough. I'm happy to do whatever cleanups or consolidation make sense, if we have a consensus.

There remains the question of why the blk-mq thing explodes on the way down for both kjump and, apparently, even the plain kexec case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-13 20:16                                       ` David Woodhouse
@ 2024-12-14  9:57                                         ` David Woodhouse
  2024-12-16 12:14                                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 34+ messages in thread
From: David Woodhouse @ 2024-12-14  9:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Thomas Gleixner
  Cc: Ming Lei, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, Petr Mladek, John Ogness, Peter Zijlstra,
	Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 2769 bytes --]

On Fri, 2024-12-13 at 20:16 +0000, David Woodhouse wrote:
> 
> > As discussed with Dave over IRC, the current implementation isn't
> > actually that bad.  It might use PMSG_THAW instead of PMSG_RESTORE in
> > kernel_kexec(), but other than this it reflects the code flow around
> > the jump from the restore kernel to the image one during resume from
> > hibernation.
> > 
> > This means that hibernation and kexec jump could be unified somewhat.
> 
> Fair enough. I'm happy to do whatever cleanups or consolidation make
> sense, if we have a consensus.
> 
> There remains the question of why the blk-mq thing explodes on the
> way down for both kjump and, apparently, even the plain kexec case.

In case it's of any use, here's a test case I put together recently for
kexec stress testing.

 http://david.woodhou.se/kexec.initramfs

It's just an initrd I boot in qemu with '-initrd kexec.initramfs' and
it builds a copy of itself, then kexecs back into the same kernel with
the same initrd. You'll need to drop your own bzImage into it.

It was designed to run without a block device, but to trigger the
blk-mq thing or the one at
https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/
we'd probably need to actually mount something and maybe do some disk
I/O.

(Which means the fact that the initrd can rebuild itself with cpio is
no longer quite so useful, as it could have just loaded the initrd for
the next kernel from the file system. But I already did that part, so
whatever.)


 $ cat init
#!/bin/sh

find . > files.txt
mount -tproc none /proc

cat /proc/sys/kernel/watchdog_thresh
echo 20 >  /proc/sys/kernel/watchdog_thresh

mount -tramfs none /tmp
NEXTCOUNT=$(($LOOPCOUNT+1))
CMDLINE=$(cat /proc/cmdline)
NEWCMDLINE="${CMDLINE/LOOPCOUNT=$LOOPCOUNT/} LOOPCOUNT=$NEXTCOUNT"

echo KEXEC LOOP $LOOPCOUNT
echo $NEWCMDLINE

cpio --quiet -H newc -o < files.txt | gzip -n9 > /tmp/initramfs.gz
kexec -f /bzImage --initrd /tmp/initramfs.gz --append "$NEWCMDLINE"


 $ cp ~/git/linux/arch/x86/boot/bzImage .
 $ find . | cpio --quiet -H newc -o  | gzip -n9 > ../kexec.initramfs
 $ ls -d `find .`
.              ./bin/mount              ./lib64/ld-linux-x86-64.so.2
./bin          ./bin/sh                 ./lib64/libc.so.6
./bin/busybox  ./bin/sleep              ./lib64/liblzma.so.5
./bin/cat      ./dev                    ./lib64/liblzma.so.5.4.4
./bin/cpio     ./dev/console            ./lib64/libz.so.1
./bin/echo     ./dev/null               ./lib64/libz.so.1.2.13
./bin/find     ./dev/ttyS5              ./loadret
./bin/gzip     ./init                   ./loadret.c
./bin/kexec    ./init.preserve-context  ./proc
./bin/mknod    ./lib64                  ./tmp


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
  2024-12-14  9:57                                         ` David Woodhouse
@ 2024-12-16 12:14                                           ` Rafael J. Wysocki
  0 siblings, 0 replies; 34+ messages in thread
From: Rafael J. Wysocki @ 2024-12-16 12:14 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Rafael J. Wysocki, Thomas Gleixner, Ming Lei, Stefan Hajnoczi,
	Jason Wang, x86@kernel.org, hpa, dyoung, kexec, linux-ext4,
	Michael S. Tsirkin, Stefano Garzarella, eperezma, Paolo Bonzini,
	Petr Mladek, John Ogness, Peter Zijlstra, Jens Axboe

On Sat, Dec 14, 2024 at 10:57 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Fri, 2024-12-13 at 20:16 +0000, David Woodhouse wrote:
> >
> > > As discussed with Dave over IRC, the current implementation isn't
> > > actually that bad.  It might use PMSG_THAW instead of PMSG_RESTORE in
> > > kernel_kexec(), but other than this it reflects the code flow around
> > > the jump from the restore kernel to the image one during resume from
> > > hibernation.
> > >
> > > This means that hibernation and kexec jump could be unified somewhat.
> >
> > Fair enough. I'm happy to do whatever cleanups or consolidation make
> > sense, if we have a consensus.
> >
> > There remains the question of why the blk-mq thing explodes on the
> > way down for both kjump and, apparently, even the plain kexec case.
>
> In case it's of any use, here's a test case I put together recently for
> kexec stress testing.
>
>  http://david.woodhou.se/kexec.initramfs
>
> It's just an initrd I boot in qemu with '-initrd kexec.initramfs' and
> it builds a copy of itself, then kexecs back into the same kernel with
> the same initrd. You'll need to drop your own bzImage into it.
>
> It was designed to run without a block device, but to trigger the
> blk-mq thing or the one at
> https://lore.kernel.org/linux-scsi/F991D40F7D096653+20241203211857.0291ab1b@john-PC/

Which BTW includes some quite convincing analysis of what's going on.

> we'd probably need to actually mount something and maybe do some disk
> I/O.

If I'm not mistaken, this should be also reproducible by echoing
"core" into /sys/power/pm_test and running "echo disk >
/sys/power/state" in a loop after that, unless the ksys_sync()
triggered by hibernate() makes it extremely unlikely.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] sched: Prevent rescheduling when interrupts are disabled
  2024-12-13 13:13                   ` Thomas Gleixner
@ 2024-12-16 13:20                     ` Thomas Gleixner
  2024-12-16 17:41                       ` David Woodhouse
  0 siblings, 1 reply; 34+ messages in thread
From: Thomas Gleixner @ 2024-12-16 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Woodhouse, Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa,
	dyoung, kexec, linux-ext4, Michael S. Tsirkin, Stefano Garzarella,
	eperezma, Paolo Bonzini, ming.lei, Petr Mladek, John Ogness

David reported a warning observed while loop testing kexec jump:

  Interrupts enabled after irqrouter_resume+0x0/0x50
  WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
   kernel_kexec+0xf6/0x180
   __do_sys_reboot+0x206/0x250
   do_syscall_64+0x95/0x180

The corresponding interrupt flag trace:

  hardirqs last  enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
  hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90

That means __up_console_sem() was invoked with interrupts enabled. Further
instrumentation revealed that in the interrupt disabled section of kexec
jump one of the syscore_suspend() callbacks woke up a task, which set the
NEED_RESCHED flag. A later callback in the resume path invoked
cond_resched() which in turn led to the invocation of the scheduler:

  __cond_resched+0x21/0x60
  down_timeout+0x18/0x60
  acpi_os_wait_semaphore+0x4c/0x80
  acpi_ut_acquire_mutex+0x3d/0x100
  acpi_ns_get_node+0x27/0x60
  acpi_ns_evaluate+0x1cb/0x2d0
  acpi_rs_set_srs_method_data+0x156/0x190
  acpi_pci_link_set+0x11c/0x290
  irqrouter_resume+0x54/0x60
  syscore_resume+0x6a/0x200
  kernel_kexec+0x145/0x1c0
  __do_sys_reboot+0xeb/0x240
  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a task wakeup and the
scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
invokes schedule() from a completely bogus context. The scheduler
enables interrupts after context switching, which causes the above
warning at the end.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operations
it's just a question of time.

The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
models. Full preemption is not affected as cond_resched() is disabled and
the preemption check preemptible() takes the interrupt disabled flag into
account.

Cure the problem by adding a corresponding check into cond_resched().

Reported-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: David Woodhouse <dwmw2@infradead.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
---
 kernel/sched/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
 #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
 int __sched __cond_resched(void)
 {
-	if (should_resched(0)) {
+	if (should_resched(0) && !irqs_disabled()) {
 		preempt_schedule_common();
 		return 1;
 	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] sched: Prevent rescheduling when interrupts are disabled
  2024-12-16 13:20                     ` [PATCH] sched: Prevent rescheduling when interrupts are disabled Thomas Gleixner
@ 2024-12-16 17:41                       ` David Woodhouse
  0 siblings, 0 replies; 34+ messages in thread
From: David Woodhouse @ 2024-12-16 17:41 UTC (permalink / raw)
  To: Thomas Gleixner, Peter Zijlstra
  Cc: Stefan Hajnoczi, Jason Wang, x86@kernel.org, hpa, dyoung, kexec,
	linux-ext4, Michael S. Tsirkin, Stefano Garzarella, eperezma,
	Paolo Bonzini, ming.lei, Petr Mladek, John Ogness

On 16 December 2024 13:20:56 GMT, Thomas Gleixner <tglx@linutronix.de> wrote:
>David reported a warning observed while loop testing kexec jump:
>
>  Interrupts enabled after irqrouter_resume+0x0/0x50
>  WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220
>   kernel_kexec+0xf6/0x180
>   __do_sys_reboot+0x206/0x250
>   do_syscall_64+0x95/0x180
>
>The corresponding interrupt flag trace:
>
>  hardirqs last  enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90
>  hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90
>
>That means __up_console_sem() was invoked with interrupts enabled. Further
>instrumentation revealed that in the interrupt disabled section of kexec
>jump one of the syscore_suspend() callbacks woke up a task, which set the
>NEED_RESCHED flag. A later callback in the resume path invoked
>cond_resched() which in turn led to the invocation of the scheduler:
>
>  __cond_resched+0x21/0x60
>  down_timeout+0x18/0x60
>  acpi_os_wait_semaphore+0x4c/0x80
>  acpi_ut_acquire_mutex+0x3d/0x100
>  acpi_ns_get_node+0x27/0x60
>  acpi_ns_evaluate+0x1cb/0x2d0
>  acpi_rs_set_srs_method_data+0x156/0x190
>  acpi_pci_link_set+0x11c/0x290
>  irqrouter_resume+0x54/0x60
>  syscore_resume+0x6a/0x200
>  kernel_kexec+0x145/0x1c0
>  __do_sys_reboot+0xeb/0x240
>  do_syscall_64+0x95/0x180
>
>This is a long standing problem, which probably got more visible with
>the recent printk changes. Something does a task wakeup and the
>scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and
>invokes schedule() from a completely bogus context. The scheduler
>enables interrupts after context switching, which causes the above
>warning at the end.
>
>Quite some of the code paths in syscore_suspend()/resume() can result in
>triggering a wakeup with the exactly same consequences. They might not
>have done so yet, but as they share a lot of code with normal operations
>it's just a question of time.
>
>The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling
>models. Full preemption is not affected as cond_resched() is disabled and
>the preemption check preemptible() takes the interrupt disabled flag into
>account.
>
>Cure the problem by adding a corresponding check into cond_resched().
>
>Reported-by: David Woodhouse <dwmw2@infradead.org>
>Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>Tested-by: David Woodhouse <dwmw2@infradead.org>
>Cc: stable@vger.kernel.org
>Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org
>---
> kernel/sched/core.c |    2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>--- a/kernel/sched/core.c
>+++ b/kernel/sched/core.c
>@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
> #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
> int __sched __cond_resched(void)
> {
>-	if (should_resched(0)) {
>+	if (should_resched(0) && !irqs_disabled()) {
> 		preempt_schedule_common();
> 		return 1;
> 	}

Thank you. Slight preference for dwmw@amazon.co.uk as the Reported-by and Tested-by addresses if it's not too late. I'm assuming you will handle this and don't want me to round it up with the kexec bits.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-12-16 17:41 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 14:28 Lockdep warnings on kexec (virtio_blk, hrtimers) David Woodhouse
2024-12-10  1:56 ` Jason Wang
2024-12-11 12:42   ` Stefan Hajnoczi
2024-12-12 11:07     ` David Woodhouse
2024-12-12 13:34       ` Thomas Gleixner
2024-12-12 13:46         ` David Woodhouse
2024-12-12 18:04           ` Thomas Gleixner
2024-12-12 19:19             ` David Woodhouse
2024-12-13  0:14               ` Thomas Gleixner
2024-12-13  9:31                 ` David Woodhouse
2024-12-13  9:43                   ` David Woodhouse
2024-12-13 10:42                     ` Thomas Gleixner
2024-12-13 11:09                       ` Ming Lei
2024-12-13 11:31                         ` Thomas Gleixner
2024-12-13 11:48                           ` Ming Lei
2024-12-13 13:23                             ` Thomas Gleixner
2024-12-13 14:07                               ` David Woodhouse
2024-12-13 17:05                                 ` Thomas Gleixner
2024-12-13 17:17                                   ` David Woodhouse
2024-12-13 17:48                                     ` Rafael J. Wysocki
2024-12-13 17:32                                   ` Rafael J. Wysocki
2024-12-13 19:06                                     ` Rafael J. Wysocki
2024-12-13 20:16                                       ` David Woodhouse
2024-12-14  9:57                                         ` David Woodhouse
2024-12-16 12:14                                           ` Rafael J. Wysocki
2024-12-13 17:59                                   ` Rafael J. Wysocki
2024-12-13 13:17                           ` David Woodhouse
2024-12-13 11:12                       ` David Woodhouse
2024-12-13 11:33                         ` Ming Lei
2024-12-13 11:20                 ` Peter Zijlstra
2024-12-13 13:13                   ` Thomas Gleixner
2024-12-16 13:20                     ` [PATCH] sched: Prevent rescheduling when interrupts are disabled Thomas Gleixner
2024-12-16 17:41                       ` David Woodhouse
2024-12-12 11:12     ` Lockdep warnings on kexec (virtio_blk, hrtimers) Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).