From: Maxim Levitsky <mlevitsk@redhat.com>
To: kvm@vger.kernel.org
Cc: Sean Christopherson <seanjc@google.com>,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
Date: Tue, 18 Mar 2025 22:19:25 -0400 [thread overview]
Message-ID: <e61d23ddc87cb45063637e0e5375cc4e09db18cd.camel@redhat.com> (raw)
Hi!
I recently came up with an interesting failure in the CI pipeline.
[ 592.704446] WARNING: possible circular locking dependency detected
[ 592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted
[ 592.715764] ------------------------------------------------------
[ 592.721946] swapper/19/0 is trying to acquire lock:
[ 592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0
[ 592.734761]
[ 592.734761] but task is already holding lock:
[ 592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel]
[ 592.752185]
[ 592.752185] which lock already depends on the new lock.
[ 592.752185]
[ 592.760357]
[ 592.760357] the existing dependency chain (in reverse order) is:
[ 592.767837]
[ 592.767837] -> #2 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}:
[ 592.776027] __lock_acquire+0xcac/0x1d20
[ 592.780489] lock_acquire.part.0+0x11b/0x360
[ 592.785300] _raw_spin_lock+0x37/0x80
[ 592.789484] pi_enable_wakeup_handler+0x11a/0x5a0 [kvm_intel]
[ 592.795776] vmx_vcpu_pi_put+0x1d3/0x230 [kvm_intel]
[ 592.801282] vmx_vcpu_put+0x12/0x20 [kvm_intel]
[ 592.806360] kvm_arch_vcpu_put+0x494/0x7a0 [kvm]
[ 592.811592] kvm_sched_out+0x161/0x1c0 [kvm]
[ 592.816456] prepare_task_switch+0x36c/0xe20
[ 592.821247] __schedule+0x481/0x1a50
[ 592.825346] schedule+0xd4/0x280
[ 592.829100] kvm_vcpu_block+0xe4/0x1d0 [kvm]
[ 592.833964] kvm_vcpu_halt+0x1a2/0x800 [kvm]
[ 592.838824] vcpu_run+0x53f/0x9e0 [kvm]
[ 592.843263] kvm_arch_vcpu_ioctl_run+0x371/0x1480 [kvm]
[ 592.849078] kvm_vcpu_ioctl+0x45e/0xc70 [kvm]
[ 592.854034] __x64_sys_ioctl+0x12e/0x1a0
[ 592.858481] do_syscall_64+0x92/0x180
[ 592.862685] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 592.868276]
[ 592.868276] -> #1 (&rq->__lock)\{-.-.}-\{2:2}:
[ 592.874038] __lock_acquire+0xcac/0x1d20
[ 592.878485] lock_acquire.part.0+0x11b/0x360
[ 592.883277] _raw_spin_lock_nested+0x3b/0x70
[ 592.888069] raw_spin_rq_lock_nested+0x2e/0x130
[ 592.893121] __task_rq_lock+0xdb/0x570
[ 592.897393] wake_up_new_task+0x750/0x1050
[ 592.902014] kernel_clone+0x13e/0x5f0
[ 592.906209] user_mode_thread+0xa4/0xe0
[ 592.910568] rest_init+0x1e/0x1c0
[ 592.914407] start_kernel+0x3ab/0x3b0
[ 592.918592] x86_64_start_reservations+0x24/0x30
[ 592.923732] x86_64_start_kernel+0x9c/0xa0
[ 592.928353] common_startup_64+0x13e/0x141
[ 592.932988]
[ 592.932988] -> #0 (&p->pi_lock)\{-.-.}-\{2:2}:
[ 592.938754] check_prev_add+0x1b7/0x23e0
[ 592.943200] validate_chain+0xa8a/0xf00
[ 592.947557] __lock_acquire+0xcac/0x1d20
[ 592.952005] lock_acquire.part.0+0x11b/0x360
[ 592.956797] _raw_spin_lock_irqsave+0x46/0x90
[ 592.961676] try_to_wake_up+0xa7/0x15c0
[ 592.966035] rcuwait_wake_up+0x80/0x190
[ 592.970396] kvm_vcpu_wake_up+0x19/0xa0 [kvm]
[ 592.975353] pi_wakeup_handler+0xf4/0x130 [kvm_intel]
[ 592.980950] sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0
[ 592.986783] asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20
[ 592.992964] finish_task_switch.isra.0+0x228/0x830
[ 592.998275] __schedule+0x7c7/0x1a50
[ 593.002376] schedule_idle+0x59/0x90
[ 593.006477] do_idle+0x155/0x200
[ 593.010244] cpu_startup_entry+0x54/0x60
[ 593.014693] start_secondary+0x212/0x290
[ 593.019138] common_startup_64+0x13e/0x141
[ 593.023758]
[ 593.023758] other info that might help us debug this:
[ 593.023758]
[ 593.031757] Chain exists of:
[ 593.031757] &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu)
[ 593.031757]
[ 593.043932] Possible unsafe locking scenario:
[ 593.043932]
[ 593.049852] CPU0 CPU1
[ 593.054384] ---- ----
[ 593.058918] lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
[ 593.064576] lock(&rq->__lock);
[ 593.070323] lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
[ 593.078496] lock(&p->pi_lock);
[ 593.081745]
[ 593.081745] *** DEADLOCK ***
[ 593.081745]
[ 593.087666] 2 locks held by swapper/19/0:
[ 593.091679] #0: ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel]
[ 593.103692] #1: ffffffffa9b3eec0 (rcu_read_lock)\{....}-\{1:2}, at: rcuwait_wake_up+0x22/0x190
[ 593.112227]
[ 593.112227] stack backtrace:
[ 593.116590] CPU: 19 UID: 0 PID: 0 Comm: swapper/19 Kdump: loaded Not tainted 6.12.0-36.el10.x86_64+debug #1
[ 593.126338] Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024
[ 593.133818] Call Trace:
[ 593.136271] <IRQ>
[ 593.138291] dump_stack_lvl+0x6f/0xb0
[ 593.141965] print_circular_bug.cold+0x38/0x48
[ 593.146427] check_noncircular+0x308/0x3f0
[ 593.150545] ? __pfx_check_noncircular+0x10/0x10
[ 593.155180] ? __module_address+0x95/0x240
[ 593.159300] ? alloc_chain_hlocks+0x33b/0x520
[ 593.163675] check_prev_add+0x1b7/0x23e0
[ 593.167620] validate_chain+0xa8a/0xf00
[ 593.171478] ? __pfx_validate_chain+0x10/0x10
[ 593.175850] ? mark_lock+0x78/0x860
[ 593.179360] ? __pfx_lockdep_lock+0x10/0x10
[ 593.183549] __lock_acquire+0xcac/0x1d20
[ 593.187494] ? __pfx_validate_chain+0x10/0x10
[ 593.191867] lock_acquire.part.0+0x11b/0x360
[ 593.196156] ? try_to_wake_up+0xa7/0x15c0
[ 593.200172] ? __pfx_lock_acquire.part.0+0x10/0x10
[ 593.204980] ? rcu_is_watching+0x15/0xb0
[ 593.208907] ? trace_lock_acquire+0x1b9/0x280
[ 593.213265] ? try_to_wake_up+0xa7/0x15c0
[ 593.217294] ? lock_acquire+0x31/0xc0
[ 593.220961] ? try_to_wake_up+0xa7/0x15c0
[ 593.224975] _raw_spin_lock_irqsave+0x46/0x90
[ 593.229351] ? try_to_wake_up+0xa7/0x15c0
[ 593.233362] try_to_wake_up+0xa7/0x15c0
[ 593.237220] ? trace_lock_acquire+0x1b9/0x280
[ 593.241581] ? __pfx_try_to_wake_up+0x10/0x10
[ 593.245958] rcuwait_wake_up+0x80/0x190
[ 593.249816] kvm_vcpu_wake_up+0x19/0xa0 [kvm]
[ 593.254260] pi_wakeup_handler+0xf4/0x130 [kvm_intel]
[ 593.259338] sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0
[ 593.264668] </IRQ>
[ 593.266772] <TASK>
[ 593.268880] asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20
[ 593.274557] RIP: 0010:finish_task_switch.isra.0+0x228/0x830
[ 593.280147] Code: 41 ff 4d 00 0f 84 95 03 00 00 81 7d d0 80 00 00 00 0f 84 bd 03 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc <49> 8d be 00 05 00 00 48 b8 00 00 00 00 00 fc ff df
48 89 fa 48 c1
[ 593.298894] RSP: 0018:ffa0000000747cd8 EFLAGS: 00000206
[ 593.304137] RAX: 000000000016e47d RBX: ff1100079ec08e40 RCX: 0000000000000000
[ 593.311269] RDX: 0000000000000000 RSI: ffffffffa82f4140 RDI: ffffffffa85a2fa0
[ 593.318403] RBP: ffa0000000747d18 R08: 0000000000000001 R09: 0000000000000000
[ 593.325534] R10: 0000000000000001 R11: 0000000000000000 R12: ff110001b0e63f80
[ 593.332667] R13: 0000000000000000 R14: ff11000101dabf80 R15: ff1100079ec09e60
[ 593.339803] ? finish_task_switch.isra.0+0x1b9/0x830
[ 593.344785] ? __switch_to+0xc31/0xfa0
[ 593.348537] ? __switch_to_asm+0x3d/0x70
[ 593.352483] __schedule+0x7c7/0x1a50
[ 593.356077] ? __pfx___schedule+0x10/0x10
[ 593.360107] ? __pfx_sched_ttwu_pending+0x10/0x10
[ 593.364829] ? rcu_is_watching+0x15/0xb0
[ 593.368777] schedule_idle+0x59/0x90
[ 593.372370] do_idle+0x155/0x200
[ 593.375605] cpu_startup_entry+0x54/0x60
[ 593.379547] start_secondary+0x212/0x290
[ 593.383483] ? __pfx_start_secondary+0x10/0x10
[ 593.387945] ? startup_64_load_idt+0xc1/0xf0
[ 593.392235] common_startup_64+0x13e/0x141
[ 593.396355] </TASK>
I reproduced the failure upstream as well.
In fact it reproduces well on a Sapphire Rapids system which I use for testing
('Intel(R) Xeon(R) Silver 4410Y')
To reproduce I just need to run the kvm selftests on a lockdep enabled kernel.
If I am not mistaken the following happens:
1. pi_enable_wakeup_handler() is called when the current vCPU is scheduled out
it will then
- *disable interrupts*
- take a per-cpu lock 'wakeup_vcpus_on_cpu_lock' of the *current CPU*
- add the current vCPU to the per-cpu list which is protected by the lock above.
- enable interrupts
2. pi_wakeup_handler (IRQ handler) will also
- take wakeup_vcpus_on_cpu_lock of *current CPU*
- go over the vCPUs in the wakeup_vcpus_on_cpu list of *current CPU*, and kick them.
AFAIK, (1) and (2) can't race with each other and in fact no locking is needed here to prevent
a race.
The lock is required to prevent race with another place with needs access to the 'wakeup_vcpus_on_cpu' list:
(3)
When a sleeping vCPU is waken up it is scheduled to run (e.g. due to the kick above or other reason)
CPU, then *vmx_vcpu_pi_load* will
- disable interrupts (to avoid race with (1))
- take wakeup_vcpus_on_cpu_lock (maybe of a different cpu)
- remove this vCPU from the wakeup_vcpus_on_cpu list of that different CPU
- release the lock and re-enable interrupts.
As far as I see, there is no race, but lockdep doesn't understand this.
It thinks that:
1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock
2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
(at the start of the function there is a list of cases when it takes it)
I don't know lockdep well yet, but maybe a lockdep annotation will help,
if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.
What do you think?
Best regards,
Maxim Levitsky
next reply other threads:[~2025-03-19 2:19 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 2:19 Maxim Levitsky [this message]
2025-03-19 16:17 ` Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock Sean Christopherson
2025-03-19 22:49 ` Maxim Levitsky
2025-03-21 11:11 ` Yan Zhao
2025-03-21 11:49 ` Paolo Bonzini
2025-03-24 13:25 ` Maxim Levitsky
2025-03-27 12:06 ` Yan Zhao
2025-03-27 23:41 ` Sean Christopherson
2025-03-28 0:52 ` Yan Zhao
2025-03-28 23:02 ` Maxim Levitsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e61d23ddc87cb45063637e0e5375cc4e09db18cd.camel@redhat.com \
--to=mlevitsk@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox