* [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup
@ 2025-04-01 15:47 Sean Christopherson
2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Sean Christopherson @ 2025-04-01 15:47 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao
Yan's fix for the PI wakeup deadlock false positive, plus a prep patch to
make the dependency on IRQs being disabled during sched_out explicit.
Sean Christopherson (1):
KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup
list
Yan Zhao (1):
KVM: VMX: Use separate subclasses for PI wakeup lock to squash false
positive
arch/x86/kvm/vmx/posted_intr.c | 37 +++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
base-commit: 782f9feaa9517caf33186dcdd6b50a8f770ed29b
--
2.49.0.472.ge94155a9ec-goog
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list 2025-04-01 15:47 [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Sean Christopherson @ 2025-04-01 15:47 ` Sean Christopherson 2025-04-01 19:04 ` Maxim Levitsky 2025-04-02 2:17 ` Yan Zhao 2025-04-01 15:47 ` [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Sean Christopherson 2025-04-04 10:34 ` [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Paolo Bonzini 2 siblings, 2 replies; 9+ messages in thread From: Sean Christopherson @ 2025-04-01 15:47 UTC (permalink / raw) To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao Assert that IRQs are already disabled when putting a vCPU on a CPU's PI wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on IRQs being disabled until the vCPU task is fully scheduled out, i.e. until the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), as attempting to wake the task while it's being scheduled out could lead to deadlock. Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/posted_intr.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index ec08fa3caf43..840d435229a8 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -148,9 +148,8 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); struct pi_desc old, new; - unsigned long flags; - local_irq_save(flags); + lockdep_assert_irqs_disabled(); raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); list_add_tail(&vmx->pi_wakeup_list, @@ -176,8 +175,6 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) */ if (pi_test_on(&new)) __apic_send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); - - local_irq_restore(flags); } static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu) -- 2.49.0.472.ge94155a9ec-goog ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list 2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson @ 2025-04-01 19:04 ` Maxim Levitsky 2025-04-02 2:17 ` Yan Zhao 1 sibling, 0 replies; 9+ messages in thread From: Maxim Levitsky @ 2025-04-01 19:04 UTC (permalink / raw) To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao On Tue, 2025-04-01 at 08:47 -0700, Sean Christopherson wrote: > Assert that IRQs are already disabled when putting a vCPU on a CPU's PI > wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on > IRQs being disabled until the vCPU task is fully scheduled out, i.e. until > the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), > as attempting to wake the task while it's being scheduled out could lead > to deadlock. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index ec08fa3caf43..840d435229a8 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -148,9 +148,8 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); > struct vcpu_vmx *vmx = to_vmx(vcpu); > struct pi_desc old, new; > - unsigned long flags; > > - local_irq_save(flags); > + lockdep_assert_irqs_disabled(); > > raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > list_add_tail(&vmx->pi_wakeup_list, > @@ -176,8 +175,6 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > */ > if (pi_test_on(&new)) > __apic_send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); > - > - local_irq_restore(flags); > } > > static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu) Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list 2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson 2025-04-01 19:04 ` Maxim Levitsky @ 2025-04-02 2:17 ` Yan Zhao 1 sibling, 0 replies; 9+ messages in thread From: Yan Zhao @ 2025-04-02 2:17 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> On Tue, Apr 01, 2025 at 08:47:26AM -0700, Sean Christopherson wrote: > Assert that IRQs are already disabled when putting a vCPU on a CPU's PI > wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on > IRQs being disabled until the vCPU task is fully scheduled out, i.e. until > the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), > as attempting to wake the task while it's being scheduled out could lead > to deadlock. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index ec08fa3caf43..840d435229a8 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -148,9 +148,8 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu); > struct vcpu_vmx *vmx = to_vmx(vcpu); > struct pi_desc old, new; > - unsigned long flags; > > - local_irq_save(flags); > + lockdep_assert_irqs_disabled(); > > raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > list_add_tail(&vmx->pi_wakeup_list, > @@ -176,8 +175,6 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > */ > if (pi_test_on(&new)) > __apic_send_IPI_self(POSTED_INTR_WAKEUP_VECTOR); > - > - local_irq_restore(flags); > } > > static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu) > -- > 2.49.0.472.ge94155a9ec-goog > ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive 2025-04-01 15:47 [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Sean Christopherson 2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson @ 2025-04-01 15:47 ` Sean Christopherson 2025-04-01 19:49 ` Maxim Levitsky 2025-04-02 2:54 ` Yan Zhao 2025-04-04 10:34 ` [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Paolo Bonzini 2 siblings, 2 replies; 9+ messages in thread From: Sean Christopherson @ 2025-04-01 15:47 UTC (permalink / raw) To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao From: Yan Zhao <yan.y.zhao@intel.com> Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** In the wakeup handler, the callchain is *always*: sysvec_kvm_posted_intr_wakeup_ipi() | --> pi_wakeup_handler() | --> kvm_vcpu_wake_up() | --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() | --> schedule() | --> kvm_sched_out() | --> vmx_vcpu_put() | --> vmx_vcpu_pi_put() | --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=>C versus C=>A inversion. In practice, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c index 840d435229a8..51116fe69a50 100644 --- a/arch/x86/kvm/vmx/posted_intr.c +++ b/arch/x86/kvm/vmx/posted_intr.c @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); */ static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING + static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) { return &(to_vmx(vcpu)->pi_desc); @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) * current pCPU if the task was migrated. */ if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); + + /* + * In addition to taking the wakeup lock for the regular/IRQ + * context, tell lockdep it is being taken for the "sched out" + * context as well. vCPU loads happens in task context, and + * this is taking the lock of the *previous* CPU, i.e. can race + * with both the scheduler and the wakeup handler. + */ + raw_spin_lock(spinlock); + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); list_del(&vmx->pi_wakeup_list); - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + spin_release(&spinlock->dep_map, _RET_IP_); + raw_spin_unlock(spinlock); } dest = cpu_physical_id(cpu); @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) lockdep_assert_irqs_disabled(); - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); + /* + * Acquire the wakeup lock using the "sched out" context to workaround + * a lockdep false positive. When this is called, schedule() holds + * various per-CPU scheduler locks. When the wakeup handler runs, it + * holds this CPU's wakeup lock while calling try_to_wake_up(), which + * can eventually take the aforementioned scheduler locks, which causes + * lockdep to assume there is deadlock. + * + * Deadlock can't actually occur because IRQs are disabled for the + * entirety of the sched_out critical section, i.e. the wakeup handler + * can't run while the scheduler locks are held. + */ + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), + PI_LOCK_SCHED_OUT); list_add_tail(&vmx->pi_wakeup_list, &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); -- 2.49.0.472.ge94155a9ec-goog ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive 2025-04-01 15:47 ` [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Sean Christopherson @ 2025-04-01 19:49 ` Maxim Levitsky 2025-04-02 2:54 ` Yan Zhao 1 sibling, 0 replies; 9+ messages in thread From: Maxim Levitsky @ 2025-04-01 19:49 UTC (permalink / raw) To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Yan Zhao On Tue, 2025-04-01 at 08:47 -0700, Sean Christopherson wrote: > From: Yan Zhao <yan.y.zhao@intel.com> > > Use a separate subclass when acquiring KVM's per-CPU posted interrupts > wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list > of vCPUs to wake, to workaround a false positive deadlock. > > Chain exists of: > &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&rq->__lock); > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&p->pi_lock); > > *** DEADLOCK *** > > In the wakeup handler, the callchain is *always*: > > sysvec_kvm_posted_intr_wakeup_ipi() > | > --> pi_wakeup_handler() > | > --> kvm_vcpu_wake_up() > | > --> try_to_wake_up(), > > and the lock order is: > > &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. > > For the schedule out path, the callchain is always (for all intents and > purposes; if the kernel is preemptible, kvm_sched_out() can be called from > something other than schedule(), but the beginning of the callchain will > be the same point in vcpu_block()): > > vcpu_block() > | > --> schedule() > | > --> kvm_sched_out() > | > --> vmx_vcpu_put() > | > --> vmx_vcpu_pi_put() > | > --> pi_enable_wakeup_handler() > > and the lock order is: > > &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > deadlock can't occur between schedule out and the wakeup handler as they > are mutually exclusive. The entirely of the schedule out code that runs > with the problematic scheduler locks held, does so with IRQs disabled, > i.e. can't run concurrently with the wakeup handler. > > Use a subclass instead disabling lockdep entirely, and tell lockdep that > both subclasses are being acquired when loading a vCPU, as the sched_out > and sched_in paths are NOT mutually exclusive, e.g. > > CPU 0 CPU 1 > --------------- --------------- > vCPU0 sched_out > vCPU1 sched_in > vCPU1 sched_out vCPU 0 sched_in > > where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup > list+lock. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- > 1 file changed, 29 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index 840d435229a8..51116fe69a50 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); > */ > static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); > > +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING > + > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) > { > return &(to_vmx(vcpu)->pi_desc); > @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) > * current pCPU if the task was migrated. > */ > if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); > + > + /* > + * In addition to taking the wakeup lock for the regular/IRQ > + * context, tell lockdep it is being taken for the "sched out" > + * context as well. vCPU loads happens in task context, and > + * this is taking the lock of the *previous* CPU, i.e. can race > + * with both the scheduler and the wakeup handler. > + */ > + raw_spin_lock(spinlock); > + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); > list_del(&vmx->pi_wakeup_list); > - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + spin_release(&spinlock->dep_map, _RET_IP_); > + raw_spin_unlock(spinlock); > } > > dest = cpu_physical_id(cpu); > @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > > lockdep_assert_irqs_disabled(); > > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + /* > + * Acquire the wakeup lock using the "sched out" context to workaround > + * a lockdep false positive. When this is called, schedule() holds > + * various per-CPU scheduler locks. When the wakeup handler runs, it > + * holds this CPU's wakeup lock while calling try_to_wake_up(), which > + * can eventually take the aforementioned scheduler locks, which causes > + * lockdep to assume there is deadlock. > + * > + * Deadlock can't actually occur because IRQs are disabled for the > + * entirety of the sched_out critical section, i.e. the wakeup handler > + * can't run while the scheduler locks are held. > + */ > + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), > + PI_LOCK_SCHED_OUT); > list_add_tail(&vmx->pi_wakeup_list, > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Best regards, Maxim Levitsky ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive 2025-04-01 15:47 ` [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Sean Christopherson 2025-04-01 19:49 ` Maxim Levitsky @ 2025-04-02 2:54 ` Yan Zhao 2025-04-02 20:06 ` Sean Christopherson 1 sibling, 1 reply; 9+ messages in thread From: Yan Zhao @ 2025-04-02 2:54 UTC (permalink / raw) To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote: > From: Yan Zhao <yan.y.zhao@intel.com> > > Use a separate subclass when acquiring KVM's per-CPU posted interrupts > wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list > of vCPUs to wake, to workaround a false positive deadlock. > > Chain exists of: > &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > Possible unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&rq->__lock); > lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); > lock(&p->pi_lock); > > *** DEADLOCK *** > > In the wakeup handler, the callchain is *always*: > > sysvec_kvm_posted_intr_wakeup_ipi() > | > --> pi_wakeup_handler() > | > --> kvm_vcpu_wake_up() > | > --> try_to_wake_up(), > > and the lock order is: > > &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. > > For the schedule out path, the callchain is always (for all intents and > purposes; if the kernel is preemptible, kvm_sched_out() can be called from > something other than schedule(), but the beginning of the callchain will > be the same point in vcpu_block()): > > vcpu_block() > | > --> schedule() > | > --> kvm_sched_out() > | > --> vmx_vcpu_put() > | > --> vmx_vcpu_pi_put() > | > --> pi_enable_wakeup_handler() > > and the lock order is: > > &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > deadlock can't occur between schedule out and the wakeup handler as they > are mutually exclusive. The entirely of the schedule out code that runs > with the problematic scheduler locks held, does so with IRQs disabled, > i.e. can't run concurrently with the wakeup handler. > > Use a subclass instead disabling lockdep entirely, and tell lockdep that Paolo initially recommended utilizing the subclass. Do you think it's good to add his suggested-by tag? BTW: is it necessary to state the subclass assignment explicitly in the patch msg? e.g., wakeup handler: subclass 0 sched_out: subclass 1 sched_in: subclasses 0 and 1 Aside from the minor nits, LGTM! Thanks for polishing the patch and helping with the msg/comments :) > both subclasses are being acquired when loading a vCPU, as the sched_out > and sched_in paths are NOT mutually exclusive, e.g. > > CPU 0 CPU 1 > --------------- --------------- > vCPU0 sched_out > vCPU1 sched_in > vCPU1 sched_out vCPU 0 sched_in > > where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup > list+lock. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/vmx/posted_intr.c | 32 +++++++++++++++++++++++++++++--- > 1 file changed, 29 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c > index 840d435229a8..51116fe69a50 100644 > --- a/arch/x86/kvm/vmx/posted_intr.c > +++ b/arch/x86/kvm/vmx/posted_intr.c > @@ -31,6 +31,8 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu); > */ > static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock); > > +#define PI_LOCK_SCHED_OUT SINGLE_DEPTH_NESTING > + > static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu) > { > return &(to_vmx(vcpu)->pi_desc); > @@ -89,9 +91,20 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu) > * current pCPU if the task was migrated. > */ > if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) { > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu); > + > + /* > + * In addition to taking the wakeup lock for the regular/IRQ > + * context, tell lockdep it is being taken for the "sched out" > + * context as well. vCPU loads happens in task context, and > + * this is taking the lock of the *previous* CPU, i.e. can race > + * with both the scheduler and the wakeup handler. > + */ > + raw_spin_lock(spinlock); > + spin_acquire(&spinlock->dep_map, PI_LOCK_SCHED_OUT, 0, _RET_IP_); > list_del(&vmx->pi_wakeup_list); > - raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + spin_release(&spinlock->dep_map, _RET_IP_); > + raw_spin_unlock(spinlock); > } > > dest = cpu_physical_id(cpu); > @@ -151,7 +164,20 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu) > > lockdep_assert_irqs_disabled(); > > - raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > + /* > + * Acquire the wakeup lock using the "sched out" context to workaround > + * a lockdep false positive. When this is called, schedule() holds > + * various per-CPU scheduler locks. When the wakeup handler runs, it > + * holds this CPU's wakeup lock while calling try_to_wake_up(), which > + * can eventually take the aforementioned scheduler locks, which causes > + * lockdep to assume there is deadlock. > + * > + * Deadlock can't actually occur because IRQs are disabled for the > + * entirety of the sched_out critical section, i.e. the wakeup handler > + * can't run while the scheduler locks are held. > + */ > + raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), > + PI_LOCK_SCHED_OUT); > list_add_tail(&vmx->pi_wakeup_list, > &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu)); > raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu)); > -- > 2.49.0.472.ge94155a9ec-goog > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive 2025-04-02 2:54 ` Yan Zhao @ 2025-04-02 20:06 ` Sean Christopherson 0 siblings, 0 replies; 9+ messages in thread From: Sean Christopherson @ 2025-04-02 20:06 UTC (permalink / raw) To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel On Wed, Apr 02, 2025, Yan Zhao wrote: > On Tue, Apr 01, 2025 at 08:47:27AM -0700, Sean Christopherson wrote: > > I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for > > wakeup, and complains about the A=>C versus C=>A inversion. In practice, > > deadlock can't occur between schedule out and the wakeup handler as they > > are mutually exclusive. The entirely of the schedule out code that runs > > with the problematic scheduler locks held, does so with IRQs disabled, > > i.e. can't run concurrently with the wakeup handler. > > > > Use a subclass instead disabling lockdep entirely, and tell lockdep that > Paolo initially recommended utilizing the subclass. > Do you think it's good to add his suggested-by tag? Sure. > BTW: is it necessary to state the subclass assignment explicitly in the > patch msg? e.g., > > wakeup handler: subclass 0 > sched_out: subclass 1 > sched_in: subclasses 0 and 1 Yeah, explicitly stating the effectively rules would be helpful. If those are the only issues, I'll just fixup the changelog when applying. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup 2025-04-01 15:47 [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Sean Christopherson 2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson 2025-04-01 15:47 ` [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Sean Christopherson @ 2025-04-04 10:34 ` Paolo Bonzini 2 siblings, 0 replies; 9+ messages in thread From: Paolo Bonzini @ 2025-04-04 10:34 UTC (permalink / raw) To: Sean Christopherson; +Cc: kvm, linux-kernel, Yan Zhao, mlevitsk Queued, thanks. Paolo ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-04-04 10:34 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-01 15:47 [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Sean Christopherson 2025-04-01 15:47 ` [PATCH 1/2] KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list Sean Christopherson 2025-04-01 19:04 ` Maxim Levitsky 2025-04-02 2:17 ` Yan Zhao 2025-04-01 15:47 ` [PATCH 2/2] KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive Sean Christopherson 2025-04-01 19:49 ` Maxim Levitsky 2025-04-02 2:54 ` Yan Zhao 2025-04-02 20:06 ` Sean Christopherson 2025-04-04 10:34 ` [PATCH 0/2] KVM: VMX: Fix lockdep false positive on PI wakeup Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox