Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
@ 2025-03-19  2:19 Maxim Levitsky
  2025-03-19 16:17 ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Maxim Levitsky @ 2025-03-19  2:19 UTC (permalink / raw)
  To: kvm; +Cc: Sean Christopherson, Paolo Bonzini

Hi!

I recently came up with an interesting failure in the CI pipeline.


[  592.704446] WARNING: possible circular locking dependency detected 
[  592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted 
[  592.715764] ------------------------------------------------------ 
[  592.721946] swapper/19/0 is trying to acquire lock: 
[  592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0 
[  592.734761]  
[  592.734761] but task is already holding lock: 
[  592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
[  592.752185]  
[  592.752185] which lock already depends on the new lock. 
[  592.752185]  
[  592.760357]  
[  592.760357] the existing dependency chain (in reverse order) is: 
[  592.767837]  
[  592.767837] -> #2 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}: 
[  592.776027]        __lock_acquire+0xcac/0x1d20 
[  592.780489]        lock_acquire.part.0+0x11b/0x360 
[  592.785300]        _raw_spin_lock+0x37/0x80 
[  592.789484]        pi_enable_wakeup_handler+0x11a/0x5a0 [kvm_intel] 
[  592.795776]        vmx_vcpu_pi_put+0x1d3/0x230 [kvm_intel] 
[  592.801282]        vmx_vcpu_put+0x12/0x20 [kvm_intel] 
[  592.806360]        kvm_arch_vcpu_put+0x494/0x7a0 [kvm] 
[  592.811592]        kvm_sched_out+0x161/0x1c0 [kvm] 
[  592.816456]        prepare_task_switch+0x36c/0xe20 
[  592.821247]        __schedule+0x481/0x1a50 
[  592.825346]        schedule+0xd4/0x280 
[  592.829100]        kvm_vcpu_block+0xe4/0x1d0 [kvm] 
[  592.833964]        kvm_vcpu_halt+0x1a2/0x800 [kvm] 
[  592.838824]        vcpu_run+0x53f/0x9e0 [kvm] 
[  592.843263]        kvm_arch_vcpu_ioctl_run+0x371/0x1480 [kvm] 
[  592.849078]        kvm_vcpu_ioctl+0x45e/0xc70 [kvm] 
[  592.854034]        __x64_sys_ioctl+0x12e/0x1a0 
[  592.858481]        do_syscall_64+0x92/0x180 
[  592.862685]        entry_SYSCALL_64_after_hwframe+0x76/0x7e 
[  592.868276]  
[  592.868276] -> #1 (&rq->__lock)\{-.-.}-\{2:2}: 
[  592.874038]        __lock_acquire+0xcac/0x1d20 
[  592.878485]        lock_acquire.part.0+0x11b/0x360 
[  592.883277]        _raw_spin_lock_nested+0x3b/0x70 
[  592.888069]        raw_spin_rq_lock_nested+0x2e/0x130 
[  592.893121]        __task_rq_lock+0xdb/0x570 
[  592.897393]        wake_up_new_task+0x750/0x1050 
[  592.902014]        kernel_clone+0x13e/0x5f0 
[  592.906209]        user_mode_thread+0xa4/0xe0 
[  592.910568]        rest_init+0x1e/0x1c0 
[  592.914407]        start_kernel+0x3ab/0x3b0 
[  592.918592]        x86_64_start_reservations+0x24/0x30 
[  592.923732]        x86_64_start_kernel+0x9c/0xa0 
[  592.928353]        common_startup_64+0x13e/0x141 
[  592.932988]  
[  592.932988] -> #0 (&p->pi_lock)\{-.-.}-\{2:2}: 
[  592.938754]        check_prev_add+0x1b7/0x23e0 
[  592.943200]        validate_chain+0xa8a/0xf00 
[  592.947557]        __lock_acquire+0xcac/0x1d20 
[  592.952005]        lock_acquire.part.0+0x11b/0x360 
[  592.956797]        _raw_spin_lock_irqsave+0x46/0x90 
[  592.961676]        try_to_wake_up+0xa7/0x15c0 
[  592.966035]        rcuwait_wake_up+0x80/0x190 
[  592.970396]        kvm_vcpu_wake_up+0x19/0xa0 [kvm] 
[  592.975353]        pi_wakeup_handler+0xf4/0x130 [kvm_intel] 
[  592.980950]        sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0 
[  592.986783]        asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20 
[  592.992964]        finish_task_switch.isra.0+0x228/0x830 
[  592.998275]        __schedule+0x7c7/0x1a50 
[  593.002376]        schedule_idle+0x59/0x90 
[  593.006477]        do_idle+0x155/0x200 
[  593.010244]        cpu_startup_entry+0x54/0x60 
[  593.014693]        start_secondary+0x212/0x290 
[  593.019138]        common_startup_64+0x13e/0x141 
[  593.023758]  
[  593.023758] other info that might help us debug this: 
[  593.023758]  
[  593.031757] Chain exists of: 
[  593.031757]   &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) 
[  593.031757]  
[  593.043932]  Possible unsafe locking scenario: 
[  593.043932]  
[  593.049852]        CPU0                    CPU1 
[  593.054384]        ----                    ---- 
[  593.058918]   lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 
[  593.064576]                                lock(&rq->__lock);
[  593.070323]                                lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); 
[  593.078496]   lock(&p->pi_lock); 
[  593.081745]  
[  593.081745]  *** DEADLOCK *** 
[  593.081745]  
[  593.087666] 2 locks held by swapper/19/0: 
[  593.091679]  #0: ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
[  593.103692]  #1: ffffffffa9b3eec0 (rcu_read_lock)\{....}-\{1:2}, at: rcuwait_wake_up+0x22/0x190 
[  593.112227]  

[  593.112227] stack backtrace: 
[  593.116590] CPU: 19 UID: 0 PID: 0 Comm: swapper/19 Kdump: loaded Not tainted 6.12.0-36.el10.x86_64+debug #1 
[  593.126338] Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024 
[  593.133818] Call Trace: 
[  593.136271]  <IRQ> 
[  593.138291]  dump_stack_lvl+0x6f/0xb0 
[  593.141965]  print_circular_bug.cold+0x38/0x48 
[  593.146427]  check_noncircular+0x308/0x3f0 
[  593.150545]  ? __pfx_check_noncircular+0x10/0x10 
[  593.155180]  ? __module_address+0x95/0x240 
[  593.159300]  ? alloc_chain_hlocks+0x33b/0x520 
[  593.163675]  check_prev_add+0x1b7/0x23e0 
[  593.167620]  validate_chain+0xa8a/0xf00 
[  593.171478]  ? __pfx_validate_chain+0x10/0x10 
[  593.175850]  ? mark_lock+0x78/0x860 
[  593.179360]  ? __pfx_lockdep_lock+0x10/0x10 
[  593.183549]  __lock_acquire+0xcac/0x1d20 
[  593.187494]  ? __pfx_validate_chain+0x10/0x10 
[  593.191867]  lock_acquire.part.0+0x11b/0x360 
[  593.196156]  ? try_to_wake_up+0xa7/0x15c0 
[  593.200172]  ? __pfx_lock_acquire.part.0+0x10/0x10 
[  593.204980]  ? rcu_is_watching+0x15/0xb0 
[  593.208907]  ? trace_lock_acquire+0x1b9/0x280 
[  593.213265]  ? try_to_wake_up+0xa7/0x15c0 
[  593.217294]  ? lock_acquire+0x31/0xc0 
[  593.220961]  ? try_to_wake_up+0xa7/0x15c0 
[  593.224975]  _raw_spin_lock_irqsave+0x46/0x90 
[  593.229351]  ? try_to_wake_up+0xa7/0x15c0 
[  593.233362]  try_to_wake_up+0xa7/0x15c0 
[  593.237220]  ? trace_lock_acquire+0x1b9/0x280 
[  593.241581]  ? __pfx_try_to_wake_up+0x10/0x10 
[  593.245958]  rcuwait_wake_up+0x80/0x190 
[  593.249816]  kvm_vcpu_wake_up+0x19/0xa0 [kvm] 
[  593.254260]  pi_wakeup_handler+0xf4/0x130 [kvm_intel] 
[  593.259338]  sysvec_kvm_posted_intr_wakeup_ipi+0x9c/0xd0 
[  593.264668]  </IRQ> 
[  593.266772]  <TASK> 
[  593.268880]  asm_sysvec_kvm_posted_intr_wakeup_ipi+0x1a/0x20 
[  593.274557] RIP: 0010:finish_task_switch.isra.0+0x228/0x830 
[  593.280147] Code: 41 ff 4d 00 0f 84 95 03 00 00 81 7d d0 80 00 00 00 0f 84 bd 03 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc <49> 8d be 00 05 00 00 48 b8 00 00 00 00 00 fc ff df
48 89 fa 48 c1 
[  593.298894] RSP: 0018:ffa0000000747cd8 EFLAGS: 00000206 
[  593.304137] RAX: 000000000016e47d RBX: ff1100079ec08e40 RCX: 0000000000000000 
[  593.311269] RDX: 0000000000000000 RSI: ffffffffa82f4140 RDI: ffffffffa85a2fa0 
[  593.318403] RBP: ffa0000000747d18 R08: 0000000000000001 R09: 0000000000000000 
[  593.325534] R10: 0000000000000001 R11: 0000000000000000 R12: ff110001b0e63f80 
[  593.332667] R13: 0000000000000000 R14: ff11000101dabf80 R15: ff1100079ec09e60 
[  593.339803]  ? finish_task_switch.isra.0+0x1b9/0x830 
[  593.344785]  ? __switch_to+0xc31/0xfa0 
[  593.348537]  ? __switch_to_asm+0x3d/0x70 
[  593.352483]  __schedule+0x7c7/0x1a50 
[  593.356077]  ? __pfx___schedule+0x10/0x10 
[  593.360107]  ? __pfx_sched_ttwu_pending+0x10/0x10 
[  593.364829]  ? rcu_is_watching+0x15/0xb0 
[  593.368777]  schedule_idle+0x59/0x90 
[  593.372370]  do_idle+0x155/0x200 
[  593.375605]  cpu_startup_entry+0x54/0x60 
[  593.379547]  start_secondary+0x212/0x290 
[  593.383483]  ? __pfx_start_secondary+0x10/0x10 
[  593.387945]  ? startup_64_load_idt+0xc1/0xf0 
[  593.392235]  common_startup_64+0x13e/0x141 
[  593.396355]  </TASK> 


I reproduced the failure upstream as well. 
In fact it reproduces well on a Sapphire Rapids system which I use for testing
('Intel(R) Xeon(R) Silver 4410Y')

To reproduce I just need to run the kvm selftests on a lockdep enabled kernel.

If I am not mistaken the following happens:


1. pi_enable_wakeup_handler() is called when the current vCPU is scheduled out

   it will then
	- *disable interrupts*
	- take a per-cpu lock 'wakeup_vcpus_on_cpu_lock' of the *current CPU*
	- add the current vCPU to the per-cpu list which is protected by the lock above.
	- enable interrupts


2. pi_wakeup_handler (IRQ handler) will also
	- take wakeup_vcpus_on_cpu_lock of *current CPU*
	- go over the vCPUs in the wakeup_vcpus_on_cpu list of *current CPU*, and kick them.


AFAIK, (1) and (2) can't race with each other and in fact no locking is needed here to prevent
a race.

The lock is required to prevent race with another place with needs access to the 'wakeup_vcpus_on_cpu' list:

(3)

When a sleeping vCPU is waken up it is scheduled to run (e.g. due to the kick above or other reason)
CPU, then  *vmx_vcpu_pi_load* will

	- disable interrupts (to avoid race with (1))
	- take wakeup_vcpus_on_cpu_lock (maybe of a different cpu)
	- remove this vCPU from the wakeup_vcpus_on_cpu list of that different CPU
	- release the lock and re-enable interrupts.

As far as I see, there is no race, but lockdep doesn't understand this.

It thinks that:

1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock

2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
(at the start of the function there is a list of cases when it takes it)

I don't know lockdep well yet, but maybe a lockdep annotation will help, 
if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.


What do you think?

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-19  2:19 Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock Maxim Levitsky
@ 2025-03-19 16:17 ` Sean Christopherson
  2025-03-19 22:49   ` Maxim Levitsky
                     ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Sean Christopherson @ 2025-03-19 16:17 UTC (permalink / raw)
  To: Maxim Levitsky; +Cc: kvm, Paolo Bonzini, Yan Zhao, James Houghton

+James and Yan

On Tue, Mar 18, 2025, Maxim Levitsky wrote:
> Hi!
> 
> I recently came up with an interesting failure in the CI pipeline.
> 
> 
> [  592.704446] WARNING: possible circular locking dependency detected 
> [  592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted 
> [  592.715764] ------------------------------------------------------ 
> [  592.721946] swapper/19/0 is trying to acquire lock: 
> [  592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0 
> [  592.734761]  
> [  592.734761] but task is already holding lock: 
> [  592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
> [  592.752185]  
> [  592.752185] which lock already depends on the new lock. 

...

> As far as I see, there is no race, but lockdep doesn't understand this.

Yep :-(

This splat fires every time (literally) I run through my battery of tests on
systems with IPI virtualization, it's basically an old friend at this point.

> It thinks that:
> 
> 1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock
> 
> 2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
> (at the start of the function there is a list of cases when it takes it)
> 
> I don't know lockdep well yet, but maybe a lockdep annotation will help, 
> if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.

Yan posted a patch to fudge around the issue[*], I strongly objected (and still
object) to making a functional and confusing code change to fudge around a lockdep
false positive.

James has also looked at the issue, and wasn't able to find a way to cleanly tell
lockdep about the situation.

Looking at this (yet) again, what if we temporarily tell lockdep that
wakeup_vcpus_on_cpu_lock isn't held when calling kvm_vcpu_wake_up()?  Gross, but
more surgical than disabling lockdep entirely on the lock.  This appears to squash
the warning without any unwanted side effects.

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index ec08fa3caf43..5984ad6f6f21 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -224,9 +224,17 @@ void pi_wakeup_handler(void)
 
        raw_spin_lock(spinlock);
        list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
-
+               /*
+                * Temporarily lie to lockdep to avoid false positives due to
+                * lockdep not understanding that deadlock is impossible.  This
+                * is called only in IRQ context, and the problematic locks
+                * taken in the kvm_vcpu_wake_up() call chain are only acquired
+                * with IRQs disabled.
+                */
+               spin_release(&spinlock->dep_map, _RET_IP_);
                if (pi_test_on(&vmx->pi_desc))
                        kvm_vcpu_wake_up(&vmx->vcpu);
+               spin_acquire(&spinlock->dep_map, 0, 0, _RET_IP_);
        }
        raw_spin_unlock(spinlock);
 }

[*] https://lore.kernel.org/all/20230313111022.13793-1-yan.y.zhao@intel.com

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-19 16:17 ` Sean Christopherson
@ 2025-03-19 22:49   ` Maxim Levitsky
  2025-03-21 11:11   ` Yan Zhao
  2025-03-21 11:49   ` Paolo Bonzini
  2 siblings, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2025-03-19 22:49 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, Paolo Bonzini, Yan Zhao, James Houghton

On Wed, 2025-03-19 at 09:17 -0700, Sean Christopherson wrote:
> +James and Yan
> 
> On Tue, Mar 18, 2025, Maxim Levitsky wrote:
> > Hi!
> > 
> > I recently came up with an interesting failure in the CI pipeline.
> > 
> > 
> > [  592.704446] WARNING: possible circular locking dependency detected 
> > [  592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted 
> > [  592.715764] ------------------------------------------------------ 
> > [  592.721946] swapper/19/0 is trying to acquire lock: 
> > [  592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0 
> > [  592.734761]  
> > [  592.734761] but task is already holding lock: 
> > [  592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
> > [  592.752185]  
> > [  592.752185] which lock already depends on the new lock. 
> 
> ...
> 
> > As far as I see, there is no race, but lockdep doesn't understand this.
> 
> Yep :-(
> 
> This splat fires every time (literally) I run through my battery of tests on
> systems with IPI virtualization, it's basically an old friend at this point.
> 
> > It thinks that:
> > 
> > 1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock
> > 
> > 2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
> > (at the start of the function there is a list of cases when it takes it)
> > 
> > I don't know lockdep well yet, but maybe a lockdep annotation will help, 
> > if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.
> 
> Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> object) to making a functional and confusing code change to fudge around a lockdep
> false positive.
> 
> James has also looked at the issue, and wasn't able to find a way to cleanly tell
> lockdep about the situation.
> 
> Looking at this (yet) again, what if we temporarily tell lockdep that
> wakeup_vcpus_on_cpu_lock isn't held when calling kvm_vcpu_wake_up()?  Gross, but
> more surgical than disabling lockdep entirely on the lock.  This appears to squash
> the warning without any unwanted side effects.
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index ec08fa3caf43..5984ad6f6f21 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -224,9 +224,17 @@ void pi_wakeup_handler(void)
>  
>         raw_spin_lock(spinlock);
>         list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
> -
> +               /*
> +                * Temporarily lie to lockdep to avoid false positives due to
> +                * lockdep not understanding that deadlock is impossible.  This
> +                * is called only in IRQ context, and the problematic locks
> +                * taken in the kvm_vcpu_wake_up() call chain are only acquired
> +                * with IRQs disabled.
> +                */
> +               spin_release(&spinlock->dep_map, _RET_IP_);
>                 if (pi_test_on(&vmx->pi_desc))
>                         kvm_vcpu_wake_up(&vmx->vcpu);
> +               spin_acquire(&spinlock->dep_map, 0, 0, _RET_IP_);
>         }
>         raw_spin_unlock(spinlock);
>  }
> 
> [*] https://lore.kernel.org/all/20230313111022.13793-1-yan.y.zhao@intel.com
> 

This is a good idea.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-19 16:17 ` Sean Christopherson
  2025-03-19 22:49   ` Maxim Levitsky
@ 2025-03-21 11:11   ` Yan Zhao
  2025-03-21 11:49   ` Paolo Bonzini
  2 siblings, 0 replies; 10+ messages in thread
From: Yan Zhao @ 2025-03-21 11:11 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Maxim Levitsky, kvm, Paolo Bonzini, James Houghton

On Wed, Mar 19, 2025 at 09:17:36AM -0700, Sean Christopherson wrote:
> +James and Yan
> 
> On Tue, Mar 18, 2025, Maxim Levitsky wrote:
> > Hi!
> > 
> > I recently came up with an interesting failure in the CI pipeline.
> > 
> > 
> > [  592.704446] WARNING: possible circular locking dependency detected 
> > [  592.710625] 6.12.0-36.el10.x86_64+debug #1 Not tainted 
> > [  592.715764] ------------------------------------------------------ 
> > [  592.721946] swapper/19/0 is trying to acquire lock: 
> > [  592.726823] ff110001b0e64ec0 (&p->pi_lock)\{-.-.}-\{2:2}, at: try_to_wake_up+0xa7/0x15c0 
> > [  592.734761]  
> > [  592.734761] but task is already holding lock: 
> > [  592.740596] ff1100079ec0c058 (&per_cpu(wakeup_vcpus_on_cpu_lock, cpu))\{-...}-\{2:2}, at: pi_wakeup_handler+0x60/0x130 [kvm_intel] 
> > [  592.752185]  
> > [  592.752185] which lock already depends on the new lock. 
> 
> ...
> 
> > As far as I see, there is no race, but lockdep doesn't understand this.
> 
> Yep :-(
> 
> This splat fires every time (literally) I run through my battery of tests on
> systems with IPI virtualization, it's basically an old friend at this point.
> 
> > It thinks that:
> > 
> > 1. pi_enable_wakeup_handler is called from schedule() which holds rq->lock, and it itself takes wakeup_vcpus_on_cpu_lock lock
> > 
> > 2. pi_wakeup_handler takes wakeup_vcpus_on_cpu_lock and then calls try_to_wake_up which can eventually take rq->lock
> > (at the start of the function there is a list of cases when it takes it)
> > 
> > I don't know lockdep well yet, but maybe a lockdep annotation will help, 
> > if we can tell it that there are multiple 'wakeup_vcpus_on_cpu_lock' locks.
> 
> Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> object) to making a functional and confusing code change to fudge around a lockdep
> false positive.
Not a functional code change? Just splitting a big lock into smaller two to
protect a single resource. :)
Kind of with threads A, B, C,
A, B: take lock 1,
A, C: take lock 2,
B, C: don't need a lock since they are mutually exclusive.

> James has also looked at the issue, and wasn't able to find a way to cleanly tell
> lockdep about the situation.
> 
> Looking at this (yet) again, what if we temporarily tell lockdep that
> wakeup_vcpus_on_cpu_lock isn't held when calling kvm_vcpu_wake_up()?  Gross, but
> more surgical than disabling lockdep entirely on the lock.  This appears to squash
> the warning without any unwanted side effects.

This can fix the lockdep warning in my side as well.

> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index ec08fa3caf43..5984ad6f6f21 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -224,9 +224,17 @@ void pi_wakeup_handler(void)
>  
>         raw_spin_lock(spinlock);
>         list_for_each_entry(vmx, wakeup_list, pi_wakeup_list) {
> -
> +               /*
> +                * Temporarily lie to lockdep to avoid false positives due to
> +                * lockdep not understanding that deadlock is impossible.  This
> +                * is called only in IRQ context, and the problematic locks
> +                * taken in the kvm_vcpu_wake_up() call chain are only acquired
> +                * with IRQs disabled.
> +                */
> +               spin_release(&spinlock->dep_map, _RET_IP_);
>                 if (pi_test_on(&vmx->pi_desc))
>                         kvm_vcpu_wake_up(&vmx->vcpu);
> +               spin_acquire(&spinlock->dep_map, 0, 0, _RET_IP_);
>         }
>         raw_spin_unlock(spinlock);
>  }
> 
> [*] https://lore.kernel.org/all/20230313111022.13793-1-yan.y.zhao@intel.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-19 16:17 ` Sean Christopherson
  2025-03-19 22:49   ` Maxim Levitsky
  2025-03-21 11:11   ` Yan Zhao
@ 2025-03-21 11:49   ` Paolo Bonzini
  2025-03-24 13:25     ` Maxim Levitsky
  2025-03-27 12:06     ` Yan Zhao
  2 siblings, 2 replies; 10+ messages in thread
From: Paolo Bonzini @ 2025-03-21 11:49 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Maxim Levitsky, kvm, Yan Zhao, James Houghton

On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> object) to making a functional and confusing code change to fudge around a lockdep
> false positive.

In that thread I had made another suggestion, which Yan also tried,
which was to use subclasses:

- in the sched_out path, which cannot race with the others:
  raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);

- in the irq and sched_in paths, which can race with each other:
  raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

Paolo


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-21 11:49   ` Paolo Bonzini
@ 2025-03-24 13:25     ` Maxim Levitsky
  2025-03-27 12:06     ` Yan Zhao
  1 sibling, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2025-03-24 13:25 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, Yan Zhao, James Houghton

On Fri, 2025-03-21 at 12:49 +0100, Paolo Bonzini wrote:
> On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > object) to making a functional and confusing code change to fudge around a lockdep
> > false positive.
> 
> In that thread I had made another suggestion, which Yan also tried,
> which was to use subclasses:
> 
> - in the sched_out path, which cannot race with the others:
>   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
> 
> - in the irq and sched_in paths, which can race with each other:
>   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));

Hi,

I also have no objections to this. As it was said in the original
discussion, that can make lockdep miss a bug, but so can the Sean's solution.

I don't think there is a way to both fix the warning and not degrade the check in some way
but since the check disables lockdep, either of these two fixes is much better
that what we have now.

So what should I do to fix this? Can I assume that one of you post a patch
to fix this, or should I post a patch with one of these solutions?
(I want to close a ticket that I have :) )

Best regards,
	Maxim Levitsky

> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-21 11:49   ` Paolo Bonzini
  2025-03-24 13:25     ` Maxim Levitsky
@ 2025-03-27 12:06     ` Yan Zhao
  2025-03-27 23:41       ` Sean Christopherson
  2025-03-28 23:02       ` Maxim Levitsky
  1 sibling, 2 replies; 10+ messages in thread
From: Yan Zhao @ 2025-03-27 12:06 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Sean Christopherson, Maxim Levitsky, kvm, James Houghton

On Fri, Mar 21, 2025 at 12:49:42PM +0100, Paolo Bonzini wrote:
> On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > object) to making a functional and confusing code change to fudge around a lockdep
> > false positive.
> 
> In that thread I had made another suggestion, which Yan also tried,
> which was to use subclasses:
> 
> - in the sched_out path, which cannot race with the others:
>   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
>
> - in the irq and sched_in paths, which can race with each other:
>   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
Hi Paolo, Sean, Maxim,

The sched_out path still may race with sched_in path. e.g.
    CPU 0                 CPU 1
-----------------     ---------------
vCPU 0 sched_out
vCPU 1 sched_in
vCPU 1 sched_out      vCPU 0 sched_in

vCPU 0 sched_in may race with vCPU 1 sched_out on CPU 0's wakeup list.


So, the situation is
sched_in, sched_out: race
sched_in, irq:       race
sched_out, irq: mutual exclusive, do not race


Hence, do you think below subclasses assignments reasonable?
irq: subclass 0
sched_out: subclass 1
sched_in: subclasses 0 and 1

As inspired by Sean's solution, I made below patch to inform lockdep that the
sched_in path involves both subclasses 0 and 1 by adding a line
"spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_)".

I like it because it accurately conveys the situation to lockdep :)
What are your thoughts?

Thanks
Yan

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index ec08fa3caf43..c5684225255a 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -89,9 +89,12 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
         * current pCPU if the task was migrated.
         */
        if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
-               raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+               raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
+               raw_spin_lock(spinlock);
+               spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_);
                list_del(&vmx->pi_wakeup_list);
-               raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+               spin_release(&spinlock->dep_map, _RET_IP_);
+               raw_spin_unlock(spinlock);
        }

        dest = cpu_physical_id(cpu);
@@ -152,7 +155,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)

        local_irq_save(flags);

-       raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+       raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
        list_add_tail(&vmx->pi_wakeup_list,
                      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
        raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-27 12:06     ` Yan Zhao
@ 2025-03-27 23:41       ` Sean Christopherson
  2025-03-28  0:52         ` Yan Zhao
  2025-03-28 23:02       ` Maxim Levitsky
  1 sibling, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2025-03-27 23:41 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, Maxim Levitsky, kvm, James Houghton

On Thu, Mar 27, 2025, Yan Zhao wrote:
> On Fri, Mar 21, 2025 at 12:49:42PM +0100, Paolo Bonzini wrote:
> > On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > > object) to making a functional and confusing code change to fudge around a lockdep
> > > false positive.
> > 
> > In that thread I had made another suggestion, which Yan also tried,
> > which was to use subclasses:
> > 
> > - in the sched_out path, which cannot race with the others:
> >   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
> >
> > - in the irq and sched_in paths, which can race with each other:
> >   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> Hi Paolo, Sean, Maxim,
> 
> The sched_out path still may race with sched_in path. e.g.
>     CPU 0                 CPU 1
> -----------------     ---------------
> vCPU 0 sched_out
> vCPU 1 sched_in
> vCPU 1 sched_out      vCPU 0 sched_in
> 
> vCPU 0 sched_in may race with vCPU 1 sched_out on CPU 0's wakeup list.
> 
> 
> So, the situation is
> sched_in, sched_out: race
> sched_in, irq:       race
> sched_out, irq: mutual exclusive, do not race
> 
> 
> Hence, do you think below subclasses assignments reasonable?
> irq: subclass 0
> sched_out: subclass 1
> sched_in: subclasses 0 and 1
> 
> As inspired by Sean's solution, I made below patch to inform lockdep that the
> sched_in path involves both subclasses 0 and 1 by adding a line
> "spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_)".
> 
> I like it because it accurately conveys the situation to lockdep :)

Me too :-)

Can you give your SoB?  I wrote comments and a changelog to explain to myself
(yet again), what the problem is, and why it's a false positive.  I also want
to change the local_irq_{save,restore}() into a lockdep assertion in a prep patch,
because this and the self-IPI trick rely on IRQs being disabled until the task
is fully scheduled out and the scheduler locks are dopped.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-27 23:41       ` Sean Christopherson
@ 2025-03-28  0:52         ` Yan Zhao
  0 siblings, 0 replies; 10+ messages in thread
From: Yan Zhao @ 2025-03-28  0:52 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, Maxim Levitsky, kvm, James Houghton

On Thu, Mar 27, 2025 at 04:41:42PM -0700, Sean Christopherson wrote:
> On Thu, Mar 27, 2025, Yan Zhao wrote:
> > On Fri, Mar 21, 2025 at 12:49:42PM +0100, Paolo Bonzini wrote:
> > > On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > > > object) to making a functional and confusing code change to fudge around a lockdep
> > > > false positive.
> > > 
> > > In that thread I had made another suggestion, which Yan also tried,
> > > which was to use subclasses:
> > > 
> > > - in the sched_out path, which cannot race with the others:
> > >   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
> > >
> > > - in the irq and sched_in paths, which can race with each other:
> > >   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> > Hi Paolo, Sean, Maxim,
> > 
> > The sched_out path still may race with sched_in path. e.g.
> >     CPU 0                 CPU 1
> > -----------------     ---------------
> > vCPU 0 sched_out
> > vCPU 1 sched_in
> > vCPU 1 sched_out      vCPU 0 sched_in
> > 
> > vCPU 0 sched_in may race with vCPU 1 sched_out on CPU 0's wakeup list.
> > 
> > 
> > So, the situation is
> > sched_in, sched_out: race
> > sched_in, irq:       race
> > sched_out, irq: mutual exclusive, do not race
> > 
> > 
> > Hence, do you think below subclasses assignments reasonable?
> > irq: subclass 0
> > sched_out: subclass 1
> > sched_in: subclasses 0 and 1
> > 
> > As inspired by Sean's solution, I made below patch to inform lockdep that the
> > sched_in path involves both subclasses 0 and 1 by adding a line
> > "spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_)".
> > 
> > I like it because it accurately conveys the situation to lockdep :)
> 
> Me too :-)
Great!

> Can you give your SoB?  I wrote comments and a changelog to explain to myself
Sure. Thanks for helping on the comments and changelog :)
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>

> (yet again), what the problem is, and why it's a false positive.  I also want
> to change the local_irq_{save,restore}() into a lockdep assertion in a prep patch,
> because this and the self-IPI trick rely on IRQs being disabled until the task
> is fully scheduled out and the scheduler locks are dopped.
Fair enough.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock
  2025-03-27 12:06     ` Yan Zhao
  2025-03-27 23:41       ` Sean Christopherson
@ 2025-03-28 23:02       ` Maxim Levitsky
  1 sibling, 0 replies; 10+ messages in thread
From: Maxim Levitsky @ 2025-03-28 23:02 UTC (permalink / raw)
  To: Yan Zhao, Paolo Bonzini; +Cc: Sean Christopherson, kvm, James Houghton

On Thu, 2025-03-27 at 20:06 +0800, Yan Zhao wrote:
> On Fri, Mar 21, 2025 at 12:49:42PM +0100, Paolo Bonzini wrote:
> > On Wed, Mar 19, 2025 at 5:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > > Yan posted a patch to fudge around the issue[*], I strongly objected (and still
> > > object) to making a functional and confusing code change to fudge around a lockdep
> > > false positive.
> > 
> > In that thread I had made another suggestion, which Yan also tried,
> > which was to use subclasses:
> > 
> > - in the sched_out path, which cannot race with the others:
> >   raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
> > 
> > - in the irq and sched_in paths, which can race with each other:
> >   raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> Hi Paolo, Sean, Maxim,
> 
> The sched_out path still may race with sched_in path. e.g.
>     CPU 0                 CPU 1
> -----------------     ---------------
> vCPU 0 sched_out
> vCPU 1 sched_in
> vCPU 1 sched_out      vCPU 0 sched_in
> 
> vCPU 0 sched_in may race with vCPU 1 sched_out on CPU 0's wakeup list.
> 
> 
> So, the situation is
> sched_in, sched_out: race
> sched_in, irq:       race
> sched_out, irq: mutual exclusive, do not race
> 
> 
> Hence, do you think below subclasses assignments reasonable?
> irq: subclass 0
> sched_out: subclass 1
> sched_in: subclasses 0 and 1
> 
> As inspired by Sean's solution, I made below patch to inform lockdep that the
> sched_in path involves both subclasses 0 and 1 by adding a line
> "spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_)".
> 
> I like it because it accurately conveys the situation to lockdep :)
> What are your thoughts?
> 
> Thanks
> Yan
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index ec08fa3caf43..c5684225255a 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -89,9 +89,12 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
>          * current pCPU if the task was migrated.
>          */
>         if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
> -               raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +               raw_spinlock_t *spinlock = &per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu);
> +               raw_spin_lock(spinlock);
> +               spin_acquire(&spinlock->dep_map, 1, 0, _RET_IP_);
>                 list_del(&vmx->pi_wakeup_list);
> -               raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +               spin_release(&spinlock->dep_map, _RET_IP_);
> +               raw_spin_unlock(spinlock);
>         }
> 
>         dest = cpu_physical_id(cpu);
> @@ -152,7 +155,7 @@ static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
> 
>         local_irq_save(flags);
> 
> -       raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> +       raw_spin_lock_nested(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu), 1);
>         list_add_tail(&vmx->pi_wakeup_list,
>                       &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
>         raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
> 
> 
I also agree that this is a good idea!
Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-28 23:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-19  2:19 Lockdep failure due to 'wierd' per-cpu wakeup_vcpus_on_cpu_lock lock Maxim Levitsky
2025-03-19 16:17 ` Sean Christopherson
2025-03-19 22:49   ` Maxim Levitsky
2025-03-21 11:11   ` Yan Zhao
2025-03-21 11:49   ` Paolo Bonzini
2025-03-24 13:25     ` Maxim Levitsky
2025-03-27 12:06     ` Yan Zhao
2025-03-27 23:41       ` Sean Christopherson
2025-03-28  0:52         ` Yan Zhao
2025-03-28 23:02       ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox