[RFC PATCH 0/2] KVM: x86: Strengthen locking rules for kvm

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] KVM: x86: Strengthen locking rules for kvm_lock
@ 2025-01-24 19:11 Paolo Bonzini
  2025-01-24 19:11 ` [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages() Paolo Bonzini
  2025-01-24 19:11 ` [PATCH 2/2] Documentation: explain issues with taking locks inside kvm_lock Paolo Bonzini
  0 siblings, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-24 19:11 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: seanjc

Last June, Sean reported a possible deadlock scenario due to
cpu_hotplug_lock() being taken unknowingly with other KVM locks
taken.

While he solved his immediate needs by inroducing kvm_usage_lock,
the same possibility could exist elsewhere.  The simplest sequence that
I could concoct requires four CPUs and no constant TSC, so this is really
theoretical, but since the fix is easy and can be documented, let's do it.

At the time the suggested solution was to use RCU for vm_list, but that's
not even necessary; it's enough to just keep the critical sections small,
avoiding _any_ mutex_lock while holding kvm_lock.  This is not hard to do
because you can always drop kvm_lock in the middle of the vm_list walk if
you first take a reference to the current struct kvm with kvm_get_kvm();
and anyway most walks of vm_list are already relatively small and only
take spinlocks.

The only case in which concurrent readers could be useful is really
accessing statistics *from debugfs*, but then even an rwsem would do.

RFC because it's compile-tested only.

Paolo

Paolo Bonzini (2):
  KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  Documentation: explain issues with taking locks inside kvm_lock

 Documentation/virt/kvm/locking.rst | 27 ++++++++++++++++++++-------
 arch/x86/kvm/mmu/mmu.c             | 13 +++++++------
 2 files changed, 27 insertions(+), 13 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-24 19:11 [RFC PATCH 0/2] KVM: x86: Strengthen locking rules for kvm_lock Paolo Bonzini
@ 2025-01-24 19:11 ` Paolo Bonzini
  2025-01-24 20:11   ` Sean Christopherson
  2025-01-24 19:11 ` [PATCH 2/2] Documentation: explain issues with taking locks inside kvm_lock Paolo Bonzini
  1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-24 19:11 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: seanjc

Protect the whole function with kvm_lock() so that all accesses to
nx_hugepage_mitigation_hard_disabled are under the lock; but drop it
when calling out to the MMU to avoid complex circular locking
situations such as the following:

__kvm_set_memory_region()
  lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                             lock(&kvm->slots_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
  sync(&kvm->srcu)

The deadlock is basically theoretical but anyway it is as follows:
- __kvm_set_memory_region() waits for kvm->srcu with kvm->slots_lock taken
- set_nx_huge_pages() waits for kvm->slots_lock with kvm_lock taken
- __kvmclock_cpufreq_notifier() waits for kvm_lock with cpu_hotplug_lock taken
- KVM_RUN waits for cpu_hotplug_lock with kvm->srcu taken
- therefore __kvm_set_memory_region() never completes synchronize_srcu(&kvm->srcu).

To break the deadlock, release kvm_lock while taking kvm->slots_lock, which
breaks the chain:

  lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
                             unlock(kvm_lock)
                                                       unlock(kvm_lock)
                                                       unlock(cpu_hotplug_lock)
                                                                                       unlock(cpu_hotplug_lock)
                                                                                   unlock(&kvm->srcu)
                             lock(&kvm->slots_lock)
  sync(&kvm->srcu)
  unlock(&kvm->slots_lock)
                             unlock(&kvm->slots_lock)

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2401606db260..1d8b45e7bb94 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7114,6 +7114,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	bool old_val = nx_huge_pages;
 	bool new_val;
 
+	guard(mutex)(&kvm_lock);
 	if (nx_hugepage_mitigation_hard_disabled)
 		return -EPERM;
 
@@ -7127,13 +7128,10 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	} else if (sysfs_streq(val, "never")) {
 		new_val = 0;
 
-		mutex_lock(&kvm_lock);
 		if (!list_empty(&vm_list)) {
-			mutex_unlock(&kvm_lock);
 			return -EBUSY;
 		}
 		nx_hugepage_mitigation_hard_disabled = true;
-		mutex_unlock(&kvm_lock);
 	} else if (kstrtobool(val, &new_val) < 0) {
 		return -EINVAL;
 	}
@@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	if (new_val != old_val) {
 		struct kvm *kvm;
 
-		mutex_lock(&kvm_lock);
-
 		list_for_each_entry(kvm, &vm_list, vm_list) {
+			kvm_get_kvm(kvm);
+			mutex_unlock(&kvm_lock);
+
 			mutex_lock(&kvm->slots_lock);
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
+
+			mutex_lock(&kvm_lock);
+			kvm_put_kvm(kvm);
 		}
-		mutex_unlock(&kvm_lock);
 	}
 
 	return 0;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] Documentation: explain issues with taking locks inside kvm_lock
  2025-01-24 19:11 [RFC PATCH 0/2] KVM: x86: Strengthen locking rules for kvm_lock Paolo Bonzini
  2025-01-24 19:11 ` [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages() Paolo Bonzini
@ 2025-01-24 19:11 ` Paolo Bonzini
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-24 19:11 UTC (permalink / raw)
  To: linux-kernel, kvm; +Cc: seanjc

kvm_lock should be used sparingly, and it is easy to protect
vm_list walks with kvm_get_kvm and kvm_put_kvm.  Make it
a hard rule to drop kvm_lock before taking another mutex,
and document it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/virt/kvm/locking.rst | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index c56d5f26c750..f94aad9b95ab 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -26,13 +26,6 @@ The acquisition orders for mutexes are as follows:
   are taken on the waiting side when modifying memslots, so MMU notifiers
   must not take either kvm->slots_lock or kvm->slots_arch_lock.
 
-cpus_read_lock() vs kvm_lock:
-
-- Taking cpus_read_lock() outside of kvm_lock is problematic, despite that
-  being the official ordering, as it is quite easy to unknowingly trigger
-  cpus_read_lock() while holding kvm_lock.  Use caution when walking vm_list,
-  e.g. avoid complex operations when possible.
-
 For SRCU:
 
 - ``synchronize_srcu(&kvm->srcu)`` is called inside critical sections
@@ -59,6 +52,23 @@ On x86:
 Everything else is a leaf: no other lock is taken inside the critical
 sections.
 
+In particular no other mutex should be taken inside kvm_lock, and the
+amount of code that can be run inside kvm_lock should be limited; this
+is because ``cpus_read_lock()`` might be triggered unknowingly and cause
+a circular dependency.  For example, if you take ``kvm->slots_lock``
+inside ``kvm_lock``, the following can happen on x86:
+
+- ``kvm->srcu`` is synchronized with ``kvm->slots_lock`` taken
+- you wait for ``kvm->slots_lock`` with ``kvm_lock`` taken
+- ``__kvmclock_cpufreq_notifier()`` waits for ``kvm_lock`` and
+  is called within ``cpus_read_lock()``.
+- ``KVM_RUN`` can trigger static key updates, which call ``cpus_read_lock()``,
+  with ``kvm->srcu`` taken
+- therefore ``synchronize_srcu(&kvm->srcu)`` never completes.
+
+This rule applies to all architectures.
+
+
 2. Exception
 ------------
 
@@ -238,6 +248,9 @@ time it will be set using the Dirty tracking mechanism described above.
 :Type:		mutex
 :Arch:		any
 :Protects:	- vm_list
+                - kvm_createvm_count
+                - kvm_active_vms
+:Comment:       Do not take any mutex inside.
 
 ``kvm_usage_lock``
 ^^^^^^^^^^^^^^^^^^
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-24 19:11 ` [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages() Paolo Bonzini
@ 2025-01-24 20:11   ` Sean Christopherson
  2025-01-24 22:19     ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-01-24 20:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

On Fri, Jan 24, 2025, Paolo Bonzini wrote:
> Protect the whole function with kvm_lock() so that all accesses to
> nx_hugepage_mitigation_hard_disabled are under the lock; but drop it
> when calling out to the MMU to avoid complex circular locking
> situations such as the following:

...

> To break the deadlock, release kvm_lock while taking kvm->slots_lock, which
> breaks the chain:

Heh, except it's all kinds of broken.  IMO, biting the bullet and converting to
an SRCU-protected list is going to be far less work in the long run.

> @@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>  	if (new_val != old_val) {
>  		struct kvm *kvm;
>  
> -		mutex_lock(&kvm_lock);
> -
>  		list_for_each_entry(kvm, &vm_list, vm_list) {

This is unsafe, as vm_list can be modified while kvm_lock is dropped.  And
using list_for_each_entry_safe() doesn't help, because the _next_ entry have been
freed.

> +			kvm_get_kvm(kvm);

This needs to be:

		if (!kvm_get_kvm_safe(kvm))
			continue;

because the last reference to the VM could already have been put.

> +			mutex_unlock(&kvm_lock);
> +
>  			mutex_lock(&kvm->slots_lock);
>  			kvm_mmu_zap_all_fast(kvm);
>  			mutex_unlock(&kvm->slots_lock);
>  
>  			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);

See my bug report on this being a NULL pointer deref.

> +
> +			mutex_lock(&kvm_lock);
> +			kvm_put_kvm(kvm);

The order is backwards, kvm_put_kvm() needs to be called before acquiring kvm_lock.
If the last reference is put, kvm_put_kvm() => kvm_destroy_vm() will deadlock on
kvm_lock.

>  		}
> -		mutex_unlock(&kvm_lock);
>  	}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-24 20:11   ` Sean Christopherson
@ 2025-01-24 22:19     ` Paolo Bonzini
  2025-01-24 23:44       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-24 22:19 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Kernel Mailing List, Linux, kvm

Il ven 24 gen 2025, 21:11 Sean Christopherson <seanjc@google.com> ha scritto:
> Heh, except it's all kinds of broken.

Yes, I didn't even try.

> IMO, biting the bullet and converting to
> an SRCU-protected list is going to be far less work in the long run.

I did try a long SRCU critical section and it was unreviewable. It
ends up a lot less manageable than just making the lock a leaf,
especially as the lock hierarchy spans multiple subsystems (static
key, KVM, cpufreq---thanks CPU hotplug lock...). I also didn't like
adding a synchronization primitive that's... kinda single-use, but
that would not be a blocker of course.

So the second attempt was regular RCU, which looked a lot like this
patch. I started writing all the dances to find a struct kvm that
passes kvm_get_kvm_safe() before you do rcu_read_unlock() and drop the
previous one (because you cannot do kvm_put_kvm() within the RCU read
side) and set aside the idea, incorrectly thinking that they were not
needed with kvm_lock. Plus I didn't like having to keep alive a bunch
of data for a whole grace period if call_rcu() is used.

So for the third attempt I could have chosen between dropping the SRCU
or just using kvm_lock. I didn't even think of SRCU to be honest,
because everything so far looked so bad, but it does seem a little
better than RCU. At least, if kvm_destroy_vm() uses call_srcu(), you
can call kvm_put_kvm() within srcu_read_lock()...srcu_read_unlock().
It would look something like

  list_for_each_entry_srcu(kvm, &vm_list, vm_list, 1) {
    if (!kvm_get_kvm_safe(kvm))
      continue;

    /* kvm is protected by the reference count now. */
    srcu_read_unlock(&kvm_srcu);
    ...
    srcu_read_lock(&kvm_srcu);
    /* kvm stays alive, and next can be read, until the next
srcu_read_unlock() */
    kvm_put_kvm(kvm);
  }
  srcu_read_unlock(&kvm_srcu);

but again I am not sure how speedy call_srcu() is in reclaiming the
data, even in the common case where set_nx_huge_pages() or any other
RCU reader (none of them is frequent) isn't running. If you don't use
call_srcu() it becomes just as bad as RCU or kvm_lock.

So... let's talk about kvm_lock.

> > @@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> > +                     kvm_get_kvm(kvm);
>
> This needs to be:
>
>                 if (!kvm_get_kvm_safe(kvm))
>                         continue;

If we go for kvm_lock, kvm_get_kvm() *can* be made safe within the
critical section; if kvm_put_kvm() uses refcount_dec_and_mutex_lock(),
then the 1->0 transition happens under kvm_lock and cannot race with
kvm_get_kvm() (the mutex can be dropped as soon as
refcount_dec_and_mutex_lock() returns, it's really just the decrement
that needs to be within the critical section).

> >       if (new_val != old_val) {
> >               struct kvm *kvm;
> >
> > -             mutex_lock(&kvm_lock);
> > -
> >               list_for_each_entry(kvm, &vm_list, vm_list) {
>
> This is unsafe, as vm_list can be modified while kvm_lock is dropped.  And
> using list_for_each_entry_safe() doesn't help, because the _next_ entry have been
> freed.

list_for_each_entry_safe() is broken, but list_for_each_entry() can be
used. The problem is the call to kvm_put_kvm(), which is where the kvm
struct is freed thus breaking the foreach loop. So next must be read
and ref'd _before_ kvm_put_kvm(), then you can do

  kvm_get_kvm(kvm);
  mutex_unlock(&kvm_lock);
  if (prev)
    kvm_put_kvm(prev);
  ...
  mutex_lock(&kvm_lock);
  prev = kvm;

I don't know... there are few-enough readers that SRCU seems a bit
misplaced and it has the issue of keeping the VM data alive; while
kvm_lock has uglier code with the kvm_put_kvm() looking really
misplaced. If there were many instances one could write a nice
iterator, but for just one use?

Hmm... I wonder if something like

  if (poll_state_synchronize_srcu(&kvm_srcu,
          get_state_synchronize_srcu(&kvm_srcu))) {
    kvm_destroy_vm_cb(&kvm->rcu_head);
  } else {
    call_srcu(&kvm_srcu, &kvm->rcu_head, kvm_destroy_vm_cb);
  }

catches the case where there's no concurrent reader. If so, SRCU would
be a winner undoubtedly, but being the only user of a tricky RCU API
doesn't give me warm and fuzzy feelings. I'm still team kvm_lock for
now.

Anyhow I can prepare a tested version next Monday, with either
kvm_lock or with SRCU if the above trick works. Unless I showed that
it's trickier than it seems and successfully nerd-sniped you.
Seriously - just tell me what you prefer.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-24 22:19     ` Paolo Bonzini
@ 2025-01-24 23:44       ` Sean Christopherson
  2025-01-25  0:08         ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-01-24 23:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

On Fri, Jan 24, 2025, Paolo Bonzini wrote:
> Il ven 24 gen 2025, 21:11 Sean Christopherson <seanjc@google.com> ha scritto:
> > Heh, except it's all kinds of broken.
> 
> Yes, I didn't even try.
> 
> > IMO, biting the bullet and converting to
> > an SRCU-protected list is going to be far less work in the long run.
> 
> I did try a long SRCU critical section and it was unreviewable. It
> ends up a lot less manageable than just making the lock a leaf,
> especially as the lock hierarchy spans multiple subsystems (static
> key, KVM, cpufreq---thanks CPU hotplug lock...).

I'm not following.  If __kvmclock_cpufreq_notifier() and set_nx_huge_pages()
switch to SRCU, then the deadlock goes away (it might even go away if just one
of those two switches).

SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
and if that's problematic then we would already have a plethora of issues.

> I also didn't like adding a synchronization primitive that's... kinda
> single-use, but that would not be a blocker of course.

It would be single use in the it only protects pure reader of vm_list, but there
are plenty of those users.

> So the second attempt was regular RCU, which looked a lot like this
> patch. I started writing all the dances to find a struct kvm that
> passes kvm_get_kvm_safe() before you do rcu_read_unlock() and drop the
> previous one (because you cannot do kvm_put_kvm() within the RCU read
> side) and set aside the idea, incorrectly thinking that they were not
> needed with kvm_lock. Plus I didn't like having to keep alive a bunch
> of data for a whole grace period if call_rcu() is used.
> 
> So for the third attempt I could have chosen between dropping the SRCU
> or just using kvm_lock. I didn't even think of SRCU to be honest,
> because everything so far looked so bad, but it does seem a little
> better than RCU. At least, if kvm_destroy_vm() uses call_srcu(), you
> can call kvm_put_kvm() within srcu_read_lock()...srcu_read_unlock().
> It would look something like
> 
>   list_for_each_entry_srcu(kvm, &vm_list, vm_list, 1) {
>     if (!kvm_get_kvm_safe(kvm))

Unless I'm missing something, we shouldn't need to take a reference so long as
SRCU is synchronized before destroying any part of the VM.  If we don't take a
reference, then we don't need to deal with the complexity of kvm_put_kvm()
creating a recursive lock snafu.

This is what I'm thinking, lightly tested...

---
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 24 Jan 2025 15:15:05 -0800
Subject: [PATCH] KVM: Use an SRCU lock to protect readers of vm_list

Introduce a global SRCU lock to protect KVM's global list of VMs, and use
it in all locations that currently take kvm_lock purely to prevent a VM
from being destroyed.

Keep using kvm_lock for flows that need to prevent VMs from being created,
as SRCU synchronization only guards against use-after-free, it doesn't
ensure a stable vm_list for readers.

This fixes a largely theoretical deadlock where:

  - __kvm_set_memory_region() waits for kvm->srcu with kvm->slots_lock taken
  - set_nx_huge_pages() waits for kvm->slots_lock with kvm_lock taken
  - __kvmclock_cpufreq_notifier() waits for kvm_lock with cpu_hotplug_lock taken
  - KVM_RUN waits for cpu_hotplug_lock with kvm->srcu taken

and therefore __kvm_set_memory_region() never completes
synchronize_srcu(&kvm->srcu).

  __kvm_set_memory_region()
    lock(&kvm->slots_lock)
                           set_nx_huge_pages()
                             lock(kvm_lock)
                             lock(&kvm->slots_lock)
                                                     __kvmclock_cpufreq_notifier()
                                                       lock(cpu_hotplug_lock)
                                                       lock(kvm_lock)
                                                                                   lock(&kvm->srcu)
                                                                                   kvm_lapic_set_base()
                                                                                     static_branch_inc()
                                                                                       lock(cpu_hotplug_lock)
  sync(&kvm->srcu)

Opportunistically add macros to walk the list of VMs, and the array of
vCPUs in each VMs, to cut down on the amount of boilerplate.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 19 +++++++++++------
 arch/x86/kvm/x86.c       | 36 +++++++++++++++++--------------
 include/linux/kvm_host.h |  9 ++++++++
 virt/kvm/kvm_main.c      | 46 +++++++++++++++++++++++++---------------
 4 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 74fa38ebddbf..f5b7ceb7ca0e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7127,6 +7127,10 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 	} else if (sysfs_streq(val, "never")) {
 		new_val = 0;
 
+		/*
+		 * Take kvm_lock to ensure no VMs are *created* before the flag
+		 * is set.  vm_list_srcu only protect VMs being deleted.
+		 */
 		mutex_lock(&kvm_lock);
 		if (!list_empty(&vm_list)) {
 			mutex_unlock(&kvm_lock);
@@ -7142,17 +7146,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 
 	if (new_val != old_val) {
 		struct kvm *kvm;
+		int idx;
 
-		mutex_lock(&kvm_lock);
+		idx = srcu_read_lock(&vm_list_srcu);
 
-		list_for_each_entry(kvm, &vm_list, vm_list) {
+		kvm_for_each_vm_srcu(kvm) {
 			mutex_lock(&kvm->slots_lock);
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 		}
-		mutex_unlock(&kvm_lock);
+
+		srcu_read_unlock(&vm_list_srcu, idx);
 	}
 
 	return 0;
@@ -7275,13 +7281,14 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	if (is_recovery_enabled &&
 	    (!was_recovery_enabled || old_period > new_period)) {
 		struct kvm *kvm;
+		int idx;
 
-		mutex_lock(&kvm_lock);
+		idx = srcu_read_lock(&vm_list_srcu);
 
-		list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_for_each_vm_srcu(kvm)
 			vhost_task_wake(kvm->arch.nx_huge_page_recovery_thread);
 
-		mutex_unlock(&kvm_lock);
+		srcu_read_unlock(&vm_list_srcu, idx);
 	}
 
 	return err;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2d9a16fd4d3..8fb49237d179 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9428,6 +9428,11 @@ static void kvm_hyperv_tsc_notifier(void)
 	struct kvm *kvm;
 	int cpu;
 
+	/*
+	 * Take kvm_lock, not just vm_list_srcu, trevent new VMs from coming
+	 * along in the middle of the update and not getting the in-progress
+	 * request.
+	 */
 	mutex_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_make_mclock_inprogress_request(kvm);
@@ -9456,7 +9461,7 @@ static void __kvmclock_cpufreq_notifier(struct cpufreq_freqs *freq, int cpu)
 {
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
-	int send_ipi = 0;
+	int send_ipi = 0, idx;
 	unsigned long i;
 
 	/*
@@ -9500,17 +9505,16 @@ static void __kvmclock_cpufreq_notifier(struct cpufreq_freqs *freq, int cpu)
 
 	smp_call_function_single(cpu, tsc_khz_changed, freq, 1);
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
-		kvm_for_each_vcpu(i, vcpu, kvm) {
-			if (vcpu->cpu != cpu)
-				continue;
-			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-			if (vcpu->cpu != raw_smp_processor_id())
-				send_ipi = 1;
-		}
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vcpu_in_each_vm(kvm, vcpu, i) {
+		if (vcpu->cpu != cpu)
+			continue;
+
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+		if (vcpu->cpu != raw_smp_processor_id())
+			send_ipi = 1;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	if (freq->old < freq->new && send_ipi) {
 		/*
@@ -9588,13 +9592,13 @@ static void pvclock_gtod_update_fn(struct work_struct *work)
 	struct kvm *kvm;
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
+	int idx;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list)
-		kvm_for_each_vcpu(i, vcpu, kvm)
-			kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vcpu_in_each_vm(kvm, vcpu, i)
+		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	srcu_read_unlock(&vm_list_srcu, idx);
 	atomic_set(&kvm_guest_has_master_clock, 0);
-	mutex_unlock(&kvm_lock);
 }
 
 static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9df590e8f3da..0d0edb697160 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -193,6 +193,11 @@ bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
 
 extern struct mutex kvm_lock;
 extern struct list_head vm_list;
+extern struct srcu_struct vm_list_srcu;
+
+#define kvm_for_each_vm_srcu(__kvm)				\
+	list_for_each_entry_srcu(__kvm, &vm_list, vm_list,	\
+				 srcu_read_lock_held(&vm_list_srcu))
 
 struct kvm_io_range {
 	gpa_t addr;
@@ -1001,6 +1006,10 @@ static inline struct kvm_vcpu *kvm_get_vcpu_by_id(struct kvm *kvm, int id)
 	return NULL;
 }
 
+#define kvm_for_each_vcpu_in_each_vm(__kvm, __vcpu, __i)		\
+	kvm_for_each_vm_srcu(__kvm)					\
+		kvm_for_each_vcpu(__i, __vcpu, __kvm)
+
 void kvm_destroy_vcpus(struct kvm *kvm);
 
 void vcpu_load(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e0b9d6dd6a85..7fcc4433bf35 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -109,6 +109,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
 
 DEFINE_MUTEX(kvm_lock);
 LIST_HEAD(vm_list);
+DEFINE_SRCU(vm_list_srcu);
 
 static struct kmem_cache *kvm_vcpu_cache;
 
@@ -1261,13 +1262,21 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	mutex_lock(&kvm_lock);
+	list_del(&kvm->vm_list);
+	mutex_unlock(&kvm_lock);
+
+	/*
+	 * Ensure all readers of the global list go away before destroying any
+	 * aspect of the VM.  After this, the VM object is reachable only via
+	 * this task and notifiers that are registered to the VM itself.
+	 */
+	synchronize_srcu(&vm_list_srcu);
+
 	kvm_destroy_pm_notifier(kvm);
 	kvm_uevent_notify_change(KVM_EVENT_DESTROY_VM, kvm);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
-	mutex_lock(&kvm_lock);
-	list_del(&kvm->vm_list);
-	mutex_unlock(&kvm_lock);
 	kvm_arch_pre_destroy_vm(kvm);
 
 	kvm_free_irq_routing(kvm);
@@ -6096,14 +6105,16 @@ static int vm_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	u64 tmp_val;
+	int idx;
 
 	*val = 0;
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm) {
 		kvm_get_stat_per_vm(kvm, offset, &tmp_val);
 		*val += tmp_val;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
+
 	return 0;
 }
 
@@ -6111,15 +6122,15 @@ static int vm_stat_clear(void *_offset, u64 val)
 {
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
+	int idx;
 
 	if (val)
 		return -EINVAL;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm)
 		kvm_clear_stat_per_vm(kvm, offset);
-	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	return 0;
 }
@@ -6132,14 +6143,15 @@ static int vcpu_stat_get(void *_offset, u64 *val)
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
 	u64 tmp_val;
+	int idx;
 
 	*val = 0;
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm) {
 		kvm_get_stat_per_vcpu(kvm, offset, &tmp_val);
 		*val += tmp_val;
 	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 	return 0;
 }
 
@@ -6147,15 +6159,15 @@ static int vcpu_stat_clear(void *_offset, u64 val)
 {
 	unsigned offset = (long)_offset;
 	struct kvm *kvm;
+	int idx;
 
 	if (val)
 		return -EINVAL;
 
-	mutex_lock(&kvm_lock);
-	list_for_each_entry(kvm, &vm_list, vm_list) {
+	idx = srcu_read_lock(&vm_list_srcu);
+	kvm_for_each_vm_srcu(kvm)
 		kvm_clear_stat_per_vcpu(kvm, offset);
-	}
-	mutex_unlock(&kvm_lock);
+	srcu_read_unlock(&vm_list_srcu, idx);
 
 	return 0;
 }

base-commit: eb723766b1030a23c38adf2348b7c3d1409d11f0
-- 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-24 23:44       ` Sean Christopherson
@ 2025-01-25  0:08         ` Paolo Bonzini
  2025-01-25  0:44           ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-25  0:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm

On 1/25/25 00:44, Sean Christopherson wrote:
>> I did try a long SRCU critical section and it was unreviewable. It
>> ends up a lot less manageable than just making the lock a leaf,
>> especially as the lock hierarchy spans multiple subsystems (static
>> key, KVM, cpufreq---thanks CPU hotplug lock...).
> 
> I'm not following.  If __kvmclock_cpufreq_notifier() and set_nx_huge_pages()
> switch to SRCU, then the deadlock goes away (it might even go away if just one
> of those two switches). [...] It would be single use in the it only
> protects pure reader of vm_list, but there are plenty of those users.

Yes, single use in the sense that only set_nx_huge_pages() really needs 
it.  kvm_lock doesn't produce any noticeable contention and as you noted 
sometimes you really need it even in pure readers.

> SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
> and if that's problematic then we would already have a plethora of issues.

Ah yeah, I missed that you cannot hold any lock when calling 
kvm_put_kvm().  So the waiting side is indeed a leaf and cannot block 
someone else.

Still from your patch (thanks!) I don't really like the special cases on 
taking SRCU vs. kvm_lock... It really seems like a job for a mutex or 
rwsem.  It keeps the complexity in the one place that is different (i.e. 
where a lock is taken inside the iteration) and everything else can just 
iterate normally.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-25  0:08         ` Paolo Bonzini
@ 2025-01-25  0:44           ` Sean Christopherson
  2025-01-27 17:27             ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-01-25  0:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

On Sat, Jan 25, 2025, Paolo Bonzini wrote:
> On 1/25/25 00:44, Sean Christopherson wrote:
> > SRCU readers would only interact with kvm_destroy_vm() from a locking perspective,
> > and if that's problematic then we would already have a plethora of issues.
> 
> Ah yeah, I missed that you cannot hold any lock when calling kvm_put_kvm().
> So the waiting side is indeed a leaf and cannot block someone else.
> 
> Still from your patch (thanks!) I don't really like the special cases on
> taking SRCU vs. kvm_lock... It really seems like a job for a mutex or rwsem.
> It keeps the complexity in the one place that is different (i.e. where a
> lock is taken inside the iteration) and everything else can just iterate
> normally.

I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
the set_nx_huge_pages() "never" path is a write in disguise, and
kvm_hyperv_tsc_notifier() is a very special snowflake.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-25  0:44           ` Sean Christopherson
@ 2025-01-27 17:27             ` Paolo Bonzini
  2025-01-27 17:56               ` Paolo Bonzini
  2025-01-27 18:01               ` Sean Christopherson
  0 siblings, 2 replies; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-27 17:27 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm

On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> the set_nx_huge_pages() "never" path is a write in disguise, and
> kvm_hyperv_tsc_notifier() is a very special snowflake.

set_nx_huge_pages() is not a writer in disguise.  Rather, it's
a *real* writer for nx_hugepage_mitigation_hard_disabled which is
also protected by kvm_lock; and there's a (mostly theoretical)
bug in set_nx_huge_pages_recovery_param() which reads it without
taking the lock.  But it's still a reader as far as vm_list is
concerned.

Likewise, kvm_hyperv_tsc_notifier()'s requirement does deserve a comment,
but its specialness is self-inflicted pain due to using (S)RCU even when
it's not the most appropriate synchronization mechanism.

Paolo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-27 17:27             ` Paolo Bonzini
@ 2025-01-27 17:56               ` Paolo Bonzini
  2025-01-27 18:01               ` Sean Christopherson
  1 sibling, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-27 17:56 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm

On Mon, Jan 27, 2025 at 6:27 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > the set_nx_huge_pages() "never" path is a write in disguise, and
> > kvm_hyperv_tsc_notifier() is a very special snowflake.
>
> Likewise, kvm_hyperv_tsc_notifier()'s requirement does deserve a comment,
> but its specialness is self-inflicted pain due to using (S)RCU even when
> it's not the most appropriate synchronization mechanism.

... in fact, you could have a KVM_CREATE_VCPU and KVM_RUN after this
point:

        mutex_lock(&kvm_lock);
        list_for_each_entry(kvm, &vm_list, vm_list)
                kvm_make_mclock_inprogress_request(kvm);

because kvm_lock is not enough to ensure that all vCPUs got the
KVM_REQ_MCLOCK_INPROGRESS memo.  So kvm_hyperv_tsc_notifier()'s
complications go beyond kvm_lock and the choice to use SRCU or not.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-27 17:27             ` Paolo Bonzini
  2025-01-27 17:56               ` Paolo Bonzini
@ 2025-01-27 18:01               ` Sean Christopherson
  2025-01-27 18:17                 ` Paolo Bonzini
  1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2025-01-27 18:01 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm

On Mon, Jan 27, 2025, Paolo Bonzini wrote:
> On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > the set_nx_huge_pages() "never" path is a write in disguise, and
> > kvm_hyperv_tsc_notifier() is a very special snowflake.
> 
> set_nx_huge_pages() is not a writer in disguise.  Rather, it's
> a *real* writer for nx_hugepage_mitigation_hard_disabled which is
> also protected by kvm_lock;

Heh, agreed, I was trying to say that it's a write that is disguised as a reader.

> and there's a (mostly theoretical) bug in set_nx_huge_pages_recovery_param()
> which reads it without taking the lock.

It's arguably not a bug.  Userspace has no visibility into the order in which
param writes are processed.  If there are racing writes to the period/ratio and
"never", both outcomes are legal (rejected with -EPERM or period/ratio changes).
If nx_hugepage_mitigation_hard_disabled becomes set after the params are changed,
then vm_list is guaranteed to be empty, so the wakeup walk is still a nop.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()
  2025-01-27 18:01               ` Sean Christopherson
@ 2025-01-27 18:17                 ` Paolo Bonzini
  0 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2025-01-27 18:17 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: linux-kernel, kvm

On Mon, Jan 27, 2025 at 7:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Jan 27, 2025, Paolo Bonzini wrote:
> > On Sat, Jan 25, 2025 at 1:44 AM Sean Christopherson <seanjc@google.com> wrote:
> > > I like the special casing, it makes the oddballs stand out, which in turn (hopefully)
> > > makes developers pause and take note.  I.e. the SRCU walkers are all normal readers,
> > > the set_nx_huge_pages() "never" path is a write in disguise, and
> > > kvm_hyperv_tsc_notifier() is a very special snowflake.
> >
> > set_nx_huge_pages() is not a writer in disguise.  Rather, it's
> > a *real* writer for nx_hugepage_mitigation_hard_disabled which is
> > also protected by kvm_lock;
>
> Heh, agreed, I was trying to say that it's a write that is disguised as a reader.
>
> > and there's a (mostly theoretical) bug in set_nx_huge_pages_recovery_param()
> > which reads it without taking the lock.
>
> It's arguably not a bug.  Userspace has no visibility into the order in which
> param writes are processed.  If there are racing writes to the period/ratio and
> "never", both outcomes are legal (rejected with -EPERM or period/ratio changes).
> If nx_hugepage_mitigation_hard_disabled becomes set after the params are changed,
> then vm_list is guaranteed to be empty, so the wakeup walk is still a nop.

I think we have enough arguably necessary lockless cases to think about
linearizability of sysfs writes and reads of vm_list. :)  If you agree,
set_nx_huge_pages() and set_nx_huge_pages_recovery_param() can be
changed to simply have a guard(mutex)(&kvm_lock) around them, instead
of protecting just the vm_list walk.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-01-27 18:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24 19:11 [RFC PATCH 0/2] KVM: x86: Strengthen locking rules for kvm_lock Paolo Bonzini
2025-01-24 19:11 ` [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages() Paolo Bonzini
2025-01-24 20:11   ` Sean Christopherson
2025-01-24 22:19     ` Paolo Bonzini
2025-01-24 23:44       ` Sean Christopherson
2025-01-25  0:08         ` Paolo Bonzini
2025-01-25  0:44           ` Sean Christopherson
2025-01-27 17:27             ` Paolo Bonzini
2025-01-27 17:56               ` Paolo Bonzini
2025-01-27 18:01               ` Sean Christopherson
2025-01-27 18:17                 ` Paolo Bonzini
2025-01-24 19:11 ` [PATCH 2/2] Documentation: explain issues with taking locks inside kvm_lock Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox