[PATCH 0/3] KVM: Speed up MMIO registrations

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: Speed up MMIO registrations
@ 2025-06-24  9:22 Keir Fraser
  2025-06-24  9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Keir Fraser @ 2025-06-24  9:22 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Keir Fraser

This series improves performance of MMIO registrations, which occur
multiple times when booting a typical VM. The existing call to
synchronize_srcu() may block for a considerable time: In our tests
it was found to account for approximately 25% of a VM's startup time.

arm64 vgic code is cleaned up and made responsible for
its own setup synchronization, and the MMIO registration logic
replaces synchronize_srcu() with a deferred callback to free the old
io_bus struct.

Keir Fraser (3):
  KVM: arm64: vgic-init: Remove vgic_ready() macro
  KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
  KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()

 arch/arm64/kvm/vgic/vgic-init.c | 14 +++-----------
 include/kvm/arm_vgic.h          |  1 -
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             | 10 ++++++++--
 4 files changed, 12 insertions(+), 14 deletions(-)

-- 
2.50.0.rc2.761.g2dc52ea45b-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro
  2025-06-24  9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser
@ 2025-06-24  9:22 ` Keir Fraser
  2025-06-24  9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
  2025-06-24  9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
  2 siblings, 0 replies; 11+ messages in thread
From: Keir Fraser @ 2025-06-24  9:22 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Keir Fraser

It is now used only within kvm_vgic_map_resources(). vgic_dist::ready
is already written directly by this function, so it is clearer to
bypass the macro for reads as well.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 arch/arm64/kvm/vgic/vgic-init.c | 5 ++---
 include/kvm/arm_vgic.h          | 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index eb1205654ac8..502b65049703 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -559,7 +559,6 @@ int vgic_lazy_init(struct kvm *kvm)
  * Also map the virtual CPU interface into the VM.
  * v2 calls vgic_init() if not already done.
  * v3 and derivatives return an error if the VGIC is not initialized.
- * vgic_ready() returns true if this function has succeeded.
  */
 int kvm_vgic_map_resources(struct kvm *kvm)
 {
@@ -568,12 +567,12 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	gpa_t dist_base;
 	int ret = 0;
 
-	if (likely(vgic_ready(kvm)))
+	if (likely(dist->ready))
 		return 0;
 
 	mutex_lock(&kvm->slots_lock);
 	mutex_lock(&kvm->arch.config_lock);
-	if (vgic_ready(kvm))
+	if (dist->ready)
 		goto out;
 
 	if (!irqchip_in_kernel(kvm))
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 4a34f7f0a864..233eaa6d1267 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -399,7 +399,6 @@ u64 vgic_v3_get_misr(struct kvm_vcpu *vcpu);
 
 #define irqchip_in_kernel(k)	(!!((k)->arch.vgic.in_kernel))
 #define vgic_initialized(k)	((k)->arch.vgic.initialized)
-#define vgic_ready(k)		((k)->arch.vgic.ready)
 #define vgic_valid_spi(k, i)	(((i) >= VGIC_NR_PRIVATE_IRQS) && \
 			((i) < (k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS))
 
-- 
2.50.0.rc2.761.g2dc52ea45b-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
  2025-06-24  9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser
  2025-06-24  9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
@ 2025-06-24  9:22 ` Keir Fraser
  2025-06-24  9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
  2 siblings, 0 replies; 11+ messages in thread
From: Keir Fraser @ 2025-06-24  9:22 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Keir Fraser

In preparation to remove synchronize_srcu() from MMIO registration,
remove the distributor's dependency on this implicit barrier by
direct acquire-release synchronization on the flag write and its
lock-free check.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 arch/arm64/kvm/vgic/vgic-init.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 502b65049703..bc83672e461b 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -567,7 +567,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	gpa_t dist_base;
 	int ret = 0;
 
-	if (likely(dist->ready))
+	if (likely(smp_load_acquire(&dist->ready)))
 		return 0;
 
 	mutex_lock(&kvm->slots_lock);
@@ -598,14 +598,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 		goto out_slots;
 	}
 
-	/*
-	 * kvm_io_bus_register_dev() guarantees all readers see the new MMIO
-	 * registration before returning through synchronize_srcu(), which also
-	 * implies a full memory barrier. As such, marking the distributor as
-	 * 'ready' here is guaranteed to be ordered after all vCPUs having seen
-	 * a completely configured distributor.
-	 */
-	dist->ready = true;
+	smp_store_release(&dist->ready, true);
 	goto out_slots;
 out:
 	mutex_unlock(&kvm->arch.config_lock);
-- 
2.50.0.rc2.761.g2dc52ea45b-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-06-24  9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser
  2025-06-24  9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
  2025-06-24  9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
@ 2025-06-24  9:22 ` Keir Fraser
  2025-06-24 15:11   ` Sean Christopherson
  2 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2025-06-24  9:22 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Keir Fraser

Device MMIO registration may happen quite frequently during VM boot,
and the SRCU synchronization each time has a measurable effect
on VM startup time. In our experiments it can account for around 25%
of a VM's startup time.

Replace the synchronization with a deferred free of the old kvm_io_bus
structure.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bde4fb5c6aa..28a63f1ad314 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -205,6 +205,7 @@ struct kvm_io_range {
 struct kvm_io_bus {
 	int dev_count;
 	int ioeventfd_count;
+	struct rcu_head rcu;
 	struct kvm_io_range range[];
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eec82775c5bf..b7d4da8ba0b2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 }
 EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
+static void __free_bus(struct rcu_head *rcu)
+{
+	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
+
+	kfree(bus);
+}
+
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 			    int len, struct kvm_io_device *dev)
 {
@@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 	memcpy(new_bus->range + i + 1, bus->range + i,
 		(bus->dev_count - i) * sizeof(struct kvm_io_range));
 	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
-	synchronize_srcu_expedited(&kvm->srcu);
-	kfree(bus);
+	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
 
 	return 0;
 }
-- 
2.50.0.rc2.761.g2dc52ea45b-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-06-24  9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
@ 2025-06-24 15:11   ` Sean Christopherson
  2025-06-30  9:59     ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2025-06-24 15:11 UTC (permalink / raw)
  To: Keir Fraser
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

+Li

On Tue, Jun 24, 2025, Keir Fraser wrote:
> Device MMIO registration may happen quite frequently during VM boot,
> and the SRCU synchronization each time has a measurable effect
> on VM startup time. In our experiments it can account for around 25%
> of a VM's startup time.
> 
> Replace the synchronization with a deferred free of the old kvm_io_bus
> structure.
> 
> Signed-off-by: Keir Fraser <keirf@google.com>
> ---
>  include/linux/kvm_host.h |  1 +
>  virt/kvm/kvm_main.c      | 10 ++++++++--
>  2 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bde4fb5c6aa..28a63f1ad314 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -205,6 +205,7 @@ struct kvm_io_range {
>  struct kvm_io_bus {
>  	int dev_count;
>  	int ioeventfd_count;
> +	struct rcu_head rcu;
>  	struct kvm_io_range range[];
>  };
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index eec82775c5bf..b7d4da8ba0b2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
>  }
>  EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>  
> +static void __free_bus(struct rcu_head *rcu)
> +{
> +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> +
> +	kfree(bus);
> +}
> +
>  int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>  			    int len, struct kvm_io_device *dev)
>  {
> @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>  	memcpy(new_bus->range + i + 1, bus->range + i,
>  		(bus->dev_count - i) * sizeof(struct kvm_io_range));
>  	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> -	synchronize_srcu_expedited(&kvm->srcu);
> -	kfree(bus);
> +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);

I'm 99% certain this will break ABI.  KVM needs to ensure all readers are guaranteed
to see the new device prior to returning to userspace.  I'm quite confident there
are other flows that rely on the synchronization, the vGIC case is simply the one
that's documented.

https://lore.kernel.org/all/aAkAY40UbqzQNr8m@google.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-06-24 15:11   ` Sean Christopherson
@ 2025-06-30  9:59     ` Keir Fraser
  2025-07-07 18:49       ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2025-06-30  9:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote:
> +Li
> 
> On Tue, Jun 24, 2025, Keir Fraser wrote:
> > Device MMIO registration may happen quite frequently during VM boot,
> > and the SRCU synchronization each time has a measurable effect
> > on VM startup time. In our experiments it can account for around 25%
> > of a VM's startup time.
> > 
> > Replace the synchronization with a deferred free of the old kvm_io_bus
> > structure.
> > 
> > Signed-off-by: Keir Fraser <keirf@google.com>
> > ---
> >  include/linux/kvm_host.h |  1 +
> >  virt/kvm/kvm_main.c      | 10 ++++++++--
> >  2 files changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3bde4fb5c6aa..28a63f1ad314 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -205,6 +205,7 @@ struct kvm_io_range {
> >  struct kvm_io_bus {
> >  	int dev_count;
> >  	int ioeventfd_count;
> > +	struct rcu_head rcu;
> >  	struct kvm_io_range range[];
> >  };
> >  
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index eec82775c5bf..b7d4da8ba0b2 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_io_bus_read);
> >  
> > +static void __free_bus(struct rcu_head *rcu)
> > +{
> > +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> > +
> > +	kfree(bus);
> > +}
> > +
> >  int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> >  			    int len, struct kvm_io_device *dev)
> >  {
> > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> >  	memcpy(new_bus->range + i + 1, bus->range + i,
> >  		(bus->dev_count - i) * sizeof(struct kvm_io_range));
> >  	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > -	synchronize_srcu_expedited(&kvm->srcu);
> > -	kfree(bus);
> > +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> 
> I'm 99% certain this will break ABI.  KVM needs to ensure all readers are guaranteed
> to see the new device prior to returning to userspace.

I'm not sure I understand this. How can userspace (or a guest VCPU)
know that it is executing *after* the MMIO registration, except via
some form of synchronization or other ordering of its own? For
example, that PCI BAR setup happens as part of PCI probing happening
early in device registration in the guest OS, strictly before the MMIO
region will be accessed. Otherwise the access is inherently racy
against the registration?

> I'm quite confident there
> are other flows that rely on the synchronization, the vGIC case is simply the one
> that's documented.

If they're in the kernel they can be fixed? If necessary I'll go audit the callers.

 Regards,
  Keir

> https://lore.kernel.org/all/aAkAY40UbqzQNr8m@google.com


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-06-30  9:59     ` Keir Fraser
@ 2025-07-07 18:49       ` Sean Christopherson
  2025-07-10  6:51         ` Keir Fraser
  2025-07-15 12:43         ` Keir Fraser
  0 siblings, 2 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-07-07 18:49 UTC (permalink / raw)
  To: Keir Fraser
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Mon, Jun 30, 2025, Keir Fraser wrote:
> On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote:
> > +Li
> > 
> > On Tue, Jun 24, 2025, Keir Fraser wrote:
> > > Device MMIO registration may happen quite frequently during VM boot,
> > > and the SRCU synchronization each time has a measurable effect
> > > on VM startup time. In our experiments it can account for around 25%
> > > of a VM's startup time.
> > > 
> > > Replace the synchronization with a deferred free of the old kvm_io_bus
> > > structure.
> > > 
> > > Signed-off-by: Keir Fraser <keirf@google.com>
> > > ---
> > >  include/linux/kvm_host.h |  1 +
> > >  virt/kvm/kvm_main.c      | 10 ++++++++--
> > >  2 files changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 3bde4fb5c6aa..28a63f1ad314 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -205,6 +205,7 @@ struct kvm_io_range {
> > >  struct kvm_io_bus {
> > >  	int dev_count;
> > >  	int ioeventfd_count;
> > > +	struct rcu_head rcu;
> > >  	struct kvm_io_range range[];
> > >  };
> > >  
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index eec82775c5bf..b7d4da8ba0b2 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_io_bus_read);
> > >  
> > > +static void __free_bus(struct rcu_head *rcu)
> > > +{
> > > +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> > > +
> > > +	kfree(bus);
> > > +}
> > > +
> > >  int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > >  			    int len, struct kvm_io_device *dev)
> > >  {
> > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > >  	memcpy(new_bus->range + i + 1, bus->range + i,
> > >  		(bus->dev_count - i) * sizeof(struct kvm_io_range));
> > >  	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > > -	synchronize_srcu_expedited(&kvm->srcu);
> > > -	kfree(bus);
> > > +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> > 
> > I'm 99% certain this will break ABI.  KVM needs to ensure all readers are
> > guaranteed to see the new device prior to returning to userspace.
> 
> I'm not sure I understand this. How can userspace (or a guest VCPU) know that
> it is executing *after* the MMIO registration, except via some form of
> synchronization or other ordering of its own? For example, that PCI BAR setup
> happens as part of PCI probing happening early in device registration in the
> guest OS, strictly before the MMIO region will be accessed. Otherwise the
> access is inherently racy against the registration?

Yes, guest software needs its own synchronization.  What I am pointing out is that,
very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that
KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's
synchronization.

It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest
code, including emulated MMIO accesses.  If KVM grabs kvm->buses at the start of
an emulation block, and then uses that reference to resolve MMIO, it's theoretically
possible for KVM to mishandle an access due to using a stale bus.

Today, such scenarios are effectively prevented by synchronize_srcu_expedited().
Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking
rules), i.e. a large emulation block must take and hold SRCU for its entire
duration.  And so waiting for all SRCU readers to go away ensures that the new
kvm->buses will be observed if KVM starts a new emulation block.

AFAIK, the only example of such emulation is x86's handle_invalid_guest_state().
And in practice, it's probably impossible for the compiler to keep a reference to
kvm->buses across multiple invocations of kvm_emulate_instruction() while still
honoring the READ_ONCE() in __rcu_dereference_check().

But I don't want to simply drop KVM's synchronization, because we need a rule of
some kind to ensure correct ordering, even if it's only for documentation purposes
for 99% of cases.  And because the existence of kvm_get_bus() means that it would
be possible for KVM to grab a long-term reference to kvm->buses and use it across
emulation of multiple instructions (though actually doing that would be all kinds
of crazy).

> > I'm quite confident there are other flows that rely on the synchronization,
> > the vGIC case is simply the one that's documented.
> 
> If they're in the kernel they can be fixed? If necessary I'll go audit the callers.

Yes, I'm sure there's a solution.  Thinking more about this, you make a good
point that KVM needs to order access with respect to instruction execution, not
with respect to the start of KVM_RUN.

For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is
disallowed (though I don't think this is formally documented), i.e. every
architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to
reading kvm->buses.  And srcu_read_lock() contains a full smp_mb(), which ensures
KVM will get a fresh kvm->buses relative to the instruction that triggered the
exit.

So for the common case of one-off accesses after a VM-Exit, I think we can simply
add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures)
to formalize the dependency on reacquiring SRCU.  AFAICT, that would also suffice
for arm64's use of kvm_io_bus_get_dev().  And then add an explicit barrier of some
kind in handle_invalid_guest_state()?

Then to prevent grabbing long-term references to a bus, require kvm->slots_lock
in kvm_get_bus() (and special case the kfree() in VM destruction).

So something like this?  I think the barriers would pair with the smp_store_release()
in rcu_assign_pointer()?

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4953846cb30d..057fb4ce66b0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
                if (kvm_test_request(KVM_REQ_EVENT, vcpu))
                        return 1;

+               /* Or maybe smp_mb()?  Not sure what this needs to be. */
+               barrier();
+
                if (!kvm_emulate_instruction(vcpu, 0))
                        return 0;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bde4fb5c6aa..066438b6571a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)

 static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
 {
-       return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
-                                     lockdep_is_held(&kvm->slots_lock) ||
-                                     !refcount_read(&kvm->users_count));
+       return rcu_dereference_protected(kvm->buses[idx],
+                                        lockdep_is_held(&kvm->slots_lock));
 }

 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eec82775c5bf..7b0e881351f7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_arch_destroy_vm:
        WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
        for (i = 0; i < KVM_NR_BUSES; i++)
-               kfree(kvm_get_bus(kvm, i));
+               kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu,
+                                           !refcount_read(&kvm->users_count));
        kvm_free_irq_routing(kvm);
 out_err_no_irq_routing:
        cleanup_srcu_struct(&kvm->irq_srcu);
@@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
                .len = len,
        };

+       /* comment goes here */
+       smp_mb__after_srcu_read_lock();
+
        bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
        if (!bus)
                return -ENOMEM;
@@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
                .len = len,
        };

+       /* comment goes here */
+       smp_mb__after_srcu_read_lock();
+
        bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
        if (!bus)
                return -ENOMEM;
@@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,

        srcu_idx = srcu_read_lock(&kvm->srcu);

+       /* comment goes here */
+       smp_mb__after_srcu_read_lock();
+
        bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu);
        if (!bus)
                goto out_unlock;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-07-07 18:49       ` Sean Christopherson
@ 2025-07-10  6:51         ` Keir Fraser
  2025-07-10 13:34           ` Sean Christopherson
  2025-07-15 12:43         ` Keir Fraser
  1 sibling, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2025-07-10  6:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote:
> On Mon, Jun 30, 2025, Keir Fraser wrote:
> > On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote:
> > > +Li
> > > 
> > > On Tue, Jun 24, 2025, Keir Fraser wrote:
> > > > Device MMIO registration may happen quite frequently during VM boot,
> > > > and the SRCU synchronization each time has a measurable effect
> > > > on VM startup time. In our experiments it can account for around 25%
> > > > of a VM's startup time.
> > > > 
> > > > Replace the synchronization with a deferred free of the old kvm_io_bus
> > > > structure.
> > > > 
> > > > Signed-off-by: Keir Fraser <keirf@google.com>
> > > > ---
> > > >  include/linux/kvm_host.h |  1 +
> > > >  virt/kvm/kvm_main.c      | 10 ++++++++--
> > > >  2 files changed, 9 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3bde4fb5c6aa..28a63f1ad314 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -205,6 +205,7 @@ struct kvm_io_range {
> > > >  struct kvm_io_bus {
> > > >  	int dev_count;
> > > >  	int ioeventfd_count;
> > > > +	struct rcu_head rcu;
> > > >  	struct kvm_io_range range[];
> > > >  };
> > > >  
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index eec82775c5bf..b7d4da8ba0b2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_io_bus_read);
> > > >  
> > > > +static void __free_bus(struct rcu_head *rcu)
> > > > +{
> > > > +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> > > > +
> > > > +	kfree(bus);
> > > > +}
> > > > +
> > > >  int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > > >  			    int len, struct kvm_io_device *dev)
> > > >  {
> > > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > > >  	memcpy(new_bus->range + i + 1, bus->range + i,
> > > >  		(bus->dev_count - i) * sizeof(struct kvm_io_range));
> > > >  	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > > > -	synchronize_srcu_expedited(&kvm->srcu);
> > > > -	kfree(bus);
> > > > +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> > > 
> > > I'm 99% certain this will break ABI.  KVM needs to ensure all readers are
> > > guaranteed to see the new device prior to returning to userspace.
> > 
> > I'm not sure I understand this. How can userspace (or a guest VCPU) know that
> > it is executing *after* the MMIO registration, except via some form of
> > synchronization or other ordering of its own? For example, that PCI BAR setup
> > happens as part of PCI probing happening early in device registration in the
> > guest OS, strictly before the MMIO region will be accessed. Otherwise the
> > access is inherently racy against the registration?
> 
> Yes, guest software needs its own synchronization.  What I am pointing out is that,
> very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that
> KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's
> synchronization.
> 
> It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest
> code, including emulated MMIO accesses.  If KVM grabs kvm->buses at the start of
> an emulation block, and then uses that reference to resolve MMIO, it's theoretically
> possible for KVM to mishandle an access due to using a stale bus.

But it doesn't do that? I think I understand now though that you are
concerned about the buses API exposed to the kernel at large. And yes
I see this would be a problem for example if a kvm_get_bus() return
value was cached.

Will and I also had a brainstorm in the office and theorised that a
really "smart" compiler might somehow unroll
handle_invalid_guest_state() 130 times and hoist all the READ_ONCE()s
of the bus to the start. It's impractical, even likely impossible, but
we couldn't outright say it's disallowed by the current enforcements
in the KVM subsystem itself.

> Today, such scenarios are effectively prevented by synchronize_srcu_expedited().
> Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking
> rules), i.e. a large emulation block must take and hold SRCU for its entire
> duration.  And so waiting for all SRCU readers to go away ensures that the new
> kvm->buses will be observed if KVM starts a new emulation block.

Understood. Yes that does make the current code definitely safe in
this regard, just slow!

> AFAIK, the only example of such emulation is x86's handle_invalid_guest_state().
> And in practice, it's probably impossible for the compiler to keep a reference to
> kvm->buses across multiple invocations of kvm_emulate_instruction() while still
> honoring the READ_ONCE() in __rcu_dereference_check().

That certainly stops compiler reordering. But I think I now agree with
your concern about needing an actual memory barrier. I am worried for
example if the guest is relying on an address dependency to
synchronise an MMIO access. For example:
     data = READ_ONCE(*READ_ONCE(mmio_base_addr));

Two loads, ordered by address dependency. But that dependency would
not prevent the access to kvm->buses[idx] from being
hoisted/speculated by the CPU, since it's not data-dependent on the
preceding load...

That said, on x86, loads are ordered anyway.

> But I don't want to simply drop KVM's synchronization, because we need a rule of
> some kind to ensure correct ordering, even if it's only for documentation purposes
> for 99% of cases.  And because the existence of kvm_get_bus() means that it would
> be possible for KVM to grab a long-term reference to kvm->buses and use it across
> emulation of multiple instructions (though actually doing that would be all kinds
> of crazy).

That seems reasonable, in terms of maintaining a fool-proof API to kvm->buses.

> 
> > > I'm quite confident there are other flows that rely on the synchronization,
> > > the vGIC case is simply the one that's documented.
> > 
> > If they're in the kernel they can be fixed? If necessary I'll go audit the callers.
> 
> Yes, I'm sure there's a solution.  Thinking more about this, you make a good
> point that KVM needs to order access with respect to instruction execution, not
> with respect to the start of KVM_RUN.
> 
> For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is
> disallowed (though I don't think this is formally documented), i.e. every
> architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to
> reading kvm->buses.  And srcu_read_lock() contains a full smp_mb(), which ensures
> KVM will get a fresh kvm->buses relative to the instruction that triggered the
> exit.
> 
> So for the common case of one-off accesses after a VM-Exit, I think we can simply
> add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures)
> to formalize the dependency on reacquiring SRCU.  AFAICT, that would also suffice
> for arm64's use of kvm_io_bus_get_dev().  And then add an explicit barrier of some
> kind in handle_invalid_guest_state()?
> 
> Then to prevent grabbing long-term references to a bus, require kvm->slots_lock
> in kvm_get_bus() (and special case the kfree() in VM destruction).
> 
> So something like this?  I think the barriers would pair with the smp_store_release()
> in rcu_assign_pointer()?

It would certainly mean that kvm->buses would be accessed *after* any
preceding vCPU memory access. Including any that "observed" the
newly-registered IO region, somehow (lock acquisition, read of a flag
or base address, or whatever).

Would it be satisfactory to put a patch along the lines of your
suggestions below into a v2 of this patch series? I have made some
comments below.

> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4953846cb30d..057fb4ce66b0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>                 if (kvm_test_request(KVM_REQ_EVENT, vcpu))
>                         return 1;
>  
> +               /* Or maybe smp_mb()?  Not sure what this needs to be. */
> +               barrier();
> +

Looks weak but maybe strong enough for x86? Maybe smp_rmb() would be better statement of intention?

>                 if (!kvm_emulate_instruction(vcpu, 0))
>                         return 0;
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bde4fb5c6aa..066438b6571a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
>  
>  static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
>  {
> -       return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
> -                                     lockdep_is_held(&kvm->slots_lock) ||
> -                                     !refcount_read(&kvm->users_count));
> +       return rcu_dereference_protected(kvm->buses[idx],
> +                                        lockdep_is_held(&kvm->slots_lock));
>  }
>  
>  static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index eec82775c5bf..7b0e881351f7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  out_err_no_arch_destroy_vm:
>         WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
>         for (i = 0; i < KVM_NR_BUSES; i++)
> -               kfree(kvm_get_bus(kvm, i));
> +               kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu,
> +                                           !refcount_read(&kvm->users_count));

srcu_dereference_check()

>         kvm_free_irq_routing(kvm);
>  out_err_no_irq_routing:
>         cleanup_srcu_struct(&kvm->irq_srcu);
> @@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
>                 .len = len,
>         };
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
>         if (!bus)
>                 return -ENOMEM;
> @@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
>                 .len = len,
>         };
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
>         if (!bus)
>                 return -ENOMEM;
> @@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  
>         srcu_idx = srcu_read_lock(&kvm->srcu);
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu);
>         if (!bus)
>                 goto out_unlock;
> 
> 

I guess kvm_io_bus_read() is to be done as well? Perhaps the barrier
and dereference should be pulled into a helper with the comment, just
in one place?

 -- Keir (with thanks to Will for brainstorming!)
 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-07-10  6:51         ` Keir Fraser
@ 2025-07-10 13:34           ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-07-10 13:34 UTC (permalink / raw)
  To: Keir Fraser
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Thu, Jul 10, 2025, Keir Fraser wrote:
> On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote:
> Would it be satisfactory to put a patch along the lines of your
> suggestions below into a v2 of this patch series?

Ya, works for me.

> I have made some comments below.
> 
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 4953846cb30d..057fb4ce66b0 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >                 if (kvm_test_request(KVM_REQ_EVENT, vcpu))
> >                         return 1;
> >  
> > +               /* Or maybe smp_mb()?  Not sure what this needs to be. */
> > +               barrier();
> > +
> 
> Looks weak but maybe strong enough for x86? Maybe smp_rmb() would be better
> statement of intention?

Hmm, yeah, smp_rmb() is better.  I was thinking it just needs to be a compiler
barrier, to ensure KVM reads kvm->buses as needed for each emulated instruction.
But ignoring that x86 is strongly ordered, KVM also needs to ensure a store to
kvm->buses that is supposed to be observed by the next guest instruction is fully
visibile before that instruction executes.

> 
> >                 if (!kvm_emulate_instruction(vcpu, 0))
> >                         return 0;
> >  

...

> I guess kvm_io_bus_read() is to be done as well? Perhaps the barrier
> and dereference should be pulled into a helper with the comment, just
> in one place?

Ya, +1 to a helper.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-07-07 18:49       ` Sean Christopherson
  2025-07-10  6:51         ` Keir Fraser
@ 2025-07-15 12:43         ` Keir Fraser
  2025-07-15 14:01           ` Sean Christopherson
  1 sibling, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2025-07-15 12:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote:
> On Mon, Jun 30, 2025, Keir Fraser wrote:
> > On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote:
> > > +Li
> > > 
> > > On Tue, Jun 24, 2025, Keir Fraser wrote:
> > > > Device MMIO registration may happen quite frequently during VM boot,
> > > > and the SRCU synchronization each time has a measurable effect
> > > > on VM startup time. In our experiments it can account for around 25%
> > > > of a VM's startup time.
> > > > 
> > > > Replace the synchronization with a deferred free of the old kvm_io_bus
> > > > structure.
> > > > 
> > > > Signed-off-by: Keir Fraser <keirf@google.com>
> > > > ---
> > > >  include/linux/kvm_host.h |  1 +
> > > >  virt/kvm/kvm_main.c      | 10 ++++++++--
> > > >  2 files changed, 9 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3bde4fb5c6aa..28a63f1ad314 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -205,6 +205,7 @@ struct kvm_io_range {
> > > >  struct kvm_io_bus {
> > > >  	int dev_count;
> > > >  	int ioeventfd_count;
> > > > +	struct rcu_head rcu;
> > > >  	struct kvm_io_range range[];
> > > >  };
> > > >  
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index eec82775c5bf..b7d4da8ba0b2 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(kvm_io_bus_read);
> > > >  
> > > > +static void __free_bus(struct rcu_head *rcu)
> > > > +{
> > > > +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> > > > +
> > > > +	kfree(bus);
> > > > +}
> > > > +
> > > >  int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > > >  			    int len, struct kvm_io_device *dev)
> > > >  {
> > > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > > >  	memcpy(new_bus->range + i + 1, bus->range + i,
> > > >  		(bus->dev_count - i) * sizeof(struct kvm_io_range));
> > > >  	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > > > -	synchronize_srcu_expedited(&kvm->srcu);
> > > > -	kfree(bus);
> > > > +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> > > 
> > > I'm 99% certain this will break ABI.  KVM needs to ensure all readers are
> > > guaranteed to see the new device prior to returning to userspace.
> > 
> > I'm not sure I understand this. How can userspace (or a guest VCPU) know that
> > it is executing *after* the MMIO registration, except via some form of
> > synchronization or other ordering of its own? For example, that PCI BAR setup
> > happens as part of PCI probing happening early in device registration in the
> > guest OS, strictly before the MMIO region will be accessed. Otherwise the
> > access is inherently racy against the registration?
> 
> Yes, guest software needs its own synchronization.  What I am pointing out is that,
> very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that
> KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's
> synchronization.
> 
> It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest
> code, including emulated MMIO accesses.  If KVM grabs kvm->buses at the start of
> an emulation block, and then uses that reference to resolve MMIO, it's theoretically
> possible for KVM to mishandle an access due to using a stale bus.
> 
> Today, such scenarios are effectively prevented by synchronize_srcu_expedited().
> Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking
> rules), i.e. a large emulation block must take and hold SRCU for its entire
> duration.  And so waiting for all SRCU readers to go away ensures that the new
> kvm->buses will be observed if KVM starts a new emulation block.
> 
> AFAIK, the only example of such emulation is x86's handle_invalid_guest_state().
> And in practice, it's probably impossible for the compiler to keep a reference to
> kvm->buses across multiple invocations of kvm_emulate_instruction() while still
> honoring the READ_ONCE() in __rcu_dereference_check().
>  
> But I don't want to simply drop KVM's synchronization, because we need a rule of
> some kind to ensure correct ordering, even if it's only for documentation purposes
> for 99% of cases.  And because the existence of kvm_get_bus() means that it would
> be possible for KVM to grab a long-term reference to kvm->buses and use it across
> emulation of multiple instructions (though actually doing that would be all kinds
> of crazy).
> 
> > > I'm quite confident there are other flows that rely on the synchronization,
> > > the vGIC case is simply the one that's documented.
> > 
> > If they're in the kernel they can be fixed? If necessary I'll go audit the callers.
> 
> Yes, I'm sure there's a solution.  Thinking more about this, you make a good
> point that KVM needs to order access with respect to instruction execution, not
> with respect to the start of KVM_RUN.
> 
> For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is
> disallowed (though I don't think this is formally documented), i.e. every
> architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to
> reading kvm->buses.  And srcu_read_lock() contains a full smp_mb(), which ensures
> KVM will get a fresh kvm->buses relative to the instruction that triggered the
> exit.

I've got a new patch series ready to go, but thinking more about the
one-off accesses after a VM-Exit: I think VM-Exit is a barrier on all
architectures? That would mean the changes to include
smp_mb__after_srcu_read_lock() are unnecessary and confusing. Maybe I
can drop those hunks. What do you think?

 Regards,
 Keir

> 
> So for the common case of one-off accesses after a VM-Exit, I think we can simply
> add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures)
> to formalize the dependency on reacquiring SRCU.  AFAICT, that would also suffice
> for arm64's use of kvm_io_bus_get_dev().  And then add an explicit barrier of some
> kind in handle_invalid_guest_state()?
> 
> Then to prevent grabbing long-term references to a bus, require kvm->slots_lock
> in kvm_get_bus() (and special case the kfree() in VM destruction).
> 
> So something like this?  I think the barriers would pair with the smp_store_release()
> in rcu_assign_pointer()?
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4953846cb30d..057fb4ce66b0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>                 if (kvm_test_request(KVM_REQ_EVENT, vcpu))
>                         return 1;
>  
> +               /* Or maybe smp_mb()?  Not sure what this needs to be. */
> +               barrier();
> +
>                 if (!kvm_emulate_instruction(vcpu, 0))
>                         return 0;
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3bde4fb5c6aa..066438b6571a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
>  
>  static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
>  {
> -       return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
> -                                     lockdep_is_held(&kvm->slots_lock) ||
> -                                     !refcount_read(&kvm->users_count));
> +       return rcu_dereference_protected(kvm->buses[idx],
> +                                        lockdep_is_held(&kvm->slots_lock));
>  }
>  
>  static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index eec82775c5bf..7b0e881351f7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
>  out_err_no_arch_destroy_vm:
>         WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
>         for (i = 0; i < KVM_NR_BUSES; i++)
> -               kfree(kvm_get_bus(kvm, i));
> +               kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu,
> +                                           !refcount_read(&kvm->users_count));
>         kvm_free_irq_routing(kvm);
>  out_err_no_irq_routing:
>         cleanup_srcu_struct(&kvm->irq_srcu);
> @@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
>                 .len = len,
>         };
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
>         if (!bus)
>                 return -ENOMEM;
> @@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
>                 .len = len,
>         };
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
>         if (!bus)
>                 return -ENOMEM;
> @@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>  
>         srcu_idx = srcu_read_lock(&kvm->srcu);
>  
> +       /* comment goes here */
> +       smp_mb__after_srcu_read_lock();
> +
>         bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu);
>         if (!bus)
>                 goto out_unlock;
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-07-15 12:43         ` Keir Fraser
@ 2025-07-15 14:01           ` Sean Christopherson
  0 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-07-15 14:01 UTC (permalink / raw)
  To: Keir Fraser
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing

On Tue, Jul 15, 2025, Keir Fraser wrote:
> On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote:
> > For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is
> > disallowed (though I don't think this is formally documented), i.e. every
> > architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to
> > reading kvm->buses.  And srcu_read_lock() contains a full smp_mb(), which ensures
> > KVM will get a fresh kvm->buses relative to the instruction that triggered the
> > exit.
> 
> I've got a new patch series ready to go, but thinking more about the
> one-off accesses after a VM-Exit: I think VM-Exit is a barrier on all
> architectures? That would mean the changes to include
> smp_mb__after_srcu_read_lock() are unnecessary and confusing. Maybe I
> can drop those hunks. What do you think?

It's not.

commit 65a4de0ffd975af7e2ffc9acb875b6a8ae7ee1aa
Author:     Yan Zhao <yan.y.zhao@intel.com>
AuthorDate: Fri Mar 8 17:09:28 2024 -0800
Commit:     Sean Christopherson <seanjc@google.com>
CommitDate: Fri Jun 7 07:18:02 2024 -0700

    KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path
    
    Ensure a full memory barrier is emitted in the VM-Exit path, as a full
    barrier is required on Intel CPUs to evict WC buffers.  This will allow
    unconditionally honoring guest PAT on Intel CPUs that support self-snoop.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-07-15 15:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24  9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser
2025-06-24  9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
2025-06-24  9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
2025-06-24  9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
2025-06-24 15:11   ` Sean Christopherson
2025-06-30  9:59     ` Keir Fraser
2025-07-07 18:49       ` Sean Christopherson
2025-07-10  6:51         ` Keir Fraser
2025-07-10 13:34           ` Sean Christopherson
2025-07-15 12:43         ` Keir Fraser
2025-07-15 14:01           ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).