* [PATCH 0/3] KVM: Speed up MMIO registrations
@ 2025-06-24 9:22 Keir Fraser
2025-06-24 9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Keir Fraser @ 2025-06-24 9:22 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, kvm
Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
Paolo Bonzini, Keir Fraser
This series improves performance of MMIO registrations, which occur
multiple times when booting a typical VM. The existing call to
synchronize_srcu() may block for a considerable time: In our tests
it was found to account for approximately 25% of a VM's startup time.
arm64 vgic code is cleaned up and made responsible for
its own setup synchronization, and the MMIO registration logic
replaces synchronize_srcu() with a deferred callback to free the old
io_bus struct.
Keir Fraser (3):
KVM: arm64: vgic-init: Remove vgic_ready() macro
KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
arch/arm64/kvm/vgic/vgic-init.c | 14 +++-----------
include/kvm/arm_vgic.h | 1 -
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 10 ++++++++--
4 files changed, 12 insertions(+), 14 deletions(-)
--
2.50.0.rc2.761.g2dc52ea45b-goog
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro 2025-06-24 9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser @ 2025-06-24 9:22 ` Keir Fraser 2025-06-24 9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser 2025-06-24 9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser 2 siblings, 0 replies; 11+ messages in thread From: Keir Fraser @ 2025-06-24 9:22 UTC (permalink / raw) To: linux-arm-kernel, linux-kernel, kvm Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Keir Fraser It is now used only within kvm_vgic_map_resources(). vgic_dist::ready is already written directly by this function, so it is clearer to bypass the macro for reads as well. Signed-off-by: Keir Fraser <keirf@google.com> --- arch/arm64/kvm/vgic/vgic-init.c | 5 ++--- include/kvm/arm_vgic.h | 1 - 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index eb1205654ac8..502b65049703 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -559,7 +559,6 @@ int vgic_lazy_init(struct kvm *kvm) * Also map the virtual CPU interface into the VM. * v2 calls vgic_init() if not already done. * v3 and derivatives return an error if the VGIC is not initialized. - * vgic_ready() returns true if this function has succeeded. */ int kvm_vgic_map_resources(struct kvm *kvm) { @@ -568,12 +567,12 @@ int kvm_vgic_map_resources(struct kvm *kvm) gpa_t dist_base; int ret = 0; - if (likely(vgic_ready(kvm))) + if (likely(dist->ready)) return 0; mutex_lock(&kvm->slots_lock); mutex_lock(&kvm->arch.config_lock); - if (vgic_ready(kvm)) + if (dist->ready) goto out; if (!irqchip_in_kernel(kvm)) diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h index 4a34f7f0a864..233eaa6d1267 100644 --- a/include/kvm/arm_vgic.h +++ b/include/kvm/arm_vgic.h @@ -399,7 +399,6 @@ u64 vgic_v3_get_misr(struct kvm_vcpu *vcpu); #define irqchip_in_kernel(k) (!!((k)->arch.vgic.in_kernel)) #define vgic_initialized(k) ((k)->arch.vgic.initialized) -#define vgic_ready(k) ((k)->arch.vgic.ready) #define vgic_valid_spi(k, i) (((i) >= VGIC_NR_PRIVATE_IRQS) && \ ((i) < (k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS)) -- 2.50.0.rc2.761.g2dc52ea45b-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering 2025-06-24 9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser 2025-06-24 9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser @ 2025-06-24 9:22 ` Keir Fraser 2025-06-24 9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser 2 siblings, 0 replies; 11+ messages in thread From: Keir Fraser @ 2025-06-24 9:22 UTC (permalink / raw) To: linux-arm-kernel, linux-kernel, kvm Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Keir Fraser In preparation to remove synchronize_srcu() from MMIO registration, remove the distributor's dependency on this implicit barrier by direct acquire-release synchronization on the flag write and its lock-free check. Signed-off-by: Keir Fraser <keirf@google.com> --- arch/arm64/kvm/vgic/vgic-init.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c index 502b65049703..bc83672e461b 100644 --- a/arch/arm64/kvm/vgic/vgic-init.c +++ b/arch/arm64/kvm/vgic/vgic-init.c @@ -567,7 +567,7 @@ int kvm_vgic_map_resources(struct kvm *kvm) gpa_t dist_base; int ret = 0; - if (likely(dist->ready)) + if (likely(smp_load_acquire(&dist->ready))) return 0; mutex_lock(&kvm->slots_lock); @@ -598,14 +598,7 @@ int kvm_vgic_map_resources(struct kvm *kvm) goto out_slots; } - /* - * kvm_io_bus_register_dev() guarantees all readers see the new MMIO - * registration before returning through synchronize_srcu(), which also - * implies a full memory barrier. As such, marking the distributor as - * 'ready' here is guaranteed to be ordered after all vCPUs having seen - * a completely configured distributor. - */ - dist->ready = true; + smp_store_release(&dist->ready, true); goto out_slots; out: mutex_unlock(&kvm->arch.config_lock); -- 2.50.0.rc2.761.g2dc52ea45b-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-06-24 9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser 2025-06-24 9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser 2025-06-24 9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser @ 2025-06-24 9:22 ` Keir Fraser 2025-06-24 15:11 ` Sean Christopherson 2 siblings, 1 reply; 11+ messages in thread From: Keir Fraser @ 2025-06-24 9:22 UTC (permalink / raw) To: linux-arm-kernel, linux-kernel, kvm Cc: Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Keir Fraser Device MMIO registration may happen quite frequently during VM boot, and the SRCU synchronization each time has a measurable effect on VM startup time. In our experiments it can account for around 25% of a VM's startup time. Replace the synchronization with a deferred free of the old kvm_io_bus structure. Signed-off-by: Keir Fraser <keirf@google.com> --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 10 ++++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3bde4fb5c6aa..28a63f1ad314 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -205,6 +205,7 @@ struct kvm_io_range { struct kvm_io_bus { int dev_count; int ioeventfd_count; + struct rcu_head rcu; struct kvm_io_range range[]; }; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index eec82775c5bf..b7d4da8ba0b2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, } EXPORT_SYMBOL_GPL(kvm_io_bus_read); +static void __free_bus(struct rcu_head *rcu) +{ + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); + + kfree(bus); +} + int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, struct kvm_io_device *dev) { @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, memcpy(new_bus->range + i + 1, bus->range + i, (bus->dev_count - i) * sizeof(struct kvm_io_range)); rcu_assign_pointer(kvm->buses[bus_idx], new_bus); - synchronize_srcu_expedited(&kvm->srcu); - kfree(bus); + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); return 0; } -- 2.50.0.rc2.761.g2dc52ea45b-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-06-24 9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser @ 2025-06-24 15:11 ` Sean Christopherson 2025-06-30 9:59 ` Keir Fraser 0 siblings, 1 reply; 11+ messages in thread From: Sean Christopherson @ 2025-06-24 15:11 UTC (permalink / raw) To: Keir Fraser Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing +Li On Tue, Jun 24, 2025, Keir Fraser wrote: > Device MMIO registration may happen quite frequently during VM boot, > and the SRCU synchronization each time has a measurable effect > on VM startup time. In our experiments it can account for around 25% > of a VM's startup time. > > Replace the synchronization with a deferred free of the old kvm_io_bus > structure. > > Signed-off-by: Keir Fraser <keirf@google.com> > --- > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 10 ++++++++-- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3bde4fb5c6aa..28a63f1ad314 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -205,6 +205,7 @@ struct kvm_io_range { > struct kvm_io_bus { > int dev_count; > int ioeventfd_count; > + struct rcu_head rcu; > struct kvm_io_range range[]; > }; > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index eec82775c5bf..b7d4da8ba0b2 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > } > EXPORT_SYMBOL_GPL(kvm_io_bus_read); > > +static void __free_bus(struct rcu_head *rcu) > +{ > + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); > + > + kfree(bus); > +} > + > int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > int len, struct kvm_io_device *dev) > { > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > memcpy(new_bus->range + i + 1, bus->range + i, > (bus->dev_count - i) * sizeof(struct kvm_io_range)); > rcu_assign_pointer(kvm->buses[bus_idx], new_bus); > - synchronize_srcu_expedited(&kvm->srcu); > - kfree(bus); > + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); I'm 99% certain this will break ABI. KVM needs to ensure all readers are guaranteed to see the new device prior to returning to userspace. I'm quite confident there are other flows that rely on the synchronization, the vGIC case is simply the one that's documented. https://lore.kernel.org/all/aAkAY40UbqzQNr8m@google.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-06-24 15:11 ` Sean Christopherson @ 2025-06-30 9:59 ` Keir Fraser 2025-07-07 18:49 ` Sean Christopherson 0 siblings, 1 reply; 11+ messages in thread From: Keir Fraser @ 2025-06-30 9:59 UTC (permalink / raw) To: Sean Christopherson Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote: > +Li > > On Tue, Jun 24, 2025, Keir Fraser wrote: > > Device MMIO registration may happen quite frequently during VM boot, > > and the SRCU synchronization each time has a measurable effect > > on VM startup time. In our experiments it can account for around 25% > > of a VM's startup time. > > > > Replace the synchronization with a deferred free of the old kvm_io_bus > > structure. > > > > Signed-off-by: Keir Fraser <keirf@google.com> > > --- > > include/linux/kvm_host.h | 1 + > > virt/kvm/kvm_main.c | 10 ++++++++-- > > 2 files changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 3bde4fb5c6aa..28a63f1ad314 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -205,6 +205,7 @@ struct kvm_io_range { > > struct kvm_io_bus { > > int dev_count; > > int ioeventfd_count; > > + struct rcu_head rcu; > > struct kvm_io_range range[]; > > }; > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index eec82775c5bf..b7d4da8ba0b2 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > > } > > EXPORT_SYMBOL_GPL(kvm_io_bus_read); > > > > +static void __free_bus(struct rcu_head *rcu) > > +{ > > + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); > > + > > + kfree(bus); > > +} > > + > > int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > int len, struct kvm_io_device *dev) > > { > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > memcpy(new_bus->range + i + 1, bus->range + i, > > (bus->dev_count - i) * sizeof(struct kvm_io_range)); > > rcu_assign_pointer(kvm->buses[bus_idx], new_bus); > > - synchronize_srcu_expedited(&kvm->srcu); > > - kfree(bus); > > + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); > > I'm 99% certain this will break ABI. KVM needs to ensure all readers are guaranteed > to see the new device prior to returning to userspace. I'm not sure I understand this. How can userspace (or a guest VCPU) know that it is executing *after* the MMIO registration, except via some form of synchronization or other ordering of its own? For example, that PCI BAR setup happens as part of PCI probing happening early in device registration in the guest OS, strictly before the MMIO region will be accessed. Otherwise the access is inherently racy against the registration? > I'm quite confident there > are other flows that rely on the synchronization, the vGIC case is simply the one > that's documented. If they're in the kernel they can be fixed? If necessary I'll go audit the callers. Regards, Keir > https://lore.kernel.org/all/aAkAY40UbqzQNr8m@google.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-06-30 9:59 ` Keir Fraser @ 2025-07-07 18:49 ` Sean Christopherson 2025-07-10 6:51 ` Keir Fraser 2025-07-15 12:43 ` Keir Fraser 0 siblings, 2 replies; 11+ messages in thread From: Sean Christopherson @ 2025-07-07 18:49 UTC (permalink / raw) To: Keir Fraser Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Mon, Jun 30, 2025, Keir Fraser wrote: > On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote: > > +Li > > > > On Tue, Jun 24, 2025, Keir Fraser wrote: > > > Device MMIO registration may happen quite frequently during VM boot, > > > and the SRCU synchronization each time has a measurable effect > > > on VM startup time. In our experiments it can account for around 25% > > > of a VM's startup time. > > > > > > Replace the synchronization with a deferred free of the old kvm_io_bus > > > structure. > > > > > > Signed-off-by: Keir Fraser <keirf@google.com> > > > --- > > > include/linux/kvm_host.h | 1 + > > > virt/kvm/kvm_main.c | 10 ++++++++-- > > > 2 files changed, 9 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index 3bde4fb5c6aa..28a63f1ad314 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -205,6 +205,7 @@ struct kvm_io_range { > > > struct kvm_io_bus { > > > int dev_count; > > > int ioeventfd_count; > > > + struct rcu_head rcu; > > > struct kvm_io_range range[]; > > > }; > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > index eec82775c5bf..b7d4da8ba0b2 100644 > > > --- a/virt/kvm/kvm_main.c > > > +++ b/virt/kvm/kvm_main.c > > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > > > } > > > EXPORT_SYMBOL_GPL(kvm_io_bus_read); > > > > > > +static void __free_bus(struct rcu_head *rcu) > > > +{ > > > + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); > > > + > > > + kfree(bus); > > > +} > > > + > > > int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > int len, struct kvm_io_device *dev) > > > { > > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > memcpy(new_bus->range + i + 1, bus->range + i, > > > (bus->dev_count - i) * sizeof(struct kvm_io_range)); > > > rcu_assign_pointer(kvm->buses[bus_idx], new_bus); > > > - synchronize_srcu_expedited(&kvm->srcu); > > > - kfree(bus); > > > + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); > > > > I'm 99% certain this will break ABI. KVM needs to ensure all readers are > > guaranteed to see the new device prior to returning to userspace. > > I'm not sure I understand this. How can userspace (or a guest VCPU) know that > it is executing *after* the MMIO registration, except via some form of > synchronization or other ordering of its own? For example, that PCI BAR setup > happens as part of PCI probing happening early in device registration in the > guest OS, strictly before the MMIO region will be accessed. Otherwise the > access is inherently racy against the registration? Yes, guest software needs its own synchronization. What I am pointing out is that, very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's synchronization. It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest code, including emulated MMIO accesses. If KVM grabs kvm->buses at the start of an emulation block, and then uses that reference to resolve MMIO, it's theoretically possible for KVM to mishandle an access due to using a stale bus. Today, such scenarios are effectively prevented by synchronize_srcu_expedited(). Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking rules), i.e. a large emulation block must take and hold SRCU for its entire duration. And so waiting for all SRCU readers to go away ensures that the new kvm->buses will be observed if KVM starts a new emulation block. AFAIK, the only example of such emulation is x86's handle_invalid_guest_state(). And in practice, it's probably impossible for the compiler to keep a reference to kvm->buses across multiple invocations of kvm_emulate_instruction() while still honoring the READ_ONCE() in __rcu_dereference_check(). But I don't want to simply drop KVM's synchronization, because we need a rule of some kind to ensure correct ordering, even if it's only for documentation purposes for 99% of cases. And because the existence of kvm_get_bus() means that it would be possible for KVM to grab a long-term reference to kvm->buses and use it across emulation of multiple instructions (though actually doing that would be all kinds of crazy). > > I'm quite confident there are other flows that rely on the synchronization, > > the vGIC case is simply the one that's documented. > > If they're in the kernel they can be fixed? If necessary I'll go audit the callers. Yes, I'm sure there's a solution. Thinking more about this, you make a good point that KVM needs to order access with respect to instruction execution, not with respect to the start of KVM_RUN. For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is disallowed (though I don't think this is formally documented), i.e. every architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to reading kvm->buses. And srcu_read_lock() contains a full smp_mb(), which ensures KVM will get a fresh kvm->buses relative to the instruction that triggered the exit. So for the common case of one-off accesses after a VM-Exit, I think we can simply add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures) to formalize the dependency on reacquiring SRCU. AFAICT, that would also suffice for arm64's use of kvm_io_bus_get_dev(). And then add an explicit barrier of some kind in handle_invalid_guest_state()? Then to prevent grabbing long-term references to a bus, require kvm->slots_lock in kvm_get_bus() (and special case the kfree() in VM destruction). So something like this? I think the barriers would pair with the smp_store_release() in rcu_assign_pointer()? diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4953846cb30d..057fb4ce66b0 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) if (kvm_test_request(KVM_REQ_EVENT, vcpu)) return 1; + /* Or maybe smp_mb()? Not sure what this needs to be. */ + barrier(); + if (!kvm_emulate_instruction(vcpu, 0)) return 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3bde4fb5c6aa..066438b6571a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm) static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx) { - return srcu_dereference_check(kvm->buses[idx], &kvm->srcu, - lockdep_is_held(&kvm->slots_lock) || - !refcount_read(&kvm->users_count)); + return rcu_dereference_protected(kvm->buses[idx], + lockdep_is_held(&kvm->slots_lock)); } static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index eec82775c5bf..7b0e881351f7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) out_err_no_arch_destroy_vm: WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count)); for (i = 0; i < KVM_NR_BUSES; i++) - kfree(kvm_get_bus(kvm, i)); + kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu, + !refcount_read(&kvm->users_count)); kvm_free_irq_routing(kvm); out_err_no_irq_routing: cleanup_srcu_struct(&kvm->irq_srcu); @@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, .len = len, }; + /* comment goes here */ + smp_mb__after_srcu_read_lock(); + bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); if (!bus) return -ENOMEM; @@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, .len = len, }; + /* comment goes here */ + smp_mb__after_srcu_read_lock(); + bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); if (!bus) return -ENOMEM; @@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx, srcu_idx = srcu_read_lock(&kvm->srcu); + /* comment goes here */ + smp_mb__after_srcu_read_lock(); + bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu); if (!bus) goto out_unlock; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-07-07 18:49 ` Sean Christopherson @ 2025-07-10 6:51 ` Keir Fraser 2025-07-10 13:34 ` Sean Christopherson 2025-07-15 12:43 ` Keir Fraser 1 sibling, 1 reply; 11+ messages in thread From: Keir Fraser @ 2025-07-10 6:51 UTC (permalink / raw) To: Sean Christopherson Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote: > On Mon, Jun 30, 2025, Keir Fraser wrote: > > On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote: > > > +Li > > > > > > On Tue, Jun 24, 2025, Keir Fraser wrote: > > > > Device MMIO registration may happen quite frequently during VM boot, > > > > and the SRCU synchronization each time has a measurable effect > > > > on VM startup time. In our experiments it can account for around 25% > > > > of a VM's startup time. > > > > > > > > Replace the synchronization with a deferred free of the old kvm_io_bus > > > > structure. > > > > > > > > Signed-off-by: Keir Fraser <keirf@google.com> > > > > --- > > > > include/linux/kvm_host.h | 1 + > > > > virt/kvm/kvm_main.c | 10 ++++++++-- > > > > 2 files changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > index 3bde4fb5c6aa..28a63f1ad314 100644 > > > > --- a/include/linux/kvm_host.h > > > > +++ b/include/linux/kvm_host.h > > > > @@ -205,6 +205,7 @@ struct kvm_io_range { > > > > struct kvm_io_bus { > > > > int dev_count; > > > > int ioeventfd_count; > > > > + struct rcu_head rcu; > > > > struct kvm_io_range range[]; > > > > }; > > > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > index eec82775c5bf..b7d4da8ba0b2 100644 > > > > --- a/virt/kvm/kvm_main.c > > > > +++ b/virt/kvm/kvm_main.c > > > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > > > > } > > > > EXPORT_SYMBOL_GPL(kvm_io_bus_read); > > > > > > > > +static void __free_bus(struct rcu_head *rcu) > > > > +{ > > > > + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); > > > > + > > > > + kfree(bus); > > > > +} > > > > + > > > > int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > > int len, struct kvm_io_device *dev) > > > > { > > > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > > memcpy(new_bus->range + i + 1, bus->range + i, > > > > (bus->dev_count - i) * sizeof(struct kvm_io_range)); > > > > rcu_assign_pointer(kvm->buses[bus_idx], new_bus); > > > > - synchronize_srcu_expedited(&kvm->srcu); > > > > - kfree(bus); > > > > + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); > > > > > > I'm 99% certain this will break ABI. KVM needs to ensure all readers are > > > guaranteed to see the new device prior to returning to userspace. > > > > I'm not sure I understand this. How can userspace (or a guest VCPU) know that > > it is executing *after* the MMIO registration, except via some form of > > synchronization or other ordering of its own? For example, that PCI BAR setup > > happens as part of PCI probing happening early in device registration in the > > guest OS, strictly before the MMIO region will be accessed. Otherwise the > > access is inherently racy against the registration? > > Yes, guest software needs its own synchronization. What I am pointing out is that, > very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that > KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's > synchronization. > > It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest > code, including emulated MMIO accesses. If KVM grabs kvm->buses at the start of > an emulation block, and then uses that reference to resolve MMIO, it's theoretically > possible for KVM to mishandle an access due to using a stale bus. But it doesn't do that? I think I understand now though that you are concerned about the buses API exposed to the kernel at large. And yes I see this would be a problem for example if a kvm_get_bus() return value was cached. Will and I also had a brainstorm in the office and theorised that a really "smart" compiler might somehow unroll handle_invalid_guest_state() 130 times and hoist all the READ_ONCE()s of the bus to the start. It's impractical, even likely impossible, but we couldn't outright say it's disallowed by the current enforcements in the KVM subsystem itself. > Today, such scenarios are effectively prevented by synchronize_srcu_expedited(). > Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking > rules), i.e. a large emulation block must take and hold SRCU for its entire > duration. And so waiting for all SRCU readers to go away ensures that the new > kvm->buses will be observed if KVM starts a new emulation block. Understood. Yes that does make the current code definitely safe in this regard, just slow! > AFAIK, the only example of such emulation is x86's handle_invalid_guest_state(). > And in practice, it's probably impossible for the compiler to keep a reference to > kvm->buses across multiple invocations of kvm_emulate_instruction() while still > honoring the READ_ONCE() in __rcu_dereference_check(). That certainly stops compiler reordering. But I think I now agree with your concern about needing an actual memory barrier. I am worried for example if the guest is relying on an address dependency to synchronise an MMIO access. For example: data = READ_ONCE(*READ_ONCE(mmio_base_addr)); Two loads, ordered by address dependency. But that dependency would not prevent the access to kvm->buses[idx] from being hoisted/speculated by the CPU, since it's not data-dependent on the preceding load... That said, on x86, loads are ordered anyway. > But I don't want to simply drop KVM's synchronization, because we need a rule of > some kind to ensure correct ordering, even if it's only for documentation purposes > for 99% of cases. And because the existence of kvm_get_bus() means that it would > be possible for KVM to grab a long-term reference to kvm->buses and use it across > emulation of multiple instructions (though actually doing that would be all kinds > of crazy). That seems reasonable, in terms of maintaining a fool-proof API to kvm->buses. > > > > I'm quite confident there are other flows that rely on the synchronization, > > > the vGIC case is simply the one that's documented. > > > > If they're in the kernel they can be fixed? If necessary I'll go audit the callers. > > Yes, I'm sure there's a solution. Thinking more about this, you make a good > point that KVM needs to order access with respect to instruction execution, not > with respect to the start of KVM_RUN. > > For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is > disallowed (though I don't think this is formally documented), i.e. every > architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to > reading kvm->buses. And srcu_read_lock() contains a full smp_mb(), which ensures > KVM will get a fresh kvm->buses relative to the instruction that triggered the > exit. > > So for the common case of one-off accesses after a VM-Exit, I think we can simply > add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures) > to formalize the dependency on reacquiring SRCU. AFAICT, that would also suffice > for arm64's use of kvm_io_bus_get_dev(). And then add an explicit barrier of some > kind in handle_invalid_guest_state()? > > Then to prevent grabbing long-term references to a bus, require kvm->slots_lock > in kvm_get_bus() (and special case the kfree() in VM destruction). > > So something like this? I think the barriers would pair with the smp_store_release() > in rcu_assign_pointer()? It would certainly mean that kvm->buses would be accessed *after* any preceding vCPU memory access. Including any that "observed" the newly-registered IO region, somehow (lock acquisition, read of a flag or base address, or whatever). Would it be satisfactory to put a patch along the lines of your suggestions below into a v2 of this patch series? I have made some comments below. > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 4953846cb30d..057fb4ce66b0 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > if (kvm_test_request(KVM_REQ_EVENT, vcpu)) > return 1; > > + /* Or maybe smp_mb()? Not sure what this needs to be. */ > + barrier(); > + Looks weak but maybe strong enough for x86? Maybe smp_rmb() would be better statement of intention? > if (!kvm_emulate_instruction(vcpu, 0)) > return 0; > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3bde4fb5c6aa..066438b6571a 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm) > > static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx) > { > - return srcu_dereference_check(kvm->buses[idx], &kvm->srcu, > - lockdep_is_held(&kvm->slots_lock) || > - !refcount_read(&kvm->users_count)); > + return rcu_dereference_protected(kvm->buses[idx], > + lockdep_is_held(&kvm->slots_lock)); > } > > static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index eec82775c5bf..7b0e881351f7 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > out_err_no_arch_destroy_vm: > WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count)); > for (i = 0; i < KVM_NR_BUSES; i++) > - kfree(kvm_get_bus(kvm, i)); > + kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu, > + !refcount_read(&kvm->users_count)); srcu_dereference_check() > kvm_free_irq_routing(kvm); > out_err_no_irq_routing: > cleanup_srcu_struct(&kvm->irq_srcu); > @@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > .len = len, > }; > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); > if (!bus) > return -ENOMEM; > @@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, > .len = len, > }; > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); > if (!bus) > return -ENOMEM; > @@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx, > > srcu_idx = srcu_read_lock(&kvm->srcu); > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu); > if (!bus) > goto out_unlock; > > I guess kvm_io_bus_read() is to be done as well? Perhaps the barrier and dereference should be pulled into a helper with the comment, just in one place? -- Keir (with thanks to Will for brainstorming!) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-07-10 6:51 ` Keir Fraser @ 2025-07-10 13:34 ` Sean Christopherson 0 siblings, 0 replies; 11+ messages in thread From: Sean Christopherson @ 2025-07-10 13:34 UTC (permalink / raw) To: Keir Fraser Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Thu, Jul 10, 2025, Keir Fraser wrote: > On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote: > Would it be satisfactory to put a patch along the lines of your > suggestions below into a v2 of this patch series? Ya, works for me. > I have made some comments below. > > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 4953846cb30d..057fb4ce66b0 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > > if (kvm_test_request(KVM_REQ_EVENT, vcpu)) > > return 1; > > > > + /* Or maybe smp_mb()? Not sure what this needs to be. */ > > + barrier(); > > + > > Looks weak but maybe strong enough for x86? Maybe smp_rmb() would be better > statement of intention? Hmm, yeah, smp_rmb() is better. I was thinking it just needs to be a compiler barrier, to ensure KVM reads kvm->buses as needed for each emulated instruction. But ignoring that x86 is strongly ordered, KVM also needs to ensure a store to kvm->buses that is supposed to be observed by the next guest instruction is fully visibile before that instruction executes. > > > if (!kvm_emulate_instruction(vcpu, 0)) > > return 0; > > ... > I guess kvm_io_bus_read() is to be done as well? Perhaps the barrier > and dereference should be pulled into a helper with the comment, just > in one place? Ya, +1 to a helper. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-07-07 18:49 ` Sean Christopherson 2025-07-10 6:51 ` Keir Fraser @ 2025-07-15 12:43 ` Keir Fraser 2025-07-15 14:01 ` Sean Christopherson 1 sibling, 1 reply; 11+ messages in thread From: Keir Fraser @ 2025-07-15 12:43 UTC (permalink / raw) To: Sean Christopherson Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote: > On Mon, Jun 30, 2025, Keir Fraser wrote: > > On Tue, Jun 24, 2025 at 08:11:49AM -0700, Sean Christopherson wrote: > > > +Li > > > > > > On Tue, Jun 24, 2025, Keir Fraser wrote: > > > > Device MMIO registration may happen quite frequently during VM boot, > > > > and the SRCU synchronization each time has a measurable effect > > > > on VM startup time. In our experiments it can account for around 25% > > > > of a VM's startup time. > > > > > > > > Replace the synchronization with a deferred free of the old kvm_io_bus > > > > structure. > > > > > > > > Signed-off-by: Keir Fraser <keirf@google.com> > > > > --- > > > > include/linux/kvm_host.h | 1 + > > > > virt/kvm/kvm_main.c | 10 ++++++++-- > > > > 2 files changed, 9 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > > index 3bde4fb5c6aa..28a63f1ad314 100644 > > > > --- a/include/linux/kvm_host.h > > > > +++ b/include/linux/kvm_host.h > > > > @@ -205,6 +205,7 @@ struct kvm_io_range { > > > > struct kvm_io_bus { > > > > int dev_count; > > > > int ioeventfd_count; > > > > + struct rcu_head rcu; > > > > struct kvm_io_range range[]; > > > > }; > > > > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > > > index eec82775c5bf..b7d4da8ba0b2 100644 > > > > --- a/virt/kvm/kvm_main.c > > > > +++ b/virt/kvm/kvm_main.c > > > > @@ -5924,6 +5924,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > > > > } > > > > EXPORT_SYMBOL_GPL(kvm_io_bus_read); > > > > > > > > +static void __free_bus(struct rcu_head *rcu) > > > > +{ > > > > + struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu); > > > > + > > > > + kfree(bus); > > > > +} > > > > + > > > > int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > > int len, struct kvm_io_device *dev) > > > > { > > > > @@ -5962,8 +5969,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, > > > > memcpy(new_bus->range + i + 1, bus->range + i, > > > > (bus->dev_count - i) * sizeof(struct kvm_io_range)); > > > > rcu_assign_pointer(kvm->buses[bus_idx], new_bus); > > > > - synchronize_srcu_expedited(&kvm->srcu); > > > > - kfree(bus); > > > > + call_srcu(&kvm->srcu, &bus->rcu, __free_bus); > > > > > > I'm 99% certain this will break ABI. KVM needs to ensure all readers are > > > guaranteed to see the new device prior to returning to userspace. > > > > I'm not sure I understand this. How can userspace (or a guest VCPU) know that > > it is executing *after* the MMIO registration, except via some form of > > synchronization or other ordering of its own? For example, that PCI BAR setup > > happens as part of PCI probing happening early in device registration in the > > guest OS, strictly before the MMIO region will be accessed. Otherwise the > > access is inherently racy against the registration? > > Yes, guest software needs its own synchronization. What I am pointing out is that, > very strictly speaking, KVM relies on synchronize_srcu_expedited() to ensure that > KVM's emulation of MMIO accesses are correctly ordered with respect to the guest's > synchronization. > > It's legal, though *extremely* uncommon, for KVM to emulate large swaths of guest > code, including emulated MMIO accesses. If KVM grabs kvm->buses at the start of > an emulation block, and then uses that reference to resolve MMIO, it's theoretically > possible for KVM to mishandle an access due to using a stale bus. > > Today, such scenarios are effectively prevented by synchronize_srcu_expedited(). > Using kvm->buses outside of SRCU protection would be a bug (per KVM's locking > rules), i.e. a large emulation block must take and hold SRCU for its entire > duration. And so waiting for all SRCU readers to go away ensures that the new > kvm->buses will be observed if KVM starts a new emulation block. > > AFAIK, the only example of such emulation is x86's handle_invalid_guest_state(). > And in practice, it's probably impossible for the compiler to keep a reference to > kvm->buses across multiple invocations of kvm_emulate_instruction() while still > honoring the READ_ONCE() in __rcu_dereference_check(). > > But I don't want to simply drop KVM's synchronization, because we need a rule of > some kind to ensure correct ordering, even if it's only for documentation purposes > for 99% of cases. And because the existence of kvm_get_bus() means that it would > be possible for KVM to grab a long-term reference to kvm->buses and use it across > emulation of multiple instructions (though actually doing that would be all kinds > of crazy). > > > > I'm quite confident there are other flows that rely on the synchronization, > > > the vGIC case is simply the one that's documented. > > > > If they're in the kernel they can be fixed? If necessary I'll go audit the callers. > > Yes, I'm sure there's a solution. Thinking more about this, you make a good > point that KVM needs to order access with respect to instruction execution, not > with respect to the start of KVM_RUN. > > For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is > disallowed (though I don't think this is formally documented), i.e. every > architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to > reading kvm->buses. And srcu_read_lock() contains a full smp_mb(), which ensures > KVM will get a fresh kvm->buses relative to the instruction that triggered the > exit. I've got a new patch series ready to go, but thinking more about the one-off accesses after a VM-Exit: I think VM-Exit is a barrier on all architectures? That would mean the changes to include smp_mb__after_srcu_read_lock() are unnecessary and confusing. Maybe I can drop those hunks. What do you think? Regards, Keir > > So for the common case of one-off accesses after a VM-Exit, I think we can simply > add calls to smp_mb__after_srcu_read_lock() (which is a nop on all architectures) > to formalize the dependency on reacquiring SRCU. AFAICT, that would also suffice > for arm64's use of kvm_io_bus_get_dev(). And then add an explicit barrier of some > kind in handle_invalid_guest_state()? > > Then to prevent grabbing long-term references to a bus, require kvm->slots_lock > in kvm_get_bus() (and special case the kfree() in VM destruction). > > So something like this? I think the barriers would pair with the smp_store_release() > in rcu_assign_pointer()? > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 4953846cb30d..057fb4ce66b0 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -5861,6 +5861,9 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) > if (kvm_test_request(KVM_REQ_EVENT, vcpu)) > return 1; > > + /* Or maybe smp_mb()? Not sure what this needs to be. */ > + barrier(); > + > if (!kvm_emulate_instruction(vcpu, 0)) > return 0; > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 3bde4fb5c6aa..066438b6571a 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -967,9 +967,8 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm) > > static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx) > { > - return srcu_dereference_check(kvm->buses[idx], &kvm->srcu, > - lockdep_is_held(&kvm->slots_lock) || > - !refcount_read(&kvm->users_count)); > + return rcu_dereference_protected(kvm->buses[idx], > + lockdep_is_held(&kvm->slots_lock)); > } > > static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i) > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index eec82775c5bf..7b0e881351f7 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1228,7 +1228,8 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname) > out_err_no_arch_destroy_vm: > WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count)); > for (i = 0; i < KVM_NR_BUSES; i++) > - kfree(kvm_get_bus(kvm, i)); > + kfree(rcu_dereference_check(kvm->buses[i], &kvm->srcu, > + !refcount_read(&kvm->users_count)); > kvm_free_irq_routing(kvm); > out_err_no_irq_routing: > cleanup_srcu_struct(&kvm->irq_srcu); > @@ -5847,6 +5848,9 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr, > .len = len, > }; > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); > if (!bus) > return -ENOMEM; > @@ -5866,6 +5870,9 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, > .len = len, > }; > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu); > if (!bus) > return -ENOMEM; > @@ -6025,6 +6032,9 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx, > > srcu_idx = srcu_read_lock(&kvm->srcu); > > + /* comment goes here */ > + smp_mb__after_srcu_read_lock(); > + > bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu); > if (!bus) > goto out_unlock; > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() 2025-07-15 12:43 ` Keir Fraser @ 2025-07-15 14:01 ` Sean Christopherson 0 siblings, 0 replies; 11+ messages in thread From: Sean Christopherson @ 2025-07-15 14:01 UTC (permalink / raw) To: Keir Fraser Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing On Tue, Jul 15, 2025, Keir Fraser wrote: > On Mon, Jul 07, 2025 at 11:49:34AM -0700, Sean Christopherson wrote: > > For all intents and purposes, holding kvm->srcu across VM-Enter/VM-Exit is > > disallowed (though I don't think this is formally documented), i.e. every > > architecture is guaranteed to do srcu_read_lock() after a VM-Exit, prior to > > reading kvm->buses. And srcu_read_lock() contains a full smp_mb(), which ensures > > KVM will get a fresh kvm->buses relative to the instruction that triggered the > > exit. > > I've got a new patch series ready to go, but thinking more about the > one-off accesses after a VM-Exit: I think VM-Exit is a barrier on all > architectures? That would mean the changes to include > smp_mb__after_srcu_read_lock() are unnecessary and confusing. Maybe I > can drop those hunks. What do you think? It's not. commit 65a4de0ffd975af7e2ffc9acb875b6a8ae7ee1aa Author: Yan Zhao <yan.y.zhao@intel.com> AuthorDate: Fri Mar 8 17:09:28 2024 -0800 Commit: Sean Christopherson <seanjc@google.com> CommitDate: Fri Jun 7 07:18:02 2024 -0700 KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit path Ensure a full memory barrier is emitted in the VM-Exit path, as a full barrier is required on Intel CPUs to evict WC buffers. This will allow unconditionally honoring guest PAT on Intel CPUs that support self-snoop. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-07-15 15:30 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-24 9:22 [PATCH 0/3] KVM: Speed up MMIO registrations Keir Fraser 2025-06-24 9:22 ` [PATCH 1/3] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser 2025-06-24 9:22 ` [PATCH 2/3] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser 2025-06-24 9:22 ` [PATCH 3/3] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser 2025-06-24 15:11 ` Sean Christopherson 2025-06-30 9:59 ` Keir Fraser 2025-07-07 18:49 ` Sean Christopherson 2025-07-10 6:51 ` Keir Fraser 2025-07-10 13:34 ` Sean Christopherson 2025-07-15 12:43 ` Keir Fraser 2025-07-15 14:01 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).