[PATCH v4 0/4] KVM: Speed up MMIO registrations

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/4] KVM: Speed up MMIO registrations
@ 2025-09-09 10:00 Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Keir Fraser @ 2025-09-09 10:00 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Keir Fraser

This is version 4 of the patches I previously posted here:

 https://lore.kernel.org/all/20250819090853.3988626-1-keirf@google.com/

Changes since v3:

 * Rebased to v6.17-rc5
 * Added Tested-by tag to patch 4
 * Fixed reproducible syzkaller splat
 * Tweaked comments to Sean's specification

Keir Fraser (4):
  KVM: arm64: vgic-init: Remove vgic_ready() macro
  KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
  KVM: Implement barriers before accessing kvm->buses[] on SRCU read
    paths
  KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()

 arch/arm64/kvm/vgic/vgic-init.c | 14 +++--------
 arch/x86/kvm/vmx/vmx.c          |  7 ++++++
 include/kvm/arm_vgic.h          |  1 -
 include/linux/kvm_host.h        | 11 ++++++---
 virt/kvm/kvm_main.c             | 43 +++++++++++++++++++++++++++------
 5 files changed, 53 insertions(+), 23 deletions(-)

-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v4 1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro
  2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
@ 2025-09-09 10:00 ` Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 2/4] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2025-09-09 10:00 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Keir Fraser

It is now used only within kvm_vgic_map_resources(). vgic_dist::ready
is already written directly by this function, so it is clearer to
bypass the macro for reads as well.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 arch/arm64/kvm/vgic/vgic-init.c | 5 ++---
 include/kvm/arm_vgic.h          | 1 -
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 1e680ad6e863..3f207b5f80a5 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -554,7 +554,6 @@ int vgic_lazy_init(struct kvm *kvm)
  * Also map the virtual CPU interface into the VM.
  * v2 calls vgic_init() if not already done.
  * v3 and derivatives return an error if the VGIC is not initialized.
- * vgic_ready() returns true if this function has succeeded.
  */
 int kvm_vgic_map_resources(struct kvm *kvm)
 {
@@ -563,12 +562,12 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	gpa_t dist_base;
 	int ret = 0;
 
-	if (likely(vgic_ready(kvm)))
+	if (likely(dist->ready))
 		return 0;
 
 	mutex_lock(&kvm->slots_lock);
 	mutex_lock(&kvm->arch.config_lock);
-	if (vgic_ready(kvm))
+	if (dist->ready)
 		goto out;
 
 	if (!irqchip_in_kernel(kvm))
diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 404883c7af6e..e7ffaf4bf2e7 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -406,7 +406,6 @@ u64 vgic_v3_get_misr(struct kvm_vcpu *vcpu);
 
 #define irqchip_in_kernel(k)	(!!((k)->arch.vgic.in_kernel))
 #define vgic_initialized(k)	((k)->arch.vgic.initialized)
-#define vgic_ready(k)		((k)->arch.vgic.ready)
 #define vgic_valid_spi(k, i)	(((i) >= VGIC_NR_PRIVATE_IRQS) && \
 			((i) < (k)->arch.vgic.nr_spis + VGIC_NR_PRIVATE_IRQS))
 
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 2/4] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
  2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
@ 2025-09-09 10:00 ` Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 3/4] KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths Keir Fraser
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2025-09-09 10:00 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Keir Fraser

In preparation to remove synchronize_srcu() from MMIO registration,
remove the distributor's dependency on this implicit barrier by
direct acquire-release synchronization on the flag write and its
lock-free check.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 arch/arm64/kvm/vgic/vgic-init.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index 3f207b5f80a5..ccccb5c04ac1 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -562,7 +562,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 	gpa_t dist_base;
 	int ret = 0;
 
-	if (likely(dist->ready))
+	if (likely(smp_load_acquire(&dist->ready)))
 		return 0;
 
 	mutex_lock(&kvm->slots_lock);
@@ -593,14 +593,7 @@ int kvm_vgic_map_resources(struct kvm *kvm)
 		goto out_slots;
 	}
 
-	/*
-	 * kvm_io_bus_register_dev() guarantees all readers see the new MMIO
-	 * registration before returning through synchronize_srcu(), which also
-	 * implies a full memory barrier. As such, marking the distributor as
-	 * 'ready' here is guaranteed to be ordered after all vCPUs having seen
-	 * a completely configured distributor.
-	 */
-	dist->ready = true;
+	smp_store_release(&dist->ready, true);
 	goto out_slots;
 out:
 	mutex_unlock(&kvm->arch.config_lock);
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 3/4] KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths
  2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 2/4] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
@ 2025-09-09 10:00 ` Keir Fraser
  2025-09-09 10:00 ` [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
  2025-09-15  9:59 ` [PATCH v4 0/4] KVM: Speed up MMIO registrations Marc Zyngier
  4 siblings, 0 replies; 15+ messages in thread
From: Keir Fraser @ 2025-09-09 10:00 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Keir Fraser

This ensures that, if a VCPU has "observed" that an IO registration has
occurred, the instruction currently being trapped or emulated will also
observe the IO registration.

At the same time, enforce that kvm_get_bus() is used only on the
update side, ensuring that a long-term reference cannot be obtained by
an SRCU reader.

Signed-off-by: Keir Fraser <keirf@google.com>
---
 arch/x86/kvm/vmx/vmx.c   |  7 +++++++
 include/linux/kvm_host.h | 10 +++++++---
 virt/kvm/kvm_main.c      | 32 ++++++++++++++++++++++++++------
 3 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index aa157fe5b7b3..0bdf9405969a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5785,6 +5785,13 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 		if (kvm_test_request(KVM_REQ_EVENT, vcpu))
 			return 1;
 
+		/*
+		 * Ensure that any updates to kvm->buses[] observed by the
+		 * previous instruction (emulated or otherwise) are also
+		 * visible to the instruction KVM is about to emulate.
+		 */
+		smp_rmb();
+
 		if (!kvm_emulate_instruction(vcpu, 0))
 			return 0;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 15656b7fba6c..e7d6111cf254 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -966,11 +966,15 @@ static inline bool kvm_dirty_log_manual_protect_and_init_set(struct kvm *kvm)
 	return !!(kvm->manual_dirty_log_protect & KVM_DIRTY_LOG_INITIALLY_SET);
 }
 
+/*
+ * Get a bus reference under the update-side lock. No long-term SRCU reader
+ * references are permitted, to avoid stale reads vs concurrent IO
+ * registrations.
+ */
 static inline struct kvm_io_bus *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
 {
-	return srcu_dereference_check(kvm->buses[idx], &kvm->srcu,
-				      lockdep_is_held(&kvm->slots_lock) ||
-				      !refcount_read(&kvm->users_count));
+	return rcu_dereference_protected(kvm->buses[idx],
+					 lockdep_is_held(&kvm->slots_lock));
 }
 
 static inline struct kvm_vcpu *kvm_get_vcpu(struct kvm *kvm, int i)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c07dd423458..870ad8ea93a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1103,6 +1103,14 @@ void __weak kvm_arch_create_vm_debugfs(struct kvm *kvm)
 {
 }
 
+/* Called only on cleanup and destruction paths when there are no users. */
+static inline struct kvm_io_bus *kvm_get_bus_for_destruction(struct kvm *kvm,
+							     enum kvm_bus idx)
+{
+	return rcu_dereference_protected(kvm->buses[idx],
+					 !refcount_read(&kvm->users_count));
+}
+
 static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 {
 	struct kvm *kvm = kvm_arch_alloc_vm();
@@ -1228,7 +1236,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_arch_destroy_vm:
 	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
 	for (i = 0; i < KVM_NR_BUSES; i++)
-		kfree(kvm_get_bus(kvm, i));
+		kfree(kvm_get_bus_for_destruction(kvm, i));
 	kvm_free_irq_routing(kvm);
 out_err_no_irq_routing:
 	cleanup_srcu_struct(&kvm->irq_srcu);
@@ -1276,7 +1284,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 
 	kvm_free_irq_routing(kvm);
 	for (i = 0; i < KVM_NR_BUSES; i++) {
-		struct kvm_io_bus *bus = kvm_get_bus(kvm, i);
+		struct kvm_io_bus *bus = kvm_get_bus_for_destruction(kvm, i);
 
 		if (bus)
 			kvm_io_bus_destroy(bus);
@@ -5843,6 +5851,18 @@ static int __kvm_io_bus_write(struct kvm_vcpu *vcpu, struct kvm_io_bus *bus,
 	return -EOPNOTSUPP;
 }
 
+static struct kvm_io_bus *kvm_get_bus_srcu(struct kvm *kvm, enum kvm_bus idx)
+{
+	/*
+	 * Ensure that any updates to kvm_buses[] observed by the previous vCPU
+	 * machine instruction are also visible to the vCPU machine instruction
+	 * that triggered this call.
+	 */
+	smp_mb__after_srcu_read_lock();
+
+	return srcu_dereference(kvm->buses[idx], &kvm->srcu);
+}
+
 int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 		     int len, const void *val)
 {
@@ -5855,7 +5875,7 @@ int kvm_io_bus_write(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 		.len = len,
 	};
 
-	bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
+	bus = kvm_get_bus_srcu(vcpu->kvm, bus_idx);
 	if (!bus)
 		return -ENOMEM;
 	r = __kvm_io_bus_write(vcpu, bus, &range, val);
@@ -5874,7 +5894,7 @@ int kvm_io_bus_write_cookie(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx,
 		.len = len,
 	};
 
-	bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
+	bus = kvm_get_bus_srcu(vcpu->kvm, bus_idx);
 	if (!bus)
 		return -ENOMEM;
 
@@ -5924,7 +5944,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 		.len = len,
 	};
 
-	bus = srcu_dereference(vcpu->kvm->buses[bus_idx], &vcpu->kvm->srcu);
+	bus = kvm_get_bus_srcu(vcpu->kvm, bus_idx);
 	if (!bus)
 		return -ENOMEM;
 	r = __kvm_io_bus_read(vcpu, bus, &range, val);
@@ -6033,7 +6053,7 @@ struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 
 	srcu_idx = srcu_read_lock(&kvm->srcu);
 
-	bus = srcu_dereference(kvm->buses[bus_idx], &kvm->srcu);
+	bus = kvm_get_bus_srcu(kvm, bus_idx);
 	if (!bus)
 		goto out_unlock;
 
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
                   ` (2 preceding siblings ...)
  2025-09-09 10:00 ` [PATCH v4 3/4] KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths Keir Fraser
@ 2025-09-09 10:00 ` Keir Fraser
  2026-02-13 15:42   ` Nikita Kalyazin
  2025-09-15  9:59 ` [PATCH v4 0/4] KVM: Speed up MMIO registrations Marc Zyngier
  4 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2025-09-09 10:00 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Keir Fraser, Li RongQing

Device MMIO registration may happen quite frequently during VM boot,
and the SRCU synchronization each time has a measurable effect
on VM startup time. In our experiments it can account for around 25%
of a VM's startup time.

Replace the synchronization with a deferred free of the old kvm_io_bus
structure.

Tested-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Keir Fraser <keirf@google.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e7d6111cf254..103be35caf0d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -206,6 +206,7 @@ struct kvm_io_range {
 struct kvm_io_bus {
 	int dev_count;
 	int ioeventfd_count;
+	struct rcu_head rcu;
 	struct kvm_io_range range[];
 };
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 870ad8ea93a7..bcef324ccbf2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1320,6 +1320,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
 	}
 	cleanup_srcu_struct(&kvm->irq_srcu);
+	srcu_barrier(&kvm->srcu);
 	cleanup_srcu_struct(&kvm->srcu);
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
@@ -5952,6 +5953,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 }
 EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
+static void __free_bus(struct rcu_head *rcu)
+{
+	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
+
+	kfree(bus);
+}
+
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 			    int len, struct kvm_io_device *dev)
 {
@@ -5990,8 +5998,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 	memcpy(new_bus->range + i + 1, bus->range + i,
 		(bus->dev_count - i) * sizeof(struct kvm_io_range));
 	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
-	synchronize_srcu_expedited(&kvm->srcu);
-	kfree(bus);
+	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
 
 	return 0;
 }
-- 
2.51.0.384.g4c02a37b29-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2025-09-09 10:00 ` [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
@ 2026-02-13 15:42   ` Nikita Kalyazin
  2026-02-13 23:20     ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Nikita Kalyazin @ 2026-02-13 15:42 UTC (permalink / raw)
  To: Keir Fraser, linux-arm-kernel, linux-kernel, kvm
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Marc Zyngier,
	Will Deacon, Paolo Bonzini, Li RongQing



On 09/09/2025 11:00, Keir Fraser wrote:
> Device MMIO registration may happen quite frequently during VM boot,
> and the SRCU synchronization each time has a measurable effect
> on VM startup time. In our experiments it can account for around 25%
> of a VM's startup time.
> 
> Replace the synchronization with a deferred free of the old kvm_io_bus
> structure.


Hi,

We noticed that this change introduced a regression of ~20 ms to the 
first KVM_CREATE_VCPU call of a VM, which is significant for our use case.

Before the patch:
45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>

After the patch:
30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>

The reason, as I understand, it happens is call_srcu() called from 
kvm_io_bus_register_dev() are adding callbacks to be called after a 
normal GP, which is 10 ms with HZ=100.  The subsequent 
synchronize_srcu_expedited() called from kvm_swap_active_memslots() 
(from KVM_CREATE_VCPU) has to wait for the normal GP to complete before 
making progress.  I don't fully understand why the delay is consistently 
greater than 1 GP, but that's what we see across our testing scenarios.

I verified that the problem is relaxed if the GP is reduced by 
configuring HZ=1000.  In that case, the regression is in the order of 1 ms.

It looks like in our case we don't benefit much from the intended 
optimisation as the number of device MMIO registrations is limited and 
and they don't cost us much (each takes at most 16 us, but most commonly 
~6 us):

      firecracker 68452 [054]  3053.183991: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184007: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03aa190)
      firecracker 68452 [054]  3053.184007: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184014: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03aa1b9)
      firecracker 68452 [054]  3053.184015: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184021: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03aa1db)
      firecracker 68452 [054]  3053.184028: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184034: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03ac957)
      firecracker 68452 [054]  3053.184093: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184099: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03ab51a)
      firecracker 68452 [054]  3053.184100: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.184106: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc03ab549)
      firecracker 68452 [054]  3053.193145: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.193164: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc0348c9f)
      firecracker 68452 [054]  3053.193165: 
kprobes:kvm_io_bus_register_dev: (ffffffffc0348390)
      firecracker 68452 [054]  3053.193171: 
kprobes:kvm_io_bus_register_dev__return: (ffffffffc0348390 <- 
ffffffffc0348c9f)

Our env:
  - 6.18
  - Arch: the analysis above is from x86, but ARM regressed very similarly
  - CONFIG_HZ=100
  - VMM: Firecracker (https://github.com/firecracker-microvm/firecracker)


I am not aware of way to make it fast for both use cases and would be 
more than happy to hear about possible solutions.


Thanks,
Nikita

> 
> Tested-by: Li RongQing <lirongqing@baidu.com>
> Signed-off-by: Keir Fraser <keirf@google.com>
> ---
>   include/linux/kvm_host.h |  1 +
>   virt/kvm/kvm_main.c      | 11 +++++++++--
>   2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e7d6111cf254..103be35caf0d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -206,6 +206,7 @@ struct kvm_io_range {
>   struct kvm_io_bus {
>   	int dev_count;
>   	int ioeventfd_count;
> +	struct rcu_head rcu;
>   	struct kvm_io_range range[];
>   };
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 870ad8ea93a7..bcef324ccbf2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1320,6 +1320,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
>   		kvm_free_memslots(kvm, &kvm->__memslots[i][1]);
>   	}
>   	cleanup_srcu_struct(&kvm->irq_srcu);
> +	srcu_barrier(&kvm->srcu);
>   	cleanup_srcu_struct(&kvm->srcu);
>   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>   	xa_destroy(&kvm->mem_attr_array);
> @@ -5952,6 +5953,13 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
>   }
>   EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>   
> +static void __free_bus(struct rcu_head *rcu)
> +{
> +	struct kvm_io_bus *bus = container_of(rcu, struct kvm_io_bus, rcu);
> +
> +	kfree(bus);
> +}
> +
>   int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>   			    int len, struct kvm_io_device *dev)
>   {
> @@ -5990,8 +5998,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>   	memcpy(new_bus->range + i + 1, bus->range + i,
>   		(bus->dev_count - i) * sizeof(struct kvm_io_range));
>   	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> -	synchronize_srcu_expedited(&kvm->srcu);
> -	kfree(bus);
> +	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-13 15:42   ` Nikita Kalyazin
@ 2026-02-13 23:20     ` Sean Christopherson
  2026-02-16 17:53       ` Nikita Kalyazin
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-13 23:20 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Keir Fraser, linux-arm-kernel, linux-kernel, kvm, Eric Auger,
	Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini,
	Li RongQing

On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
> 
> 
> On 09/09/2025 11:00, Keir Fraser wrote:
> > Device MMIO registration may happen quite frequently during VM boot,
> > and the SRCU synchronization each time has a measurable effect
> > on VM startup time. In our experiments it can account for around 25%
> > of a VM's startup time.
> > 
> > Replace the synchronization with a deferred free of the old kvm_io_bus
> > structure.
> 
> 
> Hi,
> 
> We noticed that this change introduced a regression of ~20 ms to the first
> KVM_CREATE_VCPU call of a VM, which is significant for our use case.
> 
> Before the patch:
> 45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
> 45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>
> 
> After the patch:
> 30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
> 30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>
> 
> The reason, as I understand, it happens is call_srcu() called from
> kvm_io_bus_register_dev() are adding callbacks to be called after a normal
> GP, which is 10 ms with HZ=100.  The subsequent synchronize_srcu_expedited()
> called from kvm_swap_active_memslots() (from KVM_CREATE_VCPU) has to wait
> for the normal GP to complete before making progress.  I don't fully
> understand why the delay is consistently greater than 1 GP, but that's what
> we see across our testing scenarios.
> 
> I verified that the problem is relaxed if the GP is reduced by configuring
> HZ=1000.  In that case, the regression is in the order of 1 ms.
> 
> It looks like in our case we don't benefit much from the intended
> optimisation as the number of device MMIO registrations is limited and and
> they don't cost us much (each takes at most 16 us, but most commonly ~6 us):

Maybe differences in platforms for arm64 vs x86?

> I am not aware of way to make it fast for both use cases and would be more
> than happy to hear about possible solutions.

What if we key off of vCPUS being created?  The motivation for Keir's change was
to avoid stalling during VM boot, i.e. *after* initial VM creation.

--
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 13 Feb 2026 15:15:01 -0800
Subject: [PATCH] KVM: Synchronize SRCU on I/O device registration if vCPUs
 haven't been created

TODO: Write a changelog if this works.

Fixes: 7d9a0273c459 ("KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()")
Reported-by: Nikita Kalyazin <kalyazin@amazon.com>
Closes: https://lkml.kernel.org/r/a84ddba8-12da-489a-9dd1-ccdf7451a1ba%40amazon.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 571cf0d6ec01..043b1c3574ab 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6027,7 +6027,30 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 	memcpy(new_bus->range + i + 1, bus->range + i,
 		(bus->dev_count - i) * sizeof(struct kvm_io_range));
 	rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
-	call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
+
+	/*
+	 * To optimize VM creation *and* boot time, use different tactics for
+	 * safely freeing the old bus based on where the VM is at in its
+	 * lifecycle.  If vCPUs haven't yet been created, simply synchronize
+	 * and free, as there are unlikely to be active SRCU readers; if not,
+	 * defer freeing the bus via SRCU callback.
+	 *
+	 * If there are active SRCU readers, synchronizing will stall until the
+	 * current grace period completes, which can meaningfully impact boot
+	 * time for VMs that trigger a large number of registrations.
+	 *
+	 * If there aren't SRCU readers, using an SRCU callback can be a net
+	 * negative due to starting a grace period of its own, which in turn
+	 * can unnecessarily cause a future synchronization to stall.  E.g. if
+	 * devices are registered before memslots are created, then creating
+	 * the first memslot will have to wait for a superfluous grace period.
+	 */
+	if (!READ_ONCE(kvm->created_vcpus)) {
+		synchronize_srcu_expedited(&kvm->srcu);
+		kfree(bus);
+	} else {
+		call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
+	}
 
 	return 0;
 }

base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
--

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-13 23:20     ` Sean Christopherson
@ 2026-02-16 17:53       ` Nikita Kalyazin
  2026-02-17 19:07         ` Sean Christopherson
  0 siblings, 1 reply; 15+ messages in thread
From: Nikita Kalyazin @ 2026-02-16 17:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Keir Fraser, linux-arm-kernel, linux-kernel, kvm, Eric Auger,
	Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini,
	Li RongQing



On 13/02/2026 23:20, Sean Christopherson wrote:
> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>
>>
>> On 09/09/2025 11:00, Keir Fraser wrote:
>>> Device MMIO registration may happen quite frequently during VM boot,
>>> and the SRCU synchronization each time has a measurable effect
>>> on VM startup time. In our experiments it can account for around 25%
>>> of a VM's startup time.
>>>
>>> Replace the synchronization with a deferred free of the old kvm_io_bus
>>> structure.
>>
>>
>> Hi,
>>
>> We noticed that this change introduced a regression of ~20 ms to the first
>> KVM_CREATE_VCPU call of a VM, which is significant for our use case.
>>
>> Before the patch:
>> 45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
>> 45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>
>>
>> After the patch:
>> 30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
>> 30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>
>>
>> The reason, as I understand, it happens is call_srcu() called from
>> kvm_io_bus_register_dev() are adding callbacks to be called after a normal
>> GP, which is 10 ms with HZ=100.  The subsequent synchronize_srcu_expedited()
>> called from kvm_swap_active_memslots() (from KVM_CREATE_VCPU) has to wait
>> for the normal GP to complete before making progress.  I don't fully
>> understand why the delay is consistently greater than 1 GP, but that's what
>> we see across our testing scenarios.
>>
>> I verified that the problem is relaxed if the GP is reduced by configuring
>> HZ=1000.  In that case, the regression is in the order of 1 ms.
>>
>> It looks like in our case we don't benefit much from the intended
>> optimisation as the number of device MMIO registrations is limited and and
>> they don't cost us much (each takes at most 16 us, but most commonly ~6 us):
> 
> Maybe differences in platforms for arm64 vs x86?

Tested on ARM, and indeed kvm_io_bus_register_dev are occurring after 
KVM_CREATE_VCPU, and the patch produces a visible optimisation:

Without the patch (15-23 us per call):

      firecracker 19916 [033]   404.518430: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
      firecracker 19916 [033]   404.518446: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
      firecracker 19916 [033]   404.518462: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.518495: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a198c)
      firecracker 19916 [032]   404.518498: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [033]   404.518521: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a198c)
      firecracker 19916 [033]   404.518524: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.518539: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a6d2c)
      firecracker 19916 [032]   404.526900: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [033]   404.526924: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff800080060168)
      firecracker 19916 [033]   404.526926: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.526941: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff800080060168)
        fc_vcpu 0 19924 [035]   404.530829: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
        fc_vcpu 0 19924 [035]   404.530848: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff80008009f6b4)

With the patch (1-6 us per call):

      firecracker 22806 [032]   427.687157: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
      firecracker 22806 [032]   427.687174: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
      firecracker 22806 [032]   427.687193: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687196: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a19cc)
      firecracker 22806 [032]   427.687196: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687197: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a19cc)
      firecracker 22806 [032]   427.687201: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687202: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a6d6c)
      firecracker 22806 [029]   427.707660: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [029]   427.707666: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800601a8)
      firecracker 22806 [029]   427.707667: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [029]   427.707668: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800601a8)
        fc_vcpu 0 22829 [030]   427.711642: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
        fc_vcpu 0 22829 [030]   427.711645: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff80008009f6f4)


Also, it is the KVM_SET_USER_MEMORY_REGION (not KVM_CREATE_VCPU) that is 
hit on ARM (but seems to be for the same reason):

45736 17:30:10.251430 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, 
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, 
userspace_addr=0xfffcbedd6000}) = 0 <0.021021>

vs

30694 17:33:01.128985 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, 
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, 
userspace_addr=0xfffc91fc9000}) = 0 <0.000016>

> 
>> I am not aware of way to make it fast for both use cases and would be more
>> than happy to hear about possible solutions.
> 
> What if we key off of vCPUS being created?  The motivation for Keir's change was
> to avoid stalling during VM boot, i.e. *after* initial VM creation.

It doesn't work as is on x86 because the delay we're seeing occurs after 
the created_cpus gets incremented so it doesn't allow to differentiate 
the two cases (below is kvm_vm_ioctl_create_vcpu):

	kvm->created_vcpus++; // <===== incremented here
	mutex_unlock(&kvm->lock);

	vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
	if (!vcpu) {
		r = -ENOMEM;
		goto vcpu_decrement;
	}

	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
	if (!page) {
		r = -ENOMEM;
		goto vcpu_free;
	}
	vcpu->run = page_address(page);

	kvm_vcpu_init(vcpu, kvm, id);

	r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here


firecracker   583 [001]   151.297145: 
probe:synchronize_srcu_expedited: (ffffffff813e5cf0)
     ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
     ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
     ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
     ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
     ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
     ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
     ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
     ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
     ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
     ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
     ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
     ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 
([kernel.kallsyms])
               6512de ioctl+0x32 (/mnt/host/firecracker)
                d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)


Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in 
KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.


> 
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Fri, 13 Feb 2026 15:15:01 -0800
> Subject: [PATCH] KVM: Synchronize SRCU on I/O device registration if vCPUs
>   haven't been created
> 
> TODO: Write a changelog if this works.
> 
> Fixes: 7d9a0273c459 ("KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()")
> Reported-by: Nikita Kalyazin <kalyazin@amazon.com>
> Closes: https://lkml.kernel.org/r/a84ddba8-12da-489a-9dd1-ccdf7451a1ba%40amazon.com
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 571cf0d6ec01..043b1c3574ab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -6027,7 +6027,30 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>          memcpy(new_bus->range + i + 1, bus->range + i,
>                  (bus->dev_count - i) * sizeof(struct kvm_io_range));
>          rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +
> +       /*
> +        * To optimize VM creation *and* boot time, use different tactics for
> +        * safely freeing the old bus based on where the VM is at in its
> +        * lifecycle.  If vCPUs haven't yet been created, simply synchronize
> +        * and free, as there are unlikely to be active SRCU readers; if not,
> +        * defer freeing the bus via SRCU callback.
> +        *
> +        * If there are active SRCU readers, synchronizing will stall until the
> +        * current grace period completes, which can meaningfully impact boot
> +        * time for VMs that trigger a large number of registrations.
> +        *
> +        * If there aren't SRCU readers, using an SRCU callback can be a net
> +        * negative due to starting a grace period of its own, which in turn
> +        * can unnecessarily cause a future synchronization to stall.  E.g. if
> +        * devices are registered before memslots are created, then creating
> +        * the first memslot will have to wait for a superfluous grace period.
> +        */
> +       if (!READ_ONCE(kvm->created_vcpus)) {
> +               synchronize_srcu_expedited(&kvm->srcu);
> +               kfree(bus);
> +       } else {
> +               call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +       }
> 
>          return 0;
>   }
> 
> base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
> --


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-16 17:53       ` Nikita Kalyazin
@ 2026-02-17 19:07         ` Sean Christopherson
  2026-02-18 12:55           ` Nikita Kalyazin
  0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-02-17 19:07 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Keir Fraser, linux-arm-kernel, linux-kernel, kvm, Eric Auger,
	Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini,
	Li RongQing

On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
> On 13/02/2026 23:20, Sean Christopherson wrote:
> > On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
> > > I am not aware of way to make it fast for both use cases and would be more
> > > than happy to hear about possible solutions.
> > 
> > What if we key off of vCPUS being created?  The motivation for Keir's change was
> > to avoid stalling during VM boot, i.e. *after* initial VM creation.
> 
> It doesn't work as is on x86 because the delay we're seeing occurs after the
> created_cpus gets incremented

I don't follow, the suggestion was to key off created_vcpus in
kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
should be largely irrelevant.

Probably a moot point though.

> so it doesn't allow to differentiate the two
> cases (below is kvm_vm_ioctl_create_vcpu):
> 
> 	kvm->created_vcpus++; // <===== incremented here
> 	mutex_unlock(&kvm->lock);
> 
> 	vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
> 	if (!vcpu) {
> 		r = -ENOMEM;
> 		goto vcpu_decrement;
> 	}
> 
> 	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> 	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> 	if (!page) {
> 		r = -ENOMEM;
> 		goto vcpu_free;
> 	}
> 	vcpu->run = page_address(page);
> 
> 	kvm_vcpu_init(vcpu, kvm, id);
> 
> 	r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
> 
> 
> firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
> (ffffffff813e5cf0)
>     ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
>     ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
>     ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
>     ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
>     ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
>     ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
>     ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
>     ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
>     ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
>     ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
>     ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
>     ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
>               6512de ioctl+0x32 (/mnt/host/firecracker)
>                d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
> 
> Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
> KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.

Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
the only practical difference (aside from waiting, obviously) between call_srcu()
and synchronize_srcu_expedited() with respect to "transferring" grace period
latency is that using call_srcu() could start a normal, non-expedited grace period.

IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
to expedited mode, but if the normal grace period has already started the timer
for the delayed invocation of process_srcu(), then SRCU will still wait for one
jiffie, i.e. won't immediately queue the work.

I have no idea if this is sane and/or acceptable, but before looping in Paul and
others, can you try this to see if it helps?

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 344ad51c8f6c..30437dc8d818 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
 
 void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
                void (*func)(struct rcu_head *head));
+void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
+                        rcu_callback_t func);
 void cleanup_srcu_struct(struct srcu_struct *ssp);
 void synchronize_srcu(struct srcu_struct *ssp);
 
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index ea3f128de06f..03333b079092 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
 }
 EXPORT_SYMBOL_GPL(call_srcu);
 
+void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
+                        rcu_callback_t func)
+{
+       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
+}
+EXPORT_SYMBOL_GPL(call_srcu_expedited);
+
 /*
  * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
  */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 737b74b15bb5..26215f98c98f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
        memcpy(new_bus->range + i + 1, bus->range + i,
                (bus->dev_count - i) * sizeof(struct kvm_io_range));
        rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
-       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
+       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
 
        return 0;
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-17 19:07         ` Sean Christopherson
@ 2026-02-18 12:55           ` Nikita Kalyazin
  2026-02-18 16:02             ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Nikita Kalyazin @ 2026-02-18 12:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Keir Fraser, linux-arm-kernel, linux-kernel, kvm, Eric Auger,
	Oliver Upton, Marc Zyngier, Will Deacon, Paolo Bonzini,
	Li RongQing



On 17/02/2026 19:07, Sean Christopherson wrote:
> On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
>> On 13/02/2026 23:20, Sean Christopherson wrote:
>>> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>>> I am not aware of way to make it fast for both use cases and would be more
>>>> than happy to hear about possible solutions.
>>>
>>> What if we key off of vCPUS being created?  The motivation for Keir's change was
>>> to avoid stalling during VM boot, i.e. *after* initial VM creation.
>>
>> It doesn't work as is on x86 because the delay we're seeing occurs after the
>> created_cpus gets incremented
> 
> I don't follow, the suggestion was to key off created_vcpus in
> kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
> imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
> should be largely irrelevant.

Yes, you're right, it's irrelevant.  I had made the change in 
kvm_io_bus_register_dev() like proposed, but have no idea how I couldn't 
see the effect.  I retested it now and it's obvious that it works on 
x86.  Sorry for the confusion.

> 
> Probably a moot point though.

Yes, this will not solve the problem on ARM.

> 
>> so it doesn't allow to differentiate the two
>> cases (below is kvm_vm_ioctl_create_vcpu):
>>
>>        kvm->created_vcpus++; // <===== incremented here
>>        mutex_unlock(&kvm->lock);
>>
>>        vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
>>        if (!vcpu) {
>>                r = -ENOMEM;
>>                goto vcpu_decrement;
>>        }
>>
>>        BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
>>        page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>>        if (!page) {
>>                r = -ENOMEM;
>>                goto vcpu_free;
>>        }
>>        vcpu->run = page_address(page);
>>
>>        kvm_vcpu_init(vcpu, kvm, id);
>>
>>        r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
>>
>>
>> firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
>> (ffffffff813e5cf0)
>>      ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
>>      ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
>>      ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
>>      ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
>>      ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
>>      ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
>>      ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
>>      ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
>>      ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
>>      ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
>>      ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
>>      ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
>>                6512de ioctl+0x32 (/mnt/host/firecracker)
>>                 d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
>>
>> Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
>> KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
> 
> Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
> the only practical difference (aside from waiting, obviously) between call_srcu()
> and synchronize_srcu_expedited() with respect to "transferring" grace period
> latency is that using call_srcu() could start a normal, non-expedited grace period.
> 
> IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
> to expedited mode, but if the normal grace period has already started the timer
> for the delayed invocation of process_srcu(), then SRCU will still wait for one
> jiffie, i.e. won't immediately queue the work.
> 
> I have no idea if this is sane and/or acceptable, but before looping in Paul and
> others, can you try this to see if it helps?

That's exactly what I tried myself before and it didn't help, probably 
for the reason you mentioned above (a normal GP being already started).

> 
> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> index 344ad51c8f6c..30437dc8d818 100644
> --- a/include/linux/srcu.h
> +++ b/include/linux/srcu.h
> @@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
> 
>   void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
>                  void (*func)(struct rcu_head *head));
> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> +                        rcu_callback_t func);
>   void cleanup_srcu_struct(struct srcu_struct *ssp);
>   void synchronize_srcu(struct srcu_struct *ssp);
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index ea3f128de06f..03333b079092 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
>   }
>   EXPORT_SYMBOL_GPL(call_srcu);
> 
> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> +                        rcu_callback_t func)
> +{
> +       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
> +}
> +EXPORT_SYMBOL_GPL(call_srcu_expedited);
> +
>   /*
>    * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>    */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 737b74b15bb5..26215f98c98f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>          memcpy(new_bus->range + i + 1, bus->range + i,
>                  (bus->dev_count - i) * sizeof(struct kvm_io_range));
>          rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
> 
>          return 0;
>   }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-18 12:55           ` Nikita Kalyazin
@ 2026-02-18 16:02             ` Keir Fraser
  2026-02-18 16:15               ` Nikita Kalyazin
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2026-02-18 16:02 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Sean Christopherson, linux-arm-kernel, linux-kernel, kvm,
	Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Li RongQing

On Wed, Feb 18, 2026 at 12:55:11PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 17/02/2026 19:07, Sean Christopherson wrote:
> > On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
> > > On 13/02/2026 23:20, Sean Christopherson wrote:
> > > > On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
> > > > > I am not aware of way to make it fast for both use cases and would be more
> > > > > than happy to hear about possible solutions.
> > > > 
> > > > What if we key off of vCPUS being created?  The motivation for Keir's change was
> > > > to avoid stalling during VM boot, i.e. *after* initial VM creation.
> > > 
> > > It doesn't work as is on x86 because the delay we're seeing occurs after the
> > > created_cpus gets incremented
> > 
> > I don't follow, the suggestion was to key off created_vcpus in
> > kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
> > imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
> > should be largely irrelevant.
> 
> Yes, you're right, it's irrelevant.  I had made the change in
> kvm_io_bus_register_dev() like proposed, but have no idea how I couldn't see
> the effect.  I retested it now and it's obvious that it works on x86.  Sorry
> for the confusion.
> 
> > 
> > Probably a moot point though.
> 
> Yes, this will not solve the problem on ARM.

Sorry for being late to this thread. I'm a bit confused now. Did
Sean's original patch (reintroducing the old logic, based on whether
any vcpus have been created) work for both/either/neither arch? I
would have expected it to work for both ARM and X86, despite the
offending synchronize_srcu() not being in the vcpu-creation ioctl on
ARM, and I think that is finally what your testing seems to show? If
so then that seems the pragmatic if somewhat ugly way forward.

 Cheers,
  Keir


> > 
> > > so it doesn't allow to differentiate the two
> > > cases (below is kvm_vm_ioctl_create_vcpu):
> > > 
> > >        kvm->created_vcpus++; // <===== incremented here
> > >        mutex_unlock(&kvm->lock);
> > > 
> > >        vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
> > >        if (!vcpu) {
> > >                r = -ENOMEM;
> > >                goto vcpu_decrement;
> > >        }
> > > 
> > >        BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> > >        page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > >        if (!page) {
> > >                r = -ENOMEM;
> > >                goto vcpu_free;
> > >        }
> > >        vcpu->run = page_address(page);
> > > 
> > >        kvm_vcpu_init(vcpu, kvm, id);
> > > 
> > >        r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
> > > 
> > > 
> > > firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
> > > (ffffffff813e5cf0)
> > >      ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
> > >      ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
> > >      ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
> > >      ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
> > >      ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
> > >      ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
> > >      ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
> > >      ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
> > >      ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
> > >      ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
> > >      ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
> > >      ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
> > >                6512de ioctl+0x32 (/mnt/host/firecracker)
> > >                 d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
> > > 
> > > Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
> > > KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
> > 
> > Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
> > the only practical difference (aside from waiting, obviously) between call_srcu()
> > and synchronize_srcu_expedited() with respect to "transferring" grace period
> > latency is that using call_srcu() could start a normal, non-expedited grace period.
> > 
> > IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
> > to expedited mode, but if the normal grace period has already started the timer
> > for the delayed invocation of process_srcu(), then SRCU will still wait for one
> > jiffie, i.e. won't immediately queue the work.
> > 
> > I have no idea if this is sane and/or acceptable, but before looping in Paul and
> > others, can you try this to see if it helps?
> 
> That's exactly what I tried myself before and it didn't help, probably for
> the reason you mentioned above (a normal GP being already started).
> 
> > 
> > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > index 344ad51c8f6c..30437dc8d818 100644
> > --- a/include/linux/srcu.h
> > +++ b/include/linux/srcu.h
> > @@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
> > 
> >   void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
> >                  void (*func)(struct rcu_head *head));
> > +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> > +                        rcu_callback_t func);
> >   void cleanup_srcu_struct(struct srcu_struct *ssp);
> >   void synchronize_srcu(struct srcu_struct *ssp);
> > 
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index ea3f128de06f..03333b079092 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
> >   }
> >   EXPORT_SYMBOL_GPL(call_srcu);
> > 
> > +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> > +                        rcu_callback_t func)
> > +{
> > +       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
> > +}
> > +EXPORT_SYMBOL_GPL(call_srcu_expedited);
> > +
> >   /*
> >    * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> >    */
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 737b74b15bb5..26215f98c98f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> >          memcpy(new_bus->range + i + 1, bus->range + i,
> >                  (bus->dev_count - i) * sizeof(struct kvm_io_range));
> >          rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> > +       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
> > 
> >          return 0;
> >   }
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-18 16:02             ` Keir Fraser
@ 2026-02-18 16:15               ` Nikita Kalyazin
  2026-02-19  7:50                 ` Keir Fraser
  0 siblings, 1 reply; 15+ messages in thread
From: Nikita Kalyazin @ 2026-02-18 16:15 UTC (permalink / raw)
  To: Keir Fraser
  Cc: Sean Christopherson, linux-arm-kernel, linux-kernel, kvm,
	Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Li RongQing



On 18/02/2026 16:02, Keir Fraser wrote:
> On Wed, Feb 18, 2026 at 12:55:11PM +0000, Nikita Kalyazin wrote:
>>
>>
>> On 17/02/2026 19:07, Sean Christopherson wrote:
>>> On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
>>>> On 13/02/2026 23:20, Sean Christopherson wrote:
>>>>> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>>>>> I am not aware of way to make it fast for both use cases and would be more
>>>>>> than happy to hear about possible solutions.
>>>>>
>>>>> What if we key off of vCPUS being created?  The motivation for Keir's change was
>>>>> to avoid stalling during VM boot, i.e. *after* initial VM creation.
>>>>
>>>> It doesn't work as is on x86 because the delay we're seeing occurs after the
>>>> created_cpus gets incremented
>>>
>>> I don't follow, the suggestion was to key off created_vcpus in
>>> kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
>>> imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
>>> should be largely irrelevant.
>>
>> Yes, you're right, it's irrelevant.  I had made the change in
>> kvm_io_bus_register_dev() like proposed, but have no idea how I couldn't see
>> the effect.  I retested it now and it's obvious that it works on x86.  Sorry
>> for the confusion.
>>
>>>
>>> Probably a moot point though.
>>
>> Yes, this will not solve the problem on ARM.
> 
> Sorry for being late to this thread. I'm a bit confused now. Did
> Sean's original patch (reintroducing the old logic, based on whether
> any vcpus have been created) work for both/either/neither arch? I
> would have expected it to work for both ARM and X86, despite the
> offending synchronize_srcu() not being in the vcpu-creation ioctl on
> ARM, and I think that is finally what your testing seems to show? If
> so then that seems the pragmatic if somewhat ugly way forward.

The original patch from Sean works for x86.  I didn't test it on ARM as 
it's harder for me to do, but I don't expect it to work because it only 
affects the pre-vcpu-creation phase.

We discussed the second patch at the KVM sync earlier today, then I 
retested it and it appears to solve the issue for both, but I'm going to 
have more complete results tomorrow.

Are you by chance able to have a look whether KVM_SET_USER_MEMORY_REGION 
execution elongates on ARM in your environment (with the 4/4 patch)? 
I'd be curious to know why not if it doesn't.

> 
>   Cheers,
>    Keir
> 
> 
>>>
>>>> so it doesn't allow to differentiate the two
>>>> cases (below is kvm_vm_ioctl_create_vcpu):
>>>>
>>>>         kvm->created_vcpus++; // <===== incremented here
>>>>         mutex_unlock(&kvm->lock);
>>>>
>>>>         vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
>>>>         if (!vcpu) {
>>>>                 r = -ENOMEM;
>>>>                 goto vcpu_decrement;
>>>>         }
>>>>
>>>>         BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
>>>>         page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>>>>         if (!page) {
>>>>                 r = -ENOMEM;
>>>>                 goto vcpu_free;
>>>>         }
>>>>         vcpu->run = page_address(page);
>>>>
>>>>         kvm_vcpu_init(vcpu, kvm, id);
>>>>
>>>>         r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
>>>>
>>>>
>>>> firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
>>>> (ffffffff813e5cf0)
>>>>       ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
>>>>       ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
>>>>       ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
>>>>       ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
>>>>       ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
>>>>       ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
>>>>       ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
>>>>       ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
>>>>       ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
>>>>       ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
>>>>       ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
>>>>       ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
>>>>                 6512de ioctl+0x32 (/mnt/host/firecracker)
>>>>                  d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
>>>>
>>>> Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
>>>> KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
>>>
>>> Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
>>> the only practical difference (aside from waiting, obviously) between call_srcu()
>>> and synchronize_srcu_expedited() with respect to "transferring" grace period
>>> latency is that using call_srcu() could start a normal, non-expedited grace period.
>>>
>>> IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
>>> to expedited mode, but if the normal grace period has already started the timer
>>> for the delayed invocation of process_srcu(), then SRCU will still wait for one
>>> jiffie, i.e. won't immediately queue the work.
>>>
>>> I have no idea if this is sane and/or acceptable, but before looping in Paul and
>>> others, can you try this to see if it helps?
>>
>> That's exactly what I tried myself before and it didn't help, probably for
>> the reason you mentioned above (a normal GP being already started).
>>
>>>
>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>>> index 344ad51c8f6c..30437dc8d818 100644
>>> --- a/include/linux/srcu.h
>>> +++ b/include/linux/srcu.h
>>> @@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
>>>
>>>    void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
>>>                   void (*func)(struct rcu_head *head));
>>> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
>>> +                        rcu_callback_t func);
>>>    void cleanup_srcu_struct(struct srcu_struct *ssp);
>>>    void synchronize_srcu(struct srcu_struct *ssp);
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index ea3f128de06f..03333b079092 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
>>>    }
>>>    EXPORT_SYMBOL_GPL(call_srcu);
>>>
>>> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
>>> +                        rcu_callback_t func)
>>> +{
>>> +       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
>>> +}
>>> +EXPORT_SYMBOL_GPL(call_srcu_expedited);
>>> +
>>>    /*
>>>     * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>>>     */
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 737b74b15bb5..26215f98c98f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>>>           memcpy(new_bus->range + i + 1, bus->range + i,
>>>                   (bus->dev_count - i) * sizeof(struct kvm_io_range));
>>>           rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
>>> -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
>>> +       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
>>>
>>>           return 0;
>>>    }
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-18 16:15               ` Nikita Kalyazin
@ 2026-02-19  7:50                 ` Keir Fraser
  2026-02-19 11:02                   ` Nikita Kalyazin
  0 siblings, 1 reply; 15+ messages in thread
From: Keir Fraser @ 2026-02-19  7:50 UTC (permalink / raw)
  To: Nikita Kalyazin
  Cc: Sean Christopherson, linux-arm-kernel, linux-kernel, kvm,
	Eric Auger, Oliver Upton, Marc Zyngier, Will Deacon,
	Paolo Bonzini, Li RongQing

On Wed, Feb 18, 2026 at 04:15:33PM +0000, Nikita Kalyazin wrote:
> 
> 
> On 18/02/2026 16:02, Keir Fraser wrote:
> > On Wed, Feb 18, 2026 at 12:55:11PM +0000, Nikita Kalyazin wrote:
> > > 
> > > 
> > > On 17/02/2026 19:07, Sean Christopherson wrote:
> > > > On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
> > > > > On 13/02/2026 23:20, Sean Christopherson wrote:
> > > > > > On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
> > > > > > > I am not aware of way to make it fast for both use cases and would be more
> > > > > > > than happy to hear about possible solutions.
> > > > > > 
> > > > > > What if we key off of vCPUS being created?  The motivation for Keir's change was
> > > > > > to avoid stalling during VM boot, i.e. *after* initial VM creation.
> > > > > 
> > > > > It doesn't work as is on x86 because the delay we're seeing occurs after the
> > > > > created_cpus gets incremented
> > > > 
> > > > I don't follow, the suggestion was to key off created_vcpus in
> > > > kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
> > > > imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
> > > > should be largely irrelevant.
> > > 
> > > Yes, you're right, it's irrelevant.  I had made the change in
> > > kvm_io_bus_register_dev() like proposed, but have no idea how I couldn't see
> > > the effect.  I retested it now and it's obvious that it works on x86.  Sorry
> > > for the confusion.
> > > 
> > > > 
> > > > Probably a moot point though.
> > > 
> > > Yes, this will not solve the problem on ARM.
> > 
> > Sorry for being late to this thread. I'm a bit confused now. Did
> > Sean's original patch (reintroducing the old logic, based on whether
> > any vcpus have been created) work for both/either/neither arch? I
> > would have expected it to work for both ARM and X86, despite the
> > offending synchronize_srcu() not being in the vcpu-creation ioctl on
> > ARM, and I think that is finally what your testing seems to show? If
> > so then that seems the pragmatic if somewhat ugly way forward.
> 
> The original patch from Sean works for x86.  I didn't test it on ARM as it's
> harder for me to do, but I don't expect it to work because it only affects
> the pre-vcpu-creation phase.

Ok, looking closer at one of your previous replies, the first fix
doesn't work for you on ARM because there your vcpu creations occur
earlier than on X86? Fair enough.

> We discussed the second patch at the KVM sync earlier today, then I retested
> it and it appears to solve the issue for both, but I'm going to have more
> complete results tomorrow.
> 
> Are you by chance able to have a look whether KVM_SET_USER_MEMORY_REGION
> execution elongates on ARM in your environment (with the 4/4 patch)? I'd be
> curious to know why not if it doesn't.

On our VMM (crosvm) the kvm_io_bus_register_dev happen much later,
during actual VM boot (device probe phase), so the results would not
be comparable. In our scenario we generally save milliseconds on every
single kvm_io_bus_register_dev invocation.

> > 
> >   Cheers,
> >    Keir
> > 
> > 
> > > > 
> > > > > so it doesn't allow to differentiate the two
> > > > > cases (below is kvm_vm_ioctl_create_vcpu):
> > > > > 
> > > > >         kvm->created_vcpus++; // <===== incremented here
> > > > >         mutex_unlock(&kvm->lock);
> > > > > 
> > > > >         vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
> > > > >         if (!vcpu) {
> > > > >                 r = -ENOMEM;
> > > > >                 goto vcpu_decrement;
> > > > >         }
> > > > > 
> > > > >         BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
> > > > >         page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > > > >         if (!page) {
> > > > >                 r = -ENOMEM;
> > > > >                 goto vcpu_free;
> > > > >         }
> > > > >         vcpu->run = page_address(page);
> > > > > 
> > > > >         kvm_vcpu_init(vcpu, kvm, id);
> > > > > 
> > > > >         r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
> > > > > 
> > > > > 
> > > > > firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
> > > > > (ffffffff813e5cf0)
> > > > >       ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
> > > > >       ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
> > > > >       ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
> > > > >       ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
> > > > >       ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
> > > > >       ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
> > > > >       ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
> > > > >       ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
> > > > >       ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
> > > > >       ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
> > > > >       ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
> > > > >       ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
> > > > >                 6512de ioctl+0x32 (/mnt/host/firecracker)
> > > > >                  d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
> > > > > 
> > > > > Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
> > > > > KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
> > > > 
> > > > Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
> > > > the only practical difference (aside from waiting, obviously) between call_srcu()
> > > > and synchronize_srcu_expedited() with respect to "transferring" grace period
> > > > latency is that using call_srcu() could start a normal, non-expedited grace period.
> > > > 
> > > > IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
> > > > to expedited mode, but if the normal grace period has already started the timer
> > > > for the delayed invocation of process_srcu(), then SRCU will still wait for one
> > > > jiffie, i.e. won't immediately queue the work.
> > > > 
> > > > I have no idea if this is sane and/or acceptable, but before looping in Paul and
> > > > others, can you try this to see if it helps?
> > > 
> > > That's exactly what I tried myself before and it didn't help, probably for
> > > the reason you mentioned above (a normal GP being already started).
> > > 
> > > > 
> > > > diff --git a/include/linux/srcu.h b/include/linux/srcu.h
> > > > index 344ad51c8f6c..30437dc8d818 100644
> > > > --- a/include/linux/srcu.h
> > > > +++ b/include/linux/srcu.h
> > > > @@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
> > > > 
> > > >    void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
> > > >                   void (*func)(struct rcu_head *head));
> > > > +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> > > > +                        rcu_callback_t func);
> > > >    void cleanup_srcu_struct(struct srcu_struct *ssp);
> > > >    void synchronize_srcu(struct srcu_struct *ssp);
> > > > 
> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > index ea3f128de06f..03333b079092 100644
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
> > > >    }
> > > >    EXPORT_SYMBOL_GPL(call_srcu);
> > > > 
> > > > +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
> > > > +                        rcu_callback_t func)
> > > > +{
> > > > +       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_srcu_expedited);
> > > > +
> > > >    /*
> > > >     * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
> > > >     */
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index 737b74b15bb5..26215f98c98f 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> > > >           memcpy(new_bus->range + i + 1, bus->range + i,
> > > >                   (bus->dev_count - i) * sizeof(struct kvm_io_range));
> > > >           rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> > > > -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> > > > +       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
> > > > 
> > > >           return 0;
> > > >    }
> > > 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
  2026-02-19  7:50                 ` Keir Fraser
@ 2026-02-19 11:02                   ` Nikita Kalyazin
  0 siblings, 0 replies; 15+ messages in thread
From: Nikita Kalyazin @ 2026-02-19 11:02 UTC (permalink / raw)
  To: Keir Fraser, Sean Christopherson
  Cc: linux-arm-kernel, linux-kernel, kvm, Eric Auger, Oliver Upton,
	Marc Zyngier, Will Deacon, Paolo Bonzini, Li RongQing



On 19/02/2026 07:50, Keir Fraser wrote:
> On Wed, Feb 18, 2026 at 04:15:33PM +0000, Nikita Kalyazin wrote:
>>
>>
>> On 18/02/2026 16:02, Keir Fraser wrote:
>>> On Wed, Feb 18, 2026 at 12:55:11PM +0000, Nikita Kalyazin wrote:
>>>>
>>>>
>>>> On 17/02/2026 19:07, Sean Christopherson wrote:
>>>>> On Mon, Feb 16, 2026, Nikita Kalyazin wrote:
>>>>>> On 13/02/2026 23:20, Sean Christopherson wrote:
>>>>>>> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>>>>>>> I am not aware of way to make it fast for both use cases and would be more
>>>>>>>> than happy to hear about possible solutions.
>>>>>>>
>>>>>>> What if we key off of vCPUS being created?  The motivation for Keir's change was
>>>>>>> to avoid stalling during VM boot, i.e. *after* initial VM creation.
>>>>>>
>>>>>> It doesn't work as is on x86 because the delay we're seeing occurs after the
>>>>>> created_cpus gets incremented
>>>>>
>>>>> I don't follow, the suggestion was to key off created_vcpus in
>>>>> kvm_io_bus_register_dev(), not in kvm_swap_active_memslots().  I can totally
>>>>> imagine the patch not working, but the ordering in kvm_vm_ioctl_create_vcpu()
>>>>> should be largely irrelevant.
>>>>
>>>> Yes, you're right, it's irrelevant.  I had made the change in
>>>> kvm_io_bus_register_dev() like proposed, but have no idea how I couldn't see
>>>> the effect.  I retested it now and it's obvious that it works on x86.  Sorry
>>>> for the confusion.
>>>>
>>>>>
>>>>> Probably a moot point though.
>>>>
>>>> Yes, this will not solve the problem on ARM.
>>>
>>> Sorry for being late to this thread. I'm a bit confused now. Did
>>> Sean's original patch (reintroducing the old logic, based on whether
>>> any vcpus have been created) work for both/either/neither arch? I
>>> would have expected it to work for both ARM and X86, despite the
>>> offending synchronize_srcu() not being in the vcpu-creation ioctl on
>>> ARM, and I think that is finally what your testing seems to show? If
>>> so then that seems the pragmatic if somewhat ugly way forward.
>>
>> The original patch from Sean works for x86.  I didn't test it on ARM as it's
>> harder for me to do, but I don't expect it to work because it only affects
>> the pre-vcpu-creation phase.
> 
> Ok, looking closer at one of your previous replies, the first fix
> doesn't work for you on ARM because there your vcpu creations occur
> earlier than on X86? Fair enough.

Yes, that's correct.

> 
>> We discussed the second patch at the KVM sync earlier today, then I retested
>> it and it appears to solve the issue for both, but I'm going to have more
>> complete results tomorrow.

Sean,

I looked at the tests we ran overnight and your 2nd patch 
(call_srcu_expedited) brings the latencies back to the original 
baselines on both x86 and ARM.  What would be the next steps?  Looping 
Paul in to make sure the proposal is sensible?

>>
>> Are you by chance able to have a look whether KVM_SET_USER_MEMORY_REGION
>> execution elongates on ARM in your environment (with the 4/4 patch)? I'd be
>> curious to know why not if it doesn't.
> 
> On our VMM (crosvm) the kvm_io_bus_register_dev happen much later,
> during actual VM boot (device probe phase), so the results would not
> be comparable. In our scenario we generally save milliseconds on every
> single kvm_io_bus_register_dev invocation.

Ok, thanks.

> 
>>>
>>>    Cheers,
>>>     Keir
>>>
>>>
>>>>>
>>>>>> so it doesn't allow to differentiate the two
>>>>>> cases (below is kvm_vm_ioctl_create_vcpu):
>>>>>>
>>>>>>          kvm->created_vcpus++; // <===== incremented here
>>>>>>          mutex_unlock(&kvm->lock);
>>>>>>
>>>>>>          vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
>>>>>>          if (!vcpu) {
>>>>>>                  r = -ENOMEM;
>>>>>>                  goto vcpu_decrement;
>>>>>>          }
>>>>>>
>>>>>>          BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
>>>>>>          page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>>>>>>          if (!page) {
>>>>>>                  r = -ENOMEM;
>>>>>>                  goto vcpu_free;
>>>>>>          }
>>>>>>          vcpu->run = page_address(page);
>>>>>>
>>>>>>          kvm_vcpu_init(vcpu, kvm, id);
>>>>>>
>>>>>>          r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
>>>>>>
>>>>>>
>>>>>> firecracker   583 [001]   151.297145: probe:synchronize_srcu_expedited:
>>>>>> (ffffffff813e5cf0)
>>>>>>        ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
>>>>>>        ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
>>>>>>        ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
>>>>>>        ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
>>>>>>        ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
>>>>>>        ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
>>>>>>        ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
>>>>>>        ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
>>>>>>        ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
>>>>>>        ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
>>>>>>        ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
>>>>>>        ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms])
>>>>>>                  6512de ioctl+0x32 (/mnt/host/firecracker)
>>>>>>                   d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
>>>>>>
>>>>>> Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
>>>>>> KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
>>>>>
>>>>> Hmm.  Under the hood, __synchronize_srcu() itself uses __call_srcu, so I _think_
>>>>> the only practical difference (aside from waiting, obviously) between call_srcu()
>>>>> and synchronize_srcu_expedited() with respect to "transferring" grace period
>>>>> latency is that using call_srcu() could start a normal, non-expedited grace period.
>>>>>
>>>>> IIUC, SRCU has best-effort logic to shift in-flight non-expedited grace periods
>>>>> to expedited mode, but if the normal grace period has already started the timer
>>>>> for the delayed invocation of process_srcu(), then SRCU will still wait for one
>>>>> jiffie, i.e. won't immediately queue the work.
>>>>>
>>>>> I have no idea if this is sane and/or acceptable, but before looping in Paul and
>>>>> others, can you try this to see if it helps?
>>>>
>>>> That's exactly what I tried myself before and it didn't help, probably for
>>>> the reason you mentioned above (a normal GP being already started).

I also realised why the same change didn't work for me earlier. 
Apparently other changes in my tree I made for debugging skewed the 
results.  Sorry again for confusion.

>>>>
>>>>>
>>>>> diff --git a/include/linux/srcu.h b/include/linux/srcu.h
>>>>> index 344ad51c8f6c..30437dc8d818 100644
>>>>> --- a/include/linux/srcu.h
>>>>> +++ b/include/linux/srcu.h
>>>>> @@ -89,6 +89,8 @@ void __srcu_read_unlock(struct srcu_struct *ssp, int idx) __releases(ssp);
>>>>>
>>>>>     void call_srcu(struct srcu_struct *ssp, struct rcu_head *head,
>>>>>                    void (*func)(struct rcu_head *head));
>>>>> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
>>>>> +                        rcu_callback_t func);
>>>>>     void cleanup_srcu_struct(struct srcu_struct *ssp);
>>>>>     void synchronize_srcu(struct srcu_struct *ssp);
>>>>>
>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>> index ea3f128de06f..03333b079092 100644
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -1493,6 +1493,13 @@ void call_srcu(struct srcu_struct *ssp, struct rcu_head *rhp,
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(call_srcu);
>>>>>
>>>>> +void call_srcu_expedited(struct srcu_struct *ssp, struct rcu_head *rhp,
>>>>> +                        rcu_callback_t func)
>>>>> +{
>>>>> +       __call_srcu(ssp, rhp, func, rcu_gp_is_normal());
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(call_srcu_expedited);
>>>>> +
>>>>>     /*
>>>>>      * Helper function for synchronize_srcu() and synchronize_srcu_expedited().
>>>>>      */
>>>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>>>> index 737b74b15bb5..26215f98c98f 100644
>>>>> --- a/virt/kvm/kvm_main.c
>>>>> +++ b/virt/kvm/kvm_main.c
>>>>> @@ -6036,7 +6036,7 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>>>>>            memcpy(new_bus->range + i + 1, bus->range + i,
>>>>>                    (bus->dev_count - i) * sizeof(struct kvm_io_range));
>>>>>            rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
>>>>> -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
>>>>> +       call_srcu_expedited(&kvm->srcu, &bus->rcu, __free_bus);
>>>>>
>>>>>            return 0;
>>>>>     }
>>>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v4 0/4] KVM: Speed up MMIO registrations
  2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
                   ` (3 preceding siblings ...)
  2025-09-09 10:00 ` [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
@ 2025-09-15  9:59 ` Marc Zyngier
  4 siblings, 0 replies; 15+ messages in thread
From: Marc Zyngier @ 2025-09-15  9:59 UTC (permalink / raw)
  To: linux-arm-kernel, linux-kernel, kvm, Keir Fraser
  Cc: Sean Christopherson, Eric Auger, Oliver Upton, Will Deacon,
	Paolo Bonzini

On Tue, 09 Sep 2025 10:00:03 +0000, Keir Fraser wrote:
> This is version 4 of the patches I previously posted here:
> 
>  https://lore.kernel.org/all/20250819090853.3988626-1-keirf@google.com/
> 
> Changes since v3:
> 
>  * Rebased to v6.17-rc5
>  * Added Tested-by tag to patch 4
>  * Fixed reproducible syzkaller splat
>  * Tweaked comments to Sean's specification
> 
> [...]

Applied to next, thanks!

[1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro
      commit: 8810c6e7cca8fbfce7652b53e05acc465e671d28
[2/4] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering
      commit: 11490b5ec6bc4fe3a36f90817bbc8021ba8b05cd
[3/4] KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths
      commit: 7788255aba6545a27b8d143c5256536f8dfb2c0a
[4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
      commit: 7d9a0273c45962e9a6bc06f3b87eef7c431c1853

Cheers,

	M.
-- 
Without deviation from the norm, progress is not possible.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-02-19 11:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-09 10:00 [PATCH v4 0/4] KVM: Speed up MMIO registrations Keir Fraser
2025-09-09 10:00 ` [PATCH v4 1/4] KVM: arm64: vgic-init: Remove vgic_ready() macro Keir Fraser
2025-09-09 10:00 ` [PATCH v4 2/4] KVM: arm64: vgic: Explicitly implement vgic_dist::ready ordering Keir Fraser
2025-09-09 10:00 ` [PATCH v4 3/4] KVM: Implement barriers before accessing kvm->buses[] on SRCU read paths Keir Fraser
2025-09-09 10:00 ` [PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev() Keir Fraser
2026-02-13 15:42   ` Nikita Kalyazin
2026-02-13 23:20     ` Sean Christopherson
2026-02-16 17:53       ` Nikita Kalyazin
2026-02-17 19:07         ` Sean Christopherson
2026-02-18 12:55           ` Nikita Kalyazin
2026-02-18 16:02             ` Keir Fraser
2026-02-18 16:15               ` Nikita Kalyazin
2026-02-19  7:50                 ` Keir Fraser
2026-02-19 11:02                   ` Nikita Kalyazin
2025-09-15  9:59 ` [PATCH v4 0/4] KVM: Speed up MMIO registrations Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox