[RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0
@ 2025-04-18 10:22 Jiayuan Liang
  2025-04-18 10:22 ` [RFC PATCH 1/1] KVM: arm: " Jiayuan Liang
  2025-04-18 13:10 ` [RFC PATCH 0/1] KVM-arm: " Marc Zyngier
  0 siblings, 2 replies; 3+ messages in thread
From: Jiayuan Liang @ 2025-04-18 10:22 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton
  Cc: Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
	Will Deacon, linux-arm-kernel, kvmarm, linux-kernel,
	Jiayuan Liang

This is an RFC patch to optimize cache flushing behavior in KVM/arm64.

When toggling cache state in a multi-vCPU guest, we currently flush the VM's
stage2 page tables on every vCPU that transitions cache state. This leads to
redundant cache flushes during guest boot, as each vCPU performs the same
flush operation.

In a typical guest boot sequence, vcpu0 is the first to enable caches, and
other vCPUs follow afterward. By the time secondary vCPUs enable their caches,
the flush performed by vcpu0 has already ensured cache coherency for the
entire VM.

I'm proposing to optimize this by only performing the stage2_flush_vm() operation 
on vcpu0, which is sufficient to maintain cache coherency while eliminating redundant
flushes on other vCPUs. This can improve performance during guest boot in
multi-vCPU configurations.

I'm submitting this as RFC because:
1. This is my first contribution to the KVM/arm64 subsystem
2. I want to confirm if this approach is architecturally sound
3. I'd like feedback on potential corner cases I may have missed:
   - Could there be scenarios where secondary vCPUs need their own flushes?
   - Is the assumption about vcpu0 always being first valid?

Implementation details:
- The patch identifies vcpu0 by checking if vcpu->vcpu_id == 0

Testing with a 64-core VM with 128GB memory using hugepages shows dramatic
performance improvements, reducing busybox boot time from 33s to 5s.

I'd appreciate any feedback on the correctness and approach of this optimization.

Jiayuan Liang (1):
  KVM: arm: Optimize cache flush by only flushing on vcpu0

 arch/arm64/kvm/mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

base-commit: fc96b232f8e7c0a6c282f47726b2ff6a5fb341d2
--
2.43.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC PATCH 1/1]     KVM: arm: Optimize cache flush by only flushing on vcpu0
  2025-04-18 10:22 [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0 Jiayuan Liang
@ 2025-04-18 10:22 ` Jiayuan Liang
  2025-04-18 13:10 ` [RFC PATCH 0/1] KVM-arm: " Marc Zyngier
  1 sibling, 0 replies; 3+ messages in thread
From: Jiayuan Liang @ 2025-04-18 10:22 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton
  Cc: Joey Gouly, Suzuki K Poulose, Zenghui Yu, Catalin Marinas,
	Will Deacon, linux-arm-kernel, kvmarm, linux-kernel,
	Jiayuan Liang

    When toggling cache state in a multi-vCPU guest, we currently flush the VM's
    stage2 page tables on every vCPU that transitions cache state. This leads to
    redundant cache flushes during guest boot, as each vCPU performs the same
    flush operation.

    In a typical guest boot sequence, vcpu0 is the first to enable caches, and
    other vCPUs follow afterward. By the time secondary vCPUs enable their caches,
    the flush performed by vcpu0 has already ensured cache coherency for the
    entire VM.

    Optimize this by only performing the stage2_flush_vm() operation on vcpu0,
    which is sufficient to maintain cache coherency while eliminating redundant
    flushes on other vCPUs. This can improve performance during guest boot in
    multi-vCPU configurations.

    Testing with a 64-core VM with 128GB memory using hugepages shows dramatic
    performance improvements, reducing busybox boot time from 33s to 5s.

    Test command:
    qemu-kvm \
        -nographic \
        -m 128G \
        -mem-path /dev/hugepages \
        -mem-prealloc \
        -cpu host -M virt \
        -smp 64 \
        -kernel ./Image \
        -append "root=/dev/ram earlycon=pl011,0x9000000 console=ttyAMA0 init=/linuxrc systemd.unified_cgroup_hierarchy=1 psi=1"

    Signed-off-by: Jiayuan Liang <ljykernel@163.com>
---
 arch/arm64/kvm/mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 754f2fe0cc67..fbc736657666 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2300,8 +2300,10 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled)
 	 * If switching it off, need to clean the caches.
 	 * Clean + invalidate does the trick always.
 	 */
-	if (now_enabled != was_enabled)
-		stage2_flush_vm(vcpu->kvm);
+	if (now_enabled != was_enabled) {
+		if (vcpu->vcpu_id == 0)
+			stage2_flush_vm(vcpu->kvm);
+	}
 
 	/* Caches are now on, stop trapping VM ops (until a S/W op) */
 	if (now_enabled)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0
  2025-04-18 10:22 [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0 Jiayuan Liang
  2025-04-18 10:22 ` [RFC PATCH 1/1] KVM: arm: " Jiayuan Liang
@ 2025-04-18 13:10 ` Marc Zyngier
  1 sibling, 0 replies; 3+ messages in thread
From: Marc Zyngier @ 2025-04-18 13:10 UTC (permalink / raw)
  To: Jiayuan Liang
  Cc: Oliver Upton, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, linux-arm-kernel, kvmarm,
	linux-kernel

On Fri, 18 Apr 2025 11:22:43 +0100,
Jiayuan Liang <ljykernel@163.com> wrote:
> 
> This is an RFC patch to optimize cache flushing behavior in KVM/arm64.
> 
> When toggling cache state in a multi-vCPU guest, we currently flush the VM's
> stage2 page tables on every vCPU that transitions cache state. This leads to
> redundant cache flushes during guest boot, as each vCPU performs the same
> flush operation.
> 
> In a typical guest boot sequence, vcpu0 is the first to enable caches, and
> other vCPUs follow afterward. By the time secondary vCPUs enable their caches,
> the flush performed by vcpu0 has already ensured cache coherency for the
> entire VM.

The most immediate issue I can spot is that vcpu0 is not special.
There is nothing that says vcpu0 will be the first switching its MMU
on, nor that vcpu0 will ever be running. I guess what you would want
instead is that the *first* vcpu that enables its MMU performs the
CMOs, while the others may not have to.

But even then, this changes a behaviour some guests *may* be relying
on, which is that what they have written while their MMU was off is
visible with the MMU on, without the guest doing any CMO of its own.

A lot of this stuff comes from the days where we were mostly running
32bit guests, some of which had (and still have) pretty bad
assumptions (set/way operations being one of them).

64bit guests *should* be much better behaved, and I wonder whether we
could actually drop the whole thing altogether for those. Something
like the hack below.

But this requires testing and more thought than I'm prepared to on a
day off... ;-)

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index bd020fc28aa9c..9d05e65433916 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -85,9 +85,11 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
 	 * For non-FWB CPUs, we trap VM ops (HCR_EL2.TVM) until M+C
 	 * get set in SCTLR_EL1 such that we can detect when the guest
 	 * MMU gets turned on and do the necessary cache maintenance
-	 * then.
+	 * then. Limit this dance to 32bit guests, assuming that 64bit
+	 * guests are reasonably behaved.
 	 */
-	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
+	if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) &&
+	    vcpu_el1_is_32bit(vcpu))
 		vcpu->arch.hcr_el2 |= HCR_TVM;
 }

-- 
Jazz isn't dead. It just smells funny.

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-04-18 13:12 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18 10:22 [RFC PATCH 0/1] KVM-arm: Optimize cache flush by only flushing on vcpu0 Jiayuan Liang
2025-04-18 10:22 ` [RFC PATCH 1/1] KVM: arm: " Jiayuan Liang
2025-04-18 13:10 ` [RFC PATCH 0/1] KVM-arm: " Marc Zyngier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).