All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] KVM: arm64: nv: Fixes for stage-2 MMU recycling
@ 2024-10-01  0:17 Oliver Upton
  2024-10-01  0:17 ` [PATCH 1/3] KVM: arm64: Treat stage-2 MMUs as refcounted generally Oliver Upton
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-01  0:17 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Oliver Upton

I recently found that the path for reclaiming stage-2 MMUs is a bit
dodgy. We call kvm_stage2_unmap_range() from vcpu_load() when
repurposing a valid MMU, which itself is allowed to block. We repeat
this silliness on the next vcpu_load(), eventually leading to a stack
overflow.

It also looks like the nested hooks into the MMU notifiers are similarly
screwed up, ignoring the @may_block parameter and allowing the page
table walk to block unconditionally.

In addition to that, we are not handling vcpu->arch.hw_mmu safely in
preemptible code, i.e. we could be using a stale pointer on the loaded
hardware MMU.

Set of fixes to address all of the above, by:

1) Allowing vCPUs to 'pin' the loaded stage-2 MMU, guaranteeing that
   vcpu->arch.hw_mmu is stable when used in preemptible code

2) Preventing unmap walks from blocking when disallowed by the calling
   context.

   This is the correct fix for the MMU notifiers, and a "fix" for the
   reclaim case. Unmapping a whole MMU behind the write lock w/o periodically
   releasing the lock hurts vCPU scheduling, especially on larger VMs.

3) Push the cleanup of reclaimed MMUs into a vCPU request, allowing the
   walk to release the lock + CPU as needed.

Applies to 6.12-rc1, tested on M2. An easy way to force the reclaim case
is to constantly rerun a KVM selftest inside the L1 guest, which is how
I found this in the first place.

Oliver Upton (3):
  KVM: arm64: Treat stage-2 MMUs as refcounted generally
  KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  KVM: arm64: nv: Punt stage-2 recycling to a vCPU request

 arch/arm64/include/asm/kvm_host.h   | 13 ++++++++--
 arch/arm64/include/asm/kvm_mmu.h    | 40 ++++++++++++++++++++++++++++-
 arch/arm64/include/asm/kvm_nested.h |  4 ++-
 arch/arm64/kvm/arm.c                | 12 +++++----
 arch/arm64/kvm/hyp/nvhe/hyp-main.c  |  2 +-
 arch/arm64/kvm/hyp/nvhe/pkvm.c      |  2 +-
 arch/arm64/kvm/hyp/nvhe/switch.c    |  2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c       |  4 +--
 arch/arm64/kvm/hyp/vhe/switch.c     |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c        |  4 +--
 arch/arm64/kvm/mmu.c                | 34 +++++++++++++-----------
 arch/arm64/kvm/nested.c             | 39 ++++++++++++++++++++--------
 arch/arm64/kvm/sys_regs.c           |  6 ++---
 13 files changed, 119 insertions(+), 45 deletions(-)


base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
-- 
2.46.1.824.gd892dcdcdd-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] KVM: arm64: Treat stage-2 MMUs as refcounted generally
  2024-10-01  0:17 [PATCH 0/3] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
@ 2024-10-01  0:17 ` Oliver Upton
  2024-10-01  0:17 ` [PATCH 2/3] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
  2024-10-01  0:17 ` [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
  2 siblings, 0 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-01  0:17 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Oliver Upton

Nested stage-2 MMUs are assigned/released at the vcpu_load()/vcpu_put()
boundary. And, if the guest has more than 2*nr_vcpus in play then KVM
starts reclaiming nested MMUs. This has the effect of making the vCPU's
reference to the loaded hardware MMU volatile in preemptible code, as
the reference can be recycled at any time.

The fix for this issue is surprisingly straightforward, as nested MMUs
have refcounts already. Start pinning nested stage-2 MMUs by raising the
refcount where vcpu->arch.hw_mmu gets used, effectively guaranteeing
that lookup_s2_mmu() finds the same thing every time. Barring any bugs,
this shouldn't have any effect on the guarantee that an unused stage-2
exists at any time, as there are more MMUs than vCPUs capable of pinning
them.

If you haven't looked at the stage-2 abort path in a while, its picked
up plenty of warts, early returns and whatnot. Rather than explicitly
handle all the exit paths, take advantage of the new cleanups
infrastructure to bind the reference to a particular scope. Very nice.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/include/asm/kvm_host.h  | 10 ++++++--
 arch/arm64/include/asm/kvm_mmu.h   | 37 ++++++++++++++++++++++++++++++
 arch/arm64/kvm/arm.c               | 10 ++++----
 arch/arm64/kvm/hyp/nvhe/hyp-main.c |  2 +-
 arch/arm64/kvm/hyp/nvhe/pkvm.c     |  2 +-
 arch/arm64/kvm/hyp/nvhe/switch.c   |  2 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c      |  4 ++--
 arch/arm64/kvm/hyp/vhe/switch.c    |  2 +-
 arch/arm64/kvm/hyp/vhe/tlb.c       |  4 ++--
 arch/arm64/kvm/mmu.c               | 19 ++++++++-------
 arch/arm64/kvm/nested.c            | 14 ++++++-----
 11 files changed, 77 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 329619c6fa96..268f69196380 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -685,8 +685,14 @@ struct kvm_vcpu_arch {
 	enum fp_type fp_type;
 	unsigned int sve_max_vl;
 
-	/* Stage 2 paging state used by the hardware on next switch */
-	struct kvm_s2_mmu *hw_mmu;
+	/*
+	 * Loaded stage-2 MMU context, this is **not** safe to dereference in
+	 * preemptible code as nested MMUs get attached at vcpu_load().
+	 *
+	 * Use the vcpu_hw_mmu 'class' instead to associate a pin on the loaded
+	 * MMU with a scope.
+	 */
+	struct kvm_s2_mmu *__hw_mmu;
 
 	/* Values of trap registers for the guest. */
 	u64 hcr_el2;
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index cd4087fbda9a..a6379ca1b0d7 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -358,5 +358,42 @@ void kvm_s2_ptdump_create_debugfs(struct kvm *kvm);
 static inline void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) {}
 #endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
 
+static inline struct kvm_s2_mmu *vcpu_to_hw_mmu_unsafe(struct kvm_vcpu *vcpu)
+{
+	/* The nVHE hypervisor code is always non-preemptible */
+	if (!is_nvhe_hyp_code())
+		lockdep_assert_preemption_disabled();
+
+	return vcpu->arch.__hw_mmu;
+}
+
+static inline void kvm_get_s2_mmu(struct kvm_s2_mmu *mmu)
+{
+	if (kvm_is_nested_s2_mmu(kvm_s2_mmu_to_kvm(mmu), mmu))
+		atomic_inc(&mmu->refcnt);
+}
+
+static inline void kvm_put_s2_mmu(struct kvm_s2_mmu *mmu)
+{
+	if (kvm_is_nested_s2_mmu(kvm_s2_mmu_to_kvm(mmu), mmu))
+		atomic_dec(&mmu->refcnt);
+}
+
+static inline struct kvm_s2_mmu *vcpu_get_hw_mmu(struct kvm_vcpu *vcpu)
+{
+	struct kvm_s2_mmu *mmu;
+
+	guard(preempt)();
+
+	mmu = vcpu_to_hw_mmu_unsafe(vcpu);
+	kvm_get_s2_mmu(mmu);
+
+	return mmu;
+}
+
+DEFINE_CLASS(vcpu_hw_mmu, struct kvm_s2_mmu *,
+	     kvm_put_s2_mmu(_T), vcpu_get_hw_mmu(vcpu),
+	     struct kvm_vcpu *vcpu)
+
 #endif /* __ASSEMBLY__ */
 #endif /* __ARM64_KVM_MMU_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a0d01c46e408..367f0d3d3194 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -481,7 +481,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 	kvm_arm_pvtime_vcpu_init(&vcpu->arch);
 
-	vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
+	vcpu->arch.__hw_mmu = &vcpu->kvm->arch.mmu;
 
 	/*
 	 * This vCPU may have been created after mpidr_data was initialized.
@@ -581,7 +581,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (vcpu_has_nv(vcpu))
 		kvm_vcpu_load_hw_mmu(vcpu);
 
-	mmu = vcpu->arch.hw_mmu;
+	mmu = vcpu_to_hw_mmu_unsafe(vcpu);
 	last_ran = this_cpu_ptr(mmu->last_vcpu_ran);
 
 	/*
@@ -1169,10 +1169,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * making a thread's VMID inactive. So we need to call
 		 * kvm_arm_vmid_update() in non-premptible context.
 		 */
-		if (kvm_arm_vmid_update(&vcpu->arch.hw_mmu->vmid) &&
+		if (kvm_arm_vmid_update(&vcpu_to_hw_mmu_unsafe(vcpu)->vmid) &&
 		    has_vhe())
-			__load_stage2(vcpu->arch.hw_mmu,
-				      vcpu->arch.hw_mmu->arch);
+			__load_stage2(vcpu_to_hw_mmu_unsafe(vcpu),
+				      &vcpu->kvm->arch);
 
 		kvm_pmu_flush_hwstate(vcpu);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 87692b566d90..8c823b1bc909 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -102,7 +102,7 @@ static void flush_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu)
 	/* Limit guest vector length to the maximum supported by the host.  */
 	hyp_vcpu->vcpu.arch.sve_max_vl	= min(host_vcpu->arch.sve_max_vl, kvm_host_sve_max_vl);
 
-	hyp_vcpu->vcpu.arch.hw_mmu	= host_vcpu->arch.hw_mmu;
+	hyp_vcpu->vcpu.arch.__hw_mmu	= vcpu_to_hw_mmu_unsafe(host_vcpu);
 
 	hyp_vcpu->vcpu.arch.hcr_el2	= host_vcpu->arch.hcr_el2;
 	hyp_vcpu->vcpu.arch.mdcr_el2	= host_vcpu->arch.mdcr_el2;
diff --git a/arch/arm64/kvm/hyp/nvhe/pkvm.c b/arch/arm64/kvm/hyp/nvhe/pkvm.c
index 187a5f4d56c0..d26949892cd7 100644
--- a/arch/arm64/kvm/hyp/nvhe/pkvm.c
+++ b/arch/arm64/kvm/hyp/nvhe/pkvm.c
@@ -333,7 +333,7 @@ static int init_pkvm_hyp_vcpu(struct pkvm_hyp_vcpu *hyp_vcpu,
 	hyp_vcpu->vcpu.vcpu_id = READ_ONCE(host_vcpu->vcpu_id);
 	hyp_vcpu->vcpu.vcpu_idx = vcpu_idx;
 
-	hyp_vcpu->vcpu.arch.hw_mmu = &hyp_vm->kvm.arch.mmu;
+	hyp_vcpu->vcpu.arch.__hw_mmu = &hyp_vm->kvm.arch.mmu;
 	hyp_vcpu->vcpu.arch.cflags = READ_ONCE(host_vcpu->arch.cflags);
 done:
 	if (ret)
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index cc69106734ca..0189d592bb64 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -326,7 +326,7 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
 	__sysreg32_restore_state(vcpu);
 	__sysreg_restore_state_nvhe(guest_ctxt);
 
-	mmu = kern_hyp_va(vcpu->arch.hw_mmu);
+	mmu = kern_hyp_va(vcpu_to_hw_mmu_unsafe(vcpu));
 	__load_stage2(mmu, kern_hyp_va(mmu->arch));
 	__activate_traps(vcpu);
 
diff --git a/arch/arm64/kvm/hyp/nvhe/tlb.c b/arch/arm64/kvm/hyp/nvhe/tlb.c
index 48da9ca9763f..fcc10783b1f4 100644
--- a/arch/arm64/kvm/hyp/nvhe/tlb.c
+++ b/arch/arm64/kvm/hyp/nvhe/tlb.c
@@ -59,10 +59,10 @@ static void enter_vmid_context(struct kvm_s2_mmu *mmu,
 		 * to be called from within __kvm_vcpu_run(), which ensures that
 		 * __hyp_running_vcpu is set to the current guest vcpu.
 		 */
-		if (mmu == vcpu->arch.hw_mmu || WARN_ON(mmu != host_s2_mmu))
+		if (mmu == vcpu_to_hw_mmu_unsafe(vcpu) || WARN_ON(mmu != host_s2_mmu))
 			return;
 
-		cxt->mmu = vcpu->arch.hw_mmu;
+		cxt->mmu = vcpu_to_hw_mmu_unsafe(vcpu);
 	} else {
 		/* We're in host context. */
 		if (mmu == host_s2_mmu)
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 80581b1c3995..c73969f0c687 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -245,7 +245,7 @@ void kvm_vcpu_load_vhe(struct kvm_vcpu *vcpu)
 
 	__vcpu_load_switch_sysregs(vcpu);
 	__vcpu_load_activate_traps(vcpu);
-	__load_stage2(vcpu->arch.hw_mmu, vcpu->arch.hw_mmu->arch);
+	__load_stage2(vcpu_to_hw_mmu_unsafe(vcpu), &vcpu->kvm->arch);
 }
 
 void kvm_vcpu_put_vhe(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/hyp/vhe/tlb.c b/arch/arm64/kvm/hyp/vhe/tlb.c
index 3d50a1bd2bdb..b21168ce7682 100644
--- a/arch/arm64/kvm/hyp/vhe/tlb.c
+++ b/arch/arm64/kvm/hyp/vhe/tlb.c
@@ -25,8 +25,8 @@ static void enter_vmid_context(struct kvm_s2_mmu *mmu,
 
 	local_irq_save(cxt->flags);
 
-	if (vcpu && mmu != vcpu->arch.hw_mmu)
-		cxt->mmu = vcpu->arch.hw_mmu;
+	if (vcpu && mmu != vcpu_to_hw_mmu_unsafe(vcpu))
+		cxt->mmu = vcpu_to_hw_mmu_unsafe(vcpu);
 	else
 		cxt->mmu = NULL;
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index a509b63bd4dd..800cf3e7ecae 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1440,6 +1440,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 
+	CLASS(vcpu_hw_mmu, mmu)(vcpu);
+
 	if (fault_is_perm)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
 	write_fault = kvm_is_write_fault(vcpu);
@@ -1459,7 +1461,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	if (!fault_is_perm || (logging_active && write_fault)) {
 		ret = kvm_mmu_topup_memory_cache(memcache,
-						 kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu));
+						 kvm_mmu_cache_min_pages(mmu));
 		if (ret)
 			return ret;
 	}
@@ -1621,7 +1623,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 	read_lock(&kvm->mmu_lock);
-	pgt = vcpu->arch.hw_mmu->pgt;
+	pgt = mmu->pgt;
 	if (mmu_invalidate_retry(kvm, mmu_seq)) {
 		ret = -EAGAIN;
 		goto out_unlock;
@@ -1708,12 +1710,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 {
 	kvm_pte_t pte;
-	struct kvm_s2_mmu *mmu;
 
 	trace_kvm_access_fault(fault_ipa);
 
+	CLASS(vcpu_hw_mmu, mmu)(vcpu);
+
 	read_lock(&vcpu->kvm->mmu_lock);
-	mmu = vcpu->arch.hw_mmu;
 	pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
 	read_unlock(&vcpu->kvm->mmu_lock);
 
@@ -1744,6 +1746,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	gfn_t gfn;
 	int ret, idx;
 
+	CLASS(vcpu_hw_mmu, mmu)(vcpu);
+
 	esr = kvm_vcpu_get_esr(vcpu);
 
 	ipa = fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
@@ -1757,7 +1761,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		}
 
 		/* Falls between the IPA range and the PARange? */
-		if (fault_ipa >= BIT_ULL(vcpu->arch.hw_mmu->pgt->ia_bits)) {
+		if (fault_ipa >= BIT_ULL(mmu->pgt->ia_bits)) {
 			fault_ipa |= kvm_vcpu_get_hfar(vcpu) & GENMASK(11, 0);
 
 			if (is_iabt)
@@ -1809,8 +1813,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	 * nothing to walk and we treat it as a 1:1 before going through the
 	 * canonical translation.
 	 */
-	if (kvm_is_nested_s2_mmu(vcpu->kvm,vcpu->arch.hw_mmu) &&
-	    vcpu->arch.hw_mmu->nested_stage2_enabled) {
+	if (kvm_is_nested_s2_mmu(vcpu->kvm, mmu) && mmu->nested_stage2_enabled) {
 		u32 esr;
 
 		ret = kvm_walk_nested_s2(vcpu, fault_ipa, &nested_trans);
@@ -1881,7 +1884,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 	}
 
 	/* Userspace should not be able to register out-of-bounds IPAs */
-	VM_BUG_ON(ipa >= kvm_phys_size(vcpu->arch.hw_mmu));
+	VM_BUG_ON(ipa >= kvm_phys_size(mmu));
 
 	if (esr_fsc_is_access_flag_fault(esr)) {
 		handle_access_fault(vcpu, fault_ipa);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index f9e30dd34c7a..c5b61415b852 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -649,7 +649,7 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM;
 
 out:
-	atomic_inc(&s2_mmu->refcnt);
+	kvm_get_s2_mmu(s2_mmu);
 	return s2_mmu;
 }
 
@@ -664,19 +664,21 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
 {
 	if (is_hyp_ctxt(vcpu)) {
-		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
+		vcpu->arch.__hw_mmu = &vcpu->kvm->arch.mmu;
 	} else {
 		write_lock(&vcpu->kvm->mmu_lock);
-		vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
+		vcpu->arch.__hw_mmu = get_s2_mmu_nested(vcpu);
 		write_unlock(&vcpu->kvm->mmu_lock);
 	}
 }
 
 void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
 {
-	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
-		atomic_dec(&vcpu->arch.hw_mmu->refcnt);
-		vcpu->arch.hw_mmu = NULL;
+	struct kvm_s2_mmu *mmu = vcpu_to_hw_mmu_unsafe(vcpu);
+
+	if (kvm_is_nested_s2_mmu(vcpu->kvm, mmu)) {
+		kvm_put_s2_mmu(mmu);
+		vcpu->arch.__hw_mmu = NULL;
 	}
 }
 
-- 
2.46.1.824.gd892dcdcdd-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed
  2024-10-01  0:17 [PATCH 0/3] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
  2024-10-01  0:17 ` [PATCH 1/3] KVM: arm64: Treat stage-2 MMUs as refcounted generally Oliver Upton
@ 2024-10-01  0:17 ` Oliver Upton
  2024-10-01  0:17 ` [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
  2 siblings, 0 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-01  0:17 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Oliver Upton

Right now the nested code allows unmap operations on a shadow stage-2 to
block unconditionally. This is wrong in a couple places, such as a
non-blocking MMU notifier or on the back of a sched_in() notifier as
part of shadow MMU recycling.

Carry through whether or not blocking is allowed to
kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU
reclaim would precipitate a stack overflow from a pile of vcpu_load()
calls, all trying to recycle a stage-2 MMU.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/include/asm/kvm_mmu.h    |  3 ++-
 arch/arm64/include/asm/kvm_nested.h |  2 +-
 arch/arm64/kvm/mmu.c                | 15 ++++++++-------
 arch/arm64/kvm/nested.c             |  6 +++---
 arch/arm64/kvm/sys_regs.c           |  6 +++---
 5 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index a6379ca1b0d7..a9a3225c9f75 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -166,7 +166,8 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
 int create_hyp_stack(phys_addr_t phys_addr, unsigned long *haddr);
 void __init free_hyp_pgds(void);
 
-void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size);
+void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
+			    u64 size, bool may_block);
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 void kvm_stage2_wp_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end);
 
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index e8bc6d67aba2..e74b90dcfac4 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -124,7 +124,7 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
 				    struct kvm_s2_trans *trans);
 extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
 extern void kvm_nested_s2_wp(struct kvm *kvm);
-extern void kvm_nested_s2_unmap(struct kvm *kvm);
+extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
 extern void kvm_nested_s2_flush(struct kvm *kvm);
 
 unsigned long compute_tlb_inval_range(struct kvm_s2_mmu *mmu, u64 val);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 800cf3e7ecae..02275854c288 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -328,9 +328,10 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 				   may_block));
 }
 
-void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
+void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
+			    u64 size, bool may_block)
 {
-	__unmap_stage2_range(mmu, start, size, true);
+	__unmap_stage2_range(mmu, start, size, may_block);
 }
 
 void kvm_stage2_flush_range(struct kvm_s2_mmu *mmu, phys_addr_t addr, phys_addr_t end)
@@ -1015,7 +1016,7 @@ static void stage2_unmap_memslot(struct kvm *kvm,
 
 		if (!(vma->vm_flags & VM_PFNMAP)) {
 			gpa_t gpa = addr + (vm_start - memslot->userspace_addr);
-			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start);
+			kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, vm_end - vm_start, true);
 		}
 		hva = vm_end;
 	} while (hva < reg_end);
@@ -1042,7 +1043,7 @@ void stage2_unmap_vm(struct kvm *kvm)
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
-	kvm_nested_s2_unmap(kvm);
+	kvm_nested_s2_unmap(kvm, true);
 
 	write_unlock(&kvm->mmu_lock);
 	mmap_read_unlock(current->mm);
@@ -1915,7 +1916,7 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 			     (range->end - range->start) << PAGE_SHIFT,
 			     range->may_block);
 
-	kvm_nested_s2_unmap(kvm);
+	kvm_nested_s2_unmap(kvm, range->may_block);
 	return false;
 }
 
@@ -2182,8 +2183,8 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
 	write_lock(&kvm->mmu_lock);
-	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size);
-	kvm_nested_s2_unmap(kvm);
+	kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
+	kvm_nested_s2_unmap(kvm, true);
 	write_unlock(&kvm->mmu_lock);
 }
 
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index c5b61415b852..1ba211541290 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -634,7 +634,7 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 
 	/* Clear the old state */
 	if (kvm_s2_mmu_valid(s2_mmu))
-		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu));
+		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
 
 	/*
 	 * The virtual VMID (modulo CnP) will be used as a key when matching
@@ -732,7 +732,7 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 	}
 }
 
-void kvm_nested_s2_unmap(struct kvm *kvm)
+void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 {
 	int i;
 
@@ -742,7 +742,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm)
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
 		if (kvm_s2_mmu_valid(mmu))
-			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu));
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 	}
 }
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index dad88e31f953..e0c7cc16466e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2903,7 +2903,7 @@ static bool handle_alle1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 	 * Drop all shadow S2s, resulting in S1/S2 TLBIs for each of the
 	 * corresponding VMIDs.
 	 */
-	kvm_nested_s2_unmap(vcpu->kvm);
+	kvm_nested_s2_unmap(vcpu->kvm, true);
 
 	write_unlock(&vcpu->kvm->mmu_lock);
 
@@ -2955,7 +2955,7 @@ union tlbi_info {
 static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 			       const union tlbi_info *info)
 {
-	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size);
+	kvm_stage2_unmap_range(mmu, info->range.start, info->range.size, true);
 }
 
 static bool handle_vmalls12e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
@@ -3050,7 +3050,7 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
 	base_addr &= ~(max_size - 1);
 
-	kvm_stage2_unmap_range(mmu, base_addr, max_size);
+	kvm_stage2_unmap_range(mmu, base_addr, max_size, true);
 }
 
 static bool handle_ipas2e1is(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
-- 
2.46.1.824.gd892dcdcdd-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01  0:17 [PATCH 0/3] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
  2024-10-01  0:17 ` [PATCH 1/3] KVM: arm64: Treat stage-2 MMUs as refcounted generally Oliver Upton
  2024-10-01  0:17 ` [PATCH 2/3] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
@ 2024-10-01  0:17 ` Oliver Upton
  2024-10-01 19:05   ` Sean Christopherson
  2024-10-01 23:23   ` Marc Zyngier
  2 siblings, 2 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-01  0:17 UTC (permalink / raw)
  To: kvmarm; +Cc: Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu,
	Oliver Upton

Currently, when a nested MMU is repurposed for some other MMU context,
KVM unmaps everything during vcpu_load() while holding the MMU lock for
write. This is quite a performance bottleneck for large nested VMs, as
all vCPU scheduling will spin until the unmap completes.

Start punting the MMU cleanup to a vCPU request, where it is then
possible to periodically release the MMU lock and CPU in the presence of
contention. To make that possible, take a pin on the hardware MMU while
unmapping it to guarantee the responsible vCPU finds the MMU on the next
vcpu_load().

Ensure that no vCPU winds up using a stale MMU by tracking the pending
unmap on the S2 MMU itself and requesting an unmap on every vCPU that
finds it.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
---
 arch/arm64/include/asm/kvm_host.h   |  3 +++
 arch/arm64/include/asm/kvm_nested.h |  2 ++
 arch/arm64/kvm/arm.c                |  2 ++
 arch/arm64/kvm/nested.c             | 21 +++++++++++++++++++--
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 268f69196380..1d310292c623 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -51,6 +51,7 @@
 #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
 #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
 #define KVM_REQ_RESYNC_PMU_EL0	KVM_ARCH_REQ(7)
+#define KVM_REQ_NESTED_S2_UNMAP	KVM_ARCH_REQ(8)
 
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
@@ -216,6 +217,8 @@ struct kvm_s2_mmu {
 	 * >0: Somebody is actively using this.
 	 */
 	atomic_t refcnt;
+
+	bool	pending_unmap;
 };
 
 struct kvm_arch_memory_slot {
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index e74b90dcfac4..233e65522716 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
 extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
 
+extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu);
+
 struct kvm_s2_trans {
 	phys_addr_t output;
 	unsigned long block_size;
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 367f0d3d3194..ab7d387379a0 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		if (kvm_dirty_ring_check_request(vcpu))
 			return 0;
+
+		check_nested_vcpu_requests(vcpu);
 	}
 
 	return 1;
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 1ba211541290..749babdbda81 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	/* Set the scene for the next search */
 	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
 
-	/* Clear the old state */
+	/* Make sure we don't forget to do the laundry */
 	if (kvm_s2_mmu_valid(s2_mmu))
-		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
+		s2_mmu->pending_unmap = true;
 
 	/*
 	 * The virtual VMID (modulo CnP) will be used as a key when matching
@@ -649,6 +649,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM;
 
 out:
+	if (s2_mmu->pending_unmap)
+		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
+
 	kvm_get_s2_mmu(s2_mmu);
 	return s2_mmu;
 }
@@ -1186,3 +1189,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm)
 
 	return 0;
 }
+
+void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
+{
+	if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) {
+		CLASS(vcpu_hw_mmu, mmu)(vcpu);
+
+		write_lock(&vcpu->kvm->mmu_lock);
+		if (mmu->pending_unmap) {
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
+			mmu->pending_unmap = false;
+		}
+		write_unlock(&vcpu->kvm->mmu_lock);
+	}
+}
-- 
2.46.1.824.gd892dcdcdd-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01  0:17 ` [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
@ 2024-10-01 19:05   ` Sean Christopherson
  2024-10-01 20:41     ` Oliver Upton
  2024-10-01 23:23   ` Marc Zyngier
  1 sibling, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-10-01 19:05 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Tue, Oct 01, 2024, Oliver Upton wrote:
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 367f0d3d3194..ab7d387379a0 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> +
> +		check_nested_vcpu_requests(vcpu);
>  	}
>  
>  	return 1;
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 1ba211541290..749babdbda81 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  	/* set the scene for the next search */
>  	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
>  
> -	/* clear the old state */
> +	/* make sure we don't forget to do the laundry */
>  	if (kvm_s2_mmu_valid(s2_mmu))
> -		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
> +		s2_mmu->pending_unmap = true;

Question time.  The code that's just out of sight modifies the nested MMU's tlb_vttbr:

	s2_mmu->tlb_vttbr = vcpu_read_sys_reg(vcpu, vttbr_el2) & ~vttbr_cnp_bit;
	s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, vtcr_el2);
	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm;

which is consumed by kvm_s2_mmu_iterate_by_vmid().  IIUC, kvm_s2_mmu_iterate_by_vmid()
is used only when emulating some form of tlbi that's executed by the guest?  I.e.
is guaranteed to do the right thing because KVM ensures the old state is purged
prior to entering the guest?

>  	/*
>  	 * the virtual vmid (modulo cnp) will be used as a key when matching
> @@ -649,6 +649,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm;
>  
>  out:
> +	if (s2_mmu->pending_unmap)
> +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);

If I followed everything correctly, I don't think a request is needed.  the
request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().

Somewhat of a side topic, the handling of xfer_to_guest_mode_handle_work() in
kvm_arch_vcpu_ioctl_run() is confusing.  If a signal is pending (ret == -eintr),
KVM will skip checking requests, e.g. not unmap the nested stage-2, but then continue
with other pieces of the entry flow.  I'm guessing that's intentional as it looks
like KVM needs to flush hardware state in case that info is exposed to userspace.

But it really warps the mind when trying to reason about what will happen if
there is a relevant request, e.g. if the below code needs to load a new stage-2
for a new vmid, it could do so with the to-be-unmapped contents.

		/*
		 * check conditions before entering the guest
		 */
		ret = xfer_to_guest_mode_handle_work(vcpu);
		if (!ret)
			ret = 1;

		if (ret > 0)
			ret = check_vcpu_requests(vcpu);

		/*
		 * preparing the interrupts to be injected also
		 * involves poking the gic, which must be done in a
		 * non-preemptible context.
		 */
		preempt_disable();

		/*
		 * the vmid allocator only tracks active vmids per
		 * physical cpu, and therefore the vmid allocated may not be
		 * preserved on vmid roll-over if the task was preempted,
		 * making a thread's vmid inactive. so we need to call
		 * kvm_arm_vmid_update() in non-premptible context.
		 */
		if (kvm_arm_vmid_update(&vcpu_to_hw_mmu_unsafe(vcpu)->vmid) &&
		    has_vhe())
			__load_stage2(vcpu_to_hw_mmu_unsafe(vcpu),
				      &vcpu->kvm->arch);

> +
>  	kvm_get_s2_mmu(s2_mmu);
>  	return s2_mmu;
>  }
> @@ -1186,3 +1189,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm)
>  
>  	return 0;
>  }
> +
> +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
> +{
> +	if (kvm_check_request(kvm_req_nested_s2_unmap, vcpu)) {
> +		CLASS(vcpu_hw_mmu, mmu)(vcpu);
> +
> +		write_lock(&vcpu->kvm->mmu_lock);
> +		if (mmu->pending_unmap) {

I doubt it's a functional issue, but if multiple vCPUs are purging the same nested
stage-2 MMU, and the first vCPU yields while unmapping, then KVM will end up with
multiple vCPUs doing the unmap.  And irrespective of whether or not the first vCPU
(or vCPUs) yields, they'll all contend for mmu_lock, which is especially problematic
since pending writers will block future readers.

Would it instead make sense to do something like the below?  Checking for the
need to unmap outside of mmu_lock would also more or less eliminate the need for
a request.

#define KVM_S2_UNMAP_PENDING		BIT(0)
#define KVM_S2_UNMAP_IN_PROGRESS	BIT(1)

	unsigned long ops = atomic_read(mmu->unmap_operations);
	unsigned long new;

	if (!(ops & KVM_S2_UNMAP_PENDING))
		return;

	if (!(ops & KVM_S2_UNMAP_IN_PROGRESS)) {
		ops = cmpxchg(mmu->unmap_operations, ops,
			      ops | KVM_S2_UNMAP_IN_PROGRESS);

	/* Nothing more to do if the unmap already completed. */
	if (!(ops & KVM_S2_UNMAP_PENDING))
		return;

	/* If a different vCPU started the unmap, simply wait for it to finish. */
	if (ops & KVM_S2_UNMAP_IN_PROGRESS) {
		while (atomic_read(&mmu->unmap_operations))
			cond_resched(); /* enter a waitq? */
		return;
	}

	write_lock(&vcpu->kvm->mmu_lock);
	kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
	write_unlock(&vcpu->kvm->mmu_lock);

	atomic_set(&mmu->unmap_operations, 0);

> +			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
> +			mmu->pending_unmap = false;
> +		}
> +		write_unlock(&vcpu->kvm->mmu_lock);
> +	}
> +}
> -- 
> 2.46.1.824.gd892dcdcdd-goog

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01 19:05   ` Sean Christopherson
@ 2024-10-01 20:41     ` Oliver Upton
  2024-10-01 23:28       ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2024-10-01 20:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

Hey,

sidebar: I was a bit confused by the diff for a second, since it looks
like your email client lowercased some stuff :)

On Tue, Oct 01, 2024 at 12:05:01PM -0700, Sean Christopherson wrote:
> On Tue, Oct 01, 2024, Oliver Upton wrote:
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 367f0d3d3194..ab7d387379a0 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >  
> >  		if (kvm_dirty_ring_check_request(vcpu))
> >  			return 0;
> > +
> > +		check_nested_vcpu_requests(vcpu);
> >  	}
> >  
> >  	return 1;
> > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> > index 1ba211541290..749babdbda81 100644
> > --- a/arch/arm64/kvm/nested.c
> > +++ b/arch/arm64/kvm/nested.c
> > @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
> >  	/* set the scene for the next search */
> >  	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
> >  
> > -	/* clear the old state */
> > +	/* make sure we don't forget to do the laundry */
> >  	if (kvm_s2_mmu_valid(s2_mmu))
> > -		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
> > +		s2_mmu->pending_unmap = true;
> 
> Question time.  The code that's just out of sight modifies the nested MMU's tlb_vttbr:
> 
> 	s2_mmu->tlb_vttbr = vcpu_read_sys_reg(vcpu, vttbr_el2) & ~vttbr_cnp_bit;
> 	s2_mmu->tlb_vtcr = vcpu_read_sys_reg(vcpu, vtcr_el2);
> 	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm;
> 
> which is consumed by kvm_s2_mmu_iterate_by_vmid().  IIUC, kvm_s2_mmu_iterate_by_vmid()
> is used only when emulating some form of tlbi that's executed by the guest?  I.e.
> is guaranteed to do the right thing because KVM ensures the old state is purged
> prior to entering the guest?

Heh, yeah I meant to add some comments on why iterators that walk the
nested_mmus array are safe. But it's basically what you've said, if the
current MMU gets reclaimed when we're in the middle of unmapping it for
TLBI, then the intent of the TLBI is still upheld since all the
old crap gets kicked out before reuse.

The architecture requires the guest put its own stage-2 into a
consistent state before issuing a TLBI, so even if we wind up doing more
shadow stage-2 fills after relinquishing the lock it is still legal. And
until the TLBI instruction retires (i.e. ERET back to the guest), other
vCPUs can use stale 'TLB entries' all they want.

I'll happily add some comments to clarify all this.

> >  	/*
> >  	 * the virtual vmid (modulo cnp) will be used as a key when matching
> > @@ -649,6 +649,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
> >  	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, hcr_el2) & hcr_vm;
> >  
> >  out:
> > +	if (s2_mmu->pending_unmap)
> > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> 
> If I followed everything correctly, I don't think a request is needed.  the
> request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().

I'm (ab)using the request to prevent the vCPU thread from actually
entering the VM without first having done the laundry. We have other
examples of strictly per-vCPU tasks that are tracked with a request so
this doesn't stick out that much.

Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
catch a 'dirty' MMU or take a pin on it from the point we check the
dirtiness to the point we disable preemption.

> Somewhat of a side topic, the handling of xfer_to_guest_mode_handle_work() in
> kvm_arch_vcpu_ioctl_run() is confusing.  If a signal is pending (ret == -eintr),
> KVM will skip checking requests, e.g. not unmap the nested stage-2, but then continue
> with other pieces of the entry flow.  I'm guessing that's intentional as it looks
> like KVM needs to flush hardware state in case that info is exposed to userspace.
> 
> But it really warps the mind when trying to reason about what will happen if
> there is a relevant request, e.g. if the below code needs to load a new stage-2
> for a new vmid, it could do so with the to-be-unmapped contents.

What you describe sounds worrying, but is still completely fine. Even
without VMID rollover, we're loading a potentially-stale MMU at the
point of vcpu_load().

What saves us is the fact that the EL1&0 translation regime is
out-of-context at EL2, meaning that hardware is disallowed from
speculatively using the translation regime. Ignoring the
ARM64_WORKAROUND_SPECULATIVE_AT turd :)

The point at which the MMU and translation tables *must* be consistent
is the ERET into EL1/EL0.

> > +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
> > +{
> > +	if (kvm_check_request(kvm_req_nested_s2_unmap, vcpu)) {
> > +		CLASS(vcpu_hw_mmu, mmu)(vcpu);
> > +
> > +		write_lock(&vcpu->kvm->mmu_lock);
> > +		if (mmu->pending_unmap) {
> 
> I doubt it's a functional issue, but if multiple vCPUs are purging the same nested
> stage-2 MMU, and the first vCPU yields while unmapping, then KVM will end up with
> multiple vCPUs doing the unmap.  And irrespective of whether or not the first vCPU
> (or vCPUs) yields, they'll all contend for mmu_lock, which is especially problematic
> since pending writers will block future readers.
> 
> Would it instead make sense to do something like the below?  Checking for the
> need to unmap outside of mmu_lock would also more or less eliminate the need for
> a request.

Maybe? The intent of this change is to restore functional correctness
without _too_ much interest in the performance of it. The general aim of
the nested work on arm64 is getting something functional upstreamed
first, then fixing performance after. You'll notice we have no concept
of an rmap yet either...

I don't really see anything wrong with your approach, it's just that we
aren't necessarily committed to this allocation scheme and we may throw
the whole damn thing out in favor of something else in the future.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01  0:17 ` [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
  2024-10-01 19:05   ` Sean Christopherson
@ 2024-10-01 23:23   ` Marc Zyngier
  2024-10-02  0:06     ` Oliver Upton
  1 sibling, 1 reply; 19+ messages in thread
From: Marc Zyngier @ 2024-10-01 23:23 UTC (permalink / raw)
  To: Oliver Upton; +Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Tue, 01 Oct 2024 01:17:09 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> Currently, when a nested MMU is repurposed for some other MMU context,
> KVM unmaps everything during vcpu_load() while holding the MMU lock for
> write. This is quite a performance bottleneck for large nested VMs, as
> all vCPU scheduling will spin until the unmap completes.
> 
> Start punting the MMU cleanup to a vCPU request, where it is then
> possible to periodically release the MMU lock and CPU in the presence of
> contention. To make that possible, take a pin on the hardware MMU while
> unmapping it to guarantee the responsible vCPU finds the MMU on the next
> vcpu_load().
> 
> Ensure that no vCPU winds up using a stale MMU by tracking the pending
> unmap on the S2 MMU itself and requesting an unmap on every vCPU that
> finds it.
> 
> Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
> ---
>  arch/arm64/include/asm/kvm_host.h   |  3 +++
>  arch/arm64/include/asm/kvm_nested.h |  2 ++
>  arch/arm64/kvm/arm.c                |  2 ++
>  arch/arm64/kvm/nested.c             | 21 +++++++++++++++++++--
>  4 files changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 268f69196380..1d310292c623 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -51,6 +51,7 @@
>  #define KVM_REQ_RELOAD_PMU	KVM_ARCH_REQ(5)
>  #define KVM_REQ_SUSPEND		KVM_ARCH_REQ(6)
>  #define KVM_REQ_RESYNC_PMU_EL0	KVM_ARCH_REQ(7)
> +#define KVM_REQ_NESTED_S2_UNMAP	KVM_ARCH_REQ(8)
>  
>  #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
>  				     KVM_DIRTY_LOG_INITIALLY_SET)
> @@ -216,6 +217,8 @@ struct kvm_s2_mmu {
>  	 * >0: Somebody is actively using this.
>  	 */
>  	atomic_t refcnt;
> +
> +	bool	pending_unmap;
>  };
>  
>  struct kvm_arch_memory_slot {
> diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
> index e74b90dcfac4..233e65522716 100644
> --- a/arch/arm64/include/asm/kvm_nested.h
> +++ b/arch/arm64/include/asm/kvm_nested.h
> @@ -78,6 +78,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
>  extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
>  extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
>  
> +extern void check_nested_vcpu_requests(struct kvm_vcpu *vcpu);
> +
>  struct kvm_s2_trans {
>  	phys_addr_t output;
>  	unsigned long block_size;
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 367f0d3d3194..ab7d387379a0 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1031,6 +1031,8 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_dirty_ring_check_request(vcpu))
>  			return 0;
> +
> +		check_nested_vcpu_requests(vcpu);
>  	}
>  
>  	return 1;
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 1ba211541290..749babdbda81 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -632,9 +632,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  	/* Set the scene for the next search */
>  	kvm->arch.nested_mmus_next = (i + 1) % kvm->arch.nested_mmus_size;
>  
> -	/* Clear the old state */
> +	/* Make sure we don't forget to do the laundry */
>  	if (kvm_s2_mmu_valid(s2_mmu))
> -		kvm_stage2_unmap_range(s2_mmu, 0, kvm_phys_size(s2_mmu), false);
> +		s2_mmu->pending_unmap = true;
>  
>  	/*
>  	 * The virtual VMID (modulo CnP) will be used as a key when matching
> @@ -649,6 +649,9 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
>  	s2_mmu->nested_stage2_enabled = vcpu_read_sys_reg(vcpu, HCR_EL2) & HCR_VM;
>  
>  out:
> +	if (s2_mmu->pending_unmap)
> +		kvm_make_request(KVM_REQ_NESTED_S2_UNMAP, vcpu);
> +
>  	kvm_get_s2_mmu(s2_mmu);
>  	return s2_mmu;
>  }
> @@ -1186,3 +1189,17 @@ int kvm_init_nv_sysregs(struct kvm *kvm)
>  
>  	return 0;
>  }
> +
> +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
> +{
> +	if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) {
> +		CLASS(vcpu_hw_mmu, mmu)(vcpu);

I really think a small comment outlining *why* we need to elevate the
refcount would help. This single line is what holds the whole thing
together.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01 20:41     ` Oliver Upton
@ 2024-10-01 23:28       ` Sean Christopherson
  2024-10-01 23:49         ` Marc Zyngier
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-10-01 23:28 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvmarm, Marc Zyngier, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Tue, Oct 01, 2024, Oliver Upton wrote:
> Hey,
> 
> sidebar: I was a bit confused by the diff for a second, since it looks
> like your email client lowercased some stuff :)

Wasn't my mail client, it was PEBKAC.  I copy+pasted a large chunk in Vim because
I wanted to pull in the changelog (which I had deleted from my response), but then
I changed my mind, and in doing so I managed to fat-finger something that converted
everything to lowercase.  And yeah, it confused me too.

> > >  out:
> > > +	if (s2_mmu->pending_unmap)
> > > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> > 
> > If I followed everything correctly, I don't think a request is needed.  the
> > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> > vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().
> 
> I'm (ab)using the request to prevent the vCPU thread from actually
> entering the VM without first having done the laundry. We have other
> examples of strictly per-vCPU tasks that are tracked with a request so
> this doesn't stick out that much.
> 
> Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
> catch a 'dirty' MMU or take a pin on it from the point we check the
> dirtiness to the point we disable preemption.

Ewww, because kvm_arch_vcpu_put() puts the nested stage-2 when the vCPU is
scheduled out.  Mostly out of curiosity, why?  99.9% of the time, the vCPU will
be scheduled back in.

Now that vcpu->scheduled_out is a thing, retaining the nested s2 MMU should be
quite straightforward.  kvm_arch_vcpu_destroy() would need to put the MMU, but
that should also be straightforward.

It wouldn't magically fix the mmu_lock bottleneck, but it would make hw_mmu
stable (obviously except for the flows that explicitly change it).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01 23:28       ` Sean Christopherson
@ 2024-10-01 23:49         ` Marc Zyngier
  2024-10-02  0:06           ` Oliver Upton
  0 siblings, 1 reply; 19+ messages in thread
From: Marc Zyngier @ 2024-10-01 23:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Wed, 02 Oct 2024 00:28:18 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Tue, Oct 01, 2024, Oliver Upton wrote:
> > Hey,
> > 
> > sidebar: I was a bit confused by the diff for a second, since it looks
> > like your email client lowercased some stuff :)
> 
> Wasn't my mail client, it was PEBKAC.  I copy+pasted a large chunk in Vim because
> I wanted to pull in the changelog (which I had deleted from my response), but then
> I changed my mind, and in doing so I managed to fat-finger something that converted
> everything to lowercase.  And yeah, it confused me too.
> 
> > > >  out:
> > > > +	if (s2_mmu->pending_unmap)
> > > > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> > > 
> > > If I followed everything correctly, I don't think a request is needed.  the
> > > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> > > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> > > vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().
> > 
> > I'm (ab)using the request to prevent the vCPU thread from actually
> > entering the VM without first having done the laundry. We have other
> > examples of strictly per-vCPU tasks that are tracked with a request so
> > this doesn't stick out that much.
> > 
> > Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
> > catch a 'dirty' MMU or take a pin on it from the point we check the
> > dirtiness to the point we disable preemption.
> 
> Ewww, because kvm_arch_vcpu_put() puts the nested stage-2 when the vCPU is
> scheduled out.  Mostly out of curiosity, why?  99.9% of the time, the vCPU will
> be scheduled back in.

Because s2 MMU structures are a scarce resource. and other vcpus could
have the opportunity to make use of an unused slot.

> Now that vcpu->scheduled_out is a thing, retaining the nested s2 MMU should be
> quite straightforward.  kvm_arch_vcpu_destroy() would need to put the MMU, but
> that should also be straightforward.

This code long predates scheduled_out, and I don't think this brings
much to the table. If the vcpu comes back quickly, it will find its
toys where it left them. If not, someone else will have borrowed them,
and it will have to pick new ones. It isn't any different from TLBs,
which s2 MMUs model.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01 23:49         ` Marc Zyngier
@ 2024-10-02  0:06           ` Oliver Upton
  2024-10-02  0:23             ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2024-10-02  0:06 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Sean Christopherson, kvmarm, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu

On Wed, Oct 02, 2024 at 12:49:27AM +0100, Marc Zyngier wrote:
> On Wed, 02 Oct 2024 00:28:18 +0100,
> Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Tue, Oct 01, 2024, Oliver Upton wrote:
> > > Hey,
> > > 
> > > sidebar: I was a bit confused by the diff for a second, since it looks
> > > like your email client lowercased some stuff :)
> > 
> > Wasn't my mail client, it was PEBKAC.  I copy+pasted a large chunk in Vim because
> > I wanted to pull in the changelog (which I had deleted from my response), but then
> > I changed my mind, and in doing so I managed to fat-finger something that converted
> > everything to lowercase.  And yeah, it confused me too.
> > 
> > > > >  out:
> > > > > +	if (s2_mmu->pending_unmap)
> > > > > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> > > > 
> > > > If I followed everything correctly, I don't think a request is needed.  the
> > > > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> > > > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> > > > vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().
> > > 
> > > I'm (ab)using the request to prevent the vCPU thread from actually
> > > entering the VM without first having done the laundry. We have other
> > > examples of strictly per-vCPU tasks that are tracked with a request so
> > > this doesn't stick out that much.
> > > 
> > > Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
> > > catch a 'dirty' MMU or take a pin on it from the point we check the
> > > dirtiness to the point we disable preemption.
> > 
> > Ewww, because kvm_arch_vcpu_put() puts the nested stage-2 when the vCPU is
> > scheduled out.  Mostly out of curiosity, why?  99.9% of the time, the vCPU will
> > be scheduled back in.
> 
> Because s2 MMU structures are a scarce resource. and other vcpus could
> have the opportunity to make use of an unused slot.
> 
> > Now that vcpu->scheduled_out is a thing, retaining the nested s2 MMU should be
> > quite straightforward.  kvm_arch_vcpu_destroy() would need to put the MMU, but
> > that should also be straightforward.
> 
> This code long predates scheduled_out, and I don't think this brings
> much to the table. If the vcpu comes back quickly, it will find its
> toys where it left them. If not, someone else will have borrowed them,
> and it will have to pick new ones. It isn't any different from TLBs,
> which s2 MMUs model.

In line with what Sean is recommending, it might make sense to explore
holding the reference while a vCPU is loaded and runnable, i.e. the vCPU
isn't scheduling out due to an emulated WFI. While not perfect, it would
increase the likelihood that we evict a 'cold' MMU.

But again, we should make sure we're actually happy with this allocation
scheme first before bothering with optimizing it a lot.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-01 23:23   ` Marc Zyngier
@ 2024-10-02  0:06     ` Oliver Upton
  0 siblings, 0 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-02  0:06 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Wed, Oct 02, 2024 at 12:23:43AM +0100, Marc Zyngier wrote:
> > +
> > +void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
> > +{
> > +	if (kvm_check_request(KVM_REQ_NESTED_S2_UNMAP, vcpu)) {
> > +		CLASS(vcpu_hw_mmu, mmu)(vcpu);
> 
> I really think a small comment outlining *why* we need to elevate the
> refcount would help. This single line is what holds the whole thing
> together.

Sure, it is only _slightly_ load bearing...

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-02  0:06           ` Oliver Upton
@ 2024-10-02  0:23             ` Sean Christopherson
  2024-10-02 23:31               ` Marc Zyngier
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-10-02  0:23 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Tue, Oct 01, 2024, Oliver Upton wrote:
> On Wed, Oct 02, 2024 at 12:49:27AM +0100, Marc Zyngier wrote:
> > On Wed, 02 Oct 2024 00:28:18 +0100,
> > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > On Tue, Oct 01, 2024, Oliver Upton wrote:
> > > > Hey,
> > > > 
> > > > sidebar: I was a bit confused by the diff for a second, since it looks
> > > > like your email client lowercased some stuff :)
> > > 
> > > Wasn't my mail client, it was PEBKAC.  I copy+pasted a large chunk in Vim because
> > > I wanted to pull in the changelog (which I had deleted from my response), but then
> > > I changed my mind, and in doing so I managed to fat-finger something that converted
> > > everything to lowercase.  And yeah, it confused me too.
> > > 
> > > > > >  out:
> > > > > > +	if (s2_mmu->pending_unmap)
> > > > > > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> > > > > 
> > > > > If I followed everything correctly, I don't think a request is needed.  the
> > > > > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> > > > > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> > > > > vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().
> > > > 
> > > > I'm (ab)using the request to prevent the vCPU thread from actually
> > > > entering the VM without first having done the laundry. We have other
> > > > examples of strictly per-vCPU tasks that are tracked with a request so
> > > > this doesn't stick out that much.
> > > > 
> > > > Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
> > > > catch a 'dirty' MMU or take a pin on it from the point we check the
> > > > dirtiness to the point we disable preemption.
> > > 
> > > Ewww, because kvm_arch_vcpu_put() puts the nested stage-2 when the vCPU is
> > > scheduled out.  Mostly out of curiosity, why?  99.9% of the time, the vCPU will
> > > be scheduled back in.
> > 
> > Because s2 MMU structures are a scarce resource. and other vcpus could
> > have the opportunity to make use of an unused slot.

But that slot is less unused than other unused slots, in the sense that KVM _knows_
at least one vCPU intends to use that MMU in the near future, whereas KVM has no
tracking to know if an MMU with no references whatsoever is likely to be reused.

IIUC, KVM round-robins across 2*nr_vcpus MMUs, and when L1 switches to a different
VTTBR, it will first drop its reference to the previous MMU.  So at any given time,
there are nr_vcpus worth of unused MMUs, i.e. a vCPU is guaranteed to be able to
find an unused slot, even if vCPUs that are scheduled out hold onto their S2 MMU
reference.

At that point, choosing an MMU that no vCPU is using seems more likely to recycle
a cold/dead MMU than a soon-to-be-reused MMU.

And the round-robin approach makes it all heavily luck-based anyways.  E.g. if
a vCPU puts VTTBR A and then loads VTTBR B, B could recycle A's S2 MMU if that
MMU slot is next up for recycling.

> > > Now that vcpu->scheduled_out is a thing, retaining the nested s2 MMU should be
> > > quite straightforward.  kvm_arch_vcpu_destroy() would need to put the MMU, but
> > > that should also be straightforward.
> > 
> > This code long predates scheduled_out, and I don't think this brings
> > much to the table. If the vcpu comes back quickly, it will find its
> > toys where it left them. If not, someone else will have borrowed them,
> > and it will have to pick new ones. It isn't any different from TLBs,
> > which s2 MMUs model.
> 
> In line with what Sean is recommending, it might make sense to explore
> holding the reference while a vCPU is loaded and runnable, i.e. the vCPU
> isn't scheduling out due to an emulated WFI. While not perfect, it would
> increase the likelihood that we evict a 'cold' MMU.
> 
> But again, we should make sure we're actually happy with this allocation
> scheme first before bothering with optimizing it a lot.

Heh, yeah.  I wasn't coming at this from a performance angle, so much as an "avoid
footguns and weird edgecases" angle.  Allowing the nested S2 MMU to be blown away
at essentially any time is likely to be quite surprising to most folks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-02  0:23             ` Sean Christopherson
@ 2024-10-02 23:31               ` Marc Zyngier
  2024-10-03  0:04                 ` Oliver Upton
  0 siblings, 1 reply; 19+ messages in thread
From: Marc Zyngier @ 2024-10-02 23:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Oliver Upton, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Wed, 02 Oct 2024 01:23:41 +0100,
Sean Christopherson <seanjc@google.com> wrote:
> 
> On Tue, Oct 01, 2024, Oliver Upton wrote:
> > On Wed, Oct 02, 2024 at 12:49:27AM +0100, Marc Zyngier wrote:
> > > On Wed, 02 Oct 2024 00:28:18 +0100,
> > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > On Tue, Oct 01, 2024, Oliver Upton wrote:
> > > > > Hey,
> > > > > 
> > > > > sidebar: I was a bit confused by the diff for a second, since it looks
> > > > > like your email client lowercased some stuff :)
> > > > 
> > > > Wasn't my mail client, it was PEBKAC.  I copy+pasted a large chunk in Vim because
> > > > I wanted to pull in the changelog (which I had deleted from my response), but then
> > > > I changed my mind, and in doing so I managed to fat-finger something that converted
> > > > everything to lowercase.  And yeah, it confused me too.
> > > > 
> > > > > > >  out:
> > > > > > > +	if (s2_mmu->pending_unmap)
> > > > > > > +		kvm_make_request(kvm_req_nested_s2_unmap, vcpu);
> > > > > > 
> > > > > > If I followed everything correctly, I don't think a request is needed.  the
> > > > > > request will never be cross-vCPU, and each vCPU holds a reference to the MMU, so
> > > > > > the MMU can't be recycled, i.e. pending_unmap is guaranteed to be relevant to the
> > > > > > vCPU's usage of the MMU.  More thoughts below in check_nested_vcpu_requests().
> > > > > 
> > > > > I'm (ab)using the request to prevent the vCPU thread from actually
> > > > > entering the VM without first having done the laundry. We have other
> > > > > examples of strictly per-vCPU tasks that are tracked with a request so
> > > > > this doesn't stick out that much.
> > > > > 
> > > > > Otherwise we'd need an open-coded check in kvm_vcpu_exit_request() to
> > > > > catch a 'dirty' MMU or take a pin on it from the point we check the
> > > > > dirtiness to the point we disable preemption.
> > > > 
> > > > Ewww, because kvm_arch_vcpu_put() puts the nested stage-2 when the vCPU is
> > > > scheduled out.  Mostly out of curiosity, why?  99.9% of the time, the vCPU will
> > > > be scheduled back in.
> > > 
> > > Because s2 MMU structures are a scarce resource. and other vcpus could
> > > have the opportunity to make use of an unused slot.
> 
> But that slot is less unused than other unused slots, in the sense that KVM _knows_
> at least one vCPU intends to use that MMU in the near future, whereas KVM has no
> tracking to know if an MMU with no references whatsoever is likely to be reused.

How do you know that? I'd happily borrow your crystal ball. By the
time that vcpu is restarted, other vcpus could have done a lot of
useful work by using that S2 MMU.

> IIUC, KVM round-robins across 2*nr_vcpus MMUs, and when L1 switches to a different
> VTTBR, it will first drop its reference to the previous MMU.  So at any given time,
> there are nr_vcpus worth of unused MMUs, i.e. a vCPU is guaranteed to be able to
> find an unused slot, even if vCPUs that are scheduled out hold onto their S2 MMU
> reference.

It's not about not finding a slot, but about making sure that vcpus
that context switch rapidly between VTTBRs for their own guests can do
so freely without sacrificing the TLBs they have just produced. Not
reusing the TLBs hogged by a vcpu that cannot run is a waste of
resource.

>
> At that point, choosing an MMU that no vCPU is using seems more likely to recycle
> a cold/dead MMU than a soon-to-be-reused MMU.
> 
> And the round-robin approach makes it all heavily luck-based anyways.  E.g. if
> a vCPU puts VTTBR A and then loads VTTBR B, B could recycle A's S2 MMU if that
> MMU slot is next up for recycling.

Well, we'll have to agree to disagree. It's a terrible hack to add
artificial ties between a vcpu and TLBs. Because that's what the
shadow MMU is, nothing else.

So if you don't like the TLB eviction policy, please come up with a
better one, making sure that a recently preempted vcpu gets its S2 MMU
recycled last. But please don't add the notion of "locked TLBs" to the
mix, because that's a pretty dodgy architectural concept.

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-02 23:31               ` Marc Zyngier
@ 2024-10-03  0:04                 ` Oliver Upton
  2024-10-03  0:12                   ` Oliver Upton
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2024-10-03  0:04 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Sean Christopherson, kvmarm, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu

Hey,

On Thu, Oct 03, 2024 at 12:31:33AM +0100, Marc Zyngier wrote:
> On Wed, 02 Oct 2024 01:23:41 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > IIUC, KVM round-robins across 2*nr_vcpus MMUs, and when L1 switches to a different
> > VTTBR, it will first drop its reference to the previous MMU.  So at any given time,
> > there are nr_vcpus worth of unused MMUs, i.e. a vCPU is guaranteed to be able to
> > find an unused slot, even if vCPUs that are scheduled out hold onto their S2 MMU
> > reference.
> 
> It's not about not finding a slot, but about making sure that vcpus
> that context switch rapidly between VTTBRs for their own guests can do
> so freely without sacrificing the TLBs they have just produced. Not
> reusing the TLBs hogged by a vcpu that cannot run is a waste of
> resource.

OTOH, our global TLBs don't model hardware exactly since a vCPU doing
rapid context switches trash the TLBs of *all* vCPUs in the system.
The cost of reusing an MMU is quite noticeable, since our unmap
implementation is slightly crap at the moment, the cost of which shows
up both on sides of the reclaim (victim and user).

> >
> > At that point, choosing an MMU that no vCPU is using seems more likely to recycle
> > a cold/dead MMU than a soon-to-be-reused MMU.
> > 
> > And the round-robin approach makes it all heavily luck-based anyways.  E.g. if
> > a vCPU puts VTTBR A and then loads VTTBR B, B could recycle A's S2 MMU if that
> > MMU slot is next up for recycling.
> 
> Well, we'll have to agree to disagree. It's a terrible hack to add
> artificial ties between a vcpu and TLBs. Because that's what the
> shadow MMU is, nothing else.
> 
> So if you don't like the TLB eviction policy, please come up with a
> better one, making sure that a recently preempted vcpu gets its S2 MMU
> recycled last. But please don't add the notion of "locked TLBs" to the
> mix, because that's a pretty dodgy architectural concept.

Well, I've effectively implemented "locked" TLBs by way of allowing
vCPUs to pin an MMU when doing some MMU operation. It's just that the
detail gets encoded in the callsite instead of being some general
property of MMU assignment.

After fiddling with this a bit more (diff below), I'm actually inclined
to go for holding a reference on scheduled out vCPUs *if* we have reason
to believe it is gonna do something useful. You get a stable hw_mmu
pointer in the places where it matters and can avoid taking the MMU lock
for write during vcpu_load() in the 'fast path'.

Still should drop the reference in most other cases, as I do *not* want
to entertain vCPUs holding a reference when they've gone out to
userspace.

diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index f9e30dd34c7a..df670c14e1c6 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * The vCPU kept its reference on the MMU after the last put, keep
+	 * rolling with it.
+	 */
+	if (vcpu->arch.hw_mmu)
+		return;
+
 	if (is_hyp_ctxt(vcpu)) {
 		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
 	} else {
@@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
 
 void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
 {
-	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
+	/*
+	 * Keep a reference on the associated stage-2 MMU if the vCPU is
+	 * scheduling out and not in WFI emulation, suggesting it is likely to
+	 * reuse the MMU sometime soon.
+	 */
+	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
+		return;
+
+	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu))
 		atomic_dec(&vcpu->arch.hw_mmu->refcnt);
-		vcpu->arch.hw_mmu = NULL;
-	}
+
+	vcpu->arch.hw_mmu = NULL;
 }
 
 /*
-- 
2.47.0.rc0.187.ge670bccf7e-goog


-- 
Thanks,
Oliver

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-03  0:04                 ` Oliver Upton
@ 2024-10-03  0:12                   ` Oliver Upton
  2024-10-03 16:45                     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2024-10-03  0:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Sean Christopherson, kvmarm, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu

On Thu, Oct 03, 2024 at 12:04:06AM +0000, Oliver Upton wrote:
> Hey,
> 
> On Thu, Oct 03, 2024 at 12:31:33AM +0100, Marc Zyngier wrote:
> > On Wed, 02 Oct 2024 01:23:41 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > IIUC, KVM round-robins across 2*nr_vcpus MMUs, and when L1 switches to a different
> > > VTTBR, it will first drop its reference to the previous MMU.  So at any given time,
> > > there are nr_vcpus worth of unused MMUs, i.e. a vCPU is guaranteed to be able to
> > > find an unused slot, even if vCPUs that are scheduled out hold onto their S2 MMU
> > > reference.
> > 
> > It's not about not finding a slot, but about making sure that vcpus
> > that context switch rapidly between VTTBRs for their own guests can do
> > so freely without sacrificing the TLBs they have just produced. Not
> > reusing the TLBs hogged by a vcpu that cannot run is a waste of
> > resource.
> 
> OTOH, our global TLBs don't model hardware exactly since a vCPU doing
> rapid context switches trash the TLBs of *all* vCPUs in the system.
> The cost of reusing an MMU is quite noticeable, since our unmap
> implementation is slightly crap at the moment, the cost of which shows
> up both on sides of the reclaim (victim and user).

Oh, and why unmap is crap:

The walker has no way of informing the iterator how far into the range
the walk actually got. Instead, the iterator forcibly walks every
PUD/PMD sized chunk.

So if you have massive holes in your IPA space (which we generally do)
then you wind up revisiting the same invalid PGD entry a bunch of times.
And TLBI for it too when you have TLBIRANGE.

I'm working on a fix for this too, but what I have right now needs to be
aggressively de-crapified first.

> > >
> > > At that point, choosing an MMU that no vCPU is using seems more likely to recycle
> > > a cold/dead MMU than a soon-to-be-reused MMU.
> > > 
> > > And the round-robin approach makes it all heavily luck-based anyways.  E.g. if
> > > a vCPU puts VTTBR A and then loads VTTBR B, B could recycle A's S2 MMU if that
> > > MMU slot is next up for recycling.
> > 
> > Well, we'll have to agree to disagree. It's a terrible hack to add
> > artificial ties between a vcpu and TLBs. Because that's what the
> > shadow MMU is, nothing else.
> > 
> > So if you don't like the TLB eviction policy, please come up with a
> > better one, making sure that a recently preempted vcpu gets its S2 MMU
> > recycled last. But please don't add the notion of "locked TLBs" to the
> > mix, because that's a pretty dodgy architectural concept.
> 
> Well, I've effectively implemented "locked" TLBs by way of allowing
> vCPUs to pin an MMU when doing some MMU operation. It's just that the
> detail gets encoded in the callsite instead of being some general
> property of MMU assignment.
> 
> After fiddling with this a bit more (diff below), I'm actually inclined
> to go for holding a reference on scheduled out vCPUs *if* we have reason
> to believe it is gonna do something useful. You get a stable hw_mmu
> pointer in the places where it matters and can avoid taking the MMU lock
> for write during vcpu_load() in the 'fast path'.
> 
> Still should drop the reference in most other cases, as I do *not* want
> to entertain vCPUs holding a reference when they've gone out to
> userspace.
> 
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index f9e30dd34c7a..df670c14e1c6 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -663,6 +663,13 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
>  
>  void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>  {
> +	/*
> +	 * The vCPU kept its reference on the MMU after the last put, keep
> +	 * rolling with it.
> +	 */
> +	if (vcpu->arch.hw_mmu)
> +		return;
> +
>  	if (is_hyp_ctxt(vcpu)) {
>  		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
>  	} else {
> @@ -674,10 +681,18 @@ void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>  
>  void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu)
>  {
> -	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu)) {
> +	/*
> +	 * Keep a reference on the associated stage-2 MMU if the vCPU is
> +	 * scheduling out and not in WFI emulation, suggesting it is likely to
> +	 * reuse the MMU sometime soon.
> +	 */
> +	if (vcpu->scheduled_out && !vcpu_get_flag(vcpu, IN_WFI))
> +		return;
> +
> +	if (kvm_is_nested_s2_mmu(vcpu->kvm, vcpu->arch.hw_mmu))
>  		atomic_dec(&vcpu->arch.hw_mmu->refcnt);
> -		vcpu->arch.hw_mmu = NULL;
> -	}
> +
> +	vcpu->arch.hw_mmu = NULL;
>  }
>  
>  /*
> -- 
> 2.47.0.rc0.187.ge670bccf7e-goog
> 
> 
> -- 
> Thanks,
> Oliver

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-03  0:12                   ` Oliver Upton
@ 2024-10-03 16:45                     ` Sean Christopherson
  2024-10-03 17:52                       ` Oliver Upton
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-10-03 16:45 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Thu, Oct 03, 2024, Oliver Upton wrote:
> On Thu, Oct 03, 2024 at 12:04:06AM +0000, Oliver Upton wrote:
> > Hey,
> > 
> > On Thu, Oct 03, 2024 at 12:31:33AM +0100, Marc Zyngier wrote:
> > > On Wed, 02 Oct 2024 01:23:41 +0100, Sean Christopherson <seanjc@google.com> wrote:
> > > > IIUC, KVM round-robins across 2*nr_vcpus MMUs, and when L1 switches to a different
> > > > VTTBR, it will first drop its reference to the previous MMU.  So at any given time,
> > > > there are nr_vcpus worth of unused MMUs, i.e. a vCPU is guaranteed to be able to
> > > > find an unused slot, even if vCPUs that are scheduled out hold onto their S2 MMU
> > > > reference.
> > > 
> > > It's not about not finding a slot, but about making sure that vcpus
> > > that context switch rapidly between VTTBRs for their own guests can do
> > > so freely without sacrificing the TLBs they have just produced. Not
> > > reusing the TLBs hogged by a vcpu that cannot run is a waste of
> > > resource.

I don't think it's a complete waste.  There's value in not having to unmap and
rebuild an S2 MMU that is likely to be reused in the near future, especially for
a vCPU that was preempted.  E.g. the preempted L2 vCPU is already going to
experience some amount of jitter, forcing it to recycle a different S2 MMU and
then rebuild its S2 MMU is going to exacerbate the jitter.

Jitter aside, for a well-behaved system, it's unlikely a vCPU in the same VM will
be able to switch between L2s (VTTBRs) so fast that it would be a net positive to
recycle the TLB/MMU of a VTTBR that is "active", but scheduled out.  E.g. if the
L0 scheduler doesn't give the scheduled out vCPU a time slice "soon", then the L1
VM is going to be quite unhappy.

> > OTOH, our global TLBs don't model hardware exactly since a vCPU doing
> > rapid context switches trash the TLBs of *all* vCPUs in the system.
> > The cost of reusing an MMU is quite noticeable, since our unmap
> > implementation is slightly crap at the moment, the cost of which shows
> > up both on sides of the reclaim (victim and user).
> 
> Oh, and why unmap is crap:

Heh, isn't unmap by definition crap?  If KVM needs to unmap and rebuild an S2 MMU,
then KVM is already in a slow, sub-optimal situation.  I'm not saying unmap and
rebuild shouldn't be optimized as much as possible, just that weighting heavily
twoards avoiding unmap+rebuild will likely yield better overall performance and
experience, even if it means holding references across certain boundaries.

> > > > At that point, choosing an MMU that no vCPU is using seems more likely to recycle
> > > > a cold/dead MMU than a soon-to-be-reused MMU.
> > > > 
> > > > And the round-robin approach makes it all heavily luck-based anyways.  E.g. if
> > > > a vCPU puts VTTBR A and then loads VTTBR B, B could recycle A's S2 MMU if that
> > > > MMU slot is next up for recycling.
> > > 
> > > Well, we'll have to agree to disagree. It's a terrible hack to add
> > > artificial ties between a vcpu and TLBs.

But those ties already exist, in that nested_mmus_size scales with the number of
L1 vCPUs.
 
> > Still should drop the reference in most other cases, as I do *not* want
> > to entertain vCPUs holding a reference when they've gone out to
> > userspace.

Why not?  The vCPU is still running, keeping its S2 MMU resident is desirable, no?
It's extremely unlikely userspace will terminate or refuse to run the vCPU unless
the entire L1 VM is doomed.

Similarly, if an L1 vCPU is running multiple L2s, i.e. multiple VTTBRs, then it's
desirable to keep references to multiple S2 MMUs as well, e.g. so that switching
between two VTTBRs is guaranteed to get a cache hit on both.

Essentially all I'm suggesting is that instead of having a common pool of 2*vCPUs
TLBs per L1 VMM, have 2 (or however many) TLBs per L1 vCPU, plus maybe N extra
TLBs per L1 VMM.  I.e. mimic the hierarchical design of hardware caches and TLBs
to some extent.

vCPUs can still spill past their dedicated 2 TLBs, e.g. if L1 is only running L2s
in a subset of vCPUs.  Yeah, there will be wasted memory if a vCPU stops running
L2s, as KVM won't know when to drop references to its VTTBRs.  But that's a very
manageable problem (though definitely something for the future), e.g. by detecting
effective TLB pressure and requesting vCPUs to drop references to VTTBRs that
aren't actively being used, or by dropping references if the VM clobbers the
associated stage-2 PTEs that are being shadowed (i.e. if L1 is no longer using
that memory for page tables).

I'm not suggesting an overhaul to fix the preemption bug, but I do think that
effectively acquiring references on-demand is an unnecessarily complex solution.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-03 16:45                     ` Sean Christopherson
@ 2024-10-03 17:52                       ` Oliver Upton
  2024-10-03 18:23                         ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Oliver Upton @ 2024-10-03 17:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Thu, Oct 03, 2024 at 09:45:40AM -0700, Sean Christopherson wrote:

[...[

> > > OTOH, our global TLBs don't model hardware exactly since a vCPU doing
> > > rapid context switches trash the TLBs of *all* vCPUs in the system.
> > > The cost of reusing an MMU is quite noticeable, since our unmap
> > > implementation is slightly crap at the moment, the cost of which shows
> > > up both on sides of the reclaim (victim and user).
> > 
> > Oh, and why unmap is crap:
> 
> Heh, isn't unmap by definition crap?  If KVM needs to unmap and rebuild an S2 MMU,
> then KVM is already in a slow, sub-optimal situation.

Not really, the unmap plumbing is used for applying the intent of a
guest TLBI too. Sub-optimal or not, it is exactly what the VM asked for,
and it'd be in our interest to handle the unmap as expeditiously as
possible.

> > > Still should drop the reference in most other cases, as I do *not* want
> > > to entertain vCPUs holding a reference when they've gone out to
> > > userspace.
> 
> Why not?  The vCPU is still running, keeping its S2 MMU resident is desirable, no?

How could we possibly know what the intent of userspace is? The VMM
could just as well throw that vCPU fd on ice for an eternity.

For example, you could have a PSCI implementation that lives in
userspace. Guest does CPU_OFF and the VMM decides to terminate the
backing thread and keep the FD around for the next CPU_ON.

Since KVM still views that fd as 'runnable', it'd sit on the reference
that vCPU holds indefinitely. On top of that, it adds complexity to the
implementation since we would need more refcount cleanup flows to handle
these straggler references.

> Essentially all I'm suggesting is that instead of having a common pool of 2*vCPUs
> TLBs per L1 VMM, have 2 (or however many) TLBs per L1 vCPU, plus maybe N extra
> TLBs per L1 VMM.  I.e. mimic the hierarchical design of hardware caches and TLBs
> to some extent.

Making TLBs private to the L1 vCPU is almost guaranteed to be a net loss
in performance. The common case for an L2 VM is that all L2 vCPUs share the
*exact* same translation tables and MMU configuration. So even if you're
running an N vCPU L2, you've only allocated 1 nested MMU context to it.

The ARM architecture has gone as far as making this an explicit contract
between hardware + software. That is, FEAT_TTCNP allows translations to
be shared across the Inner Shareable domain. There's hardware out there
that takes advantage of this, and there is a performance advantage to
enabling the feature.

The guest hypervisor is free to disregard this part of the architecture
and keep MMUs private between CPUs. At the same time, I see zero
incentive for optimizing this use case and am absolutely fine with there
being a performance penalty associated with this configuration.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-03 17:52                       ` Oliver Upton
@ 2024-10-03 18:23                         ` Sean Christopherson
  2024-10-03 22:03                           ` Oliver Upton
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-10-03 18:23 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Marc Zyngier, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Thu, Oct 03, 2024, Oliver Upton wrote:
> On Thu, Oct 03, 2024 at 09:45:40AM -0700, Sean Christopherson wrote:
> 
> [...[
> 
> > > > OTOH, our global TLBs don't model hardware exactly since a vCPU doing
> > > > rapid context switches trash the TLBs of *all* vCPUs in the system.
> > > > The cost of reusing an MMU is quite noticeable, since our unmap
> > > > implementation is slightly crap at the moment, the cost of which shows
> > > > up both on sides of the reclaim (victim and user).
> > > 
> > > Oh, and why unmap is crap:
> > 
> > Heh, isn't unmap by definition crap?  If KVM needs to unmap and rebuild an S2 MMU,
> > then KVM is already in a slow, sub-optimal situation.
> 
> Not really, the unmap plumbing is used for applying the intent of a
> guest TLBI too. Sub-optimal or not, it is exactly what the VM asked for,
> and it'd be in our interest to handle the unmap as expeditiously as
> possible.

Sorry, I meant "unnecessary unmap".

> > > > Still should drop the reference in most other cases, as I do *not* want
> > > > to entertain vCPUs holding a reference when they've gone out to
> > > > userspace.
> > 
> > Why not?  The vCPU is still running, keeping its S2 MMU resident is desirable, no?
> 
> How could we possibly know what the intent of userspace is? The VMM
> could just as well throw that vCPU fd on ice for an eternity.
> 
> For example, you could have a PSCI implementation that lives in
> userspace. Guest does CPU_OFF and the VMM decides to terminate the
> backing thread and keep the FD around for the next CPU_ON.

Yes, but we need to play the odds.  I.e. make the common case fast/efficient.
KVM obviously needs to not fallover or crater performance in the presence of edge
cases, but IMO, disallowing a vCPU from pinning a vCPU because it _might_ go
offline is the wrong tradeoff.

> Since KVM still views that fd as 'runnable', it'd sit on the reference
> that vCPU holds indefinitely. On top of that, it adds complexity to the
> implementation since we would need more refcount cleanup flows to handle
> these straggler references.

But only one flow, vCPU destruction, is mandatory.  Anything beyond that is pure
optimization.

> > Essentially all I'm suggesting is that instead of having a common pool of 2*vCPUs
> > TLBs per L1 VMM, have 2 (or however many) TLBs per L1 vCPU, plus maybe N extra
> > TLBs per L1 VMM.  I.e. mimic the hierarchical design of hardware caches and TLBs
> > to some extent.
> 
> Making TLBs private to the L1 vCPU is almost guaranteed to be a net loss
> in performance.

I'm not saying make TLBs private, I'm saying allow each vCPU to "pin" (i.e. hold
a reference) up to N TLBs/MMUs, regardless of "where" that vCPU is in the flow
of things.  Versus the proposed behavior of pinning TLBs only when it's absolutely
mandatory to do so for functional correctness.

Holding a reference across preemption would be the first step towards that model.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request
  2024-10-03 18:23                         ` Sean Christopherson
@ 2024-10-03 22:03                           ` Oliver Upton
  0 siblings, 0 replies; 19+ messages in thread
From: Oliver Upton @ 2024-10-03 22:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Marc Zyngier, kvmarm, Joey Gouly, Suzuki K Poulose, Zenghui Yu

On Thu, Oct 03, 2024 at 11:23:36AM -0700, Sean Christopherson wrote:
> > > Why not?  The vCPU is still running, keeping its S2 MMU resident is desirable, no?
> > 
> > How could we possibly know what the intent of userspace is? The VMM
> > could just as well throw that vCPU fd on ice for an eternity.
> > 
> > For example, you could have a PSCI implementation that lives in
> > userspace. Guest does CPU_OFF and the VMM decides to terminate the
> > backing thread and keep the FD around for the next CPU_ON.
> 
> Yes, but we need to play the odds.

I agree that we can make an educated guess about the state of a vCPU
when it remains in kernel, but anything outside of that is guesswork.

> I.e. make the common case fast/efficient.
> KVM obviously needs to not fallover or crater performance in the presence of edge
> cases, but IMO, disallowing a vCPU from pinning a vCPU because it _might_ go
> offline is the wrong tradeoff.

But in the event of a 'rare' offline event the vCPU took out an MMU slot
forever, or at least until the VM decides to online it again. That feels
off to me.

So like I mentioned earlier, the common case is that the L1 is running a
VM where all of the vCPUs are sharing the same stage-2 MMU context.

In this case, it is highly likely that the L2 VM's nested MMU keeps an
elevated refcount, as at least one of the vCPUs remains in the KVM_RUN
loop.

In addition to that, we likely have quite a few free slots as we
overprovision the nested MMUs to make sure the worst case remains
functional. The only practical situation in which we would see thrashing
of the nested stage-2 MMUs is if the L1 were running more than 2*NR_VCPUS
VMs, which is already a 2x overcommit of the L1.

> > Since KVM still views that fd as 'runnable', it'd sit on the reference
> > that vCPU holds indefinitely. On top of that, it adds complexity to the
> > implementation since we would need more refcount cleanup flows to handle
> > these straggler references.
> 
> But only one flow, vCPU destruction, is mandatory.  Anything beyond that is pure
> optimization.

vcpu_load() / vcpu_put() is the mandatory flow in this design. We reload
the vCPU to handle nested transitions (i.e. L1<->L2), and we need to attach
an MMU that matches the new context.

> > > Essentially all I'm suggesting is that instead of having a common pool of 2*vCPUs
> > > TLBs per L1 VMM, have 2 (or however many) TLBs per L1 vCPU, plus maybe N extra
> > > TLBs per L1 VMM.  I.e. mimic the hierarchical design of hardware caches and TLBs
> > > to some extent.
> > 
> > Making TLBs private to the L1 vCPU is almost guaranteed to be a net loss
> > in performance.
> 
> I'm not saying make TLBs private, I'm saying allow each vCPU to "pin" (i.e. hold
> a reference) up to N TLBs/MMUs, regardless of "where" that vCPU is in the flow
> of things.  Versus the proposed behavior of pinning TLBs only when it's absolutely
> mandatory to do so for functional correctness.

Ah, got it. It is an interesting idea, if we want to explore any meaningful
value of N then we're gonna need to fix the allocation scheme. We'd
probably also need that mechanism to be more tightly integrated into
TLBIs to potentially drop references when a scope has been invalidated.

> Holding a reference across preemption would be the first step towards that model.

I'm OK with doing this for non-WFI preemption, which has the favorable
property of avoiding lock serialization at vcpu_load() in most cases
too.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-10-03 22:03 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-01  0:17 [PATCH 0/3] KVM: arm64: nv: Fixes for stage-2 MMU recycling Oliver Upton
2024-10-01  0:17 ` [PATCH 1/3] KVM: arm64: Treat stage-2 MMUs as refcounted generally Oliver Upton
2024-10-01  0:17 ` [PATCH 2/3] KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed Oliver Upton
2024-10-01  0:17 ` [PATCH 3/3] KVM: arm64: nv: Punt stage-2 recycling to a vCPU request Oliver Upton
2024-10-01 19:05   ` Sean Christopherson
2024-10-01 20:41     ` Oliver Upton
2024-10-01 23:28       ` Sean Christopherson
2024-10-01 23:49         ` Marc Zyngier
2024-10-02  0:06           ` Oliver Upton
2024-10-02  0:23             ` Sean Christopherson
2024-10-02 23:31               ` Marc Zyngier
2024-10-03  0:04                 ` Oliver Upton
2024-10-03  0:12                   ` Oliver Upton
2024-10-03 16:45                     ` Sean Christopherson
2024-10-03 17:52                       ` Oliver Upton
2024-10-03 18:23                         ` Sean Christopherson
2024-10-03 22:03                           ` Oliver Upton
2024-10-01 23:23   ` Marc Zyngier
2024-10-02  0:06     ` Oliver Upton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.