[PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs
@ 2024-08-09 19:02 Sean Christopherson
  2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
                   ` (22 more replies)
  0 siblings, 23 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

The folks doing TDX enabling ran into a problem where exposing a read-only
memslot to a TDX guest put it into an infinite loop.  The most immediate
issue is that KVM never creates MMIO SPTEs for RO memslots, because except
for TDX (which isn't officially supported yet), such SPTEs can't distinguish
between reads and writes, i.e. would trigger MMIO on everything and thus
defeat the purpose of having a RX memslot.

That breaks TDX, SEV-ES, and SNP, i.e. VM types that rely on MMIO caching
to reflect MMIO faults into the guest as #VC/#VE, as the guest never sees
the fault, KVM refuses to emulate, the guest loops indefinitely.  That's
patch 1.

Patches 2-4 fix an amusing number of other bugs that made it difficult to
figure out the true root cause.

The rest is a bunch of cleanups to consolidate all of the unprotect+retry
paths (there are four-ish).

As a bonus, adding RET_PF_WRITE_PROTECTED obviates the need for
kvm_lookup_pfn()[*].

[*] https://lore.kernel.org/all/63c41e25-2523-4397-96b4-557394281443@redhat.com

Sean Christopherson (22):
  KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is
    valid
  KVM: x86/mmu: Trigger unprotect logic only on write-protection page
    faults
  KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is
    zapped
  KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip
  KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for
    retry
  KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
  KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive
    helper
  KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()
  KVM: x86: Fold retry_instruction() into x86_emulate_instruction()
  KVM: x86/mmu: Don't try to unprotect an INVALID_GPA
  KVM: x86/mmu: Always walk guest PTEs with WRITE access when
    unprotecting
  KVM: x86/mmu: Move event re-injection unprotect+retry into common path
  KVM: x86: Remove manual pfn lookup when retrying #PF after failed
    emulation
  KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  KVM: x86: Apply retry protection to "unprotect on failure" path
  KVM: x86: Update retry protection fields when forcing retry on
    emulation failure
  KVM: x86: Rename
    reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
  KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry()
    version
  KVM: x86/mmu: Detect if unprotect will do anything based on
    invalid_list

 arch/x86/include/asm/kvm_host.h |  16 ++-
 arch/x86/kvm/mmu/mmu.c          | 175 ++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |   3 +
 arch/x86/kvm/mmu/mmutrace.h     |   1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |   6 +-
 arch/x86/kvm/vmx/vmx.c          |   5 +-
 arch/x86/kvm/x86.c              | 133 +++++++-----------------
 include/linux/kvm_host.h        |   7 ++
 virt/kvm/kvm_main.c             |   5 +-
 10 files changed, 184 insertions(+), 169 deletions(-)


base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
@ 2024-08-09 19:02 ` Sean Christopherson
  2024-08-14 16:31   ` Paolo Bonzini
  2025-12-03 13:04   ` Naveen N Rao
  2024-08-09 19:02 ` [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid Sean Christopherson
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation.  Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.

But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes.  Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC.  And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.

Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0.  But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.

Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.

Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 include/linux/kvm_host.h        | 7 +++++++
 virt/kvm/kvm_main.c             | 5 ++---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..37c4a573e5fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2191,6 +2191,8 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 #define kvm_arch_has_private_mem(kvm) false
 #endif

+#define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
+
 static inline u16 kvm_read_ldt(void)
 {
 	u16 ldt;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 689e8be873a7..62a3d1c0cc07 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -715,6 +715,13 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 }
 #endif

+#ifndef kvm_arch_has_readonly_mem
+static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
+{
+	return IS_ENABLED(CONFIG_HAVE_KVM_READONLY_MEM);
+}
+#endif
+
 struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..fad2d5932844 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1578,15 +1578,14 @@ static int check_memory_region_flags(struct kvm *kvm,
 	if (mem->flags & KVM_MEM_GUEST_MEMFD)
 		valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;

-#ifdef CONFIG_HAVE_KVM_READONLY_MEM
 	/*
 	 * GUEST_MEMFD is incompatible with read-only memslots, as writes to
 	 * read-only memslots have emulated MMIO, not page fault, semantics,
 	 * and KVM doesn't allow emulated MMIO for private memory.
 	 */
-	if (!(mem->flags & KVM_MEM_GUEST_MEMFD))
+	if (kvm_arch_has_readonly_mem(kvm) &&
+	    !(mem->flags & KVM_MEM_GUEST_MEMFD))
 		valid_flags |= KVM_MEM_READONLY;
-#endif

 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
-- 
2.46.0.76.ge559c4bf1a-goog

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
  2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
@ 2024-08-09 19:02 ` Sean Christopherson
  2024-08-14 11:11   ` Yuan Yao
  2024-08-09 19:03 ` [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Sean Christopherson
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Set PFERR_GUEST_{FINAL,PAGE}_MASK based on EPT_VIOLATION_GVA_TRANSLATED if
and only if EPT_VIOLATION_GVA_IS_VALID is also set in exit qualification.
Per the SDM, bit 8 (EPT_VIOLATION_GVA_TRANSLATED) is valid if and only if
bit 7 (EPT_VIOLATION_GVA_IS_VALID) is set, and is '0' if bit 7 is '0'.

  Bit 7 (a.k.a. EPT_VIOLATION_GVA_IS_VALID)

  Set if the guest linear-address field is valid.  The guest linear-address
  field is valid for all EPT violations except those resulting from an
  attempt to load the guest PDPTEs as part of the execution of the MOV CR
  instruction and those due to trace-address pre-translation

  Bit 8 (a.k.a. EPT_VIOLATION_GVA_TRANSLATED)

  If bit 7 is 1:
    • Set if the access causing the EPT violation is to a guest-physical
      address that is the translation of a linear address.
    • Clear if the access causing the EPT violation is to a paging-structure
      entry as part of a page walk or the update of an accessed or dirty bit.
      Reserved if bit 7 is 0 (cleared to 0).

Failure to guard the logic on GVA_IS_VALID results in KVM marking the page
fault as PFERR_GUEST_PAGE_MASK when there is no known GVA, which can put
the vCPU into an infinite loop due to kvm_mmu_page_fault() getting false
positive on its PFERR_NESTED_GUEST_PAGE logic (though only because that
logic is also buggy/flawed).

In practice, this is largely a non-issue because so GVA_IS_VALID is almost
always set.  However, when TDX comes along, GVA_IS_VALID will *never* be
set, as the TDX Module deliberately clears bits 12:7 in exit qualification,
e.g. so that the faulting virtual address and other metadata that aren't
practically useful for the hypervisor aren't leaked to the untrusted host.

  When exit is due to EPT violation, bits 12-7 of the exit qualification
  are cleared to 0.

Fixes: eebed2438923 ("kvm: nVMX: Add support for fast unprotection of nested guest page tables")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..52de013550e9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5804,8 +5804,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
 		      ? PFERR_PRESENT_MASK : 0;

-	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
+		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
+			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

 	/*
 	 * Check that the GPA doesn't exceed physical memory limits, as that is
-- 
2.46.0.76.ge559c4bf1a-goog

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
  2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
  2024-08-09 19:02 ` [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 11:42   ` Yuan Yao
  2024-08-14 16:40   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected Sean Christopherson
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Trigger KVM's various "unprotect gfn" paths if and only if the page fault
was a write to a write-protected gfn.  To do so, add a new page fault
return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
such page faults.

If a page fault requires emulation for any MMIO (or any reason besides
write-protection), trying to unprotect the gfn is pointless and risks
putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
an infinite loop if the vCPU manages to trigger MMIO on a page table walk.

Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++--------------
 arch/x86/kvm/mmu/mmu_internal.h |  3 ++
 arch/x86/kvm/mmu/mmutrace.h     |  1 +
 arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  6 +--
 5 files changed, 53 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..e3aa04c498ea 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2914,10 +2914,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 		trace_kvm_mmu_set_spte(level, gfn, sptep);
 	}
 
-	if (wrprot) {
-		if (write_fault)
-			ret = RET_PF_EMULATE;
-	}
+	if (wrprot && write_fault)
+		ret = RET_PF_WRITE_PROTECTED;
 
 	if (flush)
 		kvm_flush_remote_tlbs_gfn(vcpu->kvm, gfn, level);
@@ -4549,7 +4547,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 		return RET_PF_RETRY;
 
 	if (page_fault_handle_page_track(vcpu, fault))
-		return RET_PF_EMULATE;
+		return RET_PF_WRITE_PROTECTED;
 
 	r = fast_page_fault(vcpu, fault);
 	if (r != RET_PF_INVALID)
@@ -4642,7 +4640,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 	int r;
 
 	if (page_fault_handle_page_track(vcpu, fault))
-		return RET_PF_EMULATE;
+		return RET_PF_WRITE_PROTECTED;
 
 	r = fast_page_fault(vcpu, fault);
 	if (r != RET_PF_INVALID)
@@ -4726,6 +4724,9 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
 	case RET_PF_EMULATE:
 		return -ENOENT;
 
+	case RET_PF_WRITE_PROTECTED:
+		return -EPERM;
+
 	case RET_PF_RETRY:
 	case RET_PF_CONTINUE:
 	case RET_PF_INVALID:
@@ -5960,6 +5961,41 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
+static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+				       u64 error_code, int *emulation_type)
+{
+	bool direct = vcpu->arch.mmu->root_role.direct;
+
+	/*
+	 * Before emulating the instruction, check if the error code
+	 * was due to a RO violation while translating the guest page.
+	 * This can occur when using nested virtualization with nested
+	 * paging in both guests. If true, we simply unprotect the page
+	 * and resume the guest.
+	 */
+	if (direct &&
+	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
+		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
+		return RET_PF_FIXED;
+	}
+
+	/*
+	 * The gfn is write-protected, but if emulation fails we can still
+	 * optimistically try to just unprotect the page and let the processor
+	 * re-execute the instruction that caused the page fault.  Do not allow
+	 * retrying MMIO emulation, as it's not only pointless but could also
+	 * cause us to enter an infinite loop because the processor will keep
+	 * faulting on the non-existent MMIO address.  Retrying an instruction
+	 * from a nested guest is also pointless and dangerous as we are only
+	 * explicitly shadowing L1's page tables, i.e. unprotecting something
+	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
+	 */
+	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
+		*emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
+
+	return RET_PF_EMULATE;
+}
+
 int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 		       void *insn, int insn_len)
 {
@@ -6005,6 +6041,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	if (r < 0)
 		return r;
 
+	if (r == RET_PF_WRITE_PROTECTED)
+		r = kvm_mmu_write_protect_fault(vcpu, cr2_or_gpa, error_code,
+						&emulation_type);
+
 	if (r == RET_PF_FIXED)
 		vcpu->stat.pf_fixed++;
 	else if (r == RET_PF_EMULATE)
@@ -6015,32 +6055,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
 	if (r != RET_PF_EMULATE)
 		return 1;
 
-	/*
-	 * Before emulating the instruction, check if the error code
-	 * was due to a RO violation while translating the guest page.
-	 * This can occur when using nested virtualization with nested
-	 * paging in both guests. If true, we simply unprotect the page
-	 * and resume the guest.
-	 */
-	if (vcpu->arch.mmu->root_role.direct &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
-		return 1;
-	}
-
-	/*
-	 * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we can still
-	 * optimistically try to just unprotect the page and let the processor
-	 * re-execute the instruction that caused the page fault.  Do not allow
-	 * retrying MMIO emulation, as it's not only pointless but could also
-	 * cause us to enter an infinite loop because the processor will keep
-	 * faulting on the non-existent MMIO address.  Retrying an instruction
-	 * from a nested guest is also pointless and dangerous as we are only
-	 * explicitly shadowing L1's page tables, i.e. unprotecting something
-	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
-	 */
-	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
-		emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
 emulate:
 	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
 				       insn_len);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1721d97743e9..50d2624111f8 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -258,6 +258,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
  * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
  * RET_PF_RETRY: let CPU fault again on the address.
  * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
+ * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
+ *                         gfn and retry, or emulate the instruction directly.
  * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
  * RET_PF_FIXED: The faulting entry has been fixed.
  * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
@@ -274,6 +276,7 @@ enum {
 	RET_PF_CONTINUE = 0,
 	RET_PF_RETRY,
 	RET_PF_EMULATE,
+	RET_PF_WRITE_PROTECTED,
 	RET_PF_INVALID,
 	RET_PF_FIXED,
 	RET_PF_SPURIOUS,
diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
index 195d98bc8de8..f35a830ce469 100644
--- a/arch/x86/kvm/mmu/mmutrace.h
+++ b/arch/x86/kvm/mmu/mmutrace.h
@@ -57,6 +57,7 @@
 TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
 TRACE_DEFINE_ENUM(RET_PF_RETRY);
 TRACE_DEFINE_ENUM(RET_PF_EMULATE);
+TRACE_DEFINE_ENUM(RET_PF_WRITE_PROTECTED);
 TRACE_DEFINE_ENUM(RET_PF_INVALID);
 TRACE_DEFINE_ENUM(RET_PF_FIXED);
 TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 69941cebb3a8..a722a3c96af9 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -805,7 +805,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 
 	if (page_fault_handle_page_track(vcpu, fault)) {
 		shadow_page_table_clear_flood(vcpu, fault->addr);
-		return RET_PF_EMULATE;
+		return RET_PF_WRITE_PROTECTED;
 	}
 
 	r = mmu_topup_memory_caches(vcpu, true);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..8bf44ac9372f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1046,10 +1046,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	 * protected, emulation is needed. If the emulation was skipped,
 	 * the vCPU would have the same fault again.
 	 */
-	if (wrprot) {
-		if (fault->write)
-			ret = RET_PF_EMULATE;
-	}
+	if (wrprot && fault->write)
+		ret = RET_PF_WRITE_PROTECTED;
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
 	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 14:22   ` Yuan Yao
  2024-08-14 16:47   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 05/22] KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped Sean Christopherson
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

When doing "fast unprotection" of nested TDP page tables, skip emulation
if and only if at least one gfn was unprotected, i.e. continue with
emulation if simply resuming is likely to hit the same fault and risk
putting the vCPU into an infinite loop.

Note, it's entirely possible to get a false negative, e.g. if a different
vCPU faults on the same gfn and unprotects the gfn first, but that's a
relatively rare edge case, and emulating is still functionally ok, i.e.
the risk of putting the vCPU isn't an infinite loop isn't justified.

Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e3aa04c498ea..95058ac4b78c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	bool direct = vcpu->arch.mmu->root_role.direct;
 
 	/*
-	 * Before emulating the instruction, check if the error code
-	 * was due to a RO violation while translating the guest page.
-	 * This can occur when using nested virtualization with nested
-	 * paging in both guests. If true, we simply unprotect the page
-	 * and resume the guest.
+	 * Before emulating the instruction, check to see if the access may be
+	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
+	 * gfn being written is for gPTEs that KVM is shadowing and has write-
+	 * protected.  Because AMD CPUs walk nested page table using a write
+	 * operation, walking NPT entries in L1 can trigger write faults even
+	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
+	 * excessive number of L1 instructions without triggering KVM's write-
+	 * flooding detection, i.e. without unprotecting the gfn.
+	 *
+	 * If the error code was due to a RO violation while translating the
+	 * guest page, the current MMU is direct (L1 is active), and KVM has
+	 * shadow pages, then the above scenario is likely being hit.  Try to
+	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
+	 * its NPT entries without triggering emulation.  If one or more shadow
+	 * pages was zapped, skip emulation and resume L1 to let it natively
+	 * execute the instruction.  If no shadow pages were zapped, then the
+	 * write-fault is due to something else entirely, i.e. KVM needs to
+	 * emulate, as resuming the guest will put it into an infinite loop.
 	 */
 	if (direct &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
+	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
+	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
 		return RET_PF_FIXED;
-	}
 
 	/*
 	 * The gfn is write-protected, but if emulation fails we can still
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/22] KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Sean Christopherson
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Resume the guest and thus skip emulation of a non-PTE-writing instruction
if and only if unprotecting the gfn actually zapped at least one shadow
page.  If the gfn is write-protected for some reason other than shadow
paging, attempting to unprotect the gfn will effectively fail, and thus
retrying the instruction is all but guaranteed to be pointless.  This bug
has existed for a long time, but was effectively fudged around by the
retry RIP+address anti-loop detection.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index af6c8cf6a37a..2072cceac68f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8967,14 +8967,14 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (ctxt->eip == last_retry_eip && last_retry_addr == cr2_or_gpa)
 		return false;
 
+	if (!vcpu->arch.mmu->root_role.direct)
+		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
+
+	if (!kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
+		return false;
+
 	vcpu->arch.last_retry_eip = ctxt->eip;
 	vcpu->arch.last_retry_addr = cr2_or_gpa;
-
-	if (!vcpu->arch.mmu->root_role.direct)
-		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
-
-	kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
-
 	return true;
 }
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (4 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 05/22] KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:01   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 07/22] KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry Sean Christopherson
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Read RIP from vCPU state instead of pulling it from the emulation context
when filling last_retry_eip, which is part of the anti-infinite-loop
protection used when unprotecting and retrying instructions that hit a
write-protected gfn.

This will allow reusing the anti-infinite-loop protection in flows that
never make it into the emulator.

This is a glorified nop as ctxt->eip is set to kvm_rip_read() in
init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with
EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through
x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2072cceac68f..372ed3842732 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8973,7 +8973,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (!kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
 		return false;
 
-	vcpu->arch.last_retry_eip = ctxt->eip;
+	vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
 	vcpu->arch.last_retry_addr = cr2_or_gpa;
 	return true;
 }
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/22] KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (5 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 08/22] KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path Sean Christopherson
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not
an unsigned long.  This fixes a bug where 32-bit KVM would unprotect and
retry the wrong gfn if the gpa had bits 63:32!=0.  In practice, this bug
is functionally benign, as unprotecting the wrong gfn is purely a
performance issue (thanks to the anti-infinite-loop logic).  And of course,
almost no one runs 32-bit KVM these days.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 372ed3842732..4c3493ffce0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8934,7 +8934,8 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 			      gpa_t cr2_or_gpa,  int emulation_type)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	unsigned long last_retry_eip, last_retry_addr, gpa = cr2_or_gpa;
+	unsigned long last_retry_eip, last_retry_addr;
+	gpa_t gpa = cr2_or_gpa;
 
 	last_retry_eip = vcpu->arch.last_retry_eip;
 	last_retry_addr = vcpu->arch.last_retry_addr;
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/22] KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (6 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 07/22] KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs Sean Christopherson
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Move the anti-infinite-loop protection provided by last_retry_{eip,addr}
into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that
never hits the emulator, as well as reexecute_instruction(), which is the
last ditch "might as well try it" logic that kicks in when emulation fails
on an instruction that faulted on a write-protected gfn.

Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry
fields and deduplicate other code (with more to come).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          | 39 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              | 27 +----------------------
 3 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37c4a573e5fb..10b47c310ff9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2136,6 +2136,7 @@ int kvm_get_nr_pending_nmis(struct kvm_vcpu *vcpu);
 void kvm_update_dr7(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
+bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa);
 void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 			ulong roots_to_free);
 void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 95058ac4b78c..09a42dc1fe5a 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2731,6 +2731,22 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 	return r;
 }
 
+bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
+{
+	gpa_t gpa = cr2_or_gpa;
+	bool r;
+
+	if (!vcpu->arch.mmu->root_role.direct)
+		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
+
+	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	if (r) {
+		vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
+		vcpu->arch.last_retry_addr = cr2_or_gpa;
+	}
+	return r;
+}
+
 static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	gpa_t gpa;
@@ -5966,6 +5982,27 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 {
 	bool direct = vcpu->arch.mmu->root_role.direct;
 
+	/*
+	 * Do not try to unprotect and retry if the vCPU re-faulted on the same
+	 * RIP with the same address that was previously unprotected, as doing
+	 * so will likely put the vCPU into an infinite.  E.g. if the vCPU uses
+	 * a non-page-table modifying instruction on the PDE that points to the
+	 * instruction, then unprotecting the gfn will unmap the instruction's
+	 * code, i.e. make it impossible for the instruction to ever complete.
+	 */
+	if (vcpu->arch.last_retry_eip == kvm_rip_read(vcpu) &&
+	    vcpu->arch.last_retry_addr == cr2_or_gpa)
+		return RET_PF_EMULATE;
+
+	/*
+	 * Reset the unprotect+retry values that guard against infinite loops.
+	 * The values will be refreshed if KVM explicitly unprotects a gfn and
+	 * retries, in all other cases it's safe to retry in the future even if
+	 * the next page fault happens on the same RIP+address.
+	 */
+	vcpu->arch.last_retry_eip = 0;
+	vcpu->arch.last_retry_addr = 0;
+
 	/*
 	 * Before emulating the instruction, check to see if the access may be
 	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
@@ -5988,7 +6025,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 */
 	if (direct &&
 	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
-	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
 		return RET_PF_FIXED;
 
 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4c3493ffce0b..5377ca55161a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8934,27 +8934,13 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 			      gpa_t cr2_or_gpa,  int emulation_type)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	unsigned long last_retry_eip, last_retry_addr;
-	gpa_t gpa = cr2_or_gpa;
-
-	last_retry_eip = vcpu->arch.last_retry_eip;
-	last_retry_addr = vcpu->arch.last_retry_addr;
 
 	/*
 	 * If the emulation is caused by #PF and it is non-page_table
 	 * writing instruction, it means the VM-EXIT is caused by shadow
 	 * page protected, we can zap the shadow page and retry this
 	 * instruction directly.
-	 *
-	 * Note: if the guest uses a non-page-table modifying instruction
-	 * on the PDE that points to the instruction, then we will unmap
-	 * the instruction and go to an infinite loop. So, we cache the
-	 * last retried eip and the last fault address, if we meet the eip
-	 * and the address again, we can break out of the potential infinite
-	 * loop.
 	 */
-	vcpu->arch.last_retry_eip = vcpu->arch.last_retry_addr = 0;
-
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
 
@@ -8965,18 +8951,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (x86_page_table_writing_insn(ctxt))
 		return false;
 
-	if (ctxt->eip == last_retry_eip && last_retry_addr == cr2_or_gpa)
-		return false;
-
-	if (!vcpu->arch.mmu->root_role.direct)
-		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
-
-	if (!kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
-		return false;
-
-	vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
-	vcpu->arch.last_retry_addr = cr2_or_gpa;
-	return true;
+	return kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa);
 }
 
 static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (7 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 08/22] KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:30   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper Sean Christopherson
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Try to unprotect shadow pages if and only if indirect_shadow_pages is non-
zero, i.e. iff there is at least one protected such shadow page.  Pre-
checking indirect_shadow_pages avoids taking mmu_lock for write when the
gfn is write-protected by a third party, i.e. not for KVM shadow paging,
and in the *extremely* unlikely case that a different task has already
unprotected the last shadow page.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 09a42dc1fe5a..358294889baa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2736,6 +2736,9 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
 	gpa_t gpa = cr2_or_gpa;
 	bool r;
 
+	if (!vcpu->kvm->arch.indirect_shadow_pages)
+		return false;
+
 	if (!vcpu->arch.mmu->root_role.direct)
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (8 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:32   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 11/22] KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction() Sean Christopherson
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Drop the globally visible PFERR_NESTED_GUEST_PAGE and replace it with a
more appropriately named is_write_to_guest_page_table().  The macro name
is misleading, because while all nNPT walks match PAGE|WRITE|PRESENT, the
reverse is not true.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ----
 arch/x86/kvm/mmu/mmu.c          | 10 ++++++++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 10b47c310ff9..25a3d84ca5e2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -280,10 +280,6 @@ enum x86_intercept_stage;
 #define PFERR_PRIVATE_ACCESS   BIT_ULL(49)
 #define PFERR_SYNTHETIC_MASK   (PFERR_IMPLICIT_ACCESS | PFERR_PRIVATE_ACCESS)
 
-#define PFERR_NESTED_GUEST_PAGE (PFERR_GUEST_PAGE_MASK |	\
-				 PFERR_WRITE_MASK |		\
-				 PFERR_PRESENT_MASK)
-
 /* apic attention bits */
 #define KVM_APIC_CHECK_VAPIC	0
 /*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 358294889baa..065bb6180988 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5980,6 +5980,13 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
 
+static bool is_write_to_guest_page_table(u64 error_code)
+{
+	const u64 mask = PFERR_GUEST_PAGE_MASK | PFERR_WRITE_MASK | PFERR_PRESENT_MASK;
+
+	return (error_code & mask) == mask;
+}
+
 static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				       u64 error_code, int *emulation_type)
 {
@@ -6026,8 +6033,7 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * write-fault is due to something else entirely, i.e. KVM needs to
 	 * emulate, as resuming the guest will put it into an infinite loop.
 	 */
-	if (direct &&
-	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
+	if (direct && (is_write_to_guest_page_table(error_code)) &&
 	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
 		return RET_PF_FIXED;
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/22] KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (9 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 12/22] KVM: x86: Fold retry_instruction() into x86_emulate_instruction() Sean Christopherson
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of
x86_emulate_instruction().  In addition to deduplicating a small amount
of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and
EMULTYPE_PF even more explicit, and will allow dropping retry_instruction()
entirely.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5377ca55161a..7e90c3b888c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8872,10 +8872,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
 
-	if (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
-	    WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
-		return false;
-
 	if (!vcpu->arch.mmu->root_role.direct) {
 		/*
 		 * Write permission should be allowed since only
@@ -8944,10 +8940,6 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
 
-	if (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
-	    WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
-		return false;
-
 	if (x86_page_table_writing_insn(ctxt))
 		return false;
 
@@ -9150,6 +9142,11 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
 	bool writeback = true;
 
+	if ((emulation_type & EMULTYPE_ALLOW_RETRY_PF) &&
+	    (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
+	     WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))))
+		emulation_type &= ~EMULTYPE_ALLOW_RETRY_PF;
+
 	r = kvm_check_emulate_insn(vcpu, emulation_type, insn, insn_len);
 	if (r != X86EMUL_CONTINUE) {
 		if (r == X86EMUL_RETRY_INSTR || r == X86EMUL_PROPAGATE_FAULT)
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/22] KVM: x86: Fold retry_instruction() into x86_emulate_instruction()
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (10 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 11/22] KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction() Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 13/22] KVM: x86/mmu: Don't try to unprotect an INVALID_GPA Sean Christopherson
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Now that retry_instruction() is reasonably tiny, fold it into its sole
caller, x86_emulate_instruction().  In addition to getting rid of the
absurdly confusing retry_instruction() name, handling the retry in
x86_emulate_instruction() pairs it back up with the code that resets
last_retry_{eip,address}.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 30 +++++++++---------------------
 1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7e90c3b888c2..771e67381fce 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8926,26 +8926,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
 }
 
-static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-			      gpa_t cr2_or_gpa,  int emulation_type)
-{
-	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-
-	/*
-	 * If the emulation is caused by #PF and it is non-page_table
-	 * writing instruction, it means the VM-EXIT is caused by shadow
-	 * page protected, we can zap the shadow page and retry this
-	 * instruction directly.
-	 */
-	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
-		return false;
-
-	if (x86_page_table_writing_insn(ctxt))
-		return false;
-
-	return kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa);
-}
-
 static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
 static int complete_emulated_pio(struct kvm_vcpu *vcpu);
 
@@ -9225,7 +9205,15 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		return 1;
 	}
 
-	if (retry_instruction(ctxt, cr2_or_gpa, emulation_type))
+	/*
+	 * If emulation was caused by a write-protection #PF on a non-page_table
+	 * writing instruction, try to unprotect the gfn, i.e. zap shadow pages,
+	 * and retry the instruction, as the vCPU is likely no longer using the
+	 * gfn as a page table.
+	 */
+	if ((emulation_type & EMULTYPE_ALLOW_RETRY_PF) &&
+	    !x86_page_table_writing_insn(ctxt) &&
+	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
 		return 1;
 
 	/* this is needed for vmware backdoor interface to work since it
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/22] KVM: x86/mmu: Don't try to unprotect an INVALID_GPA
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (11 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 12/22] KVM: x86: Fold retry_instruction() into x86_emulate_instruction() Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 14/22] KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting Sean Christopherson
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

If getting the gpa for a gva fails, e.g. because the gva isn't mapped in
the guest page tables, don't try to unprotect the invalid gfn.  This is
mostly a performance fix (avoids unnecessarily taking mmu_lock), as
for_each_gfn_valid_sp_with_gptes() won't explode on garbage input, it's
simply pointless.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 065bb6180988..a5d1f6232f8c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2739,8 +2739,11 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
 	if (!vcpu->kvm->arch.indirect_shadow_pages)
 		return false;
 
-	if (!vcpu->arch.mmu->root_role.direct)
+	if (!vcpu->arch.mmu->root_role.direct) {
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
+		if (gpa == INVALID_GPA)
+			return false;
+	}
 
 	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 	if (r) {
@@ -2759,6 +2762,8 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
 		return 0;
 
 	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+	if (gpa == INVALID_GPA)
+		return 0;
 
 	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/22] KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (12 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 13/22] KVM: x86/mmu: Don't try to unprotect an INVALID_GPA Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path Sean Christopherson
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

When getting a gpa from a gva to unprotect the associated gfn when an
event is awating reinjection, walk the guest PTEs for WRITE as there's no
point in unprotecting the gfn if the guest is unable to write the page,
i.e. if write-protection can't trigger emulation.

Note, the entire flow should be guarded on the access being a write, and
even better should be conditioned on actually triggering a write-protect
fault.  This will be addressed in a future commit.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a5d1f6232f8c..f64ad36ca9e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2761,7 +2761,7 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
 	if (vcpu->arch.mmu->root_role.direct)
 		return 0;
 
-	gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
+	gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL);
 	if (gpa == INVALID_GPA)
 		return 0;
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (13 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 14/22] KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:43   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation Sean Christopherson
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Move the event re-injection unprotect+retry logic into
kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
the #PF actually hit a write-protected gfn.  Note, there is a small
possibility that the gfn was unprotected by a different tasking between
hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
the guest immediately anyways because KVM will treat the fault as spurious.

As a bonus, unprotecting _after_ handling the page fault also addresses the
case where the installing a SPTE to handle fault encounters a shadowed PTE,
i.e. *creates* a read-only SPTE.

Opportunstically add a comment explaining what on earth the intent of the
code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid
instruction emulation when event delivery is pending").

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f64ad36ca9e0..d3c0220ff7ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2753,23 +2753,6 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
 	return r;
 }
 
-static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
-{
-	gpa_t gpa;
-	int r;
-
-	if (vcpu->arch.mmu->root_role.direct)
-		return 0;
-
-	gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL);
-	if (gpa == INVALID_GPA)
-		return 0;
-
-	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
-
-	return r;
-}
-
 static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	trace_kvm_mmu_unsync_page(sp);
@@ -4640,8 +4623,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	if (!flags) {
 		trace_kvm_page_fault(vcpu, fault_address, error_code);
 
-		if (kvm_event_needs_reinjection(vcpu))
-			kvm_mmu_unprotect_page_virt(vcpu, fault_address);
 		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
 				insn_len);
 	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
@@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * execute the instruction.  If no shadow pages were zapped, then the
 	 * write-fault is due to something else entirely, i.e. KVM needs to
 	 * emulate, as resuming the guest will put it into an infinite loop.
+	 *
+	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
+	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
+	 * KVM emulates multiple instructions before completing even injection,
+	 * the event could be delayed beyond what is architecturally allowed,
+	 * e.g. KVM could inject an IRQ after the TPR has been raised.
 	 */
-	if (direct && (is_write_to_guest_page_table(error_code)) &&
+	if (((direct && is_write_to_guest_page_table(error_code)) ||
+	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
 	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
 		return RET_PF_FIXED;
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (14 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:50   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn Sean Christopherson
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Drop the manual pfn look when retrying an instruction that KVM failed to
emulation in response to a #PF due to a write-protected gfn.  Now that KVM
sets EMULTYPE_PF if and only if the page fault it a write-protected gfn,
i.e. if and only if there's a writable memslot, there's no need to redo
the lookup to avoid retrying an instruction that failed on emulated MMIO
(no slot, or a write to a read-only slot).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 771e67381fce..67f9871990fb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8867,7 +8867,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  int emulation_type)
 {
 	gpa_t gpa = cr2_or_gpa;
-	kvm_pfn_t pfn;
 
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
@@ -8887,23 +8886,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			return true;
 	}
 
-	/*
-	 * Do not retry the unhandleable instruction if it faults on the
-	 * readonly host memory, otherwise it will goto a infinite loop:
-	 * retry instruction -> write #PF -> emulation fail -> retry
-	 * instruction -> ...
-	 */
-	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
-
-	/*
-	 * If the instruction failed on the error pfn, it can not be fixed,
-	 * report the error to userspace.
-	 */
-	if (is_error_noslot_pfn(pfn))
-		return false;
-
-	kvm_release_pfn_clean(pfn);
-
 	/*
 	 * If emulation may have been triggered by a write to a shadowed page
 	 * table, unprotect the gfn (zap any relevant SPTEs) and re-enter the
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (15 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:53   ` Paolo Bonzini
  2024-08-09 19:03 ` [PATCH 18/22] KVM: x86: Apply retry protection to "unprotect on failure" path Sean Christopherson
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is
set, as KVM will simply report the emulation failure to userspace.  This
will allow converting reexecute_instruction() to use
kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 67f9871990fb..bbb63cf9fe2c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8871,6 +8871,19 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
 
+	/*
+	 * If the failed instruction faulted on an access to page tables that
+	 * are used to translate any part of the instruction, KVM can't resolve
+	 * the issue by unprotecting the gfn, as zapping the shadow page will
+	 * result in the instruction taking a !PRESENT page fault and thus put
+	 * the vCPU into an infinite loop of page faults.  E.g. KVM will create
+	 * a SPTE and write-protect the gfn to resolve the !PRESENT fault, and
+	 * then zap the SPTE to unprotect the gfn, and then do it all over
+	 * again.  Report the error to userspace.
+	 */
+	if (emulation_type & EMULTYPE_WRITE_PF_TO_SP)
+		return false;
+
 	if (!vcpu->arch.mmu->root_role.direct) {
 		/*
 		 * Write permission should be allowed since only
@@ -8896,16 +8909,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
 
 	/*
-	 * If the failed instruction faulted on an access to page tables that
-	 * are used to translate any part of the instruction, KVM can't resolve
-	 * the issue by unprotecting the gfn, as zapping the shadow page will
-	 * result in the instruction taking a !PRESENT page fault and thus put
-	 * the vCPU into an infinite loop of page faults.  E.g. KVM will create
-	 * a SPTE and write-protect the gfn to resolve the !PRESENT fault, and
-	 * then zap the SPTE to unprotect the gfn, and then do it all over
-	 * again.  Report the error to userspace.
+	 * Retry even if _this_ vCPU didn't unprotect the gfn, as it's possible
+	 * all SPTEs were already zapped by a different task.  The alternative
+	 * is to report the error to userspace and likely terminate the guest,
+	 * and the infinite loop detection logic will prevent retrying the page
+	 * fault indefinitely, i.e. there's nothing to lose by retrying.
 	 */
-	return !(emulation_type & EMULTYPE_WRITE_PF_TO_SP);
+	return true;
 }
 
 static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 18/22] KVM: x86: Apply retry protection to "unprotect on failure" path
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (16 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 19/22] KVM: x86: Update retry protection fields when forcing retry on emulation failure Sean Christopherson
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick
up protection against infinite loops, e.g. if KVM somehow manages to
encounter an unsupported instruction and unprotecting the gfn doesn't
allow the vCPU to make forward progress.  Other than that, the retry-on-
failure logic is a functionally equivalent, open coded version of
kvm_mmu_unprotect_gfn_and_retry().

Note, the emulation failure path still isn't fully protected, as KVM
won't update the retry protection fields if no shadow pages are zapped
(but this change is still a step forward).  That flaw will be addressed
in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bbb63cf9fe2c..ddeda91b0530 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8866,8 +8866,6 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type)
 static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				  int emulation_type)
 {
-	gpa_t gpa = cr2_or_gpa;
-
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
 
@@ -8884,29 +8882,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (emulation_type & EMULTYPE_WRITE_PF_TO_SP)
 		return false;
 
-	if (!vcpu->arch.mmu->root_role.direct) {
-		/*
-		 * Write permission should be allowed since only
-		 * write access need to be emulated.
-		 */
-		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
-
-		/*
-		 * If the mapping is invalid in guest, let cpu retry
-		 * it to generate fault.
-		 */
-		if (gpa == INVALID_GPA)
-			return true;
-	}
-
 	/*
 	 * If emulation may have been triggered by a write to a shadowed page
 	 * table, unprotect the gfn (zap any relevant SPTEs) and re-enter the
 	 * guest to let the CPU re-execute the instruction in the hope that the
 	 * CPU can cleanly execute the instruction that KVM failed to emulate.
 	 */
-	if (vcpu->kvm->arch.indirect_shadow_pages)
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa);
 
 	/*
 	 * Retry even if _this_ vCPU didn't unprotect the gfn, as it's possible
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 19/22] KVM: x86: Update retry protection fields when forcing retry on emulation failure
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (17 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 18/22] KVM: x86: Apply retry protection to "unprotect on failure" path Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 20/22] KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() Sean Christopherson
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

When retrying the faulting instruction after emulation failure, refresh
the infinite loop protection fields even if no shadow pages were zapped,
i.e. avoid hitting an infinite loop even when retrying the instruction as
a last-ditch effort to avoid terminating the guest.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 10 +++++++++-
 arch/x86/kvm/mmu/mmu.c          | 12 +++++++-----
 arch/x86/kvm/x86.c              |  2 +-
 3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 25a3d84ca5e2..b3a2793fc89c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2132,7 +2132,15 @@ int kvm_get_nr_pending_nmis(struct kvm_vcpu *vcpu);
 void kvm_update_dr7(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
-bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa);
+bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+				       bool always_retry);
+
+static inline bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu,
+						   gpa_t cr2_or_gpa)
+{
+	return __kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa, false);
+}
+
 void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 			ulong roots_to_free);
 void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d3c0220ff7ee..59af085a6e8e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2731,22 +2731,24 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
 	return r;
 }
 
-bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
+bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
+				       bool always_retry)
 {
 	gpa_t gpa = cr2_or_gpa;
-	bool r;
+	bool r = false;
 
 	if (!vcpu->kvm->arch.indirect_shadow_pages)
-		return false;
+		goto out;
 
 	if (!vcpu->arch.mmu->root_role.direct) {
 		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
 		if (gpa == INVALID_GPA)
-			return false;
+			goto out;
 	}
 
 	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
-	if (r) {
+out:
+	if (r || always_retry) {
 		vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
 		vcpu->arch.last_retry_addr = cr2_or_gpa;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ddeda91b0530..65531768bb1e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8888,7 +8888,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * guest to let the CPU re-execute the instruction in the hope that the
 	 * CPU can cleanly execute the instruction that KVM failed to emulate.
 	 */
-	kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa);
+	__kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa, true);
 
 	/*
 	 * Retry even if _this_ vCPU didn't unprotect the gfn, as it's possible
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 20/22] KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (18 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 19/22] KVM: x86: Update retry protection fields when forcing retry on emulation failure Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 21/22] KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version Sean Christopherson
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to
make the intent and purpose of the helper much more obvious.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 65531768bb1e..2f4bb5028226 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8863,8 +8863,9 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type)
 	return 1;
 }
 
-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-				  int emulation_type)
+static bool kvm_unprotect_and_retry_on_failure(struct kvm_vcpu *vcpu,
+					       gpa_t cr2_or_gpa,
+					       int emulation_type)
 {
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
@@ -9131,8 +9132,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				kvm_queue_exception(vcpu, UD_VECTOR);
 				return 1;
 			}
-			if (reexecute_instruction(vcpu, cr2_or_gpa,
-						  emulation_type))
+			if (kvm_unprotect_and_retry_on_failure(vcpu, cr2_or_gpa,
+							       emulation_type))
 				return 1;
 
 			if (ctxt->have_exception &&
@@ -9218,7 +9219,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		return 1;
 
 	if (r == EMULATION_FAILED) {
-		if (reexecute_instruction(vcpu, cr2_or_gpa, emulation_type))
+		if (kvm_unprotect_and_retry_on_failure(vcpu, cr2_or_gpa,
+						       emulation_type))
 			return 1;
 
 		return handle_emulation_failure(vcpu, emulation_type);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 21/22] KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (19 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 20/22] KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-09 19:03 ` [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list Sean Christopherson
  2024-08-14 17:58 ` [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Paolo Bonzini
  22 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now
that all other direct usage is gone.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/mmu/mmu.c          | 33 +++++++++++++--------------------
 2 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b3a2793fc89c..e2df07b3c411 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2131,7 +2131,6 @@ int kvm_get_nr_pending_nmis(struct kvm_vcpu *vcpu);
 
 void kvm_update_dr7(struct kvm_vcpu *vcpu);
 
-int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn);
 bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				       bool always_retry);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 59af085a6e8e..300a47801685 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2713,31 +2713,16 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned long goal_nr_mmu_pages)
 	write_unlock(&kvm->mmu_lock);
 }
 
-int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
-{
-	struct kvm_mmu_page *sp;
-	LIST_HEAD(invalid_list);
-	int r;
-
-	r = 0;
-	write_lock(&kvm->mmu_lock);
-	for_each_gfn_valid_sp_with_gptes(kvm, sp, gfn) {
-		r = 1;
-		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
-	}
-	kvm_mmu_commit_zap_page(kvm, &invalid_list);
-	write_unlock(&kvm->mmu_lock);
-
-	return r;
-}
-
 bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 				       bool always_retry)
 {
+	struct kvm *kvm = vcpu->kvm;
+	LIST_HEAD(invalid_list);
+	struct kvm_mmu_page *sp;
 	gpa_t gpa = cr2_or_gpa;
 	bool r = false;
 
-	if (!vcpu->kvm->arch.indirect_shadow_pages)
+	if (!kvm->arch.indirect_shadow_pages)
 		goto out;
 
 	if (!vcpu->arch.mmu->root_role.direct) {
@@ -2746,7 +2731,15 @@ bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			goto out;
 	}
 
-	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
+	r = false;
+	write_lock(&kvm->mmu_lock);
+	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa)) {
+		r = true;
+		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+	}
+	kvm_mmu_commit_zap_page(kvm, &invalid_list);
+	write_unlock(&kvm->mmu_lock);
+
 out:
 	if (r || always_retry) {
 		vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (20 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 21/22] KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version Sean Christopherson
@ 2024-08-09 19:03 ` Sean Christopherson
  2024-08-14 17:57   ` Paolo Bonzini
  2024-08-14 17:58 ` [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Paolo Bonzini
  22 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-09 19:03 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

Explicitly query the list of to-be-zapped shadow pages when checking to
see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
retry the faulting instruction.

Add a comment to explain why the list needs to be checked before zapping,
which is the primary motivation for this change.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 300a47801685..50695eb2ee22 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2731,12 +2731,15 @@ bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 			goto out;
 	}
 
-	r = false;
 	write_lock(&kvm->mmu_lock);
-	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa)) {
-		r = true;
+	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa))
 		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
-	}
+
+	/*
+	 * Snapshot the result before zapping, as zapping will remove all list
+	 * entries, i.e. checking the list later would yield a false negative.
+	 */
+	r = !list_empty(&invalid_list);
 	kvm_mmu_commit_zap_page(kvm, &invalid_list);
 	write_unlock(&kvm->mmu_lock);
 
-- 
2.46.0.76.ge559c4bf1a-goog


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
  2024-08-09 19:02 ` [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid Sean Christopherson
@ 2024-08-14 11:11   ` Yuan Yao
  0 siblings, 0 replies; 53+ messages in thread
From: Yuan Yao @ 2024-08-14 11:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng

On Fri, Aug 09, 2024 at 12:02:59PM -0700, Sean Christopherson wrote:
> Set PFERR_GUEST_{FINAL,PAGE}_MASK based on EPT_VIOLATION_GVA_TRANSLATED if
> and only if EPT_VIOLATION_GVA_IS_VALID is also set in exit qualification.
> Per the SDM, bit 8 (EPT_VIOLATION_GVA_TRANSLATED) is valid if and only if
> bit 7 (EPT_VIOLATION_GVA_IS_VALID) is set, and is '0' if bit 7 is '0'.
>
>   Bit 7 (a.k.a. EPT_VIOLATION_GVA_IS_VALID)
>
>   Set if the guest linear-address field is valid.  The guest linear-address
>   field is valid for all EPT violations except those resulting from an
>   attempt to load the guest PDPTEs as part of the execution of the MOV CR
>   instruction and those due to trace-address pre-translation
>
>   Bit 8 (a.k.a. EPT_VIOLATION_GVA_TRANSLATED)
>
>   If bit 7 is 1:
>     • Set if the access causing the EPT violation is to a guest-physical
>       address that is the translation of a linear address.
>     • Clear if the access causing the EPT violation is to a paging-structure
>       entry as part of a page walk or the update of an accessed or dirty bit.
>       Reserved if bit 7 is 0 (cleared to 0).
>
> Failure to guard the logic on GVA_IS_VALID results in KVM marking the page
> fault as PFERR_GUEST_PAGE_MASK when there is no known GVA, which can put
> the vCPU into an infinite loop due to kvm_mmu_page_fault() getting false
> positive on its PFERR_NESTED_GUEST_PAGE logic (though only because that
> logic is also buggy/flawed).
>
> In practice, this is largely a non-issue because so GVA_IS_VALID is almost
> always set.  However, when TDX comes along, GVA_IS_VALID will *never* be
> set, as the TDX Module deliberately clears bits 12:7 in exit qualification,
> e.g. so that the faulting virtual address and other metadata that aren't
> practically useful for the hypervisor aren't leaked to the untrusted host.
>
>   When exit is due to EPT violation, bits 12-7 of the exit qualification
>   are cleared to 0.
>
> Fixes: eebed2438923 ("kvm: nVMX: Add support for fast unprotection of nested guest page tables")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f18c2d8c7476..52de013550e9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5804,8 +5804,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
>  		      ? PFERR_PRESENT_MASK : 0;
>
> -	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> -	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> +	if (error_code & EPT_VIOLATION_GVA_IS_VALID)
> +		error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) ?
> +			      PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>
>  	/*
>  	 * Check that the GPA doesn't exceed physical memory limits, as that is
> --
> 2.46.0.76.ge559c4bf1a-goog
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-09 19:03 ` [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Sean Christopherson
@ 2024-08-14 11:42   ` Yuan Yao
  2024-08-14 14:21     ` Sean Christopherson
  2024-08-14 16:40   ` Paolo Bonzini
  1 sibling, 1 reply; 53+ messages in thread
From: Yuan Yao @ 2024-08-14 11:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng

On Fri, Aug 09, 2024 at 12:03:00PM -0700, Sean Christopherson wrote:
> Trigger KVM's various "unprotect gfn" paths if and only if the page fault
> was a write to a write-protected gfn.  To do so, add a new page fault
> return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
> such page faults.
>
> If a page fault requires emulation for any MMIO (or any reason besides
> write-protection), trying to unprotect the gfn is pointless and risks
> putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
> an infinite loop if the vCPU manages to trigger MMIO on a page table walk.
>
> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++--------------
>  arch/x86/kvm/mmu/mmu_internal.h |  3 ++
>  arch/x86/kvm/mmu/mmutrace.h     |  1 +
>  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c      |  6 +--
>  5 files changed, 53 insertions(+), 37 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 901be9e420a4..e3aa04c498ea 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2914,10 +2914,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>  		trace_kvm_mmu_set_spte(level, gfn, sptep);
>  	}
>
> -	if (wrprot) {
> -		if (write_fault)
> -			ret = RET_PF_EMULATE;
> -	}
> +	if (wrprot && write_fault)
> +		ret = RET_PF_WRITE_PROTECTED;
>
>  	if (flush)
>  		kvm_flush_remote_tlbs_gfn(vcpu->kvm, gfn, level);
> @@ -4549,7 +4547,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  		return RET_PF_RETRY;
>
>  	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> +		return RET_PF_WRITE_PROTECTED;
>
>  	r = fast_page_fault(vcpu, fault);
>  	if (r != RET_PF_INVALID)
> @@ -4642,7 +4640,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  	int r;
>
>  	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> +		return RET_PF_WRITE_PROTECTED;
>
>  	r = fast_page_fault(vcpu, fault);
>  	if (r != RET_PF_INVALID)
> @@ -4726,6 +4724,9 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>  	case RET_PF_EMULATE:
>  		return -ENOENT;
>
> +	case RET_PF_WRITE_PROTECTED:
> +		return -EPERM;
> +
>  	case RET_PF_RETRY:
>  	case RET_PF_CONTINUE:
>  	case RET_PF_INVALID:
> @@ -5960,6 +5961,41 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  }
>
> +static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> +				       u64 error_code, int *emulation_type)
> +{
> +	bool direct = vcpu->arch.mmu->root_role.direct;
> +
> +	/*
> +	 * Before emulating the instruction, check if the error code
> +	 * was due to a RO violation while translating the guest page.
> +	 * This can occur when using nested virtualization with nested
> +	 * paging in both guests. If true, we simply unprotect the page
> +	 * and resume the guest.
> +	 */
> +	if (direct &&
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +		return RET_PF_FIXED;
> +	}
> +
> +	/*
> +	 * The gfn is write-protected, but if emulation fails we can still
> +	 * optimistically try to just unprotect the page and let the processor
> +	 * re-execute the instruction that caused the page fault.  Do not allow
> +	 * retrying MMIO emulation, as it's not only pointless but could also
> +	 * cause us to enter an infinite loop because the processor will keep
> +	 * faulting on the non-existent MMIO address.  Retrying an instruction
> +	 * from a nested guest is also pointless and dangerous as we are only
> +	 * explicitly shadowing L1's page tables, i.e. unprotecting something
> +	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
> +	 */
> +	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))

Looks the mmio_info_in_cache() checking can be removed,
emulation should not come here with RET_PF_WRITE_PROTECTED
introduced, may WARN_ON_ONCE() it.


> +		*emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
> +
> +	return RET_PF_EMULATE;
> +}
> +
>  int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>  		       void *insn, int insn_len)
>  {
> @@ -6005,6 +6041,10 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  	if (r < 0)
>  		return r;
>
> +	if (r == RET_PF_WRITE_PROTECTED)
> +		r = kvm_mmu_write_protect_fault(vcpu, cr2_or_gpa, error_code,
> +						&emulation_type);
> +
>  	if (r == RET_PF_FIXED)
>  		vcpu->stat.pf_fixed++;
>  	else if (r == RET_PF_EMULATE)
> @@ -6015,32 +6055,6 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err
>  	if (r != RET_PF_EMULATE)
>  		return 1;
>
> -	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> -	 */
> -	if (vcpu->arch.mmu->root_role.direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> -		return 1;
> -	}
> -
> -	/*
> -	 * vcpu->arch.mmu.page_fault returned RET_PF_EMULATE, but we can still
> -	 * optimistically try to just unprotect the page and let the processor
> -	 * re-execute the instruction that caused the page fault.  Do not allow
> -	 * retrying MMIO emulation, as it's not only pointless but could also
> -	 * cause us to enter an infinite loop because the processor will keep
> -	 * faulting on the non-existent MMIO address.  Retrying an instruction
> -	 * from a nested guest is also pointless and dangerous as we are only
> -	 * explicitly shadowing L1's page tables, i.e. unprotecting something
> -	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
> -	 */
> -	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
> -		emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
>  emulate:
>  	return x86_emulate_instruction(vcpu, cr2_or_gpa, emulation_type, insn,
>  				       insn_len);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1721d97743e9..50d2624111f8 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -258,6 +258,8 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   * RET_PF_CONTINUE: So far, so good, keep handling the page fault.
>   * RET_PF_RETRY: let CPU fault again on the address.
>   * RET_PF_EMULATE: mmio page fault, emulate the instruction directly.
> + * RET_PF_WRITE_PROTECTED: the gfn is write-protected, either unprotected the
> + *                         gfn and retry, or emulate the instruction directly.
>   * RET_PF_INVALID: the spte is invalid, let the real page fault path update it.
>   * RET_PF_FIXED: The faulting entry has been fixed.
>   * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU.
> @@ -274,6 +276,7 @@ enum {
>  	RET_PF_CONTINUE = 0,
>  	RET_PF_RETRY,
>  	RET_PF_EMULATE,
> +	RET_PF_WRITE_PROTECTED,
>  	RET_PF_INVALID,
>  	RET_PF_FIXED,
>  	RET_PF_SPURIOUS,
> diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h
> index 195d98bc8de8..f35a830ce469 100644
> --- a/arch/x86/kvm/mmu/mmutrace.h
> +++ b/arch/x86/kvm/mmu/mmutrace.h
> @@ -57,6 +57,7 @@
>  TRACE_DEFINE_ENUM(RET_PF_CONTINUE);
>  TRACE_DEFINE_ENUM(RET_PF_RETRY);
>  TRACE_DEFINE_ENUM(RET_PF_EMULATE);
> +TRACE_DEFINE_ENUM(RET_PF_WRITE_PROTECTED);
>  TRACE_DEFINE_ENUM(RET_PF_INVALID);
>  TRACE_DEFINE_ENUM(RET_PF_FIXED);
>  TRACE_DEFINE_ENUM(RET_PF_SPURIOUS);
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 69941cebb3a8..a722a3c96af9 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -805,7 +805,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>
>  	if (page_fault_handle_page_track(vcpu, fault)) {
>  		shadow_page_table_clear_flood(vcpu, fault->addr);
> -		return RET_PF_EMULATE;
> +		return RET_PF_WRITE_PROTECTED;
>  	}
>
>  	r = mmu_topup_memory_caches(vcpu, true);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c7dc49ee7388..8bf44ac9372f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1046,10 +1046,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	 * protected, emulation is needed. If the emulation was skipped,
>  	 * the vCPU would have the same fault again.
>  	 */
> -	if (wrprot) {
> -		if (fault->write)
> -			ret = RET_PF_EMULATE;
> -	}
> +	if (wrprot && fault->write)
> +		ret = RET_PF_WRITE_PROTECTED;
>
>  	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
>  	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
> --
> 2.46.0.76.ge559c4bf1a-goog
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-14 11:42   ` Yuan Yao
@ 2024-08-14 14:21     ` Sean Christopherson
  2024-08-15  8:30       ` Yuan Yao
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-14 14:21 UTC (permalink / raw)
  To: Yuan Yao
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng

On Wed, Aug 14, 2024, Yuan Yao wrote:
> > @@ -5960,6 +5961,41 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> >  	write_unlock(&vcpu->kvm->mmu_lock);
> >  }
> >
> > +static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > +				       u64 error_code, int *emulation_type)
> > +{
> > +	bool direct = vcpu->arch.mmu->root_role.direct;
> > +
> > +	/*
> > +	 * Before emulating the instruction, check if the error code
> > +	 * was due to a RO violation while translating the guest page.
> > +	 * This can occur when using nested virtualization with nested
> > +	 * paging in both guests. If true, we simply unprotect the page
> > +	 * and resume the guest.
> > +	 */
> > +	if (direct &&
> > +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> > +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> > +		return RET_PF_FIXED;
> > +	}
> > +
> > +	/*
> > +	 * The gfn is write-protected, but if emulation fails we can still
> > +	 * optimistically try to just unprotect the page and let the processor
> > +	 * re-execute the instruction that caused the page fault.  Do not allow
> > +	 * retrying MMIO emulation, as it's not only pointless but could also
> > +	 * cause us to enter an infinite loop because the processor will keep
> > +	 * faulting on the non-existent MMIO address.  Retrying an instruction
> > +	 * from a nested guest is also pointless and dangerous as we are only
> > +	 * explicitly shadowing L1's page tables, i.e. unprotecting something
> > +	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
> > +	 */
> > +	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
> 
> Looks the mmio_info_in_cache() checking can be removed,
> emulation should not come here with RET_PF_WRITE_PROTECTED
> introduced, may WARN_ON_ONCE() it.

Yeah, that was my instinct as well.  I kept it mostly because I liked having the
comment, but also because I was thinking the cache could theoretically get a hit.
But that's not true.  KVM should return RET_PF_WRITE_PROTECTED if and only if
there is a memslot, and creating a memslot is supposed to invalidate the MMIO
cache by virtue of changing the memslot generation.

Unless someone feels strongly that the mmio_info_in_cache() call should be
deleted entirely, I'll tack on this patch.  The cache lookup is cheap, and IMO
it's helpful to explicitly document that it should be impossible to reach this
point with what appears to be MMIO.

---
 arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 50695eb2ee22..7f3f57237f23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5997,6 +5997,18 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	vcpu->arch.last_retry_eip = 0;
 	vcpu->arch.last_retry_addr = 0;
 
+	/*
+	 * It should be impossible to reach this point with an MMIO cache hit,
+	 * as RET_PF_WRITE_PROTECTED is returned if and only if there's a valid,
+	 * writable memslot, and creating a memslot should invalidate the MMIO
+	 * cache by way of changing the memslot generation.  WARN and disallow
+	 * retry if MMIO is detect, as retrying MMIO emulation is pointless and
+	 * could put the vCPU into an infinite loop because the processor will
+	 * keep faulting on the non-existent MMIO address.
+	 */
+	if (WARN_ON_ONCE(mmio_info_in_cache(vcpu, cr2_or_gpa, direct)))
+		return RET_PF_EMULATE;
+
 	/*
 	 * Before emulating the instruction, check to see if the access may be
 	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
@@ -6029,17 +6041,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		return RET_PF_FIXED;
 
 	/*
-	 * The gfn is write-protected, but if emulation fails we can still
-	 * optimistically try to just unprotect the page and let the processor
+	 * The gfn is write-protected, but if KVM detects its emulating an
+	 * instruction that is unlikely to be used to modify page tables, or if
+	 * emulation fails, KVM can try to unprotect the gfn and let the CPU
 	 * re-execute the instruction that caused the page fault.  Do not allow
-	 * retrying MMIO emulation, as it's not only pointless but could also
-	 * cause us to enter an infinite loop because the processor will keep
-	 * faulting on the non-existent MMIO address.  Retrying an instruction
-	 * from a nested guest is also pointless and dangerous as we are only
-	 * explicitly shadowing L1's page tables, i.e. unprotecting something
-	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
+	 * retrying an instruction from a nested guest as KVM is only explicitly
+	 * shadowing L1's page tables, i.e. unprotecting something for L1 isn't
+	 * going to magically fix whatever issue cause L2 to fail.
 	 */
-	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
+	if (!is_guest_mode(vcpu))
 		*emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
 
 	return RET_PF_EMULATE;

base-commit: 7d33880356496eb0640c6c825cc60898063c4902
--

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-09 19:03 ` [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected Sean Christopherson
@ 2024-08-14 14:22   ` Yuan Yao
  2024-08-15 23:31     ` Yao Yuan
  2024-08-14 16:47   ` Paolo Bonzini
  1 sibling, 1 reply; 53+ messages in thread
From: Yuan Yao @ 2024-08-14 14:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng

On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> When doing "fast unprotection" of nested TDP page tables, skip emulation
> if and only if at least one gfn was unprotected, i.e. continue with
> emulation if simply resuming is likely to hit the same fault and risk
> putting the vCPU into an infinite loop.
>
> Note, it's entirely possible to get a false negative, e.g. if a different
> vCPU faults on the same gfn and unprotects the gfn first, but that's a
> relatively rare edge case, and emulating is still functionally ok, i.e.
> the risk of putting the vCPU isn't an infinite loop isn't justified.
>
> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3aa04c498ea..95058ac4b78c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	bool direct = vcpu->arch.mmu->root_role.direct;
>
>  	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> +	 * Before emulating the instruction, check to see if the access may be
> +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> +	 * protected.  Because AMD CPUs walk nested page table using a write
> +	 * operation, walking NPT entries in L1 can trigger write faults even
> +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> +	 * excessive number of L1 instructions without triggering KVM's write-
> +	 * flooding detection, i.e. without unprotecting the gfn.
> +	 *
> +	 * If the error code was due to a RO violation while translating the
> +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> +	 * shadow pages, then the above scenario is likely being hit.  Try to
> +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> +	 * its NPT entries without triggering emulation.  If one or more shadow
> +	 * pages was zapped, skip emulation and resume L1 to let it natively
> +	 * execute the instruction.  If no shadow pages were zapped, then the
> +	 * write-fault is due to something else entirely, i.e. KVM needs to
> +	 * emulate, as resuming the guest will put it into an infinite loop.
>  	 */

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>  	if (direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
>  		return RET_PF_FIXED;
> -	}
>
>  	/*
>  	 * The gfn is write-protected, but if emulation fails we can still
> --
> 2.46.0.76.ge559c4bf1a-goog
>
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
@ 2024-08-14 16:31   ` Paolo Bonzini
  2025-12-03 13:04   ` Naveen N Rao
  1 sibling, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 16:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:02, Sean Christopherson wrote:
> Disallow read-only memory for all VMs with protected state, i.e. for
> upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> protections and SUPPRESS_VE=0.  But there is no strong use case for
> supporting read-only memslots on TDX, e.g. the main historical usage is
> to emulate option ROMs, but TDX disallows executing from shared memory.
> And if someone comes along with a legitimate, strong use case, the
> restriction can always be lifted for TDX.
> 
> Don't bother trying to retroactively apply the restriction to SEV-ES
> VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> possibly work for SEV-ES, i.e. disallowing such memslots is really just
> means reporting an error to userspace instead of silently hanging vCPUs.
> Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> isn't worth the marginal benefit it would provide userspace.

Queuing this one for 6.11, thanks.

Paolo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-09 19:03 ` [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Sean Christopherson
  2024-08-14 11:42   ` Yuan Yao
@ 2024-08-14 16:40   ` Paolo Bonzini
  2024-08-14 19:34     ` Sean Christopherson
  1 sibling, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 16:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Trigger KVM's various "unprotect gfn" paths if and only if the page fault
> was a write to a write-protected gfn.  To do so, add a new page fault
> return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
> such page faults.
> 
> If a page fault requires emulation for any MMIO (or any reason besides
> write-protection), trying to unprotect the gfn is pointless and risks
> putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
> an infinite loop if the vCPU manages to trigger MMIO on a page table walk.
> 
> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org

Do we really want Cc: stable@ for all these patches?  Most of them are 
of the "if it hurts, don't do it" kind; as long as there are no infinite 
loops in a non-killable region, I prefer not to complicate our lives 
with cherry picks of unknown quality.

That said, this patch could be interesting for 6.11 because of the 
effect on prefaulting (see below).

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++--------------
>   arch/x86/kvm/mmu/mmu_internal.h |  3 ++
>   arch/x86/kvm/mmu/mmutrace.h     |  1 +
>   arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c      |  6 +--
>   5 files changed, 53 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 901be9e420a4..e3aa04c498ea 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2914,10 +2914,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
>   		trace_kvm_mmu_set_spte(level, gfn, sptep);
>   	}
>   
> -	if (wrprot) {
> -		if (write_fault)
> -			ret = RET_PF_EMULATE;
> -	}
> +	if (wrprot && write_fault)
> +		ret = RET_PF_WRITE_PROTECTED;
>   
>   	if (flush)
>   		kvm_flush_remote_tlbs_gfn(vcpu->kvm, gfn, level);
> @@ -4549,7 +4547,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>   		return RET_PF_RETRY;
>   
>   	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> +		return RET_PF_WRITE_PROTECTED;
>   
>   	r = fast_page_fault(vcpu, fault);
>   	if (r != RET_PF_INVALID)
> @@ -4642,7 +4640,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>   	int r;
>   
>   	if (page_fault_handle_page_track(vcpu, fault))
> -		return RET_PF_EMULATE;
> +		return RET_PF_WRITE_PROTECTED;
>   
>   	r = fast_page_fault(vcpu, fault);
>   	if (r != RET_PF_INVALID)
> @@ -4726,6 +4724,9 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
>   	case RET_PF_EMULATE:
>   		return -ENOENT;
>   
> +	case RET_PF_WRITE_PROTECTED:
> +		return -EPERM;

Shouldn't this be a "return 0"?  Even if kvm_mmu_do_page_fault() cannot 
fully unprotect the page, it was nevertheless prefaulted as much as 
possible.

Paolo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-09 19:03 ` [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected Sean Christopherson
  2024-08-14 14:22   ` Yuan Yao
@ 2024-08-14 16:47   ` Paolo Bonzini
  1 sibling, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 16:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> When doing "fast unprotection" of nested TDP page tables, skip emulation
> if and only if at least one gfn was unprotected, i.e. continue with
> emulation if simply resuming is likely to hit the same fault and risk
> putting the vCPU into an infinite loop.
> 
> Note, it's entirely possible to get a false negative, e.g. if a different
> vCPU faults on the same gfn and unprotects the gfn first, but that's a
> relatively rare edge case, and emulating is still functionally ok, i.e.
> the risk of putting the vCPU isn't an infinite loop isn't justified.

English snafu - "the risk of causing a livelock for the vCPU is 
negligible", perhaps?

Paolo

> Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
>   1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e3aa04c498ea..95058ac4b78c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   	bool direct = vcpu->arch.mmu->root_role.direct;
>   
>   	/*
> -	 * Before emulating the instruction, check if the error code
> -	 * was due to a RO violation while translating the guest page.
> -	 * This can occur when using nested virtualization with nested
> -	 * paging in both guests. If true, we simply unprotect the page
> -	 * and resume the guest.
> +	 * Before emulating the instruction, check to see if the access may be
> +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> +	 * protected.  Because AMD CPUs walk nested page table using a write
> +	 * operation, walking NPT entries in L1 can trigger write faults even
> +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> +	 * excessive number of L1 instructions without triggering KVM's write-
> +	 * flooding detection, i.e. without unprotecting the gfn.
> +	 *
> +	 * If the error code was due to a RO violation while translating the
> +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> +	 * shadow pages, then the above scenario is likely being hit.  Try to
> +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> +	 * its NPT entries without triggering emulation.  If one or more shadow
> +	 * pages was zapped, skip emulation and resume L1 to let it natively
> +	 * execute the instruction.  If no shadow pages were zapped, then the
> +	 * write-fault is due to something else entirely, i.e. KVM needs to
> +	 * emulate, as resuming the guest will put it into an infinite loop.
>   	 */
>   	if (direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
>   		return RET_PF_FIXED;
> -	}
>   
>   	/*
>   	 * The gfn is write-protected, but if emulation fails we can still


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip
  2024-08-09 19:03 ` [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Sean Christopherson
@ 2024-08-14 17:01   ` Paolo Bonzini
  0 siblings, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Read RIP from vCPU state instead of pulling it from the emulation context
> when filling last_retry_eip, which is part of the anti-infinite-loop
> protection used when unprotecting and retrying instructions that hit a
> write-protected gfn.
> 
> This will allow reusing the anti-infinite-loop protection in flows that
> never make it into the emulator.
> 
> This is a glorified nop as ctxt->eip is set to kvm_rip_read() in
> init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with
> EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through
> x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet).

This is as much a nit as it can be, but "glorified nop" would be 
interpreted more as "the assignment is not needed at all", or something 
similarly wrong.  Just "This has no functional change because..." will do.

Paolo

> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/x86.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2072cceac68f..372ed3842732 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8973,7 +8973,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
>   	if (!kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa)))
>   		return false;
>   
> -	vcpu->arch.last_retry_eip = ctxt->eip;
> +	vcpu->arch.last_retry_eip = kvm_rip_read(vcpu);
>   	vcpu->arch.last_retry_addr = cr2_or_gpa;
>   	return true;
>   }


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  2024-08-09 19:03 ` [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs Sean Christopherson
@ 2024-08-14 17:30   ` Paolo Bonzini
  2024-08-15 14:09     ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Try to unprotect shadow pages if and only if indirect_shadow_pages is non-
> zero, i.e. iff there is at least one protected such shadow page.  Pre-
> checking indirect_shadow_pages avoids taking mmu_lock for write when the
> gfn is write-protected by a third party, i.e. not for KVM shadow paging,
> and in the *extremely* unlikely case that a different task has already
> unprotected the last shadow page.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 09a42dc1fe5a..358294889baa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2736,6 +2736,9 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
>   	gpa_t gpa = cr2_or_gpa;
>   	bool r;
>   
> +	if (!vcpu->kvm->arch.indirect_shadow_pages)
> +		return false;

indirect_shadow_pages is accessed without a lock, so here please add a 
note that, while it may be stale, a false negative will only cause KVM 
to skip the "unprotect and retry" optimization.  (This is preexisting in 
reexecute_instruction() and goes away in patch 18, if I'm pre-reading 
that part of the series correctly).

Bonus points for opportunistically adding a READ_ONCE() here and in 
kvm_mmu_track_write().

Paolo

>   	if (!vcpu->arch.mmu->root_role.direct)
>   		gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
>   


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
  2024-08-09 19:03 ` [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper Sean Christopherson
@ 2024-08-14 17:32   ` Paolo Bonzini
  2024-08-15 14:10     ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> -	if (direct &&
> -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> +	if (direct && (is_write_to_guest_page_table(error_code)) &&

Too many parentheses. :)

Maybe put it before patch 3 if we decide not to Cc: stable?  Or even 
squash it in there?

Paolo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path
  2024-08-09 19:03 ` [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path Sean Christopherson
@ 2024-08-14 17:43   ` Paolo Bonzini
  2024-08-26 21:52     ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Move the event re-injection unprotect+retry logic into
> kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if
> the #PF actually hit a write-protected gfn.  Note, there is a small
> possibility that the gfn was unprotected by a different tasking between
> hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume
> the guest immediately anyways because KVM will treat the fault as spurious.
> 
> As a bonus, unprotecting _after_ handling the page fault also addresses the
> case where the installing a SPTE to handle fault encounters a shadowed PTE,
> i.e. *creates* a read-only SPTE.
> 
> Opportunstically add a comment explaining what on earth the intent of the
> code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid
> instruction emulation when event delivery is pending").
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 28 ++++++++--------------------
>   1 file changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f64ad36ca9e0..d3c0220ff7ee 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2753,23 +2753,6 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
>   	return r;
>   }
>   
> -static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
> -{
> -	gpa_t gpa;
> -	int r;
> -
> -	if (vcpu->arch.mmu->root_role.direct)
> -		return 0;
> -
> -	gpa = kvm_mmu_gva_to_gpa_write(vcpu, gva, NULL);
> -	if (gpa == INVALID_GPA)
> -		return 0;
> -
> -	r = kvm_mmu_unprotect_page(vcpu->kvm, gpa >> PAGE_SHIFT);
> -
> -	return r;
> -}
> -
>   static void kvm_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>   {
>   	trace_kvm_mmu_unsync_page(sp);
> @@ -4640,8 +4623,6 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>   	if (!flags) {
>   		trace_kvm_page_fault(vcpu, fault_address, error_code);
>   
> -		if (kvm_event_needs_reinjection(vcpu))
> -			kvm_mmu_unprotect_page_virt(vcpu, fault_address);
>   		r = kvm_mmu_page_fault(vcpu, fault_address, error_code, insn,
>   				insn_len);
>   	} else if (flags & KVM_PV_REASON_PAGE_NOT_PRESENT) {
> @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   	 * execute the instruction.  If no shadow pages were zapped, then the
>   	 * write-fault is due to something else entirely, i.e. KVM needs to
>   	 * emulate, as resuming the guest will put it into an infinite loop.
> +	 *
> +	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
> +	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
> +	 * KVM emulates multiple instructions before completing even injection,
> +	 * the event could be delayed beyond what is architecturally allowed,
> +	 * e.g. KVM could inject an IRQ after the TPR has been raised.

This paragraph should go before the description of
kvm_mmu_unprotect_gfn_and_retry:

	 * There are two cases in which we try to unprotect the page here
	 * preemptively, i.e. zap any shadow pages, before emulating the
	 * instruction.
	 *
	 * First, the access may be due to L1 accessing nested NPT/EPT entries
	 * used for L2, i.e. if the gfn being written is for gPTEs that KVM is
	 * shadowing and has write-protected.  In this case, because AMD CPUs
	 * walk nested page table using a write operation, walking NPT entries
	 * in L1 can trigger write faults even when L1 isn't modifying PTEs.
	 * KVM would then emulate an excessive number of L1 instructions
	 * without triggering KVM's write-flooding detection, i.e. without
	 * unprotecting the gfn.  This is detected as a RO violation while
	 * translating the guest page when the current MMU is direct.
	 *
	 * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU,
	 * unprotect the gfn and reenter the guest if an event is awaiting
	 * reinjection.  If KVM emulates multiple instructions before completing
	 * event injection, the event could be delayed beyond what is
	 * architecturally allowed, e.g. KVM could inject an IRQ after the
	 * TPR has been raised.
	 *
	 * In both cases, if one or more shadow pages were zapped, skip
	 * emulation and resume L1 to let it natively execute the instruction.
	 * If no shadow pages were zapped, then the write-fault is due to
	 * something else entirely and KVM needs to emulate, as resuming
	 * the guest will put it into an infinite loop.

Thanks,

Paolo

>   	 */
> -	if (direct && (is_write_to_guest_page_table(error_code)) &&
> +	if (((direct && is_write_to_guest_page_table(error_code)) ||
> +	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
>   	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
>   		return RET_PF_FIXED;
>   


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation
  2024-08-09 19:03 ` [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation Sean Christopherson
@ 2024-08-14 17:50   ` Paolo Bonzini
  0 siblings, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Drop the manual pfn look when retrying an instruction that KVM failed to
> emulation in response to a #PF due to a write-protected gfn.  Now that KVM
> sets EMULTYPE_PF if and only if the page fault it a write-protected gfn,

Pointing out where this happened will likely help a few years from now:

With the introduction of RET_PF_WRITE_PROTECTED, KVM sets EMULTYPE_PF if 
and only if the page fault it a write-protected gfn, i.e. if and only if 
there's a writable memslot.  KVM will never try to redo an instruction 
that failed on emulated MMIO (no slot, or a write to a read-only slot), 
so therefore there's no redo the lookup in reexecute_instruction().

Paolo

> i.e. if and only if there's a writable memslot, there's no need to redo
> the lookup to avoid retrying an instruction that failed on emulated MMIO
> (no slot, or a write to a read-only slot).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/x86.c | 18 ------------------
>   1 file changed, 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 771e67381fce..67f9871990fb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8867,7 +8867,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   				  int emulation_type)
>   {
>   	gpa_t gpa = cr2_or_gpa;
> -	kvm_pfn_t pfn;
>   
>   	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
>   		return false;
> @@ -8887,23 +8886,6 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   			return true;
>   	}
>   
> -	/*
> -	 * Do not retry the unhandleable instruction if it faults on the
> -	 * readonly host memory, otherwise it will goto a infinite loop:
> -	 * retry instruction -> write #PF -> emulation fail -> retry
> -	 * instruction -> ...
> -	 */
> -	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> -
> -	/*
> -	 * If the instruction failed on the error pfn, it can not be fixed,
> -	 * report the error to userspace.
> -	 */
> -	if (is_error_noslot_pfn(pfn))
> -		return false;
> -
> -	kvm_release_pfn_clean(pfn);
> -
>   	/*
>   	 * If emulation may have been triggered by a write to a shadowed page
>   	 * table, unprotect the gfn (zap any relevant SPTEs) and re-enter the


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  2024-08-09 19:03 ` [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn Sean Christopherson
@ 2024-08-14 17:53   ` Paolo Bonzini
  0 siblings, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> +	 * Retry even if _this_ vCPU didn't unprotect the gfn, as it's possible
> +	 * all SPTEs were already zapped by a different task.  The alternative
> +	 * is to report the error to userspace and likely terminate the guest,
> +	 * and the infinite loop detection logic will prevent retrying the page
> +	 * fault indefinitely, i.e. there's nothing to lose by retrying.

Putting myself in the shoes of someone unfamiliar with the code, I might 
prefer "the last_retry_eip/last_retry_addr checks" to "the infinite loop 
detection logic"; after all, you're saying in the same sentence that 
it's preventing an infinite loop.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  2024-08-09 19:03 ` [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list Sean Christopherson
@ 2024-08-14 17:57   ` Paolo Bonzini
  2024-08-15 14:25     ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:03, Sean Christopherson wrote:
> Explicitly query the list of to-be-zapped shadow pages when checking to
> see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
> retry the faulting instruction.
> 
> Add a comment to explain why the list needs to be checked before zapping,
> which is the primary motivation for this change.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 300a47801685..50695eb2ee22 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2731,12 +2731,15 @@ bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   			goto out;
>   	}
>   
> -	r = false;
>   	write_lock(&kvm->mmu_lock);
> -	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa)) {
> -		r = true;
> +	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa))
>   		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> -	}
> +
> +	/*
> +	 * Snapshot the result before zapping, as zapping will remove all list
> +	 * entries, i.e. checking the list later would yield a false negative.
> +	 */

Hmm, the comment is kinda overkill?  Maybe just

	/* Return whether there were sptes to zap.  */
	r = !list_empty(&invalid_test);

I'm not sure about patch 21 - I like the simple kvm_mmu_unprotect_page() 
function.  Maybe rename it to kvm_mmu_zap_gfn() and make it static in 
the same patch?

Either way, this small cleanup applies even if the function is not inlined.

Thanks,

Paolo
> +	r = !list_empty(&invalid_list);
>   	kvm_mmu_commit_zap_page(kvm, &invalid_list);
>   	write_unlock(&kvm->mmu_lock);
>   


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs
  2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
                   ` (21 preceding siblings ...)
  2024-08-09 19:03 ` [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list Sean Christopherson
@ 2024-08-14 17:58 ` Paolo Bonzini
  22 siblings, 0 replies; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-14 17:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On 8/9/24 21:02, Sean Christopherson wrote:
> The folks doing TDX enabling ran into a problem where exposing a read-only
> memslot to a TDX guest put it into an infinite loop.  The most immediate
> issue is that KVM never creates MMIO SPTEs for RO memslots, because except
> for TDX (which isn't officially supported yet), such SPTEs can't distinguish
> between reads and writes, i.e. would trigger MMIO on everything and thus
> defeat the purpose of having a RX memslot.
> 
> That breaks TDX, SEV-ES, and SNP, i.e. VM types that rely on MMIO caching
> to reflect MMIO faults into the guest as #VC/#VE, as the guest never sees
> the fault, KVM refuses to emulate, the guest loops indefinitely.  That's
> patch 1.
> 
> Patches 2-4 fix an amusing number of other bugs that made it difficult to
> figure out the true root cause.
> 
> The rest is a bunch of cleanups to consolidate all of the unprotect+retry
> paths (there are four-ish).
> 
> As a bonus, adding RET_PF_WRITE_PROTECTED obviates the need for
> kvm_lookup_pfn()[*].
> 
> [*] https://lore.kernel.org/all/63c41e25-2523-4397-96b4-557394281443@redhat.com

Nice!  For now I've placed it in kvm/queue as this is clearly 6.12 
material.  It will be replaced by the v2 of course before graduating to 
kvm/next.

Thanks,

Paolo

> Sean Christopherson (22):
>    KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
>    KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is
>      valid
>    KVM: x86/mmu: Trigger unprotect logic only on write-protection page
>      faults
>    KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
>    KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is
>      zapped
>    KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip
>    KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for
>      retry
>    KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path
>    KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
>    KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive
>      helper
>    KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()
>    KVM: x86: Fold retry_instruction() into x86_emulate_instruction()
>    KVM: x86/mmu: Don't try to unprotect an INVALID_GPA
>    KVM: x86/mmu: Always walk guest PTEs with WRITE access when
>      unprotecting
>    KVM: x86/mmu: Move event re-injection unprotect+retry into common path
>    KVM: x86: Remove manual pfn lookup when retrying #PF after failed
>      emulation
>    KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
>    KVM: x86: Apply retry protection to "unprotect on failure" path
>    KVM: x86: Update retry protection fields when forcing retry on
>      emulation failure
>    KVM: x86: Rename
>      reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
>    KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry()
>      version
>    KVM: x86/mmu: Detect if unprotect will do anything based on
>      invalid_list
> 
>   arch/x86/include/asm/kvm_host.h |  16 ++-
>   arch/x86/kvm/mmu/mmu.c          | 175 ++++++++++++++++++++++----------
>   arch/x86/kvm/mmu/mmu_internal.h |   3 +
>   arch/x86/kvm/mmu/mmutrace.h     |   1 +
>   arch/x86/kvm/mmu/paging_tmpl.h  |   2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c      |   6 +-
>   arch/x86/kvm/vmx/vmx.c          |   5 +-
>   arch/x86/kvm/x86.c              | 133 +++++++-----------------
>   include/linux/kvm_host.h        |   7 ++
>   virt/kvm/kvm_main.c             |   5 +-
>   10 files changed, 184 insertions(+), 169 deletions(-)
> 
> 
> base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-14 16:40   ` Paolo Bonzini
@ 2024-08-14 19:34     ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-14 19:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > Trigger KVM's various "unprotect gfn" paths if and only if the page fault
> > was a write to a write-protected gfn.  To do so, add a new page fault
> > return code, RET_PF_WRITE_PROTECTED, to explicitly and precisely track
> > such page faults.
> > 
> > If a page fault requires emulation for any MMIO (or any reason besides
> > write-protection), trying to unprotect the gfn is pointless and risks
> > putting the vCPU into an infinite loop.  E.g. KVM will put the vCPU into
> > an infinite loop if the vCPU manages to trigger MMIO on a page table walk.
> > 
> > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > Cc: stable@vger.kernel.org
> 
> Do we really want Cc: stable@ for all these patches?  Most of them are of
> the "if it hurts, don't do it" kind;

True.  I was thinking that the VMX PFERR_GUEST_{FINAL,PAGE}_MASK bug in particular
was stable-worthy, but until TDX comes along, it's only relevant if guests puts
PDPTRs in an MMIO region.  And in that case, the guest is likely hosed anyway,
the only difference is if it gets stuck or killed.

I'll drop the stable@ tags unless someone objects.

> as long as there are no infinite loops in a non-killable region, I prefer not
> to complicate our lives with cherry picks of unknown quality.

Yeah, the RET_PF_WRITE_PROTECTED one in particular has high potential for a bad
cherry-pick.

> That said, this patch could be interesting for 6.11 because of the effect on
> prefaulting (see below).
> 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c          | 78 +++++++++++++++++++--------------
> >   arch/x86/kvm/mmu/mmu_internal.h |  3 ++
> >   arch/x86/kvm/mmu/mmutrace.h     |  1 +
> >   arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
> >   arch/x86/kvm/mmu/tdp_mmu.c      |  6 +--
> >   5 files changed, 53 insertions(+), 37 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 901be9e420a4..e3aa04c498ea 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2914,10 +2914,8 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
> >   		trace_kvm_mmu_set_spte(level, gfn, sptep);
> >   	}
> > -	if (wrprot) {
> > -		if (write_fault)
> > -			ret = RET_PF_EMULATE;
> > -	}
> > +	if (wrprot && write_fault)
> > +		ret = RET_PF_WRITE_PROTECTED;
> >   	if (flush)
> >   		kvm_flush_remote_tlbs_gfn(vcpu->kvm, gfn, level);
> > @@ -4549,7 +4547,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >   		return RET_PF_RETRY;
> >   	if (page_fault_handle_page_track(vcpu, fault))
> > -		return RET_PF_EMULATE;
> > +		return RET_PF_WRITE_PROTECTED;
> >   	r = fast_page_fault(vcpu, fault);
> >   	if (r != RET_PF_INVALID)
> > @@ -4642,7 +4640,7 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> >   	int r;
> >   	if (page_fault_handle_page_track(vcpu, fault))
> > -		return RET_PF_EMULATE;
> > +		return RET_PF_WRITE_PROTECTED;
> >   	r = fast_page_fault(vcpu, fault);
> >   	if (r != RET_PF_INVALID)
> > @@ -4726,6 +4724,9 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> >   	case RET_PF_EMULATE:
> >   		return -ENOENT;
> > +	case RET_PF_WRITE_PROTECTED:
> > +		return -EPERM;
> 
> Shouldn't this be a "return 0"?  Even if kvm_mmu_do_page_fault() cannot
> fully unprotect the page, it was nevertheless prefaulted as much as
> possible.

Hmm, I hadn't thought about it from that perspective.  Ah, right, and the early
check in page_fault_handle_page_track() only handles PRESENT faults, so KVM will
at least install a read-only mapping.

So yeah, agreed this should return 0.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults
  2024-08-14 14:21     ` Sean Christopherson
@ 2024-08-15  8:30       ` Yuan Yao
  0 siblings, 0 replies; 53+ messages in thread
From: Yuan Yao @ 2024-08-15  8:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng

On Wed, Aug 14, 2024 at 07:21:06AM -0700, Sean Christopherson wrote:
> On Wed, Aug 14, 2024, Yuan Yao wrote:
> > > @@ -5960,6 +5961,41 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
> > >  	write_unlock(&vcpu->kvm->mmu_lock);
> > >  }
> > >
> > > +static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > +				       u64 error_code, int *emulation_type)
> > > +{
> > > +	bool direct = vcpu->arch.mmu->root_role.direct;
> > > +
> > > +	/*
> > > +	 * Before emulating the instruction, check if the error code
> > > +	 * was due to a RO violation while translating the guest page.
> > > +	 * This can occur when using nested virtualization with nested
> > > +	 * paging in both guests. If true, we simply unprotect the page
> > > +	 * and resume the guest.
> > > +	 */
> > > +	if (direct &&
> > > +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> > > +		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> > > +		return RET_PF_FIXED;
> > > +	}
> > > +
> > > +	/*
> > > +	 * The gfn is write-protected, but if emulation fails we can still
> > > +	 * optimistically try to just unprotect the page and let the processor
> > > +	 * re-execute the instruction that caused the page fault.  Do not allow
> > > +	 * retrying MMIO emulation, as it's not only pointless but could also
> > > +	 * cause us to enter an infinite loop because the processor will keep
> > > +	 * faulting on the non-existent MMIO address.  Retrying an instruction
> > > +	 * from a nested guest is also pointless and dangerous as we are only
> > > +	 * explicitly shadowing L1's page tables, i.e. unprotecting something
> > > +	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
> > > +	 */
> > > +	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
> >
> > Looks the mmio_info_in_cache() checking can be removed,
> > emulation should not come here with RET_PF_WRITE_PROTECTED
> > introduced, may WARN_ON_ONCE() it.
>
> Yeah, that was my instinct as well.  I kept it mostly because I liked having the
> comment, but also because I was thinking the cache could theoretically get a hit.
> But that's not true.  KVM should return RET_PF_WRITE_PROTECTED if and only if
> there is a memslot, and creating a memslot is supposed to invalidate the MMIO
> cache by virtue of changing the memslot generation.
>
> Unless someone feels strongly that the mmio_info_in_cache() call should be
> deleted entirely, I'll tack on this patch.  The cache lookup is cheap, and IMO
> it's helpful to explicitly document that it should be impossible to reach this
> point with what appears to be MMIO.
>
> ---
>  arch/x86/kvm/mmu/mmu.c | 28 +++++++++++++++++++---------
>  1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 50695eb2ee22..7f3f57237f23 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5997,6 +5997,18 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	vcpu->arch.last_retry_eip = 0;
>  	vcpu->arch.last_retry_addr = 0;
>
> +	/*
> +	 * It should be impossible to reach this point with an MMIO cache hit,
> +	 * as RET_PF_WRITE_PROTECTED is returned if and only if there's a valid,
> +	 * writable memslot, and creating a memslot should invalidate the MMIO
> +	 * cache by way of changing the memslot generation.  WARN and disallow
> +	 * retry if MMIO is detect, as retrying MMIO emulation is pointless and
> +	 * could put the vCPU into an infinite loop because the processor will
> +	 * keep faulting on the non-existent MMIO address.
> +	 */
> +	if (WARN_ON_ONCE(mmio_info_in_cache(vcpu, cr2_or_gpa, direct)))
> +		return RET_PF_EMULATE;
> +

LGTM.

Reviewed-by: Yuan Yao <yuan.yao@intel.com>

>  	/*
>  	 * Before emulating the instruction, check to see if the access may be
>  	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> @@ -6029,17 +6041,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		return RET_PF_FIXED;
>
>  	/*
> -	 * The gfn is write-protected, but if emulation fails we can still
> -	 * optimistically try to just unprotect the page and let the processor
> +	 * The gfn is write-protected, but if KVM detects its emulating an
> +	 * instruction that is unlikely to be used to modify page tables, or if
> +	 * emulation fails, KVM can try to unprotect the gfn and let the CPU
>  	 * re-execute the instruction that caused the page fault.  Do not allow
> -	 * retrying MMIO emulation, as it's not only pointless but could also
> -	 * cause us to enter an infinite loop because the processor will keep
> -	 * faulting on the non-existent MMIO address.  Retrying an instruction
> -	 * from a nested guest is also pointless and dangerous as we are only
> -	 * explicitly shadowing L1's page tables, i.e. unprotecting something
> -	 * for L1 isn't going to magically fix whatever issue cause L2 to fail.
> +	 * retrying an instruction from a nested guest as KVM is only explicitly
> +	 * shadowing L1's page tables, i.e. unprotecting something for L1 isn't
> +	 * going to magically fix whatever issue cause L2 to fail.
>  	 */
> -	if (!mmio_info_in_cache(vcpu, cr2_or_gpa, direct) && !is_guest_mode(vcpu))
> +	if (!is_guest_mode(vcpu))
>  		*emulation_type |= EMULTYPE_ALLOW_RETRY_PF;
>
>  	return RET_PF_EMULATE;
>
> base-commit: 7d33880356496eb0640c6c825cc60898063c4902
> --

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  2024-08-14 17:30   ` Paolo Bonzini
@ 2024-08-15 14:09     ` Sean Christopherson
  2024-08-15 16:48       ` Paolo Bonzini
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-15 14:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > Try to unprotect shadow pages if and only if indirect_shadow_pages is non-
> > zero, i.e. iff there is at least one protected such shadow page.  Pre-
> > checking indirect_shadow_pages avoids taking mmu_lock for write when the
> > gfn is write-protected by a third party, i.e. not for KVM shadow paging,
> > and in the *extremely* unlikely case that a different task has already
> > unprotected the last shadow page.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 09a42dc1fe5a..358294889baa 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2736,6 +2736,9 @@ bool kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa)
> >   	gpa_t gpa = cr2_or_gpa;
> >   	bool r;
> > +	if (!vcpu->kvm->arch.indirect_shadow_pages)
> > +		return false;
> 
> indirect_shadow_pages is accessed without a lock, so here please add a note
> that, while it may be stale, a false negative will only cause KVM to skip
> the "unprotect and retry" optimization.

Correct, I'll add a comment.

> (This is preexisting in reexecute_instruction() and goes away in patch 18, if
> I'm pre-reading that part of the series correctly).
> 
> Bonus points for opportunistically adding a READ_ONCE() here and in
> kvm_mmu_track_write().

Hmm, right, this one should have a READ_ONCE(), but I don't see any reason to
add one in kvm_mmu_track_write().  If the compiler was crazy and generate multiple
loads between the smp_mb() and write_lock(), _and_ the value transitioned from
1->0, reading '0' on the second go is totally fine because it means the last
shadow page was zapped.  Amusingly, it'd actually be "better" in that it would
avoid unnecessary taking mmu_lock.

Practically speaking, the compiler would have to be broken to generate multiple
loads in the 0->1 case, as that would mean the generated code loaded the value
but ignored the result.  But even if that were to happen, a final read of '1' is
again a-ok.

This code is different because a READ_ONCE() would ensure that indirect_shadow_pages
isn't reloaded for every check.  Though that too would be functionally ok, just
weird.

Obviously the READ_ONCE() would be harmless, but IMO it would be more confusing
than helpful, e.g. would beg the question of why kvm_vcpu_exit_request() doesn't
wrap vcpu->mode with READ_ONCE().  Heh, though arguably vcpu->mode should be
wrapped with READ_ONCE() since it's a helper and could be called multiple times
without any code in between that would guarantee a reload.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper
  2024-08-14 17:32   ` Paolo Bonzini
@ 2024-08-15 14:10     ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-15 14:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > -	if (direct &&
> > -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> > +	if (direct && (is_write_to_guest_page_table(error_code)) &&
> 
> Too many parentheses. :)
> 
> Maybe put it before patch 3 if we decide not to Cc: stable?

Ya, I'll do this.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  2024-08-14 17:57   ` Paolo Bonzini
@ 2024-08-15 14:25     ` Sean Christopherson
  2024-08-30 23:54       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-15 14:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > Explicitly query the list of to-be-zapped shadow pages when checking to
> > see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
> > retry the faulting instruction.
> > 
> > Add a comment to explain why the list needs to be checked before zapping,
> > which is the primary motivation for this change.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 11 +++++++----
> >   1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 300a47801685..50695eb2ee22 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -2731,12 +2731,15 @@ bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >   			goto out;
> >   	}
> > -	r = false;
> >   	write_lock(&kvm->mmu_lock);
> > -	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa)) {
> > -		r = true;
> > +	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa))
> >   		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> > -	}
> > +
> > +	/*
> > +	 * Snapshot the result before zapping, as zapping will remove all list
> > +	 * entries, i.e. checking the list later would yield a false negative.
> > +	 */
> 
> Hmm, the comment is kinda overkill?  Maybe just
> 
> 	/* Return whether there were sptes to zap.  */
> 	r = !list_empty(&invalid_test);

I would strongly prefer to keep the verbose comment.  I was "this" close to
removing the local variable and checking list_empty() after the commit phase.
If we made that goof, it would only show up at the worst time, i.e. when a guest
triggers retry and gets stuck.  And the logical outcome of fixing such a bug
would be to add a comment to prevent it from happening again, so I say just add
the comment straightaway.

> I'm not sure about patch 21 - I like the simple kvm_mmu_unprotect_page()
> function.

From a code perspective, I kinda like having a separate helper too.  As you
likely suspect given your below suggestion, KVM should never unprotect a gfn
without retry protection, i.e. there should never be another caller, and I want
to enforce that.

> Maybe rename it to kvm_mmu_zap_gfn() and make it static in the same patch?

kvm_mmu_zap_gfn() would be quite misleading.  Unlike kvm_zap_gfn_range(), it only
zaps non-leaf shadow pages.  E.g. the name would suggest that it could be used by
__kvm_set_or_clear_apicv_inhibit(), but it would do the complete wrong thing.

kvm_mmu_zap_shadow_pages() is the least awful I can come up with (it needs to be
plural because it zaps all SPs related to the gfn), but that's something confusing
too since it would take in a single gfn.  So I think my vote is to keep patch 21
and dodge the naming entirely.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  2024-08-15 14:09     ` Sean Christopherson
@ 2024-08-15 16:48       ` Paolo Bonzini
  2024-08-28 23:28         ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Paolo Bonzini @ 2024-08-15 16:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Thu, Aug 15, 2024 at 4:09 PM Sean Christopherson <seanjc@google.com> wrote:
> > (This is preexisting in reexecute_instruction() and goes away in patch 18, if
> > I'm pre-reading that part of the series correctly).
> >
> > Bonus points for opportunistically adding a READ_ONCE() here and in
> > kvm_mmu_track_write().
>
> Hmm, right, this one should have a READ_ONCE(), but I don't see any reason to
> add one in kvm_mmu_track_write().  If the compiler was crazy and generate multiple
> loads between the smp_mb() and write_lock(), _and_ the value transitioned from
> 1->0, reading '0' on the second go is totally fine because it means the last
> shadow page was zapped.  Amusingly, it'd actually be "better" in that it would
> avoid unnecessary taking mmu_lock.

Your call, but I have started leaning towards always using
READ_ONCE(), similar to all atomic_t accesses are done with
atomic_read(); that is, just as much as a marker for cross-thread
lock-free accesses, in addition to limiting the compiler's
optimizations.

tools/memory-model/Documentation/access-marking.txt also suggests
using READ_ONCE() and WRITE_ONCE() always except in special cases.
They are also more friendly to KCSAN (though I have never used it).

This of course has the issue of being yet another unfinished transition.

> Obviously the READ_ONCE() would be harmless, but IMO it would be more confusing
> than helpful, e.g. would beg the question of why kvm_vcpu_exit_request() doesn't
> wrap vcpu->mode with READ_ONCE().  Heh, though arguably vcpu->mode should be
> wrapped with READ_ONCE() since it's a helper and could be called multiple times
> without any code in between that would guarantee a reload.

Indeed, who said I wouldn't change that one as well? :)

Paolo


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-14 14:22   ` Yuan Yao
@ 2024-08-15 23:31     ` Yao Yuan
  2024-08-23  0:39       ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Yao Yuan @ 2024-08-15 23:31 UTC (permalink / raw)
  To: Yuan Yao
  Cc: Sean Christopherson, Paolo Bonzini, kvm, linux-kernel,
	Peter Gonda, Michael Roth, Vishal Annapurve, Ackerly Tng

On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > if and only if at least one gfn was unprotected, i.e. continue with
> > emulation if simply resuming is likely to hit the same fault and risk
> > putting the vCPU into an infinite loop.
> >
> > Note, it's entirely possible to get a false negative, e.g. if a different
> > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > relatively rare edge case, and emulating is still functionally ok, i.e.
> > the risk of putting the vCPU isn't an infinite loop isn't justified.
> >
> > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> >  1 file changed, 20 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e3aa04c498ea..95058ac4b78c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >  	bool direct = vcpu->arch.mmu->root_role.direct;
> >
> >  	/*
> > -	 * Before emulating the instruction, check if the error code
> > -	 * was due to a RO violation while translating the guest page.
> > -	 * This can occur when using nested virtualization with nested
> > -	 * paging in both guests. If true, we simply unprotect the page
> > -	 * and resume the guest.
> > +	 * Before emulating the instruction, check to see if the access may be
> > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > +	 * protected.  Because AMD CPUs walk nested page table using a write

Hi Sean,

I Just want to consult how often of this on EPT:

The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
in middle of walking the guest CR3 page table, and the guest
CR3 page table page is write-protected on EPT01, are these
guest CR3 page table pages also are EPT12 page table pages
often?  I just think most of time they should be data page
on guest CR3 table for L1 to access them by L1 GVA, if so
the PFERR_GUEST_FINAL_MASK should be set but not
PFERR_GUEST_PAGE_MASK.

> > +	 * operation, walking NPT entries in L1 can trigger write faults even
> > +	 * when L1 isn't modifying PTEs, and thus result in KVM emulating an
> > +	 * excessive number of L1 instructions without triggering KVM's write-
> > +	 * flooding detection, i.e. without unprotecting the gfn.
> > +	 *
> > +	 * If the error code was due to a RO violation while translating the
> > +	 * guest page, the current MMU is direct (L1 is active), and KVM has
> > +	 * shadow pages, then the above scenario is likely being hit.  Try to
> > +	 * unprotect the gfn, i.e. zap any shadow pages, so that L1 can walk
> > +	 * its NPT entries without triggering emulation.  If one or more shadow
> > +	 * pages was zapped, skip emulation and resume L1 to let it natively
> > +	 * execute the instruction.  If no shadow pages were zapped, then the
> > +	 * write-fault is due to something else entirely, i.e. KVM needs to
> > +	 * emulate, as resuming the guest will put it into an infinite loop.
> >  	 */
>
> Reviewed-by: Yuan Yao <yuan.yao@intel.com>
>
> >  	if (direct &&
> > -	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> > -		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> > +	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE &&
> > +	    kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
> >  		return RET_PF_FIXED;
> > -	}
> >
> >  	/*
> >  	 * The gfn is write-protected, but if emulation fails we can still
> > --
> > 2.46.0.76.ge559c4bf1a-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-15 23:31     ` Yao Yuan
@ 2024-08-23  0:39       ` Sean Christopherson
  2024-08-23 23:46         ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-23  0:39 UTC (permalink / raw)
  To: Yao Yuan
  Cc: Yuan Yao, Paolo Bonzini, kvm, linux-kernel, Peter Gonda,
	Michael Roth, Vishal Annapurve, Ackerly Tng

+Tom

On Fri, Aug 16, 2024, Yao Yuan wrote:
> On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> > On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > > if and only if at least one gfn was unprotected, i.e. continue with
> > > emulation if simply resuming is likely to hit the same fault and risk
> > > putting the vCPU into an infinite loop.
> > >
> > > Note, it's entirely possible to get a false negative, e.g. if a different
> > > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > > relatively rare edge case, and emulating is still functionally ok, i.e.
> > > the risk of putting the vCPU isn't an infinite loop isn't justified.
> > >
> > > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> > >  1 file changed, 20 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e3aa04c498ea..95058ac4b78c 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >  	bool direct = vcpu->arch.mmu->root_role.direct;
> > >
> > >  	/*
> > > -	 * Before emulating the instruction, check if the error code
> > > -	 * was due to a RO violation while translating the guest page.
> > > -	 * This can occur when using nested virtualization with nested
> > > -	 * paging in both guests. If true, we simply unprotect the page
> > > -	 * and resume the guest.
> > > +	 * Before emulating the instruction, check to see if the access may be
> > > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > > +	 * protected.  Because AMD CPUs walk nested page table using a write
> 
> Hi Sean,
> 
> I Just want to consult how often of this on EPT:
> 
> The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
> in middle of walking the guest CR3 page table, and the guest
> CR3 page table page is write-protected on EPT01, are these
> guest CR3 page table pages also are EPT12 page table pages
> often?  I just think most of time they should be data page
> on guest CR3 table for L1 to access them by L1 GVA, if so
> the PFERR_GUEST_FINAL_MASK should be set but not
> PFERR_GUEST_PAGE_MASK.

Hmm, now I'm confused too.  I thought this was handling the case where L1 is
accessing EPT12/NTP12 entries, but as you say, that doesn't make sense because
those accesses would be PFERR_GUEST_FINAL_MASK.  And the EPT12/NPT12 entries
should never be used by hardware.

The only thing that I can think of is if L1 stops using the pages for EPT/NPT
entries, and instead reallocates them for IA32 page tables.  But that doesn't
make much sense either, because KVM's write-flooding detection should kick in
when L1 repurposes the page for its own page tables, before L1 actually starts
using the page tables.

I'll try to reproduce the excessive emulation that this code is handling, because
I think I'm missing something too.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-23  0:39       ` Sean Christopherson
@ 2024-08-23 23:46         ` Sean Christopherson
  2024-08-26 20:28           ` Sean Christopherson
  0 siblings, 1 reply; 53+ messages in thread
From: Sean Christopherson @ 2024-08-23 23:46 UTC (permalink / raw)
  To: Yao Yuan
  Cc: Yuan Yao, Paolo Bonzini, kvm, linux-kernel, Peter Gonda,
	Michael Roth, Vishal Annapurve, Ackerly Tng

On Thu, Aug 22, 2024, Sean Christopherson wrote:
> On Fri, Aug 16, 2024, Yao Yuan wrote:
> > On Wed, Aug 14, 2024 at 10:22:56PM GMT, Yuan Yao wrote:
> > > On Fri, Aug 09, 2024 at 12:03:01PM -0700, Sean Christopherson wrote:
> > > > When doing "fast unprotection" of nested TDP page tables, skip emulation
> > > > if and only if at least one gfn was unprotected, i.e. continue with
> > > > emulation if simply resuming is likely to hit the same fault and risk
> > > > putting the vCPU into an infinite loop.
> > > >
> > > > Note, it's entirely possible to get a false negative, e.g. if a different
> > > > vCPU faults on the same gfn and unprotects the gfn first, but that's a
> > > > relatively rare edge case, and emulating is still functionally ok, i.e.
> > > > the risk of putting the vCPU isn't an infinite loop isn't justified.
> > > >
> > > > Fixes: 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes")
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c | 28 ++++++++++++++++++++--------
> > > >  1 file changed, 20 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e3aa04c498ea..95058ac4b78c 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5967,17 +5967,29 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >  	bool direct = vcpu->arch.mmu->root_role.direct;
> > > >
> > > >  	/*
> > > > -	 * Before emulating the instruction, check if the error code
> > > > -	 * was due to a RO violation while translating the guest page.
> > > > -	 * This can occur when using nested virtualization with nested
> > > > -	 * paging in both guests. If true, we simply unprotect the page
> > > > -	 * and resume the guest.
> > > > +	 * Before emulating the instruction, check to see if the access may be
> > > > +	 * due to L1 accessing nested NPT/EPT entries used for L2, i.e. if the
> > > > +	 * gfn being written is for gPTEs that KVM is shadowing and has write-
> > > > +	 * protected.  Because AMD CPUs walk nested page table using a write
> > 
> > Hi Sean,
> > 
> > I Just want to consult how often of this on EPT:
> > 
> > The PFERR_GUEST_PAGE_MASK is set when EPT violation happens
> > in middle of walking the guest CR3 page table, and the guest
> > CR3 page table page is write-protected on EPT01, are these
> > guest CR3 page table pages also are EPT12 page table pages
> > often?  I just think most of time they should be data page
> > on guest CR3 table for L1 to access them by L1 GVA, if so
> > the PFERR_GUEST_FINAL_MASK should be set but not
> > PFERR_GUEST_PAGE_MASK.
> 
> Hmm, now I'm confused too.  I thought this was handling the case where L1 is
> accessing EPT12/NTP12 entries, but as you say, that doesn't make sense because
> those accesses would be PFERR_GUEST_FINAL_MASK.  And the EPT12/NPT12 entries
> should never be used by hardware.
> 
> The only thing that I can think of is if L1 stops using the pages for EPT/NPT
> entries, and instead reallocates them for IA32 page tables.  But that doesn't
> make much sense either, because KVM's write-flooding detection should kick in
> when L1 repurposes the page for its own page tables, before L1 actually starts
> using the page tables.
> 
> I'll try to reproduce the excessive emulation that this code is handling, because
> I think I'm missing something too.

*sigh*

After poking for an hour and never being able to hit that error code, _period_,
even if without the vcpu->arch.mmu->root_role.direct guard, I came to the conclusion
that the only way to hit this is if L1 is reusing its own page tables for L2.

And then I finally wised up and went hunting on lore, and indeed that appears to
be what this is trying to handle[1]:

 : I think this patch is necessary for functional reasons (not just perf),
 : because we added the other patch to look at the GPA and stop walking the
 : guest page tables on a NPF.
 : 
 : The issue I think was that hardware has taken an NPF because the page
 : table is marked RO, and it saves the GPA in the VMCB.  KVM was then
 : going and emulating the instruction and it saw that a GPA was available.
 : But that GPA was not the GPA of the instruction it is emulating, since
 : it was the GPA of the tablewalk page that had the fault. It was debugged
 : that at the time and realized that emulating the instruction was
 : unnecessary so we added this new code in there which fixed the functional
 : issue and helps perf.
 : 
 : I don't have any data on how much perf, as I recall it was most effective
 : when the L1 guest page tables and L2 nested page tables were exactly the
 : same.  In that case, it avoided emulations for code that L1 executes which
 : I think could be as much as one emulation per 4kb code page.

To muddy the waters more, the vcpu->arch.mmu->root_role.direct was a proactive
suggestion from Paolo[2], i.e. not in response to an actual failure that someone
encountered.

Further adding to the confusion was Paolo's commit that extended the behavior
to Intel CPUs.  The AuthorDate vs. CommitDate is particularly interesting, as
it suggests that Paolo wrote the patch in response to the SVM series being
posted:

  Subject: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 23 Nov 2016 12:01:38 -0500	[thread overview]

  commit eebed2438923f8df465c27f8fa41303771fdb2e8
  Author:     Paolo Bonzini <pbonzini@redhat.com>
  AuthorDate: Mon Nov 28 14:39:58 2016 +0100

and then commited after asking many of the same questions we just asked, with
Brijesh's reply[1] coming just a few days before:

  Subject: Re: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 2 Aug 2017 12:42:20 +0200	[thread overview]

  Commit:     Paolo Bonzini <pbonzini@redhat.com>
  CommitDate: Thu Aug 10 16:44:04 2017 +0200

    kvm: nVMX: Add support for fast unprotection of nested guest page tables

    This is the same as commit 147277540bbc ("kvm: svm: Add support for
    additional SVM NPF error codes", 2016-11-23), but for Intel processors.
    In this case, the exit qualification field's bit 8 says whether the
    EPT violation occurred while translating the guest's final physical
    address or rather while translating the guest page tables.

  Subject: [PATCH v2 1/3] kvm: svm: Add support for additional SVM NPF error codes
  Date: Wed, 23 Nov 2016 12:01:38 -0500	[thread overview]

Intel support is especially misleading, because sharing page tables between EPT
and IA32 is rather nonsensical due to them having different formats.  I.e. I doubt
Paolo had a use case for the VMX changes, and was just providing parity with SVM.
Of course, reusing L1's page tables as the NPT tables for L2 is quite crazy too,
but at least the PTE formats are identical. 

Given that the patches were original posted as part the SEV enabling series[3],
my best guest is that the AMD folks were doing some prototyping or emulation
hackery that involved running a nested guest by simply reusing CR3 as hCR3.

So, I'll rewrite the comment to make it clear that practically speaking, this
scenario can be hit if and only if L1 is reusing its own page tables for L2's NPT.

[1] https://lore.kernel.org/all/f8a2c258-0b53-b33c-1dae-a6f6e0e68239@amd.com
[2] https://lore.kernel.org/all/21b9f4db-f929-80f6-6ad2-6fa3b77f82c0@redhat.com
[3] https://lore.kernel.org/all/147190822443.9523.7814744422402462127.stgit__43469.155036337$1471909238$gmane$org@brijesh-build-machine

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected
  2024-08-23 23:46         ` Sean Christopherson
@ 2024-08-26 20:28           ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-26 20:28 UTC (permalink / raw)
  To: Yao Yuan
  Cc: Yuan Yao, Paolo Bonzini, kvm, linux-kernel, Peter Gonda,
	Michael Roth, Vishal Annapurve, Ackerly Tng

On Fri, Aug 23, 2024, Sean Christopherson wrote:
> On Thu, Aug 22, 2024, Sean Christopherson wrote:
> Intel support is especially misleading, because sharing page tables between EPT
> and IA32 is rather nonsensical due to them having different formats.  I.e. I doubt
> Paolo had a use case for the VMX changes, and was just providing parity with SVM.
> Of course, reusing L1's page tables as the NPT tables for L2 is quite crazy too,

Actually, it's not _that_ crazy, e.g. KVM s390 does this for last-level page tables
so that changes to the host userspace mappings don't require mmu_notifier-induced
changes in KVM.

> but at least the PTE formats are identical. 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path
  2024-08-14 17:43   ` Paolo Bonzini
@ 2024-08-26 21:52     ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-26 21:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> On 8/9/24 21:03, Sean Christopherson wrote:
> > @@ -6037,8 +6018,15 @@ static int kvm_mmu_write_protect_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >   	 * execute the instruction.  If no shadow pages were zapped, then the
> >   	 * write-fault is due to something else entirely, i.e. KVM needs to
> >   	 * emulate, as resuming the guest will put it into an infinite loop.
> > +	 *
> > +	 * For indirect MMUs, i.e. if KVM is shadowing the current MMU, try to
> > +	 * unprotect the gfn and retry if an event is awaiting reinjection.  If
> > +	 * KVM emulates multiple instructions before completing even injection,
> > +	 * the event could be delayed beyond what is architecturally allowed,
> > +	 * e.g. KVM could inject an IRQ after the TPR has been raised.
> 
> This paragraph should go before the description of
> kvm_mmu_unprotect_gfn_and_retry:

Hmm, I disagree.  The comment ends up being disconnected from the code, especially
by the end of the series.  E.g. when reading kvm_mmu_write_protect_fault(), someone
would have to jump twice (to kvm_mmu_unprotect_gfn_and_retry and then
__kvm_mmu_unprotect_gfn_and_retry()) in order to understand the checks buried
in kvm_mmu_write_protect_fault().

And the comment also becomes stale when kvm_mmu_unprotect_gfn_and_retry() is
used by x86_emulate_instruction().  That's obviously solvable by extending the
function comment, but then we end up with a rather massive function comment that
documents its callers, not the function itself.

> 
> 	 * There are two cases in which we try to unprotect the page here
> 	 * preemptively, i.e. zap any shadow pages, before emulating the
> 	 * instruction.
> 	 *
> 	 * First, the access may be due to L1 accessing nested NPT/EPT entries
> 	 * used for L2, i.e. if the gfn being written is for gPTEs that KVM is
> 	 * shadowing and has write-protected.  In this case, because AMD CPUs
> 	 * walk nested page table using a write operation, walking NPT entries
> 	 * in L1 can trigger write faults even when L1 isn't modifying PTEs.
> 	 * KVM would then emulate an excessive number of L1 instructions
> 	 * without triggering KVM's write-flooding detection, i.e. without
> 	 * unprotecting the gfn.  This is detected as a RO violation while
> 	 * translating the guest page when the current MMU is direct.
> 	 *
> 	 * Second, for indirect MMUs, i.e. if KVM is shadowing the current MMU,
> 	 * unprotect the gfn and reenter the guest if an event is awaiting
> 	 * reinjection.  If KVM emulates multiple instructions before completing
> 	 * event injection, the event could be delayed beyond what is
> 	 * architecturally allowed, e.g. KVM could inject an IRQ after the
> 	 * TPR has been raised.
> 	 *
> 	 * In both cases, if one or more shadow pages were zapped, skip
> 	 * emulation and resume L1 to let it natively execute the instruction.
> 	 * If no shadow pages were zapped, then the write-fault is due to
> 	 * something else entirely and KVM needs to emulate, as resuming
> 	 * the guest will put it into an infinite loop.
> 
> Thanks,
> 
> Paolo
> 
> >   	 */
> > -	if (direct && (is_write_to_guest_page_table(error_code)) &&
> > +	if (((direct && is_write_to_guest_page_table(error_code)) ||
> > +	     (!direct && kvm_event_needs_reinjection(vcpu))) &&
> >   	    kvm_mmu_unprotect_gfn_and_retry(vcpu, cr2_or_gpa))
> >   		return RET_PF_FIXED;
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs
  2024-08-15 16:48       ` Paolo Bonzini
@ 2024-08-28 23:28         ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-28 23:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Thu, Aug 15, 2024, Paolo Bonzini wrote:
> On Thu, Aug 15, 2024 at 4:09 PM Sean Christopherson <seanjc@google.com> wrote:
> > > (This is preexisting in reexecute_instruction() and goes away in patch 18, if
> > > I'm pre-reading that part of the series correctly).
> > >
> > > Bonus points for opportunistically adding a READ_ONCE() here and in
> > > kvm_mmu_track_write().
> >
> > Hmm, right, this one should have a READ_ONCE(), but I don't see any reason to
> > add one in kvm_mmu_track_write().  If the compiler was crazy and generate multiple
> > loads between the smp_mb() and write_lock(), _and_ the value transitioned from
> > 1->0, reading '0' on the second go is totally fine because it means the last
> > shadow page was zapped.  Amusingly, it'd actually be "better" in that it would
> > avoid unnecessary taking mmu_lock.
> 
> Your call, but I have started leaning towards always using
> READ_ONCE(), similar to all atomic_t accesses are done with
> atomic_read(); that is, just as much as a marker for cross-thread
> lock-free accesses, in addition to limiting the compiler's
> optimizations.
> 
> tools/memory-model/Documentation/access-marking.txt also suggests
> using READ_ONCE() and WRITE_ONCE() always except in special cases.
> They are also more friendly to KCSAN (though I have never used it).
> 
> This of course has the issue of being yet another unfinished transition.

I opted to fix the kvm_vcpu_exit_request() case[*], and add the READ_ONCE() to
this patch, but left kvm_mmu_track_write() as-is.

My reasoning, and what I think makes for a decent policy, is that while I 100%
agree lockless accesses need _some_ form of protection/documentation, I think
adding READ_ONCE() (and WRITE_ONCE()) on top adds confusion and makes the actual
requirement unclear.

In other words, if there's already an smp_rmb() or smp_wmb() (or similar), then
don't add READ/WRITE_ONCE() (unless that's also necesary for some reason) because
doing so detracts from the barriers that are actually necessary.

[*] https://lore.kernel.org/all/20240828232013.768446-1-seanjc@google.com

> > Obviously the READ_ONCE() would be harmless, but IMO it would be more confusing
> > than helpful, e.g. would beg the question of why kvm_vcpu_exit_request() doesn't
> > wrap vcpu->mode with READ_ONCE().  Heh, though arguably vcpu->mode should be
> > wrapped with READ_ONCE() since it's a helper and could be called multiple times
> > without any code in between that would guarantee a reload.
> 
> Indeed, who said I wouldn't change that one as well? :)
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  2024-08-15 14:25     ` Sean Christopherson
@ 2024-08-30 23:54       ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2024-08-30 23:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, Peter Gonda, Michael Roth, Vishal Annapurve,
	Ackerly Tng

On Thu, Aug 15, 2024, Sean Christopherson wrote:
> On Wed, Aug 14, 2024, Paolo Bonzini wrote:
> > On 8/9/24 21:03, Sean Christopherson wrote:
> > > Explicitly query the list of to-be-zapped shadow pages when checking to
> > > see if unprotecting a gfn for retry has succeeded, i.e. if KVM should
> > > retry the faulting instruction.
> > > 
> > > Add a comment to explain why the list needs to be checked before zapping,
> > > which is the primary motivation for this change.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >   arch/x86/kvm/mmu/mmu.c | 11 +++++++----
> > >   1 file changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 300a47801685..50695eb2ee22 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2731,12 +2731,15 @@ bool __kvm_mmu_unprotect_gfn_and_retry(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >   			goto out;
> > >   	}
> > > -	r = false;
> > >   	write_lock(&kvm->mmu_lock);
> > > -	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa)) {
> > > -		r = true;
> > > +	for_each_gfn_valid_sp_with_gptes(kvm, sp, gpa_to_gfn(gpa))
> > >   		kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> > > -	}
> > > +
> > > +	/*
> > > +	 * Snapshot the result before zapping, as zapping will remove all list
> > > +	 * entries, i.e. checking the list later would yield a false negative.
> > > +	 */
> > 
> > Hmm, the comment is kinda overkill?  Maybe just
> > 
> > 	/* Return whether there were sptes to zap.  */
> > 	r = !list_empty(&invalid_test);
> 
> I would strongly prefer to keep the verbose comment.  I was "this" close to
> removing the local variable and checking list_empty() after the commit phase.
> If we made that goof, it would only show up at the worst time, i.e. when a guest
> triggers retry and gets stuck.  And the logical outcome of fixing such a bug
> would be to add a comment to prevent it from happening again, so I say just add
> the comment straightaway.
> 
> > I'm not sure about patch 21 - I like the simple kvm_mmu_unprotect_page()
> > function.
> 
> >From a code perspective, I kinda like having a separate helper too.  As you
> likely suspect given your below suggestion, KVM should never unprotect a gfn
> without retry protection, i.e. there should never be another caller, and I want
> to enforce that.

Oh, another argument for eliminating the separate helper is that having a separate
helper makes it really hard to write a comment for why reading indirect_shadow_pages
outside of mmu_lock is ok (it reads/looks weird if mmu_lock is taken in a different
helper).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
  2024-08-14 16:31   ` Paolo Bonzini
@ 2025-12-03 13:04   ` Naveen N Rao
  2025-12-23 16:06     ` Sean Christopherson
  1 sibling, 1 reply; 53+ messages in thread
From: Naveen N Rao @ 2025-12-03 13:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng, Nikunj A Dadhania, Tom Lendacky

Hi Sean,

On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> directly emulate instructions for ES/SNP, and instead the guest must
> explicitly request emulation.  Unless the guest explicitly requests
> emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> 
> But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> because except for ES/SNP, doing so requires setting reserved bits in the
> SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> writes.  Because KVM never creates MMIO SPTEs and jumps directly to
> emulation, the guest never gets a #VC.  And since KVM simply resumes the
> guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> 
> Disallow read-only memory for all VMs with protected state, i.e. for
> upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> protections and SUPPRESS_VE=0.  But there is no strong use case for
> supporting read-only memslots on TDX, e.g. the main historical usage is
> to emulate option ROMs, but TDX disallows executing from shared memory.
> And if someone comes along with a legitimate, strong use case, the
> restriction can always be lifted for TDX.
> 
> Don't bother trying to retroactively apply the restriction to SEV-ES
> VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> possibly work for SEV-ES, i.e. disallowing such memslots is really just
> means reporting an error to userspace instead of silently hanging vCPUs.
> Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> isn't worth the marginal benefit it would provide userspace.
> 
> Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> Cc: Peter Gonda <pgonda@google.com>
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Vishal Annapurve <vannapurve@google.com>
> Cc: Ackerly Tng <ackerleytng@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 ++
>  include/linux/kvm_host.h        | 7 +++++++
>  virt/kvm/kvm_main.c             | 5 ++---
>  3 files changed, 11 insertions(+), 3 deletions(-)

As discussed in one of the previous PUCK calls, this is causing Qemu to 
throw an error when trying to enable debug-swap for a SEV-ES guest when 
using a pflash drive for OVMF. Sample qemu invocation (*):
  qemu-system-x86_64 ... \
    -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
    -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
    -machine q35,confidential-guest-support=sev0 \
    -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on

This is expected since enabling debug-swap requires use of 
KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However, 
SEV-ES VMs that do not enable any VMSA SEV features (and are hence 
KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they 
are also susceptible to this issue.

One of the suggestions in the call was to consider returning an error to 
userspace instead. Is this close to what you had in mind:

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 73cdcbccc89e..19e27ed27e17 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
         * they can fix it by changing memory to shared, or they can
         * provide a better error.
         */
-       if (r == RET_PF_EMULATE && fault.is_private) {
-               pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
+       if (r == RET_PF_EMULATE && (fault.is_private ||
+           (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
+               if (fault.is_private)
+                       pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
                kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
                return -EFAULT;
        }

This seems to work though Qemu seems to think we are asking it to 
convert the memory to shared (so we probably need to signal this error 
some other way?):
  qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared

Thoughts?


Thanks,
Naveen

--
(*) This requires below patches for Qemu, unless using IGVM:
https://lore.kernel.org/qemu-devel/cover.1761648149.git.naveen@kernel.org/


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)
  2025-12-03 13:04   ` Naveen N Rao
@ 2025-12-23 16:06     ` Sean Christopherson
  0 siblings, 0 replies; 53+ messages in thread
From: Sean Christopherson @ 2025-12-23 16:06 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Paolo Bonzini, kvm, linux-kernel, Peter Gonda, Michael Roth,
	Vishal Annapurve, Ackerly Tng, Nikunj A Dadhania, Tom Lendacky

On Wed, Dec 03, 2025, Naveen N Rao wrote:
> Hi Sean,
> 
> On Fri, Aug 09, 2024 at 12:02:58PM -0700, Sean Christopherson wrote:
> > Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
> > directly emulate instructions for ES/SNP, and instead the guest must
> > explicitly request emulation.  Unless the guest explicitly requests
> > emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
> > SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
> > 
> > But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
> > because except for ES/SNP, doing so requires setting reserved bits in the
> > SPTE, i.e. the SPTE can't be readable while also generating a #VC on
> > writes.  Because KVM never creates MMIO SPTEs and jumps directly to
> > emulation, the guest never gets a #VC.  And since KVM simply resumes the
> > guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
> > into an infinite #NPF loop if the vCPU attempts to write read-only memory.
> > 
> > Disallow read-only memory for all VMs with protected state, i.e. for
> > upcoming TDX VMs as well as ES/SNP VMs.  For TDX, it's actually possible
> > to support read-only memory, as TDX uses EPT Violation #VE to reflect the
> > fault into the guest, e.g. KVM could configure read-only SPTEs with RX
> > protections and SUPPRESS_VE=0.  But there is no strong use case for
> > supporting read-only memslots on TDX, e.g. the main historical usage is
> > to emulate option ROMs, but TDX disallows executing from shared memory.
> > And if someone comes along with a legitimate, strong use case, the
> > restriction can always be lifted for TDX.
> > 
> > Don't bother trying to retroactively apply the restriction to SEV-ES
> > VMs that are created as type KVM_X86_DEFAULT_VM.  Read-only memslots can't
> > possibly work for SEV-ES, i.e. disallowing such memslots is really just
> > means reporting an error to userspace instead of silently hanging vCPUs.
> > Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
> > isn't worth the marginal benefit it would provide userspace.
> > 
> > Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES")
> > Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> > Cc: Peter Gonda <pgonda@google.com>
> > Cc: Michael Roth <michael.roth@amd.com>
> > Cc: Vishal Annapurve <vannapurve@google.com>
> > Cc: Ackerly Tng <ackerleytng@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 2 ++
> >  include/linux/kvm_host.h        | 7 +++++++
> >  virt/kvm/kvm_main.c             | 5 ++---
> >  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> As discussed in one of the previous PUCK calls, this is causing Qemu to 
> throw an error when trying to enable debug-swap for a SEV-ES guest when 
> using a pflash drive for OVMF. Sample qemu invocation (*):
>   qemu-system-x86_64 ... \
>     -drive if=pflash,format=raw,unit=0,file=/path/to/OVMF_CODE.fd,readonly=on \
>     -drive if=pflash,format=raw,unit=1,file=/path/to/OVMF_VARS.fd \
>     -machine q35,confidential-guest-support=sev0 \
>     -object sev-guest,id=sev0,policy=0x5,cbitpos=51,reduced-phys-bits=1,debug-swap=on
> 
> This is expected since enabling debug-swap requires use of 
> KVM_SEV_INIT2, which implies a VM type of KVM_X86_SEV_ES_VM. However, 
> SEV-ES VMs that do not enable any VMSA SEV features (and are hence 
> KVM_X86_DEFAULT_VM type) are allowed to continue to launch though they 
> are also susceptible to this issue.
> 
> One of the suggestions in the call was to consider returning an error to 
> userspace instead. Is this close to what you had in mind:
> 
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 73cdcbccc89e..19e27ed27e17 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -387,8 +387,10 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>          * they can fix it by changing memory to shared, or they can
>          * provide a better error.
>          */
> -       if (r == RET_PF_EMULATE && fault.is_private) {
> -               pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
> +       if (r == RET_PF_EMULATE && (fault.is_private ||
> +           (!fault.map_writable && fault.write && vcpu->arch.guest_state_protected))) {
> +               if (fault.is_private)
> +                       pr_warn_ratelimited("kvm: unexpected emulation request on private memory\n");
>                 kvm_mmu_prepare_memory_fault_exit(vcpu, &fault);
>                 return -EFAULT;
>         }
> 
> This seems to work though Qemu seems to think we are asking it to 
> convert the memory to shared (so we probably need to signal this error 
> some other way?):
>   qemu-system-x86_64: Convert non guest_memfd backed memory region (0xf0000 ,+ 0x1000) to shared
> 
> Thoughts?

The choke point would be kvm_handle_error_pfn() (see below), where the RET_PF_EMULATE
originates.  But looking at all of this again, I am opposed to changing KVM's
ABI to allow KVM_MEM_READONLY for SEV-ES guests, it simply can't work.  And KVM
enumerates as much.

	case KVM_CAP_READONLY_MEM:
		r = kvm ? kvm_arch_has_readonly_mem(kvm) : 1;
		break;

More importantly, if QEMU wants to provide a not-fully-functional configuration
to allow KVM_SEV_INIT2 with pflash, QEMU can fudge around the lack of read-only
memory without KVM's assistance.  It likely won't be pretty, but it's doable,
by clearing PROT_WRITE in the backing VMA that's handed to the KVM memslot.

KVM will see a normal memslot that the guest can read/execute, and if the guest
attempts to write to the memory, hva_to_pfn() will return KVM_PFN_RR_FAULT and
kvm_handle_error_pfn() will send that out to userspace as -EFAULT.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f17324546900..27dc909b8225 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3493,8 +3493,12 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
         * into the spte otherwise read access on readonly gfn also can
         * caused mmio page fault and treat it as mmio access.
         */
-       if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+       if (fault->pfn == KVM_PFN_ERR_RO_FAULT) {
+               if (kvm_arch_has_readonly_mem(vcpu->kvm))
+                       return -EFAULT;
+
                return RET_PF_EMULATE;
+       }
 
        if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
                kvm_send_hwpoison_signal(fault->slot, fault->gfn);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2025-12-23 16:06 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-09 19:02 [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Sean Christopherson
2024-08-09 19:02 ` [PATCH 01/22] KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) Sean Christopherson
2024-08-14 16:31   ` Paolo Bonzini
2025-12-03 13:04   ` Naveen N Rao
2025-12-23 16:06     ` Sean Christopherson
2024-08-09 19:02 ` [PATCH 02/22] KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid Sean Christopherson
2024-08-14 11:11   ` Yuan Yao
2024-08-09 19:03 ` [PATCH 03/22] KVM: x86/mmu: Trigger unprotect logic only on write-protection page faults Sean Christopherson
2024-08-14 11:42   ` Yuan Yao
2024-08-14 14:21     ` Sean Christopherson
2024-08-15  8:30       ` Yuan Yao
2024-08-14 16:40   ` Paolo Bonzini
2024-08-14 19:34     ` Sean Christopherson
2024-08-09 19:03 ` [PATCH 04/22] KVM: x86/mmu: Skip emulation on page fault iff 1+ SPs were unprotected Sean Christopherson
2024-08-14 14:22   ` Yuan Yao
2024-08-15 23:31     ` Yao Yuan
2024-08-23  0:39       ` Sean Christopherson
2024-08-23 23:46         ` Sean Christopherson
2024-08-26 20:28           ` Sean Christopherson
2024-08-14 16:47   ` Paolo Bonzini
2024-08-09 19:03 ` [PATCH 05/22] KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zapped Sean Christopherson
2024-08-09 19:03 ` [PATCH 06/22] KVM: x86: Get RIP from vCPU state when storing it to last_retry_eip Sean Christopherson
2024-08-14 17:01   ` Paolo Bonzini
2024-08-09 19:03 ` [PATCH 07/22] KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retry Sean Christopherson
2024-08-09 19:03 ` [PATCH 08/22] KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" path Sean Christopherson
2024-08-09 19:03 ` [PATCH 09/22] KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPs Sean Christopherson
2024-08-14 17:30   ` Paolo Bonzini
2024-08-15 14:09     ` Sean Christopherson
2024-08-15 16:48       ` Paolo Bonzini
2024-08-28 23:28         ` Sean Christopherson
2024-08-09 19:03 ` [PATCH 10/22] KVM: x86/mmu: Replace PFERR_NESTED_GUEST_PAGE with a more descriptive helper Sean Christopherson
2024-08-14 17:32   ` Paolo Bonzini
2024-08-15 14:10     ` Sean Christopherson
2024-08-09 19:03 ` [PATCH 11/22] KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction() Sean Christopherson
2024-08-09 19:03 ` [PATCH 12/22] KVM: x86: Fold retry_instruction() into x86_emulate_instruction() Sean Christopherson
2024-08-09 19:03 ` [PATCH 13/22] KVM: x86/mmu: Don't try to unprotect an INVALID_GPA Sean Christopherson
2024-08-09 19:03 ` [PATCH 14/22] KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotecting Sean Christopherson
2024-08-09 19:03 ` [PATCH 15/22] KVM: x86/mmu: Move event re-injection unprotect+retry into common path Sean Christopherson
2024-08-14 17:43   ` Paolo Bonzini
2024-08-26 21:52     ` Sean Christopherson
2024-08-09 19:03 ` [PATCH 16/22] KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulation Sean Christopherson
2024-08-14 17:50   ` Paolo Bonzini
2024-08-09 19:03 ` [PATCH 17/22] KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn Sean Christopherson
2024-08-14 17:53   ` Paolo Bonzini
2024-08-09 19:03 ` [PATCH 18/22] KVM: x86: Apply retry protection to "unprotect on failure" path Sean Christopherson
2024-08-09 19:03 ` [PATCH 19/22] KVM: x86: Update retry protection fields when forcing retry on emulation failure Sean Christopherson
2024-08-09 19:03 ` [PATCH 20/22] KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() Sean Christopherson
2024-08-09 19:03 ` [PATCH 21/22] KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version Sean Christopherson
2024-08-09 19:03 ` [PATCH 22/22] KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list Sean Christopherson
2024-08-14 17:57   ` Paolo Bonzini
2024-08-15 14:25     ` Sean Christopherson
2024-08-30 23:54       ` Sean Christopherson
2024-08-14 17:58 ` [PATCH 00/22] KVM: x86: Fix multiple #PF RO infinite loop bugs Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).