[RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes
@ 2024-08-01 18:34 Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled Sean Christopherson
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

This applies on top of the massive "follow pfn" rework[*].  The gist is to
avoid losing accessed information, e.g. because NUMA balancing mucks with
PTEs, by preserving accessed state when KVM zaps SPTEs in response to
mmu_notifier invalidations that are for protection changes, e.g. PROT_NUMA.

RFC as I haven't done any testing to verify whether or not this has any
impact on page aging, let alone has _postivie_ impact.  Personally, I'm not
at all convinced that this is necessary outside of tests that care about
exact counts, e.g. KVM selftests.

That said, I do think patches 1-7 would be worth merging on their own.
Using A/D bits to track state even when A/D bits are disabled in hardware
is a nice cleanup.

[*] https://lore.kernel.org/all/20240726235234.228822-1-seanjc@google.com

Sean Christopherson (9):
  KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally
    enabled
  KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits
    disabled
  KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled
  KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are
    disabled
  KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range
  KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE
    found
  KVM: Plumb mmu_notifier invalidation event type into arch code
  KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT
    changes

 arch/x86/kvm/mmu/mmu.c     |  10 ++--
 arch/x86/kvm/mmu/spte.c    |  16 ++++--
 arch/x86/kvm/mmu/spte.h    |  39 +++++--------
 arch/x86/kvm/mmu/tdp_mmu.c | 113 +++++++++++++++++++++----------------
 include/linux/kvm_host.h   |   1 +
 virt/kvm/kvm_main.c        |   1 +
 6 files changed, 99 insertions(+), 81 deletions(-)


base-commit: 93a198738e0aeb3193ca39c9f01f66060b3c4910
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 2/9] KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled Sean Christopherson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero.  This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     |  6 +++---
 arch/x86/kvm/mmu/spte.c    |  6 ++++++
 arch/x86/kvm/mmu/spte.h    | 20 +++++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.c |  4 ++--
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5979eeb916cd..1e24bc4a06db 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3319,7 +3319,7 @@ static bool page_fault_can_be_fast(struct kvm *kvm, struct kvm_page_fault *fault
 	 *    by setting the Writable bit, which can be done out of mmu_lock.
 	 */
 	if (!fault->present)
-		return !kvm_ad_enabled();
+		return !kvm_ad_enabled;
 
 	/*
 	 * Note, instruction fetches and writes are mutually exclusive, ignore
@@ -3454,7 +3454,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * uses A/D bits for non-nested MMUs.  Thus, if A/D bits are
 		 * enabled, the SPTE can't be an access-tracked SPTE.
 		 */
-		if (unlikely(!kvm_ad_enabled()) && is_access_track_spte(spte))
+		if (unlikely(!kvm_ad_enabled) && is_access_track_spte(spte))
 			new_spte = restore_acc_track_spte(new_spte);
 
 		/*
@@ -5429,7 +5429,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 	role.efer_nx = true;
 	role.smm = cpu_role.base.smm;
 	role.guest_mode = cpu_role.base.guest_mode;
-	role.ad_disabled = !kvm_ad_enabled();
+	role.ad_disabled = !kvm_ad_enabled;
 	role.level = kvm_mmu_get_tdp_level(vcpu);
 	role.direct = true;
 	role.has_4_byte_gpte = false;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 2c5650390d3b..b713a6542eeb 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -24,6 +24,8 @@ static bool __ro_after_init allow_mmio_caching;
 module_param_named(mmio_caching, enable_mmio_caching, bool, 0444);
 EXPORT_SYMBOL_GPL(enable_mmio_caching);
 
+bool __read_mostly kvm_ad_enabled;
+
 u64 __read_mostly shadow_host_writable_mask;
 u64 __read_mostly shadow_mmu_writable_mask;
 u64 __read_mostly shadow_nx_mask;
@@ -435,6 +437,8 @@ EXPORT_SYMBOL_GPL(kvm_mmu_set_me_spte_mask);
 
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 {
+	kvm_ad_enabled		= has_ad_bits;
+
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
 	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
 	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
@@ -468,6 +472,8 @@ void kvm_mmu_reset_all_pte_masks(void)
 	u8 low_phys_bits;
 	u64 mask;
 
+	kvm_ad_enabled = true;
+
 	/*
 	 * If the CPU has 46 or less physical address bits, then set an
 	 * appropriate mask to guard against L1TF attacks. Otherwise, it is
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ef793c459b05..d722b37b7434 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -167,6 +167,15 @@ static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
 #define SHADOW_NONPRESENT_VALUE	0ULL
 #endif
 
+
+/*
+ * True if A/D bits are supported in hardware and are enabled by KVM.  When
+ * enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can disable
+ * A/D bits in EPTP12, SP and SPTE variants are needed to handle the scenario
+ * where KVM is using A/D bits for L1, but not L2.
+ */
+extern bool __read_mostly kvm_ad_enabled;
+
 extern u64 __read_mostly shadow_host_writable_mask;
 extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
@@ -285,17 +294,6 @@ static inline bool is_ept_ve_possible(u64 spte)
 	       (spte & VMX_EPT_RWX_MASK) != VMX_EPT_MISCONFIG_WX_VALUE;
 }
 
-/*
- * Returns true if A/D bits are supported in hardware and are enabled by KVM.
- * When enabled, KVM uses A/D bits for all non-nested MMUs.  Because L1 can
- * disable A/D bits in EPTP12, SP and SPTE variants are needed to handle the
- * scenario where KVM is using A/D bits for L1, but not L2.
- */
-static inline bool kvm_ad_enabled(void)
-{
-	return !!shadow_accessed_mask;
-}
-
 static inline bool sp_ad_disabled(struct kvm_mmu_page *sp)
 {
 	return sp->role.ad_disabled;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dc153cf92a40..2b0fc601d2ce 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1072,7 +1072,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 			   struct kvm_mmu_page *sp, bool shared)
 {
-	u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
+	u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled);
 	int ret = 0;
 
 	if (shared) {
@@ -1488,7 +1488,7 @@ static bool tdp_mmu_need_write_protect(struct kvm_mmu_page *sp)
 	 * from level, so it is valid to key off any shadow page to determine if
 	 * write protection is needed for an entire tree.
 	 */
-	return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled();
+	return kvm_mmu_page_ad_need_write_protect(sp) || !kvm_ad_enabled;
 }
 
 static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/9] KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 3/9] KVM: x86/mmu: Set shadow_dirty_mask " Sean Christopherson
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware.  This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index b713a6542eeb..cae45825617c 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -440,7 +440,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 	kvm_ad_enabled		= has_ad_bits;
 
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
-	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
+	shadow_accessed_mask	= VMX_EPT_ACCESS_BIT;
 	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/9] KVM: x86/mmu: Set shadow_dirty_mask for EPT even if A/D bits disabled
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 2/9] KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled Sean Christopherson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware.  Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.

Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cae45825617c..a0ff504f1e7e 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -441,7 +441,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
 	shadow_accessed_mask	= VMX_EPT_ACCESS_BIT;
-	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
+	shadow_dirty_mask	= VMX_EPT_DIRTY_BIT;
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
 	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (2 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 3/9] KVM: x86/mmu: Set shadow_dirty_mask " Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-05 16:49   ` David Matlack
  2024-08-01 18:34 ` [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE Sean Christopherson
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present.  In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.

Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c  |  6 ++++--
 arch/x86/kvm/mmu/spte.c |  6 +++---
 arch/x86/kvm/mmu/spte.h | 11 +----------
 3 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e24bc4a06db..c8fc59fcc8e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3454,8 +3454,10 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * uses A/D bits for non-nested MMUs.  Thus, if A/D bits are
 		 * enabled, the SPTE can't be an access-tracked SPTE.
 		 */
-		if (unlikely(!kvm_ad_enabled) && is_access_track_spte(spte))
-			new_spte = restore_acc_track_spte(new_spte);
+		if (unlikely(!kvm_ad_enabled) && is_access_track_spte(spte)) {
+			new_spte = restore_acc_track_spte(new_spte) |
+				   shadow_accessed_mask;
+		}
 
 		/*
 		 * To keep things simple, only SPTEs that are MMU-writable can
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index a0ff504f1e7e..ca1a8116de34 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -181,7 +181,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 
 	spte |= shadow_present_mask;
 	if (!prefetch)
-		spte |= spte_shadow_accessed_mask(spte);
+		spte |= shadow_accessed_mask;
 
 	/*
 	 * For simplicity, enforce the NX huge page mitigation even if not
@@ -258,7 +258,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	}
 
 	if (pte_access & ACC_WRITE_MASK)
-		spte |= spte_shadow_dirty_mask(spte);
+		spte |= shadow_accessed_mask;
 
 out:
 	if (prefetch)
@@ -367,7 +367,7 @@ u64 mark_spte_for_access_track(u64 spte)
 
 	spte |= (spte & SHADOW_ACC_TRACK_SAVED_BITS_MASK) <<
 		SHADOW_ACC_TRACK_SAVED_BITS_SHIFT;
-	spte &= ~shadow_acc_track_mask;
+	spte &= ~(shadow_acc_track_mask | shadow_accessed_mask);
 
 	return spte;
 }
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d722b37b7434..ba7ff1dfbeb2 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -316,12 +316,6 @@ static inline bool spte_ad_need_write_protect(u64 spte)
 	return (spte & SPTE_TDP_AD_MASK) != SPTE_TDP_AD_ENABLED;
 }
 
-static inline u64 spte_shadow_accessed_mask(u64 spte)
-{
-	KVM_MMU_WARN_ON(!is_shadow_present_pte(spte));
-	return spte_ad_enabled(spte) ? shadow_accessed_mask : 0;
-}
-
 static inline u64 spte_shadow_dirty_mask(u64 spte)
 {
 	KVM_MMU_WARN_ON(!is_shadow_present_pte(spte));
@@ -355,10 +349,7 @@ static inline kvm_pfn_t spte_to_pfn(u64 pte)
 
 static inline bool is_accessed_spte(u64 spte)
 {
-	u64 accessed_mask = spte_shadow_accessed_mask(spte);
-
-	return accessed_mask ? spte & accessed_mask
-			     : !is_access_track_spte(spte);
+	return spte & shadow_accessed_mask;
 }
 
 static inline bool is_dirty_spte(u64 spte)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (3 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-05  7:20   ` Yuan Yao
  2024-08-01 18:34 ` [RFC PATCH 6/9] KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range Sean Christopherson
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Remove all flavors of A/D bits from FROZEN_SPTE so that KVM can keep A/D
bits set in SPTEs that are frozen, without getting false positives.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/spte.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ba7ff1dfbeb2..d403ecdfcb8e 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -216,15 +216,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  * should not modify the SPTE.
  *
  * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
- * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
- * vulnerability.
+ * both AMD and Intel CPUs, doesn't set any A/D bits, and doesn't set PFN bits,
+ * i.e. doesn't create a L1TF vulnerability.
  *
  * Only used by the TDP MMU.
  */
-#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
+#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x498ULL)
 
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(FROZEN_SPTE & (PT_ACCESSED_MASK | VMX_EPT_ACCESS_BIT)));
+static_assert(!(FROZEN_SPTE & (PT_DIRTY_MASK | VMX_EPT_DIRTY_BIT)));
 
 static inline bool is_frozen_spte(u64 spte)
 {
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 6/9] KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (4 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 7/9] KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found Sean Christopherson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Skip invalid TDP MMU roots when aging a gfn range.  There is zero reason
to process invalid roots, as they by definition hold stale information.
E.g. if a root is invalid because its from a previous memslot generation,
in the unlikely event the root has a SPTE for the gfn, then odds are good
that the gfn=>hva mapping is different, i.e. doesn't map to the hva that
is being aged by the primary MMU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 2b0fc601d2ce..b358642890e1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1202,9 +1202,11 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 
 	/*
 	 * Don't support rescheduling, none of the MMU notifiers that funnel
-	 * into this helper allow blocking; it'd be dead, wasteful code.
+	 * into this helper allow blocking; it'd be dead, wasteful code.  Note,
+	 * this helper must NOT be used to unmap GFNs, as it processes only
+	 * valid roots!
 	 */
-	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
 		rcu_read_lock();
 
 		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 7/9] KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (5 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 6/9] KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 8/9] KVM: Plumb mmu_notifier invalidation event type into arch code Sean Christopherson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Return immediately if a young SPTE is found when testing, but not updating,
SPTEs.  The return value is a boolean, i.e. whether there is one young SPTE
or fifty is irrelevant (ignoring the fact that it's impossible for there to
be fifty SPTEs, as KVM has a hard limit on the number of valid TDP MMU
roots).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 84 ++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 44 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b358642890e1..ac3200ce00f9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1189,35 +1189,6 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 	return flush;
 }
 
-typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
-			      struct kvm_gfn_range *range);
-
-static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
-						   struct kvm_gfn_range *range,
-						   tdp_handler_t handler)
-{
-	struct kvm_mmu_page *root;
-	struct tdp_iter iter;
-	bool ret = false;
-
-	/*
-	 * Don't support rescheduling, none of the MMU notifiers that funnel
-	 * into this helper allow blocking; it'd be dead, wasteful code.  Note,
-	 * this helper must NOT be used to unmap GFNs, as it processes only
-	 * valid roots!
-	 */
-	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
-		rcu_read_lock();
-
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
-			ret |= handler(kvm, &iter, range);
-
-		rcu_read_unlock();
-	}
-
-	return ret;
-}
-
 /*
  * Mark the SPTEs range of GFNs [start, end) unaccessed and return non-zero
  * if any of the GFNs in the range have been accessed.
@@ -1226,15 +1197,10 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
  * from the clear_young() or clear_flush_young() notifier, which uses the
  * return value to determine if the page has been accessed.
  */
-static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
-			  struct kvm_gfn_range *range)
+static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
 {
 	u64 new_spte;
 
-	/* If we have a non-accessed entry we don't need to change the pte. */
-	if (!is_accessed_spte(iter->old_spte))
-		return false;
-
 	if (spte_ad_enabled(iter->old_spte)) {
 		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
 							 iter->old_spte,
@@ -1250,23 +1216,53 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 
 	trace_kvm_tdp_mmu_spte_changed(iter->as_id, iter->gfn, iter->level,
 				       iter->old_spte, new_spte);
-	return true;
+}
+
+static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
+					struct kvm_gfn_range *range,
+					bool test_only)
+{
+	struct kvm_mmu_page *root;
+	struct tdp_iter iter;
+	bool ret = false;
+
+	/*
+	 * Don't support rescheduling, none of the MMU notifiers that funnel
+	 * into this helper allow blocking; it'd be dead, wasteful code.  Note,
+	 * this helper must NOT be used to unmap GFNs, as it processes only
+	 * valid roots!
+	 */
+	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		rcu_read_lock();
+
+		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
+			if (!is_accessed_spte(iter.old_spte))
+				continue;
+
+			ret = true;
+			if (test_only)
+				break;
+
+			kvm_tdp_mmu_age_spte(&iter);
+		}
+
+		rcu_read_unlock();
+
+		if (ret && test_only)
+			break;
+	}
+
+	return ret;
 }
 
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
-}
-
-static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
-			 struct kvm_gfn_range *range)
-{
-	return is_accessed_spte(iter->old_spte);
+	return __kvm_tdp_mmu_age_gfn_range(kvm, range, false);
 }
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
+	return __kvm_tdp_mmu_age_gfn_range(kvm, range, true);
 }
 
 /*
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 8/9] KVM: Plumb mmu_notifier invalidation event type into arch code
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (6 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 7/9] KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-01 18:34 ` [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes Sean Christopherson
  2024-08-05 16:45 ` [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on " David Matlack
  9 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Forward the mmu_notifier invalidation event information into the arch
handler so that arch code can take different actions based on the
invalidation type.  E.g. x86 will use the information to preserve
Accessed information when zapping SPTEs because of a protection change.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 006668db9911..1fce5cf73b8e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -252,6 +252,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
 	unsigned long attributes;
+	enum mmu_notifier_event event;
 };
 
 struct kvm_gfn_range {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e279140f2425..3aa04e785d32 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -719,6 +719,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		.start		= range->start,
 		.end		= range->end,
 		.handler	= kvm_mmu_unmap_gfn_range,
+		.arg.event	= range->event,
 		.on_lock	= kvm_mmu_invalidate_begin,
 		.flush_on_ret	= true,
 		.may_block	= mmu_notifier_range_blockable(range),
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (7 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 8/9] KVM: Plumb mmu_notifier invalidation event type into arch code Sean Christopherson
@ 2024-08-01 18:34 ` Sean Christopherson
  2024-08-05  7:59   ` Yuan Yao
  2024-08-05 16:45 ` [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on " David Matlack
  9 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-08-01 18:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel

Preserve Accessed information when zapping SPTEs in response to an
mmu_notifier protection change, e.g. if KVM is zapping SPTEs because
NUMA balancing kicked in.  KVM is not required to fully unmap the SPTE,
and the core VMA information isn't changing, i.e. the information is
still fresh and useful.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ac3200ce00f9..780f35a22c05 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * operation can cause a soft lockup.
  */
 static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
-			      gfn_t start, gfn_t end, bool can_yield, bool flush)
+			      gfn_t start, gfn_t end, bool can_yield,
+			      bool keep_accessed_bit, bool flush)
 {
 	struct tdp_iter iter;
 
@@ -849,17 +850,29 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+		u64 new_spte = SHADOW_NONPRESENT_VALUE;
+
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
 			continue;
 		}
 
+		/*
+		 * Note, this will fail to clear non-present, accessed SPTEs,
+		 * but that isn't a functional problem, it can only result in
+		 * a _potential_ false positive  in the unlikely scenario that
+		 * the primary MMU zaps an hva, reinstalls a new hva, and ages
+		 * the new hva, all before KVM accesses the hva.
+		 */
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
-		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+		if (keep_accessed_bit)
+			new_spte |= iter.old_spte & shadow_accessed_mask;
+
+		tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
 
 		/*
 		 * Zappings SPTEs in invalid roots doesn't require a TLB flush,
@@ -889,7 +902,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
-		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
+		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, false, flush);
 
 	return flush;
 }
@@ -1180,11 +1193,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush)
 {
+	bool keep_a_bit = range->arg.event == MMU_NOTIFY_PROTECTION_VMA ||
+			  range->arg.event == MMU_NOTIFY_PROTECTION_PAGE;
 	struct kvm_mmu_page *root;
 
 	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
 		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
-					  range->may_block, flush);
+					  range->may_block, keep_a_bit, flush);
 
 	return flush;
 }
@@ -1201,7 +1216,11 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
 {
 	u64 new_spte;
 
-	if (spte_ad_enabled(iter->old_spte)) {
+	if (spte_ad_enabled(iter->old_spte) ||
+	    !is_shadow_present_pte(iter->old_spte)) {
+		KVM_MMU_WARN_ON(!is_shadow_present_pte(iter->old_spte) &&
+				iter->old_spte != (SHADOW_NONPRESENT_VALUE | shadow_accessed_mask));
+
 		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
 							 iter->old_spte,
 							 shadow_accessed_mask,
@@ -1235,7 +1254,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
 	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
+		tdp_root_for_each_pte(iter, root, range->start, range->end) {
 			if (!is_accessed_spte(iter.old_spte))
 				continue;
 
-- 
2.46.0.rc1.232.g9752f9e123-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  2024-08-01 18:34 ` [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE Sean Christopherson
@ 2024-08-05  7:20   ` Yuan Yao
  2024-08-05 22:17     ` Sean Christopherson
  0 siblings, 1 reply; 19+ messages in thread
From: Yuan Yao @ 2024-08-05  7:20 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Aug 01, 2024 at 11:34:49AM -0700, Sean Christopherson wrote:
> Remove all flavors of A/D bits from FROZEN_SPTE so that KVM can keep A/D
> bits set in SPTEs that are frozen, without getting false positives.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/spte.h | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index ba7ff1dfbeb2..d403ecdfcb8e 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -216,15 +216,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>   * should not modify the SPTE.
>   *
>   * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> - * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> - * vulnerability.
> + * both AMD and Intel CPUs, doesn't set any A/D bits, and doesn't set PFN bits,
> + * i.e. doesn't create a L1TF vulnerability.
>   *
>   * Only used by the TDP MMU.
>   */
> -#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> +#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x498ULL)

Question:
Why bit3 and bit4 also changed from 0 to 1 ?
They're not part of AD bits fro EPT and CR3 page table/AMD NPT

EPT: Abit:8 Dbit:9
CR3: Abit:5 Dbit:6

>
>  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
>  static_assert(!(FROZEN_SPTE & SPTE_MMU_PRESENT_MASK));
> +static_assert(!(FROZEN_SPTE & (PT_ACCESSED_MASK | VMX_EPT_ACCESS_BIT)));
> +static_assert(!(FROZEN_SPTE & (PT_DIRTY_MASK | VMX_EPT_DIRTY_BIT)));
>
>  static inline bool is_frozen_spte(u64 spte)
>  {
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes
  2024-08-01 18:34 ` [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes Sean Christopherson
@ 2024-08-05  7:59   ` Yuan Yao
  2024-08-05  9:12     ` Yuan Yao
  2024-08-07  6:41     ` Yuan Yao
  0 siblings, 2 replies; 19+ messages in thread
From: Yuan Yao @ 2024-08-05  7:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Aug 01, 2024 at 11:34:53AM -0700, Sean Christopherson wrote:
> Preserve Accessed information when zapping SPTEs in response to an
> mmu_notifier protection change, e.g. if KVM is zapping SPTEs because
> NUMA balancing kicked in.  KVM is not required to fully unmap the SPTE,
> and the core VMA information isn't changing, i.e. the information is
> still fresh and useful.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ac3200ce00f9..780f35a22c05 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>   * operation can cause a soft lockup.
>   */
>  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> -			      gfn_t start, gfn_t end, bool can_yield, bool flush)
> +			      gfn_t start, gfn_t end, bool can_yield,
> +			      bool keep_accessed_bit, bool flush)
>  {
>  	struct tdp_iter iter;
>
> @@ -849,17 +850,29 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>  	rcu_read_lock();
>
>  	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> +		u64 new_spte = SHADOW_NONPRESENT_VALUE;
> +
>  		if (can_yield &&
>  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
>  			flush = false;
>  			continue;
>  		}
>
> +		/*
> +		 * Note, this will fail to clear non-present, accessed SPTEs,
> +		 * but that isn't a functional problem, it can only result in
> +		 * a _potential_ false positive  in the unlikely scenario that
> +		 * the primary MMU zaps an hva, reinstalls a new hva, and ages
> +		 * the new hva, all before KVM accesses the hva.
> +		 */
>  		if (!is_shadow_present_pte(iter.old_spte) ||
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>
> -		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> +		if (keep_accessed_bit)
> +			new_spte |= iter.old_spte & shadow_accessed_mask;
> +
> +		tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
>
>  		/*
>  		 * Zappings SPTEs in invalid roots doesn't require a TLB flush,
> @@ -889,7 +902,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
>
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
> -		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
> +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, false, flush);
>
>  	return flush;
>  }
> @@ -1180,11 +1193,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  				 bool flush)
>  {
> +	bool keep_a_bit = range->arg.event == MMU_NOTIFY_PROTECTION_VMA ||
> +			  range->arg.event == MMU_NOTIFY_PROTECTION_PAGE;
>  	struct kvm_mmu_page *root;
>
>  	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
>  		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
> -					  range->may_block, flush);
> +					  range->may_block, keep_a_bit, flush);
>
>  	return flush;
>  }
> @@ -1201,7 +1216,11 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
>  {
>  	u64 new_spte;
>
> -	if (spte_ad_enabled(iter->old_spte)) {
> +	if (spte_ad_enabled(iter->old_spte) ||
> +	    !is_shadow_present_pte(iter->old_spte)) {
> +		KVM_MMU_WARN_ON(!is_shadow_present_pte(iter->old_spte) &&
> +				iter->old_spte != (SHADOW_NONPRESENT_VALUE | shadow_accessed_mask));

Is that possible some sptes are zapped by
kvm_tdp_mmu_zap_leafs(keep_accessed_bit = false) i.e. from kvm_post_set_cr0(),
then handled by __kvm_tdp_mmu_age_gfn_range() for aging before
accessed by guest again ?
In this scenario the spte is non-present w/o A bit set.

> +
>  		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
>  							 iter->old_spte,
>  							 shadow_accessed_mask,
> @@ -1235,7 +1254,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
>  	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
>  		rcu_read_lock();
>
> -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> +		tdp_root_for_each_pte(iter, root, range->start, range->end) {

This also clears the A bit of non-leaf entries for aging, I remember
KVM doesn't care them before, could you please explain the reason of
this ?

>  			if (!is_accessed_spte(iter.old_spte))
>  				continue;
>
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes
  2024-08-05  7:59   ` Yuan Yao
@ 2024-08-05  9:12     ` Yuan Yao
  2024-08-07  6:41     ` Yuan Yao
  1 sibling, 0 replies; 19+ messages in thread
From: Yuan Yao @ 2024-08-05  9:12 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Aug 05, 2024 at 03:59:11PM +0800, Yuan Yao wrote:
> On Thu, Aug 01, 2024 at 11:34:53AM -0700, Sean Christopherson wrote:
> > Preserve Accessed information when zapping SPTEs in response to an
> > mmu_notifier protection change, e.g. if KVM is zapping SPTEs because
> > NUMA balancing kicked in.  KVM is not required to fully unmap the SPTE,
> > and the core VMA information isn't changing, i.e. the information is
> > still fresh and useful.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 31 +++++++++++++++++++++++++------
> >  1 file changed, 25 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index ac3200ce00f9..780f35a22c05 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >   * operation can cause a soft lockup.
> >   */
> >  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > -			      gfn_t start, gfn_t end, bool can_yield, bool flush)
> > +			      gfn_t start, gfn_t end, bool can_yield,
> > +			      bool keep_accessed_bit, bool flush)
> >  {
> >  	struct tdp_iter iter;
> >
> > @@ -849,17 +850,29 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	rcu_read_lock();
> >
> >  	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > +		u64 new_spte = SHADOW_NONPRESENT_VALUE;
> > +
> >  		if (can_yield &&
> >  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> >  			flush = false;
> >  			continue;
> >  		}
> >
> > +		/*
> > +		 * Note, this will fail to clear non-present, accessed SPTEs,
> > +		 * but that isn't a functional problem, it can only result in
> > +		 * a _potential_ false positive  in the unlikely scenario that
> > +		 * the primary MMU zaps an hva, reinstalls a new hva, and ages
> > +		 * the new hva, all before KVM accesses the hva.
> > +		 */
> >  		if (!is_shadow_present_pte(iter.old_spte) ||
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >
> > -		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> > +		if (keep_accessed_bit)
> > +			new_spte |= iter.old_spte & shadow_accessed_mask;
> > +
> > +		tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
> >
> >  		/*
> >  		 * Zappings SPTEs in invalid roots doesn't require a TLB flush,
> > @@ -889,7 +902,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
> >
> >  	lockdep_assert_held_write(&kvm->mmu_lock);
> >  	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
> > -		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
> > +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, false, flush);
> >
> >  	return flush;
> >  }
> > @@ -1180,11 +1193,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> >  				 bool flush)
> >  {
> > +	bool keep_a_bit = range->arg.event == MMU_NOTIFY_PROTECTION_VMA ||
> > +			  range->arg.event == MMU_NOTIFY_PROTECTION_PAGE;
> >  	struct kvm_mmu_page *root;
> >
> >  	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> >  		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
> > -					  range->may_block, flush);
> > +					  range->may_block, keep_a_bit, flush);
> >
> >  	return flush;
> >  }
> > @@ -1201,7 +1216,11 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
> >  {
> >  	u64 new_spte;
> >
> > -	if (spte_ad_enabled(iter->old_spte)) {
> > +	if (spte_ad_enabled(iter->old_spte) ||
> > +	    !is_shadow_present_pte(iter->old_spte)) {
> > +		KVM_MMU_WARN_ON(!is_shadow_present_pte(iter->old_spte) &&
> > +				iter->old_spte != (SHADOW_NONPRESENT_VALUE | shadow_accessed_mask));
>
> Is that possible some sptes are zapped by
> kvm_tdp_mmu_zap_leafs(keep_accessed_bit = false) i.e. from kvm_post_set_cr0(),
> then handled by __kvm_tdp_mmu_age_gfn_range() for aging before
> accessed by guest again ?
> In this scenario the spte is non-present w/o A bit set.

No, I just ignored that the A bit is already checked in
__kvm_tdp_mmu_age_gfn_range(), so non-accessed spte will
be skipped.

>
> > +
> >  		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> >  							 iter->old_spte,
> >  							 shadow_accessed_mask,
> > @@ -1235,7 +1254,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
> >  	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
> >  		rcu_read_lock();
> >
> > -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> > +		tdp_root_for_each_pte(iter, root, range->start, range->end) {
>
> This also clears the A bit of non-leaf entries for aging, I remember
> KVM doesn't care them before, could you please explain the reason of
> this ?
>
> >  			if (!is_accessed_spte(iter.old_spte))
> >  				continue;
> >
> > --
> > 2.46.0.rc1.232.g9752f9e123-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes
  2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
                   ` (8 preceding siblings ...)
  2024-08-01 18:34 ` [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes Sean Christopherson
@ 2024-08-05 16:45 ` David Matlack
  2024-08-05 20:11   ` Sean Christopherson
  9 siblings, 1 reply; 19+ messages in thread
From: David Matlack @ 2024-08-05 16:45 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
>
> This applies on top of the massive "follow pfn" rework[*].  The gist is to
> avoid losing accessed information, e.g. because NUMA balancing mucks with
> PTEs,

What do you mean by "NUMA balancing mucks with PTEs"?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled
  2024-08-01 18:34 ` [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled Sean Christopherson
@ 2024-08-05 16:49   ` David Matlack
  0 siblings, 0 replies; 19+ messages in thread
From: David Matlack @ 2024-08-05 16:49 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Thu, Aug 1, 2024 at 11:36 AM Sean Christopherson <seanjc@google.com> wrote:
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -181,7 +181,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>
>         spte |= shadow_present_mask;
>         if (!prefetch)
> -               spte |= spte_shadow_accessed_mask(spte);
> +               spte |= shadow_accessed_mask;
>
>         /*
>          * For simplicity, enforce the NX huge page mitigation even if not
> @@ -258,7 +258,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>         }
>
>         if (pte_access & ACC_WRITE_MASK)
> -               spte |= spte_shadow_dirty_mask(spte);
> +               spte |= shadow_accessed_mask;

spte |= shadow_dirty_mask;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes
  2024-08-05 16:45 ` [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on " David Matlack
@ 2024-08-05 20:11   ` Sean Christopherson
  0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2024-08-05 20:11 UTC (permalink / raw)
  To: David Matlack; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Aug 05, 2024, David Matlack wrote:
> On Thu, Aug 1, 2024 at 11:35 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > This applies on top of the massive "follow pfn" rework[*].  The gist is to
> > avoid losing accessed information, e.g. because NUMA balancing mucks with
> > PTEs,
> 
> What do you mean by "NUMA balancing mucks with PTEs"?

When NUMA auto-balancing is enabled, for VMAs the current task has been accessing,
the kernel will periodically change PTEs (in the primary MMU) to PROT_NONE, i.e.
make them !PRESENT.  That in turn results in mmu_notifier invalidations (usually
for the entire VMA, eventually) that cause KVM to unmap SPTEs.  If KVM doesn't
mark folios accessed when SPTEs are zapped, the NUMA-induced zapping effectively
loses the accessed information.

For non-KVM setups, NUMA balancing works quite well because the cost of the #PF
to "fix" the NUMA-induced PROT_NONE is relatively cheap, especially compared to
the long-term costs of accessing remote memory.

For KVM, the cost vs. benefit is very different, as each mmu_notifier invalidation
forces KVM to emit a remote TLB flush, i.e. the cost is much higher.  And it's
also much more feasible (in practice) to affine vCPUs to single NUMA nodes, even
if vCPUs are pinned 1:1 with pCPUs, than it is to affine a random userspace task
to a NUMA node.

Which is why I'm not terribly concerned about optimizing NUMA auto-balancing; it's
already sub-optimal for KVM.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  2024-08-05  7:20   ` Yuan Yao
@ 2024-08-05 22:17     ` Sean Christopherson
  2024-08-06  3:31       ` Yuan Yao
  0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2024-08-05 22:17 UTC (permalink / raw)
  To: Yuan Yao; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Aug 05, 2024, Yuan Yao wrote:
> On Thu, Aug 01, 2024 at 11:34:49AM -0700, Sean Christopherson wrote:
> > Remove all flavors of A/D bits from FROZEN_SPTE so that KVM can keep A/D
> > bits set in SPTEs that are frozen, without getting false positives.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/spte.h | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index ba7ff1dfbeb2..d403ecdfcb8e 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -216,15 +216,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   * should not modify the SPTE.
> >   *
> >   * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> > - * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> > - * vulnerability.
> > + * both AMD and Intel CPUs, doesn't set any A/D bits, and doesn't set PFN bits,
> > + * i.e. doesn't create a L1TF vulnerability.
> >   *
> >   * Only used by the TDP MMU.
> >   */
> > -#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > +#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x498ULL)
> 
> Question:
> Why bit3 and bit4 also changed from 0 to 1 ?

Purely so that more bits are set, i.e. so that KVM doesn't rely on one or two
bits to identify frozen SPTEs.

> They're not part of AD bits fro EPT and CR3 page table/AMD NPT

This is very delibreate.  The A/D bits need to be '0' in the FROZEN, i.e. bits
5,6, 8, and 9 must not be set in FROZEN_SPTE.

> 
> EPT: Abit:8 Dbit:9
> CR3: Abit:5 Dbit:6

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE
  2024-08-05 22:17     ` Sean Christopherson
@ 2024-08-06  3:31       ` Yuan Yao
  0 siblings, 0 replies; 19+ messages in thread
From: Yuan Yao @ 2024-08-06  3:31 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Aug 05, 2024 at 03:17:20PM -0700, Sean Christopherson wrote:
> On Mon, Aug 05, 2024, Yuan Yao wrote:
> > On Thu, Aug 01, 2024 at 11:34:49AM -0700, Sean Christopherson wrote:
> > > Remove all flavors of A/D bits from FROZEN_SPTE so that KVM can keep A/D
> > > bits set in SPTEs that are frozen, without getting false positives.
> > >
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/spte.h | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index ba7ff1dfbeb2..d403ecdfcb8e 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -216,15 +216,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > >   * should not modify the SPTE.
> > >   *
> > >   * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
> > > - * both AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> > > - * vulnerability.
> > > + * both AMD and Intel CPUs, doesn't set any A/D bits, and doesn't set PFN bits,
> > > + * i.e. doesn't create a L1TF vulnerability.
> > >   *
> > >   * Only used by the TDP MMU.
> > >   */
> > > -#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > > +#define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x498ULL)
> >
> > Question:
> > Why bit3 and bit4 also changed from 0 to 1 ?
>
> Purely so that more bits are set, i.e. so that KVM doesn't rely on one or two
> bits to identify frozen SPTEs.

Thanks for your explanation!

Please consider add this into the commit log, it explains
the reason of why some non A/D bits are selected.

>
> > They're not part of AD bits fro EPT and CR3 page table/AMD NPT
>
> This is very delibreate.  The A/D bits need to be '0' in the FROZEN, i.e. bits
> 5,6, 8, and 9 must not be set in FROZEN_SPTE.
>
> >
> > EPT: Abit:8 Dbit:9
> > CR3: Abit:5 Dbit:6

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes
  2024-08-05  7:59   ` Yuan Yao
  2024-08-05  9:12     ` Yuan Yao
@ 2024-08-07  6:41     ` Yuan Yao
  1 sibling, 0 replies; 19+ messages in thread
From: Yuan Yao @ 2024-08-07  6:41 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Aug 05, 2024 at 03:59:11PM +0800, Yuan Yao wrote:
> On Thu, Aug 01, 2024 at 11:34:53AM -0700, Sean Christopherson wrote:
> > Preserve Accessed information when zapping SPTEs in response to an
> > mmu_notifier protection change, e.g. if KVM is zapping SPTEs because
> > NUMA balancing kicked in.  KVM is not required to fully unmap the SPTE,
> > and the core VMA information isn't changing, i.e. the information is
> > still fresh and useful.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/tdp_mmu.c | 31 +++++++++++++++++++++++++------
> >  1 file changed, 25 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index ac3200ce00f9..780f35a22c05 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >   * operation can cause a soft lockup.
> >   */
> >  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> > -			      gfn_t start, gfn_t end, bool can_yield, bool flush)
> > +			      gfn_t start, gfn_t end, bool can_yield,
> > +			      bool keep_accessed_bit, bool flush)
> >  {
> >  	struct tdp_iter iter;
> >
> > @@ -849,17 +850,29 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
> >  	rcu_read_lock();
> >
> >  	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > +		u64 new_spte = SHADOW_NONPRESENT_VALUE;
> > +
> >  		if (can_yield &&
> >  		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
> >  			flush = false;
> >  			continue;
> >  		}
> >
> > +		/*
> > +		 * Note, this will fail to clear non-present, accessed SPTEs,
> > +		 * but that isn't a functional problem, it can only result in
> > +		 * a _potential_ false positive  in the unlikely scenario that
> > +		 * the primary MMU zaps an hva, reinstalls a new hva, and ages
> > +		 * the new hva, all before KVM accesses the hva.
> > +		 */
> >  		if (!is_shadow_present_pte(iter.old_spte) ||
> >  		    !is_last_spte(iter.old_spte, iter.level))
> >  			continue;
> >
> > -		tdp_mmu_iter_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> > +		if (keep_accessed_bit)
> > +			new_spte |= iter.old_spte & shadow_accessed_mask;
> > +
> > +		tdp_mmu_iter_set_spte(kvm, &iter, new_spte);
> >
> >  		/*
> >  		 * Zappings SPTEs in invalid roots doesn't require a TLB flush,
> > @@ -889,7 +902,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush)
> >
> >  	lockdep_assert_held_write(&kvm->mmu_lock);
> >  	for_each_valid_tdp_mmu_root_yield_safe(kvm, root, -1)
> > -		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, flush);
> > +		flush = tdp_mmu_zap_leafs(kvm, root, start, end, true, false, flush);
> >
> >  	return flush;
> >  }
> > @@ -1180,11 +1193,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> >  				 bool flush)
> >  {
> > +	bool keep_a_bit = range->arg.event == MMU_NOTIFY_PROTECTION_VMA ||
> > +			  range->arg.event == MMU_NOTIFY_PROTECTION_PAGE;
> >  	struct kvm_mmu_page *root;
> >
> >  	__for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> >  		flush = tdp_mmu_zap_leafs(kvm, root, range->start, range->end,
> > -					  range->may_block, flush);
> > +					  range->may_block, keep_a_bit, flush);
> >
> >  	return flush;
> >  }
> > @@ -1201,7 +1216,11 @@ static void kvm_tdp_mmu_age_spte(struct tdp_iter *iter)
> >  {
> >  	u64 new_spte;
> >
> > -	if (spte_ad_enabled(iter->old_spte)) {
> > +	if (spte_ad_enabled(iter->old_spte) ||
> > +	    !is_shadow_present_pte(iter->old_spte)) {
> > +		KVM_MMU_WARN_ON(!is_shadow_present_pte(iter->old_spte) &&
> > +				iter->old_spte != (SHADOW_NONPRESENT_VALUE | shadow_accessed_mask));
>
> Is that possible some sptes are zapped by
> kvm_tdp_mmu_zap_leafs(keep_accessed_bit = false) i.e. from kvm_post_set_cr0(),
> then handled by __kvm_tdp_mmu_age_gfn_range() for aging before
> accessed by guest again ?
> In this scenario the spte is non-present w/o A bit set.
>
> > +
> >  		iter->old_spte = tdp_mmu_clear_spte_bits(iter->sptep,
> >  							 iter->old_spte,
> >  							 shadow_accessed_mask,
> > @@ -1235,7 +1254,7 @@ static bool __kvm_tdp_mmu_age_gfn_range(struct kvm *kvm,
> >  	for_each_valid_tdp_mmu_root(kvm, root, range->slot->as_id) {
> >  		rcu_read_lock();
> >
> > -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end) {
> > +		tdp_root_for_each_pte(iter, root, range->start, range->end) {
>
> This also clears the A bit of non-leaf entries for aging, I remember
> KVM doesn't care them before, could you please explain the reason of
> this ?

The new __kvm_tdp_mmu_age_gfn_range() covers aging and testing, so here
it allows testing on non-present sptes, becasue A bit is preserved there.

I worried before that the access state is updated by
handle_changed_spte() in case of zapping, preserve A bit
gives the inaccurate access state to in future .test_young()
if no one access the zapped guest again. But this should be
addressed by patch 8 and 81 in the 'massive "follow pfn"
rework' patch set mentioned in the cover letter.

>
> >  			if (!is_accessed_spte(iter.old_spte))
> >  				continue;
> >
> > --
> > 2.46.0.rc1.232.g9752f9e123-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-08-07  6:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-01 18:34 [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on PROT changes Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 1/9] KVM: x86/mmu: Add a dedicated flag to track if A/D bits are globally enabled Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 2/9] KVM: x86/mmu: Set shadow_accessed_mask for EPT even if A/D bits disabled Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 3/9] KVM: x86/mmu: Set shadow_dirty_mask " Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 4/9] KVM: x86/mmu: Use Accessed bit even when _hardware_ A/D bits are disabled Sean Christopherson
2024-08-05 16:49   ` David Matlack
2024-08-01 18:34 ` [RFC PATCH 5/9] KVM: x86/mmu: Free up A/D bits in FROZEN_SPTE Sean Christopherson
2024-08-05  7:20   ` Yuan Yao
2024-08-05 22:17     ` Sean Christopherson
2024-08-06  3:31       ` Yuan Yao
2024-08-01 18:34 ` [RFC PATCH 6/9] KVM: x86/mmu: Process only valid TDP MMU roots when aging a gfn range Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 7/9] KVM: x86/mmu: Stop processing TDP MMU roots for test_age if young SPTE found Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 8/9] KVM: Plumb mmu_notifier invalidation event type into arch code Sean Christopherson
2024-08-01 18:34 ` [RFC PATCH 9/9] KVM: x86/mmu: Track SPTE accessed info across mmu_notifier PROT changes Sean Christopherson
2024-08-05  7:59   ` Yuan Yao
2024-08-05  9:12     ` Yuan Yao
2024-08-07  6:41     ` Yuan Yao
2024-08-05 16:45 ` [RFC PATCH 0/9] KVM: x86/mmu: Preserve Accessed bits on " David Matlack
2024-08-05 20:11   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox