[PATCH v3 0/2] KVM: x86/mmu: Run NX huge page recovery under MMU read lock

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/2]  KVM: x86/mmu: Run NX huge page recovery under MMU read lock
@ 2024-09-06 20:45 Vipin Sharma
  2024-09-06 20:45 ` [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately Vipin Sharma
  2024-09-06 20:45 ` [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
  0 siblings, 2 replies; 8+ messages in thread
From: Vipin Sharma @ 2024-09-06 20:45 UTC (permalink / raw)
  To: seanjc, pbonzini, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Split NX huge page recovery in two separate flows, one for TDP MMU and
one for non-TDP MMU.

TDP MMU flow will use MMU read lock and non-TDP MMU flow will use MMU
write lock. This change unblocks vCPUs which are waiting for MMU read
lock while NX huge page recovery is running and zapping MMU pages.

A Windows guest was showing network latency jitters which was root
caused to vCPUs waiting for MMU read lock when NX huge page recovery
thread was holding MMU write lock. Disabling NX huge page recovery fixed
the jitter issue.

So, to optimize NX huge page recovery, it was modified to run under MMU
read lock, the switch made jitter issue disappear completely and vCPUs
wait time for MMU read lock reduced drastically. Patch 2 commit log has
the data from the tool to show improvement observed.

Patch 1 splits the NX huge pages tracking into two lists, one for TDP
MMU and one for shadow and legacy MMU. Patch 2 adds support to run
recovery worker under MMU read lock for TDP MMU pages.

v3:
- Use pointers in track and untrack NX huge pages APIs for accounting.
- Remove #ifdefs from v2.
- Fix error in v2 where TDP MMU flow was using
  cond_resched_rwlock_write() instead of cond_resched_rwlock_read() 
- Keep common code for both TDP and non-TDP MMU logic.
- Create wrappers for TDP MMU data structures to avoid #ifdefs.

v2: https://lore.kernel.org/kvm/20240829191135.2041489-1-vipinsh@google.com/#t
- Track legacy and TDP MMU NX huge pages separately.
- Each list has their own calculation of "to_zap", i.e. number of pages
  to zap.
- Unaccount huge page before dirty log check and zap logic in TDP MMU recovery
  worker. Check patch 4 for more details.
- 32 bit build issue fix.
- Sparse warning fix for comparing RCU pointer with non-RCU pointer.
  (sp->spt == spte_to_child_pt())

v1: https://lore.kernel.org/kvm/20240812171341.1763297-1-vipinsh@google.com/#t

Vipin Sharma (2):
  KVM: x86/mmu: Track TDP MMU NX huge pages separately
  KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock

 arch/x86/include/asm/kvm_host.h |  13 +++-
 arch/x86/kvm/mmu/mmu.c          | 116 ++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |   8 ++-
 arch/x86/kvm/mmu/tdp_mmu.c      |  73 ++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      |   6 +-
 5 files changed, 164 insertions(+), 52 deletions(-)

base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
-- 
2.46.0.469.g59c65b2a67-goog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately
  2024-09-06 20:45 [PATCH v3 0/2] KVM: x86/mmu: Run NX huge page recovery under MMU read lock Vipin Sharma
@ 2024-09-06 20:45 ` Vipin Sharma
  2024-10-30 14:12   ` Sean Christopherson
  2024-09-06 20:45 ` [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
  1 sibling, 1 reply; 8+ messages in thread
From: Vipin Sharma @ 2024-09-06 20:45 UTC (permalink / raw)
  To: seanjc, pbonzini, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Create separate list for storing TDP MMU NX huge pages and provide
counter for it. Use this list in NX huge page recovery worker along with
the existing NX huge pages list. Use old NX huge pages list for storing
only non-TDP MMU pages and provide separate counter for it.

Separate list will allow to optimize TDP MMU NX huge page recovery in
future patches by using MMU read lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/include/asm/kvm_host.h | 13 ++++++++++-
 arch/x86/kvm/mmu/mmu.c          | 39 +++++++++++++++++++++++----------
 arch/x86/kvm/mmu/mmu_internal.h |  8 +++++--
 arch/x86/kvm/mmu/tdp_mmu.c      | 19 ++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      |  1 +
 5 files changed, 61 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 950a03e0181e..0f21f9a69285 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1318,8 +1318,12 @@ struct kvm_arch {
 	 * guarantee an NX huge page will be created in its stead, e.g. if the
 	 * guest attempts to execute from the region then KVM obviously can't
 	 * create an NX huge page (without hanging the guest).
+	 *
+	 * This list only contains shadow and legacy MMU pages. TDP MMU pages
+	 * are stored separately in tdp_mmu_possible_nx_huge_pages.
 	 */
 	struct list_head possible_nx_huge_pages;
+	u64 nr_possible_nx_huge_pages;
 #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
 	struct kvm_page_track_notifier_head track_notifier_head;
 #endif
@@ -1474,7 +1478,7 @@ struct kvm_arch {
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
 	 *  - the link field of kvm_mmu_page structs used by the TDP MMU
-	 *  - possible_nx_huge_pages;
+	 *  - tdp_mmu_possible_nx_huge_pages
 	 *  - the possible_nx_huge_page_link field of kvm_mmu_page structs used
 	 *    by the TDP MMU
 	 * Because the lock is only taken within the MMU lock, strictly
@@ -1483,6 +1487,13 @@ struct kvm_arch {
 	 * the code to do so.
 	 */
 	spinlock_t tdp_mmu_pages_lock;
+
+	/*
+	 * Similar to possible_nx_huge_pages list but this one stores only TDP
+	 * MMU pages.
+	 */
+	struct list_head tdp_mmu_possible_nx_huge_pages;
+	u64 tdp_mmu_nr_possible_nx_huge_pages;
 #endif /* CONFIG_X86_64 */
 
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 901be9e420a4..455caaaa04f5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -857,7 +857,8 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 		kvm_flush_remote_tlbs_gfn(kvm, gfn, PG_LEVEL_4K);
 }
 
-void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 struct list_head *pages, u64 *nr_pages)
 {
 	/*
 	 * If it's possible to replace the shadow page with an NX huge page,
@@ -870,9 +871,9 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	if (!list_empty(&sp->possible_nx_huge_page_link))
 		return;
 
+	++(*nr_pages);
 	++kvm->stat.nx_lpage_splits;
-	list_add_tail(&sp->possible_nx_huge_page_link,
-		      &kvm->arch.possible_nx_huge_pages);
+	list_add_tail(&sp->possible_nx_huge_page_link, pages);
 }
 
 static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
@@ -881,7 +882,10 @@ static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 	sp->nx_huge_page_disallowed = true;
 
 	if (nx_huge_page_possible)
-		track_possible_nx_huge_page(kvm, sp);
+		track_possible_nx_huge_page(kvm,
+					    sp,
+					    &kvm->arch.possible_nx_huge_pages,
+					    &kvm->arch.nr_possible_nx_huge_pages);
 }
 
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -900,11 +904,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
 
-void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				   u64 *nr_pages)
 {
 	if (list_empty(&sp->possible_nx_huge_page_link))
 		return;
 
+	--(*nr_pages);
 	--kvm->stat.nx_lpage_splits;
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
@@ -913,7 +919,7 @@ static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	sp->nx_huge_page_disallowed = false;
 
-	untrack_possible_nx_huge_page(kvm, sp);
+	untrack_possible_nx_huge_page(kvm, sp, &kvm->arch.nr_possible_nx_huge_pages);
 }
 
 static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
@@ -7311,9 +7317,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_huge_pages(struct kvm *kvm)
+void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
+			       unsigned long nr_pages)
 {
-	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
 	struct kvm_memory_slot *slot;
 	int rcu_idx;
 	struct kvm_mmu_page *sp;
@@ -7333,9 +7339,9 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	rcu_read_lock();
 
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
-	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
+	to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.possible_nx_huge_pages))
+		if (list_empty(pages))
 			break;
 
 		/*
@@ -7345,7 +7351,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 		 * the total number of shadow pages.  And because the TDP MMU
 		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
+		sp = list_first_entry(pages,
 				      struct kvm_mmu_page,
 				      possible_nx_huge_page_link);
 		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
@@ -7417,6 +7423,12 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
 		       : MAX_SCHEDULE_TIMEOUT;
 }
 
+static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm)
+{
+	kvm_recover_nx_huge_pages(kvm, &kvm->arch.possible_nx_huge_pages,
+				  kvm->arch.nr_possible_nx_huge_pages);
+}
+
 static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 {
 	u64 start_time;
@@ -7438,7 +7450,10 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		kvm_recover_nx_huge_pages(kvm);
+		kvm_mmu_recover_nx_huge_pages(kvm);
+		if (tdp_mmu_enabled)
+			kvm_tdp_mmu_recover_nx_huge_pages(kvm);
+
 	}
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1721d97743e9..2d2e1231996a 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -351,7 +351,11 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
-void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
-void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 struct list_head *pages, u64 *nr_pages);
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				   u64 *nr_pages);
+void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
+			       unsigned long nr_pages);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c7dc49ee7388..9a6c26d20210 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -15,6 +15,7 @@
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 {
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
+	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
 }
 
@@ -73,6 +74,13 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 	tdp_mmu_free_sp(sp);
 }
 
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
+{
+	kvm_recover_nx_huge_pages(kvm,
+				  &kvm->arch.tdp_mmu_possible_nx_huge_pages,
+				  kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
+}
+
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
 {
 	if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
@@ -318,7 +326,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	sp->nx_huge_page_disallowed = false;
-	untrack_possible_nx_huge_page(kvm, sp);
+	untrack_possible_nx_huge_page(kvm, sp, &kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 }
 
@@ -1162,10 +1170,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (fault->huge_page_disallowed &&
-		    fault->req_level >= iter.level) {
+		    fault->req_level >= iter.level &&
+		    sp->nx_huge_page_disallowed) {
 			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-			if (sp->nx_huge_page_disallowed)
-				track_possible_nx_huge_page(kvm, sp);
+			track_possible_nx_huge_page(kvm,
+						    sp,
+						    &kvm->arch.tdp_mmu_possible_nx_huge_pages,
+						    &kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
 			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 		}
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1b74e058a81c..510baf3eb3f1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -66,6 +66,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
+void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm);
 
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
-- 
2.46.0.469.g59c65b2a67-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2024-09-06 20:45 [PATCH v3 0/2] KVM: x86/mmu: Run NX huge page recovery under MMU read lock Vipin Sharma
  2024-09-06 20:45 ` [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately Vipin Sharma
@ 2024-09-06 20:45 ` Vipin Sharma
  2024-09-08 23:29   ` kernel test robot
  2024-10-30 14:28   ` Sean Christopherson
  1 sibling, 2 replies; 8+ messages in thread
From: Vipin Sharma @ 2024-09-06 20:45 UTC (permalink / raw)
  To: seanjc, pbonzini, dmatlack; +Cc: kvm, linux-kernel, Vipin Sharma

Use MMU read lock to recover TDP MMU NX huge pages. Iterate
huge pages list under tdp_mmu_pages_lock protection and unaccount the
page before dropping the lock.

Modify kvm_tdp_mmu_zap_sp() to kvm_tdp_mmu_zap_possible_nx_huge_page()
as there are no other user of it. Ignore the zap if any of the following
condition is true:
- It is a root page.
- Parent is pointing to:
  - A different page table.
  - A huge page.
  - Not present

Warn if zapping SPTE fails and current SPTE is still pointing to same
page table. This should never happen.

There is always a race between dirty logging, vCPU faults, and NX huge
page recovery for backing a gfn by an NX huge page or an execute small
page. Unaccounting sooner during the list traversal is increasing the
window of that race. Functionally, it is okay, because accounting
doesn't protect against iTLB multi-hit bug, it is there purely to
prevent KVM from bouncing a gfn between two page sizes. The only
downside is that a vCPU will end up doing more work in tearing down all
the child SPTEs. This should be a very rare race.

Zapping under MMU read lock unblock vCPUs which are waiting for MMU read
lock. This optimizaion is done to solve a guest jitter issue on Windows
VM which was observing an increase in network latency. The test workload
sets up two Windows VM and use latte.exe[1] binary to run network
latency benchmark. Running NX huge page recovery under MMU lock was
causing latency to increase up to 30 ms because vCPUs were waiting for
MMU lock.

Running the tool on VMs using MMU read lock NX huge page recovery
removed the jitter issue completely and MMU lock wait time by vCPUs was
also reduced.

Command used for testing:

Server:
latte.exe -udp -a 192.168.100.1:9000 -i 10000000

Client:
latte.exe -c -udp -a 192.168.100.1:9000 -i 10000000 -hist -hl 1000 -hc 30

Output from the latency tool on client:

Before
------

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 69.98
CPU(%)        2.8
CtxSwitch/sec 32783     (2.29/iteration)
SysCall/sec   99948     (6.99/iteration)
Interrupt/sec 55164     (3.86/iteration)

Interval(usec)   Frequency
      0          9999967
   1000          14
   2000          0
   3000          5
   4000          1
   5000          0
   6000          0
   7000          0
   8000          0
   9000          0
  10000          0
  11000          0
  12000          2
  13000          2
  14000          4
  15000          2
  16000          2
  17000          0
  18000          1

After
-----

Protocol      UDP
SendMethod    Blocking
ReceiveMethod Blocking
SO_SNDBUF     Default
SO_RCVBUF     Default
MsgSize(byte) 4
Iterations    10000000
Latency(usec) 67.66
CPU(%)        1.6
CtxSwitch/sec 32869     (2.22/iteration)
SysCall/sec   69366     (4.69/iteration)
Interrupt/sec 50693     (3.43/iteration)

Interval(usec)   Frequency
      0          9999972
   1000          27
   2000          1

[1] https://github.com/microsoft/latte

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 85 ++++++++++++++++++++++-----------
 arch/x86/kvm/mmu/mmu_internal.h |  4 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 56 ++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h      |  5 +-
 4 files changed, 110 insertions(+), 40 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 455caaaa04f5..fc597f66aa11 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7317,8 +7317,8 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
-			       unsigned long nr_pages)
+void kvm_recover_nx_huge_pages(struct kvm *kvm, bool shared,
+			       struct list_head *pages, unsigned long nr_pages)
 {
 	struct kvm_memory_slot *slot;
 	int rcu_idx;
@@ -7329,7 +7329,10 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
 	ulong to_zap;
 
 	rcu_idx = srcu_read_lock(&kvm->srcu);
-	write_lock(&kvm->mmu_lock);
+	if (shared)
+		read_lock(&kvm->mmu_lock);
+	else
+		write_lock(&kvm->mmu_lock);
 
 	/*
 	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
@@ -7341,8 +7344,13 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
 	to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(pages))
+		if (tdp_mmu_enabled)
+			kvm_tdp_mmu_pages_lock(kvm);
+		if (list_empty(pages)) {
+			if (tdp_mmu_enabled)
+				kvm_tdp_mmu_pages_unlock(kvm);
 			break;
+		}
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
@@ -7358,24 +7366,41 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
 		WARN_ON_ONCE(!sp->role.direct);
 
 		/*
-		 * Unaccount and do not attempt to recover any NX Huge Pages
-		 * that are being dirty tracked, as they would just be faulted
-		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
-		 * recovered, along with all the other huge pages in the slot,
-		 * when dirty logging is disabled.
+		 * Unaccount the shadow page before zapping its SPTE so as to
+		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
+		 * Clearing nx_huge_page_disallowed before zapping is safe, as
+		 * the flag doesn't protect against iTLB multi-hit, it's there
+		 * purely to prevent bouncing the gfn between an NX huge page
+		 * and an X small spage. A vCPU could get stuck tearing down
+		 * the shadow page, e.g. if it happens to fault on the region
+		 * before the SPTE is zapped and replaces the shadow page with
+		 * an NX huge page and get stuck tearing down the child SPTEs,
+		 * but that is a rare race, i.e. shouldn't impact performance.
+		 */
+		unaccount_nx_huge_page(kvm, sp);
+		if (tdp_mmu_enabled)
+			kvm_tdp_mmu_pages_unlock(kvm);
+
+		/*
+		 * Do not attempt to recover any NX Huge Pages that are being
+		 * dirty tracked, as they would just be faulted back in as 4KiB
+		 * pages. The NX Huge Pages in this slot will be recovered,
+		 * along with all the other huge pages in the slot, when dirty
+		 * logging is disabled.
 		 *
 		 * Since gfn_to_memslot() is relatively expensive, it helps to
 		 * skip it if it the test cannot possibly return true.  On the
 		 * other hand, if any memslot has logging enabled, chances are
-		 * good that all of them do, in which case unaccount_nx_huge_page()
-		 * is much cheaper than zapping the page.
+		 * good that all of them do, in which case
+		 * unaccount_nx_huge_page() is much cheaper than zapping the
+		 * page.
 		 *
-		 * If a memslot update is in progress, reading an incorrect value
-		 * of kvm->nr_memslots_dirty_logging is not a problem: if it is
-		 * becoming zero, gfn_to_memslot() will be done unnecessarily; if
-		 * it is becoming nonzero, the page will be zapped unnecessarily.
-		 * Either way, this only affects efficiency in racy situations,
-		 * and not correctness.
+		 * If a memslot update is in progress, reading an incorrect
+		 * value of kvm->nr_memslots_dirty_logging is not a problem: if
+		 * it is becoming zero, gfn_to_memslot() will be done
+		 * unnecessarily; if it is becoming nonzero, the page will be
+		 * zapped unnecessarily.  Either way, this only affects
+		 * efficiency in racy situations, and not correctness.
 		 */
 		slot = NULL;
 		if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
@@ -7385,20 +7410,21 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
 			slot = __gfn_to_memslot(slots, sp->gfn);
 			WARN_ON_ONCE(!slot);
 		}
-
-		if (slot && kvm_slot_dirty_track_enabled(slot))
-			unaccount_nx_huge_page(kvm, sp);
-		else if (is_tdp_mmu_page(sp))
-			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
-		else
-			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		if (!slot || !kvm_slot_dirty_track_enabled(slot)) {
+			if (shared)
+				flush |= kvm_tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
+			else
+				kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
+		}
 		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
 			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
 			rcu_read_unlock();
-
-			cond_resched_rwlock_write(&kvm->mmu_lock);
+			if (shared)
+				cond_resched_rwlock_read(&kvm->mmu_lock);
+			else
+				cond_resched_rwlock_write(&kvm->mmu_lock);
 			flush = false;
 
 			rcu_read_lock();
@@ -7408,7 +7434,10 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
 
 	rcu_read_unlock();
 
-	write_unlock(&kvm->mmu_lock);
+	if (shared)
+		read_unlock(&kvm->mmu_lock);
+	else
+		write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
@@ -7425,7 +7454,7 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
 
 static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm)
 {
-	kvm_recover_nx_huge_pages(kvm, &kvm->arch.possible_nx_huge_pages,
+	kvm_recover_nx_huge_pages(kvm, false, &kvm->arch.possible_nx_huge_pages,
 				  kvm->arch.nr_possible_nx_huge_pages);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2d2e1231996a..e6b757c59ccc 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -355,7 +355,7 @@ void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 				 struct list_head *pages, u64 *nr_pages);
 void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 				   u64 *nr_pages);
-void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
-			       unsigned long nr_pages);
+void kvm_recover_nx_huge_pages(struct kvm *kvm, bool shared,
+			       struct list_head *pages, unsigned long nr_pages);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9a6c26d20210..8a6ffc150c99 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -74,9 +74,19 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
 	tdp_mmu_free_sp(sp);
 }
 
+void kvm_tdp_mmu_pages_lock(struct kvm *kvm)
+{
+	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
+void kvm_tdp_mmu_pages_unlock(struct kvm *kvm)
+{
+	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+}
+
 void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
 {
-	kvm_recover_nx_huge_pages(kvm,
+	kvm_recover_nx_huge_pages(kvm, true,
 				  &kvm->arch.tdp_mmu_possible_nx_huge_pages,
 				  kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
 }
@@ -825,23 +835,51 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	rcu_read_unlock();
 }
 
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					   struct kvm_mmu_page *sp)
 {
-	u64 old_spte;
+	struct tdp_iter iter = {
+		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
+		.sptep = sp->ptep,
+		.level = sp->role.level + 1,
+		.gfn = sp->gfn,
+		.as_id = kvm_mmu_page_as_id(sp),
+	};
+
+	lockdep_assert_held_read(&kvm->mmu_lock);
+	if (WARN_ON_ONCE(!is_tdp_mmu_page(sp)))
+		return false;
 
 	/*
-	 * This helper intentionally doesn't allow zapping a root shadow page,
-	 * which doesn't have a parent page table and thus no associated entry.
+	 * Root shadow pages don't a parent page table and thus no associated
+	 * entry, but they can never be possible NX huge pages.
 	 */
 	if (WARN_ON_ONCE(!sp->ptep))
 		return false;
 
-	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
-	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
+	/*
+	 * Since mmu_lock is held in read mode, it's possible another task has
+	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
+	 * points at the SP's page table, as checking  shadow-present isn't
+	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
+	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
+	 * shadow-present, i.e. guards against zapping a frozen SPTE.
+	 */
+	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
 		return false;
 
-	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
-			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
+	/*
+	 * If a different task modified the SPTE, then it should be impossible
+	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
+	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
+	 * creating non-leaf SPTEs, and all other bits are immutable for non-
+	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
+	 * zapping and replacement.
+	 */
+	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
+		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
+		return false;
+	}
 
 	return true;
 }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 510baf3eb3f1..ed4bdceb9aec 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,7 +20,8 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
+bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+					   struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
@@ -66,6 +67,8 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 			 int *root_level);
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, gfn_t gfn,
 					u64 *spte);
+void kvm_tdp_mmu_pages_lock(struct kvm *kvm);
+void kvm_tdp_mmu_pages_unlock(struct kvm *kvm);
 void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm);
 
 #ifdef CONFIG_X86_64
-- 
2.46.0.469.g59c65b2a67-goog


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2024-09-06 20:45 ` [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
@ 2024-09-08 23:29   ` kernel test robot
  2024-09-09 16:37     ` Vipin Sharma
  2024-10-30 14:28   ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: kernel test robot @ 2024-09-08 23:29 UTC (permalink / raw)
  To: Vipin Sharma, seanjc, pbonzini, dmatlack
  Cc: llvm, oe-kbuild-all, kvm, linux-kernel, Vipin Sharma

Hi Vipin,

kernel test robot noticed the following build errors:

[auto build test ERROR on 332d2c1d713e232e163386c35a3ba0c1b90df83f]

url:    https://github.com/intel-lab-lkp/linux/commits/Vipin-Sharma/KVM-x86-mmu-Track-TDP-MMU-NX-huge-pages-separately/20240907-044800
base:   332d2c1d713e232e163386c35a3ba0c1b90df83f
patch link:    https://lore.kernel.org/r/20240906204515.3276696-3-vipinsh%40google.com
patch subject: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
config: i386-randconfig-005-20240908 (https://download.01.org/0day-ci/archive/20240909/202409090949.xuOxMsJ2-lkp@intel.com/config)
compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240909/202409090949.xuOxMsJ2-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409090949.xuOxMsJ2-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: kvm_tdp_mmu_zap_possible_nx_huge_page
   >>> referenced by mmu.c:7415 (arch/x86/kvm/mmu/mmu.c:7415)
   >>>               arch/x86/kvm/mmu/mmu.o:(kvm_recover_nx_huge_pages) in archive vmlinux.a

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2024-09-08 23:29   ` kernel test robot
@ 2024-09-09 16:37     ` Vipin Sharma
  0 siblings, 0 replies; 8+ messages in thread
From: Vipin Sharma @ 2024-09-09 16:37 UTC (permalink / raw)
  To: kernel test robot
  Cc: seanjc, pbonzini, dmatlack, llvm, oe-kbuild-all, kvm,
	linux-kernel

On 2024-09-09 07:29:55, kernel test robot wrote:
> Hi Vipin,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on 332d2c1d713e232e163386c35a3ba0c1b90df83f]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Vipin-Sharma/KVM-x86-mmu-Track-TDP-MMU-NX-huge-pages-separately/20240907-044800
> base:   332d2c1d713e232e163386c35a3ba0c1b90df83f
> patch link:    https://lore.kernel.org/r/20240906204515.3276696-3-vipinsh%40google.com
> patch subject: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
> config: i386-randconfig-005-20240908 (https://download.01.org/0day-ci/archive/20240909/202409090949.xuOxMsJ2-lkp@intel.com/config)
> compiler: clang version 18.1.5 (https://github.com/llvm/llvm-project 617a15a9eac96088ae5e9134248d8236e34b91b1)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240909/202409090949.xuOxMsJ2-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202409090949.xuOxMsJ2-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
> >> ld.lld: error: undefined symbol: kvm_tdp_mmu_zap_possible_nx_huge_page
>    >>> referenced by mmu.c:7415 (arch/x86/kvm/mmu/mmu.c:7415)
>    >>>               arch/x86/kvm/mmu/mmu.o:(kvm_recover_nx_huge_pages) in archive vmlinux.a

I missed it because i386 command I used was from config given in v1 of
the series by lkp bot. That command was just for build kvm directory and
ldd didn't get invoke.

I will send out a new version after collecting feedback on this version
of the series. I am thinking of below change to fix the error.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index ed4bdceb9aec..37620496f64a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,8 +20,6 @@ __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root);

 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
-bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
-                                          struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
@@ -73,8 +71,17 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm);

 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu_page; }
+bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+                                          struct kvm_mmu_page *sp);
 #else
 static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; }
+static bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,
+                                          struct kvm_mmu_page *sp)
+{
+       WARN_ONCE(1, "TDP MMU not supported in 32bit builds");
+       return false;
+}
+
 #endif

 #endif /* __KVM_X86_MMU_TDP_MMU_H */


> 
> -- 
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately
  2024-09-06 20:45 ` [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately Vipin Sharma
@ 2024-10-30 14:12   ` Sean Christopherson
  2024-10-30 14:25     ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-10-30 14:12 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: pbonzini, dmatlack, kvm, linux-kernel

On Fri, Sep 06, 2024, Vipin Sharma wrote:
> Create separate list for storing TDP MMU NX huge pages and provide

Create a separate list...

> counter for it. Use this list in NX huge page recovery worker along with
> the existing NX huge pages list. Use old NX huge pages list for storing
> only non-TDP MMU pages and provide separate counter for it.
> 
> Separate list will allow to optimize TDP MMU NX huge page recovery in
> future patches by using MMU read lock.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: David Matlack <dmatlack@google.com>
> Signed-off-by: Vipin Sharma <vipinsh@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 13 ++++++++++-
>  arch/x86/kvm/mmu/mmu.c          | 39 +++++++++++++++++++++++----------
>  arch/x86/kvm/mmu/mmu_internal.h |  8 +++++--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 19 ++++++++++++----
>  arch/x86/kvm/mmu/tdp_mmu.h      |  1 +
>  5 files changed, 61 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 950a03e0181e..0f21f9a69285 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1318,8 +1318,12 @@ struct kvm_arch {
>  	 * guarantee an NX huge page will be created in its stead, e.g. if the
>  	 * guest attempts to execute from the region then KVM obviously can't
>  	 * create an NX huge page (without hanging the guest).
> +	 *
> +	 * This list only contains shadow and legacy MMU pages. TDP MMU pages
> +	 * are stored separately in tdp_mmu_possible_nx_huge_pages.

Hmm, ideally that would be reflected in the name.  More thoughts two comments
down.

>  	 */
>  	struct list_head possible_nx_huge_pages;
> +	u64 nr_possible_nx_huge_pages;

Changelog says nothing about tracking the number of possible pages.  This clearly
belongs in a separate patch.

>  #ifdef CONFIG_KVM_EXTERNAL_WRITE_TRACKING
>  	struct kvm_page_track_notifier_head track_notifier_head;
>  #endif
> @@ -1474,7 +1478,7 @@ struct kvm_arch {
>  	 * is held in read mode:
>  	 *  - tdp_mmu_roots (above)
>  	 *  - the link field of kvm_mmu_page structs used by the TDP MMU
> -	 *  - possible_nx_huge_pages;
> +	 *  - tdp_mmu_possible_nx_huge_pages
>  	 *  - the possible_nx_huge_page_link field of kvm_mmu_page structs used
>  	 *    by the TDP MMU
>  	 * Because the lock is only taken within the MMU lock, strictly
> @@ -1483,6 +1487,13 @@ struct kvm_arch {
>  	 * the code to do so.
>  	 */
>  	spinlock_t tdp_mmu_pages_lock;
> +
> +	/*
> +	 * Similar to possible_nx_huge_pages list but this one stores only TDP
> +	 * MMU pages.
> +	 */
> +	struct list_head tdp_mmu_possible_nx_huge_pages;
> +	u64 tdp_mmu_nr_possible_nx_huge_pages;

These obviously come in a pair, and must be passed around as such.  To make the
relevant code easier on the eyes (long lines), and to avoid passing a mismatched
pair, add a parent structure.

E.g.

  struct kvm_possible_nx_huge_pages {
	struct list_head list;
	u64 nr_pages;
  }

And then you can have 

	struct kvm_possible_nx_huge_pages shadow_mmu_possible_nx_huge_pages;

	struct kvm_possible_nx_huge_pages tdp_mmu_possible_nx_huge_pages;

And the comments about the lists can go away, since the structures and names are
fairly self-explanatory.


Aha!  An even better idea.

enum kvm_mmu_types {
	KVM_SHADOW_MMU,
#ifdef CONFIG_X86_64
	KVM_TDP_MMU,
#endif
	KVM_NR_MMU_TYPES,
};


	struct kvm_possible_nx_huge_pages possible_nx_huge_pages[NR_MMU_TYPES];

And then the line lengths aren't heinous and there's no need to trampoline in and
out of the TDP MMU.

>  static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> @@ -881,7 +882,10 @@ static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  	sp->nx_huge_page_disallowed = true;
>  
>  	if (nx_huge_page_possible)
> -		track_possible_nx_huge_page(kvm, sp);
> +		track_possible_nx_huge_page(kvm,
> +					    sp,

No need to put "kvm" and "sp" on separate lines.  And if we go with the array
approach, this becomes:

	if (nx_huge_page_possible)
		track_possible_nx_huge_page(kvm, sp, KVM_SHADOW_MMU);

> +					    &kvm->arch.possible_nx_huge_pages,
> +					    &kvm->arch.nr_possible_nx_huge_pages);
>  }
>  
>  static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu,
> @@ -7311,9 +7317,9 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  	return err;
>  }
>  
> -static void kvm_recover_nx_huge_pages(struct kvm *kvm)
> +void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
> +			       unsigned long nr_pages)

And this becomes something like:

static void kvm_recover_nx_huge_pages(struct kvm *kvm,
				      enum kvm_mmu_types mmu_type)
{
	struct kvm_possible_nx_huge_pages *possible_nx_pages;
	struct kvm_memory_slot *slot;
	int rcu_idx;
	struct kvm_mmu_page *sp;
	unsigned int ratio;
	LIST_HEAD(invalid_list);
	bool flush = false;
	ulong to_zap;

	possible_nx_pages = &kvm->arch.possible_nx_huge_pages[mmu_type];


>  {
> -	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
>  	struct kvm_memory_slot *slot;
>  	int rcu_idx;
>  	struct kvm_mmu_page *sp;
> @@ -7333,9 +7339,9 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  	rcu_read_lock();
>  
>  	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
> -	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
> +	to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(&kvm->arch.possible_nx_huge_pages))
> +		if (list_empty(pages))
>  			break;
>  
>  		/*
> @@ -7345,7 +7351,7 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  		 * the total number of shadow pages.  And because the TDP MMU
>  		 * doesn't use active_mmu_pages.
>  		 */
> -		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
> +		sp = list_first_entry(pages,
>  				      struct kvm_mmu_page,
>  				      possible_nx_huge_page_link);
>  		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);

> @@ -7417,6 +7423,12 @@ static long get_nx_huge_page_recovery_timeout(u64 start_time)
>  		       : MAX_SCHEDULE_TIMEOUT;
>  }
>  
> +static void kvm_mmu_recover_nx_huge_pages(struct kvm *kvm)
> +{
> +	kvm_recover_nx_huge_pages(kvm, &kvm->arch.possible_nx_huge_pages,
> +				  kvm->arch.nr_possible_nx_huge_pages);
> +}
> +
>  static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
>  {
>  	u64 start_time;
> @@ -7438,7 +7450,10 @@ static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
>  		if (kthread_should_stop())
>  			return 0;
>  
> -		kvm_recover_nx_huge_pages(kvm);
> +		kvm_mmu_recover_nx_huge_pages(kvm);
> +		if (tdp_mmu_enabled)
> +			kvm_tdp_mmu_recover_nx_huge_pages(kvm);

And this:

		for (i = KVM_SHADOW_MMU; i < KVM_NR_MMU_TYPES; i++
			kvm_recover_nx_huge_pages(kvm, i);

>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index c7dc49ee7388..9a6c26d20210 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -15,6 +15,7 @@
>  void kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  {
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
> +	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_possible_nx_huge_pages);

And this copy-paste goes away.

>  	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
>  }
>  
> @@ -73,6 +74,13 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
>  	tdp_mmu_free_sp(sp);
>  }
>  
> +void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
> +{
> +	kvm_recover_nx_huge_pages(kvm,
> +				  &kvm->arch.tdp_mmu_possible_nx_huge_pages,
> +				  kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
> +}
> +
>  void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root)
>  {
>  	if (!refcount_dec_and_test(&root->tdp_mmu_root_count))
> @@ -318,7 +326,7 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  
>  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>  	sp->nx_huge_page_disallowed = false;
> -	untrack_possible_nx_huge_page(kvm, sp);
> +	untrack_possible_nx_huge_page(kvm, sp, &kvm->arch.tdp_mmu_nr_possible_nx_huge_pages);
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  }
>  
> @@ -1162,10 +1170,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		}
>  
>  		if (fault->huge_page_disallowed &&
> -		    fault->req_level >= iter.level) {
> +		    fault->req_level >= iter.level &&
> +		    sp->nx_huge_page_disallowed) {
>  			spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> -			if (sp->nx_huge_page_disallowed)
> -				track_possible_nx_huge_page(kvm, sp);

And this is _exactly_ why I am so strict about the "one change per patch" rule.

commit 21a36ac6b6c7059965bac0cc73ef3cbb8ef576dd
Author:     Sean Christopherson <seanjc@google.com>
AuthorDate: Tue Dec 13 03:30:28 2022 +0000
Commit:     Paolo Bonzini <pbonzini@redhat.com>
CommitDate: Fri Dec 23 12:33:53 2022 -0500

    KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowed
    
    Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
    when adding a new shadow page in the TDP MMU.  To ensure the NX reclaim
    kthread can't see a not-yet-linked shadow page, the page fault path links
    the new page table prior to adding the page to possible_nx_huge_pages.
    
    If the page is zapped by different task, e.g. because dirty logging is
    disabled, between linking the page and adding it to the list, KVM can end
    up triggering use-after-free by adding the zapped SP to the aforementioned
    list, as the zapped SP's memory is scheduled for removal via RCU callback.
    The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
    i.e. the below splat is just one possible signature.
    
      ------------[ cut here ]------------
      list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
      WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
      Modules linked in: kvm_intel
      CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G        W          6.1.0-rc4+ #71
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:__list_add_valid+0x79/0xa0
      RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
      RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
      RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
      R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
      R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
      FS:  00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
      Call Trace:
       <TASK>
       track_possible_nx_huge_page+0x53/0x80
       kvm_tdp_mmu_map+0x242/0x2c0
       kvm_tdp_page_fault+0x10c/0x130
       kvm_mmu_page_fault+0x103/0x680
       vmx_handle_exit+0x132/0x5a0 [kvm_intel]
       vcpu_enter_guest+0x60c/0x16f0
       kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
       kvm_vcpu_ioctl+0x271/0x660
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x2b/0x50
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
       </TASK>
      ---[ end trace 0000000000000000 ]---
    
    Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
    Reported-by: Greg Thelen <gthelen@google.com>
    Analyzed-by: David Matlack <dmatlack@google.com>
    Cc: David Matlack <dmatlack@google.com>
    Cc: Ben Gardon <bgardon@google.com>
    Cc: Mingwei Zhang <mizhang@google.com>
    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20221213033030.83345-4-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fbdef59374fe..62a687d094bb 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1214,7 +1214,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
                if (fault->huge_page_disallowed &&
                    fault->req_level >= iter.level) {
                        spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-                       track_possible_nx_huge_page(kvm, sp);
+                       if (sp->nx_huge_page_disallowed)
+                               track_possible_nx_huge_page(kvm, sp);
                        spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
                }
        }

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately
  2024-10-30 14:12   ` Sean Christopherson
@ 2024-10-30 14:25     ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-10-30 14:25 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: pbonzini, dmatlack, kvm, linux-kernel

On Wed, Oct 30, 2024, Sean Christopherson wrote:
> On Fri, Sep 06, 2024, Vipin Sharma wrote:
> > +	struct list_head tdp_mmu_possible_nx_huge_pages;
> > +	u64 tdp_mmu_nr_possible_nx_huge_pages;
> 
> These obviously come in a pair, and must be passed around as such.  To make the
> relevant code easier on the eyes (long lines), and to avoid passing a mismatched
> pair, add a parent structure.
> 
> E.g.
> 
>   struct kvm_possible_nx_huge_pages {
> 	struct list_head list;

Actually, I vote for s/list/pages, as that makes the usage in code is more intuitive.

> 	u64 nr_pages;
>   }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock
  2024-09-06 20:45 ` [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
  2024-09-08 23:29   ` kernel test robot
@ 2024-10-30 14:28   ` Sean Christopherson
  1 sibling, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-10-30 14:28 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: pbonzini, dmatlack, kvm, linux-kernel

On Fri, Sep 06, 2024, Vipin Sharma wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 455caaaa04f5..fc597f66aa11 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7317,8 +7317,8 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  	return err;
>  }
>  
> -void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
> -			       unsigned long nr_pages)
> +void kvm_recover_nx_huge_pages(struct kvm *kvm, bool shared,
> +			       struct list_head *pages, unsigned long nr_pages)
>  {
>  	struct kvm_memory_slot *slot;
>  	int rcu_idx;
> @@ -7329,7 +7329,10 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
>  	ulong to_zap;
>  
>  	rcu_idx = srcu_read_lock(&kvm->srcu);
> -	write_lock(&kvm->mmu_lock);
> +	if (shared)


Hmm, what if we do this?

enum kvm_mmu_types {
	KVM_SHADOW_MMU,
#ifdef CONFIG_X86_64
	KVM_TDP_MMU,
#endif
	KVM_NR_MMU_TYPES,
};

#ifndef CONFIG_X86_64
#define KVM_TDP_MMU -1
#endif

And then this becomes:

	if (mmu_type == KVM_TDP_MMU)
		
> +		read_lock(&kvm->mmu_lock);
> +	else
> +		write_lock(&kvm->mmu_lock);
>  
>  	/*
>  	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
> @@ -7341,8 +7344,13 @@ void kvm_recover_nx_huge_pages(struct kvm *kvm, struct list_head *pages,
>  	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
>  	to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(pages))
> +		if (tdp_mmu_enabled)

Shouldn't this be?

		if (shared)

Or if we do the above

		if (mmu_type == KVM_TDP_MMU)
			 
Actually, better idea (sans comments)

	if (mmu_type == KVM_TDP_MMU) {
		read_lock(&kvm->mmu_lock);
		kvm_tdp_mmu_pages_lock(kvm);
	} else {
		write_lock(&kvm->mmu_lock);
	}

	rcu_read_lock();

	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
	to_zap = ratio ? DIV_ROUND_UP(possible_nx->nr_pages, ratio) : 0;
	for ( ; to_zap; --to_zap) {
		if (list_empty(possible_nx->pages))
			break;

                ...

		/* Blah blah blah. */
		if (mmu_type == KVM_TDP_MMU)
			kvm_tdp_mmu_pages_unlock(kvm);

                ...

		/* Blah blah blah. */
		if (mmu_type == KVM_TDP_MMU)
			kvm_tdp_mmu_pages_lock(kvm);
	}
	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

	rcu_read_unlock();

	if (mmu_type == KVM_TDP_MMU) {
		kvm_tdp_mmu_pages_unlock(kvm);
		read_unlock(&kvm->mmu_lock);
	} else {
		write_unlock(&kvm->mmu_lock);
	}
	srcu_read_unlock(&kvm->srcu, rcu_idx);

> @@ -825,23 +835,51 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	rcu_read_unlock();
>  }
>  
> -bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +bool kvm_tdp_mmu_zap_possible_nx_huge_page(struct kvm *kvm,

This rename, and any refactoring that is associated with said rename, e.g. comments,
belongs in a separate patch.

> +					   struct kvm_mmu_page *sp)
>  {
> -	u64 old_spte;
> +	struct tdp_iter iter = {
> +		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
> +		.sptep = sp->ptep,
> +		.level = sp->role.level + 1,
> +		.gfn = sp->gfn,
> +		.as_id = kvm_mmu_page_as_id(sp),
> +	};
> +
> +	lockdep_assert_held_read(&kvm->mmu_lock);

Newline here, to isolate the lockdep assertion from the functional code.

> +	if (WARN_ON_ONCE(!is_tdp_mmu_page(sp)))
> +		return false;
>  
>  	/*
> -	 * This helper intentionally doesn't allow zapping a root shadow page,
> -	 * which doesn't have a parent page table and thus no associated entry.
> +	 * Root shadow pages don't a parent page table and thus no associated

Missed a word or three.

> +	 * entry, but they can never be possible NX huge pages.
>  	 */
>  	if (WARN_ON_ONCE(!sp->ptep))
>  		return false;
>  
> -	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> -	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> +	/*
> +	 * Since mmu_lock is held in read mode, it's possible another task has
> +	 * already modified the SPTE. Zap the SPTE if and only if the SPTE
> +	 * points at the SP's page table, as checking  shadow-present isn't
> +	 * sufficient, e.g. the SPTE could be replaced by a leaf SPTE, or even
> +	 * another SP. Note, spte_to_child_pt() also checks that the SPTE is
> +	 * shadow-present, i.e. guards against zapping a frozen SPTE.
> +	 */
> +	if ((tdp_ptep_t)sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
>  		return false;
>  
> -	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> -			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
> +	/*
> +	 * If a different task modified the SPTE, then it should be impossible
> +	 * for the SPTE to still be used for the to-be-zapped SP. Non-leaf
> +	 * SPTEs don't have Dirty bits, KVM always sets the Accessed bit when
> +	 * creating non-leaf SPTEs, and all other bits are immutable for non-
> +	 * leaf SPTEs, i.e. the only legal operations for non-leaf SPTEs are
> +	 * zapping and replacement.
> +	 */
> +	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE)) {
> +		WARN_ON_ONCE((tdp_ptep_t)sp->spt == spte_to_child_pt(iter.old_spte, iter.level));
> +		return false;
> +	}
>  
>  	return true;
>  }

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-30 14:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-06 20:45 [PATCH v3 0/2] KVM: x86/mmu: Run NX huge page recovery under MMU read lock Vipin Sharma
2024-09-06 20:45 ` [PATCH v3 1/2] KVM: x86/mmu: Track TDP MMU NX huge pages separately Vipin Sharma
2024-10-30 14:12   ` Sean Christopherson
2024-10-30 14:25     ` Sean Christopherson
2024-09-06 20:45 ` [PATCH v3 2/2] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock Vipin Sharma
2024-09-08 23:29   ` kernel test robot
2024-09-09 16:37     ` Vipin Sharma
2024-10-30 14:28   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox